Re: Timer race conditions
Duane Pauls <Duane.Pauls@...>
I’d originally raised two bugs to track this issue, one against the commonTimer infrastructure to release the lock, and the other against the qpid bridge. The changes could have been deployed separately. However, I discovered a problem with the fix, and submitted a new patch to fix the problem instead. With the new patch, the bridge has a dependency on a new function provided by the commonTimer library in OpenMAMA. Therefore, I’ve closed one of the bugs, leaving a single bug to track the issue.
Please see this bug for more details if you haven’t seen it already:
Do you think this patch is suitable to be applied to OpenMAMA? If so, do you know when this patch might be able to be applied to a release? I’m not familiar with the process for getting fixes like this into the OpenMAMA codebase, so if there is anything else we should do to help, please let me know.
From: Frank Quinn [mailto:fquinn.ni@...]
OK I understand now, thanks - got confused between 'the thread on which timer events are firing' and the 'timer thread'.
I have given consideration in the past to simply ripping out this timer code (it's literally not used by anything other than avis / qpid bridges) and using a libevent evtimer or something like that instead directly from the bridge - I even have a mock implementation for that lying around somewhere, but that's a matter for another day.
In the interim, yeah it would be great if you could raise a bugzilla ticket with a patch for timers.c and the qpid bridge (don't worry about avis because the qpid implementation is about to replace it anyway) and we'll look to get that merged.
On Thu, Jul 30, 2015 at 10:01 PM, Duane Pauls <Duane.Pauls@...> wrote:
The mamaTimer is only being destroyed from one thread. The problem is that the wombat timerElement is being destroyed from two threads at once, but not the mamaTimer.
One of the timerElement destroys is occurring from within the bridge’s timerElement callback, which is necessary to implement a reoccurring timer with a one-shot timer primitive. The second destroy is occurring from an application call to mamaTimer_destroy(). The application has no way of knowing the timer is currently expiring.
The natural solution is to put locks in the bridge implementation to prevent certain things from occurring at the same time in different threads (which I believe should be done in the qpid and avis bridges), but then you end up with deadlock because the heap is holding the lock while calling the bridge’s callback.
In our experience with messaging API’s, you need to be very careful about holding locks while calling application callbacks. We try to avoid this in our API’s. I don’t see any reason the timer heap needs to hold a lock during the callback here.
I read this as the problem being triggered by the timer being destroyed from multiple threads is that right? If so, that is an illegal sequence of events and would constitute a bug in the application itself. From the MAMA Developer's guide section on timers:
"A timer can be destroyed from within a timer callback or any other callback on the queue. This function must be called from the same thread dispatching on the associated event queue unless both the default queue and dispatch queue are not actively dispatching."
On Thu, Jul 30, 2015 at 6:54 PM, Duane Pauls <Duane.Pauls@...> wrote:
At Solace our timer implementation in our middleware bridge is very similar to that of qpid. So although my comments below are based primarily on our experience with our own middleware bridge, I believe the problems described apply equally to qpid. The natural fix for the problem, as I’ll explain below, leads to the possibility of deadlock, which I believe should be fixed in common/c_cpp/src/c/timers.c.
I’d like to get feedback on my assessment, to see if we agree that the timers module should be modified in OpenMAMA. Or, if anyone has a suggestion for another way to solve the problem, I’d be interested in that as well.
We discovered rare cases of crashes when destroying timers. By looking at the dumps and inspecting the code, I believe the problem is that the bridgeImpl’s timerElement is not protected against multi-threaded access. Consider:
1. A timer expires. In the timer thread, the bridge’s callback calls ‘BridgeMamaTimer_reset’, which destroys (i.e. frees) the timerElement.
2. Before BridgeMamaTimer_reset recreates the timerElement, an application thread decides to destroy the timer (i.e. it is being destroyed as it is expiring). The destroy flows through to the bridge, which is asked to destroy the timer. The bridge then frees the timerElement. This timerElement has already been freed in #1 above, leading to a crash.
The natural fix here is to protect the bridge’s timerElement with a lock. In the BridgeMamaTimer_reset function, use the lock to make the destroy/create an atomic operation, and also protect accesses to the timerElement.
If this is done however, there is now a possibility for deadlock. When a timer expires:
1. timers.c ‘dispatchEntry’ locks the timer heap.
2. The bridge timer callback is called.
3. The bridge callback needs to reset the timer, which means taking the timer lock.
Now suppose an application (for example MamaTimerTestCPP_ForTimer) decides that when a timer expires it wants to destroy the timer (i.e. implement a one-shot timer). The following sequence would occur:
1. The application timer callback is called on the application’s queue dispatch thread.
2. The application calls mamaTimer_destroy().
3. mamaTimer_destroy() calls BridgeMamaTimer_destroy. This involves taking the timerElement lock.
4. BridgeMamaTimer_destroy calls destroyTimer, which involves taking the timer heap lock.
Because the above sequences take the same locks, but in different orders, we have the opportunity for deadlock.
A patch that I would propose to solve this issue is:
@@ -176,7 +176,9 @@
while (!timercmp(&ele->mTimeout, &now, >))
RB_REMOVE (orderedTimeRBTree_, &heap->mTimeTree, ele);
+ wthread_mutex_unlock (&heap->mLock);
ele->mCb (ele, ele->mClosure);
+ wthread_mutex_lock (&heap->mLock);
ele = RB_MIN (orderedTimeRBTree_, &heap->mTimeTree);
/* No timers left so break */
if (ele == NULL)
This means the timer heap lock is not held for the duration of an application callback. It should always be safe to re-acquire the timer heap lock after the callback since something that could have been invalidated, such as an iterator, is not being used – the function calls RB_MIN to look at the first item on the heap.
Does this seem like a reasonable approach to the problem I’ve described? Or does someone have an alternate proposal for how to handle this in the bridge? Or another solution with OpenMAMA code?
Should a bug be raised for this issue?
Note qpid is not currently susceptible to this issue, but I believe it is susceptible to the double-free described at the beginning.