Timer race conditions


Duane Pauls <Duane.Pauls@...>
 

At Solace our timer implementation in our middleware bridge is very similar to that of qpid.  So although my comments below are based primarily on our experience with our own middleware bridge, I believe the problems described apply equally to qpid.  The natural fix for the problem, as I’ll explain below, leads to the possibility of deadlock, which I believe should be fixed in common/c_cpp/src/c/timers.c.

 

I’d like to get feedback on my assessment, to see if we agree that the timers module should be modified in OpenMAMA.  Or, if anyone has a suggestion for another way to solve the problem, I’d be interested in that as well.

 

We discovered rare cases of crashes when destroying timers.  By looking at the dumps and inspecting the code, I believe the problem is that the bridgeImpl’s timerElement is not protected against multi-threaded access.  Consider:

1.       A timer expires.  In the timer thread, the bridge’s callback calls ‘BridgeMamaTimer_reset’, which destroys (i.e. frees) the timerElement.

2.       Before BridgeMamaTimer_reset recreates the timerElement, an application thread decides to destroy the timer (i.e. it is being destroyed as it is expiring).  The destroy flows through to the bridge, which is asked to destroy the timer.  The bridge then frees the timerElement.  This timerElement has already been freed in #1 above, leading to a crash.

 

The natural fix here is to protect the bridge’s timerElement with a lock.  In the BridgeMamaTimer_reset function, use the lock to make the destroy/create an atomic operation, and also protect accesses to the timerElement.

 

If this is done however, there is now a possibility for deadlock.  When a timer expires:

1.       timers.c ‘dispatchEntry’ locks the timer heap.

2.       The bridge timer callback is called.

3.       The bridge callback needs to reset the timer, which means taking the timer lock.

 

Now suppose an application (for example MamaTimerTestCPP_ForTimer) decides that when a timer expires it wants to destroy the timer (i.e. implement a one-shot timer).  The following sequence would occur:

1.       The application timer callback is called on the application’s queue dispatch thread.

2.       The application calls mamaTimer_destroy().

3.       mamaTimer_destroy() calls BridgeMamaTimer_destroy.  This involves taking the timerElement lock.

4.       BridgeMamaTimer_destroy calls destroyTimer, which involves taking the timer heap lock.

 

Because the above sequences take the same locks, but in different orders, we have the opportunity for deadlock.

 

A patch that I would propose to solve this issue is:

 

Index: timers.c

===================================================================

@@ -176,7 +176,9 @@

                 while (!timercmp(&ele->mTimeout, &now, >))

                 {

                     RB_REMOVE (orderedTimeRBTree_, &heap->mTimeTree, ele);

+                    wthread_mutex_unlock (&heap->mLock);

                     ele->mCb (ele, ele->mClosure);

+                    wthread_mutex_lock (&heap->mLock);

                     ele = RB_MIN (orderedTimeRBTree_, &heap->mTimeTree);

                     /* No timers left so break */

                     if (ele == NULL)

 

 

This means the timer heap lock is not held for the duration of an application callback.  It should always be safe to re-acquire the timer heap lock after the callback since something that could have been invalidated, such as an iterator, is not being used – the function calls RB_MIN to look at the first item on the heap.

                                                                                                                                                                                                                                                        

Does this seem like a reasonable approach to the problem I’ve described?  Or does someone have an alternate proposal for how to handle this in the bridge?  Or another solution with OpenMAMA code?

 

Should a bug be raised for this issue?

 

Note qpid is not currently susceptible to this issue, but I believe it is susceptible to the double-free described at the beginning.

 

Cheers,

Duane

Join Openmama-dev@lists.openmama.org to automatically receive all group messages.