Re: let's move bridgeMamaSubscriptionMute call up in hierarchy out of throttle locked context
Igor Kovalenko
Classification: Public
Hi Frank,
I confirm that with your change I can reproduce the deadlock I described earlier:
(gdb) info threads Id Target Id Frame 6 Thread 0x7f51b9fcd700 (LWP 26406) "mamalistencpp_c" 0x00007f51bb6a9993 in select () from /lib64/libc.so.6 5 Thread 0x7f51b903b700 (LWP 26407) "mamalistencpp_c" 0x00007f51bb6a9993 in select () from /lib64/libc.so.6 4 Thread 0x7f51b3433700 (LWP 26408) "mamalistencpp_c" 0x00007f51bb6a9993 in select () from /lib64/libc.so.6 3 Thread 0x7f51b2c32700 (LWP 26414) "mamalistencpp_c" 0x00007f51bb99054d in __lll_lock_wait () from /lib64/libpthread.so.0 2 Thread 0x7f51b2431700 (LWP 26415) "mamalistencpp_c" 0x00007f51bb67984d in nanosleep () from /lib64/libc.so.6 * 1 Thread 0x7f51bd02d7c0 (LWP 26405) "mamalistencpp_c" 0x00007f51bb99054d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) thread 1 [Switching to thread 1 (Thread 0x7f51bd02d7c0 (LWP 26405))] #0 0x00007f51bb99054d in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) frame 3 #3 0x00007f51bc0912e4 in wlock_lock (lock=0x7f519dda0490) at /xxx/openmama/src/common/c_cpp/src/c/wlock.c:64 64 wthread_mutex_lock( &impl->mMutex ); (gdb) print impl->mMutex $4 = {__data = {__lock = 2, __count = 1, __owner = 26414, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\001\000\000\000.g\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}
(gdb) thread 3 [Switching to thread 3 (Thread 0x7f51b2c32700 (LWP 26414))] #0 0x00007f51bb99054d in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) frame 3 #3 0x00007f51bc0912e4 in wlock_lock (lock=0xa3f6a0) at /xxx/openmama/src/common/c_cpp/src/c/wlock.c:64 64 wthread_mutex_lock( &impl->mMutex ); (gdb) print impl->mMutex $3 = {__data = {__lock = 2, __count = 1, __owner = 26405, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\001\000\000\000%g\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}
From: Openmama-dev@... [mailto:Openmama-dev@...]
On Behalf Of Igor Kovalenko
Sent: Wednesday, May 13, 2020 14:43 To: Frank Quinn <fquinn@...>; openmama-dev@... Subject: Re: [Openmama-dev] let's move bridgeMamaSubscriptionMute call up in hierarchy out of throttle locked context
Classification: Public
Hi Frank,
Thank you for looking into this issue. Testing this change will take some time, but I think this change would introduce a deadlock due to change in locking order:
wombatThrottle_sendQueuedMessage [A] acquires list_lock (self->mMsgQueue) and calls info->mActionCb() which is wrapped mamaSubscriptionImpl_completeMarketDataInitialisation() [B] which acquires impl->mCreateDestroyLock
.. while your implementation does locking in B-A order.
My implementation of this reduced locked scope preserves AB locking order. In fact it looks very much like the following pseudo-code
{ acquire throttle lock acquire mCreateDestroyLock /* protecting impl from changes via throttle sendQueuedMessage callback invocation */ check if impl state is MAMA_SUBSCRIPTION_ACTIVATING if so remove throttle action and set impl state to MAMA_SUBSCRIPTION_DEACTIVATED, so throttle sendQueuedMessage callback invocation would skip this impl drop mCreateDestroyLock drop throttle lock } /* do not acquire throttle lock, since throttle sendQueuedMessage will already skip impl not in MAMA_SUBSCRIPTION_ACTIVATING state*/ acquire mCreateDestroyLock unconditionally call mute subscription *before mamaSubscription_deactivate_internal has any chance* here goes original code to handle impl states (now the case of MAMA_SUBSCRIPTION_ACTIVATING is a no-op since we handled it above) drop mCreateDestroyLock
From: Frank Quinn [mailto:fquinn@...]
Hi Igor, can you try out this branch please?
https://github.com/fquinner/OpenMAMA/tree/feature/subscription-throttle-lock-scope-reduction
Cheers, Frank
Frank Quinn, Cascadium | +44 (0) 28 8678 8015 | http://cascadium.io
From: Igor Kovalenko <igor.kovalenko@...>
Classification: Public
Hi Frank,
Yes I will be able to review and test the PR.
From: Frank Quinn [mailto:fquinn@...]
Hi Igor,
Erm, I think I understand the issue… If I put a PR together, can you review and verify if this fixes the issue?
Cheers, Frank
Frank Quinn, Cascadium | +44 (0) 28 8678 8015 | http://cascadium.io
From:
Openmama-dev@... <Openmama-dev@...>
On Behalf Of Igor Kovalenko via lists.openmama.org
Classification: Public
Hi
I now seem to have successfully tested a proper fix for the issue which requires calling bridgeMamaSubscriptionMute outside of throttle locked context.
All it takes is to handle a subscription in state MAMA_SUBSCRIPTION_ACTIVATING as special case in mamaSubscription_deactivate with throttle lock still held to drain related event from throttle queue. Following that all other cases do not require throttle lock anymore. Now when throttle lock is dropped but with mCreateDestroyLock acquired it is safe to call bridgeMamaSubscriptionMute outside of throttle locked context.
In my local testing I can confirm this (+ a couple of obvious lines to tick42rmds code) fixes the old issue reported back in July, 2017 regarding crash with tick42rmds bridge wrt concurrent subscription.destroy() call. As a bonus the queueing solution in tick42rmds bridge is no longer required, which reduces memory consumption and improves latency.
I think that history regarding old issue is still available if one searches for this subject in mail archives: [Openmama-dev] [Openmama-users] Concurrent subscription.destroy() ? Crash when using tick42rmds transport.
Frank, I hope you will be able to have a look into this to possibly land this change into the openmama code.
From:
Openmama-dev@... [mailto:Openmama-dev@...]
On Behalf Of Igor Kovalenko
Classification: Public
Hi openmama team,
I'd like to propose moving bridgeMamaSubscriptionMute call from mamaSubscription_deactivate_internal up in the call hierarchy to mamaSubscription_deactivate so Mute() is called before any throttle lock is acquired (that is, just after checking if subscription is not NULL in mamaSubscription_deactivate).
Since in this case the client is removing subscription anyway, it should not matter that transport bridge Mute() method will be called a bit earlier in subscription lifecycle.
I'll try to explain why this change is needed for tick42rmds bridge to operate reliably. A required change to tick42rmds is outlined below as well. Locally I tested a combination of this proposed change to openmama with tick42rmds bridge modified as outlined below, and also with other transport bridges (unmodified) that we use, and found no negative side effects.
A couple of years back there was a discussion regarding tick42rmds transport issue. My understanding is that tick42rmds design choice to process subscriptions on bridge thread as opposed to Mama.start() thread violates a few openmama threading assumptions. As a consequence we can reproduce a crash during moderately fast subscribe/unsubscribe sequence where tick42rmds bridge is still trying to deliver data to callback of already destroyed mama subscription. This crash is caused by a race condition detailed below.
Tick42rmds bridge implements tick42rmdsBridgeMamaSubscription_mute method which does call Shutdown() on bridge thread subscription object. Intention there appears to be to mark bridge thread subscription object as shut down so that bridge thread ignores it and is not trying to deliver a callbacks on mama subscription objects being destroyed. Unfortunately there is no proper locking there so that seldom a Mute() call returns before Shutdown() takes effect. If this happens, bridge thread proceeds to deliver next message to mama subscription callback while mama subscription object is being destroyed by the thread returned from Mute().
Without significant redesign of tick42rmds this race can be closed by adding proper locking between tick42rmdsBridgeMamaSubscription_mute and bridge thread, so that bridge thread is guaranteed to see result of Shutdown() before Mute() call returns. Simple implementation would be to add a mutex shared between Shutdown() call and bridge thread methods issuing callbacks to mama subscriptions.
So far so good. Yet unfortunately openmama locking scheme needs to be modified for this to work due to a problem with locking order of MAMA_THROTTLE_DEFAULT throttle. There are many places where MAMA_THROTTLE_DEFAULT lock is acquired. According to my analysis with valgrind/helgrind/drd acquisition of corresponding mutex must always be a leaf call in locking hierarchy to prevent ABBA-style deadlock.
If Mute() and Shutdown() are synchronized with a mutex, then the following would happen:
client thread destroying mama subscription eventually calls mamaSubscription_destroy which calls mamaSubscription_deactivate() [A] .. mamaSubscription_deactivate acquires lock [A] on throttle object then calls mamaSubscription_deactivate_internal() .. which calls impl->mBridgeImpl->bridgeMamaSubscriptionMute (impl->mSubscBridge) which is tick42rmdsBridgeMamaSubscription_mute [B] .. tick42rmdsBridgeMamaSubscription_mute acquires bridge thread subscription object lock [B] XX client thread gets suspended here for some reason so it is not calling Shutdown() yet
tick42rmds bridge thread calls RMDSBridgeSubscription::OnMessage (or anything else which requires calling mamaSubscription_processMsg later) [B] .. bridge thread acquires bridge thread subscription object lock [B] bridge thread checks if Shutdown() is called on this object XX .. since client thread is suspended, Shutdown() is not in effect bridge thread calls mamaSubscription_processMsg() to deliver message to mama subscription .. [A] imageRequest_stopWaitForResponse() is called which tries to acquire throttle lock [A]
If my proposal to move bridgeMamaSubscriptionMute call out of locked scope is implemented, this deadlock would not occur.
I will not be able to share any code change due to our organizational restrictions but will try to answer other questions regarding this proposal.
--
Kind regards,
--- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Please refer to https://www.db.com/disclosures for additional EU corporate and regulatory disclosures and to http://www.db.com/unitedkingdom/content/privacy.htm for information about privacy. |
|