Re: Socket pair used for timer addition
Duane Pauls <Duane.Pauls@...>
Thanks for picking up the patch. I’ve attached another patch to bug 220 which applies the same fix to avis timers, if that saves you any effort.
I’ve created bug 222 to track the socketpair problem. I’ve also attached a patch to the bug, which addresses the problem. We run the unit tests regularly and haven’t seen an issue since applying the patch.
From: Frank Quinn [mailto:f.quinn@...]
Sent: Tuesday, September 22, 2015 3:54 AM
To: Duane Pauls; openmama-dev@...
Cc: David McKay
Subject: RE: Socket pair used for timer addition
Just realized we never informed the list but this was merged into next last week as well:
Wrt the EAGAIN change, that sounds like a separate issue which just happens to be in the same area to me so I think a separate patch / pull request / github ticket would be appropriate here.
Also for future reference (we’ll do it this time via issue #25), could you also include avis fixes for code which is currently (and regrettably) duplicated such as the middleware bridge timer code?
One other note related to this: While debugging, I read 4 bytes from the socket (so that the number of bytes available dropped from 423 to 419), then continued the program. This cleared the busy loop/deadlock condition and the unit test passed. So it supports the theory that the socket was full and couldn’t be written.
This occurred on CentOS 6.5.
I’ve been running the timer unit tests repeatedly for over a week with the patch submitted in bug 220:
MamaTimerTestCPP.ForTimer recently entered a busy loop + deadlock. After debugging the problem, I don’t think the problem I’m seeing is specific to the patched OpenMAMA. I’d like to open a quick discussion to see if my proposed fix sounds reasonable.
I saw that the socket write in _addTimer (a new function in the patch, but the same logic as the wwrite in the original OpenMAMA) was failing due to ‘EAGAIN’, so _addTimer entered a busy loop continually trying to write to the socket that was full. The reader is unable to read from the socket since the lock is being held for the duration of the write, and the reader tries to take the lock before reading. By calling ioctl(heapImpl->mSockPair, FIONREAD, buf), I determined there were 423 bytes available for reading. It seems this is the maximum number of records that can be written to a socket pair in Linux. See:
I’m thinking that to fix this, we can simply ignore EAGAIN (and for maximum portability EWOULDBLOCK as well). If the socket is full, there are plenty of records there to wake up the reader, so it isn’t necessary to add another ‘wake-up’ record. Since the lock was taken before the new timer was added and continues to be held through the write, we can just release the lock knowing the reader has something to wake it up.
If there are no objections, I’ll submit another patch along with bug 220. Or should I open a separate bug?
I’m also interested to know if anyone can provide a timeline for applying the patch in bug 220 to a release.