Re: Socket pair used for timer addition
Duane Pauls <Duane.Pauls@...>
One other note related to this: While debugging, I read 4 bytes from the socket (so that the number of bytes available dropped from 423 to 419), then continued the program. This cleared the busy loop/deadlock condition and the unit test passed. So it supports the theory that the socket was full and couldn’t be written.
This occurred on CentOS 6.5.
From: Duane Pauls
Sent: Friday, September 04, 2015 10:13 AM
Cc: David McKay
Subject: Socket pair used for timer addition
I’ve been running the timer unit tests repeatedly for over a week with the patch submitted in bug 220:
MamaTimerTestCPP.ForTimer recently entered a busy loop + deadlock. After debugging the problem, I don’t think the problem I’m seeing is specific to the patched OpenMAMA. I’d like to open a quick discussion to see if my proposed fix sounds reasonable.
I saw that the socket write in _addTimer (a new function in the patch, but the same logic as the wwrite in the original OpenMAMA) was failing due to ‘EAGAIN’, so _addTimer entered a busy loop continually trying to write to the socket that was full. The reader is unable to read from the socket since the lock is being held for the duration of the write, and the reader tries to take the lock before reading. By calling ioctl(heapImpl->mSockPair, FIONREAD, buf), I determined there were 423 bytes available for reading. It seems this is the maximum number of records that can be written to a socket pair in Linux. See:
I’m thinking that to fix this, we can simply ignore EAGAIN (and for maximum portability EWOULDBLOCK as well). If the socket is full, there are plenty of records there to wake up the reader, so it isn’t necessary to add another ‘wake-up’ record. Since the lock was taken before the new timer was added and continues to be held through the write, we can just release the lock knowing the reader has something to wake it up.
If there are no objections, I’ll submit another patch along with bug 220. Or should I open a separate bug?
I’m also interested to know if anyone can provide a timeline for applying the patch in bug 220 to a release.