Socket pair used for timer addition


Frank Quinn <f.quinn@...>
 

Hi Folks,

 

Just to update the list (update was already provided on the BZ ticket), the socketpair problem is currently being tracked via https://github.com/OpenMAMA/OpenMAMA/issues/26

 

Cheers,

Frank

 

From: Duane Pauls [mailto:Duane.Pauls@...]
Sent: 22 September 2015 16:46
To: Frank Quinn <f.quinn@...>; openmama-dev@...
Cc: David McKay <David.McKay@...>
Subject: RE: Socket pair used for timer addition

 

Hi Frank,

 

Thanks for picking up the patch.  I’ve attached another patch to bug 220 which applies the same fix to avis timers, if that saves you any effort.

 

I’ve created bug 222 to track the socketpair problem.  I’ve also attached a patch to the bug, which addresses the problem.  We run the unit tests regularly and haven’t seen an issue since applying the patch.

 

Cheers,

Duane

 

From: Frank Quinn [mailto:f.quinn@...]
Sent: Tuesday, September 22, 2015 3:54 AM
To: Duane Pauls;
openmama-dev@...
Cc: David McKay
Subject: RE: Socket pair used for timer addition

 

Thanks Duane,

 

Just realized we never informed the list but this was merged into next last week as well:

 

https://github.com/OpenMAMA/OpenMAMA/commit/a4a684e3ab0ce1ec6a2d5eff3c95c29c7cffee2f

 

Wrt the EAGAIN change, that sounds like a separate issue which just happens to be in the same area to me so I think a separate patch / pull request / github ticket would be appropriate here.

 

Also for future reference (we’ll do it this time via issue #25), could you also include avis fixes for code which is currently (and regrettably) duplicated such as the middleware bridge timer code?

 

Cheers.

Frank

 

From: openmama-dev-bounces@... [mailto:openmama-dev-bounces@...] On Behalf Of Duane Pauls
Sent: 04 September 2015 15:17
To:
openmama-dev@...
Cc: David McKay <
David.McKay@...>
Subject: Re: [Openmama-dev] Socket pair used for timer addition

 

One other note related to this: While debugging, I read 4 bytes from the socket (so that the number of bytes available dropped from 423 to 419), then continued the program.  This cleared the busy loop/deadlock condition and the unit test passed.  So it supports the theory that the socket was full and couldn’t be written.

 

This occurred on CentOS 6.5.

 

Duane

 

From: Duane Pauls
Sent: Friday, September 04, 2015 10:13 AM
To:
openmama-dev@...
Cc: David McKay
Subject: Socket pair used for timer addition

 

I’ve been running the timer unit tests repeatedly for over a week with the patch submitted in bug 220:

http://bugs.openmama.org/show_bug.cgi?id=220

 

MamaTimerTestCPP.ForTimer recently entered a busy loop + deadlock.  After debugging the problem, I don’t think the problem I’m seeing is specific to the patched OpenMAMA.  I’d like to open a quick discussion to see if my proposed fix sounds reasonable.

 

I saw that the socket write in _addTimer (a new function in the patch, but the same logic as the wwrite in the original OpenMAMA) was failing due to ‘EAGAIN’, so _addTimer entered a busy loop continually trying to write to the socket that was full.  The reader is unable to read from the socket since the lock is being held for the duration of the write, and the reader tries to take the lock before reading.  By calling ioctl(heapImpl->mSockPair[0], FIONREAD, buf), I determined there were 423 bytes available for reading.  It seems this is the maximum number of records that can be written to a socket pair in Linux.  See:

http://stackoverflow.com/questions/10899814/af-unix-socket-overhead

 

I’m thinking that to fix this, we can simply ignore EAGAIN (and for maximum portability EWOULDBLOCK as well).  If the socket is full, there are plenty of records there to wake up the reader, so it isn’t necessary to add another ‘wake-up’ record.  Since the lock was taken before the new timer was added and continues to be held through the write, we can just release the lock knowing the reader has something to wake it up.

 

If there are no objections, I’ll submit another patch along with bug 220.  Or should I open a separate bug?

 

I’m also interested to know if anyone can provide a timeline for applying the patch in bug 220 to a release.

 

Cheers,

Duane


The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to this message and deleting it from your computer. Thank you. SR Labs LLC


The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to this message and deleting it from your computer. Thank you. SR Labs LLC


Duane Pauls <Duane.Pauls@...>
 

Hi Frank,

 

Thanks for picking up the patch.  I’ve attached another patch to bug 220 which applies the same fix to avis timers, if that saves you any effort.

 

I’ve created bug 222 to track the socketpair problem.  I’ve also attached a patch to the bug, which addresses the problem.  We run the unit tests regularly and haven’t seen an issue since applying the patch.

 

Cheers,

Duane

 

From: Frank Quinn [mailto:f.quinn@...]
Sent: Tuesday, September 22, 2015 3:54 AM
To: Duane Pauls; openmama-dev@...
Cc: David McKay
Subject: RE: Socket pair used for timer addition

 

Thanks Duane,

 

Just realized we never informed the list but this was merged into next last week as well:

 

https://github.com/OpenMAMA/OpenMAMA/commit/a4a684e3ab0ce1ec6a2d5eff3c95c29c7cffee2f

 

Wrt the EAGAIN change, that sounds like a separate issue which just happens to be in the same area to me so I think a separate patch / pull request / github ticket would be appropriate here.

 

Also for future reference (we’ll do it this time via issue #25), could you also include avis fixes for code which is currently (and regrettably) duplicated such as the middleware bridge timer code?

 

Cheers.

Frank

 

From: openmama-dev-bounces@... [mailto:openmama-dev-bounces@...] On Behalf Of Duane Pauls
Sent: 04 September 2015 15:17
To: openmama-dev@...
Cc: David McKay <David.McKay@...>
Subject: Re: [Openmama-dev] Socket pair used for timer addition

 

One other note related to this: While debugging, I read 4 bytes from the socket (so that the number of bytes available dropped from 423 to 419), then continued the program.  This cleared the busy loop/deadlock condition and the unit test passed.  So it supports the theory that the socket was full and couldn’t be written.

 

This occurred on CentOS 6.5.

 

Duane

 

From: Duane Pauls
Sent: Friday, September 04, 2015 10:13 AM
To: openmama-dev@...
Cc: David McKay
Subject: Socket pair used for timer addition

 

I’ve been running the timer unit tests repeatedly for over a week with the patch submitted in bug 220:

http://bugs.openmama.org/show_bug.cgi?id=220

 

MamaTimerTestCPP.ForTimer recently entered a busy loop + deadlock.  After debugging the problem, I don’t think the problem I’m seeing is specific to the patched OpenMAMA.  I’d like to open a quick discussion to see if my proposed fix sounds reasonable.

 

I saw that the socket write in _addTimer (a new function in the patch, but the same logic as the wwrite in the original OpenMAMA) was failing due to ‘EAGAIN’, so _addTimer entered a busy loop continually trying to write to the socket that was full.  The reader is unable to read from the socket since the lock is being held for the duration of the write, and the reader tries to take the lock before reading.  By calling ioctl(heapImpl->mSockPair[0], FIONREAD, buf), I determined there were 423 bytes available for reading.  It seems this is the maximum number of records that can be written to a socket pair in Linux.  See:

http://stackoverflow.com/questions/10899814/af-unix-socket-overhead

 

I’m thinking that to fix this, we can simply ignore EAGAIN (and for maximum portability EWOULDBLOCK as well).  If the socket is full, there are plenty of records there to wake up the reader, so it isn’t necessary to add another ‘wake-up’ record.  Since the lock was taken before the new timer was added and continues to be held through the write, we can just release the lock knowing the reader has something to wake it up.

 

If there are no objections, I’ll submit another patch along with bug 220.  Or should I open a separate bug?

 

I’m also interested to know if anyone can provide a timeline for applying the patch in bug 220 to a release.

 

Cheers,

Duane


The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to this message and deleting it from your computer. Thank you. SR Labs LLC


Frank Quinn <f.quinn@...>
 

Thanks Duane,

 

Just realized we never informed the list but this was merged into next last week as well:

 

https://github.com/OpenMAMA/OpenMAMA/commit/a4a684e3ab0ce1ec6a2d5eff3c95c29c7cffee2f

 

Wrt the EAGAIN change, that sounds like a separate issue which just happens to be in the same area to me so I think a separate patch / pull request / github ticket would be appropriate here.

 

Also for future reference (we’ll do it this time via issue #25), could you also include avis fixes for code which is currently (and regrettably) duplicated such as the middleware bridge timer code?

 

Cheers.

Frank

 

From: openmama-dev-bounces@... [mailto:openmama-dev-bounces@...] On Behalf Of Duane Pauls
Sent: 04 September 2015 15:17
To: openmama-dev@...
Cc: David McKay <David.McKay@...>
Subject: Re: [Openmama-dev] Socket pair used for timer addition

 

One other note related to this: While debugging, I read 4 bytes from the socket (so that the number of bytes available dropped from 423 to 419), then continued the program.  This cleared the busy loop/deadlock condition and the unit test passed.  So it supports the theory that the socket was full and couldn’t be written.

 

This occurred on CentOS 6.5.

 

Duane

 

From: Duane Pauls
Sent: Friday, September 04, 2015 10:13 AM
To: openmama-dev@...
Cc: David McKay
Subject: Socket pair used for timer addition

 

I’ve been running the timer unit tests repeatedly for over a week with the patch submitted in bug 220:

http://bugs.openmama.org/show_bug.cgi?id=220

 

MamaTimerTestCPP.ForTimer recently entered a busy loop + deadlock.  After debugging the problem, I don’t think the problem I’m seeing is specific to the patched OpenMAMA.  I’d like to open a quick discussion to see if my proposed fix sounds reasonable.

 

I saw that the socket write in _addTimer (a new function in the patch, but the same logic as the wwrite in the original OpenMAMA) was failing due to ‘EAGAIN’, so _addTimer entered a busy loop continually trying to write to the socket that was full.  The reader is unable to read from the socket since the lock is being held for the duration of the write, and the reader tries to take the lock before reading.  By calling ioctl(heapImpl->mSockPair[0], FIONREAD, buf), I determined there were 423 bytes available for reading.  It seems this is the maximum number of records that can be written to a socket pair in Linux.  See:

http://stackoverflow.com/questions/10899814/af-unix-socket-overhead

 

I’m thinking that to fix this, we can simply ignore EAGAIN (and for maximum portability EWOULDBLOCK as well).  If the socket is full, there are plenty of records there to wake up the reader, so it isn’t necessary to add another ‘wake-up’ record.  Since the lock was taken before the new timer was added and continues to be held through the write, we can just release the lock knowing the reader has something to wake it up.

 

If there are no objections, I’ll submit another patch along with bug 220.  Or should I open a separate bug?

 

I’m also interested to know if anyone can provide a timeline for applying the patch in bug 220 to a release.

 

Cheers,

Duane


The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to this message and deleting it from your computer. Thank you. SR Labs LLC


Duane Pauls <Duane.Pauls@...>
 

One other note related to this: While debugging, I read 4 bytes from the socket (so that the number of bytes available dropped from 423 to 419), then continued the program.  This cleared the busy loop/deadlock condition and the unit test passed.  So it supports the theory that the socket was full and couldn’t be written.

 

This occurred on CentOS 6.5.

 

Duane

 

From: Duane Pauls
Sent: Friday, September 04, 2015 10:13 AM
To: openmama-dev@...
Cc: David McKay
Subject: Socket pair used for timer addition

 

I’ve been running the timer unit tests repeatedly for over a week with the patch submitted in bug 220:

http://bugs.openmama.org/show_bug.cgi?id=220

 

MamaTimerTestCPP.ForTimer recently entered a busy loop + deadlock.  After debugging the problem, I don’t think the problem I’m seeing is specific to the patched OpenMAMA.  I’d like to open a quick discussion to see if my proposed fix sounds reasonable.

 

I saw that the socket write in _addTimer (a new function in the patch, but the same logic as the wwrite in the original OpenMAMA) was failing due to ‘EAGAIN’, so _addTimer entered a busy loop continually trying to write to the socket that was full.  The reader is unable to read from the socket since the lock is being held for the duration of the write, and the reader tries to take the lock before reading.  By calling ioctl(heapImpl->mSockPair[0], FIONREAD, buf), I determined there were 423 bytes available for reading.  It seems this is the maximum number of records that can be written to a socket pair in Linux.  See:

http://stackoverflow.com/questions/10899814/af-unix-socket-overhead

 

I’m thinking that to fix this, we can simply ignore EAGAIN (and for maximum portability EWOULDBLOCK as well).  If the socket is full, there are plenty of records there to wake up the reader, so it isn’t necessary to add another ‘wake-up’ record.  Since the lock was taken before the new timer was added and continues to be held through the write, we can just release the lock knowing the reader has something to wake it up.

 

If there are no objections, I’ll submit another patch along with bug 220.  Or should I open a separate bug?

 

I’m also interested to know if anyone can provide a timeline for applying the patch in bug 220 to a release.

 

Cheers,

Duane


Duane Pauls <Duane.Pauls@...>
 

I’ve been running the timer unit tests repeatedly for over a week with the patch submitted in bug 220:

http://bugs.openmama.org/show_bug.cgi?id=220

 

MamaTimerTestCPP.ForTimer recently entered a busy loop + deadlock.  After debugging the problem, I don’t think the problem I’m seeing is specific to the patched OpenMAMA.  I’d like to open a quick discussion to see if my proposed fix sounds reasonable.

 

I saw that the socket write in _addTimer (a new function in the patch, but the same logic as the wwrite in the original OpenMAMA) was failing due to ‘EAGAIN’, so _addTimer entered a busy loop continually trying to write to the socket that was full.  The reader is unable to read from the socket since the lock is being held for the duration of the write, and the reader tries to take the lock before reading.  By calling ioctl(heapImpl->mSockPair[0], FIONREAD, buf), I determined there were 423 bytes available for reading.  It seems this is the maximum number of records that can be written to a socket pair in Linux.  See:

http://stackoverflow.com/questions/10899814/af-unix-socket-overhead

 

I’m thinking that to fix this, we can simply ignore EAGAIN (and for maximum portability EWOULDBLOCK as well).  If the socket is full, there are plenty of records there to wake up the reader, so it isn’t necessary to add another ‘wake-up’ record.  Since the lock was taken before the new timer was added and continues to be held through the write, we can just release the lock knowing the reader has something to wake it up.

 

If there are no objections, I’ll submit another patch along with bug 220.  Or should I open a separate bug?

 

I’m also interested to know if anyone can provide a timeline for applying the patch in bug 220 to a release.

 

Cheers,

Duane