Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for XPUB/PUB option ZMQ_XPUB_NODROP to combat message loss #1070

Open
mnmr opened this issue Aug 27, 2023 · 4 comments
Open

Add support for XPUB/PUB option ZMQ_XPUB_NODROP to combat message loss #1070

mnmr opened this issue Aug 27, 2023 · 4 comments

Comments

@mnmr
Copy link
Contributor

mnmr commented Aug 27, 2023

The socket option ZMQ_XPUB_NODROP (69) is supposed to toggle the behavior of the socket when SENDHWM is reached. If 0/false (the default), messages are silently dropped. If 1/true, sending a message will instead return an error.

We're using TrySendMultipartMessage to send the message, and our logic assumes that this returns false if the message couldn't be sent, but this seems to not always be the case (or we wouldn't be seeing lost messages). I checked the NetMQ (v4, master) source code, but was not able to understand what actually happens in NetMQ when the socket is full.

The motivation for requesting this is that we experience occasional message loss (not a slow joiner, but while the system is running) between publishers and subscribers. All subscribers always lose the same amount of messages, so I'm thinking it must be a publisher problem. We're verifying sequence numbers, both when we send a message, and when we receive it. The publisher code does not report SN gaps, but subscribers do. We've bumped the SENDHWM to 100_000 from the default 1_000 and also increased the buffer, but still have the problem when there are message spikes.

We're going to try to bump the HWM even more, but it seems like a hacky workaround instead of a solution.

@mnmr
Copy link
Contributor Author

mnmr commented Sep 5, 2023

I've created a set of unit tests to verify that this problem exists. When I send messages and the XPUB socket reaches its HWM it looks like it does not correctly report back (in call to TrySendMultipartMessage) that its buffer is full until "some time after" the fact, causing message loss.

I am going to try to replicate this unit test in the NetMQ project to demonstrate that the problem exists and isn't caused by our (reasonably thin) wrappers around NetMQSocket.

As a workaround we've now bumped the HWMs to 2_000_000.

@mnmr
Copy link
Contributor Author

mnmr commented Sep 5, 2023

Are the unit tests for NetMQ in working order? I did a fresh checkout but get a ton of failures when trying to run the tests.

Looking at the code, some things also don't make sense. E.g. in TcpListener there is an Assumes.NotNull(m_handle) right before code that assigns to this variable. Uncommenting this Assumes allows a bunch more tests to pass, but there are still many failing ones, and I don't really want to modify this rather complex codebase without a test suite that passes before any edits.

Is this the end of the road for NetMQ or am I doing something wrong?

@drewnoakes
Copy link
Member

#1073 fixes that assertion in TcpListener, along with some other modernisation. In CI the tests are running on release builds, where that assert isn't tripping. Thanks for pointing it out.

The tests still fail for me in VS, though they have been passing on Ubuntu CI. We just need someone to spend some time to understand the failures.

@mnmr
Copy link
Contributor Author

mnmr commented Sep 6, 2023

Ok, glad to hear it. I was just surprised to see so many failing tests considering that there haven't been that many commits to the project. I may take a second look and see if I can help get the tests working, as I'd really like to help squash the HWM bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants