[Bug]: Queue events does not always capture all completed events #2517

bobthekingofegypt · 2024-04-10T15:53:13Z

Version

v5.7.1

Platform

NodeJS

What happened?

We recently updated from the old bull to bullmq in a legacy project.
With the old bull we used to do some strange wrapping and monitoring of this legacy project from our newer orchestration tool, the tool used the completed callbacks from the queues to understand when this legacy stage was finished processing everything. This always worked fine. After switching over to bullmq we have noticed that the orchestrator is not understanding that the queue work is complete. We found bull had processed all the jobs fine, they have been saved of to our database no problem but the orchestrator process is always waiting for a few callbacks so that the submitted event count equals the completed callback count. But those callbacks never arrive. Not sure if we have done something wrong, or if this is expected behaviour.

In summary:

lots of workers doing the work
single process monitoring queue for completed event, submits X events, waits till X == completed callback count to stop the workers

How to reproduce.

https://github.com/bobthekingofegypt/check_bull_complete_count

I uploaded this repo as a minimum test. It contains a monitor, consumer and producer. The monitor listens for the complete events, node:cluster starts up loads of workers and producer submits 1million events. But the monitor doesn't always seem to register 1million callbacks. I'm testing this on a 20 core machine.

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

manast · 2024-04-10T16:16:16Z

I think it is possible this has to do with the max events length, as you are processing very quickly, the Redis stream holding the events may get trimmed before the queue events manage to read the events, you can try to increase this setting to a larger value to see if it improves. https://api.docs.bullmq.io/interfaces/v5.QueueOptions.html#streams Default is 10k, you could try with 100k instead.

bobthekingofegypt · 2024-04-10T18:38:46Z

Tried it with 100k, sadly no difference.

I originally had the reproducible test case running with random sleeps to more match our production machines throughput but when I saw the same issue without them I just removed them for simplicity. Our production machines don't consume events super quickly; the completion monitor is attached to the end queue of a stream of processors. That queue has the task of saving to postgres, so it's throughput isn't very high.

bobthekingofegypt added the bug Something isn't working label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Queue events does not always capture all completed events #2517

[Bug]: Queue events does not always capture all completed events #2517

bobthekingofegypt commented Apr 10, 2024

manast commented Apr 10, 2024 •

edited

bobthekingofegypt commented Apr 10, 2024

[Bug]: Queue events does not always capture all completed events #2517

[Bug]: Queue events does not always capture all completed events #2517

Comments

bobthekingofegypt commented Apr 10, 2024

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct

manast commented Apr 10, 2024 • edited

bobthekingofegypt commented Apr 10, 2024

manast commented Apr 10, 2024 •

edited