Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDU to Tagged Stream confuses the scheduler by failing to produce output #6628

Open
argilo opened this issue Apr 9, 2023 · 8 comments · May be fixed by #7028
Open

PDU to Tagged Stream confuses the scheduler by failing to produce output #6628

argilo opened this issue Apr 9, 2023 · 8 comments · May be fixed by #7028
Labels
Bug FEC medium pdu PDU-related (i.e. about packetized data in messages) QA

Comments

@argilo
Copy link
Member

argilo commented Apr 9, 2023

A CI run (https://github.com/gnuradio/gnuradio/actions/runs/4651656779/jobs/8231524712) recently failed with the following error:

117/265 Test #117: qa_fecapi_dummy ..............................***Failed    0.61 sec
.............
======================================================================
FAIL: test_async_00 (__main__.test_fecapi_dummy) (packed=False, rev_pack=True)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/gnuradio/gnuradio/gr-fec/python/fec/qa_fecapi_dummy.py", line 279, in test_async_00
    self.assertSequenceEqualGR(data_in, data_out)
  File "/__w/gnuradio/gnuradio/gnuradio-runtime/python/gnuradio/gr_unittest.py", line 115, in assertSequenceEqualGR
    self.assertEqual(len(data_in), len(data_out), msg="Lengths do not match")
AssertionError: 15360 != 15120 : Lengths do not match

----------------------------------------------------------------------
Ran 14 tests in 0.393s

FAILED (failures=1)

I was able to reproduce this on my laptop by running the test in a loop (while gr-fec/python/fec/qa_fecapi_dummy_test.sh; do :; done) in parallel with a heavy CPU load (stress -c 16).

Debian build logs show that qa_fecapi_repetition fails on i386:

https://buildd.debian.org/status/fetch.php?pkg=gnuradio&arch=i386&ver=3.10.9.0%7Erc1-2&stamp=1703091669&raw=0
https://buildd.debian.org/status/fetch.php?pkg=gnuradio&arch=i386&ver=3.10.9.0%7Erc1-3&stamp=1703129935&raw=0

Again, I was able to reproduce the failure by repeatedly executing the test in parallel with a heavy CPU load.

The failing tests use _qa_helper_async to build a flow graph containing FEC Async Encoder and FEC Async Decoder blocks. The flow graph alternates between stream and message connections. I suspect that the scheduler is deciding that the flow graph is "done" even though there are still messages in flight.

@argilo argilo changed the title qa_fecapi_dummy is flaky qa_fecapi_* async tests are flaky Dec 21, 2023
@argilo
Copy link
Member Author

argilo commented Dec 21, 2023

Various other tests involving message passing were failing in the past, and #5754 was intended to fix that. It may be worth looking over the fix to see whether it properly handles the flow graphs in the FEC async tests.

@argilo
Copy link
Member Author

argilo commented Dec 21, 2023

Some observations:

  • In the CI qa_fecapi_dummy failure, the output was 240 bits short, which is exactly one frame. This suggests a message was dropped somewhere along the line.
  • I tried to induce failures by adding sleeps to the message outputs of FEC Async Encoder, FEC Async Decoder, and Tagged Stream to PDU, but this didn't work.
  • In the Debian qa_fecapi_repetition failures, the output was 512 bits (= 64 bytes) short in the packed case, and 64 bits short in the unpacked case. This is not a multiple of the frame size, which suggests that some other problem occurred.
  • The Debian qa_fecapi_repetition failures occurred for all four async test cases in two consecutive builds. This suggests that that the failure may not be intermittent. Strangely, the failures did not occur for qa_fecapi_cc, qa_fecapi_dummy, or qa_fecapi_ldpc, which use the same setup.

@argilo
Copy link
Member Author

argilo commented Dec 22, 2023

It turns out there are two separate bugs here:

  1. The flow graph sometimes completes with messages still in Tagged Stream to PDU's input queue. This can readily be reproduced by increasing rep here:
  2. FEC Async Decoder calculates the frame length in floating point, then converts it to an integer without rounding. Floating point error can cause it to be one bit/byte too low.

I'll reopen #7022 to track bug 2.

@argilo argilo changed the title qa_fecapi_* async tests are flaky PDU to Tagged Stream confuses the scheduler by failing to produce output Dec 22, 2023
@argilo
Copy link
Member Author

argilo commented Dec 22, 2023

I was able to track down the bug after adding some debug logging. Here's what goes wrong:

  • In the flow graph, an "Async Encoder" block produces PDU messages and sends them to a "PDU to Tagged Stream" block.
  • The "Async Encoder" block sends along a burst of PDUs, then finishes its work and sends a system message to the "PDU to Tagged Stream" block indicating that it is "done". This marks the "PDU to Tagged Stream" block as "done" receiving messages.
  • The "PDU to Tagged Stream" writes PDUs to its stream output, but eventually its output buffer fills to the point that there is no longer room for the next PDU to be copied in full. When this happens, it refuses to produce output. (More specifically, it uses calculate_output_stream_length to inform its parent class (tagged_stream_block) that it requires noutput_items to be at least as large as the next PDU, and the parent's general_work method returns no output.)
  • The scheduler sees that the "PDU to Tagged Stream" block is "done" receiving messages, and refuses to fill its output buffer. The scheduler concludes that the block must be "done" and stops invoking it.

Prior to #323, the PDU to Tagged Stream block would output partial PDUs. I think we should restore this behaviour so that run-to-completion flow graphs will finish properly.

@argilo
Copy link
Member Author

argilo commented Dec 23, 2023

Unfortunately, it's the tagged_stream_block class that insists on having sufficient space to write the entire PDU in a single shot, so this couldn't be fixed in the PDU to Tagged Stream block unless it is changed to no longer be a Tagged Stream Block.

It seems likely that other Tagged Stream Blocks could confuse the scheduler in a similar way when used in run-to-completion flow graphs.

@jacobagilbert
Copy link
Contributor

FWIW, this exact issue caused me to immediately abandon tagged stream blocks back in the start of my PDU work. As you have identified its a property of TSBs, which as expounded in #1446 are inherently evil.

@argilo
Copy link
Member Author

argilo commented Dec 23, 2023

TSBs [...] are inherently evil

I'm starting to see that! They make unreasonable demands of the scheduler, so it's not a great surprise that it responds badly.

Maybe the right path forward for the FECAPI tests is to rewrite them to avoid the "Tagged Stream to PDU" and "PDU to Tagged Stream" blocks.

If that turns out to be impractical, then I suppose it would be possible to make the "PDU to Tagged Stream" block behave by setting its output multiple to the FEC frame size.

@argilo
Copy link
Member Author

argilo commented Dec 23, 2023

Maybe the right path forward for the FECAPI tests is to rewrite them to avoid the "Tagged Stream to PDU" and "PDU to Tagged Stream" blocks.

On second thought, rewriting tests to avoid these blocks wouldn't fix the root cause. The "Tagged Stream to PDU" and "PDU to Tagged Stream" blocks don't need to use the scheduler-hostile tagged_stream_block class; they could instead be adjusted to handle partial PDUs.

@argilo argilo added the pdu PDU-related (i.e. about packetized data in messages) label Dec 23, 2023
@argilo argilo linked a pull request Dec 23, 2023 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug FEC medium pdu PDU-related (i.e. about packetized data in messages) QA
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants