PDU to Tagged Stream confuses the scheduler by failing to produce output #6628

argilo · 2023-04-09T21:20:16Z

A CI run (https://github.com/gnuradio/gnuradio/actions/runs/4651656779/jobs/8231524712) recently failed with the following error:

117/265 Test #117: qa_fecapi_dummy ..............................***Failed    0.61 sec
.............
======================================================================
FAIL: test_async_00 (__main__.test_fecapi_dummy) (packed=False, rev_pack=True)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/gnuradio/gnuradio/gr-fec/python/fec/qa_fecapi_dummy.py", line 279, in test_async_00
    self.assertSequenceEqualGR(data_in, data_out)
  File "/__w/gnuradio/gnuradio/gnuradio-runtime/python/gnuradio/gr_unittest.py", line 115, in assertSequenceEqualGR
    self.assertEqual(len(data_in), len(data_out), msg="Lengths do not match")
AssertionError: 15360 != 15120 : Lengths do not match

----------------------------------------------------------------------
Ran 14 tests in 0.393s

FAILED (failures=1)

I was able to reproduce this on my laptop by running the test in a loop (while gr-fec/python/fec/qa_fecapi_dummy_test.sh; do :; done) in parallel with a heavy CPU load (stress -c 16).

~~Debian build logs show that qa_fecapi_repetition fails on i386:~~

~~https://buildd.debian.org/status/fetch.php?pkg=gnuradio&arch=i386&ver=3.10.9.0%7Erc1-2&stamp=1703091669&raw=0~~
~~https://buildd.debian.org/status/fetch.php?pkg=gnuradio&arch=i386&ver=3.10.9.0%7Erc1-3&stamp=1703129935&raw=0~~

~~Again, I was able to reproduce the failure by repeatedly executing the test in parallel with a heavy CPU load.~~

The failing tests use _qa_helper_async to build a flow graph containing FEC Async Encoder and FEC Async Decoder blocks. The flow graph alternates between stream and message connections. I suspect that the scheduler is deciding that the flow graph is "done" even though there are still messages in flight.

The text was updated successfully, but these errors were encountered:

argilo · 2023-12-21T19:23:25Z

Various other tests involving message passing were failing in the past, and #5754 was intended to fix that. It may be worth looking over the fix to see whether it properly handles the flow graphs in the FEC async tests.

argilo · 2023-12-21T20:33:26Z

Some observations:

In the CI qa_fecapi_dummy failure, the output was 240 bits short, which is exactly one frame. This suggests a message was dropped somewhere along the line.
I tried to induce failures by adding sleeps to the message outputs of FEC Async Encoder, FEC Async Decoder, and Tagged Stream to PDU, but this didn't work.
In the Debian qa_fecapi_repetition failures, the output was 512 bits (= 64 bytes) short in the packed case, and 64 bits short in the unpacked case. This is not a multiple of the frame size, which suggests that some other problem occurred.
The Debian qa_fecapi_repetition failures occurred for all four async test cases in two consecutive builds. This suggests that that the failure may not be intermittent. Strangely, the failures did not occur for qa_fecapi_cc, qa_fecapi_dummy, or qa_fecapi_ldpc, which use the same setup.

argilo · 2023-12-22T14:15:02Z

It turns out there are two separate bugs here:

The flow graph sometimes completes with messages still in Tagged Stream to PDU's input queue. This can readily be reproduced by increasing rep here:

gnuradio/gr-fec/python/fec/qa_fecapi_repetition.py

Line 161 in 4d3eb9f

rep = 3
FEC Async Decoder calculates the frame length in floating point, then converts it to an integer without rounding. Floating point error can cause it to be one bit/byte too low.

I'll reopen #7022 to track bug 2.

argilo · 2023-12-22T23:54:36Z

I was able to track down the bug after adding some debug logging. Here's what goes wrong:

In the flow graph, an "Async Encoder" block produces PDU messages and sends them to a "PDU to Tagged Stream" block.
The "Async Encoder" block sends along a burst of PDUs, then finishes its work and sends a system message to the "PDU to Tagged Stream" block indicating that it is "done". This marks the "PDU to Tagged Stream" block as "done" receiving messages.
The "PDU to Tagged Stream" writes PDUs to its stream output, but eventually its output buffer fills to the point that there is no longer room for the next PDU to be copied in full. When this happens, it refuses to produce output. (More specifically, it uses calculate_output_stream_length to inform its parent class (tagged_stream_block) that it requires noutput_items to be at least as large as the next PDU, and the parent's general_work method returns no output.)
The scheduler sees that the "PDU to Tagged Stream" block is "done" receiving messages, and refuses to fill its output buffer. The scheduler concludes that the block must be "done" and stops invoking it.

Prior to #323, the PDU to Tagged Stream block would output partial PDUs. I think we should restore this behaviour so that run-to-completion flow graphs will finish properly.

argilo · 2023-12-23T01:12:51Z

Unfortunately, it's the tagged_stream_block class that insists on having sufficient space to write the entire PDU in a single shot, so this couldn't be fixed in the PDU to Tagged Stream block unless it is changed to no longer be a Tagged Stream Block.

It seems likely that other Tagged Stream Blocks could confuse the scheduler in a similar way when used in run-to-completion flow graphs.

jacobagilbert · 2023-12-23T02:35:20Z

FWIW, this exact issue caused me to immediately abandon tagged stream blocks back in the start of my PDU work. As you have identified its a property of TSBs, which as expounded in #1446 are inherently evil.

argilo · 2023-12-23T02:44:42Z

TSBs [...] are inherently evil

I'm starting to see that! They make unreasonable demands of the scheduler, so it's not a great surprise that it responds badly.

Maybe the right path forward for the FECAPI tests is to rewrite them to avoid the "Tagged Stream to PDU" and "PDU to Tagged Stream" blocks.

If that turns out to be impractical, then I suppose it would be possible to make the "PDU to Tagged Stream" block behave by setting its output multiple to the FEC frame size.

argilo · 2023-12-23T17:24:55Z

Maybe the right path forward for the FECAPI tests is to rewrite them to avoid the "Tagged Stream to PDU" and "PDU to Tagged Stream" blocks.

On second thought, rewriting tests to avoid these blocks wouldn't fix the root cause. The "Tagged Stream to PDU" and "PDU to Tagged Stream" blocks don't need to use the scheduler-hostile tagged_stream_block class; they could instead be adjusted to handle partial PDUs.

argilo added Bug FEC QA labels Apr 9, 2023

argilo mentioned this issue Dec 20, 2023

qa_fecapi_repetition is flaky #7022

Closed

marcusmueller added the Triage::needed label Dec 20, 2023

argilo changed the title ~~qa_fecapi_dummy is flaky~~ qa_fecapi_* async tests are flaky Dec 21, 2023

argilo changed the title ~~qa_fecapi_* async tests are flaky~~ PDU to Tagged Stream confuses the scheduler by failing to produce output Dec 22, 2023

marcusmueller removed the Triage::needed label Dec 23, 2023

argilo added the pdu PDU-related (i.e. about packetized data in messages) label Dec 23, 2023

argilo linked a pull request Dec 23, 2023 that will close this issue

pdu: Allow PDU to Tagged Stream to write partial PDUs #7028

Draft

6 tasks

marcusmueller added the medium label Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDU to Tagged Stream confuses the scheduler by failing to produce output #6628

PDU to Tagged Stream confuses the scheduler by failing to produce output #6628

argilo commented Apr 9, 2023 •

edited

argilo commented Dec 21, 2023

argilo commented Dec 21, 2023 •

edited

argilo commented Dec 22, 2023

argilo commented Dec 22, 2023

argilo commented Dec 23, 2023

jacobagilbert commented Dec 23, 2023

argilo commented Dec 23, 2023

argilo commented Dec 23, 2023

PDU to Tagged Stream confuses the scheduler by failing to produce output #6628

PDU to Tagged Stream confuses the scheduler by failing to produce output #6628

Comments

argilo commented Apr 9, 2023 • edited

argilo commented Dec 21, 2023

argilo commented Dec 21, 2023 • edited

argilo commented Dec 22, 2023

argilo commented Dec 22, 2023

argilo commented Dec 23, 2023

jacobagilbert commented Dec 23, 2023

argilo commented Dec 23, 2023

argilo commented Dec 23, 2023

argilo commented Apr 9, 2023 •

edited

argilo commented Dec 21, 2023 •

edited