New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
busctl
remains stuck even after mount-monitor-dispatch
leaves the rate limit state
#30573
Comments
One more journal with interesting logs:
Full journal: https://mrc0mmand.fedorapeople.org/busctl_session_timeout/system.journal.3 |
This looks like a bug in the mount ratelimiting machinery that was uncovered by @yuwata's recently added tests for bootctl, which do a lot of (un)mounting in a short timespan:
It can be somewhat reliably triggered with following crude reproducer: diff --git a/test/units/testsuite-74.sh b/test/units/testsuite-74.sh
index 9c2a033aa9..61851f4e26 100755
--- a/test/units/testsuite-74.sh
+++ b/test/units/testsuite-74.sh
@@ -6,6 +6,18 @@ set -o pipefail
# shellcheck source=test/units/test-control.sh
. "$(dirname "$0")"/test-control.sh
-run_subtests
+mkdir -p /tmp/mnt{0..9};
+while :; do
+ n=$((RANDOM % 10))
+ for i in $(seq 0 $n); do
+ systemd-mount /dev/disk/by-label/systemd_boot /tmp/mnt$i
+ done
+ for i in $(seq 0 $n); do
+ systemd-umount /tmp/mnt$i
+ done
+
+ SYSTEMD_BUS_TIMEOUT=10 busctl status --user --machine=testuser@.host
+ systemctl stop user@4711.service
+done
touch /testok by simply running:
(it's important to not forward logs to console, since that slows the whole test down) I suspect this could be worked around in the CI by adding |
(Hopefully) a temporary workaround for systemd#30573 where the starting the user session when PID 1 is rate limited stalls even after it leaves the rate limited state: [ 11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes> [ 11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state. [ 101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected [ 101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1) [ 101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus [ 101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787. [ 101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788. [ 101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed' The issue appeared after ee07fff which does a bunch of mounts/umounts that get is into the rate limited state, and is frequent enough to be annoying, so let's temporarily bump the rate limit to alleviate that.
(Hopefully) a temporary workaround for systemd#30573 where starting a user session when PID 1 is rate limited stalls even after it leaves the rate limited state: [ 11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes> [ 11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state. [ 101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected [ 101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1) [ 101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus [ 101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787. [ 101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788. [ 101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed' The issue appeared after ee07fff which does a bunch of mounts/umounts that get is into the rate limited state, and is frequent enough to be annoying, so let's temporarily bump the rate limit to alleviate that.
(Hopefully) a temporary workaround for systemd#30573 where starting a user session when PID 1 is rate limited stalls even after it leaves the rate limited state: [ 11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes> [ 11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state. [ 101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected [ 101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1) [ 101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus [ 101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787. [ 101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788. [ 101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed' The issue appeared after ee07fff which does a bunch of mounts/umounts that get PID 1 into a rate limited state, and is frequent enough to be annoying, so let's temporarily bump the rate limit to alleviate that.
(Hopefully) a temporary workaround for #30573 where starting a user session when PID 1 is rate limited stalls even after it leaves the rate limited state: [ 11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes> [ 11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state. [ 101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected [ 101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1) [ 101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus [ 101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787. [ 101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788. [ 101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed' The issue appeared after ee07fff which does a bunch of mounts/umounts that get PID 1 into a rate limited state, and is frequent enough to be annoying, so let's temporarily bump the rate limit to alleviate that.
(Hopefully) a temporary workaround for systemd#30573 where starting a user session when PID 1 is rate limited stalls even after it leaves the rate limited state: [ 11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes> [ 11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state. [ 101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected [ 101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1) [ 101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus [ 101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787. [ 101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788. [ 101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed' The issue appeared after ee07fff which does a bunch of mounts/umounts that get PID 1 into a rate limited state, and is frequent enough to be annoying, so let's temporarily bump the rate limit to alleviate that. (cherry picked from commit c707e34)
Hmpf, this doesn't seem to be a recent regression, since with the reproducer above I can reproduce it even on v253. |
I tweaked the reproducer a bit [0], and apart from just confirming where the timeout occurs:
I also noticed that when you have some background process that periodically queries PID1's bus (the commented out "heartbeat" stuff in the reproducer), the deadlock(?) doesn't seem to happen (or I'm just extremely lucky). [0] reproducerNote: it may still take up to ~300 iterations to hit the deadlock, YMMV.diff --git a/test/TEST-74-AUX-UTILS/test.sh b/test/TEST-74-AUX-UTILS/test.sh
index d870d57dcc..164602f166 100755
--- a/test/TEST-74-AUX-UTILS/test.sh
+++ b/test/TEST-74-AUX-UTILS/test.sh
@@ -9,7 +9,7 @@ NSPAWN_ARGUMENTS="--private-network"
. "${TEST_BASE_DIR:?}/test-functions"
# (Hopefully) a temporary workaround for https://github.com/systemd/systemd/issues/30573
-KERNEL_APPEND="${KERNEL_APPEND:-} SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_BURST=100"
+KERNEL_APPEND="${KERNEL_APPEND:-}"
# Make sure vsock is available in the VM
CID=$((RANDOM + 3))
diff --git a/test/test-functions b/test/test-functions
index 80fdcd26b9..6c2546f0ea 100644
--- a/test/test-functions
+++ b/test/test-functions
@@ -1464,6 +1464,9 @@ install_systemd() {
# Remove unneeded documentation
rm -fr "${initdir:?}"/usr/share/{man,doc}
+ mkdir -p "$initdir/etc/systemd/journald.conf.d/"
+ echo -ne "[Journal]\nRateLimitBurst=0\n" >"$initdir/etc/systemd/journald.conf.d/99-ratelimit.conf"
+
# Enable debug logging in PID1
mkdir -p "$initdir/etc/systemd/system.conf.d/"
echo -ne "[Manager]\nLogLevel=debug\n" >"$initdir/etc/systemd/system.conf.d/10-log-level.conf"
diff --git a/test/units/testsuite-74.sh b/test/units/testsuite-74.sh
index 9c2a033aa9..10413e1ae4 100755
--- a/test/units/testsuite-74.sh
+++ b/test/units/testsuite-74.sh
@@ -3,9 +3,33 @@
set -eux
set -o pipefail
-# shellcheck source=test/units/test-control.sh
-. "$(dirname "$0")"/test-control.sh
+systemctl mask --now --runtime systemd-oomd.{socket,service}
-run_subtests
+trap "cat /strace.log || :" EXIT
+
+heartbeat() {
+ while :; do
+ systemctl is-active systemd-journald.service
+ sleep 5
+ done
+}
+
+#heartbeat &
+#disown $!
+
+mkdir -p /tmp/mnt{0..9};
+for i in {0..500}; do
+ echo $i >/dev/ttyS0
+ n=$((RANDOM % 10))
+ for i in $(seq 0 $n); do
+ systemd-mount /dev/disk/by-label/systemd_boot /tmp/mnt$i
+ done
+ for i in $(seq 0 $n); do
+ systemd-umount /tmp/mnt$i
+ done
+
+ SYSTEMD_BUS_TIMEOUT=10 strace -ftttyv -s 500 -o /strace.log busctl status --user --machine=testuser@.host
+ systemctl stop user@4711.service
+done
touch /testok |
@mrc0mmand I cannot reproduce the issue with your reproducer with the current git HEAD (3037616). Do you still reproduce the issue? |
This reverts commit 3f4b00a. The issue systemd#30573 seems to be fixed somehow. Let's revert the workaround.
As mentioned in #31863 (review) the issue seems to be gone (or at least I can't reproduce it locally anymore as well), so let's close this. |
This reverts commit 3f4b00a. The issue systemd#30573 seems to be fixed somehow. Let's revert the workaround.
I forgot details, but I recently found similar failure. |
If it was in CentOS CI, I'll eventually get to it (there's currently 185 unread emails from CentOS CI in my inbox, so it might take a couple of days). |
Yup, you're absolutely right:
It's the same scenario as the original issue: the A couple of full journals: |
busctl
test from TEST-74-AUX-UTILS
became unstablebusctl
remains stuck even after mount-monitor-dispatch
leaves the rate limit state
(Hopefully) a temporary workaround for systemd#30573 where starting a user session when PID 1 is rate limited stalls even after it leaves the rate limited state: [ 11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes> [ 11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state. [ 101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected [ 101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1) [ 101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus [ 101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787. [ 101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788. [ 101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed' The issue appeared after ee07fff which does a bunch of mounts/umounts that get PID 1 into a rate limited state, and is frequent enough to be annoying, so let's temporarily bump the rate limit to alleviate that. (cherry picked from commit c707e34) (cherry picked from commit 4d60fb7)
(Hopefully) a temporary workaround for systemd#30573 where starting a user session when PID 1 is rate limited stalls even after it leaves the rate limited state: [ 11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes> [ 11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state. [ 101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected [ 101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1) [ 101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus [ 101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787. [ 101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788. [ 101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed' The issue appeared after ee07fff which does a bunch of mounts/umounts that get PID 1 into a rate limited state, and is frequent enough to be annoying, so let's temporarily bump the rate limit to alleviate that. (cherry picked from commit c707e34) (cherry picked from commit 4d60fb7) (cherry picked from commit 7215018)
After going through a pile of failed CentOS CI jobs, I noticed there's a new unstable test which might require some attention:
Some example journals:
I didn't dig deeper into this, yet, just opening this for a reference.
The text was updated successfully, but these errors were encountered: