Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

busctl remains stuck even after mount-monitor-dispatch leaves the rate limit state #30573

Open
mrc0mmand opened this issue Dec 21, 2023 · 9 comments
Labels
bug 🐛 Programming errors, that need preferential fixing busctl tests
Milestone

Comments

@mrc0mmand
Copy link
Member

After going through a pile of failed CentOS CI jobs, I noticed there's a new unstable test which might require some attention:

$ journalctl --file system.journal -o short-monotonic --no-hostname 
...
[   11.017441] testsuite-74.sh[772]: + busctl status --user --machine=testuser@.host
...
[   11.157277] (o-bridge)[788]: pam_unix(login:session): session opened for user testuser(uid=4711) by testuser(uid=0)
...
[   11.239340] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/job/762 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=4889 reply_cookie=0 signature=sa{sv}as error-n>
[   11.239371] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/job/764 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=4890 reply_cookie=0 signature=sa{sv}as error-n>
[   11.239403] systemd[1]: home-testuser.mount: Collecting.
[   11.239438] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4891 reply_cookie=0 signature=so error-name=n/a error-messa>
[   11.239469] systemd[1]: Event source 0x55946d5a1c00 (mount-monitor-dispatch) left rate limit state.
[  101.114526] busctl[786]: Failed to get credentials: Transport endpoint is not connected
[  101.116390] systemd[1]: systemd-journald.service: Got notification message from PID 297 (WATCHDOG=1)
[  101.116753] testsuite-74.sh[786]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.116815] systemd[1]: Successfully forked off '(sd-expire)' as PID 789.
[  101.116877] testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'
[  101.116877] testsuite-74.sh[381]: Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed
[  101.116877] testsuite-74.sh[381]: + return 1
...
[  101.148830] systemd[1]: Unable to add match type='signal',sender='org.freedesktop.DBus',path='/org/freedesktop/DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0=':1.0', failing connection: Method call timed out
[  101.148906] systemd[1]: Bus bus-api-system: changing state RUNNING → CLOSING
[  101.149138] systemd[1]: Bus bus-api-system: changing state CLOSING → CLOSED
[  101.150816] systemd[1]: session-1.scope: starting held back, waiting for: user@4711.service
[  101.191085] systemd[1]: user-4711.slice changed dead -> active
[  101.191163] systemd[1]: user-4711.slice: Job 762 user-4711.slice/start finished, result=done
[  101.191221] systemd[1]: Created slice user-4711.slice.

Some example journals:

I didn't dig deeper into this, yet, just opening this for a reference.

@mrc0mmand mrc0mmand added bug 🐛 Programming errors, that need preferential fixing tests labels Dec 21, 2023
@yuwata yuwata added this to the v256 milestone Dec 21, 2023
@mrc0mmand
Copy link
Member Author

One more journal with interesting logs:

[   11.467615] H testsuite-74.sh[770]: + busctl status --user --machine=testuser@.host
[   11.524340] H systemd[1]: Started run-u9.service.
[   11.600801] H dbus-daemon[384]: [system] Activating via systemd: service name='org.freedesktop.home1' unit='dbus-org.freedesktop.home1.service' requested by ':1.10' (uid=0 pid=786 comm="(o-bridge)")
[   11.601339] H dbus-daemon[384]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.home1.service': Unit dbus-org.freedesktop.home1.service not found.
[   11.604893] H (o-bridge)[786]: pam_unix(login:session): session opened for user testuser(uid=4711) by testuser(uid=0)
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'
[  101.564831] H testsuite-74.sh[381]: Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed
[  101.564831] H testsuite-74.sh[381]: + return 1
[  101.565287] H systemd[1]: systemd-oomd.service: Unexpected error response from GetNameOwner(): Connection terminated
[  101.565533] H systemd[1]: testsuite-74.service: Main process exited, code=exited, status=1/FAILURE
[  101.565693] H systemd[1]: testsuite-74.service: Failed with result 'exit-code'.
[  101.591049] H systemd[1]: Failed to start testsuite-74.service.
[  101.591241] H systemd[1]: testsuite-74.service: Consumed 1.796s CPU time, 18.2M memory peak, 0B memory swap peak.
[  101.592446] H systemd-logind[378]: Failed to start session scope session-1.scope: Message recipient disconnected from message bus without replying
[  101.592853] H (o-bridge)[786]: pam_systemd(login:session): Failed to create session: Message recipient disconnected from message bus without replying
[  101.594660] H (o-bridge)[786]: run-u9.service: Referenced but unset environment variable evaluates to an empty string: XDG_RUNTIME_DIR
[  101.621700] H systemd-stdio-bridge[786]: Failed to start bus client: No such file or directory
[  101.621778] H (sd-pam)[789]: pam_unix(login:session): session closed for user testuser
[  101.622089] H systemd[1]: Created slice user-4711.slice.
[  101.623056] H systemd[1]: Reached target testsuite.target.
[  101.623417] H systemd[1]: Starting end.service...
[  101.623732] H systemd[1]: Starting user-runtime-dir@4711.service...
[  101.632656] H end.sh[790]: + set -o pipefail
[  101.633210] H end.sh[792]: + journalctl -q -o short-monotonic --grep 'didn'\''t pass validation'
[  101.678035] H systemd[1]: run-u9.service: Main process exited, code=exited, status=1/FAILURE
[  101.678087] H systemd[1]: run-u9.service: Failed with result 'exit-code'.

Full journal: https://mrc0mmand.fedorapeople.org/busctl_session_timeout/system.journal.3

YHNdnzj added a commit to YHNdnzj/systemd that referenced this issue Dec 25, 2023
YHNdnzj added a commit to YHNdnzj/systemd that referenced this issue Dec 25, 2023
YHNdnzj added a commit to YHNdnzj/systemd that referenced this issue Dec 25, 2023
@YHNdnzj YHNdnzj added the busctl label Dec 25, 2023
yuwata pushed a commit that referenced this issue Dec 25, 2023
@mrc0mmand
Copy link
Member Author

This looks like a bug in the mount ratelimiting machinery that was uncovered by @yuwata's recently added tests for bootctl, which do a lot of (un)mounting in a short timespan:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

It can be somewhat reliably triggered with following crude reproducer:

diff --git a/test/units/testsuite-74.sh b/test/units/testsuite-74.sh
index 9c2a033aa9..61851f4e26 100755
--- a/test/units/testsuite-74.sh
+++ b/test/units/testsuite-74.sh
@@ -6,6 +6,18 @@ set -o pipefail
 # shellcheck source=test/units/test-control.sh
 . "$(dirname "$0")"/test-control.sh
 
-run_subtests
+mkdir -p /tmp/mnt{0..9};
+while :; do
+       n=$((RANDOM % 10))
+       for i in $(seq 0 $n); do
+               systemd-mount /dev/disk/by-label/systemd_boot /tmp/mnt$i
+       done
+       for i in $(seq 0 $n); do
+               systemd-umount /tmp/mnt$i
+       done
+
+       SYSTEMD_BUS_TIMEOUT=10 busctl status --user --machine=testuser@.host
+       systemctl stop user@4711.service
+done
 
 touch /testok

by simply running:

# make -C test/TEST-74-AUX-UTILS/ clean setup run BUILD_DIR=$PWD/build TEST_NO_NSPAWN=1

(it's important to not forward logs to console, since that slows the whole test down)

I suspect this could be worked around in the CI by adding SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_BURST=100 to the qemu kernel command line until a proper fix is prepared, assuming this is a bug (but I find it strange that the busctl call/user daemon start stalls even after we leave the rate limit state).

mrc0mmand added a commit to mrc0mmand/systemd that referenced this issue Jan 3, 2024
(Hopefully) a temporary workaround for systemd#30573 where the starting the
user session when PID 1 is rate limited stalls even after it leaves the
rate limited state:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

The issue appeared after ee07fff which does a bunch of mounts/umounts
that get is into the rate limited state, and is frequent enough to be
annoying, so let's temporarily bump the rate limit to alleviate that.
mrc0mmand added a commit to mrc0mmand/systemd that referenced this issue Jan 3, 2024
(Hopefully) a temporary workaround for systemd#30573 where starting a user
session when PID 1 is rate limited stalls even after it leaves the rate
limited state:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

The issue appeared after ee07fff which does a bunch of mounts/umounts
that get is into the rate limited state, and is frequent enough to be
annoying, so let's temporarily bump the rate limit to alleviate that.
mrc0mmand added a commit to mrc0mmand/systemd that referenced this issue Jan 3, 2024
(Hopefully) a temporary workaround for systemd#30573 where starting a user
session when PID 1 is rate limited stalls even after it leaves the rate
limited state:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

The issue appeared after ee07fff which does a bunch of mounts/umounts
that get PID 1 into a rate limited state, and is frequent enough to be
annoying, so let's temporarily bump the rate limit to alleviate that.
yuwata pushed a commit that referenced this issue Jan 3, 2024
(Hopefully) a temporary workaround for #30573 where starting a user
session when PID 1 is rate limited stalls even after it leaves the rate
limited state:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

The issue appeared after ee07fff which does a bunch of mounts/umounts
that get PID 1 into a rate limited state, and is frequent enough to be
annoying, so let's temporarily bump the rate limit to alleviate that.
intelfx pushed a commit to intelfx/systemd that referenced this issue Feb 2, 2024
(Hopefully) a temporary workaround for systemd#30573 where starting a user
session when PID 1 is rate limited stalls even after it leaves the rate
limited state:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

The issue appeared after ee07fff which does a bunch of mounts/umounts
that get PID 1 into a rate limited state, and is frequent enough to be
annoying, so let's temporarily bump the rate limit to alleviate that.

(cherry picked from commit c707e34)
@mrc0mmand
Copy link
Member Author

Hmpf, this doesn't seem to be a recent regression, since with the reproducer above I can reproduce it even on v253.

@mrc0mmand
Copy link
Member Author

I tweaked the reproducer a bit [0], and apart from just confirming where the timeout occurs:

[  297.557487] testsuite-74.sh[273]: + SYSTEMD_BUS_TIMEOUT=10
[  297.557487] testsuite-74.sh[273]: + strace -ftttyv -s 500 -o /strace.log busctl status --user --machine=testuser@.host
[  297.617115] busctl[9192]: status: sd_bus_get_address() returned 0
[  297.617650] busctl[9192]: status: sd_bus_get_scope() returned -61
...
[  297.745642] systemd[1]: home-testuser.mount: Collecting.
[  297.745714] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=75561 reply_cookie=0 signature=so error-name=n/a error-mess>
[  297.745782] systemd[1]: Event source 0x5684d8b3b480 (mount-monitor-dispatch) left rate limit state.
[  307.751642] testsuite-74.sh[9193]: Failed to query unit state: Connection timed out
...
[  387.697584] busctl[9192]: status: sd_bus_get_bus_id() returned -107
[  387.698185] busctl[9192]: status: sd_bus_get_owner_creds() returned -107
[  387.698563] busctl[9192]: Failed to get credentials: Transport endpoint is not connected

I also noticed that when you have some background process that periodically queries PID1's bus (the commented out "heartbeat" stuff in the reproducer), the deadlock(?) doesn't seem to happen (or I'm just extremely lucky).

[0]

reproducer Note: it may still take up to ~300 iterations to hit the deadlock, YMMV.
diff --git a/test/TEST-74-AUX-UTILS/test.sh b/test/TEST-74-AUX-UTILS/test.sh
index d870d57dcc..164602f166 100755
--- a/test/TEST-74-AUX-UTILS/test.sh
+++ b/test/TEST-74-AUX-UTILS/test.sh
@@ -9,7 +9,7 @@ NSPAWN_ARGUMENTS="--private-network"
 . "${TEST_BASE_DIR:?}/test-functions"
 
 # (Hopefully) a temporary workaround for https://github.com/systemd/systemd/issues/30573
-KERNEL_APPEND="${KERNEL_APPEND:-} SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_BURST=100"
+KERNEL_APPEND="${KERNEL_APPEND:-}"
 
 # Make sure vsock is available in the VM
 CID=$((RANDOM + 3))
diff --git a/test/test-functions b/test/test-functions
index 80fdcd26b9..6c2546f0ea 100644
--- a/test/test-functions
+++ b/test/test-functions
@@ -1464,6 +1464,9 @@ install_systemd() {
     # Remove unneeded documentation
     rm -fr "${initdir:?}"/usr/share/{man,doc}
 
+    mkdir -p "$initdir/etc/systemd/journald.conf.d/"
+    echo -ne "[Journal]\nRateLimitBurst=0\n" >"$initdir/etc/systemd/journald.conf.d/99-ratelimit.conf"
+
     # Enable debug logging in PID1
     mkdir -p "$initdir/etc/systemd/system.conf.d/"
     echo -ne "[Manager]\nLogLevel=debug\n" >"$initdir/etc/systemd/system.conf.d/10-log-level.conf"
diff --git a/test/units/testsuite-74.sh b/test/units/testsuite-74.sh
index 9c2a033aa9..10413e1ae4 100755
--- a/test/units/testsuite-74.sh
+++ b/test/units/testsuite-74.sh
@@ -3,9 +3,33 @@
 set -eux
 set -o pipefail
 
-# shellcheck source=test/units/test-control.sh
-. "$(dirname "$0")"/test-control.sh
+systemctl mask --now --runtime systemd-oomd.{socket,service}
 
-run_subtests
+trap "cat /strace.log || :" EXIT
+
+heartbeat() {
+       while :; do
+               systemctl is-active systemd-journald.service
+               sleep 5
+       done
+}
+
+#heartbeat &
+#disown $!
+
+mkdir -p /tmp/mnt{0..9};
+for i in {0..500}; do
+       echo $i >/dev/ttyS0
+       n=$((RANDOM % 10))
+       for i in $(seq 0 $n); do
+               systemd-mount /dev/disk/by-label/systemd_boot /tmp/mnt$i
+       done
+       for i in $(seq 0 $n); do
+               systemd-umount /tmp/mnt$i
+       done
+
+       SYSTEMD_BUS_TIMEOUT=10 strace -ftttyv -s 500 -o /strace.log busctl status --user --machine=testuser@.host
+       systemctl stop user@4711.service
+done
 
 touch /testok

@yuwata
Copy link
Member

yuwata commented Mar 20, 2024

@mrc0mmand I cannot reproduce the issue with your reproducer with the current git HEAD (3037616). Do you still reproduce the issue?

yuwata added a commit to yuwata/systemd that referenced this issue Mar 20, 2024
This reverts commit 3f4b00a.

The issue systemd#30573 seems to be fixed somehow. Let's revert the workaround.
@mrc0mmand
Copy link
Member Author

As mentioned in #31863 (review) the issue seems to be gone (or at least I can't reproduce it locally anymore as well), so let's close this.

mrc0mmand pushed a commit that referenced this issue Mar 20, 2024
This reverts commit 3f4b00a.

The issue #30573 seems to be fixed somehow. Let's revert the workaround.
chunyi-wu pushed a commit to chunyi-wu/systemd that referenced this issue Apr 3, 2024
This reverts commit 3f4b00a.

The issue systemd#30573 seems to be fixed somehow. Let's revert the workaround.
@yuwata
Copy link
Member

yuwata commented Apr 20, 2024

I forgot details, but I recently found similar failure.

@mrc0mmand
Copy link
Member Author

I forgot details, but I recently found similar failure.

If it was in CentOS CI, I'll eventually get to it (there's currently 185 unread emails from CentOS CI in my inbox, so it might take a couple of days).

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Apr 23, 2024

Yup, you're absolutely right:

...
[   11.409921] systemd[1]: systemd-journald.service: Got notification message from PID 301 (FDSTORE=1)
[   11.409976] systemd[1]: systemd-journald.service: Trying to stash fd for dev=0:8/inode=18092
[   11.410146] systemd[1]: systemd-journald.service: Added fd 89 (stored) to fd store.
[   11.413109] systemd[1]: Event source 0x56e785fee0b0 (mount-monitor-dispatch) entered rate limit state.
[   11.510414] kernel: BTRFS info (device loop0p3): last unmount of filesystem 568a8c47-3d57-48fa-a586-2521ab29cb45
[   11.416977] systemd[1]: Received SIGCHLD from PID 749 (3).
...
[   11.892361] testsuite-74.sh[793]: + busctl status --user --machine=testuser@.host
[   11.904714] systemd[1]: Got message type=method_call sender=:1.10 destination=org.freedesktop.systemd1 path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties member=Get  cookie=2 reply_cookie=0 signature=ss error-name=n/a error-message=n/a
[   11.904837] systemd[1]: Sent message type=method_return sender=n/a destination=:1.10 path=n/a interface=n/a member=n/a cookie=4898 reply_cookie=2 signature=v error-name=n/a error-message=n/a
[   11.905004] systemd[1]: Got message type=method_call sender=:1.10 destination=org.freedesktop.systemd1 path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=StartTransientUnit  cookie=3 reply_cookie=0 signature=ssa(sv)a(sa(sv)) error-name=n/a error-message=n/a
[   11.905073] systemd[1]: Trying to acquire polkit authentication for 'org.freedesktop.systemd1.manage-units'.
[   11.905117] systemd[1]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=GetConnectionUnixUser cookie=4899 reply_cookie=0 signature=s error-name=n/a error-message=n/a
[   11.905164] systemd[1]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.1 path=n/a interface=n/a member=n/a  cookie=4294967295 reply_cookie=4899 signature=u error-name=n/a error-message=n/a
[   11.905289] systemd[1]: run-u10.service: Failed to load configuration: No such file or directory
[   11.905542] systemd[1]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=AddMatch cookie=4900 reply_cookie=0 signature=s error-name=n/a error-message=n/a
[   11.905595] systemd[1]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=GetNameOwner cookie=4901 reply_cookie=0 signature=s error-name=n/a error-message=n/a
[   11.905668] systemd[1]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.1 path=n/a interface=n/a member=n/a  cookie=4294967295 reply_cookie=4901 signature=s error-name=n/a error-message=n/a
[   11.906000] systemd[1]: run-u10.service: Trying to enqueue job run-u10.service/start/fail
[   11.906078] systemd[1]: Added job run-u10.service/start to transaction.
[   11.906124] systemd[1]: Pulling in sysinit.target/start from run-u10.service/start
[   11.906166] systemd[1]: Added job sysinit.target/start to transaction.
[   11.906207] systemd[1]: Pulling in proc-sys-fs-binfmt_misc.automount/start from sysinit.target/start
[   11.906247] systemd[1]: Added job proc-sys-fs-binfmt_misc.automount/start to transaction.
[   11.906289] systemd[1]: Pulling in shutdown.target/stop from proc-sys-fs-binfmt_misc.automount/start
[   11.906330] systemd[1]: Added job shutdown.target/stop to transaction.
[   11.906377] systemd[1]: Pulling in systemd-soft-reboot.service/stop from shutdown.target/stop
[   11.906423] systemd[1]: Added job systemd-soft-reboot.service/stop to transaction.
[   11.906461] systemd[1]: Pulling in soft-reboot.target/stop from systemd-soft-reboot.service/stop
[   11.906501] systemd[1]: Added job soft-reboot.target/stop to transaction.
[   11.906539] systemd[1]: Pulling in systemd-poweroff.service/stop from shutdown.target/stop
[   11.906577] systemd[1]: Added job systemd-poweroff.service/stop to transaction.
[   11.906615] systemd[1]: Pulling in poweroff.target/stop from systemd-poweroff.service/stop
[   11.906663] systemd[1]: Added job poweroff.target/stop to transaction.
...
[   12.035679] systemd[1]: dbus-org.freedesktop.home1.service: Collecting.
[   12.035751] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4938 reply_cookie=0 signature=so error-name=n/a error-message=n/a
[   12.035834] dbus-broker-launch[394]: Activation request for 'org.freedesktop.home1' failed: The systemd unit 'dbus-org.freedesktop.home1.service' could not be found.
[   12.048868] (o-bridge)[810]: pam_unix(login:session): session opened for user testuser(uid=4711) by testuser(uid=0)
[   12.049042] systemd[1]: varlink: New incoming connection.
...
[   12.140508] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/job/821 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=5020 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[   12.140541] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/job/819 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=5021 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[   12.140573] systemd[1]: home-testuser.mount: Collecting.
[   12.140607] systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=5022 reply_cookie=0 signature=so error-name=n/a error-message=n/a
[   12.140644] systemd[1]: Event source 0x56e785fee0b0 (mount-monitor-dispatch) left rate limit state.
[  101.988722] busctl[808]: Failed to get credentials: Transport endpoint is not connected
[  101.993090] systemd[1]: systemd-journald.service: Got notification message from PID 301 (WATCHDOG=1)
[  101.993694] testsuite-74.sh[808]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.993802] systemd[1]: Successfully forked off '(sd-expire)' as PID 812.
[  101.993872] testsuite-74.sh[386]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'
[  101.993872] testsuite-74.sh[386]: Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed
[  101.993872] testsuite-74.sh[386]: + return 1
[  101.994664] systemd[1]: Received SIGCHLD from PID 386 (bash).
[  101.994734] systemd[1]: Child 386 (bash) died (code=exited, status=1/FAILURE)
[  101.994794] systemd[1]: testsuite-74.service: Child 386 belongs to testsuite-74.service.
[  101.994864] systemd[1]: testsuite-74.service: Main process exited, code=exited, status=1/FAILURE
[  101.994982] systemd[1]: testsuite-74.service: Failed with result 'exit-code'.

It's the same scenario as the original issue: the busctl command is stuck even after the mount source left the rate limit state.

A couple of full journals:

@mrc0mmand mrc0mmand reopened this Apr 23, 2024
@mrc0mmand mrc0mmand changed the title busctl test from TEST-74-AUX-UTILS became unstable busctl remains stuck even after mount-monitor-dispatch leaves the rate limit state Apr 23, 2024
@mrc0mmand mrc0mmand pinned this issue Apr 23, 2024
yuwata pushed a commit to yuwata/systemd that referenced this issue Apr 26, 2024
(Hopefully) a temporary workaround for systemd#30573 where starting a user
session when PID 1 is rate limited stalls even after it leaves the rate
limited state:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

The issue appeared after ee07fff which does a bunch of mounts/umounts
that get PID 1 into a rate limited state, and is frequent enough to be
annoying, so let's temporarily bump the rate limit to alleviate that.

(cherry picked from commit c707e34)
(cherry picked from commit 4d60fb7)
yuwata pushed a commit to yuwata/systemd that referenced this issue Apr 26, 2024
(Hopefully) a temporary workaround for systemd#30573 where starting a user
session when PID 1 is rate limited stalls even after it leaves the rate
limited state:

[   11.658201] H systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=4208 reply_cookie=0 signature=so error-name=n/a error-mes>
[   11.658233] H systemd[1]: Event source 0x559babdd8bb0 (mount-monitor-dispatch) left rate limit state.
[  101.562697] H busctl[784]: Failed to get credentials: Transport endpoint is not connected
[  101.563480] H systemd[1]: systemd-journald.service: Got notification message from PID 300 (WATCHDOG=1)
[  101.563725] H testsuite-74.sh[784]: BusAddress=unixexec:path=systemd-run,argv1=-M.host,argv2=-PGq,argv3=--wait,argv4=-pUser%3dtestuser,argv5=-pPAMName%3dlogin,argv6=systemd-stdio-bridge,argv7=-punix:path%3d%24%7bXDG_RUNTIME_DIR%7d/bus
[  101.564136] H systemd[1]: Successfully forked off '(sd-expire)' as PID 787.
[  101.564754] H systemd[1]: Successfully forked off '(sd-expire)' as PID 788.
[  101.564831] H testsuite-74.sh[381]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/testsuite-74.busctl.sh failed'

The issue appeared after ee07fff which does a bunch of mounts/umounts
that get PID 1 into a rate limited state, and is frequent enough to be
annoying, so let's temporarily bump the rate limit to alleviate that.

(cherry picked from commit c707e34)
(cherry picked from commit 4d60fb7)
(cherry picked from commit 7215018)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing busctl tests
Development

No branches or pull requests

3 participants