[227 regression] Boot occasionally fails with "Connection timed out" #1505

martinpitt · 2015-10-09T07:07:49Z

https://bugs.debian.org/801354 reported that 227 introduced a rather major boot regression: You often get failures like

Oct 09 09:00:19 autopkgtest systemd-logind[616]: Failed to enable subscription: Connection timed out
Oct 09 09:00:19 autopkgtest systemd-logind[616]: Failed to fully start up daemon: Connection timed out
Oct 09 09:00:19 autopkgtest systemd[1]: Failed to subscribe to activation signal: Connection timed out
Oct 09 09:00:44 autopkgtest systemd[1]: Failed to register name: Connection timed out
Oct 09 09:00:44 autopkgtest systemd[1]: Failed to set up API bus: Connection timed out
Oct 09 09:01:56 autopkgtest su[764]: pam_systemd(su:session): Failed to create session: Connection timed out
Oct 09 09:01:58 autopkgtest lightdm[1578]: pam_systemd(lightdm-greeter:session): Failed to create session: Connection timed out

(same for lightdm, etc.) I now found a way to reproduce this in a VM, so I'll go bisect hunting.

As for the pam_systemd thing: I wonder if that's related to the regression that the user systemd instance is now trying to unmount things during boot:

systemd[1653]: Reached target Default.
systemd[1653]: boot.mount: Mount process exited, code=exited status=1
systemd[1653]: Failed unmounting /boot.
systemd[1653]: run-user-1000.mount: Mount process exited, code=exited status=1
systemd[1653]: Failed unmounting /run/user/1000.
systemd[1653]: dev-dm\x2d2.swap: Unit entered failed state.
systemd[1653]: sys-kernel-debug.mount: Mount process exited, code=exited status=1
systemd[1653]: Failed unmounting /sys/kernel/debug.
[...]
umount[1657]: umount: /boot: umount failed: Operation not permitted
umount[1658]: umount: /run/user/1000: umount failed: Operation not permitted
swapoff[1659]: swapoff: Not superuser.

This might be something entirely different (and then mostly cosmetical), or be the cause for failing to create a user session.

The text was updated successfully, but these errors were encountered:

martinpitt · 2015-10-09T07:52:22Z

For the record:

[user systemd trying to unmount stuff] might something entirely different (and then mostly cosmetical)

It's actually not -- as https://bugs.debian.org/801361 points out, this will actually unmount stuff if you log in as root.

I have the bisect running now, which will show if the unmounts and the timeouts have one and the same root cause. If not, I'll open a separate issue for the unmounts.

martinpitt · 2015-10-09T09:16:27Z

Found the culprit: It's a5bd3c32a from PR #1215. Reverting that fixes the connection timeouts and everything is hunky-dory again.

Curiously, pretty much the exact same symptom happened with 219 in Februrary -- do you guys still remember http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html ? Back then it was fixed in 64144440 which already (more or less accidentally) fixed the CMSG_SPACE allocation. The justification back then was different though (http://lists.freedesktop.org/archives/systemd-devel/2015-April/031364.html), so this isn't just the old bug back.

@maciejaszek , @dvdhrm , any idea about how the real fix looks like? Thanks!

martinpitt · 2015-10-09T09:45:16Z

The "user systemd tries to unmount" stuff is unrelated to this, I now reported issue #1507 about that.

martinpitt · 2015-10-09T09:49:24Z

FTR, I actually did see this occasionally when running CI tests for the Ubuntu trunk builds, but as the Debian ones (which I do more often) worked fine I didn't pay enough attention to them. Sorry for the process fail, it's always frustrating when we release with a major bug that we could have spotted before. I'll be more suspicious when the "boot smoke" test fails in the future!

zonque · 2015-10-09T11:10:41Z

Thanks, @martinpitt for bisecting this! Could give this patch a try please, and see if it fixes the regression? zonque/systemd@9f58f915

zonque · 2015-10-09T11:15:41Z

Hmm, wait. No, that doesn't explain it.

martinpitt · 2015-10-09T13:34:50Z

@zonque: No surprise, but FTR: no difference with that patch.

poettering · 2015-10-09T14:52:41Z

@martinpitt soo, how and why precisely does the sendmsg() fail and in which process? Do you have an strace of the process maybe, so that we can have a look?

It causes connection errors from various services on boot. systemd/systemd#1505 Closes: #801354

martinpitt · 2015-10-12T07:14:05Z

@poettering: I'll see whether I can come up with an early systemd unit which attaches strace to pid 1 early. Things like logind etc. do a Subscribe D-Bus call to pid 1 (which fails with "Connection timed out"), so stracing logind etc. is uninteresting. I did try to add some log_warnings to sd_pid_notify_with_fds(), but the initial ones (before a successful sendmsg()) mysteriously never appeared in the journal or stderr; I only ever saw successful calls. strace wouldn't show us the interesting numbers for the control header allocation, just the error code. But I'll keep digging.

@maciejaszek : Your original PR #1215 looks wrong. "CMSG_SPACE(0) may return value other than 0" is intended as it contains padding for alignment:

 #define CMSG_SPACE(len) (CMSG_ALIGN (len) \
                    + CMSG_ALIGN (sizeof (struct cmsghdr)))

I think this padding must at least be part of the allocation below:

msghdr.msg_control = alloca(msghdr.msg_controllen);

Thus the original code before that PR looked right. However, you said that you got some EINVAL in some cases (without giving further detail/logs).

cmsg(3) is a bit confusing and contradictory. It says that msg.msg_controllen needs to be set twice, first "with the length of the control message buffer" without explaining that further (but sum of all CMSG_SPACEs is certainly plausible), and after writing all the elements it needs to be "set to the sum of the CMSG_SPACE() of the length of all control messages in the buffer." This is what the original code (before PR #1215) already did, and it has worked for quite a number of releases.

But the example on the manpage actually sets it to the sum of the CMSG_LENs, not the sum of the CMSG_SPACEs. So it might be that the description in the manpage is wrong and the example is right, and the manpage should instead say "Finally, the msg_controllen field of the msghdr should be set to the sum of the CMSG_LEN() of the length of all control messages in the buffer." I tested this theory with http://paste.ubuntu.com/12761809/ but that didn't help, I still get the connection timeouts.

So if we instead assume that the description is right and the example is wrong, we keep msg_controllen as the sum of CMSG_SPACEs, but in the special case of have_pid or n_fds == 0 we keep the extra padding for the allocation. That would be http://paste.ubuntu.com/12761826/ but it still fails.

So going back, the only code that actually works is the one with PR #1215 reverted, and I don't know why @maciejaszek got EINVALs there.

zonque · 2015-10-12T07:53:34Z

@martinpitt if either n_fds == 0 or !have_pid, we only attach one cmsg to the message. Hence, msg.msg_controllen must not contain the size of the 2nd, unused message header, and MSG_NXTHDR is never called, which is correct. So #1215 does the right thing, but the code is somewhat obfuscated. I still don't know what the actual problem is though.

martinpitt · 2015-10-12T07:56:03Z

Wrt. stracing, this actually works quite well:

[Unit]
Description=strace pid 1
DefaultDependencies=no

[Service]
ExecStart=/usr/bin/strace -e socket,sendmsg,close -vvs1024 -o /run/pid1.trace -p 1

[Install]
WantedBy=sysinit.target

First I ran this stracing with current git head, i. e. with this bug.

From a failed boot the journal is http://paste.ubuntu.com/12761903/ and the strace output is http://paste.ubuntu.com/12761900/ . Note that there are zero instances of msg_controllen not being zero, and sendmsg() only ever fails with EAGAIN (10 times) or EPIPE (21 times)

From a successful boot the journal is http://paste.ubuntu.com/12761945/ and the strace is http://paste.ubuntu.com/12761943/ . Note that there are no EAGAIN errors, just 19 EPIPE ones (and no other error codes). Just like above all msg_controllens are 0.

AFAICS the socket isn't marked as nonblocking, so the usual meaning of EAGAIN doesn't apply here. The second variant is documented as "(Internet domain datagram sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use." This also sounds implausible.

So I'm afraid I can't really make sense of the EAGAIN here...

martinpitt · 2015-10-12T09:25:53Z

I re-ran with the reverted patch, so that the "Connection timed out" errors disappear. strace is http://paste.ubuntu.com/12762156/. As expected there are no more EAGAIN errors, but also there is still no single sendmsg() call with a msg_controllen != 0. I ran this in a loop, and I never got a single EAGAIN. So I suppose the cases where control messages are actually being sent are from the client side (logind, journald, etc.). I'll get some straces there to compare.

With current git head (i. e. without reverting), the EAGAINs coincide exactly with the failed boots with connection errors (in pid 1).

maciejaszek · 2015-10-12T09:32:36Z

@martinpitt: bug occurred when I tried to pass fds, but have_pid was set to 0. After my change everything booted without any problems. I'm looking at the code, but it would be good to test changes - do you have any image on which I can reproduce this error?

martinpitt · 2015-10-12T09:46:31Z

I do have an image, but it's 2.1 GB. I started uploading it now, but it'll take some two hours.

This is more or less a standard ubuntu cloud image (http://cloud-images.ubuntu.com/wily/current/wily-server-cloudimg-amd64-disk1.img) with installing some extra stuff (network-manager, policykit-1, lightdm), upgrading to systemd 227 and enabling persistant journal, and then rebooting it in a loop. I also got this behaviour on full desktop images back then. The whole machinery to do that (autopkgtest etc.) is not that simple to reproduce on other distros. But it seems to me that enabling persistent journal is somehow a key ingredient.

In the previous case (http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html) we eventually got bug reports on pretty much every distro (Arch, Fedora, Debian, Ubuntu, etc.), but it seems no developer except me could reliably reproduce this.. It's such an iffy heisenbug.

martinpitt · 2015-10-12T10:04:09Z

When this bug happens, stracing logind shows 27 sendmsg() calls, all of which succeed. There is exactly one call with ancilliary data:

sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1\4\0\0\0\17\0\0\0(\0\0\0\5\1u\0\21\0\0\0\6\1s\0\5\0\0\0:1.11\0\0\0\10\1g\0\1h\0\0\t\1u\0\1\0\0\0", 56}, {"\0\0\0\0", 4}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {17}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 60

Whereas with the reverted patch we get multiple calls but they all have exactly the same control length "20", and again there are no errors:

sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1\4\0\0\0\r\0\0\0(\0\0\0\5\1u\0\35\0\0\0\6\1s\0\4\0\0\0:1.5\0\0\0\0\10\1g\0\1h\0\0\t\1u\0\1\0\0\0", 56}, {"\0\0\0\0", 4}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {17}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 60
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0\37\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.12\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c1\0\0\"\0\0\0/org/freedesktop/login1/session/c1\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0/\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.15\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c2\0\0\"\0\0\0/org/freedesktop/login1/session/c2\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0?\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.18\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c3\0\0\"\0\0\0/org/freedesktop/login1/session/c3\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0O\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.21\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c4\0\0\"\0\0\0/org/freedesktop/login1/session/c4\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0_\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.24\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c5\0\0\"\0\0\0/org/freedesktop/login1/session/c5\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1X\0\0\0s\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.25\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c6\0\0\"\0\0\0/org/freedesktop/login1/session/c6\0\0\v\0\0\0/run/user/0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 88}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 152

Sorry, this makes no sense at all to me, at this point I'm just monkey-patching around..

dvdhrm · 2015-10-12T10:34:48Z

Just for clarity, a5bd3c3 looks correct. The previous code was definitely not correct. Maybe there is still something wrong, but I checked all the kernel CMSG macros and internal handling, and it looks fine. Please correct me, if I'm wrong.

Furthermore, the log-messages don't mention a failure in sendmsg(2). Instead, what I see is lightdm calling pam, calling logind, calling pid1, calling AddMatch on dbus-daemon. The latter fails and the error code is passed through the chain to lightdm. Maybe this is not a direct chain, but rather an activation chain. But that's probably irrelevant.
This makes me wonder, why AddMatch on dbus-daemon fails with ETIMEDOUT. This might be worth investigating.

Anyway, we cannot ignore that reverting a5bd3c3 fixes your issue. This somehow smells like stack corruption to me.. I'll see whether valgrind throws something interesting.

Furthermore, can you give some details how you reproduced this? Is this 32bit? 64bit? 32bit on 64bit? x86? ARM? etc.. That is, trying to figure out why none of us sees it on their production system.

martinpitt · 2015-10-12T11:17:58Z

@dvdhrm: We got at least three different reporters hours after we uploaded 227 to Debian sid. People there used i386 (32 bit) and x86_64, lightdm or gdm3 etc., and as this also kills journald, logind, rfkill I don't believe this is dependent on a particular login manager or even architecture.

However, I could never reproduce it on my Debian test images. I just tried taking a standard x86_64 Ubunt desktop VM install, upgrade to systemd 227, enable persistent journal, and reboot often; but I couldn't trigger it like that. Half a year ago it pretty much felt exactly the same, and I got it once or twice with manual tests on a desktop VM, but it was too frustrating to reproduce that way as this seems to be highly dependent on timing, sun rays, local air pressure, and what not.

The upload of my aupkgtest VM where this is reasonably easy to reproduce finally finished: http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img (2.1 GB)

qemu-system-x86_64 -enable-kvm -m 2048 -drive file=adt-wily+systemd227.img,if=virtio -net nic,model=virtio -net user,hostfwd=tcp::2222-:22 -nographic -serial stdio -monitor none -snapshot

As this was based on a minimal VM, lightdm doesn't actually start up, so there's no usable graphic output. The VM can be driven by the serial console, or (once it starts up), over ssh. The above QEMU command starts it with a console on stdio, and you can use ssh -p 2222 ubuntu@localhost to log into it with ssh.

This VM mostly just needs to be rebooted a couple of times (like 5), and then it reliably produces that hang for me. User is "ubuntu", password "ubuntu", sudo works without password.

martinpitt · 2015-10-12T11:31:10Z

Some missing info:

It is enough to replace these three binaries for testing a fix/change: /lib/systemd/systemd{,-logind,-journald} .
systemd was configured like that: ./configure CFLAGS='-g -O0 -ftrapv' --sysconfdir=/etc --localstatedir=/var --libdir=/usr/lib --with-rootprefix= --with-rootlibdir=/lib --enable-split-usr PYTHON=python3

maciejaszek · 2015-10-12T11:39:23Z

I've written some test service, which calls sd_pid_notify_with_fds and straced pid 1:

recvmsg(16, {msg_name(0)=NULL, msg_iov(1)=[{"FDSTORE=1", 4096}], msg_controllen=56, [{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, {pid=330, uid=0, gid=0}}, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, [13]}], msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 9
open("/proc/330/cgroup", O_RDONLY|O_CLOEXEC) = 15

// ...

epoll_ctl(4, EPOLL_CTL_ADD, 13, {0, {u32=3773119200, u64=94329844873952}}) = 0
recvmsg(16, 0x7ffc0f5f1b60, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

It looks like manager is receiving it properly, adds to watched fds and then tries to receive something one more time, which fails.

martinpitt · 2015-10-12T12:35:34Z

For the record: a5bd3c3 changed both the allocation and the value of msg_controllen. As @dvdhrm suspected a memory error I tried with

-                msghdr.msg_control = alloca(msghdr.msg_controllen);
+                msghdr.msg_control = alloca(msghdr.msg_controllen + 100);

which keeps the current control length, but allocates some extra space in case it overflows due to padding. With this the bug is still present, so I don't believe it's just a short allocation.

As for my earlier "AFAICS the socket isn't marked as nonblocking, so the usual meaning of EAGAIN doesn't apply here" → this is obviously wrong. The socket itself is not opened with SOCK_NONBLOCK, and sendmsg() is not called with MSG_DONTWAIT in the code -- but in the strace we clearly see MSG_DONTWAIT|MSG_NOSIGNAL. This explains the EAGAIN.

Are these sendmsg() really intended to be non-blocking? If so, where is that coming from, I don't see that in the code? And then we actually need to handle the EAGAIN case, as otherwise it falls through to the "try with our own ucred instead" case and eventually fails completely. If the non-blocking send is not intended, that's a likely thing to fix then?

zonque · 2015-10-12T12:49:04Z

Hmm, there's a nasty detail in CMSG_NXTHDR(), but that should have resulted in an assertion. Anyway, could you try zonque/systemd@41b112f ?

dvdhrm · 2015-10-12T13:12:04Z

Seriously? This loop fixes it?

martinpitt · 2015-10-12T13:12:19Z

@zonque: That's it, you are a star! No more hangs with the alloca0. Almost no EAGAINs now either (only a few from sending log messages to the journal). Turns out the EAGAIN/DONTWAIT bit was a red herring, as these were from journal sendmsg. I guess I got a lot of those as journald was hanging when the bug happened.

Please just fix the "sd-daemin" typo in the commit log :-)

Thanks!

martinpitt · 2015-10-29T00:16:19Z

As per our conversation I updated http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img to have gdb, debug symbols for libc, dbus, and libcap, and I removed the apt proxy config.

There are two snapshots now: "upstream" is with systemd, systemd-logind, and systemd-journald from current master, and "revert-a5bd3c3" has that commit reverted (in spirit). The VM defaults to "upstream". Note that this reversion actually stopped working reliably, some commit in the last days broke this as a workaround (perhaps #1707); it was meant to provide a reliable first boot before test stuff can be deployed, so it's not that useful any more. With that this VM fails quickly: usually at the first or second boot, but in my runs tonight it never survived more than 5.

Note that there is no snapshot yet of the running state when this happens. I need to look into whether and how this can be done with qemu (didn't see this in https://en.wikibooks.org/wiki/QEMU/Monitor or the manpages). But the hang reproduces very often, so this is maybe not that important.

I also did one more experiment: http://paste.ubuntu.com/12995227/ (resulting in the journal http://paste.ubuntu.com/12995347/). I think we stared at the sendmsg() side long enough now, so I looked at the receiving end. I ensured that all ctrl messages received fall into the two existing cases "fd array" or "ucred", i. e. we don't receive any unhandled/unknown messages.

BTW: manager_dispatch_notify_fd does not actually seem to handle a pid? I know we don't use that feature in systemd itself, but as the sender API supports it now, shouldn't control messages with a pid be handled somehow? But this is just a tangent...

An interesting (I think) observation is that even with completely ignoring received fd arrays (as in the above patch) we still get the bug. So the problem is not with their handling and triggering further notifications further down in the code (I though this might lead to deadlocks due to cyclic notifications perhaps); it almost seems like the mere act of receiving an fd array and doing nothing with it already causes this. @zonque had some theory of this being some kernel bug above, maybe this corroborates this?

benjarobin · 2015-10-29T00:45:41Z

The backtrace obtained with gdb a little bit later of systemd, dbus-daemon, systemd-logind, systemd-journal, systemd-udevd, running with systemd 227 + cherry-pick of commit ref #1707 : http://benjarobin.free.fr/Divers/benjarobin-systemd-1505-4.tar.xz
I don't have time now to analyse it, I hope it will help you

journald is not able to write the content of the log, journald just hang. journald is unstuck when I just sent SIGKILL to all process using Magic SysRq key

benjarobin · 2015-10-29T10:43:20Z

The analyse of the deadlock with systemd 227 + commit b215b0e
We do have a nice deadlock loop :
[journald] --- sd_pid_notify_with_fds ---> [systemd] --- sd_bus_call_method --> [dbus-daemon] --- vsyslog --> [journald]

[logind] is stuck since the process try to communicate with systemd (notify) or dbus, and these processes hang since stuck in a deadlock loop

zonque · 2015-10-29T10:51:05Z

@benjarobin hmm, could you elaborate a little more, please? The notify socket is a non-blocking DGRAM socket. Even if we bail from manager_dispatch_notify_fd without doing anything (ignore and dispose any message immediately in case it contains FDSTORE=), we still see the issue. And in my gdb and strace debug attempts, I don't see any such deadlocks. What am I missing?

benjarobin · 2015-10-29T12:05:19Z

@zonque Well, I just show the fact (take for example the 7AKu3T-d3.log of the 1505-4.tar.xz archive), journald is stuck inside the sd_pid_notify_with_fds function. Yes the process shouldn't be stuck like that, I have no idea why all backtrace show this process stuck inside this function.

It's very easy to reproduce the problem on my computer, it's much harder to boot normally on this computer. With systemd 226 I do not have any problem.

benjarobin · 2015-10-29T12:36:05Z

@zonque The socket of sendmsg is not opened with SOCK_NONBLOCK, so the DGRAM socket can block if there is not enough room to store the message.

zonque · 2015-10-29T13:47:00Z

The socket of sendmsg is not opened with SOCK_NONBLOCK, so the DGRAM socket can block if there is not enough room to store the message.

This is confusing. Our messages are tiny, and way smaller that PIPE_BUF, so this should never happen. Are you really seeing this call block in strace or gdb? Does journald hang in your case? Because it doesn't in my setup.

benjarobin · 2015-10-29T13:54:55Z

@zonque Yes I do see this call block with gdb (did not setup strace). Did you check the archives which contain the test script and the result log ?
Could you please share your test setup ?

I currently trying to reproduce the problem the problem with systemd patch with SOCK_NONBLOCK for sendmsg

dvdhrm · 2015-10-29T14:26:50Z

@benjarobin But systemd (pid1) never does sd_bus_call_method(). In other words, pid1 never does a synchronous method call. So I cannot see why pid1 is stuck in your case? If it weren't stuck, then it should still dispatch notify messages and journal would continue as usual.

benjarobin · 2015-10-29T14:44:47Z

@dvdhrm Well maybe you are wrong. Did you check the backtrace ? If anybody have a problem to access/download the archives lets me know

src/core/manager.c:2039 is the code of pid1 ? I don't know anything about the code of systemd, I may have done a mistake, that why I give you the test script and the result log

dvdhrm · 2015-10-29T14:52:21Z

@benjarobin, manager.c:2039 is the main-loop, so this is not really correct. Anyway, I can see that AddMatch and friends are blocking DBus calls, and they will indeed dead-lock if dbus-daemon logs and journald blocks on pid1. @poettering might wanna comment on this.

benjarobin · 2015-10-29T15:11:34Z

I am not able to reproduce the hang with this code :
Systemd 227 + b215b0e + inside sd_pid_notify_with_fds() the socket call use SOCK_NONBLOCK and sendmsg use MSG_DONTWAIT

I did setup a auto-reboot of the computer on boot success, and everything looks fine...

But the patch applied is not a solution, we shouldn't drop sd_pid_notify_with_fds() call if the kernel buffer is full

dvdhrm · 2015-10-29T15:17:22Z

@benjarobin, you might be on to something. Talking with @zonque about this a bit more, imagine the following:

system boots up, everything goes on as normal. dbus-daemon is started up and trigger pid1 bus activation. pid1 receives the call and triggers API bus registration. pid1 goes into bus_init_api(), which calls sd_bus_add_match() synchronously on dbus-daemon. This eventually fails, causing bus_init_api() to fail without ever setting m->api_bus. Thus, we end up with an ETIMEDOUT error on sd_bus_add_match() in the logs (as described by @martinpitt), and we also get the behavior @zonque and I experienced, where pid1 is fully working via the registered API, but never sends out any bus-events on the api-bus (because the its vtables are registered and working, but the api_bus pointer is not set).

Now the remaining question is: why does sd_bus_add_match() time out? And more importantly, why is that related to the sd_notify_*() call in the journal?

Some facts that might help: dbus-daemon logs synchronously. Hence, that log might be full, thus dbus-daemon is blocking on the log-stream to the journal. The journal might be blocking (who knows why) on pid1, and pid1 blocks on dbus-daemon via sd_bus_add_match(). The latter times out at some point, causing the problems as described.

I still don't get why the journal blocks, though. The sd_notify call allocates a new socket and only sends a single DGRAM message. This should never block as the kernel has a small, but existent internal buffer per dgram socket.

Also weird: why does a sleep() call in front of the sd_notify solve the issue?

I think the blocking sd_bus_add_match might indeed be the underlying cause, but I'm still not sure how the full pseudo-deadlock circle goes.

zonque · 2015-10-29T15:55:04Z

Ok, @benjarobin, you deserve a price for this. I profiled the sd_pid_notify_with_fds() call in journald, an in the failed case, it in fact spends >25s in that call. This explains the deadlock, now we need to think about a solution. Of course, after that timeout, everything goes back to normal, and hence none of the tasks remain in and hung state. Thanks a million, this one was driving me nuts.

benjarobin · 2015-10-29T16:01:52Z

Thanks to @martinpitt who gave me the base idea for the test script. And thanks to my computer who have a reproduce rate close to 90%
But if anyone could explain to me why sd_pid_notify_with_fds() is blocking ?

dvdhrm · 2015-10-29T16:04:14Z

@benjarobin: Simple. pid1 has a single DGRAM socket where it receives sd_notify() messages from all other processes. The receive-queue is limited to 16 messages by default. If there're 16 unit startups in parallel, the queue is full an journald blocks on the sd_notify().

In the same time pid1 blocks on AddMatch, blocks on dbus-daemon, blocks on logging.

Fix is probably to make dbus-daemon logging async, and to make journald sd_notify async as well.

poettering · 2015-11-01T21:19:33Z

My patch to fix this properly waits for you in #1745 now. Please have a look. But note that #1737 should be reviewed/merged first, as #1745 builds on it.

Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505.

* relbump * fixes issues about "Connection timed out" see systemd/systemd#1505

Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505. (cherry-picked from commit e22aa3d) Resolves: #1511565

martinpitt added the regression ⚠️ A bug in something that used to work correctly and broke through some recent commit label Oct 9, 2015

martinpitt added this to the v228 milestone Oct 9, 2015

martinpitt self-assigned this Oct 9, 2015

martinpitt added the release-critical label Oct 9, 2015

martinpitt removed their assignment Oct 9, 2015

poettering added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Oct 9, 2015

manover pushed a commit to manover/systemd that referenced this issue Oct 11, 2015

Revert "sd_pid_notify_with_fds: fix computing msg_controllen"

c11d9e0

It causes connection errors from various services on boot. systemd/systemd#1505 Closes: #801354

poettering mentioned this issue Nov 1, 2015

Make sure journald never blocks on sd_notify() to PID 1 #1745

Merged

poettering closed this as completed in e22aa3d Nov 2, 2015

DeX77 added a commit to frugalware/frugalware-current that referenced this issue Nov 7, 2015

systemd-227-5-x86_64

7113e69

* relbump * fixes issues about "Connection timed out" see systemd/systemd#1505

martinpitt mentioned this issue Nov 27, 2015

systemd sporadically fails to start polkitd #2019

Closed

evverx mentioned this issue Dec 30, 2015

Support for Configuration Reload in journald #2236

Closed

benjarobin mentioned this issue Apr 5, 2016

Sometimes systemd-notify(3)'s cgroup="/", causing message to get dropped #2739

Closed

[227 regression] Boot occasionally fails with "Connection timed out" #1505

[227 regression] Boot occasionally fails with "Connection timed out" #1505

Comments

martinpitt commented Oct 9, 2015

martinpitt commented Oct 9, 2015

martinpitt commented Oct 9, 2015

martinpitt commented Oct 9, 2015

martinpitt commented Oct 9, 2015

zonque commented Oct 9, 2015

zonque commented Oct 9, 2015

martinpitt commented Oct 9, 2015

poettering commented Oct 9, 2015

martinpitt commented Oct 12, 2015

zonque commented Oct 12, 2015

martinpitt commented Oct 12, 2015

martinpitt commented Oct 12, 2015

maciejaszek commented Oct 12, 2015

martinpitt commented Oct 12, 2015

martinpitt commented Oct 12, 2015

dvdhrm commented Oct 12, 2015

martinpitt commented Oct 12, 2015

martinpitt commented Oct 12, 2015

maciejaszek commented Oct 12, 2015

martinpitt commented Oct 12, 2015

zonque commented Oct 12, 2015

dvdhrm commented Oct 12, 2015

martinpitt commented Oct 12, 2015

martinpitt commented Oct 29, 2015

benjarobin commented Oct 29, 2015

benjarobin commented Oct 29, 2015

zonque commented Oct 29, 2015

benjarobin commented Oct 29, 2015

benjarobin commented Oct 29, 2015

zonque commented Oct 29, 2015

benjarobin commented Oct 29, 2015

dvdhrm commented Oct 29, 2015

benjarobin commented Oct 29, 2015

dvdhrm commented Oct 29, 2015

benjarobin commented Oct 29, 2015

dvdhrm commented Oct 29, 2015

zonque commented Oct 29, 2015

benjarobin commented Oct 29, 2015

dvdhrm commented Oct 29, 2015

poettering commented Nov 1, 2015