New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[227 regression] Boot occasionally fails with "Connection timed out" #1505
Comments
For the record:
It's actually not -- as https://bugs.debian.org/801361 points out, this will actually unmount stuff if you log in as root. I have the bisect running now, which will show if the unmounts and the timeouts have one and the same root cause. If not, I'll open a separate issue for the unmounts. |
Found the culprit: It's a5bd3c32a from PR #1215. Reverting that fixes the connection timeouts and everything is hunky-dory again. Curiously, pretty much the exact same symptom happened with 219 in Februrary -- do you guys still remember http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html ? Back then it was fixed in 64144440 which already (more or less accidentally) fixed the @maciejaszek , @dvdhrm , any idea about how the real fix looks like? Thanks! |
The "user systemd tries to unmount" stuff is unrelated to this, I now reported issue #1507 about that. |
FTR, I actually did see this occasionally when running CI tests for the Ubuntu trunk builds, but as the Debian ones (which I do more often) worked fine I didn't pay enough attention to them. Sorry for the process fail, it's always frustrating when we release with a major bug that we could have spotted before. I'll be more suspicious when the "boot smoke" test fails in the future! |
Thanks, @martinpitt for bisecting this! Could give this patch a try please, and see if it fixes the regression? zonque/systemd@9f58f915 |
Hmm, wait. No, that doesn't explain it. |
@zonque: No surprise, but FTR: no difference with that patch. |
@martinpitt soo, how and why precisely does the sendmsg() fail and in which process? Do you have an strace of the process maybe, so that we can have a look? |
It causes connection errors from various services on boot. systemd/systemd#1505 Closes: #801354
@poettering: I'll see whether I can come up with an early systemd unit which attaches strace to pid 1 early. Things like logind etc. do a @maciejaszek : Your original PR #1215 looks wrong. "CMSG_SPACE(0) may return value other than 0" is intended as it contains padding for alignment:
I think this padding must at least be part of the allocation below:
Thus the original code before that PR looked right. However, you said that you got some
But the example on the manpage actually sets it to the sum of the So if we instead assume that the description is right and the example is wrong, we keep So going back, the only code that actually works is the one with PR #1215 reverted, and I don't know why @maciejaszek got |
@martinpitt if either |
Wrt. stracing, this actually works quite well:
First I ran this stracing with current git head, i. e. with this bug. From a failed boot the journal is http://paste.ubuntu.com/12761903/ and the strace output is http://paste.ubuntu.com/12761900/ . Note that there are zero instances of From a successful boot the journal is http://paste.ubuntu.com/12761945/ and the strace is http://paste.ubuntu.com/12761943/ . Note that there are no AFAICS the socket isn't marked as nonblocking, so the usual meaning of So I'm afraid I can't really make sense of the |
I re-ran with the reverted patch, so that the "Connection timed out" errors disappear. strace is http://paste.ubuntu.com/12762156/. As expected there are no more With current git head (i. e. without reverting), the EAGAINs coincide exactly with the failed boots with connection errors (in pid 1). |
@martinpitt: bug occurred when I tried to pass fds, but have_pid was set to 0. After my change everything booted without any problems. I'm looking at the code, but it would be good to test changes - do you have any image on which I can reproduce this error? |
I do have an image, but it's 2.1 GB. I started uploading it now, but it'll take some two hours. This is more or less a standard ubuntu cloud image (http://cloud-images.ubuntu.com/wily/current/wily-server-cloudimg-amd64-disk1.img) with installing some extra stuff (network-manager, policykit-1, lightdm), upgrading to systemd 227 and enabling persistant journal, and then rebooting it in a loop. I also got this behaviour on full desktop images back then. The whole machinery to do that (autopkgtest etc.) is not that simple to reproduce on other distros. But it seems to me that enabling persistent journal is somehow a key ingredient. In the previous case (http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html) we eventually got bug reports on pretty much every distro (Arch, Fedora, Debian, Ubuntu, etc.), but it seems no developer except me could reliably reproduce this.. It's such an iffy heisenbug. |
When this bug happens, stracing logind shows 27
Whereas with the reverted patch we get multiple calls but they all have exactly the same control length "20", and again there are no errors:
Sorry, this makes no sense at all to me, at this point I'm just monkey-patching around.. |
Just for clarity, a5bd3c3 looks correct. The previous code was definitely not correct. Maybe there is still something wrong, but I checked all the kernel CMSG macros and internal handling, and it looks fine. Please correct me, if I'm wrong. Furthermore, the log-messages don't mention a failure in sendmsg(2). Instead, what I see is lightdm calling pam, calling logind, calling pid1, calling AddMatch on dbus-daemon. The latter fails and the error code is passed through the chain to lightdm. Maybe this is not a direct chain, but rather an activation chain. But that's probably irrelevant. Anyway, we cannot ignore that reverting a5bd3c3 fixes your issue. This somehow smells like stack corruption to me.. I'll see whether valgrind throws something interesting. Furthermore, can you give some details how you reproduced this? Is this 32bit? 64bit? 32bit on 64bit? x86? ARM? etc.. That is, trying to figure out why none of us sees it on their production system. |
@dvdhrm: We got at least three different reporters hours after we uploaded 227 to Debian sid. People there used i386 (32 bit) and x86_64, lightdm or gdm3 etc., and as this also kills journald, logind, rfkill I don't believe this is dependent on a particular login manager or even architecture. However, I could never reproduce it on my Debian test images. I just tried taking a standard x86_64 Ubunt desktop VM install, upgrade to systemd 227, enable persistent journal, and reboot often; but I couldn't trigger it like that. Half a year ago it pretty much felt exactly the same, and I got it once or twice with manual tests on a desktop VM, but it was too frustrating to reproduce that way as this seems to be highly dependent on timing, sun rays, local air pressure, and what not. The upload of my aupkgtest VM where this is reasonably easy to reproduce finally finished: http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img (2.1 GB)
As this was based on a minimal VM, lightdm doesn't actually start up, so there's no usable graphic output. The VM can be driven by the serial console, or (once it starts up), over ssh. The above QEMU command starts it with a console on stdio, and you can use This VM mostly just needs to be rebooted a couple of times (like 5), and then it reliably produces that hang for me. User is "ubuntu", password "ubuntu", sudo works without password. |
Some missing info:
|
I've written some test service, which calls sd_pid_notify_with_fds and straced pid 1:
It looks like manager is receiving it properly, adds to watched fds and then tries to receive something one more time, which fails. |
For the record: a5bd3c3 changed both the allocation and the value of
which keeps the current control length, but allocates some extra space in case it overflows due to padding. With this the bug is still present, so I don't believe it's just a short allocation. As for my earlier "AFAICS the socket isn't marked as nonblocking, so the usual meaning of EAGAIN doesn't apply here" → this is obviously wrong. The socket itself is not opened with Are these |
Hmm, there's a nasty detail in |
Seriously? This loop fixes it? |
@zonque: That's it, you are a star! No more hangs with the Please just fix the "sd-daemin" typo in the commit log :-) Thanks! |
As per our conversation I updated http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img to have gdb, debug symbols for libc, dbus, and libcap, and I removed the apt proxy config. There are two snapshots now: "upstream" is with systemd, systemd-logind, and systemd-journald from current master, and "revert-a5bd3c3" has that commit reverted (in spirit). The VM defaults to "upstream". Note that this reversion actually stopped working reliably, some commit in the last days broke this as a workaround (perhaps #1707); it was meant to provide a reliable first boot before test stuff can be deployed, so it's not that useful any more. With that this VM fails quickly: usually at the first or second boot, but in my runs tonight it never survived more than 5. Note that there is no snapshot yet of the running state when this happens. I need to look into whether and how this can be done with qemu (didn't see this in https://en.wikibooks.org/wiki/QEMU/Monitor or the manpages). But the hang reproduces very often, so this is maybe not that important. I also did one more experiment: http://paste.ubuntu.com/12995227/ (resulting in the journal http://paste.ubuntu.com/12995347/). I think we stared at the BTW: An interesting (I think) observation is that even with completely ignoring received fd arrays (as in the above patch) we still get the bug. So the problem is not with their handling and triggering further notifications further down in the code (I though this might lead to deadlocks due to cyclic notifications perhaps); it almost seems like the mere act of receiving an fd array and doing nothing with it already causes this. @zonque had some theory of this being some kernel bug above, maybe this corroborates this? |
The backtrace obtained with gdb a little bit later of systemd, dbus-daemon, systemd-logind, systemd-journal, systemd-udevd, running with systemd 227 + cherry-pick of commit ref #1707 : http://benjarobin.free.fr/Divers/benjarobin-systemd-1505-4.tar.xz journald is not able to write the content of the log, journald just hang. journald is unstuck when I just sent SIGKILL to all process using Magic SysRq key |
The analyse of the deadlock with systemd 227 + commit b215b0e [logind] is stuck since the process try to communicate with systemd (notify) or dbus, and these processes hang since stuck in a deadlock loop |
@benjarobin hmm, could you elaborate a little more, please? The notify socket is a non-blocking DGRAM socket. Even if we bail from |
@zonque Well, I just show the fact (take for example the 7AKu3T-d3.log of the 1505-4.tar.xz archive), journald is stuck inside the sd_pid_notify_with_fds function. Yes the process shouldn't be stuck like that, I have no idea why all backtrace show this process stuck inside this function. It's very easy to reproduce the problem on my computer, it's much harder to boot normally on this computer. With systemd 226 I do not have any problem. |
@zonque The socket of sendmsg is not opened with SOCK_NONBLOCK, so the DGRAM socket can block if there is not enough room to store the message. |
This is confusing. Our messages are tiny, and way smaller that |
@zonque Yes I do see this call block with gdb (did not setup strace). Did you check the archives which contain the test script and the result log ? I currently trying to reproduce the problem the problem with systemd patch with SOCK_NONBLOCK for sendmsg |
@benjarobin But systemd (pid1) never does sd_bus_call_method(). In other words, pid1 never does a synchronous method call. So I cannot see why pid1 is stuck in your case? If it weren't stuck, then it should still dispatch notify messages and journal would continue as usual. |
@dvdhrm Well maybe you are wrong. Did you check the backtrace ? If anybody have a problem to access/download the archives lets me know
|
@benjarobin, manager.c:2039 is the main-loop, so this is not really correct. Anyway, I can see that AddMatch and friends are blocking DBus calls, and they will indeed dead-lock if dbus-daemon logs and journald blocks on pid1. @poettering might wanna comment on this. |
I am not able to reproduce the hang with this code : I did setup a auto-reboot of the computer on boot success, and everything looks fine... But the patch applied is not a solution, we shouldn't drop sd_pid_notify_with_fds() call if the kernel buffer is full |
@benjarobin, you might be on to something. Talking with @zonque about this a bit more, imagine the following: system boots up, everything goes on as normal. dbus-daemon is started up and trigger pid1 bus activation. pid1 receives the call and triggers API bus registration. pid1 goes into Now the remaining question is: why does Some facts that might help: dbus-daemon logs synchronously. Hence, that log might be full, thus dbus-daemon is blocking on the log-stream to the journal. The journal might be blocking (who knows why) on pid1, and pid1 blocks on dbus-daemon via I still don't get why the journal blocks, though. The Also weird: why does a sleep() call in front of the I think the blocking |
Ok, @benjarobin, you deserve a price for this. I profiled the |
Thanks to @martinpitt who gave me the base idea for the test script. And thanks to my computer who have a reproduce rate close to 90% |
@benjarobin: Simple. pid1 has a single DGRAM socket where it receives sd_notify() messages from all other processes. The receive-queue is limited to 16 messages by default. If there're 16 unit startups in parallel, the queue is full an journald blocks on the sd_notify(). In the same time pid1 blocks on AddMatch, blocks on dbus-daemon, blocks on logging. Fix is probably to make dbus-daemon logging async, and to make journald sd_notify async as well. |
Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505.
Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505.
Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505.
Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505.
Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505.
* relbump * fixes issues about "Connection timed out" see systemd/systemd#1505
Otherwise we might run into deadlocks, when journald blocks on the notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and dbus-daemon blocks on logging to journald. Break this cycle by making sure that journald never ever blocks on PID 1. Note that this change disables support for event loop watchdog support, as these messages are sent in blocking style by sd-event. That should not be a big loss though, as people reported frequent problems with the watchdog hitting journald on excessively slow IO. Fixes: systemd#1505. (cherry-picked from commit e22aa3d) Resolves: #1511565
https://bugs.debian.org/801354 reported that 227 introduced a rather major boot regression: You often get failures like
(same for lightdm, etc.) I now found a way to reproduce this in a VM, so I'll go bisect hunting.
As for the pam_systemd thing: I wonder if that's related to the regression that the user systemd instance is now trying to unmount things during boot:
This might be something entirely different (and then mostly cosmetical), or be the cause for failing to create a user session.
The text was updated successfully, but these errors were encountered: