Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causing sd_notify messages to get dropped. #2737

Closed
LukeShu opened this issue Feb 25, 2016 · 5 comments
Closed

Race condition causing sd_notify messages to get dropped. #2737

LukeShu opened this issue Feb 25, 2016 · 5 comments

Comments

@LukeShu
Copy link
Contributor

LukeShu commented Feb 25, 2016

Submission type

[X] Bug report

[ ] Request for enhancement (RFE)

systemd version the issue has been seen with

229

Used distribution

Parabola GNU/Linux-libre (derivative of Arch Linux)

In case of bug report: Expected behaviour you didn't see

Call sd_notify(3) just before your process exits (as is done in systemd-notify(1)); I expect the message to always make it to where it's going (and show up in the journal if applicable).

In case of bug report: Unexpected behaviour you saw

Sometimes the message doesn't make it there, with the result of log_warning("Cannot find unit for notify message of PID "PID_FMT".", ucred->pid); showing up in the journal.

In case of bug report: Steps to reproduce the problem

Call systemd-notify(1) repeatedly with something that will show up in the journal. A small percentage won't make it. This is because the the manager decides which units it applies to based on the cgroup string. And it decides the cgroup string by looking at /proc/${sending_pid}/cgroup, which won't exist anymore if the sending process gets cleaned up before systemd gets to handling the message.

It's tempting to say "well, the process is exiting anyway, so it probably doesn't matter if we lose it's last words," but systemd-notify(1).

@ohsix
Copy link

ohsix commented Feb 25, 2016

are they actually lost or in the journal lacking the keys you're using to look for it (like -u / _SYSTEMD_UNIT) try using -o json-pretty and looking for nearby messages

this is probably an artifact of a well known bug, https://bugs.freedesktop.org/show_bug.cgi?id=50184

@LukeShu
Copy link
Contributor Author

LukeShu commented Feb 25, 2016

I'm not entirely sure, if it's getting totally lost or not because of this bug; I found it while trying to track down another bug with notify events getting lost. But they are getting lost :-)

It's definitely related to that bug. It's at least conceptually the same, I'm not sure if it's the same code. (I'd thought that everything had been migrated to GitHub. Shows what happens when I make assumptions...)

Saying "sometimes things from short-lived processes get lost" would be a sort-of acceptable caveat, except that systemd ships with a short-lived process for the purpose of sending these messages.

As Lennart notes, the correct fix is to get the Kernel to send cgroup information. But, unlike the linked bug, a possible workaround exists here: create a separate socket for each unit. You could filter out messages from PIDs known to be in a different unit (although I suppose that introduces another race with PID reuse), but if the process has exited, just assume that it was in the right group, since it knew the correct value for $NOTIFY_SOCKET.

@benjarobin
Copy link
Contributor

If the process calling sd_notify() does have a parent process (and not systemd itself), you should use sd_pid_notify() and the first parameter shall be the ppid

@LukeShu
Copy link
Contributor Author

LukeShu commented Feb 27, 2016

That only works if the process is root (perhaps that should be noted in the man page).

@poettering
Copy link
Member

Duplicate of #2739. Let's close this version.

binford2k added a commit to binford2k/abalone that referenced this issue Sep 13, 2017
There's a bug in `systemd-notify` such that the command sends the message and then exits. Since its lifetime is so short, systemd doesn't have time to do the housekeeping to figure out which service unit to associate it with. The mitigation is to just open the socket and send the message directly ourselves, since our process is long-lived.

See systemd/systemd#2737

Fixes #16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants