Introduce sd_notify_barrier #15547

kkdwivedi · 2020-04-22T19:36:28Z

This is just a proposal I'm posting for review and feedback, it isn't really in its final form yet. It just outlines the basic idea.

sd_notify_barrier allows the caller to synchronize against the reception of messages by systemd, and ensure that all notification changes have been picked up (whether they have been processed successfully or not, e.g. in case of FDSTORE=1 cannot be determined by this) which were sent before the barrier call. This relies on a few guarantees provided by the notification socket and systemd, i.e. SOCK_DGRAM means messages will be processed in order, and systemd closes all foreign file descriptors it receives that are not sent with FDSTORE=1 (which is necessary anyway).

The write end of a pipe is sent to the manager over the notification socket, and client polls the read end for POLLHUP. Once poll returns successfully, we indicate success, otherwise, the timeout field allows for a graceful fallback in cases where systemd isn't responding (probably due to a deadlock).

There are a few questions I have:

Is it better to be explicit and add a BARRIER=1 command? I can rework the PR to do that.
Since this is a light wrapper over poll(2), I am exposing the timeout as int, but for more precision ppoll(2) can be used as well. The other concern was if usec_t (uint64_t) should be used instead.

The advantages over previous approaches are:

We avoid the race reliably in all cases. The call should take at most timeout seconds anyway, so the caller cannot really block indefinitely, and in many cases poll may return instantly because by the time the call is made, systemd may have already closed the auxiliary file descriptor it received.
When something like SCM_CGROUPS lands in the kernel, we can deprecate this call and make it return 0 immediately. Until then, it fixes the issue for good.

Fixes: #2739

poettering · 2020-04-23T16:48:16Z

hmm, interesting concept. can you elaborate on the precise usecase for you? trying to wrap my head around why one would want this. (not saying it was a bad idea to have, just wondering)

src/libsystemd/sd-daemon/sd-daemon.c

poettering · 2020-04-23T17:01:09Z

hmm, and I think there should be an explicit property that should be send here, just to keep our options open, and so that the host side can recognize this case explicitly, and no generate a warning about such messages (i mean pid1 has every right to log warnings about peers sending it arbitrary fds it doesn't expect, hence let's make sure this kind of message is recognizable properly so that the fd is expected)

kkdwivedi · 2020-04-23T19:01:39Z

hmm, interesting concept. can you elaborate on the precise usecase for you? trying to wrap my head around why one would want this. (not saying it was a bad idea to have, just wondering)

I believe the use of this call eliminates any race conditions involving lookup on the cgroup of the client by PID after it is already gone, because we wait until the fd is closed on the manager end (which would mean our previous messages have already been read and processed by then). The parent PID trick has been quite helpful in case of systemd-notify, but it's still not a complete fix (and only works with enough privileges). Using this call should mean NotifyAccess=all begins working reliably again (if you're not the child of a control process and you exit too quickly).

The linked bug really shows one other corner case of systemd-notify being invoked in a PID namespace, but talking to the host NOTIFY_SOCKET. You may consider that unsupported, but for a container payload it seems like valid usage (and it would work with SCM_CGROUPS anyway).

Ofcourse, anything that has systemd as its client or is long-lived should use the current async API and shouldn't need this.

I'll be pushing with the suggested changes soon.

src/core/manager.c

src/libsystemd/sd-daemon/sd-daemon.c

src/notify/notify.c

poettering · 2020-04-28T08:41:05Z

looks pretty good.

man page would be good to have for this too though...

kkdwivedi · 2020-04-28T08:47:01Z

@poettering Thanks for the review.

I've been wondering about one other thing: Currently the BARRIER=1 command is not allowed to be sent with anything else, but it is only problematic when sent with FDSTORE=1, as then we're not able to discern which one to close and which ones to keep (unless we decide on closing the first one in the fd array and let others be kept). Would it be better to only ignore in case it's been sent with FDSTORE=1 (and then everything else sent with them is also not processed)?

This has the advantage that we're able to send READY=1 or potentially anything other than FDSTORE=1 with BARRIER=1 in one sendmsg call and determine when it's picked up, as the destructor for fd set is invoked after the notify_message callback is made, but this could be relying too much on the implementation details...

poettering · 2020-04-28T09:03:18Z

I must say I kinda like the way you did it right now, i.e. that its a message of its own, and cannot be combined with anything else. Because it makes the ordering very clear, everything you issue before is processed before, everything you issue after is processed after. If we'd allow doing other stuff and the barrier in one msg that property is a bit lost, since suddenly the barrier itself also can carry other operations for which the order isn't as clear.

Moreover, I think it kinda pushes people towards enqueuing stuff and then syncing on it, rather than possibly just syncing along with messages all the time, dunno. i.e. there's benefit in making the barrier an extra function you need to call instead of just making it a slightly different flavour of a call ou need to make anyway.

But of course this is all just philosophical. Anyway, please leave this concept as it is, let's declare that there are barrier messages and payload messages, but not mixed messages.

man/sd_notify.xml

man/systemd-notify.xml

src/core/manager.c

src/libsystemd/sd-daemon/sd-daemon.c

src/notify/notify.c

kkdwivedi · 2020-04-29T22:28:07Z

Pushed addressing everything but the --no-block behaviour. I think the new change is better, since it is surprising if systemd-notify --ready doesn't work when directly invoked from PID 1.

I want to know if the attestation using ppid when --no-block is not used is a good idea still, given the only reason it was added was to avoid races when NotifyAccess=all was used, which should now be gone. Anyone using --no-block is then asking for the race to happen (they can use --pid without an argument to attest the message using their PID, but that sets MAINPID= too). So either we drop the whole thing altogether, or do it as a second best effort thing in case of --no-block (but that still means systemd-notify --ready --no-block invoked from PID 1 is broken).

Prompted by the discussions on systemd#15547.

Prompted by the discussions on #15547.

poettering · 2020-04-30T16:36:39Z

Could you rebase now that #15642 is merged?

src/libsystemd/sd-daemon/sd-daemon.c

poettering · 2020-04-30T16:49:32Z

BTW, you could add "Fixes #2739" to your commit msgs somewhere.

kkdwivedi · 2020-04-30T17:32:37Z

I rearranged the documentation again reflecting your --pid= additions.

poettering · 2020-04-30T17:50:14Z

excellent! lgtm!

poettering · 2020-04-30T21:46:09Z

Hmmm. Needs another rebase, sorry

This adds the sd_notify_barrier function, to allow users to synchronize against the reception of sd_notify(3) status messages. It acts as a synchronization point, and a successful return gurantees that all previous messages have been consumed by the manager. This can be used to eliminate race conditions where the sending process exits too early for systemd to associate its PID to a cgroup and attribute the status message to a unit correctly. systemd-notify now uses this function for proper notification delivery and be useful for NotifyAccess=all units again in user mode, or in cases where it doesn't have a control process as parent. Fixes: systemd#2739

Add note for change of behaviour in systemd-notify, where parent pid trick is only used when --no-block is passed, and with enough privileges ofcourse. Also, fix a small error in systemd(1).

kkdwivedi · 2020-04-30T21:58:39Z

There you go.

- This function is called for each message, and almost all of them will not contain BARRIER=1. So 5 comparaisons are realized in previous code. Now only one strstr is used. - If BARRIER=1 is found, drop the message in all case. So always return true when BARRIER=1 is found at the beginning of a new line Follow up of systemd#15547

poettering requested changes Apr 23, 2020

View reviewed changes

poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks sd-daemon labels Apr 23, 2020

kkdwivedi force-pushed the notify-barrier branch from 8ea8cab to 763fec3 Compare April 24, 2020 05:05

kkdwivedi marked this pull request as ready for review April 24, 2020 05:11