Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sd_notify_barrier #15547

Merged
merged 2 commits into from May 1, 2020
Merged

Conversation

kkdwivedi
Copy link
Contributor

@kkdwivedi kkdwivedi commented Apr 22, 2020

This is just a proposal I'm posting for review and feedback, it isn't really in its final form yet. It just outlines the basic idea.

sd_notify_barrier allows the caller to synchronize against the reception of messages by systemd, and ensure that all notification changes have been picked up (whether they have been processed successfully or not, e.g. in case of FDSTORE=1 cannot be determined by this) which were sent before the barrier call. This relies on a few guarantees provided by the notification socket and systemd, i.e. SOCK_DGRAM means messages will be processed in order, and systemd closes all foreign file descriptors it receives that are not sent with FDSTORE=1 (which is necessary anyway).

The write end of a pipe is sent to the manager over the notification socket, and client polls the read end for POLLHUP. Once poll returns successfully, we indicate success, otherwise, the timeout field allows for a graceful fallback in cases where systemd isn't responding (probably due to a deadlock).

There are a few questions I have:

  • Is it better to be explicit and add a BARRIER=1 command? I can rework the PR to do that.
  • Since this is a light wrapper over poll(2), I am exposing the timeout as int, but for more precision ppoll(2) can be used as well. The other concern was if usec_t (uint64_t) should be used instead.

The advantages over previous approaches are:

  • We avoid the race reliably in all cases. The call should take at most timeout seconds anyway, so the caller cannot really block indefinitely, and in many cases poll may return instantly because by the time the call is made, systemd may have already closed the auxiliary file descriptor it received.
  • When something like SCM_CGROUPS lands in the kernel, we can deprecate this call and make it return 0 immediately. Until then, it fixes the issue for good.

Fixes: #2739

@poettering
Copy link
Member

hmm, interesting concept. can you elaborate on the precise usecase for you? trying to wrap my head around why one would want this. (not saying it was a bad idea to have, just wondering)

src/libsystemd/sd-daemon/sd-daemon.c Outdated Show resolved Hide resolved
src/libsystemd/sd-daemon/sd-daemon.c Outdated Show resolved Hide resolved
src/libsystemd/sd-daemon/sd-daemon.c Outdated Show resolved Hide resolved
src/libsystemd/sd-daemon/sd-daemon.c Outdated Show resolved Hide resolved
src/libsystemd/sd-daemon/sd-daemon.c Outdated Show resolved Hide resolved
src/libsystemd/sd-daemon/sd-daemon.c Outdated Show resolved Hide resolved
@poettering
Copy link
Member

hmm, and I think there should be an explicit property that should be send here, just to keep our options open, and so that the host side can recognize this case explicitly, and no generate a warning about such messages (i mean pid1 has every right to log warnings about peers sending it arbitrary fds it doesn't expect, hence let's make sure this kind of message is recognizable properly so that the fd is expected)

@poettering poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks sd-daemon labels Apr 23, 2020
@kkdwivedi
Copy link
Contributor Author

kkdwivedi commented Apr 23, 2020

hmm, interesting concept. can you elaborate on the precise usecase for you? trying to wrap my head around why one would want this. (not saying it was a bad idea to have, just wondering)

I believe the use of this call eliminates any race conditions involving lookup on the cgroup of the client by PID after it is already gone, because we wait until the fd is closed on the manager end (which would mean our previous messages have already been read and processed by then). The parent PID trick has been quite helpful in case of systemd-notify, but it's still not a complete fix (and only works with enough privileges). Using this call should mean NotifyAccess=all begins working reliably again (if you're not the child of a control process and you exit too quickly).

The linked bug really shows one other corner case of systemd-notify being invoked in a PID namespace, but talking to the host NOTIFY_SOCKET. You may consider that unsupported, but for a container payload it seems like valid usage (and it would work with SCM_CGROUPS anyway).

Ofcourse, anything that has systemd as its client or is long-lived should use the current async API and shouldn't need this.

I'll be pushing with the suggested changes soon.

@kkdwivedi kkdwivedi marked this pull request as ready for review April 24, 2020 05:11
src/core/manager.c Outdated Show resolved Hide resolved
src/core/manager.c Outdated Show resolved Hide resolved
src/core/manager.c Outdated Show resolved Hide resolved
src/core/manager.c Outdated Show resolved Hide resolved
src/notify/notify.c Outdated Show resolved Hide resolved
@poettering
Copy link
Member

looks pretty good.

man page would be good to have for this too though...

@kkdwivedi
Copy link
Contributor Author

kkdwivedi commented Apr 28, 2020

@poettering Thanks for the review.

I've been wondering about one other thing: Currently the BARRIER=1 command is not allowed to be sent with anything else, but it is only problematic when sent with FDSTORE=1, as then we're not able to discern which one to close and which ones to keep (unless we decide on closing the first one in the fd array and let others be kept). Would it be better to only ignore in case it's been sent with FDSTORE=1 (and then everything else sent with them is also not processed)?

This has the advantage that we're able to send READY=1 or potentially anything other than FDSTORE=1 with BARRIER=1 in one sendmsg call and determine when it's picked up, as the destructor for fd set is invoked after the notify_message callback is made, but this could be relying too much on the implementation details...

@poettering
Copy link
Member

I must say I kinda like the way you did it right now, i.e. that its a message of its own, and cannot be combined with anything else. Because it makes the ordering very clear, everything you issue before is processed before, everything you issue after is processed after. If we'd allow doing other stuff and the barrier in one msg that property is a bit lost, since suddenly the barrier itself also can carry other operations for which the order isn't as clear.

Moreover, I think it kinda pushes people towards enqueuing stuff and then syncing on it, rather than possibly just syncing along with messages all the time, dunno. i.e. there's benefit in making the barrier an extra function you need to call instead of just making it a slightly different flavour of a call ou need to make anyway.

But of course this is all just philosophical. Anyway, please leave this concept as it is, let's declare that there are barrier messages and payload messages, but not mixed messages.

man/sd_notify.xml Outdated Show resolved Hide resolved
man/systemd-notify.xml Outdated Show resolved Hide resolved
src/core/manager.c Outdated Show resolved Hide resolved
src/notify/notify.c Outdated Show resolved Hide resolved
@kkdwivedi
Copy link
Contributor Author

kkdwivedi commented Apr 29, 2020

Pushed addressing everything but the --no-block behaviour. I think the new change is better, since it is surprising if systemd-notify --ready doesn't work when directly invoked from PID 1.

I want to know if the attestation using ppid when --no-block is not used is a good idea still, given the only reason it was added was to avoid races when NotifyAccess=all was used, which should now be gone. Anyone using --no-block is then asking for the race to happen (they can use --pid without an argument to attest the message using their PID, but that sets MAINPID= too). So either we drop the whole thing altogether, or do it as a second best effort thing in case of --no-block (but that still means systemd-notify --ready --no-block invoked from PID 1 is broken).

poettering added a commit to poettering/systemd that referenced this pull request Apr 30, 2020
Prompted by the discussions on systemd#15547.
poettering added a commit that referenced this pull request Apr 30, 2020
Prompted by the discussions on #15547.
@poettering
Copy link
Member

Could you rebase now that #15642 is merged?

@poettering poettering added needs-rebase notify and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Apr 30, 2020
@poettering
Copy link
Member

BTW, you could add "Fixes #2739" to your commit msgs somewhere.

@kkdwivedi
Copy link
Contributor Author

I rearranged the documentation again reflecting your --pid= additions.

@poettering
Copy link
Member

excellent! lgtm!

@poettering poettering added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed needs-rebase labels Apr 30, 2020
@poettering
Copy link
Member

Hmmm. Needs another rebase, sorry

This adds the sd_notify_barrier function, to allow users to synchronize against
the reception of sd_notify(3) status messages. It acts as a synchronization
point, and a successful return gurantees that all previous messages have been
consumed by the manager. This can be used to eliminate race conditions where
the sending process exits too early for systemd to associate its PID to a
cgroup and attribute the status message to a unit correctly.

systemd-notify now uses this function for proper notification delivery and be
useful for NotifyAccess=all units again in user mode, or in cases where it
doesn't have a control process as parent.

Fixes: systemd#2739
Add note for change of behaviour in systemd-notify, where parent pid trick
is only used when --no-block is passed, and with enough privileges ofcourse.

Also, fix a small error in systemd(1).
@kkdwivedi
Copy link
Contributor Author

There you go.

@poettering poettering merged commit 6eb35fd into systemd:master May 1, 2020
benjarobin added a commit to benjarobin/systemd that referenced this pull request May 1, 2020
 - This function is called for each message, and almost all of them will not
   contain BARRIER=1. So 5 comparaisons are realized in previous code. Now only
   one strstr is used.

 - If BARRIER=1 is found, drop the message in all case. So always return true
   when BARRIER=1 is found at the beginning of a new line

Follow up of systemd#15547
@kkdwivedi kkdwivedi deleted the notify-barrier branch July 5, 2020 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed notify sd-daemon
Development

Successfully merging this pull request may close these issues.

Sometimes systemd-notify(3)'s cgroup="/", causing message to get dropped
2 participants