New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements for notify services (including #4212) #4745
Conversation
@joukewitteveen , I'd like to postpone this PR. |
I'll take a look at #4752 later this week when I have time, but skimming the report I noticed the phrase 'race condition'. The second commit of this PR (only fail [...] during start) might solve this problem. Also: failing exit.service with a protocol error is correct, since the READY=1 message never makes it. |
Sure
Hm, not sure I understand
doesn't fail. Is it correct? |
I'd say yes - in the first case you have race condition between main process termination and receiving of |
I don't have race condition: |
Hm,
And
works fine. But fails sometimes of course (if
|
3fc75e5
to
e0c662c
Compare
Added a section to NEWS. |
I restarted the semaphore run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, just some minor changes!
if (s->type == SERVICE_NOTIFY) | ||
/* No chance of getting a ready notification anymore */ | ||
service_enter_signal(s, SERVICE_FINAL_SIGTERM, SERVICE_FAILURE_PROTOCOL); | ||
else if (s->pid_file_pathspec) { | ||
else /* Fall through */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, dropping into the next case as "else" statement looks strange. Quite franky, I am surprised this works at all...
Can you rewrite this and make the first if branch end in a "break", so that we don't need the "else" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem.
Do we want this change at all? It is clear that the test should apply only to SERVICE_START, yet as far as I can tell it is impossible for Type=notify services to end up here while in a SERVICE_START_POST state. The main reason for this change is, I think, that it is more 'correct'.
@@ -1252,7 +1252,8 @@ static int service_spawn( | |||
if (!our_env) | |||
return -ENOMEM; | |||
|
|||
if ((flags & EXEC_IS_CONTROL) ? s->notify_access == NOTIFY_ALL : s->notify_access != NOTIFY_NONE) | |||
if ((flags & EXEC_IS_CONTROL) ? IN_SET(s->notify_access, NOTIFY_ALL, NOTIFY_EXEC) | |||
: s->notify_access != NOTIFY_NONE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i figure this check got complex enough to deserve its function. service_exec_needs_notify_sockets(Service *s, flags)
or so, with some explanatory comments owuld be great!
pid, s->main_pid, s->control_pid); | ||
else if (s->main_pid != 0 || s->control_pid != 0) | ||
log_unit_warning(u, "Got notification message from PID "PID_FMT", but reception only permitted for %s PID "PID_FMT, | ||
pid, s->main_pid ? "main" : "control", s->main_pid | s->control_pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That binary OR here is evil as fuck! ;-)
I think this would be better with a ternary operator, after all you generate the "main" and "control" string already that way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume s->main_pid ?: s->control_pid
(?:
is a GCC extension) is out of the question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use ?:
pretty frequently... but in this case I'd actually avoid it, as we try to get people to do numeric != 0 checks by spelling out != 0
, instead of relying on C's degrade-to-bool feature (we use that only for ptr and bool-like objects, but not for numeric variables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but then the ternary operator for "main" vs. "control" has to be changed too and the part becomes
s->main_pid != 0 ? "main" : "control", s->main_pid != 0 ? s->main_pid : s->control_pid
This almost warrants a separate if block.
The alternative
s->main_pid ? "main" : "control", s->main_pid ?: s->control_pid
looks pretty neat if you ask me.
In the end I can see the appeal in either. Shall I add a clause to the if block?
if (s->main_pid != 0 && s->control_pid != 0)
log_unit_warning(u, "Got notification message from PID "PID_FMT", but reception only permitted for main PID "PID_FMT" and control PID "PID_FMT,
pid, s->main_pid, s->control_pid);
else if (s->main_pid != 0)
log_unit_warning(u, "Got notification message from PID "PID_FMT", but reception only permitted for main PID "PID_FMT, pid, s->main_pid);
else if (s->control_pid != 0)
log_unit_warning(u, "Got notification message from PID "PID_FMT", but reception only permitted for control PID "PID_FMT, pid, s->control_pid);
else
log_unit_debug(u, "Got notification message from PID "PID_FMT", but reception only permitted for main PID and control PID which are currently not known", pid);
return;
P.S. Do we really want the last case to be a debug message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the version with 4 branches indeed looks nicer. I think making all of them warnings would make sense.
@@ -17,6 +17,9 @@ CHANGES WITH 233 in spe | |||
The 'n' choice for the confirmation spawn prompt has been removed, | |||
because its meaning was confusing. | |||
|
|||
* Services of type notify require a READY=1 notification to be sent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please write that as "Services of Type=notify require…"...
@@ -17,6 +17,9 @@ CHANGES WITH 233 in spe | |||
The 'n' choice for the confirmation spawn prompt has been removed, | |||
because its meaning was confusing. | |||
|
|||
* Services of type notify require a READY=1 notification to be sent | |||
during startup. If no such message is sent, the service now fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also say: "…, even if the process exited with a zero exit code." or so?
-bash-4.3# systemctl cat a --no-pager
# /etc/systemd/system/a.service
[Service]
Type=notify
ExecStart=/bin/sh -x -c 'exit 0'
-bash-4.3# systemctl start a
Job for a.service failed.
See "systemctl status a.service" and "journalctl -xe" for details. Shouldn't we update the -bash-4.3# systemctl cat b --no-pager
# /etc/systemd/system/b.service
[Service]
Type=notify
ExecStart=/bin/sh -x -c 'exit 1'
-bash-4.3# systemctl start b
Job for b.service failed because the control process exited with error code.
See "systemctl status b.service" and "journalctl -xe" for details. Why not the |
That's matter for |
I need some help with this one. For the explanations table I am not sure what the right values are. We only raise the error type through For the BTW: Documentation of error types is quite scattered. I am sorry for not covering it all in the commit that introduced the new error type. There is also another table under https://www.freedesktop.org/software/systemd/man/systemd.exec#$EXIT_CODE ... Edit: In the above message, I mixed up the explanations table and the Summary of possible service result variable values table in the systemd.service manpage. The explanations table is easy to fix. Time for some sleep. |
@joukewitteveen , that's a good point. But I think Also, I think we should return the first meaningful error: |
CI failure is issue #4753 again and unrelated. As this is "changes requested" anyway, there is going to be a new run anyway, so not restarting. @joukewitteveen, it would be great if you could rebase against master so that we get some useful debugging out of that failure. Thanks! |
We stay in the SERVICE_START while no READY=1 notification message has been received. When we are in the SERVICE_START_POST state, we have already received a ready notification. Hence we should not fail when the cgroup becomes empty in that state.
We assume a process can be only one of the two in service_sigchld_event.
e0c662c
to
6047ebf
Compare
Processed the requested changes and added an entry to the explanation table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, one minor string issue
@@ -764,6 +764,7 @@ static const struct { | |||
const char *result, *explanation; | |||
} explanations [] = { | |||
{ "resources", "of unavailable resources or another system error" }, | |||
{ "protocol", "the service did not take the steps required by its configuration" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm "configuration" might be misleading. most people would probably look into the daemon's own configuratoin when debugging this when they read this. Maybe simply add in a "unit" there, so that it becomes "… required by its unit configuration"...
so the Restart= table in the docs is incomplete currently. I am not entirely sure it needs to list all result codes however (but if somebody preps a patch I'd happily merge it!), hence I think your PR is actually good as it is right now in this regard. The restart code should require no updates for the new result code, the existing restart modes should cover this well enough already. Hence, please do the minor string fix, then it's good to merge. And if you want to update the restart docs, that'd be welcome too, but I think not necessary. |
6047ebf
to
0b6fb8b
Compare
Cool! Updated. |
What about https://www.freedesktop.org/software/systemd/man/systemd.exec#$EXIT_CODE ?
What about https://github.com/systemd/systemd/pull/4724/files ?
|
True, that table should indeed be updated. @joukewitteveen can you add that please? (and I figure a quick comment next to the enum referencing that man page and suggesting to update it when when adding a new enum might be great) |
Yupp, that should be updated to match the message string. @joukewitteveen can you fix that too, please? |
What would the values be for the https://www.freedesktop.org/software/systemd/man/systemd.exec#$EXIT_CODE table? |
@joukewitteveen well, for "protocol" the exit status would be clean, right, hence $EXIT_STATUS would be "exited", and $EXIT_STATUS would be "0", in the typical case, right? |
That also depends on what future uses of the error there may be. It is currently used for Type=notify services that never send READY=1 and for services that promise to write a pid file but don't. |
Hm, we don't trigger the -bash-4.3# systemctl cat a --no-pager
# /etc/systemd/system/a.service
[Service]
Type=notify
ExecStart=/bin/sh -x -c 'exit 0'
ExecStopPost=/bin/sh -x -c 'set'
-bash-4.3# systemctl start a
...
Nov 28 20:58:03 sh[48]: + exit 0
Nov 28 20:58:03 systemd[1]: a.service: Child 48 belongs to a.service
Nov 28 20:58:03 systemd[1]: a.service: Main process exited, code=exited, status=0/SUCCESS
Nov 28 20:58:03 systemd[1]: a.service: Changed start -> failed
Nov 28 20:58:03 systemd[1]: a.service: Job a.service/start finished, result=failed
Nov 28 20:58:03 systemd[1]: Failed to start a.service.
Nov 28 20:58:03 systemd[1]: a.service: Unit entered failed state.
Nov 28 20:58:03 systemd[1]: a.service: Failed with result 'protocol'.
Nov 28 20:58:03 systemd[1]: a.service: cgroup is empty |
@joukewitteveen well, there's always he option to chicken out and just say "any of the above" in the table, like the "resources" line already does... ;-) |
@evverx We do here and here. |
@joukewitteveen will, the table isn't totally honest anyway, as you can fiddle with the clean exit statuses anyway with So, the table at best shows only "typical" results... So adding the line with exited/0 sounds OK to me. |
@joukewitteveen , well, I start this unit [Service]
Type=notify
ExecStart=/bin/sh -x -c 'exit 0'
ExecStopPost=/bin/sh -x -c 'set' And |
@evverx for Type=notify services, we never go through ExecStop{,Post}. For the pid file situation we sometimes do. |
So
and
doesn't make sense. Right? |
Setting NotifyAccess=exec allows notifications coming directly from any control process.
0b6fb8b
to
6375bd2
Compare
Ok, so I only fixed the phrasing in the systemd.exec manpage and did not address the table and failure behavior. Here is a way to trigger STOP_POST with a protocol error:
Resulting in
Note the absence of both EXIT_STATUS and EXIT_CODE. |
lgtm! |
<option>all</option>. If <option>none</option>, no daemon status | ||
updates are accepted from the service processes, all status | ||
update messages are ignored. If <option>main</option>, only | ||
service updates sent from the main process of the service are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we should fix the main
-description too. Actually, root
can send "fake"-ucred data, i.e.
User=0
NotifyAccess=main
ExecStart=/bin/sh -x -c 'sleep 2; systemd-notify --ready; sleep 10'
works fine
But this unrelated to this PR, of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joukewitteveen , thanks!
Three improvements for services of type notify.
src/core/service.c:service_sigchld_event
assumes a process is the main process or the control process, but not both.