New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd notification: missing shell escape can cause startup failures #1187
Comments
besides that, IMHO it's the wrong service, anyway:
however:
|
You can easily reproduce it by using your very own chef recipes on e.g. macOS:
to get a shell while this job is running/hanging, start in the same direcotry:
|
@rmoriz can you please be more specific when you claim that something is "in the wrong place"? systemd support has been around for at least 6 months and somehow it works for many. |
@michaelklishin I really tried to give as much hints as possible. It doesn't work for your own CI… https://travis-ci.org/rabbitmq/chef-cookbook/jobs/221400408 |
@rmoriz "the wrong place" is not specific, I'm sorry. We have a separate CI service (not on Travis) which tests RPM and Debian packages against 8 distributions or so. It does work there and it does work for at least some real world users, including those who contributed systemd notification support (and some of them deploy RabbitMQ in a non-trivial number of varying environments). Anyhow, providing specifics would be a lot more productive than finger pointing. |
@rmoriz I should also point our that the cookbook is under our GitHub org but it is maintained by a different group of folks (mostly from Chef, Inc). |
Srsly, did you read the issue? I provided examples, logs, path to your erlang source and even a recipe to run everything local. And all of that code is maintained by your org. |
@rmoriz I did and I do not understand what does "the wrong service" mean exactly. Not everyone is a systemd expert. |
Here's a CentOS 7 package verification build from Concourse. I don't know if it's publicly accessible so here's a gist of the most interesting part. It does roughly the following:
The output strongly suggests that the node does start. Therefore there must be a difference between your environment and the one we use. Concourse also happens to be using containers, although it's not the point of the test. |
@michaelklishin I guess you need to involve someone who wrote the code and understands the systemd logic (maybe @binarin ?) The issues start with looking up the current service by shelling out and looking up the pid: rabbitmq-server/src/rabbit.erl Line 410 in e07ca0e
( systemctl status 202 in my example).
That returns '-.slice' which then is used for another shell-out without shell escaping to get the state - and that crashes and no notification is ever sent to systemd, leaving the job hanging ("activating"). rabbitmq-server/src/rabbit.erl Line 433 in e07ca0e
(see https://www.freedesktop.org/software/systemd/man/sd_notify.html for the systemd details) The other issue is, that IMHO the pid (202 in my example) actually (IMHO) should belong to I started reverseing the issue by grepping thethe journald/systemd error message in your erlang source code. |
@rmoriz do you have any thoughts on the above? |
OK, so we depend on |
Your CI looks good, but you're using |
OK, now that we understand where the variability comes from and have a failing test suite, we can look into it. I wonder how much of the escaping would be sufficient here. |
I got happy news for you. 👍 The only issue left here is the un-escaping which may hurt if someone decides to run rabbitmq with a unit-name that requires escaping. I guess it's minor. Feel free to close or discuss/fix internally. This issue covered (at least) another two issues which are a part of the
Sorry for the hassle… |
No worries, I think this is a legit problem that may affect some. We will discuss it next week. |
And thank you for looking into the issues with the cookbook test suite! |
BTW, this can be re-implemented in a more clean and robust way using unix domain socket support in erlang 19. And even this will not be needed if systemd/systemd#2739 gets fixed by the time of dropping 18 support (but I don't put much hope on it ) |
Any updates for this bug? |
The conclusion is that the issue is way more complex and deeper than it seems. It involves an interplay of many things that RabbitMQ packages do not control. Even if were to switch to using UNIX sockets, systemd/systemd#2739 is still not resolved and when it is, it will take years So there is no solution that our packages can provide. I'm inclined to close this as it is not actionable as things stand right now and #1187 (comment) suggests the OP at least somewhat agrees with that. |
@michaelklishin Is there any side-effects if we change
|
@axot this is not a support forum. Please post your questions to rabbitmq-users. systemd notification support was contributed by Fedora/RHEL engineers, so I assume the |
We could change the
That would end the option list with Works on my workstation:
|
@lukebakken please submit a PR. |
Also terminate systemctl args with `--` in case the unit name starts with a dash Fixes #1187
It won't help, real problem is that we can't reliably determine systemd unit name, which gives us Another option is to allow forcing unit name from environment or config, but I'm not sure that it'll be helpful - strange things happen with systemd inside containers. |
@binarin this will at least address the problem of passing invalid arguments to
How will that be any different than what is now happening? RabbitMQ's startup correctly informs I tested both the original code and my changes on CentOS 7 and Ubuntu 16. Both used to work fine, and work fine with my changes. The bug reported here seems to be out of scope. |
@rmoriz in your docker environment, what is the output of this command?
Thanks! |
@lukebakken see opening post. But as I wrote, this only occurs if you don't proper mount hosts' Just to clarify, this only happens when:
|
@lukebakken All this mess with detecting systemd unit name and then querying its status is needed to detect whether systemd has successfully received our notification, so we can then safely close 'socat' process. But if notifications are sent from within the main process itself, it will be just fire and forget - systemd will be always able to detect who sent a message to it. And then |
@binarin thanks. I noticed that Erlang has unix domain socket support last October and added the change you described to our internal tracker back then. |
This is affecting our rabbitmq systems as well. |
We merged a short term fix for |
@tbennett6421 - please see this comment - |
weird /sys/fs/cgroup is not empty for us. may be a different issue then. |
@tbennett6421 are you seeing the same output from
|
Tried to debug a failing rabbitmq-server start in docker (see rabbitmq/chef-cookbook#435).
The shellout shenanigans in
rabbitmq-server/src/rabbit.erl
Line 433 in e07ca0e
What it does:
What it should:
systemctl show --property=ActiveState \\-.slice ActiveState=inactive
Result:
=> never notifies systemd, "start" hangs forever.
"-.slice" is probably a CentOS7 thing.
The text was updated successfully, but these errors were encountered: