Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd notification: missing shell escape can cause startup failures #1187

Closed
rmoriz opened this issue Apr 22, 2017 · 35 comments
Closed

systemd notification: missing shell escape can cause startup failures #1187

rmoriz opened this issue Apr 22, 2017 · 35 comments

Comments

@rmoriz
Copy link

rmoriz commented Apr 22, 2017

Tried to debug a failing rabbitmq-server start in docker (see rabbitmq/chef-cookbook#435).

The shellout shenanigans in

case os:cmd("systemctl show --property=ActiveState " ++ Unit) of
are not using proper shell escaping:

What it does:

[root@01b3d8c9ee66 /]# systemctl show --property=ActiveState -.slice
systemctl: invalid option -- '.'

What it should:

systemctl show --property=ActiveState \\-.slice
ActiveState=inactive

Result:

[root@01b3d8c9ee66 /]# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ broker
   Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: disabled)
   Active: activating (start) since Sat 2017-04-22 14:49:26 UTC; 14min ago
 Main PID: 3169 (beam.smp)
   CGroup: /docker/01b3d8c9ee668b5396d2d374bb279181567a5f73840432192d5bd9bb62b14eea/system.slice/rabbitmq-server.service
           ├─3169 /usr/lib64/erlang/erts-5.10.4/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -- -root /usr/li...
           ├─3393 inet_gethost 4
           └─3394 inet_gethost 4
           ‣ 3169 /usr/lib64/erlang/erts-5.10.4/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -- -root /usr/li...

Apr 22 14:49:28 01b3d8c9ee66 rabbitmq-server[3169]: ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
Apr 22 14:49:28 01b3d8c9ee66 rabbitmq-server[3169]: ##  ##
Apr 22 14:49:28 01b3d8c9ee66 rabbitmq-server[3169]: ##########  Logs: /var/log/rabbitmq/rabbit@01b3d8c9ee66.log
Apr 22 14:49:28 01b3d8c9ee66 rabbitmq-server[3169]: ######  ##        /var/log/rabbitmq/rabbit@01b3d8c9ee66-sasl.log
Apr 22 14:49:28 01b3d8c9ee66 rabbitmq-server[3169]: ##########
Apr 22 14:49:28 01b3d8c9ee66 rabbitmq-server[3169]: Starting broker...
Apr 22 14:49:30 01b3d8c9ee66 rabbitmq-server[3169]: systemd unit for activation check: "-.slice"
Apr 22 14:49:30 01b3d8c9ee66 rabbitmq-server[3169]: Unexpected status from systemd "systemctl: invalid option -- '.'\n"
Apr 22 14:49:30 01b3d8c9ee66 rabbitmq-server[3169]: systemd READY notification failed, beware of timeouts
Apr 22 14:49:30 01b3d8c9ee66 rabbitmq-server[3169]: completed with 0 plugins.

=> never notifies systemd, "start" hangs forever.

"-.slice" is probably a CentOS7 thing.

[root@01b3d8c9ee66 /]# rpm -qf /usr/lib/systemd/system/-.slice 
systemd-219-30.el7_3.8.x86_64
@rmoriz
Copy link
Author

rmoriz commented Apr 22, 2017

besides that, IMHO it's the wrong service, anyway:

● rabbitmq-server.service - RabbitMQ broker
   Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: disabled)
   Active: activating (start) since Sat 2017-04-22 15:19:07 UTC; 11min ago
 Main PID: 202 (beam.smp)
   CGroup: /docker/01b3d8c9ee668b5396d2d374bb279181567a5f73840432192d5bd9bb62b14eea/system.slice/rabbitmq-server.service
           ├─202 /usr/lib64/erlang/erts-5.10.4/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -- -root /usr/...
           ├─426 inet_gethost 4
           └─427 inet_gethost 4
           ‣ 202 /usr/lib64/erlang/erts-5.10.4/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -- -root /usr/...

however:

systemctl status 202
● -.slice - Root Slice
   Loaded: loaded (/usr/lib/systemd/system/-.slice; static; vendor preset: disabled)
   Active: active since Sat 2017-04-22 15:19:06 UTC; 13min ago
     Docs: man:systemd.special(7)
   CGroup: /docker/01b3d8c9ee668b5396d2d374bb279181567a5f73840432192d5bd9bb62b14eea
           ├─1 /usr/lib/systemd/systemd
           └─system.slice
             ├─dbus.service
             │ └─57 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
             ├─rabbitmq-server.service
             │ ├─202 /usr/lib64/erlang/erts-5.10.4/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -- -root /...
             │ ├─426 inet_gethost 4
             │ └─427 inet_gethost 4
             ├─system-epmd.slice
             │ └─epmd@0.0.0.0.service
             │   └─314 /usr/bin/epmd -systemd
             ├─system-getty.slice
             │ └─getty@tty1.service
             │   └─72 /sbin/agetty --noclear tty1 linux
             ├─systemd-logind.service
             │ └─67 /usr/lib/systemd/systemd-logind
             ├─systemd-udevd.service
             │ └─28 /usr/lib/systemd/systemd-udevd
             └─systemd-journald.service
               └─22 /usr/lib/systemd/systemd-journald

@rmoriz
Copy link
Author

rmoriz commented Apr 22, 2017

You can easily reproduce it by using your very own chef recipes on e.g. macOS:

# get docker for mac + chefdk
docker -v || brew cask install docker 
/opt/chefdk/bin/chef -v || brew cask install chefdk

git clone https://github.com/rabbitmq/chef-cookbook
cd chef-cookbook

KITCHEN_LOCAL_YAML=.kitchen.dokken.yml /opt/chefdk/embedded/bin/kitchen converge default-centos-72

to get a shell while this job is running/hanging, start in the same direcotry:

KITCHEN_LOCAL_YAML=.kitchen.dokken.yml /opt/chefdk/embedded/bin/kitchen login default-centos-72

@michaelklishin
Copy link
Member

@rmoriz can you please be more specific when you claim that something is "in the wrong place"?

systemd support has been around for at least 6 months and somehow it works for many.

@rmoriz
Copy link
Author

rmoriz commented Apr 22, 2017

@michaelklishin I really tried to give as much hints as possible. It doesn't work for your own CI… https://travis-ci.org/rabbitmq/chef-cookbook/jobs/221400408

@michaelklishin
Copy link
Member

michaelklishin commented Apr 22, 2017

@rmoriz "the wrong place" is not specific, I'm sorry. We have a separate CI service (not on Travis) which tests RPM and Debian packages against 8 distributions or so. It does work there and it does work for at least some real world users, including those who contributed systemd notification support (and some of them deploy RabbitMQ in a non-trivial number of varying environments).

Anyhow, providing specifics would be a lot more productive than finger pointing.

@michaelklishin
Copy link
Member

@rmoriz I should also point our that the cookbook is under our GitHub org but it is maintained by a different group of folks (mostly from Chef, Inc).

@rmoriz
Copy link
Author

rmoriz commented Apr 22, 2017

Srsly, did you read the issue? I provided examples, logs, path to your erlang source and even a recipe to run everything local. And all of that code is maintained by your org.

@michaelklishin
Copy link
Member

michaelklishin commented Apr 22, 2017

@rmoriz I did and I do not understand what does "the wrong service" mean exactly. Not everyone is a systemd expert.

@michaelklishin
Copy link
Member

Here's a CentOS 7 package verification build from Concourse.

I don't know if it's publicly accessible so here's a gist of the most interesting part. It does roughly the following:

  • Installs an RPM package
  • Does some basic sanity checking for server startup and CLI tools and service commands
  • Uninstalls the package

The output strongly suggests that the node does start. Therefore there must be a difference between your environment and the one we use. Concourse also happens to be using containers, although it's not the point of the test.

@rmoriz
Copy link
Author

rmoriz commented Apr 22, 2017

@michaelklishin I guess you need to involve someone who wrote the code and understands the systemd logic (maybe @binarin ?)

The issues start with looking up the current service by shelling out and looking up the pid:

case catch re:run(os:cmd("systemctl status " ++ os:getpid()), "([-.@0-9a-zA-Z]+)", [unicode, {capture, all_but_first, list}]) of

(systemctl status 202 in my example).

That returns '-.slice' which then is used for another shell-out without shell escaping to get the state - and that crashes and no notification is ever sent to systemd, leaving the job hanging ("activating").

case os:cmd("systemctl show --property=ActiveState " ++ Unit) of

(see https://www.freedesktop.org/software/systemd/man/sd_notify.html for the systemd details)

The other issue is, that IMHO the pid (202 in my example) actually (IMHO) should belong to rabbitmq-server.service and not to the root slice (-.slice) anyway.

I started reverseing the issue by grepping thethe journald/systemd error message in your erlang source code.

@michaelklishin
Copy link
Member

@rmoriz do you have any thoughts on the above?

@michaelklishin
Copy link
Member

OK, so we depend on systemctl status output here. That's probably explains it. Thanks, that was very helpful.

@rmoriz
Copy link
Author

rmoriz commented Apr 22, 2017

Your CI looks good, but you're using /sbin/init as pid1 command and the chef docker-setup uses /usr/lib/systemd/systemd (not sure if this is a problem yet)

@michaelklishin
Copy link
Member

OK, now that we understand where the variability comes from and have a failing test suite, we can look into it. I wonder how much of the escaping would be sufficient here.

@rmoriz
Copy link
Author

rmoriz commented Apr 22, 2017

I got happy news for you. 👍

The only issue left here is the un-escaping which may hurt if someone decides to run rabbitmq with a unit-name that requires escaping. I guess it's minor. Feel free to close or discuss/fix internally.

This issue covered (at least) another two issues which are a part of the chef-cookbook repo:

  • Looks like Docker container need the cgroup fs binding even when they run privilged. Then rabbitmq gets the right service (not -.slice) and starts as expected (also because rabbitmq-server.service does not need escaping).
    This is broken in the chef-cookbook CI setup but also not configured by almost all chef-cookbook systemd CI setups on GitHub yet. I guess this is not an issue because most software projects are too lazy to implement sd_notify and other "advanced" systemd features yet. 😩

  • chef-cookbook tests still fail due to other reasons. I'll continue my deep journey (with PRs) in CentOS 7 is broken (in dokken) chef-cookbook#435

Sorry for the hassle…

@michaelklishin
Copy link
Member

No worries, I think this is a legit problem that may affect some. We will discuss it next week.

@michaelklishin
Copy link
Member

michaelklishin commented Apr 22, 2017

And thank you for looking into the issues with the cookbook test suite!

@binarin
Copy link
Contributor

binarin commented Apr 23, 2017

BTW, this can be re-implemented in a more clean and robust way using unix domain socket support in erlang 19. And even this will not be needed if systemd/systemd#2739 gets fixed by the time of dropping 18 support (but I don't put much hope on it :trollface: )

@michaelklishin michaelklishin removed this from the 3.6.10 milestone Apr 23, 2017
@michaelklishin michaelklishin changed the title systemd notification: missing shell escape causes failing startup on CentOS 7 systemd notification: missing shell escape can cause startup failures Apr 23, 2017
@axot
Copy link

axot commented Feb 2, 2018

Any updates for this bug?

@michaelklishin michaelklishin removed the bug label Feb 2, 2018
@michaelklishin
Copy link
Member

michaelklishin commented Feb 2, 2018

The conclusion is that the issue is way more complex and deeper than it seems. It involves an interplay of many things that RabbitMQ packages do not control. Even if were to switch to using UNIX sockets, systemd/systemd#2739 is still not resolved and when it is, it will take years
until services can depend on it (only support distributions that ship a systemd version that new).

So there is no solution that our packages can provide. I'm inclined to close this as it is not actionable as things stand right now and #1187 (comment) suggests the OP at least somewhat agrees with that.

@axot
Copy link

axot commented Feb 2, 2018

@michaelklishin Is there any side-effects if we change Type=notify to Type=simple?

@michaelklishin
Copy link
Member

@axot this is not a support forum. Please post your questions to rabbitmq-users. systemd notification support was contributed by Fedora/RHEL engineers, so I assume the simple strategy doesn't work very well, e.g. when a node can take a while to start, e.g. during an upgrade or just recovering/reindexing a large enough data set.

@lukebakken
Copy link
Collaborator

We could change the os:cmd call to this:

case os:cmd("systemctl show --property=ActiveState -- '" ++ Unit ++ $') of

That would end the option list with -- and quote Unit using single quotes. I suspect if your unit name contains a single quote you are intentionally asking for problems.

Works on my workstation:

[root@shostakovich ~]# systemctl show --property=ActiveState -- '-.slice'
ActiveState=active

@michaelklishin
Copy link
Member

@lukebakken please submit a PR.

lukebakken added a commit that referenced this issue Feb 2, 2018
Also terminate systemctl args with `--` in case the unit name starts with a dash

Fixes #1187
@binarin
Copy link
Contributor

binarin commented Feb 2, 2018

It won't help, real problem is that we can't reliably determine systemd unit name, which gives us -.slice. The only real solution is to re-implement notifications using unix-domain socket support which is now available in Erlang for some time. That will make all problems completely go away, as in that case it will be rabbitmq process itself who reports to systemd.

Another option is to allow forcing unit name from environment or config, but I'm not sure that it'll be helpful - strange things happen with systemd inside containers.

@lukebakken
Copy link
Collaborator

lukebakken commented Feb 2, 2018

@binarin this will at least address the problem of passing invalid arguments to systemctl.

The only real solution is to re-implement notifications using unix-domain socket support which is now available in Erlang for some time

How will that be any different than what is now happening? RabbitMQ's startup correctly informs systemd of its PID. If systemctl status PID doesn't return the correct unit, how is that RabbitMQ's fault?

I tested both the original code and my changes on CentOS 7 and Ubuntu 16. Both used to work fine, and work fine with my changes. The bug reported here seems to be out of scope.

@lukebakken
Copy link
Collaborator

@rmoriz in your docker environment, what is the output of this command?

systemctl status rabbitmq-server.service

Thanks!

@rmoriz
Copy link
Author

rmoriz commented Feb 2, 2018

@lukebakken see opening post. But as I wrote, this only occurs if you don't proper mount hosts' /sys/fs/cgroup into a container. This is required and documented but was not very well known among tooling https://developers.redhat.com/blog/2016/09/13/running-systemd-in-a-non-privileged-container/

Just to clarify, this only happens when:

  1. Without /sys/fs/cgroup mounted, the unit will be resolved/named -.slice
  2. not properly escaping (the leading) '-' causes the failure for which I opened this issue.

@binarin
Copy link
Contributor

binarin commented Feb 3, 2018

@lukebakken All this mess with detecting systemd unit name and then querying its status is needed to detect whether systemd has successfully received our notification, so we can then safely close 'socat' process. But if notifications are sent from within the main process itself, it will be just fire and forget - systemd will be always able to detect who sent a message to it. And then NotifyAccess=all can be also removed from unit files.

@lukebakken
Copy link
Collaborator

@binarin thanks. I noticed that Erlang has unix domain socket support last October and added the change you described to our internal tracker back then.

@tbennett6421
Copy link

This is affecting our rabbitmq systems as well.

@michaelklishin michaelklishin added this to the 3.7.4 milestone Feb 13, 2018
@michaelklishin
Copy link
Member

We merged a short term fix for 3.7.4 in #1494.

@lukebakken
Copy link
Collaborator

@tbennett6421 - please see this comment - /sys/fs/cgroup must be available.

@tbennett6421
Copy link

weird /sys/fs/cgroup is not empty for us. may be a different issue then.

@lukebakken
Copy link
Collaborator

@tbennett6421 are you seeing the same output from systemctl status rabbitmq-server.service as reported initially?

Apr 22 14:49:30 01b3d8c9ee66 rabbitmq-server[3169]: systemd unit for activation check: "-.slice"
Apr 22 14:49:30 01b3d8c9ee66 rabbitmq-server[3169]: Unexpected status from systemd "systemctl: invalid option -- '.'\n"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants