Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: some rkt-tests fail #4752

Closed
evverx opened this issue Nov 27, 2016 · 12 comments
Closed

regression: some rkt-tests fail #4752

evverx opened this issue Nov 27, 2016 · 12 comments
Labels
pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit

Comments

@evverx
Copy link
Member

evverx commented Nov 27, 2016

I run https://github.com/coreos/rkt/tree/master/tests sometimes

$ ~/rkt/build-rkt-1.20.0+git/target/bin/rkt version
rkt Version: 1.20.0+git3a72142
appc Version: 0.8.9
Go Version: go1.6.3
Go OS/Arch: linux/amd64
Features: +TPM +SDJOURNAL

1a1b13c - all tests passed
a748a01 (master)

--- FAIL: TestCapsSeveralAppWithPatches (19.67s)
--- FAIL: TestCapsSeveralAppWithFlags (20.28s)
--- FAIL: TestPodManifest (169.52s)
--- FAIL: TestAppUserGroup (110.90s)
--- FAIL: TestVolumes (31.34s)

Seems like #4693 introduced the regression. I'll try to bisect tomorrow.

cc @lucab, @s-urbaniak

@evverx evverx added nspawn regression ⚠️ A bug in something that used to work correctly and broke through some recent commit labels Nov 27, 2016
@lucab
Copy link
Contributor

lucab commented Nov 27, 2016

Thanks for the heads-up! According to our nightly builds, rkt:

  • was building fine with systemd-master till 2016-11-23, 04:00
  • is failing with systemd-master since 2016-11-23, 16:00

In that timeframe we didn't merge any relevant PR and v231 is still green.

(yes, rkt configure should probably spit out the systemd commit in use instead of "master" :)

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

(yes, rkt configure should probably spit out the systemd commit in use instead of "master" :)

👍

Seems like #4693 introduced the regression

Hm, I was wrong. #4259 (3d474ef) introduced the regression

$ git bisect log
git bisect start
# bad: [a748a0169a11ed0803fd5b1df6f8eb1c4af61803] Merge pull request #4736 from dobyrch/calendar-cleanup
git bisect bad a748a0169a11ed0803fd5b1df6f8eb1c4af61803
# good: [1a1b13c9573b8cd30a4ab8dca2ec7961e460f083] seccomp: add @filesystem syscall group (#4537)
git bisect good 1a1b13c9573b8cd30a4ab8dca2ec7961e460f083
# bad: [fadc06bb8166b7ee494ed90b054f083ac4db4e11] Merge pull request #4259 from joukewitteveen/notify
git bisect bad fadc06bb8166b7ee494ed90b054f083ac4db4e11
# good: [ef8b0084552e05f28b9132d5dfc75edae164a991] sd-dhcp-client: use free_and_strdup
git bisect good ef8b0084552e05f28b9132d5dfc75edae164a991
# good: [c5c755e1bc76165a635e7036b0f7888395d82ae7] Merge pull request #4693 from poettering/nspawn-ephemeral
git bisect good c5c755e1bc76165a635e7036b0f7888395d82ae7
# good: [3d4cf7de48a74726694abbaa09f9804b845ff3ba] build-sys: check for lz4 in the old and new numbering scheme (#4717)
git bisect good 3d4cf7de48a74726694abbaa09f9804b845ff3ba
# bad: [3d474ef7a687e2052aa303e0f95893b2fc610475] service: fix main processes exit behavior for type notify services
git bisect bad 3d474ef7a687e2052aa303e0f95893b2fc610475
# good: [c35755fb878af58b80ac62a501a75f79c90a3763] service: introduce protocol error type
git bisect good c35755fb878af58b80ac62a501a75f79c90a3763
# first bad commit: [3d474ef7a687e2052aa303e0f95893b2fc610475] service: fix main processes exit behavior for type notify services

I'm trying to understand how to implement the "minimal" reproducer. I need help :-)

cc @joukewitteveen

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

According to our nightly builds, rkt:
is failing with systemd-master since 2016-11-23, 16:00

BTW

https://jenkins-rkt-public.prod.coreos.systems/job/rkt-master-periodic/325/os_type=debian-testing,stage1_flavor=src_master_v999/consoleText

I don't see

--- FAIL: TestAppUserGroup

And the TestAppUserGroup-test is passing (against 3d474ef). So, this is another regression/bug...

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

Hm, Failed with result 'protocol'

Some logs:

Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd-journald[10]: Time spent on flushing to /var is 1.158ms for 0 entries.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd-journald[10]: Journal started
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd-journald[10]: System journal (/var/log/journal/1128e42d322e436794797224833f7cd4) is 8.0M, max 3.9G, 3.9G free.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Starting Prepare minimum environment for chrooted applications...
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Starting Prepare minimum environment for chrooted applications...
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Started Prepare minimum environment for chrooted applications.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Started Prepare minimum environment for chrooted applications.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Starting Application=rkt-inspect Image=coreos.com/rkt-inspect...
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Starting Application=rkt-inspect-2 Image=coreos.com/rkt-inspect-2...
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Failed to start Application=rkt-inspect Image=coreos.com/rkt-inspect.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: reaper-rkt-inspect.service: Unit not needed anymore. Stopping.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: rkt-inspect.service: Unit entered failed state.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: rkt-inspect.service: Triggering OnFailure= dependencies.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: rkt-inspect.service: Failed with result 'protocol'.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Stopping rkt-inspect-2 Reaper...
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: shutdown.service: Unit not needed anymore. Stopping.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Stopping rkt-inspect Reaper...
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Stopped rkt-inspect Reaper.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: shutdown.service: Unit not needed anymore. Stopping.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Stopped rkt-inspect-2 Reaper.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: shutdown.service: Unit not needed anymore. Stopping.
Nov 28 02:01:33 rkt-1128e42d-322e-4367-9479-7224833f7cd4 systemd[1]: Reached target Halt.

I removed --pre-sleep 1 and the TestVolumes-test passed

-               `/bin/sh -c "export FILE=/dir1/file CONTENT=1 ; ^RKT_BIN^ --debug --insecure-options=image run --mds-register=false --volume=dir1,kind=host,source=^TMPDIR^ --inherit-env=true ^VOL_RW_WRITE_FILE_ONLY^ ^VOL_RO_READ_FILE_ONLY^ --exec /inspect -- --pre-sleep=1 --read-file"`,
+               `/bin/sh -c "export FILE=/dir1/file CONTENT=1 ; ^RKT_BIN^ --debug --insecure-options=image run --mds-register=false --volume=dir1,kind=host,source=^TMPDIR^ --inherit-env=true ^VOL_RW_WRITE_FILE_ONLY^ ^VOL_RO_READ_FILE_ONLY^ --exec /inspect -- --read-file"`,
                `<<<1>>>`,
                0,
        },

Hm. I'm not sure I understand how does sleep 1 affect the protocol-error. Looks like a race between sigchld/notify-event.

@evverx evverx removed the nspawn label Nov 28, 2016
@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

@lucab, why does rkt-inspect.service contain Type=notify? The /inspect-binary doesn't send READY=1 (or, maybe, I'm missing something).

I've prepared a patch:

diff --git a/tests/image/manifest b/tests/image/manifest
index 75fd5d3..bc52c47 100644
--- a/tests/image/manifest
+++ b/tests/image/manifest
@@ -73,7 +73,7 @@
         },
         {
             "name": "appc.io/executor/supports-systemd-notify",
-            "value": "true"
+            "value": "false"
         }
     ]
 }

All tests passed (against 3d474ef) :-) (except TestNetDefaultRestrictedConnectivity) rkt/rkt#2943

Anyway, that sleep 1-race is a bug (but I'm not sure how to reproduce it reliably)

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

Also,
https://github.com/systemd/systemd/pull/4724/files

protocol (in case of a protocol violation; if a service did not take the steps required by its configuration)

What does it mean? :-)

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

-bash-4.3# systemctl cat exit --no-pager
# /etc/systemd/system/exit.service
[Service]
Type=notify
NotifyAccess=all
ExecStart=/bin/sh -x -c 'systemd-notify --ready'

3d474ef~1

Nov 28 07:04:54 systemd-testsuite systemd[1]: exit.service: Collecting.
Nov 28 07:04:54 systemd-testsuite systemd[1]: exit.service: Collecting.
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Trying to enqueue job exit.service/start/replace
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Installed new job exit.service/start as 186
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Enqueued job exit.service/start as 186
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Failed to reset devices.list: No such file or directory
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Failed to set pids.max: No such file or directory
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Passing 0 fds to service
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: About to execute: /bin/sh -x -c 'systemd-notify --ready'
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Forked /bin/sh as 40
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Changed dead -> start
Nov 28 07:04:59 systemd-testsuite systemd[40]: exit.service: Executing: /bin/sh -x -c 'systemd-notify --ready'
Nov 28 07:04:59 systemd-testsuite systemd[1]: Starting exit.service...
Nov 28 07:04:59 systemd-testsuite sh[40]: + systemd-notify --ready
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Child 40 belongs to exit.service
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Main process exited, code=exited, status=0/SUCCESS
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Changed start -> dead
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Job exit.service/start finished, result=done
Nov 28 07:04:59 systemd-testsuite systemd[1]: Started exit.service.
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: cgroup is empty
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Collecting.
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Failed to send unit change signal for exit.service: Connection reset by peer
Nov 28 07:04:59 systemd-testsuite systemd[1]: exit.service: Failed to send unit remove signal for exit.service: Transport endpoint is not connected

3d474ef

Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Trying to enqueue job exit.service/start/replace
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Installed new job exit.service/start as 3795
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Enqueued job exit.service/start as 3795
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Failed to reset devices.list: No such file or directory
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Failed to set pids.max: No such file or directory
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Passing 0 fds to service
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: About to execute: /bin/sh -x -c 'systemd-notify --ready'
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Forked /bin/sh as 467
Nov 28 06:25:23 systemd-testsuite sh[467]: + systemd-notify --ready
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Changed dead -> start
Nov 28 06:25:23 systemd-testsuite systemd[1]: Starting exit.service...
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Child 467 belongs to exit.service
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Main process exited, code=exited, status=0/SUCCESS
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Changed start -> failed
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Job exit.service/start finished, result=failed
Nov 28 06:25:23 systemd-testsuite systemd[1]: Failed to start exit.service.
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Unit entered failed state.
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: Failed with result 'protocol'.
Nov 28 06:25:23 systemd-testsuite systemd[1]: exit.service: cgroup is empty

Related: Race condition causing sd_notify messages to get dropped

@lucab
Copy link
Contributor

lucab commented Nov 28, 2016

@evverx you are right, it doesn't send a READY=1 yet, the PR implementing that got somehow stuck at rkt/rkt#3227. Toggling the boolean in the manifest seems the right thing to do untill that PR is re-spun. Would you like to submit your onliner?

Regarding the sleep race, it looks like the whole test is broken with type=notify but should be back to normality with your oneliner. Do you still see some failures without type=notify and with the sleep in place?

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

@lucab , thanks!

So, the regression is

-bash4.3# systemctl cat exit --no-pager
# /etc/systemd/system/exit.service
[Service]
User=1001
Type=notify
NotifyAccess=all
ExecStart=/bin/sh -x -c 'systemd-notify --ready --pid=$$$$'

-bash4.3# while systemctl start exit; do :; done
See "systemctl status exit.service" and "journalctl -xe" for details.

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

BTW this is a regression too:

-bash-4.3# systemctl cat exit --no-pager
# /etc/systemd/system/exit.service
[Service]
Type=notify
ExecStart=/bin/sh -x -c 'sleep 20'

-bash-4.3# journalctl -u exit --no-pager
...
exit.service: Failed with result 'protocol'.

The docs says nothing about 'protocol'
(except

protocol (in case of a protocol violation; if a service did not take the steps required by its configuration)

)
But that's not clear

Another manpage:

Behavior of notify is similar to simple; however, it is expected that the daemon sends a notification message via sd_notify(3) or an equivalent call when it has finished starting up.

But it doesn't say "the daemon must send a notification message"

So, why should we break all such services?
And why not to add warning to the NEWS?

@joukewitteveen
Copy link
Contributor

When a notify service fails after not having received a READY notification, that is not a regression but improved behavior. Earlier, such units would silently succeed and ignore any StartPost commands (see the commit message of 3d474ef. This was wrong.

The docs says nothing about 'protocol'
(except

protocol (in case of a protocol violation; if a service did not take the steps required by its configuration)

)
But that's not clear

I couldn't find any better wording that was not too long (#4724). Service configurations come with their own set of expectations. When this expected behavior is not showcased by the unit, the unit fails with this new protocol error.

Another manpage:

Behavior of notify is similar to simple; however, it is expected that the daemon sends a notification message via sd_notify(3) or an equivalent call when it has finished starting up.

But it doesn't say "the daemon must send a notification message"

So, why should we break all such services?

As noted before, these services were already broken, we just didn't fail on them. As I read it, 'it is expected' should be read as 'must' in the manpage.

And why not to add warning to the NEWS?

Agreed. This is added to #4745.

@evverx
Copy link
Member Author

evverx commented Nov 28, 2016

This is added to #4745.

@joukewitteveen , thanks!

I'm closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit
Development

No branches or pull requests

3 participants