-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tests and services with PrivateNetwork=yes running under LXC with AppArmor #32945
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change in behaviour is fine: the change in "business logic" is narrowly tailored and looks like it cannot have any effect except in the specific case it targets.
guys, can we please stop taping over all kinds of not completely understood issues in other projects? let's take a step back please, this is a a hack, and this should not be commited like this, in particular without any input from aa folks. |
If you have a better solution for fixing test failures please propose a PR to do it, otherwise this will have to do as it's a release blocker. You can find the test infrastructure with instruction on how to reproduce it locally here: https://ci.debian.net/packages/s/systemd/ (documentation tab at the top) |
Also I'm not really sure what's left to understand. There's a denial, and things break. The only other solution is to change the policy, but obviously you cannot do that from the guest, and infrastructure owners are perfectly entitled to stop network namespace nesting in their infrastructure, and we should not fall flat on our face when that happens - PrivateNetwork=yes is already gracefully skipped when denied in other occasions (selinux), this is just adding a missing bit for detecting it in this occasion. |
Sorry, but MAC denials result in EPERM (or maybe EACCES), but not EAGAIN. Can you explain the EAGAIN to me? |
and we do handle MAC denials gracefully, generically, i.e. EPERM/EACCES. But any code that handles EAGAIN like EPERM/EACCESS under some weird, specific conditioning raises all alarm bells |
Well, denials can result in a variety of things, depending on which syscall or action or hook or LSM is used. I agree it's a bit weird to see EAGAIN here, but it's what I am getting, it's not only clear from the logs but also from the fact that this change works and fixes the issues. |
Sorry, but this seems like a bug, not expected behaviour. This needs more investigation. For example, looking at setup_shareable_ns() there might be other cases we might hit EAGAIN, so maybe it's really not the namespace stuff that fails with EAGAIN but our logic around it. Hence, this deserves more investigation, this is just taping over some bug otherwise. A bug either in our code or in the kernel, doesn't really matter, we should figure out what's going on here. |
the 2nd commit is fine, lgtm, good to merge. the 1st commit is a hack i'd rather not see. it sets in stone a workaround for a temporary bug in debian/aa/lxc that we shouldn't set in stone, we generally don't do that. It's fine to gracefully handling MAC denials even if they are overzealous, but I am pretty sure that workarounds against temporary bugs in other projects should not be merged into our tree like the first commit. in particular as the fixes are already merged upstream, debian is just a bit slow. can't you get these bugs fixed in debian? backported? |
This is not a Debian problem, it's a kernel regression, and as already mentioned we have been trying to get it backported to existing LTS kernels for a year, did not happen yet. Every kernel below 6.2 is affected, and our baseline is 4.9 or so |
Debian does not patch upstream kernels, backport fixes and stuff? they use vanilla kernels? interesting. i am surprised anything at all works for them then... if debian insists on testing on broken kernels/aa without avenue to ever get this fixed, then maybe patch this out locally in the debian package? there must be a way to apply patches at debian package build time, no? rpm has that, it's quite widely used. |
As already mentioned the fix is very complicated, it's not a one-liner, so it is not really surprising they don't want to take it without upstream support.
Once again, this affects anybody running on kernel < v6.2, not just debian. Our kernel baseline as per README is 3.15. |
As per the documentation, EACCES is only returned when F_SETLK is used, and only on some platforms, which doesn't seem to include Linux: https://github.com/torvalds/linux/blob/master/fs/locks.c F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW are blocking operations so this logic doesn't apply to them in the first place. Hence, only automatically convert EACCES into EAGAIN for F_SETLK operations, and propagate the original error in the other cases. This is important because in some cases we catch permission errors and gracefully fallback, which is not possible if the original error is lost. This is an issue in practice because, due to a kernel bug present before v6.2, AppArmor denies locking on file descriptors to LXC containers. We support all currently maintained LTS kernels, including v6.1, where despite a lot of effort and attempts over almost a year, the bugfix still hasn't been backported, as it is complex and requires large changes to AppArmor. On affected kernels, all services running with PrivateNetwork=yes fail and do not recover, instead of the normal behaviour of gracefully downgrading to PrivateNetwork=no. The integration tests in the Debian CI fail due to this issue: https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/
I've changed the test to explicitly look for permission errors as we commonly do (including in this test, which checks for uid 0 already), instead for lxc+apparmor, and it works in the CI env. So hopefully this is good to go now, PTAL. |
Dunno, I ran such kernels for a long time, never had an issue. As i understand AA is pretty much an ubuntu/debian only thing, no? But this doesn't trip on Ubuntu either, does it? |
Yeah I meant with apparmor ofc, given it's a regression in that LSM. Yes it affects Ubuntu too, but that CI runs in qemu nowadays for all architectures, so it wasn't spotted. Debian's run in LXC. So to be really pedantic, everybody running on a kernel < 6.2 with apparmor in a container is affected. Anyway, it's moot, I have changed the test to spot permission errors as it's normally done. |
Note, we already have below. That should be mostly for LXC, right? systemd/test/units/TEST-43-PRIVATEUSER-UNPRIV.sh Lines 11 to 14 in 7a321b5
|
Not just lxc, iirc that's the default on Ubuntu now, so it affects bare metal/qemu runs too, as user namespaces are disabled by default via apparmor. It's unrelated to this issue, which is about locking operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think not mangling the errno value (for the blocking case) is generally the right thing to do. It's unfortunate that the syscall uses the same return code for two very diffferent situations in non-blocking case, but this is a problem that we cannot fix. But at least we shouldn't make it more widespread.
i don't like using errnos as process exit status, we so far never did this, please use the bsd error status EX_NOPERM instead. otherwise looks ok to me. |
When running in LXC with AppArmor we'll most likely get an error when creating a network namespace due to a kernel regression in < v6.2 affecting AppArmor, resulting in denials. Like other tests, avoid failing in case of permission issues and handle it gracefully.
As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:
https://github.com/torvalds/linux/blob/master/fs/locks.c
F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.
Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.
This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.
This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.
The integration tests in the Debian CI fail due to this issue:
https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/