Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests and services with PrivateNetwork=yes running under LXC with AppArmor #32945

Merged
merged 2 commits into from
May 22, 2024

Conversation

bluca
Copy link
Member

@bluca bluca commented May 21, 2024

As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:

https://github.com/torvalds/linux/blob/master/fs/locks.c

F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.

Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.

This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.

This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.

The integration tests in the Debian CI fail due to this issue:

https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/

@bluca bluca added this to the v256 milestone May 21, 2024
@github-actions github-actions bot added the please-review PR is ready for (re-)review by a maintainer label May 21, 2024
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change in behaviour is fine: the change in "business logic" is narrowly tailored and looks like it cannot have any effect except in the specific case it targets.

src/test/test-namespace.c Outdated Show resolved Hide resolved
@keszybz keszybz added good-to-merge/with-minor-suggestions and removed please-review PR is ready for (re-)review by a maintainer labels May 21, 2024
@keszybz keszybz added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed good-to-merge/with-minor-suggestions labels May 21, 2024
src/core/unit.c Outdated Show resolved Hide resolved
src/core/exec-invoke.c Outdated Show resolved Hide resolved
@poettering
Copy link
Member

guys, can we please stop taping over all kinds of not completely understood issues in other projects? let's take a step back please, this is a a hack, and this should not be commited like this, in particular without any input from aa folks.

@poettering poettering added needs-discussion 🤔 and removed good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed labels May 21, 2024
@bluca
Copy link
Member Author

bluca commented May 21, 2024

guys, can we please stop taping over all kinds of not completely understood issues in other projects? let's take a step back please, this is a a hack, and this should not be commited like this, in particular without any input from aa folks.

If you have a better solution for fixing test failures please propose a PR to do it, otherwise this will have to do as it's a release blocker. You can find the test infrastructure with instruction on how to reproduce it locally here: https://ci.debian.net/packages/s/systemd/ (documentation tab at the top)

@bluca
Copy link
Member Author

bluca commented May 21, 2024

guys, can we please stop taping over all kinds of not completely understood issues in other projects? let's take a step back please, this is a a hack, and this should not be commited like this, in particular without any input from aa folks.

Also I'm not really sure what's left to understand. There's a denial, and things break. The only other solution is to change the policy, but obviously you cannot do that from the guest, and infrastructure owners are perfectly entitled to stop network namespace nesting in their infrastructure, and we should not fall flat on our face when that happens - PrivateNetwork=yes is already gracefully skipped when denied in other occasions (selinux), this is just adding a missing bit for detecting it in this occasion.

@bluca
Copy link
Member Author

bluca commented May 21, 2024

Also issues with LXC + AppArmor are not really new, what changed now is that the Debian CI is always running and is now mandatory, so I cannot really ignore it anymore. E.g.:

#10166
#10011
#9700

@poettering
Copy link
Member

Also I'm not really sure what's left to understand. There's a denial, and things break.

Sorry, but MAC denials result in EPERM (or maybe EACCES), but not EAGAIN.

Can you explain the EAGAIN to me?

@poettering
Copy link
Member

and we do handle MAC denials gracefully, generically, i.e. EPERM/EACCES.

But any code that handles EAGAIN like EPERM/EACCESS under some weird, specific conditioning raises all alarm bells

@bluca
Copy link
Member Author

bluca commented May 21, 2024

Also I'm not really sure what's left to understand. There's a denial, and things break.

Sorry, but MAC denials result in EPERM (or maybe EACCES), but not EAGAIN.

Can you explain the EAGAIN to me?

Well, denials can result in a variety of things, depending on which syscall or action or hook or LSM is used. I agree it's a bit weird to see EAGAIN here, but it's what I am getting, it's not only clear from the logs but also from the fact that this change works and fixes the issues.

@github-actions github-actions bot added the please-review PR is ready for (re-)review by a maintainer label May 21, 2024
@poettering
Copy link
Member

Well, denials can result in a variety of things, depending on which syscall or action or hook or LSM is used. I agree it's a bit weird to see EAGAIN here, but it's what I am getting, it's not only clear from the logs but also from the fact that this change works and fixes the issues.

Sorry, but this seems like a bug, not expected behaviour.

This needs more investigation. For example, looking at setup_shareable_ns() there might be other cases we might hit EAGAIN, so maybe it's really not the namespace stuff that fails with EAGAIN but our logic around it.

Hence, this deserves more investigation, this is just taping over some bug otherwise. A bug either in our code or in the kernel, doesn't really matter, we should figure out what's going on here.

@poettering
Copy link
Member

the 2nd commit is fine, lgtm, good to merge. the 1st commit is a hack i'd rather not see. it sets in stone a workaround for a temporary bug in debian/aa/lxc that we shouldn't set in stone, we generally don't do that.

It's fine to gracefully handling MAC denials even if they are overzealous, but I am pretty sure that workarounds against temporary bugs in other projects should not be merged into our tree like the first commit. in particular as the fixes are already merged upstream, debian is just a bit slow.

can't you get these bugs fixed in debian? backported?

@bluca
Copy link
Member Author

bluca commented May 22, 2024

can't you get these bugs fixed in debian? backported?

This is not a Debian problem, it's a kernel regression, and as already mentioned we have been trying to get it backported to existing LTS kernels for a year, did not happen yet. Every kernel below 6.2 is affected, and our baseline is 4.9 or so

@poettering
Copy link
Member

This is not a Debian problem, it's a kernel regression, and as already mentioned we have been trying to get it backported to existing LTS kernels for a year, did not happen yet. Every kernel below 6.2 is affected, and our baseline is 4.9 or so

Debian does not patch upstream kernels, backport fixes and stuff? they use vanilla kernels? interesting. i am surprised anything at all works for them then...

if debian insists on testing on broken kernels/aa without avenue to ever get this fixed, then maybe patch this out locally in the debian package? there must be a way to apply patches at debian package build time, no? rpm has that, it's quite widely used.

@bluca
Copy link
Member Author

bluca commented May 22, 2024

This is not a Debian problem, it's a kernel regression, and as already mentioned we have been trying to get it backported to existing LTS kernels for a year, did not happen yet. Every kernel below 6.2 is affected, and our baseline is 4.9 or so

Debian does not patch upstream kernels, backport fixes and stuff? they use vanilla kernels? interesting. i am surprised anything at all works for them then...

As already mentioned the fix is very complicated, it's not a one-liner, so it is not really surprising they don't want to take it without upstream support.

if debian insists on testing on broken kernels/aa without avenue to ever get this fixed, then maybe patch this out locally in the debian package? there must be a way to apply patches at debian package build time, no? rpm has that, it's quite widely used.

Once again, this affects anybody running on kernel < v6.2, not just debian. Our kernel baseline as per README is 3.15.

As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:

https://github.com/torvalds/linux/blob/master/fs/locks.c

F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.

Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.

This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.

This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.

The integration tests in the Debian CI fail due to this issue:

https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/
@bluca
Copy link
Member Author

bluca commented May 22, 2024

I've changed the test to explicitly look for permission errors as we commonly do (including in this test, which checks for uid 0 already), instead for lxc+apparmor, and it works in the CI env. So hopefully this is good to go now, PTAL.

@poettering
Copy link
Member

Once again, this affects anybody running on kernel < v6.2, not just debian. Our kernel baseline as per README is 3.15.

Dunno, I ran such kernels for a long time, never had an issue. As i understand AA is pretty much an ubuntu/debian only thing, no? But this doesn't trip on Ubuntu either, does it?

@bluca
Copy link
Member Author

bluca commented May 22, 2024

Once again, this affects anybody running on kernel < v6.2, not just debian. Our kernel baseline as per README is 3.15.

Dunno, I ran such kernels for a long time, never had an issue. As i understand AA is pretty much an ubuntu/debian only thing, no? But this doesn't trip on Ubuntu either, does it?

Yeah I meant with apparmor ofc, given it's a regression in that LSM. Yes it affects Ubuntu too, but that CI runs in qemu nowadays for all architectures, so it wasn't spotted. Debian's run in LXC. So to be really pedantic, everybody running on a kernel < 6.2 with apparmor in a container is affected.

Anyway, it's moot, I have changed the test to spot permission errors as it's normally done.

@yuwata
Copy link
Member

yuwata commented May 22, 2024

Note, we already have below. That should be mostly for LXC, right?

if [[ "$(sysctl -ne kernel.apparmor_restrict_unprivileged_userns)" -eq 1 ]]; then
echo "Cannot create unprivileged user namespaces" >/skipped
exit 77
fi

@bluca
Copy link
Member Author

bluca commented May 22, 2024

Note, we already have below. That should be mostly for LXC, right?

if [[ "$(sysctl -ne kernel.apparmor_restrict_unprivileged_userns)" -eq 1 ]]; then
echo "Cannot create unprivileged user namespaces" >/skipped
exit 77
fi

Not just lxc, iirc that's the default on Ubuntu now, so it affects bare metal/qemu runs too, as user namespaces are disabled by default via apparmor. It's unrelated to this issue, which is about locking operations.

Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think not mangling the errno value (for the blocking case) is generally the right thing to do. It's unfortunate that the syscall uses the same return code for two very diffferent situations in non-blocking case, but this is a problem that we cannot fix. But at least we shouldn't make it more widespread.

@keszybz keszybz added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed please-review PR is ready for (re-)review by a maintainer ci-failure-appears-unrelated labels May 22, 2024
@poettering
Copy link
Member

i don't like using errnos as process exit status, we so far never did this, please use the bsd error status EX_NOPERM instead.

otherwise looks ok to me.

@poettering poettering added good-to-merge/with-minor-suggestions and removed good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed labels May 22, 2024
When running in LXC with AppArmor we'll most likely get an error when creating
a network namespace due to a kernel regression in < v6.2 affecting AppArmor,
resulting in denials. Like other tests, avoid failing in case of permission
issues and handle it gracefully.
@bluca bluca added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed good-to-merge/with-minor-suggestions labels May 22, 2024
@bluca bluca merged commit 6840ecb into systemd:main May 22, 2024
42 of 49 checks passed
@bluca bluca deleted the lxc_network_test branch May 22, 2024 20:07
@github-actions github-actions bot removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants