Let's group these three up. LSM is how Apparmor and SELinux are implemented.
LSMs were the de-facto way to accomplish mandatory or role based access control on Linux, as opposed to the default user/group "discretionary" access control (DAC).
SELinux was developed for building very very powerful policies, often to its detriment as it was not unusual for policies to be flawed due to that complexity. These days though I think it's doing pretty well on Android.
Apparmor is much simpler, though less powerful. It leverages a per-process sandboxing model, mostly focusing on read/write/execute permissions for the filesystem for a given process.
> - seccomp
> - seccomp-bpf
The OG seccomp gave a program access to four system calls. 'exit', 'sigreturn', 'read', and 'write'.
Basically if you had, say, a parser, you could fork and exec it within a seccomp sandbox. All it could do is read/write to the inherited pipes - so, it could read stdin, write to stdout.
Really cool, but limited.
seccomp-ebpf was an extension of the idea behind seccomp - that is, policies on system calls.
The main idea is that system calls are the way that shit gets done, whether it's exploiting the kernel or actually just performing malicious actions. So, we want to limit them.
This works by pushing policies up to the ebpf kernel virtual machine, and that thing makes decisions about what to allow or not. It's very powerful and effective, but it's hard to maintain because you have to know your syscalls ahead of time.
> - cgroups
> - cgroupsV2
cgroups are less about sandboxing and more about restricting resource usage.
> - namespaces
Namespaces give a program its own view of some slice of your system. In a pid namespace it won't see other processes, unless they're in that same namespace. In a network namespace the process will think it's got its own network to itself.
This is obviously quite useful for containers.
There's even more than that though, way more.
There's Yama, another LSM. There's chroots, chroot jails. There's the entire DAC system with users and groups. Root "capabilities".
Windows isn't a ton better fwiw, it has jobs, integrity (including undocumented hidden integrity levels), virtual desktops, DAC, AppContainer (or something?), AppLocker, UAC virtualization, etc.
In terms of what to use, that depends. Apparmor and SELinux are distro-specific, but IMO Apparmor is pretty trivial to write and maintain so it's sort of a harmless thing to deliver alongside a service/application.
Seccomp is arguably the most powerful, I think in particular if it's combined with namespaces (since you can't do things like filter on pointer arguments to system calls, so filtering on file path etc is out) this is really damn powerful, assuming a tight seccomp filter. But you'll want really good test coverage so that you can determine ahead of time if you accidentally introduced a new system call.
The main thing imo is to take the dangerous part of your program and isolate it into a small component. That's just the most important thing to do, and from there sandboxing becomes (more or less) trivial. That component can run as an unprivileged user, or in a namespace with no access, or in a seccompv1 jail with nothing but a pipe for reading/ and writing.
Thanks for the detailed explanation, this is very useful! This is a pretty open ended question, but what approach would you choose in order to get the most security for the buck, for a piece of software that runs containers for non-technical users, where the apps come from 3rd party developers? Containerd and runc expose some of these security levers but it's a bit daunting to build the right approach without burdening the user too much.
If you're running 3rd party code that more or less moves you into the namespace/container world or the LSM world. You aren't going to want to maintain a seccomp filter for 3rd party software.
It's going to really depend on what you're going for, but probably something along the lines of "docker + best practices" is the most bang for your buck solution.
Ok, that's inline with what I expected. I was also thinking of mandating that developers embed some kind of manifest with the app which states what resources it consumes (public ports, Internet connectivity to specific hosts etc) so that the system could enforce some boundaries around that, perhaps using eBPF. For the syscall fencing the Landlock project might be of use.
I've heard multiple times that docker wasn't suited for security isolation as it wasn't designed for it (even though the underlying mechanisms can be used for that purpose). Had something changed recently on that front?
Let's group these three up. LSM is how Apparmor and SELinux are implemented.
LSMs were the de-facto way to accomplish mandatory or role based access control on Linux, as opposed to the default user/group "discretionary" access control (DAC).
SELinux was developed for building very very powerful policies, often to its detriment as it was not unusual for policies to be flawed due to that complexity. These days though I think it's doing pretty well on Android.
Apparmor is much simpler, though less powerful. It leverages a per-process sandboxing model, mostly focusing on read/write/execute permissions for the filesystem for a given process.
> - seccomp > - seccomp-bpf
The OG seccomp gave a program access to four system calls. 'exit', 'sigreturn', 'read', and 'write'.
Basically if you had, say, a parser, you could fork and exec it within a seccomp sandbox. All it could do is read/write to the inherited pipes - so, it could read stdin, write to stdout.
Really cool, but limited.
seccomp-ebpf was an extension of the idea behind seccomp - that is, policies on system calls.
The main idea is that system calls are the way that shit gets done, whether it's exploiting the kernel or actually just performing malicious actions. So, we want to limit them.
This works by pushing policies up to the ebpf kernel virtual machine, and that thing makes decisions about what to allow or not. It's very powerful and effective, but it's hard to maintain because you have to know your syscalls ahead of time.
> - cgroups > - cgroupsV2
cgroups are less about sandboxing and more about restricting resource usage.
> - namespaces
Namespaces give a program its own view of some slice of your system. In a pid namespace it won't see other processes, unless they're in that same namespace. In a network namespace the process will think it's got its own network to itself.
This is obviously quite useful for containers.
There's even more than that though, way more.
There's Yama, another LSM. There's chroots, chroot jails. There's the entire DAC system with users and groups. Root "capabilities".
Windows isn't a ton better fwiw, it has jobs, integrity (including undocumented hidden integrity levels), virtual desktops, DAC, AppContainer (or something?), AppLocker, UAC virtualization, etc.
In terms of what to use, that depends. Apparmor and SELinux are distro-specific, but IMO Apparmor is pretty trivial to write and maintain so it's sort of a harmless thing to deliver alongside a service/application.
Seccomp is arguably the most powerful, I think in particular if it's combined with namespaces (since you can't do things like filter on pointer arguments to system calls, so filtering on file path etc is out) this is really damn powerful, assuming a tight seccomp filter. But you'll want really good test coverage so that you can determine ahead of time if you accidentally introduced a new system call.
The main thing imo is to take the dangerous part of your program and isolate it into a small component. That's just the most important thing to do, and from there sandboxing becomes (more or less) trivial. That component can run as an unprivileged user, or in a namespace with no access, or in a seccompv1 jail with nothing but a pipe for reading/ and writing.