Skip to main content

Hardening docker

·12 mins

Introduction
#

Some time about a week ago I had the pleasure of being a speaker at Poznań Security Meetup 3 and talking about docker security to around a 100 people! Since I’m sure I glossed over some importants detail let me use this post fill in any gaps or to bring you up to speed in if you weren’t present! If you want to see the presentation you can download it here. Anwyays, this is me:

my photo on the event

NOTE: On the talk I mentioned the difference between LXC and docker containers and did an intro to how the docker ecosystem works. While using the term “containers/docker containers” in this blog post I mean OCI containers.

Why containers work?
#

You may have heard this magical online passage:

I’d just like to interject for a moment. What you’re refering to as Linux, is in fact, GNU/Linux, or as I’ve recently taken to calling it, GNU plus Linux. Linux is not an operating system unto itself, but rather another free component of a fully functioning GNU system made useful by the GNU corelibs, shell utilities and vital system components comprising a full OS as defined by POSIX. ~ Not Richard Stallman

and in our case it has tons of educational merit. You see containers work wonders because of the fact that they share the linux kernel with the host operating system. This makes containers fast becuase they don’t need to emulate hardware devices. Since we are running on the same kernel, most of the security features that we’re going to be dealing with are implemented and exposed by the kernel itself. Because of the lack of emulation most of the things we are going to be talking about are in relation to the art of “escaping” to the host system.

How the docker socket works?
#

The docker command is just a CLI to access the docker daemon, the daemon manages things like networking, volumes and images. They communicate using REST via a socket file located at /var/run/docker.sock - we can check that using curl on the host machine. Running curl --unix-socket /var/run/docker.sock http://localhost/images/json should get us the available images, so let’s see about that…

adam@host ~ > curl --unix-socket /var/run/docker.sock http://localhost/images/json
curl: (7) Failed to connect to localhost port 80 after 0 ms: Couldn't connect to server

Couldn’t connect? Maybe something with the file?

adam@host ~ > ls -l /var/run/docker.sock
srw-rw----. 1 root docker 0 Jun 22 11:21 /var/run/docker.sock
adam@host ~ > groups
adam wheel pkg-build libvirt

The socket file is owned by the group docker! That’s why you need the docker group to spawn containers when not using sudo. Now running that command again and passing it to jq.

adam@host ~ > curl --silent --unix-socket /var/run/docker.sock http://localhost/images/json | jq           
[
  {
    "Containers": -1,
    "Created": 1719404777,
    "Id": "sha256:ac173dc647dad829d7b9e8530779ad8b2bf66c45768c8b2963e03e8a9547af91",
    "Labels": null,
    "ParentId": "",
    "RepoDigests": [],
    "RepoTags": [
      "test-secret:latest"
    ],
    "SharedSize": -1,
    "Size": 7794716
  },
...

We can see that this container - test-secret:latest corresponds to the output of docker image ls:

adam@host ~ > docker image ls                                            
REPOSITORY                               TAG               IMAGE ID       CREATED             SIZE
test-secret                              latest            ac173dc647da   About an hour ago   7.79MB
test-arg                                 latest            3ca41158419b   2 hours ago         7.79MB
...

Basically, the docker socket is the gateway of communcating between the frontend and the backend of the docker runtime. Later, we will discuss the implications of centralizing communication like that.

Namespaces
#

I said that containers are using the same kernel as the host system, let’s check that using a simple test. Run:

adam@host ~ > docker run --interactive --tty --rm alpine:3.20.0 sleep 25 &
[1] 163159
adam@host ~ > ps u -u root | grep "sleep 25"
root      163255  0.2  0.0   1612   640 pts/0    Ss+  15:07   0:00 sleep 25

Let’s break this down: docker run creates a container with the tag alpine:3.20.0. The flag --interactive makes the container interactive, --tty allocates a pseudo-tty and --rm removes the container after the shell has exited. We execute sleep 25 and as we can see the sleep 25 inside the container is visible from the host machine as being run from the root user. If you check the inverse (i.e. run a sleep 25 on the host machine and ps aux on the container) you will notice that the container cannot see the host processes. That is because of a magical thing called kernel namespaces. As per man namespaces:

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes.

Here are the available namespace types:

   Namespace types
       Namespace Page                  Isolates
       Cgroup    cgroup_namespaces(7)  Cgroup root directory
       IPC       ipc_namespaces(7)     System V IPC, POSIX  message
                                       queues
       Network   network_namespaces(7) Network   devices,   stacks,
                                       ports, etc.
       Mount     mount_namespaces(7)   Mount points
       PID       pid_namespaces(7)     Process IDs
       Time      time_namespaces(7)    Boot and monotonic clocks
       User      user_namespaces(7)    User and group IDs
       UTS       uts_namespaces(7)     Hostname and NIS domain name

That’s why --network=host makes LAN devices visible to a container - the container gets added to the host’s Network namespace. Let’s try the process example again but this time, adding the container to the host’s PID namespace with --pid=host:

adam@host ~ > sleep 25 &
[1] 232866
adam@host ~ > docker run --interactive --tty --rm --pid=host alpine:3.20.0 ps aux | grep "sleep 25"
232866 1000      0:00 sleep 25
adam@host ~ > id                                                                                   
uid=1000(adam) gid=1000(adam) groups=1000(adam),10(wheel),971(pkg-build),974(docker),983(libvirt)

Since we added the container to the host’s namespace, we can see the sleep 25 that was created on the host! As we can see the process seen by the container has the ID of the owner set to 1000, which is the ID of the adam user. Namespaces basically limit what you can see as a container, let’s explore how to limit what you can do.

Capabilities
#

The manual page for capabilities says the following:

Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled.

Capabilities basically break down the power of root into different parts and allow the root on the container to not be equal to host root. There are many available ones, let’s just mention the ones given to containers by default:

func DefaultCapabilities() []string {
    return []string{
        "CAP_CHOWN",             // Allows changing the owner of files.
        "CAP_DAC_OVERRIDE",      // Bypasses file read, write, and execute permission checks.
        "CAP_FSETID",            // Allows setting the file system User ID (UID) and Group ID (GID) bits.
        "CAP_FOWNER",            // Allows bypassing permission checks (file owner)
        "CAP_MKNOD",             // Allows the creation of special files using the mknod system call.
        "CAP_NET_RAW",           // Allows the use of RAW and PACKET sockets.
        "CAP_SETGID",            // Allows setting the GID (Group ID) of a process.
        "CAP_SETUID",            // Allows setting the UID (User ID) of a process.
        "CAP_SETFCAP",           // Allows setting file capabilities.
        "CAP_SETPCAP",           // Allows modifying process capabilities.
        "CAP_NET_BIND_SERVICE",  // Allows binding to network ports below 1024.
        "CAP_SYS_CHROOT",        // Allows the use of chroot() system call
        "CAP_KILL",              // Allows sending signals to processes
        "CAP_AUDIT_WRITE",       // Allows writing to the audit logs.
    }
}

There are also some dangerous ones not given by default like:

  • CAP_SYS_ADMIN - Allows the user to perform a LOT of system administration operations. Basically allows to gain other priviledges.
  • CAP_SYS_BOOT - Allows to use reboot and kexec_load. You can just build a custom kernel with a malicious kernel module and load it into the host system.

Why am I mentioning this? Because some people advocate for using the --privildged flag in some “i don’t want permission issues” cases. It is really important to understand that it gives all the capabiltiies to the container, making it essentially root on host. You can drop capabilities for containers using --cap-drop=ALL and add them using for example --cap-add=CAP_NET_ADMIN. Let’s check if that works using our trusty alpine:3.20.0:

adam@host ~ > docker run --interactive --tty --rm --cap-drop=ALL --cap-add=CAP_NET_ADMIN alpine:3.20.0 sh
/ # apk add libcap
fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/community/x86_64/APKINDEX.tar.gz
...
OK: 8 MiB in 19 packages
/ # capsh --print # print available capabilities
Current: cap_net_admin=ep
Bounding set =cap_net_admin
Ambient set =
Current IAB: !cap_chown,!cap_dac_override,!cap_dac_read_search,!cap_fowner,!cap_fsetid,!cap_kill,!cap_setgid,!cap_setuid,!cap_setpcap,!cap_linux_immutable,!cap_net_bind_service,!cap_net_broadcast,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_chroot,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_mknod,!cap_lease,!cap_audit_write,!cap_audit_control,!cap_setfcap,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore

As we can see we successfully dropped all the capabilties except for CAP_NET_ADMIN. You should examine your docker containers and drop all the capabilities that they do not need to limit the potential damage.

Docker socket exposure
#

Another thing that is very dangerous but seen quite often is exposing /var/run/docker.sock to containers. Why would somebody do this? Software like traefik in my homelab uses the information received by the socket to load balance a route based on how many containers are available. Exposisng the docker socket is extremely dangerous and stupid and you shouldn’t do that! It may seem innocent at first but let’s do a quick demo to see what you can do once you detect something like this:

adam@host ~ > docker run --interactive --tty --rm --volume "/var/run/docker.sock:/var/run/docker.sock" alpine:3.20.0 sh
/ # apk add docker
fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/community/x86_64/APKINDEX.tar.gz
...
OK: 280 MiB in 27 packages
/ # docker run --interactive --tty --volume "/:/psm" ubuntu:24.04 sh
# cat /psm/etc/hostname
host

As you can see, the container docker client uses the host socket to mount the host root filesystem on /psm - We verify this by checking the hostname of the host system. We could also just spin up a --priviledged container and have basically full control over the host system. Try not to expose the docker socket for no reason.

But what if I need to communicate?
#

You can filter kernel calls at host OS level with mechanisms like SELinux, to only allow an identified set of actions for the container client (or the “socket exposer” process). You can also (with less config) expose the Docker socket over TCP or SSH, instead of the the socket file. It allows different implementation levels of the AAA (Authentication, Authorization, Accounting) concepts, depending on your security assessment. For more info refer to protecting the daemon socket.

Limiting container resources with control groups
#

Now imagine that someone breaks into your container and runs this magical line: :(){ :|:& };:. If you sometimes use linux you might get what that means - it’s a fork bomb that makes the CPU go boom boom with infinitely spawning processes.

Fortunately, Docker containers provide a layer of protection against such attacks through the use of control groups, or cgroups. They are a kernel feature that allows you to allocate resources such as CPU, memory, and I/O bandwidth among user-defined groups of tasks (in this case, containers). The different resources that can be controlled are listed in man cgroups. (you just gotta love man, man). Let’s set off a forkbomb in alpine:

adam@host ~ > docker run --interactive --tty --rm alpine:3.20.0 sh
/ # bomb(){ bomb|bomb& };bomb
/ #
[1]+  Done                       bomb | bomb

This is output of my host’s btop, as you can see I am ripping the threads:

btop showing 90%+ CPU usage on every core

If you’re lucky you will be able to kill the container via docker kill. How to remedy this? Let’s look into forkbombing again but this time with --memory and --cpus flags:

adam@host ~ > docker run --interactive --tty --rm --memory 256m --cpus 4 alpine:3.20.0 sh
/ # bomb(){ bomb|bomb& };bomb
/ #
[1]+  Done                       bomb | bomb

the output is the same, but we put a lock on how many resources the container can use, effectively making DDoS attacks much harder to execute. We can also use the --restart on-failure:5 policy to disallow attacks abusing the restart on-failure mechanic (and maybe set up alerts if we fail to start an abysmal 5 times).

Build time secrets
#

So far we mostly talked about hardening the runtime and how to bend the daemon to our will - let’s now talk about less administrator-like actions. Our fake company wants to build an image and push it to the dockerhub but all our files are hidden behind an rsync-enabled server. How do we get those files while building the image? We need a key, and a way to use it while building, but in a way so it will not appear in the built image. A common solution is to use build arguments - create a Dockerfile.build-args file with the following contents:

FROM alpine:3.20.0
ARG SECRET
RUN echo $SECRET

Build it using docker build --file Dockerfile.build-args --tag test-arg --build-arg SECRET="my secret" . The secret was hidden in plaintext in the build layers! Don’t believe me? Let’s use docker history test-arg:

docker history showing a build argument leakage

We can remedy this using an OCI image builder feature. Let’s create a file called SECRET with the contents hello world! and create the file Dockerfile.secret:

FROM alpine:3.20.0
RUN --mount=type=secret,id=mysecret cat /run/secrets/mysecret

We build it using the special secret syntax, we need to supply the id of the secret and also the name of the file containing the secret. It would be advised that file has special permissions, so that us mortals cannot read it easily. Build using docker build --no-cache --secret id=mysecret,src=$PWD/SECRET --file Dockerfile.secret --tag test-secret . Checking out docker history test-secret --no-trunc:

docker history good secret hiding

We can also use the special RUN --mount instruction with ssh keys to allow for secure remote data fetching at build time! Check out this docker page for more info!

Host OS security
#

Since the kernel is shared with the host OS, we can allow ourselves to have a very “light” host runtime environment. This is your call to just install gentoo and have some fun setting up the worst production server ever! But seriously, if you’re planning on self-hosting container infrastructure it would be very wise to have a minimal base OS, and harden it with things like SELinux (here). You can also use AppArmor or modify the built-in seccomp profiles.

Summary
#

Containers are basically just isolated processes, so we need to make sure of two things:

  1. Conatiner cannot escape to the host
  2. Attacker cannot affect host “quality of life” like CPU and memory

What should I do again? I hear you say - uncle Adam has you covered:

  • Drop all non-needed capabilities
  • Harden the host OS - use SELinux or similar
  • Limit chance of escalation using the USER directive and maybe namespace remapping
  • Limit resources of your containers to prevent resource depletion attacks (try not to forkbomb yourself)

You should probably also scan your containers

Sources
#

What I used to prepare for the talk: