Privileged Container Escape: Linux Capability Abuse and Host Device Access for Breakout

binary_exploitation Difficulty 1–5 30 min certifiable

Theory

Why This Matters

In 2019, the CVE-2019-5736 runc vulnerability demonstrated that container escape was possible without --privileged, but --privileged containers have always been trivially escapable by design. Multiple Kubernetes cluster compromises — including the 2018 Tesla AWS cryptojacking incident and numerous subsequent red team engagements — involved privileged containers deployed as CI/CD agents, monitoring tools, or system utilities that gave attackers immediate host access upon code execution inside the container. The --privileged flag is so dangerous precisely because it appears to be a simple configuration option while functionally eliminating every security boundary Docker provides.

Core Concept

The --privileged flag in Docker does two things simultaneously: it grants the container all Linux capabilities (including CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE, and 34 others), and it disables the default seccomp and AppArmor profiles, removing syscall filtering. The combined effect is that a process inside a privileged container is functionally equivalent to a process running as root on the host — the only remaining isolation is the container's separate PID and network namespaces, which can be collapsed with --pid=host --net=host.

Escape technique — mounting the host disk: Because a privileged container can access all host devices via /dev, the attacker mounts the host root disk partition directly, gaining full read-write access to the host filesystem without needing any container escape vulnerability.

Escape technique — loading a kernel module: With CAP_SYS_MODULE, the attacker compiles and loads a malicious kernel module (rootkit) directly from inside the container, executing code in kernel space with no isolation whatsoever.

Escape technique — cgroup release_agent: The CAP_SYS_ADMIN capability allows mounting cgroup filesystems. A published technique (CVE-2022-0492, Felix Wilhelm's PoC) uses the cgroup release_agent file to execute arbitrary commands on the host when the last process in a cgroup exits.

Legitimate use cases for --privileged are narrow: network packet capture requiring raw socket access, loading kernel modules for storage drivers, and certain hardware testing scenarios. In almost all production deployments where --privileged is found, it was added to "fix" a capability error and never removed.

Detection: docker inspect --format='{{.HostConfig.Privileged}}' CONTAINER_ID returns true for privileged containers. The CIS Docker Benchmark check 5.4 explicitly prohibits privileged containers.

Technical Deep-Dive

# Detect all running privileged containers
docker ps -q | while read cid; do
    name=$(docker inspect "$cid" --format '{{.Name}}')
    priv=$(docker inspect "$cid" --format '{{.HostConfig.Privileged}}')
    caps=$(docker inspect "$cid" --format '{{.HostConfig.CapAdd}}')
    if [ "$priv" = "true" ]; then
        echo "PRIVILEGED: $name (ID: $cid)"
    elif [ "$caps" != "[]" ] && [ -n "$caps" ]; then
        echo "EXTRA_CAPS: $name  CapAdd=$caps"
    fi
done

# Check a specific container's full security configuration
docker inspect target-container --format '{{json .HostConfig}}' | 
  python3 -c "
import sys, json
hc = json.load(sys.stdin)
print('Privileged:', hc.get('Privileged'))
print('CapAdd:', hc.get('CapAdd'))
print('CapDrop:', hc.get('CapDrop'))
print('SecurityOpt:', hc.get('SecurityOpt'))
print('PidMode:', hc.get('PidMode'))
print('NetworkMode:', hc.get('NetworkMode'))
print('ReadonlyRootfs:', hc.get('ReadonlyRootfs'))
"

# From inside a privileged container: verify escape is possible
# Check available capabilities
cat /proc/self/status | grep CapEff
# Decode capability bitmask
capsh --decode=$(cat /proc/self/status | grep CapEff | awk '{print $2}')

# ESCAPE: Mount host root disk (for authorised testing only)
# Find the host root device
fdisk -l 2>/dev/null | grep "Linux filesystem"
# Mount it
mkdir /tmp/host && mount /dev/sda1 /tmp/host
ls /tmp/host/root/   # host root home directory
cat /tmp/host/etc/shadow   # host shadow password file

# Run CIS Docker Benchmark
docker run --rm --net host --pid host --userns host --cap-add audit_control 
  -e DOCKER_CONTENT_TRUST=$DOCKER_CONTENT_TRUST 
  -v /etc:/etc:ro -v /lib/systemd/system:/lib/systemd/system:ro 
  -v /usr/bin/containerd:/usr/bin/containerd:ro 
  -v /usr/bin/runc:/usr/bin/runc:ro 
  -v /usr/lib/systemd:/usr/lib/systemd:ro 
  -v /var/lib:/var/lib:ro 
  -v /var/run/docker.sock:/var/run/docker.sock:ro 
  --label docker_bench_security 
  docker/docker-bench-security

# VULNERABLE Kubernetes pod spec
spec:
  containers:
  - name: monitoring-agent
    image: monitor:latest
    securityContext:
      privileged: true   # CRITICAL — remove this

# SECURE: use only required capabilities
spec:
  containers:
  - name: monitoring-agent
    image: monitor:latest
    securityContext:
      privileged: false
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000
      capabilities:
        drop: ["ALL"]
        add: ["NET_RAW"]   # only if packet capture is required

Security Assessment Methodology

Enumerate all privileged containers. Inspect every running container for HostConfig.Privileged: true. In Kubernetes, use kubectl get pods -A -o json | jq '.items[].spec.containers[].securityContext.privileged' to enumerate cluster-wide.
Check for dangerous individual capabilities. Even without full --privileged, containers with CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE, CAP_NET_ADMIN, or CAP_DAC_OVERRIDE added individually can be exploited. List added caps via docker inspect.
Verify seccomp and AppArmor profiles. Check SecurityOpt in docker inspect. A privileged container disables these; a non-privileged container should show seccomp:default and an AppArmor profile. Missing profiles indicate the container has wider syscall access than necessary.
Check for --pid=host or --net=host. These flags in combination with --privileged or high capabilities allow accessing host process memory and host network interfaces, enabling MITM attacks and credential extraction from host process memory.
Demonstrate the escape in a safe test environment. Using the device mount technique or cgroup release_agent technique, demonstrate that a process in the privileged container can achieve root on the host. Document the full chain.
Remediate by removing --privileged and replacing with only the specific capabilities required. Use capsh --print inside the container to identify which capabilities are actually used. Drop all capabilities by default (--cap-drop=ALL) and add back only required ones. Enable read-only root filesystem and enforce seccomp profiles.

Common Assessment Errors

Treating Kubernetes Pod Security Standards as automatic protection. Pod Security Standards (PSS) restricted profile blocks privileged: true, but PSS must be enforced via admission control (Pod Security Admission or an OPA policy). Many clusters have PSS in warn mode, not enforce — privileged pods still run.
Overlooking init containers. Kubernetes init containers run before the main container and may have privileged: true for setup tasks. They deserve the same scrutiny as main containers — a privileged init container can modify the host before the main container starts.
Missing containers with capabilities equivalent to privileged. CAP_SYS_ADMIN alone provides most of the attack surface of full --privileged. An assessment that only flags Privileged: true and misses CapAdd: [SYS_ADMIN] misses equivalently dangerous configurations.
Not testing AppArmor/seccomp bypass. Some container runtimes have SecurityOpt: [seccomp:unconfined] or apparmor:unconfined set without full --privileged. These containers have unrestricted syscall access even though Privileged: false. Always check SecurityOpt.
Assuming the container image's USER instruction prevents escalation. Even if the container runs as a non-root user, --privileged grants that user all capabilities. A UID 1000 process in a privileged container can still mount host disks and load kernel modules.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0053	Knowledge of security risk management processes	Understanding that `--privileged` containers eliminate Docker's security model entirely — the risk is not container-level but host-level compromise
K0167	Knowledge of system administration, network, and OS hardening techniques	Hardening container security contexts: dropping all capabilities, enforcing seccomp/AppArmor profiles, and using least-capability principle
S0073	Skill in conducting vulnerability scans and recognizing vulnerabilities	Using `docker inspect`, `kubectl get pods`, and CIS Docker Benchmark scans to detect privileged containers and dangerous capability additions
T0144	Conduct penetration testing as required for new or updated applications	Demonstrating host escape from privileged containers using device mounting and cgroup release_agent techniques during container security assessments
T0395	Write code to address security vulnerabilities	Writing secure Kubernetes pod security contexts with `privileged: false`, `capabilities.drop: [ALL]`, and explicit narrow capability additions