nicholas amorim.

Collection of notes on whatever I’m learning about 🪀

15 Jan 2022

Demystifying Container Runtimes

What is a “Container Runtime”?

The term “runtime” is traditionally regarded as the lifecycle phase when a program is running, or the specific implementation of a language that supports its execution. An example being the Go runtime. The latter meaning is closest to the definition of a “container runtime”. At its core, a container runtime is the software responsible for running containers. As we will described, different container runtimes implement varying levels of features; but running a container is all that’s required to classify something as a container runtime.

Docker

Docker is a popular monolithic platform for all things containers with niceties such as:

  • A container image format
  • A method for building container images (Dockerfile / docker build)
  • A way to manage containers images (docker images, docker rm, etc.)
  • A way to manage instances of containers (docker ps, docker rm, etc.)
  • A way to share containers (docker push/pull)
  • A way to run containers (docker run)

However, none of these features are really dependent on each other. Each of these could be implemented in smaller, more focused tools that could be used together through a common format, a container standard. Because of this, the Open Container Initiative (OCI) was created. Docker then broke out these features and created a tool and library, runc, and donated it to OCI as a reference implementation of the OCI Runtime Implementation.

It was initially confusing what Docker had contributed to OCI. What they contributed was a standard way to “run” containers, nothing more. When you run a Docker container, these are the steps Docker actually goes through:

  1. Download the image.
  2. Unpack the image into a “bundle”. This flattens the layer into a single filesystem.
  3. Run the container from the bundle (what Docker actually standardized).

Low-Level & High-Level Container Runtimes

Examples of container runtimes include: runc, lxc, lmctfy, Docker (containerd), rkt, cri-o. Each of these is built for different situations and implements different features. Some, like containerd and cri-o, actually use runc to run the container but implement image management and APIs on top. You can think of these features which include image transport, image management, image unpacking, and APIs, as high-level features compared to runc’s low-level implementation.

Therefore for practical purposes, actual container runtimes that focus on just running containers are referred to as “low-level container runtimes”. Runtimes that implement more higher-level features are usually referred to as “high-level container tools”, “high-level container runtimes”, or “container runtimes”. It’s important to note that low-level and high-level container runtimes are fundamentally different things that solve different problems.

What is a Container?

Containers are implemented using Linux namespaces and cgroups. Linux namespaces let you virtualize system resources, like the filesystem or networking, for each container. Cgroups provide a way to limit the amount of resources like CPU and memory that each container can use. At the lowest level, a container runtime is responsible for setting up these namespaces and cgroups for containers, and running commands inside though namespaces and cgroups. Low-level runtimes support using these operating system features.

Why Do We Need High-Level Container Runtimes?

Typically, developers who want to run applications in containers will need more than just the features the low-level runtimes provide. They need APIs and features around image formats, image management, and sharing images which are usually provided by high-level container runtimes. Low-level container runtimes just don’t provide enough features for this everyday use; for that reason, those who use low-level runtimes would be developers who implement higher-level runtimes, and tools for containers.

Developers who implement low-level runtimes will say that higher-level runtimes like containerd and cri-o are not actually container runtimes, as from their perspective they outsource the implementation of running a container to runc. But, from the user’s perspective, they are a singular component that provides the ability to run containers. One implementation can be swapped out by another, so it still makes sense to call it a runtime from that perspective.

Anatomy of a Low-Level Container Runtime

Low-level runtimes have a limited feature set and typically perform the low-level tasks required for running a container. They are usually implemented as simple tools or libraries that developers of higher level runtimes and tools can use for the low-level features.

Building a Sample Runtime

We can perform the steps using the standard Linux cgcreate, cgset, cgexec, chroot, and unshare commands.

First, let’s set up a root filesystem for our container. We’ll use the busybox Docker container as our base. Here we create a temporary directory and extract busybox into it.

CID=$(docker create busybox)
ROOTFS=$(mktemp -d)
docker export $CID | tar -xf - -C $ROOTFS

Now, let’s create our cgroup and set restrictions on the memory and CPU. Memory limits are set in bytes; here we are setting the limit to 100MB.

UUID=$(uuidgen)
cgreate -g cpu,memory:$UUID
cgset -r memory.limit_in_bytes=10000000 $UUID
cgset -r cpu.shares=512 $UUID

CPU usage can be restricted in one of two ways. Here we set our CPU limit using CPU “shares”. Shares are an amount of CPU time relative to other processes running at the same time. Containers running by themselves can use the whole CPU, but if other containers are running, they can use a proportional amount of CPU to their CPU shares.

CPU limits based on CPU cores are a bit more complicated. They let you set hard limits on the amount of CPU cores that a container can use. Limiting CPU cores requires you set two options on the cgroup, cfs_period_us and cfs_quota_us.cfs_period_us specifies how often CPU usage is checked and cfs_quota_us specifies the amount of time that a task can run on one core in one period. Both are specified in microseconds.

For instance, if we wanted to limit our container to two cores we could specify a period of one second and a quota of two seconds (one second is 1,000,000 microseconds) and this would effectively allow our process to use two cores during a one-second period.

cgset -r cpu.cfs_period_us=1000000 $UUID
cgset -r cpu.cfs_quota_us=2000000 $UUID

Next, we can execute a command in the container. This will execute the command within the cgroup we created, unshare the specified namespaces, set the hostname, and chroot to our filesystem.

cgexec -g cpu,memory:$UUID \
  unshare -uinpUrf --mount-proc \
  sh -c "/bin/hostname $UUID && chroot $ROOTFS /bin/sh"
  / # echo "Hello from in a container"
  Hello from in a container
  / # exit

Finally, after our command has completed, we can clean up by deleting the cgroup and temporary directory we created.

cgdelete -r -g cpu,memory:$UUID
rm -r $ROOTFS

Examples of Low-Level Container Runtimes

lmctfy

Though not in wide use, one container runtime of note is lmctfy. It is a project by Google based on the internal container runtime that Borg uses. One of its most interesting features is that it supports container hierarchies that use cgroup hierarchies via the container names. For example, a root container called “busybox” could create sub-containers under the name “busybox/sub1” or “busybox/sub2” where the names form a kind of path structure. As a result, each sub-container can have its own cgroups that are then limited by the parent container’s cgroup.

runc

runc is currently the most widely used container runtime. It was originally developed as part of Docker and was later extracted out as a separate tool and library. Internally, runc runs containers that implement the OCI runtime spec. That means that it runs containers from a specific “OCI bundle” format. The format of the bundle has a config.json file for some configuration and a root file system for the container.

First, create the root filesystem (here we’ll use busybox again).

mkdir rootfs
docker export $(docker create busyox) | tar -xf - -C rootfs

Next, create a config.json file.

runc spec

This command creates a template config.json for our container. It should look something like this:

{
  "ociVersion": "1.0.0",
  "process": {
          "terminal": true,
          "user": {
                  "uid": 0,
                  "gid": 0
          },
          "args": [
                  "sh"
          ],
          "env": [
                  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                  "TERM=xterm"
          ],
          "cwd": "/",
          "capabilities": {
...

By default, it runs the sh command in a container with a root filesystem at ./rootfs. Since that’s exactly the setup we want we can just go ahead and run the container; sudo runc run mycontainerid / # echo "Hello from in a container".

Anatomy of a High-Level Container Runtime

High-level runtimes are higher up the stack than low-level runtimes. While low-level runtimes are responsible for the mechanics of actually running a container, high-level runtimes are responsible for transport and management of container images, unpacking the image, and passing off to the low-level runtime to run the container. Typically, high-level runtimes provide an daemon application and an API that remote applications can use to logically run containers and monitor them but they set on top of and delegate to low-level runtimes or other high-level runtimes for the actual work.

High-level runtimes can also provide features that sound low-level, but are used across individual containers on a machine. For example, one feature might be the management of network namespaces, and allowing containers to join another container’s network namespace.

Here’s a conceptual diagram to understand how the components fit together:

Examples of High-Level Runtimes

Docker

Docker is one of the first open source container runtimes. It was developed by the platform-as-a-service company dotCloud, and was used to run their users' web applications in containers.

Docker is a container runtime that incorporates building, packaging, sharing, and running containers. Docker has a client/server architecture and was originally built as a monolithic daemon, dockerd, and the docker client application. The daemon provided most of the logic of building containers, managing images, and running containers, along with an API. The command line client could be run to send commands and to get information from the daemon. It was the first popular runtime to incorporate all of the features needed during the lifecycle of building and running containers.

Docker originally implemented both high-level and low-level runtime features, but those pieces have since been broken out into separate projects as runc and containerd. Docker now consists of the dockerd daemon, and the docker-containerd daemon along with docker-runc. docker-containerd and docker-runc are just Docker packaged versions of vanilla containerd and runc.

dockerd provides features such as building images and uses docker-containerd for image management and running containers. For instance, Docker’s build step is actually just some logic that interprets a Dockerfile, runs the necessary commands in a container using containerd, and saves the resulting container file system as an image.

containerd

containerd is a high-level runtime that was split off from Docker. Like runc, which was broken off as the low-level runtime piece, containerd was broken off as the high-level runtime piece of Docker. containerd implements downloading images, managing them, and running containers from images. When it needs to run a container, it unpacks the image into an OCI runtime bundle and shells out to runc to run it.

Containerd also provides an API and client application that can be used to interact with it. The containerd command line client is ctr which can be used to tell containerd to:

  • Pull an image - sudo ctr images pull docker.io/library/redis:latest,
  • List the images you have - sudo ctr images list,
  • Run a container based on an image - sudo ctr container create docker.io/library/redis:latest redis,
  • List the running containers - sudo ctr containers list, and
  • Stop the container - sudo ctr container delete redis.

These commands are similar to how a user interacts with Docker. However, in contrast with Docker, containerd is focused solely on running containers, so it does not provide a mechanism for building containers. Docker was focused on end-user and developer use cases, such as running containers on servers. Tasks such as building container images are left to other tools.

rkt

rkt is a runtime that has both low-level and high-level features. For instance, much like Docker, rkt allows you to build container images, fetch and manage container images in a local repository, and run them all from a single command. rkt stops short of Docker’s functionality, however, in that it doesn’t provide a long-running daemon and remote API.

  • Fetch remote images - sudo rkt fetch coreos.com/etcd:v3.3.10,
  • List the images installed locally - sudo rkt image list, and
  • Delete images sudo rkt image rm coreos.com/etcd:v3.3.10.

Though, rkt doesn’t seem to be actively developed very much anymore.

Kubernetes Container Runtimes & CRI

Kubernetes runtimes are high-level container runtimes that support the Container Runtime Interface (CRI). CRI was introduced in Kubernetes 1.5 and acts as a bridge between the kubelet and the container runtime. High-level container runtimes that want to integrate with Kubernetes are expected to implement CRI. The runtime is expected to handle the management of images and to support Kubernetes pods, as well as manages the individual containers so a Kubernetes runtime must be a high-level runtime. Low-level runtimes just don’t have the necessary features.

Container Runtime Interface (CRI)

In order to understand more about CRI it’s worth taking a look at the overall Kubernetes architecture. The kubelet is an agent that sits on each worker node in the Kubernetes cluster. The kubelet is responsible for managing the container workloads for its node. When it comes to actually run the workload, the kubelet uses CRI to communicate with the container runtime running on that same node. In this way, CRI is simply an abstraction layer or API that allows you to switch out container runtime implementations instead of having them built into the kubelet.

Examples of CRI Runtimes

Here are some CRI runtimes that can be used with Kubernetes.

containerd

containerd is possibly the most popular CRI runtime currently. It implements CRI as a plugin which is enabled by default. It listens on a Unix socket by default so you can configure ctictl to connect to containerd like this:

cat <<EOF | sudo tee /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
EOF

It is an interesting high-level runtime in that it supports multiple low-level runtimes via something called a “runtime handler” starting in version 1.2. The runtime handler is passed via a field in CRI and based on that runtime handler, containerd runs an application called a shim to start the container. This can be used to run containers using low-level runtimes other than runc, like gVisor, Kata Containers, or Nabla Containers. The runtime handler is exposed in the Kubernetes API using the RuntimeClass object which is alpha in Kubernetes 1.12.

Docker

Docker support for CRI was the first to be developed and was implemented as a shim between the kubelet and Docker. Docker has since broken out many of its features into containerd and now supports CRI through containerd. When modern versions of Docker are installed, containerd is installed along with it and CRI talks directly to containerd. For that reason, Docker itself isn’t necessary to support CRI. So you can install containerd directly or via Docker depending on your use case.

cri-o

cri-o is a lightweight CRI runtime made as a Kubernetes specific high-level runtime. It supports the management of OCI compatible images and pulls from any OCI compatible image registry. It supports runc and Clear Containers as low-level runtimes. It supports other OCI compatible low-level runtimes in theory, but relies on compatibility with the runc OCI command line interface, so in practice it isn’t as flexible as containerd’s shim API.

cri-o’s endpoint is at /var/run/crio/crio.sock by default so you can configure crictl like so:

cat <<EOF | sudo tee /etc/crictl.yaml
runtime-endpoint: unix:///var/run/crio/crio.sock
EOF

The CRI Specification

CRI is a protocol buffers and gRPC API. The specification is defined in a protobuf file in the Kubernetes repository under the kubelet. CRI defines several remote procedure calls (RPCs) and message types. The RPCs are for operations like:

  • “Pull Image” - ImageServer.PullImage,
  • “Create Pod” - RuntimeServer.RunPodSandbox,
  • “Create Container” - RuntimeService.CreateContainer,
  • “Start Container” - RuntimeService.StartContainer,
  • “Stop Container” - RuntimeService.StopContainer, etc.

We can interact with a CRI runtime directly using the crictl tool. crictl lets us send gRPC messages to a CRI runtime directly from the command line. We can use this to debug and test out CRI implementations without starting up a full-blown kubelet or Kubernetes cluster.

You can configure crictl by creating a configuration file under /etc/crictl.yaml. Here you should specify the runtime’s gRPC endpoints as either a Unix socket file (unix:///path/to/file) or a TCP endpoint (tcp://<host>:<port>). We will use containerd for this example:

cat <<EOF | sudo tee /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
EOF

Or, you can specify the runtime endpoint on each command lin execution:

crictl --runtime-endpoint unix:///run/containerd/containerd.sock ...

Let’s run a pod with a single container with crictl. First, you would need to tell the runtime to pull the nginx image you need since you can’t start a container without the image stored locally; sudo crictl pull nginx.

Next, create a Pod creation request. You can do this as a JSON file.

cat <<EOF | tee sandbox.json
{
    "metadata": {
        "name": "nginx-sandbox",
        "namespace": "default",
        "attempt": 1,
        "uid": "hdishd83djaidwnduwk28bcsb"
    },
    "linux": {
    },
    "log_directory": "/tmp"
}
EOF

And then create the pod sandbox. We wil store the ID of the sandbox as SANDBOX_ID; SANDBOX_ID=$(sudo crictl runp --runtime runsc sandbox.json).

Next, we will create the container creation request in a JSON file.

cat <<EOF | tee container.json
{
  "metadata": {
      "name": "nginx"
    },
  "image":{
      "image": "nginx"
    },
  "log_path":"nginx.0.log",
  "linux": {
  }
}
EOF

We can then create and start the container inside the Pod we created earlier.

{
  CONTAINER_ID=$(sudo crictl create ${SANDBOX_ID} container.json sandbox.json)
  sudo crictl start ${CONTAINER_ID}
}

You can inspect the running pod; sudo crictl inspectp ${SANDBOX_ID} and the running container; sudo crictl inspect {CONTAINER_ID}.

Clean up by stopping and deleting the container:

{
  sudo crictl stop ${CONTAINER_ID}
  sudo crictl rm ${CONTAINER_ID}
}

And then stop and delete the Pod:

{
  sudo crictl stopp ${SANDBOX_ID}
  sudo crictl rmp ${SANDBOX_ID}
}

Sources

Container Runtimes Part 1: An Introduction to Container Runtimes Container Runtimes Part 2: Anatomy of a Low-Level Container Runtime Container Runtimes Part 3: High-Level Runtimes Container Runtimes Part 4: Kubernetes Container Runtimes & CRI