libcontainer
This crate is one of the core crates of the youki workspace, and has functions and structs that deal with the actual craetion and management of the container processes.
Remember, in the end, a container is just another process in Linux, which has control groups, namespaces, pivot_root and other mechanisms applied to it. The program executing has the impression that is is running on a complete system, but from the host system's perspective, it is just another process, and has attributes such as pid, file descriptors, etc. associated with it like any other process.
Along with the container related functions, this crate also provides Youki Config, a subset of the OCI spec config. This config contains only the essential data required for running the containers, and due to its smaller size, parsing it and passing it around is more efficient than the complete OCI spec config struct.
Other than that, this crate also provides a wrapper over basic Linux sockets, which are then used internally as well as by youki to communicate between the main youki process and the forked container process as well as the intermediate process.
This crate also provides an interface for Apparmor which is another Linux Kernel module allowing to apply security profiles on a per-program basis. More information about it can be found at https://apparmor.net/.
Notes
Some other modules expose by this crate are
- rootfs, which is a ramfs like simple filesystem used by kernel during initialization
- hooks, which allow running of specified program at certain points in the container lifecycle, such as before and after creation, start etc.
- signals, which provide a wrapper to convert to and from signal numbers and text representation of signal names
- capabilities, which has functions related to set and reset specific capabilities, as well as to drop extra privileges
- tty module which daels with providing terminal interface to the container process
- pseudoterminal man page : Information about the pseudoterminal system, useful to understand console_socket parameter in create subcommand
Executor
By default and traditionally, the executor forks and execs into the binary
command that specified in the oci spec. Using executors, we can override this
behavior. For example, youki
uses executor to implement running wasm
workloads. Instead of running the command specified in the process section of
the OCI spec, the wasm related executors can choose to execute wasm code
instead. The executor will run at the end of the container init process.
The API accepts only a single executor, so when using multiple executors, (try wasm first, then defaults to running a binary), the users should compose multiple executors into a single executor. The executor will return an error when the executor can't handle the workload.
Namespaces : namespaces provide isolation of resources such as filesystem, process ids networks etc on kernel level. This module contains structs and functions related to applying or un-applying namespaces to the calling process
Note: clone(2) offers us the ability to enter into user and pid namespace by creating only one process. However, clone(2) can only create new pid namespace, but cannot enter into existing pid namespaces. Therefore, to enter into existing pid namespaces, we would need to fork twice. Currently, there is no getting around this limitation.
Pausing and resuming
Pausing a container indicates suspending all processes in it. This can be done with signals SIGSTOP and SIGCONT, but these can be intercepted. Using cgroups to suspend and resume processes without letting tasks know.
The following are some resources that can help understand with various Linux features used in the code of this crate
- oom-score-adj
- unshare man page
- user-namespace man page
- wait man page
- pipe2 man page : Definition and usage of pipe2
- Unix Sockets man page : Useful to understand sockets
- prctl man page : Process control man pages