Virtualisation

Introduction

Virtualization technologies such as VMs (Virtual Machines) and containers allow users to define custom software environments to facilitate deployment of applications along with all their dependencies, or to help with software development and validation tasks.

On this cluster, the pcocc tool (pronounced like peacock) allows to deploy both VMs and containers on compute resources.

Container images built with popular tools such as Docker or Podman can be launched interactively or within Slurm jobs. High performance networks and GPUs can be used from within containers.

Clusters of VMs can be instanciated on the fly with a single command which allocates the necessary resources to host the virtual machines, including private Ethernet and/or InfiniBand networks and starts VMs with ephemeral disks created from a common image.

Containers

A container is a set of processes which share a custom view of system resources such as the filesystem. Typically, a container is started from an image (a set of files forming a root filesystem) and processes within the container see the image as their root filesystem instead of the filesystem of the host (login or compute node where the container is launched).

A container image can be created using several popular tools (Docker, Podman, …) or downloaded from a registry (DockerHub, NVIDIA container registry, …). Images are usually distributed in the OCI format which is a standard image format that allows inter-operability between container build tools and runtimes. Another image format is SIF which is mainly used by Singularity (or Apptainer).

Using pcocc, images in the OCI format (limited support is also provided for SIF images) can be imported and run efficiently on a compute cluster. pcocc facilitates integration with the host environment including mounting dataspaces, launching jobs in containers or using high performance interconnects and GPUs.

pcocc also provides a way to spawn a dedicated single-node Docker environment which allows to work directly with Docker to build containers or run services.

Importing an image and launching containers

  1. Generate a tar archive from an OCI image

On your local workstation, build and export a Docker image in the tar archive format:

$ docker build -t my_docker_image . # Build an image
$ docker run -ti my_docker_image # Test the image
$ docker save my_docker_image -o my_docker_image.tar # Export the image
  1. Copy the newly obtained tar file (here: my_docker_image.tar) on the cluster using, for instance, scp or rsync

  2. Import the image in your pcocc image repository on the cluster:

    $ pcocc-rs image import docker-archive:my_docker_image.tar my_docker_image
    
  3. Check that the image is available in your repository

    $ pcocc-rs image list
    
  4. Start a container from your image (once the image has been imported it can be launched multiple times without re-importing)

A single container on the current node:

$ pcocc-rs run my_pcocc_image [arg1,arg2,...]

Multiple tasks in containers on remote nodes in a batch job:

#!/bin/bash
#MSUB -q rome
#MSUB -Q test
#MSUB -T 120
#MSUB -n 2
#MSUB -c 2
#MSUB -A <project>
ccc_mprun -n 2 -c 2 -C my_pcocc_image -- <cmd>

In that case, one container per node is created and multiple tasks within a node share a container.

When launching a container image with pcocc, there are some notable differences in the execution environment compared to the same image launched using Docker:

  • Processes within the container are executed as your user instead of root or the default user specified in the container image.
  • A default set of pcocc modules is applied which mounts storage spaces (home, scratch, work, store, …) in the container and propagates a minimal set of common environment variables from the host. Other environment variables are not propagated by default, as when using Docker. The default set of modules may be disabled using the --no-defaults option.
  • The current working directory is propagated to the container unless it is specified in the image.
  • When launching a container through Slurm (ccc_mprun -C or srun --ctr) the entrypoint and/or command defined when building the image are ignored. However they are executed normally when using pcocc-rs run. The entrypoint may be disabled using the --no-ep option.

Mounting files and directories

Files or directories from the host can be mounted into the container. By default cluster storage spaces (home, scratch, …) are mounted in the container on the same location as on the host. This can be disabled using the --no-defaults option. Custom mount points may be specified on the command line as follows:

For a local launch:

$ pcocc-rs run my_pcocc_image --mount src=<source directory>,dst=<target directory> -- <cmd>

For a remote launch using ccc_mprun:

$ ccc_mprun -C my_pcocc_image -E'--ctr-mount src=<source directory>,dst=<target directory>' -- <cmd>

Alternatively, modules or container templates can be defined in a configuration file.

Mounting SquashFS images

SquashFS images can also be mounted using a similar syntax:

For a local launch:

$ pcocc-rs run my_pcocc_image --mount src=<squashfs_image>,dst=<target directory>,type=squashfs -- <cmd>

For a remote launch using ccc_mprun:

$ ccc_mprun -C my_pcocc_image -E'--ctr-mount src=<squashfs_image>,dst=<target directory>,type=squashfs' -- <cmd>

This can be used in combination with a passthrough container which replicates the host to mount SquashFS images on top of the host filesystem, for example:

$ ccc_mprun -C host-passthrough -E'--ctr-mount src=<squashfs_image>,dst=<target directory>,type=squashfs' -- <cmd>

Propagating environment variables

Additional environment variables may be propageted from the host to the container using the following syntax:

$ export VARIABLE_TEST=valeur
$ pcocc-rs run my_pcocc_image --env VARIABLE_TEST -- <command>

Alternatively, modules or container templates can be defined in a configuration file.

Container modules

Container modules allow to configure a set of host files to mount in a container and of environment variables to propagate in a more convenient way than with command line options.

Modules can be defined in your personal configuration file in ~/.config/pcocc/containers.yaml. Additional configuration files can be provided in custom locations. In addition, some modules are provided by the compute center, for exemple to mount storage spaces, GPU drivers or inject MPI libraries.

To list available modules:

$ pcocc-rs module list

Modules listed as default modules are automatically included unless the --no-defaults option is specified.

The syntax to define a module is as follows:

modules:
  my_module:
    mounts:
      # Mount source_dir_1 on target_dir in the container
      - source: <source_dir__1>
        destination: <target_dir>
      # Mount source_dir_2 on the same location in the container
      - source: <source_dir_2>
    env:
      # Propagate a variable from the host
      - SLURM_JOB_ID
      # Set a variable to a specfic value
      - MY_VARIABLE=<value>
      # Prefix content to a PATH-like variable in the container
      - pp(LD_LIBRARY_PATH)=/opt/mylib
      # Suffix content to a PATH-like variable in the container
      - ps(PATH)=$HOME/mylib

To use a module on the command line:

For a local launch:

$ pcocc-rs run my_pcocc_image --ctr-module my_module  <cmd>

For a remote launch using ccc_mprun:

$ ccc_mprun -C my_pcocc_image -E'--ctr-module my_module' -- <cmd>

Using MPI and NVIDA GPUs in containers

Parallel MPI applications as well as CUDA based applications require low level libraries which talk directly to kernel drivers and/or high performance interconnects and GPUs. These libraries have to match the kernel drivers and hardware of the host.

One way to achieve this is to mount these libraries at runtime using modules.

The nvidia module mounts low-level libraries needed to access the NVIDIA drivers on compute nodes. For the container to be compatible with the injected libraries it should use a CUDA version that is compatible with the driver version of the compute node (which can be checked with the nvidia-smi command) or use the libcudacompat library.

The openmpi-4.1.4 module mounts the OpenMPI library and recommended settings for using the compute nodes high performance interconnect. It injects an equivalent of the default version of the mpi/openmpi module. No other versions of openmpi modules are available inside of the container. For the container to be compatible with the injected libraries, it should ideally be based on the same OS distribution as the one installed on the host node. More recent Linux distributions should generally work but incompatibilities may exist. The containerized application must be compiled using the same major version of OpenMPI as the one being injected in the container.

Container templates

A container template combines a container image with runtime launch options to facilitate launching containers which require advanced configuration options.

Container templates can be defined in your personal configuration file in ~/.config/pcocc/containers.yaml. Additional configuration files can be provided in custom locations.

The syntax to define a container template is as follows:

containers:
 exemple_template:
   rootfs: "user:my_pcocc_image"
   # Optional. If true, default modules are not loaded
   # (equivalent to pcocc-rs run --no-defaults)
   no_defaults: false
   modules:
     - my_module
   env:
     # Supports the same syntax as modules
     - MY_VARIABLE
   mounts:
     # Supports the same syntax as modules
     - source: <source_dir_1>
       destination: <target_dir>

A container template can reference modules or define environment variables and mounts inline. The rootfs field defines the container image to use. In this example the my_pcocc_image image in the ‘user’ repository is selected.

The template name can be passed to pcocc-rs run or ccc_mprun -C instead of a container image name.

For a local launch:

$ pcocc-rs run my_template -- <cmd>

For a remote launch using ccc_mprun:

$ ccc_mprun -C my_template <cmd>

Configuration files and image repository locations

By default, images and container configurations (modules, templates, …) are sourced from the per-user configuration directory and image repository. The per-user configuration directory is located at ~/.config/pcocc/. Container configuration files for modules and templates can be defined in ~/.config/pcocc/containers.yaml and ~/.config/pcocc/containers.d/*.yaml.

Additional image repositories can be defined in ~/.config/pcocc/repositories.yaml and ~/.config/pcocc/repositories.d/*.yaml.

The syntax to define a repository is as follows:

repositories:
  myrepo:
    path: </path/to/my/repo>
    description: This is my repository

A new repository is automatically initialized if the path is an empty directory or if it doesn’t exist (the parent directory must however exist).

To use an image from a non-default repository, the repository name must be prefixed to the image name, for example:

$ pcocc-rs run myrepo:myimage

To list images from a specific repository use the --repo option, for example:

$ pcocc-rs image list --repo myrepo

Additional configuration directories can be defined by setting the PCOCC_CONFIG_PATH=<dir1>[:<dirN>] variable. In each additional configuration directory, the same configuration files as in the per-user configuration directory can be set. This can be combined with shared spaces and extenv modules to provide container definitions to a group of users.

Importing SIF images

SIF images can be imported using:

$ pcocc-rs image import sif:<mysifimage.sif> mysifimage

Please note however, that environment variables defined in SIF images may not be fully preserved when converted for use with pcocc-rs.

Migrating from pcocc to pcocc-rs

pcocc-rs is a new version of pcocc written in Rust which is now recommended for launching containers. While the pcocc command is still available, it should no longer be used for that purpose.

Image repositories are shared between pcocc and pcocc-rs but if you imported images with the pcocc command, they will at first appear with a forbidden sign in the output of pcocc-rs image list. In that case, you have to rebuild them for pcocc-rs using the following command before running them:

$ pcocc-rs image rebuild my_docker_image

The “No squashfs has been built for this image” error message usually indicates that pcocc-rs` is trying to start a pcocc image that hasn’t been rebuilt yet.

Docker environment

pcocc allows to create a dedicated, single-node, virtual Docker environment. Inside this environment, Docker commands can be used as normal to build and test Docker containers. Behind the scenes, a Docker daemon is started in a VM and handles the requests from the Docker CLI as if the daemon was running locally. Once containers have been built in this environment they can be exported to a pcocc image repository to start them directly on login or compute nodes without using a Docker daemon.

This Docker environment should only be used to build or develop with containers, or for workflows that are tightly integrated with Docker. It only scales to a single node and induces substantial overhead on I/O operations when mounting data from the host cluster in containers.

To run containerized applications efficiently, containers should be started directly using pcocc-rs instead of using Docker (see the Containers section).

In any case, access to external resources (such as public image registries, package or source code repositories, etc.) is only possible if the host cluster can access the Internet.

Starting a Docker environment interactively

To start an interactive Docker session:

$ pcocc docker alloc -p <partition> -c <number of cores>
salloc: Pending job allocation XXXXXX
salloc: job XXXXXX queued and waiting for resources
salloc: job XXXXXX has been allocated resources
salloc: Granted job allocation XXXXXX
salloc: Waiting for resource configuration
salloc: Nodes node6020 are ready for job
Configuring hosts... (done)
Waiting for Docker VM to start ...
(pcocc/XXXXXX) $ docker --version
Docker version XX.YY.ZZ, build xxxxxxx
(pcocc/XXXXXX) $ exit
$

In some cases you may be asked for an account which can be specified with -A <account>. Please ask the hotline if you don’t know which account to use.

Please note that by default an ephemeral VM is used for the Docker daemon where all containers and images are stored in memory. The number of allocated cores determines the amount of memory available for Docker to store containers. Alternatively, the docker Daemon can also be configured to use persistent data.

In the Docker session, Docker commands can be used as normal. However, access to external resources is only possible on clusters where Internet access is authorized:

(pcocc/XXXXXX) $ docker search ubuntu
(pcocc/XXXXXX) $ docker pull ubuntu:latest
NAME   DESCRIPTION                                   STARS OFFICIAL AUTOMATED
ubuntu Ubuntu is a Debian-based Linux operating sys… 15605 [OK]
(pcocc/XXXXXX) $ docker image list
REPOSITORY   TAG       IMAGE ID       CREATED              SIZE
ubuntu       latest    58db3edaf2be   3 weeks ago          77.8MB
(pcocc/XXXXXX) $ docker run -ti ubuntu:latest bash
root@c02d04424d02:/# cat /etc/debian_version
bookworm/sid
(pcocc/XXXXXX) $ docker run -ti ubuntu:latest bash
(pcocc/XXXXXX) $ cat Dockerfile
FROM ubuntu:latest
RUN apt update && apt upgrade -y
(pcocc/XXXXXX) $ docker build -t myubuntu:latest .
(pcocc/XXXXXX) $ docker image list
REPOSITORY   TAG       IMAGE ID       CREATED              SIZE
myubuntu     latest    c56a83f1ec3a   About a minute ago   128MB
ubuntu       latest    58db3edaf2be   3 weeks ago          77.8MB
(pcocc/XXXXXX) $ docker save myubuntu:latest -o myubuntu_latest.tar

At the end of a Docker session, the whole state of the Docker daemon, including all container and container images is destroyed. Container images can be exported to a pcocc image repository to persist them and/or launch containers on the cluster without using Docker as described in the Containers section:

(pcocc/XXXXXX) $ pcocc-rs image import docker-daemon:myubuntu:latest myubuntu
Successfully imported docker-daemon:myubuntu:latest
(pcocc/XXXXXX) $ exit
$ pcocc-rs image list
Name                          Type        Modified
=============================================================
myubuntu                      container   20XX-02-16T18:11:15

To retrieve the Docker image from a pcocc image repository in a future session:

$ pcocc docker alloc -p <partition> -c <number of cores>
(pcocc/XXXXXX) $ pcocc docker import user:myubuntu myubuntu:latest
(pcocc/XXXXXX) $ docker image list
REPOSITORY   TAG       IMAGE ID       CREATED             SIZE
myubuntu     latest    1f382964a2b1   About an hour ago   128MB

Persisting the Docker daemon state across allocations

Docker sessions can be persisted across successive allocations using the persistent Docker template. First, the persistent environment must be initialized once by creating a persistent disk image for the VM hosting the Docker daemon:

$ qemu-img create -f raw $CCCWORKDIR/.pcocc.docker.persistent.img 100G
$ mkfs.ext4 $CCCWORKDIR/.pcocc.docker.persistent.img

To make use of this disk image, set the option -t docker-persist when launching a Docker session:

$ pcocc docker alloc -t docker-persist -p <partition> -c <number of cores>

Please be aware that in this mode, a single Docker daemon can be allocated at a time since this data disk cannot be shared by multiple daemons.

Submitting jobs in a Docker environment

The pcocc docker batch command can be used to submit a batch job which can execute Docker commands.

$ pcocc docker batch -p <partition> -c <number of cores>

Once the Docker daemon is running in batch mode the Docker environment can be joined using pcocc docker shell.

A build script with docker commands can be used in both interactive or batch mode

For example, given this script:

$ cat docker_job.sh
#!/usr/bin/bash
docker build -t myubuntu:latest .
docker save myubuntu:latest -o myubuntu_latest.tar

It can be run with:

$ pcocc docker alloc -E 'docker_job.sh'

or

$ pcocc docker batch -E 'docker_job.sh'

Virtual machines

pcocc allows to run clusters of VMs on compute nodes, alongside regular jobs.

To launch a virtual cluster, a user selects one or more templates from which to instantiate VMs and the number of requested VMs. For each virtual cluster, pcocc allocates the necessary resources to host the VMs, including private Ethernet and/or Infiniband networks, creates temporary disk images from the selected templates and instantiates the requested VMs.The VMs are ephemeral and use Copy-on-Write disk drives instanciated using the template image as a backing file.

Full documentation is provided through pcocc man pages. Use man pcocc for an index and refer to the individual man pages listed at the bottom the page for more specific information for example: man pcocc-templates.yaml. The documentation is also available on the web: <https://pcocc.readthedocs.io/en/latest>

The following sections provide a quick overview of how some common tasks can be performed.

Building or importing images

Any disk image in a file format supported by Qemu can be imported using:

$ pcocc image import /path/to/my/image user:[image name]

Depending on the image format, and filename extension, the file format may need to be specified using the --fmt option.

There are no particular requirements on images except that DHCP should be enabled on the main interface to configure the network automatically. It’s also useful that images are configured to use the serial console. Vanilla cloud images from Linux distributions are usually well suited as a base for creating images that work well with pcocc.

To manage images and image repositories, please refer to the dedicated man page with man pcocc-image.

Cloud images can also be configured by pcocc using cloud-init. Please refer to the vm creation tutorial (man pcocc-newvm-tutorial).

Launching a virtual cluster

Once an image has been imported, a cluster of VMs can be instanciated from it. More precisely, VMs are instanciated from templates which refer to an image and other configuration parameters such as network resources or host files to expose to the VM. However, for each image, a template with the same name is implicitely created. It allows to instanciate a VM from the image with a default configuration.

The default network is a private Ethernet network interconnecting all VMs of virtual a cluster. It features a virtual router providing access to the host cluster from the VM private network using NAT. It also ensures that VMs are reachable through SSH from the host cluster using reverse NAT.

For more information about VM templates, please refer to man pcocc-templates.yaml

To start a cluster of VMs from VM images:

$ pcocc alloc -p <partition> -c <cores per VM>  myimage1:<number of VMs>,myimage2:<number of VMs>

The boot of a VM can usually be monitored through its console (hit Ctrl-C 3 times to exit the console):

(pcocc/1016696) $ pcocc console vm0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[...]

If SSH access has been properly configured, VMs should be reachable using SSH:

(pcocc/1016696) $ pcocc ssh vm0

For more information about launching VMs with pcocc, please refer to the main man page man pcocc and to the man pages for each subcommand for example man pcocc-alloc. Help is also available for each subcommand, for example pcocc batch --help. The documentation is also available on the web: <https://pcocc.readthedocs.io/en/latest>