Bug #17244
openMake sure cgroupsV2 works with Arvados
0%
Description
Reading
https://docs.docker.com/config/containers/runmetrics/
Running Docker on cgroup v2
Docker supports cgroup v2 experimentally since Docker 20.10. Running Docker on cgroup v2 also requires the following conditions to be satisfied:
containerd: v1.4 or later
runc: v1.0.0-rc91 or later
Kernel: v4.15 or later (v5.2 or later is recommended)Note that the cgroup v2 mode behaves slightly different from the cgroup v1 mode:
The default cgroup driver (dockerd --exec-opt native.cgroupdriver) is “systemd” on v2, “cgroupfs” on v1.
The default cgroup namespace mode (docker run --cgroupns) is “private” on v2, “host” on v1.
The docker run flags --oom-kill-disable and --kernel-memory are discarded on v2.
With all this changes, we have to make sure that:
- We can run a distro that has cgroup v2 by default (As in Fedora 2020) or kernel parameters that boot up with cgroups v2 enabled in systemd (kernel param systemd.unified_cgroup_hierarchy=1) and docker version >= 2020.04
- We can guide the admin to upgrade to cgroup v2 and have a test case easy to check that arvados will run
The last point is important because the current error is kindof cryptic:
applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers
There also cryptic messages with a cgroupsv2 enabled host and Docker 19.03.13
Status: Downloaded newer image for hello-world:latest docker: Error response from daemon: cgroups: cgroup mountpoint does not exist: unknown. ERRO[0005] error waiting for container: context canceled
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
Updated by Nico César almost 4 years ago
- Category set to Crunch
- Target version set to 2021-01-20 Sprint
Updated by Javier Bértoli almost 4 years ago
I tried Arvados with the following setup:
1. Built binaries/images from current master (commit e98f4df4a@arvados)
2. Created a cluster
3. Run the test script from the salt-install test dir
4. With kernel Linux 5.9.0-5-amd64 & cgroups2 (as documented here, I have /sys/fs/cgroup/cgroup.controllers
)
5. Using docker 20.10
6. Using containerd 1.4.3
7. When I run the script, I get:
+ cwl-runner hasher-workflow.cwl hasher-workflow-job.yml INFO /usr/bin/cwl-runner 2.1.1, arvados-python-client 2.1.1, cwltool 3.0.20200807132242 INFO Resolved 'hasher-workflow.cwl' to 'file:///usr/src/arvados/tests/hasher-workflow.cwl' INFO hasher-workflow.cwl:36:7: Unknown hint WorkReuse INFO hasher-workflow.cwl:50:7: Unknown hint WorkReuse INFO hasher-workflow.cwl:64:7: Unknown hint WorkReuse INFO Using cluster arvie (https://arvie.arv.local:8000/) INFO Upload local files: "test.txt" INFO Using collection f55e750025853f5b8ccae3ca79240f65+54 (arvie-4zz18-zbm7cmmt5h9d5rg) INFO Using collection cache size 256 MiB INFO [container hasher-workflow.cwl] submitted container_request arvie-xvhdp-7jpooik0zd8aj1t INFO [container hasher-workflow.cwl] arvie-xvhdp-7jpooik0zd8aj1t is Final ERROR [container hasher-workflow.cwl] (arvie-dz642-4v8xcwcvjvp5j2f) error log: 2021-01-11T20:56:51.604627332Z crunch-run crunch-run dev (go1.15) started 2021-01-11T20:56:51.604709650Z crunch-run Executing container 'arvie-dz642-4v8xcwcvjvp5j2f' 2021-01-11T20:56:51.604763728Z crunch-run Executing on host '27d4cb3c42e2' 2021-01-11T20:56:51.871544244Z crunch-run Fetching Docker image from collection '0428f2e88f4b398b8489f6c454e7e9ae+261' 2021-01-11T20:56:51.940054697Z crunch-run Using Docker image id 'sha256:0dd5078a5bec49810c1fcb86b60e1bda6b9c1e12dc2c3de75453b2fd37a55885' 2021-01-11T20:56:51.943832124Z crunch-run Docker image is available 2021-01-11T20:56:51.952139500Z crunch-run Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id /tmp/crunch-run.arvie-dz642-4v8xcwcvjvp5j2f.288172359/keep406717434] 2021-01-11T20:56:52.454639768Z crunch-run Creating Docker container 2021-01-11T20:56:52.509556810Z crunch-run Attaching container streams 2021-01-11T20:56:53.205291750Z crunch-run Starting Docker container id '7d91dac5eb133131cc9b131d1f0280810acf9c4eda6209b674546bb885c90606' 2021-01-11T20:56:53.397951196Z crunch-run error in Run: could not start container: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers -- it is in an invalid state: unknown 2021-01-11T20:56:53.752428822Z crunch-run Cancelled ERROR Overall process status is permanentFail INFO Final output collection None {} WARNING Final process status is permanentFail
Using same images and setup with
- Linux 4.19.0-13-amd64 with systemd 241.7 (with cgroupsv1) works ok.
Updated by Javier Bértoli almost 4 years ago
According to this issue, Debian's systemd
defaults to cgroupsv2 since 242-7 and docker 20.10.x
Updated by Peter Amstutz almost 4 years ago
- Target version deleted (
2021-01-20 Sprint)
Updated by Nico César almost 4 years ago
- Related to Bug #17270: Test for docker cgroups issues in crunch-run works on ubuntu 20.04 added