Bug #19702
closedsingularity failure "plugin type="portmap" failed (add): netplugin failed with no error message: signal: killed"
100%
Description
https://workbench2.tordo.arvadosapi.com/processes/tordo-xvhdp-u55x8qxyai66qu3
2022-11-04T01:36:42.720240040Z using local keepstore process (pid 31086) at http://10.253.254.237:36637 2022-11-04T01:36:43.527944057Z gateway server listening at 10.253.254.237:37517 2022-11-04T01:36:43.529210341Z crunch-run 2.5.0~dev20221031202240 (go1.17.7) started 2022-11-04T01:36:43.529730722Z crunch-run process has uid=0(root) gid=0(root) groups=0(root) 2022-11-04T01:36:50.403375515Z Using FUSE mount: /usr/bin/arv-mount 2.5.0.dev20220908150551 2022-11-04T01:37:00.964489187Z Using container runtime: singularity-ce version 3.9.9 2022-11-04T01:37:00.965279062Z Executing container: tordo-dz642-2jn257pktac5pds 2022-11-04T01:37:00.965437914Z Executing on host 'ip-10-253-254-237' 2022-11-04T01:37:01.072429734Z container token "v2/tordo-gj3su-pss9lb029q7r5tk/16ci5mwyml0v1tgs0bnpq05kyr3ax5wlhg71jdkuzf1nmw8zjs/tordo-dz642-2jn257pktac5pds" 2022-11-04T01:37:01.073177130Z Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2jn257pktac5pds.4195713021/keep3363330801] 2022-11-04T01:37:01.980388274Z Fetching Docker image from collection 'f8d78c661b100d071829d0600e01d2a6+513' 2022-11-04T01:37:02.088701070Z Using Docker image id "sha256:fb0ac87078b3916df22a477743a911c933447bc6ed6310af48dcc3cad3c5c815" 2022-11-04T01:37:02.088734688Z Loading Docker image from keep 2022-11-04T01:37:02.507284910Z building singularity image 2022-11-04T01:37:02.508061814Z [singularity build /tmp/crunch-run.tordo-dz642-2jn257pktac5pds.4195713021/keep3363330801/by_uuid/tordo-4zz18-y71qluoxmymynse/image.sif docker-archive:///tmp/crunch-run-singularity-2503404092/image.tar] 2022-11-04T01:38:45.488012829Z INFO: Starting build... 2022-11-04T01:38:45.488012829Z Getting image source signatures 2022-11-04T01:38:45.488012829Z Copying blob sha256:2141d9a2bb10152a46970ba69da724943d79d19bc0cd194945cc4ec2d1bc4ae2 2022-11-04T01:38:45.488012829Z Copying blob sha256:270a8dc08c4bb67a19b86398d4cfee8cdfcc344f7f3af88362a0a5eedfb5d2f9 2022-11-04T01:38:45.488012829Z Copying blob sha256:6be90f1a2d3f1eb115203b6adb2ce1014fab9a9f8f1b2afa31343397063603d3 2022-11-04T01:38:45.488012829Z Copying blob sha256:2761f8a9e627669ad97308c19bfb1dc2069a585c2614e83f220daf2dcef7c67e 2022-11-04T01:38:45.488012829Z Copying blob sha256:2761f8a9e627669ad97308c19bfb1dc2069a585c2614e83f220daf2dcef7c67e 2022-11-04T01:38:45.488012829Z Copying blob sha256:fa83b8d448a9dc6ab0ace6ee87bc8fd7ad2afc48536bad3722f198ce2f761872 2022-11-04T01:38:45.488012829Z Copying blob sha256:6483dc4da598b97363463ffda4351a39938c5dd1a7da7f7624dd57d3c6e50340 2022-11-04T01:38:45.488012829Z Copying blob sha256:c6f565b0be6987fcf58b3cc5466c25daf5b4ff9e0729a6194c4d7312577eb1a0 2022-11-04T01:38:45.488012829Z Copying blob sha256:7a51c5dbb21b520720e67b568c5da49bf1fa76af11f84ce5dcb2ae2c4e2714c1 2022-11-04T01:38:45.488012829Z Copying blob sha256:2edb48854a1856466fddbfaa009706ea3b86119977d2d87847d14dd3abe90657 2022-11-04T01:38:45.488012829Z Copying blob sha256:d0b9905f86257c13c172ba5dfef25db80eea39d6b1a5df4897b742e0a82a71ea 2022-11-04T01:38:45.488012829Z Copying config sha256:33b0fcf52b3adeb6a3ffb8d19414dc601ff893e862c1fd0f6819d7b32ccf8aad 2022-11-04T01:38:45.488012829Z Writing manifest to image destination 2022-11-04T01:38:45.488012829Z Storing signatures 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:32 info unpack layer: sha256:c719853e88efcc312969f220cd8e62ed9c46449a6bf5a7f3a3fa7dd403390aa6 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:33 info unpack layer: sha256:f12b85199b52ac3a1df407f52e4ca01b65d205852457d62d56e2504bb9db79e8 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40 info unpack layer: sha256:5cc050ed8d38cfaa70b4510dca7867744d2c1003dc43e98413bb96ade4803d7a 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40 info unpack layer: sha256:3ba58afa464a775d93de58a18d2a684b6a9eb3b830123c595aec9ce9277f9423 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40 info unpack layer: sha256:3ba58afa464a775d93de58a18d2a684b6a9eb3b830123c595aec9ce9277f9423 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40 info unpack layer: sha256:a4c191a15cf848288a39f0182b45ac9a11fdee6f6b741b3426fa3ca813888090 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40 info unpack layer: sha256:6d16249d98c1bc9ba8f3c2cf97cb56612c89963043f6f9f702e7b9b1c3a7081a 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40 info unpack layer: sha256:19f6b11482751f58dcde924f562d84e77925464dc00dce9b5e6daf0540441c02 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40 info unpack layer: sha256:5fe831bf67816b7d24c7df6c4424f6b68b5499a5b56621138feba1cbeb71dc25 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:41 info unpack layer: sha256:6b53f2346bd04c68c8024bbef8b511e5aa9d59dfdd75280e338ec57786c05368 2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:41 info unpack layer: sha256:70ce1999407af4e1f02c3c4a3b4c43d958f5eb8cea82d6eff244243b891f8e8a 2022-11-04T01:38:45.488012829Z INFO: Creating SIF file... 2022-11-04T01:38:45.488012829Z INFO: Build complete: /tmp/crunch-run.tordo-dz642-2jn257pktac5pds.4195713021/keep3363330801/by_uuid/tordo-4zz18-y71qluoxmymynse/image.sif 2022-11-04T01:38:45.759894753Z Starting container 2022-11-04T01:38:45.761689735Z Waiting for container to finish 2022-11-04T01:38:58.798455389Z FATAL: container creation failed: plugin type="portmap" failed (add): netplugin failed with no error message: signal: killed 2022-11-04T01:38:58.813852446Z Container exited with status code 255 (signal -1) 2022-11-04T01:38:59.011943529Z Complete
Updated by Tom Clegg about 2 years ago
I don't see any great clues here.
"signal: killed" might mean OOM while setting up the container. Perhaps 2 GB RAM is not enough for singularity to work reliably while other node-startup things are happening, and ReserveExtraRAM needs to be increased?
Updated by Peter Amstutz about 2 years ago
Tom Clegg wrote in #note-3:
I don't see any great clues here.
"signal: killed" might mean OOM while setting up the container. Perhaps 2 GB RAM is not enough for singularity to work reliably while other node-startup things are happening, and ReserveExtraRAM needs to be increased?
In theory, you just merged a feature that should be recording that information?
Updated by Tom Clegg about 2 years ago
indeed
2022-11-04T01:38:45.760663239Z mem 100184064 cache 2277 pgmajfault 1048678400 rss 2022-11-04T01:38:51.073166536Z procmem 778432512 arv-mount 43061248 crunch-run 289300480 keepstore 2022-11-04T01:38:55.761151799Z mem 152002560 cache 2475 pgmajfault 1208025088 rss
778432512+43061248+289300480+1208025088 = 2318819328 > 2006636k
Subtracting the requested keep_cache_ram (268435456) from arv-mount+crunch-run+keepstore, we have
(778432512-268435456)+43061248+289300480 = 842358784
Perhaps- Default ReserveExtraRAM should increase from 256 MiB to 550 MiB
- ChooseInstanceType should add
((NBuffers * 64 MiB) + 200 MiB) * 1.1
when LocalKeepBlobBuffersPerVCPU>0, instead of justNBuffers*64
(adding some for non-buffer memory use, and 10% for GOGC=10)
Updated by Tom Clegg about 2 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
19702-memory-overhead @ 9cd2fc2cd84000e706d73d1ff8316ce46b1be54d -- developer-run-tests: #3359
Updated by Peter Amstutz about 2 years ago
Tom Clegg wrote in #note-6:
19702-memory-overhead @ 9cd2fc2cd84000e706d73d1ff8316ce46b1be54d -- developer-run-tests: #3359
I'm wondering if there's something about the conversion from Docker to SIF that is leaving arv-mount with a larger than normal footprint.
Keepstore having 200 MiB of overhead before accounting for buffers seems high. Although the numbers are the numbers.
Does this mean we can't run on 2 GiB nodes any more?
Otherwise this LGTM.
Updated by Tom Clegg about 2 years ago
I agree, we should be able to make those numbers lower.
Does this mean we can't run on 2 GiB nodes any more?
I suppose so, if the container requests more than 1 GiB of RAM + arv-mount cache.
Updated by Tom Clegg about 2 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|8a0527fcb8948720a873aa35fd1800076c3859a2.