Bug #17175
closedarv-boot sometimes fail to start (ie: WB2 integration tests)
100%
Description
From time to time arv-boot fails to start, producing the following output in the case of wb2 integration tests pipelines:
... 18:37:19 sudo is already the newest version (1.8.27-1+deb10u2). 18:37:19 wget is already the newest version (1.20.1-1.1). 18:37:19 xvfb is already the newest version (2:1.20.4-1+deb10u1). 18:37:19 zlib1g-dev is already the newest version (1:1.2.11.dfsg-1). 18:37:19 ca-certificates is already the newest version (20200601~deb10u1). 18:37:20 0 upgraded, 0 newly installed, 0 to remove and 57 not upgraded. 18:37:20 INFO[2020-12-01T21:37:20.117040482Z] ruby 2.5.7 already installed 18:37:20 INFO[2020-12-01T21:37:20.122311238Z] go 1.14 already installed 18:37:20 INFO[2020-12-01T21:37:20.154868995Z] phantomjs 1.9.8 already installed 18:37:20 INFO[2020-12-01T21:37:20.156001403Z] geckodriver 0.24.0 already installed 18:37:20 INFO[2020-12-01T21:37:20.159528791Z] nodejs v8.15.1 already installed 18:37:20 INFO[2020-12-01T21:37:20.884538246Z] gradle 5.3.1 already installed 18:37:20 INFO[2020-12-01T21:37:20.886043239Z] locale en_US.UTF-8 already installed 18:37:20 INFO[2020-12-01T21:37:20.920794537Z] postgresql cluster 11-main is down; trying to start 18:37:20 2020-12-01 21:37:20.996 UTC [792] LOG: listening on IPv4 address "127.0.0.1", port 5432 18:37:20 2020-12-01 21:37:20.996 UTC [792] LOG: could not bind IPv6 address "::1": Cannot assign requested address 18:37:20 2020-12-01 21:37:20.996 UTC [792] HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry. 18:37:21 2020-12-01 21:37:21.001 UTC [792] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 18:37:21 2020-12-01 21:37:21.020 UTC [795] LOG: database system was shut down at 2020-09-28 16:47:32 UTC 18:37:21 2020-12-01 21:37:21.047 UTC [792] LOG: database system is ready to accept connections 18:37:22 2020-12-01 21:37:22.009 UTC [803] root@root FATAL: role "root" does not exist 18:37:22 INFO[2020-12-01T21:37:22.362835777Z] postgresql supports collation en_US 18:37:22 INFO[2020-12-01T21:37:22.423076600Z] postgresql supports collation en_US.UTF-8 18:37:22 INFO[2020-12-01T21:37:22.503320541Z] arvados role exists; superuser privileges added, password updated 18:37:22 INFO[2020-12-01T21:37:22.503416234Z] sent SIGTERM; waiting for postgres to shut down 18:37:22 2020-12-01 21:37:22.503 UTC [792] LOG: received smart shutdown request 18:37:22 2020-12-01 21:37:22.508 UTC [792] LOG: background worker "logical replication launcher" (PID 801) exited with exit code 1 18:37:22 2020-12-01 21:37:22.511 UTC [796] LOG: shutting down 18:37:22 2020-12-01 21:37:22.577 UTC [792] LOG: database system is shut down 18:37:22 Launching arvados in test mode... 18:39:58 + set +e +o pipefail 18:39:58 + kill 834 18:39:58 + wait 834 18:39:58 + true 18:39:58 + '[' 1 -eq 1 ']' 18:39:58 + rm -rf /tmp/tmp.1sNVIKdaCS 18:39:59 + echo done 18:39:59 done 18:39:59 make: *** [Makefile:67: integration-tests] Error 1 18:39:59 Build step 'Execute shell' marked build as failure 18:39:59 Archiving artifacts 18:39:59 An attempt to send an e-mail to empty list of recipients, ignored. 18:39:59 Started calculate disk usage of build 18:39:59 Finished Calculation of disk usage of build in 0 seconds 18:39:59 Started calculate disk usage of workspace 18:40:00 Finished Calculation of disk usage of workspace in 1 second 18:40:00 Finished: FAILURE
Attached is the arvados.log
file from one of those failures when running the tests locally.
Files
Updated by Lucas Di Pentima about 4 years ago
- Target version changed from 2020-12-16 Sprint to 2021-01-06 Sprint
Updated by Tom Clegg almost 4 years ago
- Status changed from New to In Progress
- Assigned To changed from Lucas Di Pentima to Tom Clegg
17175-boot-fail-rake-seed @ e4a6240bf2ca163c36a4357ef0f80958c40bd39a -- developer-run-tests: #2237
If the rake task and the app start concurrently, they both try to set up the database seeds. The exclusive-locking mechanism in services/api/app/lib/current_api_client.rb doesn't cover this case because the rake task and app don't use the same cache. If the app commits the database seeds between the rake task's "check existing" and "commit" steps, commit fails and the rake task aborts.
(also included a fix for the logging labels that were showing the first argument ("-l" or "-D") instead of the program name ("initdb" or "postgres") in the setuidgid-wrapper case)
Updated by Tom Clegg almost 4 years ago
The database errors in the issue description are normal -- it seems we have a number of "FATAL" logs that are actually harmless -- but the attached arvados-failed.log
file shows [what I think is] the rake-vs.-app race problem.
Updated by Lucas Di Pentima almost 4 years ago
The fix LGTM, I was suffering around 50% of failed test runs on lib/controller
but after merging your branch the failures are gone. Thanks!
Updated by Tom Clegg almost 4 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|1ce3429c4005567a450a28f774e3f543010cd5c6.