Feature #23075
closedRun CWL integration tests from run-tests.sh
Description
The goal is you should be able to run the following from run-tests.sh:
test sdk/cwl tests/test_conformance.py::test_conformance_1_2
and it should pass. Similarly for other conformance and integration tests.
If needed, it is okay to add a separate subcommand for this, something like test sdk/cwl_integration, that does dedicated setup. It still needs to be able to take arguments for test selection. Avoid using the word "conformance" in this name because that's too limited. In the CWL context, "integration" tests are those that require a full cluster+dispatcher, and "conformance" tests are a strict subset of that, running the officially provided tests. (I am open to changing the "integration" name to something better.)
This will require extending run_test_server.py to start crunch-dispatch-local as part of the test cluster.
Files
Updated by Brett Smith 8 months ago
- Related to Feature #23006: Port CWL integration tests to pytest added
Updated by Brett Smith 8 months ago
- Target version set to Development 2025-08-21
Updated by Brett Smith 7 months ago
- Assigned To changed from Brett Smith to Tom Clegg
Updated by Tom Clegg 7 months ago
- 1.0 - 5 tests passed, 192 failures, 0 unsupported features
- 1.1 - 18 tests passed, 234 failures, 0 unsupported features
- 1.2 - 41 tests passed, 337 failures, 0 unsupported features
I wish I knew how to make the test suite show any progress on the fly, rather than just taking many minutes/hours to fail conformance_1_0 and then show details about individual tests. I'm using test sdk/cwl -v -m "cwl_conformance" and tried adding --verbose to run_cwltest() invocation in source:sdk/cwl/tests/__init__.py but that doesn't do it.
I think I will add a profiling/debug server to crunch-dispatch-local just so we can do the "wait for port to listen" thing in run_test_servers.py instead of "wait for it to log something without immediately exiting", which is working now but is quite awful.
Updated by Tom Clegg 7 months ago
progress:
23075-cwl-conformance-tests @ 23c6a46c8f56e4570422f2dd0f5bcc09194d494d -- developer-run-tests: #4852
Updated by Tom Clegg 7 months ago
23075-cwl-conformance-tests @ d0d743885b40045a38875c54ab831a5e888024ec -- developer-run-tests: #4853
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅ You can now run the sdk/cwl integration tests from run-tests.sh
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ existing automated tests pass locally (Jenkins doesn't run them)
- The tested code incorporates recent main branch changes.
- ✅
- New or changed UI/UX has gotten feedback from stakeholders.
- n/a
- Documentation has been updated.
- n/a
- Behaves appropriately at the intended scale (describe intended scale).
- n/a
- Considered backwards and forwards compatibility issues between client and server.
- n/a
- Follows our coding standards and GUI style guidelines.
- ✅
- run_test_server.py has start_dispatch and stop_dispatch commands
- run_test_server.py has a database_reset command (to clean up after the dispatcher has tried to run the test fixtures)
- run-tests.sh brings up keep and dispatch servers while running sdk/cwl tests, and shuts them down and resets the database afterward
- run_test_server.py integration test config always runs crunch-run in "host" networking mode (otherwise containers can't connect to controller at 127.0.0.1)
- crunch-dispatch-local doesn't deadlock when detecting a container that the host doesn't have enough resources to run
- crunch-dispatch-local doesn't enter a spin loop when the dispatch library closes the status channel to signal that it should terminate the container
- crunch-dispatch-local logs its PID (helpful when you're unintentionally running multiple instances of it)
- crunch-dispatch-local propagates ARVADOS_API_HOST_INSECURE to crunch-run
- arv-put doesn't crash when sys.stderr is a tee
I thought about adding an optional debug/profiling server to crunch-dispatch-local, but instead I made run_test_server.py wait for it to log something. I now regret my decision, but now that I did it, it's quicker to leave it that way. Should fix?
What next? test sdk/cwl -v --capture=tee-sys -m integration ... Test [197/197] schema-def_anonymous_enum_in_array: Test an anonymous enum inside an array inside a record, SchemaDefRequirement All tests passed PASSED [ 10%] Test [1/253] cl_basic_generation: General test of command line generation ... Test [253/253] optional_numerical_output_returns_0_not_null: Test that optional number output is returned as 0, not null 250 tests passed, 2 unsupported features PASSED [ 20%] Test [1/378] cl_basic_generation: General test of command line generation ... Test [378/378] paramref_arguments_inputs: confirm that $inputs is available to parameter references in arguments 376 tests passed, 2 unsupported features PASSED [ 30%] tests/test_copy_deps.py::test_create PASSED [ 40%] tests/test_copy_deps.py::test_update PASSED [ 50%] ... Test [1/54] Test directory in keep Test [2/54] Test directory in keep Test [3/54] Test default directory in keep ... Test [53/54] test same file appearing in output of both Directory and array Test [54/54] test bug 22466 All tests passed PASSED [ 80%] tests/test_integration.py::test_set_properties_17004 ... PASSED [ 90%] tests/test_integration.py::test_fix_workflow_18888 ... PASSED [100%] ====================================================================== 10 passed, 128 deselected in 7905.87s (2:11:45) ======================================================================= ======= test sdk/cwl -- 7907s Sent SIGTERM to 2091048 (/home/tom/arvados/tmp/keep0.pid) Sent SIGTERM to 2091067 (/home/tom/arvados/tmp/keep1.pid) Sent SIGTERM to 2091495 (/home/tom/arvados/tmp/dispatch.pid) Pass: sdk/cwl install (9s) Pass: sdk/cwl tests (7907s) All test suites passed.
Updated by Brett Smith 7 months ago
23075-cwl-conformance-tests @ d0d743885b40045a38875c54ab831a5e888024ec -- developer-run-tests: #4853
I'm very glad I had you work on this branch. I think I would've had a much harder time finding and fixing the crunch-dispatch-local bugs that got resolved as part of this. Thank you.
So for context: Jenkins did until recently run the CWL conformance test suite. This job ran after pushes to main after we built and uploaded an arvbox Docker image. It only stopped running recently when we stopped building the arvbox image after #20311. But I would like to have it running and passing again before a 3.2.0 release.
It's currently written on top of test_with_arvbox.sh. I think we should update this job to call run-tests.sh, and then we should test your branch against it. Partly I think this because I am getting some failures locally, so I would like to see this branch passing in the cleanest possible environment to rule out the possibility this a problem with my local development box.
Does that make sense and sound good to you? If so, I can update the Jenkins job and run a test, that's straightforward. We will need to figure out where we want to reintroduce it in the Jenkins pipeline but that doesn't have to happen immediately and it's easy enough to change our minds later.
My only substantive comments about the code are:
- To remove code we don't use, please remove the
INSIDE_ARVBOXconstant and associated conditionals from the CWL test. - In
run_test_server.py, callingsubprocess.check_outputwithout capturing the output is a little weird. Prefersubprocess.run(check=True). Then you can removestderr=sys.stderrbecause piping is the default behavior for all stdio fds. (If you want to still suppress stdout, saystdout=subprocess.DEVNULL.)
I thought about adding an optional debug/profiling server to crunch-dispatch-local, but instead I made run_test_server.py wait for it to log something. I now regret my decision, but now that I did it, it's quicker to leave it that way. Should fix?
Longer-term I am still interested in standardizing on arvados-server boot and getting rid of run_test_server.py (and having the cloud loopback driver replace crunch-dispatch-local). I felt like it was worth investing a little in run_test_server.py to unblock removing arvbox from 3.2.0. I won't stop you if you feel strongly but I'm inclined to say it's not worth investing any more effort into it.
run_test_server.py integration test config always runs crunch-run in "host" networking mode (otherwise containers can't connect to controller at 127.0.0.1)
Similarly, this is fine for now, but longer-term I would like our test cluster(s) to follow the strategy outlined in the controller login tests: the test scaffolding should make an explicit Docker network, and services should listen on the host's interface for that network.
Thanks.
Updated by Tom Clegg 7 months ago
Brett Smith wrote in #note-13:
[...] I think we should update this job to call
run-tests.sh, and then we should test your branch against it.Does that make sense and sound good to you?
Yes please.
- To remove code we don't use, please remove the
INSIDE_ARVBOXconstant and associated conditionals from the CWL test.
Removed.
- In
run_test_server.py, callingsubprocess.check_outputwithout capturing the output is a little weird. Prefersubprocess.run(check=True). Then you can removestderr=sys.stderrbecause piping is the default behavior for all stdio fds. (If you want to still suppress stdout, saystdout=subprocess.DEVNULL.)
Ah yes, that sounds better. Fixed.
I thought about adding an optional debug/profiling server to crunch-dispatch-local, but instead I made run_test_server.py wait for it to log something. I now regret my decision, but now that I did it, it's quicker to leave it that way. Should fix?
Longer-term I am still interested in standardizing on
arvados-server bootand getting rid ofrun_test_server.py(and having the cloud loopback driver replacecrunch-dispatch-local). I felt like it was worth investing a little inrun_test_server.pyto unblock removing arvbox from 3.2.0. I won't stop you if you feel strongly but I'm inclined to say it's not worth investing any more effort into it.
I've done this because it's less code in run_test_server.py. I took the opportunity to add systemd-notify support in crunch-dispatch-local because that was also 2 lines.
(If crunch-dispatch-local weren't a dead end itself, I'd convert it to an arvados-server subcommand, and then it would get a -pprof option automatically... but it is)
run_test_server.py integration test config always runs crunch-run in "host" networking mode (otherwise containers can't connect to controller at 127.0.0.1)
Similarly, this is fine for now, but longer-term I would like our test cluster(s) to follow the strategy outlined in the controller login tests: the test scaffolding should make an explicit Docker network, and services should listen on the host's interface for that network.
Right. I forgot to mention: this just preserves a workaround these tests were already relying on. Arvbox always uses host networking mode. Git history (50dba76531ad4fdd5ab79598aa192dd3bb1be314) indicates that started in 2017 as a way to avoid some unrelated kernel-dependent issue and just stayed that way.
23075-cwl-conformance-tests @ f2b33b208de0dc6f4d8420d422ce8b8239df288e -- developer-run-tests: #4855
Updated by Brett Smith 7 months ago
Tom Clegg wrote in #note-14:
Brett Smith wrote in #note-13:
[...] I think we should update this job to call
run-tests.sh, and then we should test your branch against it.Does that make sense and sound good to you?
Yes please.
In progress at https://ci.arvados.org/job/arvados-cwl-conformance-tests/ - I at least got it to the point where it looks like it's starting the tests correctly. If it passes then this all LGTM.
Updated by Tom Clegg 7 months ago
The problem was incorrectly assuming ${SRCDIR}/tmp/GOPATH/bin was in PATH. That's true if run-tests.sh is run with an empty GOPATH and --temp=${WORKSPACE}/tmp, but neither of those is necessarily true.
Fixed by building the crunch-run and crunch-dispatch-local binaries in ${SRCDIR}/tmp/ and explicitly running those ones instead of relying on PATH to find them.
- preserve tmp/*.log artifacts like we do elsewhere, so I could see the crunch-dispatch-local logs, which revealed the problem
- pass
--capture=tee-sys, which made it possible to see progress instead of waiting for every single test to time out before getting any console logs
Any reason not to leave both in place?
Updated by Tom Clegg 7 months ago
Investigating the above failure, which seems to show crunch-run occasionally not exiting.
... zzzzz-dz642-a2osw5c701fi27l 2025-08-20T20:04:51.408492619Z Waiting for container to finish ... (other container starts and finishes) ... zzzzz-dz642-ap90ayjdx8ly583 2025-08-20T20:05:09.110077444Z Waiting for container to finish ... (other container starts and finishes) ... zzzzz-dz642-2mui028vs5zbh8z 2025-08-20T20:05:20.839664845Z Waiting for container to finish ... (no more messages with any of these three UUIDs) ... {"ClusterID":"zzzzz","PID":8452,"level":"info","msg":"Insufficient resources to start zzzzz-dz642-wtsv4gawl8js60a, waiting for next event","time":"2025-08-20T20:05:27.005430244Z"} ... (repeat "insufficient resources" until everything times out) ... zzzzz-dz642-a2osw5c701fi27l 2025-08-20T21:04:41.179438608Z runTunnel: EOF zzzzz-dz642-ap90ayjdx8ly583 2025-08-20T21:04:41.179470048Z runTunnel: EOF zzzzz-dz642-2mui028vs5zbh8z 2025-08-20T21:04:41.179511608Z runTunnel: EOF
Updated by Brett Smith 7 months ago
Thank you for picking this up and driving it forward while I was out, I really appreciate that.
Tom Clegg wrote in #note-16:
The problem was incorrectly assuming
${SRCDIR}/tmp/GOPATH/binwas in PATH. That's true if run-tests.sh is run with an empty GOPATH and--temp=${WORKSPACE}/tmp, but neither of those is necessarily true.
I think on Jenkins it cannot be true, at least not without major changes to our setup. We build an initial --temp dir at the time we build the worker image to save time on every test. But at that time we are running outside Jenkins, so we are not using the $WORKSPACE Jenkins will use, and there's no way to predict where that will be.
While troubleshooting I updated the Jenkins config to
- preserve tmp/*.log artifacts like we do elsewhere, so I could see the crunch-dispatch-local logs, which revealed the problem
- pass
--capture=tee-sys, which made it possible to see progress instead of waiting for every single test to time out before getting any console logsAny reason not to leave both in place?
Not that I can think of, both of those are fine by me. I thought about setting --capture myself but I wanted to be optimistic and just assume the tests would pass without being watched.
Updated by Tom Clegg 7 months ago
23075-cwl-conformance-tests @ 7463fd2ee09785ea3856297d854f12185d8d3f47 -- developer-run-tests: #4856
Updated by Brett Smith 7 months ago
- Target version changed from Development 2025-08-21 to Development 2025-09-03
Updated by Brett Smith 7 months ago
Tom Clegg wrote in #note-19:
23075-cwl-conformance-tests @ 7463fd2ee09785ea3856297d854f12185d8d3f47 -- developer-run-tests: #4856
LGTM, but of course happy to look again if you make further changes based on further test results.
Updated by Brett Smith 7 months ago
Tom Clegg wrote in #note-16:
The problem was incorrectly assuming
${SRCDIR}/tmp/GOPATH/binwas in PATH. That's true if run-tests.sh is run with an empty GOPATH and--temp=${WORKSPACE}/tmp, but neither of those is necessarily true.
Wait hold on. Why are we even doing this build? run-tests.sh normally starts by installing everything, so all the binaries we need should already be in $tempdir/GOPATH/bin. Right? Why don't we just use those?
Yes if you're running interactively, or using --only-install, then you can create a situation where you're not testing what you think because of inconsistent versions of things. But this is basically universally true of all our tests, I'm not sure we should write special code to prevent it in one specific case. My understanding is that our philosophy around this is, it's the developer's responsibility to make sure they re-install dependent components when they're testing this way.
Updated by Tom Clegg 7 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|837251f282a79a22aaf8f75bac20946c763bda2a.