Bug #9552
closed[CWL] arvados-cwl-runner runs too slowly when reusing jobs
75%
Description
arvados-cwl-runner --local --enable-reuse
takes 1 hour 24 minutes: https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-dwpq8wwvf89s6cu
- First step (merge-fastqs): 17m8s (output size: 2 files totalling 15.1 GiB)
- Second step (star): 1h7m6s (output size: 6 files totalling 58.6 GiB)
- Third step (salmon): 2s (output size: 3 files totalling 4.7 MiB)
Updated by Brett Smith over 8 years ago
- Story points set to 1.0
This story is timeboxed for this sprint at 2 days. Spend up to that much time diagnosing and fixing the bug. At that time, let's reassess.
Updated by Tom Clegg over 8 years ago
- Category set to Performance
- Assigned To set to Tom Clegg
Updated by Radhika Chippada over 8 years ago
- Assigned To changed from Tom Clegg to Radhika Chippada
Updated by Sarah Guthrie over 8 years ago
The CWL which exhibited this behavior can be found in the git repository on qr1hi: sguthrie/unc under the "cwl" directory.
The command, run in unc/cwl in an environment with ARVADOS_API_TOKEN and ARVADOS_API_HOST set and arvados-cwl-runner installed, is below:
arvados-cwl-runner --local --enable-reuse merge-star-salmon.cwl basic_test.json
Let me know if you need anything else to reproduce the bug
Updated by Radhika Chippada over 8 years ago
- Status changed from New to In Progress
The time being spent is all happening in our end.
sdk/cwl/arvados_cwl/arvjob.py -> ArvadosJob.done method is parsing through each line of the log to determine the tmpdir, outdir, keepdir using tmpdirre, outdirre, keepre resp, which is taking all the extra time after “finding” the job to be reused. This information is then passed to done.py -> done method, which is using this to set builder.outdir and keepdir.
self.builder.outdir = outdir self.builder.pathmapper.keepdir = keepdir return self.collect_outputs("keep:" + record["output"])
For the merge-fastqs job with log 263kb size, it took 11 mins to process the log lines.
2016-07-11 13:16:52 arvados.cwl-runner[33584] INFO: START 2016-07-11 13:16:55 arvados.cwl-runner[33584] INFO: GOT LOG = [u'qr1hi-8i9sb-a4p07cjcfg3906t.log.txt'] 2016-07-11 13:27:11 arvados.cwl-runner[33584] INFO: DONE GETTING OUTPUTS
This seems like the most inefficient way of handling the need to set outdir and keepdir. Should this information be stored as metadata on the job record (using properties) by crunch itself? And on older jobs where this information may not be available, set this on first “reuse” to save time in all future reuses?
Updated by Tom Clegg over 8 years ago
Parsing 269KB of log entries is inefficient, but not inefficient enough to explain 11 minutes. Is there something wrong with the regexps? With the Collection class?
Updated by Radhika Chippada over 8 years ago
I reran the test with the regexp parsing lines commented out completely, thus leaving only the log line reading from Collection in place. This again took 10m to go though the log lines (but doing nothing with each of the lines) in the log Collection. Thus, it appears that the issue is with reading the lines from the log Collection.
2016-07-11 15:58:01 arvados.cwl-runner[58009] INFO: GOT LOG = [u'qr1hi-8i9sb-a4p07cjcfg3906t.log.txt'] 2016-07-11 16:08:40 arvados.cwl-runner[58009] INFO: DONE GETTING OUTPUTS
Updated by Tom Clegg over 8 years ago
Radhika Chippada wrote:
I reran the test with the regexp parsing lines commented out completely, thus leaving only the log line reading from Collection in place. This again took 10m to go though the log lines (but doing nothing with each of the lines) in the log Collection. Thus, it appears that the issue is with reading the lines from the log Collection.
That narrows it down. Have you been able to reproduce it outside CWL?
This test script runs in less than one second on shell.qr1hi, using python-arvados-python-client 0.1.20160517202250-2:
import arvados
c=arvados.collection.Collection("2a3d13d2aac3242ac0b92a91b6875c58+87")
f=c.open("qr1hi-8i9sb-a4p07cjcfg3906t.log.txt")
for x in f:
print x,
Perhaps something breaks the Python SDK's read cache in this context, and every iteration of "for l in log" re-fetches the entire file from the network? That would increase network latency by 1640×, which is the order of magnitude of the slowdown we're seeing.
FWIW I also downloaded the log from afar (browser via keep-web) and although I didn't time it, it was somewhere around 1s, certainly nowhere near 10m.
Updated by Radhika Chippada over 8 years ago
Sorry for the misinformation, but the issue is neither with the regexp parsing nor with the Collection class.
It is taking those 10+ minutes in collect_outputs call in done.py -> done method
return self.collect_outputs("keep:" + record["output"])
It appears that this is from the cwl tool library?
Updated by Peter Amstutz over 8 years ago
cwltool/draft2tool.py L412:
with builder.fs_access.open(files["location"], "rb") as f: contents = f.read(CONTENT_LIMIT) if binding.get("loadContents"): files["contents"] = contents filesize = 0 while contents != "": checksum.update(contents) filesize += len(contents) contents = f.read(1024*1024)
Add a flag to skip the loop while contents != "":
the make a pull requests against the cwltool
repository.
Updated by Radhika Chippada over 8 years ago
Made PR for cwltool update: https://github.com/common-workflow-language/cwltool/pull/127
Updated by Radhika Chippada over 8 years ago
PR 127 had been merged into cwltool master.
Updated by Radhika Chippada over 8 years ago
- Target version changed from 2016-07-20 sprint to 2016-08-03 sprint
- Story points changed from 1.0 to 0.5
Updated by Radhika Chippada over 8 years ago
- Updated cwltool version we are using and now we have compete_checksum flag available
- Reran Sally's job in qr1hi (pointing to this branch) and the pipeline instance completed within 15 seconds for the first job. (The pipeline on the whole fails for me in my env for the second job, but we can prove that the job reuse has happened for the first job). https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-tu3ac1kf556zkex
Updated by Radhika Chippada over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:4c98448b0966aefbf23b92f3da3a9278d2b73cab.
Updated by Radhika Chippada over 8 years ago
- Status changed from Resolved to In Progress
Reopening this because in our jenkins environment we run the conformance tests and these fail without checksum computing.
Updated by Radhika Chippada over 8 years ago
Added compute-checksum flag to arvados-cwl-runn at cfa7968
Updated by Tom Clegg over 8 years ago
(More background)
I understand the checksum is supposed to be optional, but the conformance test suite doesn't know this yet.
Meanwhile, a command line flag plus a wrapper script should make it possible to pass conformance tests.
#!/bin/sh exec arvados-cwl-runner --compute-checksum "$@"
Does cfa7968 work? It looks like this command line flag won't make a difference, because arvExecutor() says kwargs["compute_checksum"]=False
. And arvExecutor already gets all the command line flags. So, instead of adding a new argument to __init__()
and a new instance variable, can we just change that line in arvExecutor to say kwargs["compute_checksum"]=kwargs.get("compute_checksum")
...?
Updated by Tom Clegg over 8 years ago
The command line flag doesn't seem to be effective...
Test [6/73] Test failed: /usr/bin/cwl-runner --outdir=/tmp/tmpJro1gw --quiet v1.0/template-tool.cwl v1.0/cat-job.json Test CreateFileRequirement ExpressionEngineRequirement.engineConfig feature expected output object { "foo": { "checksum": "sha1$63da67422622fbf9251a046d7a34b7ea0fd4fead", "class": "File", "location": "foo.txt", "size": 22 } } got { "foo": { "basename": "foo.txt", "class": "File", "location": "keep:b3e162ad900be8b7c15ab5552e7c1c1c+112/foo.txt", "size": 22 } } Compare failure expected { "checksum": "sha1$63da67422622fbf9251a046d7a34b7ea0fd4fead", "class": "File", "size": 22 } got { "class": "File", "size": 22 }
Updated by Tom Clegg over 8 years ago
I think what we're missing is a mechanism to pass the compute_checksum flag to the Arvados job that runs the tool. The only case where we propagate kwargs["compute_checksum"] here is the one where we execute a CommandLineTool locally. The conformance test is using the Crunch1 work API, though, so it hits the RunnerJob line:
if self.work_api == "containers":
if tool.tool["class"] == "CommandLineTool":
runnerjob = tool.job(job_order,
self.output_callback,
**kwargs).next()
else:
runnerjob = RunnerContainer(self, tool, job_order, kwargs.get("enable_reuse"))
else:
runnerjob = RunnerJob(self, tool, job_order, kwargs.get("enable_reuse"))
https://github.com/curoverse/arvados/blob/master/sdk/cwl/arvados_cwl/__init__.py#L151
Updated by Radhika Chippada over 8 years ago
Tom, thanks for the hint.
I used crunch2 to run the tests, and most tests passed.
exec arvados-cwl-runner --compute-checksum --api containers "$@"
The following tests failed and the rest (including, test 70) passed:
Test [36/73] v1.0/env-job.json Test [37/73] v1.0/env-job.json Test [39/73] v1.0/revsort-job.json Test [45/73] v1.0/schemadef-job.json Test [64/73] v1.0/conflict-job.json
Would it be possible to run the tests using crunch2 in c97qk?
If this env is not ready for crunch2, would it be possible to (1) disable this jenkins job until that time, (2) run just the first 5 or so of the conformance tests (which can only ensure that the environment is setup properly)?
If none of these options are acceptable, please let me know what updates need to be made to RunnerJob. (preferably make this update in a separate ticket ...). We probably also need to create another ticket to see why those 5 tests are failing in crunch2 and fix any issues.
Updated by Radhika Chippada over 8 years ago
- Target version changed from 2016-08-03 sprint to 2016-08-17 sprint
Updated by Peter Amstutz over 8 years ago
Here's a thought, instead of passing through the --compute-checksum flag to the runner job, there's no reason that the client can't check if the checksums are present, and if not, compute the checksum itself.
Updated by Peter Amstutz over 8 years ago
Implemented branch 9552-compute-checksum-final-output which fills in the checksums before returning the final output object which --compute-checksum is provided.
Updated by Tom Clegg over 8 years ago
9552-compute-checksum-final-output LGTM, thanks!
Updated by Peter Amstutz over 8 years ago
- Status changed from In Progress to Resolved