Bug #9773
closed[CWL] It takes 1.25 hours to re-use 300 jobs
100%
Description
For example: https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-nkdgmvy6d0tzgbi
The pipeline instance was created at 3:52pm 8/10/16
The variant_effect_predictor job started at 5:06pm 8/10/16
I would like it a lot faster :)
Updated by Bryan Cosca over 8 years ago
It seems to take 3-4 seconds to reuse a job for this pipeline though...
arvados-cwl-runner --local --verbose bclfastq-wf.cwl bclfastq.yml /home/bcosc/venv-test3/bin/arvados-cwl-runner 1.0.20160811221813, arvados-python-client 0.1.20160727172639, cwltool 1.0.20160811184335 2016-08-18 13:56:54 arvados.cwl-runner[63497] INFO: Pipeline instance qr1hi-d1hrv-tf8jy3x3hbbwepx 2016-08-18 13:56:55 arvados.arv-run[63497] INFO: Upload local files: "bclfastq2.sh" 2016-08-18 13:56:56 arvados.arv-run[63497] INFO: Using collection qr1hi-4zz18-20vkqqo8c5r02oz 2016-08-18 13:56:58 arvados.cwl-runner[63497] INFO: Job bcl2fastq (qr1hi-8i9sb-cqxav9rgw0jnvtz) is Complete [step bcl2fastq] completion status is success 2016-08-18 13:57:00 arvados.arv-run[63497] INFO: Upload local files: "move-fastqs-to-dir.py" 2016-08-18 13:57:00 arvados.arv-run[63497] INFO: Uploaded to qr1hi-4zz18-4ohl74z33bppze4 2016-08-18 13:57:02 arvados.cwl-runner[63497] INFO: Job movetodir (qr1hi-8i9sb-jmscypd7qfrbgiv) is Complete [step movetodir] completion status is success 2016-08-18 13:57:05 arvados.cwl-runner[63497] INFO: Job archerdemux (qr1hi-8i9sb-qux4i851f0l3ybf) is Complete 2016-08-18 13:57:09 arvados.cwl-runner[63497] INFO: Job archerdemux_2 (qr1hi-8i9sb-m2ig7cr4ibuu92j) is Complete 2016-08-18 13:57:12 arvados.cwl-runner[63497] INFO: Job archerdemux_3 (qr1hi-8i9sb-97cghd8jw3o57he) is Complete 2016-08-18 13:57:16 arvados.cwl-runner[63497] INFO: Job archerdemux_4 (qr1hi-8i9sb-ja4llmwtd4rk8ot) is Complete 2016-08-18 13:57:20 arvados.cwl-runner[63497] INFO: Job archerdemux_5 (qr1hi-8i9sb-kynvdn4zf68arpw) is Complete 2016-08-18 13:57:23 arvados.cwl-runner[63497] INFO: Job archerdemux_6 (qr1hi-8i9sb-sng1ilw8ylvp3cz) is Complete 2016-08-18 13:57:27 arvados.cwl-runner[63497] INFO: Job archerdemux_7 (qr1hi-8i9sb-on80m1ollu8w5m0) is Complete
This one has 150 jobs, so I'm still waiting for 10 minutes for each re-run...
Updated by Bryan Cosca over 8 years ago
maybe the job reuses can happen in parallel? is that how arvados pipeline templates work and not cwl?
Updated by Bryan Cosca over 8 years ago
I think its increasing over time....
( from the same instance as note #1 )
unner[63497] INFO: Job archerdemux_120 (qr1hi-8i9sb-58y47df4c3lvh6x) is Complete 2016-08-18 14:07:16 arvados.cwl-runner[63497] INFO: Job archerdemux_121 (qr1hi-8i9sb-6xxojux716h7r4v) is Complete 2016-08-18 14:07:23 arvados.cwl-runner[63497] INFO: Job archerdemux_122 (qr1hi-8i9sb-6gth0en6kxgbx13) is Complete 2016-08-18 14:07:29 arvados.cwl-runner[63497] INFO: Job archerdemux_123 (qr1hi-8i9sb-y8hnjx5npiap30q) is Complete 2016-08-18 14:07:36 arvados.cwl-runner[63497] INFO: Job archerdemux_124 (qr1hi-8i9sb-5nmvg9cs0ujtxdk) is Complete 2016-08-18 14:07:43 arvados.cwl-runner[63497] INFO: Job archerdemux_125 (qr1hi-8i9sb-x6plniqy53tpnpo) is Complete 2016-08-18 14:07:50 arvados.cwl-runner[63497] INFO: Job archerdemux_126 (qr1hi-8i9sb-ona1snr3ypgsg1y) is Complete 2016-08-18 14:07:56 arvados.cwl-runner[63497] INFO: Job archerdemux_127 (qr1hi-8i9sb-4sp08zcrc7hwmtm) is Complete 2016-08-18 14:08:03 arvados.cwl-runner[63497] INFO: Job archerdemux_128 (qr1hi-8i9sb-n5xrbjgknz8wyzc) is Complete 2016-08-18 14:08:10 arvados.cwl-runner[63497] INFO: Job archerdemux_129 (qr1hi-8i9sb-2a6x8l07dy6vold) is Complete 2016-08-18 14:08:16 arvados.cwl-runner[63497] INFO: Job archerdemux_130 (qr1hi-8i9sb-8gq459ailvr46ns) is Complete 2016-08-18 14:08:23 arvados.cwl-runner[63497] INFO: Job archerdemux_131 (qr1hi-8i9sb-vv99lpfwx62yfzo) is Complete 2016-08-18 14:08:30 arvados.cwl-runner[63497] INFO: Job archerdemux_132 (qr1hi-8i9sb-ojcy0fw50bel8pa) is Complete 2016-08-18 14:08:37 arvados.cwl-runner[63497] INFO: Job archerdemux_133 (qr1hi-8i9sb-883ike8c5b6ig3p) is Complete 2016-08-18 14:08:44 arvados.cwl-runner[63497] INFO: Job archerdemux_134 (qr1hi-8i9sb-wh9wzq55xa805hu) is Complete 2016-08-18 14:08:51 arvados.cwl-runner[63497] INFO: Job archerdemux_135 (qr1hi-8i9sb-iz67yk93gti8rk5) is Complete 2016-08-18 14:08:58 arvados.cwl-runner[63497] INFO: Job archerdemux_136 (qr1hi-8i9sb-zuoibeypq840kb1) is Complete 2016-08-18 14:09:05 arvados.cwl-runner[63497] INFO: Job archerdemux_137 (qr1hi-8i9sb-7778iw7qhb1uneo) is Complete 2016-08-18 14:09:12 arvados.cwl-runner[63497] INFO: Job archerdemux_138 (qr1hi-8i9sb-jom8ps6ikt6pjgm) is Complete 2016-08-18 14:09:19 arvados.cwl-runner[63497] INFO: Job archerdemux_139 (qr1hi-8i9sb-w3280yxbppu8xri) is Complete 2016-08-18 14:09:26 arvados.cwl-runner[63497] INFO: Job archerdemux_140 (qr1hi-8i9sb-wuq8m5km0otuo9l) is Complete 2016-08-18 14:09:34 arvados.cwl-runner[63497] INFO: Job archerdemux_141 (qr1hi-8i9sb-xgof4czghf4w1fk) is Complete 2016-08-18 14:09:41 arvados.cwl-runner[63497] INFO: Job archerdemux_142 (qr1hi-8i9sb-39a5xlllwya0a3v) is Complete 2016-08-18 14:09:49 arvados.cwl-runner[63497] INFO: Job archerdemux_143 (qr1hi-8i9sb-6dfdsjatasevyrt) is Complete 2016-08-18 14:09:56 arvados.cwl-runner[63497] INFO: Job archerdemux_144 (qr1hi-8i9sb-xtrzupjeb66s6nz) is Complete 2016-08-18 14:10:04 arvados.cwl-runner[63497] INFO: Job archerdemux_145 (qr1hi-8i9sb-g3uxz5ut215deio) is Complete 2016-08-18 14:10:11 arvados.cwl-runner[63497] INFO: Job archerdemux_146 (qr1hi-8i9sb-cvne85iaf7f5hod) is Complete 2016-08-18 14:10:19 arvados.cwl-runner[63497] INFO: Job archerdemux_147 (qr1hi-8i9sb-c07fz5qm3el13nq) is Complete 2016-08-18 14:10:26 arvados.cwl-runner[63497] INFO: Job archerdemux_148 (qr1hi-8i9sb-xmw0aa9w597v18q) is Complete 2016-08-18 14:10:34 arvados.cwl-runner[63497] INFO: Job archerdemux_149 (qr1hi-8i9sb-w0d2jqq9eneo5sf) is Complete 2016-08-18 14:10:41 arvados.cwl-runner[63497] INFO: Job archerdemux_150 (qr1hi-8i9sb-rkhpyn53uxeykgb) is Complete 2016-08-18 14:10:49 arvados.cwl-runner[63497] INFO: Job archerdemux_151 (qr1hi-8i9sb-b9red2lfdcqdvm3) is Complete 2016-08-18 14:10:57 arvados.cwl-runner[63497] INFO: Job archerdemux_152 (qr1hi-8i9sb-jhr7yw47ds5het1) is Complete 2016-08-18 14:11:04 arvados.cwl-runner[63497] INFO: Job archerdemux_153 (qr1hi-8i9sb-ri2w2l5543heyay) is Complete 2016-08-18 14:11:12 arvados.cwl-runner[63497] INFO: Job archerdemux_154 (qr1hi-8i9sb-afkfh92ubf19ull) is Complete 2016-08-18 14:11:19 arvados.cwl-runner[63497] INFO: Job archerdemux_155 (qr1hi-8i9sb-adf2npm03a02a96) is Complete 2016-08-18 14:11:28 arvados.cwl-runner[63497] INFO: Job archerdemux_156 (qr1hi-8i9sb-8dg8m9b3murldfu) is Complete 2016-08-18 14:11:36 arvados.cwl-runner[63497] INFO: Job archerdemux_157 (qr1hi-8i9sb-92tllhasky87tk3) is Complete 2016-08-18 14:11:44 arvados.cwl-runner[63497] INFO: Job archerdemux_158 (qr1hi-8i9sb-g9s91sn3may7n4l) is Complete 2016-08-18 14:11:51 arvados.cwl-runner[63497] INFO: Job archerdemux_159 (qr1hi-8i9sb-fblmoo9tst40ldq) is Complete
Updated by Tom Morris over 8 years ago
- Assigned To set to Tom Clegg
- Target version set to 2016-08-31 sprint
Updated by Tom Clegg over 8 years ago
POST /arvados/v1/jobs?...&find_or_create=true&...
taking ~1.5 seconds- update pipeline instance with found job
- get collection (by PDH), select manifest_text
- create new collection
- get same collection again (by PDH)
This seems a little wasteful but I don't think it comes close enough to explaining 4 seconds per job.
The 1.5-second "find or create" query is something we might be able to speed up on the API side. Improving the cwl runner to do multiple concurrent queries would probably help too, but that might be harder, and won't always help (e.g., in a serial pipeline, the N+1st query can't be built until the response to the Nth query is received).
Updated by Tom Clegg over 8 years ago
"Getting worse over time" suggests the "get all maybe-matching jobs from Postgres, and filter them further in Ruby" approach is making a significant contribution.
This pipeline has lots of jobs with identical docker image, command, version, etc., but different script_parameters. Currently we filter on docker/command/version in Postgres, and filter script_parameters in Ruby, so this is just the sort of pattern we'd expect to be slow (and slower with every run).
Updated by Bryan Cosca over 8 years ago
FWIW, this kind of job pattern is getting very common when using CWL. The ScatterMethod class creates a new job on each sample file. I'm making pipelines that have 300+ jobs, almost all of which have the pattern you described because of this scatter method.
Updated by Tom Clegg over 8 years ago
9773-faster-find-or-create @ 051a325 queued for jenkins testing.
Updated by Ward Vandewege over 8 years ago
Tom Clegg wrote:
9773-faster-find-or-create @ 051a325 queued for jenkins testing.
Build 147 failed (services/api and services/nodemanager). The nodemanager test looks like it may be a race. The api server test, I'm not sure:
1) Failure: ArvadosModelTest#test_search_index_exists_on_models_that_go_into_projects [/data/1/jenkins/workspace/developer-test-job/services/api/test/unit/arvados_model_test.rb:144]: jobs has no search index with columns ["uuid", "owner_uuid", "modified_by_client_uuid", "modified_by_user_uuid", "submit_id", "script", "script_version", "cancelled_by_client_uuid", "cancelled_by_user_uuid", "output", "is_locked_by_uuid", "log", "repository", "supplied_script_version", "docker_image_locator", "state", "arvados_sdk_version", "script_parameters_digest"]. Instead found search index with columns ["uuid", "owner_uuid", "modified_by_client_uuid", "modified_by_user_uuid", "submit_id", "script", "script_version", "cancelled_by_client_uuid", "cancelled_by_user_uuid", "output", "is_locked_by_uuid", "log", "repository", "supplied_script_version", "docker_image_locator", "state", "arvados_sdk_version"]
I kicked it off again just in case, as build 149.
Updated by Ward Vandewege over 8 years ago
Ward Vandewege wrote:
I kicked it off again just in case, as build 149.
149 failed too, same services/api error so I think that one is real. The nodemanager error is gone so that is a race.
Updated by Tom Clegg over 8 years ago
Fixed the index test (by dropping script_parameters_digest from searchable_columns) in a3e6fef and kicked off
https://ci.curoverse.com/job/developer-test-job/150/
https://ci.curoverse.com/job/developer-test-job/151/
Updated by Ward Vandewege over 8 years ago
Tom Clegg wrote:
Fixed the index test (by dropping script_parameters_digest from searchable_columns) in a3e6fef and kicked off
https://ci.curoverse.com/job/developer-test-job/150/
https://ci.curoverse.com/job/developer-test-job/151/
Hrm, both failed with the same set of selenium errors in the workbench integration tests. I kicked off another build just in case.
Updated by Tom Clegg over 8 years ago
https://ci.curoverse.com/job/developer-test-job/152/ - flaky node manager test
https://ci.curoverse.com/job/developer-test-job/153/ - passed
Updated by Tom Clegg over 8 years ago
Note: you can expect the migration 20160819195725_populate_script_parameters_digest to take a while on a production site with lots of job records.
Updated by Radhika Chippada over 8 years ago
- In the PopulateScriptParametersDigest migration script
- Every job within the list of selected jobs is being updated here (due to id=?), even though the update_all can eliminate getting jobs with the same script_parameters in the next batch. So, I am wondering if fetching a smaller number of objects (limit(200)?) would yield better overall performance?
Job. where('id=? or script_parameters=?', j.id, j.script_parameters.to_yaml). update_all(script_parameters_digest: j.update_script_parameters_digest)
- In JobsController: In the following code, would there be a scenario where there is Completed job and Running / Queued jobs? In this scenario, I’d assume that “if @object.output != j.output” would result in ignoring both and starting a third job?
+ where('state = ? or (owner_uuid = ? and state in (?))', + Job::Complete, current_user.uuid, [Job::Queued, Job::Running]). + where('script_parameters_digest = ?', Job.sorted_hash_digest(resource_attrs[:script_parameters])). + where('nondeterministic is distinct from ?', true). + order('state desc, created_at') # prefer Running jobs over Queued
LGTM with those questions clarified.
Updated by Tom Clegg over 8 years ago
Radhika Chippada wrote:
I am wondering if fetching a smaller number of objects (limit(200)?) would yield better overall performance?
Yes, I think you're right. No particular advantage to having a larger batch size. Changed to 200.
- In JobsController: In the following code, would there be a scenario where there is Completed job and Running / Queued jobs? In this scenario, I’d assume that “if @object.output != j.output” would result in ignoring both and starting a third job?
Ah, yes. If we have
job | state | output |
A | Complete | 123 |
B | Complete | 456 |
C | Running |
...then a3e6fef would start a new job, where the old code would have used job C.
It seems better to preserve the old behavior in this case.
→ 897f8c0
Updated by Tom Clegg over 8 years ago
- Target version changed from 2016-08-31 sprint to 2016-09-14 sprint
Updated by Tom Clegg over 8 years ago
- Status changed from In Progress to Feedback