Reusable tasks » History » Version 3
Tom Clegg, 10/15/2014 03:44 PM
| 1 | 1 | Tom Clegg | {{>toc}} |
|---|---|---|---|
| 2 | |||
| 3 | h1=. Reusable tasks |
||
| 4 | |||
| 5 | p>. *"Tom Clegg":mailto:tom@curoverse.com |
||
| 6 | Last Updated: October 6, 2014* |
||
| 7 | |||
| 8 | h2. Overview |
||
| 9 | |||
| 10 | h3. Objective |
||
| 11 | |||
| 12 | Say jobs A and B, although not identical, have some tasks in common. Job A is complete. Job B starting now. They use the same script, version, docker image, etc. The only difference between A and B is that B's input collection has one more file; the rest of the files are identical. The script processes each input file independently, and it is a pure function (re-computing the same files will produce the same result). This means most of Job B's work has already been done. Task re-use will allow Arvados to recognize this condition and re-use the outputs of Job A's tasks instead of recomputing them. |
||
| 13 | |||
| 14 | Task re-use will not attempt to detect equivalence conditions like differently-encoded collection manifests with identical data, differing git commits with identical trees, and differing docker images with functionally equivalent content. |
||
| 15 | |||
| 16 | The intended audience for this document is software engineers. |
||
| 17 | |||
| 18 | h3. Background |
||
| 19 | |||
| 20 | 2 | Tom Clegg | The arvados.v1.jobs.create API offers a find_or_create feature which searches for an existing job which meets criteria specified by the client (e.g., same script, compatible script_version) and additional criteria (e.g., did not fail, is not marked impure/nondeterministic, does not diagree with other jobs passing the same criteria about what the correct output is). |
| 21 | 1 | Tom Clegg | |
| 22 | 2 | Tom Clegg | * http://doc.arvados.org/api/methods/jobs.html#create |
| 23 | |||
| 24 | 1 | Tom Clegg | h3. Alternatives |
| 25 | |||
| 26 | 2 | Tom Clegg | Always recompute each task (i.e., leave existing behavior). |
| 27 | 1 | Tom Clegg | |
| 28 | 2 | Tom Clegg | bq. This makes desirable use cases prohibitively expensive. |
| 29 | 1 | Tom Clegg | |
| 30 | 2 | Tom Clegg | Use smaller jobs, and more jobs per pipeline. |
| 31 | 1 | Tom Clegg | |
| 32 | 2 | Tom Clegg | bq. We could make the dynamic-structure capabilities of crunch jobs available at the pipeline level, and de-emphasize or stop using the features that encourage long-running jobs. Disadvantages include: |
| 33 | * The process of running a pipeline is not done in a controlled environment. This effectively reduces the utility of reproducibility and provenance features. |
||
| 34 | * Pipelines are currently encoded as JSON which is awkward to use as a DSL. |
||
| 35 | 1 | Tom Clegg | |
| 36 | 2 | Tom Clegg | h3. Tradeoffs |
| 37 | 1 | Tom Clegg | |
| 38 | 2 | Tom Clegg | _TODO_ |
| 39 | 1 | Tom Clegg | |
| 40 | 2 | Tom Clegg | h3. High Level Design |
| 41 | 1 | Tom Clegg | |
| 42 | 3 | Tom Clegg | Before executing a job_task that qualifies for re-use, crunch-job uses the API to discover existing job_tasks that are functionally identical, are marked as "pure", and have already finished. If any are found, crunch-job copies the existing job_tasks' output into the new job_task instead of executing the task. |
| 43 | 2 | Tom Clegg | |
| 44 | 1 | Tom Clegg | h2. Specifics |
| 45 | |||
| 46 | h3. Detailed Design |
||
| 47 | |||
| 48 | 2 | Tom Clegg | The JobTask schema has a new boolean flag @is_pure@ (not null, default @false@). |
| 49 | 1 | Tom Clegg | |
| 50 | 2 | Tom Clegg | Just before starting a task having @is_pure==true@, crunch-job does an API query look up other tasks with @is_pure=true@ and identical inputs, parameters, script_version, etc. |
| 51 | * Some attributes like script and script_version are currently stored in the job record, not the job_task record. This will make the lookup interesting, in the absence of a generic "join" API. |
||
| 52 | 1 | Tom Clegg | |
| 53 | 2 | Tom Clegg | Job tasks have one especially noteworthy side effect: queueing additional tasks. In order to reuse tasks safely without races, we need additional restraints: |
| 54 | * Tasks with @is_pure==true@ cannot queue additional tasks, *and* @is_pure@ cannot change from @false@ to @true@. |
||
| 55 | 3 | Tom Clegg | * Tasks do not qualify for reuse until they have completed[1]. When reusing a task, copy (and reset to "todo" state) each task whose @created_by_job_task_uuid@ attribute references the task being reused. |
| 56 | 1 | Tom Clegg | |
| 57 | 2 | Tom Clegg | fn1. At least in the short term, this constraint is a good way to limit the complexity of implementation without sacrificing too much of the user benefit. |
| 58 | 1 | Tom Clegg | |
| 59 | h3. Code Location |
||
| 60 | |||
| 61 | 2 | Tom Clegg | @sdk/cli/bin/crunch-job@ will have new task reuse logic. |
| 62 | 1 | Tom Clegg | |
| 63 | 2 | Tom Clegg | @services/api/db/migrate@ will have a new migration, which will be reflected in @services/api/db/structure.sql@. |
| 64 | |||
| 65 | 3 | Tom Clegg | @services/api/app/models/job_task.rb@ will add :is_pure to the API response and prohibit any transaction that changes @is_pure@ from @false@ to @true@. IOW, @is_pure@ can be set to @true@ only at creation time. |
| 66 | 2 | Tom Clegg | |
| 67 | 3 | Tom Clegg | @doc/api/schema/JobTask.html.textile.liquid@ will document the @is_pure@ flag. |
| 68 | 2 | Tom Clegg | |
| 69 | 1 | Tom Clegg | h3. Testing Plan |
| 70 | |||
| 71 | 2 | Tom Clegg | _TODO_ |
| 72 | 1 | Tom Clegg | |
| 73 | h3. Logging |
||
| 74 | |||
| 75 | 2 | Tom Clegg | @crunch-job@ will log the fact that it has copied its output attribute (and, if applicable, queued additional tasks) from an existing completed task. |
| 76 | 1 | Tom Clegg | |
| 77 | h3. Debugging |
||
| 78 | |||
| 79 | 2 | Tom Clegg | _TODO_ |
| 80 | 1 | Tom Clegg | |
| 81 | h3. Caveats |
||
| 82 | |||
| 83 | To be determined. |
||
| 84 | |||
| 85 | h3. Security Concerns |
||
| 86 | |||
| 87 | 3 | Tom Clegg | The existing permission model can prevent user A's job from reusing completed tasks merely because they were initiated by a different user. In such cases (where user A has no other way of knowing about user B's job or task), this is preferable to exposing to user A the fact that any other user has run the task. |
| 88 | 1 | Tom Clegg | |
| 89 | h3. Open Questions and Risks |
||
| 90 | |||
| 91 | 3 | Tom Clegg | Should purity be enforced or monitored? |
| 92 | * Each task could be given a token with scopes restricting it to reading the collection hashes in its @parameters@ hash and its own JobTask and Job resources. |
||
| 93 | 1 | Tom Clegg | |
| 94 | 3 | Tom Clegg | Will there be a special-purpose API for looking up a reusable task, or a generic join-and-filter API? If neither, crunch-job will have to fetch multiple pages of job_tasks and jobs in order to reject ones with mismatched script, script_version, docker image, etc. |
| 95 | |||
| 96 | Do we indicate in the job_task record that the output was copied from an existing task? If so, how? (Note that a reference to the existing job_task can become stale due to permission changes.) |
||
| 97 | |||
| 98 | What are the appropriate values for a job_task's start/finish timestamp attributes, if the task's outputs were copied from existing tasks? |
||
| 99 | |||
| 100 | 1 | Tom Clegg | h3. Work Estimates |
| 101 | |||
| 102 | _TODO_ |
||
| 103 | |||
| 104 | h3. Future Work |
||
| 105 | |||
| 106 | 3 | Tom Clegg | The database tables could be refactored into @jobs@, @job_tasks@, and @tasks@ where @job_tasks@ establishes a many-to-many relationship. |
| 107 | 1 | Tom Clegg | |
| 108 | 3 | Tom Clegg | |Table|Significance of a row| |
| 109 | |jobs|A user initiated some work (requested an output) using Crunch.| |
||
| 110 | |job_tasks|A job must run a task in order to generate part of its output.| |
||
| 111 | |tasks|A unit of work was (or will be, or is being) performed as part of a job.| |
||
| 112 | |||
| 113 | This way, jobs could reference existing tasks directly rather than copying data between rows in @job_tasks@. Jobs could share tasks even before the tasks have completed. |
||
| 114 | |||
| 115 | A facility (and incentive) could be provided to denote tasks as reusable even by users to whom they are otherwise invisible: "If you can guess exactly what I did, and you have permission to read the inputs, I'll admit I did that work and I'll show you the output." |
||
| 116 | |||
| 117 | 1 | Tom Clegg | h3. Revision History |
| 118 | |||
| 119 | |_.Date |_.Revisions Made |_.Author |_.Reviewed By | |
||
| 120 | | October 6, 2014 | Initial Draft | Tom Clegg |=. ---- | |
||
| 121 | 3 | Tom Clegg | | October 15, 2014 | (cont'd) | Tom Clegg |=. ---- | |