Project

General

Profile

Python SDK » History » Version 2

Tom Clegg, 08/16/2014 01:26 AM

1 1 Tom Clegg
h1. Python SDK
2
3
(design draft)
4
5 2 Tom Clegg
h3. Example crunch script
6
7 1 Tom Clegg
<pre><code class="python">
8
#!/usr/bin/env python
9
10
from arvados import CrunchJob
11
12
import examplelib
13
import re
14
15
class NormalizeMatchingFiles(CrunchJob):
16
    @CrunchJob.task()
17
    def grep_files(self):
18
        # CrunchJob instantiates input parameters based on the
19
        # dataclass attribute.  When we ask for the input parameter,
20
        # CrunchJob sees that it's a Collection, and returns a
21
        # CollectionReader object.
22
        for filename in self.job_param('input').filenames():
23
            self.grep_file(filename)
24
25
    @CrunchJob.task()
26
    def grep_file(self, filename):
27
        regexp = re.compile(self.job_param('pattern'))
28
        with self.job_param('input').open(filename) as in_file:
29
            for line in in_file:
30
                if regexp.search(line):
31
                    self.normalize(filename)
32
                    break
33
34
    # examplelib is already multi-threaded and will peg the whole
35
    # compute node.  These tasks should run sequentially.
36
    @CrunchJob.task(parallel_with=[])
37
    def normalize(self, filename):
38
        output = examplelib.frob(self.job_param('input').mount_path(filename))
39
        # self.output is a CollectionWriter.  When this task method finishes,
40
        # CrunchJob checks if we wrote anything to it.  If so, it takes care
41
        # of finishing the upload process, and sets this task's output to the
42
        # Collection UUID.
43
        with self.output.open(filename) as out_file:
44
            out_file.write(output)
45
46
47
if __name__ == '__main__':
48
    NormalizeMatchingFiles(task0='grep_files').main()
49
</code></pre>
50 2 Tom Clegg
51
Notes/todo:
52
* Important concurrency limits that job scripts must be able to express:
53
** Task Z cannot start until all outputs/side effects of tasks W, X, Y are known/complete (e.g., because Z uses WXY's outputs as its inputs).
54
** Task Y and Z cannot run on the same worker node without interfering with each other (e.g., due to RAM requirements).
55
* In general, the output name is not known until the task is nearly finished. Frequently it is clearer to specify it when the task is queued, though. We should provide a convenient way to do this without any boilerplate in the queued task.
56
* It would be nice to pass an openable object to a task rather than a filename. (i.e., @grep_file()@ shouldn't have to repeat @job_param('input')@; that should be implicit in its argument.)
57
* A second example that uses a "case" and "control" input (e.g., "tumor" and "normal") might help reveal features.
58
* Should get more clear about how the output of the job (as opposed to the output of the last task) is to be set. The obvious way (concatenate all task outputs) should be a one-liner, if not implicit. Either way, it should run in a task rather than being left up to @crunch-job@.