Story #3609
closed[SDKs] arv-run wrapper to interactively submit 'run-command' jobs
Added by Peter Amstutz over 10 years ago. Updated about 10 years ago.
100%
Updated by Peter Amstutz over 10 years ago
Brainstorming:
/keep/foo/1.baz /keep/foo/2.baz /keep/foo/3.baz
Example command line:
arv-run grep /keep/foo/*.baz '|' cut -c5 '>' stuff
Runs as separate tasks:bash -c grep /keep/foo/1.baz | cut -c5 > stuff
bash -c grep /keep/foo/2.baz | cut -c5 > stuff
bash -c grep /keep/foo/3.baz | cut -c5 > stuff
Combine "stuff" at the end.
Collect everything at the end of the command line that doesn't start with '-' as separate files. If necessary, signal start of file list using '--'
Run one task per file, or specify grouping
arv-run --group-by=2 bwa mem -- *.fastq
Updated by Tom Clegg over 10 years ago
- Subject changed from arv-run wrapper to interactively submit 'run-command' jobs to [SDKs] arv-run wrapper to interactively submit 'run-command' jobs
- Story points set to 2.0
Updated by Ward Vandewege over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-09-17 sprint
Updated by Peter Amstutz over 10 years ago
- Category set to SDKs
- Assigned To set to Peter Amstutz
Updated by Peter Amstutz over 10 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 10 years ago
- Target version changed from 2014-09-17 sprint to 2014-10-08 sprint
Updated by Tom Clegg about 10 years ago
- Target version changed from 2014-10-08 sprint to Arvados Future Sprints
Updated by Peter Amstutz about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Updated by Brett Smith about 10 years ago
Reviewing the revamped arv-ws at 55888e6.
events module¶
- Please add tests for PollClient.
- It shouldn't be necessary to do your own JSON encoding of API parameters; the API client library should take care of that for you. It definitely works for filters, and seems to do OK with a single-string order parameter too. It seems to mess up when you pass a list to order. Maybe that's a bug in our discovery document?
_logger.error("Web sockets not available at %s" % api._rootDesc['websocketUrl'])
can crash if there is no websocketUrl (e.g., we got here because PollClient raised an exception). I think you could avoid these sorts of problems, and reduce code that's redundant between the branching logic and the exception logic, by having thetry
clause only cover the main websockets setup code, and do necessary cleanup if it fails.- This is a modernization thing not directly related to your branch, but while you're here, the websocket URL should be generated using
api.api_token
rather thanconfig.get('ARVADOS_API_TOKEN')
.
ws command¶
- Please add help for all the options. I don't think I would've figured out that
--filters
expects a JSON string without reading the code. - The name
--poll-fallback
sounds like a boolean, especially since it's accompanied by--no-poll-fallback
. Please consider a different name that clarifies that it's defining a poll interval. lambda ev: on_message(ev)
is redundant. You can just writeon_message
here.- Please remove the commented code from
on_message
.
Thanks.
Updated by Brett Smith about 10 years ago
Reviewing arv-ws follow-ups at 58014b9.
- Sleeping in multithreaded tests is really dicey business. It's difficult to find values that are both fast for developer machines, and accommodating enough for Jenkins. What if on_event in both these test classes set a threading.Event for state 2, and the test method waited for that with a generous timeout (10 seconds, maybe)? There's really no surefire way to test state 3 with the current structure, since that requires solving the halting problem.
- In the tests,
lambda ev: self.on_event(ev)
is redundant as before. - Suggest
signal.pause()
instead oftime.sleep(60)
to wait for the child thread to finish. - The polling-related switches would be nicer as a mutually exclusive group, since it makes no sense to set a poll interval and turn polling off.
- The help and error messages inconsistently use both "websockets" and "web sockets." Please figure out the preferred spelling and make them consistent.
Thanks.
Updated by Peter Amstutz about 10 years ago
Brett Smith wrote:
Reviewing the revamped arv-ws at 55888e6.
events module¶
- Please add tests for PollClient.
Test added.
- It shouldn't be necessary to do your own JSON encoding of API parameters; the API client library should take care of that for you. It definitely works for filters, and seems to do OK with a single-string order parameter too. It seems to mess up when you pass a list to order. Maybe that's a bug in our discovery document?
You're right, the discovery document does specify type=string (instead of type=array) for the "order" parameter. However changing this falls into a gray area of backwards compatibility that seems out of scope for this branch. Instead, since there's only a single order column, I just changed to be a string (no json encoding necessary).
_logger.error("Web sockets not available at %s" % api._rootDesc['websocketUrl'])
can crash if there is no websocketUrl (e.g., we got here because PollClient raised an exception). I think you could avoid these sorts of problems, and reduce code that's redundant between the branching logic and the exception logic, by having thetry
clause only cover the main websockets setup code, and do necessary cleanup if it fails.
Fixed.
- This is a modernization thing not directly related to your branch, but while you're here, the websocket URL should be generated using
api.api_token
rather thanconfig.get('ARVADOS_API_TOKEN')
.
Fixed.
ws command¶
- Please add help for all the options. I don't think I would've figured out that
--filters
expects a JSON string without reading the code.
Fixed.
- The name
--poll-fallback
sounds like a boolean, especially since it's accompanied by--no-poll-fallback
. Please consider a different name that clarifies that it's defining a poll interval.
Fixed.
lambda ev: on_message(ev)
is redundant. You can just writeon_message
here.
Fixed.
- Please remove the commented code from
on_message
.
Fixed.
Updated by Peter Amstutz about 10 years ago
Brett Smith wrote:
Reviewing arv-ws follow-ups at 58014b9.
- Sleeping in multithreaded tests is really dicey business. It's difficult to find values that are both fast for developer machines, and accommodating enough for Jenkins. What if on_event in both these test classes set a threading.Event for state 2, and the test method waited for that with a generous timeout (10 seconds, maybe)? There's really no surefire way to test state 3 with the current structure, since that requires solving the halting problem.
Changed to use event variable.
- In the tests,
lambda ev: self.on_event(ev)
is redundant as before.
Fixed.
- Suggest
signal.pause()
instead oftime.sleep(60)
to wait for the child thread to finish.
Didn't know about that one. Fixed.
- The polling-related switches would be nicer as a mutually exclusive group, since it makes no sense to set a poll interval and turn polling off.
That was my mistake, I intended them to be mutually exclusive but used the wrong method.
- The help and error messages inconsistently use both "websockets" and "web sockets." Please figure out the preferred spelling and make them consistent.
Fixed.
Updated by Brett Smith about 10 years ago
Reviewing 2379619
The new logic for cloning an API client for PollClient seems to be inconsistently applied. We use a threaded WebSocketClient, so it seems like that whatever thread safety policy we apply to PollClient, we should also apply to EventClient.
Personally, I think we should make that the caller's responsibility: "When you pass an API client to subscribe(), you can't use it again until you close the client." Like you pointed out on IRC, api() kwargs make it difficult to do proper cloning, and it seems like most users will fall into the pattern that they have an API client to do some setup, and then use it for websockets the rest of the time.
But, if you feel strongly about cloning, then I think we need to do the clone at the top of subscribe(), before we build any client.
Peter Amstutz wrote:
Brett Smith wrote:
Reviewing arv-ws follow-ups at 58014b9.
- Sleeping in multithreaded tests is really dicey business. It's difficult to find values that are both fast for developer machines, and accommodating enough for Jenkins. What if on_event in both these test classes set a threading.Event for state 2, and the test method waited for that with a generous timeout (10 seconds, maybe)? There's really no surefire way to test state 3 with the current structure, since that requires solving the halting problem.
Changed to use event variable.
What about the time.sleep(1)
before creating the object? If we need to make sure the 200 OK is handled before we send that, I think we need an event flag for that too. If not, I think we can ditch the sleep.
- Suggest
signal.pause()
instead oftime.sleep(60)
to wait for the child thread to finish.Didn't know about that one. Fixed.
You can remove the unused time
import now.
- The help and error messages inconsistently use both "websockets" and "web sockets." Please figure out the preferred spelling and make them consistent.
Fixed.
There's still a logger warning with "web sockets" on events.py line 100.
Updated by Peter Amstutz about 10 years ago
Brett Smith wrote:
Reviewing 2379619
Personally, I think we should make that the caller's responsibility: "When you pass an API client to subscribe(), you can't use it again until you close the client." Like you pointed out on IRC, api() kwargs make it difficult to do proper cloning, and it seems like most users will fall into the pattern that they have an API client to do some setup, and then use it for websockets the rest of the time.
Fixed.
What about the
time.sleep(1)
before creating the object? If we need to make sure the 200 OK is handled before we send that, I think we need an event flag for that too. If not, I think we can ditch the sleep.
Yep, added a "subscribed" event variable too. Refactored EventClientTest and PollClientTest to use the same test code.
You can remove the unused
time
import now.
Fixed.
There's still a logger warning with "web sockets" on events.py line 100.
Fixed.
Updated by Brett Smith about 10 years ago
Peter Amstutz wrote:
Yep, added a "subscribed" event variable too. Refactored EventClientTest and PollClientTest to use the same test code.
Just a couple of small things about the new test structure:
- In EventTestBase.runTest, assertIsInstance would generate a nicer diagnostic than the current condition+fail.
- In the subclasses' tearDown, please guard against the possibility that
self.ws is None
. super(run_test_server.TestCaseWithServers, self).tearDown()
should have the current class (WebsocketTest
orPollClientTest
) as the first argument. The current invocation is running the tearDown method of TestCaseWithServers' superclass, which is presumably a noop.
Please go ahead and merge with these changes. Thanks.
Updated by Tim Pierce about 10 years ago
Review @ 0edcc26f:
sdk/python/arvados/commands/run.py
- The
slots
stuff was really confusing to me, and I had to read the code and the man page a few times to tell what was going on. Can I suggest a comment to clarify what's going on here? Something like# Parse the command arguments into 'slots'. # All words following '>' are output arguments and are collected into slots[0]. # All words following '<' are input arguments and are collected into slots[1]. # slots[2] and on represent individual pipelines. # # e.g. arv-run foo arg1 arg2 '|' bar arg3 arg4 '<' input1 input2 input3 '>' output.txt # will be parsed into: # [['output.txt'], # ['input1', 'input2', 'input3'], # ['foo', 'arg1', 'arg2'], # ['bar', 'arg3', 'arg4']]
determine_project
andis_in_collection
should list which exceptions they're catching- import
arvados.commands._util
so we can include the default--retries
argument, instead of defaultingmax_retries
to 3 - L123-146:
- I'm not sure I understand what I'm going on here. Is this here to catch command-line flags that may be hiding filenames that need to be translated, like
--filename=zzzzz-4zz18-23u3hxugbm71qmn.txt
or-fblurfl.txt
? If so:- Add a comment explaining what we're doing.
- Does this correctly handle directories or filenames with equal signs in them?
- I'm not sure I understand what I'm going on here. Is this here to catch command-line flags that may be hiding filenames that need to be translated, like
- L191:
- if f is more than one directory deep under the current dir, this looks like it will make the stream name just the top level directory. Shouldn't the stream name be
os.path.basename(f)
or am I confused? - If f does not include a "/" then the stream name will be the empty string "", not ".".
- if f is more than one directory deep under the current dir, this looks like it will make the stream name just the top level directory. Shouldn't the stream name be
- L245: trailing "\"?
- L267: needs
ensure_unique_name=True
crunch_scripts/run-command
- I'm pretty lost about how the list parsing works here, particularly on what
var_items
,get_items
andexpand_item
are for and how they fit together, and what overall goal they accomplish. Document at least the general logic flow for a new reader. - L395: I think this is equivalent to
success = any([status == 0 for status in rcode.values()])
? Thereduce
call is sort of confusing to me because it suggests that something is being accumulated.
Updated by Peter Amstutz about 10 years ago
- Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Updated by Peter Amstutz about 10 years ago
Tim Pierce wrote:
Review @ 0edcc26f:
sdk/python/arvados/commands/run.py
- The
slots
stuff was really confusing to me, and I had to read the code and the man page a few times to tell what was going on. Can I suggest a comment to clarify what's going on here? Something like
[...]
Done
determine_project
andis_in_collection
should list which exceptions they're catching
Done
- import
arvados.commands._util
so we can include the default--retries
argument, instead of defaultingmax_retries
to 3
Done
- L123-146:
- I'm not sure I understand what I'm going on here. Is this here to catch command-line flags that may be hiding filenames that need to be translated, like
--filename=zzzzz-4zz18-23u3hxugbm71qmn.txt
or-fblurfl.txt
? If so:
- Add a comment explaining what we're doing.
That's correct. Added comments.
- Does this correctly handle directories or filenames with equal signs in them?
I reworked the code a bit so that it tests to see if something is a filename first, instead of pattern matching first, so if a filename has a equals sign it should get recognized correctly.
I just added the ability to recognize directories in collections.
Currently directories are not uploaded automatically since I don't feel like I have enough data to know what the right behavior is; recursively uploading the contents of a directory seems much more dangerous than uploading explicitly specified files.
- L191:
- if f is more than one directory deep under the current dir, this looks like it will make the stream name just the top level directory. Shouldn't the stream name be
os.path.basename(f)
or am I confused?
os.path.split() in python returns 2 items, the entire leading directory path in [0] and the leaf name in [1]
- If f does not include a "/" then the stream name will be the empty string "", not ".".
Fortunately, CollectionWriter already handles that case and turns "" into "."
- L245: trailing "\"?
Fixed.
- L267: needs
ensure_unique_name=True
Actually I don't think this is necessary as we currently enforce uniqueness for pipeline instance names, but out of an abundance of caution I added it anyway.
crunch_scripts/run-command
- I'm pretty lost about how the list parsing works here, particularly on what
var_items
,get_items
andexpand_item
are for and how they fit together, and what overall goal they accomplish. Document at least the general logic flow for a new reader.
I added some comments, probably inadequate but I'm trying to turn this code around.
- L395: I think this is equivalent to
success = any([status == 0 for status in rcode.values()])
? Thereduce
call is sort of confusing to me because it suggests that something is being accumulated.
Not quite, but I did come up with a use for any() that does do the right thing. I agree that using reduce() here is being excessively clever.
Thanks!
Updated by Tim Pierce about 10 years ago
Comments at 51de3bf2
doc/user/topics/run-command.html.textile.liquid
- The documentation for list expansion functions uses a lot of repeated component names like
a
andb
. That makes it confusing to tell which parameters are expanded to what values. This would be much easier to grasp if each example is rewritten to use unique parameter names. - The
group
list function in particular needs to be clarified. The innerforeach
loop is puzzling. I can work out that the outer b variable in this scope is getting expanded to the lists ["alice", "carol", "dave"] and ["bob"] based on how they match the regex, but the nested foreach loops are confusing, especially that one foreach is on"$(b)"
and one is on"b"
-- do those both work and are they different? - The section List context describes what happens "when a parameter is evaluated in a list context", but it doesn't describe what a list context is or how the user puts an expression into a list context. Is a list context something the user has any control over, or is it just implicitly invoked by list functions like
foreach
?
crunch_scripts/run-command
- The logic for evaluating a parameter template into an argument list is hard to follow. The run-command user documentation fills in some of the gaps here, which is helpful -- we should copy some or most of this into the source as well, to make reading the source easier.
- It is true that this process is a little bit like having a tiny language interpreter in the code. We need to start checking inputs and types more strictly here, e.g. these fields are assumed to hold a string value:
c['foreach']
c['list']
c['var']
- (others?)
- But if one of these holds a list instead, or a dict, or None, things will start to go sideways quickly (e.g.
var_items
will return None forvar
and this will setparams[None]
to each item in turn). As far as I can tell, that's a badly formed pipeline and should not work anyway, but we should proactively detect those cases and raise an exception. - FWIW I think we are going to need to make the definition and parsing of this metalanguage more rigorous, and that should happen sooner rather than later, before we get locked into weird implementation idiosyncrasies that are hard to back out.
Updated by Peter Amstutz about 10 years ago
Tim Pierce wrote:
Comments at 51de3bf2
doc/user/topics/run-command.html.textile.liquid
- The documentation for list expansion functions uses a lot of repeated component names like
a
andb
. That makes it confusing to tell which parameters are expanded to what values. This would be much easier to grasp if each example is rewritten to use unique parameter names.
Improved the formatting and added explicit "var" naming instead of using the implicit variable shadowing.
- The
group
list function in particular needs to be clarified. The innerforeach
loop is puzzling. I can work out that the outer b variable in this scope is getting expanded to the lists ["alice", "carol", "dave"] and ["bob"] based on how they match the regex, but the nested foreach loops are confusing, especially that one foreach is on"$(b)"
and one is on"b"
-- do those both work and are they different?
Fixed.
- The section List context describes what happens "when a parameter is evaluated in a list context", but it doesn't describe what a list context is or how the user puts an expression into a list context. Is a list context something the user has any control over, or is it just implicitly invoked by list functions like
foreach
?
I tried to clarify this, let me know if you have any more questions.
crunch_scripts/run-command
- The logic for evaluating a parameter template into an argument list is hard to follow. The run-command user documentation fills in some of the gaps here, which is helpful -- we should copy some or most of this into the source as well, to make reading the source easier.
Added comments about what each of the functions do.
- It is true that this process is a little bit like having a tiny language interpreter in the code. We need to start checking inputs and types more strictly here, e.g. these fields are assumed to hold a string value:
c['foreach']
c['list']
c['var']
- (others?)
- But if one of these holds a list instead, or a dict, or None, things will start to go sideways quickly (e.g.
var_items
will return None forvar
and this will setparams[None]
to each item in turn). As far as I can tell, that's a badly formed pipeline and should not work anyway, but we should proactively detect those cases and raise an exception.
Added more typechecking.
- FWIW I think we are going to need to make the definition and parsing of this metalanguage more rigorous, and that should happen sooner rather than later, before we get locked into weird implementation idiosyncrasies that are hard to back out.
In terms of strategy, I don't want to continue extending run-command's bespoke template language, but instead start implementing the common workflow language tool description which (a) will use Javascript instead of a custom expression language and (b) have a somewhat narrower scope (things like task parallelization will be handled at a different layer). (Eventually, if we're using tool description files in pipelines, then run-command in its current form may only be necessary support arv-run.)
Updated by Tim Pierce about 10 years ago
This LGTM enough to merge, thanks. The documentation makes me more confident that a new and unfamiliar user (i.e. me) could write a template based on this description, and knowing that we expect the common workflow language to replace this templating system makes me feel a lot more comfortable with going with it in its current state.
Until then we should be aggressive about writing tests for these templates. I'll add a separate story for that.
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:788ecdf8085f5e69cd3dc960f510b49f11432cb3.