Bug #3168
closed[Crunch] Source git repository for a job should not be fetched into internal arvados repository over and over again even when requested node profile is not available
100%
Description
It fetches the git repository into the Arvados internal repository even if the desired commit is already available in the internal repository. It should test to see if the commit is available in the internal repository, and only run "git fetch-pack" when the commit is not available. This will also suppress the "fetch over and over" behavior.
Updated by Tom Clegg over 10 years ago
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz over 10 years ago
- Subject changed from Git repository for jobs is deployed over and over again even when requested node is busy to Source git repository for a job is fetched into internal arvados repository over and over again even when requested node is busy
- Description updated (diff)
Updated by Brett Smith over 10 years ago
- Description updated (diff)
(Split the second half of this bug into #3278, since they're unrelated.)
Updated by Tom Clegg over 10 years ago
- Subject changed from Source git repository for a job is fetched into internal arvados repository over and over again even when requested node is busy to [Crunch] Source git repository for a job should not be fetched into internal arvados repository over and over again even when requested node profile is not available
Updated by Tom Clegg over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-08 sprint
Updated by Peter Amstutz over 10 years ago
- Status changed from New to In Progress
- Assigned To set to Peter Amstutz
Updated by Tom Clegg over 10 years ago
At 9a15397
- It would be nice to avoid repeating that "print cmd; print cmd output; if error then sleep 1 untake next" template 3 times. But I guess I can live with it for now.
- Rather than running git N times for N scheduling attempts, could we just run it once, and remember "script_version X has been fetched" in a hash table? (I thought we had already done something similar for API tokens, but I don't see it -- is that still broken too?)
- At
unless tag_rev == job.script_version
-- If a script_version changes after an unsuccessful bid to start the job, does that leave the job stuck in the queue because the git-tag will always fail-and-retry from now on? Would it be better to fail the job here?- Another alternative would be to force-update the tag, but I think that creates a race bug where the winner of the "start the job" race uses version A but the git repo ends up tagged at version B by the loser.
- Perhaps a check for "is the revision tagged correctly" should be done in crunch-job, after the job has been locked, to reliably detect races.
Updated by Peter Amstutz over 10 years ago
Tom Clegg wrote:
At 9a15397
- It would be nice to avoid repeating that "print cmd; print cmd output; if error then sleep 1 untake next" template 3 times. But I guess I can live with it for now.
Yea, I'm not thrilled by the repetition either, but it would require some more aggressive refactoring that I was willing to do in the context of this (fairly narrow) bug fix.
- Rather than running git N times for N scheduling attempts, could we just run it once, and remember "script_version X has been fetched" in a hash table? (I thought we had already done something similar for API tokens, but I don't see it -- is that still broken too?)
Although if you have multiple crunch jobs, you'll still be calling git multiple times. While forking to call git is kind of heavyweight compared to doing a hash table lookup, keeping a cache means more code and introduces the "multiple sources of truth" problem.
- At
unless tag_rev == job.script_version
-- If a script_version changes after an unsuccessful bid to start the job, does that leave the job stuck in the queue because the git-tag will always fail-and-retry from now on? Would it be better to fail the job here?
You're right, it would be better in the case of a tag collision to fail instead of linger. Fixed in 1d8f975.
A tag collision should never happen unless someone is messing around with the internal git repository and/or the job record. I wrote up #3988 about protecting the job record better. I also added #4003 since crunch dispatch errors (such as git errors) don't go to the logs table so are not visible to the user (the job would just fail silently).
Updated by Tom Clegg over 10 years ago
Peter Amstutz wrote:
Yea, I'm not thrilled by the repetition either, but it would require some more aggressive refactoring that I was willing to do in the context of this (fairly narrow) bug fix.
Fair enough
Although if you have multiple crunch jobs, you'll still be calling git multiple times. While forking to call git is kind of heavyweight compared to doing a hash table lookup, keeping a cache means more code and introduces the "multiple sources of truth" problem.
Given that we're talking about "has a sha1 been added to this repo?" and the whole point of that repo is that commits, once added, do not disappear, I figured a hash would be extremely safe. (And by "once" I meant one time per commit sha1, not one time ever!)
But, OK. Don't need to hold up the bugfix for it.
- At
unless tag_rev == job.script_version
-- If a script_version changes after an unsuccessful bid to start the job, does that leave the job stuck in the queue because the git-tag will always fail-and-retry from now on? Would it be better to fail the job here?You're right, it would be better in the case of a tag collision to fail instead of linger. Fixed in 1d8f975.
A tag collision should never happen unless someone is messing around with the internal git repository and/or the job record. I wrote up #3988 about protecting the job record better. I also added #4003 since crunch dispatch errors (such as git errors) don't go to the logs table so are not visible to the user (the job would just fail silently).
LGTM, except: I think save!
(2×) is not correct (even though we do use it elsewhere). Every time we use save!
we risk exiting crunch-dispatch and losing supervision/logging of whatever other jobs happen to be running right now. Is this error really worth crashing on? Where possible, we should stick with save or recover
-- even if recover is just "log something and continue" -- instead of save or crash
.
Updated by Anonymous over 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:c2c38069d28fc68dea6e1b2cb0d5f4f36e1ef03f.