Feature #8707: Arvados job: download data from remote site into Keep - Tapestry - Arvados

Actions

Copy link

Feature #8707

open

Arvados job: download data from remote site into Keep

Added by Tom Clegg about 9 years ago. Updated almost 6 years ago.

Status:

In Progress

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Third party integration

Target version:

Interpretation automation

Start date:

03/15/2016

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

1.0

Description

...to satisfy an API request like #8688

Implementation¶

One task per requested file -- this avoids retrying everything whenever one file fails

Use writable FUSE (task output dir)

Run wget or curl, probably with some sort of batch-progress flag

Subtasks 3 (0 open — 3 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Clegg about 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg about 9 years ago

Story points set to 1.0

Actions

Copy link

Updated by Tom Clegg about 9 years ago

Category set to Third party integration
Assigned To set to Tom Clegg

Actions

Copy link

Updated by Tom Clegg about 9 years ago

8707-download @ db7bd2a8f4981c079ced6c09646ac297790326ae

failure due to successful download with right size but wrong md5sum: https://crvr.se/su92l-8i9sb-ful8qhzowkshfoq
success: https://crvr.se/su92l-8i9sb-aizw0cupzxafowf

Actions

Copy link

Updated by Brett Smith about 9 years ago

Reviewing db7bd2a. This is good to merge, these are all just "idiomatic Python" nits that you can take or leave as you like.

cStringIO provides the same API as StringIO with better performance. You can switch to it with a one-line change by changing your import to import cStringIO as StringIO.

It seems a little odd that you open the URL, then check its scheme. Maybe move that up? You might also consider saving the result of urlparse.urlparse() and reusing it, but that's really small potatoes.

Your download loop can be written a little DRYer as:

   with open(outpath, 'w') as outfile:
        for chunk in iter(lambda: httpresp.read(BUFFER_SIZE), ''):
            outfile.write(chunk)
            got_md5.update(chunk)
        got_size = outfile.tell()

Thanks.

Actions

Copy link

Updated by Tom Clegg about 9 years ago

All of that sounds better, thanks. I was torn between the two uglies -- while-True-if-cond-break and duplicating the read() -- the iter solution is just what I was wishing for.

Now at aee617c with new test jobs: