Project

General

Profile

Actions

Feature #12244

open

API server bulk transfers for keep-balance collection retrieval

Added by Joshua Randall over 7 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
-
Start date:
09/13/2017
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release:
Release relationship:
Auto

Description

In Issue 9998 (https://dev.arvados.org/issues/9998) we managed to improve performance of retrieving batches of collections via the API server by close to 50% by eliminating unnecessary steps in the processing of each collection record. However, retrieving data directly from postgres remains over 100x faster than it is via the API server, even after those improvements.

A full cycle of keep-balance on our system today takes ~13h (10M+ collections), while the data can be retrieved from postgres in 6.5 minutes:

# time echo "COPY (select * from collections) TO STDOUT (format text)" | psql -U arvados -w -h localhost arvados_production > /data/tmp/collections.dump

real    6m28.068s
user    0m46.619s
sys     0m45.584s
# wc -l /data/tmp/collections.dump
10075589 /data/tmp/collections.dump

I'd imagine no matter what there will still be some overhead associated with going through the API rather than doing a database table dump, but I suspect a bulk transfer API of some kind which does not involve an ORM could potentially get the cycle time for keep-balance down to less than 15m (so, 50x faster than currently). I think that would be worth doing.

Actions #1

Updated by Lucas Di Pentima almost 2 years ago

  • Release set to 60
Actions

Also available in: Atom PDF