Scaling things » History » Version 1
Tom Clegg, 06/07/2023 03:02 PM
| 1 | 1 | Tom Clegg | h1. Scaling things |
|---|---|---|---|
| 2 | |||
| 3 | In principle, an Arvados cluster with access to sufficient hardware/cloud resources should be able to handle arbitrarily large datasets, computations, and interactive usage. In practice, there are limitations. This wiki aims to catalog limitations and strategies to address them. |
||
| 4 | |||
| 5 | h2. Collection size |
||
| 6 | |||
| 7 | Collections with a large number of files |
||
| 8 | * Slowness due to large manifest being sent over the network in order to load/update a single file |
||
| 9 | * High memory usage (in several components) due to large manifest |
||
| 10 | |||
| 11 | h2. Total data size |
||
| 12 | |||
| 13 | Large number of blocks |
||
| 14 | * High memory usage in keep-balance |
||
| 15 | * High garbage collection / replication adjustment latency due to long keep-balance iterations |
||
| 16 | * High sensitivity to back-end errors (a back-end error while indexing can abort an entire keep-balance iteration) |
||
| 17 | |||
| 18 | h2. Container queue size |
||
| 19 | |||
| 20 | Large number of queued containers |
||
| 21 | * Higher dispatcher latency due to reloading entire queue |
||
| 22 | * Excessive controller/rails/db load due to dispatcher reloading entire queue every N seconds |
||
| 23 | * Scheduling/prioritization effects when cloud services are limited (e.g., instance quota) |
||
| 24 | * High dispatcher memory use (function of # queued+running, not just # running) |
||
| 25 | |||
| 26 | Large number of running containers |
||
| 27 | * Lock contention due to cascading container/container request updates |
||
| 28 | * Controller/rails bottleneck causes container/log/output updates to take longer |
||
| 29 | * Interactive usage suffers when controller/rails is busy servicing many containers |