Bug #15148
closedkeep-balance incorrectly accounts for blocks in collections with null `modified_at` field
100%
Description
In certain circumstances, when collections have a null modified_at
field (which should normally never happen), keep-balance can mark blocks for deletion even though there are still references to them. A check for this will be included in 1.4 and 1.3.2: keep-balance will refuse to run at all if this situation is detected.
An API server bug which causes the modified_at
field to be null was introduced by #13561 and released with Arvados v1.3. The API server bug was fixed as a side effect of #14595 and this fix will also be included in 1.4 and 1.3.2.
Related: Recovering lost data
Updated by Tom Morris over 5 years ago
- Wherever keep-balance is running: systemctl stop keep-balance (or whatever it takes to disable keep-balance) until a fixed version is released+installed
- On every keepstore node: add TrashCheckInterval: 87600h to /etc/arvados/keepstore/keepstore.yml, and then systemctl restart keepstore (or equivalent) to avoid deleting any trashed blocks that are still recoverable
- Install a fixed version of keep-balance and arvados-api-server (≥1.3.2 or ≥1.4)
- Enable keep-balance
- Use the keepstore untrash API to recover any blocks that were trashed but not yet deleted (details TBD)
- Delete/revert TrashCheckInterval in keepstore configs and restart keepstore processes
Any system with containers that finished while running Arvados 1.3 will need a migration to fix the collections table for the output collections of those containers.
The following fixes have been made:
- Fix to refuse to run if any modified_at fields are null 15112-dont-trash-needed-replica @ 8b9ea19ebde9f4653d6adc145ef6fcbd36d2aace
- Database migration to repair any null `modified_at`s:
- 15112-migration @ 243130b8c5a8558d6bd132d4a062483be93ef7bc
- 15112-migration-1.3 @ a2c3d1ffd627974c8daa3bff300c0ad96f07d3a0
- Update migration to handle empty database case @ 7bc8e8add
- Cherry pick #14595 - $ git cherry-pick 2aa58f31ac8fc696361214a05ab9ba75a5140b08 4e32f0b140ec0ec7f96c1f9eaae00950c176ff03
Updated by Tom Clegg over 5 years ago
- Install a fixed version of keep-balance and arvados-api-server (≥1.3.2 or ≥1.4)
- Enable keep-balance
- Use the keepstore untrash API to recover any blocks that were trashed but not yet deleted (details TBD)
Details: Untrashing lost blocks
Any system with containers that finished while running Arvados 1.3 will need a migration to fix the collections table for the output collections of those containers.
This migration runs during the upgrade to arvados ≥1.3.2 or ≥1.4.
Updated by Tom Clegg over 5 years ago
15148-lost-collection-pdh @ 6c5852fb18c0b6422c079c6fee66891a273ad089 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1219/
Updated by Peter Amstutz over 5 years ago
Tom Clegg wrote:
15148-lost-collection-pdh @ 6c5852fb18c0b6422c079c6fee66891a273ad089 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1219/
Waiting on jenkins but this LGTM.
Updated by Tom Clegg over 5 years ago
This change is in master (48d2a213b, destined for 1.4) and 1.3-dev (675237bec, destined for 1.3.3).
Each line of the "lost blocks" file will now be "BLOCKHASH PDH1 PDH2 ..." where PDH* are all collections that refer to BLOCKHASH. From here you can get a complete list of affected collection PDHs:
cut -d" " -f2- < lost-blocks.txt | tr " " "\n" | sort -u > lost-collections.txt
Updated by Tom Clegg over 5 years ago
- Status changed from In Progress to Resolved
Updated by Tom Morris over 5 years ago
- Related to Feature #13561: [API] Store, and add APIs to retrieve, previous versions of collection objects added