Bug #21302: Container created with corrupted mounts - Arvados

Actions

Copy link

Bug #21302

open

Container created with corrupted mounts

Added by Brett Smith over 2 years ago. Updated 6 months ago.

Status:

In Progress

Priority:

Normal

Assigned To:

Brett Smith

Category:

API

Target version:

Future

Story points:

Description

A user's container request failed to dispatch because the container created for it had corrupted mounts, so Crunch could not dispatch it. There were ~2,250 mounts of individual files. This container was near the end of a workflow, so all its inputs came from a single collection, the output of a previous step.

The mounts were corrupted in different ways:

Some had their path changed, so the mount for a given file would be backed by a different file.
Some became the empty array [].
Some became the literal object {"for": "to_hash"}.

We haven't yet identified a pattern for which mounts got corrupted which way.

We did check the database record and confirmed the mounts were corrupted in the container's mounts column. This suggests that the problem happened during container creation, and isn't an issue with, e.g., serializing the record to JSON.

See child ticket for supporting documentation.

Subtasks

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Tom Clegg over 2 years ago

Related to Bug #21314: a-d-c should cancel a container if it can't be loaded added

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Tracker changed from Idea to Bug

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Target version changed from To be scheduled to Future

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

This happened again.

syslog:Oct  1 00:00:11 ip-172-25-144-184 arvados-dispatch-cloud[1414811]: {"ClusterID":"xngs1","ContainerUUID":"xxxxx-dz642-9vlpjyz07bs8jpj","PID":1414811,"error":"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount","level":"warning","msg":"error getting mounts","time":"2024-10-01T00:00:11.229977070Z"}

  "/keep/9675bc74facc27e95dbd682f50014f1c+1654":{
   "kind":"collection",
   "portable_data_hash":"9675bc74facc27e95dbd682f50014f1c+1654" 
  },
  "/keep/9682a92f38b185d61899911be9cae755+1596":[],
  "/keep/9694fecee4a8866f68a6378ccc479fbf+940":{
   "for":"to_hash" 
  },
  "/keep/96c1bcaeead7f643209ee3e585d687fd+940":{
   "for":"to_hash" 
  },
  "/keep/96d31dba7b507bc51e83d37b20d57719+1696":[],
  "/keep/96d90ebb580218cd2e1ad6d2e5b288de+1762":{
   "kind":"collection",
   "portable_data_hash":"96d90ebb580218cd2e1ad6d2e5b288de+1762" 
  },

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

I'm wondering if there's something weird in the Ruby serialization or deserialization that causes them to think that these particular objects are something else.

If so, it should be possible to re-produce by submitting a container request with mounts that are examples of the fields that are being corrupted.

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

So, simply adding a couple of mounts with the PDH of mounts that were corrupted before doesn't trigger the bug.

The bug has only been observed seems when there are large numbers of mounts (1000+? 2000+?)

I feel like the {"for":"to_hash"} strongly implicates something weird happening in Ruby land, I just can't see that getting added by Go or Postgres.

Actions

Copy link

Updated by Brett Smith over 1 year ago

Target version changed from Future to Development 2024-10-09 sprint
Assigned To set to Brett Smith

Actions

Copy link

Updated by Brett Smith over 1 year ago

Status changed from New to In Progress

After a handful of attempts and various tweaks to try to get as close as possible to the original, I have not been able to reproduce this on pirca. pirca-j7d0g-042lw4miq9kw5q5

The user clusters where this has happened are running Ubuntu 20.04, while pirca is running Debian 11. It's possible that the bug only happens with a specific JSON gem and/or C library.

Actions

Copy link

Updated by Brett Smith over 1 year ago

bundle list gives byte-identical output on both pirca and a user cluster where we saw this bug. Which, good, it's supposed to! That rules out a gem difference (but not a stdlib difference).

On Ubuntu 20.04: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
On pirca via RVM: ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]

Ruby 2.7.1 did include a security fix for the json gem. However, it seems like that fix got backported to Ubuntu: it has the json 2.3.0 gem with the fix.

Neither the json gem nor the oj gem (which should be what multi_json selects according to the documentation) seem to have any external dependencies. ldd from pirca:

% pwd
/usr/local/rvm/rubies/default/lib/ruby/2.7.0/x86_64-linux/json/ext
% ldd *.so /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so
generator.so:
        linux-vdso.so.1 (0x00007ffe04922000)
        libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007fbc29090000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbc28f44000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbc28d70000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbc28d53000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbc28d31000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbc28d27000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fbc28ca4000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbc28c9e000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007fbc28c63000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbc293ee000)
parser.so:
        linux-vdso.so.1 (0x00007ffdb95b9000)
        libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3cba6a0000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3cba554000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3cba380000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3cba363000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3cba341000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3cba337000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3cba2b4000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3cba2ae000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3cba273000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3cba9fc000)
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so:
        linux-vdso.so.1 (0x00007ffcd1bd1000)
        libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3fc51fd000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3fc50b1000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3fc4edd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3fc4ec0000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3fc4e9e000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3fc4e94000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3fc4e11000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3fc4e0b000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3fc4dd0000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3fc55b2000)

Actions

Copy link

#10

Updated by Brett Smith over 1 year ago

I am pretty sure we are just using Oj:

Our Gemfile declares a dependency on multi_json but we don't actually use it anywhere.
source:services/api/config/initializers/oj_mimic_json.rb does what you think and causes Oj to basically slot itself in as a JSON gem replacement, including with Rails. It runs the incantation documented by Oj for this.

Actions

Copy link

#11

Updated by Brett Smith over 1 year ago

Stray thoughts:

Another difference between the clusters is PostgreSQL version. Even if the client code is the same, it may go down different code paths based on what it believes the database server supports.

The container mounts column is declared serialize :mounts, Hash. Another possibility is that JSON is working fine, but the bug is in Rails' serialization process. (Again, same version of Rails across clusters, but maybe different code paths.)

Another "simple" possibility is that this bug depends on a race condition and I haven't had the (bad?) luck to trigger it yet.

Actions

Copy link

#12

Updated by Peter Amstutz over 1 year ago

Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint

Actions

Copy link

#13

Updated by Peter Amstutz over 1 year ago

PA to find Oj bug report from 2017 that had kind of similar behavior

Actions

Copy link

#14

Updated by Peter Amstutz over 1 year ago

This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:

https://github.com/ohler55/oj/issues/330

This one is blamed on faulty hardware:

https://github.com/ohler55/oj/issues/892

Actions

Copy link

#15

Updated by Brett Smith over 1 year ago

Peter Amstutz wrote in #note-14:

This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:

https://github.com/ohler55/oj/issues/330

Yeah, this is interesting but it's hard to know what to do with this.

We don't use the curb gem specifically (it was removed from Gemfile.lock in Jan 2020 in 14d53b3386296eb9b4d25c91cd9d40a5dfb52b75)
It's possible another gem we use is causing a similar bad interaction with GC, but I don't know enough about Ruby internals to know how we'd start investigating that.
Similarly, I don't know how we'd implement the proposed workaround of disabling GC since this happens deep in the Rails stack. Maybe we could write our own wrappers around the wrappers that get installed in the "use Oj" initializer. But that might also have serious performance implications.

If it is something like this, that does suggest there's probably a race condition involved.

Actions

Copy link

#16

Updated by Peter Amstutz over 1 year ago

Target version changed from Development 2024-10-23 sprint to Future

Actions

Copy link

#17

Updated by Lucas Di Pentima 6 months ago

Today we got a report from a user that was getting 500 errors when listing container requests, and the cause was that there was one container request record with its output_storage_classes field set to {"for": "to_hash"}. The arvados-controller service couldn't unmarshall this to a []string Go struct field.

Note: The record was created (and updated) on Feb 13th, 2024. I think this might be relevant if the root cause is a buggy library.

Actions

Copy link

#18

Updated by Brett Smith 6 months ago

Rough theory for how this could be caused by #23152:

User submits a container request with malformed mounts.
- Note this request only returns an error when controller tries to reserialize it. This means it can return a 200 OK if it has a select parameter that excludes mounts.
A later update/lock/??? request processes mounts in a way that assumes it is well-formed but happens to operate with other types.
- Note this could skip a lot of validation if this request doesn't modify mounts itself.
- Note this could actually be a series of updates, not just one.

My problem is, at this point, there's too many possibilities for different ways the data could be malformed and different code paths that it's still hard to know where to look. I could be right but try to reproduce this lots of different ways and have no luck.

One piece of evidence that the explanation could be this rather than a race condition or similar: we've only seen it on a single user install. This could also be explained by a site-specific software stack or configuration, but their custom client code is at least plausible. If it was a "pure" race condition I'd expect to see it on other installs.

Actions

Copy link

#19

Updated by Brett Smith 6 months ago

Related to Bug #23152: Validate schema of complex fields added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #21302

Container created with corrupted mounts

Updated by Tom Clegg over 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Brett Smith over 1 year ago

Updated by Brett Smith over 1 year ago

Updated by Brett Smith over 1 year ago

Updated by Brett Smith over 1 year ago

Updated by Brett Smith over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Brett Smith over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Lucas Di Pentima 6 months ago

Updated by Brett Smith 6 months ago

Updated by Brett Smith 6 months ago