Bug #21302
openContainer created with corrupted mounts
Description
A user's container request failed to dispatch because the container created for it had corrupted mounts, so Crunch could not dispatch it. There were ~2,250 mounts of individual files. This container was near the end of a workflow, so all its inputs came from a single collection, the output of a previous step.
The mounts were corrupted in different ways:
- Some had their path changed, so the mount for a given file would be backed by a different file.
- Some became the empty array
[]. - Some became the literal object
{"for": "to_hash"}.
We haven't yet identified a pattern for which mounts got corrupted which way.
We did check the database record and confirmed the mounts were corrupted in the container's mounts column. This suggests that the problem happened during container creation, and isn't an issue with, e.g., serializing the record to JSON.
See child ticket for supporting documentation.
Subtasks
Updated by Tom Clegg over 2 years ago
- Related to Bug #21314: a-d-c should cancel a container if it can't be loaded added
Updated by Peter Amstutz about 2 years ago
- Target version changed from To be scheduled to Future
Updated by Peter Amstutz over 1 year ago
This happened again.
syslog:Oct 1 00:00:11 ip-172-25-144-184 arvados-dispatch-cloud[1414811]: {"ClusterID":"xngs1","ContainerUUID":"xxxxx-dz642-9vlpjyz07bs8jpj","PID":1414811,"error":"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount","level":"warning","msg":"error getting mounts","time":"2024-10-01T00:00:11.229977070Z"}
"/keep/9675bc74facc27e95dbd682f50014f1c+1654":{
"kind":"collection",
"portable_data_hash":"9675bc74facc27e95dbd682f50014f1c+1654"
},
"/keep/9682a92f38b185d61899911be9cae755+1596":[],
"/keep/9694fecee4a8866f68a6378ccc479fbf+940":{
"for":"to_hash"
},
"/keep/96c1bcaeead7f643209ee3e585d687fd+940":{
"for":"to_hash"
},
"/keep/96d31dba7b507bc51e83d37b20d57719+1696":[],
"/keep/96d90ebb580218cd2e1ad6d2e5b288de+1762":{
"kind":"collection",
"portable_data_hash":"96d90ebb580218cd2e1ad6d2e5b288de+1762"
},
Updated by Peter Amstutz over 1 year ago
I'm wondering if there's something weird in the Ruby serialization or deserialization that causes them to think that these particular objects are something else.
If so, it should be possible to re-produce by submitting a container request with mounts that are examples of the fields that are being corrupted.
Updated by Peter Amstutz over 1 year ago
So, simply adding a couple of mounts with the PDH of mounts that were corrupted before doesn't trigger the bug.
The bug has only been observed seems when there are large numbers of mounts (1000+? 2000+?)
I feel like the {"for":"to_hash"} strongly implicates something weird happening in Ruby land, I just can't see that getting added by Go or Postgres.
Updated by Brett Smith over 1 year ago
- Target version changed from Future to Development 2024-10-09 sprint
- Assigned To set to Brett Smith
Updated by Brett Smith over 1 year ago
- Status changed from New to In Progress
After a handful of attempts and various tweaks to try to get as close as possible to the original, I have not been able to reproduce this on pirca. pirca-j7d0g-042lw4miq9kw5q5
The user clusters where this has happened are running Ubuntu 20.04, while pirca is running Debian 11. It's possible that the bug only happens with a specific JSON gem and/or C library.
Updated by Brett Smith over 1 year ago
bundle list gives byte-identical output on both pirca and a user cluster where we saw this bug. Which, good, it's supposed to! That rules out a gem difference (but not a stdlib difference).
On Ubuntu 20.04: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
On pirca via RVM: ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]
Ruby 2.7.1 did include a security fix for the json gem. However, it seems like that fix got backported to Ubuntu: it has the json 2.3.0 gem with the fix.
Neither the json gem nor the oj gem (which should be what multi_json selects according to the documentation) seem to have any external dependencies. ldd from pirca:
% pwd
/usr/local/rvm/rubies/default/lib/ruby/2.7.0/x86_64-linux/json/ext
% ldd *.so /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so
generator.so:
linux-vdso.so.1 (0x00007ffe04922000)
libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007fbc29090000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbc28f44000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbc28d70000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbc28d53000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbc28d31000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbc28d27000)
libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fbc28ca4000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbc28c9e000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007fbc28c63000)
/lib64/ld-linux-x86-64.so.2 (0x00007fbc293ee000)
parser.so:
linux-vdso.so.1 (0x00007ffdb95b9000)
libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3cba6a0000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3cba554000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3cba380000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3cba363000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3cba341000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3cba337000)
libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3cba2b4000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3cba2ae000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3cba273000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3cba9fc000)
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so:
linux-vdso.so.1 (0x00007ffcd1bd1000)
libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3fc51fd000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3fc50b1000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3fc4edd000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3fc4ec0000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3fc4e9e000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3fc4e94000)
libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3fc4e11000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3fc4e0b000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3fc4dd0000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3fc55b2000)
Updated by Brett Smith over 1 year ago
I am pretty sure we are just using Oj:
- Our Gemfile declares a dependency on
multi_jsonbut we don't actually use it anywhere. - source:services/api/config/initializers/oj_mimic_json.rb does what you think and causes Oj to basically slot itself in as a JSON gem replacement, including with Rails. It runs the incantation documented by Oj for this.
Updated by Brett Smith over 1 year ago
Stray thoughts:
Another difference between the clusters is PostgreSQL version. Even if the client code is the same, it may go down different code paths based on what it believes the database server supports.
The container mounts column is declared serialize :mounts, Hash. Another possibility is that JSON is working fine, but the bug is in Rails' serialization process. (Again, same version of Rails across clusters, but maybe different code paths.)
Another "simple" possibility is that this bug depends on a race condition and I haven't had the (bad?) luck to trigger it yet.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Updated by Peter Amstutz over 1 year ago
PA to find Oj bug report from 2017 that had kind of similar behavior
Updated by Peter Amstutz over 1 year ago
This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:
https://github.com/ohler55/oj/issues/330
This one is blamed on faulty hardware:
Updated by Brett Smith over 1 year ago
Peter Amstutz wrote in #note-14:
This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:
Yeah, this is interesting but it's hard to know what to do with this.
- We don't use the curb gem specifically (it was removed from
Gemfile.lockin Jan 2020 in 14d53b3386296eb9b4d25c91cd9d40a5dfb52b75) - It's possible another gem we use is causing a similar bad interaction with GC, but I don't know enough about Ruby internals to know how we'd start investigating that.
- Similarly, I don't know how we'd implement the proposed workaround of disabling GC since this happens deep in the Rails stack. Maybe we could write our own wrappers around the wrappers that get installed in the "use Oj" initializer. But that might also have serious performance implications.
If it is something like this, that does suggest there's probably a race condition involved.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2024-10-23 sprint to Future
Updated by Lucas Di Pentima 6 months ago
Today we got a report from a user that was getting 500 errors when listing container requests, and the cause was that there was one container request record with its output_storage_classes field set to {"for": "to_hash"}. The arvados-controller service couldn't unmarshall this to a []string Go struct field.
Note: The record was created (and updated) on Feb 13th, 2024. I think this might be relevant if the root cause is a buggy library.
Updated by Brett Smith 6 months ago
Rough theory for how this could be caused by #23152:
- User submits a container request with malformed
mounts.- Note this request only returns an error when controller tries to reserialize it. This means it can return a 200 OK if it has a
selectparameter that excludesmounts.
- Note this request only returns an error when controller tries to reserialize it. This means it can return a 200 OK if it has a
- A later update/lock/??? request processes
mountsin a way that assumes it is well-formed but happens to operate with other types.- Note this could skip a lot of validation if this request doesn't modify
mountsitself. - Note this could actually be a series of updates, not just one.
- Note this could skip a lot of validation if this request doesn't modify
My problem is, at this point, there's too many possibilities for different ways the data could be malformed and different code paths that it's still hard to know where to look. I could be right but try to reproduce this lots of different ways and have no luck.
One piece of evidence that the explanation could be this rather than a race condition or similar: we've only seen it on a single user install. This could also be explained by a site-specific software stack or configuration, but their custom client code is at least plausible. If it was a "pure" race condition I'd expect to see it on other installs.
Updated by Brett Smith 6 months ago
- Related to Bug #23152: Validate schema of complex fields added