Bug #4956
closed[API] cannot create collections with very large manifests
86%
Description
From #4621:
A job generating a manifest consisting of one 28MB line of text could not be saved.
The pipeline log for failed job qr1hi-8i9sb-t6qzh3f0vkybftx says:
2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 57 success in 1223 seconds 2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 57 output ecf907baa43ecdcc5b74d29b2d1d593f+28007813+A42020335df3e977930d81cc7f2070b1ab35e050b@54c172a9 2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 wait for last 0 children to finish 2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 status: 62 done, 0 running, 0 todo 2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 release job allocation 2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 Freeze not implemented 2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 collate 2015-01-08_22:00:29 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 Failed to write output collection 2015-01-08_22:00:29 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 finish
The crunch-dispatch logs include some more context:
2015-01-08_21:59:08.90514 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 Freeze not implemented 2015-01-08_21:59:08.90515 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 collate 2015-01-08_21:59:10.18634 dispatch: update compute14 state to {:state=>"idle", :job=>nil} 2015-01-08_22:00:29.67611 qr1hi-8i9sb-t6qzh3f0vkybftx ! Traceback (most recent call last): 2015-01-08_22:00:29.67620 qr1hi-8i9sb-t6qzh3f0vkybftx ! File "<string>", line 1, in <module> 2015-01-08_22:00:29.67627 qr1hi-8i9sb-t6qzh3f0vkybftx ! File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 132, in positional_wrapper 2015-01-08_22:00:29.91142 qr1hi-8i9sb-t6qzh3f0vkybftx ! return wrapped(*args, **kwargs) 2015-01-08_22:00:29.91149 qr1hi-8i9sb-t6qzh3f0vkybftx ! File "/usr/local/lib/python2.7/dist-packages/apiclient/http.py", line 723, in execute 2015-01-08_22:00:29.91154 qr1hi-8i9sb-t6qzh3f0vkybftx ! raise HttpError(resp, content, uri=self.uri) 2015-01-08_22:00:29.91158 qr1hi-8i9sb-t6qzh3f0vkybftx ! arvados.errors.ApiError: <HttpError 500 when requesting https://qr1hi.arvadosapi.com/arvados/v1/collections?alt=json returned "Internal Server Error"> 2015-01-08_22:00:29.94333 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 Failed to write output collection 2015-01-08_22:00:29.94346 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 finish
However, the API server production logs do not have any record of a failed attempt to create this collection.
See if we can reliably reproduce this and determine whether it represents a flaw in the API server, a simple timeout, or something else.
Updated by Tim Pierce about 10 years ago
- Project changed from 35 to Arvados
- Category set to API
Updated by Tom Clegg about 10 years ago
- Target version changed from Arvados Future Sprints to 2015-02-18 sprint
Updated by Tom Clegg almost 10 years ago
- Target version changed from 2015-02-18 sprint to 2015-03-11 sprint
Updated by Peter Amstutz almost 10 years ago
Re-ran the job and got the same error: qr1hi-8i9sb-77epylk1ecakg3a
Updated by Peter Amstutz almost 10 years ago
Was able to recreate the error. Notes:
- This is a severely degenerate manifest. There are many instances of sequential file segments that could be collapsed. There is also interleaving of writes to multiple files in the same block which greatly bloats the number of segments required to represent the file. The same information could probably be represented by a manifest 10% of the size (or even smaller.)
- The manifest text is successfully written to Keep. The failure is when trying to create the collection record.
- I can't find anything in the API server logs indicating that the request made it to the Rails code. Along with the 500 error code, this suggests it is getting rejected earlier by the nginx proxy (?) or in passenger.
Updated by Peter Amstutz almost 10 years ago
Here's the log:
10.13.26.53 - - [23/Feb/2015:19:02:00 +0000] "POST /arvados/v1/collections?alt=json HTTP/1.1" 500 192 "-" "google-api-python-client/1.3.1 (gzip)"
Updated by Peter Amstutz almost 10 years ago
Whoops.
This manifest isn't 28 MiB. It's at least two orders of magnitude larger than that. The 28 MiB manifest is only the first of 60 tasks. The actual requests concatenates all those tasks and results in a manifest that is around 887 MiB.
Updated by Peter Amstutz almost 10 years ago
Analysis:
There are two underlying causes here:
- CollectionWriter was used in a way that created fragmented manifests which were orders of magnitude larger than necessary
- Failure to have explicit limits on how large a single API request and/or manifest is allowed to be, so that instead of getting specific errors (413 This manifest is too large) we get generic errors (500 Internal Server Error) that are much harder to debug.
Updated by Tom Clegg almost 10 years ago
The test in sdk/python/tests/test_api.py
would be a bit more stable if it looked up the request limit in the discovery doc, instead of assuming 128 MiB. (If it's an error when limit>128MiB
then we could check for that on the API side in rake config:check
... but I'm assuming it isn't.)
If I'm following correctly, this added line would make more sense in __init__()
than in __authorize__()
:
def authorize(self, http):
http.arvados_api_token = self.api_token
http.orig_http_request = http.request
http.request = types.MethodType(self.http_request, http)
+ http.max_request_size = 0
return http
The comment in services/api/config/application.default.yml
would probably be a good place to add a note about enforcement. Perhaps:
"Note this limit only affects clients that pay attention to it. Therefore we also recommend configuring the web server, or an upstream proxy server, to enforce a request size limit."
Everything else LGTM.
Updated by Peter Amstutz almost 10 years ago
- Status changed from New to Resolved