Project

General

Profile

Actions

Bug #4956

closed

[API] cannot create collections with very large manifests

Added by Tim Pierce about 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
01/09/2015
Due date:
% Done:

86%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

From #4621:

A job generating a manifest consisting of one 28MB line of text could not be saved.

The pipeline log for failed job qr1hi-8i9sb-t6qzh3f0vkybftx says:

2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 57 success in 1223 seconds
2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570 57 output ecf907baa43ecdcc5b74d29b2d1d593f+28007813+A42020335df3e977930d81cc7f2070b1ab35e050b@54c172a9
2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  wait for last 0 children to finish
2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  status: 62 done, 0 running, 0 todo
2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  release job allocation
2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  Freeze not implemented
2015-01-08_21:59:08 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  collate
2015-01-08_22:00:29 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  Failed to write output collection
2015-01-08_22:00:29 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  finish

The crunch-dispatch logs include some more context:

2015-01-08_21:59:08.90514 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  Freeze not implemented
2015-01-08_21:59:08.90515 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  collate
2015-01-08_21:59:10.18634 dispatch: update compute14 state to {:state=>"idle", :job=>nil}
2015-01-08_22:00:29.67611 qr1hi-8i9sb-t6qzh3f0vkybftx ! Traceback (most recent call last):
2015-01-08_22:00:29.67620 qr1hi-8i9sb-t6qzh3f0vkybftx !   File "<string>", line 1, in <module>
2015-01-08_22:00:29.67627 qr1hi-8i9sb-t6qzh3f0vkybftx !   File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 132, in positional_wrapper
2015-01-08_22:00:29.91142 qr1hi-8i9sb-t6qzh3f0vkybftx !     return wrapped(*args, **kwargs)
2015-01-08_22:00:29.91149 qr1hi-8i9sb-t6qzh3f0vkybftx !   File "/usr/local/lib/python2.7/dist-packages/apiclient/http.py", line 723, in execute
2015-01-08_22:00:29.91154 qr1hi-8i9sb-t6qzh3f0vkybftx !     raise HttpError(resp, content, uri=self.uri)
2015-01-08_22:00:29.91158 qr1hi-8i9sb-t6qzh3f0vkybftx ! arvados.errors.ApiError: <HttpError 500 when requesting https://qr1hi.arvadosapi.com/arvados/v1/collections?alt=json returned "Internal Server Error">
2015-01-08_22:00:29.94333 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  Failed to write output collection
2015-01-08_22:00:29.94346 qr1hi-8i9sb-t6qzh3f0vkybftx 15570  finish

However, the API server production logs do not have any record of a failed attempt to create this collection.

See if we can reliably reproduce this and determine whether it represents a flaw in the API server, a simple timeout, or something else.


Subtasks 6 (0 open6 closed)

Task #5299: Determine causeResolvedPeter Amstutz01/09/2015

Actions
Task #5102: Try to reproduceResolvedPeter Amstutz01/09/2015

Actions
Task #5358: In Python SDK, check each request's size against discovered limitResolvedPeter Amstutz01/09/2015

Actions
Task #5357: Advertise configurable request size limit in discovery doc (default 128MiB)ResolvedPeter Amstutz01/09/2015

Actions
Task #5366: Review 4956-limit-request-sizeResolvedTom Clegg01/09/2015

Actions
Task #5308: Decide on explicit, documented limits on maximum manifest size, maximum API server request size.Resolved01/09/2015

Actions
Actions #1

Updated by Tim Pierce about 10 years ago

  • Project changed from 35 to Arvados
  • Category set to API
Actions #2

Updated by Tom Clegg about 10 years ago

  • Target version changed from Arvados Future Sprints to 2015-02-18 sprint
Actions #3

Updated by Peter Amstutz almost 10 years ago

  • Assigned To set to Peter Amstutz
Actions #4

Updated by Tom Clegg almost 10 years ago

  • Target version changed from 2015-02-18 sprint to 2015-03-11 sprint
Actions #5

Updated by Peter Amstutz almost 10 years ago

Re-ran the job and got the same error: qr1hi-8i9sb-77epylk1ecakg3a

Actions #6

Updated by Peter Amstutz almost 10 years ago

Was able to recreate the error. Notes:

  • This is a severely degenerate manifest. There are many instances of sequential file segments that could be collapsed. There is also interleaving of writes to multiple files in the same block which greatly bloats the number of segments required to represent the file. The same information could probably be represented by a manifest 10% of the size (or even smaller.)
  • The manifest text is successfully written to Keep. The failure is when trying to create the collection record.
  • I can't find anything in the API server logs indicating that the request made it to the Rails code. Along with the 500 error code, this suggests it is getting rejected earlier by the nginx proxy (?) or in passenger.
Actions #7

Updated by Peter Amstutz almost 10 years ago

Here's the log:

10.13.26.53 - - [23/Feb/2015:19:02:00 +0000] "POST /arvados/v1/collections?alt=json HTTP/1.1" 500 192 "-" "google-api-python-client/1.3.1 (gzip)" 
Actions #8

Updated by Peter Amstutz almost 10 years ago

Whoops.

This manifest isn't 28 MiB. It's at least two orders of magnitude larger than that. The 28 MiB manifest is only the first of 60 tasks. The actual requests concatenates all those tasks and results in a manifest that is around 887 MiB.

Actions #9

Updated by Peter Amstutz almost 10 years ago

Analysis:

There are two underlying causes here:

  1. CollectionWriter was used in a way that created fragmented manifests which were orders of magnitude larger than necessary
  2. Failure to have explicit limits on how large a single API request and/or manifest is allowed to be, so that instead of getting specific errors (413 This manifest is too large) we get generic errors (500 Internal Server Error) that are much harder to debug.
Actions #10

Updated by Tom Clegg almost 10 years ago

The test in sdk/python/tests/test_api.py would be a bit more stable if it looked up the request limit in the discovery doc, instead of assuming 128 MiB. (If it's an error when limit>128MiB then we could check for that on the API side in rake config:check... but I'm assuming it isn't.)

If I'm following correctly, this added line would make more sense in __init__() than in __authorize__():

     def authorize(self, http):
         http.arvados_api_token = self.api_token
         http.orig_http_request = http.request
         http.request = types.MethodType(self.http_request, http)
+        http.max_request_size = 0
         return http

The comment in services/api/config/application.default.yml would probably be a good place to add a note about enforcement. Perhaps:

"Note this limit only affects clients that pay attention to it. Therefore we also recommend configuring the web server, or an upstream proxy server, to enforce a request size limit."

Everything else LGTM.

Actions #11

Updated by Peter Amstutz almost 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF