Project

General

Profile

Actions

Bug #8224

closed

Job failed due to API call 503 Service Unavailable: This website is under heavy load

Added by Joshua Randall about 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

A job with a large number of tasks and a huge number of available slots failed last night. The last few messages in the job log were:

2016-01-19_01:01:59 z8ta6-8i9sb-ckih5ofn1ydh3bv 46721 593 child 19822 started on humgen-05-15.22
2016-01-19_01:01:59 z8ta6-8i9sb-ckih5ofn1ydh3bv 46721 594 job_task z8ta6-ot0gb-ovr359ezjhi4bpy
2016-01-19_01:01:59 z8ta6-8i9sb-ckih5ofn1ydh3bv 46721 594 child 19831 started on humgen-05-16.22
2016-01-19_01:01:59 API call /job_tasks/z8ta6-ot0gb-ovr359ezjhi4bpy failed: 503 Service Unavailable
2016-01-19_01:01:59 <h1>This website is under heavy load</h1><p>We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.</p> at /usr/share/perl5/Arvados/ResourceProxy.pm line 15
2016-01-19_01:01:59 salloc: Relinquishing job allocation 4263
2016-01-19_01:01:59 salloc: Job allocation 4263 has been revoked.

There were a total of 39201 tasks in this job, but as you can see above it had only gotten as far as starting the 594th task when it failed.

This job had a node allocation of 27 nodes with a total of 952 cores, but as you can see it did not even get to the point where all tasks were running before it apparently overloaded the API server and died.

There are plenty of cores (40) and memory (192GB) available on the API server, so I've tried raising `passenger_max_pool_size` in nginx.conf from the default (6) to 64. I will report back if that addresses the problem, but even if that fixes it, I think this issue is still an issue with the job itself - if the API server has a transient issue (such as overload) perhaps the job should try backing off rather than failing (maybe crunch-job could send SIGSTP to the crunch script?). I'm not sure whether it was an API call from the crunch script or the crunch-job process that printed the "API call" failed error, but that is not something that my crunch script itself printed.

Actions #1

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF