Bug #18102
closedmax dispatch attempts error
Added by Jiayong Li over 3 years ago. Updated about 3 years ago.
100%
Description
https://workbench.2xpu4.arvadosapi.com/container_requests/2xpu4-xvhdp-vzc0tvmwarbfnx1
Error: Failed to start container. Cancelled after exceeding 'Containers.MaxDispatchAttempts' (lock_count=5)
Updated by Tom Clegg over 3 years ago
example of affected container: 2xpu4-dz642-3duf09jaxd3b9gv
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","InstanceType":"m58xlarge.spot","PID":23276,"Priority":562948322809997302,"State":"Queued","level":"info","msg":"adding container to queue","time":"2021-09-02T20:20:53.543265064Z"} {"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","InstanceType":"m58xlarge.spot","PID":23276,"level":"info","msg":"creating new instance","time":"2021-09-02T20:21:51.709046394Z"} {"InstanceType":"m58xlarge.spot","PID":23276,"error":"InsufficientInstanceCapacity: We currently do not have sufficient m5.8xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get m5.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.\n\tstatus code: 500, request id: 1a987cc4-80fc-4105-85bb-2d25fdf8dbbd","level":"error","msg":"create failed","time":"2021-09-02T20:21:52.628528300Z"} {"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-a9h7sg2h1e1u11ity3xs)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.392472387Z"} {"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-14br8w8ut0nf9nek9xex)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.974055346Z"} {"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Cancelled\u003e (req-4f9hcjkzs9x61409czrt)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:56.305846772Z"} {"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"State":"Cancelled","level":"info","msg":"container finished -- dropping from queue","time":"2021-09-02T20:21:56.305932024Z"} {"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2021-09-02T20:21:56.305946078Z"}
Updated by Tom Clegg over 3 years ago
Flapping lock/unlock calls¶
The scheduler assumed, incorrectly, that sorting containers by priority would give consistent results from one iteration to the next. This is not true when multiple containers have identical priority, which is how we do child containers. As a result, in a situation where the scheduler is waiting for new instances to come up and knows that there are more containers than new instances, the scheduler keeps changing its mind about which containers are "first" and should be locked, and which should be unlocked because they won't fit on the new instances.
Fix by tracking FirstSeenAt timestamp on each queue entry, and using that to sort identical-priority containers.
"Invalid state transition" errors¶
In runQueue, when ctr.State is Locked and sch.pool.AtQuota() is true, we call sch.queue.Unlock(), but we also include the current container in the overquota slice, which means we call sch.queue.Unlock() on it again in the "unlock any containers ..." block below. Fix by removing the first call to sch.queue.Unlock().
18102-max-dispatch-attempts @ 6ba70e34ec9bc9941e8c04d6654106e6d1bbae82 -- developer-run-tests: #2678
extra thoughts¶
Rate limiting:
We're getting some rate-limiting errors (RequestLimitExceeded) from AWS. Not clear whether it's caused by having additional create requests in progress when the first one hits InsufficientInstanceCapacity, or just by creating several dozen instances at once.
Either way, these configs should probably not be zero/unlimited. Perhaps we should update the defaults to something reasonable.
MaxCloudOpsPerSecond: 0 MaxConcurrentInstanceCreateOps: 0
MaxConcurrentInstanceCreateOps was invented to accommodate Azure but it seems useful for AWS too.
Error handling:
If we get an error from the cloud provider when trying to create an instance, but the error isn't recognized as a quota or rate-limiting error, we retry very quickly, which would be unhelpful for an error like "invalid instance type". I don't think this is happening here (InsufficientInstanceCapacity is recognized as a quota error) but we might want to consider other options, like a per-instance-type quiet period for unrecognized errors (which would probably be a better way to respond to InsufficientInstanceCapacity as well).
Updated by Tom Clegg over 3 years ago
- Related to Bug #11561: [API] Limit number of lock/unlock cycles for a given container added
Updated by Lucas Di Pentima over 3 years ago
Just one small question:
- At file
lib/dispatchcloud/scheduler/sync.go
- Wouldn't the changes at lines 87-93 avoid almost completely the logging of unexpected state? Maybe we could log the message just once per container?
The rest LGTM, including your extra thoughts, although maybe we want to discuss them with the rest of the team?
Updated by Tom Clegg over 3 years ago
- Related to Feature #18113: [a-d-c] non-zero defaults for MaxCloudOpsPerSecond and MaxConcurrentInstanceCreateOps added
Updated by Tom Clegg over 3 years ago
- Related to Bug #18114: [a-d-c] slow down retries when CreateInstance returns non-quota/non-throttle errors added
Updated by Tom Clegg over 3 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados-private:commit:arvados|89bdd6bf07147ad707125fc3882caebb5402ab09.