Bug #23178: a-d-c is treating rate limit errors as capacity issues, should not - Arvados

Actions

Copy link

Bug #23178

closed

a-d-c is treating rate limit errors as capacity issues, should not

Added by Brett Smith 6 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Crunch

Target version:

Development 2025-10-15

Story points:

Release:

Arvados 3.2.0

Release relationship:

Auto

Description

We're seeing many errors like this in a-d-c logs:

{"ClusterID":"zzzzz","InstanceType":"m5a2xlarge.preemptible","PID":626945,"error":"operation error EC2: RunInstances, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested","level":"error","msg":"create failed","time":"2025-09-25T15:02:23.154868810Z"}

Looking at source:lib/dispatchcloud/worker/pool.go based on the log msg a-d-c apparently treats this as a global instance capacity error. It is not: it means we have exceeded the rate limit for RunInstances. See EC2 documentation. Note the rate limit is account-wide, so we might hit this error even if a-d-c itself stays well under the limits.

In the same logs, we only see one instance of "suspending remote calls due to rate-limit error" from CheckRateLimitError.

When we receive this error, a-d-c should pause making RunInstances requests for a time, but it should not decrease sch.maxConcurrency. In other words (I think this is accurate), it should not go down the if err, ok := err.(cloud.CapacityError); ok && err.IsCapacityError() branch in pool.go, but it should go through the handling of CheckRateLimitError in throttle.go.

I think part of the reason this is happening is because the AWS SDK v2 reportedly changed the way it handles the underlying API errors and bubbles them up. See this terraform issue. Unfortunately the linked "newly published guidance" doesn't seem to be available any more, but I think my more immediate concern is just that our error checks probably aren't working as originally intended because of these changes. See also this long AWS SDK issue.

Subtasks 1 (0 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Brett Smith 6 months ago

Description updated (diff)

Actions

Copy link

Updated by Brett Smith 6 months ago

I think Retries and Timeouts is the documentation Terraform referred to. I'm especially looking at the "Client-side rate limiting" section, which says:

If you find that the default behavior does not fit your application's needs, you can disable it with ratelimit.None.

This seems to be how you restore SDK v1-like behavior, which I think is what we want given that we do our own retry handling at the application level.

Actions

Copy link

Updated by Brett Smith 6 months ago

Target version set to Development 2025-10-15
Assigned To set to Tom Clegg

Actions

Copy link

Updated by Brett Smith 6 months ago

Subtask #23189 added

Actions

Copy link

Updated by Tom Clegg 6 months ago

Aside: That log message does not really mean a-d-c interpreted the error as a capacity error:

                        if err, ok := err.(cloud.QuotaError); ok && err.IsQuotaError() {
                                ...
                        }
                        if err, ok := err.(cloud.CapacityError); ok && err.IsCapacityError() {
                                ...
                        }
                        logger.WithError(err).Error("create failed")

But the main point stands, i.e., the aws sdk's auto-retry behavior is interfering with a-d-c's error handling.

Actions

Copy link

Updated by Tom Clegg 6 months ago

Status changed from New to In Progress

23178-ec2-rate-limit @ 0c3e4f98d6941321f14a846cd55756472c7e7f83 -- developer-run-tests: #4897

Disable the AWS SDK's auto-retry (trivial).
When handling a quota error, log how long the pause will be, which instance type/group it applies to, etc.
When reporting an instance create error, log whether it was handled as a quota and/or throttle error.

Actions

Copy link

Updated by Brett Smith 6 months ago

On review, I got a little touchy about the distinction between retries and rate limiting. The specific error we're seeing gets logged when the internal rate limiter has no more tokens which makes me think we should be looking to disable rate limiting, not retrying.

However, note also that happens in a standard retryer method. A look at the corresponding NopRetryer method confirms that, true to its name, it does not talk to a rate limiter at all.

So, if I've followed right, switching to NopRetryer is the change that gets us closest to the SDK v1 behavior that a-d-c expects, and therefore is the best way to resolve this issue. If we're on the same page about this, then this LGTM, thanks.

Actions

Copy link

Updated by Tom Clegg 6 months ago

On review, I got a little touchy about the distinction between retries and rate limiting. The specific error we're seeing gets logged when the internal rate limiter has no more tokens which makes me think we should be looking to disable rate limiting, not retrying.

Sorry, I mentioned auto-retry in the commit message but neglected to call out in my note that auto-retry is also undesirable for a-d-c. It's better for a-d-c to see and report errors immediately, and retry later if the node is still needed. As you noted, switching to NopRetryer gets rid of both.

So, if I've followed right, switching to NopRetryer is the change that gets us closest to the SDK v1 behavior that a-d-c expects, and therefore is the best way to resolve this issue. If we're on the same page about this, then this LGTM, thanks.

Yes.

Actions

Copy link

#10