Project

General

Profile

Actions

Bug #23178

closed

a-d-c is treating rate limit errors as capacity issues, should not

Added by Brett Smith 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

We're seeing many errors like this in a-d-c logs:

{"ClusterID":"zzzzz","InstanceType":"m5a2xlarge.preemptible","PID":626945,"error":"operation error EC2: RunInstances, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested","level":"error","msg":"create failed","time":"2025-09-25T15:02:23.154868810Z"}

Looking at source:lib/dispatchcloud/worker/pool.go based on the log msg a-d-c apparently treats this as a global instance capacity error. It is not: it means we have exceeded the rate limit for RunInstances. See EC2 documentation. Note the rate limit is account-wide, so we might hit this error even if a-d-c itself stays well under the limits.

In the same logs, we only see one instance of "suspending remote calls due to rate-limit error" from CheckRateLimitError.

When we receive this error, a-d-c should pause making RunInstances requests for a time, but it should not decrease sch.maxConcurrency. In other words (I think this is accurate), it should not go down the if err, ok := err.(cloud.CapacityError); ok && err.IsCapacityError() branch in pool.go, but it should go through the handling of CheckRateLimitError in throttle.go.

I think part of the reason this is happening is because the AWS SDK v2 reportedly changed the way it handles the underlying API errors and bubbles them up. See this terraform issue. Unfortunately the linked "newly published guidance" doesn't seem to be available any more, but I think my more immediate concern is just that our error checks probably aren't working as originally intended because of these changes. See also this long AWS SDK issue.


Subtasks 1 (0 open1 closed)

Task #23189: Review 23178-ec2-rate-limitResolvedTom Clegg10/09/2025Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #23196: Should track instance capacity state separately for preemptible and non-preemptible instance typesResolvedTom Clegg10/14/2025Actions
Actions #2

Updated by Brett Smith 6 months ago

  • Description updated (diff)
Actions #3

Updated by Brett Smith 6 months ago

I think Retries and Timeouts is the documentation Terraform referred to. I'm especially looking at the "Client-side rate limiting" section, which says:

If you find that the default behavior does not fit your application's needs, you can disable it with ratelimit.None.

This seems to be how you restore SDK v1-like behavior, which I think is what we want given that we do our own retry handling at the application level.

Actions #4

Updated by Brett Smith 6 months ago

  • Target version set to Development 2025-10-15
  • Assigned To set to Tom Clegg
Actions #5

Updated by Brett Smith 6 months ago

  • Subtask #23189 added
Actions #6

Updated by Tom Clegg 6 months ago

Aside: That log message does not really mean a-d-c interpreted the error as a capacity error:

                        if err, ok := err.(cloud.QuotaError); ok && err.IsQuotaError() {
                                ...
                        }
                        if err, ok := err.(cloud.CapacityError); ok && err.IsCapacityError() {
                                ...
                        }
                        logger.WithError(err).Error("create failed")

But the main point stands, i.e., the aws sdk's auto-retry behavior is interfering with a-d-c's error handling.

Actions #7

Updated by Tom Clegg 6 months ago

  • Status changed from New to In Progress
23178-ec2-rate-limit @ 0c3e4f98d6941321f14a846cd55756472c7e7f83 -- developer-run-tests: #4897
  • Disable the AWS SDK's auto-retry (trivial).
  • When handling a quota error, log how long the pause will be, which instance type/group it applies to, etc.
  • When reporting an instance create error, log whether it was handled as a quota and/or throttle error.
Actions #8

Updated by Brett Smith 6 months ago

On review, I got a little touchy about the distinction between retries and rate limiting. The specific error we're seeing gets logged when the internal rate limiter has no more tokens which makes me think we should be looking to disable rate limiting, not retrying.

However, note also that happens in a standard retryer method. A look at the corresponding NopRetryer method confirms that, true to its name, it does not talk to a rate limiter at all.

So, if I've followed right, switching to NopRetryer is the change that gets us closest to the SDK v1 behavior that a-d-c expects, and therefore is the best way to resolve this issue. If we're on the same page about this, then this LGTM, thanks.

Actions #9

Updated by Tom Clegg 6 months ago

On review, I got a little touchy about the distinction between retries and rate limiting. The specific error we're seeing gets logged when the internal rate limiter has no more tokens which makes me think we should be looking to disable rate limiting, not retrying.

Sorry, I mentioned auto-retry in the commit message but neglected to call out in my note that auto-retry is also undesirable for a-d-c. It's better for a-d-c to see and report errors immediately, and retry later if the node is still needed. As you noted, switching to NopRetryer gets rid of both.

So, if I've followed right, switching to NopRetryer is the change that gets us closest to the SDK v1 behavior that a-d-c expects, and therefore is the best way to resolve this issue. If we're on the same page about this, then this LGTM, thanks.

Yes.

Actions #10

Updated by Tom Clegg 6 months ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Tom Clegg 6 months ago

  • Related to Bug #23196: Should track instance capacity state separately for preemptible and non-preemptible instance types added
Actions #12

Updated by Brett Smith 5 months ago

  • Release changed from 82 to 79
Actions

Also available in: Atom PDF