Project

General

Profile

Actions

Bug #22779

closed

Salt installer sets AWS credentials in /root/.aws that avoids a-d-c to work properly

Added by Lucas Di Pentima 12 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
-
Release relationship:
Auto

Description

Presumably when deploying a multi node cluster in AWS with Route53 and LetsEncrypt, there're related credentials that get stashed in /root/.aws/ so that the installer can use DNS based auth to request TLS certs.
When this happens, the dispatcher service cannot launch new compute nodes, probably because of the recent change in AWS SDK version that we're using in Golang code.
When these credentials are removed and a-d-c restarted, everything goes back to normal.

See https://docs.aws.amazon.com/sdkref/latest/guide/settings-reference.html#EVarSettings for potential options to try.


Subtasks 1 (0 open1 closed)

Task #22803: Review 22779-adc-aws-creds (arvados-formula & arvados repos)ClosedBrett Smith05/28/2025Actions
Actions #2

Updated by Lucas Di Pentima 11 months ago

  • Assigned To set to Lucas Di Pentima
  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 11 months ago

  • Subtask #22803 added
Actions #4

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2025-04-30 to Development 2025-05-14
Actions #5

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2025-05-14 to Development 2025-05-28
Actions #6

Updated by Lucas Di Pentima 10 months ago

  • Status changed from New to In Progress
Actions #7

Updated by Lucas Di Pentima 10 months ago

22779-adc-aws-creds at 36aee5b0 (arvados repo)
22779-adc-aws-creds at commit 928762f4 (arvados-formula repo)

test-provision: #1197

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Yes. This change installs an arvados-dispatch-cloud service override to pass a couple of environment variables that makes the AWS SDK to ignore credentials from ~/.aws/. Details described below.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • Discovered a potential bug in a-d-c where using a ED25519 type SSH key made the service to output the following error message in journal logs: "Could not make key fingerprint: Unmarshal failed to parse public key: ssh: short read". Filed a new ticket for this: https://dev.arvados.org/issues/22935
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Yes. Manual testing was done by deploying a multi node test cluster with the old arvados-formula, confirming that the problem was happening and then updating the provision script to point to the new arvados-formula. After re-running the installer, a-d-c started working again.
  • New or changed UI/UX and has gotten feedback from stakeholders.
    • No changes in UI/UX
  • Documentation has been updated.
    • No documentation updates needed as this is a fix for a bug produced by us changing the AWS SDK version used in a-d-c a while ago.
  • Behaves appropriately at the intended scale (describe intended scale).
    • No changes in scale
  • Considered backwards and forwards compatibility issues between client and server.
    • No compatibility issues in this case.
  • Follows our coding standards and GUI style guidelines.
    • Not applicable

The following describes how I got to the final solution:

  • 1st attempt: Making certbot use a different AWS profile so that the default is left empty
    • The dns-route53 plugin only uses the default profile, so an alternative would be to pass credentials as envvars: https://certbot-dns-route53.readthedocs.io/en/stable/#credentials
    • This might require passing envvars to the certbot timer (for renewals) and modifying the letsencrypt-formula to use them on the initial cert request, so it's not super trivial.
  • 2nd attempt: Setting an a-d-c service override with an envvar like AWS_PROFILE=nonexistant
    • This makes a-d-c to fail to start
  • 3rd attempt: Setting an a-d-c service override with an envvar like AWS_PROFILE=arvados-dispatch-cloud and setting this profile in /root/.aws/ as being empty
    • This allows a-d-c to start but when scheduling a container request, it starts erroring with messages like: "Could not make key fingerprint: Unmarshal failed to parse public key: ssh: short read"
  • 4th attempt: Setting an a-d-c service override with envvars AWS_SDK_LOAD_CONFIG=0 & AWS_SHARED_CREDENTIALS_FILE=/dev/null
    • Confirming that the instance profile is correct by running: AWS_SDK_LOAD_CONFIG=0 AWS_SHARED_CREDENTIALS_FILE=/dev/null /usr/bin/aws sts get-caller-identity
      • Got the name of the dispatcher role
    • Getting the same key fingerprint error message as before, so I'm thinking this is another issue, most probably a config error from my part
    • Yes... set a different dispatcher ssh key and it started working (RSA vs ED25519 key). This might be a real bug as SSH currently uses ED25519 by default and the documentation doesn't ask for a RSA key.

So, I opted for the 4th attempt solution as looks the more elegant option.

Actions #8

Updated by Brett Smith 10 months ago

I'm not convinced the Salt installer is the right place to address this. /root/.aws could be created by all kinds of tools, not just letsencrypt-formula. If a-d-c doesn't behave correctly when credentials exist there, then we should fix that in a-d-c.

Easiest thing would be to set those environment variables directly in source:cmd/arvados-server/arvados-dispatch-cloud.service.

Even better would be extend the service to run with DynamicUser=on. This would bring a lot of security benefits, including not loading unintended user configuration.

I'm guessing we could also more specifically tell the AWS SDK what credentials we do/n't want to load.

Actions #9

Updated by Tom Clegg 10 months ago

Using credentials from ~/.aws is normal AWS client behavior. We could declare that it's normal for the arvados-dispatch-cloud service to skip that source of credentials, and the sysadmin can un-skip it using systemd override files. Or we could leave it alone, and the sysadmin (or salt installer) can skip it using systemd override files.

Either way, similar to the certbot situation, we have an arvados-server cloudtest tool which ideally should use the same credentials as a-d-c, since it's intended to help troubleshoot things like this.

Actions #10

Updated by Brett Smith 10 months ago

Tom Clegg wrote in #note-9:

Using credentials from ~/.aws is normal AWS client behavior. We could declare that it's normal for the arvados-dispatch-cloud service to skip that source of credentials, and the sysadmin can un-skip it using systemd override files. Or we could leave it alone, and the sysadmin (or salt installer) can skip it using systemd override files.

I'm basically agnostic on the question of whether or not it reads ~/.aws. What I care about more is the (lack of) migration story. If I've followed right, a-d-c didn't used to read ~/.aws, and now not only does it read it, it prioritizes that source of credentials over everything else, and that change was completely undocumented, because it was basically an unintended side effect of upgrading the SDK. Which leads to situations like this ticket where an install that used to work perfectly fine suddenly doesn't, with no warning in our documentation.

If I followed that right: personally I think we should ship a fix for 3.1.x that keeps a-d-c's behavior as close to its pre-SDK-upgrade behavior as possible, to minimize breakage. I'd be open to revisiting this for 3.2.0 with an appropriate upgrade note. And yes, whatever we decide, the beauty of implementing it in systemd is that the local sysadmin always has an easy way to override our decision.

Either way, similar to the certbot situation, we have an arvados-server cloudtest tool which ideally should use the same credentials as a-d-c, since it's intended to help troubleshoot things like this.

Does arvados-server cloudtest currently check that it's running with access to the same IAMInstanceProfile as a-d-c? (Does it even have a way to know what that profile is?)

Actions #11

Updated by Brett Smith 10 months ago

One thing I'm curious about (and would be willing to test) is: if you set a specific AWS_PROFILE, and that profile doesn't exist at all, does a-d-c fall back to the IAMInstanceProfile? Or does it error out? If the former, that seems like a nice way forward: we can set a specific AWS_PROFILE, and administrators who want to set that can do so, but if you haven't you still get the pre-SDK-upgrade behavior.

Sorry, I got tunnel vision looking at Lucas' 3rd attempt and forgot the 2nd covered this, my bad.

Actions #12

Updated by Brett Smith 10 months ago

Discussed at standup: we will set the two environment variables in the service for 3.1.2. This gets us closer to the original behavior while leaving a way for administrators to override. We may change the behavior for 3.2.0, but it'll probably involve a code change (and maybe some work on arvados-server cloudtest to help keep it consistent) and be appropriately documented.

Looking at the AWS SDK code, I did not find any references to AWS_SDK_LOAD_CONFIG. It is used by other AWS tools, but it doesn't seem to be supported by this one. The algorithm is:

  • first it reads settings for AWS_PROFILE (default default) from the corresponding section of AWS_SHARED_CREDENTIALS_FILE (default ~/.aws/credentials)
  • then it adds settings for AWS_PROFILE from the corresponding profile section of AWS_CONFIG_FILE (default ~/.aws/config)

So I believe the best way to prevent reading ~/.aws at all is to set both AWS_CONFIG_FILE and AWS_SHARED_CREDENTIALS_FILE to /dev/null. I believe that achieves the same result as Lucas' suggestion, while using variables that are definitely read by the AWS SDK (see config/env_config.go).

Actions #13

Updated by Brett Smith 10 months ago

  • Status changed from In Progress to Resolved

With 9b0c6975b4c70985add11402b712e1424d8d990e, I'm saying we have fixed the bug and this is resolved. We can have a separate ticket about changing this behavior in a future release—because we still need to decide what that will actually be.

Actions #14

Updated by Brett Smith 10 months ago

  • Release set to 78
Actions

Also available in: Atom PDF