Support #22238: Prototype ansible installer - Arvados

Actions

Copy link

Support #22238

closed

Prototype ansible installer

Added by Peter Amstutz over 1 year ago. Updated 6 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

Deployment

Target version:

Development 2025-05-14

Due date:

Story points:

Release:

Arvados 3.2.0

Release relationship:

Auto

Description

Requirements:

Primary configuration used by the playbook should come from config.yml, i.e. user creates an Arvados config and the installer figures out what needs to go where based on that. The installer might still process the config.yml to produce the final one that is sent to the nodes but the basic structure of the input should match config.yml.

We want to avoid creating additional configuration layers that just get turned into config.yml values.

"Don't force the user to reconfigure anything that is straightforwardly configured in config.yml"

However stuff that isn't adequately described in the current config.yml, such as cluster topography, should be expressed separately.

Should be able to reason about roles, e.g. node X is the load balancer node, so it needs to write a nginx config that refers to nodes Y and Z which are backend controller nodes.

Also, secret values should be kept in a separate file, so we have the option of keeping them in a secret store. This could be as simple as a separate "secrets" file that also has the same shape as "config.yml" and is merged with the main config.yml at build time.

Success criteria:

A multi-role Ansible playbook that deploys Arvados services on a single host and passes Arvados diagnostics
A new jenkins jobs which tests the installer by performing the install and running diagnostics

Files

Download all files

z2a05.yml (83.2 KB) z2a05.yml		Brett Smith, 04/30/2025 01:52 PM
z2a05.ini (1.15 KB) z2a05.ini		Brett Smith, 04/30/2025 01:53 PM
setup-nspawn-zone.yml (3.5 KB) setup-nspawn-zone.yml		Brett Smith, 04/30/2025 01:53 PM

Subtasks 1 (0 open — 1 closed)

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Target version changed from Development 2024-11-20 to Development 2024-12-04

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Assigned To set to Brett Smith

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Blocked by Support #22318: Prototype using ansible to set up local test environments added

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Target version changed from Development 2024-12-04 to Development 2025-01-08

Actions

Copy link

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-01-08 to Development 2025-01-29

Actions

Copy link

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-01-29 to Development 2025-02-12

Actions

Copy link

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-02-12 to Development 2025-01-29

Actions

Copy link

#10

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-01-29 to Development 2025-02-12

Actions

Copy link

#11

Updated by Peter Amstutz about 1 year ago

Description updated (diff)

Actions

Copy link

#12

Updated by Brett Smith about 1 year ago

Status changed from New to In Progress

Actions

Copy link

#13

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-02-12 to Development 2025-02-26

Actions

Copy link

#14

Updated by Brett Smith about 1 year ago

Target version changed from Development 2025-02-26 to Development 2025-03-19

Actions

Copy link

#15

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-03-19 to Development 2025-04-02

Actions

Copy link

#16

Updated by Peter Amstutz 12 months ago

Related to Idea #18337: Easier install using Ansible added

Actions

Copy link

#17

Updated by Peter Amstutz 12 months ago

Target version changed from Development 2025-04-02 to Development 2025-04-16

Actions

Copy link

#18

Updated by Peter Amstutz 11 months ago

Target version changed from Development 2025-04-16 to Development 2025-04-30

Actions

Copy link

#19

Updated by Peter Amstutz 11 months ago

Target version changed from Development 2025-04-30 to Development 2025-05-14

Actions

Copy link

#20

Updated by Brett Smith 11 months ago

What This Is¶

tools/ansible/install-arvados-cluster.yml is a playbook to deploy all core Arvados services in a cluster. The methodology is similar to the existing Salt installer: each service corresponds to a group of hosts in the inventory like arvados_api, arvados_workbench, etc. The playbook deploys that service on each host in the group. Thanks to this, it naturally supports single-node and multi-node installs. For a single-node install, you just put the same host in each service group. For a multi-node install, you have different hosts in different groups. The playbook serializes plays in a way that will minimize downtime for any individual service as much as possible.

It reads as much information as possible out of the Arvados cluster configuration file to write supporting infrastructure like service environment variables, nginx front-end sites, etc. You just need to add some additional variables to the configuration file describing stuff that isn't in the cluster configuration file like your network topology and source of TLS certificates.

This approach gives us a clean and easy solution to the upgrade problem that has dogged the Salt installer forever. Arvados developers are responsible for the playbook; cluster administrators are responsible for their Arvados cluster configuration and Ansible inventory. When an administrator wants to upgrade to a new release of Arvados, they just get the corresponding Ansible playbook, and then run that new version with their own cluster configuration and inventory that they maintain separately.

Using this playbook, I have deployed a test cluster across multiple systemd-nspawn VMs that is fully functional: I can use all of Workbench and the cluster passes of all arvados-client diagnostics (barring #22828).

What This Doesn't Do (Yet)¶

Except for stuff that's strictly required like Docker, there is no deployment of or integration with third-party software: no Let's Encrypt, no Loki, no Grafana. Those are all follow-up tasks. In particular, I think we should think carefully about whether those need to be integrated into this installer playbook, or if they could be written as separate playbooks to make it easier to pick and choose services.

For similar reasons, there's no code to deploy the SLURM or LSF dispatchers. We discussed scoping this out in a meeting. Those can also be separate tasks.

I am sure there are bugs. In particular, I am sure there are some possible cluster configurations that will trip up the Ansible playbook. Those can also be fixed in follow-ups. One I'm aware of is #22830—but that's already an issue in playbooks in main like the compute node builder, so I don't think it even makes sense to be addressed as part of this ticket.

This does not make any attempt to be backwards compatible with the Salt installer. We would need to work on the above to achieve feature parity first. Once we get there (or at least close enough), I think we should consider a separate playbook to "migrate" a Salt install to an Ansible install. I would also be open to making minor changes to configuration file paths, etc. to improve compatibility.

This does not deploy webshell because we'd rather deal with #20802 first.

How I Tested¶

I am going to try to give you everything so you can reproduce exactly what I did, down to the cluster ID if you really want.

Set up systemd-nspawn following the wiki. Note this setup assumes you have mymachines in nsswitch.conf hosts.
Bootstrap the first VM: ansible-playbook -e "image_name=z2a05a image_authorized_keys=$HOME/.ssh/authorized_keys debootstrap_suite=bullseye" build-debian-nspawn-vm.yml
- I'm using bullseye because that was the oldest Debian we supported when I started development. In principle bookworm should work but that's not how I tested.

Configure it (your ResolvConf setting depends on your host DNS settings):

sudo tee /etc/systemd/nspawn/z2a05a.nspawn >/dev/null <<EOF
[Exec]
ResolvConf=bind-uplink
[Network]
Zone=z2a05
EOF

Clone other nodes as desired: for suf in b1 b2 c1 f s; do machinectl clone z2a05a "z2a05$suf" || break; done
Privilege nodes that will run Docker: ansible-playbook -Ke container_name=z2a05c1 privilege-nspawn-vm.yml - Repeat this process for each node running Docker like z2a05s.
Start the nodes: machinectl start z2a05{a,b1,b2,c1,f,s}
Set up DNS and TLS on those nodes: ansible-playbook -Ki z2a05.ini setup-nspawn-zone.yml - Basically this playbook sets up DNS and TLS for the cluster. This is the kind of stuff that would typically be handled by an external service in production.
Combine the certificates into one trusted set: cat z2a05[a-z]*.pem > z2a05.pem
Install the cluster: ansible-playbook -Ki z2a05.ini install-arvados-cluster.yml
Create Keep service records for your keepstore servers, update z2a05.yml with your new service UUIDs, and rerun install-arvados-cluster.yml to update all the servers.
Tell Firefox to trust the certificates: in Firefox settings,
1. search for Certificates
2. open "View Certificates…"
3. open "Add exception…"
4. Enter the controller external URL "https://z2a05a/"
5. Press "Get certificate"
6. Check "Permanently store this exception"
7. Press "Confirm Security Exception"
8. Repeat this process from step 3 for your keep-web, WebDAV, and Workbench external URLs
Note then when you run Ruby CLI tools, you'll want to export SSL_CERT_FILE=/etc/arvados/ca-certificates.crt to get it to trust your own certificates like other tools.

Actions

Copy link Download all files

#21

Updated by Brett Smith 11 months ago

File z2a05.ini z2a05.ini added
File setup-nspawn-zone.yml setup-nspawn-zone.yml added
File z2a05.yml z2a05.yml added

Actions

Copy link

#22

Updated by Peter Amstutz 11 months ago

These are just comments about the "Setting up systemd-nspawn VMs" page.

The section about writing nspawn-image.yml is hand-wavy about which fields are required and which are not. Technically I guess none of them are required, but you end up with an image called "stable".

I was a confused by how image_name refers to {{ debootstrap_suite }} and I thought maybe debootstrap_suite had some special meaning provided by the playbook but actually it is just the variable defined on the next line. If it's meaningful/useful to have that be the default for image_name, they should be switched so that debootstrap_suite is defined first, but actually if we want to force the user to provide any field at all, it should probably be the name of the image.

I put my public key in image_authorized_keys (e.g. "image_authorized_keys: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEXWIBlCQPd5goeChL7F4KotAssIiG/qAG8YKa4BbMl2 peter@peter-curii-app"). I left image_passhash and image_gecos the way it was and it didn't create a user. So maybe the passhash is actually required?

I ran ansible-playbook and it just sat there for five minutes at "TASK [debootstrap]". Finally I hit control-C and tried again and on the second try it worked fine. I think I must have mistyped my password (ansible-playbook was using like 3% CPU so it wasn't quite completely asleep but it wasn't doing very much either). I guess there's not much you can do about that, but that kind of sucks that it won't error out but just hangs when that happens?

For networking, I'm finding that a virtual ethernet bridge which allows your VMs to get IP addresses from your home router is actually a lot less fussy than a separate private network that has to be masqueraded to talk to anything else. It makes DNS slightly more tractable by avoiding an extra layer of network address translation. You can also run avahi and have the VMs advertise "foo.local" addresses. Setting that up with systemd-nspawn is firmly a "me" problem, though.

I'll keep fiddling with it.

Actions

Copy link

#23

Updated by Brett Smith 11 months ago

Peter Amstutz wrote in #note-22:

I put my public key in image_authorized_keys (e.g. "image_authorized_keys: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEXWIBlCQPd5goeChL7F4KotAssIiG/qAG8YKa4BbMl2 peter@peter-curii-app"). I left image_passhash and image_gecos the way it was and it didn't create a user. So maybe the passhash is actually required?

I don't have an explanation for this and would need logs or something to understand what happened. The initial user account setup is completely separate from the key authorization. My best guess is you ran Ansible as root so the playbook just modified the existing root user in the image rather than making a new user account—but I don't think that's a good guess, just the only thing I can think of.

I ran ansible-playbook and it just sat there for five minutes at "TASK [debootstrap]". Finally I hit control-C and tried again and on the second try it worked fine. I think I must have mistyped my password (ansible-playbook was using like 3% CPU so it wasn't quite completely asleep but it wasn't doing very much either). I guess there's not much you can do about that, but that kind of sucks that it won't error out but just hangs when that happens?

In my experience if you type your password wrong you get an "Incorrect sudo password" error as soon as the playbook tries to become. Try it yourself. I suspect what happened here was:

debootstrap took a while. Hard to say whether it was abnormally slow or if you just got impatient. But it makes sense that Ansible wasn't using much CPU since it was just waiting for debootstrap.
But it got far enough to make /var/lib/machines/stable/etc/os-release, so on the second run the playbook saw that and took that as a sign that it didn't need to re-run debootstrap.

So I would guess you have an incomplete image and probably want to rebuild it. And I guess the playbook needs a more explicit success check than relying on /etc/os-release.

For networking, I'm finding that a virtual ethernet bridge which allows your VMs to get IP addresses from your home router is actually a lot less fussy than a separate private network that has to be masqueraded to talk to anything else. It makes DNS slightly more tractable by avoiding an extra layer of network address translation. You can also run avahi and have the VMs advertise "foo.local" addresses. Setting that up with systemd-nspawn is firmly a "me" problem, though.

Yeah, like, that sounds cool, go for it, but relying on that just adds to the list of undocumentable prerequisites, I can't tell every Arvados developer how to configure their home router to do this.

Actions

Copy link

#24

Updated by Peter Amstutz 11 months ago

So in general, I determined that almost all of my problems were issues with properly setting up nspawn-image.yml, so I have been editing the wiki page to reflect my understanding of what needs to go into that file.

Brett Smith wrote in #note-23:

I don't have an explanation for this and would need logs or something to understand what happened. The initial user account setup is completely separate from the key authorization. My best guess is you ran Ansible as root so the playbook just modified the existing root user in the image rather than making a new user account—but I don't think that's a good guess, just the only thing I can think of.

It turns out ansible_user_id was not present in ansible_facts, meaning using the example as written would leave the username blank, so it didn't set up a user.

In my experience if you type your password wrong you get an "Incorrect sudo password" error as soon as the playbook tries to become. Try it yourself. I suspect what happened here was:

debootstrap took a while. Hard to say whether it was abnormally slow or if you just got impatient. But it makes sense that Ansible wasn't using much CPU since it was just waiting for debootstrap.

It was definitely abnormally slow but I don't know what it was waiting on, but I haven't had it happen again. Debbootstrap usually completes in about a minute and when this happened I had given it something like ten minutes before giving up.

Yeah, like, that sounds cool, go for it, but relying on that just adds to the list of undocumentable prerequisites, I can't tell every Arvados developer how to configure their home router to do this.

You don't have to. I updated the wiki page with more information about that an alternative option.

Actions

Copy link

#25

Updated by Peter Amstutz 11 months ago

Ok, the one issue I ran into that I think needs to be fixed in build-debian-nspawn-vm.yml is that I think it needs to delete /etc/hostname. Otherwise, when you start it with machinectl start it will take the hostname from /etc/hostname and not from the machine name you used.

Actions

Copy link

#26

Updated by Lucas Di Pentima 11 months ago

I've tested this branch by deploying to a QEMU VM. My comments & questions below:

The use of the ansible_fqdn fact created issues on my test, where I had a different hostname from the DNS name assigned to the VM. Might be convenient to use inventory_hostname instead? Just a thought.
Made the mistake of not setting a SystemRootToken on the config file so it was set to empty string. This made the VM machine query step to fail with a "not logged in" error from the API. Maybe adding a validation on those values would make things more user friendly.
There's at least one variable that is set in the .ini (inventory?) file: arvados_tls that is required to have a special key ('Default' in this case) but I don't see it declared anywhere else in the branch, it's just included in the example .ini file attached in this ticket. I think we should have an example .ini file committed to the repo, as a template or maybe this arvados_tls should have a default value set elsewhere?
Can you explain what's the difference between include_role and ansible.builtin.include_role? Maybe they're the same thing but the latter is its fully qualified name?

Actions

Copy link

#27

Updated by Brett Smith 11 months ago

Lucas Di Pentima wrote in #note-26:

The use of the ansible_fqdn fact created issues on my test, where I had a different hostname from the DNS name assigned to the VM. Might be convenient to use inventory_hostname instead? Just a thought.

Yeah this is probably a good call. The inventory_hostname can also be an alias, there's nothing in Ansible that requires it to be real, but we can document that you need to write your inventory to use the same hostnames as your Arvados external URLs. Administrators can separately configure ansible_host if they need to connect to a different hostname than the Arvados configuration uses.

Made the mistake of not setting a SystemRootToken on the config file so it was set to empty string. This made the VM machine query step to fail with a "not logged in" error from the API. Maybe adding a validation on those values would make things more user friendly.

Yes. Like default values, this is an issue that affects our existing playbooks and it would be good to have a general solution for it. Filed #22862 to follow up.

There's at least one variable that is set in the .ini (inventory?) file: arvados_tls that is required to have a special key ('Default' in this case) but I don't see it declared anywhere else in the branch, it's just included in the example .ini file attached in this ticket. I think we should have an example .ini file committed to the repo, as a template or maybe this arvados_tls should have a default value set elsewhere?

Agreed this needs to be documented but it's hard to know where the right place to do it is before we've documented the playbook at all. e.g., should we have a file of example facts, should it be documented on some install guide page, etc.

In general the structure of this object is:

arvados_tls:
  ServiceName:
    cert: PATH
    key: PATH
    remote: false
  …

The ServiceName corresponds to a key under Services in the Arvados cluster configuration. When the installer is configuring a TLS service, if it does not find the service defined in arvados_tls, it will fall back to using the Default values. So it's possible to configure TLS separately for each service, but there's also a single place to write common TLS configuration.

Can you explain what's the difference between include_role and ansible.builtin.include_role? Maybe they're the same thing but the latter is its fully qualified name?

You have it right, stylistically I think it's better to write the latter but I didn't know this myself until writing more.

Actions

Copy link

#28

Updated by Brett Smith 11 months ago

Lucas Di Pentima wrote in #note-26:

The use of the ansible_fqdn fact created issues on my test, where I had a different hostname from the DNS name assigned to the VM. Might be convenient to use inventory_hostname instead? Just a thought.

Done in aa1d00f980daa49eea6649a2fbad1eecb24e49b3.

There's at least one variable that is set in the .ini (inventory?) file: arvados_tls that is required to have a special key ('Default' in this case) but I don't see it declared anywhere else in the branch, it's just included in the example .ini file attached in this ticket. I think we should have an example .ini file committed to the repo, as a template or maybe this arvados_tls should have a default value set elsewhere?

Added a complete example inventory in 46fcafa504884ca44bf44c5f8d00ed883061b3d5. This documents all the host groups as well as variables administrators will need or may likely want to configure.

Actions

Copy link

#29

Updated by Lucas Di Pentima 11 months ago

This is helpful, thanks! LGTM

Actions

Copy link

#30

Updated by Brett Smith 11 months ago

Status changed from In Progress to Resolved

Applied in changeset arvados|cd7c12d37780d13db0d7b5b12626272d60565941.

Actions

Copy link

#31

Updated by Brett Smith 6 months ago

Release set to 79

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Support #22238

Prototype ansible installer

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz over 1 year ago

Updated by Peter Amstutz about 1 year ago

Updated by Peter Amstutz about 1 year ago

Updated by Peter Amstutz about 1 year ago

Updated by Peter Amstutz about 1 year ago

Updated by Peter Amstutz about 1 year ago

Updated by Brett Smith about 1 year ago

Updated by Peter Amstutz about 1 year ago

Updated by Brett Smith about 1 year ago

Updated by Peter Amstutz about 1 year ago

Updated by Peter Amstutz 12 months ago

Updated by Peter Amstutz 12 months ago

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Brett Smith 11 months ago

What This Is¶

What This Doesn't Do (Yet)¶

How I Tested¶

Updated by Brett Smith 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Brett Smith 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Lucas Di Pentima 11 months ago

Updated by Brett Smith 11 months ago

Updated by Brett Smith 11 months ago

Updated by Lucas Di Pentima 11 months ago

Updated by Brett Smith 11 months ago

Updated by Brett Smith 6 months ago