Cluster configuration » History » Revision 21
Revision 20 (Tom Clegg, 01/23/2019 05:57 PM) → Revision 21/33 (Tom Clegg, 01/23/2019 06:52 PM)
h1. Cluster configuration We are (2019) consolidating configuration from per-microservice yaml/json/ini files into a single cluster configuration document that is used by all components. * Long term: system nodes automatically keep their configs synchronized (using something like consul). * Short term: sysadmin uses tools like puppet and terraform to ensure /etc/arvados/config.yml is identical on all system nodes. * Hosts without config files (e.g., hosts outside the cluster) can retrieve the config document from the API server. h2. Discovery document Previously, we copied selected config values from the API server config into the API discovery document so clients could see them. When clients can get the configuration document itself, this won't be needed. The discovery document should advertise APIs provided by the server, not cluster configuration. h2. Secrets Secrets like BlobSigningKey can be given literally in the config file (convenient for dev/test, consul-template, etc) or indirectly using a secret backend. Anticipated backends: * <code class="yaml">BlobSigningKey: foobar</code> ⇒ the secret is literally <code>foobar</code> * <code class="yaml">BlobSigningKey: "vault:foobar"</code> ⇒ the secret can be obtained from vault using the vault key "foobar" * <code class="yaml">BlobSigningKey: "file:/foobar"</code> ⇒ the secret can be read from the local file @/foobar@ * <code class="yaml">BlobSigningKey: "env:FOOBAR"</code> ⇒ the secret can be read from the environment variable @FOOBAR@ h2. Implementation Development strategy for switching config file format/location in an operator-friendly way: # Read the new config file into an internal struct, if the new config file exists. # Copy old config file values into the new config struct. # Use the new config struct internally (the old config is no longer referenced except in the load-and-copy-to-new-struct step). # Add a mechanism for dumping the new config struct at startup/runtime after loading both new and old configs. [optional?] # At startup, if Add a mechanism for reporting that some parts of the old config are not redundant (i.e., redundant, i.e., haven't been migrated to the new config file by the operator), log a deprecation warning. operator. [optional?] # Wait one minor version release cycle. # Error out if the new config file does not exist. # Error out if the old config file exists (...and some parts of the old config are not redundant [optional?]). h2. Example config file (Format not yet frozen!) Notes: * Keys are CamelCase — except in special cases like PostgreSQL connection settings, which are passed through to another system without being interpreted by Arvados. * Arrays and lists are not permitted. These cannot be expressed natively in consul, and tend to be troublesome anyway: "what changed?" is harder to answer usefully, significance of duplicate elements is unclear, etc. <pre><code class="yaml"> Clusters: xyzzy: ManagementToken: eec1999ccb6d75840a2c09bc70b6d3cbc990744e BlobSigningKey: ungu355able BlobSignatureTTL: 172800 SessionKey: 186005aa54cab1ca95a3738e6e954e0a35a96d3d13a8ea541f4156e8d067b4f3 PostgreSQL: ConnectionPool: 32 # max concurrent connections per arvados server daemon Connection: # All parameters here are passed to the PG client library in a connection string; # see https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS Host: localhost Port: 5432 User: arvados Password: s3cr3t DBName: arvados_production client_encoding: utf8 fallback_application_name: arvados HTTPRequestTimeout: 5m Defaults: CollectionReplication: 2 TrashLifetime: 2w UserActivation: ActivateNewUsers: true AutoAdminUser: root@example.com UserProfileNotificationAddress: notify@example.com NewUserNotificationRecipients: {} NewInactiveUserNotificationRecipients: {} RequestLimits: MaxRequestLogParamsSize: 2KB MaxRequestSize: 128MiB MaxIndexDatabaseRead: 128MiB MaxItemsPerResponse: 1000 MultiClusterRequestConcurrency: 4 LogLevel: info CloudVMs: BootProbeCommand: "docker ps -q" SSHPort: 22 SyncInterval: 1m # how often to get list of active instances from cloud provider TimeoutIdle: 1m # shutdown if idle longer than this TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully TimeoutProbe: 2m # shutdown if (after booting) communication fails longer than this, even if ctrs are running TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown Driver: Amazon DriverParameters: Region: us-east-1 APITimeout: 20s AWSAccessKeyID: abcdef AWSSecretAccessKey: abcdefghijklmnopqrstuvwxyz ImageID: ami-0a01b48b88d14541e SubnetID: subnet-24f5ae62 SecurityGroups: sg-3ec53e2a AuditLogs: MaxAge: 2w DeleteBatchSize: 100000 UnloggedAttributes: {} # example: {"manifest_text": true} ContainerLogStream: BatchSize: 4KiB BatchTime: 1s ThrottlePeriod: 1m ThrottleThresholdSize: 64KiB ThrottleThresholdLines: 1024 TruncateSize: 64MiB PartialLineThrottlePeriod: 5s Timers: TrashSweepInterval: 60s ContainerDispatchPollInterval: 10s APIRequestTimeout: 20s Scaling: MaxComputeNodes: 64 EnablePreemptibleInstances: false DisableAPIMethods: {} # example: {"jobs.create": true} DockerImageFormats: {"v2": true} Crunch1: Enable: true CrunchJobWrapper: none CrunchJobUser: crunch CrunchRefreshTrigger: /tmp/crunch_refresh_trigger DefaultDockerImage: false NodeProfiles: # Key is a profile name; can be specified on service prog command line, defaults to $(hostname) keep: # Don’t run other services automatically -- only specified ones Default: {Disable: true} Keepstore: {Listen: ":25107"} apiserver: Default: {Disable: true} RailsAPI: {Listen: ":9000", TLS: true} Controller: {Listen: ":9100"} Websocket: {Listen: ":9101"} Health: {Listen: ":9199"} keep: Default: {Disable: true} KeepProxy: {Listen: ":9102"} KeepWeb: {Listen: ":9103"} *: # This section used for a node whose profile name is not listed above Default: {Disable: false} # (this is the default behavior) Volumes: xyzzy-keep-0: Type: s3 Region: us-east Bucket: xyzzy-keep-0 # [rest of keepstore volume config goes here] WebRoutes: # “default” means route according to method/host/path (e.g., if host is a login shell, route there) xyzzy.arvadosapi.com: default # “collections” means always route to keep-web collections.xyzzy.arvadosapi.com: collections # leading * is a wildcard (longest match wins) "*--collections.xyzzy.arvadosapi.com": collections cloud.curoverse.com: workbench workbench.xyzzy.arvadosapi.com: workbench "*.xyzzy.arvadosapi.com": default InstanceTypes: m4.large: VCPUs: 2 RAM: 8000000000 Scratch: 31000000000 Price: 0.1 m4.large-1t: # same instance type as m4.large but our scripts attach more scratch ProviderType: m4.large VCPUs: 2 RAM: 8000000000 Scratch: 999000000000 Price: 0.12 m4.xlarge: VCPUs: 4 RAM: 16000000000 Scratch: 78000000000 Price: 0.2 m4.8xlarge: VCPUs: 40 RAM: 160000000000 Scratch: 156000000000 Price: 2 m4.16xlarge: VCPUs: 64 RAM: 256000000000 Scratch: 310000000000 Price: 3.2 c4.large: VCPUs: 2 RAM: 3750000000 Price: 0.1 c4.8xlarge: VCPUs: 36 RAM: 60000000000 Price: 1.591 RemoteClusters: xrrrr: Host: xrrrr.arvadosapi.com Proxy: true # proxy requests to xrrrr on behalf of our clients AuthProvider: true # users authenticated by xrrrr can use our cluster </code></pre>