Cluster configuration » History » Revision 28
Revision 27 (Tom Clegg, 04/24/2019 01:00 PM) → Revision 28/33 (Tom Clegg, 05/06/2019 03:21 PM)
h1. Cluster configuration We are (2019) consolidating configuration from per-microservice yaml/json/ini files into a single cluster configuration document that is used by all components. * Long term: system nodes automatically keep their configs synchronized (using something like consul). * Short term: sysadmin uses tools like puppet and terraform to ensure /etc/arvados/config.yml is identical on all system nodes. * Hosts without config files (e.g., hosts outside the cluster) can retrieve the config document from the API server. h2. Discovery document Previously, we copied selected config values from the API server config into the API discovery document so clients could see them. When clients can get the configuration document itself, this won't be needed. The discovery document should advertise APIs provided by the server, not cluster configuration. h2. Secrets Secrets like BlobSigningKey can be given literally in the config file (convenient for dev/test, consul-template, etc) or indirectly using a secret backend. Anticipated backends: * <code class="yaml">BlobSigningKey: foobar</code> ⇒ the secret is literally <code>foobar</code> * <code class="yaml">BlobSigningKey: "vault:foobar"</code> ⇒ the secret can be obtained from vault using the vault key "foobar" * <code class="yaml">BlobSigningKey: "file:/foobar"</code> ⇒ the secret can be read from the local file @/foobar@ * <code class="yaml">BlobSigningKey: "env:FOOBAR"</code> ⇒ the secret can be read from the environment variable @FOOBAR@ h2. Instructions for ops Tentative instructions for switching config file format/location: # Upgrade Arvados to a version that supports loading all configs from the new cluster-wide config file (maybe 1.4). When services come back up, they will still use your old configuration files, but they will log some deprecation warnings. # Migrate your configuration to the new config file, one component at a time. For each component: ## Restart the component. ## Inspect the deprecation warning that is logged at startup. It will tell you either "old config file is superfluous" or "new config file is incomplete". ## If your old config file is superfluous, delete it. You're done. ## Run the component with the "--config-diff" flag. This suggests changes to your new config file which will make your old config file obsolete. (Alternatively, run the component with the "--config-dump" flag. This outputs a new config file that would make your old config file obsolete. Saving this might be easier than applying a diff, but it will reorder keys and lose comments.) ## Make the suggested changes. ## Repeat until finished. # Upgrade to a version that doesn't support old config files at all (maybe 1.5). h2. Implementation Development strategy for facilitating the above ops instructions: # Read the new config file into an internal struct, if the new config file exists. # Copy old config file values into the new config struct. # Use the new config struct internally (the old config is no longer referenced except in the load-and-copy-to-new-struct step). # Add a mechanism for showing the effect of the old config file on the resulting config struct (see "--config-diff" above). # At startup, if the old config has any effect (i.e., some parts haven't been migrated to the new config file by the operator), log a deprecation warning recommending "--config-diff" and RTFM. # Wait one minor version release cycle. # Error out if the new config file does not exist. # Error out if the old config file exists (...and some parts of the old config are not redundant [optional?]). h2. Example/template config file See also [[Config migration key mapping]] (Format not yet frozen!) Notes: * Keys are CamelCase — except in special cases like PostgreSQL connection settings, which are passed through to another system without being interpreted by Arvados. * Arrays and lists are not permitted. These cannot be expressed natively in consul, and tend to be troublesome anyway: "what changed?" is harder to answer usefully, significance of duplicate elements is unclear, etc. <pre><code class="yaml"> Clusters: xyzzy: # api-server/uuid_prefix, sso/uuid_prefix SystemRootToken: # arvados-git-sync.rb/arvados_api_token, keepstore/SystemAuthTokenFile, c-d-s/AuthToken ManagementToken: # {arvados-ws,keepstore,keepproxy,keep-balance}/ManagementToken (& others) Services: RailsAPI: InternalURLs: "http://zzzzz:8000/": {} # api-server/(protocol,host,port) ExternalURL: “https://zzzzz.arvadosapi.com/" Insecure: false GitHTTP: InternalURLs: "http://git:9001/": {} ExternalURL: "https://git.zzzzz.arvadosapi.com/" # api-server/git_repo_https_base Keepstore: InternalURLs: "http://keep0:25107/": {Unlisted: true} "http://keep1:25107/": {Debug: true} Controller: InternalURLs: "http://zzzzz:9004/": {} # controller/NodeProfiles.$cluster.Controller.Listen ExternalURL: "https://zzzzz.arvadosapi.com/" # composer/apiEndPoint, workbench2/API_HOST, workbench/arvados_{login,v1}_base, arvados-ws/Client, keepproxy/Client Websocket: InternalURLs: "http://ws:9003/": {} # arvados-ws/Listen ExternalURL: "https://ws.zzzzz.arvadosapi.com/" # api-server/websocket_address Keepbalance: InternalURLs: "http://zzzzz:9005": {} # keepbalance/Listen GitHTTP: InternalURLs: "http://zzzzz:9001": {} # arvados-git-httpd/Listen ExternalURL: "https://git.zzzzz.arvadosapi.com/" # api-server/git_repo_https_base GitSSH: ExternalURL: "git@git.zzzzz.arvadosapi.com" # api-server/git_repo_ssh_base DispatchCloud: InternalURLs: "http://zzzzz:9006": {} # a-d-c/NodeProfiles SSO: ExternalURL: "https://auth.zzzzz.arvadosapi.com/" # api-server/sso_provider_url Keepproxy: InternalURLs: "http://keep:25107/": {} # keepproxy/Listen ExternalURL: "https://keep.zzzzz.arvadosapi.com/" WebDAV: InternalURLs: "http://keep:9002/": {} # keep-web/Listen ExternalURL: "https://*.collections.zzzzz.arvadosapi.com/" # api-server/keep_web_service_url, workbench/keep_web_url WebDAVDownload: InternalURLs: "http://keep:9002/": {} # keep-web/Listen ExternalURL: "https://download.zzzzz.arvadosapi.com/" # keep-web/AttachmentOnlyHost, workbench/keep_web_download_url Keepstore: InternalURLs: "https://keep0:25107/": {} # keepstore/Listen "https://keep1:25107/": {} # keepstore/Listen Composer: ExternalURL: "http://composer.zzzzz.arvadosapi.com/" # workbench/composer_url WebShell: ExternalURL: "http://webshell.zzzzz.arvadosapi.com/" # workbench/shell_in_a_box_url Workbench1: InternalURLs: "http://workbench:9000": {} # workbench/Nginx.server.listen ExternalURL: "http://workbench.zzzzz.arvadosapi.com/" # workbench/Nginx.server.listen, api-server/workbench_address Workbench2: ExternalURL: "http://workbench2.zzzzz.arvadosapi.com/" # workbench/workbench2_url PostgreSQL: Connection: # arvados-ws/Postgres, controller/PostgreSQL.Connection # All parameters here are passed to the PG client library in a connection string; # see https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS Host: localhost Port: 5432 User: arvados Password: s3cr3t DBName: arvados_production client_encoding: utf8 fallback_application_name: arvados ConnectionPool: # arvados-ws/PostgresPool TLS: Certificate: # (literal, file, or acme dir) keepstore/TLSCertificateFile Key: # (literal, file, or acme dir) keepstore/TLSKeyFile Insecure: true # workbench/arvados_insecure_https, api-server/sso_insecure Git: GitoliteAdminRepo: # arvados-git-sync.rb/gitolite_url GitoliteAdminPublicKey: # arvados-git-sync.rb/gitolite_arvados_git_user_key GitoliteSyncWorkDir: # arvados-git-sync.rb/gitolite_tmp GitCommand: # arv-git-httpd/GitCommand GitoliteHome: # arv-git-httpd/GitoliteHome Repositories: # api-server/git_repositories_dir (crunch1 only; just assume {GitoliteHome}/repositories?) API: DisabledAPIs: # api-server/disable_api_methods WebsocketKeepaliveTimeout: # arvados-ws/PingTimeout WebsocketClientEventQueue: # arvados-ws/ClientEventQueue WebsocketServerEventQueue: # arvados-ws/ServerEventQueue KeepServiceRequestTimeout: # keepproxy/Timeout MaxMemoryBuffers: # keepstore/MaxBuffers MaxConcurrentRequests: # keepstore/MaxRequests MaxRequestSize: # api-server/max_request_size MaxIndexDatabaseRead: # api-server/max_index_database_read MaxItemsPerResponse: # api-server/max_items_per_response, keep-balance/CollectionBatchSize, keep-balance/CollectionBuffers MaxRequestAmplification: # controller/RequestLimits.MultiClusterRequestConcurrency AsyncPermissionsUpdateInterval: # api-server/async_permissions_update_interval Users: AutoSetupNewUsers: # api-server/auto_setup_new_users AutoSetupNewUsersWithVmUUID: # api-server/auto_setup_new_users_with_vm_uuid AutoSetupNewUsersWithRepository: # api-server/auto_setup_new_users_with_repository AutoSetupUsernameBlacklist: # api-server/auto_setup_name_blacklist NewUsersAreActive: # api-server/new_users_are_active AutoAdminUserWithEmail: # api-server/auto_admin_user AutoAdminFirstUser: # api-server/auto_admin_first_user UserProfileNotificationAddress: # api-server/user_profile_notification_address AdminNotifierEmailFrom: # api-server/admin_notifier_email_from EmailSubjectPrefix: # api-server/email_subject_prefix UserNotifierEmailFrom: # api-server/user_notifier_email_from NewUserNotificationRecipients: # api-server/new_user_notification_recipients NewInactiveUserNotificationRecipients: # api-server/new_inactive_user_notification_recipients AnonymousUserToken: # workbench/anonymous_user_token, keep-web/AnonymousTokens Login: SiteTitle: # sso/site_title DefaultLinkTitle: # sso/default_link_title DefaultLinkURL: # sso/default_link_url AllowAccountRegistration: # sso/allow_account_registration RequireEmailConfirmation: # sso/require_email_confirmation Google: ClientID: # sso/google_oauth2_client_id ClientSecret: # sso/google_oauth2_client_secret LDAP: # sso/use_ldap Title: # sso/use_ldap.title Host: # sso/use_ldap.host Port: # sso/use_ldap.port Method: # sso/use_ldap.method Base: # sso/use_ldap.base Uid: # sso/use_ldap.uid EmailDomain: # sso/use_ldap.email_domain BindDN: # sso/use_ldap.BindDN Password: # sso/user_ldap.password SecretToken: # sso/secret_token ProviderAppSecret: # api-server/sso_app_secret ProviderAppID: # api-server/sso_app_id AuditLogs: Enable: MaxAge: # api-server/max_audit_log_age MaxDeleteBatch: # api-server/max_audit_log_delete_batch UnloggedAttributes: # api-server/unlogged_attributes (applies to logs table) SystemLogs: LogLevel: # keepstore/Debug, keepproxy/Debug, arvados-ws/LogLevel Format: # keepstore/LogFormat, arvados-ws/LogFormat MaxRequestLogParamsSize: # api-server/max_request_log_params_size Collections: DefaultReplication: # api-server/default_collection_replication, keepproxy/DefaultReplicas DefaultTrashLifetime: # api-server/default_trash_lifetime CollectionVersioning: # api-server/collection_versioning PreserveVersionIfIdle: # api-server/preserve_version_if_idle TrustAllContent: # keep-web/TrustAllContent, workbench/trust_all_content TrashSweepInterval: # api-server/trash_sweep_interval BlobSigningKey: # api-server/blob_signing_key, keepstore/BlobSigningKeyFile BlobSigningTTL: # api-server/blob_signature_ttl, keepstore/BlobSignatureTTL BlobSigning: # keepstore/RequireSignatures, api-server/permit_create_collection_with_unsigned_manifest BlobTrash: # keepstore/EnableDelete BlobTrashLifetime: # keepstore/TrashLifetime BlobTrashCheckInterval: # keepstore/TrashCheckInterval BlobTrashConcurrency: # keepstore/TrashWorkers, keep-balance/-commit-trash BlobDeleteConcurrency: # keepstore/EmptyTrashWorkers BlobReplicateConcurrency: # keepstore/PullWorkers, keep-balance/-commit-pulls KeepBalanceRunPeriod: 10m # keepbalance/RunPeriod WebDAVCache: TTL: # keep-web/Cache.TTL UUIDTTL: # keep-web/Cache.UUIDTTL MaxCollectionEntries: # keep-web/Cache.MaxCollectionEntries MaxCollectionBytes: # keep-web/Cache.MaxCollectionBytes MaxPermissionEntries: # keep-web/Cache.MaxPermissionEntries MaxUUIDEntries: # keep-web/Cache.MaxUUIDEntries Containers: # control how Arvados runs user containers SupportedDockerImageFormats: # api-server/docker_image_formats LogReuseDecisions: # api-server/log_reuse_decisions DefaultKeepCacheRAM: # api-server/container_default_keep_cache_ram MaxDispatchAttempts: # api-server/max_container_dispatch_attempts MaxRetryAttempts: # api-server/container_count_max PollInterval: 10s # c-d-s/PollPeriod, a-d-c/Dispatch/PollInterval MinRetryPeriod: 30s # c-d-s/MinRetryPeriod (optional? in case ContainerDispatchPollInterval is too short) CrunchRunCommand: "crunch-run" # c-d-s/CrunchRunCommand CrunchRunArguments: ‘["-cgroup-parent-subsystem=memory", "-foo=bar"]’ ‘[“-cgroup-parent-subsystem=memory”, “-foo=bar”]’ # c-d-s/CrunchRunCommand (should this be named CrunchRunArgumentsJSON?) ReserveExtraRAM: 256MiB # c-d-s/ReserveExtraRAM UsePreemptibleInstances: # api-server/preemptible_instances MaxComputeVMs: # api-server/max_compute_nodes DispatchPrivateKey: # a-d-c/Dispatch/PrivateKey StaleLockTimeout: # a-d-c/Dispatch/StaleLockTimeout Logging: LogBytesPerEvent: # api-server/crunch_log_bytes_per_event LogSecondsBetweenEvents: # api-server/crunch_log_seconds_between_events LogThrottlePeriod: # api-server/crunch_log_throttle_period LogThrottleBytes: # api-server/crunch_log_throttle_bytes LogThrottleLines: # api-server/crunch_log_throttle_lines LimitLogBytesPerJob: # api-server/crunch_limit_log_bytes_per_job LogPartialLineThrottlePeriod: # api-server/crunch_log_partial_line_throttle_period LogUpdatePeriod: # api-server/crunch_log_update_period LogUpdateSize: # api-server/crunch_log_update_size MaxAge: # api-server/clean_container_log_rows_after, api-server/clean_job_log_rows_after CloudVMs: Enable: # arvados-dispatch-cloud is in use BootProbeCommand: # a-d-c/CloudVMs/BootProbeCommand ProbeInterval: # a-d-c/Dispatch/ProbeInterval MaxProbesPerSecond: # a-d-c/Dispatch/MaxProbesPerSecond TimeoutSignal: # a-d-c/Dispatch/TimeoutSignal TimeoutTERM: # a-d-c/Dispatch/TimeoutTERM MaxCloudOpsPerSecond: # a-d-c/CloudVMs/MaxCloudOpsPerSecond SSHPort: # a-d-c/CloudVMs/SSHPort SyncInterval: # a-d-c/CloudVMs/SyncInterval TimeoutIdle: # a-d-c/CloudVMs/TimeoutIdle TimeoutBooting: # a-d-c/CloudVMs/TimeoutBooting TimeoutProbe: # a-d-c/CloudVMs/TimeoutProbe TimeoutShutdown: # a-d-c/CloudVMs/TimeoutShutdown ImageID: # a-d-c/CloudVMs/ImageID Driver: Amazon # a-d-c/CloudVMs/Driver DriverParameters: # a-d-c/CloudVMs/DriverParameters Region: us-east-1 APITimeout: 20s AWSAccessKeyID: abcdef AWSSecretAccessKey: abcdefghijklmnopqrstuvwxyz ImageID: ami-0a01b48b88d14541e SubnetID: subnet-24f5ae62 SecurityGroups: sg-3ec53e2a SLURM: Enable: # crunch-dispatch-slurm is in use PrioritySpread: 1000 # c-d-s/PrioritySpread SbatchArguments: '["-partition=PartitionName"]' ‘[“-partition=PartitionName”]’ # c-d-s/SbatchArguments KeepServices: 00000-bi6l4-000000000000000: InternalURLs: "http://127.0.0.1:25107": {} “http://127.0.0.1:25107” # c-d-s/KeepServiceURIs Managed: Enable: # arvados-node-manager is in use DNSServerConfDir: # api-server/dns_server_conf_dir DNSServerConfTemplate: # api-server/dns_server_conf_template DNSServerReloadCommand: # api-server/dns_server_reload_command DNSServerUpdateCommand: # api-server/dns_server_update_command ComputeNodeDomain: # api-server/compute_node_domain ComputeNodeNameservers: # api-server/compute_node_nameservers AssignNodeHostname: # api-server/assign_node_hostname JobsAPI: Enable: # api-server/enable_legacy_jobs_api (crunch1) CrunchJobWrapper: # api-server/crunch_job_wrapper (crunch1) CrunchJobUser: # api-server/crunch_job_user (crunch1) CrunchRefreshTrigger: # api-server/crunch_refresh_trigger (crunch1) GitInternalDir: # api-server/git_internal_dir (crunch1) ReuseJobIfOutputsDiffer: # api-server/reuse_job_if_outputs_differ DefaultDockerImage: # api-server/default_docker_image_for_jobs Volumes: # keepstore/Volumes, keep-balance/KeepServiceTypes # TODO: some keepstores are closer to specific volumes zzzzz-ivpuk-voihjznerfweefq: AccessViaHosts: # replaces differing configs on keepstore hosts "http://keep0:25107": “http://keep0:25107”: {ReadOnly: true} "http://keep1:25107": “http://keep1:25107”: {} "http://keep2:25107": “http://keep2:25107”: {ReadOnly: true} "http://keep3:25107": “http://keep3:25107”: {ReadOnly: true} StorageClasses: # keepstore/S3Volume.StorageClasses, keepstore/AzureBlobVolume.StorageClasses, keepstore/UnixVolume.StorageClasses default: true cold: true Replication: 2 # keepstore/S3Volume.S3Replication, keepstore/AzureBlobVolume.AzureReplication, keepstore/UnixVolume.DirectoryReplication ReadOnly: false # keepstore/S3Volume.ReadOnly, keepstore/AzureBlobVolume.ReadOnly, keepstore/UnixVolume.ReadOnly Driver: S3 # keepstore/Volumes[].Type DriverParameters: AccessKey: # keepstore/S3Volume.AccessKey SecretKey: # keepstore/S3Volume.SecretKey Endpoint: # keepstore/S3Volume.Endpoint Region: # keepstore/S3Volume.Region Bucket: # keepstore/S3Volume.Bucket LocationConstraint: # keepstore/S3Volume.LocationConstraint IndexPageSize: # keepstore/S3Volume.IndexPageSize S3Replication: ConnectTimeout: # keepstore/S3Volume.ConnectTimeout ReadTimeout: # keepstore/S3Volume.ReadTimeout RaceWindow: # keepstore/S3Volume.RaceWindow ReadOnly: # UnsafeDelete: # keepstore/S3Volume.UnsafeDelete zzzzz-ivpuk-adbtuyuiivjhbnmb: AccessViaHosts: # replaces differing configs on keepstore hosts (TBD: do we need “readonly from these hosts”?) "http://keep1:25107": “http://keep1:25107”: {ReadOnly: false} StorageClasses: # keepstore/S3Volume.StorageClasses, keepstore/AzureBlobVolume.StorageClasses, keepstore/UnixVolume.StorageClasses default: true cold: false Replication: 2 # keepstore/S3Volume.S3Replication, keepstore/AzureBlobVolume.AzureReplication, keepstore/UnixVolume.DirectoryReplication ReadOnly: false # keepstore/S3Volume.ReadOnly, keepstore/AzureBlobVolume.ReadOnly, keepstore/UnixVolume.ReadOnly Driver: Azure # keepstore/Volumes[].Type DriverParameters: StorageAccountName: # keepstore/AzureBlobVolume.StorageAccountName StorageAccountKey: # keepstore/AzureBlobVolume.StorageAccountKeyFile StorageBaseURL: # keepstore/AzureBlobVolume.StorageBaseURL ContainerName: # keepstore/AzureBlobVolume.ContainerName RequestTimeout: # keepstore/AzureBlobVolume.RequestTimeout zzzzz-ivpuk-2344guvaiubbae4wa: Driver: Filesystem # keepstore/Volumes[].Type DriverParameters: Root: # keepstore/UnixVolume.Root Serialize: # keepstore/UnixVolume.Serialize BlockDeviceUUID: # (disable if this is non-empty and does not match the local filesystem device) Mail: MailchimpAPIKey: # api-server/mailchimp_api_key MailchimpListID: # api-server/mailchimp_list_id SendUserSetupNotificationEmail: # workbench/send_user_setup_notification_email IssueReporterEmailFrom: # workbench/issue_reporter_email_from IssueReporterEmailTo: # workbench/issue_reporter_email_to SupportEmailAddress: # workbench/support_email_address EmailFrom: # workbench/email_from RemoteClusters: # api-server/remote_hosts xyzzx: Host: Proxy: false Scheme: https Insecure: false ActivateUsers: false "*": “*”: # api-server/remote_hosts_via_dns ActivateUsers: false Workbench: Theme: default # workbench/arvados_theme ActivationContactLink: # workbench/activation_contact_link ArvadosDocsite: # workbench/arvados_docsite ArvadosPublicDataDocURL: # workbench/arvados_public_data_doc_url ShowUserAgreementInline: # workbench/show_user_agreement_inline SecretToken: # workbench/secret_token SecretKeyBase: # workbench/secret_key_base RepositoryCache: # workbench/repository_cache UserProfileFormFields: # workbench/user_profile_form_fields UserProfileFormMessage: UserProfileFormMessage # workbench/user_profile_form_message ApplicationMimetypesWithViewIcon: # workbench/application_mimetypes_with_view_icon LogViewerMaxBytes: # workbench/log_viewer_max_bytes EnablePublicProjectsPage: # workbench/enable_public_projects_page EnableGettingStartedPopup: # workbench/enable_getting_started_popup ApiResponseCompression: # workbench/api_response_compression APIClientConnectTimeout: # workbench/api_client_connect_timeout APIClientReceiveTimeout: # workbench/api_client_receive_timeout RunningJobLogRecordsToFetch: # workbench/running_job_log_records_to_fetch ShowRecentCollectionsOnDashboard: # workbench/show_recent_collections_on_dashboard ShowUserNotifications: # workbench/show_user_notifications MultiSiteSearch: # workbench/multi_site_search Repositories: # workbench/repositories SiteName: # workbench/site_name VocabularyURL: # workbench2/VOCABULARY_URL FileViewersConfigURL: # workbench2/FILE_VIEWERS_CONFIG_URL InstanceTypes: x1l: ProviderType: x1.large VCPUs: 16 RAM: 128GiB Scratch: 128GB IncludedScratch: 128GB AddedScratch: 0 Price: 1.23 Preemptible: false TODO: KeepproxyDisableGet: KeepproxyDisableGet # keepproxy/DisableGet (retire this feature / use Nginx instead / use a per-token permission instead) KeepproxyDisablePut: KeepproxyDisablePut # keepproxy/DisablePut (retire this feature / use Nginx instead / use a per-token permission instead) RailsSessionSecretToken: # api-server/secret_token (should this be generated at runtime from superusertoken?) InternalIPNetworks: # Nginx $external_client </code></pre> h2. Go Configuration Framework Options Viper and go-config seem to be the leading go config framework contenders considering some of our long term goals (config synchronization); but viper seems to be the more widely adopted of the two. *spf13/viper:* https://github.com/spf13/viper *micro/go-config* https://github.com/micro/go-config - more useful - https://micro.mu/docs/go-config.html Both solutions are very similar in terms of reported functionality. Both have watch support, and would allow for merging flags, environment variables, remote key stores (Consul), and our master YAML config. Viper also supports encrypted remote key/value access.