Cluster configuration » History » Version 26
Peter Amstutz, 02/06/2019 02:37 PM
1 | 1 | Tom Clegg | h1. Cluster configuration |
---|---|---|---|
2 | |||
3 | 18 | Tom Clegg | We are (2019) consolidating configuration from per-microservice yaml/json/ini files into a single cluster configuration document that is used by all components. |
4 | 1 | Tom Clegg | * Long term: system nodes automatically keep their configs synchronized (using something like consul). |
5 | * Short term: sysadmin uses tools like puppet and terraform to ensure /etc/arvados/config.yml is identical on all system nodes. |
||
6 | * Hosts without config files (e.g., hosts outside the cluster) can retrieve the config document from the API server. |
||
7 | |||
8 | h2. Discovery document |
||
9 | |||
10 | Previously, we copied selected config values from the API server config into the API discovery document so clients could see them. When clients can get the configuration document itself, this won't be needed. The discovery document should advertise APIs provided by the server, not cluster configuration. |
||
11 | |||
12 | 7 | Tom Clegg | h2. Secrets |
13 | |||
14 | Secrets like BlobSigningKey can be given literally in the config file (convenient for dev/test, consul-template, etc) or indirectly using a secret backend. Anticipated backends: |
||
15 | * <code class="yaml">BlobSigningKey: foobar</code> ⇒ the secret is literally <code>foobar</code> |
||
16 | * <code class="yaml">BlobSigningKey: "vault:foobar"</code> ⇒ the secret can be obtained from vault using the vault key "foobar" |
||
17 | * <code class="yaml">BlobSigningKey: "file:/foobar"</code> ⇒ the secret can be read from the local file @/foobar@ |
||
18 | * <code class="yaml">BlobSigningKey: "env:FOOBAR"</code> ⇒ the secret can be read from the environment variable @FOOBAR@ |
||
19 | |||
20 | 22 | Tom Clegg | h2. Instructions for ops |
21 | |||
22 | 24 | Tom Clegg | Tentative instructions for switching config file format/location: |
23 | # Upgrade Arvados to a version that supports loading all configs from the new cluster-wide config file (maybe 1.4). When services come back up, they will still use your old configuration files, but they will log some deprecation warnings. |
||
24 | 22 | Tom Clegg | # Migrate your configuration to the new config file, one component at a time. For each component: |
25 | ## Restart the component. |
||
26 | ## Inspect the deprecation warning that is logged at startup. It will tell you either "old config file is superfluous" or "new config file is incomplete". |
||
27 | ## If your old config file is superfluous, delete it. You're done. |
||
28 | 23 | Tom Clegg | ## Run the component with the "--config-diff" flag. This suggests changes to your new config file which will make your old config file obsolete. (Alternatively, run the component with the "--config-dump" flag. This outputs a new config file that would make your old config file obsolete. Saving this might be easier than applying a diff, but it will reorder keys and lose comments.) |
29 | 22 | Tom Clegg | ## Make the suggested changes. |
30 | 1 | Tom Clegg | ## Repeat until finished. |
31 | 22 | Tom Clegg | # Upgrade to a version that doesn't support old config files at all (maybe 1.5). |
32 | 24 | Tom Clegg | |
33 | 22 | Tom Clegg | |
34 | 19 | Tom Clegg | h2. Implementation |
35 | 1 | Tom Clegg | |
36 | 22 | Tom Clegg | Development strategy for facilitating the above ops instructions: |
37 | 1 | Tom Clegg | # Read the new config file into an internal struct, if the new config file exists. |
38 | # Copy old config file values into the new config struct. |
||
39 | 19 | Tom Clegg | # Use the new config struct internally (the old config is no longer referenced except in the load-and-copy-to-new-struct step). |
40 | 22 | Tom Clegg | # Add a mechanism for showing the effect of the old config file on the resulting config struct (see "--config-diff" above). |
41 | # At startup, if the old config has any effect (i.e., some parts haven't been migrated to the new config file by the operator), log a deprecation warning recommending "--config-diff" and RTFM. |
||
42 | 1 | Tom Clegg | # Wait one minor version release cycle. |
43 | 19 | Tom Clegg | # Error out if the new config file does not exist. |
44 | # Error out if the old config file exists (...and some parts of the old config are not redundant [optional?]). |
||
45 | 22 | Tom Clegg | |
46 | 19 | Tom Clegg | |
47 | 1 | Tom Clegg | h2. Example config file |
48 | |||
49 | 26 | Peter Amstutz | See also [[Config migration key mapping]] |
50 | |||
51 | 1 | Tom Clegg | (Format not yet frozen!) |
52 | |||
53 | 20 | Tom Clegg | Notes: |
54 | * Keys are CamelCase — except in special cases like PostgreSQL connection settings, which are passed through to another system without being interpreted by Arvados. |
||
55 | * Arrays and lists are not permitted. These cannot be expressed natively in consul, and tend to be troublesome anyway: "what changed?" is harder to answer usefully, significance of duplicate elements is unclear, etc. |
||
56 | |||
57 | 1 | Tom Clegg | <pre><code class="yaml"> |
58 | Clusters: |
||
59 | xyzzy: |
||
60 | 16 | Tom Clegg | ManagementToken: eec1999ccb6d75840a2c09bc70b6d3cbc990744e |
61 | 1 | Tom Clegg | BlobSigningKey: ungu355able |
62 | BlobSignatureTTL: 172800 |
||
63 | 6 | Tom Clegg | SessionKey: 186005aa54cab1ca95a3738e6e954e0a35a96d3d13a8ea541f4156e8d067b4f3 |
64 | 4 | Tom Clegg | PostgreSQL: |
65 | 11 | Tom Clegg | ConnectionPool: 32 # max concurrent connections per arvados server daemon |
66 | 10 | Tom Clegg | Connection: |
67 | # All parameters here are passed to the PG client library in a connection string; |
||
68 | # see https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS |
||
69 | Host: localhost |
||
70 | Port: 5432 |
||
71 | User: arvados |
||
72 | Password: s3cr3t |
||
73 | DBName: arvados_production |
||
74 | client_encoding: utf8 |
||
75 | fallback_application_name: arvados |
||
76 | 4 | Tom Clegg | HTTPRequestTimeout: 5m |
77 | 6 | Tom Clegg | Defaults: |
78 | CollectionReplication: 2 |
||
79 | TrashLifetime: 2w |
||
80 | UserActivation: |
||
81 | ActivateNewUsers: true |
||
82 | AutoAdminUser: root@example.com |
||
83 | UserProfileNotificationAddress: notify@example.com |
||
84 | 8 | Tom Clegg | NewUserNotificationRecipients: {} |
85 | NewInactiveUserNotificationRecipients: {} |
||
86 | 15 | Tom Clegg | RequestLimits: |
87 | 6 | Tom Clegg | MaxRequestLogParamsSize: 2KB |
88 | MaxRequestSize: 128MiB |
||
89 | MaxIndexDatabaseRead: 128MiB |
||
90 | 1 | Tom Clegg | MaxItemsPerResponse: 1000 |
91 | 15 | Tom Clegg | MultiClusterRequestConcurrency: 4 |
92 | 14 | Tom Clegg | LogLevel: info |
93 | CloudVMs: |
||
94 | 17 | Tom Clegg | BootProbeCommand: "docker ps -q" |
95 | SSHPort: 22 |
||
96 | SyncInterval: 1m # how often to get list of active instances from cloud provider |
||
97 | TimeoutIdle: 1m # shutdown if idle longer than this |
||
98 | TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully |
||
99 | TimeoutProbe: 2m # shutdown if (after booting) communication fails longer than this, even if ctrs are running |
||
100 | TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown |
||
101 | 1 | Tom Clegg | Driver: Amazon |
102 | 14 | Tom Clegg | DriverParameters: |
103 | Region: us-east-1 |
||
104 | APITimeout: 20s |
||
105 | 17 | Tom Clegg | AWSAccessKeyID: abcdef |
106 | AWSSecretAccessKey: abcdefghijklmnopqrstuvwxyz |
||
107 | 14 | Tom Clegg | ImageID: ami-0a01b48b88d14541e |
108 | SubnetID: subnet-24f5ae62 |
||
109 | SecurityGroups: sg-3ec53e2a |
||
110 | 13 | Lucas Di Pentima | AuditLogs: |
111 | MaxAge: 2w |
||
112 | 6 | Tom Clegg | DeleteBatchSize: 100000 |
113 | UnloggedAttributes: {} # example: {"manifest_text": true} |
||
114 | ContainerLogStream: |
||
115 | 8 | Tom Clegg | BatchSize: 4KiB |
116 | 6 | Tom Clegg | BatchTime: 1s |
117 | ThrottlePeriod: 1m |
||
118 | ThrottleThresholdSize: 64KiB |
||
119 | ThrottleThresholdLines: 1024 |
||
120 | TruncateSize: 64MiB |
||
121 | PartialLineThrottlePeriod: 5s |
||
122 | Timers: |
||
123 | TrashSweepInterval: 60s |
||
124 | 14 | Tom Clegg | ContainerDispatchPollInterval: 10s |
125 | APIRequestTimeout: 20s |
||
126 | 6 | Tom Clegg | Scaling: |
127 | MaxComputeNodes: 64 |
||
128 | EnablePreemptibleInstances: false |
||
129 | 8 | Tom Clegg | DisableAPIMethods: {} # example: {"jobs.create": true} |
130 | DockerImageFormats: {"v2": true} |
||
131 | 6 | Tom Clegg | Crunch1: |
132 | Enable: true |
||
133 | CrunchJobWrapper: none |
||
134 | CrunchJobUser: crunch |
||
135 | 12 | Tom Clegg | CrunchRefreshTrigger: /tmp/crunch_refresh_trigger |
136 | 6 | Tom Clegg | DefaultDockerImage: false |
137 | 4 | Tom Clegg | NodeProfiles: |
138 | # Key is a profile name; can be specified on service prog command line, defaults to $(hostname) |
||
139 | keep: |
||
140 | # Don’t run other services automatically -- only specified ones |
||
141 | Default: {Disable: true} |
||
142 | Keepstore: {Listen: ":25107"} |
||
143 | apiserver: |
||
144 | Default: {Disable: true} |
||
145 | RailsAPI: {Listen: ":9000", TLS: true} |
||
146 | Controller: {Listen: ":9100"} |
||
147 | 1 | Tom Clegg | Websocket: {Listen: ":9101"} |
148 | Health: {Listen: ":9199"} |
||
149 | keep: |
||
150 | Default: {Disable: true} |
||
151 | KeepProxy: {Listen: ":9102"} |
||
152 | KeepWeb: {Listen: ":9103"} |
||
153 | *: |
||
154 | # This section used for a node whose profile name is not listed above |
||
155 | 13 | Lucas Di Pentima | Default: {Disable: false} # (this is the default behavior) |
156 | Volumes: |
||
157 | xyzzy-keep-0: |
||
158 | Type: s3 |
||
159 | Region: us-east |
||
160 | Bucket: xyzzy-keep-0 |
||
161 | # [rest of keepstore volume config goes here] |
||
162 | 4 | Tom Clegg | WebRoutes: |
163 | 5 | Tom Clegg | # “default” means route according to method/host/path (e.g., if host is a login shell, route there) |
164 | 4 | Tom Clegg | xyzzy.arvadosapi.com: default |
165 | # “collections” means always route to keep-web |
||
166 | collections.xyzzy.arvadosapi.com: collections |
||
167 | # leading * is a wildcard (longest match wins) |
||
168 | "*--collections.xyzzy.arvadosapi.com": collections |
||
169 | cloud.curoverse.com: workbench |
||
170 | workbench.xyzzy.arvadosapi.com: workbench |
||
171 | "*.xyzzy.arvadosapi.com": default |
||
172 | 3 | Tom Clegg | InstanceTypes: |
173 | 8 | Tom Clegg | m4.large: |
174 | VCPUs: 2 |
||
175 | RAM: 8000000000 |
||
176 | Scratch: 31000000000 |
||
177 | Price: 0.1 |
||
178 | m4.large-1t: |
||
179 | # same instance type as m4.large but our scripts attach more scratch |
||
180 | ProviderType: m4.large |
||
181 | VCPUs: 2 |
||
182 | RAM: 8000000000 |
||
183 | Scratch: 999000000000 |
||
184 | Price: 0.12 |
||
185 | m4.xlarge: |
||
186 | VCPUs: 4 |
||
187 | RAM: 16000000000 |
||
188 | Scratch: 78000000000 |
||
189 | Price: 0.2 |
||
190 | m4.8xlarge: |
||
191 | VCPUs: 40 |
||
192 | RAM: 160000000000 |
||
193 | Scratch: 156000000000 |
||
194 | Price: 2 |
||
195 | m4.16xlarge: |
||
196 | VCPUs: 64 |
||
197 | RAM: 256000000000 |
||
198 | Scratch: 310000000000 |
||
199 | Price: 3.2 |
||
200 | c4.large: |
||
201 | VCPUs: 2 |
||
202 | RAM: 3750000000 |
||
203 | Price: 0.1 |
||
204 | c4.8xlarge: |
||
205 | VCPUs: 36 |
||
206 | RAM: 60000000000 |
||
207 | Price: 1.591 |
||
208 | 9 | Tom Clegg | RemoteClusters: |
209 | xrrrr: |
||
210 | Host: xrrrr.arvadosapi.com |
||
211 | Proxy: true # proxy requests to xrrrr on behalf of our clients |
||
212 | AuthProvider: true # users authenticated by xrrrr can use our cluster |
||
213 | 1 | Tom Clegg | </code></pre> |
214 | 25 | Eric Biagiotti | |
215 | h2. Go Configuration Framework Options |
||
216 | |||
217 | Viper and go-config seem to be the leading go config framework contenders considering some of our long term goals (config synchronization); but viper seems to be the more widely adopted of the two. |
||
218 | |||
219 | *spf13/viper:* https://github.com/spf13/viper |
||
220 | |||
221 | *micro/go-config* https://github.com/micro/go-config - more useful - https://micro.mu/docs/go-config.html |
||
222 | |||
223 | Both solutions are very similar in terms of reported functionality. Both have watch support, and would allow for merging flags, environment variables, remote key stores (Consul), and our master YAML config. Viper also supports encrypted remote key/value access. |