Bug #23079
closedarv-put cannot put file when Arvados cluster is behind Tailscale funnel (proxy)
Description
Summary: When running arv-put on a host that does not run an Arvados cluster (i.e. running arv-put on a purely "client" host), file upload fails.
To reproduce, set up an Arvados cluster (I'm using single host 3.1.2 on Ubuntu 24.04) and a separate host ("client" host) with working environment configured to use the newly setup cluster (I'm using config files under $HOME/.config/arvados).
Then, when uploading a collection using arv-put, the following error messages appear. (I'm disabling retries and caches to speedily reproduce the errors, but otherwise the error is the same.)
$ arv-put --no-cache --retries 0 --name example example-dir/ 2025-07-28 14:25:37 arvados.arv_put[9840] INFO: Calculating upload size, this could take some time... 2025-07-28 14:25:37 arvados.arv_put[9840] INFO: No cache usage requested for this run. 0M / 0M 0.0% 2025-07-28 14:25:38 arvados.arv_put[9840] ERROR: arv-put: Error writing some blocks: block bea8252ff4e80f41719ea13cdf007273+14 raised KeepWriteError ([req-307ucnh3wnvrso38jr9y] failed to write bea8252ff4e80f41719ea13cdf007273 after 1 attempt (wanted (1, ['default']) copies but wrote (0, [])): service http://localhost:8088/ responded with 0 (7, 'Failed to connect to localhost port 8088 after 0 ms: Could not connect to server')) (X-Request-Id: req-307ucnh3wnvrso38jr9y)
Notice that here the error message service http://localhost:8088/ responded with 0 (7, 'Failed to connect to localhost port 8088 after 0 ms: Could not connect to server')) is rather intriguing because this is the InternalURL of the Keepstore service (as configured by the single-node Ansible install). I'm not sure why it is tried -- it seems that on the client host the arv-put command doesn't know how to use the cluster's Keep proxy.
This does not happen when running arv-put from the same host on which Arvados services are running. In that case the upload succeeds without errors.
Updated by Brett Smith 8 months ago
- Assigned To set to Brett Smith
- Category set to Deployment
Zoë,
The Arvados controller tells clients to contact different Keep services depending on whether they are on the same network as the Arvados cluster or not. Arvados relies on nginx configuration to tell which is which. The installer is responsible for setting up this configuration based on the user's settings.
Can you please describe how the cluster and client hosts are related to each other? For example, are they two different VMs on the same host system? Is the client on the host system directly? Something else? When the client system contacts the Arvados cluster, how are they networked?
When the client initially contacts the cluster, you should be able to see it request the discovery document in the arvados-controller journal. For example:
Jul 29 09:12:15 arvados-controller[213]: {"ClusterID":"z2a01","PID":213,"RequestID":"req-dgu5tha4s4vq5cmj6jr0","level":"info","msg":"request","remoteAddr":"127.0.0.1:46360","reqBytes":0,"reqForwardedFor":"192.168.137.1","reqHost":"z2a01:8443","reqMethod":"GET","reqPath":"discovery/v1/apis/arvados/v1/rest","reqQuery":"","time":"2025-07-29T09:12:15.159413994-04:00"}
Can you please find and share that line for your arv-put?
Updated by Zoë Ma 8 months ago
Thank you for the hint about examining the networking and the journals. I think I now have a grip on what was happening.
The cluster and the "client" hosts are both VM images running on the same physical host. They are also part of the same Tailnet. They are known to each other as hostnames xtmp1.halley-mirzam.ts.net (Arvados cluster) and xtmp2.halley-mirzam.ts.net (client). The Arvados API server is at xtmp1.halley-mirzam.ts.net:8443.
At the same time, what I really wanted to do was to "export" xtmp1 to the public internet with Tailscale funnel. I was doing this by commands likesudo tailscale funnel --bg --tcp 8443 tcp://localhost:8443 (for exporting the controller). I think the particular way the TCP proxy is setup means that the request from the client (xtmp2) appeared like a local request and confused the client program.
The journal output for the request loos like
Jul 29 10:50:21 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-9lr4l3tnpa593xq6nuxh","level":"info","msg":"request","remoteAddr":"127.0.0.1:35348","reqBytes":0,"reqForwardedFor":"127.0.0.1","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/config","reqQuery":"alt=json","time":"2025-07-29T10:50:21.097327947-04:00"}
Jul 29 10:50:21 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-9lr4l3tnpa593xq6nuxh","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35348","reqBytes":0,"reqForwardedFor":"127.0.0.1","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/config","reqQuery":"alt=json","respBytes":4001,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:50:21.098124782-04:00","timeToStatus":0.000748,"timeTotal":0.000793,"timeWriteBody":0.000046,"tokenUUIDs":["xlogn-gj3su-b4h28emjx5w3n0p"]}
Jul 29 10:50:21 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-9lr4l3tnpa593xq6nuxh","level":"info","msg":"request","remoteAddr":"127.0.0.1:35362","reqBytes":0,"reqForwardedFor":"127.0.0.1","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/users/current","reqQuery":"alt=json","time":"2025-07-29T10:50:21.100193631-04:00"}
Jul 29 10:50:21 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-9lr4l3tnpa593xq6nuxh","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35362","reqBytes":0,"reqForwardedFor":"127.0.0.1","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/users/current","reqQuery":"alt=json","respBytes":612,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:50:21.186843365-04:00","timeToStatus":0.086633,"timeTotal":0.086642,"timeWriteBody":0.000010,"tokenUUIDs":["xlogn-gj3su-b4h28emjx5w3n0p"]}
Jul 29 10:50:21 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-9lr4l3tnpa593xq6nuxh","level":"info","msg":"request","remoteAddr":"127.0.0.1:35368","reqBytes":0,"reqForwardedFor":"127.0.0.1","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/keep_services/accessible","reqQuery":"alt=json","time":"2025-07-29T10:50:21.217019028-04:00"}
Jul 29 10:50:21 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-9lr4l3tnpa593xq6nuxh","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35368","reqBytes":0,"reqForwardedFor":"127.0.0.1","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/keep_services/accessible","reqQuery":"alt=json","respBytes":522,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:50:21.225436597-04:00","timeToStatus":0.008345,"timeTotal":0.008410,"timeWriteBody":0.000065}
Meanwhile without funnel (TCP proxy) the successful request looks like
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"request","remoteAddr":"127.0.0.1:35274","reqBytes":0,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/config","reqQuery":"alt=json","time":"2025-07-29T10:51:41.638061795-04:00"}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35274","reqBytes":0,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/config","reqQuery":"alt=json","respBytes":4001,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:51:41.638818889-04:00","timeToStatus":0.000641,"timeTotal":0.000753,"timeWriteBody":0.000112,"tokenUUIDs":["xlogn-gj3su-b4h28emjx5w3n0p"]}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"request","remoteAddr":"127.0.0.1:35284","reqBytes":0,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/users/current","reqQuery":"alt=json","time":"2025-07-29T10:51:41.640789097-04:00"}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35284","reqBytes":0,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/users/current","reqQuery":"alt=json","respBytes":612,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:51:41.649432704-04:00","timeToStatus":0.008617,"timeTotal":0.008626,"timeWriteBody":0.000009,"tokenUUIDs":["xlogn-gj3su-b4h28emjx5w3n0p"]}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"request","remoteAddr":"127.0.0.1:35292","reqBytes":0,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/keep_services/accessible","reqQuery":"alt=json","time":"2025-07-29T10:51:41.677661043-04:00"}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35292","reqBytes":0,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/keep_services/accessible","reqQuery":"alt=json","respBytes":539,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:51:41.685657205-04:00","timeToStatus":0.007936,"timeTotal":0.007991,"timeWriteBody":0.000055}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"request","remoteAddr":"127.0.0.1:35298","reqBytes":0,"reqForwardedFor":"100.94.70.12","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/users/current","reqQuery":"","time":"2025-07-29T10:51:41.722517649-04:00"}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35298","reqBytes":0,"reqForwardedFor":"100.94.70.12","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"GET","reqPath":"arvados/v1/users/current","reqQuery":"","respBytes":612,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:51:41.730866399-04:00","timeToStatus":0.008333,"timeTotal":0.008342,"timeWriteBody":0.000008,"tokenUUIDs":["xlogn-gj3su-b4h28emjx5w3n0p"]}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"request","remoteAddr":"127.0.0.1:35308","reqBytes":213,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"POST","reqPath":"arvados/v1/collections","reqQuery":"ensure_unique_name=true\u0026alt=json","time":"2025-07-29T10:51:41.770618906-04:00"}
Jul 29 10:51:41 xtmp1 arvados-controller[2029]: {"ClusterID":"xtmp1","PID":2029,"RequestID":"req-nsnqtc0j93ng1tv8o9js","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:35308","reqBytes":213,"reqForwardedFor":"100.110.145.89","reqHost":"xtmp1.halley-mirzam.ts.net:8443","reqMethod":"POST","reqPath":"arvados/v1/collections","reqQuery":"ensure_unique_name=true\u0026alt=json","respBytes":938,"respStatus":"OK","respStatusCode":200,"time":"2025-07-29T10:51:41.790078440-04:00","timeToStatus":0.019447,"timeTotal":0.019456,"timeWriteBody":0.000009,"tokenUUIDs":["xlogn-gj3su-b4h28emjx5w3n0p"]}
(note the reqForwardedFor fields.)
It seems that the client is confused by the proxy.
Given that we do want to be able to export services currently hosted on Tailscale to the public internet, is there a way to make client SDK command-line programs more aware of the possibility of a proxy in between (such as Tailscale funnel)?
Updated by Alexander Wait Zaranek 8 months ago
I suspect we should have started this discussion in a PGP Incubator issue? Is there a use-case for upload to a public Arvados instance (where the uploading user doesn't have access to the Tailnet)? That seems very unlikely to me.
The Tailscale funnel docs mention "The funnel command offers a TCP forwarder to forward TLS-terminated TCP packets to a local TCP server like Caddy or other TCP-based protocols such as SSH or RDP. By default, the TCP forwarder forwards raw packets."
If you follow the bread-crumbs to the Caddy docs, they directly address wildcard domains / certificates? Perhaps fewer ports and Caddy managed wildcard sub-domain will give us everything we want? Worth continuing at GitHub?
Updated by Brett Smith 8 months ago
Alexander Wait Zaranek wrote in #note-4:
I suspect we should have started this discussion in a PGP Incubator issue?
In retrospect, probably. I think Zoë started with a good faith belief that this was an arv-put code bug, and I continued the discussion on the good faith belief that we might be able to do work in the Ansible installer to accommodate this deployment strategy.
But given my reading so far, Tailscale Funnel does not seem like an appropriate way to make an Arvados cluster publicly accessible. This article describes Funnel as a tool for temporary resource sharing. It doesn't seem designed for long-running services.
This article describes the limitations of Funnel, which include:
Funnel can only listen on ports 443, 8443, and 10000.
The current Arvados architecture expects you to have at least five public-facing services, and that's before you add anything like service containers. Today there's simply no way to expose all the functionality of an Arvados cluster through three ports.
Even if we did substantial engineering work to lift that limitation, there's also:
Traffic sent over a Funnel is subject to non-configurable bandwidth limits.
Given the amount of data Arvados clusters typically handle, I think even if you got a cluster available through Tailnet Funnel, it would be unpleasant to use.
Updated by Zoë Ma 8 months ago
I think I didn't make it clear that this ticket is less about "the public cannot interact with Arvados using arv-put when the cluster is exported by funnel", and is actually more about "arv-put doesn't work with TCP proxy in a rather surprising way" (of course "surprising" being subjective).
The thing being broken is that merely enabling Tailscale funnel would confuse arv-put (and by extension arv-keepdocker, arvados-cwl-runner, etc.) on the private Tailnet, not just the public internet.
I agree that there's a better discussion on GitHub in the PGPincubator repo. I'll write a bit over there, to create an issue about "figure out ways to export selected Arvados functionalities to the public while not breaking 'backend' work on the private network". Then I'll update this ticket to point to that GH issue.
Updated by Zoë Ma 8 months ago
For reference, I created a public GH issue following discussions here -> https://github.com/PGPinformatics/PGPincubator/issues/11