Bug #16393
closedkeepproxy timeout on lugli
100%
Description
Keepproxy on lugli seems to be suffering from a timeout error if it can't write an entire block's worth to keepstore within 15s:
May 01 18:44:39 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:39Z" level=info msg=request RequestID=req-35oyym3mwx6bb8qwm7sk remoteAddr="127.0.0.1:35746" reqBytes=67108864 reqForwardedFor=207.172.223.247 reqHost=keep.lugli.arvadosapi.com reqMethod=PUT reqPath=abc3977c2c709626b57927dc7388a9fc reqQuery= May 01 18:44:39 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:39Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Begin upload abc3977c2c709626b57927dc7388a9fc to http://keep0.lugli.arvadosapi.com:25107" May 01 18:44:39 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:39Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 1" May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload failed http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc error: Put http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 0" May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Begin upload abc3977c2c709626b57927dc7388a9fc to http://keep0.lugli.arvadosapi.com:25107" May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 1" May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload failed http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc error: Put http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 0" May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Begin upload abc3977c2c709626b57927dc7388a9fc to http://keep0.lugli.arvadosapi.com:25107" May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 1" May 01 18:45:12 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:12Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc success" May 01 18:45:12 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:12Z" level=info msg="207.172.223.247,127.0.0.1:35746 PUT /abc3977c2c709626b57927dc7388a9fc 200 67108864 2 2 abc3977c2c709626b57927dc7388a9fc+67108864+A5a1c5246763239c3a5d474cdcadba9850dff2b6e@5ebee338 <nil>" May 01 18:45:12 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:12Z" level=info msg=response RequestID=req-35oyym3mwx6bb8qwm7sk remoteAddr="127.0.0.1:35746" reqBytes=67108864 reqForwardedFor=207.172.223.247 reqHost=keep.lugli.arvadosapi.com reqMethod=PUT reqPath=abc3977c2c709626b57927dc7388a9fc reqQuery= respBytes=92 respStatus=OK respStatusCode=200 timeToStatus=33.241195 timeTotal=33.241230 timeWriteBody=0.000035
The third attempt works, it is much faster than the first 2.
Why does it start out so slow?
Updated by Ward Vandewege over 4 years ago
- Description updated (diff)
What seems to be happening here is that a slow client is uploading data, which keepproxy is streaming through to keepstore. The client doesn't send the whole block within 15s, so keepproxy aborts the connection to keepstore:
May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload failed http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc error: Put http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
That happens once again at the 30s mark. Keepproxy keeps retrying the connection to keepstore, and by the third try the entire upload of the block completes before the 15s timeout window, and the upload to keesptore completes.
The client is none the wiser, the upload appears to complete without problems.
For even slower clients, though, the upload fails because the retry limit for the keepproxy-> keepstore connection is hit before all data can be uploaded to keepproxy.
Updated by Ward Vandewege over 4 years ago
- Status changed from New to In Progress
- Assigned To set to Ward Vandewege
- Target version set to 2020-05-06 Sprint
Ready for review in defdea5df1baf318b3a34c60e2fe148afff66a51 on branch 16393-keepproxy-timeout-with-slow-clients
Updated by Ward Vandewege over 4 years ago
- Category set to Keep
- Release set to 33
Updated by Peter Amstutz over 4 years ago
Ward Vandewege wrote:
Ready for review in defdea5df1baf318b3a34c60e2fe148afff66a51 on branch 16393-keepproxy-timeout-with-slow-clients
Looking at the original code, it takes the timeout from the passed-in client struct, which appears to be one created at keepproxy.go:128 main(). So maybe the timeout should be set in main() instead?
Updated by Ward Vandewege over 4 years ago
Peter Amstutz wrote:
Ward Vandewege wrote:
Ready for review in defdea5df1baf318b3a34c60e2fe148afff66a51 on branch 16393-keepproxy-timeout-with-slow-clients
Looking at the original code, it takes the timeout from the passed-in client struct, which appears to be one created at keepproxy.go:128 main(). So maybe the timeout should be set in main() instead?
I don't think so; the keepclient.MakeKeepClient call does all sorts of stuff, including magic with KeepClient.foundNonDiskSvc, but all that is a red herring (and that magic flag is not exported... why do we have 2 timeout profiles and no way to tell the client which one to use?), I went down that road first and spent a good bit of time there.
The actual timeout that we were hitting was 15s; that was the default value for cluster.API.KeepServiceRequestTimeout, and the http.Client timeout for the connection to keepstore is set to that value (by way of MakeRESTRouter) on line 621 in keepproxy.go, which is what I changed.
And the fix works, as tested with a binary on lugli.
Updated by Peter Amstutz over 4 years ago
My original comment was wrong.
The place where the timeout needs to be changed is on line 166:
@router = MakeRESTRouter(kc, time.Duration(cluster.API.KeepServiceRequestTimeout), cluster.ManagementToken)@
This sets the value of h.timeout
that gets used to initialize the keepclient later on.
Hardcoding the timeout in makeKeepClient
may break a couple of tests which call MakeRESTRouter() with a different timeout.
Updated by Ward Vandewege over 4 years ago
Thanks, I made this change in b4603c6c684246942d6d8747d51b441b9ffc42fb. Tests pass, and I've tested the binary on lugli as well, it still works as expected.
Updated by Peter Amstutz over 4 years ago
Ward Vandewege wrote:
Thanks, I made this change in b4603c6c684246942d6d8747d51b441b9ffc42fb. Tests pass, and I've tested the binary on lugli as well, it still works as expected.
Perfect, thank you, LGTM.
Updated by Anonymous over 4 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|696df23123aac8b1eafb6eea6198de2143b8650f.