Bug #17339
closed[keepstore] unexplained memory use with new S3 driver
100%
Description
I'm using the new S3 driver on 2xpu4. This is keepstore 2.1.1:
keep0.2xpu4:~# keepstore -version keepstore 2.1.1 (go1.13.4)
The keepstore daemon on keep0.2xpu4 seems to use more memory than expected.
The machine it runs on is modest, with just 3.7G of ram:
# free total used free shared buff/cache available Mem: 3827036 3219100 317440 1000 290496 3255540 Swap: 0 0 0
I used our little formula to calculate a suitable value for MaxKeepBlobBuffers:
$ perl -e 'print 3827036/88/1024' 42.4697709517045
That was too many; keepstore kept getting killed by the OOM killer. I reduced it to 40, same problem. I reduced it to 20 and that made it happy, stable at an RSS use of 2999284 bytes, and suitably making clients wait when its 20 buffers are in use (writing roughly 600 MiB/sec into S3).
After an hour of being mostly idle, the keepstore RSS memory use is unchanged:
root 3366 31.2 78.3 4218032 2999284 ? Ssl 16:22 25:05 /usr/bin/keepstore
Two surprising things/questions here:
a) why is it using ~2.86GiB of ram with only 20 buffers? I was expecting memory use to be 1.28GiB of ram (20 * 64 MiB) plus a bit of overhead.
b) why is the memory use not going down after a long period of idle?
Here is a snapshot of the metrics for the keepstore (also available in prometheus/grafana as a time series):
# HELP arvados_concurrent_requests Number of requests in progress # TYPE arvados_concurrent_requests gauge arvados_concurrent_requests 0 # HELP arvados_keepstore_bufferpool_allocated_bytes Number of bytes allocated to buffers # TYPE arvados_keepstore_bufferpool_allocated_bytes gauge arvados_keepstore_bufferpool_allocated_bytes 5.905580032e+09 # HELP arvados_keepstore_bufferpool_inuse_buffers Number of buffers in use # TYPE arvados_keepstore_bufferpool_inuse_buffers gauge arvados_keepstore_bufferpool_inuse_buffers 0 # HELP arvados_keepstore_bufferpool_max_buffers Maximum number of buffers allowed # TYPE arvados_keepstore_bufferpool_max_buffers gauge arvados_keepstore_bufferpool_max_buffers 20 # HELP arvados_keepstore_pull_queue_inprogress_entries Number of pull requests in progress # TYPE arvados_keepstore_pull_queue_inprogress_entries gauge arvados_keepstore_pull_queue_inprogress_entries 0 # HELP arvados_keepstore_pull_queue_pending_entries Number of queued pull requests # TYPE arvados_keepstore_pull_queue_pending_entries gauge arvados_keepstore_pull_queue_pending_entries 0 # HELP arvados_keepstore_trash_queue_inprogress_entries Number of trash requests in progress # TYPE arvados_keepstore_trash_queue_inprogress_entries gauge arvados_keepstore_trash_queue_inprogress_entries 0 # HELP arvados_keepstore_trash_queue_pending_entries Number of queued trash requests # TYPE arvados_keepstore_trash_queue_pending_entries gauge arvados_keepstore_trash_queue_pending_entries 0 # HELP arvados_keepstore_volume_errors Number of volume errors # TYPE arvados_keepstore_volume_errors counter arvados_keepstore_volume_errors{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",error_type="*s3manager.multiUploadError 000 MultipartUpload"} 2 arvados_keepstore_volume_errors{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",error_type="s3.requestFailure 404 NotFound"} 1927 # HELP arvados_keepstore_volume_io_bytes Volume I/O traffic in bytes # TYPE arvados_keepstore_volume_io_bytes counter arvados_keepstore_volume_io_bytes{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",direction="in"} 2.63069164e+08 arvados_keepstore_volume_io_bytes{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",direction="out"} 1.25395672825e+11 # HELP arvados_keepstore_volume_operations Number of volume operations # TYPE arvados_keepstore_volume_operations counter arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="get"} 28 arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="head"} 5857 arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="list"} 112 arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="put"} 5821 # HELP arvados_max_concurrent_requests Maximum number of concurrent requests # TYPE arvados_max_concurrent_requests gauge arvados_max_concurrent_requests 0 # HELP request_duration_seconds Summary of request duration. # TYPE request_duration_seconds summary request_duration_seconds_sum{code="200",method="get"} 118.02206067300001 request_duration_seconds_count{code="200",method="get"} 42 request_duration_seconds_sum{code="200",method="put"} 29078.17948122801 request_duration_seconds_count{code="200",method="put"} 3906 request_duration_seconds_sum{code="500",method="put"} 19.722034916 request_duration_seconds_count{code="500",method="put"} 4 # HELP time_to_status_seconds Summary of request TTFB. # TYPE time_to_status_seconds summary time_to_status_seconds_sum{code="200",method="get"} 106.74335901900001 time_to_status_seconds_count{code="200",method="get"} 42 time_to_status_seconds_sum{code="200",method="put"} 29075.862950967014 time_to_status_seconds_count{code="200",method="put"} 3906 time_to_status_seconds_sum{code="500",method="put"} 19.72120516 time_to_status_seconds_count{code="500",method="put"} 4
Files