Project

General

Profile

Actions

Bug #17339

closed

[keepstore] unexplained memory use with new S3 driver

Added by Ward Vandewege almost 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
12/04/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

I'm using the new S3 driver on 2xpu4. This is keepstore 2.1.1:

keep0.2xpu4:~# keepstore -version
keepstore 2.1.1 (go1.13.4)

The keepstore daemon on keep0.2xpu4 seems to use more memory than expected.

The machine it runs on is modest, with just 3.7G of ram:

# free
              total        used        free      shared  buff/cache   available
Mem:        3827036     3219100      317440        1000      290496     3255540
Swap:             0           0           0

I used our little formula to calculate a suitable value for MaxKeepBlobBuffers:

$  perl -e 'print 3827036/88/1024'
42.4697709517045

That was too many; keepstore kept getting killed by the OOM killer. I reduced it to 40, same problem. I reduced it to 20 and that made it happy, stable at an RSS use of 2999284 bytes, and suitably making clients wait when its 20 buffers are in use (writing roughly 600 MiB/sec into S3).

After an hour of being mostly idle, the keepstore RSS memory use is unchanged:

root      3366 31.2 78.3 4218032 2999284 ?     Ssl  16:22  25:05 /usr/bin/keepstore

Two surprising things/questions here:

a) why is it using ~2.86GiB of ram with only 20 buffers? I was expecting memory use to be 1.28GiB of ram (20 * 64 MiB) plus a bit of overhead.

b) why is the memory use not going down after a long period of idle?

Here is a snapshot of the metrics for the keepstore (also available in prometheus/grafana as a time series):

# HELP arvados_concurrent_requests Number of requests in progress
# TYPE arvados_concurrent_requests gauge
arvados_concurrent_requests 0
# HELP arvados_keepstore_bufferpool_allocated_bytes Number of bytes allocated to buffers
# TYPE arvados_keepstore_bufferpool_allocated_bytes gauge
arvados_keepstore_bufferpool_allocated_bytes 5.905580032e+09
# HELP arvados_keepstore_bufferpool_inuse_buffers Number of buffers in use
# TYPE arvados_keepstore_bufferpool_inuse_buffers gauge
arvados_keepstore_bufferpool_inuse_buffers 0
# HELP arvados_keepstore_bufferpool_max_buffers Maximum number of buffers allowed
# TYPE arvados_keepstore_bufferpool_max_buffers gauge
arvados_keepstore_bufferpool_max_buffers 20
# HELP arvados_keepstore_pull_queue_inprogress_entries Number of pull requests in progress
# TYPE arvados_keepstore_pull_queue_inprogress_entries gauge
arvados_keepstore_pull_queue_inprogress_entries 0
# HELP arvados_keepstore_pull_queue_pending_entries Number of queued pull requests
# TYPE arvados_keepstore_pull_queue_pending_entries gauge
arvados_keepstore_pull_queue_pending_entries 0
# HELP arvados_keepstore_trash_queue_inprogress_entries Number of trash requests in progress
# TYPE arvados_keepstore_trash_queue_inprogress_entries gauge
arvados_keepstore_trash_queue_inprogress_entries 0
# HELP arvados_keepstore_trash_queue_pending_entries Number of queued trash requests
# TYPE arvados_keepstore_trash_queue_pending_entries gauge
arvados_keepstore_trash_queue_pending_entries 0
# HELP arvados_keepstore_volume_errors Number of volume errors
# TYPE arvados_keepstore_volume_errors counter
arvados_keepstore_volume_errors{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",error_type="*s3manager.multiUploadError 000 MultipartUpload"} 2
arvados_keepstore_volume_errors{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",error_type="s3.requestFailure 404 NotFound"} 1927
# HELP arvados_keepstore_volume_io_bytes Volume I/O traffic in bytes
# TYPE arvados_keepstore_volume_io_bytes counter
arvados_keepstore_volume_io_bytes{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",direction="in"} 2.63069164e+08
arvados_keepstore_volume_io_bytes{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",direction="out"} 1.25395672825e+11
# HELP arvados_keepstore_volume_operations Number of volume operations
# TYPE arvados_keepstore_volume_operations counter
arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="get"} 28
arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="head"} 5857
arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="list"} 112
arvados_keepstore_volume_operations{device_id="s3:///2xpu4-nyw5e-000000000000000-volume",operation="put"} 5821
# HELP arvados_max_concurrent_requests Maximum number of concurrent requests
# TYPE arvados_max_concurrent_requests gauge
arvados_max_concurrent_requests 0
# HELP request_duration_seconds Summary of request duration.
# TYPE request_duration_seconds summary
request_duration_seconds_sum{code="200",method="get"} 118.02206067300001
request_duration_seconds_count{code="200",method="get"} 42
request_duration_seconds_sum{code="200",method="put"} 29078.17948122801
request_duration_seconds_count{code="200",method="put"} 3906
request_duration_seconds_sum{code="500",method="put"} 19.722034916
request_duration_seconds_count{code="500",method="put"} 4
# HELP time_to_status_seconds Summary of request TTFB.
# TYPE time_to_status_seconds summary
time_to_status_seconds_sum{code="200",method="get"} 106.74335901900001
time_to_status_seconds_count{code="200",method="get"} 42
time_to_status_seconds_sum{code="200",method="put"} 29075.862950967014
time_to_status_seconds_count{code="200",method="put"} 3906
time_to_status_seconds_sum{code="500",method="put"} 19.72120516
time_to_status_seconds_count{code="500",method="put"} 4

Files


Subtasks 1 (0 open1 closed)

Task #17411: ReviewResolvedWard Vandewege12/04/2021

Actions
Actions #1

Updated by Ward Vandewege almost 4 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Ward Vandewege almost 4 years ago

  • Description updated (diff)
Actions #3

Updated by Ward Vandewege almost 4 years ago

  • Description updated (diff)
Actions #5

Updated by Ward Vandewege almost 4 years ago

  • Subject changed from [keepstore] unexplained memory use to [keepstore] unexplained memory use with new S3 driver
Actions #6

Updated by Ward Vandewege almost 4 years ago

  • Description updated (diff)
Actions #7

Updated by Ward Vandewege almost 4 years ago

  • Target version changed from 2021-02-17 sprint to 2021-03-03 sprint
Actions #8

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2021-03-03 sprint to 2021-03-17 sprint
Actions #9

Updated by Ward Vandewege almost 4 years ago

  • Target version changed from 2021-03-17 sprint to 2021-03-31 sprint
Actions #10

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2021-03-31 sprint to 2021-04-14 sprint
Actions #11

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2021-04-14 sprint to 2021-04-28 bughunt sprint
Actions #12

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-04-28 bughunt sprint to 2021-05-12 sprint
Actions #13

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-05-12 sprint to 2021-05-26 sprint
Actions #14

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-05-26 sprint to 2021-06-09 sprint
Actions #15

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-06-09 sprint to 2021-06-23 sprint
Actions #16

Updated by Ward Vandewege over 3 years ago

  • Target version changed from 2021-06-23 sprint to 2021-07-07 sprint
Actions #17

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-07-07 sprint to 2021-07-21 sprint
Actions #18

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-07-21 sprint to 2021-08-04 sprint
Actions #19

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-08-04 sprint to 2021-08-18 sprint
Actions #20

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Actions #21

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-09-01 sprint to 2021-09-15 sprint
Actions #22

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-09-15 sprint to 2021-09-29 sprint
Actions #23

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-09-29 sprint to 2021-10-13 sprint
Actions #24

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-10-13 sprint to 2021-10-27 sprint
Actions #25

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-10-27 sprint to 2021-11-10 sprint
Actions #26

Updated by Ward Vandewege about 3 years ago

  • Target version changed from 2021-11-10 sprint to 2021-11-24 sprint
Actions #27

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-11-24 sprint to 2021-12-08 sprint

Updated by Ward Vandewege about 3 years ago

Running some experiments in a controlled environment. I've attached before-profile-arv-put.bb.gz which is a pprof snapshot taken after running an upload experiment a couple times. This is the corresponding graph that shows cumulative memory allocations, which points at the problem:

Clearly, an (unnecessary) copy of uploaded blocks is being made inside the S3 sdk.

This is also visible in the "top10" output:

File: keepstore
Build ID: 78315dc4f3813def38521cb385cc1fdaecdcf72b
Type: inuse_space
Time: Dec 2, 2021 at 2:43pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)  sample_index=alloc_space
(pprof) top10
Showing nodes accounting for 6319.45MB, 98.28% of 6430.30MB total
Dropped 308 nodes (cum <= 32.15MB)
Showing top 10 nodes out of 28
      flat  flat%   sum%        cum   cum%
 3140.14MB 48.83% 48.83%  3140.14MB 48.83%  github.com/aws/aws-sdk-go-v2/service/s3/s3manager.(*maxSlicePool).newSlice
    3136MB 48.77% 97.60%     3136MB 48.77%  git.arvados.org/arvados.git/services/keepstore.(*handler).setup.func1
   42.81MB  0.67% 98.27%    62.87MB  0.98%  io.copyBuffer
    0.50MB 0.0078% 98.28%    47.61MB  0.74%  github.com/aws/aws-sdk-go-v2/service/s3/s3manager.(*multiuploader).send
         0     0% 98.28%    58.24MB  0.91%  bufio.(*Writer).ReadFrom
         0     0% 98.28%  3155.16MB 49.07%  git.arvados.org/arvados.git/services/keepstore.(*S3AWSVolume).Put
         0     0% 98.28%  3155.16MB 49.07%  git.arvados.org/arvados.git/services/keepstore.(*S3AWSVolume).WriteBlock
         0     0% 98.28%  3155.16MB 49.07%  git.arvados.org/arvados.git/services/keepstore.(*S3AWSVolume).writeObject
         0     0% 98.28%     3136MB 48.77%  git.arvados.org/arvados.git/services/keepstore.(*bufferPool).Get
         0     0% 98.28%  3154.66MB 49.06%  git.arvados.org/arvados.git/services/keepstore.PutBlock.func1

Tom helped looking at the code, and found that the issue is that we are passing in an io.Reader, not a `readerAtSeeker`, which causes nextReader() in `/service/s3/s3manager/upload.go` to make an extra memory allocation.

The cause of this is our use of `putWithPipe` in the `Put` method in `services/keepstore/s3aws_volume.go`, which is fundamentally incompatible with seeking. We use `putWithPipe` to be able to release the allocated memory as soon as possible when a client hangs up. We can outsource that problem to the S3 sdk since we pass in a context.

After 464c1562415ac7b5b5503f41b20c3183610dc899, the extra memory allocation on the write path is gone. This is a Profile (after-profile-arv-put.bb.gz) taken after writing roughly 15GB of data with arv-put:

File: keepstore
Build ID: 3fa6ed4b9b8bab06a255c44ae9b017f5743ad637
Type: inuse_space
Time: Dec 2, 2021 at 3:50pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) sample_index=alloc_space
(pprof) top
Showing nodes accounting for 2094.03MB, 94.76% of 2209.78MB total
Dropped 327 nodes (cum <= 11.05MB)
Showing top 10 nodes out of 67
      flat  flat%   sum%        cum   cum%
    1728MB 78.20% 78.20%     1728MB 78.20%  git.arvados.org/arvados.git/services/keepstore.(*handler).setup.func1
  160.04MB  7.24% 85.44%   160.04MB  7.24%  git.arvados.org/arvados.git/services/keepstore.(*statsTicker).TickOutBytes
  109.33MB  4.95% 90.39%   300.90MB 13.62%  io.copyBuffer
   30.50MB  1.38% 91.77%    30.50MB  1.38%  context.WithCancel
   16.56MB  0.75% 92.52%    16.56MB  0.75%  bufio.NewReaderSize
   15.05MB  0.68% 93.20%    15.05MB  0.68%  bufio.NewWriterSize
   14.02MB  0.63% 93.83%    14.02MB  0.63%  strings.(*Builder).grow (inline)
    9.01MB  0.41% 94.24%    15.01MB  0.68%  github.com/aws/aws-sdk-go-v2/aws/signer/v4.(*httpSigner).buildCanonicalHeaders
    7.51MB  0.34% 94.58%    13.01MB  0.59%  github.com/aws/aws-sdk-go-v2/aws.New
       4MB  0.18% 94.76%    55.54MB  2.51%  github.com/aws/aws-sdk-go-v2/aws/signer/v4.(*httpSigner).Build
Actions #30

Updated by Ward Vandewege about 3 years ago

On the read path, a similar thing happens, though this time the extra memory allocation happens in the Keep driver, not in the S3 sdk.

Top10:

File: keepstore
Build ID: 3fa6ed4b9b8bab06a255c44ae9b017f5743ad637
Type: inuse_space
Time: Dec 2, 2021 at 4:08pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) sample_index=alloc_space
(pprof) top10
Showing nodes accounting for 27992.35MB, 98.70% of 28359.95MB total
Dropped 351 nodes (cum <= 141.80MB)
Showing top 10 nodes out of 27
      flat  flat%   sum%        cum   cum%
14976.50MB 52.81% 52.81% 15016.70MB 52.95%  git.arvados.org/arvados.git/services/keepstore.(*S3AWSVolume).readWorker
   12672MB 44.68% 97.49%    12672MB 44.68%  git.arvados.org/arvados.git/services/keepstore.(*handler).setup.func1
  189.81MB  0.67% 98.16%   525.38MB  1.85%  io.copyBuffer
  153.54MB  0.54% 98.70%   153.54MB  0.54%  git.arvados.org/arvados.git/services/keepstore.(*statsTicker).TickInBytes
    0.50MB 0.0018% 98.70%   475.79MB  1.68%  github.com/aws/aws-sdk-go-v2/service/s3/s3manager.(*downloader).downloadChunk
         0     0% 98.70%   182.03MB  0.64%  bufio.(*Reader).Read
         0     0% 98.70%   173.16MB  0.61%  crypto/tls.(*Conn).HandshakeContext (inline)
         0     0% 98.70%   182.03MB  0.64%  crypto/tls.(*Conn).Read
         0     0% 98.70%   173.16MB  0.61%  crypto/tls.(*Conn).handshakeContext
         0     0% 98.70% 15174.88MB 53.51%  git.arvados.org/arvados.git/services/keepstore.(*S3AWSVolume).ReadBlock

Updated by Ward Vandewege about 3 years ago

After 5ce5bf966dfabbc0beb7330d4c976a23fde3fd83, the extra memory allocation in the read path is gone. This is a Profile (after-profile-arv-mount-read.bb.gz) taken after reading roughly 15GB of data through arv-mount:

pprof after-profile-arv-mount-read.bb.gz
File: keepstore
Build ID: 91924b715b34947abbb234201802d2364bc4749f
Type: inuse_space
Time: Dec 3, 2021 at 3:01pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) sample_index=alloc_space
(pprof) top10
Showing nodes accounting for 3460.61MB, 96.73% of 3577.68MB total
Dropped 314 nodes (cum <= 17.89MB)
Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
    3072MB 85.87% 85.87%     3072MB 85.87%  git.arvados.org/arvados.git/services/keepstore.(*handler).setup.func1
  175.88MB  4.92% 90.78%   363.53MB 10.16%  io.copyBuffer
  136.01MB  3.80% 94.58%   136.01MB  3.80%  context.WithCancel
   37.65MB  1.05% 95.64%    37.65MB  1.05%  bytes.makeSlice
   19.07MB  0.53% 96.17%    19.07MB  0.53%  bufio.NewWriterSize (inline)
   17.50MB  0.49% 96.66%    54.14MB  1.51%  crypto/tls.(*Conn).readFromUntil
    1.50MB 0.042% 96.70%   447.14MB 12.50%  github.com/aws/aws-sdk-go-v2/service/s3/s3manager.(*downloader).downloadChunk
    0.50MB 0.014% 96.71%    35.03MB  0.98%  github.com/aws/aws-sdk-go-v2/aws/signer/v4.(*httpSigner).Build
    0.50MB 0.014% 96.73%    18.06MB   0.5%  github.com/aws/aws-sdk-go-v2/private/protocol/xml.NewEncoder (inline)
         0     0% 96.73%   187.64MB  5.24%  bufio.(*Reader).Read
Actions #34

Updated by Tom Clegg about 3 years ago

Functionally LGTM, just some code suggestions:

In count.go:

NewCountingReaderAtSeeker(r interface{}, f func(uint64)) should be NewCountingReaderAtSeeker(r readerAtSeeker, f func(uint64)) ... then sending the wrong type will be a compile error instead of a runtime panic.

No need to write Seek() explicitly -- countingReaderAtSeeker embeds readerAtSeeker so Seek will pass through implicitly.

In translateError() I think it would be better form to check for the concrete type *aws.RequestCanceledError before checking the interface type awserr.Error so we still do the same thing in case the type starts implementing the interface. Also, mixing switch and if-else here is a bit awkward, could just pick one.

if _, ok := err.(*awsRequestCanceledError); ok {
} else if err, ok := err.(aws.Error); ok {
}

Looks like Get() is an unnecessary wrapper -- could we just rename ReadBlock() to Get()?

Might be better to call v.bucket.stats.TickInBytes(uint64(count)) from readWorker() instead, so ReadBlock() doesn't need to call it twice.

In readWorker, the line buf = awsBuf.Bytes() doesn't do anything (buf is already the buffer where the block has been downloaded, and if it wasn't, this wouldn't help anyway since we don't do anything with buf after this). (With that, the last 5 lines can reduce to "return count, v.translateError(err)")

WriteBlock() is a bit misleading since it takes an io.Reader but only works the way we want if the io.Reader also happens to be a readAtSeeker. The reason it existed in the first place was to use the "pipe adapter" stuff, which we now know it can't use, so we could just put "rdr := bytes.NewReader(block)" at the top and then WriteBlock would be the whole Put func.

Actions #35

Updated by Ward Vandewege about 3 years ago

Tom Clegg wrote:

Functionally LGTM, just some code suggestions:

In count.go:

NewCountingReaderAtSeeker(r interface{}, f func(uint64)) should be NewCountingReaderAtSeeker(r readerAtSeeker, f func(uint64)) ... then sending the wrong type will be a compile error instead of a runtime panic.

Yes! Fixed.

No need to write Seek() explicitly -- countingReaderAtSeeker embeds readerAtSeeker so Seek will pass through implicitly.

Done.

In translateError() I think it would be better form to check for the concrete type *aws.RequestCanceledError before checking the interface type awserr.Error so we still do the same thing in case the type starts implementing the interface. Also, mixing switch and if-else here is a bit awkward, could just pick one.

Done, thanks.

Looks like Get() is an unnecessary wrapper -- could we just rename ReadBlock() to Get()?

Yes, done!

Might be better to call v.bucket.stats.TickInBytes(uint64(count)) from readWorker() instead, so ReadBlock() doesn't need to call it twice.

Fixed.

In readWorker, the line buf = awsBuf.Bytes() doesn't do anything (buf is already the buffer where the block has been downloaded, and if it wasn't, this wouldn't help anyway since we don't do anything with buf after this). (With that, the last 5 lines can reduce to "return count, v.translateError(err)")

Done thanks.

WriteBlock() is a bit misleading since it takes an io.Reader but only works the way we want if the io.Reader also happens to be a readAtSeeker. The reason it existed in the first place was to use the "pipe adapter" stuff, which we now know it can't use, so we could just put "rdr := bytes.NewReader(block)" at the top and then WriteBlock would be the whole Put func.

Done.

Ready for another look at 24e333637a1ecf1fd01ad501d62778a24985948f on branch 17339-s3aws-driver-memory-footprint
Tests are running at developer-run-tests: #2832

Actions #36

Updated by Tom Clegg about 3 years ago

LGTM, thanks!

Actions #37

Updated by Ward Vandewege about 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100
Actions #38

Updated by Ward Vandewege about 3 years ago

  • Release set to 46
Actions

Also available in: Atom PDF