Bug #6844
closed[SDKs] arv-copy consistently gets MemoryError trying to copy an 8GiB collection
100%
Description
qr1hi-4zz18-c1t7uidtqs5sg82 is public reference data for the Tuxedo pipeline. It's 8GiB.
I am trying to copy this data, using a shell node that has 4GiB RAM and 4GiB swap and current master arv-copy. While it's copying blocks, it eventually gets a MemoryError, and then the whole copy falls apart. Here's a transcript leading up to the badness, with ARVADOS_DEBUG enabled:
qr1hi-4zz18-c1t7uidtqs5sg82: 3904M / 8184M 47.7% 2015-07-31 18:48:02 arvados.keep[11112] DEBUG: c948eb1f7afcaf183552c531c08dc114+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/'] 2015-07-31 18:48:02 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/c948eb1f7afcaf183552c531c08dc114+67108864+A… 2015-07-31 18:48:03 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1650.02799034 msec (38.787 MiB/sec) 2015-07-31 18:48:03 arvados.keep[11112] DEBUG: c948eb1f7afcaf183552c531c08dc114+67108864: ['http://keep0.xyzzy.arvadosapi.com:25107/', 'http://keep1.xyzzy.arvadosapi.com:25107/'] 2015-07-31 18:48:03 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-123, started 139671556409088)> proceeding c948eb1f7afcaf183552c531c08dc114+67108864 http://keep1.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:03 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/c948eb1f7afcaf183552c531c08dc114 2015-07-31 18:48:03 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-124, started 139671564801792)> proceeding c948eb1f7afcaf183552c531c08dc114+67108864 http://keep0.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:03 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/c948eb1f7afcaf183552c531c08dc114 2015-07-31 18:48:05 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-123, started 139671556409088)> succeeded c948eb1f7afcaf183552c531c08dc114+67108864 http://keep1.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:05 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-124, started 139671564801792)> succeeded c948eb1f7afcaf183552c531c08dc114+67108864 http://keep0.xyzzy.arvadosapi.com:25107/ qr1hi-4zz18-c1t7uidtqs5sg82: 3968M / 8184M 48.5% 2015-07-31 18:48:05 arvados.keep[11112] DEBUG: 7c8de9368ec9ad742a3a27336a134dc5+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/'] 2015-07-31 18:48:05 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/7c8de9368ec9ad742a3a27336a134dc5+67108864+A… 2015-07-31 18:48:07 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1569.37599182 msec (40.781 MiB/sec) 2015-07-31 18:48:07 arvados.keep[11112] DEBUG: 7c8de9368ec9ad742a3a27336a134dc5+67108864: ['http://keep1.xyzzy.arvadosapi.com:25107/', 'http://keep0.xyzzy.arvadosapi.com:25107/'] 2015-07-31 18:48:07 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-125, started 139671564801792)> proceeding 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep1.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:07 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/7c8de9368ec9ad742a3a27336a134dc5 2015-07-31 18:48:07 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-126, started 139671556409088)> proceeding 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep0.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:07 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/7c8de9368ec9ad742a3a27336a134dc5 2015-07-31 18:48:09 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-125, started 139671564801792)> succeeded 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep1.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:09 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-126, started 139671556409088)> succeeded 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep0.xyzzy.arvadosapi.com:25107/ qr1hi-4zz18-c1t7uidtqs5sg82: 4032M / 8184M 49.3% 2015-07-31 18:48:09 arvados.keep[11112] DEBUG: 7ac2a965fe398a919e86392d67bce52b+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/'] 2015-07-31 18:48:09 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/7ac2a965fe398a919e86392d67bce52b+67108864+A… 2015-07-31 18:48:10 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1575.77109337 msec (40.615 MiB/sec) 2015-07-31 18:48:10 arvados.keep[11112] DEBUG: 7ac2a965fe398a919e86392d67bce52b+67108864: ['http://keep0.xyzzy.arvadosapi.com:25107/', 'http://keep1.xyzzy.arvadosapi.com:25107/'] 2015-07-31 18:48:10 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-127, started 139671556409088)> proceeding 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep1.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:10 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/7ac2a965fe398a919e86392d67bce52b 2015-07-31 18:48:10 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-128, started 139671564801792)> proceeding 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep0.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:10 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/7ac2a965fe398a919e86392d67bce52b 2015-07-31 18:48:12 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-128, started 139671564801792)> succeeded 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep0.xyzzy.arvadosapi.com:25107/ 2015-07-31 18:48:13 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-127, started 139671556409088)> succeeded 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep1.xyzzy.arvadosapi.com:25107/ qr1hi-4zz18-c1t7uidtqs5sg82: 4096M / 8184M 50.0% 2015-07-31 18:48:13 arvados.keep[11112] DEBUG: 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/'] 2015-07-31 18:48:13 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/22e4cbeae1375cb0aea342ab68ddbfef+67108864+A… MemoryError: out of memory 2015-07-31 18:48:14 arvados.keep[11112] DEBUG: Request fail: GET https://keep.qr1hi.arvadosapi.com:443/22e4cbeae1375cb0aea342ab68ddbfef+67108864+A… => <class 'arvados.errors.HttpError'>: 2015-07-31 18:48:17 arvados.keep[11112] DEBUG: {u'qr1hi-bi6l4-ebw6id3jibfc3cm': OrderedDict([(u'href', u'/keep_services/qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'kind', u'arvados#keepService'), (u'etag', u'dlywp1t0nne9dw3gds9tcyhs3'), (u'uuid', u'qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'owner_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'created_at', u'2014-07-08T02:19:44.015581000Z'), (u'modified_by_client_uuid', u'qr1hi-ozdt8-obw7foaks3qjyej'), (u'modified_by_user_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'modified_at', u'2014-07-08T02:20:11.712252000Z'), (u'service_host', u'keep.qr1hi.arvadosapi.com'), (u'service_port', 443), (u'service_ssl_flag', True), (u'service_type', u'proxy'), (u'read_only', False), ('_service_root', 'https://keep.qr1hi.arvadosapi.com:443/')])} 2015-07-31 18:48:17 arvados.keep[11112] DEBUG: [OrderedDict([(u'href', u'/keep_services/qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'kind', u'arvados#keepService'), (u'etag', u'dlywp1t0nne9dw3gds9tcyhs3'), (u'uuid', u'qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'owner_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'created_at', u'2014-07-08T02:19:44.015581000Z'), (u'modified_by_client_uuid', u'qr1hi-ozdt8-obw7foaks3qjyej'), (u'modified_by_user_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'modified_at', u'2014-07-08T02:20:11.712252000Z'), (u'service_host', u'keep.qr1hi.arvadosapi.com'), (u'service_port', 443), (u'service_ssl_flag', True), (u'service_type', u'proxy'), (u'read_only', False), ('_service_root', 'https://keep.qr1hi.arvadosapi.com:443/')])] 2015-07-31 18:48:17 arvados.keep[11112] DEBUG: 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/'] 2015-07-31 18:48:17 arvados.keep[11112] INFO: No more Keep services to try; giving up Traceback (most recent call last): File "/home/brett/pyvenv/bin/arv-copy", line 4, in <module> __import__('pkg_resources').run_script('arvados-python-client==0.1', 'arv-copy') File "/home/brett/pyvenv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 735, in run_script self.require(requires)[0].run_script(script_name, ns) File "/home/brett/pyvenv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1652, in run_script exec(code, namespace, namespace) File "/home/brett/pyvenv/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/EGG-INFO/scripts/arv-copy", line 4, in <module> main() File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/commands/arv_copy.py", line 119, in main args) File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/commands/arv_copy.py", line 563, in copy_collection data = src_keep.get(word) File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/retry.py", line 154, in num_retries_setter return orig_func(self, *args, **kwargs) File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/keep.py", line 907, in get "failed to read {}".format(loc_s), service_errors, label="service") arvados.errors.KeepReadError: failed to read 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: service https://keep.qr1hi.arvadosapi.com:443/ responded with 0 (23, 'Failed writing body (0 != 16384)')
I note that the KeepWriterThread ids keep going up monotonically. Is it possible they're not being released and eventually garbage collected?
Updated by Peter Amstutz over 9 years ago
Check the keepproxy logs. This is a case where the real error is hidden from view.
Updated by Brett Smith over 9 years ago
Peter Amstutz wrote:
Check the keepproxy logs. This is a case where the real error is hidden from view.
Everything's fine there. The "MemoryError: out of memory" line earlier in the transcript suggests that this a purely client-side OOM condition.
Updated by Brett Smith over 9 years ago
I spent an hour digging into this and didn't immediately find the source. I spent some time scrutinizing our PyCurl client pool, but everything seems on the up-and-up there. And the writer threads etc. should be garbage collectible after put() returns.
I wonder if the issue is with our use of PyCurl itself? Should we be closing or releasing objects that we're currently not?
You might have to do some instrumentation to check the refcounts of live objects to help track this down.
Updated by Tom Clegg over 9 years ago
6844-py-mem-leak @ f809883
The fix is in 0988acb
Bad news: lost a feature.- The memory leak was
@retry_method
's fault. Evidently inspect.getcallargs causes reference cycles (or something like that) which is bad if an arg is often, say, a 64 MiB string. I tried various things with "del" to no avail, eventually decided that the ability to accept num_retries as a positional argument isn't important enough to worry about too much. Found a total of 3 such uses, including the test case for that specific feature.
- Clean DB after each websocket test -- and make "clean DB" clean the humans table too (even though we have no fixtures there). Now you can leave an API server up and run Python SDK tests against it repeatedly, without "websocket" leaving stuff in the DB that breaks the "empty list" test.
- Remove arv-copy code that preserves the "missing trailing newline" manifest format error when copying a collection.
- Remove arv-copy code that uselessly stores an unsigned manifest in Keep.
- Fix overly broad exception-catching in arv-copy
- Cleanup sdk/python websocket tests (faster, test more things)
Updated by Tom Clegg over 9 years ago
I'd still like to see about making a memory-leak test case for this, but I don't think it should block merging the fix -- and maybe it shouldn't even block closing the issue, if it turns out to be hard to test reliably.
Updated by Tom Clegg over 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:2a96c097e5a176018d078a5d6985403072e8672e.
Updated by Brett Smith over 9 years ago
FWIW I was able to use the latest version to copy the original collection. Thanks.