Bug #13999
closedPython SDK downloading a different file from Keep
Description
To create this issue, I ran mkdir keep && arv-mount keep to mount keep in my home directory. Then, I ran the Python script below:
import numpy as np
import arvados
import arvados.collection
import os
# file details
collection_uuid = "su92l-4zz18-b8rs5x7t6gry16k"
filename = "all-info.npy"
# function to retrieve a tile file from keep
tiled_data_dir = os.path.join("/home/kfang/keep/by_id/", collection_uuid)
all_info = np.load(os.path.join(tiled_data_dir, filename))
api = arvados.api()
c = arvados.collection.CollectionReader(collection_uuid)
# open a file in Arvados
with c.open(filename, "rb") as reader:
info = reader.read()
# load a numpy array from binary
info_pysdk = np.fromstring(info, dtype=np.uint32)
# compare the two files
# the shape of all_info is (21310012,)
print(all_info.shape)
# whereas the shape of info_pysdk is (21310032,)
print(info_pysdk.shape)
# this returns false
print(np.array_equal(all_info, info_pysdk))
# whereas this returns true
print(np.array_equal(all_info, info_pysdk[20:]))
The script successfully reads the numpy array from keep, but np.array_equal returns False. Upon comparing array sizes I get discrepancies - 21,310,012 versus 21,310,032.
Just for reference, the variable that contains the keep-mounted file is all_info and the variable that stores the Arvados Python SDK downloaded array is info_pysdk.
However, once I cut out the first 20 indices of the Python SDK-loaded file (info_pysdk[20:]), np.array_equal returns True and the arrays are the same.
Updated by Tom Morris over 7 years ago
numpy doesn't seem relevant. What is different in the byte streams before numpy starts playing with them?
Updated by Kevin Fang over 7 years ago
Hmm. Upon further inspection, this doesn't actually look like a bug in the Arvados Python SDK but rather a strange "feature" in Numpy. It looks like the np.fromstring function is adding something to the beginning of the file and an offset needs to be added to the Python SDK-downloaded binary. If the binary from the Python SDK is saved to file and then opened using np.load, the files are the exact same.
Updated by Kevin Fang over 7 years ago
I.e. doing something like:
[...]
with c.open(filename, "rb") as reader:
info = reader.read()
with open('arr.npy') as f:
f.save(info)
np.load('arr.npy')
[...]
will result in the correct array being loaded.
Updated by Joshua Randall over 7 years ago
I'm not sure if this is relevant, but in the original example you are relying on deprecated functionality of numpy's fromstring function. The documentation says that when `sep` is not specified: "fromstring falls back on the behaviour of frombuffer after encoding unicode string inputs as either utf-8 (python 3), or the default encoding (python 2)." (Deprecated since version 1.14; https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromstring.html)
I suspect the problem might be to do with the encoding handling? If you change to using `frombuffer` directly on the bytestring returned from `read()` instead of going through the deprecated `fromstring` behaviour, that may solve the problem?
Updated by Kevin Fang over 7 years ago
Do you mean something like this? info = np.frombuffer(reader.read(), dtype=np.uint32). This unfortunately produces the same behavior, with the 20 extra indices at the front.
Updated by Tom Morris over 7 years ago
- Status changed from New to Rejected
One possibility that comes to mind is that it's something to do with text vs binary mode (similar to Josh's suggestion about encoding), but, in any case, it doesn't sound like it's an Arvados bug.