Bug #13999: Python SDK downloading a different file from Keep - Arvados

Actions

Copy link

Bug #13999

closed

Python SDK downloading a different file from Keep

Added by Kevin Fang over 7 years ago. Updated over 7 years ago.

Status:

Rejected

Priority:

High

Assigned To:

Category:

SDKs

Target version:

Story points:

Description

To create this issue, I ran mkdir keep && arv-mount keep to mount keep in my home directory. Then, I ran the Python script below:

import numpy as np
import arvados
import arvados.collection
import os

# file details
collection_uuid = "su92l-4zz18-b8rs5x7t6gry16k" 
filename = "all-info.npy" 

# function to retrieve a tile file from keep
tiled_data_dir = os.path.join("/home/kfang/keep/by_id/", collection_uuid)

all_info = np.load(os.path.join(tiled_data_dir, filename))

api = arvados.api()

c = arvados.collection.CollectionReader(collection_uuid)

# open a file in Arvados
with c.open(filename, "rb") as reader:
    info = reader.read()

# load a numpy array from binary
info_pysdk = np.fromstring(info, dtype=np.uint32)

# compare the two files

# the shape of all_info is (21310012,)
print(all_info.shape)
# whereas the shape of info_pysdk is (21310032,)
print(info_pysdk.shape)

# this returns false
print(np.array_equal(all_info, info_pysdk))
# whereas this returns true
print(np.array_equal(all_info, info_pysdk[20:]))

The script successfully reads the numpy array from keep, but np.array_equal returns False. Upon comparing array sizes I get discrepancies - 21,310,012 versus 21,310,032.

Just for reference, the variable that contains the keep-mounted file is all_info and the variable that stores the Arvados Python SDK downloaded array is info_pysdk.

However, once I cut out the first 20 indices of the Python SDK-loaded file (info_pysdk[20:]), np.array_equal returns True and the arrays are the same.

Actions

Copy link

Updated by Tom Morris over 7 years ago

numpy doesn't seem relevant. What is different in the byte streams before numpy starts playing with them?

Actions

Copy link

Updated by Kevin Fang over 7 years ago

Hmm. Upon further inspection, this doesn't actually look like a bug in the Arvados Python SDK but rather a strange "feature" in Numpy. It looks like the np.fromstring function is adding something to the beginning of the file and an offset needs to be added to the Python SDK-downloaded binary. If the binary from the Python SDK is saved to file and then opened using np.load, the files are the exact same.

Actions

Copy link

Updated by Kevin Fang over 7 years ago

I.e. doing something like:

[...]
with c.open(filename, "rb") as reader:
    info = reader.read()
with open('arr.npy') as f:
    f.save(info)
np.load('arr.npy')
[...]

will result in the correct array being loaded.

Actions

Copy link

Updated by Joshua Randall over 7 years ago

I'm not sure if this is relevant, but in the original example you are relying on deprecated functionality of numpy's fromstring function. The documentation says that when `sep` is not specified: "fromstring falls back on the behaviour of frombuffer after encoding unicode string inputs as either utf-8 (python 3), or the default encoding (python 2)." (Deprecated since version 1.14; https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromstring.html)

I suspect the problem might be to do with the encoding handling? If you change to using `frombuffer` directly on the bytestring returned from `read()` instead of going through the deprecated `fromstring` behaviour, that may solve the problem?

Actions

Copy link

Updated by Kevin Fang over 7 years ago

Do you mean something like this? info = np.frombuffer(reader.read(), dtype=np.uint32). This unfortunately produces the same behavior, with the 20 extra indices at the front.

Actions

Copy link

Updated by Tom Morris over 7 years ago

Status changed from New to Rejected

One possibility that comes to mind is that it's something to do with text vs binary mode (similar to Josh's suggestion about encoding), but, in any case, it doesn't sound like it's an Arvados bug.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #13999

Python SDK downloading a different file from Keep

Updated by Tom Morris over 7 years ago

Updated by Kevin Fang over 7 years ago

Updated by Kevin Fang over 7 years ago

Updated by Joshua Randall over 7 years ago

Updated by Kevin Fang over 7 years ago

Updated by Tom Morris over 7 years ago