Project

General

Profile

Actions

Index of files in collections

Currently the manifest_text column contains information about the individual files in collections. However, utility is limited because the data is not structured in a way that PostgreSQL understands.
  • searching filenames is difficult/impossible because even the "list of filenames" column is too long for PostgreSQL to index properly.
  • searching collections with a given block locator (or locator pattern, which is useful for partitioning keep-balance work) is inefficient.

These problems (and some other opportunities) could be addressed by keeping a separate table of files.

pdh dir filename bytesize filehash†
abcd1234+123 foo/bar baz.txt 1234 dcba4321
abcd1234+123 foo/bar waz.txt 1235 efab8912

† In general filehash cannot be computed just from the manifest. This column would presumably allow null ("not known") and might not exist at all.

New rows would be added to the files table whenever a collection is saved with a PDH that isn't already present.

Old rows would be deleted from the files table whenever the last remaining collection with a given PDH is removed.

Once this table is populated, searching collection filenames would be implemented by searching the files table and joining the collections table on PDH.

Whatever the index/search mechanism is, it should be able to find "Sample_RMF1U7F_S27_R1_001.fastq.gz" by searching for the following strings:

"sample_rmf1u7f_s27_r1_001.fastq.gz" (or "sample*")
"rmf1u7f_s27_r1_001.fastq.gz" (or "rmf1u7f*")
"s27_r1_001.fastq.gz" (...)
"r1_001.fastq.gz"
"001.fastq.gz"
"fastq.gz"
"gz"

Updated by Tom Clegg almost 6 years ago · 2 revisions