Index of files in collections¶
Currently the manifest_text column contains information about the individual files in collections. However, utility is limited because the data is not structured in a way that PostgreSQL understands.- searching filenames is difficult/impossible because even the "list of filenames" column is too long for PostgreSQL to index properly.
- searching collections with a given block locator (or locator pattern, which is useful for partitioning keep-balance work) is inefficient.
These problems (and some other opportunities) could be addressed by keeping a separate table of files.
pdh | dir | filename | bytesize | filehash† |
abcd1234+123 | foo/bar | baz.txt | 1234 | dcba4321 |
abcd1234+123 | foo/bar | waz.txt | 1235 | efab8912 |
† In general filehash cannot be computed just from the manifest. This column would presumably allow null ("not known") and might not exist at all.
New rows would be added to the files table whenever a collection is saved with a PDH that isn't already present.
Old rows would be deleted from the files table whenever the last remaining collection with a given PDH is removed.
Once this table is populated, searching collection filenames would be implemented by searching the files table and joining the collections table on PDH.
Whatever the index/search mechanism is, it should be able to find "Sample_RMF1U7F_S27_R1_001.fastq.gz" by searching for the following strings:
"sample_rmf1u7f_s27_r1_001.fastq.gz" (or "sample*")
"rmf1u7f_s27_r1_001.fastq.gz" (or "rmf1u7f*")
"s27_r1_001.fastq.gz" (...)
"r1_001.fastq.gz"
"001.fastq.gz"
"fastq.gz"
"gz"
Updated by Tom Clegg almost 6 years ago · 2 revisions