Index of files in collections » History » Version 2
Tom Clegg, 02/20/2019 08:11 PM
| 1 | 1 | Tom Clegg | h1. Index of files in collections |
|---|---|---|---|
| 2 | |||
| 3 | Currently the manifest_text column contains information about the individual files in collections. However, utility is limited because the data is not structured in a way that PostgreSQL understands. |
||
| 4 | * searching filenames is difficult/impossible because even the "list of filenames" column is too long for PostgreSQL to index properly. |
||
| 5 | * searching collections with a given block locator (or locator pattern, which is useful for partitioning keep-balance work) is inefficient. |
||
| 6 | |||
| 7 | These problems (and some other opportunities) could be addressed by keeping a separate table of files. |
||
| 8 | |||
| 9 | |pdh|dir|filename|bytesize|filehash†| |
||
| 10 | |abcd1234+123|foo/bar|baz.txt|1234|dcba4321| |
||
| 11 | |abcd1234+123|foo/bar|waz.txt|1235|efab8912| |
||
| 12 | |||
| 13 | † In general filehash cannot be computed just from the manifest. This column would presumably allow null ("not known") and might not exist at all. |
||
| 14 | |||
| 15 | New rows would be added to the files table whenever a collection is saved with a PDH that isn't already present. |
||
| 16 | |||
| 17 | Old rows would be deleted from the files table whenever the last remaining collection with a given PDH is removed. |
||
| 18 | |||
| 19 | Once this table is populated, searching collection filenames would be implemented by searching the files table and joining the collections table on PDH. |
||
| 20 | 2 | Tom Clegg | |
| 21 | Whatever the index/search mechanism is, it should be able to find "Sample_RMF1U7F_S27_R1_001.fastq.gz" by searching for the following strings: |
||
| 22 | |||
| 23 | "sample_rmf1u7f_s27_r1_001.fastq.gz" (or "sample*") |
||
| 24 | "rmf1u7f_s27_r1_001.fastq.gz" (or "rmf1u7f*") |
||
| 25 | "s27_r1_001.fastq.gz" (...) |
||
| 26 | "r1_001.fastq.gz" |
||
| 27 | "001.fastq.gz" |
||
| 28 | "fastq.gz" |
||
| 29 | "gz" |