Story #3491
open[Keep] Support transparent compression of blocks in Keep
0%
Description
Support automatic compression of blocks in the Keep server. Proposed design:
- Keep server can accept PUT blocks compressed with gzip, or use gzip to compress uncompressed blocks before saving to disk. Compare compressed/uncompressed sizes to ensure that compression isn't adding unnecessary overhead.
- On GET, Keep clients provide the "Accept-Encoding: gzip" header, and the server responds with "Content-Encoding: gzip" and spools the compressed data directly off disk.
- Keep client decompresses the data before delivering it to the application.
- Support random access into large files without needing special file formats or explicit application support.
- Reduce disk and network usage across the board.
- Transparent to user
- Adds a "decompress and then re-compress on Keep block boundaries" step when working with a collection that's already compressed at the file level.
- May increase latency and client overhead because each block needs to be decompressed in order to use it
Updated by Peter Amstutz over 10 years ago
- Subject changed from Support transparent compression of blocks in Keep to [Keep] Support transparent compression of blocks in Keep
- Description updated (diff)
- Category set to Keep
Updated by Stanislaw Adaszewski almost 5 years ago
I would let the user decide whether blocks should be compressed or raw but this is definitely a great feature with potential for a lot of space savings. As a private person I would like this feature. I would implement it slightly differently though - basically use the checksum of the compressed data as the block address (like this there would be no need to decompress to verify the checksum and re-compress to send). Then the only thing should be the fuse driver should decompress blocks marked as gzip-compressed on-the-fly. If algos other than gzip were an option, there are compression schemes that are designed to be way faster to decompress, e.g. WKdm used for memory compression on Mac OS. This would perhaps be less convenient insofar that HTTP doesn't support it as encoding but it is much much faster.
Updated by Peter Amstutz almost 5 years ago
The main reason this hasn't been a priority is that many file formats already have domain specific compression such as BAM and various compressed image formats. Trying to compress already-compressed files is counterproductive, since at best it is a waste of time and at worst the result is larger than if you had left it alone. It also turns out that at gigabit+ transfer speeds, involving the CPU to do compression/decompression can be a huge bottleneck compared to just sending the data uncompressed (for typical compression ratios).
Updated by Stanislaw Adaszewski almost 5 years ago
Thank you for your reply. This makes sense. However, recently I unpacked UniRef30 for example and it jumped from 42GB compressed to 162GB uncompressed. Would be neat to have the compression as a user-controlled option. Some brainstorming on this could be worthwhile, as I am encountering this kind of ratio pretty often.