Block Level Deduping in ZFS

zfs_thumb“File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. […] ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS’s 256-bit block checksums. The deduplication is done inline […]” (Chris Mellor) [ZFS gets inline dedupe]

“Good for virtual machine images.” I’d like to nominate this remark for the understatement of the year contest. Note it would also be great for vectorized OLAP. Also note that compression engines like LZW use “byte-range dedupe” on a data stream. … Implementing byte-range dedupe on a random access system won’t be easy. … Besides performance might suffer, if the contents of every block is expressed in terms of other blocks. … Probably a dynamic dictionary is worth a try. … Somebody stop me!