[plug] ZFS and deduplicaton?

Mon Dec 23 07:59:22 UTC 2013

Hi all,

I'm testing different deduplicating filesystems on Wheezy for storing
database backups (somewhat-compressed database dumps, average of about 25Gb
times 12 clients, ideally 30 days worth, so 9 terabytes raw). To test I
have a set of 4 days' worth from the same server, of 21Gb each day.

I first played with opendedup (aka sdfs) which is Java-based so loads up
the system a bit when reading and writing (not near as bad on physical as
on a VM, though). With that, the first file is the full 21Gb or near to,
while the subsequent ones are a bit smaller - one of them is down to 5.4Gb,
as reported by a simple du.

Next I'm trying ZFS, as something a bit more native would be preferred. I
have a 1.06Tb raw LVM logical volume, so I run

zpool create -O dedup=on backup /dev/VolGroup00/LogVol01

zpool list gives:

NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
backup  1.05T   183K  1.05T     0%  1.00x  ONLINE  -

I then create a filesystem device under it (I've tried without it first,
made no difference to what's coming):

zfs create -o dedup=on backup/admin

Now zfs list gives:

NAME           USED  AVAIL  REFER  MOUNTPOINT
backup         104K  1.04T    21K  /backup
backup/admin    21K  1.04T    21K  /backup/admin

Looks OK so far.

Trouble is, when I copy my 80Gb-odd set to it with plain rsync (same as
before), I only get a dedupe ratio of 1.01x (ie nothing at all):

NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
backup  1.05T  78.5G  1001G     7%  1.01x  ONLINE  -

I also found "zdb backup | grep plain", which indicates that there is no
deduping being done on any files on the disk, including the schema files
also included (column 7 should be something less than 100):

       107    2    16K   128K  2.75M  2.75M  100.00  ZFS plain file
       108    2    16K   128K  2.13M  2.12M  100.00  ZFS plain file
       109    1    16K     8K     8K     8K  100.00  ZFS plain file
       110    1    16K   9.5K   9.5K   9.5K  100.00  ZFS plain file
       111    1    16K   9.5K   9.5K   9.5K  100.00  ZFS plain file
       112    1    16K  12.0K  12.0K  12.0K  100.00  ZFS plain file
       113    1    16K   9.5K   9.5K   9.5K  100.00  ZFS plain file
       114    4    16K   128K  19.9G  19.9G  100.00  ZFS plain file
       115    1    16K    512    512    512  100.00  ZFS plain file
       116    1    16K     8K     8K     8K  100.00  ZFS plain file
       117    1    16K   9.5K   9.5K   9.5K  100.00  ZFS plain file
       118    1    16K   9.5K   9.5K   9.5K  100.00  ZFS plain file
       119    1    16K  14.5K  14.5K  14.5K  100.00  ZFS plain file
       120    1    16K  14.5K  14.5K  14.5K  100.00  ZFS plain file
       121    1    16K  3.50K  3.50K  3.50K  100.00  ZFS plain file

95% of those schema files are in fact identical, so filesystem hard links
would dedupe them perfectly...

I must be missing something, surely? Or should I just go ahead with
opendedup and be done with? Any others I should know about (btrfs didn't
sound terribly stable from what I've been reading)?

TIA and Merry Christmas,
Andrew

-- 
Linux supports the notion of a command line or a shell for the same
reason that only children read books with only pictures in them.
Language, be it English or something else, is the only tool flexible
enough to accomplish a sufficiently broad range of tasks.
                          -- Bill Garrett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.plug.org.au/pipermail/plug/attachments/20131223/d82bc4a7/attachment.html>