[plug] Filesystems for lots of inodes

Sun Feb 2 15:02:49 AWST 2020

Thanks for that Brad

I did some testing a number of years ago when I got my raid controller.
I was benchmarking xfs and ext4 also.

>From memory xfs came out on top also in those cases.
It also seemed to have the benefit of a dedicated filesystem backup tool in
one of the xfs packages. I think it's called xfsdump now. It might be a
good idea to look at using that.

from my Tablet

On Sun, 2 Feb. 2020, 2:42 pm Brad Campbell, <brad at fnarfbargle.com> wrote:

> On 15/1/20 10:35 pm, Brad Campbell wrote:
> > On 9/1/20 2:12 pm, Brad Campbell wrote:
> >> On 8/1/20 13:39, Byron Hammond wrote:
> >>> I'm keeping my eye on this thread with great interest.
> >>>
> >>> I'm really curious to see what your findings are and how you got there.
> >>>
> >>
> >> It will be interesting. I'll say in response to "how you got there",
> >> the current answer is _slowly_.
> >
> > Untarring the backup file onto a clean ext4 filesystem on a 5 drive
> > RAID5 took 176 hours for the bulk restore, and then tar seems to do
> > another pass removing symlinks, creating a new symlink and then
> > hardlinking that. That took an additional 7 hours.
> >
> > So ~183 hours to restore the tar file onto a clean ext4 filesystem.
> >
> > At least I have a reproducible test case. That averaged 5.3MB/s.
> >
> > Correcting my mistake, this filesystem has 50.2 million inodes and 448
> > million files/directories.
> >
> > root at test:/server# tar -tvzf backups.tgz | wc -l
> > 448763241
> >
> > Just the tar | wc -l took the best part of a day. This might take a
> while.
> >
> >
>
> So from here on this E-mail has just been built up in sequence as things
> are tried. Lost of big gaps (like 183 hours to build the ext4
> filesystem), so it's all kinda "stream of consciousness". I'll put a
> summary at the end.
>
> I needed some way of speeding this process up and write caching the
> drives seemed like the sanest way to do it. I ran up a new qemu(kvm)
> devuan instance, passed it the raw block device and set the caching
> method to "unsafe".  That basically ignores all data safety requests
> (sync/fsync/flush) and allows the machine to act as a huge cache.
>
> So, the filesystem had already been created and populated (183 hours
> worth). This is a simple find on the filesystem from inside the VM.
>
> ext4:
> root at devfstest:/mnt/source# time find . | wc -l
> 448763242
>
> real    1300m56.182s
> user    3m18.904s
> sys     12m56.012s
>
> I've created a new xfs filesystem and :
> real    10130m14.072s
> user    9631m11.388s
> sys     325m38.168s
>
> So 168 hours for xfs.
>
> I've noticed an inordinate amount of time being spent inside tar, so I
> took the time to create the archive again, this time with a separate
> tarball for each backed up directory.
>
> So, let's repeat that test with xfs :  A bit complexish, but let's see
> what happens. Surely can't be slower!
>
> root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ; do
> echo $i ; ssh test cat $i | pigz -d | tar -x ; done
> real    730m25.915s
> user    496m28.968s
> sys     209m7.068s
>
> 12.1 hours using separate tarballs vs one big tarball.
>
> So, in this instance tar was/is the bottleneck! All future tests will be
> done using the multiple tarball archive.
>
> Right, so now create a new ext4 filesystem on there and repeat the test
>
> real    1312m53.829s
> user    481m3.272s
> sys     194m49.744s
>
> Summary :
>
> xfs  : 12.1 hours
> ext4 : 21.8 hours
>
> Filesystem population test win : XFS by a long margin.
>
> Now, I wasn't clever enough to do a find test on xfs before doing the
> ext4 creation test, so let's run the find on ext4, then re-create the
> xfs and do it again.
>
> This should be interesting to see how it compares to the initial find
> test on the fs created on the bare metal and then the block device
> passed through to the VM (first result in this mail, some 1300 seconds).
> Not entirely a fair test as the filesystems differ in content. The "one
> big tarball" was about 10 days before the "multiple smaller tarballs",
> Cbut still ~45-50 million inodes.
>
> Lesson learned, make sure filesystem is mounted noatime before the test.
> Several restarts before I figure out what was writing to the disk.
>
> Find test on ext4 :
> cd /mnt ; time find . | wc -l
>
> ext4 :
> real    1983m45.609s
> user    3m32.184s
> sys     14m2.420s
>
> Not so pretty. So 50% longer than last time. Still, different filesystem
> contents so not directly comparable. Right, lets build up a new xfs
> filesystem and repeat the test :
>
> root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ; do
> echo $i ; ssh test cat $i | pigz -d | tar -x ; done
> real    711m17.118s
> user    498m12.424s
> sys     210m50.748s
>
> So create was 730 minus last time and 711 mins this time. ~3% variance.
> Close enough.
>
> root at devfstest:/mnt# time find . | wc -l
> 497716103
>
> real    43m13.998s
> user    2m49.624s
> sys     6m33.148s
>
> xfs ftw! 43 mins vs 730 mins.
>
> So, summary.
> xfs create : 12.1 hours
> ext4 create : 21.8 hours
>
> xfs find : 43 min
> ext4 find : 12.1 hours
>
> Let's do a tar test and see how long it takes to read the entire
> filesystem. This would be a good indicator of time to replicate. Again,
> because I wasn't clever enough to have this stuff thought up before
> hand, I'll have to do it on xfs, then recreate the ext4 and run it again.
>
> root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i >
> /dev/null ; done
> real    108m59.595s
> user    20m14.032s
> sys     50m48.216s
>
> Seriously?!? 108 minutes for 3.5TB of data. I've done something wrong
> obviously. Let's retest that with pipebench to make sure it's actually
> archiving data :
>
> root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i |
> pipebench -b 32768 > /dev/null ; done
> real    308m44.940s
> user    31m58.108s
> sys     98m8.844s
>
> Better. Just over 5 hours.
>
> Lets do a du -hcs *
> root at devfstest:/mnt# time du -hcs *
> real    73m20.487s
> user    2m53.884s
> sys     29m49.184s
>
> xfs tar test : 5.1 hours
> xfs du -hcs test : 73 minutes
>
> Right, now to re-populate the filesystem ext4 and re-test.
> Hrm. Just realised that all previous ext4 creation tests were at the
> mercy of lazy_init, so create the new one with no lazy init on block
> tables or journal.
>
> real    1361m53.562s
> user    499m20.168s
> sys     212m6.524s
>
> So ext4 create : 22.6 hours. Still about right.
>
> Time for the tar create test :
> root at devfstest:/mnt# time for i in * ; do echo $i ; sleep 5 ; tar -cp $i
> | pipebench -b 32768 > /dev/null ; done
> real    2248m18.299s
> user    35m6.968s
> sys     98m57.936s
>
> Right. That wasn't really a surprise, but the magnitude of the
> difference was.
> xfs : 5.1 hours
> ext4 : 37.4 hours
>
> Now the du -hcs * test :
> real    1714m21.503s
> user    3m40.596s
> sys     37m24.928s
>
> xfs : 74 minutes
> ext4 : 28.5 hours
>
>
> Summary
>
> Populate fresh & empty fs from tar files :
> xfs  : 12.1 hours
> ext4 : 21.8 hours
>
> Find :
> xfs  : 43 min
> ext4 : 12.1 hours
>
> du -hcs * :
> xfs  : 74 minutes
> ext4 : 28.5 hours
>
> tar create :
> xfs  : 5.1 hours
> ext4 : 37.4 hours
>
> I think there's a pattern there.
>
> So, using one VM config and hardware set. One set of source tar files.
>
> Tests were performed sequentially, so there were likely workload
> variations on the host server, but nothing significant and certainly not
> enough to make more than a couple of percent difference either way.
>
> So I still need to go back and figure out what happened with the first
> xfs tar test and how it possibly exceeded the available throughput for
> the disks. Everything else was pretty sane.
>
> It would appear xfs destroys ext4 for this perverse use case.
>
> I suppose my next step is migrating the system across to xfs and if I
> take the time to copy the whole thing across, probably foregoing a
> couple of nights backups or just start a new drive from scratch and put
> the current ext4 drive in the safe for a couple of months.
>
> Regards,
> Brad
> --
> An expert is a person who has found out by his own painful
> experience all the mistakes that one can make in a very
> narrow field. - Niels Bohr
> _______________________________________________
> PLUG discussion list: plug at plug.org.au
> http://lists.plug.org.au/mailman/listinfo/plug
> Committee e-mail: committee at plug.org.au
> PLUG Membership: http://www.plug.org.au/membership
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.plug.org.au/pipermail/plug/attachments/20200202/734374fc/attachment.html>