[plug] Filesystems for lots of inodes

Mon Feb 3 07:49:30 AWST 2020

On Sun, 2020-02-02 at 14:42 +0800, Brad Campbell wrote:
> On 15/1/20 10:35 pm, Brad Campbell wrote:
> > On 9/1/20 2:12 pm, Brad Campbell wrote:
> > > On 8/1/20 13:39, Byron Hammond wrote:
> > > > I'm keeping my eye on this thread with great interest.
> > > > 
> > > > I'm really curious to see what your findings are and how you
> > > > got there.
> > > > 
> > > 
> > > It will be interesting. I'll say in response to "how you got
> > > there", 
> > > the current answer is _slowly_.
> > 
> > Untarring the backup file onto a clean ext4 filesystem on a 5
> > drive 
> > RAID5 took 176 hours for the bulk restore, and then tar seems to
> > do 
> > another pass removing symlinks, creating a new symlink and then 
> > hardlinking that. That took an additional 7 hours.
> > 
> > So ~183 hours to restore the tar file onto a clean ext4 filesystem.
> > 
> > At least I have a reproducible test case. That averaged 5.3MB/s.
> > 
> > Correcting my mistake, this filesystem has 50.2 million inodes and
> > 448 
> > million files/directories.
> > 
> > root at test:/server# tar -tvzf backups.tgz | wc -l
> > 448763241
> > 
> > Just the tar | wc -l took the best part of a day. This might take a
> > while.
> > 
> > 
> 
> So from here on this E-mail has just been built up in sequence as
> things
> are tried. Lost of big gaps (like 183 hours to build the ext4
> filesystem), so it's all kinda "stream of consciousness". I'll put a
> summary at the end.
> 
> I needed some way of speeding this process up and write caching the
> drives seemed like the sanest way to do it. I ran up a new qemu(kvm)
> devuan instance, passed it the raw block device and set the caching
> method to "unsafe".  That basically ignores all data safety requests
> (sync/fsync/flush) and allows the machine to act as a huge cache.
> 
> So, the filesystem had already been created and populated (183 hours
> worth). This is a simple find on the filesystem from inside the VM.
> 
> ext4:
> root at devfstest:/mnt/source# time find . | wc -l
> 448763242
> 
> real	1300m56.182s
> user	3m18.904s
> sys	12m56.012s
> 
> I've created a new xfs filesystem and :
> real    10130m14.072s
> user    9631m11.388s
> sys     325m38.168s
> 
> So 168 hours for xfs.
> 
> I've noticed an inordinate amount of time being spent inside tar, so
> I
> took the time to create the archive again, this time with a separate
> tarball for each backed up directory.
> 
> So, let's repeat that test with xfs :  A bit complexish, but let's
> see
> what happens. Surely can't be slower!
> 
> root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ;
> do
> echo $i ; ssh test cat $i | pigz -d | tar -x ; done
> real    730m25.915s
> user    496m28.968s
> sys     209m7.068s
> 
> 12.1 hours using separate tarballs vs one big tarball.
> 
> So, in this instance tar was/is the bottleneck! All future tests will
> be
> done using the multiple tarball archive.
> 
> Right, so now create a new ext4 filesystem on there and repeat the
> test
> 
> real    1312m53.829s
> user    481m3.272s
> sys     194m49.744s
> 
> Summary :
> 
> xfs  : 12.1 hours
> ext4 : 21.8 hours
> 
> Filesystem population test win : XFS by a long margin.
> 
> Now, I wasn't clever enough to do a find test on xfs before doing the
> ext4 creation test, so let's run the find on ext4, then re-create the
> xfs and do it again.
> 
> This should be interesting to see how it compares to the initial find
> test on the fs created on the bare metal and then the block device
> passed through to the VM (first result in this mail, some 1300
> seconds).
> Not entirely a fair test as the filesystems differ in content. The
> "one
> big tarball" was about 10 days before the "multiple smaller
> tarballs",
> Cbut still ~45-50 million inodes.
> 
> Lesson learned, make sure filesystem is mounted noatime before the
> test.
> Several restarts before I figure out what was writing to the disk.
> 
> Find test on ext4 :
> cd /mnt ; time find . | wc -l
> 
> ext4 :
> real    1983m45.609s
> user    3m32.184s
> sys     14m2.420s
> 
> Not so pretty. So 50% longer than last time. Still, different
> filesystem
> contents so not directly comparable. Right, lets build up a new xfs
> filesystem and repeat the test :
> 
> root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ;
> do
> echo $i ; ssh test cat $i | pigz -d | tar -x ; done
> real    711m17.118s
> user    498m12.424s
> sys     210m50.748s
> 
> So create was 730 minus last time and 711 mins this time. ~3%
> variance.
> Close enough.
> 
> root at devfstest:/mnt# time find . | wc -l
> 497716103
> 
> real    43m13.998s
> user    2m49.624s
> sys     6m33.148s
> 
> xfs ftw! 43 mins vs 730 mins.
> 
> So, summary.
> xfs create : 12.1 hours
> ext4 create : 21.8 hours
> 
> xfs find : 43 min
> ext4 find : 12.1 hours
> 
> Let's do a tar test and see how long it takes to read the entire
> filesystem. This would be a good indicator of time to replicate.
> Again,
> because I wasn't clever enough to have this stuff thought up before
> hand, I'll have to do it on xfs, then recreate the ext4 and run it
> again.
> 
> root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i >
> /dev/null ; done
> real    108m59.595s
> user    20m14.032s
> sys     50m48.216s
> 
> Seriously?!? 108 minutes for 3.5TB of data. I've done something wrong
> obviously. Let's retest that with pipebench to make sure it's
> actually
> archiving data :
> 
> root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i |
> pipebench -b 32768 > /dev/null ; done
> real    308m44.940s
> user    31m58.108s
> sys     98m8.844s
> 
> Better. Just over 5 hours.
> 
> Lets do a du -hcs *
> root at devfstest:/mnt# time du -hcs *
> real    73m20.487s
> user    2m53.884s
> sys     29m49.184s
> 
> xfs tar test : 5.1 hours
> xfs du -hcs test : 73 minutes
> 
> Right, now to re-populate the filesystem ext4 and re-test.
> Hrm. Just realised that all previous ext4 creation tests were at the
> mercy of lazy_init, so create the new one with no lazy init on block
> tables or journal.
> 
> real    1361m53.562s
> user    499m20.168s
> sys     212m6.524s
> 
> So ext4 create : 22.6 hours. Still about right.
> 
> Time for the tar create test :
> root at devfstest:/mnt# time for i in * ; do echo $i ; sleep 5 ; tar -cp
> $i
> > pipebench -b 32768 > /dev/null ; done
> real    2248m18.299s
> user    35m6.968s
> sys     98m57.936s
> 
> Right. That wasn't really a surprise, but the magnitude of the 
> difference was.
> xfs : 5.1 hours
> ext4 : 37.4 hours
> 
> Now the du -hcs * test :
> real    1714m21.503s
> user    3m40.596s
> sys     37m24.928s
> 
> xfs : 74 minutes
> ext4 : 28.5 hours
> 
> 
> Summary
> 
> Populate fresh & empty fs from tar files :
> xfs  : 12.1 hours
> ext4 : 21.8 hours
> 
> Find :
> xfs  : 43 min
> ext4 : 12.1 hours
> 
> du -hcs * :
> xfs  : 74 minutes
> ext4 : 28.5 hours
> 
> tar create :
> xfs  : 5.1 hours
> ext4 : 37.4 hours
> 
> I think there's a pattern there.
> 
> So, using one VM config and hardware set. One set of source tar
> files.
> 
> Tests were performed sequentially, so there were likely workload 
> variations on the host server, but nothing significant and certainly
> not 
> enough to make more than a couple of percent difference either way.
> 
> So I still need to go back and figure out what happened with the
> first 
> xfs tar test and how it possibly exceeded the available throughput
> for 
> the disks. Everything else was pretty sane.
> 
> It would appear xfs destroys ext4 for this perverse use case.
> 
> I suppose my next step is migrating the system across to xfs and if
> I 
> take the time to copy the whole thing across, probably foregoing a 
> couple of nights backups or just start a new drive from scratch and
> put 
> the current ext4 drive in the safe for a couple of months.

The xfs performance is quite impressive.

All I can say is I see the xfs developers on the mailing list
and irc do try hard to help with problems.

Some recent xfs patches I did showed them to be quite thorough
(picky really, but that's a good thing) although that might be
because some of them work in the same group as me so I got some
free attention right at the start.

They put quite a bit of effort into the xfstests package too, so
much so that it's become fairly widely used for file systems other
than xfs itself (although I have to admit it is a bit buggy).

I had also recently subscribed to the ext mailing list and my
impression is that they aren't quite as receptive and not quite
as helpful as the xfs folks, but that's largely unfounded TBH.

I myself don't use xfs in anger so I can't really say it's better
than anything else but thought you might find my recent impressions
and experience useful, ;)

Ian