[plug] Filesystems for lots of inodes

Brad Campbell brad at fnarfbargle.com
Sun Feb 2 14:42:27 AWST 2020

On 15/1/20 10:35 pm, Brad Campbell wrote:
> On 9/1/20 2:12 pm, Brad Campbell wrote:
>> On 8/1/20 13:39, Byron Hammond wrote:
>>> I'm keeping my eye on this thread with great interest.
>>> I'm really curious to see what your findings are and how you got there.
>> It will be interesting. I'll say in response to "how you got there", 
>> the current answer is _slowly_.
> Untarring the backup file onto a clean ext4 filesystem on a 5 drive 
> RAID5 took 176 hours for the bulk restore, and then tar seems to do 
> another pass removing symlinks, creating a new symlink and then 
> hardlinking that. That took an additional 7 hours.
> So ~183 hours to restore the tar file onto a clean ext4 filesystem.
> At least I have a reproducible test case. That averaged 5.3MB/s.
> Correcting my mistake, this filesystem has 50.2 million inodes and 448 
> million files/directories.
> root at test:/server# tar -tvzf backups.tgz | wc -l
> 448763241
> Just the tar | wc -l took the best part of a day. This might take a while.

So from here on this E-mail has just been built up in sequence as things
are tried. Lost of big gaps (like 183 hours to build the ext4
filesystem), so it's all kinda "stream of consciousness". I'll put a
summary at the end.

I needed some way of speeding this process up and write caching the
drives seemed like the sanest way to do it. I ran up a new qemu(kvm)
devuan instance, passed it the raw block device and set the caching
method to "unsafe".  That basically ignores all data safety requests
(sync/fsync/flush) and allows the machine to act as a huge cache.

So, the filesystem had already been created and populated (183 hours
worth). This is a simple find on the filesystem from inside the VM.

root at devfstest:/mnt/source# time find . | wc -l

real	1300m56.182s
user	3m18.904s
sys	12m56.012s

I've created a new xfs filesystem and :
real    10130m14.072s
user    9631m11.388s
sys     325m38.168s

So 168 hours for xfs.

I've noticed an inordinate amount of time being spent inside tar, so I
took the time to create the archive again, this time with a separate
tarball for each backed up directory.

So, let's repeat that test with xfs :  A bit complexish, but let's see
what happens. Surely can't be slower!

root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ; do
echo $i ; ssh test cat $i | pigz -d | tar -x ; done
real    730m25.915s
user    496m28.968s
sys     209m7.068s

12.1 hours using separate tarballs vs one big tarball.

So, in this instance tar was/is the bottleneck! All future tests will be
done using the multiple tarball archive.

Right, so now create a new ext4 filesystem on there and repeat the test

real    1312m53.829s
user    481m3.272s
sys     194m49.744s

Summary :

xfs  : 12.1 hours
ext4 : 21.8 hours

Filesystem population test win : XFS by a long margin.

Now, I wasn't clever enough to do a find test on xfs before doing the
ext4 creation test, so let's run the find on ext4, then re-create the
xfs and do it again.

This should be interesting to see how it compares to the initial find
test on the fs created on the bare metal and then the block device
passed through to the VM (first result in this mail, some 1300 seconds).
Not entirely a fair test as the filesystems differ in content. The "one
big tarball" was about 10 days before the "multiple smaller tarballs",
Cbut still ~45-50 million inodes.

Lesson learned, make sure filesystem is mounted noatime before the test.
Several restarts before I figure out what was writing to the disk.

Find test on ext4 :
cd /mnt ; time find . | wc -l

ext4 :
real    1983m45.609s
user    3m32.184s
sys     14m2.420s

Not so pretty. So 50% longer than last time. Still, different filesystem
contents so not directly comparable. Right, lets build up a new xfs
filesystem and repeat the test :

root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ; do
echo $i ; ssh test cat $i | pigz -d | tar -x ; done
real    711m17.118s
user    498m12.424s
sys     210m50.748s

So create was 730 minus last time and 711 mins this time. ~3% variance.
Close enough.

root at devfstest:/mnt# time find . | wc -l

real    43m13.998s
user    2m49.624s
sys     6m33.148s

xfs ftw! 43 mins vs 730 mins.

So, summary.
xfs create : 12.1 hours
ext4 create : 21.8 hours

xfs find : 43 min
ext4 find : 12.1 hours

Let's do a tar test and see how long it takes to read the entire
filesystem. This would be a good indicator of time to replicate. Again,
because I wasn't clever enough to have this stuff thought up before
hand, I'll have to do it on xfs, then recreate the ext4 and run it again.

root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i >
/dev/null ; done
real    108m59.595s
user    20m14.032s
sys     50m48.216s

Seriously?!? 108 minutes for 3.5TB of data. I've done something wrong
obviously. Let's retest that with pipebench to make sure it's actually
archiving data :

root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i |
pipebench -b 32768 > /dev/null ; done
real    308m44.940s
user    31m58.108s
sys     98m8.844s

Better. Just over 5 hours.

Lets do a du -hcs *
root at devfstest:/mnt# time du -hcs *
real    73m20.487s
user    2m53.884s
sys     29m49.184s

xfs tar test : 5.1 hours
xfs du -hcs test : 73 minutes

Right, now to re-populate the filesystem ext4 and re-test.
Hrm. Just realised that all previous ext4 creation tests were at the
mercy of lazy_init, so create the new one with no lazy init on block
tables or journal.

real    1361m53.562s
user    499m20.168s
sys     212m6.524s

So ext4 create : 22.6 hours. Still about right.

Time for the tar create test :
root at devfstest:/mnt# time for i in * ; do echo $i ; sleep 5 ; tar -cp $i
| pipebench -b 32768 > /dev/null ; done
real    2248m18.299s
user    35m6.968s
sys     98m57.936s

Right. That wasn't really a surprise, but the magnitude of the 
difference was.
xfs : 5.1 hours
ext4 : 37.4 hours

Now the du -hcs * test :
real    1714m21.503s
user    3m40.596s
sys     37m24.928s

xfs : 74 minutes
ext4 : 28.5 hours


Populate fresh & empty fs from tar files :
xfs  : 12.1 hours
ext4 : 21.8 hours

Find :
xfs  : 43 min
ext4 : 12.1 hours

du -hcs * :
xfs  : 74 minutes
ext4 : 28.5 hours

tar create :
xfs  : 5.1 hours
ext4 : 37.4 hours

I think there's a pattern there.

So, using one VM config and hardware set. One set of source tar files.

Tests were performed sequentially, so there were likely workload 
variations on the host server, but nothing significant and certainly not 
enough to make more than a couple of percent difference either way.

So I still need to go back and figure out what happened with the first 
xfs tar test and how it possibly exceeded the available throughput for 
the disks. Everything else was pretty sane.

It would appear xfs destroys ext4 for this perverse use case.

I suppose my next step is migrating the system across to xfs and if I 
take the time to copy the whole thing across, probably foregoing a 
couple of nights backups or just start a new drive from scratch and put 
the current ext4 drive in the safe for a couple of months.

An expert is a person who has found out by his own painful
experience all the mistakes that one can make in a very
narrow field. - Niels Bohr

More information about the plug mailing list