[plug] Filesystems for lots of inodes
Brad Campbell
brad at fnarfbargle.com
Sun Feb 2 14:42:27 AWST 2020
On 15/1/20 10:35 pm, Brad Campbell wrote:
> On 9/1/20 2:12 pm, Brad Campbell wrote:
>> On 8/1/20 13:39, Byron Hammond wrote:
>>> I'm keeping my eye on this thread with great interest.
>>>
>>> I'm really curious to see what your findings are and how you got there.
>>>
>>
>> It will be interesting. I'll say in response to "how you got there",
>> the current answer is _slowly_.
>
> Untarring the backup file onto a clean ext4 filesystem on a 5 drive
> RAID5 took 176 hours for the bulk restore, and then tar seems to do
> another pass removing symlinks, creating a new symlink and then
> hardlinking that. That took an additional 7 hours.
>
> So ~183 hours to restore the tar file onto a clean ext4 filesystem.
>
> At least I have a reproducible test case. That averaged 5.3MB/s.
>
> Correcting my mistake, this filesystem has 50.2 million inodes and 448
> million files/directories.
>
> root at test:/server# tar -tvzf backups.tgz | wc -l
> 448763241
>
> Just the tar | wc -l took the best part of a day. This might take a while.
>
>
So from here on this E-mail has just been built up in sequence as things
are tried. Lost of big gaps (like 183 hours to build the ext4
filesystem), so it's all kinda "stream of consciousness". I'll put a
summary at the end.
I needed some way of speeding this process up and write caching the
drives seemed like the sanest way to do it. I ran up a new qemu(kvm)
devuan instance, passed it the raw block device and set the caching
method to "unsafe". That basically ignores all data safety requests
(sync/fsync/flush) and allows the machine to act as a huge cache.
So, the filesystem had already been created and populated (183 hours
worth). This is a simple find on the filesystem from inside the VM.
ext4:
root at devfstest:/mnt/source# time find . | wc -l
448763242
real 1300m56.182s
user 3m18.904s
sys 12m56.012s
I've created a new xfs filesystem and :
real 10130m14.072s
user 9631m11.388s
sys 325m38.168s
So 168 hours for xfs.
I've noticed an inordinate amount of time being spent inside tar, so I
took the time to create the archive again, this time with a separate
tarball for each backed up directory.
So, let's repeat that test with xfs : A bit complexish, but let's see
what happens. Surely can't be slower!
root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ; do
echo $i ; ssh test cat $i | pigz -d | tar -x ; done
real 730m25.915s
user 496m28.968s
sys 209m7.068s
12.1 hours using separate tarballs vs one big tarball.
So, in this instance tar was/is the bottleneck! All future tests will be
done using the multiple tarball archive.
Right, so now create a new ext4 filesystem on there and repeat the test
real 1312m53.829s
user 481m3.272s
sys 194m49.744s
Summary :
xfs : 12.1 hours
ext4 : 21.8 hours
Filesystem population test win : XFS by a long margin.
Now, I wasn't clever enough to do a find test on xfs before doing the
ext4 creation test, so let's run the find on ext4, then re-create the
xfs and do it again.
This should be interesting to see how it compares to the initial find
test on the fs created on the bare metal and then the block device
passed through to the VM (first result in this mail, some 1300 seconds).
Not entirely a fair test as the filesystems differ in content. The "one
big tarball" was about 10 days before the "multiple smaller tarballs",
Cbut still ~45-50 million inodes.
Lesson learned, make sure filesystem is mounted noatime before the test.
Several restarts before I figure out what was writing to the disk.
Find test on ext4 :
cd /mnt ; time find . | wc -l
ext4 :
real 1983m45.609s
user 3m32.184s
sys 14m2.420s
Not so pretty. So 50% longer than last time. Still, different filesystem
contents so not directly comparable. Right, lets build up a new xfs
filesystem and repeat the test :
root at devfstest:/mnt# time for i in `ssh test ls /server/fred/*.tgz` ; do
echo $i ; ssh test cat $i | pigz -d | tar -x ; done
real 711m17.118s
user 498m12.424s
sys 210m50.748s
So create was 730 minus last time and 711 mins this time. ~3% variance.
Close enough.
root at devfstest:/mnt# time find . | wc -l
497716103
real 43m13.998s
user 2m49.624s
sys 6m33.148s
xfs ftw! 43 mins vs 730 mins.
So, summary.
xfs create : 12.1 hours
ext4 create : 21.8 hours
xfs find : 43 min
ext4 find : 12.1 hours
Let's do a tar test and see how long it takes to read the entire
filesystem. This would be a good indicator of time to replicate. Again,
because I wasn't clever enough to have this stuff thought up before
hand, I'll have to do it on xfs, then recreate the ext4 and run it again.
root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i >
/dev/null ; done
real 108m59.595s
user 20m14.032s
sys 50m48.216s
Seriously?!? 108 minutes for 3.5TB of data. I've done something wrong
obviously. Let's retest that with pipebench to make sure it's actually
archiving data :
root at devfstest:/mnt# time for i in * ; do echo $i ; tar -cp $i |
pipebench -b 32768 > /dev/null ; done
real 308m44.940s
user 31m58.108s
sys 98m8.844s
Better. Just over 5 hours.
Lets do a du -hcs *
root at devfstest:/mnt# time du -hcs *
real 73m20.487s
user 2m53.884s
sys 29m49.184s
xfs tar test : 5.1 hours
xfs du -hcs test : 73 minutes
Right, now to re-populate the filesystem ext4 and re-test.
Hrm. Just realised that all previous ext4 creation tests were at the
mercy of lazy_init, so create the new one with no lazy init on block
tables or journal.
real 1361m53.562s
user 499m20.168s
sys 212m6.524s
So ext4 create : 22.6 hours. Still about right.
Time for the tar create test :
root at devfstest:/mnt# time for i in * ; do echo $i ; sleep 5 ; tar -cp $i
| pipebench -b 32768 > /dev/null ; done
real 2248m18.299s
user 35m6.968s
sys 98m57.936s
Right. That wasn't really a surprise, but the magnitude of the
difference was.
xfs : 5.1 hours
ext4 : 37.4 hours
Now the du -hcs * test :
real 1714m21.503s
user 3m40.596s
sys 37m24.928s
xfs : 74 minutes
ext4 : 28.5 hours
Summary
Populate fresh & empty fs from tar files :
xfs : 12.1 hours
ext4 : 21.8 hours
Find :
xfs : 43 min
ext4 : 12.1 hours
du -hcs * :
xfs : 74 minutes
ext4 : 28.5 hours
tar create :
xfs : 5.1 hours
ext4 : 37.4 hours
I think there's a pattern there.
So, using one VM config and hardware set. One set of source tar files.
Tests were performed sequentially, so there were likely workload
variations on the host server, but nothing significant and certainly not
enough to make more than a couple of percent difference either way.
So I still need to go back and figure out what happened with the first
xfs tar test and how it possibly exceeded the available throughput for
the disks. Everything else was pretty sane.
It would appear xfs destroys ext4 for this perverse use case.
I suppose my next step is migrating the system across to xfs and if I
take the time to copy the whole thing across, probably foregoing a
couple of nights backups or just start a new drive from scratch and put
the current ext4 drive in the safe for a couple of months.
Regards,
Brad
--
An expert is a person who has found out by his own painful
experience all the mistakes that one can make in a very
narrow field. - Niels Bohr
More information about the plug
mailing list