Arch Linux based File Server, Btrfs

Again, I chose Btrfs over ZFS since the former is shipped with the latest kernel, and seemed to be the least headache when it comes to upgrading the OS.  However, I hear ZFS On Linux (ZOL) is quite good, so feel free to go with that if you prefer.  If the Btrfs file systems on my Arch Linux systems ever become corrupted, I may consider restoring my Borg Backup to a ZFS pool instead.  Time will tell.

In any case, on tennessine, my Arch Linux file server, I started by partitioning each disk with a GUID Partition Table (GPT), and created two partions on each disk.  The first partition was for an EFI partition, the first of which was formatted with a vfat filesystem.  This would not have been necessary had the Dell R730xd been able to boot off my NVMe SSD, which is attached via a PCIe M.2 daughter card.  Each of these EFI partitions is only 511M, and only one is actually in use (/dev/sda1).  The second partition on the first six disks will be a device in the Btrfs volume.  To save time, I ran parted in a Bash/zsh for loop (this can be executed during the Arch Linux installation):

for disk in /dev/sd{a..h}; do
    echo ${disk}
    parted -s ${disk} -- mklabel gpt \
      mkpart 1 fat32 2048s 512MiB \
      mkpart 2 btrfs 512MiB -1s
done

I also partitioned my NVMe SSD, since it will house my swap partition, root filesystem, and home filesystem:

parted -s /dev/nvme0n1 -- mklabel gpt \
    mkpart 1 linux-swap 8192s 20GiB \
    mkpart 2 ext4 20GiB 60GiB \
    mkpart 3 ext4 60GiB -1s    

Next, I created the filesystems and created the swap device:

mkfs.ext4 /dev/nvme0n1p2
mkfs.ext4 /dev/nvme0n1p3
mkfs.btrfs --label data \
    --metadata raid10 \
    --data raid10 \
    /dev/sd{a..f}2
mkswap /dev/nvme0n1p1
swapon

When running pacstrap as part of the Arch installation, I ensured to include btrfs-progs, in the list of packages to install.  The key is the mkfs.btrfs command, which is included in this package.  It creates a top-level subvolume labeled data.  This subvolume has the subvolume ID of 5, and is the root subvolume (/) of this filesystem.  Everything else is created as a subvolume of this top-level subvolume.  See the Btrfs wiki for more details.  

Next, I mounted the root, home, and boot/EFI partitions:

mount /dev/nvme0n1p2 /mnt
mkdir /mnt/{boot,home}
mount /dev/nvme0n1p3 /mnt/home
mount /dev/sda1 /mnt/boot

After this, I mounted the top-level subvolume to /mnt/data:

mkdir /mnt/data
mount -t btrfs -o rw,compress=zstd,subvol=/ /dev/sda2 /mnt/data

You only need to specify one of the member disks when mounting a multi-device Btrfs filesystem.  Subvolumes are directly related to snapshots, a snapshot is a subvolume, and contains all the changes to files within the subvolume, but excludes any child subvolumes.  For tennessine, I created my subvolumes using the following commands:

btrfs subvolume create /data/media/music
btrfs subvolume create /data/media/video
btrfs subvolume create /data/backup
btrfs subvolume create /data/backup/borg
btrfs subvolume create /data/backup/borg/encrypted # much later than initial setup

I now have these subvolumes:

btrfs subvolume list /data
ID 258 gen 110452 top level 5 path media/music
ID 259 gen 110453 top level 5 path media/video
ID 260 gen 110450 top level 5 path backup
ID 261 gen 110455 top level 260 path backup/borg
ID 48245 gen 148658 top level 261 path backup/borg/encrypted

NOTE:  My /data directory on tennessine was created after I installed Arch, so the Btrfs mount points needed to be added to /etc/fstab manually.  As listed above, I chose zstd compression (default level 3) as a mount option for the top-level subvolume, and created everything else as a subvolume of that.  Here is the partition layout, where you can see the UUID of all filesystems:

lsblk -f
NAME        FSTYPE FSVER LABEL           UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1      vfat   FAT16                 B863-C60E                             453.2M    11% /boot
└─sda2      btrfs        data            384a9c1d-bbd6-4253-a74c-a38d41200bb9   32.1T    16% /mnt/snapshots/borg
                                                                                             /srv/nfs/video
                                                                                             /srv/nfs/music
                                                                                             /data/backup/borg
                                                                                             /data/media/video
                                                                                             /data/media/music
                                                                                             /data/backup
                                                                                             /data
sdb
├─sdb1
└─sdb2      btrfs        data            384a9c1d-bbd6-4253-a74c-a38d41200bb9
sdc
├─sdc1
└─sdc2      btrfs        data            384a9c1d-bbd6-4253-a74c-a38d41200bb9
sdd
├─sdd1
└─sdd2      btrfs        data            384a9c1d-bbd6-4253-a74c-a38d41200bb9
sde
├─sde1
└─sde2      btrfs        data            384a9c1d-bbd6-4253-a74c-a38d41200bb9
sdf
├─sdf1
└─sdf2      btrfs        data            384a9c1d-bbd6-4253-a74c-a38d41200bb9
sdg
├─sdg1
└─sdg2
sdh
├─sdh1      ntfs         System Reserved D81068E11068C858
└─sdh2
nvme0n1
├─nvme0n1p1 swap   1                     52f7589e-d6db-4f3e-8c53-d1eb251b6827                [SWAP]
├─nvme0n1p2 ext4   1.0                   a2919159-0976-4131-8ab0-8076d3fb55f5   37.2G    32% /
└─nvme0n1p3 ext4   1.0                   80fa7e18-4605-478c-a62d-877ed5b9113b  122.4G    10% /home

Incidentally I had installed Windows on the eighth hard drive (sdh) in order to receive some support from Seagate.  That was a nightmare, and I complained as loud as I could that it shouldn't have been necessary.  Note that the UUID of the Btrfs filesystem on each number 2 partition is identical, I used that in /etc/fstab:

# Static information about the filesystems.
# See fstab(5) for details.

# <file system> <dir> <type> <options> <dump> <pass>
# /dev/nvme0n1p2
UUID=a2919159-0976-4131-8ab0-8076d3fb55f5       /               ext4            rw,relatime     0 1

# /dev/sda1
UUID=B863-C60E          /boot           vfat            rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro 0 2

# /dev/nvme0n1p3
UUID=80fa7e18-4605-478c-a62d-877ed5b9113b       /home           ext4            rw,relatime     0 2

# /data => /dev/sd[a-f]2
UUID=384a9c1d-bbd6-4253-a74c-a38d41200bb9 /data              btrfs   rw,compress=zstd,subvol=/,noauto,defaults,relatime         0   0
UUID=384a9c1d-bbd6-4253-a74c-a38d41200bb9 /data/media/music  btrfs   rw,compress=zstd,subvol=/media/music,defaults,relatime     0   0
UUID=384a9c1d-bbd6-4253-a74c-a38d41200bb9 /data/media/video  btrfs   rw,compress=zstd,subvol=/media/video,defaults,relatime     0   0
UUID=384a9c1d-bbd6-4253-a74c-a38d41200bb9 /data/backup       btrfs   rw,compress=zstd,subvol=/backup,defaults,relatime          0   0
UUID=384a9c1d-bbd6-4253-a74c-a38d41200bb9 /data/backup/borg  btrfs   rw,compress=zstd,subvol=/backup/borg,user,relatime     0   0

# NFS binds
/data/media/music /srv/nfs/music    none    bind    0   0
/data/media/video /srv/nfs/video    none    bind    0   0

Note that the subvol mount option lists the subvolume name, relative to the top-level subvolume (subvolume /).  You can instead use subvolid, and list the numeric subvolume ID (see the output from btrfs subvolume list above).  I set the top-level subvolume to not be mounted on boot, which is a Btrfs paradigm (see the section on Subvolumes in the Btrfs documentation).  If I need to create any child subvolumes in the top-level subvolume, I should mount /data first.

Note that my latest subvolume, /data/backup/borg/encrypted, was created over a year after the initial setup.  I do not have an explicit entry in /etc/fstab for this, it gets mounted when /backup/borg gets mounted.  Still, since it is a subvolume it has snapshots created independently, see the snapper configuration below.

snapper

Now I set up regularly scheduled snapshots, using the OpenSuSE tool snapper.  Setup is pretty straightforward:

snapper -c music /data/media/music
snapper -c video /data/media/video
snapper -c backup /data/backup
snapper -c borg /data/backup/borg
snapper -c encrypted /data/backup/borg/encrypted

The -c <config_name> sets the configuration name, in /etc/snapper/configs/<config_name>.  It also adds <config_name> to the SNAPPER_CONFIGS variable in /etc/conf.d/snapper.  It finally creates the .snapshots subvolume of each specified subvolume, where subsequent snapshots are stored.

Let's look at one of the configurations, it is populated with quite a bit by default.  We'll look at my encrypted configuration, as that is the most important one for my purposes with tennessine.  It is essentially a Bourne-compatible shell script, with nothing but variable definitions within it:


# subvolume to snapshot
SUBVOLUME="/data/backup/borg/encrypted"

# filesystem type
FSTYPE="btrfs"


# btrfs qgroup for space aware cleanup algorithms
QGROUP=""


# fraction of the filesystems space the snapshots may use
SPACE_LIMIT="0.5"

# fraction of the filesystems space that should be free
FREE_LIMIT="0.2"


# users and groups allowed to work with config
ALLOW_USERS=""
ALLOW_GROUPS=""

# sync users and groups from ALLOW_USERS and ALLOW_GROUPS to .snapshots
# directory
SYNC_ACL="no"


# start comparing pre- and post-snapshot in background after creating
# post-snapshot
BACKGROUND_COMPARISON="yes"


# run daily number cleanup
NUMBER_CLEANUP="yes"

# limit for number cleanup
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="20"
NUMBER_LIMIT_IMPORTANT="10"


# create hourly snapshots
TIMELINE_CREATE="yes"

# cleanup hourly snapshots after some time
TIMELINE_CLEANUP="yes"

# limits for timeline cleanup
TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="6"
TIMELINE_LIMIT_DAILY="7"
TIMELINE_LIMIT_WEEKLY="3"
TIMELINE_LIMIT_MONTHLY="2"
TIMELINE_LIMIT_YEARLY="1"


# cleanup empty pre-post-pairs
EMPTY_PRE_POST_CLEANUP="yes"

# limits for empty pre-post-pair cleanup
EMPTY_PRE_POST_MIN_AGE="1800"


# subvolume to snapshot
SUBVOLUME="/data/backup/borg"

# filesystem type
FSTYPE="btrfs"


# btrfs qgroup for space aware cleanup algorithms
QGROUP=""


# fraction of the filesystems space the snapshots may use
SPACE_LIMIT="0.5"

# fraction of the filesystems space that should be free
FREE_LIMIT="0.2"


# users and groups allowed to work with config
ALLOW_USERS=""
ALLOW_GROUPS=""

# sync users and groups from ALLOW_USERS and ALLOW_GROUPS to .snapshots
# directory
SYNC_ACL="no"


# start comparing pre- and post-snapshot in background after creating
# post-snapshot
BACKGROUND_COMPARISON="yes"


# run daily number cleanup
NUMBER_CLEANUP="yes"

# limit for number cleanup
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="20"
NUMBER_LIMIT_IMPORTANT="10"


# create hourly snapshots
TIMELINE_CREATE="yes"

# cleanup hourly snapshots after some time
TIMELINE_CLEANUP="yes"

# limits for timeline cleanup
TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="6"
TIMELINE_LIMIT_DAILY="7"
TIMELINE_LIMIT_WEEKLY="3"
TIMELINE_LIMIT_MONTHLY="2"
TIMELINE_LIMIT_YEARLY="1"


# cleanup empty pre-post-pairs
EMPTY_PRE_POST_CLEANUP="yes"

# limits for empty pre-post-pair cleanup
EMPTY_PRE_POST_MIN_AGE="1800"

The most important parts of this configuration are the TIMELINE_CREATE and TIMELINE_LIMIT_* variables, they allow snapper to periodically create, and then clean up snapshots.  I keep only the most recent six hours of snapshots, last seven days, last three weeks, last two months, and one year old snapshot.  This particular subvolume's latest read-only snapshot will be backed up to the Backblaze B2 cloud bucket.

The next step is to enable and start the systemd snapper-timeline.timer and snapper-cleanup.timer:

systemctl enable --now snapper-timeline.timer
systemctl enable --now snapper-cleanup.timer

A major caveat to snapper-timeline and snapper-cleanup:  the developer of snapper has set the service to be Type=simple, which is really designed for daemons that continue running in the background.  The snapper-helper command which actually runs the timeline and the cleanup doesn't keep running forever, so I've overriden this in both service unit files, with sudo systemctl edit snapper-timeline.service and sudo systemctl edit snapper-cleanup.service.  I set the parameter Type=oneshot in the [Service] section.

I also updated the snapper-related systemd timers.  By default, snapper-timeline.service runs hourly, but since my Backblaze B2 uploads can last longer than an hour, I needed to reduce the cadence to every four hours.  See man systemd.timer on how to actually do this.

One final note which I mention in my Borg and Backblaze B2 articles in this series, in order to facilitate the mounting of the latest snapshots, I have an ExecStartPost parameter in the snapper-timeline.service override which automatically mounts the latest snapshot so Borg or B2 can read it for backing up the necessary files.

Scrub

Both the Btrfs and the Arch Linux wikis recommend periodically running btrfs scrub on any Btrfs filesystem.  The recommended frequency is one month.  The btrfs-progs package ships with systemd service and timer templates, which I installed for my /data directory like so:  

systemctl enable --now btrfs-scrub@data.timer

This by default has AccuracySec set to 1d (one day) and RandomizedDelaySec set to 1w (one week).  After reviewing the logs, I realized that several scrub runs have been aborted, due to a system reboot (likely because I was upgrading tennessine at those times).  Setting the following systemd unit override for the btrfs-scrub@data.timer corrected this:

[Timer]
AccuracySec=1s
RandomizedDelaySec=1h

Now it should run on the first of the month, within an hour after midnight local time.  I'm far less likely to be administering the system at that time, so hopefully the scrubs will no longer be aborted.

Conclusion

That's about it for my Btrfs setup on tennessine.  I use a similar setup on most of my other Arch Linux systems, as creating a snapshot and mounting it read-only (as mentioned in the article on Subvolumes in the Btrfs documentation) prevents the backup from getting confused by files that change during the backup.

Next Steps

My series on an Arch Linux based File Server continues: