At work, we have a couple Backblaze storage pods (version 3 with 4TB drives) that we use for backup purposes. They were obtained before my time because quick, bulk storage was necessary to backup our object storage platform, Swift.
Sadly, the boxes were deployed in an unsatisfactory manner whereas all 45 drives were pooled together in one gigantic LVM formation, meaning any one disk could die and data loss would occur.
Entering present day, I got the chance to start from scratch with one of these boxes. I decided almost instantly that ZFS would be an awesome candidate for this much storage. It could easily provide for great reliability, amazing speed, and fantastic redundancy. Lets get started!
Installation: Or why does GRUB hate me.
I decided to use the latest Ubuntu 14.04.2 64bit LTS release on this box. Note that it was running 12.04 initially, and luckily, a do-release-upgrade
worked outstandingly. But there was a problem on boot up….of course :(
The below is what you’ll see a lot of at boot whilst the box initializes all the devices:
60.332970] ata7.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320)
[ 60.333909] ata7.00: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[ 60.333911] ata7.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32)
[ 60.334582] ata7.00: configured for UDMA/100
[ 60.335493] ata7.01: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[ 60.335494] ata7.01: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 60.336145] ata7.01: configured for UDMA/100
[ 60.337061] ata7.02: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[ 60.337062] ata7.02: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 60.337729] ata7.02: configured for UDMA/100
[ 60.338693] ata7.03: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[ 60.338694] ata7.03: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 60.339359] ata7.03: configured for UDMA/100
[ 60.339953] ata7.04: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[ 60.339954] ata7.04: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 60.340602] ata7.04: configured for UDMA/100
[ 60.340649] ata7: EH complete
[ 60.340703] scsi 6:0:0:0: Direct-Access ATA ST4000DM000-1F21 CC52 PQ: 0 ANSI: 5
[ 60.340830] sd 6:0:0:0: [sdu] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[ 60.340832] sd 6:0:0:0: [sdu] 4096-byte physical blocks
[ 60.340866] sd 6:0:0:0: [sdu] Write Protect is off
[ 60.340867] sd 6:0:0:0: [sdu] Mode Sense: 00 3a 00 00
[ 60.340882] sd 6:0:0:0: [sdu] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 60.341014] sd 6:0:0:0: Attached scsi generic sg20 type 0
[ 60.341078] scsi 6:1:0:0: Direct-Access ATA ST4000DM000-1F21 CC52 PQ: 0 ANSI: 5
[ 60.341163] sd 6:1:0:0: [sdv] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[ 60.341166] sd 6:1:0:0: [sdv] 4096-byte physical blocks
[ 60.341168] sd 6:1:0:0: Attached scsi generic sg21 type 0
[ 60.341219] sd 6:1:0:0: [sdv] Write Protect is off
But get this: The main root disk plugged into the motherboard doesn’t get /dev/sda or recognized as the first scsi disk in the system. This means it inherits some outrageous /dev/sdft{1,2,5} name.
Sadly, with GRUB, a race condition exists as the version packaged with 12.04 and up doesn’t actually wait for the root device to become available before trying to load the kernel (or maybe it just thinks the proper device is always going to be sda?). Hence we kept getting the infamous (initramfs)
prompt consistently. Luckily, others have seen this before and there is a solution in rootdelay=90
. Sadly, the proper option, which I would consider an acceptable default, is rootwait
. For whatever reason, this didn’t work for me:(
With that out of the way, here’s the final GRUB config:
#/etc/default/grub - ensure you update-grub
GRUB_DEFAULT=0
GRUB_HIDDEN_TIMEOUT_QUIET=false
GRUB_TIMEOUT=8
GRUB_RECORDFAIL_TIMEOUT=$GRUB_TIMEOUT
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=90 console=tty0 console=ttyS0,9600n8"
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0"
GRUB_SERIAL_COMMAND="serial --speed=9600"
GRUB_TERMINAL="serial"
GRUB_TERMINAL_OUTPUT="console serial"
GRUB_TERMINAL_INPUT="console serial"
We also disable the stupid “I’m not going to boot because last time the system went down hard/gross, and that means you probably want me to just sit here and suck my thumb”.
You can also see the serial port configuration, which I just tacked on as it already existed in the previous config. More on setting that up here.
Meat and Potatoes: I’m ready for you ZoL
Because of licensing, there isn’t really a native ZFS implementation for Linux. Fortunately, ZFS on Linux exists and is pretty much everything you need to ZFS hardcore on Linux systems.
We’ll use the official PPA to get the ubuntu package ubuntu-zfs
:
add-apt-repository ppa:zfs-native/stable
apt-get update
apt-get install ubuntu-zfs
modprobe zfs
Before continuing on and envisioning how you’d like to cut and format your array, I suggest reading a few articles on the options ZFS offers.
- Ars gives us a walkthrough
- Aaron Toponce has too much freetime
- Archlinux always has great docs
- There’s even a subreddit!
The Layout
After some deliberation with my co-worker, we decided upon the following:
One zpool of 4 vdevs, each with RAIDZ2 and 11 disks. One disk sits on the sidelines as hot spare. This is effectively RAID60. (11 * 4) + 1 = 45
Now the first thing you’re probably thinking is: Why in the hell are they using an odd number of disks for each vdev? Whilst on the Interwebz, there’s about 6 million pages covering this, we talked through this and decided whatever small loss we’d suffer in performance would be negligible to the core goals of this box for our needs. I can’t stress this enough:
Everyone has slightly different needs and goals. It’s essential you really decide your priorities for what your storage must provide before blindly rolling something out that “looks” and “seems” right.
With this configuration, we can lose up to 8 of the “right” drives and up to 3 of the “wrong” drives. I won’t delve too deeply into how RAID works, but to us, this provided a good enough level of availability and usable disk space (ended up being roughly 120TB).
Creation
Literally, one command to rule them all!
zpool \
create \
-f \
-o ashift=12 -O compression=on \
backup \
raidz2 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009P73 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008ANS \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008WL3 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009ACP \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300CKT0 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AMAC \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008KJN \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A8ET \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W30086WX \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AM5Q \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008WLE \
raidz2 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AHR6 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300ALYN \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W30088YL \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NFP \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300B0FK \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300CTDB \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A8R8 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NQ3 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300959J \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009176 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A6HQ \
raidz2 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008ERM \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W30090R8 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009A5D \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300974C \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300915V \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009146 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A3MK \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NCC \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AHNM \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300BZBP \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008JYF \
raidz2 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008QL2 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008W5S \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008KQ7 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300ADHE \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009P1R \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AE5J \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NG0 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009P3J \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300ADZK \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008BTB \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008WWS \
spare \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008KVP
Quick breakdown of options:
-f
- Since these drives used to be part of an LVM, we tell zfs to disregard any previous affiliationashift=12
- Accounting for the 4k sectors-O compression=on
- Compression works really well so we enable for the entire zpoolbackup
- This is the name of our zpool
The rest is self-explanatory. You can see we decided to use the by-id
names for the disks. Note that these references are what is actually used by ZFS to interface with your drives, so using /dev/sdX is probably a bad idea (as they aren’t permanent). Personally, I prefer using by-id as I can easily see the type and serial of each disk in the array. Another great option would be to use UUID’s. ZoL mentions this in their FAQ, but I actually disagree with their recommendations on what to use for larger pools.
One last thing to mention is that this zpool create command took literally 10 seconds to complete. That’s pretty amazing considering mdadm would have taken hours to finish setting up the array, given you are allowed to use it whilst that process takes place.
Assessing/Status
Let’s get some details about our pool!
[~]> zpool status backup
pool: backup
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
backup ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009P73 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008ANS ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008WL3 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009ACP ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300CKT0 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300AMAC ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008KJN ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300A8ET ONLINE 0 0 0
ata-ST4000DM000-1F2168_W30086WX ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300AM5Q ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008WLE ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300AHR6 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300ALYN ONLINE 0 0 0
ata-ST4000DM000-1F2168_W30088YL ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009NFP ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300B0FK ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300CTDB ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300A8R8 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009NQ3 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300959J ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009176 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300A6HQ ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008ERM ONLINE 0 0 0
ata-ST4000DM000-1F2168_W30090R8 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009A5D ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300974C ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300915V ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009146 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300A3MK ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009NCC ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300AHNM ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300BZBP ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008JYF ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008QL2 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008W5S ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008KQ7 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300ADHE ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009P1R ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300AE5J ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009NG0 ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3009P3J ONLINE 0 0 0
ata-ST4000DM000-1F2168_W300ADZK ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008BTB ONLINE 0 0 0
ata-ST4000DM000-1F2168_W3008WWS ONLINE 0 0 0
spares
ata-ST4000DM000-1F2168_W3008KVP AVAIL
errors: No known data errors
Create and configure Filesystems
Here we’ll create a filesystem on top of our pool and disable atime.
[~]> zfs create backup/mcorral
[~]> zfs set atime=off backup/mcorral
[~]> zfs get all backup/mcorral
NAME PROPERTY VALUE SOURCE
backup/mcorral type filesystem -
backup/mcorral creation Wed Apr 8 21:45 2015 -
backup/mcorral used 37.4G -
backup/mcorral available 120T -
backup/mcorral referenced 37.4G -
backup/mcorral compressratio 1.77x -
backup/mcorral mounted yes -
backup/mcorral quota none default
backup/mcorral reservation none default
backup/mcorral recordsize 128K default
backup/mcorral mountpoint /backup/mcorral default
backup/mcorral sharenfs off default
backup/mcorral checksum on default
backup/mcorral compression on inherited from backup
backup/mcorral atime off local
backup/mcorral devices on default
backup/mcorral exec on default
backup/mcorral setuid on default
backup/mcorral readonly off default
backup/mcorral zoned off default
backup/mcorral snapdir hidden default
backup/mcorral aclinherit restricted default
backup/mcorral canmount on default
backup/mcorral xattr on default
backup/mcorral copies 1 default
backup/mcorral version 5 -
backup/mcorral utf8only off -
backup/mcorral normalization none -
backup/mcorral casesensitivity sensitive -
backup/mcorral vscan off default
backup/mcorral nbmand off default
backup/mcorral sharesmb off default
backup/mcorral refquota none default
backup/mcorral refreservation none default
backup/mcorral primarycache all default
backup/mcorral secondarycache all default
backup/mcorral usedbysnapshots 0 -
backup/mcorral usedbydataset 37.4G -
backup/mcorral usedbychildren 0 -
backup/mcorral usedbyrefreservation 0 -
backup/mcorral logbias latency default
backup/mcorral dedup off default
backup/mcorral mlslabel none default
backup/mcorral sync standard default
backup/mcorral refcompressratio 1.77x -
backup/mcorral written 37.4G -
backup/mcorral logicalused 61.6G -
backup/mcorral logicalreferenced 61.6G -
backup/mcorral snapdev hidden default
backup/mcorral acltype off default
backup/mcorral context none default
backup/mcorral fscontext none default
backup/mcorral defcontext none default
backup/mcorral rootcontext none default
backup/mcorral relatime off default
You can see the compression option gets inherited from our pool. Lets check our compression ratio (this is after we started backing up our Swift cluster).
[~]> zfs get compressratio backup/mcorral
NAME PROPERTY VALUE SOURCE
backup/mcorral compressratio 1.77x -
That’s all folks
For now, we have backups running and ZFS just chillin doing its thing. I plan to report back on any other things I run into or options I decide to modify.
I’ll leave you with the reddit comment page for the new ZoL release. Some of the commentors bring some insight into differences between the native ZFS on Solaris vs ZoL, which gives you a good idea of how far along ZoL is.