Article Image

At work, we have a couple Backblaze storage pods (version 3 with 4TB drives) that we use for backup purposes. They were obtained before my time because quick, bulk storage was necessary to backup our object storage platform, Swift.

Sadly, the boxes were deployed in an unsatisfactory manner whereas all 45 drives were pooled together in one gigantic LVM formation, meaning any one disk could die and data loss would occur.

Entering present day, I got the chance to start from scratch with one of these boxes. I decided almost instantly that ZFS would be an awesome candidate for this much storage. It could easily provide for great reliability, amazing speed, and fantastic redundancy. Lets get started!

Installation: Or why does GRUB hate me.

I decided to use the latest Ubuntu 14.04.2 64bit LTS release on this box. Note that it was running 12.04 initially, and luckily, a do-release-upgrade worked outstandingly. But there was a problem on boot up....of course :(

The below is what you'll see a lot of at boot whilst the box initializes all the devices:

   60.332970] ata7.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320)
[   60.333909] ata7.00: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[   60.333911] ata7.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32)
[   60.334582] ata7.00: configured for UDMA/100
[   60.335493] ata7.01: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[   60.335494] ata7.01: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[   60.336145] ata7.01: configured for UDMA/100
[   60.337061] ata7.02: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[   60.337062] ata7.02: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[   60.337729] ata7.02: configured for UDMA/100
[   60.338693] ata7.03: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[   60.338694] ata7.03: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[   60.339359] ata7.03: configured for UDMA/100
[   60.339953] ata7.04: ATA-8: ST4000DM000-1F2168, CC52, max UDMA/133
[   60.339954] ata7.04: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[   60.340602] ata7.04: configured for UDMA/100
[   60.340649] ata7: EH complete
[   60.340703] scsi 6:0:0:0: Direct-Access     ATA      ST4000DM000-1F21 CC52 PQ: 0 ANSI: 5
[   60.340830] sd 6:0:0:0: [sdu] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[   60.340832] sd 6:0:0:0: [sdu] 4096-byte physical blocks
[   60.340866] sd 6:0:0:0: [sdu] Write Protect is off
[   60.340867] sd 6:0:0:0: [sdu] Mode Sense: 00 3a 00 00
[   60.340882] sd 6:0:0:0: [sdu] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   60.341014] sd 6:0:0:0: Attached scsi generic sg20 type 0
[   60.341078] scsi 6:1:0:0: Direct-Access     ATA      ST4000DM000-1F21 CC52 PQ: 0 ANSI: 5
[   60.341163] sd 6:1:0:0: [sdv] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[   60.341166] sd 6:1:0:0: [sdv] 4096-byte physical blocks
[   60.341168] sd 6:1:0:0: Attached scsi generic sg21 type 0
[   60.341219] sd 6:1:0:0: [sdv] Write Protect is off

But get this: The main root disk plugged into the motherboard doesn't get /dev/sda or recognized as the first scsi disk in the system. This means it inherits some outrageous /dev/sdft{1,2,5} name.

Sadly, with GRUB, a race condition exists as the version packaged with 12.04 and up doesn't actually wait for the root device to become available before trying to load the kernel (or maybe it just thinks the proper device is always going to be sda?). Hence we kept getting the infamous (initramfs) prompt consistently. Luckily, others have seen this before and there is a solution in rootdelay=90. Sadly, the proper option, which I would consider an acceptable default, is rootwait. For whatever reason, this didn't work for me:(

With that out of the way, here's the final GRUB config:

#/etc/default/grub - ensure you update-grub
GRUB_DEFAULT=0  
GRUB_HIDDEN_TIMEOUT_QUIET=false  
GRUB_TIMEOUT=8  
GRUB_RECORDFAIL_TIMEOUT=$GRUB_TIMEOUT  
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`

GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=90 console=tty0 console=ttyS0,9600n8"  
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0"  
GRUB_SERIAL_COMMAND="serial --speed=9600"

GRUB_TERMINAL="serial"  
GRUB_TERMINAL_OUTPUT="console serial"  
GRUB_TERMINAL_INPUT="console serial"  

We also disable the stupid "I'm not going to boot because last time the system went down hard/gross, and that means you probably want me to just sit here and suck my thumb".

You can also see the serial port configuration, which I just tacked on as it already existed in the previous config. More on setting that up here.

Meat and Potatoes: I'm ready for you ZoL

Because of licensing, there isn't really a native ZFS implementation for Linux. Fortunately, ZFS on Linux exists and is pretty much everything you need to ZFS hardcore on Linux systems.

We'll use the official PPA to get the ubuntu package ubuntu-zfs:

add-apt-repository ppa:zfs-native/stable  
apt-get update  
apt-get install ubuntu-zfs  
modprobe zfs  

Before continuing on and envisioning how you'd like to cut and format your array, I suggest reading a few articles on the options ZFS offers.

The Layout

After some deliberation with my co-worker, we decided upon the following:

One zpool of 4 vdevs, each with RAIDZ2 and 11 disks. One disk sits on the sidelines as hot spare. This is effectively RAID60. (11 * 4) + 1 = 45

Now the first thing you're probably thinking is: Why in the hell are they using an odd number of disks for each vdev? Whilst on the Interwebz, there's about 6 million pages covering this, we talked through this and decided whatever small loss we'd suffer in performance would be negligible to the core goals of this box for our needs. I can't stress this enough:

Everyone has slightly different needs and goals. It's essential you really decide your priorities for what your storage must provide before blindly rolling something out that "looks" and "seems" right.

With this configuration, we can lose up to 8 of the "right" drives and up to 3 of the "wrong" drives. I won't delve too deeply into how RAID works, but to us, this provided a good enough level of availability and usable disk space (ended up being roughly 120TB).

Creation

Literally, one command to rule them all!

zpool \  
create \  
-f \
-o ashift=12 -O compression=on \
backup \  
raidz2 \  
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009P73 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008ANS \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008WL3 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009ACP \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300CKT0 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AMAC \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008KJN \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A8ET \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W30086WX \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AM5Q \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008WLE \
raidz2 \  
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AHR6 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300ALYN \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W30088YL \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NFP \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300B0FK \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300CTDB \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A8R8 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NQ3 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300959J \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009176 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A6HQ \
raidz2 \  
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008ERM \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W30090R8 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009A5D \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300974C \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300915V \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009146 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300A3MK \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NCC \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AHNM \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300BZBP \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008JYF \
raidz2 \  
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008QL2 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008W5S \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008KQ7 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300ADHE \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009P1R \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300AE5J \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009NG0 \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3009P3J \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W300ADZK \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008BTB \
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008WWS \
spare \  
/dev/disk/by-id/ata-ST4000DM000-1F2168_W3008KVP

Quick breakdown of options:

  • -f - Since these drives used to be part of an LVM, we tell zfs to disregard any previous affiliation
  • ashift=12 - Accounting for the 4k sectors
  • -O compression=on - Compression works really well so we enable for the entire zpool
  • backup - This is the name of our zpool

The rest is self-explanatory. You can see we decided to use the by-id names for the disks. Note that these references are what is actually used by ZFS to interface with your drives, so using /dev/sdX is probably a bad idea (as they aren't permanent). Personally, I prefer using by-id as I can easily see the type and serial of each disk in the array. Another great option would be to use UUID's. ZoL mentions this in their FAQ, but I actually disagree with their recommendations on what to use for larger pools.

One last thing to mention is that this zpool create command took literally 10 seconds to complete. That's pretty amazing considering mdadm would have taken hours to finish setting up the array, given you are allowed to use it whilst that process takes place.

Assessing/Status

Let's get some details about our pool!

[~]> zpool status backup
  pool: backup
 state: ONLINE
  scan: none requested
config:

        NAME                                 STATE     READ WRITE CKSUM
        backup                               ONLINE       0     0     0
          raidz2-0                           ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009P73  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008ANS  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008WL3  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009ACP  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300CKT0  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300AMAC  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008KJN  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300A8ET  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W30086WX  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300AM5Q  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008WLE  ONLINE       0     0     0
          raidz2-1                           ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300AHR6  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300ALYN  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W30088YL  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009NFP  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300B0FK  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300CTDB  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300A8R8  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009NQ3  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300959J  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009176  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300A6HQ  ONLINE       0     0     0
          raidz2-2                           ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008ERM  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W30090R8  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009A5D  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300974C  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300915V  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009146  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300A3MK  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009NCC  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300AHNM  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300BZBP  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008JYF  ONLINE       0     0     0
          raidz2-3                           ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008QL2  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008W5S  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008KQ7  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300ADHE  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009P1R  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300AE5J  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009NG0  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3009P3J  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W300ADZK  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008BTB  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_W3008WWS  ONLINE       0     0     0
        spares
          ata-ST4000DM000-1F2168_W3008KVP    AVAIL

errors: No known data errors  

Create and configure Filesystems

Here we'll create a filesystem on top of our pool and disable atime.

[~]> zfs create backup/mcorral
[~]> zfs set atime=off backup/mcorral
[~]> zfs get all backup/mcorral
NAME            PROPERTY              VALUE                  SOURCE  
backup/mcorral  type                  filesystem             -  
backup/mcorral  creation              Wed Apr  8 21:45 2015  -  
backup/mcorral  used                  37.4G                  -  
backup/mcorral  available             120T                   -  
backup/mcorral  referenced            37.4G                  -  
backup/mcorral  compressratio         1.77x                  -  
backup/mcorral  mounted               yes                    -  
backup/mcorral  quota                 none                   default  
backup/mcorral  reservation           none                   default  
backup/mcorral  recordsize            128K                   default  
backup/mcorral  mountpoint            /backup/mcorral        default  
backup/mcorral  sharenfs              off                    default  
backup/mcorral  checksum              on                     default  
backup/mcorral  compression           on                     inherited from backup  
backup/mcorral  atime                 off                    local  
backup/mcorral  devices               on                     default  
backup/mcorral  exec                  on                     default  
backup/mcorral  setuid                on                     default  
backup/mcorral  readonly              off                    default  
backup/mcorral  zoned                 off                    default  
backup/mcorral  snapdir               hidden                 default  
backup/mcorral  aclinherit            restricted             default  
backup/mcorral  canmount              on                     default  
backup/mcorral  xattr                 on                     default  
backup/mcorral  copies                1                      default  
backup/mcorral  version               5                      -  
backup/mcorral  utf8only              off                    -  
backup/mcorral  normalization         none                   -  
backup/mcorral  casesensitivity       sensitive              -  
backup/mcorral  vscan                 off                    default  
backup/mcorral  nbmand                off                    default  
backup/mcorral  sharesmb              off                    default  
backup/mcorral  refquota              none                   default  
backup/mcorral  refreservation        none                   default  
backup/mcorral  primarycache          all                    default  
backup/mcorral  secondarycache        all                    default  
backup/mcorral  usedbysnapshots       0                      -  
backup/mcorral  usedbydataset         37.4G                  -  
backup/mcorral  usedbychildren        0                      -  
backup/mcorral  usedbyrefreservation  0                      -  
backup/mcorral  logbias               latency                default  
backup/mcorral  dedup                 off                    default  
backup/mcorral  mlslabel              none                   default  
backup/mcorral  sync                  standard               default  
backup/mcorral  refcompressratio      1.77x                  -  
backup/mcorral  written               37.4G                  -  
backup/mcorral  logicalused           61.6G                  -  
backup/mcorral  logicalreferenced     61.6G                  -  
backup/mcorral  snapdev               hidden                 default  
backup/mcorral  acltype               off                    default  
backup/mcorral  context               none                   default  
backup/mcorral  fscontext             none                   default  
backup/mcorral  defcontext            none                   default  
backup/mcorral  rootcontext           none                   default  
backup/mcorral  relatime              off                    default  

You can see the compression option gets inherited from our pool.
Lets check our compression ratio (this is after we started backing up our Swift cluster).

[~]> zfs get compressratio backup/mcorral
NAME            PROPERTY       VALUE  SOURCE  
backup/mcorral  compressratio  1.77x  -  

That's all folks

For now, we have backups running and ZFS just chillin doing its thing. I plan to report back on any other things I run into or options I decide to modify.

I'll leave you with the reddit comment page for the new ZoL release. Some of the commentors bring some insight into differences between the native ZFS on Solaris vs ZoL, which gives you a good idea of how far along ZoL is.

Blog Logo

Mario Loria


Published

Image

./scriptthe.net

Because 127.0.0.1 gets old after a while.

Back to Overview