Tutorial for Storage on External Drives – Jennifer Peterson

Table of Contents

Toggle

Advantages of ZFS

ZFS is a combined file system and volume manager originally developed by Sun Microsystems. It was developed for Solaris which is a proprietary Unix operating system then owned by Sun (now Oracle). Thanks to OpenZFS, ZFS is available on Linux and the performance is rock solid on non-system drives. ZFS offers the following desirable features.

ZFS provides enhanced data integrity. ZFS uses a “copy-on-write” system which writes new data to new blocks while preserving old data. Read/writes are verified by a checksum. Running fsck after a system crash or power interruption is not needed.
ZFS can create a file system that spans across several drives in a pool. ZFS provides mirroring and other RAID configurations with automatic error correction. There is no need for a RAID controller or third party software.
Any size drive can be used. Any number of drives can function as one device.
It is easy to organize your data by creating datasets and managing them independently.
ZFS offers reliable and efficient encryption and compression. Compression is so efficient it can actually increase the read/write speed of a drive.
Using ZFS for system drives (drives or partitions containing the Linux operating system) is currently experimental. This feature is in active development and full support will be available soon.

Installing ZFS

If you are using Linux Mint 20 or later there is nothing to install. ZFS is natively available. If you are using Ubuntu 20.04 or later and ZFS was configured during installation, there is also nothing to install.

To check to see if ZFS is included with your Linux distribution, enter the following command in the terminal (bash). You need ZFS version 0.8.3 or above.

For Ubuntu 20.04, if ZFS was not configured during installation, it is easy to install ZFS as described below. For Linux Mint, also use the commands below for versions of Mint previous to Mint 20.

sudo apt update
sudo apt install zfsutils-linux
zfs --version

Preparing the External Drive

Place your external drive in a hard drive docking station or hard drive enclosure. Power on the drive, wait until it spins up, and then connect it to your machine via USB or eSATA. I used a StarTech Hard Drive Docking Station SATDOCK22U3S with a 10TB Western Digital Red drive connected by USB.

Warning: the following procedure will destroy all files on the external drive. Make sure you know which drive is your external drive.

Use the Linux program Gparted to locate your external drive. The drive should not be partitioned and all space should be unallocated. Remove partitions if necessary. If you are starting with a new unused drive, you need to create a GPT partition table. This can be done using Gparted: Device > Create Partition Table.

Good record keeping is important. It is best to name each drive and write the name on the drive with a sharpie being careful not to cover up the serial number. Keep a list of all your drives with a description of the data they contain and the ZFS commands to manage them.

Creating the Pool

Start by defining a pool which encompasses your storage area, typically one or several drives. [Later on we will create datasets, which live inside the zpool, that contain your data.] In ZFS lingo, the zpool is made up of one or more vdevs. Each vdev is usually a group of drives that work together (a RAID group for example). For this tutorial, yourpool contains one vdev containing a single drive which is the simplest way to create a backup disk.

Use the Disks program, Gparted, or other disk utility program to find the name of the external drive that you want to format with ZFS. In this tutorial we will format sdb.

Warning: when you create the zpool using the next command, it will wipe all existing data off this drive. In the following command you must replace sdb with the name of your external drive.

sudo zpool create yourpool /dev/sdb

In the above, I named the pool “yourpool”. You can use any name you want. Whatever name you choose, it is good practice to make the name end in “pool”. This will remind you that it is a pool name, for example “jennypool”.

Use the Disks program or Gparted to view your drive. You will see that when the pool was created, it made two partitions on your drive. The first partition is a ZFS pool which takes up almost the entire drive, in our case sdb1. The second partition sdb2 is marked “Solaris Reserved” which is where ZFS does its housekeeping. For now, we will reference the pool as being on sdb1.

Recharacterizing the Pool

Your pool was created on sdb1. You may be concerned about what will happen if someday you want to connect this external drive to another machine. Users have little control over what names Linux assigns to drives. On the other machine, the drive containing the pool may be assigned to sdc1, not sdb1. Will the pool be found? This is a real problem with external drives. Luckily, there is an easy fix. Instead of referring to the pool location with the common name of the partition sdb1, we will refer to it with a name which contains the serial number of the drive. The serial number will not change no matter what machine the drive is connected to. It will always be the same as the serial number on the paper label affixed to the drive. Look in this directory.

cd /dev/disk/by-id
ls -alF

Here you can find the “by-id” name of your pool partition. This name is in a soft link to sdb1. The name contains the type of drive, the model number, the serial number and the partition number. Here is an example:
ata-WDC_WD101EFAX-68LDBN0_VCG88VNN-part1 -> ../../sdb1

Now we will recharacterize the pool partition with its by-id name. To do this we will use a trick. This only has to be done once.

Export the pool so that it will no longer be recognized by your machine.

sudo zpool export yourpool
zpool status

Import the pool so that the machine will once again recognize the pool. Use the by-id name of your pool.

sudo zpool import -d /dev/disk/by-id/ata-WDC_WD101EFAX-68LDBN0_VCG88VNN-part1 yourpool -N
zpool status

The zpool status command should show that the pool is now being referred to with its by-id name. Always use an import command like the above when you want to import a pool. Never use a common name like sdb1 again.

Adding Compression to the Pool

You can add compression to your pool and I would encourage you do so. Do not be worried that compression will slow down your read/write speed. In most cases, compression actually speeds up a ZFS disk. This is because the performance boost of reduced file size more than offsets the performance cost of compression/decompression.

sudo zfs set compression=on yourpool

The only case where you may not want to use compression is when most of your data is photos or video. The compression ratio for this type of data is low for all file systems including ZFS. In this case, compression will not reduce the data size significantly.

Planning to Mount the Datasets

A dataset is a space where you put your data. Datasets are flexible in size and are located inside your pool. In this tutorial we chose “yourdataset” for the name of your dataset. It is important to mount your datasets in a good location. ZFS will allow you to mount your datasets anywhere you choose, so choose wisely. The default is the directory /yourpool/yourdataset. In most cases this is NOT the best place to mount your ZFS dataset. The default mountpoint can cause catastrophic problems for people that use Timeshift or other backup programs. Timeshift is a program that automatically takes snapshots of your system (OS), so that the system can be rolled back if an upgrade (or new package install) damages the system. When Timeshift does a snapshot, it automatically excludes the /mnt and /media directories which are common mountpoints for external drives. It is best practice to mount your ZFS datasets in one of these directories so that they also will be excluded by Timeshift. In most cases you do not want to make a Timeshift snapshot of a external drive, particularly if the external drive is a backup drive. A particular problem that the suggested mountpoint avoids is writing the entire contents of your external drive to a Timeshift snapshot stored on an internal drive, perhaps to a point beyond the drive’s capacity. This can cause your system to crash.

The easiest way to set the mountpoint of your datasets is to set the mountpoint of the pool to which the datasets will belong. Set the pool’s mountpoint immediately after the pool is created. All datasets created later in this pool will “inherit” that mountpoint.

sudo zfs set mountpoint=/mnt/yourpool yourpool

All new datasets created in yourpool will inherit the new mountpoint. In our case, yourdataset will be mounted on /mnt/yourpool/yourdataset.

Creating Encrypted Datasets

Datasets can be used to help you organize your data. For example, consider the following scenarios.

You need to control access to two different types of data, for example accounting data and medical data. You can put the data in separate datasets and encrypt them with different passwords.
You want to mount different types of data in different directories. Create several datasets, each with a different mountpoint.
You use “Back In Time” to backup your home directory to an external drive. You want to encrypt your data, but do not want to use the default EncFs encryption method which is known to have security issues. https://backintime.readthedocs.io/en/latest/settings.html#local-encrypted Instead, backup to a secure ZFS aes-256 encrypted dataset.

In this tutorial, we will create an encrypted dataset named “yourdataset”. When the dataset is created you will be prompted to create a passphrase. The encryption method is aes-256. Remember that encryption only provides security for “data at rest” (for datasets that are not mounted).

sudo zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase yourpool/yourdataset

The mountpoint of yourdataset is /mnt/yourpool/yourdataset (the mountpoint was inherited from yourpool). Verify this.

As discussed above, this directory is a good safe mountpoint. However, if it does not suit your needs, you can change it with the following command. [Do not bother to create the directory of your new mountpoint prior to running this command. ZFS will create the directory for you.]

sudo zfs set mountpoint=/any/path/you/want yourpool/yourdataset

A note on terminology. When reading about ZFS on the internet, the terms “dataset” and “file system” are often used interchangeably which can be confusing. A ZFS “file system” is actually a type of dataset. It is the only type of dataset that is considered in this tutorial. [Other types include ZFS snapshots and clones.] The term “file system” has a very different meaning outside the ZFS world and its use in this tutorial could be confusing. I chose to use the more general term “dataset” instead of “file system” to make this tutorial more understandable.

Mounting Datasets

ZFS makes mounting/unmounting easy. You do not need to manage ZFS entries in your /etc/vfstab file. Each ZFS dataset is mounted according to properties of that dataset. Before mounting a dataset, you should check to see if the pool has been imported.

If the machine has been restarted (or the drive has been moved to another machine) you need to import the pool.

sudo zpool import -d /dev/disk/by-id/ata-WDC_WD101EFAX-68LDBN0_VCG88VNN-part1 yourpool -N

Mount the dataset. Use the -l option only for encrypted datasets.

sudo zfs mount -l yourpool/yourdataset

Now you can read/write to your dataset. When datasets are first mounted they are owned by root. You may want to change the file permissions.

Using the External Drive on Any Machine

Now that the external drive has been properly formated with ZFS, it can be moved and used on a different machine. All you have to do is connect the drive, import the pool, and mount the dataset (providing a password if required).

sudo zpool import -d /dev/disk/by-id/ata-WDC_WD101EFAX-68LDBN0_VCG88VNN-part1 yourpool -N
sudo zfs mount -l yourpool/yourdataset

That’s all it boils down to. Just two commands. You may want to keep a list of your drives with the particular commands needed for each drive. Two quick cut and pastes into the terminal and your drive is connected.

Unmounting and Powering Down

If our OS was Windows, we would “eject” the external drive before disconnecting the USB or eSATA adaptor. In Linux Mint, for an external ZFS drive, there are a few more steps.

Flush memory and write all data to your drives.

sudo zfs unmount yourpool/yourdataset

sudo zpool export yourpool

The command above is mandatory if you are going to move the external drive to a different machine.

To power down the drive, open the Disks program, select the drive, and push the power button symbol at the top of the screen.

Using the steps above can be considered to be a “best practice”. However, if you omit some of these steps , chances are that you will get away with it. ZFS is very robust. I was once in the middle of writing a large file to a ZFS dataset when my system crashed and I had to do an emergency reboot. This was a worst case scenario in the sense that the pool was online, the dataset was mounted, and unwritten data was in memory. I had no issues after rebooting. The pool imported and the dataset mounted with no problems. I lost only the file that was being written at the time of the crash.

Checking Data Integrity

In our quest for data integrity, our goal is to identify and replace drives that are likely to begin developing unrecoverable errors. Almost all modern drives support SMART which is a monitoring system that reports hard drive problems. It is important to know that you can (and should!) monitor your drives using both SMART and ZFS. A rule of thumb is to SMART test your drives once per month and ZFS test (scrub) your drives every two weeks. SMART is often best used as an early warning system. ZFS is always better at preventing problems in the first place and often better at indicating what action to take when problems develop. With ZFS, before discarding a bad disk it is often possible to find which file(s) have been corrupted. There may be very valuable files on the bad disk that are not corrupted and can be recovered.

SMART Monitoring

There are many programs that will extract and display SMART data. For Linux (or Windows), I recommend Gsmartcontrol which is a GUI interface for Smartmontools. SMART reports a long list of drive attributes. These are the attributes that are most critical:

ID 5 – Reallocated_Sector_Count
ID 187 – Reported_Uncorrectable_Errors
ID 196 – Reallocation Event Count
ID 197 – Current_Pending_Sector_Count
ID 198 – Offline_Uncorrectable

Interpreting SMART data is a bit subjective, but most companies will replace drives which have a non-zero count in ID# 187 or ID# 198 (as expressed as a non-zero Raw Value). If this has occured, the drive hardware ECC has been unable to fix an error by repeatedly trying to read the data and relocate it to a good sector. In other words, there is bad data on the drive which the drive’s internal correction mechanism can not fix. The other attributes listed above are a measure of the number of times data has been or will be reallocated to a healthy sector. Consider replacing the drive if these numbers are high or rapidly increasing.

ZFS Monitoring with Scrub

Scrub drives to check data integrity using this command.

sudo zpool scrub yourpool

Only one ZFS pool can be scrubbed at at time. The pool can be online and you can continue to work on your machine while scrubbing. However, scrubbing can slow down your system and it is best to perform scrubs at night.

If you want to stop scrubbing, use this command.

sudo zpool scrub -s yourpool

If you restart a scrub it will not restart where the previous scrub left off. You will be starting the scrub all over again.

When you check the status of your pool, you will see the results of your last scrub.

sudo zpool status -v yourpool

If any errors were found, you will be given the status of each drive and a recommended action for repairing the errors. For details see https://illumos.org/books/zfs-admin/gavwg.html#gbbzs

For users that want to test zfs to see how well it can detect and repair errors, there is a ZFS utility called zinject which creates artificial problems in a ZFS pool by simulating data corruption or device failures. This program is very dangerous and should not be used with pools containing valuable information.