Home‎ > ‎

Trouble in ZFS land

posted Aug 13, 2017, 1:20 AM by ms b0b
An interesting incident took place over the last month that took my storage. There are several lessons learned so a write up is deserved.

Around Independence Day, I decided to reorganize the storage server for the following reasons:
  1. Completely fill all 8 drive bays with 2TB hard disk drives
  2. Improve IOPS in VM
To accomplish 1, I purchased an adapter that allows me to mount the L2ARC SSD in the R510's 12.7 mm optical bay. That went without a hitch.

To accomplish 2, I was torn between striped 4x mirrors or striped 2x RAIDZ2. Striped mirrors is ZFS implementation of RAID10, which is known for best IOPS performance on traditional RAID setups. Striping RAIDZ2s together allows me to lose two disks in each RAIDZ2 sets and maintain redundancy. I decided to go with striped mirrors.

Here's my plan on how to make the migration.
  1. Insert the two extra 2TB hard disk drives.
  2. Create a new pool made up of the new hard disk drives in striped set for 3.6TB effective space. About 3TB is in use so this volume is just big enough.
  3. Use the zfs send command to copy data from the live pool to the backup pool.
  4. Stop the SMB, iSCSI and tftp services.
  5. Destroy the main pool.
  6. Recreate the main pool as striped 3x mirrors.
  7. Use the zfs recv command to copy data from the backup pool to the live pool.
  8. Restart the services.
  9. Verify file system integrity.
  10. Destroy the backup pool.
  11. Add the backup disks as mirror to the live pool.
  12. Done.
Sounds easy...

At step 7, zfs recv command returns invalid stream bad magic number. What? Panic. And thus started my crash course on how zfs send/recv works.

The zstreamdump tool shows one of the BEGIN block did not have the correct magic number of 0x00000002f5bacbac. x86-64 architecture being little ending has the long stored as 0xaccbbaf502000000.

Opening up the first part of the stream file in a hex editor shows the zfs send output is interlaced with the TIME SENT SNAPSHOT and each progress call outs.

Looking back at step 3, the command I used was zfs send -Rv livepool > livepool.zfssend. To prevent the job from aborting when I log off shell, I put nohup in front. Looks good right? Wrong.

The existence of nohup shifts the stdout redirection to the nohup command level. It is now capturing both the zfs send output and the progress callouts as stdout. The spurious text is throwing off t he ZFS binary file reader.

Whipped up a program to strip out the progress callouts. The pattern was consistently "hh:mm:ss   12.3T   livepool/dataset@snapshot\n" which made it easier to process.

Then the error moved from the start of the file to bad checksum at around the 2.6 TB mark. As the checksum is checked at every block, that means 2.6 TB worth of data are intact. The checksum error is at where the first snapshot ends and the second snapshot starts. zfs send orders the data from oldest snapshot to newest, each newer snapshot is a delta changes log of the previous snapshot.

If I can insert the correct end records to tell zfs recv to recognize this is the end of the stream, maybe the directory structure will be created properly. Using the stream of an empty dataset snapshot, I inserted two END back-to-back records. One with checksum and one with 0's as payload. Using test run as tweaks to find the correct checksum. Each test run takes 3 hours so it's become very tedious.

Finally zfs recv reports success, one month after the original incident. It reports the file storage dataset snapshot was restored successfully. Navigating in the directories confirmed most files are present. What lies between 2.6 TB and 3.2 TB? More snapshots of the file system and the two logical volumes being used as iSCSI target. I only have test VMs, which have no important data. I can recreate them if needed.

Lessons learned:
  1. Don't destroy the original pool until tested the restore will work.
  2. When possible replicate the ZFS pool instead of using a file. Alternatively, copy the files to a different file system.
Quite an adventure, and I can answer all sorts of ZFS send/receive stream questions until this knowledge gets pushed out by new knowledge.