To get familiar with configuring a zfs based Lustre filesystem it is helpful to setup a small single node filesystem. This exact configuration is designed to illustrate several of the available options. If you're not already familiar with configuring a traditional ldiskfs based Lustre filesystem you should review the Lustre documentation.

The filesystem will be named lustre and it will consist of:

  • 1 MGS: 128 MiB zfs file-backed 2-vdev mirror array
  • 1 MDT: 128 MiB zfs file-backed 2-vdev mirror array
  • 2 OSTs: 1 GiB zfs file-backed 3-vdev raidz2 array

The first thing which you should do is ensure SELinux is disabled. The filesystem types ldiskfs, zfs, and lustre are not part of the default SELinux policy. This can cause mount failures because SELinux is unaware that these filesystems support xattrs. To disable SELinux edit /etc/selinux/config and change the SELINUX line to SELINUX=disabled. Then reboot your system to remove the existing policy.
$ cat /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

Next configure the required Lustre services. This process will be slightly different than that described in the official Lustre documentation. Aside from the zfs specific changes these packages contain Livermore's init scripts. They are used to provide a traditional interface for starting/stopping services. Additionally, they are integrated with the Linux Heartbeat package to provide automatic Lustre failover. While you do not have to use the init scripts, this example does.

Create a 128 MiB MGS, a 128 MiB MDT, and two 1 GiB OSTs. To specify zfs based services use the --backfstype=zfs option to override the ldiskfs default value. When testing you can also include the --vdev-size option to automatically create sparse files for vdevs. This avoids the need to setup traditional loopback devices. Next, for MDT and OST services you will need to explicitly set their service index and the NID of the MGS host. Finally, you must list the dataset name and the desired zfs pool configuration.
$ sudo mkfs.lustre --mgs --backfstype=zfs --vdev-size 131072 \
  lustre-mgs/mgs mirror /tmp/lustre-mgs-disk0 /tmp/lustre-mgs-disk1

$ sudo mkfs.lustre --mdt --backfstype=zfs --vdev-size 131072 --index=0 --mgsnode=192.168.1.100@tcp \
  lustre-mdt0/mdt0 mirror /tmp/lustre-mdt0-disk0 /tmp/lustre-mdt0-disk1

$ sudo mkfs.lustre --ost --backfstype=zfs --vdev-size 1048576 --index=0 --mgsnode=192.168.1.100@tcp \
  lustre-ost0/ost0 raidz2 /tmp/lustre-ost0-disk0 /tmp/lustre-ost0-disk1 /tmp/lustre-ost0-disk2

$ sudo mkfs.lustre --ost --backfstype=zfs --vdev-size 1048576 --index=1 --mgsnode=192.168.1.100@tcp \
  lustre-ost1/ost1 raidz2 /tmp/lustre-ost1-disk0 /tmp/lustre-ost1-disk1 /tmp/lustre-ost1-disk2

$ sudo zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
lustre-mdt0        128K  90.9M    30K  /lustre-mdt0
lustre-mdt0/mdt0    20K  90.9M    20K  /lustre-mdt0/mdt0
lustre-mgs         128K  90.9M    30K  /lustre-mgs
lustre-mgs/mgs      20K  90.9M    20K  /lustre-mgs/mgs
lustre-ost0        127K  90.9M    29K  /lustre-ost0
lustre-ost0/ost0    20K  90.9M    20K  /lustre-ost0/ost0
lustre-ost1        130K   158M    30K  /lustre-ost1
lustre-ost1/ost1    20K   158M    20K  /lustre-ost1/ost1

You may have noticed that we created a seperate zfs pool and dataset for each of our services. While all the services can be created in a single zfs pool this will result in the available free space being over reported. Just like normal zfs filesystems each dataset may use any available free space in the pool.

Now create a file called /etc/ldev.conf. It is used by the init scripts and the failover infrastructure to control the Lustre services. Minimally this file must contain the hostname where the service should run, the service label, and the ldiskfs block device or zfs dataset name. Make sure you update this file with your hostname.

$ cat /etc/ldev.conf
# example /etc/ldev.conf
#
# local  foreign/-  label       [md|zfs:]device-path     [journal-path]
#
toss-2-0-amd64 - mgs     zfs:lustre-mgs/mgs
toss-2-0-amd64 - mdt0    zfs:lustre-mdt0/mdt0
toss-2-0-amd64 - ost0    zfs:lustre-ost0/ost0
toss-2-0-amd64 - ost1    zfs:lustre-ost1/ost1

Everything is now in place to start the Lustre services. You should start the MGS, then MDT, and lastly the OSTs. Once everything is running the Lustre filesystem can be mounted. A word of caution before starting the services. The Lustre Orion code still requires some polish and you may encounter a few bugs. It is not currently considered ready for real production usage.

$ sudo service lustre start mgs
Mounting lustre-mgs/mgs on /mnt/lustre/local/mgs

$ sudo service lustre start mdt0
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/mdt0

$ sudo service lustre start ost0
Mounting lustre-ost0/ost0 on /mnt/lustre/local/ost0

$ sudo service lustre start ost1
Mounting lustre-ost1/ost1 on /mnt/lustre/local/ost1

$ sudo mkdir -p /mnt/lustre/client
$ sudo mount -t lustre 192.168.1.100:/lustre /mnt/lustre/client

$ df -h /mnt/lustre/client
Filesystem            Size  Used Avail Use% Mounted on
192.168.1.100:/lustre
                      1.9G   75M  1.8G   5% /mnt/lustre/client