|
You can use the administrative utilities provided with Lustre to set up a system with many different configurations. This chapter shows how to configure a simple Lustre system comprised of a combined MGS/MDT, an OST and a client, and includes the following sections:
4.1 Configuring the Lustre File System
A Lustre file system consists of four types of subsystems – a Management Server (MGS), a Metadata Target (MDT), Object Storage Targets (OSTs) and clients. We recommend running these components on different systems, although, technically, they can co-exist on a single system. Together, the OSSs and MDS present a Logical Object Volume (LOV) which is an abstraction that appears in the configuration.
It is possible to set up the Lustre system with many different configurations by using the administrative utilities provided with Lustre. Some sample scripts are included in the directory where Lustre is installed. If you have installed the Lustre source code, the scripts are located in the lustre/tests sub-directory. These scripts enable quick setup of some simple, standard Lustre configurations.
Note – We recommend that you use dotted-quad IP addressing (IPv4) rather than host names. This aids in reading debug logs, and helps greatly when debugging configurations with multiple interfaces. |
1. Define the module options for Lustre networking (LNET), by adding this line to the /etc/modprobe.conf file[1].
options lnet networks=<network interfaces that LNET can use>
This step restricts LNET to use only the specified network interfaces and prevents LNET from using all network interfaces.
As an alternative to modifying the modprobe.conf file, you can modify the modprobe.local file or the configuration files in the modprobe.d directory.
Note – For details on configuring networking and LNET, see Configuring LNET. |
2. (Optional) Prepare the block devices to be used as OSTs or MDTs.
Depending on the hardware used in the MDS and OSS nodes, you may want to set up a hardware or software RAID to increase the reliability of the Lustre system. For more details on how to set up a hardware or software RAID, see the documentation for your RAID controller or see Lustre Software RAID Support.
3. Create a combined MGS/MDT file system.
a. Consider the MDT size needed to support the file system.
When calculating the MDT size, the only important factor is the number of files to be stored in the file system. This determines the number of inodes needed, which drives the MDT sizing. For more information, see Sizing the MDT and Planning for Inodes. Make sure the MDT is properly sized before performing the next step, as a too-small MDT can cause the space on the OSTs to be unusable.
b. Create the MGS/MDT file system on the block device. On the MDS node, run:
mkfs.lustre --fsname=<fsname> --mgs --mdt <block device name>
The default file system name (fsname) is lustre.
Note – If you plan to generate multiple file systems, the MGS should be on its own dedicated block device. |
4. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:
mount -t lustre <block device name> <mount point>
5. Create the OST[2]. On the OSS node, run:
mkfs.lustre --ost --fsname=<fsname> --mgsnode=<NID> <block device name>
You can have as many OSTs per OSS as the hardware or drivers allow.
You should only use only 1 OST per block device. Optionally, you can create an OST which uses the raw block device and does not require partitioning.
6. Mount the OST. On the OSS node where the OST was created, run:
mount -t lustre <block device name> <mount point>
Note – To create additional OSTs, repeat Step 5 and Step 6. |
7. Create the client (mount the file system on the client). On the client node, run:
mount -t lustre <MGS node>:/<fsname> <mount point>
Note – To create additional clients, repeat Step 7. |
8. Verify that the file system started and is working correctly by running the df, dd and ls commands on the client node.
[root@client1 /] lfs df -h
The lfs df -h command lists space usage per OST and the MDT in human-readable format.
b. Run the lfs df -ih command.
[root@client1 /] lfs df -ih
The lfs df -ih command lists inode usage per OST and the MDT.
[root@client1 /] cd /lustre [root@client1 /lustre] dd if=/dev/zero of=/lustre/zero.dat bs=4M count=2
The dd command verifies write functionality by creating a file containing all zeros (0s). In this command, an 8 MB file is created.
[root@client1 /lustre] ls -lsah
The ls -lsah command lists files and directories in the current working directory.
If you have a problem mounting the file system, check the syslogs for errors and also check the network settings. A common issue with newly-installed systems is hosts.deny or filewall rules that prevent connections on port 988.
Tip – Now that you have configured Lustre, you can collect and register service tags in Lustre 1.8.3 and earlier versions. Note that service tags have been discontinued in Lustre 1.8.4 and later releases. For more information, see Service Tags. |
4.1.0.1 Simple Lustre Configuration Example
To see the steps in a simple Lustre configuration, follow this worked example in which a combined MGS/MDT and two OSTs are created. Three block devices are used, one for the combined MGS/MDS node and one for each OSS node. Common parameters used in the example are listed below, along with individual node parameters.
1. Define the module options for Lustre networking (LNET), by adding this line to the /etc/modprobe.conf file.
options lnet networks=tcp
2. Create a combined MGS/MDT file system on the block device. On the MDS node, run:
[root@mds /]# mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb
This command generates this output:
Permanent disk data: Target: temp-MDTffff Index: unassigned Lustre FS: temp Mount type: ldiskfs Flags: 0x75 (MDT MGS needs_index first_time update ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mdt.group_upcall=/usr/sbin/l_getgroups checking for existing Lustre data: not found device size = 16MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sdb target name temp-MDTffff 4k blocks 0 options -i 4096 -I 512 -q -O dir_index,uninit_groups -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-MDTffff -i 4096 -I 512 -q -O dir_index,uninit_groups -F /dev/sdb Writing CONFIGS/mountdata
3. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:
[root@mds /]# mount -t lustre /dev/sdb /mnt/mdt
This command generates this output:
Lustre: temp-MDT0000: new disk, initializing Lustre: 3009:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) \ temp-MDT0000: group upcall set to /usr/sbin/l_getgroups Lustre: temp-MDT0000.mdt: set parameter \ group_upcall=/usr/sbin/l_getgroups Lustre: Server temp-MDT0000 on device /dev/sdb has started
In this example, the OSTs (ost1 and ost2) are being created or different OSSs (oss1 and oss2).
a. Create ost1. On oss1 node, run:
[root@oss1 /]# mkfs.lustre --ost --fsname=temp --mgsnode=10.2.0.1@tcp0 /dev/sdc
The command generates this output:
Permanent disk data: Target: temp-OSTffff Index: unassigned Lustre FS: temp Mount type: ldiskfs Flags: 0x72 (OST needs_index first_time update) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.2.0.1@tcp checking for existing Lustre data: not found device size = 16MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sdc target name temp-OSTffff 4k blocks 0 options -I 256 -q -O dir_index,uninit_groups -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OSTffff -I 256 -q -O dir_index,uninit_groups -F /dev/sdc Writing CONFIGS/mountdata
b. Create ost2. On oss2 node, run:
[root@oss2 /]# mkfs.lustre --ost --fsname=temp --mgsnode=10.2.0.1@tcp0 /dev/sdd
The command generates this output:
Permanent disk data: Target: temp-OSTffff Index: unassigned Lustre FS: temp Mount type: ldiskfs Flags: 0x72 (OST needs_index first_time update) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.2.0.1@tcp checking for existing Lustre data: not found device size = 16MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sdd target name temp-OSTffff 4k blocks 0 options -I 256 -q -O dir_index,uninit_groups -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OSTffff -I 256 -q -O dir_index,uninit_groups -F /dev/sdc Writing CONFIGS/mountdata
Mount each OST (ost1 and ost2), on the OSS where the OST was created.
a. Mount ost1. On oss1 node, run:
root@oss1 /] mount -t lustre /dev/sdc /mnt/ost1
The command generates this output:
LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc enabled Lustre: temp-OST0000: new disk, initializing Lustre: Server temp-OST0000 on device /dev/sdb has started
Shortly afterwards, this output appears:
Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0 Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting orphans
b. Mount ost2. On oss2 node, run:
root@oss2 /] mount -t lustre /dev/sdd /mnt/ost2
The command generates this output:
LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc enabled Lustre: temp-OST0000: new disk, initializing Lustre: Server temp-OST0000 on device /dev/sdb has started
Shortly afterwards, this output appears:
Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0 Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting orphans
6. Create the client (mount the file system on the client). On the client node, run:
root@client1 /] mount -t lustre 10.2.0.1@tcp0:/temp /lustre
This command generates this output:
Lustre: Client temp-client has started
7. Verify that the file system started and is working by running the df, dd and ls commands on the client node.
[root@client1 /] lfs df -h
This command generates output similar to this:
Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 7.2G 2.4G 4.5G 35% / dev/sda1 99M 29M 65M 31% /boot tmpfs 62M 0 62M 0% /dev/shm 10.2.0.1@tcp0:/temp 30M 8.5M 20M 30% /lustre
[root@client1 /] cd /lustre [root@client1 /lustre] dd if=/dev/zero of=/lustre/zero.dat bs=4M count=2
This command generates output similar to this:
2+0 records in 2+0 records out 8388608 bytes (8.4 MB) copied, 0.159628 seconds, 52.6 MB/s
[root@client1 /lustre] ls -lsah
This command generates output similar to this:
total 8.0M 4.0K drwxr-xr-x 2 root root 4.0K Oct 16 15:27 . 8.0K drwxr-xr-x 25 root root 4.0K Oct 16 15:27 .. 8.0M -rw-r--r-- 1 root root 8.0M Oct 16 15:27 zero.dat
4.1.0.2 Module Setup
Make sure the modules (like LNET) are installed in the appropriate /lib/modules directory. The mkfs.lustre utility tries to automatically load LNET (via the Lustre module) with the default network settings (using all available network interfaces). To change this default setting, use the network=... option to specify the network(s) that LNET should use:
modprobe -v lustre "networks=XXX"
For example, to load Lustre with multiple-interface support (meaning LNET will use more than one physical circuit for communication between nodes), load the Lustre module with the following network=... option:
modprobe -v lustre "networks=tcp0(eth0),o2ib0(ib0)"
tcp0 is the network itself (TCP/IP)
eth0 is the physical device (card) that is used (Ethernet)
o2ib0 is the interconnect (InfiniBand)
4.1.1 Scaling the Lustre File System
A Lustre file system can be scaled by adding OSTs or clients. For instructions on creating additional OSTs see Step 4 and Step 5 above; for clients, see Step 7.
4.2 Additional Lustre Configuration
Once the Lustre file system is configured, it is ready for use. If additional configuration is necessary, several configuration utilities are available. For man pages and reference information, see:
System Configuration Utilities (man8) profiles utilities (e.g., lustre_rmmod, e2scan, l_getgroups, llobdstat, llstat, plot-llstat, routerstat, and ll_recover_lost_found_objs), and tools to manage large clusters, perform application profiling, and debug Lustre.
4.3 Basic Lustre Administration
Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre administration tasks:
4.3.1 Specifying the File System Name
The file system name is limited to 8 characters. We have encoded the file system and target information in the disk label, so you can mount by label. This allows system administrators to move disks around without worrying about issues such as SCSI disk reordering or getting the /dev/device wrong for a shared target. Soon, file system naming will be made as fail-safe as possible. Currently, Linux disk labels are limited to 16 characters. To identify the target within the file system, 8 characters are reserved, leaving 8 characters for the file system name:
<fsname>-MDT0000 or <fsname>-OST0a19
To mount by label, use this command:
$ mount -t lustre -L <file system label> <mount point>
This is an example of mount-by-label:
$ mount -t lustre -L testfs-MDT0000 /mnt/mdt
Caution – Mount-by-label should NOT be used in a multi-path environment. |
Although the file system name is internally limited to 8 characters, you can mount the clients at any mount point, so file system users are not subjected to short names. Here is an example:
mount -t lustre uml1@tcp0:/shortfs /mnt/<long-file_system-name>
4.3.2 Starting up Lustre
The startup order of Lustre components depends on whether you have a combined MGS/MDT or these components are separate.
4.3.3 Mounting a Server
Starting a Lustre server is straightforward and only involves the mount command. Lustre servers can be added to /etc/fstab:
mount -t lustre
The mount command generates output similar to this:
/dev/sda1 on /mnt/test/mdt type lustre (rw) /dev/sda2 on /mnt/test/ost0 type lustre (rw) 192.168.0.21@tcp:/testfs on /mnt/testfs type lustre (rw)
In this example, the MDT, an OST (ost0) and file system (testfs) are mounted.
LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0 LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev,noauto 0 0
In general, it is wise to specify noauto and let your high-availability (HA) package manage when to mount the device. If you are not using failover, make sure that networking has been started before mounting a Lustre server. RedHat, SuSE, Debian (and perhaps others) use the _netdev flag to ensure that these disks are mounted after the network is up.
We are mounting by disk label here–the label of a device can be read with e2label. The label of a newly-formatted Lustre server ends in FFFF, meaning that it has yet to be assigned. The assignment takes place when the server is first started, and the disk label is updated.
Caution – Do not do this when the client and OSS are on the same node, as memory pressure between the client and OSS can lead to deadlocks. |
Caution – Mount-by-label should NOT be used in a multi-path environment. |
4.3.4 Unmounting a Server
To stop a Lustre server, use the umount <mount point> command.
For example, to stop ost0 on mount point /mnt/test, run:
$ umount /mnt/test
Gracefully stopping a server with the umount command preserves the state of the connected clients. The next time the server is started, it waits for clients to reconnect, and then goes through the recovery procedure.
If the force (-f) flag is used, then the server evicts all clients and stops WITHOUT recovery. Upon restart, the server does not wait for recovery. Any currently connected clients receive I/O errors until they reconnect.
Note – If you are using loopback devices, use the -d flag. This flag cleans up loop devices and can always be safely specified. |
4.3.5 Working with Inactive OSTs
To mount a client or an MDT with one or more inactive OSTs, run commands similar to this:
client> mount -o exclude=testfs-OST0000 -t lustre uml1:/testfs\ /mnt/testfs client> cat /proc/fs/lustre/lov/testfs-clilov-*/target_obd
To activate an inactive OST on a live client or MDT, use the lctl activate command on the OSC device. For example:
lctl --device 7 activate
Note – A colon-separated list can also be specified. For example, exclude=testfs-OST0000:testfs-OST0001. |
4.3.6 Finding Nodes in the Lustre File System
There may be situations in which you need to find all nodes in your Lustre file system or get the names of all OSTs.
To get a list of all Lustre nodes, run this command on the MGS:
# cat /proc/fs/lustre/mgs/MGS/live/*
Note – This command must be run on the MGS. |
In this example, file system lustre has three nodes, lustre-MDT0000, lustre-OST0000, and lustre-OST0001.
cfs21:/tmp# cat /proc/fs/lustre/mgs/MGS/live/* fsname: lustre flags: 0x0 gen: 26 lustre-MDT0000 lustre-OST0000 lustre-OST0001
To get the names of all OSTs, run this command on the MDS:
# cat /proc/fs/lustre/lov/<fsname>-mdtlov/target_obd
Note – This command must be run on the MDS. |
In this example, there are two OSTs, lustre-OST0000 and lustre-OST0001, which are both active.
cfs21:/tmp# cat /proc/fs/lustre/lov/lustre-mdtlov/target_obd 0: lustre-OST0000_UUID ACTIVE 1: lustre-OST0001_UUID ACTIVE
4.3.7 Mounting a Server Without Lustre Service
If you are using a combined MGS/MDT, but you only want to start the MGS and not the MDT, run this command:
mount -t lustre <MDT partition> -o nosvc <mount point>
The <MDT partition> variable is the combined MGS/MDT.
In this example, the combined MGS/MDT is testfs-MDT0000 and the mount point is mnt/test/mdt.
$ mount -t lustre -L testfs-MDT0000 -o nosvc /mnt/test/mdt
4.3.8 Specifying Failout/Failover Mode for OSTs
Lustre uses two modes, failout and failover, to handle an OST that has become unreachable because it fails, is taken off the network, is unmounted, etc.
By default, the Lustre file system uses failover mode for OSTs. To specify failout mode instead, run this command:
$ mkfs.lustre --fsname=<fsname> --ost --mgsnode=<MGS node NID> --param="failover.mode=failout" <block device name>
In this example, failout mode is specified for the OSTs on MGS uml1, file system testfs.
$ mkfs.lustre --fsname=testfs --ost --mgsnode=uml1 --param="failover.mode=failout" /dev/sdb
Caution – Before running this command, unmount all OSTS that will be affected by the change in the failover/failout mode. |
Note – After initial file system configuration, use the tunefs.lustre utility to change the failover/failout mode. For example, to set the failout mode, run:
$ tunefs.lustre --param failover.mode=failout <OST partition>
4.3.9 Running Multiple Lustre File Systems
There may be situations in which you want to run multiple file systems. This is doable, as long as you follow specific naming conventions.
By default, the mkfs.lustre command creates a file system named lustre. To specify a different file system name (limited to 8 characters), run this command:
mkfs.lustre --fsname=<new file system name>
To mount a client on the file system, run:
mount -t lustre mgsnode:/<new fsname> <mountpoint>
For example, to mount a client on file system foo at mount point /mnt/lustre1, run:
mount -t lustre mgsnode:/foo /mnt/lustre1
Note – If a client(s) will be mounted on several file systems, add the following line to /etc/xattr.conf file to avoid problems when files are moved between the file systems: lustre.* skip |
Note – The MGS is universal; there is only one MGS per Lustre installation, not per file system. |
Note – There is only one file system per MDT. Therefore, specify --mdt --mgs on one file system and --mdt --mgsnode=<MGS node NID> on the other file systems. |
A Lustre installation with two file systems (foo and bar) could look like this, where the MGS node is mgsnode@tcp0 and the mount points are /mnt/lustre1 and/mnt/lustre2.
mgsnode# mkfs.lustre --mgs /mnt/lustre1 mdtfoonode# mkfs.lustre --fsname=foo --mdt --mgsnode=mgsnode@tcp0 /mnt/lustre1 ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode@tcp0 /mnt/lustre1 ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode@tcp0 /mnt/lustre2 mdtbarnode# mkfs.lustre --fsname=bar --mdt --mgsnode=mgsnode@tcp0 /mnt/lustre1 ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode@tcp0 /mnt/lustre1 ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode@tcp0 /mnt/lustre2
To mount a client on file system foo at mount point /mnt/lustre1, run:
mount -t lustre mgsnode@tcp0:/foo /mnt/lustre1
To mount a client on file system bar at mount point /mnt/lustre2, run:
mount -t lustre mgsnode@tcp0:/bar /mnt/lustre2
4.3.10 Setting and Retrieving Lustre Parameters
There are several options for setting parameters in Lustre.
- When the file system is created, using mkfs.lustre. See Setting Parameters with mkfs.lustre
- When a server is stopped, using tunefs.lustre. See Setting Parameters with tunefs.lustre
- When the file system is running, using lctl. See Setting Parameters with lctl
Additionally, you can use lctl to retrieve Lustre parameters. See Reporting Current Parameter Values.
4.3.10.1 Setting Parameters with mkfs.lustre
When the file system is created, parameters can simply be added as a --param option to the mkfs.lustre command. For example:
$ mkfs.lustre --mdt --param="sys.timeout=50" /dev/sda
4.3.10.2 Setting Parameters with tunefs.lustre
If a server (OSS or MDS) is stopped, parameters can be added using the --param option to the tunefs.lustre command. For example:
$ tunefs.lustre --param="failover.node=192.168.0.13@tcp0" /dev/sda
With tunefs.lustre, parameters are “additive” — new parameters are specified in addition to old parameters, they do not replace them. To erase all oldtunefs.lustre parameters and just use newly-specified parameters, run:
$ tunefs.lustre --erase-params --param=<new parameters>
The tunefs.lustre command can be used to set any parameter settable in a /proc/fs/lustre file and that has its own OBD device, so it can be specified as<obd|fsname>.<obdtype>.<proc_file_name>=<value>. For example:
$ tunefs.lustre --param mdt.group_upcall=NONE /dev/sda1
4.3.10.3 Setting Parameters with lctl
When the file system is running, the lctl command can be used to set parameters (temporary or permanent) and report current parameter values. Temporary parameters are active as long as the server or client is not shut down. Permanent parameters live through server and client reboots.
Note – Lustre 1.8.4 adds the lctl list_param command, which enables users to list all parameters that can be set. See Listing Parameters. |
Setting Temporary Parameters
Use the lctl set_param command to set temporary parameters on the node where it is run. These parameters map to items in /proc/{fs,sys}/{lnet,lustre}. The lctl set_param command uses this syntax:
lctl set_param [-n] <obdtype>.<obdname>.<proc_file_name>=<value>
# lctl set_param osc.*.max_dirty_mb=1024 osc.myth-OST0000-osc.max_dirty_mb=32 osc.myth-OST0001-osc.max_dirty_mb=32 osc.myth-OST0002-osc.max_dirty_mb=32 osc.myth-OST0003-osc.max_dirty_mb=32 osc.myth-OST0004-osc.max_dirty_mb=32
Setting Permanent Parameters
Use the lctl conf_param command to set permanent parameters. In general, the lctl conf_param command can be used to specify any parameter settable in a /proc/fs/lustre file, with its own OBD device. The lctl conf_param command uses this syntax (same as the mkfs.lustre and tunefs.lustre commands):
<obd|fsname>.<obdtype>.<proc_file_name>=<value>)
Here are a few examples of lctl conf_param commands:
$ mgs> lctl conf_param testfs-MDT0000.sys.timeout=40 $ lctl conf_param testfs-MDT0000.mdt.group_upcall=NONE $ lctl conf_param testfs.llite.max_read_ahead_mb=16 $ lctl conf_param testfs-MDT0000.lov.stripesize=2M $ lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15 $ lctl conf_param testfs-OST0000.ost.client_cache_seconds=15 $ lctl conf_param testfs.sys.timeout=40
Caution – Parameters specified with the lctl conf_param command are set permanently in the file system’s configuration file on the MGS. |
Listing Parameters
To list Lustre or LNET parameters that are available to set, use the lctl list_param command. For example:
lctl list_param [-FR] <obdtype>.<obdname>
The following arguments are available for the lctl list_param command.
-F Add ‘/’, ‘@’ or ‘=’ for directories, symlinks and writeable files, respectively
-R Recursively lists all parameters under the specified path
$ lctl list_param obdfilter.lustre-OST0000
4.3.10.4 Reporting Current Parameter Values
To report current Lustre parameter values, use the lctl get_param command with this syntax:
lctl get_param [-n] <obdtype>.<obdname>.<proc_file_name>
This example reports data on RPC service times.
$ lctl get_param -n ost.*.ost_io.timeouts service : cur 1 worst 30 (at 1257150393, 85d23h58m54s ago) 1 1 1 1
This example reports the number of inodes available on each OST.
# lctl get_param osc.*.filesfree osc.myth-OST0000-osc-ffff88006dd20000.filesfree=217623 osc.myth-OST0001-osc-ffff88006dd20000.filesfree=5075042 osc.myth-OST0002-osc-ffff88006dd20000.filesfree=3762034 osc.myth-OST0003-osc-ffff88006dd20000.filesfree=91052 osc.myth-OST0004-osc-ffff88006dd20000.filesfree=129651
4.3.11 Regenerating Lustre Configuration Logs
If the Lustre system’s configuration logs are in a state where the file system cannot be started, use the writeconf command to erase them. After the writeconfcommand is run and the servers restart, the configuration logs are re-generated and stored on the MGS (as in a new file system).
You should only use the writeconf command if:
The writeconf command is destructive to some configuration items (i.e., OST pools information and items set via conf_param), and should be used with caution. To avoid problems:
To regenerate Lustre’s system configuration logs:
1. Shut down the file system in this order.
2. Make sure the the MDT and OST devices are available.
3. Run the writeconf command on all servers.
Run writeconf on the MDT first, and then the OSTs.
<mdt node>$ tunefs.lustre --writeconf <device>
<ost node>$ tunefs.lustre --writeconf <device>
4. Restart the file system in this order.
a. Mount the MGS (or the combined MGS/MDT).
After the writeconf command is run, the configuration logs are re-generated as servers restart.
4.3.12 Changing a Server NID
If you need to change the NID on the MDT or an OST, run the writeconf command to erase Lustre configuration information (including server NIDs), and then re-generate the system configuration using updated server NIDs.
Change a server NID in these situations:
1. Update the LNET configuration in the /etc/modprobe.conf file so the list of server NIDs (lctl list_nids) is correct.
The lctl list_nids command indicates which network(s) are configured to work with Lustre.
2. Shut down the file system in this order.
3. Run the writeconf command on all servers.
Run writeconf on the MDT first, and then the OSTs.
<mdt node>$ tunefs.lustre --writeconf <device>
<ost node>$ tunefs.lustre --writeconf <device>
c. If the NID on the MGS was changed, communicate the new MGS location to each server. Run:
tunefs.lustre --erase-param --mgsnode=<new_nid(s)> --writeconf /dev/..
4. Restart the file system in this order.
a. Mount the MGS (or the combined MGS/MDT).
After the writeconf command is run, the configuration logs are re-generated as servers restart, and server NIDs in the updated list_nids file are used.
4.3.13 Removing and Restoring OSTs
OSTs can be removed from and restored to a Lustre file system. Currently in Lustre, removing an OST really means that the OST is ‘deactivated’ in the file system, not permanently removed. A removed OST still appears in the file system; do not create a new OST with the same name.
You may want to remove (deactivate) an OST and prevent new files from being written to it in several situations:
4.3.13.1 Removing an OST from the File System
When removing an OST, remember that the MDT does not communicate directly with OSTs. Rather, each OST has a corresponding OSC which communicates with the MDT. It is necessary to determine the device number of the OSC that corresponds to the OST. Then, you use this device number to deactivate the OSC on the MDT.
To remove an OST from the file system:
1. For the OST to be removed, determine the device number of the corresponding OSC on the MDT.
a. List all OSCs on the node, along with their device numbers. Run:
lctl dl | grep " osc "
This is sample lctl dl | grep " osc " output:
11 UP osc lustre-OST-0000-osc-cac94211 4ea5b30f-6a8e-55a0-7519-2f20318ebdb4 5 12 UP osc lustre-OST-0001-osc-cac94211 4ea5b30f-6a8e-55a0-7519-2f20318ebdb4 5 13 IN osc lustre-OST-0000-osc lustre-MDT0000-mdtlov_UUID 5 14 UP osc lustre-OST-0001-osc lustre-MDT0000-mdtlov_UUID 5
b. Determine the device number of the OSC that corresponds to the OST to be removed.
2. Temporarily deactivate the OSC on the MDT. On the MDT, run:
$ mdt> lctl --device <devno> deactivate
For example, based on the command output in Step 1, to deactivate device 13 (the MDT’s OSC for OST-0000), the command would be:
$ mdt> lctl --device 13 deactivate
This marks the OST as inactive on the MDS, so no new objects are assigned to the OST. This does not prevent use of existing objects for reads or writes.
Note – Do not deactivate the OST on the clients. Do so causes errors (EIOs), and the copy out to fail. |
Caution – Do not use lctl conf_param to deactivate the OST. It permanently sets a parameter in the file system configuration. |
3. Discover all files that have objects residing on the deactivated OST. Run:
lfs find --obd {OST UUID} / <mount_point>
4. Copy (not move) the files to a new directory in the file system.
Copying the files forces object re-creation on the active OSTs.
5. Move (not copy) the files back to their original directory in the file system.
Moving the files causes the original files to be deleted, as the copies replace them.
6. Once all files have been moved, permanently deactivate the OST on the clients and the MDT. On the MGS, run:
# mgs> lctl conf_param <OST name>.osc.active=0
Note – A removed OST still appears in the file system; do not create a new OST with the same name. |
Temporarily Deactivating an OST in the File System
You may encounter situations when it is necessary to temporarily deactivate an OST, rather than permanently deactivate it. For example, you may need to deactivate a failed OST that cannot be immediately repaired, but want to continue to access the remaining files on the available OSTs.
To temporarily deactivate an OST:
1. Mount the Lustre file system.
2. On the MDS and all clients, run:
# lctl set_param osc.<faname>-<OST name>-*.active=0
Clients accessing files on the deactivated OST receive an IO error (-5), rather than pausing until the OST completes recovery.
4.3.13.2 Restoring an OST in the File System
Restoring an OST to the file system is as easy as activating it. When the OST is active, it is automatically added to the normal stripe rotation and files are written to it.
1. Make sure the OST to be restored is running.
2. Reactivate the OST. On the MGS, run:
# mgs> lctl conf_param <OST name>.osc.active=1
4.3.14 Aborting Recovery
You can abort recovery with either the lctl utility or by mounting the target with the abort_recov option (mount -o abort_recov). When starting a target, run:
$ mount -t lustre -L <MDT name> -o abort_recov <mount point>
Note – The recovery process is blocked until all OSTs are available. |
4.3.15 Determining Which Machine is Serving an OST
In the course of administering a Lustre file system, you may need to determine which machine is serving a specific OST. It is not as simple as identifying the machine’s IP address, as IP is only one of several networking protocols that Lustre uses and, as such, LNET does not use IP addresses as node identifiers, but NIDs instead.
To identify the NID that is serving a specific OST, run one of the following commands on a client (you do not need to be a root user):
client$ lctl get_param osc.${fsname}-${OSTname}*.ost_conn_uuid
client$ lctl get_param osc.*-OST0000*.ost_conn_uuid osc.myth-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
client$ lctl get_param osc.*.ost_conn_uuid osc.myth-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.myth-OST0001-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.myth-OST0002-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.myth-OST0003-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.myth-OST0004-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
4.4 More Complex Configurations
If a node has multiple network interfaces, it may have multiple NIDs. When a node is specified, all of its NIDs must be listed, delimited by commas (,) so other nodes can choose the NID that is appropriate for their network interfaces. When failover nodes are specified, they are delimited by a colon (:) or by repeating a keyword (--mgsnode=or --failnode=). To obtain all NIDs from a node (while LNET is running), run:
lctl list_nids
This displays the server’s NIDs (networks configured to work with Lustre).
4.4.1 Failover
This example has a combined MGS/MDT failover pair on uml1 and uml2, and a OST failover pair on uml3 and uml4. There are corresponding Elan addresses on uml1 and uml2.
uml1> mkfs.lustre --fsname=testfs --mdt --mgs \ --failnode=uml2,2@elan /dev/sda1 uml1> mount -t lustre /dev/sda1 /mnt/test/mdt uml3> mkfs.lustre --fsname=testfs --ost --failnode=uml4 \ --mgsnode=uml1,1@elan --mgsnode=uml2,2@elan /dev/sdb uml3> mount -t lustre /dev/sdb /mnt/test/ost0 client> mount -t lustre uml1,1@elan:uml2,2@elan:/testfs /mnt/testfs uml1> umount /mnt/mdt uml2> mount -t lustre /dev/sda1 /mnt/test/mdt uml2> cat /proc/fs/lustre/mds/testfs-MDT0000/recovery_status
Where multiple NIDs are specified, comma-separation (for example, uml2,2@elan) means that the two NIDs refer to the same host, and that Lustre needs to choose the “best” one for communication. Colon-separation (for example, uml1:uml2) means that the two NIDs refer to two different hosts, and should be treated as failover locations (Lustre tries the first one, and if that fails, it tries the second one.)
4.5 Operational Scenarios
In the operational scenarios below, the management node is the MDS. The management server is co-located on the MDS and started with the MDT.
Tip – All targets that are configured for failover must have some kind of shared storage among two server nodes. |
IP Network, Combined MGS/MDS, Single OST, No Failover
mkfs.lustre --mgs --mdt --fsname=<fsname> <partition> mount -t lustre <partition> <mountpoint>
mkfs.lustre --ost --mgsnode=<MGS NID> --fsname=<fsname> <partition> mount -t lustre <partition> <mountpoint>
mount -t lustre <MGS NID>:/<fsname> <mountpoint>
IP Network, Failover MGS/MDS
For failover, storage holding target data must be available as shared storage to failover server nodes. Failover nodes are statically configured as mount options.
mkfs.lustre --mgs --mdt --fsname=<fsname> --failover=<failover MGS NID> <partition> mount -t lustre <partition> <mount point>
mkfs.lustre --ost --mgsnode=<MGS NID>[,<failover MGS NID>] failover=<failover OSS NID> <partition> --fsname=<fsname> mount -t lustre <partition> <mount point>
mount -t lustre <MGS NID>[,<failover MGS NID>]:/<fsname> \ <mount point>
IP Network, Failover MGS/MDS and OSS
mkfs.lustre --mgs --mdt --failover=<failover MGS NID> <partition> --fsname=<fsname> mount -t lustre <partition> <mount point>
mkfs.lustre --ost --mgsnode=<MGS NID>[,<failover MGS NID>] --failover=<failover OSS NID> <partition> --fsname=<fsname> mount -t lustre <partition> <mount point>
mount -t lustre <MGS NID>[,<failover MGS NID>]:/<fsname> <mount point>
4.5.1 Changing the Address of a Failover Node
To change the address of a failover node (e.g, to use node X instead of node Y), run this command on the OSS/OST partition:
tunefs.lustre --erase-params --failnode=<NID> <device>
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.