Configuring Lustre


Configuring Lustre

You can use the administrative utilities provided with Lustre to set up a system with many different configurations. This chapter shows how to configure a simple Lustre system comprised of a combined MGS/MDT, an OST and a client, and includes the following sections:


4.1 Configuring the Lustre File System

A Lustre file system consists of four types of subsystems – a Management Server (MGS), a Metadata Target (MDT), Object Storage Targets (OSTs) and clients. We recommend running these components on different systems, although, technically, they can co-exist on a single system. Together, the OSSs and MDS present a Logical Object Volume (LOV) which is an abstraction that appears in the configuration.

It is possible to set up the Lustre system with many different configurations by using the administrative utilities provided with Lustre. Some sample scripts are included in the directory where Lustre is installed. If you have installed the Lustre source code, the scripts are located in the lustre/tests sub-directory. These scripts enable quick setup of some simple, standard Lustre configurations.

 


Note – We recommend that you use dotted-quad IP addressing (IPv4) rather than host names. This aids in reading debug logs, and helps greatly when debugging configurations with multiple interfaces.

 

1. Define the module options for Lustre networking (LNET), by adding this line to the /etc/modprobe.conf file[1].

options lnet networks=<network interfaces that LNET can use> 

This step restricts LNET to use only the specified network interfaces and prevents LNET from using all network interfaces.

As an alternative to modifying the modprobe.conf file, you can modify the modprobe.local file or the configuration files in the modprobe.d directory.

 


Note – For details on configuring networking and LNET, see Configuring LNET.

 

2. (Optional) Prepare the block devices to be used as OSTs or MDTs.

Depending on the hardware used in the MDS and OSS nodes, you may want to set up a hardware or software RAID to increase the reliability of the Lustre system. For more details on how to set up a hardware or software RAID, see the documentation for your RAID controller or see Lustre Software RAID Support.

3. Create a combined MGS/MDT file system.

a. Consider the MDT size needed to support the file system.

When calculating the MDT size, the only important factor is the number of files to be stored in the file system. This determines the number of inodes needed, which drives the MDT sizing. For more information, see Sizing the MDT and Planning for Inodes. Make sure the MDT is properly sized before performing the next step, as a too-small MDT can cause the space on the OSTs to be unusable.

b. Create the MGS/MDT file system on the block device. On the MDS node, run:

mkfs.lustre --fsname=<fsname> --mgs --mdt <block device name>

The default file system name (fsname) is lustre.

 


Note – If you plan to generate multiple file systems, the MGS should be on its own dedicated block device.

 

4. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:

mount -t lustre <block device name> <mount point>

5. Create the OST[2]. On the OSS node, run:

mkfs.lustre --ost --fsname=<fsname> --mgsnode=<NID> <block device name>

You can have as many OSTs per OSS as the hardware or drivers allow.

You should only use only 1 OST per block device. Optionally, you can create an OST which uses the raw block device and does not require partitioning.

 


Note – Lustre currently supports block devices up to 16 TB on OEL 5/RHEL 5 (up to 8 TB on other distributions). If the device size is only slightly larger that 16 TB, we recommend that you limit the file system size to 16 TB at format time. If the device size is significantly larger than 16 TB, you should reconfigure the storage into devices smaller than 16 TB. We recommend that you not place partitions on top of RAID 5/6 block devices due to negative impacts on performance.

 

6. Mount the OST. On the OSS node where the OST was created, run:

mount -t lustre <block device name> <mount point>

 


Note – To create additional OSTs, repeat Step 5 and Step 6.

 

7. Create the client (mount the file system on the client). On the client node, run:

mount -t lustre <MGS node>:/<fsname> <mount point> 

 


Note – To create additional clients, repeat Step 7.

 

8. Verify that the file system started and is working correctly by running the dfdd and ls commands on the client node.

a. Run the lfs df -h command.

[root@client1 /] lfs df -h

The lfs df -h command lists space usage per OST and the MDT in human-readable format.

b. Run the lfs df -ih command.

[root@client1 /] lfs df -ih

The lfs df -ih command lists inode usage per OST and the MDT.

c. Run the dd command.

[root@client1 /] cd /lustre
[root@client1 /lustre] dd if=/dev/zero of=/lustre/zero.dat bs=4M count=2

The dd command verifies write functionality by creating a file containing all zeros (0s). In this command, an 8 MB file is created.

d. Run the ls command.

[root@client1 /lustre] ls -lsah

The ls -lsah command lists files and directories in the current working directory.

If you have a problem mounting the file system, check the syslogs for errors and also check the network settings. A common issue with newly-installed systems is hosts.deny or filewall rules that prevent connections on port 988.

 


Tip – Now that you have configured Lustre, you can collect and register service tags in Lustre 1.8.3 and earlier versions. Note that service tags have been discontinued in Lustre 1.8.4 and later releases. For more information, see Service Tags.

 

4.1.0.1 Simple Lustre Configuration Example

To see the steps in a simple Lustre configuration, follow this worked example in which a combined MGS/MDT and two OSTs are created. Three block devices are used, one for the combined MGS/MDS node and one for each OSS node. Common parameters used in the example are listed below, along with individual node parameters.

 

Common Parameters

Value

Description

MGS/MDS node

10.2.0.1@tcp0

Node for the combined MGS/MDS

file system

temp

Name of the Lustre file system

network type

TCP/IP

Network type used for Lustre file system temp



Node Parameters

Value

Description

MGS/MDS node

MGS/MDS node

mdt1

MDS in Lustre file system temp

block device

/dev/sdb

Block device for the combined MGS/MDS node

mount point

/mnt/mdt

Mount point for the mdt1 block device (/dev/sdb) on the MGS/MDS node

First OSS node

OSS node

oss1

First OSS node in Lustre file system temp

OST

ost1

First OST in Lustre file system temp

block device

/dev/sdc

Block device for the first OSS node (oss1)

mount point

/mnt/ost1

Mount point for the ost1 block device (/dev/sdc) on the oss1 node

Second OSS node

OSS node

oss2

Second OSS node in Lustre file system temp

OST

ost2

Second OST in Lustre file system temp

block device

/dev/sdd

Block device for the second OSS node (oss2)

mount point

/mnt/ost2

Mount point for the ost2 block device (/dev/sdd) on the oss2 node

Client node

client node

client1

Client in Lustre file system temp

mount point

/lustre

Mount point for Lustre file system temp on the client1 node

 

1. Define the module options for Lustre networking (LNET), by adding this line to the /etc/modprobe.conf file.

options lnet networks=tcp

2. Create a combined MGS/MDT file system on the block device. On the MDS node, run:

[root@mds /]# mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb

This command generates this output:

    Permanent disk data:
Target:            temp-MDTffff
Index:         unassigned
Lustre FS:     temp
Mount type:        ldiskfs
Flags:         0x75
   (MDT MGS needs_index first_time update )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mdt.group_upcall=/usr/sbin/l_getgroups
 
checking for existing Lustre data: not found
device size = 16MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sdb
   target name     temp-MDTffff
   4k blocks       0
   options -i 4096 -I 512 -q -O dir_index,uninit_groups -F
mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-MDTffff  -i 4096 -I 512 -q -O 
dir_index,uninit_groups -F /dev/sdb
Writing CONFIGS/mountdata 

3. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:

[root@mds /]# mount -t lustre /dev/sdb /mnt/mdt

This command generates this output:

Lustre: temp-MDT0000: new disk, initializing 
Lustre: 3009:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) \ temp-MDT0000: group upcall set to /usr/sbin/l_getgroups
Lustre: temp-MDT0000.mdt: set parameter \ group_upcall=/usr/sbin/l_getgroups
Lustre: Server temp-MDT0000 on device /dev/sdb has started 

4. Create the OSTs.

In this example, the OSTs (ost1 and ost2) are being created or different OSSs (oss1 and oss2).

a. Create ost1. On oss1 node, run:

[root@oss1 /]# mkfs.lustre --ost --fsname=temp --mgsnode=10.2.0.1@tcp0 /dev/sdc

The command generates this output:

    Permanent disk data:
Target:            temp-OSTffff
Index:         unassigned
Lustre FS:     temp
Mount type:        ldiskfs
Flags:         0x72
(OST needs_index first_time update)
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.2.0.1@tcp
 
checking for existing Lustre data: not found
device size = 16MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sdc
   target name     temp-OSTffff
   4k blocks       0
   options -I 256 -q -O dir_index,uninit_groups -F
mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OSTffff  -I 256 -q -O
dir_index,uninit_groups -F /dev/sdc
Writing CONFIGS/mountdata 

b. Create ost2. On oss2 node, run:

[root@oss2 /]# mkfs.lustre --ost --fsname=temp --mgsnode=10.2.0.1@tcp0 /dev/sdd

The command generates this output:

    Permanent disk data:
Target:            temp-OSTffff
Index:         unassigned
Lustre FS:     temp
Mount type:        ldiskfs
Flags:         0x72
(OST needs_index first_time update)
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.2.0.1@tcp
checking for existing Lustre data: not found
device size = 16MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sdd
   target name     temp-OSTffff
   4k blocks       0
   options -I 256 -q -O dir_index,uninit_groups -F
mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OSTffff  -I 256 -q -O
dir_index,uninit_groups -F /dev/sdc
Writing CONFIGS/mountdata 

5. Mount the OSTs.

Mount each OST (ost1 and ost2), on the OSS where the OST was created.

a. Mount ost1. On oss1 node, run:

root@oss1 /] mount -t lustre /dev/sdc /mnt/ost1 

The command generates this output:

LDISKFS-fs: file extents enabled 
LDISKFS-fs: mballoc enabled
Lustre: temp-OST0000: new disk, initializing
Lustre: Server temp-OST0000 on device /dev/sdb has started

Shortly afterwards, this output appears:

Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0
Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting orphans 

b. Mount ost2. On oss2 node, run:

root@oss2 /] mount -t lustre /dev/sdd /mnt/ost2 

The command generates this output:

LDISKFS-fs: file extents enabled 
LDISKFS-fs: mballoc enabled
Lustre: temp-OST0000: new disk, initializing
Lustre: Server temp-OST0000 on device /dev/sdb has started

Shortly afterwards, this output appears:

Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0
Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting orphans 

6. Create the client (mount the file system on the client). On the client node, run:

root@client1 /] mount -t lustre 10.2.0.1@tcp0:/temp /lustre 

This command generates this output:

Lustre: Client temp-client has started

7. Verify that the file system started and is working by running the dfdd and ls commands on the client node.

a. Run the df command:

[root@client1 /] lfs df -h 

This command generates output similar to this:

Filesystem              Size        Used        Avail       Use%        Mounted on
/dev/mapper/VolGroup00-LogVol00
                   7.2G        2.4G        4.5G        35%     /
dev/sda1               99M     29M     65M     31%     /boot
tmpfs                  62M     0       62M     0%      /dev/shm
10.2.0.1@tcp0:/temp            30M     8.5M        20M     30%     /lustre

b. Run the dd command:

[root@client1 /] cd /lustre
[root@client1 /lustre] dd if=/dev/zero of=/lustre/zero.dat bs=4M count=2

This command generates output similar to this:

2+0 records in
2+0 records out
8388608 bytes (8.4 MB) copied, 0.159628 seconds, 52.6 MB/s

c. Run the ls command:

[root@client1 /lustre] ls -lsah

This command generates output similar to this:

total 8.0M
4.0K drwxr-xr-x  2 root root 4.0K Oct 16 15:27 .
8.0K drwxr-xr-x 25 root root 4.0K Oct 16 15:27 ..
8.0M -rw-r--r--  1 root root 8.0M Oct 16 15:27 zero.dat 

4.1.0.2 Module Setup

Make sure the modules (like LNET) are installed in the appropriate /lib/modules directory. The mkfs.lustre utility tries to automatically load LNET (via the Lustre module) with the default network settings (using all available network interfaces). To change this default setting, use the network=... option to specify the network(s) that LNET should use:

modprobe -v lustre "networks=XXX"

For example, to load Lustre with multiple-interface support (meaning LNET will use more than one physical circuit for communication between nodes), load the Lustre module with the following network=... option:

modprobe -v lustre "networks=tcp0(eth0),o2ib0(ib0)"

where:

tcp0 is the network itself (TCP/IP)

eth0 is the physical device (card) that is used (Ethernet)

o2ib0 is the interconnect (InfiniBand)

4.1.1 Scaling the Lustre File System

A Lustre file system can be scaled by adding OSTs or clients. For instructions on creating additional OSTs see Step 4 and Step 5 above; for clients, see Step 7.


4.2 Additional Lustre Configuration

Once the Lustre file system is configured, it is ready for use. If additional configuration is necessary, several configuration utilities are available. For man pages and reference information, see:

System Configuration Utilities (man8) profiles utilities (e.g., lustre_rmmod, e2scan, l_getgroups, llobdstat, llstat, plot-llstat, routerstat, and ll_recover_lost_found_objs), and tools to manage large clusters, perform application profiling, and debug Lustre.


4.3 Basic Lustre Administration

Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre administration tasks:

4.3.1 Specifying the File System Name

The file system name is limited to 8 characters. We have encoded the file system and target information in the disk label, so you can mount by label. This allows system administrators to move disks around without worrying about issues such as SCSI disk reordering or getting the /dev/device wrong for a shared target. Soon, file system naming will be made as fail-safe as possible. Currently, Linux disk labels are limited to 16 characters. To identify the target within the file system, 8 characters are reserved, leaving 8 characters for the file system name:

<fsname>-MDT0000 or <fsname>-OST0a19

To mount by label, use this command:

$ mount -t lustre -L <file system label> <mount point>

This is an example of mount-by-label:

$ mount -t lustre -L testfs-MDT0000 /mnt/mdt

 


caution icon Caution – Mount-by-label should NOT be used in a multi-path environment.

 

Although the file system name is internally limited to 8 characters, you can mount the clients at any mount point, so file system users are not subjected to short names. Here is an example:

mount -t lustre uml1@tcp0:/shortfs /mnt/<long-file_system-name>

4.3.2 Starting up Lustre

The startup order of Lustre components depends on whether you have a combined MGS/MDT or these components are separate.

    • If you have a combined MGS/MDT, the recommended startup order is OSTs, then the MGS/MDT, and then clients.

    • If the MGS and MDT are separate, the recommended startup order is: MGS, then OSTs, then the MDT, and then clients.

 


Note – If an OST is added to a Lustre file system with a combined MGS/MDT, then the startup order changes slightly; the MGS must be started first because the OST needs to write its configuration data to it. In this scenario, the startup order is MGS/MDT, then OSTs, then the clients.

 

4.3.3 Mounting a Server

Starting a Lustre server is straightforward and only involves the mount command. Lustre servers can be added to /etc/fstab:

mount -t lustre

The mount command generates output similar to this:

/dev/sda1 on /mnt/test/mdt type lustre (rw)
/dev/sda2 on /mnt/test/ost0 type lustre (rw)
192.168.0.21@tcp:/testfs on /mnt/testfs type lustre (rw)

In this example, the MDT, an OST (ost0) and file system (testfs) are mounted.

LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0
LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev,noauto 0 0

In general, it is wise to specify noauto and let your high-availability (HA) package manage when to mount the device. If you are not using failover, make sure that networking has been started before mounting a Lustre server. RedHat, SuSE, Debian (and perhaps others) use the _netdev flag to ensure that these disks are mounted after the network is up.

We are mounting by disk label here–the label of a device can be read with e2label. The label of a newly-formatted Lustre server ends in FFFF, meaning that it has yet to be assigned. The assignment takes place when the server is first started, and the disk label is updated.

 


caution icon Caution – Do not do this when the client and OSS are on the same node, as memory pressure between the client and OSS can lead to deadlocks.

 

 


caution icon Caution – Mount-by-label should NOT be used in a multi-path environment.

 

4.3.4 Unmounting a Server

To stop a Lustre server, use the umount <mount point> command.

For example, to stop ost0 on mount point /mnt/test, run:

$ umount /mnt/test

Gracefully stopping a server with the umount command preserves the state of the connected clients. The next time the server is started, it waits for clients to reconnect, and then goes through the recovery procedure.

If the force (-f) flag is used, then the server evicts all clients and stops WITHOUT recovery. Upon restart, the server does not wait for recovery. Any currently connected clients receive I/O errors until they reconnect.

 


Note – If you are using loopback devices, use the -d flag. This flag cleans up loop devices and can always be safely specified.

 

4.3.5 Working with Inactive OSTs

To mount a client or an MDT with one or more inactive OSTs, run commands similar to this:

client> mount -o exclude=testfs-OST0000 -t lustre uml1:/testfs\ /mnt/testfs
client> cat /proc/fs/lustre/lov/testfs-clilov-*/target_obd

To activate an inactive OST on a live client or MDT, use the lctl activate command on the OSC device. For example:

lctl --device 7 activate

 


Note – A colon-separated list can also be specified. For example, exclude=testfs-OST0000:testfs-OST0001.

 

4.3.6 Finding Nodes in the Lustre File System

There may be situations in which you need to find all nodes in your Lustre file system or get the names of all OSTs.

To get a list of all Lustre nodes, run this command on the MGS:

# cat /proc/fs/lustre/mgs/MGS/live/*

 


Note – This command must be run on the MGS.

 

In this example, file system lustre has three nodes, lustre-MDT0000lustre-OST0000, and lustre-OST0001.

cfs21:/tmp# cat /proc/fs/lustre/mgs/MGS/live/* 
fsname: lustre 
flags: 0x0     gen: 26 
lustre-MDT0000 
lustre-OST0000 
lustre-OST0001 

To get the names of all OSTs, run this command on the MDS:

# cat /proc/fs/lustre/lov/<fsname>-mdtlov/target_obd 

 


Note – This command must be run on the MDS.

 

In this example, there are two OSTs, lustre-OST0000 and lustre-OST0001, which are both active.

cfs21:/tmp# cat /proc/fs/lustre/lov/lustre-mdtlov/target_obd 
0: lustre-OST0000_UUID ACTIVE 
1: lustre-OST0001_UUID ACTIVE 

4.3.7 Mounting a Server Without Lustre Service

If you are using a combined MGS/MDT, but you only want to start the MGS and not the MDT, run this command:

mount -t lustre <MDT partition> -o nosvc <mount point>

The <MDT partition> variable is the combined MGS/MDT.

In this example, the combined MGS/MDT is testfs-MDT0000 and the mount point is mnt/test/mdt.

$ mount -t lustre -L testfs-MDT0000 -o nosvc /mnt/test/mdt

4.3.8 Specifying Failout/Failover Mode for OSTs

Lustre uses two modes, failout and failover, to handle an OST that has become unreachable because it fails, is taken off the network, is unmounted, etc.

    • In failout mode, Lustre clients immediately receive errors (EIOs) after a timeout, instead of waiting for the OST to recover.

    • In failover mode, Lustre clients wait for the OST to recover.

By default, the Lustre file system uses failover mode for OSTs. To specify failout mode instead, run this command:

$ mkfs.lustre --fsname=<fsname> --ost --mgsnode=<MGS node NID> --param="failover.mode=failout" <block device name>

In this example, failout mode is specified for the OSTs on MGS uml1, file system testfs.

$ mkfs.lustre --fsname=testfs --ost --mgsnode=uml1 --param="failover.mode=failout" /dev/sdb

 


caution icon Caution – Before running this command, unmount all OSTS that will be affected by the change in the failover/failout mode.


Note – After initial file system configuration, use the tunefs.lustre utility to change the failover/failout mode. For example, to set the failout mode, run:

$ tunefs.lustre --param failover.mode=failout <OST partition>

4.3.9 Running Multiple Lustre File Systems

There may be situations in which you want to run multiple file systems. This is doable, as long as you follow specific naming conventions.

By default, the mkfs.lustre command creates a file system named lustre. To specify a different file system name (limited to 8 characters), run this command:

mkfs.lustre --fsname=<new file system name>

 


Note – The MDT, OSTs and clients in the new file system must share the same name (prepended to the device name). For example, for a new file system named foo, the MDT and two OSTs would be named foo-MDT0000foo-OST0000, and foo-OST0001.

 

To mount a client on the file system, run:

mount -t lustre mgsnode:/<new fsname> <mountpoint>

For example, to mount a client on file system foo at mount point /mnt/lustre1, run:

mount -t lustre mgsnode:/foo /mnt/lustre1

 


Note – If a client(s) will be mounted on several file systems, add the following line to /etc/xattr.conf file to avoid problems when files are moved between the file systems: lustre.* skip




Note – The MGS is universal; there is only one MGS per Lustre installation, not per file system.




Note – There is only one file system per MDT. Therefore, specify --mdt --mgs on one file system and --mdt --mgsnode=<MGS node NID> on the other file systems.

 

A Lustre installation with two file systems (foo and bar) could look like this, where the MGS node is mgsnode@tcp0 and the mount points are /mnt/lustre1 and/mnt/lustre2.

mgsnode# mkfs.lustre --mgs /mnt/lustre1
mdtfoonode# mkfs.lustre --fsname=foo --mdt --mgsnode=mgsnode@tcp0 /mnt/lustre1
ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode@tcp0 /mnt/lustre1
ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode@tcp0 /mnt/lustre2
mdtbarnode# mkfs.lustre --fsname=bar --mdt --mgsnode=mgsnode@tcp0 /mnt/lustre1
ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode@tcp0 /mnt/lustre1
ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode@tcp0 /mnt/lustre2

To mount a client on file system foo at mount point /mnt/lustre1, run:

mount -t lustre mgsnode@tcp0:/foo /mnt/lustre1

To mount a client on file system bar at mount point /mnt/lustre2, run:

mount -t lustre mgsnode@tcp0:/bar /mnt/lustre2

4.3.10 Setting and Retrieving Lustre Parameters

There are several options for setting parameters in Lustre.

Additionally, you can use lctl to retrieve Lustre parameters. See Reporting Current Parameter Values.

4.3.10.1 Setting Parameters with mkfs.lustre

When the file system is created, parameters can simply be added as a --param option to the mkfs.lustre command. For example:

$ mkfs.lustre --mdt --param="sys.timeout=50" /dev/sda

4.3.10.2 Setting Parameters with tunefs.lustre

If a server (OSS or MDS) is stopped, parameters can be added using the --param option to the tunefs.lustre command. For example:

$ tunefs.lustre --param="failover.node=192.168.0.13@tcp0" /dev/sda

With tunefs.lustre, parameters are “additive” — new parameters are specified in addition to old parameters, they do not replace them. To erase all oldtunefs.lustre parameters and just use newly-specified parameters, run:

$ tunefs.lustre --erase-params --param=<new parameters> 

The tunefs.lustre command can be used to set any parameter settable in a /proc/fs/lustre file and that has its own OBD device, so it can be specified as<obd|fsname>.<obdtype>.<proc_file_name>=<value>. For example:

$ tunefs.lustre --param mdt.group_upcall=NONE /dev/sda1

4.3.10.3 Setting Parameters with lctl

When the file system is running, the lctl command can be used to set parameters (temporary or permanent) and report current parameter values. Temporary parameters are active as long as the server or client is not shut down. Permanent parameters live through server and client reboots.

 


Note – Lustre 1.8.4 adds the lctl list_param command, which enables users to list all parameters that can be set. See Listing Parameters.

 

Setting Temporary Parameters

Use the lctl set_param command to set temporary parameters on the node where it is run. These parameters map to items in /proc/{fs,sys}/{lnet,lustre}. The lctl set_param command uses this syntax:

lctl set_param [-n] <obdtype>.<obdname>.<proc_file_name>=<value>

For example:

# lctl set_param osc.*.max_dirty_mb=1024
osc.myth-OST0000-osc.max_dirty_mb=32 
osc.myth-OST0001-osc.max_dirty_mb=32 
osc.myth-OST0002-osc.max_dirty_mb=32 
osc.myth-OST0003-osc.max_dirty_mb=32 
osc.myth-OST0004-osc.max_dirty_mb=32
Setting Permanent Parameters

Use the lctl conf_param command to set permanent parameters. In general, the lctl conf_param command can be used to specify any parameter settable in a /proc/fs/lustre file, with its own OBD device. The lctl conf_param command uses this syntax (same as the mkfs.lustre and tunefs.lustre commands):

<obd|fsname>.<obdtype>.<proc_file_name>=<value>) 

Here are a few examples of lctl conf_param commands:

$ mgs> lctl conf_param testfs-MDT0000.sys.timeout=40
$ lctl conf_param testfs-MDT0000.mdt.group_upcall=NONE 
$ lctl conf_param testfs.llite.max_read_ahead_mb=16 
$ lctl conf_param testfs-MDT0000.lov.stripesize=2M 
$ lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15 
$ lctl conf_param testfs-OST0000.ost.client_cache_seconds=15 
$ lctl conf_param testfs.sys.timeout=40 

 


caution icon Caution – Parameters specified with the lctl conf_param command are set permanently in the file system’s configuration file on the MGS.

 

Listing Parameters

To list Lustre or LNET parameters that are available to set, use the lctl list_param command. For example:

lctl list_param [-FR] <obdtype>.<obdname>

The following arguments are available for the lctl list_param command.

-F Add ‘/’, ‘@’ or ‘=’ for directories, symlinks and writeable files, respectively

-R Recursively lists all parameters under the specified path

For example:

$ lctl list_param obdfilter.lustre-OST0000 

4.3.10.4 Reporting Current Parameter Values

To report current Lustre parameter values, use the lctl get_param command with this syntax:

lctl get_param [-n] <obdtype>.<obdname>.<proc_file_name>

This example reports data on RPC service times.

$ lctl get_param -n ost.*.ost_io.timeouts 
service : cur 1 worst 30 (at 1257150393, 85d23h58m54s ago) 1 1 1 1 

This example reports the number of inodes available on each OST.

# lctl get_param osc.*.filesfree
osc.myth-OST0000-osc-ffff88006dd20000.filesfree=217623 
osc.myth-OST0001-osc-ffff88006dd20000.filesfree=5075042 
osc.myth-OST0002-osc-ffff88006dd20000.filesfree=3762034 
osc.myth-OST0003-osc-ffff88006dd20000.filesfree=91052 
osc.myth-OST0004-osc-ffff88006dd20000.filesfree=129651

4.3.11 Regenerating Lustre Configuration Logs

If the Lustre system’s configuration logs are in a state where the file system cannot be started, use the writeconf command to erase them. After the writeconfcommand is run and the servers restart, the configuration logs are re-generated and stored on the MGS (as in a new file system).

You should only use the writeconf command if:

    • The configuration logs are in a state where the file system cannot start

    • A server NID is being changed

The writeconf command is destructive to some configuration items (i.e., OST pools information and items set via conf_param), and should be used with caution. To avoid problems:

    • Shut down the file system before running the writeconf command

    • Run the writeconf command on all servers (MDT first, then OSTs)

    • Start the file system in this order:

      • MGS (or the combined MGS/MDT)

      • MDT

      • OSTs

      • Lustre clients

 


caution icon Caution – Lustre 1.8 introduces the OST pools feature, which enables a group of OSTs to be named for file striping purposes. If you use OST pools, be aware that running the writeconf command erases all pools information (as well as any other parameters set via lctl conf_param). We recommend that the pools definitions (and conf_param settings) be executed via a script, so they can be reproduced easily after a writeconf is performed.

 

To regenerate Lustre’s system configuration logs:

1. Shut down the file system in this order.

a. Unmount the clients.

b. Unmount the MDT.

c. Unmount all OSTs.

2. Make sure the the MDT and OST devices are available.

3. Run the writeconf command on all servers.

Run writeconf on the MDT first, and then the OSTs.

a. On the MDT, run:

<mdt node>$ tunefs.lustre --writeconf <device>

b. On each OST, run:

<ost node>$ tunefs.lustre --writeconf <device>

4. Restart the file system in this order.

a. Mount the MGS (or the combined MGS/MDT).

b. Mount the MDT.

c. Mount the OSTs.

d. Mount the clients.

After the writeconf command is run, the configuration logs are re-generated as servers restart.

4.3.12 Changing a Server NID

If you need to change the NID on the MDT or an OST, run the writeconf command to erase Lustre configuration information (including server NIDs), and then re-generate the system configuration using updated server NIDs.

Change a server NID in these situations:

    • New server hardware is added to the file system, and the MDS or an OSS is being moved to the new machine

    • New network card is installed in the server

    • You want to reassign IP addresses

To change a server NID:

1. Update the LNET configuration in the /etc/modprobe.conf file so the list of server NIDs (lctl list_nids) is correct.

The lctl list_nids command indicates which network(s) are configured to work with Lustre.

2. Shut down the file system in this order.

a. Unmount the clients.

b. Unmount the MDT.

c. Unmount all OSTs.

3. Run the writeconf command on all servers.

Run writeconf on the MDT first, and then the OSTs.

a. On the MDT, run:

<mdt node>$ tunefs.lustre --writeconf <device>

b. On each OST, run:

<ost node>$ tunefs.lustre --writeconf <device>

c. If the NID on the MGS was changed, communicate the new MGS location to each server. Run:

tunefs.lustre --erase-param --mgsnode=<new_nid(s)> --writeconf /dev/..

4. Restart the file system in this order.

a. Mount the MGS (or the combined MGS/MDT).

b. Mount the MDT.

c. Mount the OSTs.

d. Mount the clients.

After the writeconf command is run, the configuration logs are re-generated as servers restart, and server NIDs in the updated list_nids file are used.

4.3.13 Removing and Restoring OSTs

OSTs can be removed from and restored to a Lustre file system. Currently in Lustre, removing an OST really means that the OST is ‘deactivated’ in the file system, not permanently removed. A removed OST still appears in the file system; do not create a new OST with the same name.

You may want to remove (deactivate) an OST and prevent new files from being written to it in several situations:

    • Hard drive has failed and a RAID resync/rebuild is underway

    • OST is nearing its space capacity

4.3.13.1 Removing an OST from the File System

When removing an OST, remember that the MDT does not communicate directly with OSTs. Rather, each OST has a corresponding OSC which communicates with the MDT. It is necessary to determine the device number of the OSC that corresponds to the OST. Then, you use this device number to deactivate the OSC on the MDT.

To remove an OST from the file system:

1. For the OST to be removed, determine the device number of the corresponding OSC on the MDT.

a. List all OSCs on the node, along with their device numbers. Run:

lctl dl | grep " osc "

This is sample lctl dl | grep " osc " output:

11 UP osc lustre-OST-0000-osc-cac94211 4ea5b30f-6a8e-55a0-7519-2f20318ebdb4 5
12 UP osc lustre-OST-0001-osc-cac94211 4ea5b30f-6a8e-55a0-7519-2f20318ebdb4 5
13 IN osc lustre-OST-0000-osc lustre-MDT0000-mdtlov_UUID 5
14 UP osc lustre-OST-0001-osc lustre-MDT0000-mdtlov_UUID 5

b. Determine the device number of the OSC that corresponds to the OST to be removed.

2. Temporarily deactivate the OSC on the MDT. On the MDT, run:

$ mdt> lctl --device <devno> deactivate

For example, based on the command output in Step 1, to deactivate device 13 (the MDT’s OSC for OST-0000), the command would be:

$ mdt> lctl --device 13 deactivate

This marks the OST as inactive on the MDS, so no new objects are assigned to the OST. This does not prevent use of existing objects for reads or writes.

 


Note – Do not deactivate the OST on the clients. Do so causes errors (EIOs), and the copy out to fail.




caution icon Caution – Do not use lctl conf_param to deactivate the OST. It permanently sets a parameter in the file system configuration.

 

3. Discover all files that have objects residing on the deactivated OST. Run:

lfs find --obd {OST UUID} / <mount_point> 

4. Copy (not move) the files to a new directory in the file system.

Copying the files forces object re-creation on the active OSTs.

5. Move (not copy) the files back to their original directory in the file system.

Moving the files causes the original files to be deleted, as the copies replace them.

6. Once all files have been moved, permanently deactivate the OST on the clients and the MDT. On the MGS, run:

# mgs> lctl conf_param <OST name>.osc.active=0 

 


Note – A removed OST still appears in the file system; do not create a new OST with the same name.

 

Temporarily Deactivating an OST in the File System

You may encounter situations when it is necessary to temporarily deactivate an OST, rather than permanently deactivate it. For example, you may need to deactivate a failed OST that cannot be immediately repaired, but want to continue to access the remaining files on the available OSTs.

To temporarily deactivate an OST:

1. Mount the Lustre file system.

2. On the MDS and all clients, run:

# lctl set_param osc.<faname>-<OST name>-*.active=0

Clients accessing files on the deactivated OST receive an IO error (-5), rather than pausing until the OST completes recovery.

4.3.13.2 Restoring an OST in the File System

Restoring an OST to the file system is as easy as activating it. When the OST is active, it is automatically added to the normal stripe rotation and files are written to it.

To restore an OST:

1. Make sure the OST to be restored is running.

2. Reactivate the OST. On the MGS, run:

# mgs> lctl conf_param <OST name>.osc.active=1

4.3.14 Aborting Recovery

You can abort recovery with either the lctl utility or by mounting the target with the abort_recov option (mount -o abort_recov). When starting a target, run:

$ mount -t lustre -L <MDT name> -o abort_recov <mount point>

 


Note – The recovery process is blocked until all OSTs are available.

 

4.3.15 Determining Which Machine is Serving an OST

In the course of administering a Lustre file system, you may need to determine which machine is serving a specific OST. It is not as simple as identifying the machine’s IP address, as IP is only one of several networking protocols that Lustre uses and, as such, LNET does not use IP addresses as node identifiers, but NIDs instead.

To identify the NID that is serving a specific OST, run one of the following commands on a client (you do not need to be a root user):

client$ lctl get_param osc.${fsname}-${OSTname}*.ost_conn_uuid

For example:

client$ lctl get_param osc.*-OST0000*.ost_conn_uuid 
osc.myth-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp

– OR –

client$ lctl get_param osc.*.ost_conn_uuid 
osc.myth-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.myth-OST0001-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.myth-OST0002-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.myth-OST0003-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.myth-OST0004-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp


4.4 More Complex Configurations

If a node has multiple network interfaces, it may have multiple NIDs. When a node is specified, all of its NIDs must be listed, delimited by commas (,) so other nodes can choose the NID that is appropriate for their network interfaces. When failover nodes are specified, they are delimited by a colon (:) or by repeating a keyword (--mgsnode=or --failnode=). To obtain all NIDs from a node (while LNET is running), run:

lctl list_nids

This displays the server’s NIDs (networks configured to work with Lustre).

4.4.1 Failover

This example has a combined MGS/MDT failover pair on uml1 and uml2, and a OST failover pair on uml3 and uml4. There are corresponding Elan addresses on uml1 and uml2.

uml1> mkfs.lustre --fsname=testfs --mdt --mgs \ 
--failnode=uml2,2@elan /dev/sda1
uml1> mount -t lustre /dev/sda1 /mnt/test/mdt
uml3> mkfs.lustre --fsname=testfs --ost --failnode=uml4 \ 
--mgsnode=uml1,1@elan --mgsnode=uml2,2@elan /dev/sdb
uml3> mount -t lustre /dev/sdb /mnt/test/ost0
client> mount -t lustre uml1,1@elan:uml2,2@elan:/testfs /mnt/testfs
uml1> umount /mnt/mdt
uml2> mount -t lustre /dev/sda1 /mnt/test/mdt
uml2> cat /proc/fs/lustre/mds/testfs-MDT0000/recovery_status

Where multiple NIDs are specified, comma-separation (for example, uml2,2@elan) means that the two NIDs refer to the same host, and that Lustre needs to choose the “best” one for communication. Colon-separation (for example, uml1:uml2) means that the two NIDs refer to two different hosts, and should be treated as failover locations (Lustre tries the first one, and if that fails, it tries the second one.)

 


Note – If you have an MGS or MDT configured for failover, perform these steps:

1. On the OST, list the NIDs of all MGS nodes at mkfs time.

OST# mkfs.lustre --fsname sunfs --ost --mgsnode=10.0.0.1
--mgsnode=10.0.0.2 /dev/{device}

2. On the client, mount the file system.

client# mount -t lustre 10.0.0.1:10.0.0.2:/sunfs /cfs/client/


 


4.5 Operational Scenarios

In the operational scenarios below, the management node is the MDS. The management server is co-located on the MDS and started with the MDT.

 


Tip – All targets that are configured for failover must have some kind of shared storage among two server nodes.

 

IP Network, Combined MGS/MDS, Single OST, No Failover

On the MDS, run:

mkfs.lustre --mgs --mdt --fsname=<fsname> <partition> 
mount -t lustre <partition> <mountpoint>

On the OSS, run:

mkfs.lustre --ost --mgsnode=<MGS NID> --fsname=<fsname> <partition> 
mount -t lustre <partition> <mountpoint>

On the client, run:

mount -t lustre <MGS NID>:/<fsname> <mountpoint>
IP Network, Failover MGS/MDS

For failover, storage holding target data must be available as shared storage to failover server nodes. Failover nodes are statically configured as mount options.

On the MDS, run:

mkfs.lustre --mgs --mdt --fsname=<fsname> --failover=<failover MGS NID> <partition> 
mount -t lustre <partition> <mount point>

On the OSS, run:

mkfs.lustre --ost --mgsnode=<MGS NID>[,<failover MGS NID>] failover=<failover OSS NID> <partition> --fsname=<fsname> 
mount -t lustre <partition> <mount point>

On the client, run:

mount -t lustre <MGS NID>[,<failover MGS NID>]:/<fsname> \ <mount point>
IP Network, Failover MGS/MDS and OSS

On the MDS, run:

mkfs.lustre --mgs --mdt --failover=<failover MGS NID> <partition> --fsname=<fsname> 
mount -t lustre <partition> <mount point>

On the OSS, run:

mkfs.lustre --ost --mgsnode=<MGS NID>[,<failover MGS NID>] --failover=<failover OSS NID> <partition> --fsname=<fsname> 
mount -t lustre <partition> <mount point>

On the client, run:

mount -t lustre <MGS NID>[,<failover MGS NID>]:/<fsname> <mount point>

4.5.1 Changing the Address of a Failover Node

To change the address of a failover node (e.g, to use node X instead of node Y), run this command on the OSS/OST partition:

tunefs.lustre --erase-params --failnode=<NID> <device>

1 (Footnote) The modprobe.conf file is a Linux file that lives in /etc/modprobe.conf and specifies what parts of the kernel are loaded.

2 (Footnote) When you create the OST, you are defining a storage device (‘sd’), a device number (a, b, c, d), and a partition (1, 2, 3) where the OST node lives.

댓글 남기기

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Videos, Slideshows and Podcasts by Cincopa Wordpress Plugin