Sys Admin > v13, i04: Monitoring and Managing Linux Software RAID

小若兮 2006-04-11

展开全文

Systems administrators managing a data center face numerous challenges to achieve required availability and uptime. Two of the main challenges are shrinking budgets (for hardware, software, and staffing) and short deadlines in which to deliver solutions. The Linux community has developed kernel support for software RAID (Redundant Array of Inexpensive Disks) to help meet those challenges. Software RAID, properly implemented, can eliminate system downtime caused by disk drive errors. The source code to the Linux kernel, the RAID modules, and the raidtools package are available at minimal cost under the GNU Public License. The interface is well documented and comprehensible to a moderately experienced Linux systems administrator.

In this article, I‘ll provide an overview of the software RAID implementation in the Linux 2.4.X kernel. I will describe the creation and activation of software RAID devices as well as the management of active RAID devices. Finally, I will discuss some procedures for recovering from a failed disk unit.

Introduction to RAID

RAID is a set of algorithms for writing data blocks to disk devices. Each RAID mode, or level, specifies the layout of data blocks on multiple disks. Each RAID mode provides an enhancement in one aspect of data management: redundancy or reliability, read or write performance, or logical unit capacity. Simple RAID modes are named with an integer number: RAID 0, RAID 1, or RAID 5. Complex RAID modes that combine multiple simple modes are named with a combined name: RAID 0+1, RAID 1+0.

RAID 0 is used to enhance the read/write performance of large data sets, and to increase logical unit capacity beyond the limits of a single disk device. RAID 0 can also be used to write consecutive data blocks onto multiple disk devices to improve read/write performance of large data sets. RAID 0 provides no data recovery capability.

RAID 1, or mirroring, is used to provide high reliability and fault tolerance. In RAID 1, each data block is written to multiple disk devices simultaneously. If one disk device were to fail, all data could be recovered from one of the mirror disks. The cost of RAID 1 data sets increases with the number of mirror sets used.

RAID 5, also referred to as striping with parity, distributes data blocks and parity blocks across all devices in a RAID device. Parity calculations are performed for each write operation, and used to regenerate data if a disk failure is detected. With the ability to stripe data across RAID 5 devices, read performance can be optimized. RAID 5 also maximizes available disk storage, allowing RAID devices to gain additional capacity without losing redundancy.

Complex RAID modes involve the combination of the benefits of single RAID levels into the same logical unit. RAID 0+1 is the striping or concatenation of multiple disk devices into a larger logical unit with additional disk devices allocated to mirror the striped devices. A 45-Gb RAID 0+1 logical unit would require ten 9-Gb disk devices. Data blocks would be striped or concatenated on five of the disk devices. Simultaneously, each data block would be written to to one of the other five disk devices to provide a mirror for the entire logical unit. RAID 1+0 is the use of mirrored disk devices to form larger striped or concatenated logical units. There is no difference between the logical unit presented to the operating system from RAID 0+1 and RAID 1+0 -- it is 45 Gb either way. The same number of disk devices are required to implement either.

Current Linux kernels support RAID -- both hardware and software -- for many disk devices and controllers. The distinction between hardware RAID and software RAID is in location of the RAID mode implementation. Hardware RAID solutions require specialized hardware (disk controllers, disk enclosures, and/or drives). Software RAID is implemented by the operating system of the server to which the devices are attached. The trade-off is between price and performance. Most hardware RAID controllers have dedicated processing units and non-volatile cache memory. The controller can acknowledge the write as completed to the operating system immediately and perform the physical writes to disk later, increasing performance and removing processing load from the server. Some RAID hardware devices support replacing physical disk units without taking the server offline.

Software RAID is an emulation of what hardware RAID devices do. There are some disadvantages of software RAID compared to hardware RAID: some disk device write performance is lost; there is additional processing burden on the server; and the hot-swappability of disk units is not available. However, the cost of standard disk controllers and devices is much less than those that support RAID modes in hardware. Often a combination of hardware RAID devices and software RAID will provide a flexible and maintainable solution that fits within the availability and budget constraints of the application.

Hardware RAID controllers allocate storage from a pool of available disks into a logical unit of disk storage, which is presented to the operating system. Most RAID controllers support RAID 0, RAID 1, RAID 5, RAID 0+1, and RAID 1+0. When RAID layouts are implemented in software, the kernel is responsible for managing individual disk units. The RAID drivers keep track of which disk units are assigned to each logical unit and where to read or write the raw data. The RAID logical unit is presented to the operating system as an abstract disk, upon which any type of Linux disk filesystem (e.g., ext2, ext3, reiserfs) can be installed. The filesystem interface is unaware both that the "disk unit" is actually an array of multiple disks and of how the data is laid out among them.

The remainder of this article will deal specifically with the Linux RAID implementation in software. In Linux documentation, the software RAID implementation is also referred to as MD (multiple disk). Many of the commands demonstrated are from the raidtools package that must be installed to manage RAID devices. The mdadm package is also available to create, manage, and monitor MD devices. This tool provides a variety of advanced features, but will not be covered in this article. For additional information on mdadm, please see the references.

A word of caution -- the examples in this article are from my test RAID systems. Please study all relevant documentation and plan carefully before adding or changing system parameters. I highly recommend that everything be implemented and tested on a non-production system before making changes to any live systems.

Linux Software RAID Implementation

The Linux kernel supports RAID 0, RAID 1, RAID 4, or RAID 5. RAID devices can also be combined to implement RAID1+0 or RAID 0+1 layouts for additional availability or performance. The kernel also supports the allocation of one or more hot spare disk units per RAID device. A hot spare disk is one that is not used to store data or parity blocks -- it is available to the RAID device for recovery if one of the other disks comprising the device fails. When a disk fault is detected, the operating system begins to resynchronize data from the failed disk onto the hot spare disk. The faulty disk drive can be replaced later.

Kernel Support

A recent version of the Linux kernel should be used to implement software RAID. All examples in this article are from kernel version 2.4.20. Some Linux distributions include kernels with RAID support precompiled into the kernel, and contain the raidtools package which is required to manage software RAID devices. Recent versions of Redhat and SuSE distributions include kernels with built-in RAID support, which can be configured at installation time, or once the server is operational.

If you compile a new kernel, you must enable RAID support in your kernel configuration. Include support for all RAID layouts (modes) that will be used:

Multiple devices driver support (RAID and LVM) (CONFIG_MD) [Y/n/?] y
                                                RAID support (CONFIG_BLK_DEV_MD) [M/n/y/?] y
                                                Linear (append) mode (CONFIG_MD_LINEAR) [M/n/y/?] y
                                                RAID-0 (striping) mode (CONFIG_MD_RAID0) [M/n/y/?] y
                                                RAID-1 (mirroring) mode (CONFIG_MD_RAID1) [M/n/y/?] y
                                                RAID-4/RAID-5 mode (CONFIG_MD_RAID5) [M/n/y/?] y

Once the kernel has been configured with support for software RAID, it must be compiled and installed, and the system should be booted with the new kernel. The dmesg command can verify that software RAID is enabled in the running kernel:

$ dmesg | grep ^md | head -3
                                                md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
                                                md: Autodetecting RAID arrays.
                                                md: autorun .

The presence of md messages indicates that the kernel has loaded the software RAID drivers and is attempting to detect RAID devices attached to the server. During the autodetect process, the kernel reads the partition identification tags of each disk partition available to the system. Any partition that has a partition identification tag of 0xFD is a RAID partition. The operating system will attempt to start each RAID partition at autodetect time. The fdisk utility can be used to view the partition tags of any disk device:

$ fdisk -l /dev/hda
                                                Disk /dev/hda: 100.0 GB, 100030242816 bytes255 heads,                                                 63 sectors/track, 12161 cylinders
                                                Units = cylinders of 16065 * 512 = 8225280 bytes
                                                Device Boot    Start    End    Blocks   Id  System
                                                /dev/hda1   *    1     12095  97153056  fd  Linux raid autodetect
                                                /dev/hda2      12096   12160   522112+  fd  Linux raid autodetect

Software RAID Devices

A software RAID device consists of one or more disk partitions (or other disk devices) that are combined to form a RAID device. The devices are combined in one of the supported basic RAID modes: RAID 0, RAID 1, RAID 4, or RAID 5. The configuration file that assigns partitions and records other details of each RAID device is /etc/raidtab. Each section of /etc/raidtab that defines a software RAID device begins with the keyword raiddev:

raiddev             /dev/md0
                                                raid-level                  1
                                                nr-raid-disks               2
                                                persistent-superblock       1
                                                nr-spare-disks              0
                                                device          /dev/hda1
                                                raid-disk           0
                                                device          /dev/hdc1
                                                raid-disk           1

The device name is identified in the raiddev directive. The RAID mode is defined in the raid-level directive. The nr-raid-disks directive specifies how many disks (partitions or other disk devices) will be used to comprise the RAID device. The persistent-superblock instructs the kernel to write the RAID configuration data at the end of each partition, which helps the kernel to identify the RAID configuration in the system. The nr-spare-disks directive is used to define hot-spare disks, described later. Following the general device defininitions, each disk device that will comprise the RAID device is specified with a device directive. The device specified can be a disk partition, a disk device, or another RAID device. Each device directive can be further specified by other directives. The raid-disk directive indicates the relative position within the RAID layout of each disk device (from 0 to (nr-raid-disks - 1). For striping layouts, such as RAID 5, this indicates the column where that disk device will lie. For RAID 1 and RAID 5, it is important to allocate disk units to each raid-disk that have the same physical disk geometry (cylinders, sectors, tracks) and are the same size.

RAID 5 devices are defined with the raid-level directive, and a number 5 to specify the RAID mode. An example three-column RAID device would be:

raiddev /dev/md1
                                                raid-level            5
                                                nr-raid-disks         3
                                                nr-spare-disks        0
                                                persistent-superblock 1
                                                parity-algorithm      left-symmetric
                                                chunk-size            32
                                                device                /dev/hda1
                                                raid-disk             0
                                                device                /dev/hdc1
                                                raid-disk             1
                                                device                /dev/hde1
                                                raid-disk             2

The chunk-size directive specifies how many kilobytes will be written as a data block to each column of the RAID device. This is also the size of the parity block that will be calculated for each data block. The optimal value for chunk-size depends on the application and the disk devices. The parity-algorithm directive specifies how parity blocks will be calculated based on the data blocks, and how to organize the location of the parity blocks across columns.

Initializing a Software RAID Device

Each software RAID device that is defined in /etc/raidtab must be initialized. The mkraid utility creates the device node, initializes all the devices that comprise the RAID device, and starts the RAID device. Be aware that initialization with mkraid will destroy data that currently exists on any of the devices that are specified in /etc/raidtab for that RAID device:

$ mkraid /dev/md0

The status of RAID device initialization can be monitored via /proc/mdstat.

When the devices have been initialized and started, filesystems can be created on the RAID device:

$ mke2fs -j /dev/md0
                                                $ mkdir /data/snowball
                                                $ mount /dev/md0 /data/snowball

After a filesystem has been created on the RAID device, the RAID device can be managed like any other Linux filesystem. To have a filesystem mount upon system boot, add an entry in /etc/fstab:

/dev/md0      /data/snowball      ext3      defaults        1 2

Once mounted as a filesystem, the RAID device is used like any other filesystem on the server.

Stopping and Starting Software RAID Devices

Under normal circumstances, software RAID devices are started during the autodetect process. Some events may require that the RAID devices be stopped while the server is still running. Before stopping a software RAID device, the kernel forces any buffered data to be written to the RAID device. After all buffered data has been written, the RAID device is stopped. The data on a stopped RAID device is not accessible to the operating system. The underlying disk devices that comprise the RAID device remain accessible, although they should not be modified or the RAID device will be corrupted. The raidstop utility is used to stop a RAID device. Mounted filesystems on the RAID device must be unmounted before stopping the underlying device:

$ umount /data/snowball
                                                $ raidstop /dev/md0

A stopped software RAID device must be explicitly started before it can be used by the server. The raidstart utility is used to start a RAID device. Once started, the filesystem can be mounted:

$ raidstart /dev/md0
                                                $ mount /dev/md0 /data/snowball

Hot Spare Disk Devices

The Linux software RAID implementation supports one or more hot spare devices to be assigned to a RAID device. A hot spare device is a disk device that is available to a RAID device to replace one of the component disk devices in case of a disk fault or failure. Spare disks enable the RAID device to continue to operate, and begin recovery procedures, in real-time. When the kernel receives notice of a failed disk device that is part of a RAID device, the RAID device is checked for an available hot spare device. If a spare disk is available, the kernel will logically replace the faulted disk with the spare in the RAID device.

RAID modes support continued operation with the loss of one disk unit in the device because each data block can be resynchronized from the other disks, either as a mirror block, or as a parity block. The nr-spare-disks directive in /etc/raidtab indicates how many spare disks are available to the RAID device. A device directive is specified for each spare disk. A spare-disk directive follows the device directive instead of the raid-disk directive. The following excerpt from /etc/raidtab defines the previous example RAID device with one hot spare disk available:

raiddev                     /dev/md0
                                                raid-level                  1
                                                nr-raid-disks               2
                                                persistent-superblock       1
                                                nr-spare-disks              1
                                                device                  /dev/hda1
                                                raid-disk               0
                                                device                  /dev/hdc1
                                                raid-disk               1
                                                device                  /dev/hde1
                                                spare-disk              0

The raidhotadd utility will add a hot spare disk to a RAID device that is running. raidhotadd will not modify /etc/raidtab. If a spare disk is added using raidhotadd, the systems administrator must add it to the appropriate RAID device specification in /etc/raidtab before the system reboots:

$ raidhotadd /dev/md0 /dev/hde1

The resynchronization of a hot spare device can be monitored via /proc/mdstat.

Monitoring Software RAID Devices

The Linux software RAID implementation reports the status of all RAID devices in /proc/mdstat. /proc/mdstat is a pseudo-file that can be read by any Linux utility that manipulate text files (e.g., more, grep). When all RAID devices are started and operating correctly, the following output would be seen:

$ cat /proc/mdstat
                                                Personalities : [raid1]
                                                read_ahead 1024 sectors
                                                md0 : active raid1 hda1[0] hdc1[1] hde1[2]
                                                97152960 blocks [2/2] [UU]

For each RAID device, the output includes the status (active), the RAID mode (raid1), a list of partitions that comprise the RAID device and their order, total device size, partition, and a status code letter for each partition. The output contains the software RAID device, a list of active partitions comprising that device, size information, and a status code letter for each partition. A status code of U indicates that the partition is operational. A status code of _ (underscore) indicates that a partition had a disk fault. When a partition has faulted, it is removed from the list of active partitions.

The following output would be seen for the RAID device when /dev/hdc1 is operational:

$ cat /proc/mdstat
                                                Personalities : [raid1]
                                                read_ahead 1024 sectors
                                                md0 : active raid1 hda1[0] hdc1[1] hde1[2]
                                                97152960 blocks [2/2] [UU]

The following output would be seen for the same RAID device when /dev/hdc1 faulted:

$ cat /proc/mdstat
                                                Personalities : [raid1]
                                                read_ahead 1024 sectors
                                                md0 : active raid1 hda1[0] hde1[2]
                                                97152960 blocks [2/1] [U_]

The following output would be seen for the same RAID device when /dev/hde is being used to resynchronize the data from hda:

$ cat /proc/mdstat
                                                md0 : active raid1 hda1[0] hde1[2]
                                                97152960 blocks [2/1] [U_]
                                                [========>............]  recovery = 43.9% (42713908/97152960)                                                 finish=82.4min speed=11002K/sec

The following example script monitors /proc/mdstat for indication of a RAID device failure. If a RAID device fails, a message is logged via syslog and an email message is sent to the on-call pager. The script can be run periodically via cron or modified to run continuously as a daemon on system startup:

#!/bin/bash
                                                ADMIN="page-oncall@mydomain.com"
                                                HOSTNAME=‘/bin/hostname‘
                                                if egrep "\[.*_.*\]" /proc/mdstat  > /dev/null
                                                then
                                                logger -p daemon.error "mdcheck: Failure of                                                 one or more software RAID devices"
                                                echo "Failure of one or more software RAID                                                 devices on ${HOSTNAME}" | /bin/mail -s                                                  "$0: Software RAID device failure on                                                    ${HOSTNAME}" ${ADMIN}
                                                fi

Recovering from a Failed Disk Drive

Eventually, a disk drive comprising a software RAID device will fail and require replacement. The failed disk can be determined from the output of /proc/mdstat, and the failed unit must be removed and replaced with a working disk unit. It is important to replace the disk with one that is physically identical to the failed disk. The replacement disk also must be partitioned identically to the failed disk. The raidhotadd command can then be issued to tell the RAID device drivers to activate the replaced disk and begin the resynchronization of data. The progress of resynchronization can be monitored via /proc/mdstat:

$ raidhotadd -a /dev/md0 /dev/hdc1

Conclusion

Linux software RAID provides systems administrators with the means to implement the reliability and performance of RAID without the cost of hardware RAID devices. The kernel supports all basic RAID modes and complex RAID devices can be created by using RAID devices as logical partitions. The examples presented were simple RAID 1 and RAID 5 configurations. The judicious allocation of hot spare devices to RAID units will reduce the risk of performance degradation when a disk device fails. The best results will be achieved by planning the desired configuration in a diagram before modifying system configurations. Any changes should be thoroughly tested before implementing them on a production system.

References

Linux Software RAID HOWTO -- http://www./HOWTO/Software-RAID-HOWTO.html

MDADM MD device management package -- http://www.cse./~neilb/source/mdadm/

Acknowledgements

Ryan would like to thank Rob Jenson from Spotch Consulting for editing the article. He would also like to thank the kernel developers, and the architects and programmers responsible for the MD device driver code.

Ryan Matteson has been a Unix systems administrator for eight years. He specializes in emergent and Web technologies, Storage Area Networks, high-availability systems, Linux, and Solaris operating systems. Questions and comments about this article can be addressed to matty91@bellsouth.net.

Current Issue

Table of contents
Buy this issue.

Unix Review Spotlight

Regular Expressions: Experience Teaches Lessons for Team Projects
Make your application configurable. Teach it to tell how it feels. Most of all, plan how you‘ll know when your program works, and how to detect when you‘re not yet done developing it.

Systems raid array and disk drive component failures can happen at anytime. Learn about how SalvageData Recovery Labs data recovery software & services professionals can help to salvage and recover your critical hard drive or RAID data failures.

CMP DevNet Spotlight

Highlighting Multiple Search Keywords in ASP.NET
This article demonstrates how to highlight a multiple keywords within a DataGrid control, no matter where they are in the text.

Looking for the Code?

The complete source code from each issue is available here.

In the News

IBM‘s Right That SOA Has The Buzz--And Risks, Too
Smart companies keep the focus on business goals as they embrace service-oriented architecture.

Deals Hint At Consolidation Of IT Services Industry
EDS wants to buy Indian outsourcer while CSC puts itself up for sale. Are more deals on the way?

Apps-To-Go Site Looks For Gold And Community
Portableapps.com is a community site in which developers offer versions of standard software that a person can take with them on a flash drive.

Adware Firm Facing NY AG Lawsuit Denies Illegality
The attorney for Direct Revenue, which is being sued by the New York Attorney General‘s Office, insists his client‘s new business practices are "legal and appropriate." Some say that‘s an odd defense of the past.

Firefox Essentials: Fixing The Fox
Trouble with Firefox? Never fear: We‘ll show you how to keep Firefox running its best -- and how to isolate, identify, and fix problems when they do crop up.

Subscribe Today!

Subscribe to THE journal for UNIX systems administrators. Receive 45% off your subscription by following the link below:

Subscribe to the
Sys Admin email newsletter...

Email Address
First Name
Last Name

CD-ROM

Sys Admin and The Perl Journal CD-ROM version 11.0

Version 11.0 delivers every issue of Sys Admin from 1992 through 2005 and every issue of The Perl Journal from 1996-2002 in one convenient CD-ROM!

Order now!

MarketPlace

Free Web Timesheets? Yes! From Journyx.
Increase your consulting billings 20% by automating billing with this FREE 100% web-based timesheet for Linux-Wintel-Solaris-AIX-FreeBSD. SOAP XML API. Which 25% of your projects are unprofitable? Journyx will tell you.

Free Embedded Linux Development Resource
Latest components for Freescale, Intel, ARM and MIPS processors. Complete cross-development environment, reference Linux distribution,500+ cross-compiled packages, developer documentation, online help and more. Free 14-day subscription.

Bug Tracking Like You‘ve Never Seen Before
Full API hooks ExtraView bug tracking into your applications, Web, source control & testing tools.

UNIX and Linux Performance Tuning Simplified
SarCheck is a UNIX performance analysis and tuning tool for most UNIX and Linux systems. It produces recommendations and explanations, complete with supporting graphs and tables. Get the most from your hardware by keeping your systems tuned.

Wanna see your ad here?

web2