Scale your file system with Parallel NFSRead and write hundreds of gigabytes per second Summary: The Network File System (NFS) is a stalwart component of most modern local area networks (LANs). But NFS is inadequate for the demanding input- and output-intensive applications commonly found in high-performance computing—or, at least it was. The newest revision of the NFS standard includes Parallel NFS (pNFS), a parallelized implementation of file sharing that multiplies transfer rates by orders of magnitude. Here's a primer. [Note: The article has been updated with regard to vendor involvement in the origin and development of pNFS —Ed.] Date: 26 Nov 2008 (Published 04 Nov 2008)
Through NFS, which consists of server and client software and protocols running among them, a computer can share its physical file system with many other computers connected to the same network. NFS masks the implementation and type of the server's file system. To applications running on an NFS client, the shared file system appears as if it's local, native storage. Figure 1 illustrates a common deployment of NFS within a network of heterogeneous operating systems, including Linux?, Mac OS X, and Windows?, all of which support the NFS standard. (NFS is the sole file system standard supported by the Internet Engineering Task Force.) Figure 1. A simple NFS configuration ![]() In Figure 1, the Linux machine is the NFS server; it shares or exports (in NFS parlance) one or more of its physical, attached file systems. The Mac OS X and Windows machines are NFS clients. Each consumes, or mounts, the shared file system. Indeed, mounting an NFS file system yields the same result as mounting a local drive partition—when mounted, applications simply read and write files, subject to access control, oblivious to the machinations required to persist data. In the case of a file system shared through NFS, Read and Write operations traverse—represented by the blue shadow—through the client (in this case, the Windows machine) to the server, which ultimately fulfills requests to retrieve or persist data or alter file metadata, such as permissions and last modified time. NFS is quite capable, as evidenced by its widespread use as Network Attached Storage (NAS). It runs over both Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) and is (relatively) easy to administer. Furthermore, NFS version 4, the most recent, ratified version of the standard, improves security, furthers interoperability between Windows and UNIX?-like systems, and provides better exclusivity through lock leases. (NFSv4 was ratified in 2003.) NFS infrastructure is also inexpensive, because it typically runs well on common Ethernet hardware. NFS suits most problem domains. However, one domain not traditionally well served by NFS is high-performance computing (HPC), where data files are very large, sometimes huge, and the number of NFS clients can reach into the thousands. (Think of a compute cluster or grid composed of thousands of commodity computing nodes.) Here, NFS is a liability, because the limits of the NFS server—be it bandwidth, storage capacity, or processor speed—throttle the overall performance of the computation. NFS is a bottleneck. Or, at least it was. The next revision of NFS, version 4.1, includes an extension called Parallel NFS (pNFS) that combines the advantages of stock NFS with the massive transfer rates proffered by parallelized input and output (I/O). Using pNFS, file systems are shared from server to clients as before, but data does not pass through the NFS server. Instead, client systems and the data storage system connect directly, providing numerous parallelized, high-speed data paths for massive data transfers. After a bit of initialization and handshaking, the pNFS server is left "out of the loop," and it no longer hinders transfer rates. Figure 2 shows a pNFS configuration. At the top are the nodes of a compute cluster, such as a large pool of inexpensive, Linux-powered blades. At the left is the NFSv4.1 server. (For this discussion, let's just call it a pNFS server.) At the bottom is a large parallel file system. Figure 2. The conceptual organization of pNFS ![]() Like NFS, the pNFS server exports file systems and retains and maintains the canonical metadata describing each and every file in the data store. As with NFS, a pNFS client—here a node in a cluster—mounts the server's exported file systems. Like NFS, each node treats the file system as if it were local and physically attached. Changes to metadata propagate through the network back to the pNFS server. Unlike NFS, however, a Read or Write of data managed with pNFS is a direct operation between a node and the storage system itself, pictured at the bottom in Figure 2. The pNFS server is removed from data transactions, giving pNFS a definite performance advantage. Thus, pNFS retains all the niceties and conveniences of NFS and improves performance and scalability. The number of clients can be expanded to provide more computing power, while the size of the storage system can expand with little impact on client configuration. All you need to do is keep the pNFS catalog and storage system in sync. So, how does it work? As shown in Figure 3, pNFS is implemented as a collection of three protocols. Figure 3. The triad of pNFS protocols ![]() The pNFS protocol transfers file metadata (formally known as a layout) between the pNFS server and a client node. You can think of a layout as a map, describing how a file is distributed across the data store, such as how it is striped across multiple spindles. Additionally, a layout contains permissions and other file attributes. With metadata captured in a layout and persisted in the pNFS server, the storage system simply performs I/O. The storage access protocol specifies how a client accesses data from the data store. As you might guess, each storage access protocol defines its own form of layout, because the access protocol and the organization of the data must be concordant. The control protocol synchronizes state between the metadata server and the data servers. Synchronization, such as reorganizing files on media, is hidden from clients. Further, the control protocol is not specified in NFSv4.1; it can take many forms, allowing vendors the flexibility to compete on performance, cost, and features. Given those protocols, you can follow the client-access process:
More specifically, a Read operation is a series of protocol operations:
A Write operation is similar, except that the client must issue a
Layouts can be cached in each client, further enhancing speed, and a client can voluntarily relinquish a layout from the server if it's no longer of use. A server can also restrict the byte range of a Write layout to avoid quota limits or to reduce allocation overhead, among other reasons. To prevent stale caches, the metadata server recalls layouts that have become inaccurate. Following a recall, every affected client must cease I/O and either fetch the layout anew or access the file through plain NFS. Recalls are mandatory before the server attempts any file administration, such as migration or re-striping. It's location, location, location As mentioned above, each storage access protocol defines a type of layout, and new access protocols and layouts can be added freely. To bootstrap the use of pNFS, the vendors and researchers shaping pNFS have already defined three storage techniques: file, block, and object stores:
No matter the type of layout, pNFS uses a common scheme to refer to servers. Instead of hostname or volume name, servers are referred to by a unique ID. This ID is mapped to the access protocol-specific server reference. Which of these storage techniques is better? The answer is, "It depends." Budget, speed, scale, simplicity, and other factors are all part of the equation. Before you break out your checkbook, let's look at the state of pNFS. As of this writing in November 2008, the draft Request for Comments (RFC) for NFSv4.1 is entering "last call," a two-month period set aside to collect and consider comments before the RFC is published and opened to industry-wide scrutiny. When published, the formal RFC review period often lasts a year. In addition to providing broad exposure, the draft proposed standard captured in the RFC lays a firm foundation for actual product development. As only minor changes to the standard are expected during the forthcoming review period, vendors can design and build workable, marketable solutions now. Products from multiple vendors will be available some time next year. In the immediate term, open source prototype implementations of pNFS on Linux are available from a git repository located at the University of Michigan (see Resources for a link). IBM, Panasas, Netapp, and the University of Michigan Center for Information Technology Integration (CITI) are leading the development of NFSv4.1 and pNFS for Linux. The potential for pNFS as an open-source parallel file system client is enormous. The fastest supercomputer in the world (as ranked by the Top500 survey) and the first computer to reach a petaflop uses the parallel file system built by Panasas (a supporter of the pNFS object-based implementation). (A petaflop is one thousand trillion operations per second.) Dubbed Roadrunner, located at the Los Alamos National Laboratory and pictured in Figure 4, the gargantuan system has 12,960 processors, runs Linux, and is the first supercomputer to be constructed using heterogeneous processor types. Both AMD Opteron X64 processors and IBM's Cell Broadband Engine? drive computation. In 2006, Roadrunner demonstrated a peak 1.6 gigabytes-per-second transfer rate using an early version of Panasas's parallel file system. In 2008, the Roadrunner parallel storage system can sustain hundreds of gigabytes per second. In comparison, traditional NFS typically peaks at hundreds of megabytes per second. Figure 4. Roadrunner, the world's first petaflop supercomputer ![]() The entire NFSv4.1 standard and pNFS are substantive improvements to the NFS standard and represent the most radical changes made to a twenty-something-year-old technology that originated with Sun Microsystems' Bill Joy in the 1980s. Five years in development, NFSv4.1 and pNFS now (or imminently) stands ready to provide super-storage speeds to super-computing machines. We have seen the future, and it is parallel storage. Learn
Get products and technologies
Discuss
![]() Martin Streicher is a freelance Ruby on Rails developer and the former Editor-in-Chief of Linux Magazine. Martin holds a Masters of Science degree in computer science from Purdue University and has programmed UNIX-like systems since 1986. He collects art and toys. |
|