[Cockcroft98] Chapter 9. Networks

Stefen 2010-10-21

展开全文

Chapter 9. Networks

The subject of network configuration and performance has been extensively covered by other writers ^[1]. For that reason, this chapter concentrates on Sun-specific networking issues, such as the performance characteristics of the many network adapters and operating system releases.

^[1] In particular, see Managing NFS and NIS by Hal Stern.

New NFS Metrics

Local disk usage and NFS usage are functionally interchangeable, so Solaris 2.6 was changed to instrument NFS client mount points as if they were disks! NFS mounts are always shown by iostat and sar. Automounted directories coming and going more often than disks coming online may be an issue for performance tools that don’t expect the number of iostat or sar records to change often.

The full instrumentation includes the wait queue for commands in the client ( biod wait) that have not yet been sent to the server; the active queue for commands currently in the server; and utilization (%busy) for the server mount point activity level. Note that unlike the case with disks, 100% busy does not indicate that the server itself is saturated, it just indicates that the client always has outstanding requests to that server. An NFS server is much more complex than a disk drive and can handle a lot more simultaneous requests than a single disk drive can.

Figure 9-1 shows the new -xnP option, although NFS mounts appear in all formats. Note that the P option suppresses disks and shows only disk partitions. The xn option breaks down the response time, svc_t, into wait and active times and puts the expanded device name at the end of the line so that long names don’t mess up the columns. The vold entry is used to mount floppy and CD-ROM devices.

Figure 9-1. Example `iostat` Output Showing NFS Mount Points

crun% iostat -xnP 
                              extended device statistics 
  r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w %b device 
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0 0 crun:vold(pid363) 
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0 0 servdist:/usr/dist 
  0.0  0.5    0.0    7.9  0.0  0.0    0.0   20.7   0 1 
servhome:/export/home/adrianc 
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0 0 servhome:/var/mail 
  0.0  1.3    0.0   10.4  0.0  0.2    0.0  128.0   0 2 c0t2d0s0 
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0 0 c0t2d0s2

New Network Metrics

The standard SNMP network management MIB for a network interface is supposed to contain IfInOctets and IfOutOctets counters that report the number of bytes input and output on the interface. These were not measured by network devices for Solaris 2, so the MIB always reported zero. Brian Wong and I filed bugs against all the different interfaces a few years ago, and bugs were filed more recently against the SNMP implementation. The result is that these counters have been added to the “ le” and “ hme” interfaces in Solaris 2.6, and the fix has been backported in patches for Solaris 2.5.1, as 103903-03 ( le) and 104212-04 ( hme).

The new counters added were:

rbytes, obytes — read and output byte counts
multircv, multixmt — multicast receive and transmit byte counts
brdcstrcv, brdcstxmt — broadcast byte counts
norcvbuf, noxmtbuf — buffer allocation failure counts

The full set of data collected for each interface can be obtained as described in “The Solaris 2 “ kstat” Interface” on page 387. An SE script, called dumpkstats.se, prints out all of the available data, and an undocumented option, netstat -k, prints out the data. In Solaris 2.6, netstat -k takes an optional kstat name, as shown in Figure 9-2 , so you don’t have to search through the reams of data to find what you want.

Figure 9-2. Solaris 2.6 Example of `netstat -k` to See Network Interface Data in Detail

% netstat -k le0 
le0: 
ipackets 0 ierrors 0 opackets 0 oerrors 5 collisions 0 
defer 0 framing 0 crc 0 oflo 0 uflo 0 missed 0 late_collisions 0 
retry_error 0 nocarrier 2 inits 11 notmds 0 notbufs 0 norbufs 0 
nocanput 0 allocbfail 0 rbytes 0 obytes 0 multircv 0 multixmt 0 
brdcstrcv 0 brdcstxmt 5 norcvbuf 0 noxmtbuf 0

Virtual IP Addresses

You can configure more than one IP address on each interface, as shown in Figure 9-3 . This is one way that a large machine can pretend to be many smaller machines consolidated together. It is also used in high-availability failover situations. In earlier releases, up to 256 addresses could be configured on each interface. Some large virtual web sites found this limiting, and now a new ndd tunable in Solaris 2.6 can be used to increase that limit. Up to about 8,000 addresses on a single interface have been tested. Some work was also done to speed up ifconfig of large numbers of interfaces. You configure a virtual IP address by using ifconfig on the interface, with the number separated by a colon. Solaris 2.6 also allows groups of interfaces to feed several ports on a network switch on the same network to get higher bandwidth.

Figure 9-3. Configuring More Than 256 IP Addresses Per Interface

# ndd /dev/ip -set ip_addrs_per_if 300 
# ifconfig hme0:283...

Network Interface Types

There are many interface types in use on Sun systems. In this section, I discuss some of their distinguishing features.

10-Mbit SBus Interfaces — “ `le`” and “ `qe`”

The “ le” interface is used on many SPARC desktop machines. The built-in Ethernet interface shares its direct memory access (DMA) connection to the SBus with the SCSI interface but has higher priority, so heavy Ethernet activity can reduce disk throughput. This can be a problem with the original DMA controller used in the SPARCstation 1, 1+, SLC, and IPC, but subsequent machines have enough DMA bandwidth to support both.

The add-on SBus Ethernet card uses exactly the same interface as the built-in Ethernet but has an SBus DMA controller to itself. The more recent buffered Ethernet interfaces used in the SPARCserver 600, the SBE/S, the FSBE/S, and the DSBE/S have a 256-Kbyte buffer to provide a low-latency source and sink for the Ethernet. This buffer cuts down on dropped packets, especially when many Ethernets are configured in a system that also has multiple CPUs consuming the memory bandwidth. The disadvantage is increased CPU utilization as data is copied between the buffer and main memory. The most recent and efficient “ qe” Ethernet interface uses a buffer but has a DMA mechanism to transfer data between the buffer and memory. This interface is found in the SQEC/S qe quadruple 10-Mbit Ethernet SBus card and the 100-Mbit “ be” Ethernet interface SBus card.

100-Mbit Interfaces — “ `be`” and “ `hme`”

The 100baseT standard takes the approach of requiring shorter and higher-quality, shielded, twisted pair cables, then running the normal Ethernet standard at ten times the speed. Performance is similar to FDDI, but with the Ethernet characteristic of collisions under heavy load. It is most useful to connect a server to a hub, which converts the 100baseT signal into many conventional 10baseT signals for the client workstations.

FDDI Interfaces

Two FDDI interfaces have been produced by Sun, and several third-party PCIbus and SBus options are available as well. FDDI runs at 100 Mbits/s and so has ten times the bandwidth of standard Ethernet. The SBus FDDI/S 2.0 “ bf” interface is the original Sun SBus FDDI board and driver. It is a single-width SBus card that provides single-attach only. The SBus FDDI/S 3.0, 4.0, 5.0 “ nf” software supports a range of SBus FDDI cards, including both single- and dual-attach types. These are OEM products from Network Peripherals Inc. The nf_stat command provided in /opt/SUNWconn/SUNWnf may be useful for monitoring the interface.

SBus ATM 155-Mbit Asynchronous Transfer Mode Cards

There are two versions of the SBus ATM 155-Mbit Asynchronous Transfer Mode card: one version uses a fiber interface, the other uses twisted pair cables like the 100baseT card. The ATM standard allows isochronous connections to be set up (so audio and video data can be piped at a constant rate), but the AAL5 standard used to carry IP protocol data makes it behave like a slightly faster FDDI or 100baseT interface for general-purpose use. You can connect systems back-to-back with just a pair of ATM cards and no switch if you only need a high-speed link between two systems. ATM configures a 9-Kbyte segment size for TCP, which is much more efficient than Ethernet’s 1.5-Kbyte segment.

622-Mbit ATM Interface

The 622-Mbit ATM interface is one of the few cards that comes close to saturating an SBus. Over 500 Mbits/s of TCP traffic have been measured on a dual CPU Ultra 2/2200. The PCIbus version has a few refinements and a higher bandwidth bus interface, so runs a little more efficiently. It was used for the SPECweb96 benchmark results when the Enterprise 450 server was announced. The four-CPU E450 needed two 622-Mbit ATM interfaces to deliver maximum web server throughput. See “SPECweb96 Performance Results ” on page 83.

Gigabit Ethernet Interfaces— `vge`

Gigabit Ethernet is the latest development. With the initial release, a single interface cannot completely fill the network, but this will be improved over time. If a server is feeding multiple 100-Mbit switches, then a gigabit interface may be useful because all the packets are the same 1.5-Kbyte size. Overall, Gigabit Ethernet is less efficient than ATM and slower than ATM622 because of its small packet sizes and relative immaturity as a technology. If the ATM interface was going to be feeding many Ethernet networks, ATM’s large segment size would not be used, so Gigabit Ethernet may be a better choice for integrating into existing Ethernet networks.

Using NFS Effectively

The NFS protocol itself limits throughput to about 3 Mbytes/s per active client-side process because it has limited prefetch and small block sizes. The NFS version 3 protocol allows larger block sizes and other changes that improve performance on high-speed networks. This limit doesn’t apply to the aggregate throughput if you have many active client processes on a machine.

First, some references:

Managing NFS and NIS by Hal Stern (O’Reilly)—essential reading!
SMCC NFS Server Performance and Tuning Guide

The SMCC NFS Server Performance and Tuning Guide is part of the SMCC hardware-specific manual set. It contains a good overview of how to size an NFS server configuration. It is updated with each Solaris release, and I think you will find it very useful.

How Many NFS Server Threads?

In SunOS 4, the NFS daemon nfsd services requests from the network, and a number of nfsd daemons are started so that a number of outstanding requests can be processed in parallel. Each nfsd takes one request off the network and passes it to the I/O subsystem. To cope with bursts of NFS traffic, you should configure a large number of nfsds, even on low-end machines. All the nfsds run in the kernel and do not context switch in the same way as user-level processes do, so the number of hardware contexts is not a limiting factor (despite folklore to the contrary!). If you want to “throttle back” the NFS load on a server so that it can do other things, you can reduce the number. If you configure too many nfsds, some may not be used, but it is unlikely that there will be any adverse side effects as long as you don’t run out of process table entries. Take the highest number you get by applying the following three rules:

Two NFS threads per active client process
Sixty-four NFS threads per SuperSPARC processor, 200 per UltraSPARC
Sixteen NFS threads per Ethernet, 160 per 100-Mbit network

What Is a Typical NFS Operation Mix?

There are characteristic NFS operation mixes for each environment. The SPECsfs mix is based on the load generated by slow diskless workstations with a small amount of memory that are doing intensive software development. It has a large proportion of writes compared to the typical load mix from a modern workstation. If workstations are using the cachefs option, then many reads will be avoided, so the total load is less, but the percentage of writes is more like the SPECsfs mix. Table 9-1 summarizes the information.

Table 9-1. The LADDIS NFS Operation Mix
NFS Operation	Mix	Comment (Possible Client Command)
`getattr`	13%	Get file attributes ( `ls -l`)
`setattr`	1%	Set file attributes ( `chmod`)
`lookup`	34%	Search directory for a file and return handle ( `open`)
`readlink`	8%	Follow a symbolic link on the server ( `ls`)
`read`	22%	Read an 8-KB block of data
`write`	15%	Write an 8-KB block of data
`create`	2%	Create a file
`remove`	1%	Remove a file ( `rm`)
`readdir`	3%	Read a directory entry ( `ls`)
`fsstat`	1%	Get filesystem information ( `df`)

The `nfsstat` Command

The nfsstat -s command shows operation counts for the components of the NFS mix. This section is based upon the Solaris 2.4 SMCC NFS Server Performance and Tuning Guide. Figure 9-4 illustrates the results of an nfsstat -s command.

Figure 9-4. NFS Server Operation Counts

% nfsstat -s 

Server rpc: 
calls      badcalls   nullrecv   badlen    xdrcall 
2104792    0         0          0         0 

Server nfs: 
calls      badcalls 
2104792    5 
null      getattr   setattr   root     lookup     readlink   read 
10779  1% 966412 46% 13165 1% 0  0%    207574 10% 572  0%   686477 33% 
wrcache   write      create   remove   rename     link      symlink 
0  0%    179582   9% 5348  0% 9562  0% 557  0%    579  0%   32  0% 
mkdir      rmdir      readdir   statfs 
120  0%    386  0%    12650  1% 10997 1%

The meaning and interpretation of the measurements are as follows:

calls — The total number of remote procedure (RPC) calls received. NFS is just one RPC application.
badcalls — The total number of RPC calls rejected, the sum of badlen and xdrcall. If this value is non-zero, then RPC requests are being rejected. Reasons include having a user in too many groups, attempts to access an unexported file system, or an improper secure RPC configuration.
nullrecv — The number of times an RPC call was not there when one was thought to be received.
badlen — The number of calls with length shorter than the RPC minimum.
xdrcall — The number of RPC calls whose header could not be decoded by the external data representation (XDR) translation.
readlink — If this value is more than 10 percent of the mix, then client machines are making excessive use of symbolic links on NFS-exported file systems. Replace the link with a directory, perhaps using a loopback mount on both server and clients.
getattr — If this value is more than 60 percent of the mix, then check that the attribute cache value on the NFS clients is set correctly. It may have been reduced or set to zero. See the mount_nfs command and the actimo option.
null — If this value is more than 1 percent, then the automounter time-out values are set too short. Null calls are made by the automounter to locate a server for the file system.
writes — If this value is more than 5 percent, then configure a Prestoserve option, NVRAM in the disk subsystem, or a logging file system on the server.

NFS Clients

On each client machine, use nfsstat -c to see the mix, as shown in Figure 9-5; for Solaris 2.6 or later clients, use iostat -xnP to see the response times.

Figure 9-5. NFS Client Operation Counts (Solaris 2.4 Version)

Code View: Scroll / Show All

% nfsstat -c 

Client rpc: 
calls     badcalls   retrans    badxids    timeouts   waits      newcreds 
1121626    61        464        15         518        0          0 
badverfs   timers     toobig     nomem      cantsend   bufulocks 
0          442        0          0          0          0 

Client nfs: 
calls      badcalls   clgets     cltoomany 
1109675    6          1109607    0 
Version 2: (1109678 calls) 
null       getattr    setattr    root       lookup     readlink   read 
0 0%       345948 31% 4097 0%    0 0%       375991 33% 214 0%     227031 20% 
wrcache    write      create     remove     rename     link       symlink 
0 0%       112821 10% 3525 0%    3120 0%    290 0%     54 0%      0 0% 
mkdir       rmdir      readdir     statfs 
370 0%      45 0%      10112 0%    26060 2%

calls — The total number of calls sent.
badcalls — The total number of calls rejected by RPC.
retrans — The total number of retransmissions.
badxid — The number of times that a duplicate acknowledgment was received for a single NFS request. If it is approximately equal to timeout and above 5 percent, then look for a server bottleneck.
timeout — The number of calls that timed out waiting for a reply from the server. If the value is more than 5 percent, then RPC requests are timing out. A badxid value of less than 5 percent indicates that the network is dropping parts of the requests or replies. Check that intervening networks and routers are working properly; consider reducing the NFS buffer size parameters (see mount_nfs rsize and wsize), but reducing the parameters will reduce peak throughput.
wait — The number of times a call had to wait because no client handle was available.
newcred — The number of times authentication information had to be refreshed.
null — If this value is above zero by a nontrivial amount, then increase the automount timeout parameter timeo.

You can also view each UDP-based mount point by using the nfsstat -m command on a client, as shown in Figure 9-6 . TCP-based NFS mounts do not use these timers.

Figure 9-6. NFS Operation Response Times Measured by Client

% nfsstat -m 
/home/username from server:/export/home3/username 
 Flags:   vers=2,hard,intr,down,dynamic,rsize=8192,wsize=8192,retrans=5 
 Lookups: srtt=7 (17ms), dev=4 (20ms), cur=2 (40ms) 
 Reads:   srtt=16 (40ms), dev=8 (40ms), cur=6 (120ms) 
 Writes:  srtt=15 (37ms), dev=3 (15ms), cur=3 (60ms) 
 All:     srtt=15 (37ms), dev=8 (40ms), cur=5 (100ms) 
/var/mail from server:/var/mail 
 Flags:   vers=2,hard,intr,dynamic,rsize=8192,wsize=8192,retrans=5 
 Lookups: srtt=8 (20ms), dev=3 (15ms), cur=2 (40ms) 
 Reads:   srtt=18 (45ms), dev=6 (30ms), cur=5 (100ms) 
 Writes:  srtt=9 (22ms), dev=5 (25ms), cur=3 (60ms) 
 All:     srtt=8 (20ms), dev=3 (15ms), cur=2 (40ms)

This output shows the smoothed round-trip times (srtt), the deviation or variability of this measure (dev), and the current time-out level for retransmission (cur). Values are converted into milliseconds and are quoted separately for read, write, lookup, and all types of calls.

The system will seem slow if any of the round trip times exceeds 50 ms. If you find a problem, watch the iostat -x measures on the server for the disks that export the slow file system, as described in “How iostat Uses the Underlying Disk Measurements ” on page 194. If the write operations are much slower than the other operations, you may need a Prestoserve, assuming that writes are an important part of your mix.

NFS Server Not Responding

If you see the “not responding” message on clients and the server has been running without any coincident downtime, then you have a serious problem. Either the network connections or the network routing is having problems, or the NFS server is completely overloaded.

The `netstat` Command

Several options to the netstat command show various parts of the TCP/IP protocol parameters and counters. The most useful options are the basic netstat command, which monitors a single interface, and the netstat -i command, which summarizes all the interfaces. Figure 9-7 shows an output from the iostat -i command.

Figure 9-7. `netstat -i` Output Showing Multiple Network Interfaces

Code View: Scroll / Show All

					% netstat -i 
Name  Mtu  Net/Dest      Address        Ipkts  Ierrs Opkts  Oerrs Collis Queue 
lo0   8232 loopback       localhost       1247105 0     1247105 0     0      0 
bf0   4352 labnet-fddi    testsys-fddi    5605601 0     1266263 0     0      0 
le1   1500 labnet-71      testsys-71       738403 0      442941 0     11485  0 
le2   1500 labnet         testsys-lab     4001566 1     3141510 0     47094  0 
le3   1500 labnet-tpt     testsys         4140495 2     6121934 0     70169  0

From a single measurement, you can calculate the collision rate since boot time; from noting the difference in the packet and collision counts over time, you can calculate the ongoing collision rates as Collis * 100 / Opkts for each device. In this case, lo0 is the internal loopback device, bf0 is an FDDI so has no collisions, le1 has a 2.6 percent collision rate, le2 has 1.5 percent, and le3 has 1.2 percent.

For more useful network performance summaries, see the network commands of the SE toolkit, as described starting with “net.se” on page 486.