分享

[Cockcroft98] Chapter 10. Processors

 Stefen 2010-10-21

Chapter 10. Processors

Previous chapters have looked at the disk and network components of a system. This chapter covers the processor component. The details of particular processors are dealt with later. This chapter explains the measurements that each version of Solaris makes on a per processor basis. It also discusses the trade-offs made by different multiprocessor architectures.

Monitoring Processors

This section lists the most important commands that you should know and explains where their data comes from.

CPU Control Commands — psrinfo, psradm, and psrset

psrinfo tells you which CPUs are in use and when they were last enabled or disabled. psradm actually controls the CPUs. Note that in server systems, interrupts are bound to specific CPUs to improve the cache hit rate, and clock interrupts always go to CPU 0. Even if a CPU is disabled, it still receives interrupts. The psrset command manages processor sets in Solaris 2.6.

The Load Average

The load average is the sum of the run queue length and the number of jobs currently running on CPUs. The three figures given are averages over the last 1, 5, and 15 minutes. They can be used as an average run queue length indicator to see if more CPU power is required.

% uptime 
11:40pm up 7 day(s), 3:54, 1 user, load average: 0.27, 0.22, 0.24

The Data Behind vmstat and mpstat Output

What do all those columns of data in vmstat mean? How do they relate to the data from mpstat, and where does it all come from?

First, let’s look at Figure 10-1 to remind ourselves what vmstat itself looks like.

Figure 10-1. Example vmstat Output
% vmstat 5 
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s2 s3 in sy cs us sy id
0 0 0 72724 25348 0 2 3 1 1 0 0 0 0 1 0 63 362 85 1 1 98
0 0 0 64724 25184 0 24 56 0 0 0 0 0 0 19 0 311 1112 356 2 4 94
0 0 0 64724 24796 0 5 38 0 0 0 0 0 0 15 0 92 325 212 0 1 99
0 0 0 64680 24584 0 12 106 0 0 0 0 0 0 41 0 574 1094 340 2 5 93
0 1 0 64632 23736 0 0 195 0 0 0 0 0 0 66 0 612 870 270 1 7 92
0 0 0 64628 22796 0 0 144 0 0 0 0 0 0 59 0 398 764 222 1 8 91
0 0 0 64620 22796 0 0 79 0 0 0 0 0 0 50 0 255 1383 136 2 18 80


The command prints the first line of data immediately, then every five seconds prints a new line that gives the average rates over the five-second interval. The first line is also the average rate over the interval that started when the system was booted because the numbers are stored by the system as counts of the number of times each event has happened. To average over a time interval, you measure the counters at the start and end and divide the difference by the time interval. For the very first measure, there are zero counters to subtract, so you automatically get the count since boot, divided by the time since boot. The absolute counters themselves can be seen with another option to vmstat, as shown in Figure 10-2.

Figure 10-2. Example vmstat -s Raw Counters Output
% vmstat -s 
0 swap ins
0 swap outs
0 pages swapped in
0 pages swapped out
208724 total address trans. faults taken
45821 page ins
3385 page outs
61800 pages paged in
27132 pages paged out
712 total reclaims
712 reclaims from free list
0 micro (hat) faults
208724 minor (as) faults
44636 major faults
34020 copy-on-write faults
77883 zero fill page faults
9098 pages examined by the clock daemon
1 revolutions of the clock hand
27748 pages freed by the clock daemon
1333 forks
187 vforks
1589 execs
6730851 cpu context switches
12848989 device interrupts
340014 traps
28393796 system calls
285638 total name lookups (cache hits 91%)
108 toolong
159288 user cpu
123409 system cpu
15185004 idle cpu
192794 wait cpu


The other closely related command is mpstat, which shows basically the same data but on a per-CPU basis. Figure 10-3 shows some output on a dual CPU system.

Figure 10-3. Example mpstat Output
% mpstat 5 
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 1 0 4 82 17 43 0 5 1 0 182 1 1 1 97
2 1 0 3 81 17 42 0 5 2 0 181 1 1 1 97
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 2 0 39 156 106 42 0 5 21 0 30 0 2 61 37
2 0 0 0 158 106 103 5 4 8 0 1704 3 36 61 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 19 28 194 142 96 1 4 18 0 342 1 8 76 16
2 0 6 11 193 141 62 4 4 10 0 683 5 15 74 6
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 22 33 215 163 87 0 7 0 0 287 1 4 90 5
2 0 22 29 214 164 88 2 8 1 0 304 2 5 89 4


The Sources of vmstat Data

The data is read from the kernel statistics interface that is described in “The Solaris 2“kstat” Interface” on page 387. Most of the data is maintained on a per-CPU basis by the kernel and is combined into the overall summaries by the commands themselves. The same kstat(3K) programming interface that is used to get at the per-disk statistics is used. It is very lightweight—it takes only a few microseconds to retrieve each data structure. The data structures are based on those described in the file /usr/include/sys/sysinfo.h, the system information header file. Of course, all the raw kstat data is directly available to the SE toolkit, which contains customized scripts; see “vmstat.se” on page 500 and “mpvmstat.se” on page 485.

Process Queues

As we follow through the fields of vmstat, we see that the first one is labeled procs, r, b, w. This field is derived from the sysinfo data, and there is a single global kstat with the same form as the sysinfo structure shown in Figure 10-4.

Figure 10-4. The sysinfo Process Queue Data Structure
typedef struct sysinfo {        /* (update freq) update action          */ 
ulong updates; /* (1 sec) ++ */
ulong runque; /* (1 sec) += num runnable procs */
ulong runocc; /* (1 sec) ++ if num runnable procs > 0 */
ulong swpque; /* (1 sec) += num swapped procs */
ulong swpocc; /* (1 sec) ++ if num swapped procs > 0 */
ulong waiting; /* (1 sec) += jobs waiting for I/O */
} sysinfo_t;

As the comments indicate, the field is updated once per second. The run queue counts the number of runnable processes that are not running. The extra data values, called runocc and swpocc, are displayed by sar -q and show the occupancy of the queues. Solaris 2 counts the total number of swapped-out idle processes, so if you see any swapped jobs registered in swpque, don’t be alarmed. sar -q is strange: If the number to be displayed is zero, sar -q displays nothing at all, just white space. If you add the number of processes actually running on CPUs in the system, then you get the same measure on which the load average is based. waiting counts how many processes are waiting for a block device transfer (a disk I/O) to complete. It shows up as the b field in vmstat.

Virtual Memory Counters

The next part of vmstat lists the free swap space and memory. This information is obtained as a kstat from a single global vminfo structure, as shown in Figure 10-5.

Figure 10-5. The vminfo Memory Usage Data Structure
typedef struct vminfo {         /* (update freq) update action          */ 
longlong_t freemem; /* (1 sec) += freemem in pages */
longlong_t swap_resv; /* (1 sec) += reserved swap in pages */
longlong_t swap_alloc; /* (1 sec) += allocated swap in pages */
longlong_t swap_avail; /* (1 sec) += unreserved swap in pages */
longlong_t swap_free; /* (1 sec) += unallocated swap in pages */
} vminfo_t;

The only swap number shown by vmstat is swap_avail, which is the most important one. If this number ever goes to zero, your system will hang up and be unable to start more processes! For some strange reason, sar -r reports swap_free instead and converts the data into useless units of 512-byte blocks. The bizarre state of the sar command is one reason why we were motivated to create and develop the SE toolkit in the first place!

These measures are accumulated, so you calculate an average level by taking the values at two points in time and dividing the difference by the number of updates.

Paging Counters

The paging counters shown in Figure 10-6 are maintained on a per-cpu basis; it’s also clear that vmstat and sar don’t show all of the available information. The states and state transitions being counted are described in detail in “The Life Cycle of a Typical Physical Memory Page” on page 326.

Figure 10-6. The cpu_vminfo Per CPU Paging Counters Structure
typedef struct cpu_vminfo {
ulong pgrec; /* page reclaims (includes page-out) */
ulong pgfrec; /* page reclaims from free list */
ulong pgin; /* page-ins */
ulong pgpgin; /* pages paged in */
ulong pgout; /* page-outs */
ulong pgpgout; /* pages paged out */
ulong swapin; /* swap-ins */
ulong pgswapin; /* pages swapped in */
ulong swapout; /* swap-outs */
ulong pgswapout; /* pages swapped out */
ulong zfod; /* pages zero filled on demand */
ulong dfree; /* pages freed by daemon or auto */
ulong scan; /* pages examined by page-out daemon */
ulong rev; /* revolutions of the page daemon hand */
ulong hat_fault; /* minor page faults via hat_fault() */
ulong as_fault; /* minor page faults via as_fault() */
ulong maj_fault; /* major page faults */
ulong cow_fault; /* copy-on-write faults */
ulong prot_fault; /* protection faults */
ulong softlock; /* faults due to software locking req */
ulong kernel_asflt; /* as_fault()s in kernel addr space */
ulong pgrrun; /* times pager scheduled */
} cpu_vminfo_t;


A few of these might need some extra explanation. Protection faults occur when a program tries to access memory it shouldn’t, gets a segmentation violation signal, and dumps a core file. Hat faults occur only on systems that have a software-managed memory management unit (sun4c and sun4u). Hat stands for hardware address translation.

I’ll skip the disk counters because vmstat is just reading the same kstat data as iostat and providing a crude count of the number of operations for the first four disks.

CPU Usage and Event Counters

Let’s remind ourselves again what vmstat looks like.

% vmstat 5 
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s2 s3 in sy cs us sy id
0 0 0 72724 25348 0 2 3 1 1 0 0 0 0 1 0 63 362 85 1 1 98


The last six columns show the interrupt rate, system call rate, context switch rate, and CPU user, system, and idle time. The per-cpu structure from which these are derived is the biggest structure yet, with about 60 values. Some of them are summarized by sar, but a lot of interesting stuff here is being carefully recorded by the kernel, is read by vmstat, and is then just thrown away. Look down the comments in Figure 10-7, and at the end I’ll point out some nonobvious and interesting values. The cpu and wait states are arrays holding the four CPU states—usr/sys/idle/wait—and three wait states, io/swap/pio. Only the io wait state is implemented, so this measurement is superfluous. (I made the mpvmstat.se script print out the wait states before I realized that they were always zero.)

Figure 10-7. The cpu_sysinfo Per-CPU System Information Structure
typedef struct cpu_sysinfo {
ulong cpu[CPU_STATES]; /* CPU utilization */
ulong wait[W_STATES]; /* CPU wait time breakdown */
ulong bread; /* physical block reads */
ulong bwrite; /* physical block writes (sync+async) */
ulong lread; /* logical block reads */
ulong lwrite; /* logical block writes */
ulong phread; /* raw I/O reads */
ulong phwrite; /* raw I/O writes */
ulong pswitch; /* context switches */
ulong trap; /* traps */
ulong intr; /* device interrupts */
ulong syscall; /* system calls */
ulong sysread; /* read() + readv() system calls */
ulong syswrite; /* write() + writev() system calls */
ulong sysfork; /* forks */
ulong sysvfork; /* vforks */
ulong sysexec; /* execs */
ulong readch; /* bytes read by rdwr() */
ulong writech; /* bytes written by rdwr() */
ulong rcvint; /* XXX: UNUSED */
ulong xmtint; /* XXX: UNUSED */
ulong mdmint; /* XXX: UNUSED */
ulong rawch; /* terminal input characters */
ulong canch; /* chars handled in canonical mode */
ulong outch; /* terminal output characters */
ulong msg; /* msg count (msgrcv()+msgsnd() calls) */
ulong sema; /* semaphore ops count (semop() calls) */
ulong namei; /* pathname lookups */
ulong ufsiget; /* ufs_iget() calls */
ulong ufsdirblk; /* directory blocks read */
ulong ufsipage; /* inodes taken with attached pages */
ulong ufsinopage; /* inodes taken with no attached pages */
ulong inodeovf; /* inode table overflows */
ulong fileovf; /* file table overflows */
ulong procovf; /* proc table overflows */
ulong intrthread; /* interrupts as threads (below clock) */
ulong intrblk; /* intrs blkd/preempted/released (swtch) */
ulong idlethread; /* times idle thread scheduled */
ulong inv_swtch; /* involuntary context switches */
ulong nthreads; /* thread_create()s */
ulong cpumigrate; /* cpu migrations by threads */
ulong xcalls; /* xcalls to other cpus */
ulong mutex_adenters; /* failed mutex enters (adaptive) */
ulong rw_rdfails; /* rw reader failures */
ulong rw_wrfails; /* rw writer failures */
ulong modload; /* times loadable module loaded */
ulong modunload; /* times loadable module unloaded */
ulong bawrite; /* physical block writes (async) */
/* Following are gathered only under #ifdef STATISTICS in source */
ulong rw_enters; /* tries to acquire rw lock */
ulong win_uo_cnt; /* reg window user overflows */
ulong win_uu_cnt; /* reg window user underflows */
ulong win_so_cnt; /* reg window system overflows */
ulong win_su_cnt; /* reg window system underflows */
ulong win_suo_cnt; /* reg window system user overflows */
} cpu_sysinfo_t;


Some of the numbers printed by mpstat are visible here. The smtx value used to watch for kernel contention is mutex_adenters. The srw value is the sum of the failures to obtain a readers/writer lock. The term xcalls is shorthand for cross-calls. A cross-call occurs when one CPU wakes up another CPU by interrupting it.

vmstat prints out 22 columns of numbers, summarizing over a hundred underlying measures (even more on a multiprocessor). It’s good to have a lot of different things summarized in one place, but the layout of vmstat (and sar) is as much a result of their long history as it is by design. Could you do better?

I’m afraid I’m going to end up plugging the SE toolkit again. It’s just so easy to get at this data and do things with it. All the structures can be read by any user, with no need to be setuid root. (This is a key advantage of Solaris 2; other Unix systems read the kernel directly, so you would have to obtain root permissions.)

If you want to customize your very own vmstat, you could either write one from scratch in C, using the kstat library as described in “The Solaris 2 “ kstat” Interface” on page 387, or you could load the SE toolkit and spend a few seconds modifying a trivial script. Either way, if you come up with something that you think is an improvement, while staying with the basic concept of one line of output that fits in 80 columns, send it to me, and we’ll add it to the next version of SE.

If you are curious about how some of these numbers behave but can’t be bothered to write your own SE scripts, you should try out the GUI front end to the raw kstat data that is provided in the SE toolkit as infotool.se. A sample snapshot is shown in Figure 16-12 on page 480.

Use of mpstat to Monitor Interrupts and Mutexes

The mpstat output shown below was measured on a 16-CPU SPARCcenter 2000 running Solaris 2.4 and with about 500 active time-shared database users. One of the key measures is smtx, the number of times the CPU failed to obtain a mutex immediately. If the number is more than about 200 per CPU, then usually system time begins to climb. The exception is the master CPU that is taking the 100 Hz clock interrupt, normally CPU 0 but in this case CPU 4. It has a larger number (1,000 or more) of very short mutex stalls that don’t hurt performance. Higher-performance CPUs can cope with higher smtx values before they start to have a problem; more recent releases of Solaris 2 are tuned to remove sources of mutex contention. The best way to solve high levels of mutex contention is to upgrade to at least Solaris 2.6, then use the lockstat command to find which mutexes are suffering from contention, as described in “Monitoring Solaris 2.6 Lock Statistics” on page 239.

Figure 10-8. Monitoring All the CPUs with mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
0 45 1 0 232 0 780 234 106 201 0 950 72 28 0 0
1 29 1 0 243 0 810 243 115 186 0 1045 69 31 0 0
2 27 1 0 235 0 827 243 110 199 0 1000 75 25 0 0
3 26 0 0 217 0 794 227 120 189 0 925 70 30 0 0
4 9 0 0 234 92 403 94 84 1157 0 625 66 34 0 0
5 30 1 0 510 304 764 213 119 176 0 977 69 31 0 0
6 35 1 0 296 75 786 224 114 184 0 1030 68 32 0 0
7 29 1 0 300 96 754 213 116 190 0 982 69 31 0 0
9 41 0 0 356 126 905 231 109 226 0 1078 69 31 0 0
11 26 0 0 231 2 805 235 120 199 0 1047 71 29 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
12 29 0 0 405 164 793 238 117 183 0 879 66 34 0 0
15 31 0 0 288 71 784 223 121 200 0 1049 66 34 0 0
16 71 0 0 263 55 746 213 115 196 0 983 69 31 0 0
17 34 0 0 206 0 743 212 115 194 0 969 69 31 0 0


Cache Affinity Algorithms

When a system that has multiple caches is in use, a process may run on a CPU and load part of itself into that cache, then stop running for a while. When it resumes, the Unix scheduler must decide which CPU to run it on. To reduce cache traffic, the process must preferentially be allocated to its original CPU, but that CPU may be in use or the cache may have been cleaned out by another process in the meantime.

The cost of migrating a process to a new CPU depends on the time since it last ran, the size of the cache, and the speed of the central bus. To strike a delicate balance, a general-purpose algorithm that adapts to all kinds of workloads was developed. In the case of the E10000, this algorithm manages up to 256 Mbytes of cache and has a significant effect on performance. The algorithm works by moving jobs to a private run queue for each CPU. The job stays on a CPU’s run queue unless another CPU becomes idle, looks at all the queues, finds a job that hasn’t run for a while on another queue, and migrates the job to its own run queue. A cache will be overwritten very quickly, so there is little benefit from binding a process very tightly to a CPU.

There is also a surprising amount of low-level background activity from various daemons and dormant applications. Although this activity does not add up to a significant amount of CPU time used, it is enough to cause a large, CPU-bound job to stall long enough for it to migrate to a different CPU. The time-share scheduler ensures that CPU-bound jobs get a long time slice but a low priority. Daemons get a short time slice and a high priority, because they wake up only briefly. The average time slice for a process can be calculated from the number of context switches and the CPU time used (available from the PRUSAGE call described in “Process Data Sources” on page 416). The SE Toolkit calculates and displays this data with an example script; see “pea.se” on page 488.

Unix on Shared Memory Multiprocessors

The Unix kernel has many critical regions, or sections of code, where a data structure is being created or updated. These regions must not be interrupted by a higher-priority interrupt service routine. A uniprocessor Unix kernel manages these regions by setting the interrupt mask to a high value while executing in the region. On a multiprocessor, there are other processors with their own interrupt masks, so a different technique must be used to manage critical regions.

The Spin Lock or Mutex

One key capability in shared memory, multiprocessor systems is the ability to perform interprocessor synchronization by means of atomic load/store or swap instructions. All SPARC chips have an instruction called LDSTUB, which means load-store-unsigned-byte. The instruction reads a byte from memory into a register, then writes 0xFF into memory in a single, indivisible operation. The value in the register can then be examined to see if it was already 0xFF, which means that another processor got there first, or if it was 0x00, which means that this processor is in charge. This instruction is used to make mutual exclusion locks (known as mutexes) that make sure only one processor at a time can hold the lock. The lock is acquired through LDSTUB and cleared by storing 0x00 back to memory.

If a processor does not get the lock, then it may decide to spin by sitting in a loop and testing the lock until it becomes available. By checking with a normal load instruction in a loop before issuing an LDSTUB, the processor performs the spin within the cache, and the bus snooping logic watches for the lock being cleared. In this way, spinning causes no bus traffic, so processors that are waiting do not slow down those that are working. A spin lock is appropriate when the wait is expected to be short. If a long wait is expected, the process should sleep for a while so that a different job can be scheduled onto the CPU. The extra SPARC V9 instructions introduced with UltraSPARC include an atomic compare-and-swap operation, which is used by the UltraSPARC-specific version of the kernel to make locking more efficient.

Code Locking

The simplest way to convert a Unix kernel that is using interrupt levels to control critical regions for use with multiprocessors is to replace the call that sets interrupt levels high with a call to acquire a mutex lock. At the point where the interrupt level was lowered, the lock is cleared. In this way, the same regions of code are locked for exclusive access. This method has been used to a greater or lesser extent by most MP Unix implementations, including SunOS 4 on the SPARCserver 600MP machines. The amount of actual concurrency that can take place in the kernel is controlled by the number and position of the locks.

In SunOS 4 there is effectively a single lock around the entire kernel. The reason for using a single lock is to make these MP systems totally compatible with user programs and device drivers written for uniprocessor systems. User programs can take advantage of the extra CPUs, but only one of the CPUs can be executing kernel code at a time.

When code locking is used, there are a fixed number of locks in the system; this number can be used to characterize how much concurrency is available. On a very busy, highly configured system, the code locks are likely to become bottlenecks, so that adding extra processors will not help performance and may actually reduce performance.

Data Locking and Solaris 2

The problem with code locks is that different processors often want to use the same code to work on different data. To allow this use, locks must be placed in data structures rather than in code. Unfortunately, this solution requires an extensive rewrite of the kernel—one reason why Solaris 2 took several years to create. The result is that the kernel has a lot of concurrency available and can be tuned to scale well with large numbers of processors. The same kernel is used on uniprocessors and multiprocessors, so that all device drivers and user programs must be written to work in the same environment and there is no need to constrain concurrency for compatibility with uniprocessor systems. The locks are still needed in a uniprocessor because the kernel can switch between kernel threads at any time to service an interrupt.

With data locking, there is a lock for each instance of a data structure. Since table sizes vary dynamically, the total number of locks grows as the tables grow, and the amount of concurrency that is available to exploit is greater on a very busy, highly configured system. Adding extra processors to such a system is likely to be beneficial. Solaris 2 has hundreds of different data locks and multiplied by the number of instances of the data, there will typically be many thousand locks in existence on a running system. As Solaris 2 is tuned for more concurrency, some of the remaining code locks are turned into data locks, and “large” locks are broken down into finer-grained locks. There is a trade-off between having lots of mutexes for good MP scalability and few mutexes for reduced CPU overhead on uniprocessors.

Monitoring Solaris 2.6 Lock Statistics

In Solaris 2.6, the lockstat(1) command offers a kernel lock monitoring capability that is very powerful and easy to use. The kernel can dynamically change its locks to start collecting information and return to the normal, highly optimized lock code when the collection is complete. There is some extra overhead while the locks are instrumented, but useful measurements can be taken in a few seconds without disruption of the work on production machines. Data is presented clearly with several options for how it is summarized, but you still must have a very good understanding of Unix kernel architecture and Solaris 2, in particular, to really understand what is happening.

You should read the lockstat(1) manual page to see all its options, but in normal operation, it is run by the root user for the duration of a process for which you want to monitor the effects. If you want to run it for an interval, use the sleep command. I had trouble finding an example that would cause lock contention in Solaris 2.6. Eventually, I found that putting a C shell into an infinite loop, as shown below, generates 120,000 or more system calls per second duplicating and shutting file descriptors. When two shells are run at the same time on a dual CPU system, mpstat shows that hundreds of mutex spins/second occur on each CPU. This test is not intended to be serious! The system call mix was monitored with truss -c, as shown in Figure 10-9. With truss running, the system call rate dropped to about 12,000 calls per second, so I ran lockstat separately from truss.

% while 1 
end

Figure 10-9. System Call Mix for C Shell Loop
% truss -c -p 351
^C
syscall seconds calls errors
open .26 2868
close .43 17209
getpid .44 17208
dup .31 14341
sigprocmask .65 28690
---- --- ---
sys totals: 2.09 80316 0
usr time: .39
elapsed: 6.51

The intention was to see what system calls were causing the contention, and it appears to be a combination of dup and close, possibly sigprocmask as well. Although there are a lot of calls to getpid, they should not cause any contention.

The example output shows that there were 3,318 adaptive mutex spins in 5 seconds, which is 663 mutex stalls per second in total and which matches the data reported by mpstat. The locks and callers are shown; the top lock seems to be flock_lock, which is a file-related lock. Some other locks that are not named in this display may be members of hashed lock groups. Several of the calling routines mention file or filesystem operations and sigprocmask, which ties up with the system call trace.

If you see high levels of mutex contention, you need to identify both the locks that are being contended on and the component of the workload that is causing the contention, for example, system calls or a network protocol stack. You can then try to change the workload to avoid the problem, and you should make a service call to Sun to report the contention. As Sun pushes multiprocessor scalability to 64 processors and beyond, there are bound to be new workloads and new areas of contention that do not occur on smaller configurations. Each new release of Solaris 2 further reduces contention to allow even higher scalability and new workloads. There is no point reporting high mutex contention levels in anything but the latest release of Solaris 2.

Figure 10-10. Example lockstat Output
# lockstat sleep 5 


Adaptive mutex spin: 3318 events

Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
601 18% 18% 1.00 1 flock_lock cleanlocks+0x10
302 9% 27% 1.00 7 0xf597aab0 dev_get_dev_info+0x4c
251 8% 35% 1.00 1 0xf597aab0 mod_rele_dev_by_major+0x2c
245 7% 42% 1.00 3 0xf597aab0 cdev_size+0x74
160 5% 47% 1.00 7 0xf5b3c738 ddi_prop_search_common+0x50
157 5% 52% 1.00 1 0xf597aab0 ddi_hold_installed_driver+0x2c
141 4% 56% 1.00 2 0xf5c138e8 dnlc_lookup+0x9c
129 4% 60% 1.00 6 0xf5b1a790 ufs_readlink+0x120
128 4% 64% 1.00 1 0xf5c46910 cleanlocks+0x50
118 4% 67% 1.00 7 0xf5b1a790 ufs_readlink+0xbc
117 4% 71% 1.00 6 stable_lock specvp+0x8c
111 3% 74% 1.00 1 0xf5c732a8 spec_open+0x54
107 3% 77% 1.00 3 stable_lock stillreferenced+0x8
97 3% 80% 1.00 1 0xf5b1af00 vn_rele+0x24
92 3% 83% 1.00 19 0xf5c732a8 spec_close+0x104
84 3% 86% 1.00 1 0xf5b0711c sfind+0xc8
57 2% 87% 1.00 1 0xf5c99338 vn_rele+0x24
53 2% 89% 1.00 54 0xf5b5b2f8 sigprocmask+0xa0
46 1% 90% 1.00 1 0xf5b1af00 dnlc_lookup+0x12c
38 1% 91% 1.00 1 0xf5b1af00 lookuppn+0xbc
30 1% 92% 1.00 2 0xf5c18478 dnlc_lookup+0x9c
28 1% 93% 1.00 33 0xf5b5b4a0 sigprocmask+0xa0
24 1% 94% 1.00 4 0xf5c120b8 dnlc_lookup+0x9c
20 1% 95% 1.00 1 0xf5915c58 vn_rele+0x24
19 1% 95% 1.00 1 0xf5af1e58 ufs_lockfs_begin+0x38
16 0% 96% 1.00 1 0xf5915c58 dnlc_lookup+0x12c
16 0% 96% 1.00 1 0xf5c732a8 spec_close+0x7c
15 0% 97% 1.00 1 0xf5915b08 vn_rele+0x24
15 0% 97% 1.00 17 kmem_async_lock kmem_async_dispatch+0xc
15 0% 97% 1.00 1 0xf5af1e58 ufs_lockfs_end+0x24
12 0% 98% 1.00 1 0xf5c99338 dnlc_lookup+0x12c
11 0% 98% 1.00 1 0xf5c8ab10 vn_rele+0x24
9 0% 98% 1.00 1 0xf5b0711c vn_rele+0x24
8 0% 99% 1.00 3 0xf5c0c268 dnlc_lookup+0x9c
8 0% 99% 1.00 5 kmem_async_lock kmem_async_thread+0x290
7 0% 99% 1.00 2 0xf5c11128 dnlc_lookup+0x9c
7 0% 99% 1.00 1 0xf5b1a720 vn_rele+0x24
5 0% 99% 1.00 2 0xf5b5b2f8 clock+0x308
5 0% 100% 1.00 1 0xf5b1a720 dnlc_lookup+0x12c
3 0% 100% 1.00 1 0xf5915b08 dnlc_lookup+0x12c
2 0% 100% 1.00 1 0xf5b5b4a0 clock+0x308
2 0% 100% 1.00 1 0xf5c149b8 dnlc_lookup+0x9c
2 0% 100% 1.00 1 0xf5c8ab10 dnlc_lookup+0x12c
1 0% 100% 1.00 1 0xf5b183d0 esp_poll_loop+0xcc
1 0% 100% 1.00 1 plocks+0x110 polldel+0x28
1 0% 100% 1.00 2 kmem_async_lock kmem_async_thread+0x1ec
1 0% 100% 1.00 6 mml_table+0x18 srmmu_mlist_enter+0x18
1 0% 100% 1.00 3 mml_table+0x10 srmmu_mlist_enter+0x18
--------------------------------------------------------------------------


Adaptive mutex block: 24 events

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
3 12% 12% 1.00 90333 0xf5b3c738 ddi_prop_search_common+0x50
3 12% 25% 1.00 235500 flock_lock cleanlocks+0x10
2 8% 33% 1.00 112250 0xf5c138e8 dnlc_lookup+0x9c
2 8% 42% 1.00 281250 stable_lock specvp+0x8c
2 8% 50% 1.00 107000 0xf597aab0 ddi_hold_installed_driver+0x2c
2 8% 58% 1.00 89750 0xf5b5b2f8 clock+0x308
1 4% 63% 1.00 92000 0xf5c120b8 dnlc_lookup+0x9c
1 4% 67% 1.00 174000 0xf5b0711c sfind+0xc8
1 4% 71% 1.00 238500 stable_lock stillreferenced+0x8
1 4% 75% 1.00 143500 0xf5b17ab8 esp_poll_loop+0x8c
1 4% 79% 1.00 44500 0xf5b183d0 esp_poll_loop+0xcc
1 4% 83% 1.00 81000 0xf5c18478 dnlc_lookup+0x9c
1 4% 88% 1.00 413000 0xf5c0c268 dnlc_lookup+0x9c
1 4% 92% 1.00 1167000 0xf5c46910 cleanlocks+0x50
1 4% 96% 1.00 98500 0xf5c732a8 spec_close+0x7c
1 4% 100% 1.00 94000 0xf5b5b4a0 clock+0x308
--------------------------------------------------------------------------


Spin lock spin: 254 events

Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
48 19% 19% 1.00 3 atomic_lock+0x29e rw_exit+0x34
45 18% 37% 1.00 25 atomic_lock+0x24e rw_exit+0x34
44 17% 54% 1.00 1 atomic_lock+0x29d rw_exit+0x34
35 14% 68% 1.00 1 atomic_lock+0x24e rw_enter+0x34
32 13% 80% 1.00 19 cpus+0x44 disp+0x78
22 9% 89% 1.00 1 atomic_lock+0x29d rw_enter+0x34
17 7% 96% 1.00 52 cpu[2]+0x44 disp+0x78
6 2% 98% 1.00 10 cpu[2]+0x44 setbackdq+0x15c
5 2% 100% 1.00 1 atomic_lock+0x29e rw_enter+0x34
--------------------------------------------------------------------------


Thread lock spin: 3 events

Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
3 100% 100% 1.00 19 0xf68ecf5b ts_tick+0x8
-------------------------------------------------------------------------------


Multiprocessor Hardware Configurations

At any point in time, there exist CPU designs that represent the best performance that can be obtained with current technology at a reasonable price. The cost and technical difficulty of pushing the technology further means that the most cost-effective way of increasing computer power is to use several processors. There have been many attempts to harness multiple CPUs and, today, there are many different machines on the market. Software has been the problem for these machines. It is hard to design software that will be portable across a large range of machines and that will scale up well when large number of CPUs are configured. Over many years, Sun has shipped high unit volumes of large-scale multiprocessor machines and has built up a large portfolio of well-tuned applications that are optimized for the Sun architecture.

The most common applications that can use multiprocessors are time-shared systems and database servers. These have been joined by multithreaded versions of CPU-intensive programs like MCAD finite element solvers, EDA circuit simulation packages, and other High Performance Computing (HPC) applications. For graphics-intensive applications where the X server uses a lot of the CPU power on a desktop machine, configuring a dual CPU system will help—the X protocol allows buffering and batched commands with few commands needing acknowledgments. In most cases, the application sends X commands to the X server and continues to run without waiting for them to complete. One CPU runs the application, and the other runs the X server and drives the display hardware.

Two classes of multiprocessor machines have some possibility of software compatibility, and both have SPARC-based implementations. For illustrations of these two classes, see Figure 10-11 and Figure 10-12.

Figure 10-11. Typical Distributed Memory Multiprocessor with Mesh Network


Figure 10-12. Typical Small-Scale, Shared Memory Multiprocessor


Distributed Memory Multiprocessors

Distributed memory multiprocessors, also known as Massively Parallel Processors (MPP), can be thought of as a network of uniprocessors. Each processor has its own memory, and data must be explicitly copied over the network to another processor before it can be used. The benefit of this approach is that there is no contention for memory bandwidth to limit the number of processors. Moreover, if the network is made up of point-to-point links, then the network throughput increases as the number of processors increases. There is no theoretical limit to the number of processors that can be used in a system of this type, but there are problems with the high latency of the point-to-point links, and it is hard to find algorithms that scale with large numbers of processors.

Shared Memory Multiprocessors

A shared memory multiprocessor is much more tightly integrated than a distributed memory multiprocessor and consists of a fairly conventional starting point of CPU, memory, and I/O subsystem, with extra CPUs added onto the central bus.

This configuration multiplies the load on the memory system by the number of processors, and the shared bus becomes a bottleneck. To reduce the load, caches are always used and very fast memory systems and buses are built. If more and faster processors are added to the design, the cache size must increase, the memory system must be improved, and the bus throughput must be increased. Most small-scale MP machines support up to four processors. Larger ones support a few tens of processors. With some workloads, the bus or memory system will saturate before the maximum number of processors has been configured.

Special circuitry snoops activity on the bus at all times so that all the caches can be kept coherent. If the current copy of some data is in more than one of the caches, then it will be marked as being shared. If it is updated in one cache, then the copies in the other caches are either invalidated or updated automatically. From a software point of view, there is never any need to explicitly copy data from one processor to another, and shared memory locations are used to communicate values among CPUs. The cache-to-cache transfers still occur when two CPUs use data in the same cache line, so such transfers must be considered from a performance point of view, but the software does not have to worry about it.

There are many examples of shared memory mainframes and minicomputers. SPARCbased Unix multiprocessors range from the dual-processor Ultra 2 and the 4-processor E450, to the 30-processor E6000 and the 64-processor E10000.

The high-end machines normally have multiple I/O subsystems and multiple memory subsystems connected to a much wider central bus, allowing more CPUs to be configured without causing bottlenecks in the I/O and memory systems. The E10000 goes one step further by using a crossbar switch as its interconnect. Multiple transfers occur between CPU, memory, and I/O on separate point-to-point data paths that all transfer data concurrently. This mechanism provides higher throughput, low contention, and low latency.

An unavoidable fact of life is that the memory latency increases as more CPUs are handled by a system design. Even with the same CPU module configuration, an Ultra 2 or E450 has lower main memory latency than does an E6000, which is in turn lower latency than an E10000. The performance of a single uniprocessor job is therefore higher on a smaller machine. The latency of a system increases at high load levels, and this load-dependent increase is more of an issue on a smaller machine. When comparing the E6000 and E10000, it seems that the E6000 has an advantage up to 20 CPUs, and the E10000 is usually faster for loads that consume 24–28 CPUs. The E10000 is the only option from 30– 64 CPUs, and it maintains a lower latency under heavy load than any other system. So, from purely a performance point of view, it is far faster to have a row of four CPU E450s than to use an E10000 that is divided into many, separate, four-CPU domains.

Clusters of Shared Memory Machines

In any situation, the most efficient configuration is to use the smallest number of the fastest CPUs communicating over the lowest latency interconnect. This configuration minimizes contention and maximizes scalability. It is often necessary to compromise for practical or cost reasons, but it is clear that the best approach to building a large-scale system is to use the largest shared memory systems you can get, then cluster them together with the smallest number of nodes. The performance of a cluster is often limited by activities that are too hard to distribute, and so performance degrades to the performance of a single node for some of the time. If that node is as large as possible, then the overall efficiency is much higher.

Shared/ Distributed Memory Multiprocessors

Although the two types of MP machines started out as very different approaches, the most recent examples of these types have converged.

The shared memory machines all have large caches for each CPU. For good performance, it is essential for the working set of data and instructions to be present in the cache. The cache is now analogous to the private memory of a distributed memory machine, and many algorithms that have been developed to partition data for a distributed memory machine are applicable to a cache-based, shared memory machine.

The latest distributed memory machines have moved from a software- or DMA-driven point-to-point link to an MMU-driven, cache-line-based implementation of distributed shared memory. With this system, the hardware is instructed to share a region of address space with another processor, and cache coherence is maintained over a point-to-point link. As the speed of these links approaches the speed of a shared memory backplane, the distributed memory system can be used to run shared memory algorithms more efficiently. Link speeds are comparable to the low end of shared memory backplane speeds, although they typically have much higher latency. The extra latency is due to the physical distances involved and the use of complex switching systems to interconnect large numbers of processors. Large distributed memory systems have higher latency than do smaller ones.

NUMA Multiprocessors

The hybrid that has emerged is known as nonuniform memory access, or NUMA. NUMAs are shared memory, multiprocessor machines where the memory latency varies. If the access is to memory in the local node, then access is fast; if the access request has to travel over one or more communication links, then access becomes progressively slower. The node itself is often a single-board SMP with two to four CPUs. The SGI Origin 2000 system uses a twin-CPU node; the various Intel Pentium-based systems from Sequent, Data General, and others use a 4-CPU node. However, it makes far more sense when building NUMA systems to use an entire high-end server as a node, with 20 or more CPUs. The bulk of high-end server system requirements are for 10 to 30 or so CPUs, and customers like to buy a half-configured system to allow for future expansion. For Sun this requirement can be satisfied by an SMP such as the E6000 or E10000, so Sun had no need to develop a NUMA-based system for the 1996–1998 generation. NUMA-based competitors must configure multiple nodes to satisfy today’s requirements, and users of these systems are suffering from the new problems that come with this technology.

Benchmarks such as the TCP-D data warehouse test—see http://www.—have shown that Sun’s SMP machines do beat the NUMA machines on price and performance over the whole range of database sizes. This benchmark is ideally suited to NUMA because the query structures are well understood and the data layout can be optimized to keep the NUMA nodes well balanced. In real life, data warehouse queries are often constructed ad hoc, and it is not possible to optimize data placement in advance. The TPC-D benchmark is substantially overstating the performance of NUMA machines as data warehouse engines because of this effect. Data placement is not an issue in one large SMP node. In general, the workload on a system is not constant; it can vary from one query to the next or from one day to the next. In a benchmark situation, a NUMA machine can be optimized for the fixed benchmark workload but would need to be continuously reoptimized for real-world workloads.

A technique used to speed up NUMA systems is data replication. The same data can be replicated in every node so that it is always available at a short memory latency. This technique has been overused by some vendors in benchmarks, with hundreds of Mbytes of data duplicated over all the nodes. The effect is that a large amount of RAM is wasted because there are many nodes over which to duplicate the data. In a 32-CPU system made up of 2-way or 4-way nodes, you might need to configure extra memory to hold 16 or 8 copies of the data. On a 32-CPU SMP system, there is a single copy. If there are any writes to this data, the NUMA machines must synchronize all their writes, a significant extra load.

For all these reasons, the best MPP or NUMA configuration should be built by clustering the smallest number of the largest possible SMP nodes.

CPU Caches

Performance can vary even among machines that have the same CPU and clock rate if they have different memory systems. The interaction between caches and algorithms can cause performance problems, and an algorithm change may be called for to work around the problem. This section provides information on the CPU caches and memory management unit designs of various SPARC platforms so that application developers can understand the implications of differing memory systems [1]. This section may also be useful to compiler writers.

[1] High Performance Computing by Keith Dowd covers this subject very well.

CPU Cache History

Historically, the memory systems on Sun machines provided equal access times for all memory locations. This was true on the Sun-2™ machines and was true for data accesses on the Sun-3/50™ and Sun-3/60 machines.

Equal access was achieved by running the entire memory system as fast as the processor could access it. On Motorola 68010 and 68020 processors, four clock cycles were required for each memory access. On a 10 MHz Sun-2, the memory ran at a 400 ns cycle time; on a 20 MHz Sun-3/60, it ran at a 200 ns cycle time, so the CPU never had to wait for memory to be ready.

SPARC processors run at higher clock rates and want to access memory in a single cycle. Main memory cycle times have improved a little, and very wide memory systems can transfer a 32- or 64-byte block almost as quickly as a single word. The CPU cache is designed to cope with the mismatch between the way DRAM can handle blocks of memory efficiently and the way that the CPU wants to access one word at a time at much higher speed.

The CPU cache is an area of fast memory made from static RAM, or SRAM. The cache transfers data from main DRAM memory using page mode in a block of 16, 32, or 64 bytes at a time. Hardware keeps track of the data in the cache and copies data into the cache when required.

More advanced caches have multiple levels and can be split so that instructions and data use separate caches. The UltraSPARC CPU has a 16-Kbyte instruction cache and separate 16-Kbyte data cache, which are loaded from a second-level 512-Kbyte to 4-Mbyte combined cache that loads from the memory system. We first examine simple caches before we look at the implications of more advanced configurations.

Cache Line and Size Effects

A SPARCstation 2 is a very simple design with clearly predictable behavior. More recent CPU designs are extremely complicated, so I’ll start with a simple example. The SPARCstation 2 has its 64 Kbytes of cache organized as 2048 blocks of 32 bytes each. When the CPU accesses an address, the cache controller checks to see if the right data is in the cache; if it is, then the CPU loads it without stalling. If the data needs to be fetched from main memory, then the CPU clock is effectively stopped for 24 or 25 cycles while the cache block is loaded. The implications for performance tuning are obvious. If your application is accessing memory very widely and its accesses are missing the cache rather than hitting the cache, then the CPU may spend a lot of its time stopped. By changing the application, you may be able to improve the hit rate and achieve a worthwhile performance gain. The 25-cycle delay is known as the miss cost, and the effective performance of an application is reduced by a large miss cost and a low hit rate. The effect of context switches is to further reduce the cache hit rate because after a context switch the contents of the cache will need to be replaced with instructions and data for the new process.

Applications access memory on almost every cycle. Most instructions take a single cycle to execute, and the instructions must be read from memory; data accesses typically occur on 20–30 percent of the cycles. The effect of changes in hit rate for a 25-cycle miss cost are shown in Table 10-1 and Figure 10-13 in both tabular and graphical forms. A 25-cycle miss cost implies that a hit takes 1 cycle and a miss takes 26 cycles.

Figure 10-13. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost


Table 10-1. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost
Hit Rate Hit Time Miss Time Total Time Performance
100% 100% 0% 100% 100%
99% 99% 26% 125% 80%
98% 98% 52% 150% 66%
96% 96% 104% 200% 50%

The execution time increases dramatically as the hit rate drops. Although a 96 percent hit rate sounds quite high, you can see that the program will be running at half speed. Many small benchmarks like Dhrystone run at 100 percent hit rate; real applications such as databases are moving data around and following linked lists and so have a poor hit rate. It isn’t that difficult to do things that are bad for the cache, so it is a common cause of reduced performance.

A complex processor such as UltraSPARC I does not have fixed cache miss costs; the cost depends upon the exact instruction sequencing, and in many cases the effective miss cost is reduced by complex hardware pipelining and compiler optimizations. The on-chip, 16-Kbyte data cache takes 1 cycle, the external cache between 1 and 7 clock cycles (it can supply a word of data in every cycle but may stall the CPU for up to 7 cycles if the data is used immediately), and main memory between 40 and 50 cycles, depending upon the CPU and system clock rates. With the cache hit rate varying in a two-level cache, we can’t draw a simple performance curve, but relatively, performance is better than the curve shown in Figure 10-13 when the system is running inside the large second-level cache, and worse when running from main memory.

A Problem with Linked Lists

Applications that traverse large linked lists cause problems with caches. Let’s look at this in detail on a simple SPARCstation 2 architecture and assume that the list has 5,000 entries. Each block on the list contains some data and a link to the next block. If we assume that the link is located at the start of the block and that the data is in the middle of a 100-byte block, as shown in Figure 10-14, then we can deduce the effect on the memory system of chaining down the list.

Figure 10-14. Linked List Example


The code to perform the search is a tight loop, shown in Figure 10-15. This code fits in seven words, one or two cache lines at worst, so the cache is working well for code accesses. Data accesses occur when the link and data locations are read. If the code is simply looking for a particular data value, then these data accesses will occur every few cycles. They will never be in the same cache line, so every data access will cause a 25-cycle miss that reads in 32 bytes of data when only 4 bytes were wanted. Also, with a 64-Kbyte cache, only 2,048 cache lines are available, so after 1,024 blocks have been read in, the cache lines must be reused. This means that a second attempt to search the list will find that the start of the list is no longer in the cache.

Figure 10-15. Linked List Search Code in C
/* C code to search a linked list for a value and miss the cache a lot */ 

struct block {
struct block *link; /* link to next block */
int pad1[11];
int data; /* data item to check for */
int pad2[12];
} blocks[5000];

struct block *find(pb,value)
struct block *pb;
int value;
{
while(pb) /* check for end of linked list */
{
if (pb->data == value) /* check for value match */
return pb; /* return matching block */
pb = pb->link; /* follow link to next block */
}
return (struct block *)0; /* return null if no match */
}

The only solution to this problem is an algorithmic change. The problem will occur on any of the current generation of computer systems. In fact, the problem gets worse as the processor gets faster since the miss cost tends to increase because of the difference in speed between the CPU and cache clock rate and the main memory speed. With more recent processors, it may be possible to cache a larger linked list in external cache, but the on-chip cache is smaller than 64 Kbytes. What may happen is that the first time through the list, each cache miss costs 50 cycles, and on subsequent passes it costs 8 cycles.

The while loop compiles with optimization to just seven instructions, including two loads, two tests, two branches, and a no-op, as shown in Figure 10-16. Note that the instruction after a branch is always executed on a SPARC processor. This loop executes in 9 cycles (225 ns) on a SPARCstation 2 if it hits the cache, and in 59 cycles (1,475 ns) if both loads miss. On a 200 MHz UltraSPARC, the loop would run in approximately 5 cycles (25 ns) if it hit the on-chip cache, 20 cycles (100 ns) in the off-chip cache, and 100 cycles (500 ns) from main memory the first time through.

Figure 10-16. Linked List Search Loop in Assembler
LY1:                           /* loop setup code omitted */ 
cmp %o2,%o1 /* see if data == value */
be L77016 /* and exit loop if matched */
nop /* pad branch delay slot */
ld [%o0],%o0 /* follow link to next block */
tst %o0 /* check for end of linked list */
bne,a LY1 /* branch back to start of loop */
ld [%o0+48],%o2 /* load data in branch delay slot */

The memory latency test in Larry McVoy’s lmbench set of benchmarks is based on this situation. It gives a worst-case situation of unrelated back-to-back loads but is very common in real application code. See also “The Cache-Aligned Block Copy Problem” on page 265 for a description of another cache problem.

For detailed information on specific cache architectures, see “SPARC CPU Cache Architectures” on page 258. Chapter 12, “Caches,” on page 299 provides a more detailed explanation of the way in which caches in general work.



    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多