[Cockcroft98] Chapter 10. Processors

Stefen 2010-10-21

展开全文

Chapter 10. Processors

Previous chapters have looked at the disk and network components of a system. This chapter covers the processor component. The details of particular processors are dealt with later. This chapter explains the measurements that each version of Solaris makes on a per processor basis. It also discusses the trade-offs made by different multiprocessor architectures.

Monitoring Processors

This section lists the most important commands that you should know and explains where their data comes from.

CPU Control Commands — `psrinfo`, `psradm`, and `psrset`

psrinfo tells you which CPUs are in use and when they were last enabled or disabled. psradm actually controls the CPUs. Note that in server systems, interrupts are bound to specific CPUs to improve the cache hit rate, and clock interrupts always go to CPU 0. Even if a CPU is disabled, it still receives interrupts. The psrset command manages processor sets in Solaris 2.6.

The Load Average

The load average is the sum of the run queue length and the number of jobs currently running on CPUs. The three figures given are averages over the last 1, 5, and 15 minutes. They can be used as an average run queue length indicator to see if more CPU power is required.

% uptime 
 11:40pm  up 7 day(s),  3:54,  1 user,  load average: 0.27, 0.22, 0.24

The Data Behind `vmstat` and `mpstat` Output

What do all those columns of data in vmstat mean? How do they relate to the data from mpstat, and where does it all come from?

First, let’s look at Figure 10-1 to remind ourselves what vmstat itself looks like.

Figure 10-1. Example `vmstat` Output

Code View: Scroll / Show All

% vmstat 5 
 procs      memory             page            disk           faults       cpu 
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s2 s3   in   sy   cs us sy id 
 0 0 0  72724 25348   0   2  3  1  1  0  0  0  0  1  0   63  362   85  1  1 98 
 0 0 0  64724 25184   0  24 56  0  0  0  0  0  0 19  0  311 1112  356  2  4 94 
 0 0 0  64724 24796   0   5 38  0  0  0  0  0  0 15  0   92  325  212  0  1 99 
 0 0 0  64680 24584   0  12 106 0  0  0  0  0  0 41  0  574 1094  340  2  5 93 
 0 1 0  64632 23736   0   0 195 0  0  0  0  0  0 66  0  612  870  270  1  7 92 
 0 0 0  64628 22796   0   0 144 0  0  0  0  0  0 59  0  398  764  222  1  8 91 
 0 0 0  64620 22796   0   0 79  0  0  0  0  0  0 50  0  255 1383  136  2 18 80

The command prints the first line of data immediately, then every five seconds prints a new line that gives the average rates over the five-second interval. The first line is also the average rate over the interval that started when the system was booted because the numbers are stored by the system as counts of the number of times each event has happened. To average over a time interval, you measure the counters at the start and end and divide the difference by the time interval. For the very first measure, there are zero counters to subtract, so you automatically get the count since boot, divided by the time since boot. The absolute counters themselves can be seen with another option to vmstat, as shown in Figure 10-2.

Figure 10-2. Example `vmstat -s` Raw Counters Output

Code View: Scroll / Show All

% vmstat -s 
        0 swap ins 
        0 swap outs 
        0 pages swapped in 
        0 pages swapped out 
   208724 total address trans. faults taken 
    45821 page ins 
     3385 page outs 
    61800 pages paged in 
    27132 pages paged out 
      712 total reclaims 
      712 reclaims from free list 
        0 micro (hat) faults 
   208724 minor (as) faults 
    44636 major faults 
    34020 copy-on-write faults 
    77883 zero fill page faults 
     9098 pages examined by the clock daemon 
        1 revolutions of the clock hand 
    27748 pages freed by the clock daemon  
     1333 forks 
      187 vforks 
     1589 execs 
  6730851 cpu context switches 
 12848989 device interrupts 
   340014 traps 
 28393796 system calls 
   285638 total name lookups (cache hits 91%) 
      108 toolong 
   159288 user cpu 
   123409 system cpu 
 15185004 idle cpu 
   192794 wait cpu

The other closely related command is mpstat, which shows basically the same data but on a per-CPU basis. Figure 10-3 shows some output on a dual CPU system.

Figure 10-3. Example `mpstat` Output

Code View: Scroll / Show All

% mpstat 5 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    1   0    4    82   17   43    0    5    1    0   182    1   1   1  97 
  2    1   0    3    81   17   42    0    5    2    0   181    1   1   1  97 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    2   0   39   156  106   42    0    5   21    0    30    0   2  61  37 
  2    0   0    0   158  106  103    5    4    8    0  1704    3  36  61   0 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    0  19   28   194  142   96    1    4   18    0   342    1   8  76  16 
  2    0   6   11   193  141   62    4    4   10    0   683    5  15  74   6 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    0  22   33   215  163   87    0    7    0    0   287    1   4  90   5 
  2    0  22   29   214  164   88    2    8    1    0   304    2   5  89   4

The Sources of `vmstat` Data

The data is read from the kernel statistics interface that is described in “The Solaris 2“kstat” Interface ” on page 387. Most of the data is maintained on a per-CPU basis by the kernel and is combined into the overall summaries by the commands themselves. The same kstat(3K) programming interface that is used to get at the per-disk statistics is used. It is very lightweight—it takes only a few microseconds to retrieve each data structure. The data structures are based on those described in the file /usr/include/sys/sysinfo.h, the system information header file. Of course, all the raw kstat data is directly available to the SE toolkit, which contains customized scripts; see “vmstat.se ” on page 500 and “mpvmstat.se” on page 485.

Process Queues

As we follow through the fields of vmstat, we see that the first one is labeled procs, r, b, w. This field is derived from the sysinfo data, and there is a single global kstat with the same form as the sysinfo structure shown in Figure 10-4.

Figure 10-4. The `sysinfo` Process Queue Data Structure

typedef struct sysinfo {        /* (update freq) update action          */ 
        ulong   updates;        /* (1 sec) ++                           */ 
        ulong   runque;         /* (1 sec) += num runnable procs        */ 
        ulong   runocc;         /* (1 sec) ++ if num runnable procs > 0 */ 
        ulong   swpque;         /* (1 sec) += num swapped procs         */ 
        ulong   swpocc;         /* (1 sec) ++ if num swapped procs > 0  */ 
        ulong   waiting;        /* (1 sec) += jobs waiting for I/O      */ 
} sysinfo_t;

As the comments indicate, the field is updated once per second. The run queue counts the number of runnable processes that are not running. The extra data values, called runocc and swpocc, are displayed by sar -q and show the occupancy of the queues. Solaris 2 counts the total number of swapped-out idle processes, so if you see any swapped jobs registered in swpque, don’t be alarmed. sar -q is strange: If the number to be displayed is zero, sar -q displays nothing at all, just white space. If you add the number of processes actually running on CPUs in the system, then you get the same measure on which the load average is based. waiting counts how many processes are waiting for a block device transfer (a disk I/O) to complete. It shows up as the b field in vmstat.

Virtual Memory Counters

The next part of vmstat lists the free swap space and memory. This information is obtained as a kstat from a single global vminfo structure, as shown in Figure 10-5.

Figure 10-5. The `vminfo` Memory Usage Data Structure

typedef struct vminfo {         /* (update freq) update action          */ 
        longlong_t freemem;     /* (1 sec) += freemem in pages          */ 
        longlong_t swap_resv;   /* (1 sec) += reserved swap in pages    */ 
        longlong_t swap_alloc;  /* (1 sec) += allocated swap in pages   */ 
        longlong_t swap_avail;  /* (1 sec) += unreserved swap in pages  */ 
        longlong_t swap_free;   /* (1 sec) += unallocated swap in pages */ 
} vminfo_t;

The only swap number shown by vmstat is swap_avail, which is the most important one. If this number ever goes to zero, your system will hang up and be unable to start more processes! For some strange reason, sar -r reports swap_free instead and converts the data into useless units of 512-byte blocks. The bizarre state of the sar command is one reason why we were motivated to create and develop the SE toolkit in the first place!

These measures are accumulated, so you calculate an average level by taking the values at two points in time and dividing the difference by the number of updates.

Paging Counters

The paging counters shown in Figure 10-6 are maintained on a per-cpu basis; it’s also clear that vmstat and sar don’t show all of the available information. The states and state transitions being counted are described in detail in “The Life Cycle of a Typical Physical Memory Page” on page 326.

Figure 10-6. The `cpu_vminfo` Per CPU Paging Counters Structure

Code View: Scroll / Show All

typedef struct cpu_vminfo {
        ulong   pgrec;          /* page reclaims (includes page-out)     */ 
        ulong   pgfrec;         /* page reclaims from free list         */ 
        ulong   pgin;           /* page-ins                              */ 
        ulong   pgpgin;         /* pages paged in                       */ 
        ulong   pgout;          /* page-outs                             */ 
        ulong   pgpgout;        /* pages paged out                      */ 
        ulong   swapin;         /* swap-ins                              */ 
        ulong   pgswapin;       /* pages swapped in                     */ 
        ulong   swapout;        /* swap-outs                             */ 
        ulong   pgswapout;      /* pages swapped out                    */ 
        ulong   zfod;           /* pages zero filled on demand          */ 
        ulong   dfree;          /* pages freed by daemon or auto        */ 
        ulong   scan;           /* pages examined by page-out daemon     */ 
        ulong   rev;            /* revolutions of the page daemon hand  */ 
        ulong   hat_fault;      /* minor page faults via hat_fault()    */ 
        ulong   as_fault;       /* minor page faults via as_fault()     */ 
        ulong   maj_fault;      /* major page faults                    */ 
        ulong   cow_fault;      /* copy-on-write faults                 */ 
        ulong   prot_fault;     /* protection faults                    */ 
        ulong   softlock;       /* faults due to software locking req   */ 
        ulong   kernel_asflt;   /* as_fault()s in kernel addr space     */ 
        ulong   pgrrun;         /* times pager scheduled                */ 
} cpu_vminfo_t;

A few of these might need some extra explanation. Protection faults occur when a program tries to access memory it shouldn’t, gets a segmentation violation signal, and dumps a core file. Hat faults occur only on systems that have a software-managed memory management unit (sun4c and sun4u). Hat stands for hardware address translation.

I’ll skip the disk counters because vmstat is just reading the same kstat data as iostat and providing a crude count of the number of operations for the first four disks.

CPU Usage and Event Counters

Let’s remind ourselves again what vmstat looks like.

Code View: Scroll / Show All

% vmstat 5 
 procs       memory            page            disk           faults       cpu 
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s2 s3   in   sy   cs us sy id 
 0 0 0  72724 25348   0   2  3  1  1  0  0  0  0  1  0   63  362   85  1  1 98

The last six columns show the interrupt rate, system call rate, context switch rate, and CPU user, system, and idle time. The per-cpu structure from which these are derived is the biggest structure yet, with about 60 values. Some of them are summarized by sar, but a lot of interesting stuff here is being carefully recorded by the kernel, is read by vmstat, and is then just thrown away. Look down the comments in Figure 10-7 , and at the end I’ll point out some nonobvious and interesting values. The cpu and wait states are arrays holding the four CPU states—usr/sys/idle/wait—and three wait states, io/swap/pio. Only the io wait state is implemented, so this measurement is superfluous. (I made the mpvmstat.se script print out the wait states before I realized that they were always zero.)

Figure 10-7. The `cpu_sysinfo` Per-CPU System Information Structure

Code View: Scroll / Show All

typedef struct cpu_sysinfo {
        ulong   cpu[CPU_STATES]; /* CPU utilization                     */ 
        ulong   wait[W_STATES]; /* CPU wait time breakdown              */ 
        ulong   bread;          /* physical block reads                 */ 
        ulong   bwrite;         /* physical block writes (sync+async)   */ 
        ulong   lread;          /* logical block reads                  */ 
        ulong   lwrite;         /* logical block writes                 */ 
        ulong   phread;         /* raw I/O reads                        */ 
        ulong   phwrite;        /* raw I/O writes                       */ 
        ulong   pswitch;        /* context switches                     */ 
        ulong   trap;           /* traps                                */ 
        ulong   intr;           /* device interrupts                    */ 
        ulong   syscall;        /* system calls                         */ 
        ulong   sysread;        /* read() + readv() system calls        */ 
        ulong   syswrite;       /* write() + writev() system calls      */ 
        ulong   sysfork;        /* forks                                */ 
        ulong   sysvfork;       /* vforks                               */ 
        ulong   sysexec;        /* execs                                */ 
        ulong   readch;         /* bytes read by rdwr()                 */ 
        ulong   writech;        /* bytes written by rdwr()              */ 
        ulong   rcvint;         /* XXX: UNUSED                          */ 
        ulong   xmtint;         /* XXX: UNUSED                          */ 
        ulong   mdmint;         /* XXX: UNUSED                          */ 
        ulong   rawch;          /* terminal input characters            */ 
        ulong   canch;          /* chars handled in canonical mode      */ 
        ulong   outch;          /* terminal output characters           */ 
        ulong   msg;            /* msg count (msgrcv()+msgsnd() calls)  */ 
        ulong   sema;           /* semaphore ops count (semop() calls)  */ 
        ulong   namei;          /* pathname lookups                     */ 
        ulong   ufsiget;        /* ufs_iget() calls                     */ 
        ulong   ufsdirblk;      /* directory blocks read                 */ 
        ulong   ufsipage;       /* inodes taken with attached pages     */ 
        ulong   ufsinopage;     /* inodes taken with no attached pages  */ 
        ulong   inodeovf;       /* inode table overflows                */ 
        ulong   fileovf;        /* file table overflows                 */ 
        ulong   procovf;        /* proc table overflows                 */ 
        ulong   intrthread;     /* interrupts as threads (below clock)  */ 
        ulong   intrblk;        /* intrs blkd/preempted/released (swtch) */ 
        ulong   idlethread;     /* times idle thread scheduled          */ 
        ulong   inv_swtch;      /* involuntary context switches         */ 
        ulong   nthreads;       /* thread_create()s                     */ 
        ulong   cpumigrate;     /* cpu migrations by threads            */ 
        ulong   xcalls;         /* xcalls to other cpus                 */ 
        ulong   mutex_adenters; /* failed mutex enters (adaptive)       */ 
        ulong   rw_rdfails;     /* rw reader failures                   */ 
        ulong   rw_wrfails;     /* rw writer failures                   */ 
        ulong   modload;        /* times loadable module loaded         */ 
        ulong   modunload;      /* times loadable module unloaded       */ 
        ulong   bawrite;        /* physical block writes (async)        */ 
/* Following are gathered only under #ifdef STATISTICS in source        */ 
        ulong   rw_enters;      /* tries to acquire rw lock             */ 
        ulong   win_uo_cnt;     /* reg window user overflows            */ 
        ulong   win_uu_cnt;     /* reg window user underflows           */ 
        ulong   win_so_cnt;     /* reg window system overflows          */ 
        ulong   win_su_cnt;     /* reg window system underflows         */ 
        ulong   win_suo_cnt;    /* reg window system user overflows     */ 
} cpu_sysinfo_t;

Some of the numbers printed by mpstat are visible here. The smtx value used to watch for kernel contention is mutex_adenters. The srw value is the sum of the failures to obtain a readers/writer lock. The term xcalls is shorthand for cross-calls. A cross-call occurs when one CPU wakes up another CPU by interrupting it.

vmstat prints out 22 columns of numbers, summarizing over a hundred underlying measures (even more on a multiprocessor). It’s good to have a lot of different things summarized in one place, but the layout of vmstat (and sar) is as much a result of their long history as it is by design. Could you do better?

I’m afraid I’m going to end up plugging the SE toolkit again. It’s just so easy to get at this data and do things with it. All the structures can be read by any user, with no need to be setuid root. (This is a key advantage of Solaris 2; other Unix systems read the kernel directly, so you would have to obtain root permissions.)

If you want to customize your very own vmstat, you could either write one from scratch in C, using the kstat library as described in “The Solaris 2 “ kstat” Interface” on page 387, or you could load the SE toolkit and spend a few seconds modifying a trivial script. Either way, if you come up with something that you think is an improvement, while staying with the basic concept of one line of output that fits in 80 columns, send it to me, and we’ll add it to the next version of SE.

If you are curious about how some of these numbers behave but can’t be bothered to write your own SE scripts, you should try out the GUI front end to the raw kstat data that is provided in the SE toolkit as infotool.se. A sample snapshot is shown in Figure 16-12 on page 480.

Use of `mpstat` to Monitor Interrupts and Mutexes

The mpstat output shown below was measured on a 16-CPU SPARCcenter 2000 running Solaris 2.4 and with about 500 active time-shared database users. One of the key measures is smtx, the number of times the CPU failed to obtain a mutex immediately. If the number is more than about 200 per CPU, then usually system time begins to climb. The exception is the master CPU that is taking the 100 Hz clock interrupt, normally CPU 0 but in this case CPU 4. It has a larger number (1,000 or more) of very short mutex stalls that don’t hurt performance. Higher-performance CPUs can cope with higher smtx values before they start to have a problem; more recent releases of Solaris 2 are tuned to remove sources of mutex contention. The best way to solve high levels of mutex contention is to upgrade to at least Solaris 2.6, then use the lockstat command to find which mutexes are suffering from contention, as described in “Monitoring Solaris 2.6 Lock Statistics” on page 239.

Figure 10-8. Monitoring All the CPUs with `mpstat`

Code View: Scroll / Show All

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0   45   1    0   232    0  780  234  106  201    0   950   72  28   0   0 
  1   29   1    0   243    0  810  243  115  186    0  1045   69  31   0   0 
  2   27   1    0   235    0  827  243  110  199    0  1000   75  25   0   0 
  3   26   0    0   217    0  794  227  120  189    0   925   70  30   0   0 
  4    9   0    0   234   92  403   94   84 1157    0   625   66  34   0   0 
  5   30   1    0   510  304  764  213  119  176    0   977   69  31   0   0 
  6   35   1    0   296   75  786  224  114  184    0  1030   68  32   0   0 
  7   29   1    0   300   96  754  213  116  190    0   982   69  31   0   0 
  9   41   0    0   356  126  905  231  109  226    0  1078   69  31   0   0 
 11   26   0    0   231    2  805  235  120  199    0  1047   71  29   0   0 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 12   29   0    0   405  164  793  238  117  183    0   879   66  34   0   0 
 15   31   0    0   288   71  784  223  121  200    0  1049   66  34   0   0 
 16   71   0    0   263   55  746  213  115  196    0   983   69  31   0   0 
 17   34   0    0   206    0  743  212  115  194    0   969   69  31   0   0

Cache Affinity Algorithms

When a system that has multiple caches is in use, a process may run on a CPU and load part of itself into that cache, then stop running for a while. When it resumes, the Unix scheduler must decide which CPU to run it on. To reduce cache traffic, the process must preferentially be allocated to its original CPU, but that CPU may be in use or the cache may have been cleaned out by another process in the meantime.

The cost of migrating a process to a new CPU depends on the time since it last ran, the size of the cache, and the speed of the central bus. To strike a delicate balance, a general-purpose algorithm that adapts to all kinds of workloads was developed. In the case of the E10000, this algorithm manages up to 256 Mbytes of cache and has a significant effect on performance. The algorithm works by moving jobs to a private run queue for each CPU. The job stays on a CPU’s run queue unless another CPU becomes idle, looks at all the queues, finds a job that hasn’t run for a while on another queue, and migrates the job to its own run queue. A cache will be overwritten very quickly, so there is little benefit from binding a process very tightly to a CPU.

There is also a surprising amount of low-level background activity from various daemons and dormant applications. Although this activity does not add up to a significant amount of CPU time used, it is enough to cause a large, CPU-bound job to stall long enough for it to migrate to a different CPU. The time-share scheduler ensures that CPU-bound jobs get a long time slice but a low priority. Daemons get a short time slice and a high priority, because they wake up only briefly. The average time slice for a process can be calculated from the number of context switches and the CPU time used (available from the PRUSAGE call described in “Process Data Sources ” on page 416). The SE Toolkit calculates and displays this data with an example script; see “pea.se” on page 488.

Unix on Shared Memory Multiprocessors

The Unix kernel has many critical regions, or sections of code, where a data structure is being created or updated. These regions must not be interrupted by a higher-priority interrupt service routine. A uniprocessor Unix kernel manages these regions by setting the interrupt mask to a high value while executing in the region. On a multiprocessor, there are other processors with their own interrupt masks, so a different technique must be used to manage critical regions.

The Spin Lock or Mutex

One key capability in shared memory, multiprocessor systems is the ability to perform interprocessor synchronization by means of atomic load/store or swap instructions. All SPARC chips have an instruction called LDSTUB, which means load-store-unsigned-byte. The instruction reads a byte from memory into a register, then writes 0xFF into memory in a single, indivisible operation. The value in the register can then be examined to see if it was already 0xFF, which means that another processor got there first, or if it was 0x00, which means that this processor is in charge. This instruction is used to make mutual exclusion locks (known as mutexes) that make sure only one processor at a time can hold the lock. The lock is acquired through LDSTUB and cleared by storing 0x00 back to memory.

If a processor does not get the lock, then it may decide to spin by sitting in a loop and testing the lock until it becomes available. By checking with a normal load instruction in a loop before issuing an LDSTUB, the processor performs the spin within the cache, and the bus snooping logic watches for the lock being cleared. In this way, spinning causes no bus traffic, so processors that are waiting do not slow down those that are working. A spin lock is appropriate when the wait is expected to be short. If a long wait is expected, the process should sleep for a while so that a different job can be scheduled onto the CPU. The extra SPARC V9 instructions introduced with UltraSPARC include an atomic compare-and-swap operation, which is used by the UltraSPARC-specific version of the kernel to make locking more efficient.

Code Locking

The simplest way to convert a Unix kernel that is using interrupt levels to control critical regions for use with multiprocessors is to replace the call that sets interrupt levels high with a call to acquire a mutex lock. At the point where the interrupt level was lowered, the lock is cleared. In this way, the same regions of code are locked for exclusive access. This method has been used to a greater or lesser extent by most MP Unix implementations, including SunOS 4 on the SPARCserver 600MP machines. The amount of actual concurrency that can take place in the kernel is controlled by the number and position of the locks.

In SunOS 4 there is effectively a single lock around the entire kernel. The reason for using a single lock is to make these MP systems totally compatible with user programs and device drivers written for uniprocessor systems. User programs can take advantage of the extra CPUs, but only one of the CPUs can be executing kernel code at a time.

When code locking is used, there are a fixed number of locks in the system; this number can be used to characterize how much concurrency is available. On a very busy, highly configured system, the code locks are likely to become bottlenecks, so that adding extra processors will not help performance and may actually reduce performance.

Data Locking and Solaris 2

The problem with code locks is that different processors often want to use the same code to work on different data. To allow this use, locks must be placed in data structures rather than in code. Unfortunately, this solution requires an extensive rewrite of the kernel—one reason why Solaris 2 took several years to create. The result is that the kernel has a lot of concurrency available and can be tuned to scale well with large numbers of processors. The same kernel is used on uniprocessors and multiprocessors, so that all device drivers and user programs must be written to work in the same environment and there is no need to constrain concurrency for compatibility with uniprocessor systems. The locks are still needed in a uniprocessor because the kernel can switch between kernel threads at any time to service an interrupt.

With data locking, there is a lock for each instance of a data structure. Since table sizes vary dynamically, the total number of locks grows as the tables grow, and the amount of concurrency that is available to exploit is greater on a very busy, highly configured system. Adding extra processors to such a system is likely to be beneficial. Solaris 2 has hundreds of different data locks and multiplied by the number of instances of the data, there will typically be many thousand locks in existence on a running system. As Solaris 2 is tuned for more concurrency, some of the remaining code locks are turned into data locks, and “large” locks are broken down into finer-grained locks. There is a trade-off between having lots of mutexes for good MP scalability and few mutexes for reduced CPU overhead on uniprocessors.

Monitoring Solaris 2.6 Lock Statistics

In Solaris 2.6, the lockstat(1) command offers a kernel lock monitoring capability that is very powerful and easy to use. The kernel can dynamically change its locks to start collecting information and return to the normal, highly optimized lock code when the collection is complete. There is some extra overhead while the locks are instrumented, but useful measurements can be taken in a few seconds without disruption of the work on production machines. Data is presented clearly with several options for how it is summarized, but you still must have a very good understanding of Unix kernel architecture and Solaris 2, in particular, to really understand what is happening.

You should read the lockstat(1) manual page to see all its options, but in normal operation, it is run by the root user for the duration of a process for which you want to monitor the effects. If you want to run it for an interval, use the sleep command. I had trouble finding an example that would cause lock contention in Solaris 2.6. Eventually, I found that putting a C shell into an infinite loop, as shown below, generates 120,000 or more system calls per second duplicating and shutting file descriptors. When two shells are run at the same time on a dual CPU system, mpstat shows that hundreds of mutex spins/second occur on each CPU. This test is not intended to be serious! The system call mix was monitored with truss -c, as shown in Figure 10-9. With truss running, the system call rate dropped to about 12,000 calls per second, so I ran lockstat separately from truss.

% while 1 
 end

Figure 10-9. System Call Mix for C Shell Loop

% truss -c -p 351
						^C 
syscall      seconds   calls  errors 
open             .26    2868 
close            .43   17209 
getpid           .44   17208 
dup              .31   14341 
sigprocmask      .65   28690 
                ----     ---    ---
sys totals:     2.09   80316      0 
usr time:        .39 
elapsed:        6.51

The intention was to see what system calls were causing the contention, and it appears to be a combination of dup and close, possibly sigprocmask as well. Although there are a lot of calls to getpid, they should not cause any contention.

The example output shows that there were 3,318 adaptive mutex spins in 5 seconds, which is 663 mutex stalls per second in total and which matches the data reported by mpstat. The locks and callers are shown; the top lock seems to be flock_lock, which is a file-related lock. Some other locks that are not named in this display may be members of hashed lock groups. Several of the calling routines mention file or filesystem operations and sigprocmask, which ties up with the system call trace.

If you see high levels of mutex contention, you need to identify both the locks that are being contended on and the component of the workload that is causing the contention, for example, system calls or a network protocol stack. You can then try to change the workload to avoid the problem, and you should make a service call to Sun to report the contention. As Sun pushes multiprocessor scalability to 64 processors and beyond, there are bound to be new workloads and new areas of contention that do not occur on smaller configurations. Each new release of Solaris 2 further reduces contention to allow even higher scalability and new workloads. There is no point reporting high mutex contention levels in anything but the latest release of Solaris 2.

Figure 10-10. Example `lockstat` Output

Code View: Scroll / Show All

# lockstat sleep 5 


Adaptive mutex spin: 3318 events 

Count indv cuml rcnt     spin Lock                   Caller 
-------------------------------------------------------------------------------
  601  18%  18% 1.00        1 flock_lock             cleanlocks+0x10 
  302   9%  27% 1.00        7 0xf597aab0             dev_get_dev_info+0x4c 
  251   8%  35% 1.00        1 0xf597aab0             mod_rele_dev_by_major+0x2c 
  245   7%  42% 1.00        3 0xf597aab0             cdev_size+0x74 
  160   5%  47% 1.00        7 0xf5b3c738             ddi_prop_search_common+0x50 
  157   5%  52% 1.00       1 0xf597aab0             ddi_hold_installed_driver+0x2c 
  141   4%  56% 1.00        2 0xf5c138e8             dnlc_lookup+0x9c 
  129   4%  60% 1.00        6 0xf5b1a790             ufs_readlink+0x120 
  128   4%  64% 1.00        1 0xf5c46910             cleanlocks+0x50 
  118   4%  67% 1.00        7 0xf5b1a790             ufs_readlink+0xbc 
  117   4%  71% 1.00        6 stable_lock            specvp+0x8c 
  111   3%  74% 1.00        1 0xf5c732a8             spec_open+0x54 
  107   3%  77% 1.00        3 stable_lock            stillreferenced+0x8 
   97   3%  80% 1.00        1 0xf5b1af00             vn_rele+0x24 
   92   3%  83% 1.00       19 0xf5c732a8             spec_close+0x104 
   84   3%  86% 1.00        1 0xf5b0711c             sfind+0xc8 
   57   2%  87% 1.00        1 0xf5c99338             vn_rele+0x24 
   53   2%  89% 1.00       54 0xf5b5b2f8             sigprocmask+0xa0 
   46   1%  90% 1.00        1 0xf5b1af00             dnlc_lookup+0x12c 
   38   1%  91% 1.00        1 0xf5b1af00             lookuppn+0xbc 
   30   1%  92% 1.00        2 0xf5c18478             dnlc_lookup+0x9c 
   28   1%  93% 1.00       33 0xf5b5b4a0             sigprocmask+0xa0 
   24   1%  94% 1.00        4 0xf5c120b8             dnlc_lookup+0x9c 
   20   1%  95% 1.00        1 0xf5915c58             vn_rele+0x24 
   19   1%  95% 1.00        1 0xf5af1e58             ufs_lockfs_begin+0x38 
   16   0%  96% 1.00        1 0xf5915c58             dnlc_lookup+0x12c 
   16   0%  96% 1.00        1 0xf5c732a8             spec_close+0x7c 
   15   0%  97% 1.00        1 0xf5915b08             vn_rele+0x24 
   15   0%  97% 1.00       17 kmem_async_lock        kmem_async_dispatch+0xc 
   15   0%  97% 1.00        1 0xf5af1e58             ufs_lockfs_end+0x24 
   12   0%  98% 1.00        1 0xf5c99338             dnlc_lookup+0x12c 
   11   0%  98% 1.00        1 0xf5c8ab10             vn_rele+0x24 
    9   0%  98% 1.00        1 0xf5b0711c             vn_rele+0x24 
    8   0%  99% 1.00        3 0xf5c0c268             dnlc_lookup+0x9c 
    8   0%  99% 1.00        5 kmem_async_lock        kmem_async_thread+0x290 
    7   0%  99% 1.00        2 0xf5c11128             dnlc_lookup+0x9c 
    7   0%  99% 1.00        1 0xf5b1a720             vn_rele+0x24 
    5   0%  99% 1.00        2 0xf5b5b2f8             clock+0x308 
    5   0% 100% 1.00        1 0xf5b1a720             dnlc_lookup+0x12c 
    3   0% 100% 1.00        1 0xf5915b08             dnlc_lookup+0x12c 
    2   0% 100% 1.00        1 0xf5b5b4a0             clock+0x308  
    2   0% 100% 1.00        1 0xf5c149b8             dnlc_lookup+0x9c 
    2   0% 100% 1.00        1 0xf5c8ab10             dnlc_lookup+0x12c 
    1   0% 100% 1.00        1 0xf5b183d0             esp_poll_loop+0xcc 
    1   0% 100% 1.00        1 plocks+0x110           polldel+0x28 
    1   0% 100% 1.00        2 kmem_async_lock        kmem_async_thread+0x1ec 
    1   0% 100% 1.00        6 mml_table+0x18         srmmu_mlist_enter+0x18 
    1   0% 100% 1.00        3 mml_table+0x10         srmmu_mlist_enter+0x18 
--------------------------------------------------------------------------


Adaptive mutex block: 24 events 

Count indv cuml rcnt     nsec Lock                   Caller 
-------------------------------------------------------------------------------
    3  12%  12% 1.00    90333 0xf5b3c738             ddi_prop_search_common+0x50 
    3  12%  25% 1.00   235500 flock_lock             cleanlocks+0x10 
    2   8%  33% 1.00   112250 0xf5c138e8             dnlc_lookup+0x9c 
    2   8%  42% 1.00   281250 stable_lock            specvp+0x8c 
    2   8%  50% 1.00   107000 0xf597aab0             ddi_hold_installed_driver+0x2c 
    2   8%  58% 1.00    89750 0xf5b5b2f8             clock+0x308 
    1   4%  63% 1.00    92000 0xf5c120b8             dnlc_lookup+0x9c 
    1   4%  67% 1.00   174000 0xf5b0711c             sfind+0xc8 
    1   4%  71% 1.00   238500 stable_lock            stillreferenced+0x8 
    1   4%  75% 1.00   143500 0xf5b17ab8             esp_poll_loop+0x8c 
    1   4%  79% 1.00    44500 0xf5b183d0             esp_poll_loop+0xcc 
    1   4%  83% 1.00    81000 0xf5c18478             dnlc_lookup+0x9c 
    1   4%  88% 1.00   413000 0xf5c0c268             dnlc_lookup+0x9c 
    1   4%  92% 1.00  1167000 0xf5c46910             cleanlocks+0x50 
    1   4%  96% 1.00    98500 0xf5c732a8             spec_close+0x7c 
    1   4% 100% 1.00    94000 0xf5b5b4a0             clock+0x308 
--------------------------------------------------------------------------


Spin lock spin: 254 events 

Count indv cuml rcnt     spin Lock                   Caller 
-------------------------------------------------------------------------------
   48  19%  19% 1.00        3 atomic_lock+0x29e      rw_exit+0x34 
   45  18%  37% 1.00       25 atomic_lock+0x24e      rw_exit+0x34 
   44  17%  54% 1.00        1 atomic_lock+0x29d      rw_exit+0x34 
   35  14%  68% 1.00        1 atomic_lock+0x24e      rw_enter+0x34 
   32  13%  80% 1.00       19 cpus+0x44              disp+0x78 
   22   9%  89% 1.00        1 atomic_lock+0x29d      rw_enter+0x34 
   17   7%  96% 1.00       52 cpu[2]+0x44            disp+0x78 
    6   2%  98% 1.00       10 cpu[2]+0x44            setbackdq+0x15c 
    5   2% 100% 1.00        1 atomic_lock+0x29e      rw_enter+0x34 
--------------------------------------------------------------------------


Thread lock spin: 3 events  

Count indv cuml rcnt     spin Lock                   Caller 
-------------------------------------------------------------------------------
    3 100% 100% 1.00       19 0xf68ecf5b             ts_tick+0x8 
-------------------------------------------------------------------------------

Multiprocessor Hardware Configurations

At any point in time, there exist CPU designs that represent the best performance that can be obtained with current technology at a reasonable price. The cost and technical difficulty of pushing the technology further means that the most cost-effective way of increasing computer power is to use several processors. There have been many attempts to harness multiple CPUs and, today, there are many different machines on the market. Software has been the problem for these machines. It is hard to design software that will be portable across a large range of machines and that will scale up well when large number of CPUs are configured. Over many years, Sun has shipped high unit volumes of large-scale multiprocessor machines and has built up a large portfolio of well-tuned applications that are optimized for the Sun architecture.

The most common applications that can use multiprocessors are time-shared systems and database servers. These have been joined by multithreaded versions of CPU-intensive programs like MCAD finite element solvers, EDA circuit simulation packages, and other High Performance Computing (HPC) applications. For graphics-intensive applications where the X server uses a lot of the CPU power on a desktop machine, configuring a dual CPU system will help—the X protocol allows buffering and batched commands with few commands needing acknowledgments. In most cases, the application sends X commands to the X server and continues to run without waiting for them to complete. One CPU runs the application, and the other runs the X server and drives the display hardware.

Two classes of multiprocessor machines have some possibility of software compatibility, and both have SPARC-based implementations. For illustrations of these two classes, see Figure 10-11 and Figure 10-12.

Figure 10-11. Typical Distributed Memory Multiprocessor with Mesh Network

Figure 10-12. Typical Small-Scale, Shared Memory Multiprocessor

Distributed Memory Multiprocessors

Distributed memory multiprocessors, also known as Massively Parallel Processors (MPP), can be thought of as a network of uniprocessors. Each processor has its own memory, and data must be explicitly copied over the network to another processor before it can be used. The benefit of this approach is that there is no contention for memory bandwidth to limit the number of processors. Moreover, if the network is made up of point-to-point links, then the network throughput increases as the number of processors increases. There is no theoretical limit to the number of processors that can be used in a system of this type, but there are problems with the high latency of the point-to-point links, and it is hard to find algorithms that scale with large numbers of processors.

Shared Memory Multiprocessors

A shared memory multiprocessor is much more tightly integrated than a distributed memory multiprocessor and consists of a fairly conventional starting point of CPU, memory, and I/O subsystem, with extra CPUs added onto the central bus.

This configuration multiplies the load on the memory system by the number of processors, and the shared bus becomes a bottleneck. To reduce the load, caches are always used and very fast memory systems and buses are built. If more and faster processors are added to the design, the cache size must increase, the memory system must be improved, and the bus throughput must be increased. Most small-scale MP machines support up to four processors. Larger ones support a few tens of processors. With some workloads, the bus or memory system will saturate before the maximum number of processors has been configured.

Special circuitry snoops activity on the bus at all times so that all the caches can be kept coherent. If the current copy of some data is in more than one of the caches, then it will be marked as being shared. If it is updated in one cache, then the copies in the other caches are either invalidated or updated automatically. From a software point of view, there is never any need to explicitly copy data from one processor to another, and shared memory locations are used to communicate values among CPUs. The cache-to-cache transfers still occur when two CPUs use data in the same cache line, so such transfers must be considered from a performance point of view, but the software does not have to worry about it.

There are many examples of shared memory mainframes and minicomputers. SPARCbased Unix multiprocessors range from the dual-processor Ultra 2 and the 4-processor E450, to the 30-processor E6000 and the 64-processor E10000.

The high-end machines normally have multiple I/O subsystems and multiple memory subsystems connected to a much wider central bus, allowing more CPUs to be configured without causing bottlenecks in the I/O and memory systems. The E10000 goes one step further by using a crossbar switch as its interconnect. Multiple transfers occur between CPU, memory, and I/O on separate point-to-point data paths that all transfer data concurrently. This mechanism provides higher throughput, low contention, and low latency.

An unavoidable fact of life is that the memory latency increases as more CPUs are handled by a system design. Even with the same CPU module configuration, an Ultra 2 or E450 has lower main memory latency than does an E6000, which is in turn lower latency than an E10000. The performance of a single uniprocessor job is therefore higher on a smaller machine. The latency of a system increases at high load levels, and this load-dependent increase is more of an issue on a smaller machine. When comparing the E6000 and E10000, it seems that the E6000 has an advantage up to 20 CPUs, and the E10000 is usually faster for loads that consume 24–28 CPUs. The E10000 is the only option from 30– 64 CPUs, and it maintains a lower latency under heavy load than any other system. So, from purely a performance point of view, it is far faster to have a row of four CPU E450s than to use an E10000 that is divided into many, separate, four-CPU domains.

Clusters of Shared Memory Machines

In any situation, the most efficient configuration is to use the smallest number of the fastest CPUs communicating over the lowest latency interconnect. This configuration minimizes contention and maximizes scalability. It is often necessary to compromise for practical or cost reasons, but it is clear that the best approach to building a large-scale system is to use the largest shared memory systems you can get, then cluster them together with the smallest number of nodes. The performance of a cluster is often limited by activities that are too hard to distribute, and so performance degrades to the performance of a single node for some of the time. If that node is as large as possible, then the overall efficiency is much higher.

Shared/ Distributed Memory Multiprocessors

Although the two types of MP machines started out as very different approaches, the most recent examples of these types have converged.

The shared memory machines all have large caches for each CPU. For good performance, it is essential for the working set of data and instructions to be present in the cache. The cache is now analogous to the private memory of a distributed memory machine, and many algorithms that have been developed to partition data for a distributed memory machine are applicable to a cache-based, shared memory machine.

The latest distributed memory machines have moved from a software- or DMA-driven point-to-point link to an MMU-driven, cache-line-based implementation of distributed shared memory. With this system, the hardware is instructed to share a region of address space with another processor, and cache coherence is maintained over a point-to-point link. As the speed of these links approaches the speed of a shared memory backplane, the distributed memory system can be used to run shared memory algorithms more efficiently. Link speeds are comparable to the low end of shared memory backplane speeds, although they typically have much higher latency. The extra latency is due to the physical distances involved and the use of complex switching systems to interconnect large numbers of processors. Large distributed memory systems have higher latency than do smaller ones.

NUMA Multiprocessors

The hybrid that has emerged is known as nonuniform memory access, or NUMA. NUMAs are shared memory, multiprocessor machines where the memory latency varies. If the access is to memory in the local node, then access is fast; if the access request has to travel over one or more communication links, then access becomes progressively slower. The node itself is often a single-board SMP with two to four CPUs. The SGI Origin 2000 system uses a twin-CPU node; the various Intel Pentium-based systems from Sequent, Data General, and others use a 4-CPU node. However, it makes far more sense when building NUMA systems to use an entire high-end server as a node, with 20 or more CPUs. The bulk of high-end server system requirements are for 10 to 30 or so CPUs, and customers like to buy a half-configured system to allow for future expansion. For Sun this requirement can be satisfied by an SMP such as the E6000 or E10000, so Sun had no need to develop a NUMA-based system for the 1996–1998 generation. NUMA-based competitors must configure multiple nodes to satisfy today’s requirements, and users of these systems are suffering from the new problems that come with this technology.

Benchmarks such as the TCP-D data warehouse test—see http://www.—have shown that Sun’s SMP machines do beat the NUMA machines on price and performance over the whole range of database sizes. This benchmark is ideally suited to NUMA because the query structures are well understood and the data layout can be optimized to keep the NUMA nodes well balanced. In real life, data warehouse queries are often constructed ad hoc, and it is not possible to optimize data placement in advance. The TPC-D benchmark is substantially overstating the performance of NUMA machines as data warehouse engines because of this effect. Data placement is not an issue in one large SMP node. In general, the workload on a system is not constant; it can vary from one query to the next or from one day to the next. In a benchmark situation, a NUMA machine can be optimized for the fixed benchmark workload but would need to be continuously reoptimized for real-world workloads.

A technique used to speed up NUMA systems is data replication. The same data can be replicated in every node so that it is always available at a short memory latency. This technique has been overused by some vendors in benchmarks, with hundreds of Mbytes of data duplicated over all the nodes. The effect is that a large amount of RAM is wasted because there are many nodes over which to duplicate the data. In a 32-CPU system made up of 2-way or 4-way nodes, you might need to configure extra memory to hold 16 or 8 copies of the data. On a 32-CPU SMP system, there is a single copy. If there are any writes to this data, the NUMA machines must synchronize all their writes, a significant extra load.

For all these reasons, the best MPP or NUMA configuration should be built by clustering the smallest number of the largest possible SMP nodes.

CPU Caches

Performance can vary even among machines that have the same CPU and clock rate if they have different memory systems. The interaction between caches and algorithms can cause performance problems, and an algorithm change may be called for to work around the problem. This section provides information on the CPU caches and memory management unit designs of various SPARC platforms so that application developers can understand the implications of differing memory systems ^[1]. This section may also be useful to compiler writers.

^[1] High Performance Computing by Keith Dowd covers this subject very well.

CPU Cache History

Historically, the memory systems on Sun machines provided equal access times for all memory locations. This was true on the Sun-2™ machines and was true for data accesses on the Sun-3/50™ and Sun-3/60 machines.

Equal access was achieved by running the entire memory system as fast as the processor could access it. On Motorola 68010 and 68020 processors, four clock cycles were required for each memory access. On a 10 MHz Sun-2, the memory ran at a 400 ns cycle time; on a 20 MHz Sun-3/60, it ran at a 200 ns cycle time, so the CPU never had to wait for memory to be ready.

SPARC processors run at higher clock rates and want to access memory in a single cycle. Main memory cycle times have improved a little, and very wide memory systems can transfer a 32- or 64-byte block almost as quickly as a single word. The CPU cache is designed to cope with the mismatch between the way DRAM can handle blocks of memory efficiently and the way that the CPU wants to access one word at a time at much higher speed.

The CPU cache is an area of fast memory made from static RAM, or SRAM. The cache transfers data from main DRAM memory using page mode in a block of 16, 32, or 64 bytes at a time. Hardware keeps track of the data in the cache and copies data into the cache when required.

More advanced caches have multiple levels and can be split so that instructions and data use separate caches. The UltraSPARC CPU has a 16-Kbyte instruction cache and separate 16-Kbyte data cache, which are loaded from a second-level 512-Kbyte to 4-Mbyte combined cache that loads from the memory system. We first examine simple caches before we look at the implications of more advanced configurations.

Cache Line and Size Effects

A SPARCstation 2 is a very simple design with clearly predictable behavior. More recent CPU designs are extremely complicated, so I’ll start with a simple example. The SPARCstation 2 has its 64 Kbytes of cache organized as 2048 blocks of 32 bytes each. When the CPU accesses an address, the cache controller checks to see if the right data is in the cache; if it is, then the CPU loads it without stalling. If the data needs to be fetched from main memory, then the CPU clock is effectively stopped for 24 or 25 cycles while the cache block is loaded. The implications for performance tuning are obvious. If your application is accessing memory very widely and its accesses are missing the cache rather than hitting the cache, then the CPU may spend a lot of its time stopped. By changing the application, you may be able to improve the hit rate and achieve a worthwhile performance gain. The 25-cycle delay is known as the miss cost, and the effective performance of an application is reduced by a large miss cost and a low hit rate. The effect of context switches is to further reduce the cache hit rate because after a context switch the contents of the cache will need to be replaced with instructions and data for the new process.

Applications access memory on almost every cycle. Most instructions take a single cycle to execute, and the instructions must be read from memory; data accesses typically occur on 20–30 percent of the cycles. The effect of changes in hit rate for a 25-cycle miss cost are shown in Table 10-1 and Figure 10-13 in both tabular and graphical forms. A 25-cycle miss cost implies that a hit takes 1 cycle and a miss takes 26 cycles.

Figure 10-13. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost

Table 10-1. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost
Hit Rate	Hit Time	Miss Time	Total Time	Performance
100%	100%	0%	100%	100%
99%	99%	26%	125%	80%
98%	98%	52%	150%	66%
96%	96%	104%	200%	50%

The execution time increases dramatically as the hit rate drops. Although a 96 percent hit rate sounds quite high, you can see that the program will be running at half speed. Many small benchmarks like Dhrystone run at 100 percent hit rate; real applications such as databases are moving data around and following linked lists and so have a poor hit rate. It isn’t that difficult to do things that are bad for the cache, so it is a common cause of reduced performance.

A complex processor such as UltraSPARC I does not have fixed cache miss costs; the cost depends upon the exact instruction sequencing, and in many cases the effective miss cost is reduced by complex hardware pipelining and compiler optimizations. The on-chip, 16-Kbyte data cache takes 1 cycle, the external cache between 1 and 7 clock cycles (it can supply a word of data in every cycle but may stall the CPU for up to 7 cycles if the data is used immediately), and main memory between 40 and 50 cycles, depending upon the CPU and system clock rates. With the cache hit rate varying in a two-level cache, we can’t draw a simple performance curve, but relatively, performance is better than the curve shown in Figure 10-13 when the system is running inside the large second-level cache, and worse when running from main memory.

A Problem with Linked Lists

Applications that traverse large linked lists cause problems with caches. Let’s look at this in detail on a simple SPARCstation 2 architecture and assume that the list has 5,000 entries. Each block on the list contains some data and a link to the next block. If we assume that the link is located at the start of the block and that the data is in the middle of a 100-byte block, as shown in Figure 10-14 , then we can deduce the effect on the memory system of chaining down the list.

Figure 10-14. Linked List Example

The code to perform the search is a tight loop, shown in Figure 10-15 . This code fits in seven words, one or two cache lines at worst, so the cache is working well for code accesses. Data accesses occur when the link and data locations are read. If the code is simply looking for a particular data value, then these data accesses will occur every few cycles. They will never be in the same cache line, so every data access will cause a 25-cycle miss that reads in 32 bytes of data when only 4 bytes were wanted. Also, with a 64-Kbyte cache, only 2,048 cache lines are available, so after 1,024 blocks have been read in, the cache lines must be reused. This means that a second attempt to search the list will find that the start of the list is no longer in the cache.

Figure 10-15. Linked List Search Code in C

/* C code to search a linked list for a value and miss the cache a lot */ 

struct block {
        struct block *link;            /* link to next block */ 
        int pad1[11]; 
        int data;                      /* data item to check for */ 
        int pad2[12]; 
        } blocks[5000]; 

struct block *find(pb,value) 
        struct block *pb; 
        int value; 
        {
        while(pb)                /* check for end of linked list */ 
                {
                if (pb->data == value) /* check for value match */ 
                        return pb;     /* return matching block */ 
                pb = pb->link;         /* follow link to next block */ 
                } 
        return (struct block *)0; /* return null if no match */ 
        }

The only solution to this problem is an algorithmic change. The problem will occur on any of the current generation of computer systems. In fact, the problem gets worse as the processor gets faster since the miss cost tends to increase because of the difference in speed between the CPU and cache clock rate and the main memory speed. With more recent processors, it may be possible to cache a larger linked list in external cache, but the on-chip cache is smaller than 64 Kbytes. What may happen is that the first time through the list, each cache miss costs 50 cycles, and on subsequent passes it costs 8 cycles.

The while loop compiles with optimization to just seven instructions, including two loads, two tests, two branches, and a no-op, as shown in Figure 10-16. Note that the instruction after a branch is always executed on a SPARC processor. This loop executes in 9 cycles (225 ns) on a SPARCstation 2 if it hits the cache, and in 59 cycles (1,475 ns) if both loads miss. On a 200 MHz UltraSPARC, the loop would run in approximately 5 cycles (25 ns) if it hit the on-chip cache, 20 cycles (100 ns) in the off-chip cache, and 100 cycles (500 ns) from main memory the first time through.

Figure 10-16. Linked List Search Loop in Assembler

LY1:                           /* loop setup code omitted */ 
        cmp     %o2,%o1        /* see if data == value */ 
        be      L77016         /* and exit loop if matched */ 
        nop                    /* pad branch delay slot */ 
        ld      [%o0],%o0      /* follow link to next block */ 
        tst     %o0            /* check for end of linked list */ 
        bne,a   LY1            /* branch back to start of loop */ 
        ld      [%o0+48],%o2   /* load data in branch delay slot */

The memory latency test in Larry McVoy’s lmbench set of benchmarks is based on this situation. It gives a worst-case situation of unrelated back-to-back loads but is very common in real application code. See also “The Cache-Aligned Block Copy Problem” on page 265 for a description of another cache problem.

For detailed information on specific cache architectures, see “SPARC CPU Cache Architectures” on page 258. Chapter 12 , “Caches,” on page 299 provides a more detailed explanation of the way in which caches in general work.

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自： Stefen > 《Performance and Availability Management》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

Stefen

关注对话

TA的最新馆藏

[转] 懒兔子第十七讲便秘之热秘的辨症治疗
冰心诗歌
AIX 中 hostname 命令与 uname -n 命令的区别
冯大辉：十年莽撞IT路（程序员杂志，2011-02-16）
未来这五年，软件工程师如何安身立命？（上） ——Ruby之父松本行弘谈《代码的未来》
未来这五年，软件工程师如何安身立命？（下） ——Ruby之父松本行弘谈《代码的未来》

喜欢该文的人也喜欢更多

热门阅读换一换

[Cockcroft98] Chapter 10. Processors

Chapter 10. Processors

Monitoring Processors

CPU Control Commands — psrinfo, psradm, and psrset

The Load Average

The Data Behind vmstat and mpstat Output

Figure 10-1. Example vmstat Output

Figure 10-2. Example vmstat -s Raw Counters Output

Figure 10-3. Example mpstat Output

The Sources of vmstat Data

Process Queues

Figure 10-4. The sysinfo Process Queue Data Structure

Virtual Memory Counters

Figure 10-5. The vminfo Memory Usage Data Structure

Paging Counters

Figure 10-6. The cpu_vminfo Per CPU Paging Counters Structure

CPU Usage and Event Counters

Figure 10-7. The cpu_sysinfo Per-CPU System Information Structure

Use of mpstat to Monitor Interrupts and Mutexes

Figure 10-8. Monitoring All the CPUs with mpstat

Cache Affinity Algorithms

Unix on Shared Memory Multiprocessors

The Spin Lock or Mutex

Code Locking

Data Locking and Solaris 2

Monitoring Solaris 2.6 Lock Statistics

Figure 10-9. System Call Mix for C Shell Loop

Figure 10-10. Example lockstat Output

Multiprocessor Hardware Configurations

Figure 10-11. Typical Distributed Memory Multiprocessor with Mesh Network

Figure 10-12. Typical Small-Scale, Shared Memory Multiprocessor

Distributed Memory Multiprocessors

Shared Memory Multiprocessors

Clusters of Shared Memory Machines

Shared/ Distributed Memory Multiprocessors

NUMA Multiprocessors

CPU Caches

CPU Cache History

Cache Line and Size Effects

Figure 10-13. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost

Table 10-1. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost

A Problem with Linked Lists

Figure 10-14. Linked List Example

Figure 10-15. Linked List Search Code in C

Figure 10-16. Linked List Search Loop in Assembler

CPU Control Commands — `psrinfo`, `psradm`, and `psrset`

The Data Behind `vmstat` and `mpstat` Output

Figure 10-1. Example `vmstat` Output

Figure 10-2. Example `vmstat -s` Raw Counters Output

Figure 10-3. Example `mpstat` Output

The Sources of `vmstat` Data

Figure 10-4. The `sysinfo` Process Queue Data Structure

Figure 10-5. The `vminfo` Memory Usage Data Structure

Figure 10-6. The `cpu_vminfo` Per CPU Paging Counters Structure

Figure 10-7. The `cpu_sysinfo` Per-CPU System Information Structure

Use of `mpstat` to Monitor Interrupts and Mutexes

Figure 10-8. Monitoring All the CPUs with `mpstat`

Figure 10-10. Example `lockstat` Output