Chapter 10. ProcessorsPrevious chapters have looked at the disk and network components of a system. This chapter covers the processor component. The details of particular processors are dealt with later. This chapter explains the measurements that each version of Solaris makes on a per processor basis. It also discusses the trade-offs made by different multiprocessor architectures. Monitoring ProcessorsThis section lists the most important commands that you should know and explains where their data comes from. CPU Control Commands — psrinfo, psradm, and psrsetpsrinfo tells you which CPUs are in use and when they were last enabled or disabled. psradm actually controls the CPUs. Note that in server systems, interrupts are bound to specific CPUs to improve the cache hit rate, and clock interrupts always go to CPU 0. Even if a CPU is disabled, it still receives interrupts. The psrset command manages processor sets in Solaris 2.6. The Load AverageThe load average is the sum of the run queue length and the number of jobs currently running on CPUs. The three figures given are averages over the last 1, 5, and 15 minutes. They can be used as an average run queue length indicator to see if more CPU power is required. % uptime The Data Behind vmstat and mpstat OutputWhat do all those columns of data in vmstat mean? How do they relate to the data from mpstat, and where does it all come from? First, let’s look at Figure 10-1 to remind ourselves what vmstat itself looks like. Figure 10-1. Example vmstat Output
The command prints the first line of data immediately, then every five seconds prints a new line that gives the average rates over the five-second interval. The first line is also the average rate over the interval that started when the system was booted because the numbers are stored by the system as counts of the number of times each event has happened. To average over a time interval, you measure the counters at the start and end and divide the difference by the time interval. For the very first measure, there are zero counters to subtract, so you automatically get the count since boot, divided by the time since boot. The absolute counters themselves can be seen with another option to vmstat, as shown in Figure 10-2. Figure 10-2. Example vmstat -s Raw Counters Output
The other closely related command is mpstat, which shows basically the same data but on a per-CPU basis. Figure 10-3 shows some output on a dual CPU system. Figure 10-3. Example mpstat Output
The Sources of vmstat DataThe data is read from the kernel statistics interface that is described in “The Solaris 2“kstat” Interface” on page 387. Most of the data is maintained on a per-CPU basis by the kernel and is combined into the overall summaries by the commands themselves. The same kstat(3K) programming interface that is used to get at the per-disk statistics is used. It is very lightweight—it takes only a few microseconds to retrieve each data structure. The data structures are based on those described in the file /usr/include/sys/sysinfo.h, the system information header file. Of course, all the raw kstat data is directly available to the SE toolkit, which contains customized scripts; see “vmstat.se” on page 500 and “mpvmstat.se” on page 485. Process QueuesAs we follow through the fields of vmstat, we see that the first one is labeled procs, r, b, w. This field is derived from the sysinfo data, and there is a single global kstat with the same form as the sysinfo structure shown in Figure 10-4. Figure 10-4. The sysinfo Process Queue Data Structure
As the comments indicate, the field is updated once per second. The run queue counts the number of runnable processes that are not running. The extra data values, called runocc and swpocc, are displayed by sar -q and show the occupancy of the queues. Solaris 2 counts the total number of swapped-out idle processes, so if you see any swapped jobs registered in swpque, don’t be alarmed. sar -q is strange: If the number to be displayed is zero, sar -q displays nothing at all, just white space. If you add the number of processes actually running on CPUs in the system, then you get the same measure on which the load average is based. waiting counts how many processes are waiting for a block device transfer (a disk I/O) to complete. It shows up as the b field in vmstat. Virtual Memory CountersThe next part of vmstat lists the free swap space and memory. This information is obtained as a kstat from a single global vminfo structure, as shown in Figure 10-5. Figure 10-5. The vminfo Memory Usage Data Structure
The only swap number shown by vmstat is swap_avail, which is the most important one. If this number ever goes to zero, your system will hang up and be unable to start more processes! For some strange reason, sar -r reports swap_free instead and converts the data into useless units of 512-byte blocks. The bizarre state of the sar command is one reason why we were motivated to create and develop the SE toolkit in the first place! These measures are accumulated, so you calculate an average level by taking the values at two points in time and dividing the difference by the number of updates. Paging CountersThe paging counters shown in Figure 10-6 are maintained on a per-cpu basis; it’s also clear that vmstat and sar don’t show all of the available information. The states and state transitions being counted are described in detail in “The Life Cycle of a Typical Physical Memory Page” on page 326. Figure 10-6. The cpu_vminfo Per CPU Paging Counters Structure
A few of these might need some extra explanation. Protection faults occur when a program tries to access memory it shouldn’t, gets a segmentation violation signal, and dumps a core file. Hat faults occur only on systems that have a software-managed memory management unit (sun4c and sun4u). Hat stands for hardware address translation. I’ll skip the disk counters because vmstat is just reading the same kstat data as iostat and providing a crude count of the number of operations for the first four disks. CPU Usage and Event CountersLet’s remind ourselves again what vmstat looks like.
Code View:
Scroll
/
Show All
% vmstat 5 The last six columns show the interrupt rate, system call rate, context switch rate, and CPU user, system, and idle time. The per-cpu structure from which these are derived is the biggest structure yet, with about 60 values. Some of them are summarized by sar, but a lot of interesting stuff here is being carefully recorded by the kernel, is read by vmstat, and is then just thrown away. Look down the comments in Figure 10-7, and at the end I’ll point out some nonobvious and interesting values. The cpu and wait states are arrays holding the four CPU states—usr/sys/idle/wait—and three wait states, io/swap/pio. Only the io wait state is implemented, so this measurement is superfluous. (I made the mpvmstat.se script print out the wait states before I realized that they were always zero.) Figure 10-7. The cpu_sysinfo Per-CPU System Information Structure
Some of the numbers printed by mpstat are visible here. The smtx value used to watch for kernel contention is mutex_adenters. The srw value is the sum of the failures to obtain a readers/writer lock. The term xcalls is shorthand for cross-calls. A cross-call occurs when one CPU wakes up another CPU by interrupting it. vmstat prints out 22 columns of numbers, summarizing over a hundred underlying measures (even more on a multiprocessor). It’s good to have a lot of different things summarized in one place, but the layout of vmstat (and sar) is as much a result of their long history as it is by design. Could you do better? I’m afraid I’m going to end up plugging the SE toolkit again. It’s just so easy to get at this data and do things with it. All the structures can be read by any user, with no need to be setuid root. (This is a key advantage of Solaris 2; other Unix systems read the kernel directly, so you would have to obtain root permissions.) If you want to customize your very own vmstat, you could either write one from scratch in C, using the kstat library as described in “The Solaris 2 “ kstat” Interface” on page 387, or you could load the SE toolkit and spend a few seconds modifying a trivial script. Either way, if you come up with something that you think is an improvement, while staying with the basic concept of one line of output that fits in 80 columns, send it to me, and we’ll add it to the next version of SE. If you are curious about how some of these numbers behave but can’t be bothered to write your own SE scripts, you should try out the GUI front end to the raw kstat data that is provided in the SE toolkit as infotool.se. A sample snapshot is shown in Figure 16-12 on page 480. Use of mpstat to Monitor Interrupts and MutexesThe mpstat output shown below was measured on a 16-CPU SPARCcenter 2000 running Solaris 2.4 and with about 500 active time-shared database users. One of the key measures is smtx, the number of times the CPU failed to obtain a mutex immediately. If the number is more than about 200 per CPU, then usually system time begins to climb. The exception is the master CPU that is taking the 100 Hz clock interrupt, normally CPU 0 but in this case CPU 4. It has a larger number (1,000 or more) of very short mutex stalls that don’t hurt performance. Higher-performance CPUs can cope with higher smtx values before they start to have a problem; more recent releases of Solaris 2 are tuned to remove sources of mutex contention. The best way to solve high levels of mutex contention is to upgrade to at least Solaris 2.6, then use the lockstat command to find which mutexes are suffering from contention, as described in “Monitoring Solaris 2.6 Lock Statistics” on page 239. Figure 10-8. Monitoring All the CPUs with mpstat
Cache Affinity AlgorithmsWhen a system that has multiple caches is in use, a process may run on a CPU and load part of itself into that cache, then stop running for a while. When it resumes, the Unix scheduler must decide which CPU to run it on. To reduce cache traffic, the process must preferentially be allocated to its original CPU, but that CPU may be in use or the cache may have been cleaned out by another process in the meantime. The cost of migrating a process to a new CPU depends on the time since it last ran, the size of the cache, and the speed of the central bus. To strike a delicate balance, a general-purpose algorithm that adapts to all kinds of workloads was developed. In the case of the E10000, this algorithm manages up to 256 Mbytes of cache and has a significant effect on performance. The algorithm works by moving jobs to a private run queue for each CPU. The job stays on a CPU’s run queue unless another CPU becomes idle, looks at all the queues, finds a job that hasn’t run for a while on another queue, and migrates the job to its own run queue. A cache will be overwritten very quickly, so there is little benefit from binding a process very tightly to a CPU. There is also a surprising amount of low-level background activity from various daemons and dormant applications. Although this activity does not add up to a significant amount of CPU time used, it is enough to cause a large, CPU-bound job to stall long enough for it to migrate to a different CPU. The time-share scheduler ensures that CPU-bound jobs get a long time slice but a low priority. Daemons get a short time slice and a high priority, because they wake up only briefly. The average time slice for a process can be calculated from the number of context switches and the CPU time used (available from the PRUSAGE call described in “Process Data Sources” on page 416). The SE Toolkit calculates and displays this data with an example script; see “pea.se” on page 488. Unix on Shared Memory MultiprocessorsThe Unix kernel has many critical regions, or sections of code, where a data structure is being created or updated. These regions must not be interrupted by a higher-priority interrupt service routine. A uniprocessor Unix kernel manages these regions by setting the interrupt mask to a high value while executing in the region. On a multiprocessor, there are other processors with their own interrupt masks, so a different technique must be used to manage critical regions. The Spin Lock or MutexOne key capability in shared memory, multiprocessor systems is the ability to perform interprocessor synchronization by means of atomic load/store or swap instructions. All SPARC chips have an instruction called LDSTUB, which means load-store-unsigned-byte. The instruction reads a byte from memory into a register, then writes 0xFF into memory in a single, indivisible operation. The value in the register can then be examined to see if it was already 0xFF, which means that another processor got there first, or if it was 0x00, which means that this processor is in charge. This instruction is used to make mutual exclusion locks (known as mutexes) that make sure only one processor at a time can hold the lock. The lock is acquired through LDSTUB and cleared by storing 0x00 back to memory. If a processor does not get the lock, then it may decide to spin by sitting in a loop and testing the lock until it becomes available. By checking with a normal load instruction in a loop before issuing an LDSTUB, the processor performs the spin within the cache, and the bus snooping logic watches for the lock being cleared. In this way, spinning causes no bus traffic, so processors that are waiting do not slow down those that are working. A spin lock is appropriate when the wait is expected to be short. If a long wait is expected, the process should sleep for a while so that a different job can be scheduled onto the CPU. The extra SPARC V9 instructions introduced with UltraSPARC include an atomic compare-and-swap operation, which is used by the UltraSPARC-specific version of the kernel to make locking more efficient. Code LockingThe simplest way to convert a Unix kernel that is using interrupt levels to control critical regions for use with multiprocessors is to replace the call that sets interrupt levels high with a call to acquire a mutex lock. At the point where the interrupt level was lowered, the lock is cleared. In this way, the same regions of code are locked for exclusive access. This method has been used to a greater or lesser extent by most MP Unix implementations, including SunOS 4 on the SPARCserver 600MP machines. The amount of actual concurrency that can take place in the kernel is controlled by the number and position of the locks. In SunOS 4 there is effectively a single lock around the entire kernel. The reason for using a single lock is to make these MP systems totally compatible with user programs and device drivers written for uniprocessor systems. User programs can take advantage of the extra CPUs, but only one of the CPUs can be executing kernel code at a time. When code locking is used, there are a fixed number of locks in the system; this number can be used to characterize how much concurrency is available. On a very busy, highly configured system, the code locks are likely to become bottlenecks, so that adding extra processors will not help performance and may actually reduce performance. Data Locking and Solaris 2The problem with code locks is that different processors often want to use the same code to work on different data. To allow this use, locks must be placed in data structures rather than in code. Unfortunately, this solution requires an extensive rewrite of the kernel—one reason why Solaris 2 took several years to create. The result is that the kernel has a lot of concurrency available and can be tuned to scale well with large numbers of processors. The same kernel is used on uniprocessors and multiprocessors, so that all device drivers and user programs must be written to work in the same environment and there is no need to constrain concurrency for compatibility with uniprocessor systems. The locks are still needed in a uniprocessor because the kernel can switch between kernel threads at any time to service an interrupt. With data locking, there is a lock for each instance of a data structure. Since table sizes vary dynamically, the total number of locks grows as the tables grow, and the amount of concurrency that is available to exploit is greater on a very busy, highly configured system. Adding extra processors to such a system is likely to be beneficial. Solaris 2 has hundreds of different data locks and multiplied by the number of instances of the data, there will typically be many thousand locks in existence on a running system. As Solaris 2 is tuned for more concurrency, some of the remaining code locks are turned into data locks, and “large” locks are broken down into finer-grained locks. There is a trade-off between having lots of mutexes for good MP scalability and few mutexes for reduced CPU overhead on uniprocessors. Monitoring Solaris 2.6 Lock StatisticsIn Solaris 2.6, the lockstat(1) command offers a kernel lock monitoring capability that is very powerful and easy to use. The kernel can dynamically change its locks to start collecting information and return to the normal, highly optimized lock code when the collection is complete. There is some extra overhead while the locks are instrumented, but useful measurements can be taken in a few seconds without disruption of the work on production machines. Data is presented clearly with several options for how it is summarized, but you still must have a very good understanding of Unix kernel architecture and Solaris 2, in particular, to really understand what is happening. You should read the lockstat(1) manual page to see all its options, but in normal operation, it is run by the root user for the duration of a process for which you want to monitor the effects. If you want to run it for an interval, use the sleep command. I had trouble finding an example that would cause lock contention in Solaris 2.6. Eventually, I found that putting a C shell into an infinite loop, as shown below, generates 120,000 or more system calls per second duplicating and shutting file descriptors. When two shells are run at the same time on a dual CPU system, mpstat shows that hundreds of mutex spins/second occur on each CPU. This test is not intended to be serious! The system call mix was monitored with truss -c, as shown in Figure 10-9. With truss running, the system call rate dropped to about 12,000 calls per second, so I ran lockstat separately from truss. % while 1 Figure 10-9. System Call Mix for C Shell Loop
The intention was to see what system calls were causing the contention, and it appears to be a combination of dup and close, possibly sigprocmask as well. Although there are a lot of calls to getpid, they should not cause any contention. The example output shows that there were 3,318 adaptive mutex spins in 5 seconds, which is 663 mutex stalls per second in total and which matches the data reported by mpstat. The locks and callers are shown; the top lock seems to be flock_lock, which is a file-related lock. Some other locks that are not named in this display may be members of hashed lock groups. Several of the calling routines mention file or filesystem operations and sigprocmask, which ties up with the system call trace. If you see high levels of mutex contention, you need to identify both the locks that are being contended on and the component of the workload that is causing the contention, for example, system calls or a network protocol stack. You can then try to change the workload to avoid the problem, and you should make a service call to Sun to report the contention. As Sun pushes multiprocessor scalability to 64 processors and beyond, there are bound to be new workloads and new areas of contention that do not occur on smaller configurations. Each new release of Solaris 2 further reduces contention to allow even higher scalability and new workloads. There is no point reporting high mutex contention levels in anything but the latest release of Solaris 2. Figure 10-10. Example lockstat Output
Multiprocessor Hardware ConfigurationsAt any point in time, there exist CPU designs that represent the best performance that can be obtained with current technology at a reasonable price. The cost and technical difficulty of pushing the technology further means that the most cost-effective way of increasing computer power is to use several processors. There have been many attempts to harness multiple CPUs and, today, there are many different machines on the market. Software has been the problem for these machines. It is hard to design software that will be portable across a large range of machines and that will scale up well when large number of CPUs are configured. Over many years, Sun has shipped high unit volumes of large-scale multiprocessor machines and has built up a large portfolio of well-tuned applications that are optimized for the Sun architecture. The most common applications that can use multiprocessors are time-shared systems and database servers. These have been joined by multithreaded versions of CPU-intensive programs like MCAD finite element solvers, EDA circuit simulation packages, and other High Performance Computing (HPC) applications. For graphics-intensive applications where the X server uses a lot of the CPU power on a desktop machine, configuring a dual CPU system will help—the X protocol allows buffering and batched commands with few commands needing acknowledgments. In most cases, the application sends X commands to the X server and continues to run without waiting for them to complete. One CPU runs the application, and the other runs the X server and drives the display hardware. Two classes of multiprocessor machines have some possibility of software compatibility, and both have SPARC-based implementations. For illustrations of these two classes, see Figure 10-11 and Figure 10-12. Figure 10-11. Typical Distributed Memory Multiprocessor with Mesh Network
Figure 10-12. Typical Small-Scale, Shared Memory Multiprocessor
Distributed Memory MultiprocessorsDistributed memory multiprocessors, also known as Massively Parallel Processors (MPP), can be thought of as a network of uniprocessors. Each processor has its own memory, and data must be explicitly copied over the network to another processor before it can be used. The benefit of this approach is that there is no contention for memory bandwidth to limit the number of processors. Moreover, if the network is made up of point-to-point links, then the network throughput increases as the number of processors increases. There is no theoretical limit to the number of processors that can be used in a system of this type, but there are problems with the high latency of the point-to-point links, and it is hard to find algorithms that scale with large numbers of processors. Shared Memory MultiprocessorsA shared memory multiprocessor is much more tightly integrated than a distributed memory multiprocessor and consists of a fairly conventional starting point of CPU, memory, and I/O subsystem, with extra CPUs added onto the central bus. This configuration multiplies the load on the memory system by the number of processors, and the shared bus becomes a bottleneck. To reduce the load, caches are always used and very fast memory systems and buses are built. If more and faster processors are added to the design, the cache size must increase, the memory system must be improved, and the bus throughput must be increased. Most small-scale MP machines support up to four processors. Larger ones support a few tens of processors. With some workloads, the bus or memory system will saturate before the maximum number of processors has been configured. Special circuitry snoops activity on the bus at all times so that all the caches can be kept coherent. If the current copy of some data is in more than one of the caches, then it will be marked as being shared. If it is updated in one cache, then the copies in the other caches are either invalidated or updated automatically. From a software point of view, there is never any need to explicitly copy data from one processor to another, and shared memory locations are used to communicate values among CPUs. The cache-to-cache transfers still occur when two CPUs use data in the same cache line, so such transfers must be considered from a performance point of view, but the software does not have to worry about it. There are many examples of shared memory mainframes and minicomputers. SPARCbased Unix multiprocessors range from the dual-processor Ultra 2 and the 4-processor E450, to the 30-processor E6000 and the 64-processor E10000. The high-end machines normally have multiple I/O subsystems and multiple memory subsystems connected to a much wider central bus, allowing more CPUs to be configured without causing bottlenecks in the I/O and memory systems. The E10000 goes one step further by using a crossbar switch as its interconnect. Multiple transfers occur between CPU, memory, and I/O on separate point-to-point data paths that all transfer data concurrently. This mechanism provides higher throughput, low contention, and low latency. An unavoidable fact of life is that the memory latency increases as more CPUs are handled by a system design. Even with the same CPU module configuration, an Ultra 2 or E450 has lower main memory latency than does an E6000, which is in turn lower latency than an E10000. The performance of a single uniprocessor job is therefore higher on a smaller machine. The latency of a system increases at high load levels, and this load-dependent increase is more of an issue on a smaller machine. When comparing the E6000 and E10000, it seems that the E6000 has an advantage up to 20 CPUs, and the E10000 is usually faster for loads that consume 24–28 CPUs. The E10000 is the only option from 30– 64 CPUs, and it maintains a lower latency under heavy load than any other system. So, from purely a performance point of view, it is far faster to have a row of four CPU E450s than to use an E10000 that is divided into many, separate, four-CPU domains. Clusters of Shared Memory MachinesIn any situation, the most efficient configuration is to use the smallest number of the fastest CPUs communicating over the lowest latency interconnect. This configuration minimizes contention and maximizes scalability. It is often necessary to compromise for practical or cost reasons, but it is clear that the best approach to building a large-scale system is to use the largest shared memory systems you can get, then cluster them together with the smallest number of nodes. The performance of a cluster is often limited by activities that are too hard to distribute, and so performance degrades to the performance of a single node for some of the time. If that node is as large as possible, then the overall efficiency is much higher. Shared/ Distributed Memory MultiprocessorsAlthough the two types of MP machines started out as very different approaches, the most recent examples of these types have converged. The shared memory machines all have large caches for each CPU. For good performance, it is essential for the working set of data and instructions to be present in the cache. The cache is now analogous to the private memory of a distributed memory machine, and many algorithms that have been developed to partition data for a distributed memory machine are applicable to a cache-based, shared memory machine. The latest distributed memory machines have moved from a software- or DMA-driven point-to-point link to an MMU-driven, cache-line-based implementation of distributed shared memory. With this system, the hardware is instructed to share a region of address space with another processor, and cache coherence is maintained over a point-to-point link. As the speed of these links approaches the speed of a shared memory backplane, the distributed memory system can be used to run shared memory algorithms more efficiently. Link speeds are comparable to the low end of shared memory backplane speeds, although they typically have much higher latency. The extra latency is due to the physical distances involved and the use of complex switching systems to interconnect large numbers of processors. Large distributed memory systems have higher latency than do smaller ones. NUMA MultiprocessorsThe hybrid that has emerged is known as nonuniform memory access, or NUMA. NUMAs are shared memory, multiprocessor machines where the memory latency varies. If the access is to memory in the local node, then access is fast; if the access request has to travel over one or more communication links, then access becomes progressively slower. The node itself is often a single-board SMP with two to four CPUs. The SGI Origin 2000 system uses a twin-CPU node; the various Intel Pentium-based systems from Sequent, Data General, and others use a 4-CPU node. However, it makes far more sense when building NUMA systems to use an entire high-end server as a node, with 20 or more CPUs. The bulk of high-end server system requirements are for 10 to 30 or so CPUs, and customers like to buy a half-configured system to allow for future expansion. For Sun this requirement can be satisfied by an SMP such as the E6000 or E10000, so Sun had no need to develop a NUMA-based system for the 1996–1998 generation. NUMA-based competitors must configure multiple nodes to satisfy today’s requirements, and users of these systems are suffering from the new problems that come with this technology. Benchmarks such as the TCP-D data warehouse test—see http://www.—have shown that Sun’s SMP machines do beat the NUMA machines on price and performance over the whole range of database sizes. This benchmark is ideally suited to NUMA because the query structures are well understood and the data layout can be optimized to keep the NUMA nodes well balanced. In real life, data warehouse queries are often constructed ad hoc, and it is not possible to optimize data placement in advance. The TPC-D benchmark is substantially overstating the performance of NUMA machines as data warehouse engines because of this effect. Data placement is not an issue in one large SMP node. In general, the workload on a system is not constant; it can vary from one query to the next or from one day to the next. In a benchmark situation, a NUMA machine can be optimized for the fixed benchmark workload but would need to be continuously reoptimized for real-world workloads. A technique used to speed up NUMA systems is data replication. The same data can be replicated in every node so that it is always available at a short memory latency. This technique has been overused by some vendors in benchmarks, with hundreds of Mbytes of data duplicated over all the nodes. The effect is that a large amount of RAM is wasted because there are many nodes over which to duplicate the data. In a 32-CPU system made up of 2-way or 4-way nodes, you might need to configure extra memory to hold 16 or 8 copies of the data. On a 32-CPU SMP system, there is a single copy. If there are any writes to this data, the NUMA machines must synchronize all their writes, a significant extra load. For all these reasons, the best MPP or NUMA configuration should be built by clustering the smallest number of the largest possible SMP nodes. CPU CachesPerformance can vary even among machines that have the same CPU and clock rate if they have different memory systems. The interaction between caches and algorithms can cause performance problems, and an algorithm change may be called for to work around the problem. This section provides information on the CPU caches and memory management unit designs of various SPARC platforms so that application developers can understand the implications of differing memory systems [1]. This section may also be useful to compiler writers.
CPU Cache HistoryHistorically, the memory systems on Sun machines provided equal access times for all memory locations. This was true on the Sun-2™ machines and was true for data accesses on the Sun-3/50™ and Sun-3/60 machines. Equal access was achieved by running the entire memory system as fast as the processor could access it. On Motorola 68010 and 68020 processors, four clock cycles were required for each memory access. On a 10 MHz Sun-2, the memory ran at a 400 ns cycle time; on a 20 MHz Sun-3/60, it ran at a 200 ns cycle time, so the CPU never had to wait for memory to be ready. SPARC processors run at higher clock rates and want to access memory in a single cycle. Main memory cycle times have improved a little, and very wide memory systems can transfer a 32- or 64-byte block almost as quickly as a single word. The CPU cache is designed to cope with the mismatch between the way DRAM can handle blocks of memory efficiently and the way that the CPU wants to access one word at a time at much higher speed. The CPU cache is an area of fast memory made from static RAM, or SRAM. The cache transfers data from main DRAM memory using page mode in a block of 16, 32, or 64 bytes at a time. Hardware keeps track of the data in the cache and copies data into the cache when required. More advanced caches have multiple levels and can be split so that instructions and data use separate caches. The UltraSPARC CPU has a 16-Kbyte instruction cache and separate 16-Kbyte data cache, which are loaded from a second-level 512-Kbyte to 4-Mbyte combined cache that loads from the memory system. We first examine simple caches before we look at the implications of more advanced configurations. Cache Line and Size EffectsA SPARCstation 2 is a very simple design with clearly predictable behavior. More recent CPU designs are extremely complicated, so I’ll start with a simple example. The SPARCstation 2 has its 64 Kbytes of cache organized as 2048 blocks of 32 bytes each. When the CPU accesses an address, the cache controller checks to see if the right data is in the cache; if it is, then the CPU loads it without stalling. If the data needs to be fetched from main memory, then the CPU clock is effectively stopped for 24 or 25 cycles while the cache block is loaded. The implications for performance tuning are obvious. If your application is accessing memory very widely and its accesses are missing the cache rather than hitting the cache, then the CPU may spend a lot of its time stopped. By changing the application, you may be able to improve the hit rate and achieve a worthwhile performance gain. The 25-cycle delay is known as the miss cost, and the effective performance of an application is reduced by a large miss cost and a low hit rate. The effect of context switches is to further reduce the cache hit rate because after a context switch the contents of the cache will need to be replaced with instructions and data for the new process. Applications access memory on almost every cycle. Most instructions take a single cycle to execute, and the instructions must be read from memory; data accesses typically occur on 20–30 percent of the cycles. The effect of changes in hit rate for a 25-cycle miss cost are shown in Table 10-1 and Figure 10-13 in both tabular and graphical forms. A 25-cycle miss cost implies that a hit takes 1 cycle and a miss takes 26 cycles. Figure 10-13. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost
The execution time increases dramatically as the hit rate drops. Although a 96 percent hit rate sounds quite high, you can see that the program will be running at half speed. Many small benchmarks like Dhrystone run at 100 percent hit rate; real applications such as databases are moving data around and following linked lists and so have a poor hit rate. It isn’t that difficult to do things that are bad for the cache, so it is a common cause of reduced performance. A complex processor such as UltraSPARC I does not have fixed cache miss costs; the cost depends upon the exact instruction sequencing, and in many cases the effective miss cost is reduced by complex hardware pipelining and compiler optimizations. The on-chip, 16-Kbyte data cache takes 1 cycle, the external cache between 1 and 7 clock cycles (it can supply a word of data in every cycle but may stall the CPU for up to 7 cycles if the data is used immediately), and main memory between 40 and 50 cycles, depending upon the CPU and system clock rates. With the cache hit rate varying in a two-level cache, we can’t draw a simple performance curve, but relatively, performance is better than the curve shown in Figure 10-13 when the system is running inside the large second-level cache, and worse when running from main memory. A Problem with Linked ListsApplications that traverse large linked lists cause problems with caches. Let’s look at this in detail on a simple SPARCstation 2 architecture and assume that the list has 5,000 entries. Each block on the list contains some data and a link to the next block. If we assume that the link is located at the start of the block and that the data is in the middle of a 100-byte block, as shown in Figure 10-14, then we can deduce the effect on the memory system of chaining down the list. Figure 10-14. Linked List Example
The code to perform the search is a tight loop, shown in Figure 10-15. This code fits in seven words, one or two cache lines at worst, so the cache is working well for code accesses. Data accesses occur when the link and data locations are read. If the code is simply looking for a particular data value, then these data accesses will occur every few cycles. They will never be in the same cache line, so every data access will cause a 25-cycle miss that reads in 32 bytes of data when only 4 bytes were wanted. Also, with a 64-Kbyte cache, only 2,048 cache lines are available, so after 1,024 blocks have been read in, the cache lines must be reused. This means that a second attempt to search the list will find that the start of the list is no longer in the cache. Figure 10-15. Linked List Search Code in C
The only solution to this problem is an algorithmic change. The problem will occur on any of the current generation of computer systems. In fact, the problem gets worse as the processor gets faster since the miss cost tends to increase because of the difference in speed between the CPU and cache clock rate and the main memory speed. With more recent processors, it may be possible to cache a larger linked list in external cache, but the on-chip cache is smaller than 64 Kbytes. What may happen is that the first time through the list, each cache miss costs 50 cycles, and on subsequent passes it costs 8 cycles. The while loop compiles with optimization to just seven instructions, including two loads, two tests, two branches, and a no-op, as shown in Figure 10-16. Note that the instruction after a branch is always executed on a SPARC processor. This loop executes in 9 cycles (225 ns) on a SPARCstation 2 if it hits the cache, and in 59 cycles (1,475 ns) if both loads miss. On a 200 MHz UltraSPARC, the loop would run in approximately 5 cycles (25 ns) if it hit the on-chip cache, 20 cycles (100 ns) in the off-chip cache, and 100 cycles (500 ns) from main memory the first time through. Figure 10-16. Linked List Search Loop in Assembler
The memory latency test in Larry McVoy’s lmbench set of benchmarks is based on this situation. It gives a worst-case situation of unrelated back-to-back loads but is very common in real application code. See also “The Cache-Aligned Block Copy Problem” on page 265 for a description of another cache problem. For detailed information on specific cache architectures, see “SPARC CPU Cache Architectures” on page 258. Chapter 12, “Caches,” on page 299 provides a more detailed explanation of the way in which caches in general work. |
|