[Cockcroft98] Chapter 7. Applications

Stefen 2010-10-20

展开全文

Chapter 7. Applications

This chapter discusses the ways by which a user running an application on a Sun machine can control or monitor on a program-by-program basis.

Tools for Applications

When you don’t have the source code for an application, you must use special tools to figure out what the application is really doing.

Tracing Applications

Applications make frequent calls into the operating system, both to shared libraries and to the kernel via system calls. System call tracing has been a feature of Solaris for a long time, and in Solaris 2.6 a new capability allows tracing and profiling of the shared library interface as well.

Tracing System Calls With `truss`

The Solaris 2 truss command has many features not found in the original SunOS 4 trace command. It can trace child processes, and it can count and time system calls and signals. Other options allow named system calls to be excluded or focused on, and data structures can be printed out in full. Here is an excerpt showing a fragment of truss output with the -v option to set verbose mode for data structures, and an example of truss -c showing the system call counts.

Code View: Scroll / Show All

% truss -v all cp NewDocument Tuning 
execve("/usr/bin/cp", 0xEFFFFB28, 0xEFFFFB38) argc = 3 
open("/usr/lib/libintl.so.1", O_RDONLY, 035737561304) = 3 
mmap(0x00000000, 4096, PROT_READ, MAP_SHARED, 3, 0) = 0xEF7B0000 
fstat(3, 0xEFFFF768)= 0 
    d=0x0080001E i=29585 m=0100755 l=1  u=2     g=2     sz=14512 
    at = Apr 27 11:30:14 PDT 1993  [ 735935414 ] 
    mt = Mar 12 18:35:36 PST 1993  [ 731990136 ] 
    ct = Mar 29 11:49:11 PST 1993  [ 733434551 ] 
    bsz=8192 blks=30 fs=ufs 
.... 
% truss -c cp NewDocument Tuning 
syscall      seconds   calls  errors 
_exit            .00       1 
write            .00       1 
open             .00      10      4 
close            .01       7 
creat            .01       1 
chmod            .01       1 
stat             .02       2      1 
lseek            .00       1 
fstat            .00       4 
execve           .00       1 
mmap             .01      18 
munmap           .00       9 
memcntl          .01       1 
                ----     ---    ---
sys totals:      .07      57      5 
usr time:        .02 
elapsed:         .43

An especially powerful technique is to log all the file open, close, directory lookup, read, and write calls to a file, then figure out what parts of the system the application is accessing. A trivial example is shown in Figure 7-1.

Figure 7-1. Example Using `truss` to Track Process File role="figure" Usage

% truss -o /tmp/ls.truss -topen,close,read,write,getdents ls / >/dev/null 
% more /tmp/ls.truss 
open("/dev/zero", O_RDONLY)                     = 3 
open("/usr/lib/libw.so.1", O_RDONLY)            = 4 
close(4)                                        = 0 
open("/usr/lib/libintl.so.1", O_RDONLY)         = 4 
close(4)                                        = 0 
open("/usr/lib/libc.so.1", O_RDONLY)            = 4 
close(4)                                        = 0 
open("/usr/lib/libdl.so.1", O_RDONLY)           = 4 
close(4)                                        = 0 
open("/usr/platform/SUNW,Ultra-2/lib/libc_psr.so.1", O_RDONLY) = 4 
close(4)                                        = 0 
close(3)                                        = 0 
open("/", O_RDONLY|O_NDELAY)                    = 3 
getdents(3, 0x0002D110, 1048)                   = 888 
getdents(3, 0x0002D110, 1048)                   = 0 
close(3)                                        = 0 
write(1, " T T _ D B", 5)                       = 5 
write(1, "\n b i n\n c d r o m\n c".., 251)     = 251

Tracing Shared Library Calls With `sotruss`

The dynamic linker has many new features. Read the ld(1) manual page for details. Two features that help with performance tuning are tracing and profiling. The Solaris 2.6 and later sotruss command is similar in use to the truss command and can be told which calls you are interested in monitoring. Library calls are, however, much more frequent than system calls and can easily generate too much output.

Profiling Shared Libraries with `LD_PROFILE`

The LD_PROFILE profiling option was new in Solaris 2.5. It allows the usage of a shared library to be recorded and accumulated from multiple commands. This data has been used to tune window system libraries and libc.so, which is used by every command in the system. Profiling is enabled by setting the LD_PROFILE environment variable to the name of the library you wish to profile. By default, profile data accumulates in /var/tmp, but the value of LD_PROFILE_OUTPUT, if it has one, can be used to set an alternative directory. As for a normal profile, gprof is used to process the data. Unlike the case for a normal profile, no special compiler options are needed, and it can be used on any program.

For all these utilities, security is maintained by limiting their use on set-uid programs to the root user and searching for libraries only in the standard directories.

Timing

The C shell has a built-in time command that is used during benchmarking or tuning to see how a particular process is running. In Solaris 2, the shell does not compute all this data, so the last six values are always zero.

% time man madvise 
... 
0.1u 0.5s 0:03 21% 0+0k 0+0io 0pf+0w 
%

In this case, 0.1 seconds of user CPU and 0.5 seconds of system CPU were used in 3 seconds elapsed time, which accounted for 21% of the CPU. Solaris 2 has a timex command that uses system accounting records to summarize process activity, but the command works only if accounting is enabled. See the manual pages for more details.

Process Monitoring Tools

Processes are monitored and controlled via the /proc interface. In addition to the familiar ps command, a set of example programs is described in the proc(1) manual page, including ptree, which prints out the process hierarchy, and ptime, which provides accurate and high-resolution process timing.

% /usr/proc/bin/ptime man madvise 
real        1.695 
user        0.005 
sys         0.009

Note the difference in user and system time. I first ran this command a very long time ago on one of the early SPARC machines and, using time, recorded the output. The measurement above using ptime was taken on a 300 MHz UltraSPARC, and by this measurement, the CPU resources used to view the manual page have decreased by a factor of 43, from 0.6 seconds to 0.014 seconds. If you use the csh built-in time command, you get zero CPU usage for this measurement on the fast system because there is not enough resolution to see anything under 0.1 seconds.

The ptime command uses microstate accounting to get the high-resolution measurement. It obtains but fails to print out many other useful measurements. Naturally, this can be fixed by writing a script in SE to get the missing data and show it all. The script is described in “msacct.se ” on page 485, and it shows that many process states are being measured. Microstate accounting itself is described in “Network Protocol (MIB) Statistics via Streams” on page 403. The msacct.se command is given a process ID to monitor and produces the output shown in Figure 7-2.

Example 7.2. Example Display from `msacct.se`

% se msacct.se 354 
Elapsed time         3:29:26.344  Current time Tue May 23 01:54:57 1995 
User CPU time              5.003  System call time           1.170 
System trap time           0.004  Text pfault sleep          0.245 
Data pfault sleep          0.000  Kernel pfault sleep        0.000 
User lock sleep            0.000  Other sleep time        9:09.717 
Wait for CPU time          1.596  Stopped time               0.000

The Effect of Underlying Filesystem Type

Some programs are predominantly I/O intensive or may open and close many temporary files. SunOS has a wide range of filesystem types, and the directory used by the program could be placed onto one of the following types.

Unix File System (UFS)

The standard file system on disk drives is the Unix File System, which in SunOS 4.1 and on is the Berkeley Fat Fast File system. Files that are read stay in RAM until a RAM shortage reuses the pages for something else. Files that are written are sent out to disk as described in “Disk Writes and the UFS Write Throttle ” on page 172, but the file stays in RAM until the pages are reused for something else. There is no special buffer cache allocation, unlike other Berkeley-derived versions of Unix. SunOS4 and SVR4 both use the whole of memory to cache pages of code, data, or I/O. The more RAM there is, the better the effective I/O throughput is.

UFS with Transaction Logging

The combination of Solaris 2.4 and Online: DiskSuite™ 3.0 or later releases supports a new option to standard UFS. Synchronous writes and directory updates are written sequentially to a transaction log that can be on a different device. The effect is similar to the Prestoserve, nonvolatile RAM cache, but the transaction log device can be shared with another system in a dual-host, failover configuration. The filesystem check with fsck requires that only the log is read, so very large file systems are checked in a few seconds.

Tmpfs

Tmpfs is a RAM disk filesystem type. Files that are written are never put out to disk as long as some RAM is available to keep them in memory. If there is a RAM shortage, then the pages are stored in the swap space. The most common way to use this filesystem type in SunOS 4.X is to uncomment the line in /etc/rc.local for mount /tmp. The /tmp directory is accelerated with tmpfs by default in Solaris 2.

One side effect of this feature is that the free swap space can be seen by means of df. The tmpfs file system limits itself to prevent using up all the swap space on a system.

% df /tmp 
Filesystem            kbytes    used   avail capacity  Mounted on 
swap                   15044     808   14236     5%    /tmp

The NFS Distributed Computing File System

NFS is a networked file system coming from a disk on a remote machine. It tends to have reasonable read performance but can be poor for writes and is slow for file locking. Some programs that do a lot of locking run very slowly on NFS-mounted file systems.

Cachefs

New since Solaris 2.3 is the cachefs filesystem type. It uses a fast file system to overlay accesses to a slower file system. The most useful way to use cachefs is to mount, via a local UFS disk cache, NFS file systems that are mostly read-only. The first time a file is accessed, blocks of it are copied to the local UFS disk. Subsequent accesses check the NFS attributes to see if the file has changed, and if not, the local disk is used. Any writes to the cachefs file system are written through to the underlying files by default, although there are several options that can be used in special cases for better performance. Another good use for cachefs is to speed up accesses to slow devices like magneto-optical disks and CD-ROMs.

When there is a central server that holds application binaries, these binaries can be cached on demand at client workstations. This practice reduces the server and network load and improves response times. See the cfsadmin manual page for more details. Solaris 2.5 includes the cachefsstat(1M) command to report cache hit rate measures.

Caution

Cachefs should not be used to cache-shared, NFS-mounted mail directories and can slow down access to write-intensive home directories.

Veritas VxFS File System

Veritas provides the VxFS file system for resale by Sun and other vendors. Compared to UFS, it has several useful features. UFS itself has some features (like disk quotas) that were not in early releases of VxFS, but VxFS is now a complete superset of all the functions of UFS.

VxFS is an extent-based file system, which is completely different from UFS, an indirect, block-based file system. The difference is most noticeable for large files. An indirect block-based file system breaks the file into 8-Kbyte blocks that can be spread all over the disk. Additional 8-Kbyte indirect blocks keep track of the location of the data blocks. For files of over a few Mbytes, double indirect blocks are needed to keep track of the location of the indirect blocks. If you try to read a UFS file sequentially at high speed, the system has to keep seeking to pick up indirect blocks and scattered data blocks. This seek procedure limits the maximum sequential read rate to about 30 Mbytes/s, even with the fastest CPU and disk performance.

In contrast, VxFS keeps track of data by using extents. Each extent contains a starting point and a size. If a 2-Gbyte file is written to an empty disk, it can be allocated as a single 2-Gbyte extent. There are no indirect blocks, and a sequential read of the file reads an extent record, then reads data for the complete extent. In November 1996, Sun published a benchmark result, using VxFS on an E6000 system, where a single file was read at a sustained rate of about 1 Gbyte/s. The downside of large extents is that they fragment the disk. After lots of files have been created and deleted, it could be difficult to allocate a large file efficiently. Veritas provides tools with VxFS that de-fragment the data in a file system by moving and merging extents.

The second advanced feature provided is snapshot backup. If you want a consistent online backup of a file system without stopping applications that are writing new data, you tell VxFS to snapshot the state of the file system at that point. Any new data or deletions are handled separately. You can back up the snapshot, freeing the snapshot later when the backup is done and recovering the extra disk space used by the snapshot.

Direct I/O Access

In some cases, applications that access very large files do not want them buffered by the normal filesystem code. They can run better with raw disk access. Raw access can be administratively inconvenient because the raw disk partitions are hard to keep track of. The simplest fix for this situation is to use a volume management GUI such as the Veritas VxVM to label and keep track of raw disk space. A need for many small raw files could still be inconvenient, so options are provided for direct I/O access, unbuffered accesses to a file in a normal file system. The VxFS extent is closer in its on-disk layout to raw, so the direct I/O option is reasonably fast. A limitation is that the VxFS extent can only be used for block-aligned reads and writes. The VxFS file system can be used with Solaris 2.5.1.

UFS directio is a new feature in Solaris 2.6; see mount_ufs(1M). UFS still suffers from indirect blocks and fragmented data placement, so directio access is less efficient than raw. A useful feature is that directio reverts automatically to buffer any unaligned access to an 8-Kbyte disk block, allowing a mixture of direct and buffered accesses to the same file.

Customizing the Execution Environment

The execution environment is largely controlled by the shell. There is a command which can be used to constrain a program that is hogging too many resources. For csh the command is limit; for sh and ksh the command is ulimit. A default set of Solaris 2 resource limits is shown in Table 7-1.

Users can increase limits up to the hard system limit. The superuser can set higher limits. The limits on data size and stack size are 2 Gbytes on recent machines with the SPARC Reference MMU but are limited to 512 Mbytes and 256 Mbytes respectively by the sun4c MMU used in the SPARCstation 1 and 2 families of machines.

Table 7-1. Resource Limits
Resource Name	Soft User Limit	Hard System Limit
`cputime`	unlimited	unlimited
`filesize`	unlimited	unlimited
`datasize`	524280–2097148 Kbytes	524280–2097148 Kbytes
`stacksize`	8192 Kbytes	261120–2097148 Kbytes
`coredumpsize`	unlimited	unlimited
`descriptors`	64	1024
`memorysize` (virtual)	unlimited	unlimited

In Solaris 2.6, you can increase the datasize to almost 4 Gbytes. The memorysize parameter limits the size of the virtual address space, not the real usage of RAM, and can be useful to prevent programs that leak from consuming all the swap space.

Useful changes to the defaults are those made to prevent core dumps from happening when they aren’t wanted:

% limit coredumpsize 0

To run programs that use vast amounts of stack space:

% limit stacksize unlimited

File Descriptor Limits

To run programs that need to open more than 64 files at a time, you must increase the file descriptor limit. The safest way to run such a program is to start it from a script that sets the soft user limit higher or to use the setrlimit call to increase the limit in the code:

% limit descriptors 256

The maximum number of descriptors in SunOS 4.X is 256. This maximum was increased to 1024 in Solaris 2, although the standard I/O package still handles only 256. The definition of FILE in /usr/include/stdio.h has only a single byte to record the underlying file descriptor index. This data structure is so embedded in the code base that it cannot be increased without breaking binary compatibility for existing applications. Raw file descriptors are used for socket-based programs, and they can use file descriptors above the stdio.h limit. Problems occur in mixed applications when stdio tries to open a file when sockets have consumed all the low-numbered descriptors. This situation can occur if the name service is invoked late in a program, as the nsswitch.conf file is read via stdio.

At higher levels, additional problems occur. The select(3C) system call uses a bitfield with 1024 bits to track which file descriptor is being selected. It cannot handle more than 1,024 file descriptors and cannot be extended without breaking binary compatibility. Some system library routines still use select, including some X Window System routines. The official solution is to use the underlying poll(2) system call instead. This call avoids the bitfield issue and can be used with many thousands of open files. It is a very bad idea to increase the default limits for a program unless you know that it is safe. Programs should increase limits themselves by using setrlimit. If the programs run as root, they can increase the hard limit as well as implement daemons which need thousands of connections.

The only opportunity to increase these limits comes with the imminent 64-bit address space ABI. It marks a clean break with the past, so some of the underlying implementation limits in Solaris can be fixed at the same time as 64-bit address support is added. The implications are discussed in “When Does “64 Bits” Mean More Performance?” on page 134.

Databases and Configurable Applications

Examples of configurable applications include relational databases, such as Oracle, Ingres, Informix, and Sybase, that have large numbers of configuration parameters and an SQL-based configuration language. Many CAD systems and Geographical Information systems also have sophisticated configuration and extension languages. This section concentrates on the Sun-specific database issues at a superficial level; the subject of database tuning is beyond the scope of this book.

Hire an Expert!

For serious tuning, you either need to read all the manuals cover-to-cover and attend training courses or hire an expert for the day. The black box mentality of using the system exactly the way it came off the tape, with all parameters set to default values, will get you going, but there is no point in tuning the rest of the system if it spends 90 percent of its time inside a poorly configured database. Experienced database consultants will have seen most problems before. They know what to look for and are likely to get quick results. Hire them, closely watch what they do, and learn as much as you can from them.

Basic Tuning Ideas

Several times I have discovered database installations that have not even started basic tuning, so some basic recommendations on the first things to try may be useful. They apply to most database systems in principle, but I will use Oracle as an example, as I have watched over the shoulders of a few Oracle consultants in my time.

Increasing Buffer Sizes

Oracle uses an area of shared memory to cache data from the database so that all Oracle processes can access the cache. In old releases, the cache defaults to about 400 Kbytes, but it can be increased to be bigger than the entire data set if needed. I recommend that you increase it to at least 20%, and perhaps as much as 50% of the total RAM in a dedicated database server if you are using raw disk space to hold the database tables. There are ways of looking at the cache hit rate within Oracle, so increase the size until the hit rate stops improving or until the rest of the system starts showing signs of memory shortage. Avoiding unnecessary random disk I/O is one of the keys to database tuning.

Solaris 2 implements a feature called intimate shared memory by which the virtual address mappings are shared as well as the physical memory pages. ISM makes virtual memory operations and context switching more efficient when very large, shared memory areas are used. In Solaris 2, ISM is enabled by the application when it attaches to the shared memory region. Oracle 7 and Sybase System 10 and later releases both enable ISM automatically by setting the SHM_SHARE_MMU flag in the shmat(2) call. In Solaris 2.6 on UltraSPARC systems, the shared memory segment is mapped by use of large (4 Mbyte) pages of contiguous RAM rather than many more individual 8-Kbyte pages. This mapping scheme greatly reduces memory management unit overhead and saves on CPU system time.

Using Raw Disk Rather Than File Systems

During installation, you should create several empty disk partitions or stripes, spread across as many different disks and controllers as possible (but avoiding slices zero and two). You can then change the raw devices to be owned by Oracle (do this by using the VxVM GUI if you created stripes) and, when installing Oracle, specify the raw devices rather than files in the file system to hold the system, redo logs, rollback, temp, index, and data table spaces.

File systems incur more CPU overhead than do raw devices and can be much slower for writes due to inode and indirect block updates. Two or three blocks in widely spaced parts of the disk must be written to maintain the file system, whereas only one block needs to be written on a raw partition. Oracle normally uses 2 Kbytes as its I/O size, and the file system uses 8 Kbytes, so each 2-Kbyte read is always rounded up to 8 Kbytes, and each 2 Kbyte write causes an 8-Kbyte read, 2-Kbyte insert and 8-Kbyte write sequence. You can avoid this excess by configuring an 8-Kbyte basic block size for Oracle, but this solution wastes memory and increases the amount of I/O done while reading the small items that are most common in the database. The data will be held in the Oracle SGA as well as in the main memory filesystem cache, thus wasting RAM. Improvements in the range of 10%–25% or more in database performance and reductions in RAM requirements have been reported after a move from file systems to raw partitions. A synchronous write accelerator, see “Disk Write Caching ” on page 173, should be used with databases to act as a database log file accelerator.

If you persist in wanting to run in the file system, three tricks may help you get back some of the performance. The first trick is to turn on the “sticky bit” for database files. This makes the inode updates for the file asynchronous and is completely safe because the file is preallocated to a fixed size. This trick is used by swap files; if you look at files created with mkfile by the root user, they always have the sticky bit set.

# chmod +t table 
# ls -l table 
-rw------T   1 oracle dba 104857600 Nov 30 22:01 table

The second trick is to use the direct I/O option discussed in “Direct I/O Access ” on page 161. This option at least avoids the memory double buffering overhead. The third trick is to configure the temporary tablespace to be raw; there is often a large amount of traffic to and from the temporary tablespace, and, by its nature, it doesn’t need to be backed up and it can be re-created whenever the database starts up.

Fast Raw Backups

You can back up small databases by copying the data from the raw partition to a file system. Often, it is important to have a short downtime for database backups, and a disk-to-disk transfer is much faster than a backup to tape. Compressing the data as it is copied can save on disk space but is very CPU intensive; I recommend compressing the data if you have a high-end multiprocessor machine. For example,

						# dd if=/dev/rsd1d bs=56k | compress > /home/data/dump_rsd1d.Z

Balance the Load over All the Disks

The log files should be on a separate disk from the data. This separation is particularly important for databases that have a lot of update activity. It also helps to put indexes and temporary tablespace on their own disks or to split the database tables over as many disks as possible. The operating system disk is often lightly used, and on a very small two-disk system, I would put the log files on the system disk and put the rest on its own disk. To balance I/O over a larger number of disks, stripe them together by using Veritas VxVM, Solstice DiskSuite, or a hardware RAID controller. Also see “Disk Load Monitoring” on page 183.

Which Disk Partition to Use

If you use the first partition on a disk as a raw Oracle partition, then you will lose the disk’s label. If you are lucky, you can recover this loss by using the “search for backup labels” option of the format command, but you should put a file system, swap space, Solstice DiskSuite state database, or small, unused partition at the start of the disk.

On modern disks, the first part of the disk is the fastest, so, for best performance, I recommend a tiny first partition followed by a database partition covering the first half of the disk. See “Zoned Bit Rate (ZBR) Disk Drives” on page 205 for more details and an explanation.

The Effect of Indices

When you look up an item in a database, your request must be matched against all the entries in a (potentially large) table. Without an index, a full table scan must be performed, and the database reads the entire table from disk in order to search every entry. If there is an index on the table, then the database looks up the request in the index and knows which entries in the table need to be read from disk. Some well-chosen indexes can dramatically reduce the amount of disk I/O and CPU time required to perform a query. Poorly designed or untuned databases are often underindexed. The problem with indexes is that when an indexed table is updated, the index must be updated as well, so peak database write performance can be reduced.

How to Configure for a Large Number of Users

One configuration scenario is for the users to interact with the database through an ASCII forms-based interface. The forms’ front end is usually created by means of high-level, application-builder techniques and in some cases can consume a large amount of CPU. This forms front end inputs and echoes characters one at a time from the user over a direct serial connection or via telnet from a terminal server. Output tends to be in large blocks of text. The operating system overhead of handling one character at a time over telnet is quite high, and when hundreds of users are connected to a single machine, the Unix kernel consumes a lot of CPU power moving these characters around one at a time. In Solaris 2.5, the telnet and rlogin processing was moved into the kernel by means of streams modules. The old implementation uses a pair of daemon processes, one for each direction of each connection; the in-kernel version still has a single daemon for handling the protocol, but data traffic does not flow through the daemon. This configuration has been tested with up to 3,000 direct connections. Higher numbers are normally configured by a Transaction Processing Monitor, such as Tuxedo.

The most scalable form of client-server configuration is for each user to have a workstation or a PC running the forms-based application and generating SQL calls directly to the backend server. Even more users can be connected this way because they do not login to the database server and only a socket connection is made.

Database Tuning Summary

When you are tuning databases, it is useful to realize that in many cases the sizing rules that have been developed by database software vendors in the past do not scale well to today’s systems. In the mainframe and minicomputer worlds, disk I/O capacity is large, processors are slow, and RAM is expensive. With today’s systems, the disk I/O capacity, CPU power, and typical RAM sizes are all huge, but the latency for a single disk read is still very slow in comparison. It is worth trading off a little extra CPU overhead and extra RAM usage for a reduction in I/O requirements, so don’t be afraid to experiment with database buffer sizes that are much larger than those recommended in the database vendors’ documentation.

The next chapter examines the reasons why disk I/O is often the problem.