Performance Analysis Tools for Linux Developers: Part 1

royy 2010-12-02

展开全文

Performance Analysis Tools for Linux Developers: Part 1

Performance analysis and profiling for Intel Processor Architectures

By Mark Gray and Julien Carreno
October 20, 2009
URL:http://www./open-source/220700195

Mark Gray is a software development engineer working at Intel on Real-Time embedded systems for Telephony. Julien Carreno is a software architect and senior software developer at specializing in embedded Real-time applications on Linux.

With the advent of the Intel Atom processor and multicore processors, Intel architecture processors are proliferating in a number of new market segments, most notably embedded systems where good performance is essential. In parallel with this trend, Linux is becoming an established operating system option for embedded designs. The two trends combined pose an interesting problem statement: "How to get the most out of my embedded application running on an Intel platform and a general purpose operating system?" During all kinds of application development, there comes a time when a certain level of performance analysis and profiling is required, either to fix an issue or to improve on current performance. Whether it is memory usage and leaks, CPU usage, or optimal cache usage, analysis and profiling would be almost impossible without the right tool set. This article seeks to help developers understand the more common tools available and select the most appropriate tools for their specific performance analysis needs.

In Part 1 of this article, we summarize some of the performance tools available to Linux developers on Intel architecture. In Part 2 we cover a set of standard performance profiling and analysis goals and scenarios that demonstrate what tool or combination of tools to select for each scenario. In some scenarios, the depth of analysis is also a determining factor in selecting the tool required. With increasingly deeper levels of investigation, we need to change tools to get the increased level of detail and focus from them. This is similar to using a microscope with different magnification lenses. We start from the smallest magnification and gradually increase magnification as we focus on a specific area.

top and ps

The top and ps commands are freely available on all Linux distributions and are generally installed by default. The ps command provides an instantaneous snapshot of system activity on a per-thread basis, whereas the top command provides mostly the same information as ps updated at defined intervals, which can be as small as hundredths of a second. They are frequently overlooked as tools for understanding process performance at a system level. For example, most users tend to use the ps -ef command only to check which processes are currently executing. However, ps can also print useful information such as resident set size or number of page faults for a process. A thorough examination of the ps man pages reveals these options. Likewise, top can also display all this information in various formats while updating it in real-time. The top command window also displays summary information at the top of the window on a per-CPU basis.

In Figure 1, top is showing information for all threads of a process on a multicore machine. Using this more detailed view, we can see total activity on CPU, all threads of the process "app" and on which CPU each thread is scheduled at that instance (P). You can also see memory usage for the process including resident set size (RES) and total virtual memory use (VIRT).

Figure 1: top View (Idle System)

In Figure 2, we can see similar information using ps. We can see the CPU usage on a per-thread basis with 1/10 % accuracy. This is the cumulative CPU percentage since the spawning of the thread.

Figure 2: ps View (Idle System)

As can be seen, top and ps provide a good general overview of system performance and the performance of each process running on the system.

free

The free application is freely available on all Linux distributions and is generally installed by default. Similar information can be found using top or sar, but it is a convenient command to view a snapshot of system memory usage and can be used to identify memory leaks (the allocation of memory blocks without ever freeing them) or disk thrashing due to excessive swapping.

Figure 3: free View (Idle System)

oProfile

The oProfile utility is a system-wide profiler and performance monitoring tool for user space as well as kernel space (the kernel itself can be included in the profiling). The profiler introduces minimal overhead and as such can been seen as relatively unobtrusive. However, it does require that the gdb debugging (-g) flag be used. Although active since 2002 and stable on a majority of platforms, oProfile still dubs itself as an alpha quality open source tool. The tool is released under GPL and can in fact, be found in post 2.6 kernels by default. The tool works by collecting data via a kernel module from various CPU counters, then displaying that information to user-space via a pseudo file system in the same way as ps collects data via the "/proc" file system.

Figure 4: opreport from oProfile)

gprof

The GNU profiler, gprof, is an application-level profiler. The tool is open source, licensed under GDB and is available as standard on most Linux distributions. Compiling the code using gcc with the -pg flag instruments the code producing an executable that measures the wall clock execution time of functions with a hundredth of a second accuracy and exports this information to a file. This file can then be parsed by the gprof application giving a flat-profile representation of the performance data and a call-graph.

Figure 5: gProf View

The profiler collects data at sampling intervals in the same way as many of the tools described in this paper. Therefore, there may be some statistical inaccuracies of the timing figures if the run-time figure is close to the sampling interval. By running your application for long periods of time, you can reduce any statistical inaccuracies. As can be seen from the output in Figure 5, gprof can help locate hot spots at function granularity. However, it also allows you to compile this information at a finer granularity using the -l flag.

As an unexpected side-benefit, gprof can suggest function and file orderings within your binary to improve performance.

valgrind

valgrind is an instrumentation framework that can be used primarily for detecting memory-related errors and threading problems, but is also extendable. It is an open source tool licensed under GPL2. The tool can detect errors such as memory leaks and incorrect freeing of memory. The valgrind tool detects these errors automatically and dynamically as the code is executing. In some cases it can produce false positives.

However, the developers of valgrind claim that it produces correct results 99% of the time and any errors can be suppressed. Although it is a very useful tool, it can be extremely intrusive as the code runs much slower than its true execution speed (by a factor of 50 in some cases) and needs to be compiled with the gcc -g flag. It is also recommended to be compiled with no optimization of code using the gcc -O0 flag. An example of the execution of a small binary through valgrind can be seen below.

Figure 6: valgrind Example

Although, it may be useful in some cases, for real-time applications that wait on I/O, valgrind can be so obtrusive as to make the checking unreliable. However, valgrind can be a highly useful tool when used in conjunction with a unit test and/or nightly build strategy. A clean run of valgrind in a nightly build allows the developer to keep track of any newly-introduced latent memory errors.

Like many of the tools presented here, valgrind is not limited to the purpose that most developers have in mind. For example, valgrind can also check for cache misses and branch mispredictions. WE strongly encourage you to read the relevant documentation and play around with this and all tools to fully appreciate their power.

VTune

VTune1 from Intel is a proprietary system-level profiler and performance analysis tool for Intel architecture. It introduces minimal overhead and therefore can be perceived as relatively unobtrusive. VTune works by collecting data via a kernel module from various CPU counters. This information is collected when an interrupt is generated. The granularity of the data can run from a process level down to an instruction level and is accessible through a highly-usable and configurable GUI.

VTune, when fully configured for your application and operating system, can identify performance issues at several levels of granularity from system-level to microarchitecture-level. As a tool for developers, it is extremely valuable since it has a global view at all granularities. OS performance counters can also be monitored and correlated to instruction-level hotspots. By using this correlation, we can answer questions such as "When the memory use in our system begins to ramp, what happens to our applications CPU usage?" If the source code in your test application is hooked into the VTune application, we can also drill down from the application level into threads and down to code functions.

It is impossible to outline all the features of VTune and indeed many of these tools described in this paper, however, the interested reader is directed to the references.

Intel Thread Checker

The Intel Thread Checker is a plug-in for the VTune debugging environment. It can be used to locate hard to find threading errors such as race conditions and deadlocks.

sar

The system activity reporter (sar) is a lightweight open source tool licensed under GPL that is used for collecting system-wide performance measures. The tool is generally installed by default on Linux, however, sometimes it may need to be installed using the sysstats package. Like top and ps, sar collects data from operating system counters via the proc file system. It provides performance data at system-level granularity reporting on a wide variety of metrics such as CPU usage, disk IO, memory, network IO, and IRQ. The tool can update these values at intervals of a minimum of 1 second.

sar can only provide information at system-level granularity and is used only to provide snapshots and overviews of overall system performance. Spurious or unexpected measurements from sar can be a first indication of performance issues of the system as a whole or of a single process or group of processes. It can be configured to run in the background, constantly providing a readily accessible database of system performance at any second during the day.

Figure 7: sar System-wide CPU Usage View

Figure 8: sar System-wide Memory Usage View

LTT

Linux Trace Toolkit (LTT) consists of a kernel patch and tool chain that gives the user the ability to trace events on the system. These events can be system kernel events (such as context switches, or system calls, and so on) or any application-level event. It is GPL licensed and has minimum impact to the run-time performance of traced applications. It can be used to isolate performance problems on parallel and real-time systems and analyze application timing. Any code that the user would like to be analyzed needs to be recompiled to be instrumented by LTT.

Alternatively, LTTng (Next Generation) is also available, which adds features such as a GUI Trace Viewer. See Figure 9.

Figure 9: Sample LTTng Viewer

iostat

The iostat command is used for monitoring system input/output block device loading. With multiple block devices in the system, it can be useful to determine which device(s) is currently the bottleneck. iostat provides a per device view of the number of transfers per second on each device as well as read and write rates. See Figure 10, for an example of the "extended iostat device" only output during a large file copy. Note: Observe the temporary increase in device activity while the file was being copied.

Figure 10: Sample iostat View (File Copy Example)

iotop

iotop is a Python program with a top-like user interface that can be used to associate processes with I/O. It requires Python version 2.5 or greater and a Linux kernel version 2.6.20 or later with the TASK_DELAY_ACCT and TASK_IO_ACCOUNTING options enabled. Therefore, a potential recompilation of the kernel may be required if these options have not been enabled by default. iotop is licensed under GPL. iotop provides data regarding the amount of Disk IO occurring within the system on a per process basis. This lets users determine which applications are using the disk(s) the most.

Figure 11: Sample iotop View