UNIX Performance Management
It doesn’t have to cost a fortune.
By Jaqui Lynch
Manager Systems Services
The purpose of this paper is to introduce the performance analyst to some of the free tools available to monitor and manage performance on UNIX systems, and to provide a guideline on how to diagnose and fix performance problems in that environment. The paper is based on the authors experiences with AIX and will cover many of the tools available on that and other UNIX platforms. It will also provide some Rules of Thumb for analyzing the performance of UNIX systems.
As more mission critical work finds its way from the mainframe to distributed systems, performance management for those systems is becoming more important. The goal for systems management is not only to maximize system throughput, but also to reduce response time. In order to this it is necessary to not only work on the system resources, but also to work on profiling and tuning applications.
In UNIX there are 7 major resource types that need to be monitored and tuned - CPU, memory, disk space and arms, communications lines, I/O Time, Network Time and applications programs. There are also standard rules of thumb in most of these areas. From the users perspective the only one they see is total execution time so we will start by looking at that.
Total execution time from a users perspective consists of wall-clock time. At a process level this is measured by running the time command. This provides you with real time (wallclock), user code CPU and system code CPU. If user + sys > 80% then there is a good chance the system is CPU constrained. The components of total running time include:
1. User-state CPU - the actual amount of time the CPU spends running the users program in the user state. It includes time spent executing library calls, but does not include time spent in the kernel on its behalf. This value can be greatly affected by the use of optimization at compile time and by writing efficient code.
2. System-state CPU - this is the amount of time the CPU spends in the system state on behalf of this program. All I/O routines require kernel services. The programmer can affect this value by the use of blocking for I/O transfers.
3. I/O Time and Network Time - these are the amount of time spent moving data and servicing I/O requests.
4. Virtual Memory Performance - This includes context switching and swapping.
5. Time spent running other programs.
In order to measure these areas there are a multitude of tools available. The most useful are:
cron Process scheduling
nice/renice Change priorities
setpri Set priorities
netstat Network statistics
nfsstat NFS statistics
time/timex Process CPU Utilization
uptime System Load Average
ps Process Statistics
iostat BSD tool for I/O
sar Bulk System Activity
vmstat BSD tool for V. Memory
gprof Call Graph profiling
prof Process Profiling
trace Used to get more depth
Other commands that will be useful include lsvg, lspv, lslv, lsps and lsdev. Each of these will be discussed below and then a general problem solving approach will be offered. It is important to note that the results and options for all of these commands may differ depending on the platform they are being run on. Most of the options discussed below are those for AIX and some of the tools are specific to AIX such as:
tprof CPU Usage
svmon Memory Usage
filemon Filesystem, LV .. activity
netpmon Network resources
The first tool to be discussed is uptime. This provides the analyst with the System Load Average (SLA). It is important to note that the SLA can only be used as a rough indicator as it does not take into account scheduling priority and it counts as runnable all jobs waiting for disk I/O, including NFS I/O. However, uptime is a good place to start when trying to determine whether a bottleneck is CPU or I/O based.
When uptime is run it provides three load averages - the first is for the last minute, the second is for the last 5 minutes and the third is for the last 15 minutes. If the value is borderline but has been falling over the last 15 minutes, then it would be wise to just monitor the situation. However, a value between 4 and 7 is fairly heavy and means that performance is being negatively affected. Above 7 means the system needs serious review and below 3 means the workload is relatively light. If the system is a single user workstation then the load average should be less than 2. There is also a command called ruptime that allows you to request uptime information remotely.
The sar command provides a good alternative to uptime with the -q option. It provides statistics on the average length of the run queue, the percentage of time the run queue is occupied, the average length of the swap queue and the percentage of time the swap queue is occupied. The run queue lists jobs that are in memory and runnable, but does not include jobs that are waiting for I/O or sleeping. The run queue size should be less than 2. If the load is high and the runqocc=0 then the problem is most likely memory or I/O, not CPU. The swap queue lists jobs that are ready to run but have been swapped out.
The sar command deserves special mention as it is a very powerful command. The command is run by typing in:
sar -options int #samples
where valid options generally are:
-g or -p Paging
-q Average Q length
-u CPU Usage
-w Swapping and Paging
-y Terminal activity
-v State of kernel tables
After determining that the problem may well be CPU based it would then be necessary to move onto iostat to get more detail. Running iostat provides a great deal of information, but the values of concern here are the %user and %sys. If (%user + %sys) > 80% over a period of time then it is very likely the bottleneck is CPU. In particular it is necessary to watch for average CPU being greater than 70% with peaks above 90%. It is also possible to get similar information by running the ps -au or sar -u commands. Both of these provide information about CPU time. The sar -u command, in particular, breaks the time down into user, system, time waiting for blocked I/O (i.e. NFS, disk, ..) and idle time.
The ps -au command also provides information on the %physical memory the process is using and the current status for the process. Statuses shown are:
P Waiting on Pagein
D Waiting on I/O
S Sleeping < 20 secs
I Idle - sleeping >20 secs
Z Zombie or defunct
W Process is swapped out
> Mem. soft limit exceeded
N Niced Process (low pri)
< Niced Process (high pri)
The cron or at command can be used to automatically schedule execution of these commands to ensure snapshots are taken at the appropriate times. The atq command can be used to list what is in the at queue and the crontab -e command edits the cron table.
Once it has been determined that the problem is a CPU bottleneck there are several options. It is possible to limit the cputime a process can use by the limit command. If the problem relates to one process then it is also possible to model or profile that process using the prof, gprof or tprof command to find out whether it is possible to optimize the program code.
Prof and gprof are very similar and have several disadvantages when compared to tprof. Both prof and gprof require a recompile of the program using wither the -p or the -pg option and they impact performance of that program very badly. Tprof only needs to be recompiled in order to do source code level profiling (-qlist option). In particular tprof exhibits the following characteristics:
No count of routine calls
No call graph
Source statement profiling
Summary of all CPU usage
No recompile needed for routine level profiling
No increase in User CPU
Prof/gprof differ as follows:
Count of routine calls
Call graph (gprof)
Routine level profiling only
Single Process CPU usage
10-300% increase in User CPU
So, the recommendation would be to use tprof if it is available on the chosen platform. It is also possible that the vendor will have their own equivalent to tprof.
Running the time or timex commands can also give a good indication of whether the process is CPU intensive. Compiler options have been proven to extensively affect the performance of CPU intensive programs as can be seen from the table below. It is well worth trying different options when compiling the program such as -O, -O2, -O3 and -Q (inline streams the code). Time/timex can give you an indication of how much benefit this will provide. Timex can also be run using the -s option which causes a full set of sar output to be generated for the duration of the programs execution. As can be seen from the table below, it is possible to see reductions in the order of 50% in CPU utilization by using optimization.
User CPU running Program phntest
Compiler Seconds % of CPU
Options for None
None 53.03 100%
-O 26.34 49.67%
-O -Q 25.11 47.35%
-O2 27.04 50.99%
-O2 -Q 24.92 46.99%
-O3 28.48 53.71%
-O3 -Q 26.13 49.27%
It is also possible to change the priority of the process so that other processes can gain control of the CPU ahead of it. This can be done by using the nice, renice or setpri commands. Renice is not available on all platforms. Before using these commands, it is useful to understand how the priority scheme works in UNIX.
Priorities range from 0-127 with 127 being the lowest priority. The actual priority of a task is set by the following calculation:
Puser normally defaults to 40 and nice to 20 unless the nice or renice commands have been used against the process. On AIX a tick is 1/100th of a second and new priorities are calculated every tick as follows:
Every second tick is recalculated as tick=tick/2 and then new-pri is again recalculated.
Otherwise, a CPU upgrade may be the only solution if there is no other machine that the workload can be run on.
If the problem does not appear to be CPU then it becomes necessary to investigate memory and I/O as possible causes. Again, it is possible to use iostat or sar to get the information that is needed here. The iowait field shown in the iostat command is a good indicator of whether there are I/O problems. If iowait is greater than 40% then it becomes necessary to investigate the physical volumes to ensure that none of them are greater than 70% full. The lspv command can be used to determine utilization of the physical volume.
Iostat is a low overhead tool that can be automated and provides local counts for I/O data. Unlike sar, iostat does not provide timestamps in the output so it is important to make a note of start/stop times. However, iostat uses kernel data which makes it hardware specific with respect to the results obtained.
Iostat provides data on several important values for each physical disk. These include: %time the physical disk was busy, kilobytes per second to/from the disk, transfers per second to/from, kilobytes read and kilobytes written. This will help to determine if there is an imbalance of I/O amongst the physical volumes. If all appears to be normal here then the next step is to investigate which filesystems the I/O is directed at. If most of the I/O is directed at the page files then memory needs to be investigated.
Information on cylinder access and seek distances is available using the sadp command and cache statistics for disk are available using the sar -b command. Further information can be obtained by running filemon and looking to see what the most active filesystems are.
Filemon provides a list of the most active segments, the most active logical volumes and physical volumes, and detailed statistics for the busiest files as well as the segments, physical and logical volumes. Details include transfers, reads, read sizes, read times in msecs, logical seeks, write times, seek distances, throughput (kb/sec) and utilization percentages. However, it is important to note that filemon runs trace in the background which can affect performance. It is also possible to run fileplace which gives information on space efficiency and sequentiality.
This would be a good time to run lsvg, lslv and lspv to get a map of the layout of the physical and logical volumes on the system as well as the various volume groups. This will make it much simpler to get more indepth information. By running lsdev -C it is also possible to determine what kind of disk devices are installed and what size they are. By using a combination of the above commands a map can be produced for each physical volume of the filesystems and their placement on the disk. The lsattr -E -l sys0 command can be used to obtain information on system parameters such as cache sizes and other associated values.
If the bulk of the I/O (>30%) is going to a logical volume or filesystem that is not used for paging then the problem is most likely user I/O. This can be resolved by one of several options -checking fragmentation, reorganizing the filesystem, adding physical volumes or splitting the data in another manner. Adding memory may still help with the problem.
Other items to take into account when looking at I/O performance include the intra and inter policies, mirroring of disks, write verify and the scheduling policy for logical volumes. It is also important to remember that the SCSI card can only talk to one device at a time. Where multiple disks are behind one SCSI card, sequential readwrites are helped if they are spread across multiple adapters. Newer technologies such as SCSI-2, fast and wide and raid will also help improve performance. Some of the newer controllers also provide buffers for each disk and can perform two way searches.
If the bulk of the I/O is going to paging (i.e. the page LV is > 30%) then it becomes necessary to investigate further. The only options available to cure a paging problem are to write more memory efficient code, move the process to another system, add memory, reschedule the process so it doesn’t contend with other memory intensive workloads, or add physical volumes or more page datasets. There are three commands that are used to investigate paging - lsps (or pstat), vmstat and svmon.
lsps -a will provide information on all of the page spaces available including physical volume and utilization. Vmstat is another low overhead tool and provides information on actual memory, free pages, processes on the I/O waitq, reclaims, pageins, pageouts, pages freed by the stealer per second, interrupts, system calls and CPU utilization. Like iostat, vmstat does not provide timestamps. Svmon -G provides similar information except it breaks memory down into work, persistent and client pages that are either in use or pinned. It is also possible to use the sar -w command.
When looking at paging it is important to note that the stealer will run whenever there are only ((2 x real) -8) pages left. So on a 32mb machine the stealer will run if there are only 56 pages left. The Rule of Thumb for page space versus real memory is generally in the order of Page = 2 x real. On some systems not all of the kernel processes are pinned so they can also be paged out. A pagein rate of >5/sec means that the system is memory constrained. Also, if fre is less than (.1(AVM)) then this may indicate that the system is real memory constrained. This depends on the way the VMM uses memory. For instance, AIX will use all memory for disk caching, etc before it reuses any so it is not unusual to see fre very low (110-120). Looking at pageins, pageouts, and the FR to SR ratio is a much more meaningful indicator for problems.
So, if at this point there is no problem with CPU and the system is not disk bound it becomes necessary to investigate the network to check whether it is remote file I/O bound. This is the last step before running the more resource heavy trace command to determine what is really happening. To look at network statistics there are three useful commands - netstat, netpmon and nfsstat.
Netstat -i shows the network interfaces along with input and output packets and errors. It also gives the number of collisions. The Mtu field shows the maximum ip packet size (transfer unit) and should be the same on all systems. In AIX it defaults to 1500. Both Oerrs (number of output errors since boot) and Ierrs (Input errors since boot) should be < 0.025. If Oerrs>0.025 then it is worth increasing the send queue size. Ierrs includes checksum errors and can also be an indicator of a hardware error such as a bad connector or terminator. The Collis field shows the number of collisions since boot and can be as high as 10%. If it is greater then it is necessary to reorganize the network as the network is obviously overloaded on that segment.
Netstat -m s used to analyze the use of mbufs in order to determine whether these are the bottleneck. The no -a command is used to see what the current values are. Values of interest are thewall, lowclust, lowmbuf and dogticks.
An mbuf is a kernel buffer that uses pinned memory and is used to service network communications. Mbufs come in two sizes - 256 bytes and 4096 bytes (clusters of 256 bytes). Thewall is the maximum memory that can be taken up for mbufs. Lowmbuf is the minimum number of mbufs to be kept free while lowclust is the minimum number of clusters to be kept free. Mb_cl_hiwat is the maximum number of free buffers to be kept in the free buffer pool and should be set to at least twice the value of lowclust to avoid thrashing.
Netstat -v is used to look at queues and other information. If Max packets on S/W transmit queue is >0 and is equal to current HW transmit queue length then the send queue size should be increased. If the No mbuf errors is large then the receive queue size needs to be increased.
Nfsstat is used to report on client and server NFS information, primarily at the daemon level. Nfsstat -c provides client information such as retrans and badxid. If badxid=retrans and retrans > 5% of calls the server is the problem, but if retrans > 5% of calls and badxid < retrans then the network is the problem. Also, if there are lots of timeouts then it is useful to increase the number of NFSDs and the qsize.
Netpmon is a further command that focuses on CPU, network adapters, remote nodes and LAN traffic. It is used to get a feeling for what is happening overall. By using a combination of the above commands it is possible to obtain a very clear view of what is happening at the network level.
At this point it is important to mention the UNIX kernel tables, as these can affect performance without any real indicators as to the cause. To find out what they are set to the pstat -T or sar -v commands can be used. Most of the values are calculated based on the value for maxusers so it is important to know what that is set to. It is often recommended that maxusers generally be determined by the following calculation:
Max Users = (2+ # users act + #NFS clients + .5 NFS exports )
In particular, attention should be paid to the following table sizes:
Process Table Size (NPROCS) - this is the maximum number of processes that can be in the system. On systems where Xwindows is heavily used this needs to be increased. If the table is full, then the process will fail to start.
Text Table Size (NTEXT) - This is the maximum number of executables that can be in the system at a time. If the table is full then the exe will not run.
Inode Table Size (NINODE) - This is a cache of the active inode entries. If this table fills up then performance slows down.
File Table Size (NFILE) - This is the maximum number of files that can be open at one time. If the table fills up the open will fail.
Callout Table Size (NCALLOUT) - This is the maximum number of timers that can be active at one time. Since timers are used heavily by device drivers to monitor I/O devices, the system will crash if this table fills up.
General Default Calculations (may be platform Specific)
Other kernel settings that should be reviewed are the number of processes per user, maximum open files per user and maximum mounted filesystems. All of these can have unpredictable effects on performance.
If none of the above provides a reasonable clue to what is going on it is necessary to bring out the most powerful tool of all - trace. Trace will provide indepth information on everything that is going on in the system. However, it will definitely affect performance and thus, should be used judiciously.
As can be seen above there is a great deal of information that can be gleaned from the system for relatively minimal effort. Figure 1 contains some of the Rules of Thumb (ROT) that are useful along with what they apply to and the tool that best provides the information. These ROTs can then be used as follows to diagnose and fix performance problems in UNIX systems.
So to reiterate: first iostat, sar and uptime are run to determine whether it appears to be a CPU problem. If it is CPU then it is possible to try profiling, time/timex, optimization, priority changing or a CPU upgrade. If the problem is not CPU then it is necessary to investigate for possible I/O problems further using iostat, and then filemon, lsvg, lsdev, lsattr, lspv and lslv. I/O solutions include adding disk space and reorganizing filesystems.
If the I/O breakdown indicates the problem is with paging (page lv>30%) then svmon, lsps or pstat should be used. Possible solutions include adding memory or disk space. If the system does not appear to be disk bound then it is time to check for remote file I/O problems using nfsstat, netpmon and netstat. Finally, if none of these identify the problem it is time to resort to trace.
By taking such a structured approach to problem diagnosis it is possible to rapidly isolate the problem area. Taking these measurements when the system is behaving normally is also a useful option as this provides a baseline to compare future measurements with.
To do performance measurement properly it is helpful to automate the reporting process using a scripting language (such as perl) combined with scheduling commands such as at or cron. These languages can als be used to create graphical representations of the output from the tools.
By using the above mentioned tools and methodology, it is possible to diagnose performance problems on most UNIX systems, using non-proprietary tools that come standard with the system.
1. System Performance Tuning, Mike Loukides, O’Reilly and Associates
2. SC23-2365, Performance Monitoring and Tuning Guide, IBM
Figure 1: Rules of Thumb
fre < .1(avm) or >=110-120
PV util > 80%
IOWait > 40%
Tm_act > 70%
retrans > 5% (total calls)
runqsz <2 is good
System Load Average (CPU)
> 10 - very bad
4-7 - fairly heavy
<3 - light load
Single user station <2