UNIX and Web PERFORMANCE
By Jaqui Lynch
Manager Systems Services
The purpose of this paper is to introduce the performance analyst to some of the free tools available to monitor and manage performance on UNIX systems, and to provide a guideline on how to diagnose and fix performance problems in that environment. The paper is based on the authorís experiences with AIX and will cover many of the tools available on that and other UNIX platforms. It will also provide some Rules of Thumb for analyzing the performance of UNIX systems. The paper has been updated to include some of the things that can be done in order to make sure that the web server will perform well.
As more mission critical work finds its way from the mainframe to distributed systems, performance management for those systems is becoming more important. The goal for systems management is not only to maximize system throughput, but also to reduce response time. To do this it is necessary to not only monitor and tune the system resources, but also to work on profiling and tuning applications. When a web server is added to the system it is necessary to alter some of the values in the system that one would normally not touch. The first part of this paper addresses normal UNIX performance issues and the second part looks at some of the areas to look at changing to improve web server performance.
Part 1 Ė UNIX Performance
In UNIX there are 7 major resource types that need to be monitored and tuned - CPU, memory, disk space and arms, communications lines, I/O Time, Network Time and applications programs. There are standard rules of thumb in most of these areas. From the userís perspective the only one they see is total execution time so we will start by looking at that.
Total execution time from a userís perspective consists of wall-clock time. At a process level this is measured by running the time command. This provides you with real time (wallclock), user code CPU and system code CPU. If user + sys > 80% then there is a good chance the system is CPU constrained. The components of total running time include:
1. User-state CPU - the actual amount of time the CPU spends running the users program in the user state. It includes time spent executing library calls, but does not include time spent in the kernel on its behalf. This value can be greatly affected by the use of optimization at compile time and by writing efficient code.
2. System-state CPU - this is the amount of time the CPU spends in the system state on behalf of this program. All I/O routines require kernel services. The programmer can affect this value by the use of blocking for I/O transfers.
3. I/O Time and Network Time - these are the amount of time spent moving data and servicing I/O requests.
4. Virtual Memory Performance - This includes context switching and swapping.
5. Time spent running other programs - i.e. when the system is not servicing this application because another application currently has the CPU.
To measure these areas there are a multitude of tools available. The most useful are:
cron Process scheduling
nice/renice Change priorities
setpri Set priorities
netstat Network statistics
nfsstat NFS statistics
time/timex Process CPU Utilization
uptime System Load Average
ps Process Statistics
iostat BSD tool for I/O
sar Bulk System Activity
vmstat BSD tool for V. Memory
gprof Call Graph profiling
prof Process Profiling
trace Used to get more depth
Other commands that will be useful include lsvg, lspv, lslv, lsps and lsdev. Each of these will be discussed later along with a general problem solving approach. It is important to note that the results and options for all of these commands may differ depending on the platform they are being run on. Most of the options discussed below are those for AIX and some of the tools are specific to AIX such as:
tprof CPU Usage
svmon Memory Usage
filemon Filesystem, LV .. activity
netpmon Network resources
The first command to be discussed is uptime. This provides the analyst with the System Load Average (SLA). It is important to note that the SLA can only be used as a rough indicator as it does not take scheduling priority into account . It also counts as runnable all jobs waiting for disk I/O, including NFS I/O. However, uptime is a good place to start when trying to determine whether a bottleneck is CPU or I/O based.
When uptime is run it provides three load averages - the first is for the last minute, the second is for the last 5 minutes and the third is for the last 15 minutes. If the value is borderline but has been falling over the last 15 minutes, then it would be wise to just monitor the situation. However, a value between 4 and 7 is fairly heavy and means that performance is being negatively affected. Above 7 means the system needs serious review and below 3 means the workload is relatively light. If the system is a single user workstation then the load average should be less than 2. There is also a command called ruptime that allows you to request uptime information remotely.
The sar command provides a good alternative to uptime with the -q option. It provides statistics on the average length of the run queue, the percentage of time the run queue is occupied, the average length of the swap queue and the percentage of time the swap queue is occupied. The run queue lists jobs that are in memory and runnable, but does not include jobs that are waiting for I/O or sleeping. The run queue size should be less than 2. If the load is high and the runqocc=0 then the problem is most likely memory or I/O, not CPU. The swap queue lists jobs that are ready to run but have been swapped out.
The sar command deserves special mention as it is a very powerful command. The command is run by typing in:
sar -options int #samples
where valid options generally are:
-g or -p Paging
-q Average Q length
-u CPU Usage
-w Swapping and Paging
-y Terminal activity
-v State of kernel tables
After determining that the problem may well be CPU based by using uptime and sar, it would then be necessary to move on to iostat to get more detail. Running iostat provides a great deal of information, but the values of concern here are the %user and %sys. If (%user + %sys) > 80% over a period of time then it is very likely the bottleneck is CPU. In particular it is necessary to watch for average CPU being greater than 70% with peaks above 90%. It is also possible to get similar information by running the ps -au or sar -u commands. Both of these provide information about CPU time. The sar -u command, in particular, breaks the time down into user, system, time waiting for blocked I/O (i.e. NFS, disk, ..) and idle time.
The ps -au command also provides information on the %physical memory the process is using and the current status for the process. The status will be one of:
P Waiting on Pagein
D Waiting on I/O
S Sleeping < 20 secs
I Idle - sleeping >20 secs
Z Zombie or defunct
W Process is swapped out
> Mem. soft limit exceeded
N Niced Process (low pri)
< Niced Process (high pri)
The cron or at command can be used to automatically schedule execution of these commands to ensure snapshots are taken at the appropriate times. The atq command can be used to list what is in the at queue and the crontab -e command edits the cron table.
Once it has been determined that the problem is a CPU bottleneck there are several options. It is possible to limit the cpu time a process can use with the limit command. If the problem relates to one process then it is also possible to model or profile that process using the prof, gprof or tprof command to find out whether it is possible to optimize the program code.
prof and gprof are very similar and have several disadvantages when compared to tprof. Both prof and gprof require a recompile of the program using either the -p or the -pg option and they slow down the program a lot. tprof needs the program to be recompiled to do source code level profiling (-qlist option). In particular, tprof exhibits the following characteristics:
No count of routine calls
No call graph
Source statement profiling
Summary of all CPU usage
No recompile needed for routine level profiling
No increase in User CPU
prof/gprof differ as follows:
Count of routine calls
Call graph (gprof)
Routine level profiling only
Single Process CPU usage
10-300% increase in User CPU
So, the recommendation would be to use tprof if it is available on the chosen platform. It is also possible that the vendor will have their own equivalent to tprof.
Running the time or timex commands can also give a good indication of whether the process is CPU intensive. Compiler options have been proven to extensively affect the performance of CPU intensive programs as can be seen from the table below. It is well worth trying different optimization options when compiling the program such as -O, -O2, -O3 and -Q (inline streams the code). More information can be found on these options by using the "man cc" command. time/timex can give you an indication of how much benefit this will provide. timex can also be run using the -s option which causes a full set of sar output to be generated for the duration of the programs execution. As can be seen from the table below, it is possible to see reductions in the order of 50% in CPU utilization by using optimization.
User CPU running Program phntest
Compiler Seconds % of CPU
Options for None
None 53.03 100%
-O 26.34 49.67%
-O -Q 25.11 47.35%
-O2 27.04 50.99%
-O2 -Q 24.92 46.99%
-O3 28.48 53.71%
-O3 -Q 26.13 49.27%
It is also possible to change the priority of the process so that other processes can gain control of the CPU ahead of it. This can be done by using the nice, renice or setpri commands. Renice is not available on all platforms. Before using these commands, it is useful to understand how the priority scheme works in UNIX.
Priorities range from 0-127 with 127 being the lowest priority. The actual priority of a task is set by the following calculation:
Puser normally defaults to 40 and nice to 20 unless the nice or renice commands have been used against the process. On AIX a tick is 1/100th of a second and new priorities are calculated every tick as follows:
Every second tick is recalculated as tick=tick/2 and then new-pri is again recalculated.
If none of the above improve performance and the problem still appears to be CPU then there are only two options left:
Run the workload on another more powerful system
Upgrade the CPU
If the problem does not appear to be CPU then it becomes necessary to investigate memory and I/O as possible causes. Again, it is possible to use iostat or sar to get the information that is needed here. The iowait field shown in the iostat command is a good indicator of whether there are I/O problems. If iowait is greater than 40% then it becomes necessary to investigate the physical volumes to ensure that none of them are greater than 70% full. The lspv command can be used to determine utilization of the physical volume.
Iostat is a low overhead tool that can be automated and provides local counts for I/O data. Unlike sar, iostat does not provide timestamps in the output so it is important to make a note of start/stop times. However, iostat uses kernel data which makes it hardware specific with respect to the results obtained.
Iostat provides data on several important values for each physical disk. These include: %time the physical disk was busy, kilobytes per second to/from the disk, transfers per second to/from, kilobytes read and kilobytes written. This will help to determine if there is an imbalance of I/O amongst the physical volumes. If all appears to be normal here then the next step is to investigate which filesystems the I/O is directed at. If most of the I/O is directed at the page files then memory needs to be investigated.
Information on cylinder access and seek distances is available using the sadp command and cache statistics for disk are available using the sar -b command. Further information can be obtained by running filemon and looking to see what the most active filesystems are.
filemon provides a list of the most active segments, the most active logical volumes and physical volumes, and detailed statistics for the busiest files as well as the segments, physical and logical volumes. Details include transfers, reads, read sizes, read times in msecs, logical seeks, write times, seek distances, throughput (kb/sec) and utilization percentages. However, it is important to note that filemon runs trace in the background which can affect performance. It is also possible to run fileplace which gives information on space efficiency and sequentiality.
Running lsvg, lslv and lspv provides a map of the layout of the physical and logical volumes on the system as well as the various volume groups. This will make it much simpler to get more in depth information. By running lsdev -C it is also possible to determine what kind of disk devices are installed and what size they are. By using a combination of the above commands a map can be produced for each physical volume of the filesystems and their placement on the disk. The lsattr -E -l sys0 command can be used to obtain information on system parameters such as cache sizes and other associated values.
If the bulk of the I/O (>30%) is going to a logical volume or filesystem that is not used for paging then the problem is most likely user I/O. This can be resolved by one of several options:
Reorganizing the filesystem
Adding physical volumes
Splitting the data in another manner
Adding memory may also help
Other items to take into account when looking at I/O performance include the intra and inter policies, mirroring of disks, write verify and the scheduling policy for logical volumes. It is also important to remember that the SCSI card can only talk to one device at a time. Where multiple disks are behind one SCSI card, sequential read/writes are helped if they are spread across multiple adapters. Newer technologies such as SCSI-2 (with fast and wide and/or differential features) and RAID will also help improve performance. Some of the newer controllers also provide buffers for each disk and can perform two way searches.
If the bulk of the I/O is going to paging (i.e. the page LV is > 30%) then it becomes necessary to investigate further. The only options available to cure a paging problem are to:
Write more memory efficient code
Move the process to another system
Reschedule the process so it doesnít contend with other memory intensive workloads
Add physical volumes or more page datasets.
There are three commands that are used to investigate paging - lsps (or pstat), vmstat and svmon. lsps -a will provide information on all of the page spaces available including physical volume and utilization. vmstat is another low overhead tool and provides information on actual memory, free pages, processes on the I/O waitq, reclaims, pageins, pageouts, pages freed by the stealer per second, interrupts, system calls and CPU utilization. Like iostat, vmstat does not provide timestamps. svmon -G provides similar information except it breaks memory down into work, persistent and client pages that are either in use or pinned. It is also possible to use the sar -w command.
When looking at paging it is important to note that the stealer will run whenever there are only ((2 x real) -8) pages left. So on a 32mb machine the stealer will run if there are only 56 pages left. The Rule of Thumb for page space versus real memory is generally in the order of Page = 2 x real. On some systems not all of the kernel processes are pinned (fixed in memory) so they can also be paged out. A pagein rate of >5/sec is a reasonable indicator for any system that the system is memory constrained.
If there is no performance problem with the CPU and the system is not disk bound it becomes necessary to investigate the network to check whether it is remote file I/O bound. This is the last step before running the more resource heavy trace command to determine what is really happening. To look at network statistics there are three useful commands:
netstat -i shows the network interfaces along with input and output packets and errors. It also gives the number of collisions. The Mtu field shows the maximum ip packet size (transfer unit) and should be the same on all systems. In AIX it defaults to 1500. Both oerrs (number of output errors since boot) and ierrs (Input errors since boot) should be < 0.025. If oerrs>0.025 then it is worth increasing the send queue size. Ierrs includes checksum errors and can also be an indicator of a hardware error such as a bad connector or terminator. The collis field shows the number of collisions since boot and can be as high as 10%. If it is greater, then it is necessary to reorganize the network as the network is overloaded on that segment.
netstat -m is used to analyze the use of mbufs to determine whether mbufs are the bottleneck. The no -a command is used to see what the current values are. Values of interest are thewall, lowclust, lowmbuf and dogticks.
An mbuf is a kernel buffer that uses pinned memory and is used to service network communications. mbufs come in two sizes - 256 bytes and 4096 bytes (clusters of 256 bytes). thewall is the maximum memory that can be taken up for mbufs. lowmbuf is the minimum number of mbufs to be kept free, while lowclust is the minimum number of clusters to be kept free. mb_cl_hiwat is the maximum number of free buffers to be kept in the free buffer pool and should be set to at least twice the value of lowclust to avoid thrashing.
netstat -v is used to look at queues and other information. If max packets on the S/W transmit queue is >0 and is equal to current hardware transmit queue length then the send queue size should be increased. If the "no mbuf errors" is large then the receive queue size needs to be increased.
nfsstat reports on client and server NFS information, primarily at the daemon level. nfsstat -c provides client information such as retrans (retransmissions due to errors) and badxid. If badxid=retrans and retrans > 5% of calls the server is the problem; but if retrans > 5% of calls and badxid < retrans, then the network is the problem. Also, if there are lots of timeouts, then it is useful to increase the number of NFSDs (NFS daemons) and the qsize.
netpmon is another command that focuses on CPU, network adapters, remote nodes and LAN traffic. It is used to get an overall sense of network performance. By using a combination of the above commands, it is possible to obtain a very clear view of what is happening at the network level.
At this point it is important to mention the UNIX kernel tables, as these can affect performance without any real indicators as to the cause. To find out what they are set to, the pstat -T or sar -v commands are used. Most of the values are calculated based on the value for maxusers, so it is important to know what that is set to. It is often recommended that maxusers generally be determined by the following calculation:
Max Users = (2+ # users act + #NFS clients + .5 NFS exports )
In particular, attention should be paid to the following table sizes:
Process Table Size (NPROCS) - this is the maximum number of processes that can be in the system. On systems where Xwindows is heavily used, this needs to be increased. If the table is full, then the process will fail to start.
Text Table Size (NTEXT) - This is the maximum number of executables that can be in the system at a time. If the table is full, then the exe will not run.
Inode Table Size (NINODE) - This is a cache of the active inode entries. If this table fills up, then performance slows down.
File Table Size (NFILE) - This is the maximum number of files that can be open at one time. If the table fills up, the open will fail.
Callout Table Size (NCALLOUT) - This is the maximum number of timers that can be active at one time. Since timers are used heavily by device drivers to monitor I/O devices, the system will crash if this table fills up.
General Default Calculations (may be platform Specific)
Other kernel settings that should be reviewed are the number of processes per user, maximum open files per user and maximum mounted filesystems. All of these can have unpredictable effects on performance.
If none of these provides a reasonable clue to what is going on, it is necessary to bring out the most powerful tool of all - trace. Trace will provide in-depth information on everything that is going on in the system. However, it will definitely affect performance and thus, should be used judiciously.
As can be seen there is a great deal of information that can be gleaned from the system for relatively minimal effort. This information can now be used as follows to diagnose and fix performance problems in UNIX systems.
First - run iostat, sar and uptime
if there is a CPU problem
if there is an I/O problem
add disk space
if there is a paging problem
lsps or pstat
if there is a network problem
if all else fails run trace
By taking such a structured approach to problem diagnosis, it is possible to rapidly isolate the problem area. Taking these measurements when the system is behaving normally is also useful, as this provides a baseline to compare future measurements with.
To do performance measurement properly, it is helpful to automate the reporting process using a scripting language (such as perl) combined with scheduling commands such as at or cron. These languages can also be used to create graphical views of the output from the tools. By using these tools and this methodology, it is possible to diagnose performance problems on most UNIX systems, using non-proprietary tools that come standard with the system.
Part 2 Ė Web Server Performance
Once a web server is installed on the system there are many things that need to be considered that can impact performance. Obviously the network becomes a factor but so do all the other servers that are being accessed. It is important to understand how the web applications are written, what optimization flags have been used, how many servers are in place as well as how performance is being measured. Several of the parameters chosen in the web server configuration can have a serious impact on performance and may need to be reevaluated. Some examples of areas to investigate include program design, web server parameters, security, unwanted visitors and spiders, network parameters and log analyzer programs. Each of these is addressed below.
Poor program design can adversely affect the scalability of a web application. It is important to look at how the program is written, what optimization parameters were chosen on compile, whether counters are being used and how they are being calculated, what languages are being used and how file locking is being implemented.
For example, interpreted languages such as perl will always be a little slower than compiled languages such as C. However, they provide more flexibility and are a standard UNIX tool that is heavily used. When looking at such programs it is important to look at how data is received and sent, making sure the program does not sit looping waiting for data from another system which causes it to take up cpu time periodically for no real value. File locking is another important thing to watch for.
Other design features that can adversely affect performance include the use of directory listings and the use of graphics. It is important to know your audience and to have some idea of the kinds of connections they will be using before adding large graphics. At 9600bps these are incredibly slow to download. Directory listings is both a performance and a security area. It is best to turn off this ability on all directories.
Web Server and itís Parameters
Dnsmode has from two to four possible settings - none, minimum, standard and maximum or off and on. Basically, dnsmode is used to determine whether the system will try and do a dns lookup on the requesting ip address so it has a name to put in the logs. Maximum means it will do a reverse dns lookup and will not allow the connection if this fails. These dns lookups are very important for security reasons but are major performance bottlenecks. On a very active server this should probably be set to off or none.
Another couple of useful parameters are Max children or Max active threads. These are the number of children or threads that can active concurrently. If this is set too low then users will be denied access. Most of the current web servers also preallocate processes to handle requests instead of forking a process when a request comes in Ė this is very helpful for performance.
If the web server supports http 1.1 keep-alive then this should be used. Basically this means that the server maintains persistent session connections between clients which saves the overhead incurred in starting a new connection for a user whenever they make an http request during the session.
Security and Unwanted Visitors
Logging is one of those processes that slows down the webserver but is needed for security. Unless they are needed it is possible to turn off the referrer and agent logs. The other logs (access and error) should be in their own filesystem on separate actuators from the documents and cgi-bin, if possible.
Security and encryption algorithms also impair performance but may be necessary for the server. There are 3 basic options on most servers Ė no security, basic or uuencoded data and finally SSL (certificate based with encryption). Unless security is necessary donít even configure it into the web server.
If there are directories that are very heavily hit then it may be helpful to put them into a file called robots.txt which will reside in your document root directory. This is an indication to spiders and web crawlers that they are not to visit that directory, which will reduce unproductive hits.
The network is key to the web serverís performance. If at all possible connect the system using a 100mb card on a switch. This will provide better throughput for the system. Netstat, no Ėa and lsattr should also be used to check the network parameters set on the system.
Of particular interest are the transmit queue size, TCP send and receive space and the number of collisions. Nfs mounts should also be closely monitored.
Apart from generic UNIX tuning as discussed above in Part 1, there are 2 parameters that need to be watched. The first is the maximum number of processes in the system and the second is the maximum number of child processes for a process (maxuproc). Since the server forks or spawns subprocesses for each connection maxuproc may need to be increased substantially.
Administration and Log Analyzer Programs
It is important that the impact on the system of log analyzers not be underestimated. It is suggested that only one analyzer be run and then only once a day at a low period, or that the logs be moved/copied to another system for analysis there.
Backups are another problem area and the times they are scheduled should be monitored very carefully. I/O pacing should be considered as part of a backup strategy to ensure the ability to provide consistent levels of service.
There are also many options that are specific to a particular platform. These become critical when planning the overall design for Multiple Web Servers with load balancing. There are alternatives to NFS such as GPFS (generalized parallel filesystem) and VSD (virtual shared disk). There are also alternatives to round robin dns such as Cisco Localdirector or Interactive Network Dispatcher. These latter two allow sessions for most ports (not
just the web) to be redirected to different servers depending on load on the systems.
As can be seen above there are many different options at both the UNIX level and the web server that can be used to either impair or enhance performance and throughput. It is important that the system administrator and webmaster evaluate the options mentioned above in order to avoid performance problems. However, a methodology was also proposed that would help in the cases where performance problems need to be diagnosed.
1. System Performance Tuning, Mike Loukides, OíReilly and Associates
2. SC23-2365, Performance Monitoring and Tuning Guide, IBM
3. UNIX Performance Tuning by Jaqui Lynch, CMG96 and UKCMG97