UNIX Support: Analyzing and Improving System Loads

Do you have a UNIX computer that is unusually slow and do not know why? Let me introduce a few UNIX commands that can help.

The first UNIX command is called the system activity reporter or "sar". Sar will report the history of a UNIX computerís load for several system activities. Letís take a look at this example:

 $ sar
 HP-UX machinename B.11.11 U 9000/800    08/13/07
 00:00:00    %usr    %sys    %wio   %idle
 00:20:00       4       2       0      64
 00:40:00       4       2       0      90
 01:00:00       5       3       0      92
 01:20:00       6       4       1      88
 01:40:00       6       3       2      85
 02:00:00       5       3       3      80
 02:20:00       7       3       2      87
 02:40:00       6       3       2      80
 03:00:00       5       3       2      80
 03:20:00       5       3       3      81
 03:40:00       6       3       3      78
 04:00:00       8       3       6      72
 04:20:00       6       3       5      74
 Average        6       3       1      81

The first row of information is the basic information about the machine, such as the OS, machine name, OS version, etc. The interesting data is in the following columns. The first column (time) is the timestamp as to when the data was collected. By default, this is every 20 minutes. The second column (%user) is the load that the users are putting on the system. The third column (%system) is the load that the system or root (including daemons) processes are putting on the system. And fourth column (%IO) is the load or processes that are waiting for hardware reads or writes (input and outputs, also known as I/O). The final column (%idle) is the average idle load of the system; in other words, how often is the system waiting for tasks to start.

If the idle percentage load is consistently below 20% for long periods of time, then the serverís users would be noticing a slower than normal performance from the machine. Thus, they will be contacting the system administrator, saying that something is wrong, and we would start by running the "sar" UNIX command to find out why the server is slow.

If the idle percentage is low thus there is a server issue, then you will need to look at the other columns for more information. Next, check the last two rows of data. Use the average load information to determine if the problem is a recurring issue that needs a long term solution or just a one time occurrence. You may need to look at the sar history of several days to determine this. Use the second to last row to determine if the performance of the machine is currently good or bad. If the most recent idle load is high, then there is nothing presently running that is using a lot of system resources.

If the system or user percentage loads are high, then there is probably a rogue or zombie process(es) that is eating up a lot of system resources. To list all the process running on the machine sorted from the highest load first, run the "top" UNIX command (exit this tool by pressing Control-C). Here is an example of this command:

 $ top
 Load averages: 0.06, 0.06, 0.07
 845 processes: 813 sleeping, 22 running, 6 stopped, 6 zombies
 Cpu states:
  0    0.06   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
  1    0.06   0.0%   0.0%   1.0%  99.0%   0.0%   0.0%   0.0%   0.0%
  2    0.04   1.0%   0.0%   0.0%  98.0%   0.0%   0.0%   0.0%   0.0%
  3    0.06   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
 ---   ----  -----  -----  -----  -----  -----  -----  -----  -----
 avg   0.06   1.0%   0.0%   0.0%  99.0%   0.0%   0.0%   0.0%   0.0%
 Memory: 1093116K (283720K) real, 1320320K (332308K) virtual, 10717400K free  Page# 1/30
  3 pts/th    18029 jim   154 20 39732K 18844K sleep    3:24  0.69  0.69 perl
  3   ?          39 root     152 20  1888K  1888K run      2:36  0.65 0.65 vxfsd
  3   ?         752 root     154 20  4180K  1756K sleep   34:04  0.64 0.64 autom
  0   ?        9074 john   154 20 16812K  8892K sleep    2:09  0.63 0.63 Xvnc
  1 pts/11    28419 phil   178 20    32K    32K zomb     0:00  0.42 0.42 perl

The upper part of the output shows the status of each CPU and their averages, in addition to how many zombie processes (processes without valid parent processes) and other useful information. Consider killing all zombie processes if they appear to be using too many resources and are no longer needed. The next section shows how much memory the computer has and is using, including virtual memory. This will show if the system needs more RAM installed, especially if there is a lot of thrashing from RAM onto a disk drive. Finally, the bottom section will report the list of processes in order of the most load intensive processes first. Note that this tool will update this data quite often and quickly, so look for processes that are consistently high and that can be killed without any bad effects. If you do not know how to kill a UNIX process, then please speak with a system administrator. Be careful to not terminate good processes that normally have a high load. If no particular process or set of processes are running amok, then it is probably time to upgrade your RAM or the entire machine.

Finally, if the I/O percentages are high, then that means the hardware resources are doing so many reads and writes that application processes are consistently waiting on the hardware to finish. This is a common issue with high traffic websites, where the high rate of reading of webpages taxes the capability of the hardware. Once you have determined that the I/O load is high, you must determine which drive(s) have a high load by running the "iostat" UNIX command. Here is an example of this command:

 $ iostat
 device    bps     sps    msps  
 c2t0d0      0     0.0     1.0  
 c2t1d0      0     0.0     1.0  
 c4t1d1      0     0.0     1.0  
 c6t1d1      0     0.0     1.0  
 c8t2d0      0     0.0     1.0  
 c6t2d0      0     0.0     1.0  
 c4t2d1      0     0.0     1.0  
This command will list each device by their internal names, followed by the kilobytes transferred per second (bps), number of seeks per second (sps), and milliseconds per average seek time (msps). Look for any unusually high numbers for any particular device; therefore, you should keep history records of this data to look for data trends that are higher than normal. If the computer transfers a lot of data, such as downloading and uploading files, the closely monitor the transfer rate (bps). If the computer reads in a lot of small packets of information scattered throughout different databases, then look at the other two data columns. Either way, this will quickly reveal any overloaded drives that need upgrading or offloading to other drives that have lower loads.

That pretty much summarizes an introductory to monitoring, analyzing, and improving UNIX system loads. Donít forget to read the manual pages for these commands to see additional options by using these commands: "man sar", "man top", and "man iostat".

Otherwise, please feel free to contact me for free advice.

