WMD Zone: September 2012

If you're an admin for Linux servers that are going to be doing any real kind of work, you'll need to know how to make sure they're running right. You need to understand how the CPU, memory and disk get utilised by the OS, and to do that you need to know how to use a few essential tools and how to interpret the results.

I'll try and write this so admins coming from a Windows background can understand how Linux works compared to Windows.

CPU

There are 3 things you need to be concerned about with regards to how the system is performing from a processor point of view.

1. CPU utilisation percentage
2. CPU run queue (load)
3. CPU I/O wait

In Windows, you mostly are just concerned with CPU utilisation from just a single percentage figure with the maximum being 100%. This isn't really the whole story though.

In Linux, if you have 4 cores in total, the CPU utilisation will be shown as a percentage with the maximum 400%. That may seem strange to someone used to seeing 100% as the maximum but it actually makes more sense to add up the totals of each core and show you all the cores together.

The thing to understand about this is that CPU utilisation isn't actually how busy your system is. It's a part of it, but not the whole story. It's simply a representation of how long the CPU was seen as being busy over a time period. If the system looks at a CPU core for 10ms, and that core was busy for 2ms, it will be 20% busy. It will then sample the other 3 cores, and add those to the total. If they were also all busy for 2ms out of that 10, the total CPU utilisation of the system will be 80%, with the maximum being 400%.

We have a percentage of how busy the CPU is, why isn't that the whole story?

Well, if a CPU core is used by 1 process for 2ms out of 10ms, but for those 2ms there are also 5 other processes waiting to jump on that core and do stuff, a utilisation of 20% isn't really accurate is it? Because for those 2ms, the system is actually trying to do 5 times more than it actually can.

When you understand that both the CPU utilisation _and_ the CPU load are factors to be taken in conjunction with each other, you can interpret what the tools tell you.

top

top - 16:49:48 up 14 days, 6:18, 5 users, load average: 2.75, 3.64, 3.87

Tasks: 315 total, 1 running, 314 sleeping, 0 stopped, 0 zombie

Cpu(s): 8.3%us, 1.6%sy, 0.0%ni, 86.4%id, 3.1%wa, 0.1%hi, 0.5%si, 0.0%st

Mem: 98871212k total, 81501412k used, 17369800k free, 50108k buffers

Swap: 9446212k total, 32700k used, 9413512k free, 7281528k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

32031 mysql -10 0 71.4g 69g 6132 S 97 74.2 13958:45 mysqld

28358 root 20 0 44176 15m 1280 D 63 0.0 210:01.71 mysqlbackup

19749 root 20 0 69624 18m 3160 S 7 0.0 3:02.40 iotop

6183 root RT 0 161m 37m 17m S 5 0.0 1188:46 aisexec

5397 root 39 19 0 0 0 S 1 0.0 241:39.12 kipmi0

2971 root 15 -5 0 0 0 S 0 0.0 65:52.74 kjournald

1 root 20 0 1064 392 324 S 0 0.0 0:16.52 init

top is the standard age-old tool for quickly looking at what's going on. The system above has 8 cores, which are hyperthreaded, so I know that it has 16 logical processors available (generally found out from cat /proc/cpuinfo). When I look at the processes, the mysqld process is taking 97%, but that's from a maximum of _around_ 1600%.

Then, as I said above, we can also look at the system load, which is represented as load average. In the output above, I can see that the first figure of 2.75 is the average over the last 1 minute, 3.64 over the last 5 minutes and 3.87 over the last 15 minutes.

What do these figures mean?

While the system was sampling how much was running on each usable CPU core, it also looked at how many processes were waiting to run. Out of 16 queues, around 3 were waiting at any time, 1 process has taken 97% of 1600%, and another 63%. Therefore, actually, what looked like the system was fairly busy, really has a lot of room to get busier. Until we're consistently filling almost all of the queues (16 on this system), and the CPU utilisation is getting nearer a total of 1600%, we don't need to worry.

The following is a Munin graph of the same system. We can see that the Max idle is 1600, and we're nowhere near it.

And this graph shows the load average

Again, it backs up what we saw from top. We don't have to worry about the load on this system, and we know this by combining the utilisation and load average to see what's really going on.

But what about IO?

A 3rd variable comes into the mix which complicates it a little further, which is IO wait. If a process is running on a CPU core, but you have a slow IO subsystem (e.g. a slow disk, or a saturated fibre channel host bus adapter), the process can be waiting for an IO request to complete. This in turn increases the CPU utilisation and the load average.

If you're seeing high CPU usage and need to find out why, you can see if it's IO wait by using vmstat.

These figures are from a web server. You can see that the io column has no blocks in and a few blocks out now and again. The blocks out are likely to be log files being written, and as it's a web server, everything is already in memory and doesn't need to be read in. No IO issues here.

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 100 266688 302404 5135804 0 0 0 0 17822 24008 15 1 84 0 0
15 0 100 266532 302404 5135820 0 0 0 124 16510 24104 12 1 87 0 0
0 0 100 265504 302404 5135848 0 0 0 0 18332 24488 17 2 82 0 0
4 0 100 264312 302404 5135852 0 0 0 0 16986 23787 14 2 84 0 0
6 0 100 265476 302404 5135864 0 0 0 344 16711 23948 15 1 83 1 0

This one is from a database server. You can see that the blocks in and blocks out (1 block is 1KB) is a lot larger, and as I ran this as vmstat 1 it's cycling every 1 second, so it was reading 30-50MB/s and writing 10-20MB/s.

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 33 879 57 24340 0 0 36108 15304 27169 42413 10 3 83 4 0
2 1 33 852 57 24384 0 0 36576 15762 26833 40486 9 3 85 4 0
2 0 33 780 57 24439 0 0 47296 9735 21587 33633 8 2 85 4 0
0 1 33 721 57 24499 0 0 49496 19881 22993 36320 8 3 86 4 0
4 0 33 683 57 24547 0 0 42176 13993 23573 36176 8 2 87 2 0
5 2 33 632 57 24595 0 0 38748 10611 26785 41753 11 3 76 10 0
4 0 33 584 57 24636 0 0 37636 12618 23149 36298 14 2 80 4 0
6 0 33 551 57 24685 0 0 34060 13504 25268 39642 14 2 79 5 0
3 0 33 481 57 24739 0 0 50360 10973 24150 37552 13 2 82 3 0

That's a lot of throughput. Is it affecting the CPU by waiting on IO? Well, the 'wa' column in 'cpu' are figures in a percentage of 100%, so the single digit figures compared to the 'id' (idle) column, it's not waiting on IO for very long at all. Therefore, this server is heavily utilised for IO, but it's not affecting CPU utilisation or system load due to having decent IO.

IO is a bit easier to see by using iostat, which gives you % utilised of your IO subsystem.

# iostat -x -d 1

Linux 2.6.27.29-0.1-default (xxxxxxx) 09/19/12 _x86_64_

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sdb 0.00 0.00 349.00 71.00 92320.00 21136.00 270.13 0.99 2.35 1.04 43.60

sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

dm-0 145.00 1795.00 349.00 71.00 92320.00 21136.00 270.13 1.15 2.77 1.07 44.80

dm-1 0.00 0.00 493.00 1857.00 91976.00 14856.00 45.46 30.01 13.64 0.19 44.40

sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sde 0.00 0.00 0.00 14.00 0.00 9490.00 677.86 0.20 14.57 2.00 2.80

dm-2 0.00 0.00 0.00 14.00 0.00 9490.00 677.86 0.21 15.14 2.57 3.60

dm-3 0.00 0.00 0.00 14.00 0.00 9490.00 677.86 0.21 15.14 2.57 3.60

Even easier to use is iotop, a layer on top of iostat to make it more like a top style interface.

Finally, on to memory

Memory is really misunderstood in Linux. Unused memory is inefficient. Some people see the below and panic.

# free -m

total used free shared buffers cached

Mem: 96553 93561 2992 0 73 21044

-/+ buffers/cache: 72443 24110
Swap: 9224 31 9192

That's 93GB used of 96GB installed RAM in the server going by the Mem: row.

Wrong. The Linux kernel grabs as much memory as it can, leaving only a small amount unused and then dishes it out to applications which request it. Anything which isn't requested by an application is then utilised for buffers and caches, including the IO buffer. Read the values in the -/+ buffers/cache line. 24GB is free, and 72GB is used by applications. That's obviously still a lot, but this is a database server, and we want to give the database engine as much memory to cache stuff as possible.

Here's another one from a slightly more modest server:

# free -m
total used free shared buffers cached
Mem: 463 397 65 0 134 136
-/+ buffers/cache: 126 336
Swap: 475 10 465

463MB of RAM and only 65MB free?! Nope, 336GB free as the kernel hasn't needed to dish it out and has allocated it to buffers and caches.