Tuesday, February 7, 2012

Troubleshooting Paging Issues on an Oracle Cluster- A Dive into the Kernel with SystemTap

This was an interesting problem that took a while to solve, so I decided it's worth a write up. If you have any questions, criticisms, or comments, let me know.

This post was peppered with "Thanks, Jon" so I deleted those and will say it here- Thanks to Jon Miller for steering me toward GNU Global and SystemTap, as well as helping me through each step of this process.


A work request came in from a DBA who ran across the following alert in OEM:

"Virtual memory paging: Host operating system was experiencing significant paging but no particular root cause could be detected. Investigate processes that do not belong to this instance running on the host that are consuming significant amount of virtual memory. Also consider adding more physical memory to the host."

Unfortunately, this was all of the information that he had available.


Not knowing exactly what I was looking for, I ran some quick diagnostics and found:

  • There were quite a few major faults/second on average:
    root@box:PROD:~> sar -B 1 30
    Linux 2.6.18-194.el5 (box) 01/17/2012
    02:17:17 PM pgpgin/s pgpgout/s fault/s majflt/s
    02:17:18 PM 92016.00 4552.00 9369.00 3.00
    02:17:19 PM 98780.20 3897.03 11154.46 2734.65
    02:17:20 PM 122900.00 1109.00 5503.00 131.00

    This is in contrast with other Oracle servers, which have an average majflt/s closer to zero.

  • Running "top f u" shows that the top offenders were Oracle processes:
    17021 oracle 15 0 649m 86m 79m S 32.3 0.1 0:23.74 3144 oracle
    24008 oracle 15 0 1423m 487m 481m S 9.2 0.7 1:55.51 3283 oracle
    6905 oracle 15 0 649m 66m 60m S 8.6 0.1 0:02.14 3098 oracle

    That doesn't really give us a good idea of why the processes are faulting. Looking at how much free buffered/cached RAM there is:

    root@box:PROD:~> free -m
    total used free shared buffers cached
    Mem: 72439 70537 1902 0 198 63027
    -/+ buffers/cache: 7311 65128
    Swap: 16383 824 15559

There are only a few scenarios in which excessive page faulting is normal:

  1. The system is out of buffered/cached physical memory and must page in and out from swap
  2. The system is spawning a lot of processes. Those familiar with the program stack will know that when a process is started, pages that aren't necessary at execution (e.g. the initialized data section) will remain on disk and be paged in as necessary. A system spawning processes constantly would see a lot of paging activity (to get an idea of this, try running "/usr/bin/time -v firefox" and note the number of major page faults on first execution).
  3. Swappiness is set to 100. This tells the kernel to use swap at every opportunity, regardless of the amount of free physical RAM. There are a few reasons to do this, but that is beyond the scope of this document.

The problem I ran into at this point is that I wasn't fully aware of what EXACTLY triggers a major fault. The documentation for major page faults is scattered and not explicit:

root@box:PROD:~> man top
"A page fault occurs when a process attempts to read from or write to a virtual page that is not currently present in its address space. A major page fault is when disk access is involved in making that page available."

That seems straightforward, but it's still a bit vague; for example, does this include mmap()ed files or is it limited to frames paged in and out of swap? After scouring manual pages and the Internet for a definitive answer, I decided it was time to dive into the kernel source.

This is a good time to bring up GNU Global; if you aren't aware of this, it's a source tagging system that indexes a collection of source files to make them easier to search. As of this writing, you can install Global on a Fedora or Red Hat system with:

yum install global-5.7.5-2.fc12.x86_64

Also, grab a copy of the latest kernel source from kernel.org and extract it.

Have Global installed and a fresh copy of the kernel source in front of you? Good. Go ahead and index the source by CDing to the source directory and running "gtags". This may take a few minutes to complete.

After the index has been created, it's time to hone down on events that trigger a page fault. But where to start? If you aren't already familiar with the kernel source tree layout, it can be difficult to navigate. Let's start with something simple:

$> global -rxi '.*major'
huge list of results

That's a huge result list for "major", which isn't unexpected. Let's try refining those:

$> global -rxi '.*major' | grep -i fault
VM_FAULT_MAJOR 78 arch/um/kernel/trap.c if (fault & VM_FAULT_MAJOR)
VM_FAULT_MAJOR 1135 arch/x86/mm/fault.c if (fault & VM_FAULT_MAJOR) {
VM_FAULT_MAJOR 117 arch/xtensa/mm/fault.c if (fault & VM_FAULT_MAJOR)
VM_FAULT_MAJOR 95 fs/ncpfs/mmap.c return VM_FAULT_MAJOR;
VM_FAULT_MAJOR 1545 mm/filemap.c ret = VM_FAULT_MAJOR;
VM_FAULT_MAJOR 1347 mm/memory.c if (ret & VM_FAULT_MAJOR)
VM_FAULT_MAJOR 2552 mm/memory.c ret = VM_FAULT_MAJOR;
VM_FAULT_MAJOR 1270 mm/shmem.c if (type && !(*type & VM_FAULT_MAJOR)) {
VM_FAULT_MAJOR 1272 mm/shmem.c *type |= VM_FAULT_MAJOR;

Much better! The majority of the results reference "VM_FAULT_MAJOR", which is almost certainly what we're looking for. A closer look at some of the results reveals that they're architecture-specific and can safely be discarded, which leaves us with:

VM_FAULT_MAJOR 1545 mm/filemap.c ret = VM_FAULT_MAJOR;
VM_FAULT_MAJOR 1347 mm/memory.c if (ret & VM_FAULT_MAJOR)
VM_FAULT_MAJOR 2552 mm/memory.c ret = VM_FAULT_MAJOR;
VM_FAULT_MAJOR 1270 mm/shmem.c if (type && !(*type & VM_FAULT_MAJOR)) {
VM_FAULT_MAJOR 1272 mm/shmem.c *type |= VM_FAULT_MAJOR;

Only two instances of a return value actually being set to VM_FAULT_MAJOR, one in mm/filemap.c and one in mm/memory.c.

filemap.c: /*
This file handles the generic file mmap semantics used by
most "normal" filesystems (but you don't have to use this:
the NFS filesystem used to do this differently, for example)

The function in which "ret = VM_FAULT_MAJOR" is set (filemap_fault()) describes triggering a major fault when a frame is paged in or out from disk. In memory.c, the function containing "ret = VM_FAULT_MAJOR" is called do_swap_page() and describes triggering a major fault when a frame is paged in or out from swap.

We could dig into this a bit more, but we have our answer- a major fault is triggered under the following circumstances:

  1. A process attempts to access a frame that currently resides in swap
  2. A process attempts to access a frame in an mmap()ed file that currently resides on disk

Armed with this new information, we can get back to tackling the problem at hand.

At this point, it would be really useful to have a tool to monitor system calls system-wide rather than just per-process. Solaris has a native tool called DTrace that provides this functionality, but unfortunately Linux is a bit behind in this field. Luckily, we do have an alternative tool called SystemTap.

SystemTap works by compiling SystemTap (STP) scripts as kernel modules and dynamically loading them into the kernel. After execution, the module is removed. This allows direct access to kernel resources that are generally untouchable from user-land.

To run SystemTap, you'll need to install kernel-debuginfo and kernel-debuginfo-common. On Fedora:

yum -y --enablerepo fedora-debuginfo install kernel-debuginfo
yum -y install kernel-devel
yum -y install systemtap systemtap-runtime

That will vary based on your distro and architecture. On RHEL:

yum -y --enablerepo rhel-debuginfo install kernel-debuginfo
yum -y install kernel-devel
yum -y install systemtap systemtap-runtime

Now, how should we proceed? Why not start by monitoring every major fault on the system?

$> stap /usr/share/doc/systemtap-1.6/examples/memory/pfaults.stp -c "sleep 10" | grep major | tee stap.log

That will gather all major page faults that occur over 10 seconds and dump it to stap.log. Looking at some sample output:


From the SystemTap documentaion:

"Each line contains a timestamp (in microseconds) when the page fault servicing was completed, the pid of the process, the address of the page fault, the type of access (read or write), the type of fault (major or minor), and the elapsed time for page fault."

That's even more information than we need! Let's figure out which processes are faulting the most:

$> cut -d: -f2 stap.log | sort | uniq -c | sort -n | tail -5
4312 7383
4313 7376
4313 7390
4556 7399
4604 6434
$> ps aux | grep 6434
oracle 6434 22.5 0.0 135988 70044 ? S 12:45 0:03 rman home/oracle/xxxxxxxx/xxxx.xxx/bin/rman

Notice that the process was just spawned? You may remember from earlier that newly-spawned processes page frames in only when necessary, triggering major faults. Let's look at all of the processes that are triggering major faults:

$> while : do for i in `stap pfaults.stp -c 'sleep 0.1' | grep major | cut -d: -f2 | uniq`; do ps aux | grep $i | grep -v grep; done done

In my case, almost all of the results were processes that were 1-2 minutes old; the processes vanished shortly thereafter.

All of this tells us what we suspected; Oracle is spawning processes constantly, which is triggering faults at a much higher rate than usual. But we're not quite finished! It would be a waste of effort if we didn't make use of the knowledge that major page faults are triggered by mmap()ed files as well. Pulling from pfaults.stp and a few other example scripts, Jon and I (mainly Jon) cobbled together a SystemTap script that shows us everything we'd want to know about the behavior of processes that are faulting:

#!/usr/bin/env stap
# major_faults.stp
# This script reports the number of exec* calls (not just forks) a particular process
# is making as well as mmap calls as the two primary causes for Major Faults occuring
# Major faults per process are also reported.
global proc_execs, proc_mmap, proc_mfault
probe vm.pagefault.return {
p = pid()
if (vm_fault_contains(fault_type,VM_FAULT_MAJOR)) {
proc_mfault[p] +=1
probe syscall.mmap* {
p = pid()
proc_mmap[p] += 1
probe syscall.exec* {
p = pid()
proc_execs[p] += 1
probe begin { printf("# started %s\n", ctime(gettimeofday_s())); }
probe end, error {
printf("# ended %s\n", ctime(gettimeofday_s()));
foreach(p+ in proc_execs) {
if (proc_execs[p] > 1) {
pm = proc_mmap[p]
if (pm) {
printf("process(%d) exec'ed %d times, faulted %d times, mmap'ed %d times\n", p, proc_execs[p], proc_mfault[p] ,pm);
} else {
printf("process(%d) exec'ed %d times, faulted %d times. Zero mmap calls.\n", p, proc_execs[p], proc_mfault[p]);
# Now need to cover the only mmap callers...
foreach(p+ in proc_mmap) {
pf = proc_execs[p]
if (!pf) {
printf("process(%d) exec'ed 0 times, but called mmap %d times and faulted %d times.\n", p, proc_mmap[p], proc_mfault[p]);

Sample output:

~> sudo stap /var/tmp/major_faults.stp -c "sleep 10"
# started Tue Feb 7 17:07:53 2012
# ended Tue Feb 7 17:08:03 2012
process(16048) exec'ed 2 times, faulted 0 times, mmap'ed 8 times
process(1458) exec'ed 0 times, but called mmap 2 times and faulted 0 times.
process(2432) exec'ed 0 times, but called mmap 4 times and faulted 0 times.
process(2723) exec'ed 0 times, but called mmap 119 times and faulted 0 times.
process(3629) exec'ed 0 times, but called mmap 11 times and faulted 0 times.

That's a surprising number of mmap() calls. So that's gotta be it, right? Oracle is mmap()ing files and paging in data from disk?

Not quite. Look at the number of faults on processes that are calling mmap(). Most of them are not faulting at all, which means that the mmap()ed files are already in physical memory. Those processes must be accessing the same files. Since they're already located in a shared address space, they don't have to page the data in from disk.

So there we are. The excessive faults are not being caused by mmap(), even though Oracle is making extensive use of the mmap() call. The majority of the processes that are faulting are new. We can conclusively say that the faults are being caused by an excessive spawn rate of Oracle processes. At this point, we can defer to the Oracle team; we've gathered the data that we needed to isolate the problem.

There's a lot more that you can do with SystemTap; check out the rest of the scripts in /usr/share/doc/systemtap-1.6/examples, and don't be afraid to dive into the kernel with Globals to seek clarification for a concept. All of us can benefit from a deep familiarity with virtual memory and the kernel in general.