This was an interesting problem that took a while to solve, so I decided it's worth a write up. If you have any questions, criticisms, or comments, let me know.
This post was peppered with "Thanks, Jon" so I deleted those and will say it here- Thanks to Jon Miller for steering me toward GNU Global and SystemTap, as well as helping me through each step of this process.
A work request came in from a DBA who ran across the following alert in OEM:
"Virtual memory paging: Host operating system was experiencing significant paging but no particular root cause could be detected. Investigate processes that do not belong to this instance running on the host that are consuming significant amount of virtual memory. Also consider adding more physical memory to the host."
Unfortunately, this was all of the information that he had available.
Not knowing exactly what I was looking for, I ran some quick diagnostics and found:
There were quite a few major faults/second on average:
root@box:PROD:~> sar -B 130
Linux 2.6.18-194.el5 (box) 01/17/2012
02:17:17PM pgpgin/s pgpgout/s fault/s majflt/s
This is in contrast with other Oracle servers, which have an average majflt/s closer to zero.
Running "top f u" shows that the top offenders were Oracle processes:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ nFLT COMMAND
17021oracle 150649m 86m 79m S 32.30.10:23.743144oracle
24008oracle 1501423m 487m 481m S 9.20.71:55.513283oracle
6905oracle 150649m 66m 60m S 8.60.10:02.143098oracle
That doesn't really give us a good idea of why the processes are faulting. Looking at how much free buffered/cached RAM there is:
root@box:PROD:~> free -m
total used free shared buffers cached
-/+ buffers/cache: 731165128
There are only a few scenarios in which excessive page faulting is normal:
The system is out of buffered/cached physical memory and must page in and out from swap
The system is spawning a lot of processes. Those familiar with the program stack will know that when a process is started, pages that aren't necessary at execution (e.g. the initialized data section) will remain on disk and be paged in as necessary. A system spawning processes constantly would see a lot of paging activity (to get an idea of this, try running "/usr/bin/time -v firefox" and note the number of major page faults on first execution).
Swappiness is set to 100. This tells the kernel to use swap at every opportunity, regardless of the amount of free physical RAM. There are a few reasons to do this, but that is beyond the scope of this document.
The problem I ran into at this point is that I wasn't fully aware of what EXACTLY triggers a major fault. The documentation for major page faults is scattered and not explicit:
root@box:PROD:~> man top
"A page fault occurs when a process attempts to read from or write to a virtual page that is not currently present in its address space. A major page fault is when disk access is involved in making that page available."
That seems straightforward, but it's still a bit vague; for example, does this include mmap()ed files or is it limited to frames paged in and out of swap? After scouring manual pages and the Internet for a definitive answer, I decided it was time to dive into the kernel source.
This is a good time to bring up GNU Global; if you aren't aware of this, it's a source tagging system that indexes a collection of source files to make them easier to search. As of this writing, you can install Global on a Fedora or Red Hat system with:
yum install global-5.7.5-2.fc12.x86_64
Also, grab a copy of the latest kernel source from kernel.org and extract it.
Have Global installed and a fresh copy of the kernel source in front of you? Good. Go ahead and index the source by CDing to the source directory and running "gtags". This may take a few minutes to complete.
After the index has been created, it's time to hone down on events that trigger a page fault. But where to start? If you aren't already familiar with the kernel source tree layout, it can be difficult to navigate. Let's start with something simple:
$> global -rxi '.*major'
huge list of results
That's a huge result list for "major", which isn't unexpected. Let's try refining those:
Much better! The majority of the results reference "VM_FAULT_MAJOR", which is almost certainly what we're looking for. A closer look at some of the results reveals that they're architecture-specific and can safely be discarded, which leaves us with:
VM_FAULT_MAJOR 1545mm/filemap.c ret = VM_FAULT_MAJOR;
Only two instances of a return value actually being set to VM_FAULT_MAJOR, one in mm/filemap.c and one in mm/memory.c.
This file handles the generic file mmap semantics used by
most "normal" filesystems (but you don't have to use this:
the NFS filesystem used to do this differently, for example)
The function in which "ret = VM_FAULT_MAJOR" is set (filemap_fault()) describes triggering a major fault when a frame is paged in or out from disk. In memory.c, the function containing "ret = VM_FAULT_MAJOR" is called do_swap_page() and describes triggering a major fault when a frame is paged in or out from swap.
We could dig into this a bit more, but we have our answer- a major fault is triggered under the following circumstances:
A process attempts to access a frame that currently resides in swap
A process attempts to access a frame in an mmap()ed file that currently resides on disk
Armed with this new information, we can get back to tackling the problem at hand.
At this point, it would be really useful to have a tool to monitor system calls system-wide rather than just per-process. Solaris has a native tool called DTrace that provides this functionality, but unfortunately Linux is a bit behind in this field. Luckily, we do have an alternative tool called SystemTap.
SystemTap works by compiling SystemTap (STP) scripts as kernel modules and dynamically loading them into the kernel. After execution, the module is removed. This allows direct access to kernel resources that are generally untouchable from user-land.
To run SystemTap, you'll need to install kernel-debuginfo and kernel-debuginfo-common. On Fedora:
Now, how should we proceed? Why not start by monitoring every major fault on the system?
$> stap /usr/share/doc/systemtap-1.6/examples/memory/pfaults.stp -c "sleep 10"| grep major | tee stap.log
That will gather all major page faults that occur over 10 seconds and dump it to stap.log. Looking at some sample output:
From the SystemTap documentaion:
"Each line contains a timestamp (in microseconds) when the page fault servicing was completed, the pid of the process, the address of the page fault, the type of access (read or write), the type of fault (major or minor), and the elapsed time for page fault."
That's even more information than we need! Let's figure out which processes are faulting the most:
oracle 643422.50.013598870044? S 12:450:03rman home/oracle/xxxxxxxx/xxxx.xxx/bin/rman
Notice that the process was just spawned? You may remember from earlier that newly-spawned processes page frames in only when necessary, triggering major faults. Let's look at all of the processes that are triggering major faults:
$> while: dofori in `stap pfaults.stp -c 'sleep 0.1'| grep major | cut -d: -f2 | uniq`; dops aux | grep $i | grep -v grep; done done
In my case, almost all of the results were processes that were 1-2 minutes old; the processes vanished shortly thereafter.
All of this tells us what we suspected; Oracle is spawning processes constantly, which is triggering faults at a much higher rate than usual. But we're not quite finished! It would be a waste of effort if we didn't make use of the knowledge that major page faults are triggered by mmap()ed files as well. Pulling from pfaults.stp and a few other example scripts, Jon and I (mainly Jon) cobbled together a SystemTap script that shows us everything we'd want to know about the behavior of processes that are faulting:
# This script reports the number of exec* calls (not just forks) a particular process
# is making as well as mmap calls as the two primary causes forMajor Faults occuring
process(1458) exec'ed 0times, but called mmap 2times and faulted 0times.
process(2432) exec'ed 0times, but called mmap 4times and faulted 0times.
process(2723) exec'ed 0times, but called mmap 119times and faulted 0times.
process(3629) exec'ed 0times, but called mmap 11times and faulted 0times.
That's a surprising number of mmap() calls. So that's gotta be it, right? Oracle is mmap()ing files and paging in data from disk?
Not quite. Look at the number of faults on processes that are calling mmap(). Most of them are not faulting at all, which means that the mmap()ed files are already in physical memory. Those processes must be accessing the same files. Since they're already located in a shared address space, they don't have to page the data in from disk.
So there we are. The excessive faults are not being caused by mmap(), even though Oracle is making extensive use of the mmap() call. The majority of the processes that are faulting are new. We can conclusively say that the faults are being caused by an excessive spawn rate of Oracle processes. At this point, we can defer to the Oracle team; we've gathered the data that we needed to isolate the problem.
There's a lot more that you can do with SystemTap; check out the rest of the scripts in /usr/share/doc/systemtap-1.6/examples, and don't be afraid to dive into the kernel with Globals to seek clarification for a concept. All of us can benefit from a deep familiarity with virtual memory and the kernel in general.