zachwalton

Friday, February 12, 2016

linux - finding disk usage culprits

keeping my notes here for the various methods i've discovered over the years for finding disk usage culprits.

Scenario: Filesystem is full but plenty of space is available. Likely Cause: The filesystem is out of inodes. You can check this with the following command:

$> df -i /
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup00-LogVol00
                     53780480  53780480 0    1% /

Solution: The solution to this is usually finding directories with hundreds of thousands (or even millions) of files and cleaning them up. More below on this below...

Scenario: Filesystem is full and there are plenty of free inodes
Likely Cause: Big files!
Solution: Do a find to discover big files.

$> find / -xdev -type f -size +5000M # find files in / that are larger than 5G.

You should play with the -size parameter based on the filesystem size. As an example, for a 5 terabyte filesystem you should look for larger files (so 50000M). If your first find doesn't discover anything, keep playing with the -size parameter.

Scenario: Filesystem is full, there are plenty of free inodes, and you didn't find any big files.
Likely Cause: Way too many small files
Solution: Do a find to discover directories with a ton of small files.

$> find / -xdev -type f |awk -F/ '{OFS="/";$(NF--)=""; print}' | sort | uniq -c | sort -n | tail
    747 /tmp/
    812 /usr/share/zsh/4.3.10/functions/
    898 /usr/bin/
    993 /usr/src/kernels/[redacted]/include/linux/
    993 /usr/src/kernels/[redacted]/include/linux/
   1124 /usr/share/man/man3p/
   1244 /usr/share/man/man1/
   [redacted]
   4159 /usr/share/man/man3/
 928374 /tmp/example/

Generally one of the last few will have a HUGE number of files. Do a "du" on that directory to see if it adds up to a significant amount of space:

$> du -hsc /tmp/example # note: this will always take a while for huge directories

Scenario: Filesystem is full, there are plenty of free inodes, you didn't find any big files, and you didn't find any small files.
Likely cause: Some large files were deleted, but there are some processes holding the file descriptors open, so they're still using up space.
Solution: See if there are any files that are still open but deleted. You can do this with the lsof command:

$> lsof | grep deleted
python    58386      user    3w      REG              253,1            0   98179200 /var/log/application.log (deleted)

If you see any open file descriptors with (deleted) as the last column, that's likely the problem. Here's how you can clear them:

Get the process ID (that's 58386 in the example above)
Look at open file descriptors for the process:

$> ls -l /proc/58386/fd
l-wx------ 1 user group 64 Oct 20 18:45 3 -> /var/log/application.log (deleted)

Truncate the file:

$> cat /dev/null > /proc/58386/fd/3

If there are a huge number of deleted files, you can delete it with this loop (of course, you can also just restart the offending program):

pids=$(lsof | awk '{if ($NF == "(deleted)") print $2}' | sort | uniq)
for pid in $pids
do 
    for fd in /proc/${pid}/fd/*
    do 
        if [[ "$(readlink ${fd} | awk '{if ($NF == "(deleted)") print $2}')" == "(deleted)" ]]
        then 
            cat /dev/null > ${fd}
        fi
    done
done

Scenario: Filesystem is full, there are plenty of free inodes, you didn't find any big files, you didn't find any small files, and there aren't any open files that have been deleted
Likely cause: Sparse files! These are files with massive chunks of null bytes. Some filesystems (like EXT 3) handle these by not actually writing out null bytes to the disk and instead doing a compression of sorts by keeping metadata that indicates that a file is full of null blocks.
Solution: Find disparities between reported size and actual size.

find / -xdev -type f 2>/dev/null | while read file
do
    if [ "$(stat -c '%s -(%b * %B)' ${file} | bc)" -gt 0 ]
    then
        echo "${file}: $(( $(stat -c '%s' $file) / 1024 / 1024 )) MB Reported"
        echo "${file}: $(( $(stat -c '(%b * %B)' $file | bc) / 1024 / 1024 )) MB Actual"
    fi
done

Tuesday, October 27, 2015

why is httpd using so much memory ffs

let's find out.

first, take a look at the memory mappings consuming the most memory (PID 8833 is an httpd.worker instance):

# grep Rss /proc/8833/smaps | sort -k2 -n | tail -1
Rss:              121752 kB
# grep 121752 /proc/8833/smaps -B 5
...
7f3fa8000000-7f3faf6e7000 rw-p 00000000 00:00 0
...

generate a core file:

# gcore 8833
Saved corefile core.8833

convert the beginning of the range to decimal, get the length of the range:

START_RANGE = 7f3fa8000000 == 139911378239488
END_RANGE = 7f3faf6e7000 == 139911502917632
DIFFERENCE = 124678144
OUTPUT_FORMAT = s # string
UNIT = b # bytes

dump that section of memory as a string:

# gdb /opt/geo/httpd/sbin/httpd.worker core.8833 -ex "x/${DIFFERENCE}${OUTPUT_FORMAT}${UNIT} ${START_RANGE}" | grep -v \"\" | tee gdb.out
# gdb /opt/geo/httpd/sbin/httpd.worker core.8833 -ex "x/124678144sb 139911378239488" | grep -v \"\" | tee gdb.out

inspect. in my case, it was an inefficient database query running in cron (via an HTTP endpoint) that was stacking up on prior queries, resulting in a slow (but substantial) memory leak.

# grep SELECT gdb.out | wc -l
6927

Tuesday, January 15, 2013

Live Partition Table Updates in RHEL 6

A typical process for us is sending a request to the ESX team and requesting the expansion of a disk. Once we get it back, we rescan the SCSI bus, add a new LVM partition, probe the partition table with partprobe to update /proc/partitions, and add the new partition as a physical LVM volume.

This was ideal for us on our RHEL 5 guests, since it allowed our ESX team to manage one or two large disks per guest rather than dozens of small ones.

Unfortunately, it seems that if a block device has partitions mounted, the ability to probe it for new partitions with partprobe is no longer present on RHEL 6. There's a great post summing up the reasons for that here, as well as a workaround with the partx command. However, the post -- ostensibly wisely -- suggests that partx is a dangerous workaround and that the best option is a reboot.

I'm prepared to bite the bullet and suggest that the ESX team start provisioning a new disk for every storage expansion request, but the fact remains that we had a production guest that already had a disk expanded and we wanted to avoid a reboot if possible.

It turns out that partx -a /dev/sdb and partprobe /dev/sdb run exactly the same thing. From libparted/disk.c:

/**

* Tell the operating system kernel about the partition table layout

* of \p disk.

* This is rather loosely defined: for example, on old versions of Linux,

* it simply calls the BLKRRPART ioctl, which tells the kernel to

* reread the partition table. On newer versions (2.4.x), it will

* use the new blkpg interface to tell Linux where each partition

* starts/ends, etc. In this case, Linux does not need to have support for

* a specific type of partition table.

* \return 0 on failure, 1 otherwise.

int

ped_disk_commit_to_os (PedDisk* disk)

So it looks like it makes an ioctl() call with BLKRRPART for pre-2.4 kernels and BLKPG on more recent kernels.

Let's see what partx does, from util-linux-ng/addpart.c:partx_add_partition():

return ioctl(fd, BLKPG, &a);

The partx command makes the same ioctl() call. Dangerous? Possibly, but since it's doing the same thing that we've done on RHEL 5 for ages, I wasn't particularly worried about it. I went ahead and ran partx -a /dev/sdb and voila! The kernel updated its partition table and I was able to go ahead and add the disk as an LVM physical volume.

In the future, we'll be adding new disks instead of expanding old ones, but if you find yourself in the same position of having to add a new partition on a live disk, hopefully this will help.

Thursday, January 3, 2013

Errybody's Talkin' 'Bout Notifications

DMNotify- The Latest Iteration of Irssi Notifications

It seems like everybody has their own notification solution for IRC. I've gone through a few over the years, including GMNotify , hodgepodges of Perl and Python scripts unprofessionally cobbled together to leverage Growl notifications, and no notifications at all!

DMNotify is the latest solution, and the best thus far. It uses the Twitter API via Net::Twitter to send DMs on nick highlights and private messages, and you can respond to the last person who notified you. The full README at the GitHub link provides a bit more information.

Enjoy!

Wednesday, December 5, 2012

Clearing Phantom Paths on a Server with PowerPath Installed

Overview

Problem: Extra paths are visible in powermt display dev=all output. We are still working on determining the root cause. What seems to happen is:

1) Initially, the number of paths is correct. Assuming that each LUN should be using four paths, you can determine the correct number as follows:

echo $(( $(powermt display dev=all | grep emcpower | wc -l) * 4 ))

And the current number:

powermt display dev=all | grep qla | wc -l

The current number of SCSI devices:

lsscsi

The HBA reports the correct number of paths:

echo $(( $(scli -l 0 | grep ^LUN | wc -l) - 4 + $(scli -l 1 | grep ^LUN | wc -l) ))

ls -l /dev/sg*

2) A LIP is issued:

echo 1 |tee /sys/class/fc_host/host?/issue_lip

# This operation performs a Loop Initialization Protocol (LIP)

#    and then scans the interconnect and causes the SCSI layer to be updated

#    to reflect the devices currently on the bus. A LIP is, essentially, a bus reset,

#    and will cause device addition and removal. This procedure is necessary to configure

#    a new SCSI target on a Fibre Channel interconnect. Bear in mind that issue_lip is

#    an asynchronous operation. The command may complete before the entire scan has completed.

#    You must monitor /var/log/messages to determine when it is done.

#    The lpfc and qla2xxx drivers support issue_lip.

#    For more information about the API capabilities supported by each driver in Red Hat Enterprise Linux,

#    refer to Table 1, ¿Fibre-Channel API Capabilities¿.

3) Several new paths appear.

The extra /dev/sg* devices are created when the LIP reports that new devices have been discovered on the SCSI bus. Since the HBA driver is responsible for reporting the paths to the system, we currently believe that the HBA driver and/or storage frame is at fault. EMC specifically mentions that the SPC2 bit must be set to "enabled", which is not an online change; the hosts must be rebooted to pick up the change. This doesn't seem to be related, however, since the paths spontaneously appear even though we can query the HBA with scli shortly afterward and it reports the correct number of paths.

Example and Mitigation

This was happening on [server name redacted].

# The number of paths that "scli -l" shows that the HBA is reporting to the server.

# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices instead of Symmetrix or Clariion.

root@xxxxx:TEST:scsi_device> echo $(( $(scli -l 0 | grep ^LUN | wc -l) - 4 + $(scli -l 1 | grep ^LUN | wc -l) ))

440

# The number of paths that should exist

root@xxxxx:TEST:scsi_device> echo $(( $(powermt display dev=all | grep emcpower | wc -l) * 4 ))

440

# The number of paths that PowerPath is reporting

root@xxxxx:TEST:scsi_device> powermt display dev=all | grep qla | wc -l

1419

# The number of SCSI devices that the system sees

# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices

root@xxxxx:TEST:scsi_device> echo $(( $(lsscsi | wc -l) - 4 ))

1419

And another "healthy" server in the same cluster:

# The number of paths that "scli -l" shows that the HBA is reporting to the server.

# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices instead of Symmetrix or Clariion.

root@xxxxx:TEST:~> echo $(( $(scli -l 0 | grep ^LUN | wc -l) - 4 + $(scli -l 1 | grep ^LUN | wc -l) ))

440

# The number of paths that should exist

root@xxxxx:TEST:~> echo $(( $(powermt display dev=all | grep emcpower | wc -l) * 4 ))

440

# The number of paths that PowerPath is reporting

root@xxxxx:TEST:~> powermt display dev=all | grep qla | wc -l

440

# The number of SCSI devices that the system sees

# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices

root@xxxxx:TEST:~> echo $(( $(lsscsi | wc -l) - 4 ))

440

The first 440 /dev/sg* devices were created on system bootup. The extraneous 900+ were created on November 15th, 2012, at the same time some LUNs were added:

xxxxx,xxxxx,2012-11-15 21:29:57.398563,"Added LUN(s) [redacted]..."

The script we use to add storage issues a LIP, so that caused the extra /dev/sg* paths to be discovered. Here's an example emcpower device with too many paths:

root@xxxxx:TEST:device> powermt display dev=emcpowercz

Pseudo name=emcpowercz

Symmetrix ID=xxxxx

Logical device ID=YYYY

state=alive; policy=SymmOpt; priority=0; queued-IOs=0

==============================================================================

---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---

###  HW Path                I/O Paths    Interf.   Mode    State  Q-IOs Errors

==============================================================================

   1 qla2xxx                   sdabg     FA 15gB   active  alive      0      0

   1 qla2xxx                   sdabt     FA 15gB   active  alive      0      0

   1 qla2xxx                   sdajq     FA 15gB   active  alive      0      0

   1 qla2xxx                   sdakd     FA 15gB   active  alive      0      0

   1 qla2xxx                   sdasa     FA 15gB   active  alive      0      0

   1 qla2xxx                   sdasn     FA 15gB   active  alive      0      0

   1 qla2xxx                   sdbak     FA 15gB   active  alive      0      0

   1 qla2xxx                   sdbax     FA 15gB   active  alive      0      0

   0 qla2xxx                   sdgu      FA  5eB   active  alive      0      0

   1 qla2xxx                   sdgv      FA 12eB   active  alive      0      0

   0 qla2xxx                   sdor      FA  2gB   active  alive      0      0

   1 qla2xxx                   sdox      FA 15gB   active  alive      0      0

   1 qla2xxx                   sdtj      FA 15gB   active  alive      0      0

Let's figure out how those paths correlate to /dev/sg* devices:

root@sxxxxx:TEST:device> lsscsi -g | egrep "sdabg|sdabt|sdajq|sdakd|sdasa|sdasn|sdbak|sdbax|sdgu|sdgv|sdor|sdox|sdtj"

[0:0:0:101]  disk    EMC      SYMMETRIX        5874  /dev/sdgu  /dev/sg202

[0:0:1:101]  disk    EMC      SYMMETRIX        5874  /dev/sdor  /dev/sg407

[1:0:0:101]  disk    EMC      SYMMETRIX        5874  /dev/sdgv  /dev/sg203

[1:0:1:101]  disk    EMC      SYMMETRIX        5874  /dev/sdox  /dev/sg413

[1:0:1:12389]disk    EMC      SYMMETRIX        5874  /dev/sdtj  /dev/sg529

[1:0:1:34309]disk    EMC      SYMMETRIX        5874  /dev/sdabg  /dev/sg734

[1:0:1:34325]disk    EMC      SYMMETRIX        5874  /dev/sdabt  /dev/sg747

[1:0:1:38405]disk    EMC      SYMMETRIX        5874  /dev/sdajq  /dev/sg952

[1:0:1:38421]disk    EMC      SYMMETRIX        5874  /dev/sdakd  /dev/sg965

[1:0:1:42501]disk    EMC      SYMMETRIX        5874  /dev/sdasa  /dev/sg1170

[1:0:1:42517]disk    EMC      SYMMETRIX        5874  /dev/sdasn  /dev/sg1183

[1:0:1:46597]disk    EMC      SYMMETRIX        5874  /dev/sdbak  /dev/sg1388

[1:0:1:46613]disk    EMC      SYMMETRIX        5874  /dev/sdbax  /dev/sg1401

Look suspicious? There are four /dev/sg* devices numbered under 440, which is the number of total SCSI paths that should be on the system. Let's make sure those are valid:

for dev in sdgu sdor sdgv sdox; do oracleasm querydisk /dev/${dev}1; done

Device "/dev/sdgu1" is marked an ASM disk with the label "XXXXXXXXX"

Device "/dev/sdor1" is marked an ASM disk with the label "XXXXXXXXX"

Device "/dev/sdgv1" is marked an ASM disk with the label "XXXXXXXXX"

Device "/dev/sdox1" is marked an ASM disk with the label "XXXXXXXXX"

That's just an example for RAC. What you want to do is make sure that the /dev/sd* devices are accessible before we blow away the extraneous ones. Something like this would work as well:

for dev in sdgu sdor sdgv sdox; do od -c /dev/${dev}1 | head -10; done

All four paths valid? Good. Now that we know how to suss out invalid paths manually, let's do it the easy way:

Note: This will delete extra SCSI paths; the actual 'delete' command has been commented out below. Make sure you're okay with the possibility of a server crash before uncommenting and running it.

VALID_TMP="/tmp/.valid_devices.$(date "+%m%d%Y")"

ALL_TMP="/tmp/.all_devices.$(date "+%m%d%Y")"

# Discover the paths that the HBA is reporting

for hba in 0 1; do

    sudo scli -l ${hba} | grep -Po "sd\w+"

done > ${VALID_TMP}

# Discover the paths that the OS is reporting

sudo powermt display dev=all | grep -Po "sd\w+" > ${ALL_TMP}

for device in $(cat ${ALL_TMP}); do

    grep -P "^${device}$" ${VALID_TMP} &>/dev/null

    if [ $? -eq 1 ]; then

        echo "Device ${device} is invalid.  Deleting..."

        #echo 1 | sudo tee /sys/block/${device}/device/delete &>/dev/null

    fi

done

rm -f ${VALID_TMP} ${ALL_TMP}

And finally, issue another LIP:

echo 1 |sudo tee /sys/class/fc_host/host?/issue_lip

Our device should be back to normal:

root@xxxxxx:TEST:~> powermt display dev=emcpowercz

Pseudo name=emcpowercz

Symmetrix ID=xxxxxx

Logical device ID=YYYY

state=alive; policy=SymmOpt; priority=0; queued-IOs=0

==============================================================================

---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---

###  HW Path                I/O Paths    Interf.   Mode    State  Q-IOs Errors

==============================================================================

   0 qla2xxx                   sdgu      FA  5eB   active  alive      0      0

   1 qla2xxx                   sdgv      FA 12eB   active  alive      0      0

   0 qla2xxx                   sdor      FA  2gB   active  alive      0      0

   1 qla2xxx                   sdox      FA 15gB   active  alive      0      0