Tuesday, January 15, 2013

Live Partition Table Updates in RHEL 6

A typical process for us is sending a request to the ESX team and requesting the expansion of a disk.  Once we get it back, we rescan the SCSI bus, add a new LVM partition, probe the partition table with partprobe to update /proc/partitions, and add the new partition as a physical LVM volume.

This was ideal for us on our RHEL 5 guests, since it allowed our ESX team to manage one or two large disks per guest rather than dozens of small ones.

Unfortunately, it seems that if a block device has partitions mounted, the ability to probe it for new partitions with partprobe is no longer present on RHEL 6.  There's a great post summing up the reasons for that here, as well as a workaround with the partx command.  However, the post -- ostensibly wisely -- suggests that partx is a dangerous workaround and that the best option is a reboot.

I'm prepared to bite the bullet and suggest that the ESX team start provisioning a new disk for every storage expansion request, but the fact remains that we had a production guest that already had a disk expanded and we wanted to avoid a reboot if possible.

It turns out that partx -a /dev/sdb and partprobe /dev/sdb run exactly the same thing.  From libparted/disk.c:

/**
 * Tell the operating system kernel about the partition table layout
 * of \p disk.
 *
 * This is rather loosely defined: for example, on old versions of Linux,
 * it simply calls the BLKRRPART ioctl, which tells the kernel to
 * reread the partition table. On newer versions (2.4.x), it will
 * use the new blkpg interface to tell Linux where each partition
 * starts/ends, etc. In this case, Linux does not need to have support for
 * a specific type of partition table.
 *
 * \return 0 on failure, 1 otherwise.
 */

int
ped_disk_commit_to_os (PedDisk* disk)

So it looks like it makes an ioctl() call with BLKRRPART for pre-2.4 kernels and BLKPG on more recent kernels.

Let's see what partx does, from util-linux-ng/addpart.c:partx_add_partition():

return ioctl(fd, BLKPG, &a);

The partx command makes the same ioctl() call.  Dangerous?  Possibly, but since it's doing the same thing that we've done on RHEL 5 for ages, I wasn't particularly worried about it.  I went ahead and ran partx -a /dev/sdb and voila!  The kernel updated its partition table and I was able to go ahead and add the disk as an LVM physical volume.

In the future, we'll be adding new disks instead of expanding old ones, but if you find yourself in the same position of having to add a new partition on a live disk, hopefully this will help.

Thursday, January 3, 2013

Errybody's Talkin' 'Bout Notifications

DMNotify- The Latest Iteration of Irssi Notifications

It seems like everybody has their own notification solution for IRC.  I've gone through a few over the years, including GMNotify , hodgepodges of Perl and Python scripts unprofessionally cobbled together to leverage Growl notifications, and no notifications at all!

DMNotify  is the latest solution, and the best thus far.  It uses the Twitter API via Net::Twitter to send DMs on nick highlights and private messages, and you can respond to the last person who notified you.  The full README at the GitHub link provides a bit more information.

Enjoy!

Wednesday, December 5, 2012

Clearing Phantom Paths on a Server with PowerPath Installed


Overview

Problem: Extra paths are visible in powermt display dev=all output. We are still working on determining the root cause. What seems to happen is:
1) Initially, the number of paths is correct. Assuming that each LUN should be using four paths, you can determine the correct number as follows:
echo $(( $(powermt display dev=all | grep emcpower | wc -l) * 4 ))

And the current number:
powermt display dev=all | grep qla | wc -l

The current number of SCSI devices:
lsscsi

The HBA reports the correct number of paths:
echo $(( $(scli -l 0 | grep ^LUN | wc -l) - 4 + $(scli -l 1 | grep ^LUN | wc -l) ))
ls -l /dev/sg*

2) A LIP is issued:
echo 1 |tee /sys/class/fc_host/host?/issue_lip
# This operation performs a Loop Initialization Protocol (LIP)
#    and then scans the interconnect and causes the SCSI layer to be updated
#    to reflect the devices currently on the bus. A LIP is, essentially, a bus reset,
#    and will cause device addition and removal. This procedure is necessary to configure
#    a new SCSI target on a Fibre Channel interconnect. Bear in mind that issue_lip is
#    an asynchronous operation. The command may complete before the entire scan has completed.
#    You must monitor /var/log/messages to determine when it is done.
#    The lpfc and qla2xxx drivers support issue_lip.
#    For more information about the API capabilities supported by each driver in Red Hat Enterprise Linux,
#    refer to Table 1, ¿Fibre-Channel API Capabilities¿.

3) Several new paths appear.
The extra /dev/sg* devices are created when the LIP reports that new devices have been discovered on the SCSI bus. Since the HBA driver is responsible for reporting the paths to the system, we currently believe that the HBA driver and/or storage frame is at fault. EMC specifically mentions that the SPC2 bit must be set to "enabled", which is not an online change; the hosts must be rebooted to pick up the change. This doesn't seem to be related, however, since the paths spontaneously appear even though we can query the HBA with scli shortly afterward and it reports the correct number of paths.

Example and Mitigation

This was happening on [server name redacted].
# The number of paths that "scli -l" shows that the HBA is reporting to the server.
# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices instead of Symmetrix or Clariion.
root@xxxxx:TEST:scsi_device> echo $(( $(scli -l 0 | grep ^LUN | wc -l) - 4 + $(scli -l 1 | grep ^LUN | wc -l) ))
440
# The number of paths that should exist
root@xxxxx:TEST:scsi_device> echo $(( $(powermt display dev=all | grep emcpower | wc -l) * 4 ))
440
# The number of paths that PowerPath is reporting
root@xxxxx:TEST:scsi_device> powermt display dev=all | grep qla | wc -l
1419
# The number of SCSI devices that the system sees
# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices
root@xxxxx:TEST:scsi_device> echo $(( $(lsscsi | wc -l) - 4 ))
1419
And another "healthy" server in the same cluster:
# The number of paths that "scli -l" shows that the HBA is reporting to the server.
# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices instead of Symmetrix or Clariion.
root@xxxxx:TEST:~> echo $(( $(scli -l 0 | grep ^LUN | wc -l) - 4 + $(scli -l 1 | grep ^LUN | wc -l) ))
440
# The number of paths that should exist
root@xxxxx:TEST:~> echo $(( $(powermt display dev=all | grep emcpower | wc -l) * 4 ))
440
# The number of paths that PowerPath is reporting
root@xxxxx:TEST:~> powermt display dev=all | grep qla | wc -l
440
# The number of SCSI devices that the system sees
# Subtracting 4 because the last 4 devices are LUNZ/VRAID devices
root@xxxxx:TEST:~> echo $(( $(lsscsi | wc -l) - 4 ))
440
The first 440 /dev/sg* devices were created on system bootup. The extraneous 900+ were created on November 15th, 2012, at the same time some LUNs were added:
xxxxx,xxxxx,2012-11-15 21:29:57.398563,"Added LUN(s) [redacted]..."
The script we use to add storage issues a LIP, so that caused the extra /dev/sg* paths to be discovered. Here's an example emcpower device with too many paths:
root@xxxxx:TEST:device> powermt display dev=emcpowercz
Pseudo name=emcpowercz
Symmetrix ID=xxxxx
Logical device ID=YYYY
state=alive; policy=SymmOpt; priority=0; queued-IOs=0
==============================================================================
---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---
###  HW Path                I/O Paths    Interf.   Mode    State  Q-IOs Errors
==============================================================================
   1 qla2xxx                   sdabg     FA 15gB   active  alive      0      0
   1 qla2xxx                   sdabt     FA 15gB   active  alive      0      0
   1 qla2xxx                   sdajq     FA 15gB   active  alive      0      0
   1 qla2xxx                   sdakd     FA 15gB   active  alive      0      0
   1 qla2xxx                   sdasa     FA 15gB   active  alive      0      0
   1 qla2xxx                   sdasn     FA 15gB   active  alive      0      0
   1 qla2xxx                   sdbak     FA 15gB   active  alive      0      0
   1 qla2xxx                   sdbax     FA 15gB   active  alive      0      0
   0 qla2xxx                   sdgu      FA  5eB   active  alive      0      0
   1 qla2xxx                   sdgv      FA 12eB   active  alive      0      0
   0 qla2xxx                   sdor      FA  2gB   active  alive      0      0
   1 qla2xxx                   sdox      FA 15gB   active  alive      0      0
   1 qla2xxx                   sdtj      FA 15gB   active  alive      0      0
Let's figure out how those paths correlate to /dev/sg* devices:
root@sxxxxx:TEST:device> lsscsi -g | egrep "sdabg|sdabt|sdajq|sdakd|sdasa|sdasn|sdbak|sdbax|sdgu|sdgv|sdor|sdox|sdtj"
[0:0:0:101]  disk    EMC      SYMMETRIX        5874  /dev/sdgu  /dev/sg202
[0:0:1:101]  disk    EMC      SYMMETRIX        5874  /dev/sdor  /dev/sg407
[1:0:0:101]  disk    EMC      SYMMETRIX        5874  /dev/sdgv  /dev/sg203
[1:0:1:101]  disk    EMC      SYMMETRIX        5874  /dev/sdox  /dev/sg413
[1:0:1:12389]disk    EMC      SYMMETRIX        5874  /dev/sdtj  /dev/sg529
[1:0:1:34309]disk    EMC      SYMMETRIX        5874  /dev/sdabg  /dev/sg734
[1:0:1:34325]disk    EMC      SYMMETRIX        5874  /dev/sdabt  /dev/sg747
[1:0:1:38405]disk    EMC      SYMMETRIX        5874  /dev/sdajq  /dev/sg952
[1:0:1:38421]disk    EMC      SYMMETRIX        5874  /dev/sdakd  /dev/sg965
[1:0:1:42501]disk    EMC      SYMMETRIX        5874  /dev/sdasa  /dev/sg1170
[1:0:1:42517]disk    EMC      SYMMETRIX        5874  /dev/sdasn  /dev/sg1183
[1:0:1:46597]disk    EMC      SYMMETRIX        5874  /dev/sdbak  /dev/sg1388
[1:0:1:46613]disk    EMC      SYMMETRIX        5874  /dev/sdbax  /dev/sg1401
Look suspicious? There are four /dev/sg* devices numbered under 440, which is the number of total SCSI paths that should be on the system. Let's make sure those are valid:
for dev in sdgu sdor sdgv sdox; do oracleasm querydisk /dev/${dev}1; done
Device "/dev/sdgu1" is marked an ASM disk with the label "XXXXXXXXX"
Device "/dev/sdor1" is marked an ASM disk with the label "XXXXXXXXX"
Device "/dev/sdgv1" is marked an ASM disk with the label "XXXXXXXXX"
Device "/dev/sdox1" is marked an ASM disk with the label "XXXXXXXXX"
That's just an example for RAC. What you want to do is make sure that the /dev/sd* devices are accessible before we blow away the extraneous ones. Something like this would work as well:
for dev in sdgu sdor sdgv sdox; do od -c /dev/${dev}1 | head -10; done
All four paths valid? Good. Now that we know how to suss out invalid paths manually, let's do it the easy way:
Note: This will delete extra SCSI paths; the actual 'delete' command has been commented out below. Make sure you're okay with the possibility of a server crash before uncommenting and running it.
VALID_TMP="/tmp/.valid_devices.$(date "+%m%d%Y")"
ALL_TMP="/tmp/.all_devices.$(date "+%m%d%Y")"
 
# Discover the paths that the HBA is reporting
for hba in 0 1; do
    sudo scli -l ${hba} | grep -Po "sd\w+"
done > ${VALID_TMP}
 
# Discover the paths that the OS is reporting
sudo powermt display dev=all | grep -Po "sd\w+" > ${ALL_TMP}
 
for device in $(cat ${ALL_TMP}); do
    grep -P "^${device}$" ${VALID_TMP} &>/dev/null
    if [ $? -eq 1 ]; then
        echo "Device ${device} is invalid.  Deleting..."
        #echo 1 | sudo tee /sys/block/${device}/device/delete &>/dev/null
    fi
done
 
rm -f ${VALID_TMP} ${ALL_TMP}
And finally, issue another LIP:
echo 1 |sudo tee /sys/class/fc_host/host?/issue_lip
Our device should be back to normal:
root@xxxxxx:TEST:~> powermt display dev=emcpowercz
Pseudo name=emcpowercz
Symmetrix ID=xxxxxx
Logical device ID=YYYY
state=alive; policy=SymmOpt; priority=0; queued-IOs=0
==============================================================================
---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---
###  HW Path                I/O Paths    Interf.   Mode    State  Q-IOs Errors
==============================================================================
   0 qla2xxx                   sdgu      FA  5eB   active  alive      0      0
   1 qla2xxx                   sdgv      FA 12eB   active  alive      0      0
   0 qla2xxx                   sdor      FA  2gB   active  alive      0      0
   1 qla2xxx                   sdox      FA 15gB   active  alive      0      0