High-Availability Obsession: IO multipathing with RHEL 5

November 23, 2009

IO multipathing with RHEL 5

In this post, we describe working IO multipathing configuration. Used technologies: RedHat Linux 5.4 64bit, 2x SunFire x4150, Sun StorageTek 2540, 2x Qlogic FC HBA.

1. Multipath.conf is shortened (only few devices are showed). Internal disks are ignored (vendor "Sun"). Also note, that changes to uid or more on multipath devices are visible only after reboot (bug?).

# cat /etc/multipath.conf
blacklist {
       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
       devnode "^hd[a-z][[0-9]*]"
       device {
               vendor Sun
       }
}

devices {
       device {
               vendor                  "SUN"
               product                 "LCSM100_F"
               getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
               prio_callout            "/sbin/mpath_prio_rdac /dev/%n"
               features                "0"
               hardware_handler        "1 rdac"
               path_grouping_policy    group_by_prio
               failback                immediate
               rr_weight               uniform
               no_path_retry           queue
               rr_min_io               1000
               path_checker            rdac
       }
}

multipaths {
        multipath {
                wwid    3600a0b80005b0a760000042b4ad4bf74
                alias   s2-dbdata1
                mode    660
                uid     500
                gid     500
        }
        multipath {
                wwid    3600a0b80005b15e3000004594ad4bf47
                alias   s2-dbredo1
                mode    660
                uid     500
                gid     500
        }
}

2. In lvm.conf, it is good to modify two lines, to reduce LVM discovery time.

/etc/lvm/lvm.conf:
  filter = [ "a/dev/mpath/.*/", "a/dev/sda.*/", "a/dev/sdb.*/", "a/dev/md.*/", "r/.*/" ]
  types = [ "device-mapper", 1]

3. Modify also udev rules, to shorten bootup time:

/etc/udev/rules.d/05-udev-early.rules:
##ACTION=="add", DEVPATH=="/devices/*", ENV{PHYSDEVBUS}=="?*", WAIT_FOR_SYSFS="bus"

4. Modprobe.conf:

/etc/modprobe.conf:
alias eth0 e1000e
alias eth1 e1000e
alias eth2 e1000e
alias eth3 e1000e
alias scsi_hostadapter aacraid
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 ahci
alias scsi_hostadapter3 usb-storage
alias scsi_hostadapter4 qla2xxx
alias scsi_hostadapter5 dm_multipath
alias scsi_hostadapter6 scsi_dh_rdac

5. Last step is critical, recreate initrd so scsi_dh_rdac is loaded first (otherwise, there will be tons of errors during boot). Also please note, that you need at least update 3 of RHEL 5.

[root@db2 boot]# mkinitrd -v -f initrd-2.6.18-164.2.1.el5.img 2.6.18-164.2.1.el5 --preload scsi_dh_rdac

28 comments:

DANiELFebruary 9, 2010 at 7:40 PM
Thank you for this, we've been having difficulty getting our paths up with almost identical equipment.

Are you getting I/O errors on load for the qla2xxx module? We have 4 paths to the SAN, but 2 throw I/O errors (the ones listed as 'ghost' by multipath). I can't copy/paste into this box to show you the errors.

Any assistance would be greatly appreciated, thanks!
-Daniel
ReplyDelete
Replies
DanielFebruary 9, 2010 at 8:27 PM
Yes, there are few I/O errors on boot (before root filesystem is mounted). But they are kept at minimum, as I recreated initrd (step 5). If you omit this step, module scsi_dh_rdac will be loaded after qla2xxx module, and you will see tons of I/O errors because of accessing two passive (ghost) paths. Module scsi_dh_rdac is responsible for correct multipath behaviour (by rising priority to active paths) so it has to be loaded as soon as possible. I hope this bug will be fixed in future RedHat releases, so we dont need to recreate initrd after each kernel update.
ReplyDelete
Replies
DanielFebruary 9, 2010 at 8:37 PM
I also changed way to post comments, so copy/paste should be ok.
ReplyDelete
Replies
DANiELFebruary 10, 2010 at 8:19 PM
Nice name btw ;)

Thanks for your help on this. What would be great is if they shipped the installer with either scsi_dh_rdac preloaded or a means to check if it should be prior to loading qla2xxx. I had to gut qla2xxx from the installer's initrd so that we could do automated installs. But, I should now be able to put it back in with preloading scsi_dh_rdac.

Thanks again!
-Daniel
ReplyDelete
Replies
DanielFebruary 11, 2010 at 3:12 PM
You welcome.

When you will have time after finishing deployment, please leave comment if you were able to do automated installs with new way (preload) or you had to fallback to method without qla2xxx module. I am curious, thanks :)
ReplyDelete
Replies
DANiELFebruary 11, 2010 at 4:36 PM
By chance, are the I/O errors you are seeing coming from the SAN's access device (Universal Xport)? I haven't figured out a way to make it so that qla2xxx does not create dev nodes for that device. It is contributing to most/all of the I/O errors I'm now seeing
ReplyDelete
Replies
DANiELFebruary 11, 2010 at 4:40 PM
Sorry... didn't see your comment about the automated installs. I haven't tested out my approach yet, but I was looking into modifying the installer's initrd and making scsi_dh_rdac a dependency for qla2xxx. In general, it is not a dependency for qla2xxx, but it appears to be when using StorageTek 2500 series SANs.

To be honest though, I don't need the SAN for installs. The easier approach would be to simply copy over the modified initrd in the post-install section of kickstart. But, I want to try anyway :)
ReplyDelete
Replies
DanielFebruary 12, 2010 at 9:42 AM
Yes, those few I/O errors come from SAN device. And for automated installs: I would personaly do it simpler way just like you do (in post install script, or not at all during install), as there are less problems (no need to recreate boot image every time new OS update comes). Os eventually boots and then I would solve problems :)
ReplyDelete
Replies
DANiELMarch 12, 2010 at 7:19 PM
Have you experienced any unusually high loads? I've been rsync'ing over ssh a 200GB dataset via a gigabit switch to a SAN volume via a machine fibre attached to the SAN. This drives loadavgs to, at its highest, 25, with iowait hovering between 60-80%. Very unexpected for a 50-60MB/s transfer rate...
ReplyDelete
Replies
DanielMarch 15, 2010 at 12:33 PM
Please try it again without using CPU intensive ssh+rsync. Then we can isolate problem. Try netcat pump, about which I wrote here: http://www.ha-obsession.net/2007/12/fast-remote-dir-copy-using-netcat.html
ReplyDelete
Replies
DanielApril 1, 2010 at 7:57 AM
Little update, I cant confirm this yes, but according https://bugzilla.redhat.com/show_bug.cgi?id=515326 they fixed correct loading order of scsi_dh_rdac and qla2xxx modules in latest update RHEL 5.5
ReplyDelete
Replies
DANiELApril 2, 2010 at 6:46 PM
Sorry its taken a second to get back to you. Our highload issues may be associated with other problems. FYI, you should read this link:

https://lists.linux-foundation.org/pipermail/bugme-new/2007-June/016354.html

I'm currently in communication with Sun/Oracle as well as LSI (the actual manufacturer of the ST2540) and their official position is that DM-MP is not supported. I believe it to be due to how the ST2540 is asymmetrical. Here is another link that talks about it somewhat. The ST6140 is more or less the same as the ST2540:

http://southbrain.com/south/2009/10/qla-linux-and-rdac---sun-6140.html

I'd be curious to know what your take on all this is. We started observing controller resets for no known reason back in February and have been diagnosing since. I'm of the opinion that it is due to AVT. I'm currently testing a multipath configuration where the second controller is disabled; multiple paths, but no failover. Failover is not essential in our environment, but would be nice to have. We are unfortunately unable to use mppRDAC, as far as we can tell, as it does not actually support our HBAs. Sun is somewhat confused as to what is and is not supported hardware: their documents conflict. Officially, our HBAs (QLE2560s) and switches (QLogic 5802s) are not compatible with the ST2540. Needless to say, we are very happy with our reseller...
ReplyDelete
Replies
DanielApril 8, 2010 at 9:09 AM
We use in servers also these cards:
QLogic Fibre Channel HBA Driver: 8.03.00.1.05.05-k
QLogic QLE2560 - Sun StorageTek 8Gb FC PCIe HBA, single port
ISP2532: PCIe (2.5Gb/s x8) @ 0000:03:00.0 hdma+, host#=8, fw=4.04.09 (85)

Yes, we talked recently to LSI guys, so I know their possition about not supporting dm-multipath. I was really surprised.

I have few things for you if you wanna further investigate: are u using latest fcode (firmware) for your HBAs? Can you paste me in pastebin.org all configs I was mentioning in original article (multipath.conf, lvm.conf, ...) + "uname -a" + maybe logs? We are not observing controller resets.. On one cluster we use about 10 LUNs from disk array (which are all owned by one controller) and on second cluster also about 10 LUNs (which are owned by second controller).

What exactly servers are you using? For example in X4270 servers you should put HBAs only in number 0 and 3 PCIe slots!
ReplyDelete
Replies
DANiELApril 13, 2010 at 4:22 PM
We're using the same HBAs, I don't have the fw on hand, but it should be the recent as of this fall. We upgraded fw on everything for all of our systems. SANs are latest as well.

We've got a mix of x4140s and x4450s. I don't remember what PCIe slot the HBA is in, however, those were installed by our reseller, not by us.

We are not using LVM, not needed for our environment.

Your use is considerably more than ours. But, regardless, I can create controller resets on a single 250g volume (FS agnostic, used both ext3 and OCFS2), with nothing else using the SAN, no other volumes, and only a single host with an initiator. The behavior only manifests itself when both controllers are fibre attached. If we unplug one controller (remove failover possibility), the configuration is stable. We know for a fact that the controller is resetting: I've become very familiar with the SAN service adviser dumps.

I pasted multipath.conf, multipath -v2 -ll, uname -a, modprobe.conf and logs from one of our machines that was observing the behavior. We are using a rebuilt initrd that preloads scsi_dh_rdac.

http://pastebin.org/149331

Do both of your hosts see both controllers? ie, multipath -v2 -ll shows 2 active paths and 2 ghost paths per LUN?

I appreciate your help on this. Sounds like we have a lot of similarities in our setup, I'm jealous of yours though ;)
ReplyDelete
Replies
DanielApril 29, 2010 at 8:44 AM
Hi,
yes both hosts see both controllers.
For example
s1-dbbackup1 (3600a0b80005b156d000007664ae719e7) dm-12 SUN,LCSM100_F
[size=144G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=200][active]
\_ 8:0:2:9 sdak 66:64 [active][ready]
\_ 9:0:2:9 sdal 66:80 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 8:0:0:9 sdh 8:112 [active][ghost]
\_ 9:0:0:9 sdv 65:80 [active][ghost]

Here is our setup plus debug messages from last boot.

db1 (sorry for format, its from "script"):
http://pastebin.org/191638
db2:
http://pastebin.org/191641

I see you have little differences in multipath.conf (no_path_retry param and blacklist), lvm.conf (no filter for SAN devices) and modprobe.conf(no scsi_hostadapter4 qla2xxx
alias scsi_hostadapter5 dm-multipath
alias scsi_hostadapter6 scsi_dh_rdac lines, but you have special qlport_down_retry parameter defined). Even if you dont use lvm, you should made filter changes to lvm.conf, because LVM boot scan is performed.

Hmm, actually I dont have idea, why you are observing such problems.. pretty strange, will be thinking about it..
ReplyDelete
Replies
DANiELMay 6, 2010 at 7:53 PM
Thanks for the info. There are some differences, but I don't believe they should result in sporadic AVT. But, I don't have much FC/SAN experience. Would you think it to be reasonable?

To give you an update, we've got high level people within Sun/Oracle scrambling on this now. Had a tense meeting yesterday.
ReplyDelete
Replies
DanielMay 18, 2010 at 2:01 PM
I have unofficial information from friend Sun engineer, that these LUN trespasses are known issue and that it should be fixed in RHEL 5.5. I cant confirm this info, but hope it helps :)
ReplyDelete
Replies
BenNovember 4, 2010 at 9:43 AM
Hi there,

We're running fully patched and updated RHEL5.5 on some Sun X4600M2 servers with Emulex FC cards connected to Sun 2540 arrays. For those we're multipathing we see huge amounts of buffer I/O errors to the ghost path on every boot and regularly during operation.

Currently we're booting with "pci=noacpi irqpoll" added to the kernel line, have rebuilt the initrd with "--preload=scsi_dh_rdac" and have "alias scsi_hostadapter3 lpfc
alias scsi_hostadapter4 dm_multipath
alias scsi_hostadapter5 scsi_dh_rdac" in modprobe.conf all without being able to get rid of the buffer I/O errors.

Can you please give some suggestions on what we're missing? I'm beginning to tear my hair out over this. Obviously there are no functional problems (everything works):

datavol (3600a0b800038b3e500000224477df1d2) dm-2 SUN,LCSM100_F
[size=1.5T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100] [active]
\_ 2:0:0:0 sdc 8:32 [active] [ready]
\_ round-robin 0 [prio=0][enabled]
\_ 1:0:0:0 sdb 8:16 [active] [ghost]

but it's a annoying to think things could be cleaner.

With thanks,

Ben
ReplyDelete
Replies
AnonymousNovember 12, 2010 at 6:13 PM
Hi Ben,

can you please post your /etc/multipath.conf, /etc/modprobe.conf and /etc/lvm/lvm.conf. During normal operation, there should be no IO errors, only during boot.

Daniel
ReplyDelete
Replies
BenNovember 15, 2010 at 9:47 AM
http://fpaste.org/iWYy/
/etc/lvm/lvm.conf

http://fpaste.org/G9vx/
/etc/modprobe.conf

http://fpaste.org/2mKm/
/etc/multipath.conf

NOTE: They'll expire within 24 hours.
ReplyDelete
Replies
DanielNovember 15, 2010 at 1:18 PM
Hi Ben,

my first guess is that you are experiencing errors due to default /etc/lvm/lvm.conf file. By default it is accepting to scan all devices (which includes ghost ones). Please modify it according to step 2 in my original post.

2. In lvm.conf, it is good to modify two lines, to reduce LVM discovery time.

filter = [ "a/dev/mpath/.*/", "a/dev/sda.*/", "a/dev/sdb.*/", "a/dev/md.*/", "r/.*/" ]
types = [ "device-mapper", 1]

Please post results if it helped.
Daniel
ReplyDelete
Replies
BenNovember 15, 2010 at 2:06 PM
What happens if we have a multipath device called /dev/sdag? For example, on another problem server we have the following:

# multipath -ll
[...]
vold08 (3600a0b8000498550000001ff492aa0ca) dm-11 SUN,LCSM100_F
[size=1.4T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]
\_ 3:0:0:1 sdd 8:48 [active][ready]
vold10 (360050768019c038e9000000000000003) dm-2 IBM,2145
[size=500G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=200][active]
\_ 6:0:0:1 sdai 66:32 [active][ready]
\_ 6:0:2:1 sdaw 67:0 [active][ready]
\_ 4:0:0:1 sdf 8:80 [active][ready]
\_ 4:0:2:1 sdt 65:48 [active][ready]
\_ round-robin 0 [prio=40][enabled]
\_ 4:0:3:1 sdaa 65:160 [active][ready]
\_ 6:0:1:1 sdap 66:144 [active][ready]
\_ 6:0:3:1 sdbd 67:112 [active][ready]
\_ 4:0:1:1 sdm 8:192 [active][ready]
vold07 (3600a0b8000498550000001fd492aa0a0) dm-10 SUN,LCSM100_F
[size=1.4T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]
\_ 3:0:0:0 sdc 8:32 [active][ready]
vold01 (3600a0b800038b29a000001f0477b118c) dm-8 SUN,LCSM100_F
[size=1.5T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]
\_ 1:0:0:0 sdb 8:16 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 5:0:0:0 sdag 66:0 [active][ghost]

It's dm-8 that's the problem child. We get the I/O errors on sdag. Will your suggested filter work with all of the above? Would it need to be modified to something of the form

filter = [ "a/dev/mpath/.*/", "a/dev/sda[0-9]/", "a/dev/sdb[0-9]/", "a/dev/md.*/", "r/.*/" ]

instead?
ReplyDelete
Replies
DanielNovember 15, 2010 at 3:57 PM
Ben,

in filter line include only block devices, which you think LVM should scan during boot or normal operation. So if you have LVM on bootdisks sda and sdb, include only those, so maybe this will work:

filter = [ "a/dev/mpath/.*/", "a/dev/sda[0-9]+/", "a/dev/sdb[0-9]+/", "a/dev/md.*/", "r/.*/" ]

If you dont use LVM on bootdisks, but only on multipath devices, use something like this:

filter = [ "a/dev/mpath/.*/", "a/dev/md.*/", "r/.*/" ]

But please, after changing lvm.conf, everything test first using "vgscan -vv" or "vgscan -vvv", so you dont make mistake.

Daniel
ReplyDelete
Replies
BenNovember 16, 2010 at 10:18 AM
We don't even use LVM on sdb as far as I can tell. Only sda. So with a little tweaking that might work perfectly.

I'll give it a shot. Thank you.
ReplyDelete
Replies
BenNovember 16, 2010 at 10:26 AM
In fact, I think

filter = [ "a/dev/sda[0-9]+/", "r/.*/" ]

is all we really want as we only use LVM on the internal RAID1 array (/dev/sda).

That makes sense, doesn't it? (-:
ReplyDelete
Replies
DanielNovember 16, 2010 at 10:54 AM
Yes :) I wanted just to be sure, that you dont forget anything, because lvm.conf is copied to initrd boot file, so any mistakes here are fatal :) This copy occurs during install of new kernel, or in step 5 of mine original post.
ReplyDelete
Replies
Ed SilvaJanuary 1, 2011 at 7:20 PM
I could not "see" the ST2540...UNTIL I upgraded to latest kernel 2.6.32-100.24.1.el5. We are using Oracle Unbreakable Linux. I can't seem to go over 4GB with a QLogic QLE2562 card to the ST2540 ? What gives ?

QLogic Fibre Channel HBA Driver: 8.03.01.01.32.1-k9
QLogic Fibre Channel HBA Driver: 8.03.01.01.32.1-k9
QLogic QLE2562 - Sun StorageTek 8Gb FC PCIe HBA, dual port
QLogic Fibre Channel HBA Driver: 8.03.01.01.32.1-k9
QLogic QLE2562 - Sun StorageTek 8Gb FC PCIe HBA, dual port
ReplyDelete
Replies
UnknownNovember 7, 2012 at 3:15 PM
I just thought I would point out that your Blog post is still helpful, 2 years after you wrote it. Thanks!
ReplyDelete
Replies

Add comment

Pages

November 23, 2009

IO multipathing with RHEL 5

28 comments: