1. Multipath.conf is shortened (only few devices are showed). Internal disks are ignored (vendor "Sun"). Also note, that changes to uid or more on multipath devices are visible only after reboot (bug?).
# cat /etc/multipath.conf blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z][[0-9]*]" device { vendor Sun } } devices { device { vendor "SUN" product "LCSM100_F" getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout "/sbin/mpath_prio_rdac /dev/%n" features "0" hardware_handler "1 rdac" path_grouping_policy group_by_prio failback immediate rr_weight uniform no_path_retry queue rr_min_io 1000 path_checker rdac } } multipaths { multipath { wwid 3600a0b80005b0a760000042b4ad4bf74 alias s2-dbdata1 mode 660 uid 500 gid 500 } multipath { wwid 3600a0b80005b15e3000004594ad4bf47 alias s2-dbredo1 mode 660 uid 500 gid 500 } }
2. In lvm.conf, it is good to modify two lines, to reduce LVM discovery time.
/etc/lvm/lvm.conf: filter = [ "a/dev/mpath/.*/", "a/dev/sda.*/", "a/dev/sdb.*/", "a/dev/md.*/", "r/.*/" ] types = [ "device-mapper", 1]
3. Modify also udev rules, to shorten bootup time:
/etc/udev/rules.d/05-udev-early.rules: ##ACTION=="add", DEVPATH=="/devices/*", ENV{PHYSDEVBUS}=="?*", WAIT_FOR_SYSFS="bus"
4. Modprobe.conf:
/etc/modprobe.conf: alias eth0 e1000e alias eth1 e1000e alias eth2 e1000e alias eth3 e1000e alias scsi_hostadapter aacraid alias scsi_hostadapter1 ata_piix alias scsi_hostadapter2 ahci alias scsi_hostadapter3 usb-storage alias scsi_hostadapter4 qla2xxx alias scsi_hostadapter5 dm_multipath alias scsi_hostadapter6 scsi_dh_rdac
5. Last step is critical, recreate initrd so scsi_dh_rdac is loaded first (otherwise, there will be tons of errors during boot). Also please note, that you need at least update 3 of RHEL 5.
[root@db2 boot]# mkinitrd -v -f initrd-2.6.18-164.2.1.el5.img 2.6.18-164.2.1.el5 --preload scsi_dh_rdac
Thank you for this, we've been having difficulty getting our paths up with almost identical equipment.
ReplyDeleteAre you getting I/O errors on load for the qla2xxx module? We have 4 paths to the SAN, but 2 throw I/O errors (the ones listed as 'ghost' by multipath). I can't copy/paste into this box to show you the errors.
Any assistance would be greatly appreciated, thanks!
-Daniel
Yes, there are few I/O errors on boot (before root filesystem is mounted). But they are kept at minimum, as I recreated initrd (step 5). If you omit this step, module scsi_dh_rdac will be loaded after qla2xxx module, and you will see tons of I/O errors because of accessing two passive (ghost) paths. Module scsi_dh_rdac is responsible for correct multipath behaviour (by rising priority to active paths) so it has to be loaded as soon as possible. I hope this bug will be fixed in future RedHat releases, so we dont need to recreate initrd after each kernel update.
ReplyDeleteI also changed way to post comments, so copy/paste should be ok.
ReplyDeleteNice name btw ;)
ReplyDeleteThanks for your help on this. What would be great is if they shipped the installer with either scsi_dh_rdac preloaded or a means to check if it should be prior to loading qla2xxx. I had to gut qla2xxx from the installer's initrd so that we could do automated installs. But, I should now be able to put it back in with preloading scsi_dh_rdac.
Thanks again!
-Daniel
You welcome.
ReplyDeleteWhen you will have time after finishing deployment, please leave comment if you were able to do automated installs with new way (preload) or you had to fallback to method without qla2xxx module. I am curious, thanks :)
By chance, are the I/O errors you are seeing coming from the SAN's access device (Universal Xport)? I haven't figured out a way to make it so that qla2xxx does not create dev nodes for that device. It is contributing to most/all of the I/O errors I'm now seeing
ReplyDeleteSorry... didn't see your comment about the automated installs. I haven't tested out my approach yet, but I was looking into modifying the installer's initrd and making scsi_dh_rdac a dependency for qla2xxx. In general, it is not a dependency for qla2xxx, but it appears to be when using StorageTek 2500 series SANs.
ReplyDeleteTo be honest though, I don't need the SAN for installs. The easier approach would be to simply copy over the modified initrd in the post-install section of kickstart. But, I want to try anyway :)
Yes, those few I/O errors come from SAN device. And for automated installs: I would personaly do it simpler way just like you do (in post install script, or not at all during install), as there are less problems (no need to recreate boot image every time new OS update comes). Os eventually boots and then I would solve problems :)
ReplyDeleteHave you experienced any unusually high loads? I've been rsync'ing over ssh a 200GB dataset via a gigabit switch to a SAN volume via a machine fibre attached to the SAN. This drives loadavgs to, at its highest, 25, with iowait hovering between 60-80%. Very unexpected for a 50-60MB/s transfer rate...
ReplyDeletePlease try it again without using CPU intensive ssh+rsync. Then we can isolate problem. Try netcat pump, about which I wrote here: http://www.ha-obsession.net/2007/12/fast-remote-dir-copy-using-netcat.html
ReplyDeleteLittle update, I cant confirm this yes, but according https://bugzilla.redhat.com/show_bug.cgi?id=515326 they fixed correct loading order of scsi_dh_rdac and qla2xxx modules in latest update RHEL 5.5
ReplyDeleteSorry its taken a second to get back to you. Our highload issues may be associated with other problems. FYI, you should read this link:
ReplyDeletehttps://lists.linux-foundation.org/pipermail/bugme-new/2007-June/016354.html
I'm currently in communication with Sun/Oracle as well as LSI (the actual manufacturer of the ST2540) and their official position is that DM-MP is not supported. I believe it to be due to how the ST2540 is asymmetrical. Here is another link that talks about it somewhat. The ST6140 is more or less the same as the ST2540:
http://southbrain.com/south/2009/10/qla-linux-and-rdac---sun-6140.html
I'd be curious to know what your take on all this is. We started observing controller resets for no known reason back in February and have been diagnosing since. I'm of the opinion that it is due to AVT. I'm currently testing a multipath configuration where the second controller is disabled; multiple paths, but no failover. Failover is not essential in our environment, but would be nice to have. We are unfortunately unable to use mppRDAC, as far as we can tell, as it does not actually support our HBAs. Sun is somewhat confused as to what is and is not supported hardware: their documents conflict. Officially, our HBAs (QLE2560s) and switches (QLogic 5802s) are not compatible with the ST2540. Needless to say, we are very happy with our reseller...
We use in servers also these cards:
ReplyDeleteQLogic Fibre Channel HBA Driver: 8.03.00.1.05.05-k
QLogic QLE2560 - Sun StorageTek 8Gb FC PCIe HBA, single port
ISP2532: PCIe (2.5Gb/s x8) @ 0000:03:00.0 hdma+, host#=8, fw=4.04.09 (85)
Yes, we talked recently to LSI guys, so I know their possition about not supporting dm-multipath. I was really surprised.
I have few things for you if you wanna further investigate: are u using latest fcode (firmware) for your HBAs? Can you paste me in pastebin.org all configs I was mentioning in original article (multipath.conf, lvm.conf, ...) + "uname -a" + maybe logs? We are not observing controller resets.. On one cluster we use about 10 LUNs from disk array (which are all owned by one controller) and on second cluster also about 10 LUNs (which are owned by second controller).
What exactly servers are you using? For example in X4270 servers you should put HBAs only in number 0 and 3 PCIe slots!
We're using the same HBAs, I don't have the fw on hand, but it should be the recent as of this fall. We upgraded fw on everything for all of our systems. SANs are latest as well.
ReplyDeleteWe've got a mix of x4140s and x4450s. I don't remember what PCIe slot the HBA is in, however, those were installed by our reseller, not by us.
We are not using LVM, not needed for our environment.
Your use is considerably more than ours. But, regardless, I can create controller resets on a single 250g volume (FS agnostic, used both ext3 and OCFS2), with nothing else using the SAN, no other volumes, and only a single host with an initiator. The behavior only manifests itself when both controllers are fibre attached. If we unplug one controller (remove failover possibility), the configuration is stable. We know for a fact that the controller is resetting: I've become very familiar with the SAN service adviser dumps.
I pasted multipath.conf, multipath -v2 -ll, uname -a, modprobe.conf and logs from one of our machines that was observing the behavior. We are using a rebuilt initrd that preloads scsi_dh_rdac.
http://pastebin.org/149331
Do both of your hosts see both controllers? ie, multipath -v2 -ll shows 2 active paths and 2 ghost paths per LUN?
I appreciate your help on this. Sounds like we have a lot of similarities in our setup, I'm jealous of yours though ;)
Hi,
ReplyDeleteyes both hosts see both controllers.
For example
s1-dbbackup1 (3600a0b80005b156d000007664ae719e7) dm-12 SUN,LCSM100_F
[size=144G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=200][active]
\_ 8:0:2:9 sdak 66:64 [active][ready]
\_ 9:0:2:9 sdal 66:80 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 8:0:0:9 sdh 8:112 [active][ghost]
\_ 9:0:0:9 sdv 65:80 [active][ghost]
Here is our setup plus debug messages from last boot.
db1 (sorry for format, its from "script"):
http://pastebin.org/191638
db2:
http://pastebin.org/191641
I see you have little differences in multipath.conf (no_path_retry param and blacklist), lvm.conf (no filter for SAN devices) and modprobe.conf(no scsi_hostadapter4 qla2xxx
alias scsi_hostadapter5 dm-multipath
alias scsi_hostadapter6 scsi_dh_rdac lines, but you have special qlport_down_retry parameter defined). Even if you dont use lvm, you should made filter changes to lvm.conf, because LVM boot scan is performed.
Hmm, actually I dont have idea, why you are observing such problems.. pretty strange, will be thinking about it..
Thanks for the info. There are some differences, but I don't believe they should result in sporadic AVT. But, I don't have much FC/SAN experience. Would you think it to be reasonable?
ReplyDeleteTo give you an update, we've got high level people within Sun/Oracle scrambling on this now. Had a tense meeting yesterday.
I have unofficial information from friend Sun engineer, that these LUN trespasses are known issue and that it should be fixed in RHEL 5.5. I cant confirm this info, but hope it helps :)
ReplyDeleteHi there,
ReplyDeleteWe're running fully patched and updated RHEL5.5 on some Sun X4600M2 servers with Emulex FC cards connected to Sun 2540 arrays. For those we're multipathing we see huge amounts of buffer I/O errors to the ghost path on every boot and regularly during operation.
Currently we're booting with "pci=noacpi irqpoll" added to the kernel line, have rebuilt the initrd with "--preload=scsi_dh_rdac" and have "alias scsi_hostadapter3 lpfc
alias scsi_hostadapter4 dm_multipath
alias scsi_hostadapter5 scsi_dh_rdac" in modprobe.conf all without being able to get rid of the buffer I/O errors.
Can you please give some suggestions on what we're missing? I'm beginning to tear my hair out over this. Obviously there are no functional problems (everything works):
datavol (3600a0b800038b3e500000224477df1d2) dm-2 SUN,LCSM100_F
[size=1.5T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100] [active]
\_ 2:0:0:0 sdc 8:32 [active] [ready]
\_ round-robin 0 [prio=0][enabled]
\_ 1:0:0:0 sdb 8:16 [active] [ghost]
but it's a annoying to think things could be cleaner.
With thanks,
Ben
Hi Ben,
ReplyDeletecan you please post your /etc/multipath.conf, /etc/modprobe.conf and /etc/lvm/lvm.conf. During normal operation, there should be no IO errors, only during boot.
Daniel
http://fpaste.org/iWYy/
ReplyDelete/etc/lvm/lvm.conf
http://fpaste.org/G9vx/
/etc/modprobe.conf
http://fpaste.org/2mKm/
/etc/multipath.conf
NOTE: They'll expire within 24 hours.
Hi Ben,
ReplyDeletemy first guess is that you are experiencing errors due to default /etc/lvm/lvm.conf file. By default it is accepting to scan all devices (which includes ghost ones). Please modify it according to step 2 in my original post.
2. In lvm.conf, it is good to modify two lines, to reduce LVM discovery time.
filter = [ "a/dev/mpath/.*/", "a/dev/sda.*/", "a/dev/sdb.*/", "a/dev/md.*/", "r/.*/" ]
types = [ "device-mapper", 1]
Please post results if it helped.
Daniel
What happens if we have a multipath device called /dev/sdag? For example, on another problem server we have the following:
ReplyDelete# multipath -ll
[...]
vold08 (3600a0b8000498550000001ff492aa0ca) dm-11 SUN,LCSM100_F
[size=1.4T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]
\_ 3:0:0:1 sdd 8:48 [active][ready]
vold10 (360050768019c038e9000000000000003) dm-2 IBM,2145
[size=500G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=200][active]
\_ 6:0:0:1 sdai 66:32 [active][ready]
\_ 6:0:2:1 sdaw 67:0 [active][ready]
\_ 4:0:0:1 sdf 8:80 [active][ready]
\_ 4:0:2:1 sdt 65:48 [active][ready]
\_ round-robin 0 [prio=40][enabled]
\_ 4:0:3:1 sdaa 65:160 [active][ready]
\_ 6:0:1:1 sdap 66:144 [active][ready]
\_ 6:0:3:1 sdbd 67:112 [active][ready]
\_ 4:0:1:1 sdm 8:192 [active][ready]
vold07 (3600a0b8000498550000001fd492aa0a0) dm-10 SUN,LCSM100_F
[size=1.4T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]
\_ 3:0:0:0 sdc 8:32 [active][ready]
vold01 (3600a0b800038b29a000001f0477b118c) dm-8 SUN,LCSM100_F
[size=1.5T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]
\_ 1:0:0:0 sdb 8:16 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 5:0:0:0 sdag 66:0 [active][ghost]
It's dm-8 that's the problem child. We get the I/O errors on sdag. Will your suggested filter work with all of the above? Would it need to be modified to something of the form
filter = [ "a/dev/mpath/.*/", "a/dev/sda[0-9]/", "a/dev/sdb[0-9]/", "a/dev/md.*/", "r/.*/" ]
instead?
Ben,
ReplyDeletein filter line include only block devices, which you think LVM should scan during boot or normal operation. So if you have LVM on bootdisks sda and sdb, include only those, so maybe this will work:
filter = [ "a/dev/mpath/.*/", "a/dev/sda[0-9]+/", "a/dev/sdb[0-9]+/", "a/dev/md.*/", "r/.*/" ]
If you dont use LVM on bootdisks, but only on multipath devices, use something like this:
filter = [ "a/dev/mpath/.*/", "a/dev/md.*/", "r/.*/" ]
But please, after changing lvm.conf, everything test first using "vgscan -vv" or "vgscan -vvv", so you dont make mistake.
Daniel
We don't even use LVM on sdb as far as I can tell. Only sda. So with a little tweaking that might work perfectly.
ReplyDeleteI'll give it a shot. Thank you.
In fact, I think
ReplyDeletefilter = [ "a/dev/sda[0-9]+/", "r/.*/" ]
is all we really want as we only use LVM on the internal RAID1 array (/dev/sda).
That makes sense, doesn't it? (-:
Yes :) I wanted just to be sure, that you dont forget anything, because lvm.conf is copied to initrd boot file, so any mistakes here are fatal :) This copy occurs during install of new kernel, or in step 5 of mine original post.
ReplyDeleteI could not "see" the ST2540...UNTIL I upgraded to latest kernel 2.6.32-100.24.1.el5. We are using Oracle Unbreakable Linux. I can't seem to go over 4GB with a QLogic QLE2562 card to the ST2540 ? What gives ?
ReplyDeleteQLogic Fibre Channel HBA Driver: 8.03.01.01.32.1-k9
QLogic Fibre Channel HBA Driver: 8.03.01.01.32.1-k9
QLogic QLE2562 - Sun StorageTek 8Gb FC PCIe HBA, dual port
QLogic Fibre Channel HBA Driver: 8.03.01.01.32.1-k9
QLogic QLE2562 - Sun StorageTek 8Gb FC PCIe HBA, dual port
I just thought I would point out that your Blog post is still helpful, 2 years after you wrote it. Thanks!
ReplyDelete