From b93cda0f8143c9f5026bfe15ce71fd1486cc4a3c Mon Sep 17 00:00:00 2001 From: Ales Kozumplik Date: Thu, 27 Oct 2011 11:25:39 +0200 Subject: Document iscsi and multipath implementations. --- docs/iscsi.txt | 169 +++++++++++++++++++++++++++++++++++++++++++++++++++++ docs/multipath.txt | 143 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 312 insertions(+) create mode 100644 docs/iscsi.txt create mode 100644 docs/multipath.txt diff --git a/docs/iscsi.txt b/docs/iscsi.txt new file mode 100644 index 000000000..ee415826c --- /dev/null +++ b/docs/iscsi.txt @@ -0,0 +1,169 @@ +================== +iSCSI and Anaconda +================== + + +Introduction +------------ + +iSCSI device is a SCSI device connected to your computer via a TCP/IP +network. The communication can be handled either in hardware or in software, or +as a hybrid --- part software, part hardware. + +The terminology: + +- 'initiator', the client in the iscsi connection. The computer we are running + Anaconda on is typically an initiator. +- 'target', the storage device behind the Network. This is where the data is + physically stored and read from. You can turn any Fedora/RHEL machine to a + target (or several) via scsi-target-utils. +- 'HBA' or Host Bus Adapter. A device (PCI card typically) you connect to a + computer. It acts as a NIC and if you configure it properly it transparently + connects to the target when started and all you can see is a block device on + your system. +- 'software initiator' is what you end up with if you emulate most of what HBA is + doing and just use a regular NIC for the iscsi communication. The modern Linux + kernel has a software initiator. To use it, you need the Open-ISCSI software + stack [1, 2] installed. It is known as iscsi-initiator-utils in Fedora/RHEL. +- 'partial offload card'. Similar to HBA but needs some support from kernel and + iscsi-initiator-utils. The least pleasant to work with, particularly because + there is no standardized amount of the manual setting that needs to be done + (some connect to the target just like HBAs, some need you to bring their NIC + part up manually etc.). Partial offload cards exist to get better performing + I/O with less processor load than with software initiator. +- 'iBFT' as in 'Iscsi Boot Firmware Table'. A table in the card's bios that + contains its network and target settings. This allows the card to configure + itself, connect to a target and boot from it before any operating system or a + bootloader has the chance. We can also read this information from + /sys/firmware/ibft after the system starts and then use it to bring the card + up (again) in Linux. +- 'CHAP' is the authentication used for iSCSI connections. The authentication + can happen during target discovery or target login or both. It can happen in + both directions too: the initiator authenticates itself to the target and the + target is sometimes required to authenticate itself to the initiator. + + +What is expected from Anaconda +------------------------------ + +We are expected to: + +- use an HBA like an ordinary disk. It is usually smart enough to bring itself + up during boot, connect to the target and just act as an ordinary disk. +- allow creating new software initiator connections in the UI, both IPv4 and IPv6. +- facilitate bringing up iBFT connections for partial offload cards. +- install the root and/or /boot filesystems on any iSCSI initiator known to us +- remember to install dracut-network if we are booting from an iSCSI initiator that + requires iscsi-initiator-utils in the ramdisk (most of them do) +- boot from an iSCSI initiator using dracut, this requires generating an + appropriate set of kernel boot arguments for it [3]. + + +How Anaconda handles iscsi +-------------------------- + +iSCSI comes into play several times while Anaconda does its thing: + +In loader, when deciding what NIC we should setup, we check if we have iBFT +information from one of the cards. If we do we set that card up with what we +found in the table, it usually boils down to an IPv4 static or IPv4 +DHCP-obtained address. [4][5] + +Next, after the main UI startup during filtering (or storage scan, whatever +comes first) we startup the iscsi support code in Anaconda [6]. This currently +involves: +- manually modprobing related kernel modules +- starting the iscsiuio daemon (required by some partial offload cards) +- most importantly, starting the iscsid daemon + +All iBFT connections are brought up next by looking at the cards' iBFT data, if +any. The filtering screen has a feature to add advanced storage devices, +including iSCSI. Both connection types are handled by libiscsi (see below). The +brought up iSCSI devices appear as /dev/sdX and are treated as ordinary block +devices. + +When DeviceTree scans all the block devices it uses the udev data (particularly +the ID_BUS and ID_PATH keys) to decide if the device is an iscsi disk. If it is, +it is represented with an iScsiDiskDevice class instance. This helps Anaconda +remember that: + +- we need to install dracut-network so the generated dracut image is able to + bring up the underlying NIC and establish the iscsi connection. +- if we are booting from the device we need to pass dracut a proper set of + arguments that will allow it to do so. + + +Libiscsi +-------- + +How are iSCSI targets found and logged into? Originally Anaconda was just +running iscsiadm as an external program through execWithRedirect(). This +ultimately proved awkward especially due to the difficulties of handling the +CHAP passphrases this way. That is why Hans de Goede , the +previous maintainer of the Anaconda iscsi subsystem decided to write a better +interface and created libiscsi (do not confuse this with the libiscsi.c in +kernel). Currently libiscsi lives as a couple of patches in the RHEL6 +iscsi-initiator-utils CVS (and in Fedora package git, in somewhat outdated +version). Since Anaconda is libiscsi's only client at the moment it is +maintained by the Anaconda team. + +The promise of libiscsi is to provide a simple C/Python API to handle iSCSI +connections while being somewhat stable and independent of the changes in the +underlying initiator-utils (while otherwise being tied to it on the +implementation level). + +And at the moment libiscsi does just that. It has a set of functions to discover +and login to targets software targets. It supports making connections through +partial offload interfaces, but the only discovery method supported at this +moment is through firmware (iBFT). Its public data structures are independent of +iscsi-initiator-utils. And there is some python boilerplate that wraps the core +functions so we can easily call those from Anaconda. + +To start nontrivial hacking on libiscsi prepare to spend some time familiarizing +yourself with the iscsi-initiator-utils internals (it is complex but quite +nice). + + +Debugging iSCSI bugs +-------------------- + +There is some information in anaconda.log and storage.log but libiscsi itself is +quite bad at logging. Most times useful information can be found by sshing onto +the machine and inspecting the output of different iscsiadm commands [2][7], +especially querying the existing sessions and known interfaces. + +If for some reason the DeviceTree fails at recognizing iscsi devices as such, +'udevadm info --exportdb' is of interest. + +The booting problems are either due to incorrectly generated dracut boot +arguments or they are simply dracut bugs. + +Note that many of the iscsi adapters are installed in different Red Hat machines +and so the issues can often be reproduced and debugged. + + +Future of iSCSI in Anaconda +--------------------------- + +- extend libiscsi to allow initializing arbitrary connections from a partial + offload card. Implement the Anaconda UI to utilize this. Difficulty hard. +- extend libiscsi with device binding support. Difficulty hard. +- work with iscsi-initiator-utils maintainer to get libiscsi.c upstream and then + to rawhide Fedora. Then the partial offload patches in the RHEL6 Anaconda can + be migrated there too and partial offload can be tested. This is something + that needs to be done before RHEL7. Difficulty medium. +- improve libiscsi's logging capabilities. Difficulty easy. + + + +[1] http://www.open-iscsi.org/ +[2] /usr/share/doc/iscsi-initiator-utils-6.*/README +[3] man 7 dracut.kernel +[4] Anaconda git repository, anaconda/loader/ibft.c +[5] Anaconda git repository, anaconda/loader/net.c, chooseNetworkInterface() +[6] Anaconda git repository, anaconda/storage/iscsi.py +[7] 'man 8 iscsiadm' + + +--- +Red Hat Author(s): Ales Kozumplik diff --git a/docs/multipath.txt b/docs/multipath.txt new file mode 100644 index 000000000..e8af24ec9 --- /dev/null +++ b/docs/multipath.txt @@ -0,0 +1,143 @@ +====================== +Multipath and Anaconda +====================== + + +Introduction +------------ + +If there are two block devices in your /dev for which udev reports the same +'ID_SERIAL' then you can create a certain device mapper device which arbitrarily +uses those devices to access the physical device. And that is Multipath [1]. + +For instance, suppose there are + +/dev/sda, with ID_SERIAL of 20090ef12700001d2, and +/dev/sdb, with the same ID_SERIAL. + +Those are probably some adapters in the system that just connect your box to a +storage area network (SAN) somewhere. There are perhaps two cables, one for sda, +one for sdb, and if one of the cables gets cut the other can still transmit +data. Normally the system won't recognize that sda and sdb have this special +relation to each other, but by creating a suitable device map using multipath +tools [2] we can create a DM device /dev/mapper/mpatha and use it for storing +and retrieving data. + +The device mapper then automatically routes IO requests to /dev/mapper/mpatha to +either sda or sdb depending on the load of the line or network congestion on the +particular network etc. + +The nomenclature I will use here is: + +- 'multipath device' for the smart /dev/mapper/mpathX device. +- 'multipath member device' for the '/dev/sdX' devices. Also 'a path'. + + +What is expected from Anaconda +------------------------------ + +Anaconda is expected to: +- detect that there are multipath devices present +- coalesce all relevant (e.g. exclusiveDisks) multipath devices. +- only let the user interact with the multipath devices in filtering, + cleardiskssel and partition screen, that is once we know 'sdc' and 'sdd' are + part of 'mpathb' show only 'mpathb' and never the paths. +- install bootloader and boot from an mpath device +- make it happen so all the multipath devices (carrying or not the root + filesystem) we used for installation are correctly coalesced in the booted + system. This is achieved by generating a suitable /etc/multipath.conf and + writing it into sysroot. +- be able to refer to mpath devices from kickstart, either by name like 'mpatha' + or by their id like 'disk/by-id/scsi-20090ef12700001d2' + + +How Anaconda handles multipath +------------------------------ + +To detect presence of multipath devices we rely on multipath tools. The same we +do for coalescing, see pyanaconda/storage/devicelibs/mpath.py, the file that +provides some abstraction from mpath tools. During the device scan we use the +'multipath -d' output to find out what devices are going to end up as multipath +members. The MultipathTopology object also enhances the multipath member's udev +dictionaries with 'ID_FS_TYPE' set to 'multipath_member' (yes, this is a hack +surviving from the original mpath implementation, and righteous is he who +eradicates it). This information is picked up by DeviceTree when populating +itself. Meaning, if 'sda' and 'sdb' are multipath member devices DeviceTree +gives them MultipathMember format and creates one MultipathDevice for them (we +know its name from 'multipath -d'). We end up with: + +DiskDevice 'sda', format 'MultipathMember' +DiskDevice 'sdb', format 'MultipathMember' +MultipathDevice 'mpatha', parents are 'sda' and 'sdb'. + +From then on, Anaconda only deals with the MultipathDevice and generally leaves +anything with 'MultipathMember' format alone (understand, this is an inert +format that really is not there but we use it just to mark the device as +"useless beyond a multipath member", kind of like MDRaidMember). + +Partition happens over the multipath device and during the preinstallconfig step +/mnt/sysimage/etc/multipath.conf is created and filled with information about +the coalesced devices. This is handled in the Storage.write() method. It is +important this file and /etc/multipath/wwids (autogenerated by mpath tools) +make it to the sysimage before the dracut image is generated. + + +Debugging multipath bugs +------------------------ + +Unlike with iSCSI, to reproduce a multipath bug one does not need the same +specific hardware as the reporter. Just found any box connected to a multipathed +SAN and you are fine (at the moment, connecting to the same iSCSI target through +its IPv4 and IPv6 address also produces a multipathed device). + +On top of that, much of the necessary information is already included in the +anaconda logs or can be easily extracted from the reporter. The things to +particularly look at are: + +- storage.log, the output around 'devices to scan for multipath' and 'devices + post multipath scan'. The latter shows a triple with regular disks, disks + comprising multipath devices and partitions. This helps you quickly find out + what the target system is about. + +- this information is also in program.log's calls to 'multipath' [3]. If mpath + devices are mysteriously appearing/disappearing between filtering and + partitioning screens look at those. 'multipath -ll' is called to display + currently coalesced mpath devices, 'multipath -d' is called to show the mpath + devices that would be coalesced if we ran 'multipath' now. This is exploited + by the device filtering screen. + + +Future of multipath in Anaconda +------------------------------- + +Overall as of RHEL6.2, the shape of multipath in Anaconda is good and what's +more important it is flexible enough to sustain new RFEs and bugs. Those are +however bugs that I expect to appear sometime soon: + +- enable or disable mpath_friendly_names in kickstart. Disabling friendly names + just means the mpath devices are called by their wwid, + e.g. /dev/mapper/360334332345343234, not '/dev/mapper/mpathc'. This is + straightforward to implement. +- extend support for mpath devices in kickstart in general. Currently mpath + devices should be accepted in most commands but I am sure there will be corner + cases. Difficulty medium. +- [rawhide] stop extending the udev info dictionary with 'ID_FS_TYPE' and + 'ID_MPATH_NAME'. Doing it this way is asking for the trouble if a dictionary + of particular mpath device is reloaded from udev without running it through + the MultipathTopology object as it will miss those entries (and DeviceTree + depends on them a lot). Difficulty hard, but includes a lot of pleasant + refactoring. +- Improve support for multipathing iSCSI devices. Someone might ask for it one + day (in fact, with the NIC bounding they already did), and it will make mpath + debugging possible on any virt machine with multiple virt NICs. + + + +[1] http://akozumpl.fedorapeople.org/archive/Multipass.jpg +[2] http://christophe.varoqui.free.fr/ +[3] 'man 8 multipath' + + + +--- +Red Hat Author(s): Ales Kozumplik -- cgit