diff options
Diffstat (limited to 'Documentation/infiniband')
-rw-r--r-- | Documentation/infiniband/core_locking.txt | 114 | ||||
-rw-r--r-- | Documentation/infiniband/ipoib.txt | 102 | ||||
-rw-r--r-- | Documentation/infiniband/sysfs.txt | 66 | ||||
-rw-r--r-- | Documentation/infiniband/user_mad.txt | 148 | ||||
-rw-r--r-- | Documentation/infiniband/user_verbs.txt | 69 |
5 files changed, 0 insertions, 499 deletions
diff --git a/Documentation/infiniband/core_locking.txt b/Documentation/infiniband/core_locking.txt deleted file mode 100644 index e1678542279..00000000000 --- a/Documentation/infiniband/core_locking.txt +++ /dev/null @@ -1,114 +0,0 @@ -INFINIBAND MIDLAYER LOCKING - - This guide is an attempt to make explicit the locking assumptions - made by the InfiniBand midlayer. It describes the requirements on - both low-level drivers that sit below the midlayer and upper level - protocols that use the midlayer. - -Sleeping and interrupt context - - With the following exceptions, a low-level driver implementation of - all of the methods in struct ib_device may sleep. The exceptions - are any methods from the list: - - create_ah - modify_ah - query_ah - destroy_ah - bind_mw - post_send - post_recv - poll_cq - req_notify_cq - map_phys_fmr - - which may not sleep and must be callable from any context. - - The corresponding functions exported to upper level protocol - consumers: - - ib_create_ah - ib_modify_ah - ib_query_ah - ib_destroy_ah - ib_bind_mw - ib_post_send - ib_post_recv - ib_req_notify_cq - ib_map_phys_fmr - - are therefore safe to call from any context. - - In addition, the function - - ib_dispatch_event - - used by low-level drivers to dispatch asynchronous events through - the midlayer is also safe to call from any context. - -Reentrancy - - All of the methods in struct ib_device exported by a low-level - driver must be fully reentrant. The low-level driver is required to - perform all synchronization necessary to maintain consistency, even - if multiple function calls using the same object are run - simultaneously. - - The IB midlayer does not perform any serialization of function calls. - - Because low-level drivers are reentrant, upper level protocol - consumers are not required to perform any serialization. However, - some serialization may be required to get sensible results. For - example, a consumer may safely call ib_poll_cq() on multiple CPUs - simultaneously. However, the ordering of the work completion - information between different calls of ib_poll_cq() is not defined. - -Callbacks - - A low-level driver must not perform a callback directly from the - same callchain as an ib_device method call. For example, it is not - allowed for a low-level driver to call a consumer's completion event - handler directly from its post_send method. Instead, the low-level - driver should defer this callback by, for example, scheduling a - tasklet to perform the callback. - - The low-level driver is responsible for ensuring that multiple - completion event handlers for the same CQ are not called - simultaneously. The driver must guarantee that only one CQ event - handler for a given CQ is running at a time. In other words, the - following situation is not allowed: - - CPU1 CPU2 - - low-level driver -> - consumer CQ event callback: - /* ... */ - ib_req_notify_cq(cq, ...); - low-level driver -> - /* ... */ consumer CQ event callback: - /* ... */ - return from CQ event handler - - The context in which completion event and asynchronous event - callbacks run is not defined. Depending on the low-level driver, it - may be process context, softirq context, or interrupt context. - Upper level protocol consumers may not sleep in a callback. - -Hot-plug - - A low-level driver announces that a device is ready for use by - consumers when it calls ib_register_device(), all initialization - must be complete before this call. The device must remain usable - until the driver's call to ib_unregister_device() has returned. - - A low-level driver must call ib_register_device() and - ib_unregister_device() from process context. It must not hold any - semaphores that could cause deadlock if a consumer calls back into - the driver across these calls. - - An upper level protocol consumer may begin using an IB device as - soon as the add method of its struct ib_client is called for that - device. A consumer must finish all cleanup and free all resources - relating to a device before returning from the remove method. - - A consumer is permitted to sleep in its add and remove methods. diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt deleted file mode 100644 index 64eeb55d0c0..00000000000 --- a/Documentation/infiniband/ipoib.txt +++ /dev/null @@ -1,102 +0,0 @@ -IP OVER INFINIBAND - - The ib_ipoib driver is an implementation of the IP over InfiniBand - protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib - working group. It is a "native" implementation in the sense of - setting the interface type to ARPHRD_INFINIBAND and the hardware - address length to 20 (earlier proprietary implementations - masqueraded to the kernel as ethernet interfaces). - -Partitions and P_Keys - - When the IPoIB driver is loaded, it creates one interface for each - port using the P_Key at index 0. To create an interface with a - different P_Key, write the desired P_Key into the main interface's - /sys/class/net/<intf name>/create_child file. For example: - - echo 0x8001 > /sys/class/net/ib0/create_child - - This will create an interface named ib0.8001 with P_Key 0x8001. To - remove a subinterface, use the "delete_child" file: - - echo 0x8001 > /sys/class/net/ib0/delete_child - - The P_Key for any interface is given by the "pkey" file, and the - main interface for a subinterface is in "parent." - -Datagram vs Connected modes - - The IPoIB driver supports two modes of operation: datagram and - connected. The mode is set and read through an interface's - /sys/class/net/<intf name>/mode file. - - In datagram mode, the IB UD (Unreliable Datagram) transport is used - and so the interface MTU has is equal to the IB L2 MTU minus the - IPoIB encapsulation header (4 bytes). For example, in a typical IB - fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. - - In connected mode, the IB RC (Reliable Connected) transport is used. - Connected mode takes advantage of the connected nature of the IB - transport and allows an MTU up to the maximal IP packet size of 64K, - which reduces the number of IP packets needed for handling large UDP - datagrams, TCP segments, etc and increases the performance for large - messages. - - In connected mode, the interface's UD QP is still used for multicast - and communication with peers that don't support connected mode. In - this case, RX emulation of ICMP PMTU packets is used to cause the - networking stack to use the smaller UD MTU for these neighbours. - -Stateless offloads - - If the IB HW supports IPoIB stateless offloads, IPoIB advertises - TCP/IP checksum and/or Large Send (LSO) offloading capability to the - network stack. - - Large Receive (LRO) offloading is also implemented and may be turned - on/off using ethtool calls. Currently LRO is supported only for - checksum offload capable devices. - - Stateless offloads are supported only in datagram mode. - -Interrupt moderation - - If the underlying IB device supports CQ event moderation, one can - use ethtool to set interrupt mitigation parameters and thus reduce - the overhead incurred by handling interrupts. The main code path of - IPoIB doesn't use events for TX completion signaling so only RX - moderation is supported. - -Debugging Information - - By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set - to 'y', tracing messages are compiled into the driver. They are - turned on by setting the module parameters debug_level and - mcast_debug_level to 1. These parameters can be controlled at - runtime through files in /sys/module/ib_ipoib/. - - CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs - virtual filesystem. By mounting this filesystem, for example with - - mount -t debugfs none /sys/kernel/debug - - it is possible to get statistics about multicast groups from the - files /sys/kernel/debug/ipoib/ib0_mcg and so on. - - The performance impact of this option is negligible, so it - is safe to enable this option with debug_level set to 0 for normal - operation. - - CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in - the data path when data_debug_level is set to 1. However, even with - the output disabled, enabling this configuration option will affect - performance, because it adds tests to the fast path. - -References - - Transmission of IP over InfiniBand (IPoIB) (RFC 4391) - http://ietf.org/rfc/rfc4391.txt - IP over InfiniBand (IPoIB) Architecture (RFC 4392) - http://ietf.org/rfc/rfc4392.txt - IP over InfiniBand: Connected Mode (RFC 4755) - http://ietf.org/rfc/rfc4755.txt diff --git a/Documentation/infiniband/sysfs.txt b/Documentation/infiniband/sysfs.txt deleted file mode 100644 index ddd519b72ee..00000000000 --- a/Documentation/infiniband/sysfs.txt +++ /dev/null @@ -1,66 +0,0 @@ -SYSFS FILES - - For each InfiniBand device, the InfiniBand drivers create the - following files under /sys/class/infiniband/<device name>: - - node_type - Node type (CA, switch or router) - node_guid - Node GUID - sys_image_guid - System image GUID - - In addition, there is a "ports" subdirectory, with one subdirectory - for each port. For example, if mthca0 is a 2-port HCA, there will - be two directories: - - /sys/class/infiniband/mthca0/ports/1 - /sys/class/infiniband/mthca0/ports/2 - - (A switch will only have a single "0" subdirectory for switch port - 0; no subdirectory is created for normal switch ports) - - In each port subdirectory, the following files are created: - - cap_mask - Port capability mask - lid - Port LID - lid_mask_count - Port LID mask count - rate - Port data rate (active width * active speed) - sm_lid - Subnet manager LID for port's subnet - sm_sl - Subnet manager SL for port's subnet - state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER) - phys_state - Port physical state (Sleep, Polling, LinkUp, etc) - - There is also a "counters" subdirectory, with files - - VL15_dropped - excessive_buffer_overrun_errors - link_downed - link_error_recovery - local_link_integrity_errors - port_rcv_constraint_errors - port_rcv_data - port_rcv_errors - port_rcv_packets - port_rcv_remote_physical_errors - port_rcv_switch_relay_errors - port_xmit_constraint_errors - port_xmit_data - port_xmit_discards - port_xmit_packets - symbol_error - - Each of these files contains the corresponding value from the port's - Performance Management PortCounters attribute, as described in - section 16.1.3.5 of the InfiniBand Architecture Specification. - - The "pkeys" and "gids" subdirectories contain one file for each - entry in the port's P_Key or GID table respectively. For example, - ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key - table. - -MTHCA - - The Mellanox HCA driver also creates the files: - - hw_rev - Hardware revision number - fw_ver - Firmware version - hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)", - or "MT25208" diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt deleted file mode 100644 index 8a366959f5c..00000000000 --- a/Documentation/infiniband/user_mad.txt +++ /dev/null @@ -1,148 +0,0 @@ -USERSPACE MAD ACCESS - -Device files - - Each port of each InfiniBand device has a "umad" device and an - "issm" device attached. For example, a two-port HCA will have two - umad devices and two issm devices, while a switch will have one - device of each type (for switch port 0). - -Creating MAD agents - - A MAD agent can be created by filling in a struct ib_user_mad_reg_req - and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file - descriptor for the appropriate device file. If the registration - request succeeds, a 32-bit id will be returned in the structure. - For example: - - struct ib_user_mad_reg_req req = { /* ... */ }; - ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req); - if (!ret) - my_agent = req.id; - else - perror("agent register"); - - Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT - ioctl. Also, all agents registered through a file descriptor will - be unregistered when the descriptor is closed. - -Receiving MADs - - MADs are received using read(). The receive side now supports - RMPP. The buffer passed to read() must be at least one - struct ib_user_mad + 256 bytes. For example: - - If the buffer passed is not large enough to hold the received - MAD (RMPP), the errno is set to ENOSPC and the length of the - buffer needed is set in mad.length. - - Example for normal MAD (non RMPP) reads: - struct ib_user_mad *mad; - mad = malloc(sizeof *mad + 256); - ret = read(fd, mad, sizeof *mad + 256); - if (ret != sizeof mad + 256) { - perror("read"); - free(mad); - } - - Example for RMPP reads: - struct ib_user_mad *mad; - mad = malloc(sizeof *mad + 256); - ret = read(fd, mad, sizeof *mad + 256); - if (ret == -ENOSPC)) { - length = mad.length; - free(mad); - mad = malloc(sizeof *mad + length); - ret = read(fd, mad, sizeof *mad + length); - } - if (ret < 0) { - perror("read"); - free(mad); - } - - In addition to the actual MAD contents, the other struct ib_user_mad - fields will be filled in with information on the received MAD. For - example, the remote LID will be in mad.lid. - - If a send times out, a receive will be generated with mad.status set - to ETIMEDOUT. Otherwise when a MAD has been successfully received, - mad.status will be 0. - - poll()/select() may be used to wait until a MAD can be read. - -Sending MADs - - MADs are sent using write(). The agent ID for sending should be - filled into the id field of the MAD, the destination LID should be - filled into the lid field, and so on. The send side does support - RMPP so arbitrary length MAD can be sent. For example: - - struct ib_user_mad *mad; - - mad = malloc(sizeof *mad + mad_length); - - /* fill in mad->data */ - - mad->hdr.id = my_agent; /* req.id from agent registration */ - mad->hdr.lid = my_dest; /* in network byte order... */ - /* etc. */ - - ret = write(fd, &mad, sizeof *mad + mad_length); - if (ret != sizeof *mad + mad_length) - perror("write"); - -Transaction IDs - - Users of the umad devices can use the lower 32 bits of the - transaction ID field (that is, the least significant half of the - field in network byte order) in MADs being sent to match - request/response pairs. The upper 32 bits are reserved for use by - the kernel and will be overwritten before a MAD is sent. - -P_Key Index Handling - - The old ib_umad interface did not allow setting the P_Key index for - MADs that are sent and did not provide a way for obtaining the P_Key - index of received MADs. A new layout for struct ib_user_mad_hdr - with a pkey_index member has been defined; however, to preserve - binary compatibility with older applications, this new layout will - not be used unless the IB_USER_MAD_ENABLE_PKEY ioctl is called - before a file descriptor is used for anything else. - - In September 2008, the IB_USER_MAD_ABI_VERSION will be incremented - to 6, the new layout of struct ib_user_mad_hdr will be used by - default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed. - -Setting IsSM Capability Bit - - To set the IsSM capability bit for a port, simply open the - corresponding issm device file. If the IsSM bit is already set, - then the open call will block until the bit is cleared (or return - immediately with errno set to EAGAIN if the O_NONBLOCK flag is - passed to open()). The IsSM bit will be cleared when the issm file - is closed. No read, write or other operations can be performed on - the issm file. - -/dev files - - To create the appropriate character device files automatically with - udev, a rule like - - KERNEL=="umad*", NAME="infiniband/%k" - KERNEL=="issm*", NAME="infiniband/%k" - - can be used. This will create device nodes named - - /dev/infiniband/umad0 - /dev/infiniband/issm0 - - for the first port, and so on. The InfiniBand device and port - associated with these devices can be determined from the files - - /sys/class/infiniband_mad/umad0/ibdev - /sys/class/infiniband_mad/umad0/port - - and - - /sys/class/infiniband_mad/issm0/ibdev - /sys/class/infiniband_mad/issm0/port diff --git a/Documentation/infiniband/user_verbs.txt b/Documentation/infiniband/user_verbs.txt deleted file mode 100644 index e5092d696da..00000000000 --- a/Documentation/infiniband/user_verbs.txt +++ /dev/null @@ -1,69 +0,0 @@ -USERSPACE VERBS ACCESS - - The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS, - enables direct userspace access to IB hardware via "verbs," as - described in chapter 11 of the InfiniBand Architecture Specification. - - To use the verbs, the libibverbs library, available from - http://www.openfabrics.org/, is required. libibverbs contains a - device-independent API for using the ib_uverbs interface. - libibverbs also requires appropriate device-dependent kernel and - userspace driver for your InfiniBand hardware. For example, to use - a Mellanox HCA, you will need the ib_mthca kernel module and the - libmthca userspace driver be installed. - -User-kernel communication - - Userspace communicates with the kernel for slow path, resource - management operations via the /dev/infiniband/uverbsN character - devices. Fast path operations are typically performed by writing - directly to hardware registers mmap()ed into userspace, with no - system call or context switch into the kernel. - - Commands are sent to the kernel via write()s on these device files. - The ABI is defined in drivers/infiniband/include/ib_user_verbs.h. - The structs for commands that require a response from the kernel - contain a 64-bit field used to pass a pointer to an output buffer. - Status is returned to userspace as the return value of the write() - system call. - -Resource management - - Since creation and destruction of all IB resources is done by - commands passed through a file descriptor, the kernel can keep track - of which resources are attached to a given userspace context. The - ib_uverbs module maintains idr tables that are used to translate - between kernel pointers and opaque userspace handles, so that kernel - pointers are never exposed to userspace and userspace cannot trick - the kernel into following a bogus pointer. - - This also allows the kernel to clean up when a process exits and - prevent one process from touching another process's resources. - -Memory pinning - - Direct userspace I/O requires that memory regions that are potential - I/O targets be kept resident at the same physical address. The - ib_uverbs module manages pinning and unpinning memory regions via - get_user_pages() and put_page() calls. It also accounts for the - amount of memory pinned in the process's locked_vm, and checks that - unprivileged processes do not exceed their RLIMIT_MEMLOCK limit. - - Pages that are pinned multiple times are counted each time they are - pinned, so the value of locked_vm may be an overestimate of the - number of pages pinned by a process. - -/dev files - - To create the appropriate character device files automatically with - udev, a rule like - - KERNEL=="uverbs*", NAME="infiniband/%k" - - can be used. This will create device nodes named - - /dev/infiniband/uverbs0 - - and so on. Since the InfiniBand userspace verbs should be safe for - use by non-privileged processes, it may be useful to add an - appropriate MODE or GROUP to the udev rule. |