summaryrefslogtreecommitdiffstats
path: root/Documentation/infiniband
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/infiniband')
-rw-r--r--Documentation/infiniband/core_locking.txt114
-rw-r--r--Documentation/infiniband/ipoib.txt102
-rw-r--r--Documentation/infiniband/sysfs.txt66
-rw-r--r--Documentation/infiniband/user_mad.txt148
-rw-r--r--Documentation/infiniband/user_verbs.txt69
5 files changed, 0 insertions, 499 deletions
diff --git a/Documentation/infiniband/core_locking.txt b/Documentation/infiniband/core_locking.txt
deleted file mode 100644
index e1678542279..00000000000
--- a/Documentation/infiniband/core_locking.txt
+++ /dev/null
@@ -1,114 +0,0 @@
-INFINIBAND MIDLAYER LOCKING
-
- This guide is an attempt to make explicit the locking assumptions
- made by the InfiniBand midlayer. It describes the requirements on
- both low-level drivers that sit below the midlayer and upper level
- protocols that use the midlayer.
-
-Sleeping and interrupt context
-
- With the following exceptions, a low-level driver implementation of
- all of the methods in struct ib_device may sleep. The exceptions
- are any methods from the list:
-
- create_ah
- modify_ah
- query_ah
- destroy_ah
- bind_mw
- post_send
- post_recv
- poll_cq
- req_notify_cq
- map_phys_fmr
-
- which may not sleep and must be callable from any context.
-
- The corresponding functions exported to upper level protocol
- consumers:
-
- ib_create_ah
- ib_modify_ah
- ib_query_ah
- ib_destroy_ah
- ib_bind_mw
- ib_post_send
- ib_post_recv
- ib_req_notify_cq
- ib_map_phys_fmr
-
- are therefore safe to call from any context.
-
- In addition, the function
-
- ib_dispatch_event
-
- used by low-level drivers to dispatch asynchronous events through
- the midlayer is also safe to call from any context.
-
-Reentrancy
-
- All of the methods in struct ib_device exported by a low-level
- driver must be fully reentrant. The low-level driver is required to
- perform all synchronization necessary to maintain consistency, even
- if multiple function calls using the same object are run
- simultaneously.
-
- The IB midlayer does not perform any serialization of function calls.
-
- Because low-level drivers are reentrant, upper level protocol
- consumers are not required to perform any serialization. However,
- some serialization may be required to get sensible results. For
- example, a consumer may safely call ib_poll_cq() on multiple CPUs
- simultaneously. However, the ordering of the work completion
- information between different calls of ib_poll_cq() is not defined.
-
-Callbacks
-
- A low-level driver must not perform a callback directly from the
- same callchain as an ib_device method call. For example, it is not
- allowed for a low-level driver to call a consumer's completion event
- handler directly from its post_send method. Instead, the low-level
- driver should defer this callback by, for example, scheduling a
- tasklet to perform the callback.
-
- The low-level driver is responsible for ensuring that multiple
- completion event handlers for the same CQ are not called
- simultaneously. The driver must guarantee that only one CQ event
- handler for a given CQ is running at a time. In other words, the
- following situation is not allowed:
-
- CPU1 CPU2
-
- low-level driver ->
- consumer CQ event callback:
- /* ... */
- ib_req_notify_cq(cq, ...);
- low-level driver ->
- /* ... */ consumer CQ event callback:
- /* ... */
- return from CQ event handler
-
- The context in which completion event and asynchronous event
- callbacks run is not defined. Depending on the low-level driver, it
- may be process context, softirq context, or interrupt context.
- Upper level protocol consumers may not sleep in a callback.
-
-Hot-plug
-
- A low-level driver announces that a device is ready for use by
- consumers when it calls ib_register_device(), all initialization
- must be complete before this call. The device must remain usable
- until the driver's call to ib_unregister_device() has returned.
-
- A low-level driver must call ib_register_device() and
- ib_unregister_device() from process context. It must not hold any
- semaphores that could cause deadlock if a consumer calls back into
- the driver across these calls.
-
- An upper level protocol consumer may begin using an IB device as
- soon as the add method of its struct ib_client is called for that
- device. A consumer must finish all cleanup and free all resources
- relating to a device before returning from the remove method.
-
- A consumer is permitted to sleep in its add and remove methods.
diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt
deleted file mode 100644
index 64eeb55d0c0..00000000000
--- a/Documentation/infiniband/ipoib.txt
+++ /dev/null
@@ -1,102 +0,0 @@
-IP OVER INFINIBAND
-
- The ib_ipoib driver is an implementation of the IP over InfiniBand
- protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
- working group. It is a "native" implementation in the sense of
- setting the interface type to ARPHRD_INFINIBAND and the hardware
- address length to 20 (earlier proprietary implementations
- masqueraded to the kernel as ethernet interfaces).
-
-Partitions and P_Keys
-
- When the IPoIB driver is loaded, it creates one interface for each
- port using the P_Key at index 0. To create an interface with a
- different P_Key, write the desired P_Key into the main interface's
- /sys/class/net/<intf name>/create_child file. For example:
-
- echo 0x8001 > /sys/class/net/ib0/create_child
-
- This will create an interface named ib0.8001 with P_Key 0x8001. To
- remove a subinterface, use the "delete_child" file:
-
- echo 0x8001 > /sys/class/net/ib0/delete_child
-
- The P_Key for any interface is given by the "pkey" file, and the
- main interface for a subinterface is in "parent."
-
-Datagram vs Connected modes
-
- The IPoIB driver supports two modes of operation: datagram and
- connected. The mode is set and read through an interface's
- /sys/class/net/<intf name>/mode file.
-
- In datagram mode, the IB UD (Unreliable Datagram) transport is used
- and so the interface MTU has is equal to the IB L2 MTU minus the
- IPoIB encapsulation header (4 bytes). For example, in a typical IB
- fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
-
- In connected mode, the IB RC (Reliable Connected) transport is used.
- Connected mode takes advantage of the connected nature of the IB
- transport and allows an MTU up to the maximal IP packet size of 64K,
- which reduces the number of IP packets needed for handling large UDP
- datagrams, TCP segments, etc and increases the performance for large
- messages.
-
- In connected mode, the interface's UD QP is still used for multicast
- and communication with peers that don't support connected mode. In
- this case, RX emulation of ICMP PMTU packets is used to cause the
- networking stack to use the smaller UD MTU for these neighbours.
-
-Stateless offloads
-
- If the IB HW supports IPoIB stateless offloads, IPoIB advertises
- TCP/IP checksum and/or Large Send (LSO) offloading capability to the
- network stack.
-
- Large Receive (LRO) offloading is also implemented and may be turned
- on/off using ethtool calls. Currently LRO is supported only for
- checksum offload capable devices.
-
- Stateless offloads are supported only in datagram mode.
-
-Interrupt moderation
-
- If the underlying IB device supports CQ event moderation, one can
- use ethtool to set interrupt mitigation parameters and thus reduce
- the overhead incurred by handling interrupts. The main code path of
- IPoIB doesn't use events for TX completion signaling so only RX
- moderation is supported.
-
-Debugging Information
-
- By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
- to 'y', tracing messages are compiled into the driver. They are
- turned on by setting the module parameters debug_level and
- mcast_debug_level to 1. These parameters can be controlled at
- runtime through files in /sys/module/ib_ipoib/.
-
- CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
- virtual filesystem. By mounting this filesystem, for example with
-
- mount -t debugfs none /sys/kernel/debug
-
- it is possible to get statistics about multicast groups from the
- files /sys/kernel/debug/ipoib/ib0_mcg and so on.
-
- The performance impact of this option is negligible, so it
- is safe to enable this option with debug_level set to 0 for normal
- operation.
-
- CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in
- the data path when data_debug_level is set to 1. However, even with
- the output disabled, enabling this configuration option will affect
- performance, because it adds tests to the fast path.
-
-References
-
- Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
- http://ietf.org/rfc/rfc4391.txt
- IP over InfiniBand (IPoIB) Architecture (RFC 4392)
- http://ietf.org/rfc/rfc4392.txt
- IP over InfiniBand: Connected Mode (RFC 4755)
- http://ietf.org/rfc/rfc4755.txt
diff --git a/Documentation/infiniband/sysfs.txt b/Documentation/infiniband/sysfs.txt
deleted file mode 100644
index ddd519b72ee..00000000000
--- a/Documentation/infiniband/sysfs.txt
+++ /dev/null
@@ -1,66 +0,0 @@
-SYSFS FILES
-
- For each InfiniBand device, the InfiniBand drivers create the
- following files under /sys/class/infiniband/<device name>:
-
- node_type - Node type (CA, switch or router)
- node_guid - Node GUID
- sys_image_guid - System image GUID
-
- In addition, there is a "ports" subdirectory, with one subdirectory
- for each port. For example, if mthca0 is a 2-port HCA, there will
- be two directories:
-
- /sys/class/infiniband/mthca0/ports/1
- /sys/class/infiniband/mthca0/ports/2
-
- (A switch will only have a single "0" subdirectory for switch port
- 0; no subdirectory is created for normal switch ports)
-
- In each port subdirectory, the following files are created:
-
- cap_mask - Port capability mask
- lid - Port LID
- lid_mask_count - Port LID mask count
- rate - Port data rate (active width * active speed)
- sm_lid - Subnet manager LID for port's subnet
- sm_sl - Subnet manager SL for port's subnet
- state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER)
- phys_state - Port physical state (Sleep, Polling, LinkUp, etc)
-
- There is also a "counters" subdirectory, with files
-
- VL15_dropped
- excessive_buffer_overrun_errors
- link_downed
- link_error_recovery
- local_link_integrity_errors
- port_rcv_constraint_errors
- port_rcv_data
- port_rcv_errors
- port_rcv_packets
- port_rcv_remote_physical_errors
- port_rcv_switch_relay_errors
- port_xmit_constraint_errors
- port_xmit_data
- port_xmit_discards
- port_xmit_packets
- symbol_error
-
- Each of these files contains the corresponding value from the port's
- Performance Management PortCounters attribute, as described in
- section 16.1.3.5 of the InfiniBand Architecture Specification.
-
- The "pkeys" and "gids" subdirectories contain one file for each
- entry in the port's P_Key or GID table respectively. For example,
- ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key
- table.
-
-MTHCA
-
- The Mellanox HCA driver also creates the files:
-
- hw_rev - Hardware revision number
- fw_ver - Firmware version
- hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)",
- or "MT25208"
diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt
deleted file mode 100644
index 8a366959f5c..00000000000
--- a/Documentation/infiniband/user_mad.txt
+++ /dev/null
@@ -1,148 +0,0 @@
-USERSPACE MAD ACCESS
-
-Device files
-
- Each port of each InfiniBand device has a "umad" device and an
- "issm" device attached. For example, a two-port HCA will have two
- umad devices and two issm devices, while a switch will have one
- device of each type (for switch port 0).
-
-Creating MAD agents
-
- A MAD agent can be created by filling in a struct ib_user_mad_reg_req
- and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
- descriptor for the appropriate device file. If the registration
- request succeeds, a 32-bit id will be returned in the structure.
- For example:
-
- struct ib_user_mad_reg_req req = { /* ... */ };
- ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
- if (!ret)
- my_agent = req.id;
- else
- perror("agent register");
-
- Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
- ioctl. Also, all agents registered through a file descriptor will
- be unregistered when the descriptor is closed.
-
-Receiving MADs
-
- MADs are received using read(). The receive side now supports
- RMPP. The buffer passed to read() must be at least one
- struct ib_user_mad + 256 bytes. For example:
-
- If the buffer passed is not large enough to hold the received
- MAD (RMPP), the errno is set to ENOSPC and the length of the
- buffer needed is set in mad.length.
-
- Example for normal MAD (non RMPP) reads:
- struct ib_user_mad *mad;
- mad = malloc(sizeof *mad + 256);
- ret = read(fd, mad, sizeof *mad + 256);
- if (ret != sizeof mad + 256) {
- perror("read");
- free(mad);
- }
-
- Example for RMPP reads:
- struct ib_user_mad *mad;
- mad = malloc(sizeof *mad + 256);
- ret = read(fd, mad, sizeof *mad + 256);
- if (ret == -ENOSPC)) {
- length = mad.length;
- free(mad);
- mad = malloc(sizeof *mad + length);
- ret = read(fd, mad, sizeof *mad + length);
- }
- if (ret < 0) {
- perror("read");
- free(mad);
- }
-
- In addition to the actual MAD contents, the other struct ib_user_mad
- fields will be filled in with information on the received MAD. For
- example, the remote LID will be in mad.lid.
-
- If a send times out, a receive will be generated with mad.status set
- to ETIMEDOUT. Otherwise when a MAD has been successfully received,
- mad.status will be 0.
-
- poll()/select() may be used to wait until a MAD can be read.
-
-Sending MADs
-
- MADs are sent using write(). The agent ID for sending should be
- filled into the id field of the MAD, the destination LID should be
- filled into the lid field, and so on. The send side does support
- RMPP so arbitrary length MAD can be sent. For example:
-
- struct ib_user_mad *mad;
-
- mad = malloc(sizeof *mad + mad_length);
-
- /* fill in mad->data */
-
- mad->hdr.id = my_agent; /* req.id from agent registration */
- mad->hdr.lid = my_dest; /* in network byte order... */
- /* etc. */
-
- ret = write(fd, &mad, sizeof *mad + mad_length);
- if (ret != sizeof *mad + mad_length)
- perror("write");
-
-Transaction IDs
-
- Users of the umad devices can use the lower 32 bits of the
- transaction ID field (that is, the least significant half of the
- field in network byte order) in MADs being sent to match
- request/response pairs. The upper 32 bits are reserved for use by
- the kernel and will be overwritten before a MAD is sent.
-
-P_Key Index Handling
-
- The old ib_umad interface did not allow setting the P_Key index for
- MADs that are sent and did not provide a way for obtaining the P_Key
- index of received MADs. A new layout for struct ib_user_mad_hdr
- with a pkey_index member has been defined; however, to preserve
- binary compatibility with older applications, this new layout will
- not be used unless the IB_USER_MAD_ENABLE_PKEY ioctl is called
- before a file descriptor is used for anything else.
-
- In September 2008, the IB_USER_MAD_ABI_VERSION will be incremented
- to 6, the new layout of struct ib_user_mad_hdr will be used by
- default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed.
-
-Setting IsSM Capability Bit
-
- To set the IsSM capability bit for a port, simply open the
- corresponding issm device file. If the IsSM bit is already set,
- then the open call will block until the bit is cleared (or return
- immediately with errno set to EAGAIN if the O_NONBLOCK flag is
- passed to open()). The IsSM bit will be cleared when the issm file
- is closed. No read, write or other operations can be performed on
- the issm file.
-
-/dev files
-
- To create the appropriate character device files automatically with
- udev, a rule like
-
- KERNEL=="umad*", NAME="infiniband/%k"
- KERNEL=="issm*", NAME="infiniband/%k"
-
- can be used. This will create device nodes named
-
- /dev/infiniband/umad0
- /dev/infiniband/issm0
-
- for the first port, and so on. The InfiniBand device and port
- associated with these devices can be determined from the files
-
- /sys/class/infiniband_mad/umad0/ibdev
- /sys/class/infiniband_mad/umad0/port
-
- and
-
- /sys/class/infiniband_mad/issm0/ibdev
- /sys/class/infiniband_mad/issm0/port
diff --git a/Documentation/infiniband/user_verbs.txt b/Documentation/infiniband/user_verbs.txt
deleted file mode 100644
index e5092d696da..00000000000
--- a/Documentation/infiniband/user_verbs.txt
+++ /dev/null
@@ -1,69 +0,0 @@
-USERSPACE VERBS ACCESS
-
- The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS,
- enables direct userspace access to IB hardware via "verbs," as
- described in chapter 11 of the InfiniBand Architecture Specification.
-
- To use the verbs, the libibverbs library, available from
- http://www.openfabrics.org/, is required. libibverbs contains a
- device-independent API for using the ib_uverbs interface.
- libibverbs also requires appropriate device-dependent kernel and
- userspace driver for your InfiniBand hardware. For example, to use
- a Mellanox HCA, you will need the ib_mthca kernel module and the
- libmthca userspace driver be installed.
-
-User-kernel communication
-
- Userspace communicates with the kernel for slow path, resource
- management operations via the /dev/infiniband/uverbsN character
- devices. Fast path operations are typically performed by writing
- directly to hardware registers mmap()ed into userspace, with no
- system call or context switch into the kernel.
-
- Commands are sent to the kernel via write()s on these device files.
- The ABI is defined in drivers/infiniband/include/ib_user_verbs.h.
- The structs for commands that require a response from the kernel
- contain a 64-bit field used to pass a pointer to an output buffer.
- Status is returned to userspace as the return value of the write()
- system call.
-
-Resource management
-
- Since creation and destruction of all IB resources is done by
- commands passed through a file descriptor, the kernel can keep track
- of which resources are attached to a given userspace context. The
- ib_uverbs module maintains idr tables that are used to translate
- between kernel pointers and opaque userspace handles, so that kernel
- pointers are never exposed to userspace and userspace cannot trick
- the kernel into following a bogus pointer.
-
- This also allows the kernel to clean up when a process exits and
- prevent one process from touching another process's resources.
-
-Memory pinning
-
- Direct userspace I/O requires that memory regions that are potential
- I/O targets be kept resident at the same physical address. The
- ib_uverbs module manages pinning and unpinning memory regions via
- get_user_pages() and put_page() calls. It also accounts for the
- amount of memory pinned in the process's locked_vm, and checks that
- unprivileged processes do not exceed their RLIMIT_MEMLOCK limit.
-
- Pages that are pinned multiple times are counted each time they are
- pinned, so the value of locked_vm may be an overestimate of the
- number of pages pinned by a process.
-
-/dev files
-
- To create the appropriate character device files automatically with
- udev, a rule like
-
- KERNEL=="uverbs*", NAME="infiniband/%k"
-
- can be used. This will create device nodes named
-
- /dev/infiniband/uverbs0
-
- and so on. Since the InfiniBand userspace verbs should be safe for
- use by non-privileged processes, it may be useful to add an
- appropriate MODE or GROUP to the udev rule.