Sophie: scsi-target-utils-1.0.4-3.el6

scsi-target-utils-1.0.4-3.el6_0.1.i686.rpm

iSCSI Extensions for RDMA (iSER)
================================

This is an detailed description of the iSER tgtd target. It
covers issues from the design to how to manually set it up.

NOTE:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
To run this iSER target you must have installed the libiverbs
and librdma rpms on your system. They will not get brought in
automatically when installing this rpm.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

See man tgt-admin and the example /etc/tgt/targets.conf
file for how to setup a persistent configuration that is
started when the tgtd service is started (when "service tgtd start"
is run).

Copyright (C) 2007 Pete Wyckoff <pw@osc.edu>

Background
----------

There is an IETF standards track RFC 5046 that extends the iSCSI protocol
to work on RDMA-capable networks as well as on traditional TCP/IP:

	Internet Small Computer System Interface (iSCSI) Extensions
	for Remote Direct Memory Access (RDMA), Mike Ko, October 2007.

RDMA stands for Remote Direct Memory Access, a way of accessing memory
of a remote node directly through the network without involving the
processor of that remote node.  Many network devices implement some form
of RDMA.  Two of the more popular network devices are InfiniBand (IB)
and iWARP.  IB uses its own physical and network layer, while iWARP sits
on top of TCP/IP (or SCTP).

Using these devices requires a new application programming interface
(API).  The Linux kernel has many components of the OpenFabrics software
stack, including APIs for access from user space and drivers for some
popular RDMA-capable NICs, including IB cards with the Mellanox chipset
and iWARP cards from NetEffect, Chelsio, and Ammasso.  Most Linux
distributions ship the user space libraries for device access and RDMA
connection management.


RDMA in tgtd
------------

The Linux kernel can act as a SCSI initiator on the iSER transport, but
not as a target.  tgtd is a user space target that supports multiple
transports, including iSCSI/TCP, and now iSER on RDMA devices.

The iSER code was written by researchers at the Ohio Supercomputer
Center in early 2007:

	Dennis Dalessandro <dennis@osc.edu>
	Ananth Devulapalli <ananth@osc.edu>
	Pete Wyckoff <pw@osc.edu>

We wanted to use a faster transport to test the capabilities of an
object-based storage device (OSD) emulator we had previously written.
Our cluster has InfiniBand cards, and while running TCP/IP over IB is
possible, the performance is not nearly as good as using native IB
directly.

A report describing this implementation and some performance results
appears in IEEE conference proceedings as:

	Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff,
	iSER Storage Target for Object-based Storage Devices,
	Proceedings of MSST'07, SNAPI Workshop, San Diego, CA,
	September 2007.

and is available at:

	http://www.osc.edu/~pw/papers/iser-snapi07.pdf

Slides of the talk with more results and analysis are also available at:

	http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf

The code mostly lives in iscsi/iscsi_rdma.c, with a few places in
iscsi/iscsid.c that check if the transport is RDMA or not and behave
accordingly.  iSCSI already had the idea of a transport, with just the
single TCP one defined.  We added the RDMA transport and virtualized
some more functions where TCP and RDMA behave differently.


Design Issues
-------------

In general, a SCSI system includes two components, an initiator and a
target. The initiator submits commands and awaits responses.  The target
services commands from initiators and returns responses.  Data may flow
from the initiator, from the client, or both (bidirectional).  The iSER
specification requires all data transfers to be started by the target,
regardless of direction.  In a read operation, the target uses RDMA
Write to move data to the initiator, while a write operation uses RDMA
Read to fetch data from the initiator.


1. Memory registration

One of the most severe stumbling blocks in moving any application to
take advantage of RDMA features is memory registration.  Before using
RDMA, both the sending and receiving buffers must be registered with the
operating system.  This operation ensures that the underlying hardware
pages will not be modified during the transfer, and provides the
physical addresses of the buffers to the network card.  However, the
process itself is time consuming, and CPU intensive.  Previous
investigations have shown that for InfiniBand, with a nominal transfer
rate of 900 MB/s, the throughput drops to around 500 MB/s when memory
registration and deregistration are included in the critical path.

Our target implementation uses pre-registered buffers for RDMA
operations.  In general such a scheme is difficult to justify due to the
large per-connection resource requirements.  However, in this
application it may be appropriate.  Since the target always initiates
RDMA operations and never advertises RDMA buffers, it can securely use
one pool of buffers for multiple clients and can manage its memory
resources explicitly.  Also, the architecture of the code is such that
the iSCSI layer dictates incoming and outgoing buffer locations to the
storage device layer, so supplying a registered buffer is relatively
easy.


2. Event management

There is a mismatch between what the tgtd event framework assumes and
what the RDMA notification interface provides.  The existing TCP-based
iSCSI target code has one file descriptor per connection and it is
driven by readability or writeability of the socket.  A single poll
system call returns which sockets can be serviced, driving the TCP code
to read or write as appropriate.  The RDMA interface can be used in
accordance with this design by requesting interrupts from the network
card on work request completions.  Notifications appear on the file
descriptor that represents a completion queue to which all RDMA events
are delivered.

However, the existing sockets-based code goes beyond this and changes
the bitmask of requested events to control its code flow.  For instance,
after it finishes sending a response, it will modify the bitmask to only
look for readability.  Even if the socket is writeable, there is no data
to write, hence polling for that status is not useful.  The code also
disables new message arrival during command execution as a sort of
exclusion facility, again by modifying the bitmask.  We cannot do this
with the RDMA interface.  Hence we must maintain an active list of tasks
that have data to write and drive a progress engine to service them.
The need for progress is tracked by a counter, and the tgtd event loop
checks this counter and calls into the iSER-specific while the counter
is still non-zero.  tgtd will block in the poll call when it must wait
on network activity.  No dedicated thread is needed for iSER.


3. Padding

The iSCSI specification clearly states that all segments in the protocol
data unit (PDU) must be individually padded to four-byte boundaries.
However, the iSER specification remains mute on the subject of padding.
It is clear from an implementation perspective that padding data
segments is both unnecessary and would add considerable overhead to
implement.  (Possibly a memory copy or extra SG entry on the initiator
when sending directly from user memory.)   RDMA is used to move all
data, with byte granularity provided by the network.  The need for
padding in the TCP case was motivated by the optional marker support to
work around the limitations of the streaming mode of TCP.  IB and iWARP
are message-based networks and would never need markers.  And finally,
the Linux initiator does not add padding either.


Using iSER
----------

Start the daemon (as root):

	./tgtd

It will send messages to syslog.  You can add "-d 9" to turn on debug
messages.

Configure the running target with one or more devices, using the tgtadm
program you just built (also as root).  Full information is in
doc/README.iscsi.  Here is a quick-start guide:

	./tgtimg --op new --device-type disk --type disk --size 1024 \
		 --file /tmp/tid1lun1
	./tgtadm --lld iscsi --mode target \
		 --op new --tid 1 --targetname $(hostname)
	./tgtadm --lld iscsi --mode target \
		 --op bind --tid 1 --initiator-address ALL
	./tgtadm --lld iscsi --mode logicalunit \
		 --op new --tid 1 --lun 1 --backing-store /tmp/tid1lun1

To make your initiator use RDMA, make sure the "ib_iser" module is
loaded in your kernel.  Then do discovery as usual, over TCP:

	iscsiadm -m discovery -t sendtargets -p $targetip

where $targetip is the ethernet address of your IPoIB device.  Discovery
traffic will use IPoIB, but login and full feature phase will use RDMA
natively.

Then do something like the following to change the transport type:

	iscsiadm -m node -p $targetip -T $targetname --op update \
	    -n node.transport_name -v iser

Next, login as usual:

	iscsiadm -m node -p $targetip -T $targetname --login

And access the new block device, e.g. /dev/sdb.


Errata
------

There is a major bug in the mthca driver in linux kernels before 2.6.21.
This includes the popular rhel5 kernels, such as 2.6.18-8.1.6.el5 and
possibly later.  The critical commit is:

    608d8268be392444f825b4fc8fc7c8b509627129
    IB/mthca: Fix data corruption after FMR unmap on Sinai

If you use single-port memfree cards, SCSI read operations will
frequently result in randomly corrupted memory, leading to bad
application data or unexplainable kernel crashes.  Older kernels are
also missing some nice iSCSI changes that avoids crashes in some
situations where the target goes away.  Stock kernel.org linux
2.6.22-rc5 and 2.6.23-rc6 have been tested and are known to work.

The Linux kernel iSER initiator is currently lacking support for
bidirectional transfers, and for extended command descriptors (CDBs).
Progress toward adding this is being made, with patches frequently
appearing on the relevant mailing lists.

The Linux kernel iSER initiator uses a different header structure on its
packets than is in the iSER specification.  This is described in
an InfiniBand document and is required for that network, which only
supports for Zero-Based Addressing.  If you are using a non-IB initiator
that doesn't need this header extension, it won't work with tgtd.  There
may be some way to negotiate the header format.  Using iWARP hardware
devices with the Linux kernel iSER initiator also will not work due to
its reliance on fast memory registration (FMR), an InfiniBand-only feature.

The current code sizes its per-connection resource consumption based on
negotiatied parameters.  However, the Linux iSER initiator does not
support negotiation of MaxOutstandingUnexpectedPDUs, so that value is
hard-coded in the target.  Also, open-iscsi is hard-coded with a very
small value of TargetRecvDataSegmentLength, so even though the target
would be willing to accept a larger size, it cannot.  This may limit
performance of small transfers on high-speed networks:  transfers bigger
than 8 kB, but not large enough to amortize a round-trip for RDMA setup.

The data structures for connection management in the iSER code are
desgined to handle multiple devices, but have never been tested with
such hardware.