Thursday, July 9, 2009

RDMA (Remote Direct Memory Access) for the Data Center Ethernet

Now that the T11 Technical Committee has completed their Standards Work on FCoE (Fibre Channel over Ethernet) it is probably time to look at additional technologies that will be able to compliment FCoE.


But first, a bit of FCoE history; the T11 AdHoc Working Group known as FC-BB-5 has been working since 2007 to define a way that the Fibre Channel (FC) protocol can be carried on an Ethernet infrastructure. As part of that effort they managed to get some complimentary work going on within the IEEE 802.1 committee. This committee defined what has been called Converged Enhanced Ethernet (CEE) aka Data Center Ethernet, or Data Center Bridging, which includes a Priority Based Flow Control, and Discovery protocol, among others that will permit vendors to build what T11 calls a “Lossless Ethernet”.

This “Lossless Ethernet” (CEE) is defined to operate only within a single subnet (no IP routing). When messages are required to be sent beyond a single CEE subnet, one of two things must be true:


  1. The message must NOT be a FCoE message and may therefore transit a router as a normal IP message, where losses may occur
    Or
  2. The message is an FCoE message and may transit what is called an FCF (Fibre Channel Forwarder) which acts like a Router for FC messages. The FC messages may be carried onto another subnet via FCoE on a CEE link or on a Physical FC link

The advantage of the FCoE Network is that it’s made up of Lossless Links and Switches, and is primarily being defined for 10Gbps CEE fabrics.


The value of this CEE Network is that it can be used by normal Ethernet packets as well as FCoE packets. This means that it is possible to share the same physical network for all networking requirements. This includes not only the normal Client/Server messaging, but also the Server to Server messaging as well as the Storage Input/Output (I/O).



Because one of the keys to Lossless Ethernet is operating only in a single subnet and not passing through an IP router, it will have a very limited distance capability. However, this limited distance matches the current major Server to Server messaging environments.


One can see examples of Server to Server messaging in the general business environment with the Front-end to Back-end messaging requirements as well as Cluster messaging requirements. However, the most demanding of all Server to Server messaging requirements are found within the environments known as High Performance Computing (HPC) where high performance and low latency is most highly prized. In all these environments you will normally find the Server configurations to be within a single Subnet with as few Switches between the Servers as possible. This is another reason that the CEE environment seems to be very compatible with Enterprise and HPC Server to Server messaging.


Now every vendor’s equipment will, of course, be better then every other vendor’s equipment, however, the goal is clearly to have the total send/receive latency less than 2-4 microseconds so that Sever to Server messaging can fully exploit it. As part of the needed infrastructure, some vendors will provide 10GE CEE switches that will operate in the sub microsecond range. With these types of goals and equipment many vendors believe that the latency of the egress/ingress path is the remaining problem to be solved. They believe that the Host side Adapters or Host Network Stacks need to shed the TCP/IP overhead so that these low latencies can be achieved. However, without TCP/IP, Server to Server messaging is only practical if all the connections are built with CEE components, and stay within the same CEE subnet. Whether or not TCP/IP needs to be removed will be covered latter, but it is safe to say that a CEE Subnet will eliminate many of the retry and error scenarios so that even TCP/IP will operate well in a CEE environment.


This entire discussion means that since the FCoE protocol is based the use of a CEE fabric, and since the CEE network includes Priority Based Flow Control, the needs of the Server to Server messaging and the Storage I/O seem to be very compatible.


Taking all of the above in mind the question then comes down to whether the Host adapters in a CEE environment can be shared between the Ethernet based messaging and the Storage I/O. The answer to this seems to be “YES” since a number of vendors are producing such devices which are called Converged Network Adapters (CNAs). These CNA devices are an evolution of FC host adapters that were called Host Bus Adapters (HBAs) but which now also provide the normal Ethernet Network Interface Controller (NIC) functions, and manage to share that NIC with a Fibre Channel (FC) function called FCoE.


The early versions of these CNAs were made up of; a NIC Chip, a FC chip, and an FCoE encapsulation chip which interfaced the FC function to the NIC Chip. Since then most CNA vendors have integrated those functions together into a single chip (aka ASIC). In any event, the same physical port can be used for Normal NIC functions (which might include normal IP and TCP/IP messaging) as well as Storage I/O, all operating at 10Gbps.

The Internet Engineering Task Force (IETF) standards group defined an RDMA protocol that can be used in a general Ethernet environment, this protocol is called iWARP. The “i” stands for Internet and “WARP” is just a cool name (indicating fast i.e. WARP Drive from Startrek) and has no acronym meeting. This iWARP standard included techniques and protocols for operating on a normal IP network; this included Ethernet and any other network type that would handle IP protocols. To accomplish this it was necessary to use TCP/IP as the Transport.

In general CNAs, today, do not have the capability of built-in RDMA functions via iWARP. As a result, those installations that wish to have RDMA functions between their servers and also have FC based Storage I/O are not able to consolidate onto the same CNAs. In an Ethernet environment it generally requires an iWARP adapter and a separate CNA/HBA for Storage I/O. It was the search for a complete CEE CNA that caused folks to consider whether it was possible to combine the RDMA functions along with the other capabilities of CNAs.

(It should be noted that it is possible to have a Convergence of RDMA messaging (via iWARP) and Storage I/O if the installation uses iSCSI for their Storage I/O. However, the enterprise business opportunity is usually found in a large physical installation that has FC based Storage devices. Except for HPC environments, the integration of iWARP and iSCSI has not really happened, and it is the large business enterprises where the large profitable opportunity exists.)


TCP/IP has all the necessary things built into its protocol to operate on a lossy and error prone network. As a result, some folks have felt that it has too much overhead for a network of servers which may be located on a single subnet. However, until the creation of the CEE capability as part of the FCoE protocol there was no practical alternative. Now that a CEE fabric can be created and since the key most strenuous low latency requirement seems to be within a single subnet, there are thoughts about how best to place the RDMA protocol on the CEE network. There are currently two proposals that will be discussed here. The first is to just use iWARP as is, and the second is to create a dWARP (the “d” stands for Data Center). The proposals can be summarized as follows:

  1. Use iWARP – We already have iWARP defined, and TCP/IP will work very well on a CEE network. The fact that there are no message drops, and the error rate is low means that TCP/IP is not entering its error path, and the TCP/IP Slow Start, and other such capabilities are not used so they are not impacting the performance of iWARP. Further, the predominate providers of iWARP NICs are offloading the TCP/IP into a TOE (TCP/IP Offload Engine) so the latency is kept to a minimum.

  2. Use dWARP – regardless of whether CEE reduces the Error conditions, there is additional path length that is needed when TCP/IP is used, and that will affect latency in a negative manner. Also there are always fights between the Server’s OS’s native TCP/IP implementation and the adapter’s Vendor’s TCP/IP, so eliminating TCP/IP will reduce this needless conflict. Further, some vendors believe that when you include TOE capabilities it requires a lot of state maintenance and the size of any ASIC implementation will get very large and requires much more electrical power, and in general cost more. Hence there became a wish to create a CEE based RDMA function that did not need a TOE. There are at least two approaches to this (which also keep the RDMA host Interfaces/APIs the same as iWARP) and they are:


    • Encapsulate the functions either directly onto Ethernet packets or onto a packet with IP headers. In one case you would create your own headers (from scratch) and in the other the header would be “IP like” headers even if the Ethertype prevented them from being treated and Routed like IP headers. Therefore, one could possibly build a specialty dWARP Router sometime in the future and not have to reinvent the things that have been learned about IP Routing, and perhaps use the same code for many functions if only an IP like header is used.

    • Exploit the capabilities of FCoE by placing the RDMA functions into FC protocols. In this case it could ride along with all the capabilities being built into Data Center Ethernet and even be able to be forwarded (Routed) to other subnets via an FCF, if that function was ever required.

    Either of these dWARP proposals might result in the smallest CNA ASIC chip, since the addition of a TOE would not be required.


At this moment there is work in the T11 Standards group to define dWARP, and as of this writing, it looks like it will take the form of 2b. This means that any installation that wants to use FCoE for its Storage I/O will be able with little if any additional cost, be able to have a Low Latency RDMA protocol that can be used on their CEE fabric.



On the other hand, if the installation desires to use RDMA across a routable non CEE network, then iWARP is currently the only game in town. An example of the usefulness of this capability can be seen in Client/Server messaging in which Clients are almost always located outside the Data Center. Unfortunately there are very few Client/Server installations that use iWARP, because of:


  • The cost of the adapters is relatively high for a Client system

  • At this time, the predominate desktop OS manufacture (Microsoft) has decided NOT to implement iWARP in software, like they did for iSCSI (internet Small Computer System Interconnect), so the potential reductions in cost for each Client system (which often has CPU cycles to spare) has not been possible. (This is regrettable since the true value of RDMA in a Client/Server environment is the reduction of overhead in the Server which could bear the additional cost of a physical iWARP adapter).

  • iWARP client software, outside of the Microsoft environment is still very embolic and is not seeing traction in Enterprise environments.


  • As a resultant the business for iWARP outside of a Single Subnet has been very small.


    On the other hand, if a software client was commercially available -- for desktop systems -- that might also foster the development of Bridges/Proxies that could sit on the edge of the CEE network and map dWARP server packets into iWARP client packets (and visa versa).


    Summary


    Without a software implementation for Clients being widely available the primary place where iWARP will be found is within a Data Center and on a single Subnet where Servers send messages to other Servers. That being the case, there is a strong motivation to exploit the capabilities of CEE and integrate these RDMA functions with the current CNAs and permit the complete convergence of the Data Center Ethernet Fabric (CEE) using dWARP enabled CNAs (Converged Network Adapters without a TOE).


    Whether or not a CNA using dWARP is a significantly better performer, and is cheaper than a CNA using iWARP on CEE, is yet to be shown, however, this is where the new RDMA messaging battle ground will be fought.