Storage, Storage Networks, & Adapters: Lossless Ethernet

When the new FCoE (Fibre Channel over Ethernet) VN2VN (aka Direct End Node to End Node) was defined in the T11.3 FC-BB-6 Ad Hoc Working Group it was assumed that it would find a niche in the Low to medium IT organizations that wanted to have compatibility with Fibre Channel (FC). Though that is still valid, it looks as though it may also be important to some of the new "Cloud" services that provide Infrastructure as a Service (IaaS).

FCoE VN2VN is a additional FCoE protocol which permits FCoE End Nodes such as Servers acting as "Initiators" and FCoE End Nodes such as Storage Controllers acting as" Targets" to either directly attach to each other or attach with only lossless Ethernet switches between them (perhaps as few as one switch between the End Nodes). This form of FCoE does not require any FC/FCoE networking equipment.

FCoE VN2VN permits the IaaS organization to enable their installation to provide storage interconnectivity with FC and/or FCoE. FCoE VN2VN capability can be used to give a customer an FCoE VN2VN connection between the servers and the storage that are supplied by the IaaS provider. This VN2VN interconnect can provide the fastest end-to-end connection with the fewest number of "hops" as possible. That is, the data path can traverse between the server and the storage unit by passing through perhaps as few as one Lossless Ethernet switch. No FCF (Fibre Channel Forwarder) is required, which means that no additional FC switching processes and overhead are involved in the data path. In addition, the lossless Ethernet switch can be provided by a great number of vendors, thus permitting the lowest possible cost data path. This means that the IaaS provider can give a customer the fastest interconnect at the lowest possible cost.

To enable this type of capability there is certain implications upon the configuration of the "Cloud" installation. For example: if the customer would like to purchase infrastructure where the required servers and storage can fit into a single rack (or even a 2-3 rack side-to-side configuration) they are candidates for FCoE VN2VN interconnection. In such a configuration a lossless Ethernet switch can be placed at the top of the Rack (or Rack set) and Ethernet connections run from the servers to the Switch and then to the storage units. For total installation flexibility the Top-of Rack (ToR) switches may also be physically interconnected to an End-of-Row (EoR) Director class FCoE switch that may have full FCF capabilities. However, the EoR Director would have no direct involvement with the data path for this IaaS rack-set. It is also possible to have a ToR switch at the top of each rack and have them interconnected with each other. In this case, the data path may go through two ToR switches but would still not need to go through the EoR FCoE Director.

So depending on the needs of the customer, and the physical configuration required by the provider, it is possible to obtain the minimum switch/"hop" count and lowest latency interconnect. This means that the provider of IaaS services can "carve-out" a rack or set of racks that can be dedicated to a specific IaaS customer, and give them isolated service yet when that customers grows and has a much larger requirement, or they leave the IaaS provider's installation, the installation can easily re-task the servers and storage, or expand to other racks of server and storage, without needing to physically re-cable the network configuration.

In this example, the IaaS systems and storage are given their own VLANs that can be used by the FCoE VN2VN to permit "direct" connection between the IaaS customer's servers and storage without involvement of other systems within the IaaS providing installation. It should be noted that when the customer either leaves the installation or expands, the provider can re-task the equipment and remove the VLAN specification, and in the case of expansion utilize a regular FCoE interconnect (via the EoR director FCoE switches).

Likewise, a company often has the need to provide IaaS like services to various internal departments which for various company technical or "political" reasons need to be provided with dedicated server and storage rack(s) which can function as isolated environments for various company departments and projects. This then becomes an internal IaaS "Cloud" environment in which FCoE VN2VN can often be an appropriate solution to this configuration requirement.

But independent of the internal or external "Cloud" IaaS environments FCoE VN2VN is still appropriate for the smaller computing environments such as "Big Box" stores, "Disaster Recovery Trailers" and small to medium IT installations.

In smaller organizations such as local "Big Box" stores, they can have their whole data center located in a single rack which has the appropriate servers and storage all inclusive. In this type of configuration the various Server vendors can be asked to bid on the "total rack" that includes FCoE VN2VN, and often obtain a "total solution" at a minimum cost. I was once associated with an organization that wanted to sell such configurations to the big box stores but was deterred because of the cost of the Fibre Channel Connections and switches. That concern is no longer relevant when FCoE and VN2VN connections, within the rack, are utilized.

I also understand that various "disaster recovery trailers" can utilize such configurations in their trailers when they are used to provide temporary IT service to big box stores (and others) after various disasters.

And, of course, when it comes small to medium IT installations (ones that fit within a single or few Racks) FCoE VN2VN configurations seems to offer a high performing low cost storage interconnect solution that is compatible with future growth into a full FCoE or FC installation. These types of installations may also be seen as a valuable asset that can easily be integrated during a merge or buy-out with larger organizations that probably have an FC and/or FCoE.

Now that the T11 Technical Committee has completed their Standards Work on FCoE (Fibre Channel over Ethernet) it is probably time to look at additional technologies that will be able to compliment FCoE.

But first, a bit of FCoE history; the T11 AdHoc Working Group known as FC-BB-5 has been working since 2007 to define a way that the Fibre Channel (FC) protocol can be carried on an Ethernet infrastructure. As part of that effort they managed to get some complimentary work going on within the IEEE 802.1 committee. This committee defined what has been called Converged Enhanced Ethernet (CEE) aka Data Center Ethernet, or Data Center Bridging, which includes a Priority Based Flow Control, and Discovery protocol, among others that will permit vendors to build what T11 calls a “Lossless Ethernet”.

This “Lossless Ethernet” (CEE) is defined to operate only within a single subnet (no IP routing). When messages are required to be sent beyond a single CEE subnet, one of two things must be true:

The message must NOT be a FCoE message and may therefore transit a router as a normal IP message, where losses may occur
Or
The message is an FCoE message and may transit what is called an FCF (Fibre Channel Forwarder) which acts like a Router for FC messages. The FC messages may be carried onto another subnet via FCoE on a CEE link or on a Physical FC link

The advantage of the FCoE Network is that it’s made up of Lossless Links and Switches, and is primarily being defined for 10Gbps CEE fabrics.

The value of this CEE Network is that it can be used by normal Ethernet packets as well as FCoE packets. This means that it is possible to share the same physical network for all networking requirements. This includes not only the normal Client/Server messaging, but also the Server to Server messaging as well as the Storage Input/Output (I/O).

Because one of the keys to Lossless Ethernet is operating only in a single subnet and not passing through an IP router, it will have a very limited distance capability. However, this limited distance matches the current major Server to Server messaging environments.

One can see examples of Server to Server messaging in the general business environment with the Front-end to Back-end messaging requirements as well as Cluster messaging requirements. However, the most demanding of all Server to Server messaging requirements are found within the environments known as High Performance Computing (HPC) where high performance and low latency is most highly prized. In all these environments you will normally find the Server configurations to be within a single Subnet with as few Switches between the Servers as possible. This is another reason that the CEE environment seems to be very compatible with Enterprise and HPC Server to Server messaging.

Now every vendor’s equipment will, of course, be better then every other vendor’s equipment, however, the goal is clearly to have the total send/receive latency less than 2-4 microseconds so that Sever to Server messaging can fully exploit it. As part of the needed infrastructure, some vendors will provide 10GE CEE switches that will operate in the sub microsecond range. With these types of goals and equipment many vendors believe that the latency of the egress/ingress path is the remaining problem to be solved. They believe that the Host side Adapters or Host Network Stacks need to shed the TCP/IP overhead so that these low latencies can be achieved. However, without TCP/IP, Server to Server messaging is only practical if all the connections are built with CEE components, and stay within the same CEE subnet. Whether or not TCP/IP needs to be removed will be covered latter, but it is safe to say that a CEE Subnet will eliminate many of the retry and error scenarios so that even TCP/IP will operate well in a CEE environment.

This entire discussion means that since the FCoE protocol is based the use of a CEE fabric, and since the CEE network includes Priority Based Flow Control, the needs of the Server to Server messaging and the Storage I/O seem to be very compatible.

Taking all of the above in mind the question then comes down to whether the Host adapters in a CEE environment can be shared between the Ethernet based messaging and the Storage I/O. The answer to this seems to be “YES” since a number of vendors are producing such devices which are called Converged Network Adapters (CNAs). These CNA devices are an evolution of FC host adapters that were called Host Bus Adapters (HBAs) but which now also provide the normal Ethernet Network Interface Controller (NIC) functions, and manage to share that NIC with a Fibre Channel (FC) function called FCoE.

The early versions of these CNAs were made up of; a NIC Chip, a FC chip, and an FCoE encapsulation chip which interfaced the FC function to the NIC Chip. Since then most CNA vendors have integrated those functions together into a single chip (aka ASIC). In any event, the same physical port can be used for Normal NIC functions (which might include normal IP and TCP/IP messaging) as well as Storage I/O, all operating at 10Gbps.

The Internet Engineering Task Force (IETF) standards group defined an RDMA protocol that can be used in a general Ethernet environment, this protocol is called iWARP. The “i” stands for Internet and “WARP” is just a cool name (indicating fast i.e. WARP Drive from Startrek) and has no acronym meeting. This iWARP standard included techniques and protocols for operating on a normal IP network; this included Ethernet and any other network type that would handle IP protocols. To accomplish this it was necessary to use TCP/IP as the Transport.

In general CNAs, today, do not have the capability of built-in RDMA functions via iWARP. As a result, those installations that wish to have RDMA functions between their servers and also have FC based Storage I/O are not able to consolidate onto the same CNAs. In an Ethernet environment it generally requires an iWARP adapter and a separate CNA/HBA for Storage I/O. It was the search for a complete CEE CNA that caused folks to consider whether it was possible to combine the RDMA functions along with the other capabilities of CNAs.

(It should be noted that it is possible to have a Convergence of RDMA messaging (via iWARP) and Storage I/O if the installation uses iSCSI for their Storage I/O. However, the enterprise business opportunity is usually found in a large physical installation that has FC based Storage devices. Except for HPC environments, the integration of iWARP and iSCSI has not really happened, and it is the large business enterprises where the large profitable opportunity exists.)

TCP/IP has all the necessary things built into its protocol to operate on a lossy and error prone network. As a result, some folks have felt that it has too much overhead for a network of servers which may be located on a single subnet. However, until the creation of the CEE capability as part of the FCoE protocol there was no practical alternative. Now that a CEE fabric can be created and since the key most strenuous low latency requirement seems to be within a single subnet, there are thoughts about how best to place the RDMA protocol on the CEE network. There are currently two proposals that will be discussed here. The first is to just use iWARP as is, and the second is to create a dWARP (the “d” stands for Data Center). The proposals can be summarized as follows:

Use iWARP – We already have iWARP defined, and TCP/IP will work very well on a CEE network. The fact that there are no message drops, and the error rate is low means that TCP/IP is not entering its error path, and the TCP/IP Slow Start, and other such capabilities are not used so they are not impacting the performance of iWARP. Further, the predominate providers of iWARP NICs are offloading the TCP/IP into a TOE (TCP/IP Offload Engine) so the latency is kept to a minimum.

Use dWARP – regardless of whether CEE reduces the Error conditions, there is additional path length that is needed when TCP/IP is used, and that will affect latency in a negative manner. Also there are always fights between the Server’s OS’s native TCP/IP implementation and the adapter’s Vendor’s TCP/IP, so eliminating TCP/IP will reduce this needless conflict. Further, some vendors believe that when you include TOE capabilities it requires a lot of state maintenance and the size of any ASIC implementation will get very large and requires much more electrical power, and in general cost more. Hence there became a wish to create a CEE based RDMA function that did not need a TOE. There are at least two approaches to this (which also keep the RDMA host Interfaces/APIs the same as iWARP) and they are:
- Encapsulate the functions either directly onto Ethernet packets or onto a packet with IP headers. In one case you would create your own headers (from scratch) and in the other the header would be “IP like” headers even if the Ethertype prevented them from being treated and Routed like IP headers. Therefore, one could possibly build a specialty dWARP Router sometime in the future and not have to reinvent the things that have been learned about IP Routing, and perhaps use the same code for many functions if only an IP like header is used.
- Exploit the capabilities of FCoE by placing the RDMA functions into FC protocols. In this case it could ride along with all the capabilities being built into Data Center Ethernet and even be able to be forwarded (Routed) to other subnets via an FCF, if that function was ever required.
Either of these dWARP proposals might result in the smallest CNA ASIC chip, since the addition of a TOE would not be required.

At this moment there is work in the T11 Standards group to define dWARP, and as of this writing, it looks like it will take the form of 2b. This means that any installation that wants to use FCoE for its Storage I/O will be able with little if any additional cost, be able to have a Low Latency RDMA protocol that can be used on their CEE fabric.

On the other hand, if the installation desires to use RDMA across a routable non CEE network, then iWARP is currently the only game in town. An example of the usefulness of this capability can be seen in Client/Server messaging in which Clients are almost always located outside the Data Center. Unfortunately there are very few Client/Server installations that use iWARP, because of:

The cost of the adapters is relatively high for a Client system

At this time, the predominate desktop OS manufacture (Microsoft) has decided NOT to implement iWARP in software, like they did for iSCSI (internet Small Computer System Interconnect), so the potential reductions in cost for each Client system (which often has CPU cycles to spare) has not been possible. (This is regrettable since the true value of RDMA in a Client/Server environment is the reduction of overhead in the Server which could bear the additional cost of a physical iWARP adapter).

iWARP client software, outside of the Microsoft environment is still very embolic and is not seeing traction in Enterprise environments.

As a resultant the business for iWARP outside of a Single Subnet has been very small.

On the other hand, if a software client was commercially available -- for desktop systems -- that might also foster the development of Bridges/Proxies that could sit on the edge of the CEE network and map dWARP server packets into iWARP client packets (and visa versa).

Summary

Without a software implementation for Clients being widely available the primary place where iWARP will be found is within a Data Center and on a single Subnet where Servers send messages to other Servers. That being the case, there is a strong motivation to exploit the capabilities of CEE and integrate these RDMA functions with the current CNAs and permit the complete convergence of the Data Center Ethernet Fabric (CEE) using dWARP enabled CNAs (Converged Network Adapters without a TOE).

Whether or not a CNA using dWARP is a significantly better performer, and is cheaper than a CNA using iWARP on CEE, is yet to be shown, however, this is where the new RDMA messaging battle ground will be fought.

Storage, Storage Networks, & Adapters

Thursday, November 17, 2011

"CLOUD" Infrastructure as a Service (IaaS) and FCoE VN2VN

Thursday, July 9, 2009

RDMA (Remote Direct Memory Access) for the Data Center Ethernet

FCoE VN2VN Tutorial

iSCSI Tutorial Presentation

FCoE Tutorial Presentation