Knowledge base
Why did we first choose and then abandon InfiniBand? The story of the development of the best cloud computing storage
Posted by Marketing Department on 11.10.2016 14:10

From January 2013 until June 2016 the Oktawave storage network based on Infiniband technology. Between June and mid-July we made probably our greatest revolution and exchanged completely Infiniband for well-known FC SAN (in this case 16 Gbps), granting the same, that our original assumptions were perhaps a bit too ambitious. But - as they say - every cloud has a silver lining. Our new architecture is more efficient and at the same time extremely stable.

How did it all begin?
Let's back up in time for a moment. It is mid-2011, we are working hard on the construction of our platform and the agreed deadline was set as the first quarter of 2012, when we planned to start beta testing. One of the most important objectives of the platform was to provide maximum speed storage solutions, not found anywhere else (and at acceptable prices for the customer).

So we gave up purchasing of the typical matrix (1) and decided to build our own solution based on expensive PCIe NVMe cards. We were creating the foundations of the physical and software architecture and we began to design the process of providing all the performance for virtual machines. As a result, service called OVS would be built (2).

The market then moved between FC SAN 4 and 8 Gbps technology and iSCSI implementation in the early 10G networks. There were a few other technologies, but basically marginal. We came up with the idea for the data node (and there are many) - but each of them has a capacity of about 600 thousand to 800 thousand IOPS (random 4K blocks operation). FC SAN 8 Gbps is able to do 200 thousand on one channel (in theory, because in practice at the level of the virtual environment and taking into account hypervisor presence, is it closer to 160 thousand). Tests have shown that the dispersion on many paths increases the throughput (1 GB/s), but the overhead associated with switching introduces such long delays that the level of IOPS is rather less. Well, the formula IOPS = 1(s)/latency (s) x queue_depth (count) is rather unforgiving. Queues can not be increased indefinitely, and we can not lower the latency with shorter cable, or by using time dilation or Lorentz shortening. I won't even mention the iSCSI, because ​​achieved values are from different order of magnitude.

What about Infiniband?
Suddenly someone threw an idea: what about InfiniBand? After all, it is used for years to create supercomputers, QDR is about 40 Gbps, and the upcoming (then) FDR is even faster. Latency for the QDR is 1.3 microseconds (FC was somewhere closer to the middle of 1 millisecond). It is well, but: Infiniband is usually RDMA and we need implementation of SCSI. We are in luck, because there is also the SRP. It sounds promising, just a pity that the SRP was in version 4.x VMware ESX, which is our primary hypervisor, but not in version 5.0, with which we plan to start (though also still in beta). We began talks with the supplier - by phone, e-mail, on conferences. Finally - in a few months we would have the required version.

Oktawave beta platform was about to start on FC, but later we changed it for desired Infiniband! With that our adventure begins. As the traffic began to appear, ills resulting from a completely immature SRP driver for VMware causing, among others, loosing track, I/O blocking, timeouts and in result, losing access to storage from virtual machines, resets of panicked ESX, switching in r/o. It doesn't really matter that bandwidth is fine (actually these 3 GB/s could be squeezed from discs at Tier-2 and up on the level of virtual machine), if at the level of IOPS it was average, and frequent restarts cause deep frustration on both the client side and on the side of administrators. We do not give up, we are still convinced that we can solve this problem. Over time, there are new drivers, so we focus on improving configuration, which brings peace for several months. Another load levels, however, make us go back to the starting point.

Time to "man up"
We recognized that this situation could not be prolonged, and we could not count on the fact that software vendors would do their work. Our network was already very large, it included three independent sub-regions, and at the beginning of this year we had planned probably biggest upgrade in our history of development of the infrastructure - actually we wanted to double its size. And that was the moment when we decided to completely withdraw from InfiniBand. Initially, only for the new infrastructure, but ultimately we just exchanged everything.

In addition to complete the upgrade (on the margin: subregion PL-004 is now available, and in short PL-005 will be too, probably a few nice surprises as well), in each of the subregions wiring was replaced, also switches, HCA/HBA cards, data nodes configuration - in one word, operation on a large scale and online. Without going in the details of the operation, eventually each of the subregions has been migrated to a functioning on the market today FC SAN 16 Gbps (sorry, but still nothing else will do, and this is our unshakable personal opinion).

New and wonderful Oktawave
Here, real numbers are already at the level of 350 thousand IOPS on a level of a single path and already from level of virtual machine. On two tracks 3 GB/s transfer can also be exceeded. All that only on Intel 2660 v3, but the older Intel working in PL-001 can reach a value of 200 thousand IOPS and AMD in the PL-002 and PL-003 overcome the magical barrier of 100 thousand IOPS.

Complementing the picture of the situation and for the purity of the message, we must also mention that we have changed a little architecture of data nodes - abandoning the idea of ​​synchronous or asynchronous replication and now in most cases, we use the classic layout found in regular disk arrays, ie. each data node consists of a pair of devices (servers) called head (head A/B), each equipped with appropriate interfaces (HBA) for connecting to data networks, their own data controllers, shared data storage and its own software - which in principle is analogous to the classical matrix with two controllers working in A/AND mode. The exceptions are nodes that provide Tier-5 where the configuration is quite different - but that's material for another article. Technically, heads are devices built on the classic x86 servers, but having the optimal - in terms of I/O hardware and software - configuration.

In summary, since the migration virtually all problems with the transmission medium ended: no reboots, switching, unexpected refusals of cooperation. The conclusion from this is that we probably tried to surpass our time, but we put our money on the wrong horse. Today, some technologies have grown to the speed that we are trying to deliver, but we also look at the new and ambitious ideas much more carefully, we are no longer a small experimental laboratory embedded in the world of startups and comfort of our customers is an absolute priority for us.

Soon we will publish in our knowledge base a document in which you will find standardized data of the OVS speed in the various sub-regions, optimization methods, tuning and several other tips.

At the end we invite everyone to see how it all works now and subregion PL-004 deserves special attention.



(1) Then, the matrix prices, which even approach the assumed parameters, were expressed in 6 and 7 digit figures (USD).
OVS drives are an implementation of block devices (conventional discs known for desktops, servers) in the virtualized Oktawave data network.

(2) OVS drives are independent and separate service from other services at Oktawave (in particular OCI). Currently, the use of OVS requires to have at least one OCI server, to which OVS drive will be able to connect. OVS disks as a separate service also has its nominal and maximum operating parameters (independent of the OCI server) and their achievement from the OCI server prospects is dependent on many factors - including network bandwidth in the subregion, performance and type of processor in the subregion, the type of driver installed on the OCI server and the length settings.

(3 vote(s))
This article was helpful
This article was not helpful