Why did we first choose and then abandon InfiniBand? The story of the development of the best cloud computing storage
Posted by Marketing Department on 11.10.2016 14:10
From January 2013 until June 2016 the Oktawave storage network based on Infiniband technology. Between June and mid-July we made probably our greatest revolution and exchanged completely Infiniband for well-known FC SAN (in this case 16 Gbps), granting the same, that our original assumptions were perhaps a bit too ambitious. But - as they say - every cloud has a silver lining. Our new architecture is more efficient and at the same time extremely stable.
How did it all begin?
So we gave up purchasing of the typical matrix (1) and decided to build our own solution based on expensive PCIe NVMe cards. We were creating the foundations of the physical and software architecture and we began to design the process of providing all the performance for virtual machines. As a result, service called OVS would be built (2).
The market then moved between FC SAN 4 and 8 Gbps technology and iSCSI implementation in the early 10G networks. There were a few other technologies, but basically marginal. We came up with the idea for the data node (and there are many) - but each of them has a capacity of about 600 thousand to 800 thousand IOPS (random 4K blocks operation). FC SAN 8 Gbps is able to do 200 thousand on one channel (in theory, because in practice at the level of the virtual environment and taking into account hypervisor presence, is it closer to 160 thousand). Tests have shown that the dispersion on many paths increases the throughput (1 GB/s), but the overhead associated with switching introduces such long delays that the level of IOPS is rather less. Well, the formula IOPS = 1(s)/latency (s) x queue_depth (count) is rather unforgiving. Queues can not be increased indefinitely, and we can not lower the latency with shorter cable, or by using time dilation or Lorentz shortening. I won't even mention the iSCSI, because achieved values are from different order of magnitude.
What about Infiniband?
Oktawave beta platform was about to start on FC, but later we changed it for desired Infiniband! With that our adventure begins. As the traffic began to appear, ills resulting from a completely immature SRP driver for VMware causing, among others, loosing track, I/O blocking, timeouts and in result, losing access to storage from virtual machines, resets of panicked ESX, switching in r/o. It doesn't really matter that bandwidth is fine (actually these 3 GB/s could be squeezed from discs at Tier-2 and up on the level of virtual machine), if at the level of IOPS it was average, and frequent restarts cause deep frustration on both the client side and on the side of administrators. We do not give up, we are still convinced that we can solve this problem. Over time, there are new drivers, so we focus on improving configuration, which brings peace for several months. Another load levels, however, make us go back to the starting point.
Time to "man up"
In addition to complete the upgrade (on the margin: subregion PL-004 is now available, and in short PL-005 will be too, probably a few nice surprises as well), in each of the subregions wiring was replaced, also switches, HCA/HBA cards, data nodes configuration - in one word, operation on a large scale and online. Without going in the details of the operation, eventually each of the subregions has been migrated to a functioning on the market today FC SAN 16 Gbps (sorry, but still nothing else will do, and this is our unshakable personal opinion).
New and wonderful Oktawave
Complementing the picture of the situation and for the purity of the message, we must also mention that we have changed a little architecture of data nodes - abandoning the idea of synchronous or asynchronous replication and now in most cases, we use the classic layout found in regular disk arrays, ie. each data node consists of a pair of devices (servers) called head (head A/B), each equipped with appropriate interfaces (HBA) for connecting to data networks, their own data controllers, shared data storage and its own software - which in principle is analogous to the classical matrix with two controllers working in A/AND mode. The exceptions are nodes that provide Tier-5 where the configuration is quite different - but that's material for another article. Technically, heads are devices built on the classic x86 servers, but having the optimal - in terms of I/O hardware and software - configuration.
In summary, since the migration virtually all problems with the transmission medium ended: no reboots, switching, unexpected refusals of cooperation. The conclusion from this is that we probably tried to surpass our time, but we put our money on the wrong horse. Today, some technologies have grown to the speed that we are trying to deliver, but we also look at the new and ambitious ideas much more carefully, we are no longer a small experimental laboratory embedded in the world of startups and comfort of our customers is an absolute priority for us.
Soon we will publish in our knowledge base a document in which you will find standardized data of the OVS speed in the various sub-regions, optimization methods, tuning and several other tips.
At the end we invite everyone to see how it all works now and subregion PL-004 deserves special attention.
(1) Then, the matrix prices, which even approach the assumed parameters, were expressed in 6 and 7 digit figures (USD).
(2) OVS drives are independent and separate service from other services at Oktawave (in particular OCI). Currently, the use of OVS requires to have at least one OCI server, to which OVS drive will be able to connect. OVS disks as a separate service also has its nominal and maximum operating parameters (independent of the OCI server) and their achievement from the OCI server prospects is dependent on many factors - including network bandwidth in the subregion, performance and type of processor in the subregion, the type of driver installed on the OCI server and the length settings.