by Josef Spillner

In the second invited talk in our colloquium series in 2018, Alan Sill from Texas Tech University’s Cloud and Autonomic Computing Center shared his views on how to manage data centres the right way. In the talk «Topics in robust modern API design for data center control and scientific applications», many issues were pointed out whose proper solution will effect the whole cloud stack up to the way cloud-native applications are designed and equipped with deep self-management capabilities. Both the talk and the mixed-in debates are captured by this blog post.

The speaker passed through a number of historic developments to gather arguments for future work on management APIs. It started off with Distributed Management Task Force (DMTF) references, comparing CIMI (fully standardised, but not adopted) to OCCI (lots of implementations including also OCCIware). Standardisation, therefore, requires people to sit together and test and implement ideas much more than fully rigid isolated committee work. This point was also stressed with references to the Cloud Plugfests which have worked best with around a dozen active participants who produced interoperability code. A more recent reference, and coming to the heart of the talk, was Redfish, which improves over the too late, too chaotic standardisation of IPMI with a RESTful and actually useful interface.

The implications deploying Redfish, which is included in the backplanes of all recent servers, are manyfold. First, future large-scale and exascale installations require full hardware control for administrators so that users cannot accidentally lock up or break parts of the system. A data point was given that Texas Tech University runs a 17000 cores cluster which is indeed affected by such issues in its operation. Second, the API design allows for proper discovery with schemas. Third, the bottleneck is moved from the servers to the client as it deals with hundreds with parallel open HTTP connections in scatter-gather mode to speed up operations. However, the need for direct hardware control was challenged for two reasons. First, it is not clear whether operating systems can cope with dynamic changes which are more far-fetching compared to dynamic CPU or memory additions at the hypervisor level. Second, cloud programming should occur at a higher level of abstraction and any reference to hardware should be avoided. This second point was again refuted by pointing out the need for full use of computing resources in several application classes especially in scientific computing. Cloud-native applications are often designed for failure because failure cannot be fully mitigated even in local systems. The answer, then, may be a selective pass-through of management capabilities in contrast to otherwise high abstraction levels with full audit trail and resetting capabilities to ensure any modifications do not extend across multi-tenancy and session boundaries. An intuitive use case is throttling down the CPU frequency during high temperatures which should happen autonomously.

The speaker furthermore touched on emerging Metal-as-a-Service offerings such as Chameleon and Cloudlab which allow for rapid and reproducible instantiation of machines including pre-installed software. Indeed, institutional cloud computing is becoming more interesting now for recomputable SaaS-level research such as our work on composite application migration. There are several interesting collaboration opportunities for European researchers who want to make use of these testbeds.

The talk also highlighted the advantages of standardised APIs using OpenAPI derived through transformation from Redfish’s JSON schemas. This process opens the door to code generation and other service tooling and to rapid prototyping of complex API-based demonstrators. Indeed, in one of the student works co-supervised by the speaker, an emulation called Redfish Mockup was run in thousands of instances of a Docker container which prove the scalability of the approach.

Looking into the near future, it becomes evident that scientific applications need to consider these novel developments at the data centre level given that comparable public cloud offerings remain expensive and less flexible despite being often more user-friendly. A particular example given was about the DRMAA interface specification which is rooted in grids, but today there are PaaS instead so that web APIs would be needed on top in order to get frictionless integration, interoperability and inert developer acceptance. But also for cloud-native applications with adaptive self-management capabilities at the highest level of maturity, a discoverable set of potentially disruptive platform actions may lead to deep self-management even in public clouds. The differentially priced actions may encompass anything beyond restarts: scheduler policy changes, volume recreations, network state resets, redeployments and other dynamic operations.

While at ZHAW, in the true sense of having productive meetings, members of the ICCLab and SPLab sat down with Dr. Sill to discuss how large scale verification and validation of standard application programming interfaces could be achieved. ZHAW is a key partner in the EU H2020 research project ElasTest, whose main aim is to provide the tooling and framework to allow for large scale distributed testing. With this in mind, we began to use the ElasTest framework to rapidly prototype the feasibility of such an approach to ensuring compliance of an API to a specification. To do this we used the set of RedFish tools available from DMTF along with the ElasTest and during the course of an hour we were able to have an initial prototype of a scalable compliance testing framework. We aim to elaborate on this work and report back.

We thank Alan Sill for his talk and views and wish to see him another time in Switzerland later this year.