Service hosting platforms such as IaaS and PaaS offer a lot of convenience for the service engineer. They take care of proper provisioning, scaling, healing and profiling. Yet, this platform support is limited when it comes to decisions which require insight into the application state and logic, especially considering applications or services ranging across multiple platforms with composition and orchestration.
The Active Service Management research initiative of the Service Prototoyping Lab aims at improving the state of the art by letting applications signal their states, conditions and requirements, and by letting platforms understand these signals. Emerging from the work on Cloud Native Applications (CNA), this initiative subsumes work on pro-active/predictive auto-scaling with application metric such as numbers of users and self-* properties such as self-healing by replacing crashed or unresponsive application parts with new instances.
Comparative evaluation of active service management techniques.
Novel contributions to some of the techniques, in particular to scaling and resilience, but also service evolution.
Turn research results into best practices to achieve an extended CNA design and appropriate hosting platforms.
Relevance to current and future markets
The commercial landscape of service hosting infrastructures generally assumes an issue-free operation without failures and demand spikes. Due to SLAs, this becomes costly when the assumptions do not hold anymore. With active service management contributions, many of the failures and spikes can be mitigated so that business continuity remains assured.
Relevant Standards and Articles
While management of services is an established topic in industry, the specific issue of actively managing environment-aware services is a genuine research topic. We present a selection of useful reading.
From a datacenter operator or cloud provider point of view, IT services are intangible entities which must run reliably 24/7, be provisioned on demand with the right scale, and be documented and certified properly. In the Service Operations research initiative of the Service Prototyping Lab, we take a closer look at the needs of businesses which operate infrastructure, platform and application software services. There are differences in the structural appearance of services (e.g. daemon, virtual machine, container, plugin archive) and in the level of assurance against risks (technical, legal). These differences need to be accounted for when planning and scheduling the service execution.
Operation of testbeds for service execution. A BladeCenter is already set up for this purpose.
Solving business needs regarding service-level agreements (SLAs), software and process certifications, governance, high availability, failover, as well as further technical protection schemes.
Relevance to current and future markets
IT services are the foundation of all digital processes between individuals, enterprises and organisations. Increasingly, processes are going digital, which saves paper but demands fully reliable and automated IT service delivery and governance. Therefore, this initiative serves as enabler and helps in particular companies to dry-run their services in a controlled environment before rolling them out for the target consumers.
Working with remote services requires appropriate and decent tooling. A service idea may take just five seconds (“I want to offer a robust note-taking service”), but its realisation may take much longer (“Which programming language and model?”, “How to describe the service?”, “Where do I find a fitting file service to store the notes on unless I want to take care of backups by myself?”, “Where do I publish my service so that it runs and generates income?”). Therefore, modelling, engineering and integration tools are primarily needed. These tools work in combination with a certain service environment, or ecosystem, consisting of more tools, dependency services, and service platforms which bring services to life.
Open source service platforms such as SPACE and FIWARE went from being architectural visions to actually usable platforms. However, in comparison to cloud stacks and commercial cloud services, their popularity is limited and they are far from being used pervasively. Therefore, the Service Tooling research initiative of the Service Prototyping Lab intends to identify tools and platform services which are straightforward to deploy, easy to use and generic enough to be re-usable in many service scenarios.
For this purpose, the initiative follows a triple structure with three topics of increasing industrial and societal interest: Function-as-a-Service (FaaS), Stealth Computing, Cloud Ecosystems.
Research and innovation in the entire service lifecycle through advanced tooling: Modelling, publishing, running, consuming and evolving services and service-based applications. This initiative will therefore contribute open source tools to build service platforms, ecosystems and individual applications.
Layered architectures for services and clients, including adaptive invocation and stealth data management, to benefit from service and cloud environments while overcoming their limitations.
In connection with CNA, identification of suitable tooling for engineering cloud-native applications, in particular aiming at extreme microservices (nanoservices) with FaaS.
Adaptation of applications to execution technologies in general: VMs, containers, packages, functions, unikernels. None of these should be a concern to an application engineer and therefore automated tooling is required.
Stealth Computing Architecture
In this part of the initiative, there are a number of architectures depending on the use case and the lifecycle phase of a service. The following diagram represents a typical multi-cloud service integration point with stealth properties. Software applications and services benefit from spreading their data and functions across providers in a tightly controlled, re-usable layer with standard interfaces such as files (e.g. POSIX) and data (e.g. SQL). Users are more willing to adopt cloud environments when explicit user control is made possible by stealth computing.
Cloud Ecosystems Architecture
This part of the initiative explores marketplaces, brokers, dashboards, cloud migration tools, API generators, aggregators and other enablers of thriving ecosystems with service producers and consumers. The research focuses on prototyping techniques with description/implementation roundtripping, a library of utility services which aid in establishing ecosystems, and improved client-side tools such as CLI helpers.
In the FaaS part of the initiative, tools to bring legacy code into FaaS environments as well as tools to advance the environments themselves are investigated. There are software decomposition tools for Python (Lambada) and for Java (Podilizer, Termite). Furthermore, there is a flexible client/server tool to migrate, execute, test and deploy functions written in several languages (Snafu).
Articles and Publications
Note: Preprints are made available in a timely manner. Check preprints.
J. Spillner, M. Beck, A. Schill, T. M. Bohnert: Stealth Databases: Ensuring User-Controlled Queries in Untrusted Cloud Environments, 8th IEEE/ACM International Conference on Utility and Cloud Computing (UCC), Limassol, Cyprus, December 2015. (PDF author version) (Slides)
J. Spillner: Secure Distributed Data Stream Analytics in Stealth Applications. 3rd IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Constanța, Romania, May 2015.
J. Spillner, J. Müller: PICav: Precise, Iterative and Complement-based Cloud Storage Availability Calculation Scheme. 7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC), London, UK, December 2014. (PDF)
J. Spillner, A. Chaichenko, A. Brito, F. Brasileiro, A. Schill: Cloud Resource Recycling: An Addition of Species to the Zoo of Virtualised, Overlaid, Federated, Multiplexed and Nested Clouds. SDPS Transactions: Journal of Integrated Design and Process Science (JIDPS), vol. 18, no. 1, pp. 5-19, April 2014.
J. Spillner, S. Illgen, A. Schill: Engineering Service Level Agreements: A Constrained-Domain and Transformation Approach. 3rd International Conference on Cloud Computing and Services Science (CLOSER), Aachen, Germany, May 2013.
The connection between the physical world and the virtual world has never been as exciting, accessible, and economically viable as today. Sensors, actors and robots are able to deliver many physical services in several scenarios, including industrial production and home automation, elderly care, assisted living, logistics and cooperative maintenance.
In isolation, computing capabilities of robots are however limited by embedded CPUs and small on-board storage units. By connecting robots among each other and to cloud computing, cloud storage, and other Internet technologies centered around the benefits of converged infrastructure and shared services, two main advantages can be exploited. First, computation can be outsourced to cloud services leveraging an on-demand pay-per-use elastic model. Second, robots can access a plethora of services complementing their capabilities (e.g., speech analysis, object recognition, knowledge sharing), enabling new complex functionalities and supporting learning.
Cloud robotics is a natural extension to the Internet of Things (IoT). Where IoT devices will gather information about an environment to help make smarter decisions, cloud robotics will be able to use this information and act on it.
Although there is clear recognition that Cloud access is required to complement robotics computation and enable functionalities needed for robotic tasks (e.g., self-driving cars), it is still unclear how to best support these scenarios.
State of the practice
The Robot Operating System (ROS) (http://www.ros.org/) has gained massive traction both in industry and research.
ROS allows robotics developers to concentrate on one specific functionality at a time (e.g., planning, navigation, 3d reconstruction) implemented into so-called ROS nodes that can communicate through pub/sub or RPC-like messages to implement a coherent robotic behavior.
Its next release iteration (ROS2) will address most of the shortcomings of the previous version, for instance removing the need for a centralized master node and adopting a peer to peer discovery mechanism for nodes.
Current ROS development is still faced with low-level interoperability and compilation concerns. Containers in this context are used for the moment simply as a way of providing a consistent running environment for ROS code, however none of the best practices from cloud computing (e.g., resource and container management, placement automation, resource and SW orchestration, continuous integration and deployment) have been adopted yet.
Our lab as seen and mapped by one of our turtlebots
The initiative’s goal is to enable robotic applications to take full advantage of cloud computing services, resources, best practices, and automation by integrating ROS nodes and robots just as other composable services in the cloud.
The results will benefit all robotic stakeholders:
Robotic and cloud developers will be able to deploy their application code in containers with a click, triggering the orchestration of all needed supporting services (e.g., databases, caches, load balancers, speech and video processing, ROS nodes) and both virtual (e.g., containers, virtual machines) and physical resources (e.g., robots, sensors, cameras, bare-metal machines). ROS nodes and robotic services will be easily composable and accessible from a market generating revenue for SW development of generic and ad hoc solutions;
Robot Producers will be able to leverage the advancements in cloud infrastructure and platforms needed to support robotic applications. This will enable them to provide end customers with fully functional robots that do not require on premises compute infrastructure and configuration. Moreover they will be able to leverage from a developer community and their SW artifacts;
End users will benefit from robots and applications that work out of the box and can be easily integrated among themselves and with their favourite cloud services.
Cloud orchestration will cover the entire lifecycle of a robotics service, catering for timely resource allocation and dismissal, taking full advantage of the cloud pay-per-use model, sensibly reducing operational costs with respect to an always-on solution.
Moreover, by taking a service oriented approach and adopting modern cloud-development methodologies, developers will benefit from continuous integration / deployment practices, resulting in shorter release cycles and higher productivity.
This research initiative will also address:
Review frameworks for offloading processing and storage tasks into suitable cloud services, including one of the first such frameworks, AdAPtS (2012).
Delivery of a bundled set of service tooling, including a dynamic service registry, for the purpose of letting robots access cloud services many years after production despite constant service evolution.
Fleet management for robots in the cloud.
Collaborative knowledge sharing between robots using suitable online services.
Identification of beneficial services to augment robot capabilities, such as image recognition and messaging.
Relevance to current and future markets
Robotics is a hot market with competitive manufacturers especially in Switzerland and Europe, including Rapyuta, Aldebaran and (for educational purposes) LEGO Mindstorms. The much larger and currently mostly untapped market is about robots connected to appropriate feedback-loop and cooperation services.
Giovanni Toffetti, Tobias Lötscher, Saken Kenzhegulov, Josef Spillner, and Thomas Michael Bohnert. 2017. Cloud Robotics: SLAM and Autonomous Exploration on PaaS. In Companion Proceedings of the10th International Conference on Utility and Cloud Computing (UCC ’17 Companion). ACM, New York, NY, USA, 65-70. DOI: https://doi.org/10.1145/3147234.3148100
J. Spillner, C. Piechnick, C. Wilke, U. Aßmann, A. Schill: Autonomous Participation in Cloud Services. 2nd International Workshop on Intelligent Techniques and Architectures for Autonomic Clouds (ITAAC), Chicago, Illinois, USA, November 2012. (PDF)
Todays Application developer tooling is focusing on single (mostly monolithic) applications and local development on developer workstations. Modern “Cloud Native Applications” on the other hand are distributed, decoupled, resilient and highly scalable. Sometimes a Business Application consists 20+ so called Microservices.
In the cloud age, customers expect fast innovation and a downtime-free application provisioning. Modern cloud development tools, cloud automation and continuous delivery of software as a service makes this possible. This delivery model depends on a continuous service deployment functionality in the hosting environment, i.e. in the PaaS or IaaS stack, in combination with powerful version control systems and automated continuous testing and integration. To bring software to scale and avoid failures, decentralised stacks and systems are used to deploy the services. Once they are deployed, scalability and resilience are taken care of by run-time methods such as CNA.
Cloud Application Developer tooling and Continuous deployment on PaaS platforms is a particularly popular research topic with industrial relevance. Tools are being created to support CloudFoundry, OpenShift, Heroku, and Azure. The goal is to get the services running with a single button click and no more worries about dependencies, random breakage or interface incompatibilities.
Innovation in the entire application development lifecycle: design, modelling, testing, packaging, deployment, debugging, publishing, running, monitoring. This initiative will therefore contribute open source tools to build application development ecosystems.
Provide advanced tooling to develop and operate large and complex cloud application.
Innovation along the whole DevOps process: design, coding, testing, continuous integration, continuous deployment, provisioning, operation,…
Enhancing techniques and tools to support multi-stage zero-downtime deployment of complex cloud service systems.
Multi-Service/Multi-Cloud support (PaaS & IaaS)
Focus on Cloud Native Applications, but also supporting complex scenarios, like service interdependencies, versioning and data migrations.
Although there are several continuous deployment approaches and quite a few comparisons, there is a lack of a detailed analysis of how to choose the best techniques for deploying services in cloud environments. The Cloud Application Developer Tooling research initiative of the Service Prototyping Lab is exploring ways to get more knowledge and findings on this topic.
Relevance to current and future markets
Automation is a massive cost cutter in the IT industry. By leveraging modern distributed development technologies, application developers and operators can streamline the application lifecycle, eliminate the downtime, and governing it with a minimum effort required for testing and deployment resources.
Software Defined Networking (SDN) is a technology that has introduced an important paradigm shift in the networking world. With the OpenFlow protocol as a main technological enabler, the essential goal is to extend the conventional network configuration approach by introducing the concept of network control and programmability.
The advances in the OpenFlow protocol and the strong community involved in the OpenDaylight (ODL) framework has significantly leveraged SDN over the past few years, which booked it a ticket as a de-facto technology in the datacenter network management journey. To follow this initiative, OpenStack has been a pioneer technology that urged to provide a direct SDN support for Neutron. Such approach has introduced new challenges arising from the direct mapping of network traffic between the physical hosts and the virtual tenant networks.
Identifying scenarios that embrace different issues to consider, has a high priority in the current SDN world. With the main focus on SDN-managed datacenter networks, this initiative will provide a technical implementation and know-how on managing cloud-based network resources in a straightforward manner.
The “SDN for clouds at the ICCLab” mission involves: establishing use cases, revealing potential issues, analyzing alternative approaches and optimizations in order to achieve efficient networking for classical datacenters, network carriers, Internet Service Providers and Cloud providers. The tasks to achieve this include:
Provide on-demand, scalable, commodity deployment to facilitate SDN knowledge transfer to academia and business partners
Provide Network as a Service for the tenants
Monitor and optimize intra-cloud-traffic
Automate changing flows with the SDN-controller
Minimize complexity of the network logic
Efficient handling of QoS and QoE network parameters
Independent network-hardware vendors
The on-going technology and protocols applied to cloud networking are not optimal in terms of resource usage, reliability, deployment and maintenance. For example, the current implementation of Open Stack Neutron relies on different tunnelling mechanisms in order to provide isolation and multi-tenancy support. From a network application developer point of view, this is inefficient since it injects additional overhead and impedes a transparent application development.
To address the issues in that context, we define the following research tasks:
Reconsider the current concepts and state of the art proposals and determine a sophisticated solution towards optimized SDN design for modern cloud architectures
Define competitive use cases as direct controllers and evaluators of our SDN solution
Provide a high level framework for management of cloud based network resources in a uniform manner
Relevance to current and future markets
Having in-house deployment implies an up and running environment prepared to leverage ideas deployed and tested over commodity-hardware. The ICCLab SDN testbed will essentially facilitate the validation of use cases towards comprehensive solutions. The high-level framework on top of the ODL controller will provide smart virtual datacenter management in OpenStack deployments, and potentially target industry partners among the content delivery network companies, like Akamai for example, IPTV and streaming service providers. We also aim to expand the cooperation by exchanging technical expertise with industry partners involved in the SDN-Cloud field.
Cloud computing service is comparable to public utility services like gas, telephone or water supply.
Economical value of cloud computing service is determined by reliability, availability and maintainability (RAM) characteristics.
Availability impacts the value of cloud computing as it is perceived by end users. High Availability systems increase guaranteed availability of a cloud computing service. Therefore they increase the economical value of a cloud computing service.
Cloud HA initiative has the objectives:
To provide a service to analyze problems related with reliability and availability of cloud computing systems
To provide systems and services that increase reliability and availability of cloud computing systems
The following challenges exist currently:
Measuring and analyzing availability: how can we experimentally determine reliability of cloud computing systems (VMs, storage etc.)? Design of adequate reliability measurement experiments is difficult, since we often have to rely on simulation of an outage.
Adapt reliability engineering methods to cloud computing: many reliability analysis and engineering techniques do exist (Fault Tree Analysis, FME(C)A, HAZOP, Markov Chains). How can we apply them to the area of cloud computing?
Analytic and monitoring systems: build systems that automatically monitor reliability of cloud resources and analyze problems.
Failure recovery and intelligent event management systems: build systems that intelligently detect and react to failures.
Currently there is almost no data available on reliability of different virtualization technologies like OpenStack or Docker.
Cloud vendors and manufacturers simply claim that their systems operate reliably without providing data to prove their claims. Think about an engineering company (like e. g. ABB or Siemens). Would they still be on the market if they were not able to tell their customers the exact hazard rates and MTBFs of their products? The IT industry is lagging behind other engineering industries. IT reliability engineering could be an interesting discipline that adds value to IT products and services.
Relevance to current and future markets
Existing High Availability solutions:
Pacemaker: resource monitor that automatically detects failures and recovers failed components. Highly configurable, but also heavyweight. System administrators notoriously complain about its bad configuration interface. A bad configuration can make the system 7-8 times slower than a good configuration.
Keepalived: lightweight resource monitor. Unclear if this tool is well supported by its community.
IBM Tivoli: extremely heavyweight resource monitor and configuration management tool.
HAProxy: light load balancer. Great for web applications, but only applicable to HTTP-based services.
DRBD: disk replication technology. Fast and lightweight. Suitable for small disk networks.
Ceph: distributed storage and file system. Highly decentralized and great scalability.
GlusterFS: distributed storage and file system. Better scalability, but sometimes problem with partition tolerance.
Galera: MySQL cluster. True multimaster solution.
MySQL NDB Cluster: maps MySQL to simple key,value store. Requires adaption of applications to database interface.
Nagios: great monitoring system. Extendability and many plugins available.
Currently today, large internet-scale services are still architected using the principles of service-orientation. The key overarching idea is that a service is not one large monolith but indeed a composite of cooperating sub-services. How these sub-services are designed and implemented are given either by the respective business function, as in the case of traditional SOA to technical function/domain-context as in the case of the microservice approach to SOA.
In the end what both approaches result in, is a set of services, each of which carrying out a specific task/function. However, in order to bring all these service units together an overarching process needs to be provided to stitch them together and manage their runtimes. In doing so present the complete service to the end-user and for the developer/provider of the service.
The basic management process of stitching these services together is known as orchestration.
Orchestration & Automation
These are two concepts that are often conflated and used as if they’re equivocal. They’re not but they are certainly related, especially when Automation refers to configuration management (CM; e.g. puppet, chef, etc.).
Nonetheless, what both certainly share is that they are oriented around the idea of software systems that expose an API. With that API, manual processes once conducted through user interfaces or command line interfaces can now be programmed and then directed by higher level supervisory software processes.
Orchestration goes beyond automation in this regard. Automation (CM) is the process that enables the provisioning and configuration of an individual node without consideration for the dependencies that node might have on others or vice versa. This is where orchestration comes into play. Orchestration, in combination with automation, ensures the phases of:
“Deploy”: the complete fleet of resources and services are deployed according to a plan. At this stage they are not configured.
“Provision”: each resource and service is correctly provisioned and configured. This must be done such that one service or resource is not without a required operational dependency (e.g. a php application without its database).
This process is of course a simplified one and does not include the steps of design, build and runtime management of the orchestrated components (services and/or resources).
Design: where the topology and dependencies of each component is specified. The model here typically takes the form of a graph.
Build: how the deployable artefacts such as VM images, python eggs, Java WAR files are created either from source or pre-existing assets. This usually has a relationship to a continuous build and integration process.
Runtime: once all components of an orchestration are running the next key element is that they are managed. To manage means at the most basic level to monitor the components. Based on metrics extracted, performance indicators can be formulated using logic-based rules. These when notified where an indicator’s threshold is breached, an Orchestrator could take a remedial action ensuring reliability.
Disposal: Where a service is deployed through cloud services (e.g. infrastructure; VMs) it may be required to destroy the complete orchestration to redeploy a new version or indeed part of the orchestration destroyed.
Ultimately the goal of orchestration is to stitch together (deploy, provision) many components to deliver a functional system (e.g. replicated database system) or service (e.g. a 3-tier web application with API) that operates reliably.
The key objective of this initiatives are:
Provide a reactive architecture that covers not only the case of controlling services but also service provider specific resources. What this means is that the architecture will exhibit responsiveness, resiliency, elasticity and be message-oriented. This architecture will accommodate all aspects that answer our identified research challenges.
Deliver an open-source framework that implements orchestration for services in general and more specifically cloud-based services.
Provide orchestration that provides reliable and cloud-native service delivery
There are other objectives that are more related to delivering other research challenges.
How to best enable and support a SOA, Microservices design patterns?
How to get insight and tracing within each service and across services so problems can be identified, understood?
Efficient management of large-scale composed service and resource instance graphs
Scaling based on ‘useful’ monitoring, resource- and service-level metrics
Consider monitoring system and scaling systems e.g. monasca
How to program the scaling of an orchestrator spanning multiple providers and very different services?
Provision of architectural recommendations and optimisation based on orchestration logic analysis
How to exploit orchestration capabilities to ensure reliability? ie, “load balancer for high availability” for cloud applications. How can load balancing service be automatically injected ensuring automatic scaling?
How could a service orchestration framework bring the techniques of netflix and amazon (internal services) to a wider audience?
Snapshot your service, rollback to your service’s previous state
Reliability of the Service Orchestrator – how to implement this? HAProxy? Pacemaker?
Orchestration logic should be able to be written in many popular languages
Continuous integration of orchestration code and assets
Provider independent orchestration execution and accomdate many resource/service providers.
Hybrid cloud deployments not well considered. How can this be done?
Adoption of well known standards, openid, openauth and custom providers
Authentication services – how to do this over disparate providers?
How to create market places to offer services. Either the service being orchestrated or that service consuming others.
Integration of business services that service owners can charge clients
Containers for service workloads. Where might CoreOS, Docker, Rocket, Solaris Zones fit in the picture?
If windows is not a hard requirement then it makes sense from a provider’s perspective to utilise container tech.
Do we really need full-blown “traditional” IaaS frameworks to offer orchestration?
Relevance to Current & Future Markets
Many companies’ products aim to provide orchestration of resources in the Cloud, such as Sixsq (Slipstream), Cloudify, ZenOSS ControlCenter, Nirmata… There are also several open source projects, especially related to OpenStack, who touch the orchestration topic: OpenStack Heat, Murano, Solum.
Our market survey established a lack of non-cross domain (different service providers), service-oriented orchestration, with many of them taking the lower-level approach of orchestrating resources directly, and very often on a single provider. One aspect that all these solutions are very different in terms of programming models, however there is a growing interest in leveraging a standards-based orchestration description, with TOSCA being the most talked about. Another identified issue is the lack of reliability of services/resources orchestrated by these products, which is a barrier to adoption this initiative aims to solve. Along with this is that many solutions either have no runtime management or has limited capabilities.
In a more general point of view, cloud orchestration brings the following benefits to customers:
Orchestration reduces the overhead of configuring manually all services comprising a cloud-native application
Orchestration allows to get out new updates to a service implementation faster and better tested through continuous testing integration and deployment
Reliable orchestration ensures the linkage and composition of services remaining running all the time, even where one or more components fail. This reduces downtime experienced by clients and keeps the service providers service always available.
Orchestration brings reproducibility and portability in cloud services, which may run on any cloud provider which the orchestration software controls
The key entities of the architecture and their relationships to basic entities are shown in the follow diagram. To understand the complete detailed architecture, click on the picture to get the complete view.
Resource Management in Cloud Computing is a topic that has received much interest both within the research community and within the operations of the large cloud providers; naturally, as it has a significant impact on the cloud provider’s bottom line. Much of the work to date on resource management focuses on Service Level Agreements (for different definitions of an SLA); some of the work also considers energy as a factor.
The primary objective of this work is to develop an energy aware load management solution for Openstack: variants of this have been proposed before and indeed implemented in other stacks (e.g. Eucalyptus) but no such capability exists for Openstack as yet. As well as realizing the solution, the work will involve deploying a variant of the solution on the cloud platform without impacting the operation of the platform and determining what energy savings can be made. It is worth noting that the classical load balancing approach which is very typical for resource managers in cloud contexts is somewhat contradictory to minimizing energy consumption; consequently, the very standard load management tools are not suitable for minimizing cloud energy consumption.
The research challenges are the following:
How to characterize the load in the system, particularly relating to spikes in demand
How much buffer space to maintain to accommodate load spikes
How to perform load consolidation – what load should be moved to what machines?
When to perform load consolidation – how frequently should it take place?
What are the energy gains that can be achieved from such a dynamic system?
Relevance to current and future markets
Advanced resource management mechanisms are a necessity for cloud computing generally. In the case of large deployments, Facebook’s autoscale is an example of how they can be used to achieve energy savings of the order of 15%. In the case of smaller deployments, it is still the case that there are many [[ https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/ | highly underutilized servers ]] in typical Data Centres and ultimately there will be a need to reduce costs and realize energy efficiencies. The problem is a large, general problem and energy is one specific aspect of it – one of the challenges for this work is how to integrate with other active parts of the ecosystem.
There are some commercial offering which explicitly address energy efficiency in the cloud context. These include:
Hauwei offers Fusion Compute which includes energy management as one of its capabilities;
Sardina Systems has an offering which focuses on energy management for Openstack specifically.
Link to Code
So far this work has focused on understanding performance of live migration – the code to perform advanced load management has not been pushed out to our public repo as it is currently in its very initial stages.
Energy in general and energy consumption in particular is a major issue for the large cloud providers today. Smaller cloud providers – both private and public – also have an interest in reducing their energy consumption, although it is often not their most important concern. With increasing competition and decreasing margins in the IaaS sector, management of energy costs will become increasingly important.
A basic prerequisite of advanced energy management solutions is a good understanding of energy consumption. This is increasingly available in multiple ways as energy meters proliferate: as well as having energy meters on racks, energy meters typically exist in modern hardware and even at subsystem level within today’s hardware. That said, energy metering is something that is commonly coupled to proprietary management systems.
The focus of this initiative is to develop an understanding of cloud energy consumption through measurement and analysis of usage.
The objectives of the energy monitoring initiative are:
to develop a tool to visualize how energy is being consumed within the cloud resources;
to understand the correlation between usage of cloud resources and energy consumption;
to understand what level of granularity is appropriate for capturing energy data;
to devise mechanisms to disaggregate energy consumption amongst users of cloud platforms.
Understanding cloud energy consumption does not give rise to fundamental research challenges – indeed, it is more of an enabler for a more advanced energy management system. However, to have a comprehensive understanding of cloud energy consumption, some research effort is required. The following research challenges arise in this context:
How to consolidate energy consumption from disparate sources to realize a clear understanding of energy consumption within the cloud environment
How to correlate energy consumption with revenue generating services at a fine-grained level (compute, storage and networking)
Relevance to current and future markets
Understanding energy consumption is essential for the large cloud providers as well as for today’s Data Centre providers. Consequently, there are already solutions available which support monitoring of energy consumption of IT resources. Today’s solutions typically do not have specific knowledge of cloud resource utilization and consequently, there is an opportunity for new tools which correlate cloud usage with energy monitoring.
In the Gartner Hype Cycle for Green IT 2014, there are some related technologies which have growth potential over the coming years. Specifically, these are:
Server Digital Power Management Module
Demand Response Management Tools
As such, there are future market opportunities for such energy related work. However, we are still evaluating its commercial potential.