As the trend continues to move towards Serverless Computing, Edge Computing and Functions as a Service (FaaS), the need for a storage system that can adapt to these architectures grows ever bigger. In a scenario where smart cars have to make decisions on a whim, there is no chance for that car to ask a data center what to do in this scenario. These scenarios constitute a driver for new storage solutions in more distributed architectures. In our work, we have been considering a scenario in which there is a distributed storage solution which exposes different local endpoints to applications distributed over a mix of cloud and local resources; such applications can give the storage infrastructure and indicator of the nature of the data which can then be used to determine where it should be stored. For example, data could be considered to be either latency-sensitive (in which case the storage system should try to store it as locally as possible) or loss sensitive (in which case the storage system should ensure it is on reliable storage). Continue reading
Service hosting platforms such as IaaS and PaaS offer a lot of convenience for the service engineer. They take care of proper provisioning, scaling, healing and profiling. Yet, this platform support is limited when it comes to decisions which require insight into the application state and logic, especially considering applications or services ranging across multiple platforms with composition and orchestration.
The Active Service Management research initiative of the Service Prototoyping Lab aims at improving the state of the art by letting applications signal their states, conditions and requirements, and by letting platforms understand these signals. Emerging from the work on Cloud Native Applications (CNA), this initiative subsumes work on pro-active/predictive auto-scaling with application metric such as numbers of users and self-* properties such as self-healing by replacing crashed or unresponsive application parts with new instances.
- Comparative evaluation of active service management techniques.
- Novel contributions to some of the techniques, in particular to scaling and resilience, but also service evolution.
- Turn research results into best practices to achieve an extended CNA design and appropriate hosting platforms.
Relevance to current and future markets
The commercial landscape of service hosting infrastructures generally assumes an issue-free operation without failures and demand spikes. Due to SLAs, this becomes costly when the assumptions do not hold anymore. With active service management contributions, many of the failures and spikes can be mitigated so that business continuity remains assured.
Relevant Standards and Articles
While management of services is an established topic in industry, the specific issue of actively managing environment-aware services is a genuine research topic. We present a selection of useful reading.
- Scryer predictive auto-scaling by Netflix: part 1, part 2
- Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services
The following architecture figure describes Dynamite, a novel auto-scaling engine. This rule-based and re-usable engine has been designed in the context of CNA.
Articles and Publications
- Giovanni Toffetti Carughi, Sandro Brunner, Martin Blochinger, Florian Dudouet and Andrew Edmonds, “An architecture for self-managing microservices”, International Workshop on Automated Incident Management in Cloud (AIMC’15), Bordeaux, France, April 2015
Open Source Software
- Dynamite scaling engine for CNA using custom metrics for its decisions
Giovanni Toffetti Carughi: toff(at)zhaw.ch
From a datacenter operator or cloud provider point of view, IT services are intangible entities which must run reliably 24/7, be provisioned on demand with the right scale, and be documented and certified properly. In the Service Operations research initiative of the Service Prototyping Lab, we take a closer look at the needs of businesses which operate infrastructure, platform and application software services. There are differences in the structural appearance of services (e.g. daemon, virtual machine, container, plugin archive) and in the level of assurance against risks (technical, legal). These differences need to be accounted for when planning and scheduling the service execution.
- Operation of testbeds for service execution. A BladeCenter is already set up for this purpose.
- Solving business needs regarding service-level agreements (SLAs), software and process certifications, governance, high availability, failover, as well as further technical protection schemes.
Relevance to current and future markets
IT services are the foundation of all digital processes between individuals, enterprises and organisations. Increasingly, processes are going digital, which saves paper but demands fully reliable and automated IT service delivery and governance. Therefore, this initiative serves as enabler and helps in particular companies to dry-run their services in a controlled environment before rolling them out for the target consumers.
Pietro Brossi: brpi(at)zhaw.ch
Working with remote services requires appropriate and decent tooling. A service idea may take just five seconds (“I want to offer a robust note-taking service”), but its realisation may take much longer (“Which programming language and model?”, “How to describe the service?”, “Where do I find a fitting file service to store the notes on unless I want to take care of backups by myself?”, “Where do I publish my service so that it runs and generates income?”). Therefore, modelling, engineering and integration tools are primarily needed. These tools work in combination with a certain service environment, or ecosystem, consisting of more tools, dependency services, and service platforms which bring services to life.
Open source service platforms such as SPACE and FIWARE went from being architectural visions to actually usable platforms. However, in comparison to cloud stacks and commercial cloud services, their popularity is limited and they are far from being used pervasively. Therefore, the Service Tooling research initiative of the Service Prototyping Lab intends to identify tools and platform services which are straightforward to deploy, easy to use and generic enough to be re-usable in many service scenarios.
For this purpose, the initiative follows a triple structure with three topics of increasing industrial and societal interest: Function-as-a-Service (FaaS), Stealth Computing, Cloud Ecosystems.
- Research and innovation in the entire service lifecycle through advanced tooling: Modelling, publishing, running, consuming and evolving services and service-based applications. This initiative will therefore contribute open source tools to build service platforms, ecosystems and individual applications.
- Layered architectures for services and clients, including adaptive invocation and stealth data management, to benefit from service and cloud environments while overcoming their limitations.
- In connection with CNA, identification of suitable tooling for engineering cloud-native applications, in particular aiming at extreme microservices (nanoservices) with FaaS.
- Adaptation of applications to execution technologies in general: VMs, containers, packages, functions, unikernels. None of these should be a concern to an application engineer and therefore automated tooling is required.
Stealth Computing Architecture
In this part of the initiative, there are a number of architectures depending on the use case and the lifecycle phase of a service. The following diagram represents a typical multi-cloud service integration point with stealth properties. Software applications and services benefit from spreading their data and functions across providers in a tightly controlled, re-usable layer with standard interfaces such as files (e.g. POSIX) and data (e.g. SQL). Users are more willing to adopt cloud environments when explicit user control is made possible by stealth computing.
Cloud Ecosystems Architecture
This part of the initiative explores marketplaces, brokers, dashboards, cloud migration tools, API generators, aggregators and other enablers of thriving ecosystems with service producers and consumers. The research focuses on prototyping techniques with description/implementation roundtripping, a library of utility services which aid in establishing ecosystems, and improved client-side tools such as CLI helpers.
In the FaaS part of the initiative, tools to bring legacy code into FaaS environments as well as tools to advance the environments themselves are investigated. There are software decomposition tools for Python (Lambada) and for Java (Podilizer, Termite). Furthermore, there is a flexible client/server tool to migrate, execute, test and deploy functions written in several languages (Snafu).
Articles and Publications
Note: Preprints are made available in a timely manner. Check preprints.
- J. Spillner, M. Beck, A. Schill, T. M. Bohnert: Stealth Databases: Ensuring User-Controlled Queries in Untrusted Cloud Environments, 8th IEEE/ACM International Conference on Utility and Cloud Computing (UCC), Limassol, Cyprus, December 2015. (PDF author version) (Slides)
- J. Spillner: Secure Distributed Data Stream Analytics in Stealth Applications. 3rd IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Constanța, Romania, May 2015.
- J. Spillner, J. Müller: PICav: Precise, Iterative and Complement-based Cloud Storage Availability Calculation Scheme. 7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC), London, UK, December 2014. (PDF)
- J. Spillner, A. Chaichenko, A. Brito, F. Brasileiro, A. Schill: Cloud Resource Recycling: An Addition of Species to the Zoo of Virtualised, Overlaid, Federated, Multiplexed and Nested Clouds. SDPS Transactions: Journal of Integrated Design and Process Science (JIDPS), vol. 18, no. 1, pp. 5-19, April 2014.
- J. Spillner, S. Illgen, A. Schill: Engineering Service Level Agreements: A Constrained-Domain and Transformation Approach. 3rd International Conference on Cloud Computing and Services Science (CLOSER), Aachen, Germany, May 2013.
Note: Latest posts are on top.
- Transducing service descriptions into SaaS prototypes
- Introducing Podilizer: Automated Java code translator for AWS Lambda
- Rapid API generation with Ramses
- Programmatic identification of cloud providers
- FaaS: Function hosting services and their technical characteristics
- Walk-through: Importing virtual machine images into EC2
- Making Tools Robust and Breaking Robust Tools
- Talk of J. Spillner: The Next Service Wave: Prototyping Cloud-Native and Stealthy Applications. IBM Research Zurich, September 2015. (Slides)
- Talk of J. Spillner: Safe File Storage and Databases. GÉANT3+ Datacenter IaaS Workshop, Helsinki, Finland, September 2014. (Slides)
- Talk of J. Spillner: Operating the Cloud from Inside Out. HPI Operating the Cloud Symposium, Potsdam, Germany, September 2013. (Video)
- Talk of J. Spillner: Flexible Service Ecosystems: The serviceplatform.org perspective. 8th KuVS NGSDP Expert Talk, Königswinter, Germany, April 2013. (Slides)
Open Source Software
Note: The software repositories are hosted in the Service Prototyping Lab Github account. Some of our smaller tools are operated live on a Labsite. Check labsite.
- Podilizer: Decompose legacy Java code into functions and deploy them into an AWS Lambda environment.
- Transducer: Service interface transducer for rapid prototyping. Creates a running service mockup from a RAML description.
- Lambda Control Plane applications: Lambackup & LaMa. Store and process data in the AWS Lambda control plane.
- Whatcloud: Identification of cloud provider by network location.
- AWS-CLI-Retry: AWS-CLI tools with retry patches. A wrapper around AWS-CLI for more robustness.
- now archived: Open Source Service Platform Research Initiative, with further links to the SPACE service platform, spotmarkets, crowdserving portal, π-box for user-controlled access to clouds, nested virtualisation etc.
- now archived: Cloud Storage Lab, with further links to dispersed storage and computing as well as stealth computing tools, such as NubiSave and StealthDB
Josef Spillner: josef.spillner(at)zhaw.ch
The connection between the physical world and the virtual world has never been as exciting, accessible, and economically viable as today. Sensors, actors and robots are able to deliver many physical services in several scenarios, including industrial production and home automation, elderly care, assisted living, logistics and cooperative maintenance.
In isolation, computing capabilities of robots are however limited by embedded CPUs and small on-board storage units. By connecting robots among each other and to cloud computing, cloud storage, and other Internet technologies centered around the benefits of converged infrastructure and shared services, two main advantages can be exploited. First, computation can be outsourced to cloud services leveraging an on-demand pay-per-use elastic model. Second, robots can access a plethora of services complementing their capabilities (e.g., speech analysis, object recognition, knowledge sharing), enabling new complex functionalities and supporting learning.
Cloud robotics is a natural extension to the Internet of Things (IoT). Where IoT devices will gather information about an environment to help make smarter decisions, cloud robotics will be able to use this information and act on it.
Although there is clear recognition that Cloud access is required to complement robotics computation and enable functionalities needed for robotic tasks (e.g., self-driving cars), it is still unclear how to best support these scenarios.
State of the practice
The Robot Operating System (ROS) (http://www.ros.org/) has gained massive traction both in industry and research.
ROS allows robotics developers to concentrate on one specific functionality at a time (e.g., planning, navigation, 3d reconstruction) implemented into so-called ROS nodes that can communicate through pub/sub or RPC-like messages to implement a coherent robotic behavior.
Its next release iteration (ROS2) will address most of the shortcomings of the previous version, for instance removing the need for a centralized master node and adopting a peer to peer discovery mechanism for nodes.
Current ROS development is still faced with low-level interoperability and compilation concerns. Containers in this context are used for the moment simply as a way of providing a consistent running environment for ROS code, however none of the best practices from cloud computing (e.g., resource and container management, placement automation, resource and SW orchestration, continuous integration and deployment) have been adopted yet.
The initiative’s goal is to enable robotic applications to take full advantage of cloud computing services, resources, best practices, and automation by integrating ROS nodes and robots just as other composable services in the cloud.
The results will benefit all robotic stakeholders:
- Robotic and cloud developers will be able to deploy their application code in containers with a click, triggering the orchestration of all needed supporting services (e.g., databases, caches, load balancers, speech and video processing, ROS nodes) and both virtual (e.g., containers, virtual machines) and physical resources (e.g., robots, sensors, cameras, bare-metal machines). ROS nodes and robotic services will be easily composable and accessible from a market generating revenue for SW development of generic and ad hoc solutions;
- Robot Producers will be able to leverage the advancements in cloud infrastructure and platforms needed to support robotic applications. This will enable them to provide end customers with fully functional robots that do not require on premises compute infrastructure and configuration. Moreover they will be able to leverage from a developer community and their SW artifacts;
- End users will benefit from robots and applications that work out of the box and can be easily integrated among themselves and with their favourite cloud services.
Cloud orchestration will cover the entire lifecycle of a robotics service, catering for timely resource allocation and dismissal, taking full advantage of the cloud pay-per-use model, sensibly reducing operational costs with respect to an always-on solution.
Moreover, by taking a service oriented approach and adopting modern cloud-development methodologies, developers will benefit from continuous integration / deployment practices, resulting in shorter release cycles and higher productivity.
This research initiative will also address:
- Review frameworks for offloading processing and storage tasks into suitable cloud services, including one of the first such frameworks, AdAPtS (2012).
- Delivery of a bundled set of service tooling, including a dynamic service registry, for the purpose of letting robots access cloud services many years after production despite constant service evolution.
- Fleet management for robots in the cloud.
- Collaborative knowledge sharing between robots using suitable online services.
- Identification of beneficial services to augment robot capabilities, such as image recognition and messaging.
Relevance to current and future markets
Robotics is a hot market with competitive manufacturers especially in Switzerland and Europe, including Rapyuta, Aldebaran and (for educational purposes) LEGO Mindstorms. The much larger and currently mostly untapped market is about robots connected to appropriate feedback-loop and cooperation services.
Relevant Standards and Architectures
ROS (Robot Operating System): http://www.ros.org/
Rapyuta: a Cloud Robotics framework http://rapyuta.org/
The following diagram presents AdAPtS, an existing framework for connecting robots to clouds. We will review and extend this and other architectures as needed, and present a set of complementary architectures for virtual robots and robot coordination in the cloud.
Articles and Publications
- J. Spillner, C. Piechnick, C. Wilke, U. Aßmann, A. Schill: Autonomous Participation in Cloud Services. 2nd International Workshop on Intelligent Techniques and Architectures for Autonomic Clouds (ITAAC), Chicago, Illinois, USA, November 2012. (PDF)
Related blog posts
Giovanni Toffetti: toff(at)zhaw.ch
Tobias Lötscher: loeh(at)zhaw.ch
Todays Application developer tooling is focusing on single (mostly monolithic) applications and local development on developer workstations. Modern “Cloud Native Applications” on the other hand are distributed, decoupled, resilient and highly scalable. Sometimes a Business Application consists 20+ so called Microservices.
In the cloud age, customers expect fast innovation and a downtime-free application provisioning. Modern cloud development tools, cloud automation and continuous delivery of software as a service makes this possible. This delivery model depends on a continuous service deployment functionality in the hosting environment, i.e. in the PaaS or IaaS stack, in combination with powerful version control systems and automated continuous testing and integration. To bring software to scale and avoid failures, decentralised stacks and systems are used to deploy the services. Once they are deployed, scalability and resilience are taken care of by run-time methods such as CNA.
Cloud Application Developer tooling and Continuous deployment on PaaS platforms is a particularly popular research topic with industrial relevance. Tools are being created to support CloudFoundry, OpenShift, Heroku, and Azure. The goal is to get the services running with a single button click and no more worries about dependencies, random breakage or interface incompatibilities.
Innovation in the entire application development lifecycle: design, modelling, testing, packaging, deployment, debugging, publishing, running, monitoring. This initiative will therefore contribute open source tools to build application development ecosystems.
- Provide advanced tooling to develop and operate large and complex cloud application.
- Innovation along the whole DevOps process: design, coding, testing, continuous integration, continuous deployment, provisioning, operation,…
- Enhancing techniques and tools to support multi-stage zero-downtime deployment of complex cloud service systems.
- Multi-Service/Multi-Cloud support (PaaS & IaaS)
- Focus on Cloud Native Applications, but also supporting complex scenarios, like service interdependencies, versioning and data migrations.
Although there are several continuous deployment approaches and quite a few comparisons, there is a lack of a detailed analysis of how to choose the best techniques for deploying services in cloud environments. The Cloud Application Developer Tooling research initiative of the Service Prototyping Lab is exploring ways to get more knowledge and findings on this topic.
Relevance to current and future markets
Automation is a massive cost cutter in the IT industry. By leveraging modern distributed development technologies, application developers and operators can streamline the application lifecycle, eliminate the downtime, and governing it with a minimum effort required for testing and deployment resources.
Articles and Publications
Open Source Software
CF-WebUI is a modern single page Web User Interface for Cloud Foundry based on AngularJS and Bootstrap http://icclab.github.io/cf-webui/
- Continous Deployment Framework for complex cloud service systems
- currently supporting CloudFoundry PaaS, goal to support other PaaS
- flexible workflows, CI tool independent
- coming soon
Christof Marti: mach(at)zhaw.ch
Software Defined Networking (SDN) is a technology that has introduced an important paradigm shift in the networking world. With the OpenFlow protocol as a main technological enabler, the essential goal is to extend the conventional network configuration approach by introducing the concept of network control and programmability.
The advances in the OpenFlow protocol and the strong community involved in the OpenDaylight (ODL) framework has significantly leveraged SDN over the past few years, which booked it a ticket as a de-facto technology in the datacenter network management journey. To follow this initiative, OpenStack has been a pioneer technology that urged to provide a direct SDN support for Neutron. Such approach has introduced new challenges arising from the direct mapping of network traffic between the physical hosts and the virtual tenant networks.
Identifying scenarios that embrace different issues to consider, has a high priority in the current SDN world. With the main focus on SDN-managed datacenter networks, this initiative will provide a technical implementation and know-how on managing cloud-based network resources in a straightforward manner.
The “SDN for clouds at the ICCLab” mission involves: establishing use cases, revealing potential issues, analyzing alternative approaches and optimizations in order to achieve efficient networking for classical datacenters, network carriers, Internet Service Providers and Cloud providers. The tasks to achieve this include:
- Provide on-demand, scalable, commodity deployment to facilitate SDN knowledge transfer to academia and business partners
- Provide Network as a Service for the tenants
- Monitor and optimize intra-cloud-traffic
- Automate changing flows with the SDN-controller
- Minimize complexity of the network logic
- Efficient handling of QoS and QoE network parameters
- Independent network-hardware vendors
The on-going technology and protocols applied to cloud networking are not optimal in terms of resource usage, reliability, deployment and maintenance. For example, the current implementation of Open Stack Neutron relies on different tunnelling mechanisms in order to provide isolation and multi-tenancy support. From a network application developer point of view, this is inefficient since it injects additional overhead and impedes a transparent application development.
To address the issues in that context, we define the following research tasks:
- Reconsider the current concepts and state of the art proposals and determine a sophisticated solution towards optimized SDN design for modern cloud architectures
- Define competitive use cases as direct controllers and evaluators of our SDN solution
- Provide a high level framework for management of cloud based network resources in a uniform manner
Relevance to current and future markets
Having in-house deployment implies an up and running environment prepared to leverage ideas deployed and tested over commodity-hardware. The ICCLab SDN testbed will essentially facilitate the validation of use cases towards comprehensive solutions. The high-level framework on top of the ODL controller will provide smart virtual datacenter management in OpenStack deployments, and potentially target industry partners among the content delivery network companies, like Akamai for example, IPTV and streaming service providers. We also aim to expand the cooperation by exchanging technical expertise with industry partners involved in the SDN-Cloud field.
Articles and Info
- An Introduction to Software-Defined Networking (SDN)
- SDN – OpenFlow Presentation
- Setting up a Learning Switch
Irena Trajkovska – mailto:firstname.lastname@example.org
Cloud computing means:
- On-demand self service
- Elastic resource provisioning
Cloud computing service is comparable to public utility services like gas, telephone or water supply.
Economical value of cloud computing service is determined by reliability, availability and maintainability (RAM) characteristics.
Availability impacts the value of cloud computing as it is perceived by end users. High Availability systems increase guaranteed availability of a cloud computing service. Therefore they increase the economical value of a cloud computing service.
Cloud HA initiative has the objectives:
- To provide a service to analyze problems related with reliability and availability of cloud computing systems
- To provide systems and services that increase reliability and availability of cloud computing systems
The following challenges exist currently:
Measuring and analyzing availability: how can we experimentally determine reliability of cloud computing systems (VMs, storage etc.)? Design of adequate reliability measurement experiments is difficult, since we often have to rely on simulation of an outage.
Adapt reliability engineering methods to cloud computing: many reliability analysis and engineering techniques do exist (Fault Tree Analysis, FME(C)A, HAZOP, Markov Chains). How can we apply them to the area of cloud computing?
Analytic and monitoring systems: build systems that automatically monitor reliability of cloud resources and analyze problems.
Failure recovery and intelligent event management systems: build systems that intelligently detect and react to failures.
Currently there is almost no data available on reliability of different virtualization technologies like OpenStack or Docker.
Cloud vendors and manufacturers simply claim that their systems operate reliably without providing data to prove their claims. Think about an engineering company (like e. g. ABB or Siemens). Would they still be on the market if they were not able to tell their customers the exact hazard rates and MTBFs of their products? The IT industry is lagging behind other engineering industries. IT reliability engineering could be an interesting discipline that adds value to IT products and services.
Relevance to current and future markets
Existing High Availability solutions:
Pacemaker: resource monitor that automatically detects failures and recovers failed components. Highly configurable, but also heavyweight. System administrators notoriously complain about its bad configuration interface. A bad configuration can make the system 7-8 times slower than a good configuration.
Keepalived: lightweight resource monitor. Unclear if this tool is well supported by its community.
IBM Tivoli: extremely heavyweight resource monitor and configuration management tool.
HAProxy: light load balancer. Great for web applications, but only applicable to HTTP-based services.
DRBD: disk replication technology. Fast and lightweight. Suitable for small disk networks.
Ceph: distributed storage and file system. Highly decentralized and great scalability.
GlusterFS: distributed storage and file system. Better scalability, but sometimes problem with partition tolerance.
Galera: MySQL cluster. True multimaster solution.
MySQL NDB Cluster: maps MySQL to simple key,value store. Requires adaption of applications to database interface.
Nagios: great monitoring system. Extendability and many plugins available.
Elasticsearch, Logstash, Kibana (ELK): log file monitoring system.
There are many HA systems available on the market, but almost no tool to analyze reliability of OpenStack and allow for automated intelligent recovery from failure.
A Nagios-based OpenStack monitoring system that automatically adapts to elastic changes the VM infrastructure
- VM Reliability Tester
A tool to test performance of OpenStack virtual machines.
Obere Kirchgasse 2
Currently today, large internet-scale services are still architected using the principles of service-orientation. The key overarching idea is that a service is not one large monolith but indeed a composite of cooperating sub-services. How these sub-services are designed and implemented are given either by the respective business function, as in the case of traditional SOA to technical function/domain-context as in the case of the microservice approach to SOA.
In the end what both approaches result in, is a set of services, each of which carrying out a specific task/function. However, in order to bring all these service units together an overarching process needs to be provided to stitch them together and manage their runtimes. In doing so present the complete service to the end-user and for the developer/provider of the service.
The basic management process of stitching these services together is known as orchestration.
Orchestration & Automation
These are two concepts that are often conflated and used as if they’re equivocal. They’re not but they are certainly related, especially when Automation refers to configuration management (CM; e.g. puppet, chef, etc.).
Nonetheless, what both certainly share is that they are oriented around the idea of software systems that expose an API. With that API, manual processes once conducted through user interfaces or command line interfaces can now be programmed and then directed by higher level supervisory software processes.
Orchestration goes beyond automation in this regard. Automation (CM) is the process that enables the provisioning and configuration of an individual node without consideration for the dependencies that node might have on others or vice versa. This is where orchestration comes into play. Orchestration, in combination with automation, ensures the phases of:
“Deploy”: the complete fleet of resources and services are deployed according to a plan. At this stage they are not configured.
“Provision”: each resource and service is correctly provisioned and configured. This must be done such that one service or resource is not without a required operational dependency (e.g. a php application without its database).
This process is of course a simplified one and does not include the steps of design, build and runtime management of the orchestrated components (services and/or resources).
Design: where the topology and dependencies of each component is specified. The model here typically takes the form of a graph.
Build: how the deployable artefacts such as VM images, python eggs, Java WAR files are created either from source or pre-existing assets. This usually has a relationship to a continuous build and integration process.
Runtime: once all components of an orchestration are running the next key element is that they are managed. To manage means at the most basic level to monitor the components. Based on metrics extracted, performance indicators can be formulated using logic-based rules. These when notified where an indicator’s threshold is breached, an Orchestrator could take a remedial action ensuring reliability.
Disposal: Where a service is deployed through cloud services (e.g. infrastructure; VMs) it may be required to destroy the complete orchestration to redeploy a new version or indeed part of the orchestration destroyed.
Ultimately the goal of orchestration is to stitch together (deploy, provision) many components to deliver a functional system (e.g. replicated database system) or service (e.g. a 3-tier web application with API) that operates reliably.
The key objective of this initiatives are:
- Provide a reactive architecture that covers not only the case of controlling services but also service provider specific resources. What this means is that the architecture will exhibit responsiveness, resiliency, elasticity and be message-oriented. This architecture will accommodate all aspects that answer our identified research challenges.
- Deliver an open-source framework that implements orchestration for services in general and more specifically cloud-based services.
- Provide orchestration that provides reliable and cloud-native service delivery
There are other objectives that are more related to delivering other research challenges.
- How to best enable and support a SOA, Microservices design patterns?
- How to get insight and tracing within each service and across services so problems can be identified, understood?
- Efficient management of large-scale composed service and resource instance graphs
- Scaling based on ‘useful’ monitoring, resource- and service-level metrics
- Consider monitoring system and scaling systems e.g. monasca
- How to program the scaling of an orchestrator spanning multiple providers and very different services?
- Provision of architectural recommendations and optimisation based on orchestration logic analysis
- How to exploit orchestration capabilities to ensure reliability? ie, “load balancer for high availability” for cloud applications. How can load balancing service be automatically injected ensuring automatic scaling?
- How could a service orchestration framework bring the techniques of netflix and amazon (internal services) to a wider audience?
- Snapshot your service, rollback to your service’s previous state
- Reliability of the Service Orchestrator – how to implement this? HAProxy? Pacemaker?
- Orchestration logic should be able to be written in many popular languages
- Continuous integration of orchestration code and assets
- Provider independent orchestration execution and accomdate many resource/service providers.
- Hybrid cloud deployments not well considered. How can this be done?
- Adoption of well known standards, openid, openauth and custom providers
- Authentication services – how to do this over disparate providers?
- How to create market places to offer services. Either the service being orchestrated or that service consuming others.
- Integration of business services that service owners can charge clients
- Containers for service workloads. Where might CoreOS, Docker, Rocket, Solaris Zones fit in the picture?
- If windows is not a hard requirement then it makes sense from a provider’s perspective to utilise container tech.
- Do we really need full-blown “traditional” IaaS frameworks to offer orchestration?
Relevance to Current & Future Markets
Many companies’ products aim to provide orchestration of resources in the Cloud, such as Sixsq (Slipstream), Cloudify, ZenOSS ControlCenter, Nirmata… There are also several open source projects, especially related to OpenStack, who touch the orchestration topic: OpenStack Heat, Murano, Solum.
Our market survey established a lack of non-cross domain (different service providers), service-oriented orchestration, with many of them taking the lower-level approach of orchestrating resources directly, and very often on a single provider. One aspect that all these solutions are very different in terms of programming models, however there is a growing interest in leveraging a standards-based orchestration description, with TOSCA being the most talked about. Another identified issue is the lack of reliability of services/resources orchestrated by these products, which is a barrier to adoption this initiative aims to solve. Along with this is that many solutions either have no runtime management or has limited capabilities.
- In a more general point of view, cloud orchestration brings the following benefits to customers:
Orchestration reduces the overhead of configuring manually all services comprising a cloud-native application
- Orchestration allows to get out new updates to a service implementation faster and better tested through continuous testing integration and deployment
- Reliable orchestration ensures the linkage and composition of services remaining running all the time, even where one or more components fail. This reduces downtime experienced by clients and keeps the service providers service always available.
- Orchestration brings reproducibility and portability in cloud services, which may run on any cloud provider which the orchestration software controls
The key entities of the architecture and their relationships to basic entities are shown in the follow diagram. To understand the complete detailed architecture, click on the picture to get the complete view.
Resource Management in Cloud Computing is a topic that has received much interest both within the research community and within the operations of the large cloud providers; naturally, as it has a significant impact on the cloud provider’s bottom line. Much of the work to date on resource management focuses on Service Level Agreements (for different definitions of an SLA); some of the work also considers energy as a factor.
The primary objective of this work is to develop an energy aware load management solution for Openstack: variants of this have been proposed before and indeed implemented in other stacks (e.g. Eucalyptus) but no such capability exists for Openstack as yet. As well as realizing the solution, the work will involve deploying a variant of the solution on the cloud platform without impacting the operation of the platform and determining what energy savings can be made. It is worth noting that the classical load balancing approach which is very typical for resource managers in cloud contexts is somewhat contradictory to minimizing energy consumption; consequently, the very standard load management tools are not suitable for minimizing cloud energy consumption.
The research challenges are the following:
- How to characterize the load in the system, particularly relating to spikes in demand
- How much buffer space to maintain to accommodate load spikes
- How to perform load consolidation – what load should be moved to what machines?
- When to perform load consolidation – how frequently should it take place?
- What are the energy gains that can be achieved from such a dynamic system?
Relevance to current and future markets
Advanced resource management mechanisms are a necessity for cloud computing generally. In the case of large deployments, Facebook’s autoscale is an example of how they can be used to achieve energy savings of the order of 15%. In the case of smaller deployments, it is still the case that there are many [[ https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/ | highly underutilized servers ]] in typical Data Centres and ultimately there will be a need to reduce costs and realize energy efficiencies. The problem is a large, general problem and energy is one specific aspect of it – one of the challenges for this work is how to integrate with other active parts of the ecosystem.
There are some commercial offering which explicitly address energy efficiency in the cloud context. These include:
- Eucalyptus has support for Energy Efficient Management of VMs which uses essentially the same techniques as in this work, albeit in the context of a different cloud stack;
- Hauwei offers Fusion Compute which includes energy management as one of its capabilities;
- Sardina Systems has an offering which focuses on energy management for Openstack specifically.
- Link to Code
- So far this work has focused on understanding performance of live migration – the code to perform advanced load management has not been pushed out to our public repo as it is currently in its very initial stages.
- Performance analysis of “post-copy” live migration in Openstack, Dec 2014
- Setting up post-copy live migration in OpenStack, Dec 2014
- The impact of ephemeral VM disk usage on the performance of Live Migration in Openstack, Oct 2014
- Performance of Live Migration in Openstack under CPU and network load, Sept 2014
- An analysis of the performance of live migration in Openstack, Sept 2014
- An analysis of the performance of block live migration in Openstack, Sept 2014
- Setting up Live Migration in Openstack Icehouse, Sept 2014
- Vojtech’s vbrownbag talk at Openstack Summit Paris, Nov 2014
- Link to related projects and initiatives
See the Energy Theme for the larger system architecture.
The next steps on the implementation roadmap are as follows:
- Get tunnelled post-copy live migration working with modifications to libvirt (Jan 2015)
- See if this can be pushed upstream to libvirt
- Consolidate live migration work into clearer message relating to the potential of live migration (Jan 2015)
- Devise control mechanism which can be used to provide energy based control (Feb 2015)
- Deploy and test on Arcus servers (Mar 2015)
- Determine if it is ready for deployment on Bart/Lisa (April 2015)
- Seán Murphy <email@example.com>