Dependability Modeling: Testing Availability from an End User’s Perspective

In a former article we spoke about testing High Availability in OpenStack with the Chaos Monkey. While the Chaos Monkey is a great tool to test what happens if some system components fail, it does not reveal anything about the general strengths and weaknesses of different system architectures.  In order to determine if an architecture with 2 redundant controller nodes and 2 compute nodes offers a higher availability level than an architecture with 3 compute nodes and only 1 controller node, a framework for testing different architectures is required. The “Dependability Modeling Framework” seems to be a great opportunity to evaluate different system architectures on their ability to achieve availability levels required by end users.

Overcome biased design decisions

The Dependability Modeling Framework is a hierarchical modeling framework for dependability evaluation of system architectures. Its purpose is to model different alternative architectural solutions for one IT system and then calculate the dependability characteristics of each different IT system realization. The calculated dependability values can help IT architects to rate system architectures before they are implemented and to choose the “best” approach from different possible alternatives. Design decisions which are based on Dependability Modeling Framework have the potential to be more reflective and less biased than purely intuitive design decisions, since no particular architectural design is preferred to others. The fit of a particular solution is tested versus previously defined criteria before any decision is taken.

Build models on different levels

The Dependability Models are built on four levels: the user level, the function level, the service level and the resource level. The levels reflect the method to first identify user interactions as well as system functions and services which are provided to users and then find resources which are contributing to accomplishment of the required functions. Once all user interactions, system functions, services and resources are identified, models are built (on each of the four levels) to assess the impact of component failures on the quality of the service delivered to end users. The models are connected in a dependency graph to show the different dependencies between user interactions, system functions, services and system resources. Once all dependencies are clear, the impact of a system resource outage to user functions can be calculated straightforward: if the failing resource was the only resource which delivered functions which were critical to the end user, the impact of the resource outage is very high. If there are redundant resources, services or functions, the impact is much less severe.
The dependency graph below demonstrates how end user interactions depend on functions, services and resources.
Dependability Graph

Fig. 1: Dependency Graph

The Dependability Model makes the impact of resource outages calculable. One could easily see that a Chaos Monkey test can verify such dependability graphs, since the Chaos Monkey effectively tests outage of system resources by randomly unplugging devices.  The less obvious part of the Dependability Modelling Framework is the calculation of resource outage probabilities. The probability of an outage could only be obtained by regularly measuring unavailability of resources over a long time frame. Since there is no such data available, one must estimate the probabilities and use this estimation as a parameter to calculate the dependability characteristics of resources so far. A sensitivity analysis can reveal if the proposed architecture offers a reliable and highly available solution.


Dependability Modeling on OpenStack HA Environment

Dependability Modeling could also be performed on the OpenStack HA Environment we use at ICCLab. It is obvious that we High Availability could be realized in many different ways: we could use e. g. a distributed DRBD device to store all data used in OpenStack and synchronize the DRBD device with Pacemaker. Another possible solution is to build Ceph clusters and again use Pacemaker as synchronization tool. An alternative to Pacemaker is keepalived which also offers synchronization and control mechanisms for Load Balancing and High Availability. And of course one could also think of using HAProxy for Load Balancing instead of Ceph or DRBD.
In short: different architectures can be modelled. How this is done will be subject of a further blog post.

EU Report: “Advances in Clouds: Report from the Cloud Computing Expert Working Group”

# Introduction
This is a brief summary of the [EU Report:”Advances in Clouds: Report from the CLOUD Computing Expert Working Group.”](http://cordis.europa.eu/fp7/ict/ssai/docs/future-cc-2may-finalreport-experts.pdf) In this report a set of appointed Cloud experts have studied the current cloud computing landscape and have came out with a set of recommendations for advancing the future cloud. They note a large number of challenges present today in cloud computing and where tackled provide an opportunity to European innovators. Quoting the resport: *”Many long-known ICT challenges continue and may be enhanced in a CLOUD environment. These include large data transmission due to inadequate bandwidth; proprietarily of services and programming interfaces causing lock-in; severe problems with trust, security and privacy (which has legal as well as technical aspects); varying capabilities in elasticity and scaling; lack of interoperation interfaces between CLOUD (resources and services) offerings and between CLOUDs and other infrastructures and many more.”*

They see that performance aspects in cloud are as ever prescient and require tackling. *”What is more, spawning (scaling) of objects – no matter whether for the purpose of horizontal or vertical scale – is thereby still slow in modern CLOUD environments and therefore also suboptimal, as it has to take a degree of lag (and hence variance) into account.”*

As ever the topics of **SLAs and QoS** , which provide aspects of **dependability and transparency** to clients arise: *”lacking quality of service control on network level, limitations of storage, consistency management.” The worry here is “If the QoS is only observable per resource instance, instead of per user, some users will not get the quality they subscribed to.”*

They say that **interoperability and portability** are still challenges and that “In general there is a lack of support for porting applications (source code) with respect to all aspects involved in the process” and that due to demand of cloud services “the need for resources will exceed the availability of individual providers” however “current federation and interoperability support is still too weak to realise this”.

More related to **business models**, “generally insufficient experience and expertise about the relationship between pricing, effort and benefit: most users cannot assess the impact of moving to the CLOUD”.

Many of the topics highlight in this report are themes that are being pursued here the **ICCLab**, especially on areas of performance, work load management, dependability and interoperability.

# Identified Essential Research Issues
From the report the following key research issues and challenges were noted.

– **Business and cost models**
– Accounting, billing, auditing: pricing models and appropriate dynamic systems are required including monitoring of resources and charging for them with associated audit functions. This should ideally be supported by integrated quota management for both provider and user, to help keep within budget limits
– Monitoring: common monitoring standards and methods are required to allow user choice over offerings and to match user expectations in billing. There are issues in managing multi-tenancy accounting, real time monitoring and the need for feedback from expectations depending on resource usage and costs.
– Expertise: The lack of expertise requires research to develop best practice. This includes user choices and their effect on costs and other parameters and the impact of CLOUDs on an ICT budget and user experience. Use cases could be a useful tool.

– **Data management and handling**
– Handling of big data across large scales;
– Dealing with real-time requirements – particularly streamed multimedia;
– Distribution of a huge amount of data from sensors to CLOUD centres;
– Relationship to code – there is a case for complete independence and mobile code move the code to the (bulky) data;
– Types of storage & types of data – there is a need for appropriate storage for the access pattern (and digital preservation) pattern required. Different kinds of data may optimally utilise different kinds of storage technology. Issues of security and privacy are also factors.
– Data structuring & integrity – the problem is to have the representation of the real world encoded appropriately inside the computer – and to validate the stored representation against the real world. This takes time (constraint handling) and requires elastic scalable solutions for distributed transactions across multiple nodes;
– Scalability & elasticity are needed in all aspects of data handling to deal with ‘bursty’ data, highly variable demand for access for control and analysis and for simulation work including comparing analytical and simulated representations;

– **Resource awareness/Management**

– Generic ways to define characteristics: there is a need for an architecture of metadata to a common framework (with internal standards) to describe all the components of a system from end-user to CLOUD centre;
– Way to exploit these characteristics (programmatically, resource management level): the way in which software (dominantly middleware but also, for example, user interface management) interacts with and utilises the metadata is the key to elasticity, interoperation, federation and other aspects;
– Relates to programmability & resource management: there are issues with the systems development environment such that the software generated has appropriate interfaces to the metadata;
– Depending on the usage, “resources” may incorporate other services Virtualisation – by metadata descriptions utilised by middleware –
– Of all types of devices
– Of network
– Of distributed infrastructures
– Of distributed data / files / storage
– Deal with scale and heterogeneity: the metadata has to have rich enough semantics;
– Multidimensional, dynamic and large scale scheduling respecting timing and QoS;
– Efficient scale up & down: this requires dynamic rescheduling based on predicted demand;
Allow portable programmability: this is critical to move the software to the appropriate resource;
– Exploit specifics on all levels: high performance and high throughput applications tend to have specific requirements which must be captured by the metadata;
– Energy efficient management of resources: in the ‘green environment’ the cost of energy is not only financial and so good management practices – another factor in the scheduling and optimisation of resources – have to be factored in;
– Resource consumption management : clearly managing the resources used contributes to the expected cost savings in an elastic CLOUD environment; Advanced reservation: this is important for time or business critical tasks and a mechanism is required;
– Fault tolerance, resilience, adaptability: it is of key importance to maintain the SLA/QoS

– **Multi-tenancy impact**
– Isolate performance, isolate network slices: this is needed to manage resources and security;
– No appropriate programming mechanism: this requires research and development to find an appropriate systems development method, probably utilising service-oriented techniques;
– Co-design of management and programming model: since the execution of the computation requires management of the resources co-design is an important aspect requiring the programmer to have extensive knowledge of the tools available in the environment;

– **Programmability**

– Restructure algorithms / identify kernels: in order to place in the new systems development context – this is re-use of old algorithms in a new context; Design models (reusability, code portability, etc): to provide a systematic basis for the above;
– Control scaling behaviour (incl. scale down, restrict behaviour etc.): this requires to be incorporated in the parameters of the metadata associated with the code;
Understand and deal with the interdependency of (different) applications with the management of large scale environments
– Different levels of scale: this is important depending on the application requirements and the characteristics of different scales need to be recorded in the metadata;
– Integrate monitoring information: dynamic re-orchestration and execution time changes to maintain SLA/QoS require the monitoring information to be available to the environment of the executing application;
– Multi-tenancy: as discussed above this raises particular aspects related to systems development and programmability;
– Ease of use: the virtualised experience of the end-user depends on the degree with which the non-functional aspects of the executing application are hidden and managed autonomically;
Placement optimisation algorithms for energy efficiency, load balancing, high availability and QoS: this is the key aspect of scheduling resources for particular executing applications to optimise resource usage within the constraints of SLA and QoS;
– Elasticity, horizontal & vertical: as discussed before this feature is essential to allow optimised resource usage maintaining SLA/QoS;
– Relationship between code and data: the greater the separation of code and data (with the relationships encoded in metadata) the better the optimisation opportunities. Includes aspects of external data representation;
– Consider a wide range of device types and according properties, including energy efficiency etc.; but also wide range of users & use cases (see also business models): this concerns the optimal use of device types for particular applications;
– Personalisation vs. general programming: as programming moves from a ’cottage knitting’ industry to a managed engineering discipline the use of general code modules and their dynamic recomposition and parameterisation (by metadata) will increasingly become the standard practice. However this requires research in systems development methods including requirements capture and matching to available services.

– **Network Management**

– Guaranteeing bandwidth / latency performance, but also adjusting it on demand for individual tenants (elastic bandwidth / latency): this is a real issue for an increasing number of applications. It is necessary for the network to exhibit some elasticity to match that of the CLOUD centres. This may require network slices with adaptive QoS for virtualising the communication paths;
– Compensating for off-line time / maintain mobile connectivity (internationally): intermittent mobile connectivity threatens integrity in computer systems (and also allows for potential security breaches). This relates to better mechanisms for maintaining sessions / restarting sessions from a checkpoint;
– Isolating performance, connectivity etc.: there is a requirement for the path from end-user to CLOUD to be virtualised but maintaining the QoS and any SLA. This leads to intelligent diagnostics to discover any problems in connectivity or performance and measures to activate autonomic processes to restore elastically the required service.

– **Legalisation and Policy**
– Privacy concerns: especially in international data transfers from user to CLOUD;
– Location awareness: required to certify conformity with legislation;
– Self-destructive data; if one-off processing is allowed;

– **Federation**
– Portability, orchestration, composition: this is a huge and important topic requiring research into semi-automated systems development methods allowing execute time dynamic behaviour;
– Merged CLOUDs: virtualisation such that the end-user does not realise the application is running on multiple CLOUD providers’ offerings;
– Management: management of an application in a federated environment requires solutions from the topics listed above but with even higher complexity;
– Brokering algorithms: are needed to find the best services given the user requirements and the resource provision;
– Sharing of resources between CLOUD providers: this mechanism would allow CLOUD providers to take on user demands greater than their own capacity by expanding elastically (with appropriate agreements) to utilise the resources of other CLOUD suppliers;
– Networking in the deployment of services across multiple CLOUD providers: this relates to the above and also to the Networking topic earlier;
– SLA negotiation and management between CLOUD providers: this is complex with technical, economic and legal aspects;
– Support for context-aware services: is necessary for portability of (fragments of) an application across multiple CLOUD service providers;
– Common standards for interfaces and data formats: if this could be achieved then federated CLOUDs could become a reality;
– Federation of virtualized resources (this is not the same as federation of CLOUDs!) is required to allow selected resources from different CLOUD suppliers to be utilised for a particular application or application instance. It has implications for research in
– Gang-Scheduling
– End-to-End Virtualisation
– Scalable orchestration of virtualized resources and data: co-orchestration is highly complex and requires earlier research on dynamic re- orchestration/composition of services;
– CLOUD bursting, replication & scale of applications across CLOUDs: this relies on all of the above.

– **Security**
– Process applications without disclosing information: Homomorphic security: this offers some chance of preserving security (and privacy);
– Static & dynamic compliance: this requires the requirements for compliance to be available as metadata to be monitored by the running application;
– Interoperability, respectively common standards for service level and security: this relates to standard interfaces since the need is to encode in metadata;
– Security policy management: policies change with the perceived threats and since the CLOUD environment is so dynamic policies will need to also be dynamic.
– Detection of faults and attacks: in order to secure the services, data and resources, threads need to be detected early (relates to reliability)
– Isolation of workloads: particular workloads of high security may require isolation and execution at specific locations with declared security policies that are appropriate;