Cloud High Availability

Overview

Cloud computing means:

  • On-demand self service
  • Virtualization
  • Elastic resource provisioning

Cloud computing service is comparable to public utility services like gas, telephone or water supply.

Economical value of cloud computing service is determined by reliability, availability and maintainability (RAM) characteristics.

Availability impacts the value of cloud computing as it is perceived by end users. High Availability systems increase guaranteed availability of a cloud computing service. Therefore they increase the economical value of a cloud computing service.

Objectives

Cloud HA initiative has the objectives:

  • To provide a service to analyze problems related with reliability and availability of cloud computing systems
  • To provide systems and services that increase reliability and availability of cloud computing systems

Research Challenges

The following challenges exist currently:

  • Measuring and analyzing availability: how can we experimentally determine reliability of cloud computing systems (VMs, storage etc.)? Design of adequate reliability measurement experiments is difficult, since we often have to rely on simulation of an outage.

  • Adapt reliability engineering methods to cloud computing: many reliability analysis and engineering techniques do exist (Fault Tree Analysis, FME(C)A, HAZOP, Markov Chains). How can we apply them to the area of cloud computing?

  • Analytic and monitoring systems: build systems that automatically monitor reliability of cloud resources and analyze problems.

  • Failure recovery and intelligent event management systems: build systems that intelligently detect and react to failures.

Currently there is almost no data available on reliability of different virtualization technologies like OpenStack or Docker.

Cloud vendors and manufacturers simply claim that their systems operate reliably without providing data to prove their claims. Think about an engineering company (like e. g. ABB or Siemens). Would they still be on the market if they were not able to tell their customers the exact hazard rates and MTBFs of their products? The IT industry is lagging behind other engineering industries. IT reliability engineering could be an interesting discipline that adds value to IT products and services.

Relevance to current and future markets

Business impact

Existing High Availability solutions:

  • Pacemaker: resource monitor that automatically detects failures and recovers failed components. Highly configurable, but also heavyweight. System administrators notoriously complain about its bad configuration interface. A bad configuration can make the system 7-8 times slower than a good configuration.

  • Keepalived: lightweight resource monitor. Unclear if this tool is well supported by its community.

  • IBM Tivoli: extremely heavyweight resource monitor and configuration management tool.

  • HAProxy: light load balancer. Great for web applications, but only applicable to HTTP-based services.

  • DRBD: disk replication technology. Fast and lightweight. Suitable for small disk networks.

  • Ceph: distributed storage and file system. Highly decentralized and great scalability.

  • GlusterFS: distributed storage and file system. Better scalability, but sometimes problem with partition tolerance.

  • Galera: MySQL cluster. True multimaster solution.

  • MySQL NDB Cluster: maps MySQL to simple key,value store. Requires adaption of applications to database interface.

  • Nagios: great monitoring system. Extendability and many plugins available.

  • Elasticsearch, Logstash, Kibana (ELK): log file monitoring system.

There are many HA systems available on the market, but almost no tool to analyze reliability of OpenStack and allow for automated intelligent recovery from failure.

Results

Presentation

HA_initiative_factsheet

Contact

Konstantin Benz
Obere Kirchgasse 2
CH-8400 Winterthur
Mail: benn__(at)__zhaw.ch