Cloud Incident Management

Overview

Cloud Incident Management is a new research direction which focuses on conducting forensic investigations, electronic discovery (eDiscovery), and other critical aspects of security that are inherent in a multi-tenant, highly virtualized environment, along with any standards that need to be followed.

An Incident is an event which occurs outside the standard operation plan and which can lead to a reduction or interruption of quality of service. Incidents, in Cloud Computing, can lead to service shortages at all infrastructure levels (IaaS, PaaS, SaaS).

Incident Management provides a solid approach to address SLA incidents by covering aspects pertaining to service runtime in cloud through monitoring and analysis of events that may not cause SLA breaches but may disrupt service execution, or by covering aspects related to security by correlating and analyzing information coming from logs and generating adequate corrective responses.

Objectives

Current research will focus on addressing a series of research challenges pertaining to the Cloud Incident Management field:

  • Tackle possible temporary or long-term failures through the development of incident management tools, reference architectures and guidance for cloud customers to build systems resilient to cloud service failure.
  • Automated management of incident prevention, detection and response as well as recovery via clear SLA commitments and continuous monitoring will increase reliability, resilience, availability, trustworthiness and even accountability of cloud providers and customers.

Research Challenges and Open Issues

Current research challenges and open issues are as follows:

  • Correct identification, aggregation and correlation of events that make up an incident
  • Automated incident classification
  • Automated incident / problem management (workflow, processes)
  • Root cause analysis in cloud computing
  • Assessing business impact
  • Incident management in multi-cloud approaches
  • Transparency and audit
  • Cloud anti-patterns
  • Clear definition of outages given by cloud service providers

Architecture

A high level overview of the architecture can be seen below

Cloud Incident Management Architecture

Cloud Incident Management Architecture

Relevance to current and future markets

Business Impact

The following items represent the business impact incident management brings:

  • Automating incident management reduces the time spent by specialized personnel
  • Automation reduces response time to incidents and thus prevents or reduces downtime as it is able to act as soon as the incident has happened
  • Return on investment though availability, response time and throughput
  • Incident management increases efficiency, reduces operating expenses, offers agility and reliability for business users

Contact point

For further information or assistance please contact Valon Mamudi.