Today marks the release of Watchtower v1.0.0. Watchtower is a Cloud Incident Management solution that we, the ICCLab, have been working on. It has been developed as part of the initiative with the same name primarily by Victor Munteanu. Continue reading
The ICCLab organized the International Workshop on Automated Incident Management in Cloud (AIMC’15) in conjunction with EuroSys’15 Conference.
We are conducting research in order to find out more about how companies handle Cloud Incident Management in their infrastructure (ICT / Cloud).
To this end, we would be grateful if you could fill in the following survey so we could get more information about how your company does it, what tools you use (if any) and other opinions you have on the matter. Filling in the survey will only take a few minutes of your precious time.
When there will be enough data to get good statistics, we will disseminate anonymized results of the survey.
In this final part of the tutorial we will verify that all things are working properly.
Because of the large number of moving components the norm is that things will go wrong. Code changes daily both in Monasca and in Openstack and quite often things need further adjusting.
As a general rule, in case a component is misbehaving, change logging from INFO to DEBUG in /etc/monasca and restart the component. The look at the logs in /var/log/monasca and see what is wrong. For Thresh / Storm, logs can be found in /opt/storm/current/logs
In this part of the tutorial, we will install and configure all Monasca components.
In my previous blog post I covered my initial impressions on Monasca. In the following trilogy I will cover its installation, setup, and testing. The installation will be performed for the Java version of Monasca, as some components have both Java and Python code available. For those which only need a quick local setup best would be to use the vagrant setup found here.
The installation will be performed on Ubuntu 14.04 and will be split into 3 posts. The first one (this one) will cover dependency installation and configuration, the next one will cover Monasca’s installation and configuration, and the final one will cover the testing of the whole setup.
CALL FOR PAPERS AND PARTICIPATION
International Workshop on Automated Incident Management in Cloud (AIMC’15)
April 21-24 2015, Bordeaux, France
Held in conjunction with
European Conference on Computer Systems (EuroSys)
http://eurosys2015.labri.fr/ Continue reading
One of the focuses of the Cloud Incident Management research initiative are Monitoring as a Service solutions as they provide the building blocks for incident detection and resolution. As such, part of the work carried throughout the initiative was on identifying good, maintainable, monitoring solutions which can be easily adapted and integrated into a greater incident management architecture. The current blog post tries to cover Monasca showing the good, the bad and the ugly.
Monasca is a Monitoring as a Service solution which comes from HP and Rackspace: it focuses on providing a complete monitoring solution for Openstack. Monasca is an open source solution designed to be highly scalable, performant and fault-tolerant for a multi-tenant environment. It features a RESTful API though which one can interact with the system in order to query it or send metrics for processing.
The solution monitors both the Openstack Infrastructure as well as the VMs which run on it. Further, it can be easily integrated with Rackspace’s Stacktach which forwards all Openstack events coming from its different components to be processed by Monasca. Additionally, the primary authentication mechanism it uses as well as service catalog is Keystone.
Cloud Incident Management is a new research direction which focuses on conducting forensic investigations, electronic discovery (eDiscovery), and other critical aspects of security that are inherent in a multi-tenant, highly virtualized environment, along with any standards that need to be followed.
An Incident is an event which occurs outside the standard operation plan and which can lead to a reduction or interruption of quality of service. Incidents, in Cloud Computing, can lead to service shortages at all infrastructure levels (IaaS, PaaS, SaaS).
Incident Management provides a solid approach to address SLA incidents by covering aspects pertaining to service runtime in cloud through monitoring and analysis of events that may not cause SLA breaches but may disrupt service execution, or by covering aspects related to security by correlating and analyzing information coming from logs and generating adequate corrective responses.
Current research will focus on addressing a series of research challenges pertaining to the Cloud Incident Management field:
- Tackle possible temporary or long-term failures through the development of incident management tools, reference architectures and guidance for cloud customers to build systems resilient to cloud service failure.
- Automated management of incident prevention, detection and response as well as recovery via clear SLA commitments and continuous monitoring will increase reliability, resilience, availability, trustworthiness and even accountability of cloud providers and customers.
Research Challenges and Open Issues
Current research challenges and open issues are as follows:
- Correct identification, aggregation and correlation of events that make up an incident
- Automated incident classification
- Automated incident / problem management (workflow, processes)
- Root cause analysis in cloud computing
- Assessing business impact
- Incident management in multi-cloud approaches
- Transparency and audit
- Cloud anti-patterns
- Clear definition of outages given by cloud service providers
A high level overview of the architecture can be seen below
Relevance to current and future markets
The following items represent the business impact incident management brings:
- Automating incident management reduces the time spent by specialized personnel
- Automation reduces response time to incidents and thus prevents or reduces downtime as it is able to act as soon as the incident has happened
- Return on investment though availability, response time and throughput
- Incident management increases efficiency, reduces operating expenses, offers agility and reliability for business users
For further information or assistance please contact Valon Mamudi.