We are using ceilometer to collect data energy from our servers. As noted previously we were having some performance issues and we needed to investigate further. In this blog post we will cover our approach to performing profiling on ceilometer API to determine where the problems arose.
Of course, the first step was to take a look at the log files (in
/var/log/ceilometer-all.log); as there was nothing unusual in there, we decided to perform profiling of the code.
We at ICCLab are embarking upon an exciting project to make a software development kit to enable SDN researchers develop exciting products and innovative protocols overcoming the challenges and drawbacks of decades old network protocols in use today. We had a huge debate internally to decide which programming language to use for this development. Since, internally we had quite strong and vocal supporters of both Java and Python, it led to stalemate. So how did we resolve it?
Cloud Incident Management is a new research direction which focuses on conducting forensic investigations, electronic discovery (eDiscovery), and other critical aspects of security that are inherent in a multi-tenant, highly virtualized environment, along with any standards that need to be followed.
An Incident is an event which occurs outside the standard operation plan and which can lead to a reduction or interruption of quality of service. Incidents, in Cloud Computing, can lead to service shortages at all infrastructure levels (IaaS, PaaS, SaaS).
Incident Management provides a solid approach to address SLA incidents by covering aspects pertaining to service runtime in cloud through monitoring and analysis of events that may not cause SLA breaches but may disrupt service execution, or by covering aspects related to security by correlating and analyzing information coming from logs and generating adequate corrective responses.
Current research will focus on addressing a series of research challenges pertaining to the Cloud Incident Management field:
- Tackle possible temporary or long-term failures through the development of incident management tools, reference architectures and guidance for cloud customers to build systems resilient to cloud service failure.
- Automated management of incident prevention, detection and response as well as recovery via clear SLA commitments and continuous monitoring will increase reliability, resilience, availability, trustworthiness and even accountability of cloud providers and customers.
Research Challenges and Open Issues
Current research challenges and open issues are as follows:
- Correct identification, aggregation and correlation of events that make up an incident
- Automated incident classification
- Automated incident / problem management (workflow, processes)
- Root cause analysis in cloud computing
- Assessing business impact
- Incident management in multi-cloud approaches
- Transparency and audit
- Cloud anti-patterns
- Clear definition of outages given by cloud service providers
A high level overview of the architecture can be seen below
Cloud Incident Management Architecture
Relevance to current and future markets
The following items represent the business impact incident management brings:
- Automating incident management reduces the time spent by specialized personnel
- Automation reduces response time to incidents and thus prevents or reduces downtime as it is able to act as soon as the incident has happened
- Return on investment though availability, response time and throughput
- Incident management increases efficiency, reduces operating expenses, offers agility and reliability for business users
For further information or assistance please contact Valon Mamudi.