Month: April 2013 (page 2 of 2)

PaaS on OpenStack

Description

In this initiative we focus on bringing Platform as a Service (PaaS) to the ICCLab testbed, on top of OpenStack. We are investigating and evaluating all the requirements for running various open source PaaS solutions like Cloud Foundry (http://www.cloudfoundry.org), OpenShift (http://www.openshift.org) and Cloudify (http://www.cloudifysource.org) and extend the testbed for monitoring, rating, charging and billing on PaaS level.

Plattform as a Service (PaaS) is focusing on developers as customers by providing them a platform containing the whole technology stack to run applications and services supporting all the typical cloud characteristics like On-Demand Self-Service, Rapid Elasticity, Measured Service, Resource Pooling etc.  Typically these platforms consist of:

  • Runtime environments (Java, Ruby, Python, NodeJs, .Net, …),
  • Frameworks (Spring, JEE, Rails, Django, … ) and
  • Services like
    • Datastores (SQL, NoSQL, Key-Value-Stores, File-/Object-Storage,…),
    • Messaging (Queuing, PubSub, EventProcessing,…)
    • Management Services (authentication, logging, monitoring,…)

Our full PaaS (mid-longterm) mission is described in the PaaS research theme page

Problem Statement

PaaS technologies and offerings are still in early stages. Lot of hype and movement in the market. Standards are not yet established. Lots of the open source tools like CloudFoundry and OpenShift are still in beta stages and not mature. Moving the responsibility for the operation of runtimes, frameworks and services to the cloud provider creates many new challenges. First of all the deployment and operation has to be totally automated and tooling for operation and management is needed. New parameters for monitoring and rating are required and new charging models to be developed and evaluated.  Other challenges are the automated interfacing with the underlying infrastructure layer (in our case OpenStack), to provide and guarantee the requested performance and scalability. Last but not least we have to investigate how to extend the frameworks with new services and runtimes.

Articles and Info

There are a number of presentations about PaaS in general and CloudFoundry/BOSH specifically used in the ICCLab:

Contact Point

The core components of any HA strategy

In his excellent article in Linux Technical Review #04 Jens-Christoph Brendel proposes a new way how to implement High Availability (HA) in current IT architectures. According to Bendel, modern IT architectures continually gain in complexity. This fact makes it difficult to guarantee availability on a certain level. Nevertheless High Availability is not merely a competitional advantage: for many companies keeping availability levels above 99,999 % per year is a matter of existence. Therefore a few systematic steps should help in planning and implementing high availability in your IT environment. This article shows a possible strategy on how to plan High Availability in the Mobile Cloud environment.

Redundancy vs. Complexity

According to Brendel, every HA-strategy starts with an evaluation of necessary degrees of availability each architecture component requires. Basically availability can be increased by adding redundant components (as mentioned in my former article). On the other hand, every new component makes the overall system more complex and increases the risk of component failures.  In short: there is always a trade off between avoiding system component outages and adding complexity (and possible points of failure) to the overall architecture by adding redundant components to an IT architecture. For the OpenStack environment this means one has to classify the different OpenStack components according to the availability an OpenStack user requires.

AEC-classification proposal for OpenStack

One possible classification for IT components is the AEC-classification developed by the Harvard Research Group. The AEC-classes reach from AEC-0 (non-critical systems, typically 90% availability) to AEC-5 (disaster-tolerant systems, 99.99999% or “Five-Nines” availability). OpenStack basically consists in the following components: Nova (including Nova-Compute, Nova-Volume and Nova-Network), Horizon, Swift (ObjectStore), Glance, Cinder, Quantum and Keystone. A typical OpenStack end user has to deal with these components in order to be able to handle his cloud installation. One has to think about the targeted availability levels of these components in order to know more about the overall stability of the OpenStack cloud environment. Some components need not be AEC-5, but for others AEC-5 is a must. The following table is a proposal of AEC-classes for each of the OpenStack components.

table_aec

Of course the real availability architecture of a productive OpenStack implementation also depends on how many OpenStack nodes are used and on the underlying virtual and even physical infrastructure, but this proposal serves as a good starting point to think about adequate levels of availability in productive OpenStack architectures. How do we secure critical components like Nova or Keystone against failures? Any OpenStack HA strategy must focus on this question first.

Risk Management and the “Chaos Monkey”

The next steps towards developing an OpenStack HA strategy are risk identification and risk management. It is obvious that the risk of a component failure depends on the underlying physical and virtual infrastructure of the current OpenStack implementation and also on the requirements of the end users, but to investigate risk probabilities and impacts, we must have a test on what happens to the OpenStack cloud if some components fail. One such test is the “Chaos Monkey” test developed by Netflix. A “Chaos Monkey” is a service which identifies groups of systems in an IT architecture environment and randomly terminates some of the systems. The random termination of some components serves as a test on what happens if some systems in a complex IT environment randomly fail. The risk of component failures in an OpenStack implementation could be tested by using such Chaos Monkey services. By running multiple tests on multiple OpenStack configurations one can easily learn if the current architecture is able to reach the required availability level or not.

Further toughts

Should OpenStack increase in terms of availability and redundancy? According to TechTarget, the OpenStack Grizzly release should become more scalable and reliable than former releases. A Chaos Monkey test could reveal if the decentralization of components like Keystone or Cinder can lead to enhanced availability levels.

 

 

 

 

 

OpenStack on SmartOS

SmartOS is an open source type 1 hypervisor platform based on Illumos, a descendant of OpenSolaris, and developed by Joyent. SmartOS is a live operating system, meaning that can be booted via PXE, USB or an ISO image, and runs entirely from memory, leaving the full space on the local disk to be used for virtual machines. This type of architecture makes SmartOS very secure, easy to upgrade and recover. Given its performances and reliability, in the context of the Mobile Cloud Networking project, SmartOS has been chosen to support telco-grade workloads and provide carrier-grade performances.

SmartOS as Cloud OS

Cloud providers must be able to offer a single server to multiple users without them noticing that that they are the only user of that machine. This means that the underlying operating system must be able to provision and deprovision, i.e. create and destroy, virtual machines in a very fast seamless way; it should also allocate physical resources efficiently and fairly amongst the users and should be able to support multithreaded and multi-processor hardware. Lastly, the operating system must be highly reliable and, in case something doesn’t work as it should, it must provide a way to quickly determine what the cause is. A customer of the cloud provider will also expect the server to be fast, meaning that the observed latency should be minimal. The provided server should also give the flexibility to get extra power when needed, i.e. bursting and scaling – and be secure, meaning that neighboring users must not interfere with each other.

Thanks to the Illumos inheritance, SmartOS presents a set of features that address these needs and make it a perfect candidate as a truly Cloud OS:

  • OS Virtualization. SmartOS offers both container-based virtualization, i.e. a lightweight solution combining resource controls and Solaris zones, and KVM virtual machines, a full, hardware-assisted virtualization solution for running a variety of guest OS’s, including Linux and Windows. Brendan Gregg of Joyent wrote a post comparing performances between OS virtualization techniques.
  • ZFS and I/O throttling. ZFS combines file system and logical volume manager in a single feature. Key characteristics of ZFS are fast file system creation and data integrity guarantee. ZFS also includes storage pools, copy-on-write snapshot creation and snapshot cloning. Joyent further extended SmartOS adding disk I/O throttling. This feature, particularly interesting for a Cloud OS, overcomes a drawback in classic Solaris where a zone or application could effectively monopolize access to local storage, causing performance degradation for other applications or zones. With this new feature all zones/applications are ensured to get a reliable turn at reading/writing to disk.
  • Network Virtualization. SmartOS makes use of Crossbow to provide a network virtualization layer. Crossbow is fully integrated with the virtual machine administration tool of SmartOS, i.e. vmadm, and allows each virtual machine can get up to 32 virtual network interfaces (VNICs). But with this ability to offer so many VNICs, how can we supply sufficient bandwidth? As SmartOS is a Solaris derivative it can leverage advanced networking features such as multipath IP (IPMP). Operating at a lower level, at the data link level, SmartOS has the possibility of levering data link multi-pathing (DLMP), which is close to trunk aggregation.
  • Observability with DTrace. DTrace is a performance analysis tool included by default in different operating system, amongst them Illumos and Solaris and therefore SmartOS. DTrace, short for Dynamic Tracing, can instrument code by modifying a program after it has been loaded into memory. DTrace is not limited to use with user-space application, but can be used to inspect the OS kernel and device drivers. In SmartOS, DTrace can be used to analyze and troubleshoot issues across all zones in a server or within an entire datacenter.
  • Resource control. Resource control is an essential part of the Container-based virtualization. In SmartOS there are two methods to control resource consumption: fair share scheduler and CPU capping. Fair share scheduler allows the administrator to set a minimum guaranteed share of CPU, to ensure that all zones get the a fair share of CPU when the system is busy. CPU capping sets a limit on the amount of CPU that a particular user will get. In addition to these two methods, Joyent added a CPU bursting feature that let administrators define a base level of CPU usage and an upper bound and also limit how much time a zone can burst.
  • Security. Thanks to the Illumos and Solaris inheritance, SmartOS offers a high level of security. Zones are complete separate environments and activity in one zone will not affect neighbouring zones on the same server. Data security is also guaranteed through the use of zones and ZFS file systems.
  • Reliability. SmartOS offers Fault Management (FMA) and Service Management Facility (SMF) that makes it more reliable. The Fault Management feature helps detect, report and diagnose any fault or defect that can occur on a SmartOS system. The Service Management Facility (SMF), another feature SmartOS inherits from Solaris, introduces dependencies between services – meaning that the system will ensure that all services a particular service depends on are up and running before starting it, parallel starting and automatic restart upon failure to allow fast booting time and service recovery, and delegation of services to non-root users to limit the privileges of a certain service. Complementing these is the ability of highly available load balancing with the virtual router redundancy protocol (VRRP). This is an additional feature that needs to be installed on SmartOS, yet it provides a means to implement hot-failover via virtual IP sharing. This is very similar to the combination of pacemaker and corosync. 

OpenStack on SmartOS

Given the set of features that makes SmartOS the ideal Cloud OS, it seems only logical to combine it with OpenStack to provide a reliable, high-performance cloud platform. This idea was already blueprinted within OpenStack and some preliminary work has been already carried out by Thijs, Andy and Hendrik.

The existing work has now been further extended and the code has been updated to the latest OpenStack release, Grizzly, and is available on github. At the moment, the nova-compute service is running on SmartOS, being able to instantiate virtual machines, both Container-based and KVM. The nova-network service is still a work-in-progress and further work needs to be carried out in order to make SmartOS fully Quantum compatible.

Further and interesting work include enabling the integration of OpenFlow controllers (e.g. Ryu, trema, floodlight). This coupled with IPMP and DLMP will make SmartOS truly a high performance virtualization platform. With high availability features of SmartOS interesting and valuable reliable compute services can be offered both with container and KVM virtualization techniques. Having all these capabilities are immensely useful however in order to truly manage this hypervisor platform, in-depth monitoring will be required and this is where DTrace will be leveraged as a configurable source of system metrics. These metrics can be supplied to the OpenStack Ceilometer monitoring system, for both performance and billing purposes. Whereas we’re currently focused on the compute and networking capabilities, SmartOS’s ZFS storage capabilities will be leveraged in both providing block-type storage services.

ICCLab at the Kick-off Meeting of the ISSS Special Interest Group Cloud Computing Security – Goals and Work planned for 2013

by Josef Spillner

At the first meeting of the SIG Cloud Computing Security (SIG CCS) under the new lead of Bernhard Tellenbach, member of the board ISSS, the nine participants of the meeting had a lively discussion about the goals of this SIG.
Among the goals discussed were ideas such as writing a white paper on a CCS problem where existing Best Practices, guides or textbooks offer little or no guidance and the publishing of an overview of existing and future cloud certifications.
But in the end, the SIG members decided to create a standardized presentation covering relevant aspects of cloud computing security. Since large enterprises typically have the required know-how already in-house and small enterprises usually outsource their IT infrastructure, the SIG decided to gear the presentation toward medium-sized enterprises.
When finished, each SIG member is expected to give the presentation at least three times at events organized by trade- and industry associations, SME-organizations, cloud computing interest groups etc. which have members from medium-sized enterprises in their target audience.
To avoid reinventing the wheel, the SIG first conducts a thorough review of existing literature and resources on the cloud computing security topic. In a next step, the SIG selects the content to be included in the presentation. Finally, the presentation is built based on both, existing material and new material, to fill the gaps identified during the review phase.
In parallel to this work, the SIG contacts groups and organizations which work in the same domain as this SIG. By exchanging information and know-how, the quality of the output is imporved and the SIG can make sure that it does not redo the work of others.

The next meeting of the SIG CCS is scheduled for the 22.04. at Zurich main station. If you want to get more information on this SIG or if you want to participate, please drop an email to bernhard.tellenbach@isss.ch

Pacemaker: clusters to allow HA in OpenStack

Open Stack’s capabilities to support High Availability are very limited. If a virtual machine crashes, there is no automatic recovery. Clustering software seems a to be a great workaround to allow redundancy and implement High Availability (HA).

Pacemaker is a scalable cluster resource manager developed by Clusterlabs. Its advantages are:

  • Support of many different deployment scenarios
  • Monitoring of resources
  • Recovery from outtages

According to the OpenStack documentation website the OpenStack HA environment builds on Pacemaker and Corosync. Corosync is Pacemaker’s message layer which is responsible for the distribution of clustering messages. The Pacemaker software uses resource agents that manage different ressources and communicate via Corosync. Corosync is responsible for synchronizing DRBD block devices which are virtual devices layered on top of the machine node devices themselves (like hard-disks etc.). The DRBD block device layer allows clustering of different machine nodes, while Corosync organizes the synchronicity of data in these clusters. Pacemaker resource agents control the DRBD devices via Corosync and are therefore able to organize high availability of machine nodes in an OpenStack environment.

Integration of Pacemaker into OpenStack is a major step towards creating a HA cloud environment. There’s an ongoing evaluation how Pacemaker fits into the MobileCloud environment, but it is obvious that there should be a test procedure to evaluate availability of cloud resources in different integration scenarios. Follow up information on this subject will be posted in a further blog post.

 

Distributed File Systems

Description

Distributed File Systems are file systems that allow access to files from multiple hosts via a computer network, making it possible for multiple users on multiple machines to share files and storage resources.

Distributed File Systems are designed to be “transparent” in a number of aspects (e.g.: location, concurrency, failure, replication), i.e. client programs see a system which is similar to a local file system. Behind the scenes, the Distributed FS handles locating files, transporting data, and potentially providing other features listed below.

Distributed File Systems can be categorised in:

  • Distributed File Systems are also called network file systems. Many implementations have been made, they are location dependent and they have access control lists (ACLs).
  • Distributed fault-tolerant File Systems replicate data between nodes (between servers or servers/clients) for high availability and offline (disconnected) operation.
  • Distributed parallel File Systems stripe data over multiple servers for high performance. They are normally used in high-performance computing (HPC).
  • Distributed parallel fault-tolerant File Systems stripe and replicate data over multiple servers for high performance and to maintain data integrity. Even if a server fails no data is lost. The file systems are used in both high-performance computing (HPC) and high-availability clusters.

The objectives of this research initiative are:

  • Evaluate and compare performance of various Distributed File Systems
  • Explore and Evaluate the use fo Distributed File Systems as Object Storage
  • Explore the use of Distributed File Systems in OpenStack
  • Explore the use of Distributed File Systems in Hadoop

Problem Statement

With the increasing need and use of cloud storage services providers must be able to deliver a reliable service that is also easily managed. Distributed File Systems provide the basis for a Cloud Storage Service.

Articles and Info

Distributed File Systems Blog post Series:

Contact Point

Cloud Performance

Description

Virtualisation is at the core of Cloud Computing and therefore its performance are crucial to be able to deliver a top-of-the-class service. Also, being able to provide the adequate virtualised environment based on the user requirements is key for cloud providers.

SmartOS, a descendant of Illumos and OpenSolaris, presents features such as containers and KVM virtualisation and network virtualisation through Crossbow that makes it particularly interesting in this context.

This research initiative aims to:

  • Evaluate performance of SmartOS virtualisation in respect to compute, i.e. containers and KVM, storage and networking
  • Compare SmartOS virtualisation with other techniques (Linux KVM, VMware, Xen)
  • Identify use cases and workloads that best suits the different techniques

Problem Statement

Cloud providers must be able to offer a single server to multiple users without them noticing that that they are the only user of that machine. This means that the underlying operating system must be able to provision and deprovision, i.e. create and destroy, virtual machines in a very fast seamless way; it should also allocate physical resources efficiently and fairly amongst the users and should be able to support multithreaded and multi-processor hardware. Lastly, the operating system must be highly reliable and, in case something doesn’t work as it should, it must provide a way to quickly determine what the cause is. at the same time, a customer of the cloud provider will also expect the server to be fast, meaning that the observed latency should be minimal. The provided server should also give the flexibility to get extra power when needed, i.e. bursting and scaling – and be secure, meaning that neighboring users must not interfere with each other.

Articles and Info

Contact Point

Cloud Monitoring

Description

A monitoring system especially in a Infrastructure as a Service environment should be considered indispensable and required. Knowing which resources are used by which Virtual Machines (and tenants) is crucial for cloud computing providers as well for their customers.

Customers want to be sure they get what they pay for at any time whereas the cloud provider needs the information for his billing and rating system. Furthermore this information can be useful when it comes to dimension and scalability questions.

For monitoring a Cloud environment there are different requirements:

  • An Cloud monitoring tool must be able to monitor not only physical machines but also virtual machines or network devices.
  • The information of the monitored resources must be assignable to its tenant.
  • The metered values must be collected and correlated automatically
  • The monitoring tool must be as generic as possible to ensure support of any device.
  • The monitoring tool must offer an API.

Problem Statement

Many of the available monitoring tools allows to collect data from particular devices such as physical devices or virtual machines. However, most of these tools don’t monitor newly created instances of a Cloud environment automatically. For this reason the ICCLab decided to use Ceilometer to monitor their OpenStack installation. Ceilometer is a core project of OpenStack but doesn’t collect data from physical devices like network switches. Therefore, the ICCLab extends Ceilometer to allow it to collect data from physical devices.

Articles and Info

Contact Point

Newer posts »