SmartOS Series: Storage

In the third part of the SmartOS Series (check the first two parts: Intro and Virtualisation) we are going to deep dive into the SmartOS Storage: ZFS.

ZFS is a combination of file system and logical volume manager originally introduced in Solaris 10. ZFS includes many features that are key for a Cloud OS such as SmartOS:

  • Data integrity. ZFS is designed with a focus on data integrity, with particular emphasis on preventing silent data corruption. Data integrity in ZFS is achieved using a checksum or a hash throughout the whole file system tree. Checksums are stored in the  pointer to the block rather than in the block itself. The checksum of the block pointer is stored in its pointer, and so on all the way up to the root node. When a block is accessed, its checksum is calculated and compared to the one stored in the pointer to the block. If the checksums match, the data is passed to the process requesting it; if the checksums do not match, ZFS will retrieve one of the redundant copy and try to heal the damaged block.
  • Snapshots. ZFS uses a copy-on-write transactional model that, when new data is written, the block containing the old data is retained, making it possible to create a snapshot of a file system. Snapshots are both quick to create and space efficient, since data is already stored and unmodified blocks are shared with the file system. ZFS allows for snapshots to be moved between pools, either on the same host or on a remote host. The ZFS send command creates a stream representation either of the entire snapshot or just of the blocks that have been modified, providing an efficient and fast way to backup snapshots.
  • Clones. Snapshots can be cloned, creating an identical writable copy. Clones share blocks and, as changes are made to any of the clones, new data block are created following the copy-on-write principles. Given that initially clones share all the blocks, cloning can be instantaneous doesn’t consume additional disk space. This means that SmartOS is able to quickly create new almost identical virtual machines.
  • Adaptive Replacement Cache (ARC). ZFS makes use of different levels of disk caches to improve file system and disk performance and to reduce latency. Data is then cached in a hierarchical manner: RAM for frequently accessed data, SSD disks for less frequently accessed data. The first level of disk cache (RAM) is always used for caching and uses a variant of the ARC algorithm similar to level 1 CPU cache. The second level (SSD disks) is optional and is split into two different caches, one for reads and one for writes. Populating the read cache, called L2ARC, can take some hours and, in case the L2ARC SSD disk is lost, data will be safely written to disk, i.e. no data will be lost. The write cache, called Log device, stores small synchronous writes and flushes them to disk periodically.
  • Fast file system creation. In ZFS, creating or modifying a filesystem within a storage pool is much easier than creating or modifying a volume in a traditional filesystem. This also means that creating a new virtual machine in SmartOS, i.e. adding a new customer for a cloud provider, is extremely fast.

Another very interesting feature added by Joyent in SmartOS with regards to storage is disk I/O throttling. In Solaris, a zone or an application could potentially monopolize access to the local storage, almost blocking other zones and applications to access it and causing performance degradation. With disk I/O throttling in place all zones are guaranteed a fair amount of access to disk. If a zone is requesting access to disk over its limits, each I/O call will get delayed by up to 100 microseconds, leaving enough time for the system to press I/O requests from other zones. Disk I/O throttling only comes into effect when the system is under heavy load from multiple tenants; during quiet times tenants are able to enjoy faster I/O.

The main ZFS components in SmartOS are pools and datasets:

  • Pools are an aggregation of virtual devices (vdevs), which in turn can be constructed of different block devices, such as files, partitions or entire disks. Block devices within a vdev can be configured with different RAID levels, depending on redundancy needs. Pools can be composed of a mix of different types of block devices, e.g. partitions and disks, and their size can be flexibly extended, as easy as adding a new vdev to the pool.
  • Datasets are a tree of blocks within a pool, presented either as a filesystem, i.e. file interface, or as a volume, i.e. block interface. Datasets can be easily resized and volumes can be thinly provisioned. Both zones and KVMs make use of ZFS datasets: zones use ZFS filesystems while KVMs use ZFS volumes.

In the next part of the SmartOS Series we’ll explore DTrace, a performance analysis and troubleshooting tool.

About the European Cloud Partnership

From Report from the European Cloud Partnership Steering Board meeting of 4 July 2013 in Tallinn, 12 August 2013: “On 27 September 2012 the Commission adopted the European Cloud Strategy in the form of a Communication entitled “Unleashing the Potential of Cloud Computing in Europe”, in which it announced the intention to set up a European Cloud Partnership (ECP). Under the guidance of the Steering Board, the ECP brings together public authorities and industry consortia to advance the objectives of the Strategy towards a digital single market for cloud computing. On the 4th of July 2013, the ECP organised its second full Steering Board meeting in Tallinn.” The full report can be downloaded here. Report European Cloud Partnership Steering Board Meeting July 2013

Some thoughts on the two topics discussed:

PRISM: No big surprise to see that on the agenda. Hope is that  a high-profile activity like the ECP is not engaging in the just too frequent FUD-type of discussions but will support their strategy and implementation with solid social and technology research.

Cloud Standards: From the report “There is a need to identify minimal standards, based on existing best practices. These should focus on public sector needs, but the private sector is free to adopt these if it sees a benefit to doing so. Past experiences with the GSM standards are recalled, where a strong and forced EU level standardisation push made the EU a global leader in mobile technology.”  –  To draw a line between GSM and Cloud Computing looks just too typical for the European ICT sector, which is mostly known for their Telco incumbents. It would be interesting to learn more about the actual motivations behind this bold statement. So far, the Cloud Computing sector proves that standards are not needed. Also, since this statement is from a Telefonica representative, this needs to be viewed in the context of Telco standardization efforts beyond communication systems, WAC, GSMA, Parlay/Parlay-X, RCSe to name a few, all of which still have to prove real impact on the Internet.

Cloud for Europe: Is this any connected to the FI-PPP, and most importantly FI-WARE? Or the EGI Federated Cloud? The document refers to a presentation of Helix Nebula but given the fact that Cloud For Europe is a new project (a large one, IP with 20+ partners from 11 EU countries), one risk fragmentation across the meanwhile many different European Cloud research and development activities.

OpenStack HA: why is Pacemaker such a slow recovery tool?

If you ever tried to implement High Availability in OpenStack by using Pacemaker, you might be disappointed by Pacemaker’s extremely slow recovery speed. Pacemaker recovers OpenStack at a very low pace – and even worse: it sometimes detects outages when they do not occur. As a result Pacemaker starts unnecessary computationally intensive recovery actions which are very slow and decrease OpenStack’s availability. This article describes why Pacemaker recovery actions are sometimes slow and what we can do against it.

Pacemaker is a distributed software that monitors and controls execution of programs or services on different computers in a cluster. The controlled services are called “resources” and Pacemaker needs a “resource agent” interface in order to be able to manage a resource. Resource management actions are performed by programs that run locally on each computer of the cluster: the “Local Resource Management Daemons” (LRMDs). LRMDs are programs that can monitor execution of services and restart them in case of failure. The LRMD actions are orchestrated by the “Cluster Resource Manager” (CRM). LRMDs know how to manage resources (from the resource agent specifications), but they do not monitor, stop or restart local IT services autonomously: the CRM has to tell them when and at what time interval they have to perform failover actions. The CRM can be configured by a distributed XML-file: the “Cluster Information Base” (CIB). The CIB contains all information that is necessary to orchestrate the LRMD actions. The communication between CRM and LRMDs is performed by a “Cluster Communication Manager” (CCM). Typical CCMs that are used in combination with Pacemaker are Corosync or Heartbeat.

Fig. 1: OpenStack HA with Pacemaker.

Fig. 1: OpenStack HA with Pacemaker.

OpenStack can be made highly available by installing redundant OpenStack services (Keystone, Nova, Glance etc.) on different machines and let Pacemaker control execution of the OpenStack services. Custom resource agents must be installed in order to allow the LRMDs to manage OpenStack resources. Then the CIB must be configured so the CRM can orchestrate the LRMD actions. An example of such a OpenStack HA architecture using Pacemaker is shown in Fig. 1.

Why is Pacemaker slow?

Sometimes one can experience that Pacemaker failover actions are very slow. There could be several reasons why the Pacemaker recovery of OpenStack is such a time-consuming task. The most common ones are these:

  • Suboptimal initialization scripts: OpenStack services do not generate a file containing the process identification (pid) in a pid file per default. Therefore Pacemaker is not able to identify OpenStack services as manageable entities or resources. Some hacking is necessary in order to make OpenStack services Pacemaker-compliant.
  • Custom resource agents: there are no OCF-compliant OpenStack resource agents delivered out of the box. Pacemaker’s Local Resource Management Daemons (LRMDs) are therefore not able to manage OpenStack services.
  • Bad Cluster Information Base (CIB) configuration: The worst thing is a messy CIB configuration. If e. g. recovery tasks are kept in large groups and monitoring intervals are too long to discover outages very fast, the Pacemaker recovery will act very slowly, because Pacemaker has to recover large resource groups and recovery actions are started lately.

What can be done to make Pacemaker faster?

The first and most important step to make Pacemaker recovery faster is to identify the cause of the slowness. Once you have done that, you can take one of the following actions:

  • Optimize initialization scripts: Depending on your initialization system (Init-V, Upstart, Systemd), you must customize the upstart of services in order to generate pid files which help Pacemaker to identify the service on the system. OpenStack services in Ubuntu are upstarted by the Init-V system. If you run OpenStack on Ubuntu, you must customize the upstart scripts so they will generate pid files automatically. This can be done by changing the configuration files in /etc/init. For the quantum server e. g. you have to change the /etc/init/quantum-server.conf file to contain several lines which tell the upstart daemon to create a pidfile and place it in a specified folder (typically /var/run). Creation of pid files can be performed using the start-stop-daemon. For more information on the start-stop-daemon read the manpage.
  • Create custom resource agents: there are no OpenStack resource agents delivered out of the box, but you can create them if you want. Resource agents must be placed in the /usr/lib/ocf/resource.d/ folder. They must contain methods to monitor, start and stop services as well as a method to control the execution status of the service. Some good examples for OpenStack resource agents can be found on the Hastexo website.
  • Improve Cluster Information Base (CIB) configuration: Most improvements can be done by changing the CIB configuration. Ideally OpenStack services should run redundantly at the same time on two different OpenStack nodes which can be reached by using a shared virtual IP. In case of a service failure on one node, Pacemaker just has to route traffic to the node where the service is still running. If the service is not running redundantly on the fallback node before the failure occurs, Pacemaker has to upstart the service on at least one of the nodes. A small context switch is usually faster than the upstart of whole services. Therefore redundant nodes must always keep redundant OpenStack services up and running. It is really important to ensure that parallel execution of redundant services is configured in the CIB file.

If you improve OpenStack initialization scripts, optimize OpenStack resource agents and improve the CIB configuration, Pacemaker should be a great tool to make OpenStack services highly available.


Overview: Network Functions as-a-Service over Virtualised Infrastructures

Network Functions Virtualisation (NFV) is an emerging concept. It refers to the migration of certain network functionalities, traditionally performed by hardware elements, to virtualized IT infrastructures, where they are deployed as software components. NFV leverages commodity servers and storage, including cloud platforms, to enable rapid deployment, reconfiguration and elastic scaling of network functionalities.

Network Function Virtualization Concept

With the aim of promoting the NFV concept, T-NOVA introduces a novel enabling framework, allowing operators not only to deploy virtualized Network Functions (NFs) for their own needs, but also to offer them to their customers, as value-added services. Virtual network appliances (gateways, proxies, firewalls, transcoders, analyzers etc.) can be provided on-demand as-a-Service, eliminating the need to acquire, install and maintain specialized hardware at customers’ premises.

High Level Architecture of T-Nova Platform

T-NOVA will design and implement a management/orchestration platform for the automated provision, configuration, monitoring and optimization of Network Functions-as-a-Service (NFaaS) over virtualised Network/IT infrastructures.

T-NOVA leverages and enhances cloud management architectures for the elastic provision and (re-) allocation of IT resources assigned to the hosting of Network Functions. It also exploits and extends Software Defined Networking platforms for efficient management of the network infrastructure.

Furthermore, in order to facilitate the involvement of diverse actors in the NFV scene and attract new market entrants, T-NOVA establishes a “NFV Marketplace”, where network services and functions by several developers can be published and brokered/traded. Via the Marketplace, customers can browse and select the services and virtual appliances which best match their needs, as well as negotiate the associated SLAs and be charged under various billing models.


More info here.

SmartOS Series: Virtualisation

Last week we started a new blog post series on SmartOS. Today we continue in this series and explore in details the virtualisation aspects of SmartOS.

SmartOS offers two types of OS virtualisation: the Solaris-inherited container-based virtualisation, i.e. zones, and the hosted virtualisation ported to SmartOS by Joyent, KVM.

Containers are a combination of resource controls and Solaris zones, i.e. a complete isolated virtual environment, that provide an efficient virtualisation solution and a complete and secure user space environment on a single global kernel. SmartOS uses sparse zones, meaning that only a portion of the file system is replicated in the zone, while the rest of the file system and other resources, e.g. packages, are shared across all zones. This limits the duplication of resources, provides a very lightweight virtualisation layer and makes OS upgrading and patching very easy. Given that no hardware emulation is involved and that guest applications talk directly to the native kernel, container-based virtualisation gives a close-to-native level of performance.


SmartOS container-based virtualisation (Source:

SmartOS provides two resource controls methods: fair share scheduler and CPU capping. With fair share scheduler a system administrator is able to define a minimum guaranteed share of CPU for a zone; this guarantee that, when the system is busy, all zones will get their fair share of CPU. CPU capping sets an upper limit on the amount of CPU that a zone will get. Joyent also introduced a CPU bursting feature that let system administrators define a base level of CPU usage and an upper limit and also specify how much time a zone is allowed to burst, making it possible for the zone to get more resources when required.

SmartOS already offer a wide set of features, but to make it a truly Cloud OS an important feature was missing: hosted virtualisation. Joyent bridged this gap by porting to SmartOS one of the best hosted virtualisation platform: KVM. KVM on SmartOS is only available on Intel processors with VT-x and EPT (Extended Page Tables) enabled and only supports x86 and x86-64 guests. Nonetheless, this still gives the capability to run unmodified Linux or Windows guests on top of SmartOS.

In hosted virtualisation hardware is emulated and exposed to virtual machine; in SmartOS, KVM doesn’t emulate hardware itself, but it exposes an interface that is then used by QEMU (Quick Emulator). When the guest emulated architecture is the same as the host architecture, QEMU can make use of KVM features such as acceleration to increase performance.


SmartOS KVM virtualisation (Source:

KVM virtual machines on SmartOS still run inside a zone, therefore combining the benefits of container-based virtualisation with the power of hosted virtualisation, with QEMU as the only process running in the zone.

In the next part of the SmartOS Series we will look into ZFS, SmartOS powerful storage component.

Dell visiting ZHAW ICCLAB

ICCLAB is often kind to  organise and host discussions with key professors from academy and players from the industry.

Last 18th of July, Alba Julio and Montserrat Pellicer from Dell –  Enterprise Solutions & Networking visited us for an open discussion on Software Defined Network (SDN) technologies. Most of researchers were attending from ICCLAB.

The event, of about two hours, was dedicated to introduce respective activities and solutions on SDN, in particular for the cloud infrastructures. Due to the background of participants, on technologies for the network, some time was also dedicated to review how the introduction of SDN / OpenFlow is evolving or progressing with different success in the cloud and in the public network.

After a briefing on his technical background and career, Julio illustrated the strategy of Dell about SDN and the solutions going to be offered to the market. This are solutions for Data Center cluster infrastructures which are already introduced in some large customer installations.

ICCLAB is currently involved in studies of SDN based on Openflow solutions as also reported in other blogs .  Some of the OpenStack Cluster solutions can be deployed easily with a SDN architecture for the ICCLAB internal network infrastructure and data center test environments.

The second part of the meeting was dedicated to review possible forms of future collaborations. Concerning collaborations in EU research projects, Thomas Bohnert was so kind to introduce the Future Internet Public Partnership Programme (FI PPP) as one of the key element for funding in many ICT sectors.   It will be further evaluated possible participation to FI PPP and other programmes (H2020) together with common participation to scientific programme and workshops.

Capture dell

Distributed File System Series: GlusterFS Introduction

In the first part of the Distributed File Systems Series we gave an introduction to Ceph. Today we will explore another file system: GlusterFS.

GlusterFS is a clustered network filesystem that uses FUSE and that allows to aggregate different storage devices, bricks in GlusterFS terminology, into a single storage pool. The storage in each brick is formatted using a local file system, e.g. XFS or ext4, and then exposed to GlusterFS for storing data. The user space approach gives GlusterFS flexible release cycles and easiness of use without taking tolls on the performance of the file system.

Two of the main concepts of GlusterFS are volumes and translators.

A volume is a collection of one or more bricks. There are three different type of volumes in GlusterFS that differentiate in how the volume stores the data in the bricks:

  • Distribute Volume. In this type of volume all the data is distributed throughout all the bricks. The data distribution is based on an algorithm that takes into account the size available in each brick. This is the default volume type.
  • Replicate Volume. In a replicate volume the data is duplicated, hence replicate, over every brick in the volume. The number of bricks must be a multiple of the replica count.
  • Stripe Volume. In a stripe volume the data is striped into units of a given size among the bricks. The default unit size is 128KB and the number of bricks should be a multiple of the stripe count.

A translator is a GlusterFS component that has a very specific function; some of the main functionalities of GlusterFS are implemented as translators, e.g. I/O scheduling, striping and replication, load balancing, failover of volumes and disk caching. A translator connects to one or more volumes, perform it specific function and offers a volume connection; therefore translators can be hooked together to provide a file system tailored to certain needs. The whole set of translators linked together is called a graph.

GlusterFS Translators graph (Source:

GlusterFS offers a fairly long list of translators for different needs:

  • Storage Translators. These translators define the behaviour of the back-end storage for GlusterFS. A storage translator is typically the first translator in a chain.
    • POSIX: tells GlusterFS to use a normal POSIX file system as the backend, e.g. ext4.
    • BDB: tells GlusterFS to use the Berkeley DB as the backend storage mechanism. This translators uses key-value pairs to store data and POSIX directories to store directories.
  • Clustering Translators. These translators are used to allow GlusterFS to use multiple servers to create a cluster. These translators are used to define the basic behaviour of GlusterFS.
    • Unify: with this translator all the subvolumes from the storage servers will appear as a single volume. An important feature of this translator is that a given file can exist on only one of the subvolumes in the cluster. The translator uses a scheduler to determine where a file resides, e.g. Adaptive Least Usage, Round Robin, random.
    • Distribute: this translator aggregate storage from several storage servers.
    • Replicate: this scheduler replicates files and directories across the subvolumes. If there are two subvolumes then a copy of each file will be on each subvolume.
    • Stripe: with this translator the content of a file is distributed across subvolumes.
  • Performance Translators. These translator are used to improve he performance of GlusterFS.
    • Read ahead: this translator pre-fetches data before it’s required, usually data that appears next in the file.
    • Write behind: this allows the write operation to return even if the operation has not completed.
    • Booster: using this translator applications are allowed to skip using FUSE and access the GlusterFS directly.

More translators can be written according to specific requirements.

For the next part of the Distributed File Systems Series we will be looking at XtreemFS.