SmartOS Series: Storage

In the third part of the SmartOS Series (check the first two parts: Intro and Virtualisation) we are going to deep dive into the SmartOS Storage: ZFS.

ZFS is a combination of file system and logical volume manager originally introduced in Solaris 10. ZFS includes many features that are key for a Cloud OS such as SmartOS:

  • Data integrity. ZFS is designed with a focus on data integrity, with particular emphasis on preventing silent data corruption. Data integrity in ZFS is achieved using a checksum or a hash throughout the whole file system tree. Checksums are stored in the  pointer to the block rather than in the block itself. The checksum of the block pointer is stored in its pointer, and so on all the way up to the root node. When a block is accessed, its checksum is calculated and compared to the one stored in the pointer to the block. If the checksums match, the data is passed to the process requesting it; if the checksums do not match, ZFS will retrieve one of the redundant copy and try to heal the damaged block.
  • Snapshots. ZFS uses a copy-on-write transactional model that, when new data is written, the block containing the old data is retained, making it possible to create a snapshot of a file system. Snapshots are both quick to create and space efficient, since data is already stored and unmodified blocks are shared with the file system. ZFS allows for snapshots to be moved between pools, either on the same host or on a remote host. The ZFS send command creates a stream representation either of the entire snapshot or just of the blocks that have been modified, providing an efficient and fast way to backup snapshots.
  • Clones. Snapshots can be cloned, creating an identical writable copy. Clones share blocks and, as changes are made to any of the clones, new data block are created following the copy-on-write principles. Given that initially clones share all the blocks, cloning can be instantaneous doesn’t consume additional disk space. This means that SmartOS is able to quickly create new almost identical virtual machines.
  • Adaptive Replacement Cache (ARC). ZFS makes use of different levels of disk caches to improve file system and disk performance and to reduce latency. Data is then cached in a hierarchical manner: RAM for frequently accessed data, SSD disks for less frequently accessed data. The first level of disk cache (RAM) is always used for caching and uses a variant of the ARC algorithm similar to level 1 CPU cache. The second level (SSD disks) is optional and is split into two different caches, one for reads and one for writes. Populating the read cache, called L2ARC, can take some hours and, in case the L2ARC SSD disk is lost, data will be safely written to disk, i.e. no data will be lost. The write cache, called Log device, stores small synchronous writes and flushes them to disk periodically.
  • Fast file system creation. In ZFS, creating or modifying a filesystem within a storage pool is much easier than creating or modifying a volume in a traditional filesystem. This also means that creating a new virtual machine in SmartOS, i.e. adding a new customer for a cloud provider, is extremely fast.

Another very interesting feature added by Joyent in SmartOS with regards to storage is disk I/O throttling. In Solaris, a zone or an application could potentially monopolize access to the local storage, almost blocking other zones and applications to access it and causing performance degradation. With disk I/O throttling in place all zones are guaranteed a fair amount of access to disk. If a zone is requesting access to disk over its limits, each I/O call will get delayed by up to 100 microseconds, leaving enough time for the system to press I/O requests from other zones. Disk I/O throttling only comes into effect when the system is under heavy load from multiple tenants; during quiet times tenants are able to enjoy faster I/O.

The main ZFS components in SmartOS are pools and datasets:

  • Pools are an aggregation of virtual devices (vdevs), which in turn can be constructed of different block devices, such as files, partitions or entire disks. Block devices within a vdev can be configured with different RAID levels, depending on redundancy needs. Pools can be composed of a mix of different types of block devices, e.g. partitions and disks, and their size can be flexibly extended, as easy as adding a new vdev to the pool.
  • Datasets are a tree of blocks within a pool, presented either as a filesystem, i.e. file interface, or as a volume, i.e. block interface. Datasets can be easily resized and volumes can be thinly provisioned. Both zones and KVMs make use of ZFS datasets: zones use ZFS filesystems while KVMs use ZFS volumes.

In the next part of the SmartOS Series we’ll explore DTrace, a performance analysis and troubleshooting tool.

SmartOS Series: Virtualisation

Last week we started a new blog post series on SmartOS. Today we continue in this series and explore in details the virtualisation aspects of SmartOS.

SmartOS offers two types of OS virtualisation: the Solaris-inherited container-based virtualisation, i.e. zones, and the hosted virtualisation ported to SmartOS by Joyent, KVM.

Containers are a combination of resource controls and Solaris zones, i.e. a complete isolated virtual environment, that provide an efficient virtualisation solution and a complete and secure user space environment on a single global kernel. SmartOS uses sparse zones, meaning that only a portion of the file system is replicated in the zone, while the rest of the file system and other resources, e.g. packages, are shared across all zones. This limits the duplication of resources, provides a very lightweight virtualisation layer and makes OS upgrading and patching very easy. Given that no hardware emulation is involved and that guest applications talk directly to the native kernel, container-based virtualisation gives a close-to-native level of performance.

osvirt

SmartOS container-based virtualisation (Source: wiki.smartos.org)

SmartOS provides two resource controls methods: fair share scheduler and CPU capping. With fair share scheduler a system administrator is able to define a minimum guaranteed share of CPU for a zone; this guarantee that, when the system is busy, all zones will get their fair share of CPU. CPU capping sets an upper limit on the amount of CPU that a zone will get. Joyent also introduced a CPU bursting feature that let system administrators define a base level of CPU usage and an upper limit and also specify how much time a zone is allowed to burst, making it possible for the zone to get more resources when required.

SmartOS already offer a wide set of features, but to make it a truly Cloud OS an important feature was missing: hosted virtualisation. Joyent bridged this gap by porting to SmartOS one of the best hosted virtualisation platform: KVM. KVM on SmartOS is only available on Intel processors with VT-x and EPT (Extended Page Tables) enabled and only supports x86 and x86-64 guests. Nonetheless, this still gives the capability to run unmodified Linux or Windows guests on top of SmartOS.

In hosted virtualisation hardware is emulated and exposed to virtual machine; in SmartOS, KVM doesn’t emulate hardware itself, but it exposes an interface that is then used by QEMU (Quick Emulator). When the guest emulated architecture is the same as the host architecture, QEMU can make use of KVM features such as acceleration to increase performance.

kvm

SmartOS KVM virtualisation (Source: wiki.smartos.org)

KVM virtual machines on SmartOS still run inside a zone, therefore combining the benefits of container-based virtualisation with the power of hosted virtualisation, with QEMU as the only process running in the zone.

In the next part of the SmartOS Series we will look into ZFS, SmartOS powerful storage component.

Distributed File System Series: GlusterFS Introduction

In the first part of the Distributed File Systems Series we gave an introduction to Ceph. Today we will explore another file system: GlusterFS.

GlusterFS is a clustered network filesystem that uses FUSE and that allows to aggregate different storage devices, bricks in GlusterFS terminology, into a single storage pool. The storage in each brick is formatted using a local file system, e.g. XFS or ext4, and then exposed to GlusterFS for storing data. The user space approach gives GlusterFS flexible release cycles and easiness of use without taking tolls on the performance of the file system.

Two of the main concepts of GlusterFS are volumes and translators.

A volume is a collection of one or more bricks. There are three different type of volumes in GlusterFS that differentiate in how the volume stores the data in the bricks:

  • Distribute Volume. In this type of volume all the data is distributed throughout all the bricks. The data distribution is based on an algorithm that takes into account the size available in each brick. This is the default volume type.
  • Replicate Volume. In a replicate volume the data is duplicated, hence replicate, over every brick in the volume. The number of bricks must be a multiple of the replica count.
  • Stripe Volume. In a stripe volume the data is striped into units of a given size among the bricks. The default unit size is 128KB and the number of bricks should be a multiple of the stripe count.

A translator is a GlusterFS component that has a very specific function; some of the main functionalities of GlusterFS are implemented as translators, e.g. I/O scheduling, striping and replication, load balancing, failover of volumes and disk caching. A translator connects to one or more volumes, perform it specific function and offers a volume connection; therefore translators can be hooked together to provide a file system tailored to certain needs. The whole set of translators linked together is called a graph.

GlusterFS Translators graph (Source: gluster.org)

GlusterFS offers a fairly long list of translators for different needs:

  • Storage Translators. These translators define the behaviour of the back-end storage for GlusterFS. A storage translator is typically the first translator in a chain.
    • POSIX: tells GlusterFS to use a normal POSIX file system as the backend, e.g. ext4.
    • BDB: tells GlusterFS to use the Berkeley DB as the backend storage mechanism. This translators uses key-value pairs to store data and POSIX directories to store directories.
  • Clustering Translators. These translators are used to allow GlusterFS to use multiple servers to create a cluster. These translators are used to define the basic behaviour of GlusterFS.
    • Unify: with this translator all the subvolumes from the storage servers will appear as a single volume. An important feature of this translator is that a given file can exist on only one of the subvolumes in the cluster. The translator uses a scheduler to determine where a file resides, e.g. Adaptive Least Usage, Round Robin, random.
    • Distribute: this translator aggregate storage from several storage servers.
    • Replicate: this scheduler replicates files and directories across the subvolumes. If there are two subvolumes then a copy of each file will be on each subvolume.
    • Stripe: with this translator the content of a file is distributed across subvolumes.
  • Performance Translators. These translator are used to improve he performance of GlusterFS.
    • Read ahead: this translator pre-fetches data before it’s required, usually data that appears next in the file.
    • Write behind: this allows the write operation to return even if the operation has not completed.
    • Booster: using this translator applications are allowed to skip using FUSE and access the GlusterFS directly.

More translators can be written according to specific requirements.

For the next part of the Distributed File Systems Series we will be looking at XtreemFS.

SmartOS Series: A SmartOS Primer

Some time back we introduced a piece of work that we are working on: OpenStack on SmartOS. Today we start a new blog post series to dig into SmartOS and its features. We’ll start with a quick introduction to SmartOS to get everyone started with this platform.

SmartOS is an open source live operating system mainly dedicated to offer a virtualisation platform. It’s based on illumos, which in turn it’s derived form OpenSolaris, thus inherits many Solaris-like features, such as zones, ZFS and DTrace. Joyent, the company behind SmartOS, further enhanced the illumos platform by adding a porting of KVM and features like I/O throttling. The core features of SmartOS will be the topic of the next posts in this Series. Thanks to these features, SmartOS makes the perfect candidate for a truly Cloud OS.

The following presentation will walk you through the basic tasks to setup, configure and administer SmartOS:

For the next topics we will cover SmartOS virtualisation (Zones and KVM), SmartOS storage (ZFS), SmartOS networking (Crossbow) and SmartOS observability (DTrace).

Distributed File Systems Series: Ceph Introduction

With this post we are going to start a new series on Distributed File Systems. We are going to start with an introduction to a file system that is enjoying a good amount of success: Ceph.

Ceph is a distributed parallel fault-tolerant file system that can offer object, block, and file storage from a single cluster. Ceph’s objective is to provide an open source storage platform with no Single-Point-of-Failure, highly available and highly scalable.

A Ceph Cluster has three main components:

  • OSDs. A Ceph Object Storage Devices (OSD) are the core of a Ceph cluster and are in charge of storing data, handling data replication and recovery, and data rebalancing. A Ceph Cluster requires at least two OSDs. OSDs also check other OSDs for a heartbeat and provide this information to Ceph Monitors.
  • Monitors: A Ceph Monitor keeps the state of the Ceph Cluster using maps, e.g.. monitors map, OSDs map and the CRUSH map. Ceph also maintains a history, also called an epoch, of each state change in the Ceph Cluster components.
  • MDSs: A Ceph MetaData Server (MDS) stores metadata for the Ceph FileSystem client. Thanks to Ceph MDSs, POSIX file system users are able to execute basic commands such as ls and find without overloading the OSDs. Ceph MDSs can provide both metadata high-availability, i.e. multiple MDS instances, at least one in standby – and scalability, i.e. multiple MDS instances, all active and managing different directory subtrees.
ceph-architecture

Ceph Architecture (Source: docs.openstack.org)

One of the key feature of Ceph is the way data is managed. Ceph clients and OSDs compute data locations using a pseudo random algorithm called Controlled Replication Under Scalable Hashing (CRUSH). The CRUSH algorithm distributes the work amongst clients and OSDs, which free them from depending on a central lookup table to retrieve location information and allow for a high degree of scaling. CRUSH also uses intelligent data replication to guarantee resiliency.

Ceph allows clients to access data through different interfaces:

  • Object Storage: The RADOS Gateway (RGW), the Ceph Object Storage component, provides RESTful APIs compatible with Amazon S3 and OpenStack Swift. It sits on top of the Ceph Storage Cluster and has its own user database, authentication, and access control. The RADOS Gateway makes use of a unified namespace, this means that you can write data using one API, e.g. Amazon S3-compatible API, and read them with another API, e.g. OpenStack Swift-compatible API. Ceph Object Storage doesn’t make use fo the Ceph MetaData Servers.
stack

Ceph Clients (Source: ceph.com)

  • Block Devices: The RADOS Block Devices (RBD), the Ceph Block Device component, provides resizable, thin-provisioned block devices. The block devices are striped across multiple OSDs in the Ceph cluster for high performance. The Ceph Block Device component also provides image snapshotting and snapshots layering, i.e. cloning of images. Ceph RBD supports QEMU/KVM hypervisors and can easily be integrated with OpenStack and CloudStack (or any other cloud stack that uses libvirt).
  • Filesystem: CephFS, the Ceph Filesystem component, provides a POSIX-compliant filesystem layered on top of the Ceph Storage Cluster, meaning that files get mapped to objects in the Ceph cluster. Ceph clients can mount the Ceph Filesystem either as a Kernel object or as a Filesystem in User Space (FUSE). CephFS separates the metadata from the data, storing the metadata in the MDSs, and storing the file data in one or more OSDs in the Ceph cluster. Thanks to this separation the Ceph Filesystem can provide high performances without stressing the Ceph Storage Cluster.

Our next topic in the Distributed File Systems Series will be and introduction to GlusterFS.

ICCLab Colloquium: Byte-Code

Many thanks to Davide Panelli and Raffaele Cigni (Solutions Architects and co-founders) from Byte-Code for their visit and talk about a scalable e-commerce platform.

Byte-Code is a SME based in Milan, Italy, providing IT consulting services to customers around the world. The company has a strong focus on Open Source solutions and strategic partnerships with leaders of different markets.

The presentation (slides) introduces the company as well as the challenges of bringing Cloud Computing into the Enterprise world. Davide then went on to introduce a novel e-commerce platform capable of scaling dinamically thanks to features offered by technologies such as MongoDB and Amazon Web Services.

About Davide & Raffaele

Davide Panelli is a Solutions Architect and Scrum Master at Byte-Code. He’s responsible for creating Enterprise Architecture based on open source products.

Raffaele Cigni is a Solutions Architect and Groovy Specialist at Byte-Code. He worked for many years on mission critical J2EE/JEE projects constantly researching for better technologies and methodologies to develop enterprise class software, and for those reasons he started working with Groovy and Grails. Recently he’s focused on developing data processing systems based on DSLs created with Groovy.

 

OpenStack on SmartOS

SmartOS is an open source type 1 hypervisor platform based on Illumos, a descendant of OpenSolaris, and developed by Joyent. SmartOS is a live operating system, meaning that can be booted via PXE, USB or an ISO image, and runs entirely from memory, leaving the full space on the local disk to be used for virtual machines. This type of architecture makes SmartOS very secure, easy to upgrade and recover. Given its performances and reliability, in the context of the Mobile Cloud Networking project, SmartOS has been chosen to support telco-grade workloads and provide carrier-grade performances.

SmartOS as Cloud OS

Cloud providers must be able to offer a single server to multiple users without them noticing that that they are the only user of that machine. This means that the underlying operating system must be able to provision and deprovision, i.e. create and destroy, virtual machines in a very fast seamless way; it should also allocate physical resources efficiently and fairly amongst the users and should be able to support multithreaded and multi-processor hardware. Lastly, the operating system must be highly reliable and, in case something doesn’t work as it should, it must provide a way to quickly determine what the cause is. A customer of the cloud provider will also expect the server to be fast, meaning that the observed latency should be minimal. The provided server should also give the flexibility to get extra power when needed, i.e. bursting and scaling – and be secure, meaning that neighboring users must not interfere with each other.

Thanks to the Illumos inheritance, SmartOS presents a set of features that address these needs and make it a perfect candidate as a truly Cloud OS:

  • OS Virtualization. SmartOS offers both container-based virtualization, i.e. a lightweight solution combining resource controls and Solaris zones, and KVM virtual machines, a full, hardware-assisted virtualization solution for running a variety of guest OS’s, including Linux and Windows. Brendan Gregg of Joyent wrote a post comparing performances between OS virtualization techniques.
  • ZFS and I/O throttling. ZFS combines file system and logical volume manager in a single feature. Key characteristics of ZFS are fast file system creation and data integrity guarantee. ZFS also includes storage pools, copy-on-write snapshot creation and snapshot cloning. Joyent further extended SmartOS adding disk I/O throttling. This feature, particularly interesting for a Cloud OS, overcomes a drawback in classic Solaris where a zone or application could effectively monopolize access to local storage, causing performance degradation for other applications or zones. With this new feature all zones/applications are ensured to get a reliable turn at reading/writing to disk.
  • Network Virtualization. SmartOS makes use of Crossbow to provide a network virtualization layer. Crossbow is fully integrated with the virtual machine administration tool of SmartOS, i.e. vmadm, and allows each virtual machine can get up to 32 virtual network interfaces (VNICs). But with this ability to offer so many VNICs, how can we supply sufficient bandwidth? As SmartOS is a Solaris derivative it can leverage advanced networking features such as multipath IP (IPMP). Operating at a lower level, at the data link level, SmartOS has the possibility of levering data link multi-pathing (DLMP), which is close to trunk aggregation.
  • Observability with DTrace. DTrace is a performance analysis tool included by default in different operating system, amongst them Illumos and Solaris and therefore SmartOS. DTrace, short for Dynamic Tracing, can instrument code by modifying a program after it has been loaded into memory. DTrace is not limited to use with user-space application, but can be used to inspect the OS kernel and device drivers. In SmartOS, DTrace can be used to analyze and troubleshoot issues across all zones in a server or within an entire datacenter.
  • Resource control. Resource control is an essential part of the Container-based virtualization. In SmartOS there are two methods to control resource consumption: fair share scheduler and CPU capping. Fair share scheduler allows the administrator to set a minimum guaranteed share of CPU, to ensure that all zones get the a fair share of CPU when the system is busy. CPU capping sets a limit on the amount of CPU that a particular user will get. In addition to these two methods, Joyent added a CPU bursting feature that let administrators define a base level of CPU usage and an upper bound and also limit how much time a zone can burst.
  • Security. Thanks to the Illumos and Solaris inheritance, SmartOS offers a high level of security. Zones are complete separate environments and activity in one zone will not affect neighbouring zones on the same server. Data security is also guaranteed through the use of zones and ZFS file systems.
  • Reliability. SmartOS offers Fault Management (FMA) and Service Management Facility (SMF) that makes it more reliable. The Fault Management feature helps detect, report and diagnose any fault or defect that can occur on a SmartOS system. The Service Management Facility (SMF), another feature SmartOS inherits from Solaris, introduces dependencies between services – meaning that the system will ensure that all services a particular service depends on are up and running before starting it, parallel starting and automatic restart upon failure to allow fast booting time and service recovery, and delegation of services to non-root users to limit the privileges of a certain service. Complementing these is the ability of highly available load balancing with the virtual router redundancy protocol (VRRP). This is an additional feature that needs to be installed on SmartOS, yet it provides a means to implement hot-failover via virtual IP sharing. This is very similar to the combination of pacemaker and corosync. 

OpenStack on SmartOS

Given the set of features that makes SmartOS the ideal Cloud OS, it seems only logical to combine it with OpenStack to provide a reliable, high-performance cloud platform. This idea was already blueprinted within OpenStack and some preliminary work has been already carried out by Thijs, Andy and Hendrik.

The existing work has now been further extended and the code has been updated to the latest OpenStack release, Grizzly, and is available on github. At the moment, the nova-compute service is running on SmartOS, being able to instantiate virtual machines, both Container-based and KVM. The nova-network service is still a work-in-progress and further work needs to be carried out in order to make SmartOS fully Quantum compatible.

Further and interesting work include enabling the integration of OpenFlow controllers (e.g. Ryu, trema, floodlight). This coupled with IPMP and DLMP will make SmartOS truly a high performance virtualization platform. With high availability features of SmartOS interesting and valuable reliable compute services can be offered both with container and KVM virtualization techniques. Having all these capabilities are immensely useful however in order to truly manage this hypervisor platform, in-depth monitoring will be required and this is where DTrace will be leveraged as a configurable source of system metrics. These metrics can be supplied to the OpenStack Ceilometer monitoring system, for both performance and billing purposes. Whereas we’re currently focused on the compute and networking capabilities, SmartOS’s ZFS storage capabilities will be leveraged in both providing block-type storage services.

Distributed File Systems

Description

Distributed File Systems are file systems that allow access to files from multiple hosts via a computer network, making it possible for multiple users on multiple machines to share files and storage resources.

Distributed File Systems are designed to be “transparent” in a number of aspects (e.g.: location, concurrency, failure, replication), i.e. client programs see a system which is similar to a local file system. Behind the scenes, the Distributed FS handles locating files, transporting data, and potentially providing other features listed below.

Distributed File Systems can be categorised in:

  • Distributed File Systems are also called network file systems. Many implementations have been made, they are location dependent and they have access control lists (ACLs).
  • Distributed fault-tolerant File Systems replicate data between nodes (between servers or servers/clients) for high availability and offline (disconnected) operation.
  • Distributed parallel File Systems stripe data over multiple servers for high performance. They are normally used in high-performance computing (HPC).
  • Distributed parallel fault-tolerant File Systems stripe and replicate data over multiple servers for high performance and to maintain data integrity. Even if a server fails no data is lost. The file systems are used in both high-performance computing (HPC) and high-availability clusters.

The objectives of this research initiative are:

  • Evaluate and compare performance of various Distributed File Systems
  • Explore and Evaluate the use fo Distributed File Systems as Object Storage
  • Explore the use of Distributed File Systems in OpenStack
  • Explore the use of Distributed File Systems in Hadoop

Problem Statement

With the increasing need and use of cloud storage services providers must be able to deliver a reliable service that is also easily managed. Distributed File Systems provide the basis for a Cloud Storage Service.

Articles and Info

Distributed File Systems Blog post Series:

Contact Point

Cloud Performance

Description

Virtualisation is at the core of Cloud Computing and therefore its performance are crucial to be able to deliver a top-of-the-class service. Also, being able to provide the adequate virtualised environment based on the user requirements is key for cloud providers.

SmartOS, a descendant of Illumos and OpenSolaris, presents features such as containers and KVM virtualisation and network virtualisation through Crossbow that makes it particularly interesting in this context.

This research initiative aims to:

  • Evaluate performance of SmartOS virtualisation in respect to compute, i.e. containers and KVM, storage and networking
  • Compare SmartOS virtualisation with other techniques (Linux KVM, VMware, Xen)
  • Identify use cases and workloads that best suits the different techniques

Problem Statement

Cloud providers must be able to offer a single server to multiple users without them noticing that that they are the only user of that machine. This means that the underlying operating system must be able to provision and deprovision, i.e. create and destroy, virtual machines in a very fast seamless way; it should also allocate physical resources efficiently and fairly amongst the users and should be able to support multithreaded and multi-processor hardware. Lastly, the operating system must be highly reliable and, in case something doesn’t work as it should, it must provide a way to quickly determine what the cause is. at the same time, a customer of the cloud provider will also expect the server to be fast, meaning that the observed latency should be minimal. The provided server should also give the flexibility to get extra power when needed, i.e. bursting and scaling – and be secure, meaning that neighboring users must not interfere with each other.

Articles and Info

Contact Point

Daniele Stroppa

Daniele Stroppa is a researcher in the InIT Cloud Computing Lab. His research interests include virtualization and cloud performance. He is currently involved in the MobileCloud Networking project.

After receiving his Master’s degree in Mobile Computing, Daniele started working as a Software Engineer in the Telecommunication industry, working with Vodafone and Fastweb in Italy first and Nexus Telecom AG and Alcatel-Lucent in Switzerland then.