In the third part of the SmartOS Series (check the first two parts: Intro and Virtualisation) we are going to deep dive into the SmartOS Storage: ZFS.
ZFS is a combination of file system and logical volume manager originally introduced in Solaris 10. ZFS includes many features that are key for a Cloud OS such as SmartOS:
- Data integrity. ZFS is designed with a focus on data integrity, with particular emphasis on preventing silent data corruption. Data integrity in ZFS is achieved using a checksum or a hash throughout the whole file system tree. Checksums are stored in the pointer to the block rather than in the block itself. The checksum of the block pointer is stored in its pointer, and so on all the way up to the root node. When a block is accessed, its checksum is calculated and compared to the one stored in the pointer to the block. If the checksums match, the data is passed to the process requesting it; if the checksums do not match, ZFS will retrieve one of the redundant copy and try to heal the damaged block.
- Snapshots. ZFS uses a copy-on-write transactional model that, when new data is written, the block containing the old data is retained, making it possible to create a snapshot of a file system. Snapshots are both quick to create and space efficient, since data is already stored and unmodified blocks are shared with the file system. ZFS allows for snapshots to be moved between pools, either on the same host or on a remote host. The ZFS send command creates a stream representation either of the entire snapshot or just of the blocks that have been modified, providing an efficient and fast way to backup snapshots.
- Clones. Snapshots can be cloned, creating an identical writable copy. Clones share blocks and, as changes are made to any of the clones, new data block are created following the copy-on-write principles. Given that initially clones share all the blocks, cloning can be instantaneous doesn’t consume additional disk space. This means that SmartOS is able to quickly create new almost identical virtual machines.
- Adaptive Replacement Cache (ARC). ZFS makes use of different levels of disk caches to improve file system and disk performance and to reduce latency. Data is then cached in a hierarchical manner: RAM for frequently accessed data, SSD disks for less frequently accessed data. The first level of disk cache (RAM) is always used for caching and uses a variant of the ARC algorithm similar to level 1 CPU cache. The second level (SSD disks) is optional and is split into two different caches, one for reads and one for writes. Populating the read cache, called L2ARC, can take some hours and, in case the L2ARC SSD disk is lost, data will be safely written to disk, i.e. no data will be lost. The write cache, called Log device, stores small synchronous writes and flushes them to disk periodically.
- Fast file system creation. In ZFS, creating or modifying a filesystem within a storage pool is much easier than creating or modifying a volume in a traditional filesystem. This also means that creating a new virtual machine in SmartOS, i.e. adding a new customer for a cloud provider, is extremely fast.
Another very interesting feature added by Joyent in SmartOS with regards to storage is disk I/O throttling. In Solaris, a zone or an application could potentially monopolize access to the local storage, almost blocking other zones and applications to access it and causing performance degradation. With disk I/O throttling in place all zones are guaranteed a fair amount of access to disk. If a zone is requesting access to disk over its limits, each I/O call will get delayed by up to 100 microseconds, leaving enough time for the system to press I/O requests from other zones. Disk I/O throttling only comes into effect when the system is under heavy load from multiple tenants; during quiet times tenants are able to enjoy faster I/O.
The main ZFS components in SmartOS are pools and datasets:
- Pools are an aggregation of virtual devices (vdevs), which in turn can be constructed of different block devices, such as files, partitions or entire disks. Block devices within a vdev can be configured with different RAID levels, depending on redundancy needs. Pools can be composed of a mix of different types of block devices, e.g. partitions and disks, and their size can be flexibly extended, as easy as adding a new vdev to the pool.
- Datasets are a tree of blocks within a pool, presented either as a filesystem, i.e. file interface, or as a volume, i.e. block interface. Datasets can be easily resized and volumes can be thinly provisioned. Both zones and KVMs make use of ZFS datasets: zones use ZFS filesystems while KVMs use ZFS volumes.
In the next part of the SmartOS Series we’ll explore DTrace, a performance analysis and troubleshooting tool.