By now most of us have heard of Object Storage and the grand vision that life would be simpler if storage spent less time worrying about block placement on disks and walking up and down directory trees, and simply treated files like objects in a giant bucket.
So let's look at what you get from Object Storage, and how it differs from traditional block and file, and also cover off a few examples.
Here’s a summary table I put together as a quick introduction:
|Core value proposition||Best performance for low I/O latency||Best scalability and lowest cost due to simplicity||Intuitive analogue for paper file storage|
|Defining characteristic||Small chunks accessed by making a SCSI call to the chunk address||Whole objects addressed by a single ID||A file serving app layered on top of block or object storage|
|Scalability||~10 petabytes||Trillions of objects||~1 billion files|
|Multi-site sync||The need for low latency limits options||Flat pool structure makes it easier to span sites||File trees and locking add additional complexity|
|Data classification||The storage doesn't know about files. It deals in blocks and addresses||Include as much object/file metadata as you want||Limited metadata can be accessed for each file|
|What does the hardware usually look like?||Generally on proprietary hardware||Generally on a cluster of Lintel servers||Generally file servers layered on top of block or object storage|
|Limitations||Difficult to achieve cost-effective scalability||Dealing in whole objects can add latency||Complexity|
There is no directory tree
Objects live in Buckets (also known as Pools or Containers) and some policies e.g. replication, are applied per Bucket rather than per object. Bucket metadata includes access permissions and physical location. In addition S3 bucket metadata supports audit logs, event notifications, cost tags, versioning, life-cycling and replication.
Folders don’t exist as a real hierarchical structure, but only as a view of certain object names. If you put a prefix key on the front of a file, the system will represent the object as being in a folder of that name. To enhance the illusion a ‘/’ is used as the prefix separator. For example, an object called starbellied/sneech1 can be ‘viewed’ as an object called sneech1 in a folder called starbellied.
While object storage cannot compete with block storage for low latency transaction processing, the flat structure does mean you can get simple consistent latency as you scale out.
The Object Mullet: Object in the front, party out the back?
When we talk about Object storage we normally mean it is object storage at the back-end, and has a REST API at the front-end, but some other storage systems, like IBM’s Spectrum Scale (was GPFS) support the object storage S3 API, even though they are block-based systems at the back-end.
There are three common REST (Representational State Transfer) APIs for object storage:
- Amazon S3 access protocol is the de facto standard with more than double the adoption of any other option.
- OpenStack Swift has many fewer features than S3, but OpenStack fans don’t seem to mind.
- SNIA CDMI was intended to be an open industry standard, but it is rapidly fading from relevance.
When it comes to object operations, essentially we’re talking about GET, PUT / POST, DELETE, COPY, HEAD (metadata), OPTIONS.
For those interested in a comparison between S3 and Swift, check out this reliable source.
IDC MarketScape for Object-Based Storage
IDC last published a MarketScape analysis in December 2014. It was quite an interesting departure from the 2013 version.
Among the software-only guys (where my personal interests are focused) the leading contenders in Dec’14 were:
- Cleversafe who were subsequently bought out by IBM in 2015
- Scality (are there still rumours about HP buying these guys?)
- Amplidata who were recently bought out by HGST (now owned by Western Digital)
- Cloudian who just moved across from major player in 2013 to leader in 2014
Others like Red Hat (having bought Inktank Ceph in 2014) showed up a little lower down the field. VMware’s VSAN wasn’t included which isn’t surprising. Even though VSAN does use an Object Store, the focus is on hyper-convergence rather than on storage per se. As well as the software guys, the traditional hardware vendors (DDN, HDS, EMC) featured, although EMC shrank dramatically from its revenue position as depicted in 2013.
Object storage examples
In recent times I have been looking at solution designs for a few named contenders and I thought that a bit of discussion plus some hardware block diagrams for these might help to give you a feel for the basic infrastructure that each needs. The examples are really just those that I have been working on designs for recently, there are plenty of others, although ViFX does have a particular interest and skills around the last three (VSAN, Cloudian and AWS).
|Example Company||Example Product||Brief Description|
|Scality||Scality RING||On-premises S/W-defined object storage with NAS options|
|Redhat||Ceph||On-premises S/W-defined S3/Swift object storage, CephFS for Hadoop|
|VMware||Virtual SAN||VMware S/W-defined object storage|
|Cloudian||HyperStore||On-premises S/W-defined S3-compliant storage with NAS & HSM to Amazon|
|AWS||S3||Public Cloud API Storage with options|
Red Hat and Ceph
Ceph supports Swift, and most (but not all) of S3. You can also attach via RADOS, Cinder, and use CephFS for Hadoop. Things to think about with Ceph are whether you also need to deploy Red Hat Satellite or at least Katello to manage however many instances of RHEL that you will be deploying as the OS underneath Ceph. Ceph currently dominates the OpenStack market. Ceph is price both per TB and per node, but the licences are sold in bundles starting at 256TB and 12 nodes.
Ceph requires monitor nodes (min 3, max 5) which own the CRUSH (Controlled Replication under Scalable Hashing) map. The CRUSH map is used to decide where to put things and to track which disks are in which servers. Plus you’ll need a minimum of 5 storage nodes, although a minimum of 10 is recommended. So you’re starting out with at least 8 nodes, or 13 if you follow best practice.
Ceph pools control the replication level, number of copies or type of erasure coding. Nett space is not surprisingly around 1/3rd of raw when there are 3 copies. When you use Erasure Coding it’s likely to be between 60 and 80%. EC apparently delivers slower reads, but (surprisingly) faster writes. Ceph has a concept of Placement Groups (min 100 PGs per disk) and the number of PGs is set in concrete at pool creation time, so there are some limits to dynamic reconfiguration that you need to be careful of when doing your detailed design.
Recommended Linux options are CentOS, Debian, Fedora, RHEL and Ubuntu. For a fully supported world you might use RHEL and then do management via Calamari and Red Hat Satellite, but I expect you could use Katello etc if you were prepared to put up with a less integrated option.
Scality RING deploys a single Supervisor Node per cluster, and optional Connector Nodes where file access is used and performance is important. Connector Nodes offload front-end NFS and SMB processing and you can start with either none or two.
A minimum of 6 Intel server Storage Nodes are supported and a RING can span sites if the latency is less than 15 ms (or you can use async replication where latency is higher). Scality uses a distributed hash table so there is no separate set of metadata servers like there is for Ceph. You can set it up to use anywhere from 1 to 6 copies of each object or you can use Erasure Coding. Scality EC is quoted as having a higher sequential and lower random I/O than copies. Scality RING is priced per usable TB of disk, starting at 200TB.
Scality RING is hardware independent as you’d expect of any of these software-defined storage solutions, and runs on your choice of RHEL, CentOS, Debian or Ubuntu.
VMware Virtual SAN also uses an object store and it makes an interesting contrast with the other options seeing as it is pitched at a completely different market. VSAN is scalable to 64 hosts & 770 TB data (1 PB nett capacity) and a best practice entry system would start with four nodes (although 2 and 3 nodes are possible). The following is a sample Lenovo VSAN Ready-Node (Ready-nodes are designed to make it easy to spec a supported VSAN and to streamline the support process):
- 8 x 4TB drives, 4 x 400GB SSDs
- Raw disk per node 32 TB and 1.6 TB SSD
- 256GB RAM, 2 x 10GigE
- 2 x 12core 2.3GHz CPUs
- vSphere6 Standard or better (priced per CPU)
- VSAN6 (priced per CPU)
Total raw disk is therefore 128 TB plus 6.4 TB SSD with ~63 TB nett capacity designed for a data requirement of ~36 TB. This provides ~17% SSD cache (~4TB read cache, ~1TB mirrored write cache). This sizing assumes failures-to-tolerate=1, plus one spare host, plus 30% headroom as recommended by the VSAN6 Design & Sizing Guide:
“VMware is recommending, if possible, 30% free capacity across the Virtual SAN datastore. The reasoning for this slack space size is that Virtual SAN begins automatic rebalancing when a disk reaches the 80% full threshold, generating rebuild traffic on the cluster. If possible, this situation should be avoided. Ideally we want configurations to be 10% less than this threshold of 80%. This is the reason for the 30% free capacity recommendation.”
A very rough rule of thumb is to allow 3 x raw disk to data size ratio for FTT=1, and 5 x for FTT=2.
Some of the maxima are:
- 6,000 VMs per cluster
- 200 VMs and 9,000 object components per ESXi host
- Max file size is 255GB
- Up to 5 disk groups per host (best to give each a dedicated controller)
- Each disk group can have up to 7 HDDs plus a single SSD of at least 10% of the actual data size
- 30% of the SSD is used for mirrored write cache and 70% is used for read cache
- Max VMDK size is 62 TB
In terms of data locality, VSAN does not try to cache a VMDK locally to the VM host. This is designed to avoid hotspots and reduce data movement following a vMotion. Reads are split across copies, but set address ranges of an object are always read from the same node so as to increases the cache hit rate and avoid double-caching.
After I wrote most of this, VSAN 6.2 was released, introducing three different editions of VSAN licensing, as per the following table:
Cloudian & AWS
IDC has rated Cloudian HyperStore as a leader in Object Storage. HyperStore doesn’t use Supervisor or Monitor Nodes and supports browser-based management. They are an AWS partner and their pitch is that they alone are “100% S3-compliant”. They also support HSM out to S3/Glacier (leaving stubs behind on-premises).
They do a bundled NFSv3 and have also traditionally played well with front-end Windows file-servers, SoftNAS or Panzura if you need SMB/CIFS support. They now also support their own new option for NFSv4, SMB/CIFS and FTP support via separate File Access Nodes.
Cloudian says it has tested linear performance from 2 to 200 nodes, and it does support clusters as small as 2 nodes and 10TB, as well as scaling up to 1 billion objects which it says they have also tested. That gives Cloudian an easy entry point for PoCs and pilots. Licensing is priced per nett usable TB (and it doesn’t matter now many copies you protect with).
As you would expect it supports both object mirrors and Erasure Coding. Cloudian uses Cassandra under the covers (something it shares with Nutanix) but that is not exposed to admins or users. Hashing is down to the disk level, and it is also hardware independent. Cloudian also has an automated deployment process that takes much of the pain out of getting CentOS and Cloudian installed and running.
ViFX already has experience with Cloudian and is able to provide a very low cost per GB per month fully managed OPEX-only implementation on your premises or in a data centre of your choosing.
AWS S3 and Glacier
There are a few different versions of S3 (Simple Storage Service) that you can subscribe to:
S3 Standard is for frequently accessed data. It’s about US 3.3 cents/GB-mth but they also charge a small fee for PUT, COPY, POST, and LIST requests (US .55 cents/1000) and for GET and other requests (US .44 cents/10,000).
S3 Standard IA (Infrequent Access) is kind of self-explanatory. If you think you need Glacier, then S3 IA is probably what you actually need. There is a 128KB minimum allocation/charge per object (i.e. don’t put a lot of small files on there) and the minimum charge is for 30 days (so don’t put short-lived files on there or use it for temporary storage). IA is charged at US 1.9 cents/GB-mth, plus PUT, COPY, POST requests US 1.0 cents/1000, GET and other requests US 1.0 cents/10,000 and Data retrievals US 1.0 cents/GB. It looks like maybe it was designed in response to Google Nearline.
S3 Standard is automatically replicated within an Availability Zone (think Data Centre) and S3 Standard Cross Region Replication (CRR) which is async with SSL, is only a tick-box and an invoice away. Note however that if you turn this on later, it replicates new primary objects only. You can use IA as the destination. Versioning must be turned on at both locations
S3 Reduced Redundancy (RRS) is less durable (data is in two places rather than three) and is priced at US 2.6 cents/GB-mth with PUT, COPY, POST, LIST requests US .55 cents/1000 and GET and other requests US .44 cents/10,000.
Glacier can be driven from S3 or the Glacier API or SDKs. The name of the service is a strong clue that this is archival storage. It takes about 4 hours to start a retrieval from Glacier and it can be very expensive if you want to retrieve a lot of data on one day. Don’t put stuff there unless you think you probably won’t need it back. Consider IA instead. Glacier charges are in the ballpark of US 1.2 cents/GB/mth, with archive & restore requests of US 6 cents/1000. There are penalties of US 3.6 cents/GB for objects deleted within 90 days and US 1.2 cents/GB if you restore more than 0.17% of your data on any given day i.e. bringing back the occasional file is OK, but this is not a good location for operational backups. This is for inactive archives.
ViFX offers a free Billing Optimisation portal for AWS, which provides analytics to help you make sense of your billing and plot trends, projections and set billing alerts. Did I mention that it's free? It's not often you get something for free!
It’s not quite a case of, as Tom Lehrer put it at the end of the Periodic Table of Elements song, “These are the only ones of which the news has come to Har-vard / and there may be many others but they haven’t been dis-covered.” There are plenty of other Object Storage offerings, but hopefully this blog has provided a high level view and a place to start.