Object Storage Overview Tutorial

This tutorial explains Object Storage – how it works, its protocols, benefits and limitations, use cases, and how it compares to SAN and NAS. Scroll down for the video and also text tutorial.

Want the complete course? Click here to get the Introduction to SAN and NAS Storage course for free

Object Storage Overview Video Tutorial

 

Great course, I gained so much understanding on this subject. It was fun to listen to and the real world examples are greatly appreciated. Keep up the good job, thank you!

It's easier to explain Object Storage if I give a review of SAN and NAS storage first, then we can compare the different storage types and you'll see where Object Storage fits in.

Block Storage

 

Block storage stores and manages data in blocks - that's why it's called ‘block storage’. The storage is accessed via low level protocols using SCSI and NVME commands.

 

SAN protocols (Fibre Channel, FCoE and iSCSI) provide block access over a network where the disks are not directly attached to the individual client. The experience for the user and applications using block storage is similar to using a local disk inside that actual computer.

 

The low level direct access to the data reduces overhead by minimizing abstraction layers. Higher level tasks such as multi-user access, sharing, locking, and security are not handled by the block level access protocol, instead they are managed by the client’s operating system.

Block Storage Metadata

 

Metadata is data about other data, such as file name, owner, creation date etc.

 

As mentioned earlier, there’s very little overhead with block storage. It keeps no storage side metadata, only the block address. The block is simply a chunk of data that has got no description, no association, and no owner.

Block Storage Use Cases

 

Block storage is considered the best solution for performance sensitive, transactional, and database oriented applications. Because there's very little overhead it provides the best performance of the available external storage types.

 

Block storage is mostly used for primary storage (not secondary storage like backup and archiving).

 

The storage is typically accessed frequently by the clients, and the storage system and clients are usually both located in the same physical location. Because performance is important, we don't want to have a lot of distance between them because that would add to the latency and harm performance.

File Storage

 

NAS protocols (CIFS, SMB, and NFS) use file storage, which stores data as a file hierarchy in a file system - that's why it's called ‘file storage’. The hierarchy is similar to a physical file cabinet like you would find in an office with folders (also known as directories) and sub-folders (also known as sub-directories).

 

The user or application connects to the file system through a share if it’s CIFS/SMB, or by mounting an export with NFS. CIFS/SMB is typically used by Windows clients and NFS by Unix and Linux clients, but all the protocols are supported by the different clients.

File Storage Metadata

 

File system metadata is recorded separately from the file itself and records basic file attributes such as the file name, the creation date, who the creator is, the file type (such as PDF or Word document), the most recent change, and when it was last accessed.

 

The metadata is fixed and standardized to that particular file system.

 

To add custom metadata (known as ‘extended attributes’) you or the vendor has to build a custom application and database. That adds a lot of complexity and is not commonly done.

File Storage Use Cases

 

File storage is well suited to general purpose data, especially data which is edited frequently and data which is concurrently accessed by multiple users or applications. For example a shared Word document that is getting constantly updated by different people.

 

It's designed to be accessed over both the local network and remotely. Performance is not typically as much a concern as it is for the SAN protocols, so it's common that users can access the data both in the same building and also over a wide area network.

Block and File Storage Limitations

 

Block and file storage systems can be scaled out by adding more disks or nodes, but they're typically limited in scale. There's a maximum size for a single system, and it's usually limited to a single geographic area. You can’t normally have it spread across multiple locations.

 

NAS file index tables (inodes) have a maximum size and can affect performance if they grow too large. Performance can be negatively impacted by scaling on NAS file systems.

 

SAN and NAS systems both need to be backed up offsite for resiliency because your storage system is in one physical location. If there is natural disaster like a flood or fire it would all be gone, so you need to back up the data to a separate physical location.

Object Storage

 

Object Storage organizes data into containers of flexible sizes, with the individual pieces of data referred to as ‘objects’. The object will be a file like a photo or a video etc.

 

The objects are stored and managed in a flat organization of flexibly sized containers called ‘buckets’ or ‘containers’. ‘Buckets’ and ‘containers’ mean the same thing, it depends on the particular platform you're using which terminology is used.

Object Storage Scalability

 

The servers which make up the Object Storage platform are known as nodes. Each node has its own processor, memory and internal or external disks. A single node could be used but Object Storage systems are designed for scalability and in practice there will be multiple nodes.

 

The same bucket can span multiple nodes which can be spread across multiple geographic locations. This is a big difference between Object Storage and SAN and NAS. Object based storage architectures can be scaled out and managed simply by adding additional nodes in any location. It can grow out to massive sizes, and performance is not negatively impacted as it gets bigger.

 

Object Storage is very commonly offered by cloud providers. The best known example is S3 from Amazon Web Services. On premises and hybrid solutions are also available.

 

A hybrid solution is a platform that supports on premises nodes that can also store data on a cloud provider's Object Storage as well. Data can be replicated across both on premises and cloud locations.

Object Storage Attributes

 

Each object contains three things -the actual data itself (like a photo or video), customizable metadata, and a globally unique identifier.

Object Storage Metadata

 

Block storage doesn't keep metadata (that would be managed at the higher level of the operating system), and on NAS storage the metadata is fixed. With Object Storage the metadata is fully customizable, so you can add your own custom attributes. It links directly with the object and contains additional descriptive properties. This provides better indexing and management capabilities.

 

For example, we could add custom metadata to a photo that says ‘black’ and ‘cat’. This information is indexed and searchable. The search options are fully flexible and not limited to the filename, path and fixed metadata.

 

Another example would be for the medical industry. If you are uploading x-rays about a particular patient, the metadata could include the patient's name and the injury type.

 

You could also include all the useful information in a file name if you were using NAS, but file names would get crazy long and it wouldn’t be practical to manage. It would be very unwieldy and really just wouldn't work. Object Storage offers the most flexible and scalable indexing and search when you've got a large amount of unstructured data.

 

The customizable metadata also provides better management. You could for example set replication instructions through metadata, where different tags indicate how many copies of a particular object should be stored and where. Different objects can have different replication policies depending on their value.

 

Another example of customisable metadata providing enhanced, flexible management is control of storage tiering. Individual objects can be stored in different classes of storage depending on a metadata tag, for example ‘gold’ objects are stored on SSD and ‘bronze’ objects are stored on SATA drives. The policy can move objects to lower performance storage or delete them as they age, and it can be updated anytime as requirements change.

Globally Unique Identifier

 

A globally unique identifier is used rather than a file name and path as in NAS protocols. As the name suggests, the globally unique identifier is unique across the entire namespace and is used to find the object over the distributed system without having to know the physical location of the data. That removes the complexity and scalability challenges of a NAS hierarchical file system, which is based on complex file paths.

Data Protection – Replication and Erasure Coding

 

Object Storage provides resiliency through replication and/or erasure coding.

 

Replication is used to store multiple copies of objects on different nodes which can be in the same or different data centers. The object is still available on node failure because there are copies on multiple nodes. This is very suitable for small objects but not so good for very large ones. Having multiple copies of very large objects would take up a lot of space which would add to costs.

 

A suitable data protection option for very large objects is erasure coding. With erasure coding the object is broken up into smaller distributed parts. Parity information allows the data to be reconstructed if there is a node failure. It’s kind of similar to raid, but we’re striping an object across nodes rather than data across a set of disks. Because parity information is included the object can be reconstructed if there is node failure.

 

With replication multiple copies of the same object are stored in multiple nodes. With erasure encoding a single object is broken up into multiple parts which are spread across multiple nodes.

 

Traditional offsite backups are not required with Object Storage because offsite copies are built in as long as replication and/or erasure coding are configured. If there is a failure the data is still available and this is transparent to users and applications, they still get their data as if nothing had happened.

File Locking and In Place Edits are Not Supported

 

The way that Object Storage works when changes are made is different to how it works with a NAS file system. Object Storage does not typically support file locking, and files can't be updated in place. It was deliberately designed like this to make simple data protection replication and erasure coding possible.

 

Up to this point you maybe thought "everything is great about Object Storage. Why would I even use NAS anymore? Why don't we just use Object Storage for everything?" Well, Object Storage is not suitable when you've got data that is going to be edited by lots of different people and you want to have an ongoing single master copy of that.

 

Let’s say you've got a Word document associated with a project being run by your company, and it gets edited by multiple people. You need to have a single master copy of that document. NAS file systems support file locking to make sure that there's one consistent copy. Multiple people can read the file at the same time, but only one person can ever edit it at a time to make sure the data remains consistent and there are no conflicts. The user saves the file when they are done making changes and then the next person who wants to make changes edits the same single copy of the file. With Object Storage it's different - it doesn't support file locking and in place editing. If you've got multiple users that are editing the file and making changes, what happens is you end up with multiple different copies of that object. If multiple users update the same object concurrently, the system will simply write different versions of the object. That would make managing the project documentation in our example very impractical, so it's not a good use case for Object Storage.

 

Object Storage is designed for puts and gets, not for data which will have multiple edits from multiple users such as transactional databases or office Word documents. This has traditionally made it more suitable for secondary data - backups and archives, rather than primary data.

Versioning

 

Object Storage supports versioning where the old version of an object can be automatically saved if it is changed. This provides some data protection - if somebody overwrites an object or edits it, it will automatically save the old version so you don't lose the original data.

Cost

 

Object Storage is typically used as secondary storage (backup and archive) where performance is not a priority. It's usually one of the lowest cost storage options from a cloud provider, again because it's not typically running on high performance hardware. You can have high performance Object Storage if you want to, but that's not appropriate for most use cases.

 

On premises Object Storage platforms can typically be bought as an appliance or software only. When you buy an appliance it includes the hardware with the software installed. When you buy software only you install it on your own hardware. You can install the software on low cost commodity server hardware.

Object Storage Protocols and APIs

 

Object Storage uses RESTful APIs, which means it's using HTTP style requests to GET, PUT, POST and DELETE data. HEAD requests metadata information about the object.

 

Applications often access Object Storage directly through the API (Application Programming Interface). There have been open standard APIs developed to help accelerate Object Storage use across the industry.

 

CDMI is the Cloud Data Management Interface and is controlled by the Storage Networking Industry Association (SNIA).

 

S3 Simple Storage Service is an Object Storage service that you can buy from Amazon Web Services, but the S3 APIs have also been made available publicly. There is support for the S3 API on on-premises Object Storage platforms from other vendors such as NetApp and Dell EMC. A benefit is that these platforms can then easily integrate with the AWS S3 service for hybrid storage.

 

The other API commonly used is OpenStack Swift. OpenStack is an open standard initiative for cloud services and Swift is its Object Storage component.

 

CDMI is not in common use now, S3 and OpenStack Swift are more prevalent. Both S3 and OpenStack Swift are typically supported on on-premises platforms.

End User Access and Cloud Gateways

 

As well as direct access from applications, end users can usually also access the storage directly via a web browser. A web interface is well suited for this because the APIs use web style commands. Storage browsing applications such as CyberDuck are also available for end user access.

 

Support for access via NAS protocols is typically also included. This functionality can either be built directly in to the platform by the vendor or it can be provided via a cloud gateway. Cloud gateways are a hardware appliance or a virtual machine which sits between the clients and the Object Storage and translates between them. The clients talk to the cloud gateway using standard NAS protocols like SMB or NFS, and then the cloud gateway converts that to Object Storage APIs on the other side (and vice versa for traffic in the other direction).

Object Storage Benefits Summary

 

  • Single namespace with almost infinite scale
  • Scales across multiple physical locations
  • Performance does not degrade with scale
  • Customizable metadata for better indexing and management
  • Supports data management functions such as replication at object-level granularity
  • Typically low cost

Object Storage Limitations

 

Lower cost is a benefit but also a drawback as it means lower performance. This means Object Storage is not suitable for databases or other applications which require high performance.

 

It doesn’t have locking and file sharing facilities, so it’s not suitable for data which may be accessed concurrently and changed by multiple users or applications.

Object Storage Use Cases

 

Object Storage is best suited as a massively scalable store for unstructured data.

 

It's well suited for file content in the cloud space, especially images and videos - think YouTube.

 

It can be used as an additional storage tier beyond transactional storage for inactive data or as archival storage. Your primary storage system provides the required performance, and then you can tier the data off to the Object Storage beyond the primary storage.

 

As Object Storage evolves it may become more suitable for primary data, but it’s not a normal use case now. Object Storage is newer than SAN and NAS so it will evolve over time and other use cases are likely to emerge.

Use Case Examples

 

Media – large repositories of unstructured photos and video

Medical – patient records

Oil and Gas - seismic data

Object Storage Platform Examples

Cloud Providers:

Amazon Web Services Simple Storage Service (S3)

Microsoft Azure Blob Storage

Proprietary:

Facebook Haystack

On-Premises/Hybrid:

NetApp StorageGRID

 

Free ‘Introduction to SAN and NAS Storage’ training course

The video is an excerpt from my Introduction to SAN and NAS Storage training. You can get the entire series for free here:

http://learn.flackbox.com/courses/introduction-to-san-and-nas-storage

When you're ready to take your storage knowledge to the next level, you can get my 'Data ONTAP Complete' NetApp Training Course.