Deduplication and Compression
In this tutorial we cover deduplication and compression. These are both storage efficiency space saving techniques that are available in ONTAP. We’ll cover deduplication first.
GET YOUR FREE eBOOK
Step by step instructions on how to build a complete NetApp lab, all on your laptop for free.
Sign up to get the eBook and occasional newsletter updates. Your email address will never be shared in any way.
Deduplication eliminates redundant duplicate blocks in a volume. By saving only one instance of a block rather than multiple copies of it, disk space usage can be significantly reduced.
For example, let’s say a company has 10 departments, each department has their own folder in a volume and each folder contains the same 10 MB spreadsheet file. That’s going to be 10 copies of the same 10 MB file in 10 different folders. That adds up to 100 MB of disk space used when we’re not using deduplication.
If we turn on deduplication, it ensures that only one copy of the blocks that make up the file are saved in the volume. Duplicate blocks are going to be saved as references only (basically just pointers) to the original block. Now, rather than having 10 copies of that 10 MB file on disk, we only have one copy of it taking up disk space. The other 9 copies are just composed of pointers. Now we’re just using up 10 MB worth of disk space rather than 100 MB.
Let’s have a look at how deduplication works. In the example below I have a file called “financials.xls” and it’s taking up 20 blocks. I have it saved in my volume in folder one. I then make a copy of the file in folder two, another copy in folder three, and then edit the copy in folder three. When I make the edit I add 10 additional blocks in this example.
If we weren’t using deduplication, we’d have the first copy taking 20 blocks, the second copy taking another 20 blocks and the third copy taking up the original 20 blocks plus the 10 blocks I added. That would be a total of 70 blocks, as shown here:
If we turn on deduplication we can see how much space we’re going to save by looking at the identical blocks. In the diagram below, the blocks on the left that are coloured in blue are identical. We can see that each of the three files have 20 identical blocks and we’ve got 10 additional unique blocks added in file three. If we use deduplication, we keep only one copy of the 20 identical blocks, plus the 10 unique blocks, so the total number of blocks is now only 30.
As we saw earlier, without deduplication it was 70. Now, with deduplication, we’re only taking up 30 blocks. We’re getting over 50% of a space saving.
As far as scheduling is concerned, deduplication normally runs as a scheduled task outside business hours, but we can also run it manually on demand. It runs as a background process and is transparent to clients, it doesn’t affect them. It can be run on any type of storage – primary or backup. If you do run it on backup storage then you’re typically going to get significant space savings. This is because your backups will generally have a lot of duplicate data given you’d be backing up the same data on multiple days.
Let’s have a look at compression now. Compression is very similar to deduplication and we use it for the same reason, which is to save disk space. Compression works a little differently though. Where deduplication looks for duplicate blocks in a volume, compression attempts to reduce the size of a file by removing redundant data within the file.
Deduplication works at the block level while compression works at the file level. By making files smaller through compression, less disk space is consumed and more files can then be stored on disk.
For example, let’s say that we’ve got a 100 kilobyte text file. We may be able to compress that to 52 kilobytes in our example by removing extra spaces within the file or by replacing duplicate character strings with shorter representations. An algorithm recreates the original data when the file is read and, obviously, the more redundant data there is in a file, the more compressible it will be.
Post-process and Inline Compression
Compression can be configured to be carried out post-process (during idle time) or we can run both post-process and inline compression (as data is written) simultaneously. Inline compression can have an impact on your latency performance. If you are going to enable it, you’ll want to test it on the particular workload first and make sure it doesn’t have too detrimental an effect on the performance.
Deduplication and Compression Space Savings
Deduplication and compression are both configured on and work at the volume level. This can affect how you want to approach your volume design. If you’ve got duplicate data but it’s in different volumes, you’re not going to get the deduplication savings. The blocks have to be in the same volume for deduplication to take effect on them.
Space savings achieved depend on the workload that is using the volume. Different workloads are going to have different amounts of duplicate data at the block level and different amounts of compressible data at the file level. Obviously, workloads with a lot of duplicate blocks will see large benefits from enabling deduplication and workloads where the files have a lot of duplicate data will see large benefits from enabling compression.
Deduplication and compression can be enabled independently or you can enable them both at the same time to run in combination with each other. If you do enable them both, compression will be completed before deduplication.
Typical Storage Savings
The image below gives a ballpark representation of the kind of space savings you can expect from deduplication and compression.
As the legend indicates, the bar in blue shows only compression enabled, green is deduplication only and yellow shows both enabled. For a lot of workloads we’ll typically get really good space savings of over 60%. For example, if you’re running a VDI (virtual desktop environment), all of those virtual machines will most likely be running the same operating system. They’re going to have the same patches applied and they’re going to have the same applications installed so you’re going to have a lot of duplicate data. You can get great space savings from enabling deduplication and compression in this scenario. The same applies for other workloads like databases and file services.
See my previous tutorial to learn about the space efficiency technique Thin Provisioning.
Text by Alex Papas.
Alex Papas has been working with Data Center technologies for the last 20 years. His first job was in local government, since then he has worked in areas such as the Building sector, Finance, Education and IT Consulting. Currently he is the Network Lead for Costa, one of the largest agricultural companies in Australia. The project he’s working on right now involves upgrading a VMware environment running on NetApp storage with migration to a hybrid cloud DR solution. When he’s not knee deep in technology you can find Alex performing with his band 2am