In this NetApp tutorial, you’ll learn about ONTAP Storage QoS and how you can use it to control the performance that will be given to your workloads. Scroll down for the video and also text tutorial.
NetApp Storage QoS Video Tutorial
This course is great. I’m only half-way through it and have learned loads of new information. Highly recommended course if looking to learn about NetApp. Neil uses real world applicable scenarios and information to teach the curriculum and it’s great being able to follow along on a virtualized lab built out on my laptop.
Storage Quality of Service allows you to deliver consistent and predictable performance to your workloads in a multi-workload environment. On your ONTAP cluster, you've got a certain amount of hardware and if you've got multiple workloads that are running on that cluster, they're all competing for the available resources.
You can set a throughput ceiling on a particular workload/s to prevent it from bullying other workloads and taking too much of those resources and negatively impacting the other workloads. You can also set a throughput floor for the workload to give a minimum throughput level, regardless of the demand by competing workloads.
You can set a ceiling, a floor, or both to a workload/s, and you can also set Storage QoS to only monitor throughput without actually setting a ceiling or a floor on there.
A throughput ceiling limits throughput for a workload to a maximum number of IOPS or MB/s. You'll normally set either IOPS or MB/s, but you can set both. If you do set both, then the limit will kick in as soon as either one is reached. This ensures the workload does not take more of its expected share of system resources and bully the other workloads running on the system.
Your throughput ceilings can be applied to a volume, a file, a LUN, or an SVM. A throughput ceiling also throttles throughput directly, meaning that if you set a limit on a volume for example, then it will limit that volume directly. It stops it from using a lot of resources.
QoS Max Caveats
Throughput to workloads may exceed the specified ceiling by up to 10%, especially if a workload experiences rapid changes in throughput, and the ceiling might be exceeded by up to 50% to handle bursts.
So, when you do configure Storage QoS, just be aware of it can sometimes go a little outside the settings that you specified. That is normal. Also, it takes some time for QoS settings to take effect, usually a few minutes.
If you do configure QoS, don't go and put in the monitoring commands immediately, and then be surprised when it looks like it's not taking effect. That's also normal. It does take a few minutes to take effect. Therefore, after you configure Storage QoS, go get yourself a cup of coffee and then come back and check that it's working.
We can also use QoS Min to set a QoS floor as well. A throughput floor guarantees that throughput for a workload does not fall below a minimum number of IOPS. When we set a ceiling, we could set value in either IOPS or MB/s or both.
When we set a floor, it is always going to be IOPS that we use here. Throughput floors can be applied to a volume, a file, or a LUN, but not SVM. When QoS first came out, it was only available for setting a ceiling. For setting a floor, that has more recently become available. There are not quite as many options available for setting a floor right now, but in future versions of ONTAP, I expect that there will be.
A throughput floor throttles throughput indirectly, by giving priority to the workloads for which the floor has been set. The ONTAP system cannot tell a workload, "You need to send me more work. Send me at least this amount." Obviously, the workload is going to send whatever it is sending.
The way that the throughput floor works is that if it looks like it's not going to get the amount of resources that it needs, then ONTAP will actually limit the other workloads on the system. It will give priority to this workload to make sure that it gets at least the minimum that you've configured.
QoS Min Caveat
Throughput to a workload might fall below the specified floor if there's insufficient performance capacity available on the node or aggregate. Obviously, QoS is not magic. It cannot somehow magically give more performance to your workloads than is available in the underlying hardware. Even when sufficient capacity is available on that underlying hardware, throughput to a workload might fall below the specified floor by up to 5%.
Just like it was with the ceiling, this is not always exact. It can sometimes go a little outside what you've set. The same caveat applies to throughput floors as well. Throughput floors are supported on AFF systems and ONTAP Select premium with SSD. Again, like I just said, this is a newer feature. There are some limitations with it, which I expect will be removed in later versions.
Now, let's look at how to actually configure QoS. That's done with policy groups. You define the storage floor and/or ceiling in a policy group at the command line. You then add the storage objects to that policy group, which again, can be volumes, files or LUNs, or SVMs if you're setting a ceiling not supported with floors.
You can define multiple policy groups and apply different QoS settings to them. You might have different workloads that have got different requirements for the throughput floor or ceiling. To configure that, you would configure different policy groups for them.
A single or multiple storage objects can be listed in the same policy group. If you've got, for example, multiple volumes with the same requirements, then you could configure one policy group for them and put them all in that policy group. If you had another volume with different requirements, that would require a different policy group.
Best practice is to not mix different objects in the same policy group. For example, don't put volumes and LUNs in the same policy group. You could have volumes and other volumes in the same policy group, that is okay.
You cannot assign a storage object to a policy group if its containing object or its child objects belong to the policy group. For example, you couldn't have a policy group with an SVM in there, and also a volume that is in that SVM, obviously that would create a conflict.
Non-Shared Policy Groups
The different types of policy groups are shared and also non-shared policy groups. A non-shared QoS policy group specifies that the defined throughput ceiling or floor applies to each member workload individually.
For example, if you configured a non-shared policy group with a maximum throughput of 50,000 IOPS, each workload in that policy group will be limited to 50,000 IOPS. If Workload 1 can use 50,000 IOPS, Workload 2 can also use 50,000 IOPS. Each of them gets that limit that you set available to them individually.
Shared Policy Groups
We've also got our shared policy groups, and for throughput ceilings, the total throughput for all workloads in a shared policy group cannot exceed the specified ceiling when it's shared.
For example, if you configured a shared policy group with a maximum throughput of 50,000 IOPS, the aggregate total IOPS of all the member workloads is limited to that 50,000 IOPS. So, if Workload 1 uses 30,000 IOPS, only 20,000 IOPS are left for Workload 2. They don't get about 50,000 IOPS each, it's shared between all of them.
For throughput floors, the minimum is applied to each workload individually, always. For example, if the minimum is set to 5,000 IOPS, each workload in the policy group is guaranteed 5,000 IOPS.
When we're setting a ceiling, you can see there would be some scenarios where you can say, "Okay, I've got 50,000 IOPS available and I'm going to make that available to this group of volumes. They get about 50,000 IOPS between them, and it's first come, first served."
That would make sense, but setting a throughput floor which was shared doesn't really make sense at all. You couldn't say, "Okay, we've got 5,000 IOPS minimum, which is shared between these particular volumes. That would be applied to an individual volume." It's the only way it really makes sense. So, for throughput floors, it's always applied individually.
That was our traditional Storage QoS, which has been available since the feature first became available. There's also a new type of QoS, well a newer type of QoS, which is Adaptive QoS. Storage QoS throughput for a storage object usually changes if the size of the object changes.
For example, an increase in the amount of space used in a volume usually requires a corresponding increase in its throughput ceiling. So, if a volume gets bigger and you've got QoS applied to there, as it gets bigger, you're usually going to want to give it more throughput.
Now, prior to ONTAP 9.3, the throughput configured in a policy group was always fixed. You set a particular volume and it stayed the same. If this size of the volume, for example, increased and you wanted to give it more throughput, then you would have to change the setting on the policy group manually. If you had a lot of objects with QoS applied to them, then this is quite a lot of administrative overhead.
The new feature that helps with that is Adaptive QoS. Adaptive QoS can optionally be used to automatically scale the policy group value to the workload size, maintaining the ratio of IOPS or MB/s to the size in TB/GB as the size of the workload changes.
As the size of the volume goes up, it will get more throughput, if it goes down, it will get less throughput. When adaptive QoS is used, it's typically used to adjust throughput ceilings, rather than floors. The workload size is expressed as either the allocated space for the storage object or space actually used by the storage object.
Adaptive QoS Policy Groups – Allocated Space
When you use allocated space, an allocated space policy maintains the IOPS to TB/GB ratio, according to the nominal size of the storage object. For example, if the ratio is 100 IOPS/GB, a 300 GB volume will have a throughput ceiling of 30,000 IOPS.
If the volume is resized to 500 GB, adaptive QoS adjusts the throughput ceiling to 50,000 IOPS. You can see here that it doesn't care how much data is actually in the volume. It's applied based on the size of the volume is. With adaptive QoS working here, as the size of the volume increased, that volume was given more throughput.
Adaptive QoS Policy Groups – Used Space (Default)
We've also got used space policies available as well. A used space policy maintains the IOPS to terabyte or gigabyte ratio, according to the amount of actual data stored, rather than the size of the object. This is before storage efficiencies, deduplication, and compression have been applied.
For example, if the ratio is 100 IOPS/GB, a 300 GB volume that has 100 GB of data stored in it would have a throughput ceiling of 10,000 IOPS. So, it's based on the amount of data in the volume, not the size of the volume itself. As the amount of usable space changes, adaptive QoS adjusts the throughput ceiling according to the ratio.
With the allocated space, if you have set up the system with volume autogrow, then this is where about would be most likely to take effect. It would also take effect if you manually changed the size of the volume, if you've got autogrow set on there and the volume autogrows, then this would give it more throughput.
When you're using used space, it's not going to be just based on autogrow. As the actual amount of data changes, then you might have throughput allocated will change in line with that.
Adaptive QoS Policy Groups
Our adaptive QoS policy groups are always non-shared. The defined throughput ceiling or floor applies to each member workload individually, so we don't have shared with adaptive QoS. When a workload changes size, updates to the QoS policy take around five minutes to take effect. Just like when you first configure a policy, it's going to take around five minutes to take effect.
With adaptive QoS, if the actual size of the volume or the data in the volume changes, that's going to take about five minutes after that to update as well. I've been using a volume as an example, but adaptive QoS can also be used for your files, your LUNs as well.
Default Adaptive QoS Policy Groups
With adaptive QoS, there's actually some default adaptive QoS policies that are built into the system. As soon as you install and enable the cluster, these default adaptive QoS policy groups will be there. The defaults are Value, Performance, and Extreme, and it's pretty obvious from the name that Extreme gives you the best performance, Value is the least performance.
Value is suitable for email, web, and file shares. You can see on the display here what the actual expected, absolute peak, and expected latency values are. Expected IOPS is the minimum. With Value, if you apply that to a volume, the volume will get 128 IOPS/TB.
The absolute min IOPS, what that is for is let's say that you've set up a volume, you've put an adaptive QoS policy on there, and it's a new volume, so there's no data in it. You're scaling based on the amount of data that is going to be zero, so it would be given 0 IOPS. Obviously, you don't want that to happen. You want it to have a minimum.
The absolute minimum is the lowest value it can fall to. When there is no data in the volume, it will be using the absolute minimum IOPS. When it gets up to a level that is above that, that's when the expected IOPS will kick in, or the peak IOPS, and that is going to be then scaling in line with the actual amount of data that is in that volume.
The Expected IOPS is the QoS floor. The peak IOPS is the QoS ceiling, and the absolute minimum is the absolute minimum that it will be allowed, that's for when the volume is empty.
You can also see the expected latency, which is the kind of latency that would be expected if this particular adaptive QoS policy is applied. We've got Value, Performance, which has got higher volumes and better performance, and then Extreme, which is the best. Performance would be suitable for databases, hypervisors, like VMware, Extremely is suitable for your workloads that require the lowest latency.
Storage QoS Workflow
You usually specify the throughput limit when you create the policy group. However, if you've got a workload and you don't know what actual values to set and a vendor of that particular workload does not release values for that, you've got no idea what you should set the actual maximum throughput to.
You can then create a policy group and not actually set any values on there, in which case it will do monitor only. Therefore, you would do this if you do want to set a maximum or a minimum value for that particular workload, but you don't know what to set it to. You can configure it to monitor it first, you can then view what the actual throughput that it is using is, and then you can base your values based on that.
NetApp Storage QoS Configuration Example
This configuration example is an excerpt from my ‘NetApp ONTAP 9 Complete’ course. Full configuration examples using both the CLI and System Manager GUI are available in the course.
Want to practice this configuration for free on your laptop? Download your free step-by-step guide ‘How to Build a NetApp ONTAP Lab for Free’
- You want to limit the maximum throughput of vol1 to prevent it from bullying the other workloads, but you need to monitor its current throughput first. Configure Storage QoS to monitor vol1’s throughput.
cluster1::> qos policy-group create -policy-group vol1 -vserver NAS
cluster1::> volume modify -vserver NAS -volume vol1 -qos-policy-group vol1
Volume modify successful on volume vol1 of Vserver NAS.
- Configure a shared maximum of 50,000 IOPS for vol2 and vol3.
cluster1::> qos policy-group create -policy-group vol2and3 -vserver NAS -max-throughput 50000iops
cluster1::> volume modify -vserver NAS -volume vol2 -qos-policy-group vol2and3
Volume modify successful on volume vol2 of Vserver NAS.
cluster1::> volume modify -vserver NAS -volume vol3 -qos-policy-group vol2and3
Volume modify successful on volume vol3 of Vserver NAS.
- Use a single Storage QoS policy group to configure a maximum throughput of 5000 IOPS each for vol4 and vol5.
cluster1::> qos policy-group create -policy-group vol4and5 -vserver NAS -max-throughput 50000iops -is-shared false
cluster1::> volume modify -vserver NAS -volume vol4 -qos-policy-group vol4and5
Volume modify successful on volume vol4 of Vserver NAS.
cluster1::> volume modify -vserver NAS -volume vol5 -qos-policy-group vol4and5
Volume modify successful on volume vol5 of Vserver NAS.
- Configure a minimum of 1000 IOPS for vol6. (You will receive an error message because QoS minimums are only supported on AFF and ONTAP Select Premium with SSD systems.)
cluster1::> qos policy-group create -policy-group vol6 -vserver NAS -min-throughput 1000iops
cluster1::> volume modify -vserver NAS -volume vol6 -qos-policy-group vol6
Error: command failed: Invalid QoS policy group specified "vol6". The specified QoS policy group has a min-throughput value set, and the workload being assigned does not reside on a performance optimized AFF platform. Only workloads on performance optimized AFF platforms are supported by a policy group with a min-throughput value set.
- For vol7, configure a minimum of 100 IOPS for every TB of space used in the volume, a maximum of 1000 IOPS for every TB of space used in the volume, and an absolute minimum of 100 IOPS.
cluster1::> qos adaptive-policy-group create -policy-group vol7 -vserver NAS -expected-iops 100iops/tb -peak-iops 1000iops/tb -peak-iops-allocation used-space -absolute-min-iops 100iops
cluster1::> volume modify -vserver NAS -volume vol7 -qos-adaptive-policy-group vol7
Volume modify successful on volume vol7 of Vserver NAS.
- Verify the QoS policy groups have been created successfully.
cluster1::> qos policy-group show
Name Vserver Class Wklds Throughput Is Shared
---------------- ----------- ------------ ----- ------------ ---------
vol1 NAS user-defined 1 0-INF true
vol2and3 NAS user-defined 2 0-50000IOPS true
vol4and5 NAS user-defined 2 0-50000IOPS false
vol6 NAS user-defined 0 1000IOPS-INF true
4 entries were displayed.
cluster1::> qos adaptive-policy-group show
Expected Peak Minimum Block
Name Vserver Wklds IOPS IOPS IOPS Size
------------ ------- ------ ----------- ------------ ------- -----
extreme cluster1 0 6144IOPS/TB 12288IOPS/TB 1000IOPS ANY
performance cluster1 0 2048IOPS/TB 4096IOPS/TB 500IOPS ANY
value cluster1 0 128IOPS/TB 512IOPS/TB 75IOPS ANY
vol7 NAS 1 100IOPS/TB 1000IOPS/TB 100IOPS ANY
4 entries were displayed.
- View the QoS performance statistics. (Note you will not get meaningful statistics as the system is not processing any storage requests.)
cluster1::> qos statistics performance show
Policy Group IOPS Throughput Latency Is Adaptive? Is Shared?
-------------------- -------- --------------- ---------- ------------ ----------
-total- 42 4.99KB/s 81.00us - -
_System-Work 42 4.99KB/s 81.00us false true
Managing workload performance by using Storage QoS: https://library.netapp.com/ecmdocs/ECMP1196798/html/GUID-660A6C00-6D7E-4EE5-B97E-9D33C0B706B5.html
Assigning volumes to Storage QoS: https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.onc-sm-help-900%2FGUID-432D6037-D91A-4296-BDE4-D226BFA26091.html
Text by Libby Teofilo, Technical Writer at www.flackbox.com
With a mission to spread network awareness through writing, Libby consistently immerses herself into the unrelenting process of knowledge acquisition and dissemination. If not engrossed in technology, you might see her with a book in one hand and a coffee in the other.