In this NetApp tutorial, you’ll learn about ONTAP upgrades.
I’ll cover where to find the documentation you should check before doing the upgrade. We’ll also look at the manual checks that you should do before the upgrade and how to actually perform it, and we’ll go into some of the caveats about whether the process will be disruptive or not. Scroll down for the video and also the text tutorial.
NetApp ONTAP Upgrades Video Tutorial
I just wanted to reach out to you and let you know that I have been studying using your courses and got my CCNA Certification on my first attempt a few days ago!
I would not have been able to do it without your course so I can’t thank you enough for all of the hard work and effort you put in to make these videos, labs, and resources.
Just wanted to reach out and let you know that your courses have helped create yet another CCNA certified individual!
Doing an upgrade is obviously a fairly major process on your storage system. So you want to make sure that you’ve got everything lined up before you go ahead and do that. You want to check all of the relevant documentation.
The first thing you should do is use Upgrade Advisor in Active IQ to generate an Upgrade Plan. That is documentation in either PDF or Excel format that takes you through a step-by-step upgrade process for your particular system.
Also, check the Release Notes for the version of ONTAP you’re upgrading to. There’s also the Upgrade Express Guide and the Upgrade and Revert Downgrade guides, which are also PDFs.
The recommended way of doing the upgrade is using ANDU. That’s Automatic Non-Disruptive Upgrade. When you use ANDU, it uses a wizard in the system manager, and most of the process is done automatically for you. So, that’s a very easy way to do it.
Manual Checks Before ANDU
When you are using ANDU, there are some manual checks that you should do first. ANDU does automate most of the process for you but, there are a few things that you need to check manually first. The things you need to check are documented in the Upgrade Plan from Upgrade Advisor, and you can also find them in the Express guide PDF. The things that you need to do are:
- Review the Release Notes for the new ONTAP version. Check that there are not any issues in there which would be relevant to your particular environment.
- Confirm that the hardware platform you’re using is supported. If you’re using an older hardware platform, it’s possible that there’s a certain version of ONTAP that it will support up to, but not the latest version. So check if you do have a supported platform.
- The switches you’re using, the cluster switch, and your network switches, check that they are supported, and you’re using the correct configuration.
- Confirm any SAN components. If you’re using SAN protocols on your system, make sure that the clients are using a supported operating system. So the client operating system is supported in this new version of ONTAP that you’re going to upgrade to, not just the old one.
Also, check that the Host Utilities version you’ve got under clients again is compatible with the version of ONTAP you want to upgrade to. You might need to upgrade the Host Utilities version, as well as upgrading ONTAP.
- Verify you can upgrade to the target ONTAP version from the current version. You may need to stage the upgrade. For example, if you’re currently on ONTAP version 9.2 and you want to upgrade to version 9.4, it’s not possible to go straight from v9.2 to v9.4. You would have to upgrade from v9.2 to v9.3 first, and then from v9.3, you can then upgrade to v.9.4.
- Verify the cluster, so it does not exceed system limits for your platform. For example, make sure you don’t have too many snapshots on there.
- Verify that the CPU and disk utilization is below 50%. When you do an upgrade, it uses high availability. Because if you think about it, obviously, to upgrade to a new version, it’s going to require a reboot. If you’re currently on v9.3 and you want to go to v9.4, v9.4 will be installed on the node. It then reboots and comes up with v9.4.
Also, while the upgrade is taking place, the storage has to failover to the HA pair. When you’re doing the upgrade, because one of the nodes will be down during the upgrade, the other node is taking care of all the storage.
It’s doing double the normal work. Because of that, you don’t want your CPU or disk utilization to be above 50%. If it does, it’s going to put too much load on that HA partner while it’s doing all of the work.
- Suspend any jobs such as SnapMirror or backups until after the upgrade is complete.
- Lastly, download the ONTAP software image from the NetApp website and copy it to an HTTP or FTP server, where it’s then going to be downloaded from there onto the actual storage system.
Starting in ONTAP 9.4, you don’t have to use that external HTTP or FTP server. You can actually download the software image to your laptop, and from your laptop, you can copy it directly to the storage system from there.
ANDU Automatic Non-Disruptive Upgrade
I said earlier that the way we’re normally going to do the upgrade is using ANDU, the Automatic Non-Disruptive Upgrade. That uses a wizard in the system manager GUI. There are three stages as you go through the wizard:
1. Select. When you’re at the select stage, the ONTAP software image is uploaded to the cluster by the administrator and selected. So you’ve already gone to the NetApp website. You’ve downloaded the new ONTAP software image. You then copied that to an HTTP or FTP server.
In the first stage in the wizard, it’s going to be uploaded from the HTTP or FTP server to the NetApp storage system. In the wizard, you’ll say this is the version of ONTAP that I want to upgrade to, then click Next in the wizard.
2. Validate. The next stage is the validate stage. At this stage, the system manager wizard automatically validates cluster health to verify that the cluster is ready to be upgraded. So it does a load of checks on your system to make sure that everything is good on the system, that there are no issues there.
Because if you think about it, if there are issues on your cluster, you don’t want to be trying to upgrade it at that time. So when you’re using the wizard, if it doesn’t pass the validate stage, it won’t let you move on to try to do the upgrade.
It gives you a full report of what the issues are and tells you the remedial action that you need to take. You then go and fix those issues. You can then come back and try the upgrade again.
3. Upgrade. Finally, when it has passed the validate stage, the last stage is the upgrade. It is where the upgrade is actually performed.
Using ANDU, it’s a very simple wizard. With just three steps which are basically Next, Next, Next. Then the system will do most of the work for you. It makes upgrades very simple.
Rolling vs Batch Upgrades
There are two types of upgrades that can be performed, either in a Rolling Upgrade or a Batch Upgrade. With a Rolling Upgrade, each node is upgraded one at a time. So when a node is being upgraded, it’s taken offline, and it’s upgraded.
While that is happening, its HA partner has taken over its storage. Once the new version of ONTAP has been put on that node, it then reboots with the new version and takes control of each storage back.
That is then repeated on the partner node. Then, each HA pair is upgraded like this, one at a time in sequence, until it’s all done. It takes around half an hour or so for each node to be upgraded.
The other way to do it is the Batch Upgrade. With a Batch Upgrade, the cluster is split into two batches of multiple HA pairs. Half of the nodes are upgraded at the same time in the first batch and then their partners. Then, that is repeated on the second batch.
So, it’s actually split into four parts. If there are two batches, half of the first batch is upgraded first, and then the second half of the first batch. Then, half of the second batch and then the second half of the second batch.
The batch upgrade is only supported on clusters of eight or more nodes. But when you have eight are more nodes, it means that the upgrade is going to be completed more quickly than if you were doing a Rolling Upgrade.
If I didn’t quite make sense, let’s have a look at that with the diagram. The first type was a Rolling Upgrade. Rolling Upgrade is one at a time. So, you see an example here. We’ve got an eight-node cluster, which is made up of four HA pairs.
We upgrade Node 1. While that is happening, Node 2’s HA pair is going to take control of its storage. When Node 1 is done, then Node 2 will get done. While Node 2 is being upgraded, Node 1 will take care of its storage. Then when Node 2 is completed, Nodes 1 and 2 are back to normal. We then do that on Nodes 3 to 8.
Looking at a Batch Upgrade now, we’ve got the same cluster here made up of eight nodes. Four HA pairs again, but it’s being split into two different batches. So, first off, Node 1 and Node 3 are upgraded at the same time. Obviously, we wouldn’t do Node 1 and Node 2 because they’re the HA pairs for each other.
So, when Node 1 is being upgraded, we need Node 2 to be online and taking care of its storage. We do Node 1 and Node 3 first. We do them at the same time. When they’re done, we then do Node 2 and Node 4, their HA partners. Then, we do Nodes 5 and 7 over in batch two, and then finally, Nodes 6 and 8.
When we were doing the Rolling Upgrade, we did them one at a time with the eight nodes. That takes about half an hour per node. So, that would take about four hours to do the upgrade.
When we use a Batch Upgrade, that time is cut in half when we’ve got eight nodes. It would only take about two hours. That’s the benefit of using a Batch Upgrade. It cuts down the amount of time that the upgrade takes.
I said earlier, ANDU is the recommended upgrade method that is using the wizard and system manager. But that is not supported if you’re using MetroCluster. If you’re using MetroCluster, you have to do a manual upgrade that uses a Rolling Upgrade. Therefore, the nodes are upgraded one at a time, and that’s performed by you at the command line.
Validation of cluster health is performed manually prior to the upgrade. So, you don’t get that automated validation process done for you by the system manager. That does verify a lot of things about the cluster. It’s a bit more involved when you’re doing a manual upgrade.
Those pre-upgrade checks are going to take you a while because you are expected to enter all those commands at the command line to check that the cluster is healthy. All of the commands that you have to enter and the entire process are documented in the Upgrade and Revert downgrade guide, which you can get from a NetApp website.
Single Node Clusters
As you saw earlier, the way that the upgrade works is that each node one node of an HA pair is done one at a time. While it is being upgraded, its HA partner takes ownership of its storage. This allows us to have a non-disruptive upgrade.
The cluster is a whole, and all of the storage on there is still available to the clients while you do the upgrade. There are some caveats to that. However, as long as you’ve got at least two nodes in the cluster, you can do that.
Now, a single node cluster is not going to be non-disruptive because it doesn’t have an HA partner that it can fail over to. So if you’ve got a single node cluster, while that single node is offline because it’s being upgraded, obviously, your data is not going to be available to clients.
Disruption Considerations – Stateless Protocols
So let’s talk about those caveats about whether this is going to be disruptive or not. Even though it’s called an Automatic Non-Disruptive Upgrade, there are some situations where it can actually cause some disruption, mostly if you’re using CIFS in your environment.
Stateless protocols do not constantly maintain the state of their connection to the server. So for stateless protocols, we’re not going to have an issue during the upgrade. If there is a temporary break in connectivity between the client and the server (that’s your client if the server is your storage system here in this case), any operation in progress will be completed when connectivity is restored.
The session will go down if no communication is possible between client and server for a certain period of time, which is a timeout period. If you’ve got a temporary break in connectivity, which is lesser than time out, then there are no issues. It’s only if that break in connectivity lasts so long that it reaches time out period, then the connections are going to be torn down.
For our storage, stateless protocols such as NFv3, fiber channel, and iSCSI are going to be less susceptible to service interruptions during the upgrades than session-oriented protocols, such as CIFS.
NFSv3, fiber channel, and iSCSI are stateless. There’s no disruption to the client if it’s using those protocols and if the timeout is greater than any disruption period on the cluster. For example, the amount of time an HA giveback takes. So, stateless protocols normally, they will be non-disruptive. You’re not going to have issues there.
Disruption Considerations – Stateful Protocols
There are also stateful protocols. Stateful protocols maintain the session constantly, and they do not have a timeout. So with stateful protocols, you need to direct users to end their sessions before you do an upgrade of a NetApp system.
CIFS is a stateful protocol so, if services are disrupted, the state information about any operation in progress is lost, then the user must restart the operation. So when you’re doing the upgrade, there’s going to be failover to the HA partner. Then, there’s going to be a giveback when it’s completed.
That does cause a short outage, but it’s long enough to cause problems with your CIFS sessions. Therefore, if you are using the CIFS protocol and you’re going to do an upgrade, it’s highly recommended that you want to do this in a maintenance window when you don’t have any clients connecting to the storage system.
NFSv4 is also a stateful protocol, but it can handle this a bit better than CIFS. The clients will automatically recover from upgrade connection losses.
If you’ve got any applications running that are stateful, the effect on them depends on the particular application. It depends on the timing of that application. If the timeout on the application is longer than any of those short disruptions on the storage system, there will be no problem.
If the timeout is shorter, then it will cause a problem with that application. In that case, check if it’s possible to make the timeout on the application longer so it’s not going to be disrupted.
System, Disk, and Disk Shelf Firmware
We’ve got system firmware. We’ve done the motherboard disk and also disk shelf firmware. The latest system disk and disk shelf firmware is bundled with the ONTAP upgrade packages. So, when you download your new version of ONTAP, but it doesn’t include the operating system, it will include the firmware and all in one file.
Upgrading ONTAP also upgrades your firmware non-disruptively at the same time. When new disks or shelves, are added their firmware is automatically upgraded to the current version on the storage system. If you buy a new disk and the firmware on there is older than what is on your system, it will be automatically upgraded when you add that disk, the same with shelves as well.
So your firmware, it’s automatically upgraded when you do an ONTAP upgrade. You can also upgrade it manually in between ONTAP upgrades. For example, if there was a bug on your particular model of disk and there was a fix for that with a firmware upgrade, then you would want to do that firmware upgrade.
DQP Disk Qualification Package
DQP stands for Disk Qualification Package. Only approved, which are known as qualified disks, are supported in NetApp systems. So when a new disk is added, it is checked against the Disk Qualification Package.
Therefore, you have to buy officially approved disks by NetApp. The DQP is not updated as part of an ONTAP upgrade, unlike your disk firmware. You need to download and install the latest DQP before:
- You add a new drive type or size, which is not already on the system.
- You are going to upgrade to a new version of ONTAP
- You are going to update the disk firmware.
So before you do your ONTAP upgrade, upgrade the DQP first. It’s really easy to do that. Just look on the NetApp website, you download the file, put a couple of commands in the command line, and you’re done. You can then do your main ONTAP upgrade using the system manager wizard.
Performing an Automatic Nondisruptive Upgrade Using the CLI: https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-ug-rdg%2FGUID-7A3BEEA9-90D1-43D8-A9FA-EFE751CBDC0F.html
Upgrading an ONTAP Cluster Using the Automated Method: https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.exp-dot-upgrade%2FGUID-CDEB0ADB-8D24-4944-A6FD-CD53E993DF49.html
Text by Libby Teofilo, Technical Writer at www.flackbox.com
With a mission to spread network awareness through writing, Libby consistently immerses herself into the unrelenting process of knowledge acquisition and dissemination. If not engrossed in technology, you might see her with a book in one hand and a coffee in the other.