You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-service-manager/safe-upgrade-practices.md
+41-31Lines changed: 41 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,34 +3,32 @@ title: Get started with Azure Operator Service Manager Safe Upgrade Practices
3
3
description: Safely execute complex upgrades of CNF workloads on Azure Operator Nexus
4
4
author: msftadam
5
5
ms.author: adamdor
6
-
ms.date: 05/20/2024
6
+
ms.date: 03/09/2026
7
7
ms.topic: upgrade-and-migration-article
8
8
ms.service: azure-operator-service-manager
9
9
ms.custom:
10
10
- build-2025
11
11
---
12
12
13
13
# Get started with safe upgrade practices
14
-
This article introduces Azure Operator Service Manager (AOSM) safe upgrade practices (SUP). This feature set enables upgrades to complex container network function (CNF) hosted on Azure Operator Nexus. These upgrades generally support partner In Service Software Upgrade (ISSU) methods and requirements. While this article introduces basic concepts, look for other articles which expand on advanced SUP features and capabilities.
14
+
This article introduces Azure Operator Service Manager safe upgrade practices (SUP). These practices enable complex upgrades of network functions hosted on Azure Operator Nexus. Compliance with in-service software upgrades (ISSU), in-service software downgrade (ISSD), and in-service rollback (ISSR) is supported within documented limits. This article focuses on basic upgrade concepts, look for other articles to expand on more advanced capabilities.
15
15
16
16
## Introduction to safe upgrades
17
-
A given network service supported by AOSM, composed of one to many CNFs, includes components which, over time, require software and/or configuration changes. To make these component level changes it's necessary to run one to many helm operations, upgrading each network function application (nfApp) in a particular order and in a manner which least impacts the network service. AOSM safe upgrade practices apply the following high level capabilities to handle upgrade process and workflow requirements:
17
+
A given container network function is composed of one to many nf Applications (nfApps) which, over time, require lifecycle software version and configuration provisioning management. This genearlly requires the execution of the appropriate helm operations on each nfApp to set the active software or configuration version. SUP intelligently automates managing thes operations at-scale, to occur a particular order, and in a manner which least impacts the network service. The following feature list summarizes SUP available capabilities. These features should be considered during the appropriate stage of publishing, designing and operating.
18
18
19
-
* SNS Reput Support - Execute helm upgrade operation across all nfApps in network function design version (NFDV).
20
-
* Nexus Platform - Support SNS reput operations on Nexus platform targets.
21
-
*Operation Timeouts - Ability to set operational timeouts for each nfApp operation.
22
-
* Synchronous Operations - Ability to run one serial nfApp operation at a time.
23
-
* Control Upgrade Order - Define different nfApp sequence for install and upgrade.
24
-
*Pause On Failure - Default behavior pauses after an nfApp operation failure.
25
-
*Rollback On Failure - Optional behavior, rollback completed nfApps before failed nfApp.
26
-
*Single Chart Test Validation - Running a helm test operation after a create or update.
27
-
* Skip nfApp on No Change - Skip processing of nfApps where no changes result.
28
-
*Image Preloading - Ability to preload images to edge repository.
19
+
* SNS Reput Support - Calculate and execute correct helm operations in sequence across a series of nfApps.
20
+
*Azure Operator Nexus Cluster - Support for workloads hosted by Nexus arc-enabled kubernetes clusters.
21
+
*Helm Parameter - Ability to set helm operating parameters like atomic, timeout, and wait for each nfApp operation.
22
+
* Synchronous Operations - Ability to run one operation against one nfApp at-a-time.
23
+
* Control Upgrade Order - Define different nfApp sequences for installs, upgrades and deletes.
24
+
*nfApp Validation - Run a helm test hook operation after a helm create or update operation.
25
+
*Pause On Failure - If an nfApp fails, pause sequence of operations for diagnostics.
26
+
*Rollback On Failure - If an nfApp fails, rollback any prior completed nfApps to starting state.
27
+
* Skip nfApp on No Change - Manual or automatic skipping of nfApps where no change is required.
28
+
*Interruption - Set a cancellation flag to stop sequence of operations before next nfApp.
29
29
30
30
## Safe upgrade approach
31
-
To update an existing AOSM site network service (SNS), the operator executes a reput request against the deployed SNS resource. Where the SNS contains CNFs with multiple nfApps, the request is fanned out across all nfApps defined in the network function definition version (NFDV). By default, in the order, which they appear, or optionally in the order defined by `updateDependsOn` parameter.
32
-
33
-
For each nfApp, the reput request supports various changes including increasing a helm chart version, adding/removing helm values and/or adding/removing any nfApps. While timeouts can be set per nfApp, based on known allowable runtimes, nfApps can only be processed in serial order, one after the other. The reput update implements the following processing logic:
31
+
To update an existing Azure Operator Service Manager site network service (SNS), the operator starts by executing a reput request against a previously deployed SNS resource. Where the SNS contains multiple nfApps, the proper helm operation is calculateed for each nfpp. Those operations are then fanned out across all nfApps defined in the network function definition version (NFDV). By default, in the order, which they appear, or optionally in the order defined by `updateDependsOn` parameter. The reput request supports various per nfApp changes including increasing a helm chart version, adding/removing helm values and/or adding/removing any nfApps. Some helm parameters, like `atomic`, `timeout` and `wait` can be set per nfApp. The basic logic used by the reput operation is as follows:
34
32
35
33
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
36
34
* nfApps with parameter `applicationEnabled` set to disable are skipped.
@@ -41,26 +39,36 @@ For each nfApp, the reput request supports various changes including increasing
41
39
42
40
To ensure outcomes, nfApp testing is supported using helm methods, either tests triggered by helm pre or post hooks, or using the standalone helm test hook. For pre or post hook failure, the `atomic` parameter is honored. With atomic/true, the failed chart is rolled back. With atomic/false, no rollback is executed. For standalone helm test hook failure, the `rollbackOnTestFailure` is honored, following similar logic as atomic. For more information on standalone helm testing, see the following article: [Run tests after install or upgrade](safe-upgrades-helm-test.md)
43
41
44
-
When an nfApp operation failure occurs, and after the failed nfApp is handled via `atomic` or `rollbackOnTestFailure` parameters, the operator can control behavior on how to handle any nfApps changed before the failed nfApp. With pause-on-failure the operator can force AOSM to break after addressing the failed nfApp, preserving the mixed version environment. With rollback-on-failure the operator can force AOSM to rollback any prior nfApp, restoring the original environment snapshot. For more information on controlling upgrade failure behavior, see the following article: [Control upgrade failure behavior](safe-upgrades-nf-level-rollback.md)
42
+
When an nfApp operation failure occurs, and after the failed nfApp is handled, the operator can control behavior of nfApps changed before the failed nfApp. With pause-on-failure the operator can force the reput operation to terminate after the failure. This best allows for diagnostics, as the broken environment is preserved. With rollback-on-failure the operator can force the reput operation to rollback any nfApp completed prior to the failure. This restores the starting state environment and has the least impact to the running service. For more information on controlling upgrade failure behavior, see the following article: [Control upgrade failure behavior](safe-upgrades-nf-level-rollback.md)
45
43
46
44
## Considerations for in-service upgrades
47
-
Azure Operator Service Manager generally supports in service upgrades, an upgrade method which advances a deployment version without interrupting the running service. Some network function owner considerations are necessary to ensure the proper behavior of AOSM during ISSU operations.
48
-
* Where AOSM performs an upgrade against an ordered set of multiple nfApps, AOSM first upgrades or creates all new nfApps, then deletes all old nfApps. This approach ensures service isn't impacted until all new nfApps are ready but requires extra platform capacity for transient hosting of both old and new nfApps.
49
-
* Where AOSM upgrades an nfApp with multiple replicas, AOSM honors the deployment profile settings for either the rolling or recreate option. Where rolling is used, expose the values `maxUnavailable` and `maxSurge` as CGS parameters, which can then be set via operator CGV at run-time.
45
+
Azure Operator Service Manager generally supports three stages of in-service operations. These stages advance, downgrade or rollback a deployed version with minimal to no interruption of the running service.
46
+
* ISSU: In-service upgrades. Advance software or configuration versions forward, to a specific NFDV target.
47
+
* ISSD: In-service downgrades. Reduce software or configuration version backwards, to a specific NFDV target.
48
+
* ISSR: In-service rollback. Rollback software or configuration versions, to the last helm release.
49
+
50
+
While ISSU and ISSD are both operator executed operations, ISSR only applies when a failures occurs in a ISSU or ISSD operation. Consider the follow high level requirements to ensure a network function behaves as desired when operated against durring ISSU, ISSD or ISSR stages.
51
+
52
+
* The reput operation upgrades or installs new nfApps first, then deletes missing nfApps last.
53
+
* This approach ensures service isn't impacted until all new nfApps are ready.
54
+
* This approach requires sufficient platform capacity for transient hosting of both old and new nfApps.
55
+
* The reput operation honors the nfApp's helm rolling or recreate deployment profile settings.
56
+
* Make sure the right setting is used to preserve service functionality during helm operations.
57
+
* Where rolling is used, consider exposing the values `maxUnavailable` and `maxSurge` as CGS parameters.
50
58
51
-
Ultimately, the ability for a given service to be upgraded without interruption is a feature of the service itself. Consult further with the service publisher to understand the in-service upgrade capabilities and ensure they're aligned with the proper AOSM behavioral options.
59
+
Ultimately, the ability for a given network function to be upgraded without service interruption requires careful co-ordination betwen both Azure Operator Service Manager and the network function. Consult further with the network function publisher to better understand supported in-service upgrade capabilities and ensure they're aligned with the proper Azure Operator Service Manager configuration options. All ISSU, ISSD and ISSU operations should be testing first in a controlled lab environment.
52
60
53
61
## Safe upgrade prerequisites
54
-
When planning for an upgrade using AOSM, address the following requirements in advance of upgrade execution, to optimize time spent attempting and ensure success of the upgrade.
62
+
When planning for an upgrade, address the following requirements in advance, to optimize time spent attempting and ensure success of the upgrade.
55
63
56
64
- Onboard updated artifacts using publisher and/or designer workflows.
57
65
- In most cases, use the existing publisher to host new version artifacts.
58
-
- Using an existing publisher supports `helm upgrade` to update an SNS to a different version.
66
+
- Using an existing publisher supports `helm upgrade` to upgrade an SNS to a different version.
59
67
- Using a new publisher requires a `helm delete` on the current SNS and then a `helm install` for the new SNS version.
60
68
- Artifact store, network service design group (NSDG), and network function design group (NFDG) are immutable and can't change.
61
69
- Changing one of these resources requires deployment of a new SNS.
62
-
- A new artifact manifest is needed to store the new charts and images.
63
-
- See [onboarding documentation](how-to-manage-artifacts-nexus.md) for details on uploading new charts and images.
70
+
- A new artifact manifest is needed to store any new charts or images.
71
+
- See [onboarding documentation](how-to-manage-artifacts-nexus.md) for details on uploading new charts oe images.
64
72
- A new NFDV, and optionally network service design version (NSDV), is needed.
65
73
- NFDV changes can be complex. We cover only basic changes in this article.
66
74
- New NSDV is only required if a new configuration group schema (CGS) version is being introduced.
@@ -82,24 +90,26 @@ Follow the following process to trigger an upgrade with AOSM.
82
90
* Create new NFDV resource
83
91
* For new NFDV versions, it must be in a valid SemVer format. The new version can be an upgrade, a greater value versus the deployed version, or a downgrade, a lower value versus the deployed version. The new version can differ by major, minor, or patch values.
84
92
* Update new NFDV parameters
85
-
* Helm chart versions can be updated, or Helm values can be updated or parameterized as necessary. New nfApps can also be added where they didn't exist in deployed version.
93
+
* Helm chart versions can be updated, or helm values can be updated or parameterized as necessary. New nfApps can also be added where they didn't exist in deployed version.
86
94
* Update NFDV for desired nfApp order
87
95
* UpdateDependsOn is an NFDV parameter used to specify ordering of nfApps during update operations. If `updateDependsOn` isn't provided, serial ordering of CNF applications, as appearing in the NFDV is used.
88
96
* Update ARM template for desired upgrade behavior
89
-
* Make sure to set any desired CNF application`timeout`, the `atomic` parameter, and `rollbackOnTestFailure` parameter. It may be useful to change these parameters over time as more confidence is gained in the upgrade.
97
+
* Make sure to set any desired `roleoverride` values for`timeout`, the `atomic` parameter, and `rollbackOnTestFailure` parameter. It may be useful to change these parameters over time as more confidence is gained in the upgrade.
90
98
* Issue SNS reput
91
99
* With onboarding complete, the reput operation is submitted. Depending on the number, size and complexity of the nfApps, the reput operation could take some time to complete (multiple hours).
92
100
* Examine reput results
93
-
* If the reput is reporting a successful result, the upgrade is complete and the user should validate the state and availability of the service. If the reput is reporting a failure, follow the steps in the upgrade failure recovery section to continue.
101
+
* If the reput is reporting a successful result, the upgrade is complete and the user should validate the state and availability of the service, as well as the component version reported for all nfApps. If the reput is reporting a failure, follow the steps in the upgrade failure recovery section to continue.
94
102
95
103
## Safe upgrade retry procedure
96
104
In cases where a reput update fails, the following process can be followed to retry the operation.
97
105
98
106
* Diagnose failed nfApp
99
-
* Resolve the root cause for nfApp failure by analyzing logs and other debugging information.
107
+
* Resolve the root cause for nfApp failure by analyzing logs and other debugging information. Pre and post hooks are a common failure point.
100
108
* Manually skip completed charts
101
-
* After fixing the failed nfApp, but before attempting an upgrade retry, consider changing the `applicationEnablement` parameter to accelerate retry behavior. This parameter can be set false, where an nfApp should be skipped. This parameter can be useful where an nfApp doesn't require an upgraded.
102
-
* Issue SNS reput retry (repeat until success)
109
+
* After fixing the failed nfApp, but before attempting an upgrade retry, consider changing the `applicationEnablement` parameter to accelerate retry behavior.
110
+
* This parameter can be set false, where an nfApp should be skipped.
111
+
* This parameter can be useful where an nfApp doesn't require an upgraded.
112
+
* Issue SNS reput retry (repeat until success).
103
113
* By default, the reput retries nfApps in the declared update order, unless they're skipped using `applicationEnablement` flag.
104
114
105
115
## Control timeouts with installOptions and UpgradeOptions
0 commit comments