You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This guide describes the Azure Operator Service Manager (AOSM) upgrade failure behavior features for container network functions (CNFs). For faster retries, use pause on failure. To return to the starting point, use rollback on failure.
14
13
15
-
## Pause on failure
14
+
## Pause on failure overview
16
15
Any upgrade using AOSM starts with a site network service (SNS) reput operation. The reput operation processes the network function applications (nfApps) found in the network function design version (NFDV). The reput operation implements the following default logic:
17
16
* A user initiates an SNS reput operation with pause-on-failure enabled.
18
17
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
@@ -40,7 +39,7 @@ An upgrade is considered unsuccessful if any nfApp generates a helm install or h
To address risk of mismatched nfApp versions, Azure Operator Service Manager supports NF level rollback on failure. With this option enabled, if an nfApp operation fails, both the failed nfApp, and all prior completed nfApps, can be rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to nfApp version mismatches. The optional rollback on failure feature works as follows:
45
44
* A user initiates an SNS reput operation with rollback on failure enabled.
46
45
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
@@ -86,7 +85,7 @@ A rollback is considered unsuccessful if any prior completed nfApps fail to reac
The most flexible method to control failure behavior is to extend a new configuration group schema (CGS) parameter, `rollbackEnabled`, to allow for configuration group value (CGV) control via `roleOverrideValues` in the NF payload. First, define the CGS parameter:
91
90
```
92
91
{
@@ -133,11 +132,48 @@ example:
133
132
## Manage nfApps that don't support rollback
134
133
Almost all publishers report some nfApps that aren't compatible with helm rollback operations. These nfApps maybe sourced from third-parties who don't common support such strict resiliency requirements. These nfApps maybe related to database applications with complicated schema management requirements. In these cases, special consideration should be taken to deal with nfApps that don't support rollback.
135
134
136
-
* The strong preference is to push vendors to support helm rollback.
135
+
* The strong preference is to push publishers to support helm rollback for all nfApps.
137
136
* nfApps that don't support rollback can't be skipped.
138
137
* nfApp rollback order can't change.
139
138
* Incremental-NFDV approach must be used in these situations.
140
139
140
+
### Selective rollback using incremental NFDVs
141
+
A network function’s composition often includes one, or more, nfApplications that cannot support a helm rollback operation, such as Elastic or VoltDb. If a rollback is attempted on one of these nfApplications, the resulting nfApplication will be broken. Pursuing additional automation, or other enhancements, to make these nfApplications rollback complaint is underway, but with a long lead time AOSM must support a method to prevent execution of rollback on selective nfApplications. Please note that skipping rollback for selective applications requires thorough testing by the Network Function owners as it caused multiple permutations and combinations of application versions during upgrade and rollback.
142
+
143
+
#### Problem Statement
144
+
At the NF level, AOSM currently supports rollback-on-failure. When nfRollbackEnabled is true, if a non-compliant nfApplication is upgraded, and a failure occurs later in the order, a rollback is executed on the non-compliant nfApplication. At the nfApplication level, AOSM currently supports applicationEnablement, atomic and skipUpgrade via RoleOverrideValues in CGVs, but does not support selective rollback. Currently, AOSM relies on NFDVs to ensure deterministic workload states that map to well-defined and tested deployment configurations. Allowing selective rollbacks introduces the risk of ending up in an undefined state that does not correspond to any known NFDV. This leads to non-deterministic behavior, increases the testing surface significantly, and undermines the reliability guarantees of our deployment process.
145
+
146
+
#### Proposed Solution
147
+
AOSM proposes that publishers should use a combination of skipUpgrade and nfRollbackEnabled configurations in CGVs, along with multiple NFDVs, to logically segment nfApplications based on rollback compatibility. This multi-NFDV strategy allows customers to bypass rollback for select charts while preserving safety for the rest. This approach is production-safe and aligns with existing AOSM mechanisms. This staged approach effectively simulates per-chart rollback behavior using NFDV-level constructs. Consider the following example where a network function is composed of 20 nfApps with 5 nfApps that don't support rollback.
148
+
149
+
* NFDV1
150
+
* Performs initial install of all 20 charts with version v1.0.
151
+
* In CGV1: rollbackEnabled: irrelevant (fresh install).
152
+
* NFDV2:
153
+
* Contains all 20 charts but the 5 Helm charts without rollback support, upgraded to v2.0.
154
+
* In CGV2:
155
+
* Use skipUpgrade: true for the remaining 15 charts.
156
+
* Set nfRollbackEnabled: false.
157
+
* Result:
158
+
* Success: Only 5 charts upgrade
159
+
* Failure:
160
+
* no rollback if upgrade fails.
161
+
* NOTE: In this case the workload will be left in a non-deterministic state due to chart limitation no rollback is possible and intentionally kept disabled. To recover this state this state there are 2 options:
162
+
* Upgrade with a working NFDV2
163
+
* Upgrade with NFDV1 and skipUpgrade disabled for every nfApplication
164
+
* NFDV3:
165
+
* Contains all charts but the 15 rollback-compatible charts upgraded to v2.0.
166
+
* In CGV3:
167
+
* Use skipUpgrade: true for the 5 charts already handled in NFDV2.
168
+
* Set nfRollbackEnabled: true.
169
+
* Result: Remaining 15 charts upgrade; rollback occurs on failure.
170
+
171
+
> [!NOTE]
172
+
> * The 5 rollback-incompatible charts must not have runtime upgrade dependencies on charts in NFDV3.
173
+
> * AOSM's rollback design assumes that rollback restores the workload state to the previous NFDV state.
174
+
175
+
This approach providers cleaner separation and manageability of applications not supporting standard helm operations. Maintains the operation’s idempotency and state on the cluster is reflected by the last operation. NFDV 2/3 can directly be used for install operations as well (installation of previous version not needed) with any difference in goal state. Overall upgrade time and deployment reliability remains the same.
176
+
141
177
## How to troubleshoot rollback on failure
142
178
### Understand pod states
143
179
Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:
0 commit comments