You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This guide describes the Azure Operator Service Manager (AOSM) upgrade failure behavior features for container network functions (CNFs). For faster retries, use pause on failure. To return to the starting point, use rollback on failure.
14
13
15
-
## Pause on failure
14
+
## Pause on failure overview
16
15
Any upgrade using AOSM starts with a site network service (SNS) reput operation. The reput operation processes the network function applications (nfApps) found in the network function design version (NFDV). The reput operation implements the following default logic:
17
16
* A user initiates an SNS reput operation with pause-on-failure enabled.
18
17
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
@@ -40,7 +39,7 @@ An upgrade is considered unsuccessful if any nfApp generates a helm install or h
To address risk of mismatched nfApp versions, Azure Operator Service Manager supports NF level rollback on failure. With this option enabled, if an nfApp operation fails, both the failed nfApp, and all prior completed nfApps, can be rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to nfApp version mismatches. The optional rollback on failure feature works as follows:
45
44
* A user initiates an SNS reput operation with rollback on failure enabled.
46
45
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
@@ -86,7 +85,7 @@ A rollback is considered unsuccessful if any prior completed nfApps fail to reac
The most flexible method to control failure behavior is to extend a new configuration group schema (CGS) parameter, `rollbackEnabled`, to allow for configuration group value (CGV) control via `roleOverrideValues` in the NF payload. First, define the CGS parameter:
91
90
```
92
91
{
@@ -133,11 +132,48 @@ example:
133
132
## Manage nfApps that don't support rollback
134
133
Almost all publishers report some nfApps that aren't compatible with helm rollback operations. These nfApps maybe sourced from third-parties who don't common support such strict resiliency requirements. These nfApps maybe related to database applications with complicated schema management requirements. In these cases, special consideration should be taken to deal with nfApps that don't support rollback.
135
134
136
-
* The strong preference is to push vendors to support helm rollback.
135
+
* The strong preference is to push publishers to support helm rollback for all nfApps.
137
136
* nfApps that don't support rollback can't be skipped.
138
137
* nfApp rollback order can't change.
139
138
* Incremental-NFDV approach must be used in these situations.
140
139
140
+
### Selective rollback using incremental NFDVs
141
+
A network function’s composition often includes one, or more, nfApplications that can't support a helm rollback operation, such as Elastic or VoltDb. If a rollback is attempted on one of these nfApplications, the resulting nfApplication is broken. Pursuing publisher enhancements, to make these nfApplications rollback complaint is the best solution. Recognizing the potential for long publisher enhancement lead times, a method to prevent execution of rollback on selective nfApplications is needed. Selectively skipping rollback requires thorough testing with the network function owners as it resulting in transiet condition where multiple version permutation exist.
142
+
143
+
#### Problem Statement
144
+
At the network function level, when nfRollbackEnabled is true, and a failure occurs during an upgrade or install, a rollback is executed across all nfApps which proceed the failure. This may include those which are rollback noncompliant. A selective rollback parameter is not supported. It introduces risk of an operational state that doesn't correspond to a defined NFDV. This state mismatch results in nondeterministic behavior, increases the testing surface significantly, and undermines the reliability guarantees of deployment processes. Instead we rely on NFDVs to ensure deterministic workload states that map to well-defined and tested deployment configurations.
145
+
146
+
#### Proposed Solution
147
+
AOSM proposes that publishers should use a combination of skipUpgrade and nfRollbackEnabled configurations in CGVs, along with multiple NFDVs, to logically segment nfApplications based on rollback compatibility. This multi-NFDV strategy allows customers to bypass rollback for select charts while preserving safety for the rest. This approach is production-safe and aligns with existing AOSM mechanisms. This staged approach effectively simulates per-chart rollback behavior using NFDV-level constructs. Consider the following example where a network function is composed of 20 nfApps with five nfApps that don't support rollback.
148
+
149
+
* NFDV1
150
+
* Performs initial install of all 20 charts with version v1.0.
151
+
* In CGV1: rollbackEnabled: irrelevant (fresh install).
152
+
* NFDV2:
153
+
* Contains all 20 charts but the five Helm charts without rollback support, upgraded to v2.0.
154
+
* In CGV2:
155
+
* Use skipUpgrade: true for the remaining 15 charts.
156
+
* Set nfRollbackEnabled: false.
157
+
* Result:
158
+
* Success: Only five charts upgrade
159
+
* Failure:
160
+
* No rollback if upgrade fails.
161
+
* Due to chart limitations, the workload is left in a nondeterministic state. No rollback is possible. To recover, there are two options:
162
+
* Upgrade with a working NFDV2
163
+
* Upgrade with NFDV1 and skipUpgrade disabled for every nfApplication
164
+
* NFDV3:
165
+
* Contains all charts but the 15 rollback-compatible charts upgraded to v2.0.
166
+
* In CGV3:
167
+
* Use skipUpgrade: true for the 5 charts already handled in NFDV2.
168
+
* Set nfRollbackEnabled: true.
169
+
* Result: Remaining 15 charts upgrade; rollback occurs on failure.
170
+
171
+
> [!NOTE]
172
+
> * The five rollback-incompatible charts must not have runtime upgrade dependencies on charts in NFDV3.
173
+
> * AOSM's rollback design assumes that rollback restores the workload state to the previous NFDV state.
174
+
175
+
This approach providers cleaner separation and manageability of applications not supporting standard helm operations. Maintains the operation’s idempotency and state on the cluster reflected by the last operation. NFDV 2/3 can directly be used for install operations as well (installation of previous version not needed) with any difference in goal state. Overall upgrade time and deployment reliability remain the same.
176
+
141
177
## How to troubleshoot rollback on failure
142
178
### Understand pod states
143
179
Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:
0 commit comments