You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-service-manager/safe-upgrades-nf-level-rollback.md
+55-13Lines changed: 55 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,55 +3,89 @@ title: Control upgrade failure behavior with Azure Operator Service Manager
3
3
description: Learn about recovery behaviors including pause on failure and rollback on failure.
4
4
author: msftadam
5
5
ms.author: adamdor
6
-
ms.date: 08/30/2024
6
+
ms.date: 03/06/2026
7
7
ms.topic: upgrade-and-migration-article
8
8
ms.service: azure-operator-service-manager
9
9
---
10
10
11
11
# Control upgrade failure behavior
12
12
13
-
## Overview
14
-
This guide describes the Azure Operator Service Manager (AOSM) upgrade failure behavior features for container network functions (CNFs). These features, as part of the AOSM safe upgrade practices initiative, offer a choice between faster retries, with pause on failure, versus return to starting point, with rollback on failure.
13
+
This guide describes the Azure Operator Service Manager (AOSM) upgrade failure behavior features for container network functions (CNFs). For faster retries, use pause on failure. To return to the starting point, use rollback on failure.
15
14
16
15
## Pause on failure
17
16
Any upgrade using AOSM starts with a site network service (SNS) reput operation. The reput operation processes the network function applications (nfApps) found in the network function design version (NFDV). The reput operation implements the following default logic:
17
+
* A user initiates an SNS reput operation with pause-on-failure enabled.
18
18
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
19
-
* nfApps with parameter `applicationEnabled` set to disable are skipped.
20
-
* nfApps present, but not referenced by the new NFDV are deleted.
21
-
* The execution sequence is paused if any of the nfApp upgrades fail and an atomic rollback is considered.
22
-
* The failure leaves the NF resource in a failed state.
19
+
* If an nfApp install or upgrade operation fails, the atomic rollback setting for that operation and nfApp is honored.
20
+
* No prior completed NfApps are further operated upon.
21
+
* The task terminates and leaves the SNS resource in a failed state.
23
22
24
-
With pause on failure, AOSM rolls back only the failed nfApp, via the `testOptions`, `installOptions`, or `upgradeOptions` parameters. No action is taken on any nfApps which proceed the failed nfApp. This method allows the end user to troubleshoot the failed nfApp and then restart the upgrade from that point forward. As the default behavior, this method is the most efficient method, but may cause network function (NF) inconsistencies while in a mixed version state.
23
+
With pause on failure, AOSM rolls back only the failed nfApp, via the `testOptions`, `installOptions`, or `upgradeOptions` operation parameters. No action is taken on any nfApps proceeding the failed nfApp. This method allows the end user to troubleshoot the failed nfApp and then restart the upgrade from that point forward. As the default behavior, this method is the most efficient method, but may cause network function (NF) inconsistencies while in a mixed version state.
24
+
25
+
### Upgrade successful
26
+
An upgrade is considered successful if all nfApps reach the desired target state without generating helm install or helm upgrade failures. In such conditions, Azure Operator Service Manager returns the following operational status and message:
27
+
28
+
```
29
+
- Upgrade Succeeded
30
+
- Provisioning State: Succeeded
31
+
- Message: <empty>
32
+
```
33
+
34
+
### Upgrade unsuccessful
35
+
An upgrade is considered unsuccessful if any nfApp generates a helm install or helm upgrade failure. In such conditions, Azure Operator Service Manager returns the following operational status and message:
To address risk of mismatched nfApp versions, AOSM now supports NF level rollback on failure. With this option enabled, if an nfApp operation fails, both the failed nfApp, and all prior completed nfApps, can be rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to nfApp version mismatches. The optional rollback on failure feature works as follows:
28
-
* A user initiates an SNS reput operation and enables rollback on failure.
44
+
To address risk of mismatched nfApp versions, Azure Operator Service Manager supports NF level rollback on failure. With this option enabled, if an nfApp operation fails, both the failed nfApp, and all prior completed nfApps, can be rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to nfApp version mismatches. The optional rollback on failure feature works as follows:
45
+
* A user initiates an SNS reput operation with rollback on failure enabled.
46
+
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
47
+
* Atomic state for all NfApps is forced to true, any operator provided values are ignored.
29
48
* A snapshot of the current nfApp versions is captured and stored.
30
49
* The snapshot is used to determine the individual nfApp actions taken to reverse actions that completed successfully.
31
50
-`helm install` action on deleted components,
32
51
-`helm rollback` action on upgraded components,
33
52
-`helm delete` action on newly installed components
34
-
* nfApp failure occurs, AOSM restores the nfApps to the snapshot version state before the upgrade, with most recent actions reverted first.
53
+
* If an nfApp install or upgrade operation fails, an atomic rollback of the failed nfApp is executed first.
54
+
* After the atomic rollback, the prior completed NfApps are restored to original snapshot version, with most recent actions reverted first.
55
+
* The task terminates and leaves the SNS resource in a failed state.
35
56
36
57
> [!NOTE]
37
58
> * AOSM doesn't create a snapshot if a user doesn't enable rollback on failure.
38
59
> * A rollback on failure only applies to the successfully completed nfApps.
39
-
> - Use the `testOptions`, `installOptions`, or `upgradeOptions` parameters to control rollback of the failed nfApp.
60
+
> * An error with the atomic rollback isn't treated as a rollback failure.
61
+
62
+
### Upgrade successful
63
+
An upgrade is considered successful if all nfApps reach the desired target state without generating helm install or helm upgrade failures. In such conditions, Azure Operator Service Manager returns the following operational status and message:
40
64
41
-
AOSM returns the following operational status and messages, given the respective results:
42
65
```
43
66
- Upgrade Succeeded
44
67
- Provisioning State: Succeeded
45
68
- Message: <empty>
69
+
```
46
70
71
+
### Rollback successful
72
+
A rollback is considered successful if all prior completed NfApps reached the original snapshot state without generating a helm rollback failure. In such conditions, Azure Operator Service Manager returns the following operational status and message:
A rollback is considered unsuccessful if any prior completed nfApps fail to reach the original snapshot state, instead generating a helm rollback failure. In such conditions, Azure Operator Service Manager stops processing any further rollback-eligible nfApps and terminates with the following operational status and message:
The most flexible method to control failure behavior is to extend a new configuration group schema (CGS) parameter, `rollbackEnabled`, to allow for configuration group value (CGV) control via `roleOverrideValues` in the NF payload. First, define the CGS parameter:
57
91
```
@@ -96,6 +130,14 @@ example:
96
130
> * Each `roleOverrideValues` entry overrides the default behavior of the NfAapps.
97
131
> * If multiple entries of `nfConfiguration` are found in the `roleOverrideValues`, then the NF reput is returned as a bad request.
98
132
133
+
## Manage nfApps that don't support rollback
134
+
Almost all publishers report some nfApps that aren't compatible with helm rollback operations. These nfApps maybe sourced from third-parties who don't common support such strict resiliency requirements. These nfApps maybe related to database applications with complicated schema management requirements. In these cases, special consideration should be taken to deal with nfApps that don't support rollback.
135
+
136
+
* The strong preference is to push vendors to support helm rollback.
137
+
* nfApps that don't support rollback can't be skipped.
138
+
* nfApp rollback order can't change.
139
+
* Incremental-NFDV approach must be used in these situations.
140
+
99
141
## How to troubleshoot rollback on failure
100
142
### Understand pod states
101
143
Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:
0 commit comments