Skip to content

Commit c3ad446

Browse files
authored
Merge pull request #312857 from msftadam/patch-579405
Revise upgrade failure behavior documentation
2 parents fa73c10 + d484e89 commit c3ad446

1 file changed

Lines changed: 55 additions & 13 deletions

File tree

articles/operator-service-manager/safe-upgrades-nf-level-rollback.md

Lines changed: 55 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,55 +3,89 @@ title: Control upgrade failure behavior with Azure Operator Service Manager
33
description: Learn about recovery behaviors including pause on failure and rollback on failure.
44
author: msftadam
55
ms.author: adamdor
6-
ms.date: 08/30/2024
6+
ms.date: 03/06/2026
77
ms.topic: upgrade-and-migration-article
88
ms.service: azure-operator-service-manager
99
---
1010

1111
# Control upgrade failure behavior
1212

13-
## Overview
14-
This guide describes the Azure Operator Service Manager (AOSM) upgrade failure behavior features for container network functions (CNFs). These features, as part of the AOSM safe upgrade practices initiative, offer a choice between faster retries, with pause on failure, versus return to starting point, with rollback on failure.
13+
This guide describes the Azure Operator Service Manager (AOSM) upgrade failure behavior features for container network functions (CNFs). For faster retries, use pause on failure. To return to the starting point, use rollback on failure.
1514

1615
## Pause on failure
1716
Any upgrade using AOSM starts with a site network service (SNS) reput operation. The reput operation processes the network function applications (nfApps) found in the network function design version (NFDV). The reput operation implements the following default logic:
17+
* A user initiates an SNS reput operation with pause-on-failure enabled.
1818
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
19-
* nfApps with parameter `applicationEnabled` set to disable are skipped.
20-
* nfApps present, but not referenced by the new NFDV are deleted.
21-
* The execution sequence is paused if any of the nfApp upgrades fail and an atomic rollback is considered.
22-
* The failure leaves the NF resource in a failed state.
19+
* If an nfApp install or upgrade operation fails, the atomic rollback setting for that operation and nfApp is honored.
20+
* No prior completed NfApps are further operated upon.
21+
* The task terminates and leaves the SNS resource in a failed state.
2322

24-
With pause on failure, AOSM rolls back only the failed nfApp, via the `testOptions`, `installOptions`, or `upgradeOptions` parameters. No action is taken on any nfApps which proceed the failed nfApp. This method allows the end user to troubleshoot the failed nfApp and then restart the upgrade from that point forward. As the default behavior, this method is the most efficient method, but may cause network function (NF) inconsistencies while in a mixed version state.
23+
With pause on failure, AOSM rolls back only the failed nfApp, via the `testOptions`, `installOptions`, or `upgradeOptions` operation parameters. No action is taken on any nfApps proceeding the failed nfApp. This method allows the end user to troubleshoot the failed nfApp and then restart the upgrade from that point forward. As the default behavior, this method is the most efficient method, but may cause network function (NF) inconsistencies while in a mixed version state.
24+
25+
### Upgrade successful
26+
An upgrade is considered successful if all nfApps reach the desired target state without generating helm install or helm upgrade failures. In such conditions, Azure Operator Service Manager returns the following operational status and message:
27+
28+
```
29+
- Upgrade Succeeded
30+
- Provisioning State: Succeeded
31+
- Message: <empty>
32+
```
33+
34+
### Upgrade unsuccessful
35+
An upgrade is considered unsuccessful if any nfApp generates a helm install or helm upgrade failure. In such conditions, Azure Operator Service Manager returns the following operational status and message:
36+
37+
```
38+
- Upgrade Failed
39+
- Provisioning State: Succeeded
40+
- Message: Application(<ComponentName>) : <Failure Reason>
41+
```
2542

2643
## Rollback on failure
27-
To address risk of mismatched nfApp versions, AOSM now supports NF level rollback on failure. With this option enabled, if an nfApp operation fails, both the failed nfApp, and all prior completed nfApps, can be rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to nfApp version mismatches. The optional rollback on failure feature works as follows:
28-
* A user initiates an SNS reput operation and enables rollback on failure.
44+
To address risk of mismatched nfApp versions, Azure Operator Service Manager supports NF level rollback on failure. With this option enabled, if an nfApp operation fails, both the failed nfApp, and all prior completed nfApps, can be rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to nfApp version mismatches. The optional rollback on failure feature works as follows:
45+
* A user initiates an SNS reput operation with rollback on failure enabled.
46+
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
47+
* Atomic state for all NfApps is forced to true, any operator provided values are ignored.
2948
* A snapshot of the current nfApp versions is captured and stored.
3049
* The snapshot is used to determine the individual nfApp actions taken to reverse actions that completed successfully.
3150
- `helm install` action on deleted components,
3251
- `helm rollback` action on upgraded components,
3352
- `helm delete` action on newly installed components
34-
* nfApp failure occurs, AOSM restores the nfApps to the snapshot version state before the upgrade, with most recent actions reverted first.
53+
* If an nfApp install or upgrade operation fails, an atomic rollback of the failed nfApp is executed first.
54+
* After the atomic rollback, the prior completed NfApps are restored to original snapshot version, with most recent actions reverted first.
55+
* The task terminates and leaves the SNS resource in a failed state.
3556

3657
> [!NOTE]
3758
> * AOSM doesn't create a snapshot if a user doesn't enable rollback on failure.
3859
> * A rollback on failure only applies to the successfully completed nfApps.
39-
> - Use the `testOptions`, `installOptions`, or `upgradeOptions` parameters to control rollback of the failed nfApp.
60+
> * An error with the atomic rollback isn't treated as a rollback failure.
61+
62+
### Upgrade successful
63+
An upgrade is considered successful if all nfApps reach the desired target state without generating helm install or helm upgrade failures. In such conditions, Azure Operator Service Manager returns the following operational status and message:
4064

41-
AOSM returns the following operational status and messages, given the respective results:
4265
```
4366
- Upgrade Succeeded
4467
- Provisioning State: Succeeded
4568
- Message: <empty>
69+
```
4670

71+
### Rollback successful
72+
A rollback is considered successful if all prior completed NfApps reached the original snapshot state without generating a helm rollback failure. In such conditions, Azure Operator Service Manager returns the following operational status and message:
73+
74+
```
4775
- Upgrade Failed, Rollback Succeeded
4876
- Provisioning State: Failed
4977
- Message: Application(<ComponentName>) : <Failure Reason>; Rollback succeeded
78+
```
5079

80+
### Rollback unsuccessful
81+
A rollback is considered unsuccessful if any prior completed nfApps fail to reach the original snapshot state, instead generating a helm rollback failure. In such conditions, Azure Operator Service Manager stops processing any further rollback-eligible nfApps and terminates with the following operational status and message:
82+
83+
```
5184
- Upgrade Failed, Rollback Failed
5285
- Provisioning State: Failed
5386
- Message: Application(<ComponentName>) : <Failure reason>; Rollback Failed (<RollbackComponentName>) : <Rollback Failure reason>
5487
```
88+
5589
## How to configure rollback on failure
5690
The most flexible method to control failure behavior is to extend a new configuration group schema (CGS) parameter, `rollbackEnabled`, to allow for configuration group value (CGV) control via `roleOverrideValues` in the NF payload. First, define the CGS parameter:
5791
```
@@ -96,6 +130,14 @@ example:
96130
> * Each `roleOverrideValues` entry overrides the default behavior of the NfAapps.
97131
> * If multiple entries of `nfConfiguration` are found in the `roleOverrideValues`, then the NF reput is returned as a bad request.
98132
133+
## Manage nfApps that don't support rollback
134+
Almost all publishers report some nfApps that aren't compatible with helm rollback operations. These nfApps maybe sourced from third-parties who don't common support such strict resiliency requirements. These nfApps maybe related to database applications with complicated schema management requirements. In these cases, special consideration should be taken to deal with nfApps that don't support rollback.
135+
136+
* The strong preference is to push vendors to support helm rollback.
137+
* nfApps that don't support rollback can't be skipped.
138+
* nfApp rollback order can't change.
139+
* Incremental-NFDV approach must be used in these situations.
140+
99141
## How to troubleshoot rollback on failure
100142
### Understand pod states
101143
Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:

0 commit comments

Comments
 (0)