Skip to content

Commit 8c3cba5

Browse files
Merge pull request #312976 from msftadam/patch-631519
Update nfApps rollback compatibility and NFDV strategy
2 parents 63acef2 + cd7a9d8 commit 8c3cba5

2 files changed

Lines changed: 49 additions & 28 deletions

File tree

articles/operator-service-manager/configuration-guide.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.service: azure-operator-service-manager
1212

1313
This article provides Azure Operator Service Manager guidelines to optimize the design of configuration group schemas (CGSs) and the operation of configuration group values (CGVs). Network function (NF) vendors, telco operators, and their partners should keep these practices in mind when onboarding and deploying NFs.
1414

15-
## Configuring group resource approach
15+
## Configurarion group approach
1616

1717
Consider the following meta-schema guidelines when you're designing configuration resources:
1818

@@ -91,6 +91,29 @@ This example shows the resulting CGV resource that Azure Operator Service Manage
9191
}
9292
```
9393

94+
## CGS with secrets
95+
Other then seperating secrets into a unique CGS, no special CGS requirements exist for secret support.
96+
97+
## CGV with secrets
98+
Considering the following configuration reqiurements to properly obscure secret values:
99+
* Use `configurationType: 'Secret'` in the resource properties.
100+
* Once a CGV is deployed, this prevents the display of the resource in most Azure methods.
101+
* Use a reference to Azure Key Vault (AKV) in place of the plain-text secret.
102+
* This obscures the display of the secret in the CGV deployment template.
103+
104+
The following example shows how to include an AKF reference in an ARM template:
105+
```json
106+
"password": {
107+
"reference": {
108+
"keyVault": {
109+
"id": "/subscriptions/xxx/resourceGroups/yyy/providers/Microsoft.KeyVault/vaults/zz"
110+
},
111+
"secretName": "passwd"
112+
}
113+
```
114+
115+
To further secure resources restrict access to the following RBAC scope: `Microsoft.Resources/deployments/exportTemplate/action`
116+
94117
## Overview of JSON Schema
95118

96119
JSON Schema is an Internet Engineering Task Force (IETF) standard that provides a format for what JSON data is required for an application and how to interact with it. Applying such standards for a JSON document helps you enforce consistency and data validity across JSON data.

articles/operator-service-manager/safe-upgrades-nf-level-rollback.md

Lines changed: 25 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Control upgrade failure behavior with Azure Operator Service Manager
33
description: Learn about recovery behaviors including pause on failure and rollback on failure.
44
author: msftadam
55
ms.author: adamdor
6-
ms.date: 03/06/2026
6+
ms.date: 03/10/2026
77
ms.topic: upgrade-and-migration-article
88
ms.service: azure-operator-service-manager
99
---
@@ -130,51 +130,49 @@ example:
130130
> * If multiple entries of `nfConfiguration` are found in the `roleOverrideValues`, then the NF reput is returned as a bad request.
131131
132132
## Manage nfApps that don't support rollback
133-
Almost all publishers report some nfApps that aren't compatible with helm rollback operations. These nfApps maybe sourced from third-parties who don't common support such strict resiliency requirements. These nfApps maybe related to database applications with complicated schema management requirements. In these cases, special consideration should be taken to deal with nfApps that don't support rollback.
134-
135-
* The strong preference is to push publishers to support helm rollback for all nfApps.
133+
Almost all publishers have some nfApps which aren't compatible with helm rollback. These nfApps maybe sourced from third-parties who don't commonly support strict resiliency requirements. These nfApps maybe database applications with complicated schema management requirements. Consider the following restrictions when onboarding services with nfApps that don't support rollback.
136134
* nfApps that don't support rollback can't be skipped.
137135
* nfApp rollback order can't change.
138136
* Incremental-NFDV approach must be used in these situations.
139137

140138
### Selective rollback using incremental NFDVs
141-
A network function’s composition often includes one, or more, nfApplications that can't support a helm rollback operation, such as Elastic or VoltDb. If a rollback is attempted on one of these nfApplications, the resulting nfApplication is broken. Pursuing publisher enhancements, to make these nfApplications rollback complaint is the best solution. Recognizing the potential for long publisher enhancement lead times, a method to prevent execution of rollback on selective nfApplications is needed. Selectively skipping rollback requires thorough testing with the network function owners as it resulting in transiet condition where multiple version permutation exist.
142-
143-
#### Problem Statement
144-
At the network function level, when nfRollbackEnabled is true, and a failure occurs during an upgrade or install, a rollback is executed across all nfApps which proceed the failure. This may include those which are rollback noncompliant. A selective rollback parameter is not supported. It introduces risk of an operational state that doesn't correspond to a defined NFDV. This state mismatch results in nondeterministic behavior, increases the testing surface significantly, and undermines the reliability guarantees of deployment processes. Instead we rely on NFDVs to ensure deterministic workload states that map to well-defined and tested deployment configurations.
139+
A network function’s composition may include nfAppa that don't support a helm rollback. Known examples are Elastic and VoltDb. An attmept to rollback one of these nfApps will break the nfApp. Pursuing publisher enhancements, to make these nfApps rollback complaint, is the best solution. A paramter to skip rollback is not supported as it introduces the risk of a deployed state not defined in a NFDV. This nondeterministic behavior increases the testing surface area significantly and undermines reliability guarantees of deployments. Instead, the incremental NFDV method enables selective rollback execution while ensuring deterministic deployment states.
145140

146-
#### Proposed Solution
147-
AOSM proposes that publishers should use a combination of skipUpgrade and nfRollbackEnabled configurations in CGVs, along with multiple NFDVs, to logically segment nfApplications based on rollback compatibility. This multi-NFDV strategy allows customers to bypass rollback for select charts while preserving safety for the rest. This approach is production-safe and aligns with existing AOSM mechanisms. This staged approach effectively simulates per-chart rollback behavior using NFDV-level constructs. Consider the following example where a network function is composed of 20 nfApps with five nfApps that don't support rollback.
141+
#### Incremental NFDV approach
142+
It's recommended that publishers use a combination of `applicationEnablement`, `skipUpgrade` and `nfRollbackEnabled` configurations in CGVs, along with multiple NFDVs, to logically segment nfApps into sets based on rollback compatibility. This incremental NFDV strategy allows operators to break deployments down into multiple operatons, bypassing rollback for select charts while preserving rollback for the rest. This approach effectively simulates per-chart rollback behavior using NFDV-level constructs. Consider the following example where a network function is composed of 20 nfApps with five nfApps that don't support rollback.
148143

149144
* NFDV1
150-
* Performs initial install of all 20 charts with version v1.0.
151-
* In CGV1: rollbackEnabled: irrelevant (fresh install).
145+
* Performs initial verions 1 install.
146+
* Contains all 20 nfApps in an enabled state.
147+
* In CGV1: `rollbackEnabled: true`.
148+
* On the first install, a failure deletes charts and does not use rollback.
152149
* NFDV2:
153-
* Contains all 20 charts but the five Helm charts without rollback support, upgraded to v2.0.
150+
* Performs first step upgrade to version 2.
151+
* Contains all 20 nfApps but enable only the five nfApps without rollback support.
154152
* In CGV2:
155-
* Use skipUpgrade: true for the remaining 15 charts.
156-
* Set nfRollbackEnabled: false.
157-
* Result:
158-
* Success: Only five charts upgrade
159-
* Failure:
160-
* No rollback if upgrade fails.
161-
* Due to chart limitations, the workload is left in a nondeterministic state. No rollback is possible. To recover, there are two options:
162-
* Upgrade with a working NFDV2
163-
* Upgrade with NFDV1 and skipUpgrade disabled for every nfApplication
153+
* Use `skipUpgrade: true` for the 15 nfApps with rollback supprt.
154+
* Set `nfRollbackEnabled: false`.
155+
* On success, only five nfApps are upgraded.
156+
* On failure, no rollback is performed.
157+
* Due to chart limitations, the workload is left in a nondeterministic state. No rollback is possible. To recover, there are two options:
158+
* Fix NFDV2 and try the upgrade again.
159+
* Downgrade to NFDV1 with `skipUpgrade: false`
164160
* NFDV3:
165-
* Contains all charts but the 15 rollback-compatible charts upgraded to v2.0.
161+
* Performs second step upgrade to version 2
162+
* Contains all 20 nfApps but enable only the 15 nfApps with rollback support.
166163
* In CGV3:
167-
* Use skipUpgrade: true for the 5 charts already handled in NFDV2.
168-
* Set nfRollbackEnabled: true.
169-
* Result: Remaining 15 charts upgrade; rollback occurs on failure.
164+
* Use `skipUpgrade: true` for the 5 nfApps previous upgraded via NFDV2.
165+
* Set `nfRollbackEnabled: true`.
166+
* On success, the remaining 15 nfApps are upgraded
167+
* On failure, a rollback occurs to restore the starting state.
170168

171169
> [!NOTE]
172170
> * The five rollback-incompatible charts must not have runtime upgrade dependencies on charts in NFDV3.
173171
> * AOSM's rollback design assumes that rollback restores the workload state to the previous NFDV state.
174172
175173
This approach providers cleaner separation and manageability of applications not supporting standard helm operations. Maintains the operation’s idempotency and state on the cluster reflected by the last operation. NFDV 2/3 can directly be used for install operations as well (installation of previous version not needed) with any difference in goal state. Overall upgrade time and deployment reliability remain the same.
176174

177-
## How to troubleshoot rollback on failure
175+
## Troubleshoot rollback on failure
178176
### Understand pod states
179177
Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:
180178
* Pending: Pod scheduling is in progress by Kubernetes.

0 commit comments

Comments
 (0)