You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This article provides Azure Operator Service Manager guidelines to optimize the design of configuration group schemas (CGSs) and the operation of configuration group values (CGVs). Network function (NF) vendors, telco operators, and their partners should keep these practices in mind when onboarding and deploying NFs.
14
14
15
-
## Configuring group resource approach
15
+
## Configurarion group approach
16
16
17
17
Consider the following meta-schema guidelines when you're designing configuration resources:
18
18
@@ -91,6 +91,29 @@ This example shows the resulting CGV resource that Azure Operator Service Manage
91
91
}
92
92
```
93
93
94
+
## CGS with secrets
95
+
Other then seperating secrets into a unique CGS, no special CGS requirements exist for secret support.
96
+
97
+
## CGV with secrets
98
+
Considering the following configuration reqiurements to properly obscure secret values:
99
+
* Use `configurationType: 'Secret'` in the resource properties.
100
+
* Once a CGV is deployed, this prevents the display of the resource in most Azure methods.
101
+
* Use a reference to Azure Key Vault (AKV) in place of the plain-text secret.
102
+
* This obscures the display of the secret in the CGV deployment template.
103
+
104
+
The following example shows how to include an AKF reference in an ARM template:
To further secure resources restrict access to the following RBAC scope: `Microsoft.Resources/deployments/exportTemplate/action`
116
+
94
117
## Overview of JSON Schema
95
118
96
119
JSON Schema is an Internet Engineering Task Force (IETF) standard that provides a format for what JSON data is required for an application and how to interact with it. Applying such standards for a JSON document helps you enforce consistency and data validity across JSON data.
Copy file name to clipboardExpand all lines: articles/operator-service-manager/safe-upgrades-nf-level-rollback.md
+25-27Lines changed: 25 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ title: Control upgrade failure behavior with Azure Operator Service Manager
3
3
description: Learn about recovery behaviors including pause on failure and rollback on failure.
4
4
author: msftadam
5
5
ms.author: adamdor
6
-
ms.date: 03/06/2026
6
+
ms.date: 03/10/2026
7
7
ms.topic: upgrade-and-migration-article
8
8
ms.service: azure-operator-service-manager
9
9
---
@@ -130,51 +130,49 @@ example:
130
130
> * If multiple entries of `nfConfiguration` are found in the `roleOverrideValues`, then the NF reput is returned as a bad request.
131
131
132
132
## Manage nfApps that don't support rollback
133
-
Almost all publishers report some nfApps that aren't compatible with helm rollback operations. These nfApps maybe sourced from third-parties who don't common support such strict resiliency requirements. These nfApps maybe related to database applications with complicated schema management requirements. In these cases, special consideration should be taken to deal with nfApps that don't support rollback.
134
-
135
-
* The strong preference is to push publishers to support helm rollback for all nfApps.
133
+
Almost all publishers have some nfApps which aren't compatible with helm rollback. These nfApps maybe sourced from third-parties who don't commonly support strict resiliency requirements. These nfApps maybe database applications with complicated schema management requirements. Consider the following restrictions when onboarding services with nfApps that don't support rollback.
136
134
* nfApps that don't support rollback can't be skipped.
137
135
* nfApp rollback order can't change.
138
136
* Incremental-NFDV approach must be used in these situations.
139
137
140
138
### Selective rollback using incremental NFDVs
141
-
A network function’s composition often includes one, or more, nfApplications that can't support a helm rollback operation, such as Elastic or VoltDb. If a rollback is attempted on one of these nfApplications, the resulting nfApplication is broken. Pursuing publisher enhancements, to make these nfApplications rollback complaint is the best solution. Recognizing the potential for long publisher enhancement lead times, a method to prevent execution of rollback on selective nfApplications is needed. Selectively skipping rollback requires thorough testing with the network function owners as it resulting in transiet condition where multiple version permutation exist.
142
-
143
-
#### Problem Statement
144
-
At the network function level, when nfRollbackEnabled is true, and a failure occurs during an upgrade or install, a rollback is executed across all nfApps which proceed the failure. This may include those which are rollback noncompliant. A selective rollback parameter is not supported. It introduces risk of an operational state that doesn't correspond to a defined NFDV. This state mismatch results in nondeterministic behavior, increases the testing surface significantly, and undermines the reliability guarantees of deployment processes. Instead we rely on NFDVs to ensure deterministic workload states that map to well-defined and tested deployment configurations.
139
+
A network function’s composition may include nfAppa that don't support a helm rollback. Known examples are Elastic and VoltDb. An attmept to rollback one of these nfApps will break the nfApp. Pursuing publisher enhancements, to make these nfApps rollback complaint, is the best solution. A paramter to skip rollback is not supported as it introduces the risk of a deployed state not defined in a NFDV. This nondeterministic behavior increases the testing surface area significantly and undermines reliability guarantees of deployments. Instead, the incremental NFDV method enables selective rollback execution while ensuring deterministic deployment states.
145
140
146
-
#### Proposed Solution
147
-
AOSM proposes that publishers should use a combination of skipUpgrade and nfRollbackEnabled configurations in CGVs, along with multiple NFDVs, to logically segment nfApplications based on rollback compatibility. This multi-NFDV strategy allows customers to bypass rollback for select charts while preserving safety for the rest. This approach is production-safe and aligns with existing AOSM mechanisms. This staged approach effectively simulates per-chart rollback behavior using NFDV-level constructs. Consider the following example where a network function is composed of 20 nfApps with five nfApps that don't support rollback.
141
+
#### Incremental NFDV approach
142
+
It's recommended that publishers use a combination of `applicationEnablement`, `skipUpgrade` and `nfRollbackEnabled` configurations in CGVs, along with multiple NFDVs, to logically segment nfApps into sets based on rollback compatibility. This incremental NFDV strategy allows operators to break deployments down into multiple operatons, bypassing rollback for select charts while preserving rollback for the rest. This approach effectively simulates per-chart rollback behavior using NFDV-level constructs. Consider the following example where a network function is composed of 20 nfApps with five nfApps that don't support rollback.
148
143
149
144
* NFDV1
150
-
* Performs initial install of all 20 charts with version v1.0.
151
-
* In CGV1: rollbackEnabled: irrelevant (fresh install).
145
+
* Performs initial verions 1 install.
146
+
* Contains all 20 nfApps in an enabled state.
147
+
* In CGV1: `rollbackEnabled: true`.
148
+
* On the first install, a failure deletes charts and does not use rollback.
152
149
* NFDV2:
153
-
* Contains all 20 charts but the five Helm charts without rollback support, upgraded to v2.0.
150
+
* Performs first step upgrade to version 2.
151
+
* Contains all 20 nfApps but enable only the five nfApps without rollback support.
154
152
* In CGV2:
155
-
* Use skipUpgrade: true for the remaining 15 charts.
156
-
* Set nfRollbackEnabled: false.
157
-
* Result:
158
-
* Success: Only five charts upgrade
159
-
* Failure:
160
-
* No rollback if upgrade fails.
161
-
* Due to chart limitations, the workload is left in a nondeterministic state. No rollback is possible. To recover, there are two options:
162
-
* Upgrade with a working NFDV2
163
-
* Upgrade with NFDV1 and skipUpgrade disabled for every nfApplication
153
+
* Use `skipUpgrade: true` for the 15 nfApps with rollback supprt.
154
+
* Set `nfRollbackEnabled: false`.
155
+
* On success, only five nfApps are upgraded.
156
+
* On failure, no rollback is performed.
157
+
* Due to chart limitations, the workload is left in a nondeterministic state. No rollback is possible. To recover, there are two options:
158
+
* Fix NFDV2 and try the upgrade again.
159
+
* Downgrade to NFDV1 with `skipUpgrade: false`
164
160
* NFDV3:
165
-
* Contains all charts but the 15 rollback-compatible charts upgraded to v2.0.
161
+
* Performs second step upgrade to version 2
162
+
* Contains all 20 nfApps but enable only the 15 nfApps with rollback support.
166
163
* In CGV3:
167
-
* Use skipUpgrade: true for the 5 charts already handled in NFDV2.
168
-
* Set nfRollbackEnabled: true.
169
-
* Result: Remaining 15 charts upgrade; rollback occurs on failure.
164
+
* Use `skipUpgrade: true` for the 5 nfApps previous upgraded via NFDV2.
165
+
* Set `nfRollbackEnabled: true`.
166
+
* On success, the remaining 15 nfApps are upgraded
167
+
* On failure, a rollback occurs to restore the starting state.
170
168
171
169
> [!NOTE]
172
170
> * The five rollback-incompatible charts must not have runtime upgrade dependencies on charts in NFDV3.
173
171
> * AOSM's rollback design assumes that rollback restores the workload state to the previous NFDV state.
174
172
175
173
This approach providers cleaner separation and manageability of applications not supporting standard helm operations. Maintains the operation’s idempotency and state on the cluster reflected by the last operation. NFDV 2/3 can directly be used for install operations as well (installation of previous version not needed) with any difference in goal state. Overall upgrade time and deployment reliability remain the same.
176
174
177
-
## How to troubleshoot rollback on failure
175
+
## Troubleshoot rollback on failure
178
176
### Understand pod states
179
177
Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:
180
178
* Pending: Pod scheduling is in progress by Kubernetes.
0 commit comments