Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions docs/design/hostedcluster-reconcile-segregation-analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# HostedCluster Controller Reconciliation Loop — Segregation Analysis

## Problem Statement

The `reconcile()` function in `hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
executes ~50 distinct operations in a strictly sequential chain where **every error
causes an early return**, short-circuiting the entire remaining loop. This means an unrelated failure
(e.g., missing SSH key secret) prevents critical operations like deploying the control plane operator
or reconciling the HostedControlPlane object.

## Implemented Solution

### Error Categorization

Operations are classified into two categories using a `reconcileReport` struct
(`reconcile_report.go`):

- **critical**: Failures block all downstream Phase 8 component operations.
Examples: PullSecretSync, PlatformCredentials, SecretEncryptionSync, CoreHCPChain.
- **nonCritical**: Failures are collected but never block other work.
Examples: SSHKeySync, AuditWebhookSync, MonitoringAndCLISecrets.

### Phase Structure

```
Phase 0-4: Initialization, status conditions, deletion handling — unchanged
Phase 5: Prerequisites — finalizers, pause, validation gates — unchanged (fail-fast)

Phase 6a: Critical sync (error-collecting within phase)
- PlatformCredentials
- PullSecretSync
- SecretEncryptionSync

Phase 6b: Non-critical sync (error-collecting, never blocked)
- RestoredFromBackup (TODO: move to Phase 4)
- AuditWebhookSync
- SSHKeySync
- AdditionalTrustBundle
- ServiceAccountSigningKey
- UnmanagedEtcdMTLS
- ETCDMemberRecovery
- GlobalConfigSync

Phase 7: Core HCP chain — sequential (HCP → InfraCR → CAPI Cluster)
Always runs regardless of Phase 6a failures.

Phase 8: Component groups — blocked if any critical operation failed.
Uses executeOrBlock/executeOrBlockMulti to automatically skip
when hasCriticalFailure() is true:
- KubeconfigAndPasswordSync
- OperatorDeployments (CPO, CAPI Manager, CAPI Provider, Karpenter)
- RBACAndPolicies (Prometheus RBAC, PKI RBAC, Network Policies)
- PlatformOIDCAndCSI (KubeVirt CSI / AWS OIDC / Azure SecretProviderClass)
- MonitoringAndCLISecrets (CLI secrets, dashboard, SRE metrics, trusted CAs)
```
Comment thread
muraee marked this conversation as resolved.

### Blocking Rules

- **Phase 6a critical failure** → Phase 8 blocked (Phase 7 still runs)
- **Phase 7 failure / nil HCP** → Phase 8 blocked
- **Phase 6b non-critical failure** → never blocks anything

### Error Aggregation

When critical failures exist, `aggregate()` returns only critical errors plus a summary of
blocked operations — non-critical errors are suppressed since the user should fix the
critical issue first:

```
critical error: failed to get pull secret...; blocked operations: [KubeconfigAndPasswordSync, OperatorDeployments, RBACAndPolicies, PlatformOIDCAndCSI, MonitoringAndCLISecrets]
```

When no critical failures exist, all errors are returned as-is via `utilerrors.NewAggregate`.

### Key Types

```go
// reconcile_report.go

type operationCategory int
const (
critical operationCategory = iota
nonCritical
)

type reconcileReport struct {
results []operationResult
requeueAfter *time.Duration
}
```

Key methods:
- `recordOp(name, category, err)` — record a single operation result
- `recordBlocked(name, category)` — record a skipped operation
- `recordErrors(name, category, []error)` — record a group operation
- `executeOrBlock(name, category, fn)` — run fn or record blocked if critical failure exists
- `executeOrBlockMulti(name, category, fn)` — same for multi-error functions
- `hasCriticalFailure()` — any critical op failed?
- `aggregate()` — final error for reconcile return value
- `conditionMessage()` — structured summary for logging

## Impact Summary

**Before**: An error in any of ~50 operations blocks all subsequent operations. Examples:
- SSH key secret missing → CPO never deployed → control plane never starts
- Monitoring dashboard error → karpenter operator never reconciled
- AWS resource tags error → HCP never updated

**After**: Only genuinely dependent operations block each other. Examples:
- SSH key secret missing → SSH key not synced (reported), but HCP object still created
(Phase 7), CPO still deployed (Phase 8), CAPI still works
- Pull secret sync failure (critical) → Phase 8 components blocked with clear reporting
of what failed and what was skipped, but Phase 7 still runs
- Platform credential error (critical) → same blocking behavior, user sees exactly which
operations were skipped
- Audit webhook failure (non-critical) → reported, but everything else proceeds normally
51 changes: 51 additions & 0 deletions docs/design/hostedcluster-reconcile-segregation-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# HostedCluster Reconciliation Segregation — Implementation Plan

Companion to: [hostedcluster-reconcile-segregation-analysis.md](./hostedcluster-reconcile-segregation-analysis.md)

## Status: Implemented

The reconcile loop has been restructured with categorized error handling. See the analysis
doc for the full design.

## Files Changed

| File | Action |
|------|--------|
| `hypershift-operator/controllers/hostedcluster/reconcile_report.go` | **Created** — report types and methods |
| `hypershift-operator/controllers/hostedcluster/reconcile_report_test.go` | **Created** — unit tests for report |
| `hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go` | **Modified** — replaced sequential error chain with report-based flow |
| `hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` | **Modified** — added non-blocking behavior tests |

## Operation Classification

| Operation | Category | Rationale |
|-----------|----------|-----------|
| PlatformCredentials | critical | Platform infra operations fail without credentials |
| PullSecretSync | critical | Image pulls fail, pods crash-loop |
| SecretEncryptionSync | critical | Etcd data encryption fails |
| CoreHCPChain | critical | Produces HCP object needed by Phase 8 |
| RestoredFromBackup | nonCritical | Status-only (TODO: move to Phase 4) |
| AuditWebhookSync | nonCritical | Observability, not functional |
| SSHKeySync | nonCritical | Debug access, not operational |
| AdditionalTrustBundle | nonCritical | Custom CAs, not blocking |
| ServiceAccountSigningKey | nonCritical | Conditional, specific feature |
| UnmanagedEtcdMTLS | nonCritical | Only for unmanaged etcd |
| ETCDMemberRecovery | nonCritical | Recovery mechanism, retries naturally |
| GlobalConfigSync | nonCritical | Config propagation, eventual consistency |
| KubeconfigAndPasswordSync | nonCritical | Consumer convenience |
| OperatorDeployments | nonCritical | CPO/CAPI/Karpenter deployments |
| RBACAndPolicies | nonCritical | RBAC/network policies |
| PlatformOIDCAndCSI | nonCritical | Platform-specific (OIDC, CSI, SecretProviderClass) |
| MonitoringAndCLISecrets | nonCritical | Monitoring, CLI secrets, SRE config, trusted CAs |

## Design Principles

1. **Categorize, don't short-circuit** — Operations are classified as critical or non-critical.
Critical failures block downstream work; non-critical failures are collected and reported.
2. **Minimal structural change** — Extracted blocks into private methods on
`HostedClusterReconciler`. Individual operation semantics are preserved.
3. **Preserve legitimate gates** — Deletion handling, pause, and config validation gates still
halt the entire loop. These are intentional.
4. **Structured error reporting** — When critical failures exist, only critical errors are
surfaced with a list of blocked operations. Non-critical noise is suppressed.
5. **Preserve requeueAfter** — If any operation sets a requeueAfter, it is preserved.
Loading