Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 68 additions & 18 deletions alerts/openshift-virtualization-operator/VirtControllerDown.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,95 @@
# VirtControllerDown

## Meaning

No running `virt-controller` pod has been detected for 10 minutes.

The alert expression evaluates
`cluster:kubevirt_virt_controller_pods_running:count == 0` with a
`for` duration of 10 minutes. The recording rule counts pods in
`Running` phase matching `virt-controller-.*`.

In newer versions of KubeVirt, the alert expression is reworked to
surface additional diagnostic labels (`pod`, `reason`) when a
container waiting reason is available. If your alert includes these
labels, see step 1 of the diagnosis below.

## Impact
Any actions related to virtual machine (VM) lifecycle management fail. This
notably includes launching a new virtual machine instance (VMI) or shutting down
an existing VMI.

Any actions related to virtual machine (VM) lifecycle management
fail. This notably includes launching a new virtual machine instance
(VMI) or shutting down an existing VMI.

## Diagnosis

1. Set the `NAMESPACE` environment variable:
1. **Check the alert labels**:

If the alert includes a `reason` label (for example,
`CrashLoopBackOff`, `ErrImagePull`, `ImagePullBackOff`), it
directly identifies why `virt-controller` is down. The `pod`
label identifies the affected pod. Skip to
[Mitigation](#mitigation) for the matching root cause. If these
labels are not present, continue with the steps below.

2. Set the `NAMESPACE` environment variable:

```bash
$ export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)"
$ export NAMESPACE="$(oc get kubevirt -A \
-o custom-columns="":.metadata.namespace)"
```

2. Check the status of the `virt-controller` deployment:
3. Check the status of the `virt-controller` deployment:

```bash
$ oc get deployment -n $NAMESPACE virt-controller -o yaml
$ oc -n $NAMESPACE get deploy virt-controller -o yaml
```

3. Review the logs of the `virt-controller` pod:
4. Check the `virt-controller` deployment details for issues such
as crashing pods or image pull failures:

```bash
$ oc get logs <virt-controller>
$ oc -n $NAMESPACE describe deploy virt-controller
```

## Mitigation
5. Check the status of the `virt-controller` pods:

```bash
$ oc -n $NAMESPACE get pods \
-l kubevirt.io=virt-controller
```

This alert can have a variety of causes, including the following:
6. Review the logs of the `virt-controller` pods:

```bash
$ oc -n $NAMESPACE logs -l kubevirt.io=virt-controller \
--previous
$ oc -n $NAMESPACE logs -l kubevirt.io=virt-controller
```

7. Check for issues such as nodes in a `NotReady` state:

```bash
$ oc get nodes
```

## Mitigation

- Node resource exhaustion
- Not enough memory on the cluster
- Nodes are down
- The API server is overloaded. For example, the scheduler might be under a
heavy load and therefore not completely available.
- Networking issues
Try to identify the root cause and resolve the issue. Common
causes include:

Identify the root cause and fix it, if possible.
- **CrashLoopBackOff**: The `virt-controller` container is
crashing repeatedly. Check the pod logs for the root cause
(panic, OOM, misconfiguration).
- **ErrImagePull / ImagePullBackOff**: The container image cannot
be pulled. Verify the image reference, registry availability,
and pull secrets.
- **Pods absent**: No `virt-controller` pods exist. Check whether
the deployment has been scaled to zero, deleted, or blocked by
resource constraints.
- **Node resource exhaustion**: Not enough memory or CPU on the
cluster to schedule the pods.
- **Node issues**: Nodes may be in `NotReady` state or under
resource pressure.

If you cannot resolve the issue, log in to the
[Customer Portal](https://access.redhat.com) and open a support case,
Expand Down