diff --git a/alerts/openshift-virtualization-operator/VirtControllerDown.md b/alerts/openshift-virtualization-operator/VirtControllerDown.md index ad85a827..c0754e5d 100644 --- a/alerts/openshift-virtualization-operator/VirtControllerDown.md +++ b/alerts/openshift-virtualization-operator/VirtControllerDown.md @@ -1,45 +1,95 @@ # VirtControllerDown ## Meaning + No running `virt-controller` pod has been detected for 10 minutes. +The alert expression evaluates +`cluster:kubevirt_virt_controller_pods_running:count == 0` with a +`for` duration of 10 minutes. The recording rule counts pods in +`Running` phase matching `virt-controller-.*`. + +In newer versions of KubeVirt, the alert expression is reworked to +surface additional diagnostic labels (`pod`, `reason`) when a +container waiting reason is available. If your alert includes these +labels, see step 1 of the diagnosis below. + ## Impact -Any actions related to virtual machine (VM) lifecycle management fail. This -notably includes launching a new virtual machine instance (VMI) or shutting down -an existing VMI. + +Any actions related to virtual machine (VM) lifecycle management +fail. This notably includes launching a new virtual machine instance +(VMI) or shutting down an existing VMI. ## Diagnosis -1. Set the `NAMESPACE` environment variable: +1. **Check the alert labels**: + + If the alert includes a `reason` label (for example, + `CrashLoopBackOff`, `ErrImagePull`, `ImagePullBackOff`), it + directly identifies why `virt-controller` is down. The `pod` + label identifies the affected pod. Skip to + [Mitigation](#mitigation) for the matching root cause. If these + labels are not present, continue with the steps below. + +2. Set the `NAMESPACE` environment variable: ```bash - $ export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)" + $ export NAMESPACE="$(oc get kubevirt -A \ + -o custom-columns="":.metadata.namespace)" ``` -2. Check the status of the `virt-controller` deployment: +3. Check the status of the `virt-controller` deployment: ```bash - $ oc get deployment -n $NAMESPACE virt-controller -o yaml + $ oc -n $NAMESPACE get deploy virt-controller -o yaml ``` -3. Review the logs of the `virt-controller` pod: +4. Check the `virt-controller` deployment details for issues such + as crashing pods or image pull failures: ```bash - $ oc get logs + $ oc -n $NAMESPACE describe deploy virt-controller ``` -## Mitigation +5. Check the status of the `virt-controller` pods: + + ```bash + $ oc -n $NAMESPACE get pods \ + -l kubevirt.io=virt-controller + ``` -This alert can have a variety of causes, including the following: +6. Review the logs of the `virt-controller` pods: + + ```bash + $ oc -n $NAMESPACE logs -l kubevirt.io=virt-controller \ + --previous + $ oc -n $NAMESPACE logs -l kubevirt.io=virt-controller + ``` + +7. Check for issues such as nodes in a `NotReady` state: + + ```bash + $ oc get nodes + ``` + +## Mitigation -- Node resource exhaustion -- Not enough memory on the cluster -- Nodes are down -- The API server is overloaded. For example, the scheduler might be under a -heavy load and therefore not completely available. -- Networking issues +Try to identify the root cause and resolve the issue. Common +causes include: -Identify the root cause and fix it, if possible. +- **CrashLoopBackOff**: The `virt-controller` container is + crashing repeatedly. Check the pod logs for the root cause + (panic, OOM, misconfiguration). +- **ErrImagePull / ImagePullBackOff**: The container image cannot + be pulled. Verify the image reference, registry availability, + and pull secrets. +- **Pods absent**: No `virt-controller` pods exist. Check whether + the deployment has been scaled to zero, deleted, or blocked by + resource constraints. +- **Node resource exhaustion**: Not enough memory or CPU on the + cluster to schedule the pods. +- **Node issues**: Nodes may be in `NotReady` state or under + resource pressure. If you cannot resolve the issue, log in to the [Customer Portal](https://access.redhat.com) and open a support case,