Platform Troubleshooting¶
kMetal platform component issues.
Helm Release Not Healthy¶
helm status kmetal -n kmetal-flux
helm history kmetal -n kmetal-flux
# Look at what the chart actually rendered
helm get manifest kmetal -n kmetal-flux | less
If a previous upgrade left bad state, roll back:
OCI Registry Pull Failures¶
# Test login
helm registry login ghcr.io -u <username> -p <token>
# Re-create the pull secret if pods report ImagePullBackOff
kubectl delete secret clastix-ghcr -n kmetal-flux
kubectl create secret docker-registry clastix-ghcr \
--docker-server=ghcr.io \
--docker-username=<username> \
--docker-password=<token> \
-n kmetal-flux
helm upgrade kmetal oci://ghcr.io/clastix/oci/kmetal \
--namespace kmetal-flux \
--values kmetal-values.yaml --wait
Kamaji Not Working¶
Tenant control planes don't reconcile or stay not-ready.
kubectl get pods -n kmetal-kamaji
kubectl logs -n kmetal-kamaji -l app.kubernetes.io/name=kamaji --tail=200
# Per-tenant datastore (each TenantControlPlane references its own DataStore CR)
kubectl get datastores
kubectl get tenantcontrolplane -A
# Look at the TCP's pod
kubectl get pods -n kmetal-kamaji -l kamaji.clastix.io/name=<tcp-name>
kubectl logs -n kmetal-kamaji <tcp-pod> -c kube-apiserver --tail=200
Restart the controller if needed:
Cert-Manager Issues¶
kubectl get pods -n kmetal-cert-manager
kubectl logs -n kmetal-cert-manager -l app.kubernetes.io/name=cert-manager --tail=200
kubectl get certificates -A
kubectl describe certificate <cert> -n <namespace>
Restart cert-manager if its controllers are stuck:
CAPI Provider Errors¶
Cluster API providers (Core, Bootstrap/kubeadm, Infrastructure/KubeVirt — CAPK, ControlPlane/Kamaji — CACPK) all live in the kmetal-capi-providers namespace.
kubectl get pods -n kmetal-capi-providers
kubectl logs -n kmetal-capi-providers -l control-plane=controller-manager --tail=200
kubectl get coreproviders,bootstrapproviders,infrastructureproviders,controlplaneproviders -n kmetal-capi-providers
Restart a provider controller:
KubeVirt / CDI Not Ready¶
kubectl get kubevirt -n system-kubevirt
kubectl describe kubevirt -n system-kubevirt # .status.phase should be Deployed
kubectl get pods -n system-kubevirt
kubectl get pods -n system-cdi
If pods are crashing, check the operator logs:
kubectl logs -n system-kubevirt -l kubevirt.io=virt-operator --tail=200
kubectl logs -n system-cdi -l cdi.kubevirt.io=cdi-operator --tail=200
Collect Diagnostics¶
helm status kmetal -n kmetal-flux > helm-status.txt
helm get manifest kmetal -n kmetal-flux > rendered.yaml
kubectl get pods -A > all-pods.txt
kubectl get events -A --sort-by='.lastTimestamp' > events.txt
kubectl logs -n kmetal-kamaji -l app.kubernetes.io/name=kamaji --tail=500 > kamaji.log
kubectl logs -n kmetal-capi-providers -l control-plane=controller-manager --tail=500 > capi.log