Skip to content

Cluster Troubleshooting

Common issues with tenant cluster operations.

Cluster Creation Fails

Cluster stuck in provisioning phase.

# Check cluster and control plane
kubectl get cluster,tenantcontrolplane <name> -n <namespace>
kubectl describe tenantcontrolplane <name> -n <namespace>

# Check control plane pods
kubectl get pods -n kmetal-kamaji -l kamaji.clastix.io/name=<name>
kubectl logs -n kmetal-kamaji <control-plane-pod>

# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Control Plane Not Ready

TenantControlPlane shows not ready.

# Check control plane status
kubectl get tenantcontrolplane <name> -n <namespace> -o yaml

# Check datastore
kubectl get pods -n kmetal-kamaji -l app.kubernetes.io/name=etcd

# Restart control plane pod
kubectl delete pod -n kmetal-kamaji -l kamaji.clastix.io/name=<name>

Worker Nodes Not Joining

Machines created but nodes don't appear.

# Check machines
kubectl get machines -n <namespace>
kubectl describe machine <machine-name> -n <namespace>

# Check control plane endpoint
kubectl get tenantcontrolplane <name> -n <namespace> -o jsonpath='{.status.controlPlaneEndpoint.host}:{.status.controlPlaneEndpoint.port}'
curl -k https://<endpoint>:6443/healthz

# Check infrastructure provider logs
kubectl logs -n kmetal-capi-providers -l control-plane=controller-manager --tail=100

Kubeconfig Not Working

Cannot connect with downloaded kubeconfig.

# Re-download kubeconfig
kubectl get secret <cluster>-admin-kubeconfig -n <namespace> \
  -o jsonpath='{.data.admin\.conf}' | base64 -d > cluster.kubeconfig

# Check endpoint and DNS
kubectl get tenantcontrolplane <name> -n <namespace> -o jsonpath='{.status.controlPlaneEndpoint.host}:{.status.controlPlaneEndpoint.port}'
nslookup <endpoint>

# Test connection
export KUBECONFIG=cluster.kubeconfig
kubectl cluster-info

Cluster Scaling Stuck

Machines not scaling as expected.

# Check MachineDeployment
kubectl get machinedeployment -n <namespace>
kubectl describe machinedeployment <name> -n <namespace>

# Check machines
kubectl get machines -n <namespace>

# Check infrastructure provider
kubectl logs -n kmetal-capi-providers -l control-plane=controller-manager --tail=100

Cluster Not Deleting

Cluster deletion hangs.

# Check finalizers
kubectl get cluster <name> -n <namespace> -o yaml | grep finalizers -A 5

# Delete machines first
kubectl delete machines --all -n <namespace> --wait=false

# Remove finalizers if stuck
kubectl patch cluster <name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge

Collect Diagnostics

# Management cluster view
kubectl get cluster,tenantcontrolplane,machines -n <namespace> -o yaml > cluster-resources.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > events.txt
kubectl logs -n kmetal-kamaji -l kamaji.clastix.io/name=<name> --tail=500 > control-plane.log

# Tenant cluster view (if accessible)
kubectl --kubeconfig=<cluster>.kubeconfig get nodes,pods -A -o yaml > tenant-resources.yaml