Skip to content

Disaster Recovery

Procedures for recovering kMetal from catastrophic failures.

DR Strategy

kMetal disaster recovery covers four layers:

  1. Under cluster — restore Kubernetes control plane and workers.
  2. Platform components — re-install the umbrella chart with the values overlay.
  3. Tenant control planes — reapply tenant Cluster / TenantControlPlane definitions.
  4. Persistent data — restore PV data from your backup tooling.

Recovery Time Objectives

The targets below are starting points; tune for your environment.

Layer RTO target RPO target
Under cluster 4 h 1 h
Platform components 2 h 15 min
Tenant control planes 1 h 5 min
Persistent data 2 h 1 h

Under Cluster Recovery

Provision new infrastructure

t.b.d. — Use your standard provisioning process (PXE, IPMI, vendor tooling) to bring up replacement bare-metal nodes that match the original under-cluster spec. See Under Cluster Setup for hardware/network/storage requirements.

Bootstrap a new under-cluster Kubernetes

# First control-plane node
sudo kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --upload-certs

# Additional control-plane nodes
sudo kubeadm join <cp-endpoint>:6443 \
  --token <token> --discovery-token-ca-cert-hash sha256:<hash> \
  --control-plane --certificate-key <key>

# Worker nodes
sudo kubeadm join <cp-endpoint>:6443 \
  --token <token> --discovery-token-ca-cert-hash sha256:<hash>

Restore etcd from snapshot

sudo systemctl stop etcd

sudo etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name <node-name> \
  --initial-cluster <node-name>=https://<node-ip>:2380 \
  --initial-advertise-peer-urls https://<node-ip>:2380 \
  --data-dir /var/lib/etcd

sudo systemctl start etcd
sudo etcdctl --endpoints=https://127.0.0.1:2379 endpoint health

Platform Component Recovery

Reinstall the kMetal umbrella chart with the same values overlay used for the original install:

helm install kmetal oci://ghcr.io/clastix/oci/kmetal \
  --namespace kmetal-flux --create-namespace \
  --values kmetal-values.yaml \
  --wait --timeout=20m

helm status kmetal -n kmetal-flux
kubectl get pods -A | grep -v Running | grep -v Completed

Restore any external secrets your environment uses (operator's choice — kMetal doesn't ship a specific secrets tool).

Tenant Control Plane Recovery

After the chart is healthy:

# Reapply tenant Cluster / TenantControlPlane / DataStore manifests
kubectl apply -f /backup/tenants/

# Verify
kubectl get tenantcontrolplanes -A
kubectl get datastores
kubectl get clusters -A

Tenant etcd data restoration depends on which datastore backup tool the operator uses. t.b.d. — A worked example flow is t.b.d. in this section.

Persistent Data Recovery

Tenant PVC data is restored using whatever backup tool the operator has deployed. The under cluster itself does not ship one. t.b.d. — Worked restore flow per tool (Velero, Kasten, vendor CSI snapshot tooling) is t.b.d. in this section.

DR Testing

Run a full drill in a non-production environment at least quarterly:

  1. Document the current state (kubectl get nodes, kubectl get tenantcontrolplanes -A).
  2. Take a fresh backup of every backed-up layer.
  3. Simulate the failure (tear down the test under cluster).
  4. Execute the recovery procedure end-to-end.
  5. Compare restored state to documented state.
  6. Update the runbook based on what you learned.

Recovery Monitoring

# Watch component pods come up
watch 'kubectl get pods -A | grep -v Running | grep -v Completed'

# Watch tenant control planes
watch 'kubectl get tenantcontrolplanes -A'

# Watch the chart release status
helm status kmetal -n kmetal-flux

Runbook Maintenance

Review the DR runbook monthly:

  • Are the RTO/RPO targets still achievable?
  • Has the recovery procedure been tested end-to-end in the last quarter?
  • Are the contact lists and credentials current?
  • Have any platform changes (chart version, sub-chart toggles, namespaces) invalidated steps?

Keep the runbook in version control alongside the values overlay.