Disaster Recovery¶
Procedures for recovering kMetal from catastrophic failures.
DR Strategy¶
kMetal disaster recovery covers four layers:
- Under cluster — restore Kubernetes control plane and workers.
- Platform components — re-install the umbrella chart with the values overlay.
- Tenant control planes — reapply tenant
Cluster/TenantControlPlanedefinitions. - Persistent data — restore PV data from your backup tooling.
Recovery Time Objectives¶
The targets below are starting points; tune for your environment.
| Layer | RTO target | RPO target |
|---|---|---|
| Under cluster | 4 h | 1 h |
| Platform components | 2 h | 15 min |
| Tenant control planes | 1 h | 5 min |
| Persistent data | 2 h | 1 h |
Under Cluster Recovery¶
Provision new infrastructure¶
t.b.d. — Use your standard provisioning process (PXE, IPMI, vendor tooling) to bring up replacement bare-metal nodes that match the original under-cluster spec. See Under Cluster Setup for hardware/network/storage requirements.
Bootstrap a new under-cluster Kubernetes¶
# First control-plane node
sudo kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --upload-certs
# Additional control-plane nodes
sudo kubeadm join <cp-endpoint>:6443 \
--token <token> --discovery-token-ca-cert-hash sha256:<hash> \
--control-plane --certificate-key <key>
# Worker nodes
sudo kubeadm join <cp-endpoint>:6443 \
--token <token> --discovery-token-ca-cert-hash sha256:<hash>
Restore etcd from snapshot¶
sudo systemctl stop etcd
sudo etcdctl snapshot restore /backup/etcd-snapshot.db \
--name <node-name> \
--initial-cluster <node-name>=https://<node-ip>:2380 \
--initial-advertise-peer-urls https://<node-ip>:2380 \
--data-dir /var/lib/etcd
sudo systemctl start etcd
sudo etcdctl --endpoints=https://127.0.0.1:2379 endpoint health
Platform Component Recovery¶
Reinstall the kMetal umbrella chart with the same values overlay used for the original install:
helm install kmetal oci://ghcr.io/clastix/oci/kmetal \
--namespace kmetal-flux --create-namespace \
--values kmetal-values.yaml \
--wait --timeout=20m
helm status kmetal -n kmetal-flux
kubectl get pods -A | grep -v Running | grep -v Completed
Restore any external secrets your environment uses (operator's choice — kMetal doesn't ship a specific secrets tool).
Tenant Control Plane Recovery¶
After the chart is healthy:
# Reapply tenant Cluster / TenantControlPlane / DataStore manifests
kubectl apply -f /backup/tenants/
# Verify
kubectl get tenantcontrolplanes -A
kubectl get datastores
kubectl get clusters -A
Tenant etcd data restoration depends on which datastore backup tool the operator uses. t.b.d. — A worked example flow is t.b.d. in this section.
Persistent Data Recovery¶
Tenant PVC data is restored using whatever backup tool the operator has deployed. The under cluster itself does not ship one. t.b.d. — Worked restore flow per tool (Velero, Kasten, vendor CSI snapshot tooling) is t.b.d. in this section.
DR Testing¶
Run a full drill in a non-production environment at least quarterly:
- Document the current state (
kubectl get nodes,kubectl get tenantcontrolplanes -A). - Take a fresh backup of every backed-up layer.
- Simulate the failure (tear down the test under cluster).
- Execute the recovery procedure end-to-end.
- Compare restored state to documented state.
- Update the runbook based on what you learned.
Recovery Monitoring¶
# Watch component pods come up
watch 'kubectl get pods -A | grep -v Running | grep -v Completed'
# Watch tenant control planes
watch 'kubectl get tenantcontrolplanes -A'
# Watch the chart release status
helm status kmetal -n kmetal-flux
Runbook Maintenance¶
Review the DR runbook monthly:
- Are the RTO/RPO targets still achievable?
- Has the recovery procedure been tested end-to-end in the last quarter?
- Are the contact lists and credentials current?
- Have any platform changes (chart version, sub-chart toggles, namespaces) invalidated steps?
Keep the runbook in version control alongside the values overlay.