Backup and Recovery¶
This guide provides comprehensive backup and recovery procedures for the kMetal, covering platform data, tenant control planes, and disaster recovery scenarios.
Backup Strategy Overview¶
kMetal requires a multi-layered backup strategy to protect different data types and ensure business continuity:
Backup Components¶
- Platform Data: Under cluster ETCD, platform configurations, secrets
- Tenant Data: Tenant control plane configurations, tenant ETCD snapshots
- Infrastructure Data: Persistent volumes, network policies, node configurations
- Application Data: Tenant cluster application data and configurations
Platform Backup¶
Under Cluster ETCD Backup¶
The under cluster ETCD contains critical platform state and must be backed up regularly:
# etcd-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 2 * * *" # Daily at 2 AM
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
containers:
- name: etcd-backup
image: registry.k8s.io/etcd:3.5.10-0
command:
- /bin/sh
- -c
- |
set -e
BACKUP_DIR="/backup/$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
# Create ETCD snapshot
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save $BACKUP_DIR/etcd-snapshot.db
# Verify snapshot
etcdctl --write-out=table snapshot status $BACKUP_DIR/etcd-snapshot.db
# Compress backup
tar -czf $BACKUP_DIR.tar.gz -C $BACKUP_DIR .
rm -rf $BACKUP_DIR
# Upload to S3 (optional)
aws s3 cp $BACKUP_DIR.tar.gz s3://kmetal-backups/etcd/
# Cleanup old backups (keep 30 days)
find /backup -name "*.tar.gz" -mtime +30 -delete
env:
- name: ETCDCTL_API
value: "3"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: backup-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: backup-credentials
key: secret-key
- name: AWS_DEFAULT_REGION
value: "us-west-2"
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup-storage
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup-storage
persistentVolumeClaim:
claimName: etcd-backup-pvc
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
Platform Configuration Backup¶
Backup platform configurations and secrets:
#!/bin/bash
# platform-config-backup.sh
set -e
BACKUP_DIR="/backup/config/$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
echo "Starting platform configuration backup..."
# Backup the kMetal Helm release (values + manifest + history)
helm get values kmetal -n kmetal-flux > $BACKUP_DIR/kmetal-values.yaml
helm get manifest kmetal -n kmetal-flux > $BACKUP_DIR/kmetal-manifest.yaml
helm history kmetal -n kmetal-flux > $BACKUP_DIR/kmetal-history.txt
# Backup platform secrets
kubectl get secret -n kmetal-flux -o yaml > $BACKUP_DIR/platform-secrets.yaml
kubectl get secret -n kmetal-kamaji -o yaml > $BACKUP_DIR/kamaji-secrets.yaml
# Backup certificates
kubectl get certificates -A -o yaml > $BACKUP_DIR/certificates.yaml
kubectl get clusterissuers -o yaml > $BACKUP_DIR/clusterissuers.yaml
# Backup custom resources
kubectl get tenantcontrolplane -A -o yaml > $BACKUP_DIR/tenantcontrolplanes.yaml
kubectl get clusters -A -o yaml > $BACKUP_DIR/clusters.yaml
kubectl get machines -A -o yaml > $BACKUP_DIR/machines.yaml
# Backup network policies
kubectl get networkpolicies -A -o yaml > $BACKUP_DIR/networkpolicies.yaml
# Backup RBAC
kubectl get clusterroles -o yaml > $BACKUP_DIR/clusterroles.yaml
kubectl get clusterrolebindings -o yaml > $BACKUP_DIR/clusterrolebindings.yaml
# Backup storage classes and PVCs
kubectl get storageclass -o yaml > $BACKUP_DIR/storageclasses.yaml
kubectl get pvc -A -o yaml > $BACKUP_DIR/persistentvolumeclaims.yaml
# Compress backup
tar -czf $BACKUP_DIR.tar.gz -C $BACKUP_DIR .
rm -rf $BACKUP_DIR
echo "Platform configuration backup completed: $BACKUP_DIR.tar.gz"
# Upload to S3
aws s3 cp $BACKUP_DIR.tar.gz s3://kmetal-backups/config/
# Cleanup old backups
find /backup/config -name "*.tar.gz" -mtime +30 -delete
Automated Platform Backup¶
# platform-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: platform-backup
namespace: kube-system
spec:
schedule: "0 3 * * *" # Daily at 3 AM
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: platform-backup
containers:
- name: platform-backup
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
set -e
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install
# Run backup script
/scripts/platform-config-backup.sh
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: backup-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: backup-credentials
key: secret-key
- name: AWS_DEFAULT_REGION
value: "us-west-2"
volumeMounts:
- name: backup-scripts
mountPath: /scripts
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-scripts
configMap:
name: backup-scripts
defaultMode: 0755
- name: backup-storage
persistentVolumeClaim:
claimName: platform-backup-pvc
restartPolicy: OnFailure
Tenant Cluster Backup¶
User-Facing Backup Instructions
For end-user instructions on backing up individual tenant clusters, see User Guide: Backup & Restore.
This section covers platform-admin automation for backing up all tenant control planes.
Tenant Control Plane Backup¶
Backup tenant control plane configurations:
# tenant-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: tenant-backup
namespace: kmetal-kamaji
spec:
schedule: "0 4 * * *" # Daily at 4 AM
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: tenant-backup
containers:
- name: tenant-backup
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
set -e
BACKUP_DIR="/backup/tenants/$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
# Get all tenant control planes
kubectl get tenantcontrolplane -A -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.namespace}{"\n"}{end}' | while read name namespace; do
echo "Backing up tenant: $name in namespace: $namespace"
# Create tenant-specific backup directory
TENANT_DIR="$BACKUP_DIR/$name"
mkdir -p $TENANT_DIR
# Backup tenant control plane configuration
kubectl get tenantcontrolplane $name -n $namespace -o yaml > $TENANT_DIR/tenantcontrolplane.yaml
# Backup tenant secrets
kubectl get secret -n $namespace -l kamaji.clastix.io/name=$name -o yaml > $TENANT_DIR/secrets.yaml
# Backup tenant certificates
kubectl get secret -n $namespace -l kamaji.clastix.io/name=$name,cert-manager.io/certificate-name -o yaml > $TENANT_DIR/certificates.yaml
# Get tenant kubeconfig for ETCD backup
KUBECONFIG_SECRET=$(kubectl get secret -n $namespace -l kamaji.clastix.io/name=$name -o jsonpath='{.items[?(@.type=="cluster.x-k8s.io/secret")].metadata.name}' | head -1)
if [ -n "$KUBECONFIG_SECRET" ]; then
# Extract kubeconfig
kubectl get secret $KUBECONFIG_SECRET -n $namespace -o jsonpath='{.data.value}' | base64 -d > $TENANT_DIR/kubeconfig
# Backup tenant ETCD (if accessible)
export KUBECONFIG=$TENANT_DIR/kubeconfig
if kubectl get nodes &>/dev/null; then
kubectl get all --all-namespaces -o yaml > $TENANT_DIR/tenant-resources.yaml
kubectl get pvc --all-namespaces -o yaml > $TENANT_DIR/tenant-pvcs.yaml
kubectl get configmap --all-namespaces -o yaml > $TENANT_DIR/tenant-configmaps.yaml
kubectl get secret --all-namespaces -o yaml > $TENANT_DIR/tenant-secrets.yaml
fi
fi
done
# Compress backup
tar -czf $BACKUP_DIR.tar.gz -C $BACKUP_DIR .
rm -rf $BACKUP_DIR
# Upload to S3
aws s3 cp $BACKUP_DIR.tar.gz s3://kmetal-backups/tenants/
# Cleanup old backups
find /backup/tenants -name "*.tar.gz" -mtime +30 -delete
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: backup-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: backup-credentials
key: secret-key
- name: AWS_DEFAULT_REGION
value: "us-west-2"
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: tenant-backup-pvc
restartPolicy: OnFailure
Tenant ETCD Backup¶
t.b.d.
Tenant control planes reference their own DataStore CR via spec.dataStoreName. The DataStore CR specifies the backing engine (etcd or PostgreSQL/CNPG) and its endpoints — the platform does not maintain a single shared tenant-etcd endpoint, so backing up per-tenant datastores requires resolving each tenant's DataStore and snapshotting against its declared endpoints. A worked CronJob example is t.b.d. in this section.
Persistent Volume Backup¶
Velero Installation¶
Install Velero for persistent volume backup:
# Install Velero CLI
curl -fsSL -o velero-v1.12.0-linux-amd64.tar.gz https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xzf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
# Install Velero server
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket kmetal-velero-backups \
--secret-file ./aws-credentials \
--backup-location-config region=us-west-2 \
--snapshot-location-config region=us-west-2
Backup Schedules¶
# platform-pv-backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: platform-pv-backup
namespace: velero
spec:
schedule: "0 1 * * *" # Daily at 1 AM
template:
includedNamespaces:
- kmetal-flux
- kmetal-kamaji
- kmetal-cert-manager
- kmetal-metallb
- system-kubevirt
- system-cdi
- kmetal-capi-providers
includedResources:
- persistentvolumeclaims
- persistentvolumes
- secrets
- configmaps
excludedResources:
- events
- events.events.k8s.io
snapshotVolumes: true
includeClusterResources: true
ttl: 720h # 30 days
metadata:
labels:
backup-type: platform-pv
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: tenant-data-backup
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
labelSelector:
matchLabels:
backup.velero.io/tenant-data: "true"
snapshotVolumes: true
includeClusterResources: false
ttl: 168h # 7 days
metadata:
labels:
backup-type: tenant-data
Disaster Recovery¶
Platform Recovery Procedures¶
Complete Platform Recovery¶
#!/bin/bash
# platform-recovery.sh
set -e
BACKUP_DATE=${1:-"latest"}
BACKUP_LOCATION=${2:-"s3://kmetal-backups"}
echo "Starting platform recovery from backup: $BACKUP_DATE"
# Step 1: Restore management cluster ETCD
echo "Restoring management cluster ETCD..."
aws s3 cp $BACKUP_LOCATION/etcd/$BACKUP_DATE-etcd-snapshot.db.tar.gz ./etcd-backup.tar.gz
tar -xzf etcd-backup.tar.gz
sudo systemctl stop etcd
sudo rm -rf /var/lib/etcd
sudo etcdctl snapshot restore etcd-snapshot.db --data-dir /var/lib/etcd
sudo systemctl start etcd
# Step 2: Wait for cluster to be ready
echo "Waiting for cluster to be ready..."
while ! kubectl get nodes &>/dev/null; do
echo "Waiting for cluster..."
sleep 10
done
# Step 3: Restore platform configuration
echo "Restoring platform configuration..."
aws s3 cp $BACKUP_LOCATION/config/$BACKUP_DATE-config.tar.gz ./config-backup.tar.gz
tar -xzf config-backup.tar.gz
# Restore secrets first
kubectl apply -f platform-secrets.yaml
kubectl apply -f flux-secrets.yaml
kubectl apply -f kamaji-secrets.yaml
# Restore CRDs
kubectl apply -f certificates.yaml
kubectl apply -f clusterissuers.yaml
# Restore Flux components
kubectl apply -f flux-helmreleases.yaml
kubectl apply -f flux-gitrepositories.yaml
kubectl apply -f flux-ocirepositories.yaml
# Step 4: Restore tenant configurations
echo "Restoring tenant configurations..."
kubectl apply -f tenantcontrolplanes.yaml
kubectl apply -f clusters.yaml
kubectl apply -f machines.yaml
# Step 5: Verify platform health
echo "Verifying platform health..."
kubectl get pods -A
helm status kmetal -n kmetal-flux
kubectl get tenantcontrolplane -A
echo "Platform recovery completed successfully!"
Tenant Control Plane Recovery¶
#!/bin/bash
# tenant-recovery.sh
TENANT_NAME=${1:-""}
BACKUP_DATE=${2:-"latest"}
BACKUP_LOCATION=${3:-"s3://kmetal-backups"}
if [ -z "$TENANT_NAME" ]; then
echo "Usage: $0 <tenant-name> [backup-date] [backup-location]"
exit 1
fi
echo "Restoring tenant control plane: $TENANT_NAME"
# Download tenant backup
aws s3 cp $BACKUP_LOCATION/tenants/$BACKUP_DATE-tenants.tar.gz ./tenant-backup.tar.gz
tar -xzf tenant-backup.tar.gz
# Restore tenant control plane
kubectl apply -f $TENANT_NAME/tenantcontrolplane.yaml
kubectl apply -f $TENANT_NAME/secrets.yaml
kubectl apply -f $TENANT_NAME/certificates.yaml
# Wait for control plane to be ready
echo "Waiting for tenant control plane to be ready..."
kubectl wait --for=condition=Ready tenantcontrolplane/$TENANT_NAME -n kmetal-kamaji --timeout=300s
# Restore tenant cluster resources (if kubeconfig exists)
if [ -f "$TENANT_NAME/kubeconfig" ]; then
echo "Restoring tenant cluster resources..."
export KUBECONFIG=$TENANT_NAME/kubeconfig
kubectl apply -f $TENANT_NAME/tenant-resources.yaml
kubectl apply -f $TENANT_NAME/tenant-pvcs.yaml
kubectl apply -f $TENANT_NAME/tenant-configmaps.yaml
kubectl apply -f $TENANT_NAME/tenant-secrets.yaml
fi
echo "Tenant recovery completed for: $TENANT_NAME"
Velero Restore Procedures¶
# Restore persistent volumes
velero restore create platform-pv-restore \
--from-backup platform-pv-backup-20240101-010000
# Restore specific tenant data
velero restore create tenant-data-restore \
--from-backup tenant-data-backup-20240101-020000 \
--include-namespaces tenant-namespace
# Monitor restore progress
velero restore describe platform-pv-restore
velero restore logs platform-pv-restore
Backup Validation¶
Automated Backup Testing¶
# backup-validation-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-validation
namespace: kube-system
spec:
schedule: "0 6 * * 0" # Weekly on Sunday at 6 AM
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: backup-validation
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
set -e
echo "Starting backup validation..."
# Test ETCD backup integrity
LATEST_ETCD_BACKUP=$(aws s3 ls s3://kmetal-backups/etcd/ | sort | tail -1 | awk '{print $4}')
aws s3 cp s3://kmetal-backups/etcd/$LATEST_ETCD_BACKUP ./etcd-test.tar.gz
tar -xzf etcd-test.tar.gz
# Verify ETCD snapshot
etcdctl --write-out=table snapshot status etcd-snapshot.db
# Test configuration backup completeness
LATEST_CONFIG_BACKUP=$(aws s3 ls s3://kmetal-backups/config/ | sort | tail -1 | awk '{print $4}')
aws s3 cp s3://kmetal-backups/config/$LATEST_CONFIG_BACKUP ./config-test.tar.gz
tar -xzf config-test.tar.gz
# Verify required files exist
for file in kmetal-helmrelease.yaml flux-helmreleases.yaml platform-secrets.yaml tenantcontrolplanes.yaml; do
if [ ! -f "$file" ]; then
echo "ERROR: Required backup file missing: $file"
exit 1
fi
done
# Test tenant backup
LATEST_TENANT_BACKUP=$(aws s3 ls s3://kmetal-backups/tenants/ | sort | tail -1 | awk '{print $4}')
aws s3 cp s3://kmetal-backups/tenants/$LATEST_TENANT_BACKUP ./tenant-test.tar.gz
tar -xzf tenant-test.tar.gz
# Verify tenant data structure
if [ ! -d "$(ls -1 | head -1)" ]; then
echo "ERROR: Tenant backup structure invalid"
exit 1
fi
# Send validation report
echo "Backup validation completed successfully at $(date)" | mail -s "kmetal Backup Validation Report" ops@company.com
echo "Backup validation completed successfully"
env:
- name: ETCDCTL_API
value: "3"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: backup-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: backup-credentials
key: secret-key
- name: AWS_DEFAULT_REGION
value: "us-west-2"
restartPolicy: OnFailure
Backup Monitoring¶
Backup Health Alerts¶
# backup-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backup-alerts
namespace: monitoring
spec:
groups:
- name: backup
rules:
- alert: BackupJobFailed
expr: kube_job_status_failed{job_name=~".*backup.*"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup job {{ $labels.job_name }} failed"
description: "Backup job {{ $labels.job_name }} has failed"
- alert: BackupJobMissing
expr: time() - kube_job_status_start_time{job_name=~".*backup.*"} > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "Backup job {{ $labels.job_name }} hasn't run"
description: "Backup job {{ $labels.job_name }} hasn't run in the last 24 hours"
- alert: VeleroBackupFailed
expr: velero_backup_failure_total > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Velero backup failed"
description: "Velero backup has failed {{ $value }} times"
This comprehensive backup and recovery guide ensures your kMetal platform is protected against data loss and can be quickly restored in disaster scenarios. Regular testing and validation of backup procedures are essential for maintaining platform resilience.