Skip to content

Backup and Recovery

This guide provides comprehensive backup and recovery procedures for the kMetal, covering platform data, tenant control planes, and disaster recovery scenarios.

Backup Strategy Overview

kMetal requires a multi-layered backup strategy to protect different data types and ensure business continuity:

Backup Components

  1. Platform Data: Under cluster ETCD, platform configurations, secrets
  2. Tenant Data: Tenant control plane configurations, tenant ETCD snapshots
  3. Infrastructure Data: Persistent volumes, network policies, node configurations
  4. Application Data: Tenant cluster application data and configurations

Platform Backup

Under Cluster ETCD Backup

The under cluster ETCD contains critical platform state and must be backed up regularly:

# etcd-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:

          - name: etcd-backup
            image: registry.k8s.io/etcd:3.5.10-0
            command:

            - /bin/sh
            - -c
            - |
              set -e
              BACKUP_DIR="/backup/$(date +%Y%m%d-%H%M%S)"
              mkdir -p $BACKUP_DIR

              # Create ETCD snapshot
              etcdctl --endpoints=https://127.0.0.1:2379 \
                --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                --cert=/etc/kubernetes/pki/etcd/server.crt \
                --key=/etc/kubernetes/pki/etcd/server.key \
                snapshot save $BACKUP_DIR/etcd-snapshot.db

              # Verify snapshot
              etcdctl --write-out=table snapshot status $BACKUP_DIR/etcd-snapshot.db

              # Compress backup
              tar -czf $BACKUP_DIR.tar.gz -C $BACKUP_DIR .
              rm -rf $BACKUP_DIR

              # Upload to S3 (optional)
              aws s3 cp $BACKUP_DIR.tar.gz s3://kmetal-backups/etcd/

              # Cleanup old backups (keep 30 days)
              find /backup -name "*.tar.gz" -mtime +30 -delete
            env:

            - name: ETCDCTL_API
              value: "3"

            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: access-key

            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: secret-key

            - name: AWS_DEFAULT_REGION
              value: "us-west-2"
            volumeMounts:

            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true

            - name: backup-storage
              mountPath: /backup
          volumes:

          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd

          - name: backup-storage
            persistentVolumeClaim:
              claimName: etcd-backup-pvc
          restartPolicy: OnFailure
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:

          - key: node-role.kubernetes.io/control-plane
            operator: Exists
            effect: NoSchedule

Platform Configuration Backup

Backup platform configurations and secrets:

#!/bin/bash
# platform-config-backup.sh

set -e

BACKUP_DIR="/backup/config/$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR

echo "Starting platform configuration backup..."

# Backup the kMetal Helm release (values + manifest + history)
helm get values   kmetal -n kmetal-flux > $BACKUP_DIR/kmetal-values.yaml
helm get manifest kmetal -n kmetal-flux > $BACKUP_DIR/kmetal-manifest.yaml
helm history      kmetal -n kmetal-flux > $BACKUP_DIR/kmetal-history.txt

# Backup platform secrets
kubectl get secret -n kmetal-flux -o yaml > $BACKUP_DIR/platform-secrets.yaml
kubectl get secret -n kmetal-kamaji -o yaml > $BACKUP_DIR/kamaji-secrets.yaml

# Backup certificates
kubectl get certificates -A -o yaml > $BACKUP_DIR/certificates.yaml
kubectl get clusterissuers -o yaml > $BACKUP_DIR/clusterissuers.yaml

# Backup custom resources
kubectl get tenantcontrolplane -A -o yaml > $BACKUP_DIR/tenantcontrolplanes.yaml
kubectl get clusters -A -o yaml > $BACKUP_DIR/clusters.yaml
kubectl get machines -A -o yaml > $BACKUP_DIR/machines.yaml

# Backup network policies
kubectl get networkpolicies -A -o yaml > $BACKUP_DIR/networkpolicies.yaml

# Backup RBAC
kubectl get clusterroles -o yaml > $BACKUP_DIR/clusterroles.yaml
kubectl get clusterrolebindings -o yaml > $BACKUP_DIR/clusterrolebindings.yaml

# Backup storage classes and PVCs
kubectl get storageclass -o yaml > $BACKUP_DIR/storageclasses.yaml
kubectl get pvc -A -o yaml > $BACKUP_DIR/persistentvolumeclaims.yaml

# Compress backup
tar -czf $BACKUP_DIR.tar.gz -C $BACKUP_DIR .
rm -rf $BACKUP_DIR

echo "Platform configuration backup completed: $BACKUP_DIR.tar.gz"

# Upload to S3
aws s3 cp $BACKUP_DIR.tar.gz s3://kmetal-backups/config/

# Cleanup old backups
find /backup/config -name "*.tar.gz" -mtime +30 -delete

Automated Platform Backup

# platform-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: platform-backup
  namespace: kube-system
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: platform-backup
          containers:

          - name: platform-backup
            image: bitnami/kubectl:latest
            command:

            - /bin/bash
            - -c
            - |
              set -e

              # Install AWS CLI
              curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
              unzip awscliv2.zip
              ./aws/install

              # Run backup script
              /scripts/platform-config-backup.sh
            env:

            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: access-key

            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: secret-key

            - name: AWS_DEFAULT_REGION
              value: "us-west-2"
            volumeMounts:

            - name: backup-scripts
              mountPath: /scripts

            - name: backup-storage
              mountPath: /backup
          volumes:

          - name: backup-scripts
            configMap:
              name: backup-scripts
              defaultMode: 0755

          - name: backup-storage
            persistentVolumeClaim:
              claimName: platform-backup-pvc
          restartPolicy: OnFailure

Tenant Cluster Backup

User-Facing Backup Instructions

For end-user instructions on backing up individual tenant clusters, see User Guide: Backup & Restore.

This section covers platform-admin automation for backing up all tenant control planes.

Tenant Control Plane Backup

Backup tenant control plane configurations:

# tenant-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: tenant-backup
  namespace: kmetal-kamaji
spec:
  schedule: "0 4 * * *"  # Daily at 4 AM
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: tenant-backup
          containers:

          - name: tenant-backup
            image: bitnami/kubectl:latest
            command:

            - /bin/bash
            - -c
            - |
              set -e

              BACKUP_DIR="/backup/tenants/$(date +%Y%m%d-%H%M%S)"
              mkdir -p $BACKUP_DIR

              # Get all tenant control planes
              kubectl get tenantcontrolplane -A -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.namespace}{"\n"}{end}' | while read name namespace; do
                echo "Backing up tenant: $name in namespace: $namespace"

                # Create tenant-specific backup directory
                TENANT_DIR="$BACKUP_DIR/$name"
                mkdir -p $TENANT_DIR

                # Backup tenant control plane configuration
                kubectl get tenantcontrolplane $name -n $namespace -o yaml > $TENANT_DIR/tenantcontrolplane.yaml

                # Backup tenant secrets
                kubectl get secret -n $namespace -l kamaji.clastix.io/name=$name -o yaml > $TENANT_DIR/secrets.yaml

                # Backup tenant certificates
                kubectl get secret -n $namespace -l kamaji.clastix.io/name=$name,cert-manager.io/certificate-name -o yaml > $TENANT_DIR/certificates.yaml

                # Get tenant kubeconfig for ETCD backup
                KUBECONFIG_SECRET=$(kubectl get secret -n $namespace -l kamaji.clastix.io/name=$name -o jsonpath='{.items[?(@.type=="cluster.x-k8s.io/secret")].metadata.name}' | head -1)

                if [ -n "$KUBECONFIG_SECRET" ]; then
                  # Extract kubeconfig
                  kubectl get secret $KUBECONFIG_SECRET -n $namespace -o jsonpath='{.data.value}' | base64 -d > $TENANT_DIR/kubeconfig

                  # Backup tenant ETCD (if accessible)
                  export KUBECONFIG=$TENANT_DIR/kubeconfig
                  if kubectl get nodes &>/dev/null; then
                    kubectl get all --all-namespaces -o yaml > $TENANT_DIR/tenant-resources.yaml
                    kubectl get pvc --all-namespaces -o yaml > $TENANT_DIR/tenant-pvcs.yaml
                    kubectl get configmap --all-namespaces -o yaml > $TENANT_DIR/tenant-configmaps.yaml
                    kubectl get secret --all-namespaces -o yaml > $TENANT_DIR/tenant-secrets.yaml
                  fi
                fi
              done

              # Compress backup
              tar -czf $BACKUP_DIR.tar.gz -C $BACKUP_DIR .
              rm -rf $BACKUP_DIR

              # Upload to S3
              aws s3 cp $BACKUP_DIR.tar.gz s3://kmetal-backups/tenants/

              # Cleanup old backups
              find /backup/tenants -name "*.tar.gz" -mtime +30 -delete
            env:

            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: access-key

            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: secret-key

            - name: AWS_DEFAULT_REGION
              value: "us-west-2"
            volumeMounts:

            - name: backup-storage
              mountPath: /backup
          volumes:

          - name: backup-storage
            persistentVolumeClaim:
              claimName: tenant-backup-pvc
          restartPolicy: OnFailure

Tenant ETCD Backup

t.b.d.

Tenant control planes reference their own DataStore CR via spec.dataStoreName. The DataStore CR specifies the backing engine (etcd or PostgreSQL/CNPG) and its endpoints — the platform does not maintain a single shared tenant-etcd endpoint, so backing up per-tenant datastores requires resolving each tenant's DataStore and snapshotting against its declared endpoints. A worked CronJob example is t.b.d. in this section.

Persistent Volume Backup

Velero Installation

Install Velero for persistent volume backup:

# Install Velero CLI
curl -fsSL -o velero-v1.12.0-linux-amd64.tar.gz https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xzf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/

# Install Velero server
velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.8.0 \
    --bucket kmetal-velero-backups \
    --secret-file ./aws-credentials \
    --backup-location-config region=us-west-2 \
    --snapshot-location-config region=us-west-2

Backup Schedules

# platform-pv-backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: platform-pv-backup
  namespace: velero
spec:
  schedule: "0 1 * * *"  # Daily at 1 AM
  template:
    includedNamespaces:

    - kmetal-flux
    - kmetal-kamaji
    - kmetal-cert-manager
    - kmetal-metallb
    - system-kubevirt
    - system-cdi
    - kmetal-capi-providers

    includedResources:

    - persistentvolumeclaims
    - persistentvolumes
    - secrets
    - configmaps

    excludedResources:

    - events
    - events.events.k8s.io

    snapshotVolumes: true
    includeClusterResources: true

    ttl: 720h  # 30 days

    metadata:
      labels:
        backup-type: platform-pv
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: tenant-data-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    labelSelector:
      matchLabels:
        backup.velero.io/tenant-data: "true"

    snapshotVolumes: true
    includeClusterResources: false

    ttl: 168h  # 7 days

    metadata:
      labels:
        backup-type: tenant-data

Disaster Recovery

Platform Recovery Procedures

Complete Platform Recovery

#!/bin/bash
# platform-recovery.sh

set -e

BACKUP_DATE=${1:-"latest"}
BACKUP_LOCATION=${2:-"s3://kmetal-backups"}

echo "Starting platform recovery from backup: $BACKUP_DATE"

# Step 1: Restore management cluster ETCD
echo "Restoring management cluster ETCD..."
aws s3 cp $BACKUP_LOCATION/etcd/$BACKUP_DATE-etcd-snapshot.db.tar.gz ./etcd-backup.tar.gz
tar -xzf etcd-backup.tar.gz
sudo systemctl stop etcd
sudo rm -rf /var/lib/etcd
sudo etcdctl snapshot restore etcd-snapshot.db --data-dir /var/lib/etcd
sudo systemctl start etcd

# Step 2: Wait for cluster to be ready
echo "Waiting for cluster to be ready..."
while ! kubectl get nodes &>/dev/null; do
    echo "Waiting for cluster..."
    sleep 10
done

# Step 3: Restore platform configuration
echo "Restoring platform configuration..."
aws s3 cp $BACKUP_LOCATION/config/$BACKUP_DATE-config.tar.gz ./config-backup.tar.gz
tar -xzf config-backup.tar.gz

# Restore secrets first
kubectl apply -f platform-secrets.yaml
kubectl apply -f flux-secrets.yaml
kubectl apply -f kamaji-secrets.yaml

# Restore CRDs
kubectl apply -f certificates.yaml
kubectl apply -f clusterissuers.yaml

# Restore Flux components
kubectl apply -f flux-helmreleases.yaml
kubectl apply -f flux-gitrepositories.yaml
kubectl apply -f flux-ocirepositories.yaml

# Step 4: Restore tenant configurations
echo "Restoring tenant configurations..."
kubectl apply -f tenantcontrolplanes.yaml
kubectl apply -f clusters.yaml
kubectl apply -f machines.yaml

# Step 5: Verify platform health
echo "Verifying platform health..."
kubectl get pods -A
helm status kmetal -n kmetal-flux
kubectl get tenantcontrolplane -A

echo "Platform recovery completed successfully!"

Tenant Control Plane Recovery

#!/bin/bash
# tenant-recovery.sh

TENANT_NAME=${1:-""}
BACKUP_DATE=${2:-"latest"}
BACKUP_LOCATION=${3:-"s3://kmetal-backups"}

if [ -z "$TENANT_NAME" ]; then
    echo "Usage: $0 <tenant-name> [backup-date] [backup-location]"
    exit 1
fi

echo "Restoring tenant control plane: $TENANT_NAME"

# Download tenant backup
aws s3 cp $BACKUP_LOCATION/tenants/$BACKUP_DATE-tenants.tar.gz ./tenant-backup.tar.gz
tar -xzf tenant-backup.tar.gz

# Restore tenant control plane
kubectl apply -f $TENANT_NAME/tenantcontrolplane.yaml
kubectl apply -f $TENANT_NAME/secrets.yaml
kubectl apply -f $TENANT_NAME/certificates.yaml

# Wait for control plane to be ready
echo "Waiting for tenant control plane to be ready..."
kubectl wait --for=condition=Ready tenantcontrolplane/$TENANT_NAME -n kmetal-kamaji --timeout=300s

# Restore tenant cluster resources (if kubeconfig exists)
if [ -f "$TENANT_NAME/kubeconfig" ]; then
    echo "Restoring tenant cluster resources..."
    export KUBECONFIG=$TENANT_NAME/kubeconfig
    kubectl apply -f $TENANT_NAME/tenant-resources.yaml
    kubectl apply -f $TENANT_NAME/tenant-pvcs.yaml
    kubectl apply -f $TENANT_NAME/tenant-configmaps.yaml
    kubectl apply -f $TENANT_NAME/tenant-secrets.yaml
fi

echo "Tenant recovery completed for: $TENANT_NAME"

Velero Restore Procedures

# Restore persistent volumes
velero restore create platform-pv-restore \
    --from-backup platform-pv-backup-20240101-010000

# Restore specific tenant data
velero restore create tenant-data-restore \
    --from-backup tenant-data-backup-20240101-020000 \
    --include-namespaces tenant-namespace

# Monitor restore progress
velero restore describe platform-pv-restore
velero restore logs platform-pv-restore

Backup Validation

Automated Backup Testing

# backup-validation-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-validation
  namespace: kube-system
spec:
  schedule: "0 6 * * 0"  # Weekly on Sunday at 6 AM
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:

          - name: backup-validation
            image: bitnami/kubectl:latest
            command:

            - /bin/bash
            - -c
            - |
              set -e

              echo "Starting backup validation..."

              # Test ETCD backup integrity
              LATEST_ETCD_BACKUP=$(aws s3 ls s3://kmetal-backups/etcd/ | sort | tail -1 | awk '{print $4}')
              aws s3 cp s3://kmetal-backups/etcd/$LATEST_ETCD_BACKUP ./etcd-test.tar.gz
              tar -xzf etcd-test.tar.gz

              # Verify ETCD snapshot
              etcdctl --write-out=table snapshot status etcd-snapshot.db

              # Test configuration backup completeness
              LATEST_CONFIG_BACKUP=$(aws s3 ls s3://kmetal-backups/config/ | sort | tail -1 | awk '{print $4}')
              aws s3 cp s3://kmetal-backups/config/$LATEST_CONFIG_BACKUP ./config-test.tar.gz
              tar -xzf config-test.tar.gz

              # Verify required files exist
              for file in kmetal-helmrelease.yaml flux-helmreleases.yaml platform-secrets.yaml tenantcontrolplanes.yaml; do
                if [ ! -f "$file" ]; then
                  echo "ERROR: Required backup file missing: $file"
                  exit 1
                fi
              done

              # Test tenant backup
              LATEST_TENANT_BACKUP=$(aws s3 ls s3://kmetal-backups/tenants/ | sort | tail -1 | awk '{print $4}')
              aws s3 cp s3://kmetal-backups/tenants/$LATEST_TENANT_BACKUP ./tenant-test.tar.gz
              tar -xzf tenant-test.tar.gz

              # Verify tenant data structure
              if [ ! -d "$(ls -1 | head -1)" ]; then
                echo "ERROR: Tenant backup structure invalid"
                exit 1
              fi

              # Send validation report
              echo "Backup validation completed successfully at $(date)" | mail -s "kmetal Backup Validation Report" ops@company.com

              echo "Backup validation completed successfully"
            env:

            - name: ETCDCTL_API
              value: "3"

            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: access-key

            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: secret-key

            - name: AWS_DEFAULT_REGION
              value: "us-west-2"
          restartPolicy: OnFailure

Backup Monitoring

Backup Health Alerts

# backup-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-alerts
  namespace: monitoring
spec:
  groups:

  - name: backup
    rules:

    - alert: BackupJobFailed
      expr: kube_job_status_failed{job_name=~".*backup.*"} > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Backup job {{ $labels.job_name }} failed"
        description: "Backup job {{ $labels.job_name }} has failed"

    - alert: BackupJobMissing
      expr: time() - kube_job_status_start_time{job_name=~".*backup.*"} > 86400
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Backup job {{ $labels.job_name }} hasn't run"
        description: "Backup job {{ $labels.job_name }} hasn't run in the last 24 hours"

    - alert: VeleroBackupFailed
      expr: velero_backup_failure_total > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Velero backup failed"
        description: "Velero backup has failed {{ $value }} times"

This comprehensive backup and recovery guide ensures your kMetal platform is protected against data loss and can be quickly restored in disaster scenarios. Regular testing and validation of backup procedures are essential for maintaining platform resilience.