Skip to content

Platform Maintenance

Routine maintenance tasks for keeping a kMetal under cluster healthy.

Maintenance categories

  1. Preventive — health checks, certificate review, backup verification.
  2. Corrective — investigating failures, restoring components, applying patches.
  3. Adaptive — version upgrades and capacity changes.
Task Cadence
Pod / node health spot-check Daily
Log review for the chart's namespaces Daily
Certificate review (expiring within 30 days) Weekly
Backup verification Weekly
Capacity review Monthly
Component upgrade window Quarterly or per-release
Disaster-recovery drill Quarterly

Daily health checks

A short manual spot-check covers most of what daily automation would need to find:

# Pods not Running / Completed across the chart's namespaces
kubectl get pods -A | grep -v Running | grep -v Completed

# Certificates not ready
kubectl get certificates -A | grep -v True

# Pending LoadBalancer services
kubectl get svc -A | grep LoadBalancer | grep -i pending

# Recent warning events
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20

# Resource pressure
kubectl top nodes
kubectl top pods -A --sort-by=cpu | head -10

t.b.d. — A reusable CronJob-packaged health-check is t.b.d. in this section.

Certificate management

cert-manager renews kMetal's TLS certificates automatically before expiry. To review the current state:

# All certificates with readiness status
kubectl get certificates -A

# Certificates not currently Ready
kubectl get certificates -A | grep -v True

To force an early renewal of a specific certificate, see the upstream cmctl renew tool.

Backup verification

Whatever backup tooling you've adopted (operator's choice — kMetal doesn't ship a specific backup product), the verification pattern is the same: list recent backups, confirm freshness, sample-restore on a non-production target periodically.

t.b.d. — A worked verification procedure depends on which backup tooling your environment uses. See Backup & Recovery for the general approach.

Upgrades

See Upgrades for the full procedure. Briefly:

helm upgrade kmetal oci://ghcr.io/clastix/oci/kmetal \
  --namespace kmetal-flux \
  --values kmetal-values.yaml \
  --wait --timeout=15m

helm history kmetal -n kmetal-flux

Before any production upgrade:

  • Run the upgrade on a non-production environment first.
  • Take a backup.
  • Render the new chart and diff against the running release.
helm template kmetal oci://ghcr.io/clastix/oci/kmetal \
  --values kmetal-values.yaml > rendered-new.yaml
helm get manifest kmetal -n kmetal-flux > rendered-current.yaml
diff rendered-current.yaml rendered-new.yaml

Capacity review

Monthly, sample under-cluster resource usage and compare to allocated requests/limits:

kubectl top nodes
kubectl describe nodes | grep -A 5 'Allocated resources'

# Tenant footprint
kubectl get tenantcontrolplane -A
kubectl top pods -n kmetal-kamaji

Plan additional worker nodes (or vertical resource increases) before utilization stays consistently above 70-80%.

Disaster-recovery drills

See Disaster Recovery. Run the drill in a non-production environment at least once per quarter.


For full upgrade procedures see Upgrades; for scaling see Scaling.