Platform Maintenance¶
Routine maintenance tasks for keeping a kMetal under cluster healthy.
Maintenance categories¶
- Preventive — health checks, certificate review, backup verification.
- Corrective — investigating failures, restoring components, applying patches.
- Adaptive — version upgrades and capacity changes.
Recommended cadence¶
| Task | Cadence |
|---|---|
| Pod / node health spot-check | Daily |
| Log review for the chart's namespaces | Daily |
| Certificate review (expiring within 30 days) | Weekly |
| Backup verification | Weekly |
| Capacity review | Monthly |
| Component upgrade window | Quarterly or per-release |
| Disaster-recovery drill | Quarterly |
Daily health checks¶
A short manual spot-check covers most of what daily automation would need to find:
# Pods not Running / Completed across the chart's namespaces
kubectl get pods -A | grep -v Running | grep -v Completed
# Certificates not ready
kubectl get certificates -A | grep -v True
# Pending LoadBalancer services
kubectl get svc -A | grep LoadBalancer | grep -i pending
# Recent warning events
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20
# Resource pressure
kubectl top nodes
kubectl top pods -A --sort-by=cpu | head -10
t.b.d. — A reusable CronJob-packaged health-check is t.b.d. in this section.
Certificate management¶
cert-manager renews kMetal's TLS certificates automatically before expiry. To review the current state:
# All certificates with readiness status
kubectl get certificates -A
# Certificates not currently Ready
kubectl get certificates -A | grep -v True
To force an early renewal of a specific certificate, see the upstream cmctl renew tool.
Backup verification¶
Whatever backup tooling you've adopted (operator's choice — kMetal doesn't ship a specific backup product), the verification pattern is the same: list recent backups, confirm freshness, sample-restore on a non-production target periodically.
t.b.d. — A worked verification procedure depends on which backup tooling your environment uses. See Backup & Recovery for the general approach.
Upgrades¶
See Upgrades for the full procedure. Briefly:
helm upgrade kmetal oci://ghcr.io/clastix/oci/kmetal \
--namespace kmetal-flux \
--values kmetal-values.yaml \
--wait --timeout=15m
helm history kmetal -n kmetal-flux
Before any production upgrade:
- Run the upgrade on a non-production environment first.
- Take a backup.
- Render the new chart and diff against the running release.
helm template kmetal oci://ghcr.io/clastix/oci/kmetal \
--values kmetal-values.yaml > rendered-new.yaml
helm get manifest kmetal -n kmetal-flux > rendered-current.yaml
diff rendered-current.yaml rendered-new.yaml
Capacity review¶
Monthly, sample under-cluster resource usage and compare to allocated requests/limits:
kubectl top nodes
kubectl describe nodes | grep -A 5 'Allocated resources'
# Tenant footprint
kubectl get tenantcontrolplane -A
kubectl top pods -n kmetal-kamaji
Plan additional worker nodes (or vertical resource increases) before utilization stays consistently above 70-80%.
Disaster-recovery drills¶
See Disaster Recovery. Run the drill in a non-production environment at least once per quarter.
For full upgrade procedures see Upgrades; for scaling see Scaling.