Scaling the kMetal¶
This guide provides comprehensive strategies for scaling the kMetal to handle increasing workloads, tenant clusters, and performance requirements.
Scaling Overview¶
Scaling Dimensions¶
- Under Cluster: Node scaling, resource scaling
- Platform Components: Horizontal scaling, vertical scaling
- Tenant Clusters: Control plane scaling, worker node scaling
Scaling Strategies¶
| Strategy | Use Case | Benefits | Considerations |
|---|---|---|---|
| Horizontal Scaling | Increased workload | Better fault tolerance | Network complexity |
| Vertical Scaling | Resource-intensive workloads | Simplified architecture | Single point of failure |
| Cluster Scaling | More tenant clusters | Workload isolation | Management overhead |
| Component Scaling | Specific bottlenecks | Targeted optimization | Dependency management |
Under Cluster Scaling¶
Node Scaling¶
Adding Under Cluster Nodes
# Check current node capacity
kubectl get nodes -o wide
kubectl top nodes
# Add new nodes to cluster
# Example for kubeadm clusters
kubeadm token create --print-join-command
# On new node
kubeadm join <master-ip>:6443 --token <token> --discovery-token-ca-cert-hash <hash>
# Label nodes for specific workloads
kubectl label nodes <node-name> node-role.kubernetes.io/platform=true
kubectl label nodes <node-name> workload-type=control-plane
Node Affinity for Platform Components
# Platform component node affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/platform
operator: In
values: ["true"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: workload-type
operator: In
values: ["control-plane"]
Resource Planning¶
Capacity Planning Formula
Total Resource Requirement = Base Platform + (Tenant Clusters * Per-Cluster Overhead)
Base Platform Resources:
- CPU: 4-8 cores
- Memory: 8-16 GB
- Storage: 100-500 GB
Per-Tenant Cluster Overhead:
- Control Plane: 1-2 cores, 2-4 GB RAM
- Monitoring: 0.5 cores, 1 GB RAM
- Network: 0.1 cores, 0.5 GB RAM
Resource Monitoring
# Monitor resource usage
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory
# Check resource requests vs limits
kubectl describe nodes | grep -A 5 "Allocated resources"
# Monitor storage usage
df -h /var/lib/kubelet
kubectl get pvc -A
Platform Component Scaling¶
Each component in the chart accepts a replicas (operator-facing) and full sub-chart override (resources, anti-affinity, PDB, etc.) — see Helm Values Reference. Typical production tuning is:
- Kamaji:
kamaji.replicas: 2or higher, with sub-chart overrides for resource requests/limits and pod-anti-affinity. - cert-manager, MetalLB, Kube-OVN, KubeVirt: chart defaults are reasonable; adjust via the corresponding sub-chart override block if you measure resource pressure.
t.b.d. — A worked production-sizing example with concrete numbers is t.b.d. in this section.
Tenant Cluster Scaling¶
User Guide Content
Tenant cluster scaling operations (control plane, worker nodes, auto-scaling) are documented in the User Guide: Scale Clusters.
This section focuses on platform-level scaling. For tenant-specific scaling, refer to the user guide.
Data Layer Scaling¶
Each tenant cluster references its own Kamaji DataStore CR. The chart does not deploy a shared platform datastore — datastore choice (etcd vs CNPG, replica count, sizing) is per-tenant. See Hosted Control Plane for the model.
t.b.d. — Worked datastore sizing guidance per tenant tier is t.b.d. in this section.
Monitoring and Observability Scaling¶
The kMetal umbrella chart does not ship a monitoring stack. If you operate Prometheus / Alertmanager separately, scale it according to that stack's own guidance.
Auto-scaling Implementation¶
The Kamaji controller scales by adjusting the chart values' kamaji.replicas and re-running helm upgrade. HPA/VPA against the Kamaji Deployment is not the recommended pattern (the controller is steady-state once you've sized it, and tenant-count growth is the driver — not CPU/memory utilization).
t.b.d. — If your environment needs HPA-style autoscaling on any platform Deployment, the exact Deployment name and container name vary with chart rendering; resolve them via kubectl get deploy -n <namespace> before authoring the HPA/VPA.
Scaling Monitoring¶
If you have a Prometheus stack deployed (operator's choice), useful scrape signals include:
- Per-node CPU/memory utilization (
node_cpu_seconds_total,node_memory_*). - Per-tenant control plane pod resource usage (
container_memory_usage_bytesfiltered bynamespace="kmetal-kamaji"). - Tenant count growth (count of
tenantcontrolplanes.kamaji.clastix.ioresources over time).
Exact metric names depend on the Kamaji and exporter versions you've deployed; verify against your live stack before wiring alerts.
t.b.d. — A canonical PromQL ruleset for scaling-related alerts is t.b.d. in this section.
Scaling Alerts¶
Scaling Alerts Configuration
# Scaling alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: scaling-alerts
spec:
groups:
- name: scaling
rules:
- alert: HighNodeUtilization
expr: node_cpu_utilization > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} CPU utilization is high"
description: "Node {{ $labels.instance }} has CPU utilization above 80%"
- alert: HighTenantDensity
expr: tenant_clusters_per_node > 20
for: 5m
labels:
severity: warning
annotations:
summary: "High tenant density on node {{ $labels.node }}"
description: "Node {{ $labels.node }} has more than 20 tenant clusters"
- alert: KamajiControllerMemoryHigh
expr: kamaji_controller_memory_usage > 6000000000 # 6GB
for: 5m
labels:
severity: warning
annotations:
summary: "Kamaji controller memory usage is high"
description: "Kamaji controller memory usage is above 6GB"
Tenant Cluster Autoscaling¶
Worker scaling inside a tenant cluster is driven by a per-tenant Cluster Autoscaler running in the under cluster, in the tenant's namespace. One autoscaler instance per tenant cluster, watching that tenant's kubeconfig.
The flow:
- A workload in the tenant cluster goes Pending because no node has room.
- The autoscaler, watching the tenant's api-server, sees the unschedulable pod.
- It scales the matching CAPI
MachineDeployment(in the under cluster) up by one replica. - CAPK provisions a new KubeVirt VM as a worker node.
- The new worker joins the tenant cluster; the Pending pod schedules onto it.
Scaling down works the inverse way: when nodes are underutilized for the configured period, the autoscaler scales the MachineDeployment down and the corresponding worker VM is decommissioned.
Multiple node sizes per tenant¶
A tenant can expose multiple worker shapes by giving each one its own MachineDeployment with a distinct nodeSelector / taints profile. The autoscaler will scale each MachineDeployment independently based on which one matches the pending workload's scheduling constraints. Typical use: a small pool for control-plane-light workloads, a large pool for memory-heavy work, an optional gpu pool when GPU passthrough is in play.
Scale-from-zero¶
MachineDeployments can be configured with min: 0, so a tenant pool sits at zero VMs when no workload needs it and provisions on demand. This is the right setting for niche pools (rarely-used GPU pool, on-demand batch pool) — it eliminates the cost of idle VMs.