Kubernetes in Production: Lessons from 50+ Cluster Deployments
We have deployed and operated over 50 Kubernetes clusters across managed services (EKS, AKS, GKE), self-managed installations on bare metal and Hetzner, and hybrid setups that span multiple providers. Along the way, we have collected a set of lessons that we wish someone had told us at the start.
This is not a Kubernetes tutorial. This is a field guide for teams moving Kubernetes into production or struggling with clusters that are already there.
graph TD
subgraph Control Plane
API[API Server]
ETCD[etcd]
SCHED[Scheduler]
CM[Controller Manager]
end
subgraph Worker Nodes
subgraph System Pool
S1[system-1]
S2[system-2]
end
subgraph App Pool
A1[app-1]
A2[app-2]
A3[app-3]
end
subgraph DB Pool
D1[db-1]
D2[db-2]
end
end
API --> S1
API --> A1
API --> D1
Lesson 1: Right-Size Your Clusters from Day One
The single most common mistake we see is starting with nodes that are too small. Teams pick the cheapest instance type, pack it with pods, and then wonder why everything is unstable.
The problem: Kubernetes itself consumes resources. The kubelet, kube-proxy, CNI plugin, CoreDNS, and your monitoring stack (Prometheus, node-exporter, etc.) all need CPU and memory. On a 2-vCPU / 4GB node, system components can consume 40-50% of available resources before a single application pod is scheduled.
Our recommendation: Start with nodes that have at least 4 vCPUs and 8GB RAM. For production workloads, 8 vCPUs / 16GB or larger is the sweet spot. Fewer, larger nodes are almost always better than many small ones — they reduce scheduling fragmentation and give pods room to burst.
# Good: explicit resource requests and limits
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Set resource requests on every pod. Without them, the scheduler is guessing, and it guesses poorly.
Lesson 2: Use Dedicated Node Pools
Not all workloads belong on the same nodes. We consistently recommend at least three node pools:
- System pool: For cluster infrastructure — ingress controllers, monitoring, cert-manager, ArgoCD. Tainted to prevent application pods from being scheduled here.
- General pool: For stateless application workloads. This is your auto-scaling pool.
- Stateful pool: For databases, message queues, and anything with persistent storage. Larger nodes, local SSDs if needed.
# Taint the system node pool
apiVersion: v1
kind: Node
metadata:
labels:
node-pool: system
spec:
taints:
- key: dedicated
value: system
effect: NoSchedule
# Toleration for system components
tolerations:
- key: dedicated
operator: Equal
value: system
effect: NoSchedule
nodeSelector:
node-pool: system
This separation prevents a misbehaving application from starving your ingress controller or monitoring stack — a scenario we have seen cause extended outages.
Lesson 3: Networking — Cilium Has Won (Mostly)
We have deployed clusters with Flannel, Calico, and Cilium. Our current default for new deployments is Cilium, and here is why:
- eBPF-based dataplane eliminates the need for kube-proxy and iptables, which dramatically improves performance and reduces complexity at scale. Clusters with 5,000+ services see measurable latency improvements.
- Network policies that actually work. Calico’s policy engine is solid, but Cilium’s L7-aware policies (filter by HTTP path, gRPC method) are a tier above.
- Built-in observability via Hubble gives you network flow visibility without deploying additional tools.
- Service mesh capabilities without the sidecar overhead of Istio.
That said, Calico is still a fine choice if your team already knows it or if you are on a managed platform where the CNI choice is constrained. The worst choice is the default Flannel on self-managed clusters — it lacks network policy support entirely.
# Cilium Helm values (simplified)
# helm install cilium cilium/cilium -f values.yaml
kubeProxyReplacement: true
k8sServiceHost: "api.k8s.example.com"
k8sServicePort: "6443"
hubble:
enabled: true
relay:
enabled: true
ui:
enabled: true
ipam:
mode: kubernetes
bpf:
masquerade: true
Lesson 4: Managed vs. Self-Managed — It Depends (But Probably Managed)
We get asked this constantly. Here is our decision framework:
Choose managed (EKS/AKS/GKE) when:
- Your team is small (fewer than 3 people managing infrastructure)
- You need to move fast and do not want to maintain etcd, the API server, or handle control plane upgrades
- You are already invested in a cloud provider’s ecosystem
- Compliance requirements favor a certified, vendor-supported platform
Choose self-managed when:
- You need to run on bare metal or specific hosting (Hetzner, OVH) for cost or data sovereignty reasons
- You have deep Kubernetes expertise and want full control
- Your workload profile is unusual (GPU clusters, edge deployments, air-gapped environments)
For self-managed clusters, we use k3s for lightweight/edge deployments and kubeadm or Cluster API for full-featured production clusters. RKE2 is a solid middle ground that we have deployed successfully in security-conscious environments.
Cost note: Managed Kubernetes is not free. EKS charges $0.10/hour for the control plane alone (~$73/month). AKS makes the control plane free but charges for other management features. GKE Autopilot bundles the cost into node pricing. Factor this into your budget, but remember that the alternative is paying your engineers to babysit etcd.
Lesson 5: Do Not Skip the Platform Layer
A bare Kubernetes cluster is not production-ready. You need a platform layer on top. Here is what we deploy on every cluster before any application workloads:
| Component | Our Default Choice |
|---|---|
| Ingress controller | ingress-nginx or Cilium Gateway API |
| Certificate management | cert-manager with Let’s Encrypt |
| External DNS | external-dns |
| Monitoring | Prometheus + Grafana (via kube-prometheus-stack) |
| Log aggregation | Loki + Promtail |
| Secret management | External Secrets Operator + Vault |
| GitOps | ArgoCD |
| Policy enforcement | Kyverno |
We manage this entire platform layer with ArgoCD using the App of Apps pattern, which we describe in detail in our later post about building a production ArgoCD setup.
Lesson 6: Persistent Storage Is Still the Hardest Part
Stateful workloads on Kubernetes are significantly more complex than stateless ones. Our advice:
- Use managed databases when possible. Running PostgreSQL on Kubernetes is possible (and we do it with CloudNativePG), but if a managed RDS or Cloud SQL instance meets your requirements, it will save you operational headaches.
- Choose your CSI driver carefully. On cloud, use the provider’s CSI driver (EBS CSI, Azure Disk CSI). On bare metal, Longhorn is our go-to — it is simple to operate and provides replication.
- Test your backup and restore process. Velero is our standard for cluster-level backup. Test the restore procedure quarterly. A backup you have never tested restoring is not a backup.
# Longhorn storage class for self-managed clusters
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-replicated
provisioner: driver.longhorn.io
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "2880"
dataLocality: "best-effort"
reclaimPolicy: Retain
allowVolumeExpansion: true
Lesson 7: Upgrade Early and Often
Kubernetes releases a new minor version roughly every 4 months, and each version is supported for approximately 14 months. Falling behind on upgrades is one of the most dangerous things you can do — it compounds. Going from 1.27 to 1.28 is straightforward. Going from 1.25 to 1.30 is a project.
Our upgrade process:
- Read the changelog and migration guide for the target version.
- Upgrade the development cluster first. Run the full test suite.
- Upgrade staging. Let it soak for at least a week with production-like traffic.
- Upgrade production during a maintenance window.
- Upgrade one node pool at a time, watching metrics between each.
On managed services, enable automatic minor version upgrades for non-production clusters to keep them current.
Lesson 8: Invest in Developer Experience
The most successful Kubernetes deployments we have seen are the ones where application developers do not need to think about Kubernetes. They push code, and things happen. This means:
- Standardized Helm charts or Kustomize bases that teams customize with values files.
- A clear, documented path from “code committed” to “running in production.”
- Self-service namespace creation with guardrails (resource quotas, network policies, Kyverno policies).
- Golden paths, not golden cages. Give teams a paved road but do not prevent them from going off-road when they have a good reason.
In our previous post about Terraform and Ansible, we discussed how those tools provision and configure the underlying infrastructure. Kubernetes adds a layer on top, but the principle is the same: automate the repeatable, codify the decisions, and let your team focus on delivering value.
Final Thoughts
Kubernetes is not magic. It is a powerful, complex system that rewards careful planning and punishes shortcuts. The patterns above are not theoretical — they come from real clusters running real workloads, where we learned many of these lessons the hard way.
If you are planning a Kubernetes deployment or struggling with one that has grown beyond your team’s capacity to manage, these are the areas to focus on first. And in our upcoming posts about GitOps and ArgoCD, we will show how to manage all of this declaratively through git.
At robto, we design, deploy, and operate Kubernetes platforms — from single-cluster setups to multi-cluster, multi-cloud architectures. We bring the experience of 50+ deployments to every engagement.