Kubernetes has matured significantly, and so have the patterns for running it effectively. As Rishikesh Baidya, our CTO, who manages our infrastructure, puts it: the basics are now well-established—it's time to get them right.
Platform Considerations
Managed vs. Self-Managed
| Factor | Managed (EKS/GKE/AKS) | Self-Managed |
|---|---|---|
| Operational Burden | Low - automated upgrades | High - manual management |
| Cost at Scale | Higher per-cluster fee | Lower with expertise |
| Customization | Limited | Full control |
| Best For | Most teams (recommended) | Compliance/special needs |
Multi-Cluster Strategy
Resource Management
Right-Sizing
resources:
requests:
memory: "256Mi" # Guaranteed minimum
cpu: "100m" # 10% of a core
limits:
memory: "512Mi" # Hard cap
cpu: "500m" # Burst to 50%Autoscaling
- KEDA for event-driven scaling:
- Scale based on queue depth (SQS, Kafka)
- Custom metrics from any source
- Scale to zero for cost savings
Security
Pod Security Standards
apiVersion: v1
kind: Namespace
metadata:
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restrictedNetwork Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- EgressSecrets Management
- Never store secrets in Git (even encrypted)
- Use External Secrets Operator + Vault/AWS Secrets Manager
- Rotate secrets automatically
- Audit secret access
- Use short-lived credentials where possible
Deployment Patterns
GitOps (Standard Approach)
Progressive Delivery
- Use Argo Rollouts for:
- Canary deployments: Roll out to 5% → 25% → 50% → 100%
- Blue-green deployments: Instant switch with instant rollback
- Automatic rollbacks: Based on analysis runs (error rates, latency)
Observability
The Three Pillars
SLO-Based Monitoring
- Define clear objectives:
- Availability: 99.9% uptime = 8.76 hours downtime/year
- Latency: p99 < 200ms
- Error rate: < 0.1% 5xx errors
- Error budgets: Alert when burning budget too fast
Cost Optimization
- Right-size workloads based on actual usage
- Use spot/preemptible instances for fault-tolerant workloads
- Implement autoscaling to match demand
- Clean up unused PVCs, load balancers, and images
- Use reserved capacity for baseline predictable workloads
- Monitor costs with Kubecost or OpenCost
Common Pitfalls to Avoid
| Pitfall | Impact | Fix |
|---|---|---|
| No resource limits | Noisy neighbors, OOM kills | Always set requests AND limits |
| No PDB | All pods killed during upgrades | Set minAvailable or maxUnavailable |
| No network policies | Lateral movement possible | Default deny + explicit allow |
| No health checks | Traffic to unhealthy pods | Configure liveness + readiness |
| No resource quotas | Runaway costs, cluster exhaustion | Set namespace quotas |
Related Resources
Need Help with Kubernetes?
Our team helps companies implement and operate production Kubernetes environments. From architecture to day-2 operations, we've got you covered.
Get Kubernetes Consultation →