Kubernetes has matured significantly, and so have the patterns for running it effectively. As
Hrishikesh Baidya, our CTO, who manages our
infrastructure, puts it: the basics are now well-establishedβit's time to get them right.
40%
Cost Savings Possible
GitOps
Deployment Standard
## Platform Considerations
### Managed vs. Self-Managed
| Factor |
Managed (EKS/GKE/AKS) |
Self-Managed |
| Operational Burden |
Low - automated upgrades |
High - manual management |
| Cost at Scale |
Higher per-cluster fee |
Lower with expertise |
| Customization |
Limited |
Full control |
| Best For |
Most teams (recommended) |
Compliance/special needs |
π‘ Our Recommendation: Use managed Kubernetes unless you have specific compliance needs or massive scale (500+ nodes). We deploy all client projects like
Radiant Finance on managed platforms.
### Multi-Cluster Strategy
π
Environment Isolation
Separate prod/staging/dev for security and stability
π
Regional Deployment
Low latency for global users, data residency compliance
π₯
Blast Radius Limitation
Issues in one cluster don't affect others
π₯
Team Separation
Different teams manage their own clusters
## Resource Management
### Right-Sizing
resources:
requests:
memory: "256Mi" # Guaranteed minimum
cpu: "100m" # 10% of a core
limits:
memory: "512Mi" # Hard cap
cpu: "500m" # Burst to 50%
1
Start Conservative
Begin with modest requests/limits based on expected usage.
2
Monitor Actual Usage
Use Prometheus metrics to see real resource consumption over time.
3
Apply VPA Recommendations
Vertical Pod Autoscaler can suggest optimal values based on history.
4
Regular Reviews
Re-evaluate quarterly as workloads change.
### Autoscaling
π₯οΈ
Cluster Autoscaler
KEDA for event-driven scaling:
- Scale based on queue depth (SQS, Kafka)
- Custom metrics from any source
- Scale to zero for cost savings
## Security
β οΈ Security First: Kubernetes is secure by design, but only if you enable the security features. Default configurations are often permissive. See our
Secure Software Development guide for more.
### Pod Security Standards
apiVersion: v1
kind: Namespace
metadata:
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
### Network Policies
Key Practice: Default deny all traffic, then explicitly allow what's needed. This prevents lateral movement in case of a breach.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
### Secrets Management
- Never store secrets in Git (even encrypted)
- Use External Secrets Operator + Vault/AWS Secrets Manager
- Rotate secrets automatically
- Audit secret access
- Use short-lived credentials where possible
## Deployment Patterns
### GitOps (Standard Approach)
"GitOps isn't just about deploymentsβit's about making your entire infrastructure auditable, reproducible, and recoverable. If it's not in Git, it doesn't exist."
HB
Hrishikesh Baidya
CTO, Softechinfra
π
Declarative Configs
All manifests stored in Git as the single source of truth
π
ArgoCD or Flux
Continuous reconciliation between Git and cluster state
π
Drift Detection
Automatic detection and correction of manual changes
βͺ
Easy Rollbacks
Revert to any previous state with a git revert
### Progressive Delivery
Use Argo Rollouts for:
-
Canary deployments: Roll out to 5% β 25% β 50% β 100%
-
Blue-green deployments: Instant switch with instant rollback
-
Automatic rollbacks: Based on analysis runs (error rates, latency)
## Observability
### The Three Pillars
π
Metrics
Prometheus + Grafana for dashboards and alerting
π
Logs
Structured JSON logging with central aggregation
π
Traces
OpenTelemetry for distributed tracing across services
### SLO-Based Monitoring
Define clear objectives:
-
Availability: 99.9% uptime = 8.76 hours downtime/year
-
Latency: p99 < 200ms
-
Error rate: < 0.1% 5xx errors
-
Error budgets: Alert when burning budget too fast
## Cost Optimization
30-40%
Spot Instance Savings
20-30%
Right-Sizing Savings
15-25%
Reserved Capacity Savings
- Right-size workloads based on actual usage
- Use spot/preemptible instances for fault-tolerant workloads
- Implement autoscaling to match demand
- Clean up unused PVCs, load balancers, and images
- Use reserved capacity for baseline predictable workloads
- Monitor costs with Kubecost or OpenCost
## Common Pitfalls to Avoid
β οΈ Top 5 Mistakes:
| Pitfall |
Impact |
Fix |
| No resource limits |
Noisy neighbors, OOM kills |
Always set requests AND limits |
| No PDB |
All pods killed during upgrades |
Set minAvailable or maxUnavailable |
| No network policies |
Lateral movement possible |
Default deny + explicit allow |
| No health checks |
Traffic to unhealthy pods |
Configure liveness + readiness |
| No resource quotas |
Runaway costs, cluster exhaustion |
Set namespace quotas |
β
Our Approach: All Softechinfra projects follow these best practices. For projects like
ChipMaker Hub and
TalkDrill, we've achieved 99.95% uptime with these patterns.
## Related Resources
-
SaaS Architecture Patterns - Application design for K8s
-
Secure Software Development - Security practices
-
AI Operations & MLOps - Running ML workloads on K8s
Need Help with Kubernetes?
Our team helps companies implement and operate production Kubernetes environments. From architecture to day-2 operations, we've got you covered.
Get Kubernetes Consultation β