What I Learned Running 5,000 TalkDrill Users on a ₹38k/Month Server Bill
Founder essay. The infra-cost decisions that kept TalkDrill — our in-house voice-AI English app with 5,000+ active users — at a healthy margin. ECS Fargate vs EC2, S3 lifecycle, RDS-to-Aurora, and the bills that surprised us.
Vivek Kumar
November 16, 202514 min read
0%
TalkDrill — our in-house voice-AI English speaking app for Indian adults — runs on a ₹38,000/month AWS bill at 5,000+ monthly active users. That's roughly ₹7.6 per user per month. When we started in 2023, the same workload was costing us ₹74,000. This post is the founder essay version of how we cut it nearly in half — what we changed, what we tried that did not work, and the specific dollar bills (because AWS bills you in dollars and the bank converts at end-of-month spot) that made me change my mind.
5,000+
Monthly Active Users
₹38k/mo
Current AWS Bill
₹74k → ₹38k
Cost Reduction Over 18 Months
₹7.6/user
Effective Per-User Infra Cost
## The Answer in 60 Words
We moved from ECS Fargate to ECS-on-EC2 with Savings Plans (saved ~38% of compute), put audio recordings on S3 Intelligent-Tiering with a lifecycle policy that drops to Glacier after 60 days (saved ₹4,200/month), migrated from RDS Postgres to Aurora Postgres (saved 22% with read-replica autoscaling instead of manual provisioning), and aggressively cached static voice prompts behind CloudFront. ₹38,000 total.
## Why This Matters For Indian Founders
The unit-economics conversation for Indian SaaS startups has shifted. In 2023, we were chasing growth and the AWS bill was a rounding error in our pitch deck. By Q2 2025, with [Indian SaaS funding down meaningfully year-over-year](https://www.bain.com/insights/indian-software-and-saas-report-2025/), every margin point matters. A ₹74,000 monthly AWS bill on 5,000 paying users is ₹14.8/user — a 13% margin hit at our ₹110/month average revenue per user. ₹38,000 is ₹7.6/user — under 7% of ARPU. That is the difference between "we are profitable" and "we are nearly profitable."
I am writing this in November 2025. The infra decisions below are the ones I would have made faster if I had started over.
## The Stack (As Of November 2025)
| Layer | Choice | Monthly Cost (Nov 2025) | Notes |
|---|---|---|---|
| Compute | ECS-on-EC2 (4× t4g.large with Savings Plan) | ₹11,400 | Was ₹19,800 on Fargate. Migration took 6 days of engineering. |
| Database | Aurora Postgres (db.t4g.medium) + 1 read replica autoscale | ₹7,800 | Was ₹10,000 on RDS Postgres with manual replicas. |
| Storage | S3 with Intelligent-Tiering, 14 TB total | ₹2,100 | 60-day lifecycle to Glacier on completed-session audio. |
| CDN | CloudFront, ~480 GB/month transfer | ₹4,400 | Voice prompts (static), static assets, signed URLs for user audio playback. |
| AI APIs | Claude Sonnet 4.7 + Whisper (self-hosted on g5.xlarge) | ₹6,300 | Self-hosting Whisper saved us OpenAI's ₹0.50/minute pricing. ROI at our volume: 4.2 months. |
| Realtime + WebSocket | API Gateway WebSocket + Lambda | ₹2,800 | Per-message billing surprised me at first. Now stable. |
| Monitoring + Logging | CloudWatch + Better Stack | ₹1,400 | We pulled most logs into Better Stack to escape CloudWatch ingestion costs. |
| Misc (DNS, SES, Route53, KMS) | — | ₹1,800 | The "long tail" line items. |
| Total | | ₹38,000 | |
## The Big Decisions (And The One I Got Wrong First)
### Decision 1: Move from Fargate to ECS-on-EC2
I was a Fargate evangelist for 18 months. No box to manage, scale-to-zero in some configurations, simpler IAM. The trade-off, [as AWS itself documents](https://aws.amazon.com/blogs/containers/theoretical-cost-optimization-by-amazon-ecs-launch-type-fargate-vs-ec2/), is roughly a 20–30% premium over equivalent EC2.
At ~30K vCPU-hours/month on Fargate, the premium was costing us ₹6,400/month in raw compute, plus the harder-to-quantify cost of zero room for spot instances. We migrated to 4× t4g.large EC2 instances behind ECS in a 6-day sprint. The migration itself was straightforward — the same task definitions work — but the surprises were:
1. Spot instances introduced 0.4% nightly task evictions. We solved with longer drain intervals and a "warm pool" of 1 always-on instance.
2. The Savings Plan commitment was scary. We committed to 1-year reserved capacity for 50% of our baseline. AWS gave us 38% off the list price. In 18 months, we have not under-utilised the commitment for a single hour.
3. Patching is back on our plate. We use [Bottlerocket OS](https://aws.amazon.com/bottlerocket/) which auto-updates. Total ops time: roughly 2 hours/month.
For an Indian startup under 30 engineers, my framing is now: Fargate is correct until your monthly bill is over ₹15,000. Above that, the EC2-with-Savings-Plan math wins on every dimension except wall-clock engineering time, and you have a senior engineer who can absorb 6 days for a 38% saving.
### Decision 2: S3 Intelligent-Tiering with Glacier Lifecycle
TalkDrill stores every user's spoken practice session as an audio file (FLAC, ~80 KB/minute compressed). Across 5,000 users averaging 22 minutes/week of practice, that is roughly 380 GB/month of fresh audio. Annual: 4.5 TB.
In 2023, we paid S3 Standard rates on the lot. ₹6,300/month for storage alone. The fix was three lines of bucket policy:
The friction: users want to replay sessions older than 60 days roughly 1.4% of the time. Glacier Instant Retrieval is sub-second, so the user does not feel a lag. Deep Archive (after 365 days) takes 12 hours — we expose this as "older sessions" and have a "request" button that triggers a Lambda. Roughly 18 such requests/month.
Net storage cost dropped from ₹6,300 to ₹2,100/month. Engineering time: 4 hours.
### Decision 3: Migrate RDS Postgres to Aurora Postgres
This one I got wrong first. I assumed Aurora was strictly more expensive (it is, per-vCPU). What I missed was the autoscaling read replica. With RDS, we had 2 read replicas always running for peak hours (8 pm–11 pm IST). With Aurora Serverless v2 plus read replica autoscaling, we provision 0.5 ACUs at 4 am and scale to 4 ACUs at 9 pm based on actual load.
Net database cost dropped from ₹10,000 to ₹7,800/month. The migration window was 11 minutes (Aurora's snapshot-based clone of RDS). The unexpected win: query performance improved meaningfully on writes because Aurora's commit-once-replicate-elsewhere model is genuinely faster on multi-region replicas, even though we run single-region today.
### Decision 4: Self-host Whisper for Speech-to-Text
OpenAI's Whisper API was costing us ₹0.50/minute of audio. At 22 minutes/week × 5,000 users × 4.3 weeks/month, we were doing ~470,000 minutes/month. Bill: ₹2.35 lakh/month and rising fast.
We rented a g5.xlarge GPU instance and self-hosted Whisper-large-v3 with [faster-whisper](https://github.com/SYSTRAN/faster-whisper). Throughput on a single GPU: ~24x realtime. Cost: ₹6,300/month flat (with Savings Plan), regardless of volume. Break-even vs OpenAI Whisper: 12,600 minutes/month. We are at 38x that.
Engineering cost: 11 days to ship the self-hosted pipeline, including Kubernetes-style scheduling so we do not run 1 idle GPU at 4 am.
### Decision 5: CloudFront With Aggressive Static Caching
This is the boring decision that quietly saved ₹3,800/month. Voice prompts (the recorded examples that users hear) are 100% static. We started serving them from S3 directly with no CDN. CloudFront cuts S3 origin pulls by 96% and the data transfer cost from EC2 to user (which is the expensive line item) by 78%. Setup time: 2 hours.
## The Cost Breakdown (Visual)
## What I Tried That Did Not Work
1. Cloudflare R2 instead of S3. R2 has zero egress fees and was cheaper on paper. We migrated 600 GB as a test. Two pain points: (a) the lifecycle policies are less mature than S3's Glacier tiering, and the Glacier tiering was where the real saving came from, and (b) the Aurora-to-storage path stays AWS-internal and free, while Aurora-to-R2 went through public internet. We rolled back after 14 days.
2. Lambda for the entire backend. I read [a great Reddit thread on r/aws](https://www.reddit.com/r/aws/) about a startup running 100% Lambda on a tiny budget. Tried it for a 2-week experiment. Lambda's cold-start latency on the speech path was unacceptable for our use case (users notice 800 ms cold starts during a live conversation). EC2 stays.
3. DynamoDB for user data. We have a relational schema (users → courses → sessions → utterances). Forcing this into DynamoDB single-table design ate 3 weeks of engineering for no measurable cost saving over Aurora Serverless v2. Postgres remains correct for relational data.
4. Reserved Instances over Savings Plans. RIs are cheaper on paper but lock you to instance family. Savings Plans flex across t4g, m7g, c7g — we have moved instance type three times in 18 months. The flexibility was worth the 4–6% premium.
## The Cost Audit Process We Run Monthly
1
Pull AWS Cost Explorer CSV (1st of month)
2
Compare to last month, line by line
3
Flag any line ≥ 8% increase
4
Investigate top 3 flagged items
5
Action or accept by week 2
The audit takes ~90 minutes including investigation. In 18 months, the worst surprise was a developer leaving an EC2 g5.xlarge running over a long weekend (₹4,800 wasted). We added a per-tag budget alert ($50 weekly) so it never happens again.
## A Detail That Surprised Me
The single biggest cost I underestimated was CloudWatch Logs ingestion. At one point, our verbose-logged Lambdas were costing ₹3,200/month in CloudWatch alone. The fix: switch the application logger to Better Stack, set CloudWatch retention to 3 days for lambda logs, and route critical alerts via SNS instead of grep'ing CloudWatch. Cost dropped to ₹400 for CloudWatch + ₹1,000 for Better Stack — a net ₹1,800/month saving with arguably better observability.
## Where This Sits In Our Wider Cost-Engineering Practice
This is the founder-essay companion to our engineering team's deeper writeups:
- The architecture behind TalkDrill as a project page.
- A separate post on WhatsApp Business API pricing surprises that hit us during marketing campaign work.
- Our other in-house product, PenLeap, runs on a similar stack at a different scale curve — see the PenLeap rubric scoring engine writeup for the contrast.
If you are running an early-stage Indian SaaS on AWS and spending more than ₹50,000/month, the chances are very good that ₹15,000–25,000 of it is recoverable in a 2-week sprint. We have run cost audits for clients in fintech, edtech, and logistics where the savings paid for the audit in week 1.
## The Pre-Cost-Audit Checklist
Tag every resource with a service-name, environment, and owner — at minimum
Set per-service budget alerts in AWS Budgets (₹2,000 thresholds work for us)
Run Trusted Advisor's cost-optimisation report monthly
Move CloudWatch logs to short retention (3 days) and ship critical to a cheaper tier (Better Stack, Logtail, Loki)
Audit S3 buckets for lifecycle policies — the default is "no policy" and that costs money
Check Reserved Instance / Savings Plan utilisation weekly
Kill any EC2 instance idle for 7+ days (we have a Lambda that does this for non-prod)
Run `aws ce get-rightsizing-recommendation` quarterly — we usually find 1-2 instances to downsize
Review CloudFront cache-hit ratio and tune cache headers if below 92%
Check the data-transfer-out tab — this is the silent killer in most bills
## When Not to Optimise
If your bill is under ₹10,000/month, do not optimise. Spend that engineering time shipping features. The cost-engineering ROI curve is genuinely flat below this threshold — your time is worth more than the AWS savings.
If your bill spiked because you launched something new and revenue is climbing in lockstep, do not optimise. Get to 90 days of stable load first; otherwise you optimise the wrong thing.
If you are running on a single machine and dreaming of "going serverless to save costs," skip it. Serverless is great for variable load. A fixed-load workload on a single VPS at ₹740/month is genuinely cheaper than Lambda's per-invocation pricing.
## FAQ
### How long does an AWS cost audit usually take?
For a startup with bills under ₹2 lakh/month, 5–7 working days for the audit and a written recommendation. Implementation depends on what we find — usually 2–4 weeks of engineering for a 30–40% reduction.
### What is the single biggest savings opportunity for most Indian startups?
Data transfer out and cross-AZ traffic. We routinely find 15–25% of the bill on data movement that did not need to happen, usually because the application is spread across availability zones for vanity HA reasons.
### Should I move from RDS to Aurora?
If your peak hours are 3x your trough hours, yes. Aurora Serverless v2 saves real money. If your load is flat, RDS Reserved is cheaper.
### What about other clouds — GCP or Azure?
We have audited clients on both. GCP's pricing is similar to AWS at our scale; the GCP discount programs (Committed Use Discounts) are slightly more flexible but slightly less aggressive. Azure has better discounts for Microsoft-shop customers (Office 365, Active Directory). For a green-field Indian startup with no Microsoft commitment, AWS or GCP is the call.
### Why not move off-cloud entirely to bare metal?
We considered Hetzner. The ₹38,000 bill would drop to maybe ₹14,000. The catch: we would need a part-time DevOps person at ₹1.4 lakh/month minimum to run it well. The math does not work yet. Revisit at ₹2 lakh+ monthly cloud spend.
### Are there Indian SaaS tools that do this cost auditing?
[Vantage](https://vantage.sh/) and [CloudZero](https://www.cloudzero.com/) are global tools we have used. For Indian-cloud-spend-specific advice, Cloudonaut and our own internal audits are what we trust. AWS's own cost intelligence dashboards have improved meaningfully in 2025.
### Did self-hosting Whisper hurt latency?
No. We get sub-700 ms for a 5-second audio clip on g5.xlarge. OpenAI Whisper API was averaging 1.4 seconds. Self-hosting won on latency too, not just cost.
Need a cost audit on your AWS / GCP / Azure bill?
We run cloud cost audits for Indian SaaS and SMB workloads spending ₹50,000–₹5 lakh/month. Fixed-fee engagement, 5–7 working days, written report with prioritised changes and expected savings. Typical audit pays back in week 1. The first call is with the engineer who would lead your audit.