AWS US-EAST-1 Outage Today Took Down DynamoDB for 15 Hours: A Diwali-Eve Reality Check for Indian SaaS
AWS US-EAST-1 had a 15-hour DynamoDB outage starting at 11:48 PM PDT Oct 19 — and it cascaded across 70+ services. On Diwali eve. Region-redundancy patterns that are NOT multi-region overkill.
Hrishikesh Baidya
October 20, 202514 min read
0%
At 11:48 PM PDT on October 19, 2025, AWS US-EAST-1 began a 15-hour outage. Root cause: a race condition in DynamoDB's automated DNS management produced an empty DNS record for dynamodb.us-east-1.amazonaws.com that the system could not self-heal. The cascade hit 70+ AWS services, including new EC2 instance launches, Lambda invocations, Fargate task launches, and Route 53 — taking down Slack, Snapchat, Atlassian, and many others. AWS recovered the region by 2:20 PM PDT October 20 (ThousandEyes outage analysis, The Register on the AWS post-mortem, InfoQ post-mortem coverage).
For Indian SMBs and SaaS firms running on US-EAST-1, the timing was particularly cruel — 11:48 PM PDT October 19 is 12:18 PM IST October 20, which is Diwali itself. Diwali eve in the US, Diwali day in India. This is the post-mortem for what region-redundancy patterns Indian SaaS should adopt — without falling into the trap of multi-region overkill that costs 3x and protects nothing.
15 hrs
Total US-EAST-1 disruption (11:48 PM PDT Oct 19 - 2:20 PM PDT Oct 20)
70+
AWS services affected by the cascade
DNS
Race condition root cause in DynamoDB DNS management
12:18 PM IST
Outage started — middle of Diwali day in India
## The 60-second answer
If your SaaS runs in US-EAST-1, you have three pragmatic resilience options at increasing cost: (1) move primary infrastructure to ap-south-1 Mumbai or ap-south-2 Hyderabad — closer to Indian users and less affected by US-EAST-1 incidents anyway; (2) add a secondary AWS region with replicated databases and active-passive failover (₹0.5-2x cost uplift); (3) full multi-region active-active (₹2-4x cost uplift, justified for high-revenue SaaS only). Most Indian SaaS should be at option 1 already and nowhere near option 3. The middle option is for firms where 4-hour downtime during a US-EAST-1 incident becomes a board-level conversation.
## Why this matters now
US-EAST-1 (N. Virginia) is AWS's oldest and largest region. It runs services and control planes that other regions depend on — IAM updates often originate there, certain global services route through there. When US-EAST-1 fails, the impact is disproportionate to "just one region." This is well-known to AWS architects and yet, more than a decade in, US-EAST-1 incidents still take down a meaningful slice of the internet.
For Indian SaaS, the fix is partly architectural and partly geographic. Mumbai (ap-south-1) and Hyderabad (ap-south-2) are AWS regions that exist precisely because Indian customers and Indian data residency requirements call for them. Yet a startling number of Indian SaaS startups still default to US-EAST-1 because the original tutorials they followed were US-EAST-1 examples. That choice was free 3 years ago; it costs you customer trust today.
The Diwali-eve detail. The outage started at 12:18 PM IST on Diwali day. A typical Indian SaaS team would have minimum staffing — Diwali is a major cultural holiday, on-call rotations are often skeleton, family obligations come first. The combination of "third-party AWS outage" and "skeleton internal staffing" is the worst-case scenario for incident response. Plan staffing accordingly for festival days; assume external incidents will land at the worst possible time.
## What actually happened (technical timeline)
Oct 19, 11:48 PM PDT (12:18 PM IST Oct 20): Two DynamoDB DNS Enactor processes ran concurrently for the dynamodb.us-east-1.amazonaws.com endpoint. A stale-plan check failed; an old plan overwrote a newer one; cleanup automation deleted the resulting "empty" plan, wiping all DNS records for the regional DynamoDB endpoint.
Oct 19, 11:50 PM PDT: DynamoDB queries from across US-EAST-1 began failing with DNS resolution errors. Cascade began.
Oct 20, 12:30 AM PDT (1:00 PM IST): AWS engineers identified the empty DNS record. Manual restoration began.
Oct 20, 2:00 AM PDT (2:30 PM IST): DNS partially restored. DynamoDB began responding. But the cascade had taken down EC2 launch, Lambda, Fargate, IAM, Route 53 control plane.
Oct 20, 6:00 AM-12:00 PM PDT (6:30 PM-12:30 AM IST Oct 21): Recovery of dependent services. Most user-facing SaaS reported intermittent functionality through this window.
Oct 20, 2:20 PM PDT (2:50 AM IST Oct 21): AWS declared the incident resolved. Total duration: ~15 hours. Indian businesses operating on US-EAST-1 effectively lost the entire Diwali working window.
## The 3 region-redundancy patterns (actual, not theoretical)
1
Mumbai-first architecture
Make ap-south-1 (Mumbai) your primary region. US-EAST-1 incidents do not affect you. Latency to Indian users improves dramatically. DPDP-compliance posture improves. Cost: roughly the same as US-EAST-1; sometimes slightly higher for niche services. ~80% of Indian SaaS should be doing this already.
2
Active-passive secondary region
Primary in Mumbai (ap-south-1). Database replication to Hyderabad (ap-south-2) or Singapore (ap-southeast-1). Failover via Route 53 health checks; RTO 15-60 minutes. Cost uplift: 1.4-1.8x. Justifiable for SaaS where 1 hour of downtime costs more than the secondary region per month.
3
Multi-region active-active
Live traffic to multiple regions simultaneously. Global load balancing. Multi-region writes (Aurora Global Database, DynamoDB Global Tables). Cost uplift: 2-4x. RTO under 5 minutes. Justifiable only for SaaS with >₹50 cr ARR or hard regulatory uptime requirements.
+
Multi-cloud (anti-pattern for most)
Primary on AWS, secondary on GCP or Azure. Sounds resilient on paper. In practice: 3-5x operational complexity, no real availability gain (each cloud has multi-region options), forces lowest-common-denominator design. Skip unless you have very specific regulatory or vendor-lock-in concerns.
## Cost comparison — what each pattern actually costs for a typical SaaS
Reference workload: 8 EC2 t3.large instances, 2 RDS db.m6g.large multi-AZ, 200 GB DynamoDB, 4 TB S3, ~1 TB egress/month. Pricing as of October 2025 in INR.
## The 3 metrics every Indian SaaS should alert on
These are the metrics that catch a region-level cloud incident before your customers do.
Metric
Why
Threshold for alert
Synthetic uptime check from a third-party (UptimeRobot, Pingdom, BetterStack)
Internal monitoring fails when the cloud fails — you need an external observer
Any 2 consecutive 5-minute failures
p95 API latency from at least 3 geographic test points
Cascading cloud incidents often start as latency spikes 10-30 minutes before full failure
>2x baseline for 10 minutes
Database write success rate
The most reliable predictor of "your stack is about to fail" — DBs degrade before app does
< 99.5% for 5 minutes
The synthetic third-party check is the most under-utilised. Your CloudWatch dashboards may show "all green" while your customers cannot reach you, because CloudWatch itself depends on the same AWS region. UptimeRobot at ₹0-₹2,500/month will tell you the truth.
## The 5-step incident-response runbook for cloud-region failures
When the third-party check fires, you have 5-15 minutes to engage. The pattern below works whether you have failover infrastructure or not.
1
Minute 0-3: Confirm and triage
Check the AWS Service Health Dashboard (status.aws.amazon.com — note: this URL itself sometimes goes down with US-EAST-1, so know where to look). Check DownDetector for your category. Check your own monitoring. Confirm: is this AWS region-wide, AWS service-specific, or local to your stack?
2
Minute 3-8: Communicate to customers
Post a status-page update. The discipline: "we are aware, we are investigating, next update in 15 minutes." Many SaaS firms wait until they have full information before posting; that erodes trust. The first post buys you time.
3
Minute 8-15: Engage failover (if you have one)
If you run active-passive, this is when you trigger the documented failover runbook. If you do not, this is when you start preparing customers for the possibility of an extended outage — better to set expectations now than scramble at hour 4.
4
Minute 15+: Update every 30 minutes
Customers do not need a fix every 30 minutes — they need to know you are still working on it. The discipline of regular communication is how SaaS firms come out of regional incidents with their NPS intact.
5
Post-incident: write the post-mortem within 48 hours
Internal post-mortem with timeline, root cause as known, customer impact, action items. Public post-mortem (lighter version) on your blog within 7 days. The transparency post is one of the highest-trust assets your SaaS can publish.
## When NOT to go multi-region
If your monthly recurring revenue is under ₹10 lakh, multi-region is the wrong investment. Spend the engineering time on growth, product, customer-success. Add a third-party synthetic check; document the failover plan you would execute if you had to migrate manually; then move on. Most outages are short enough that the manual-recovery cost is acceptable.
If your application is fundamentally non-real-time (batch processing, scheduled reports, async workflows), multi-region adds complexity without proportional benefit. Tolerate the downtime; communicate well with customers.
If your customers are entirely Indian and your data is required to live in India by your contracts or DPDP requirements, full multi-region across non-Indian regions may not even be permitted. Check before you build.
The DR-test trap. If you have a documented failover plan but have never executed it, you do not really have a failover plan. The Asahi ransomware story (covered in our Sora 2 + Asahi piece from September 30) is the most expensive lesson on this — they reverted to phone and fax because their digital recovery had not been tested at scale. Test your DR runbook quarterly; a runbook never executed is fiction.
## A real example — a 25-person Bangalore B2B SaaS
A Bangalore-based B2B analytics SaaS (₹14 cr ARR, 25 engineers, 800 paying customers mostly in India) had infrastructure on US-EAST-1 because the original founders set it up there in 2020 from a tutorial. They were affected by the October 19-20 outage — partial functionality from 12:18 PM IST Oct 20 until ~11 PM IST. About 65 of their 800 customers raised tickets. Two threatened cancellation.
We worked with them through November-December 2025 on the migration to ap-south-1 Mumbai. Plan: 6 weeks, dual-write phase, traffic shift gradually, decommission US-EAST-1 last. Cost of migration: ₹3.8 lakh in engineering time. Cost of staying: roughly the same monthly bill, but exposure to every future US-EAST-1 incident.
Outcome: completed mid-December 2025. Latency to Indian customers dropped from p95 230ms to 38ms. The CTO sent a customer-comms email titled "we moved to Mumbai" — got 40+ positive replies. The sales team now uses "infrastructure in India" as a credible differentiator vs. Western SaaS competitors targeting the Indian market.
For background on the broader pattern, see our Cloudflare outage runbook for India SaaS — same RTO/RPO discipline applied to a different category of dependency. Both are part of the operational resilience cluster.
## A founder note from our team
Our colleague Hrishikesh, our CTO, has been pushing the Mumbai-first default for Indian SaaS clients for the last 3 years. The shortest version of his argument: every quarter, an incident shows that "default to US-EAST-1" was the wrong call. The cost of moving is finite and one-time. The cost of staying compounds with each incident.
The broader resilience point: most outages are not perfectly preventable — even AWS, with the deepest engineering bench in the industry, ships incidents like the Oct 20 one. The discipline is not "achieve zero outages" — it is "keep the impact small, communicate well, recover fast, learn." Indian SaaS that build this muscle out-compete those that pretend incidents will not happen.
For complementary technical reading on the database side, see our March 2026 piece on a 2.4M-row MySQL to PostgreSQL migration with zero downtime — same dual-write discipline applies to region migration.
## The Reddit pulse
The r/aws subreddit through October 20-22, 2025 was dominated by the US-EAST-1 outage. The dominant defender takes: stop making US-EAST-1 the default; AWS itself recommends regions closer to your users; DR runbooks should be quarterly, not annual; the post-mortem reveals AWS is taking the right systemic actions, but architects must still design for region failure.
Indian engineers in the threads consistently called out that ap-south-1 (Mumbai) is operationally mature and that the Mumbai-first migration pays for itself within 12 months for most SaaS workloads through reduced data egress and faster response times for Indian customers.
The HN discussion on the post-mortem was technical and surfaced two lessons: (1) DNS automation needs idempotency and reconciliation, not just speed; (2) cross-service dependencies on a single region are how local incidents become global ones. Both are lessons every architect can take into their next design review.
## FAQ
### Does this outage affect us if we use AWS Mumbai?
Largely no. ap-south-1 (Mumbai) is its own isolated region with its own DNS infrastructure. The race condition was specific to the dynamodb.us-east-1 endpoint. Some global services (IAM control plane, CloudFront, Route 53 health-check evaluators) had US-EAST-1 dependencies that caused some Mumbai customers degraded performance, but core Mumbai workloads stayed up.
### How do we move our infrastructure to Mumbai?
Standard pattern: provision parallel infrastructure in ap-south-1, set up data replication (RDS read-replica or DMS for active databases, S3 cross-region replication for objects, ECR replication for container images), shift traffic gradually with weighted routing, decommission US-EAST-1 last. Typical timeline: 4-12 weeks for an SMB SaaS, depending on data volume.
### What is "cell-based architecture" and should we adopt it?
Cell-based architecture (popularised by AWS internal engineering) divides your infrastructure into multiple isolated "cells," each serving a subset of users with no shared resources between cells. A failure in one cell affects only its users. AWS uses this internally to limit blast radius. For most SMB SaaS, this is overkill — but worth understanding as the next-generation pattern beyond multi-region.
### Should we use Aurora Global Database for cross-region failover?
Aurora Global Database supports multi-region replication with under-1-second cross-region replica lag and around 1-minute RPO. Cost uplift: roughly 1.5-2x your single-region Aurora bill. Worth it if your RPO requirement is under 5 minutes; otherwise, RDS read-replicas with manual promotion are cheaper and adequate.
### What is "DNS as a single point of failure" — what should we do about it?
The Oct 20 outage demonstrated that DNS at scale is itself a system that can fail. Defences: use multiple DNS providers (Route 53 + Cloudflare DNS as a secondary, configured via NS delegation), keep TTLs reasonable (300-3600 seconds, not 86400) so cache flushes happen quickly, monitor DNS resolution from external probes (UptimeRobot, ThousandEyes).
### How do we communicate with customers during an extended outage?
Use a status page on a different infrastructure than your primary stack — Statuspage.io, Cachet, BetterStack are independent. Update every 30 minutes during an active incident. Be specific about what is and is not working. Post a transparent post-mortem within 7 days of resolution. Indian customers reward transparent communication, and word travels fast on X/LinkedIn.
### Are we required to host Indian user data in India?
Under the current DPDP Act framework, the requirement is that the Government may restrict transfers to specifically notified countries — not a general requirement to host all Indian data in India. However, several sectoral regulators (RBI for payments, IRDAI for insurance) do require data localisation. For pure SaaS, hosting in Mumbai is recommended (latency, compliance hedging, customer-perception) but not legally mandatory in most categories.
### What is the real-world RTO/RPO our SMB SaaS should target?
For most B2B SaaS at 5-50 cr ARR: RTO 4 hours, RPO 1 hour. For B2C consumer SaaS at the same scale: RTO 1 hour, RPO 15 minutes (consumers are less patient). For regulated fintech or health: RTO 15 minutes, RPO 1 minute (regulatory pressure). The numbers should drive your architecture, not the other way around.
Want a multi-region resilience audit on your AWS stack?
We run a 1-week AWS resilience audit for Indian SaaS teams (₹2-50 cr ARR) for ₹65,000 fixed price. You leave with: full architecture review, region-redundancy gap analysis, cost-benefit comparison of 3 patterns, RTO/RPO documented per business application, and a 90-day migration plan for moving to Mumbai-first. First call is with the engineer who would lead the audit.