Halloween Edition: 5 Spooky Bugs PenLeap Hit at 2,400 Concurrent Writers (and the Patch Notes That Saved Us)
A Halloween-themed honest postmortem. The 5 production bugs our in-house edtech product PenLeap hit when concurrent writer load spiked to 2,400 — connection pool starvation, prompt-cache leak, and the 3 graphs we now alert on.
Hrishikesh Baidya
October 31, 202516 min read
0%
On October 17, 2025, at 4:42 pm IST, PenLeap — our in-house edtech product — hit a peak of 2,400 concurrent student writers for the first time. Within 90 seconds, the AI feedback engine started returning timeouts. Within 6 minutes, 18% of submissions were failing. By 5:11 pm, we had identified and patched the worst bug; full recovery took until 6:30 pm. The post-mortem found 5 distinct production bugs, each invisible at our normal load (~600 concurrent writers). Halloween is the right week for this story — these are the spooky bugs that hide in your stack until concurrency triples. Names changed where necessary; technical details are exact. If you operate any AI-backed product, treat this as a checklist of things to test before you hit your own peak.
2,400
Peak concurrent writers (4× normal)
5
Distinct production bugs surfaced
18%
Submission failure rate at peak
108 min
Time from spike to full recovery
## The Answer in 60 Words
The 5 bugs: pgvector connection pool starvation under burst writes, Claude prompt-cache memory leak in our long-running Node worker, S3 multipart upload race on duplicate filenames, Redis-backed session lock deadlock, and silent NULL coalescing in our mastery-update query. Total fix time: 17 hours of engineering across 3 days. We now alert on 3 graphs that would have caught all 5 before users felt them.
## Why This Story Matters
Most engineering blog posts present the after — the polished architecture diagram, the 99.95% SLA. The truth of running a production AI product is that you discover whole categories of bugs only when load patterns change. Our normal peak — 600-800 concurrent writers at school timing — exposed none of these. The 4x spike on October 17 (a state-board notification went out about an upcoming pre-board, driving a one-day surge) exposed five. Be honest about the bugs. That is the credibility hook. If you have not hit these yet, you will.
## The Setup (Specific Stack)
PostgreSQL 16 + pgvector 0.7. Single primary, one read replica. PgBouncer in front in transaction-pooling mode.
🤖
AI Layer
Claude Sonnet 4.5 via Anthropic SDK. Prompt caching enabled. Per-worker singleton client.
📦
Storage + Cache
S3-compatible storage (Backblaze B2). Redis 7 for session locks and rate-limiting. Sentry + Grafana for observability.
## Bug 1 — pgvector Connection Pool Starvation (The One That Hurt Most)
Symptom: At ~1,800 concurrent users, every read query against pgvector started timing out at 30 seconds. Application logs showed "remaining connection slots are reserved for roles with the SUPERUSER attribute."
Root cause. Our pgvector retrieval used a separate Postgres client per request — and we had not capped the per-worker pool size. Under burst load, each Node worker tried to open up to 50 connections; 4 boxes × 6 workers × 50 = 1,200 attempted connections against a primary configured for max_connections = 200. PgBouncer was NOT in the pgvector path because of an early architectural mistake (we wrongly thought transaction-pool mode was incompatible with the vector extension).
The fix. Routed pgvector queries through PgBouncer in transaction-pooling mode, capped per-worker pool at 8 connections, added maxUses=1000 on the underlying pg.Pool to recycle connections regularly. There is a closely related n8n issue from late 2024 describing the same pattern — that PR's discussion is required reading if you run pgvector at scale.
Time to identify. 22 minutes. Time to deploy fix. 80 minutes (had to coordinate PgBouncer config + code change + rolling restart).
## Bug 2 — Claude Prompt-Cache Memory Leak
Symptom: One Node worker per box started consuming > 6 GB RAM (normal: 800 MB). After 90 minutes of peak load, OOM kills triggered.
Root cause. We were storing the cached prompt prefix in a per-worker in-memory map keyed by user_id. The intent was "fast lookup for prefix to send to Anthropic." The reality: we never evicted entries, and at 2,400 concurrent users with diverse prefixes, the map grew unboundedly. Worse, each cached prefix was the full system prompt — about 4 KB per entry. 2,400 entries × 4 KB = 9.6 MB, but we held the full Claude SDK request payload too, including the streaming response chunks until the response completed. That ballooned to ~6 GB.
The fix. Removed the per-user cache map entirely (Anthropic's prompt caching is server-side; our client-side map was redundant and harmful). Added an LRU cache with a 200-entry cap and a 5-minute TTL on the rare entries we actually need to keep client-side. Memory dropped to 1.1 GB peak.
Time to identify. 38 minutes (memory profiling under load is painful). Time to deploy. 35 minutes.
## Bug 3 — S3 Multipart Upload Race On Duplicate Filenames
Symptom: About 0.4% of student image submissions were saving as 0-byte files. Affected only handwritten image uploads, not text submissions.
Root cause. Two students in the same class submitted images at exactly the same second (within milliseconds — a coordinated "upload now" moment from their teacher). Our upload key was derived from {{class_id}}/{{student_id}}/{{timestamp_seconds}}.jpg. Collision required same class, different students, same second — which under normal load almost never happened. Under burst load it happened multiple times. The S3 multipart upload from one student aborted the in-progress upload from another, leaving the destination file at 0 bytes.
The fix. Added a UUID suffix to the upload key: {{class_id}}/{{student_id}}/{{timestamp_ms}}_{{uuid}}.jpg. Trivial change, 4 minutes of code. Required a migration of in-flight upload routing during deploy.
Time to identify. 70 minutes (the 0-byte signal was hidden in our normal "upload failed" metric). Time to deploy. 18 minutes.
## Bug 4 — Redis Session Lock Deadlock
Symptom: Some users saw "your previous session is still active" error and could not start a new feedback session even after force-quitting the app.
Root cause. We use Redis to hold a per-user session lock for the duration of an AI grading run (~12 seconds). The lock has a 30-second TTL safety net. Under burst load, our Sonnet calls started taking 18-25 seconds (Anthropic API was rate-limiting us subtly). When a user retried via app force-quit, the new request found an existing lock, returned an error, and crucially never released the original lock. The lock would only free at the 30-second TTL — by which point the user had retried 3 times.
The fix. Two changes. Reduced TTL to 18 seconds (matches actual peak AI-call time + 6s buffer). Added an explicit "release on error" handler that calls Redis DEL on any exception path. Most important: added a "force release after authentication" endpoint that the app calls on force-quit recovery.
Time to identify. 95 minutes (the symptom looked like a UI bug for the first hour). Time to deploy. 45 minutes.
## Bug 5 — Silent NULL Coalescing In Mastery Update
Symptom: Three days after the spike, we noticed that some students' mastery scores had reset. Affected ~140 students.
Root cause. Our mastery update query was UPDATE user_mastery SET score = COALESCE($1, score) + 1 WHERE user_id = $2 AND concept = $3. Under normal load, $1 was always the new score. Under burst load, our application's score-computation function occasionally returned undefined (a race in our concept-lookup cache). The Anthropic SDK serialised undefined to NULL. COALESCE happily fell back to the existing score, then added 1. Visible symptom: scores looked normal for a day, then started displaying anomalies as the cache evicted.The fix. Mandatory NOT-NULL check on the input parameter, throw an exception instead of silently coalescing. Backfilled the affected 140 mastery rows from our event log (we log every score-change event as an immutable row in a separate table, which saved us).
Time to identify. 3 days post-spike (this one got missed during the live incident). Time to deploy. 22 minutes.
## The 3 Graphs We Now Alert On
All 5 bugs would have been caught earlier with the right monitoring. The 3 graphs we added to our Grafana stack:
1
Graph 1 — Active pgvector connections vs max_connections
Alert when usage exceeds 70% of max for 60 seconds. Catches pool starvation BEFORE Postgres starts rejecting. Single most useful alert we added.
2
Graph 2 — Per-worker RSS memory delta over 5-min window
Alert when any worker grows by > 500 MB in 5 minutes without a corresponding load spike. Catches memory leaks (including the prompt-cache one) within 10 minutes instead of 90.
3
Graph 3 — Redis lock-held duration p95
Alert when p95 lock-held duration exceeds 15 seconds. Catches deadlocks AND slow-AI-call cascades. Co-related with Anthropic API latency for full picture.
## A Visualisation Of The 90-Minute Incident
## What Did NOT Break (Worth Noting)
Three components held perfectly under 4x load. Worth recording because we considered changing them and the data says do not.
- The PgBouncer for our application Postgres connections (non-vector). Held at 95% pool utilisation, 0 timeouts. Our chosen pool size was right.
- The Claude Sonnet 4.5 API at the Anthropic end. Other than rate-limiting our specific account (which was a feature, not a bug), the underlying API stayed at p95 latency around 4.2 seconds — consistent with normal load.
- Our front-end Next.js + Vercel edge caching. Static assets and API routes that did not hit the AI engine stayed fast. The single-page-app shell loaded normally throughout.
## What This Cost Us (Real Numbers)
Compared to the alternative — losing student trust during pre-board prep season — this is inexpensive. The product manager's framing: "we paid ₹55K for a real load test we could not have afforded to schedule."
## The Pre-Spike Checklist (We Run This Quarterly Now)
Postgres max_connections set with explicit headroom (we use 300 for a 200-app-need)
PgBouncer in transaction-pooling mode in front of all Postgres connections (including pgvector)
Per-worker pg.Pool capped at < 10 connections with maxUses recycling
No in-memory caches without an explicit eviction policy AND a size cap
Redis lock TTL set to (p99 hold time + buffer), with explicit release-on-error
S3 upload keys include enough entropy to avoid sub-second collisions (UUID, not just timestamp)
Mastery / score updates use parameterised queries with NOT NULL checks, never silent COALESCE on user-derived input
Event log table for any score change — gives you backfill recovery
Memory growth alert per worker (RSS delta over 5 min)
Synthetic load test at 4x current peak in staging, run quarterly
## Common Mistakes (Each One Hurts)
Symptom: "Postgres connection rejected with 'remaining slots reserved for SUPERUSER.'" Cause: per-worker pool too large or PgBouncer not in path. Fix: as above.
Symptom: "Worker memory grows linearly with concurrent users." Cause: unbounded in-memory cache. Fix: LRU + TTL + size cap.
Symptom: "Some user submissions saved as 0 bytes." Cause: filename collision under burst write. Fix: add entropy to the key.
Symptom: "Users see 'previous session active' even after force-quit." Cause: Redis lock not released on error path. Fix: try/finally with explicit DEL.
Symptom: "Mastery scores look normal for a day, then anomalous." Cause: silent NULL coalescing. Fix: NOT NULL parameter validation, never trust application-layer scores.
## When To Run The Quarterly 4x Load Test
Mandatory if (a) you have crossed 1,000 active concurrent users, (b) your stack includes any AI / LLM call in the request path, (c) your business has natural traffic spikes (exam season, sale day, festival). Skip if you are still under 200 concurrent — your bottlenecks are different.
## A Detail That Saved Us On Day Of Recovery
At 6:18 pm, twelve minutes before full recovery, our CTO Hrishikesh noticed that one of our Hetzner boxes was running 30% slower than the others on the same workload. Investigation: a months-old kernel update on that box had silently rolled back a TCP buffer tuning. We bumped the box, the cluster rebalanced, full recovery achieved at 6:30 instead of 7:15. The lesson: during incidents, look at the boxes that are NOT failing — sometimes the box that is "doing fine" is actually the one limiting recovery.
## How This Connects To The Wider PenLeap Stack
The vocabulary drill engine from our 800-drills-per-student post uses the same pgvector + pool + cache architecture. The fixes here apply to that engine as well. The CBSE paper generator from our paper generator post hits the prompt-cache code path heavily and benefited directly from the leak fix. We have applied similar load-test discipline to TalkDrill (our voice product) — different stack, same principles. Honest postmortems are how we keep both products stable as load grows.
## FAQ
### Why didn't your existing observability catch these?
Pre-incident we tracked aggregate request latency and error rate. We did NOT track per-worker memory growth, pgvector connection utilisation, or Redis lock-hold duration. The aggregate metrics looked fine until they did not. Specific component metrics matter more.
### How often should we run the quarterly load test?
Quarterly is enough. Monthly is overkill unless you are growing > 30% MoM. We run our 4x load test the second Saturday of every quarter, in staging with production-like data.
### Is Anthropic's prompt caching reliable?
Mostly yes. In April 2026 they shortened the default TTL from 1 hour to 5 minutes, which surprised some teams. We monitor cache hit rate as part of our standard dashboard now.
### Why Hetzner instead of AWS?
Cost-per-vCPU is roughly 4x cheaper at our scale, networking quality between European Hetzner and Indian users is acceptable (p50 95 ms), and the operational simplicity of fixed boxes vs auto-scaling matched our team's experience. AWS would let us auto-scale for the spike automatically, at 4x the steady-state cost.
### Did you lose any data permanently?
No. The event log table for score changes meant the 140 affected mastery rows could be reconstructed from immutable change events. Permanent data loss is the failure mode that makes a postmortem unwriteable; logging every state change as an event is the cheapest insurance.
### Did students complain publicly?
Three Twitter posts and two Instagram stories. We responded to each within 90 minutes, explained what happened in plain terms, and gave each affected student 30 days of free Pro. All three accepted graciously. Honesty was cheaper than spin.
### What about Apple App Store / Google Play crash reports?
PenLeap is web-first (PWA), no app store presence. Our crash reporting is via Sentry directly. Sentry caught the worker OOMs immediately — but did not surface them as user-facing because the workers restarted under PM2's cluster mode.
### How is the engineering team structured to handle these incidents?
Two backend engineers on rotation for production support, plus the CTO Hrishikesh on call for severity-1. Average resolution time across all 5 bugs was 75 minutes from identification. Our internal SLO is 90 minutes. We hit it on this incident.
Want a load-test + observability audit on your AI app?
We have run production load tests for both PenLeap and TalkDrill and for two enterprise client AI products. Fixed-scope audit: 4x load test, observability gap analysis, and a written 12-page report with prioritised fixes. ₹85,000 for the full engagement, 5 working days. Email contact@softechinfra.com.