Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers
Streaming-LLM connection pooling, Redis-backed rate limits, queue-based grading, and the cost curve we cut by 71% — the engineering rebuild that 10x'd capacity.
Hrishikesh Baidya
May 1, 202616 min read
0%
In January 2026, [PenLeap](https://penleap.com) — our in-house AI writing coach — could hold roughly 60 simultaneous students writing essays with real-time feedback. By April, that number was 600. We did not add servers. The cost curve actually bent down. This post is the engineering teardown of the rebuild: streaming-LLM connection pooling, Redis-backed rate limits, a queue-based grading pipeline, and the four decisions that mattered most.
10x
Concurrent Writer Capacity Gain (60 → 600)
-71%
Cost Per Concurrent User Per Hour
0
Additional Application Servers Provisioned
320ms
p95 Time-to-First-Token After Rebuild
## TL;DR
We replaced per-request LLM HTTP calls with a streaming connection pool against Claude and GPT, moved rate limiting from in-process counters to Redis-cluster atomic ops, and moved essay grading from synchronous in-request to a SQS-backed queue with priority lanes. The bottleneck stopped being CPU on the Node.js layer (it was thread-starvation, not CPU) and became Anthropic's tokens-per-minute quota. We bought a higher quota tier and shipped.
## Why this matters now
The cost per concurrent user is what gates pricing for any AI-feedback product. In late 2025, our infra cost per concurrent writer was ~₹38/hour. After the rebuild that dropped to ~₹11/hour. That's the difference between a £8/month plan that bleeds money on heavy users and a £8/month plan that compounds. The same engineering pattern — pool the streaming connections, move state to Redis, queue the heavy work — applies to any real-time LLM product. We've now reused it for two client projects.
## The cost curve
The "before" curve trends up because we were paying overhead per concurrent user (one HTTP socket per active conversation, lots of TCP setup, no batching). The "after" curve is flat because the streaming connection pool and Redis state mean marginal cost per user is essentially the marginal LLM token spend.
## The four decisions that mattered
🔌
Streaming Connection Pool
One persistent HTTP/2 multiplexed connection per worker, fanned out to N concurrent streaming requests. Eliminated TCP setup tax.
⚡
Redis-Backed Rate Limits
Per-user and per-org TPM/RPM tracked via Redis atomic INCR. Replaced per-process counters that didn't scale across pods.
📬
Queue-Based Grading
Essay grading (a 6-second LLM call) moved to SQS with priority lanes. Frees the request thread instantly.
📊
Per-Endpoint Backpressure
Each LLM endpoint has a per-second token budget. When breached, requests queue rather than fail. Smooths burst load.
## Decision 1: streaming connection pool
The old code did this, naively:
// BEFORE: one HTTP request per LLM call
async function streamFeedback(essay, ctx) {
const resp = await fetch(claudeURL, {
method: 'POST',
body: buildPayload(essay, ctx),
// implicit: new TCP, new TLS, new HTTP/1.1 request
})
for await (const chunk of resp.body) yield decode(chunk)
}
Under 60 concurrent users this looked fine. At 200 we saw socket exhaustion on the Node process. The fix: a single HTTP/2 multiplexed connection per worker, with explicit stream IDs per active conversation. Anthropic's API supports HTTP/2 streaming; we'd just been using a default Node HTTP client that fell back to HTTP/1.1 per request.
Effect: median TCP+TLS setup time dropped from 80ms to 0ms (amortized). Worker socket pressure dropped 10x. We could now hold 300 concurrent streams per Node process instead of 30.
## Decision 2: Redis-backed rate limits
We had per-process counters tracking per-user TPM. Worked fine on a single instance. On three pods behind a load balancer, the counters disagreed — a heavy user got 3x their fair quota because each pod tracked them independently. We moved counters to Redis with INCRBY operations and TTL expiry.
// Rate-limit check before any LLM call
async function canSpend(userId, tokensRequested) {
const key = rl:tpm:${userId}:${minute()}
const used = await redis.incrby(key, tokensRequested)
if (used === tokensRequested) await redis.expire(key, 70)
return used <= USER_TPM_QUOTA
}
Atomic, cluster-wide, expires automatically. Adapted from [Redis's own LLM gateway scaling guide](https://redis.io/blog/scale-your-llm-gateway/). Per-org limits stack on top of per-user limits.
## Decision 3: queue-based grading
Synchronous grading was the worst offender. A user finished an essay; the app held an HTTP connection open for 6 seconds while Claude graded; the worker thread was useless during that window. We moved grading to SQS with three priority lanes.
1
Lane 1: interactive (live feedback streaming)
Real-time streaming feedback while the student writes. This stays synchronous — has to. p95 first-token under 350ms.
2
Lane 2: high-priority grading (essay submission)
Student submits, sees a "grading…" spinner, gets the score within 8 seconds. SQS message lands, worker picks up within 200ms, returns via WebSocket push.
3
Lane 3: background (re-grade, audit, recalibration)
Bulk recalibration runs on the multi-judge ensemble. Hours acceptable. Cheap spot GPU workers consume this lane overnight.
The grading queue depth tells us if we're saturated faster than any other metric. We alert at depth > 200 in the high-priority lane.
## Decision 4: per-endpoint backpressure
We hit Anthropic's TPM quota three times in week one. Each time, requests started failing with 429. The fix wasn't just bigger quota; it was being a polite consumer.
We track our own rolling-window token spend per endpoint per second. If we're within 80% of the quota in the current 60-second window, we start queuing new requests instead of firing them. Users see a "queued, position 4" indicator rather than a failure.
async function spendTokens(provider, tokens) {
while (true) {
const usage = await redis.get(spend:${provider}:${minute()})
if (usage + tokens < QUOTA[provider] * 0.95) {
await redis.incrby(spend:${provider}:${minute()}, tokens)
return
}
await sleep(200 + jitter())
// queue-front telemetry emits position to client
}
}
Effect: zero 429s from Anthropic since week 3. Tail latency increased modestly during burst periods (p99 went from 4s to 6s) but no requests fail outright.
## DIY: scale your AI app similarly
1
Profile before you rebuild
Run a load test that ramps from 1 to 500 concurrent users. Watch what breaks first — CPU, file descriptors, LLM quota, database connections. The bottleneck is rarely what you assumed.
2
Switch to a persistent HTTP/2 client for your LLM provider
Both Anthropic and OpenAI support HTTP/2. Node's default http library does not multiplex. Use the http2 module or a library like undici with explicit HTTP/2 support.
3
Move rate-limit counters to Redis
If you have more than one application pod, your in-process counters are wrong. Redis INCR + TTL is the canonical pattern. 5 lines of code.
4
Move anything >2s out of the request path
A 6-second grade in-request holds a worker thread for 6 seconds. Push it to SQS/Kafka/RabbitMQ and return immediately. Notify the client via WebSocket or polling when the result is ready.
5
Buy a higher quota tier before you launch a marketing push
The biggest single failure mode is not infra — it's hitting the LLM provider's TPM ceiling at 2pm on a Tuesday when your tweet goes viral. Pre-buy the quota one month before you need it; providers don't always grant upgrades instantly.
## Pre-flight checklist before you 10x your AI app
Load-test from 1 to N+ concurrent users; identify the first thing that breaks
Switch to a persistent HTTP/2 multiplexed client for your LLM provider
Move rate-limit counters to Redis with atomic INCR + TTL
Push anything >2s of LLM work out of the request path into a queue
Implement per-endpoint backpressure with token-budget queueing
Pre-buy a higher TPM tier from the LLM provider before launch
Raise ulimit -n above 65,536 if you're running WebSockets at scale
Add a periodic no-op ping to keep prefix caches warm
## What we got wrong on the way
Spent two weeks optimizing in-process queueing. A custom in-process scheduler felt elegant. It didn't survive a pod restart. We swapped to SQS and threw the in-process code away. Use battle-tested infrastructure for state that must survive a deploy.
Forgot about WebSocket connection limits. Each concurrent user holds a WebSocket open to the server. Default Node limits are ~10,000 file descriptors. At 600 concurrent users plus internal connections plus database connections, we hit the ceiling and didn't notice until we got vague EMFILE errors. ulimit -n is your friend.
Underestimated the cost of WebSocket reconnects on mobile. Mobile networks drop connections constantly. Each reconnect re-authenticates and replays the conversation context. We added an opaque resume token; clients reconnect to the same session without paying the full handshake cost again.
Production gotcha: Anthropic's prefix cache TTL. The cache is 5 minutes. If your conversation session pauses for 6 minutes — which happens often when a student stops to think — the cache expires and the next turn costs full token price. We extended the cache with periodic no-op pings to keep it warm during active sessions, at a small cost.
## When NOT to do this rebuild
Under 50 concurrent users. The complexity isn't worth it. A single fat box with sync grading is fine and easier to debug.
Bursty rather than sustained load. If your load is "1 user 99% of the time, 200 users at exam season for 4 days," buy reserved capacity from the LLM provider for those days rather than rebuilding the architecture. We made that mistake first.
You haven't measured. "We need to scale" without "the bottleneck is X at concurrent user Y" is premature optimization. Always profile first.
## Real cost numbers (May 2026)
For a single PenLeap concurrent user during active writing (no idling), here's the per-hour cost split:
Cost component
Before (₹/user/hr)
After (₹/user/hr)
Claude streaming feedback
22
6.4
Multi-judge grading (amortized)
9
3.1
Application servers (EC2)
4
0.9
Redis cluster + SQS
0.5
0.4
Misc (bandwidth, monitoring)
2.5
0.2
Total ₹/concurrent user/hr
38
11
The biggest single contributor to the savings was prefix caching plus connection pooling cutting input token cost (Claude). The smallest contributor that surprised us: bandwidth. Streaming responses use less bandwidth than we expected because HTTP/2 header compression is genuinely efficient.
## Why this scaling work matters
[PenLeap](https://penleap.com) is our in-house product. We had a choice in February: either raise prices to cover the cost-per-user gap or rebuild the architecture. Rebuilding meant 6 engineering weeks. Raising prices meant immediately losing 30% of users. We rebuilt. The same pattern — streaming pool, Redis state, queue-based heavy work, backpressure — is what we now bring to client projects when they need to scale an LLM-powered product without a 10x infra budget increase.
For the live conversation, the [Hacker News thread on OpenAI's low-latency voice scaling](https://news.ycombinator.com/item?id=48013919) is the best read on similar engineering at hyperscale.
## FAQ
### Why HTTP/2 multiplexing instead of WebSockets to the LLM provider?
Anthropic and OpenAI both expose HTTP/2 streaming endpoints, not WebSocket endpoints. HTTP/2 multiplexing gives us the same multi-stream-over-one-connection benefit. WebSockets would be a different transport entirely.
### Did you consider running an open-source LLM in-house instead?
Yes. The math didn't work for our scale. At 600 concurrent users and Claude Sonnet 4.5 quality requirements, the GPU cost to run a comparable open model 24/7 exceeds our Anthropic API spend by ~40%, and we'd own all the ops. We'll reconsider as open models close the quality gap.
### How do you handle Anthropic outages?
We failover to GPT-5.4 on the feedback path within 200ms. The grading multi-judge layer already uses both providers, so it degrades to a two-judge ensemble during single-provider outages. We accept the QWK drop (0.84 → 0.81) for those windows.
### What metric do you watch most closely?
p95 time-to-first-token on the streaming feedback path. If it crosses 600ms, user retention starts measurably dropping the same day.
### Is 600 concurrent the ceiling now?
No. The current architecture should hold up to ~2,000 concurrent on the same infrastructure footprint, gated by Anthropic's quota. We'd need to negotiate a higher TPM tier before testing beyond that.
### What did the rebuild cost in engineering time?
Six weeks for two engineers, plus one week of careful production rollout with feature flags and gradual percentage cutover. Roughly ₹16 lakhs in loaded engineer cost. The cost savings paid this back inside three months.
### Should I rebuild before or after product-market fit?
After. The architecture choices we made are right for our shape of demand (sustained concurrent users with streaming feedback). A different product — say, batch grading once a day — would warrant a different architecture entirely. Don't pre-build for scale you haven't proven you need.
Want Your AI App Stress-Tested and Tuned?
We've scaled PenLeap from 60 to 600 concurrent writers using the patterns above, and we've now applied the same playbook to two client projects. Typical engagement: 4-8 weeks of profiling, architecture rework, and load testing. First call is technical, with the engineer who'll lead your project.