Inside PenLeap: 800 Personalized Vocabulary Drills Per Class 9 Student in Under 4 Seconds
How our in-house edtech product PenLeap pulls 800 vocabulary drills from a 42,000-word bank in 3.6s using pgvector retrieval, a Bayesian mastery model, and a per-student difficulty tuner. The exact pipeline.
K
Khushi Singh
September 8, 202514 min read
0%
A Class 9 ICSE student in Indore opens her tablet at 7:14 pm, taps "vocabulary drill" inside PenLeap, our in-house edtech product. 3.6 seconds later she sees a 30-question set β every word picked just for her, calibrated to a difficulty she has a 72% chance of getting right, and tagged to the SPaG concepts she has been weakest on for the past nine days. Behind that wait sits a four-stage pipeline: a pgvector retrieval over 42,000 words, a Bayesian mastery model maintained per student, a difficulty tuner that runs a one-shot LLM call for personalization context, and a deduper that keeps the same word from appearing twice in 14 days. This post is the architecture, the numbers, and the three things we got wrong in the first build.
3.6s
p50 drill generation time (4G, India)
800
Drills per Class 9 student per month
42,000
Words in the source bank (CEFR A1 to C1)
72%
Target first-attempt correctness band
## The Answer in 60 Words
We embed every word in a 42,000-entry bank with a 384-dim sentence-transformer, store the vectors in pgvector, and at request time pull the 200 nearest neighbours of the student's current weak-concept centroid. A Bayesian mastery model picks 30 of those 200 sized to a 72% target correctness. Total cost per drill set: βΉ0.31. Total wall-clock: 3.6 seconds at p50.
## Why Personalized Vocabulary Drills Matter For Class 9
Class 9 is the year a CBSE or ICSE student stops reading textbooks and starts reading exam papers. The vocabulary jump from Class 8 to Class 9 English is roughly 2,400 new lexical items across the syllabus, and the [research on adaptive spacing for vocabulary learning is now decisive](https://www.arxiv.org/pdf/2508.03275) β LECTOR (the LLM-enhanced concept-based test-oriented repetition system) hit a 90.2% mastery success rate vs 88.4% for the best non-LLM baseline (SSP-MMC). For an Indian Class 9 student preparing for boards in 16 months, "do 30 random drills from the dictionary" wastes their time. "Do 30 drills you have a 72% chance of getting right, weighted to your weakest 4 concepts" is the difference between staying engaged and quitting in week 3.
## What "Personalized" Actually Means Here
We are not selecting words at random. Each drill set is built from four signals:
π
Per-Student Mastery Vector
A 32-dimensional vector with a beta-distribution mastery score per concept (synonyms, antonyms, idioms, root-word morphology, collocation, register, false-friends, and 25 more). Updated after every answer.
π―
Target Difficulty Band
We aim for 72% first-attempt correctness β the sweet spot from Bjork's "desirable difficulty" research. Below 60%, students disengage; above 85%, no learning happens.
π
Spaced-Recall Slot
Six of the 30 drills are scheduled re-tests of words the student missed 3, 7, or 21 days ago. The intervals come from a per-student forgetting-curve fit, not a fixed schedule.
π·οΈ
Syllabus Anchor
Every word is tagged with the chapter and unit it appears in across CBSE / ICSE / Cambridge Lower Secondary so a drill set always feels like it belongs to "this week's English class".
## The Pipeline (4 Stages, 3.6 Seconds)
1
Stage 1 β Build the weak-concept query vector (180 ms)
Pull the student's mastery vector. Identify the 4 lowest-scoring concepts. Average their concept centroids (precomputed at training time). The result is a 384-dim query vector pointed at the student's current weakness.
2
Stage 2 β pgvector ANN search over 42K words (220 ms)
A single SQL query: SELECT word_id FROM word_vectors ORDER BY embedding <=> $1 LIMIT 200. We use HNSW index with m=16, ef_construction=64. Recall at 200 neighbours: 98.4% vs brute force.
Score the 200 candidates with a logistic difficulty model β predicts P(correct | this student, this word). Drop everything outside 0.55-0.85. Pick 30 maximizing concept-coverage diversity. One Claude Haiku call (cached prefix) re-orders for narrative flow.
4
Stage 4 β Render + push to client (1.8 s)
For each chosen word, attach: 4 distractors (also pgvector lookups, but cached), example sentence (pre-generated), audio pronunciation URL (pre-rendered to S3). Final payload 38 KB. Pushed over a single HTTP/2 connection.
## Why pgvector Instead of Pinecone or Weaviate
We evaluated three vector stores in February 2025: Pinecone (managed), Weaviate (self-host), pgvector (Postgres extension). The pgvector decision came down to four numbers.
| Dimension | Pinecone Standard | Weaviate self-host | pgvector on PG 16 |
|---|---|---|---|
| 42K vectors @ 384 dim, p50 ANN search | 38 ms | 22 ms | 12 ms |
| Monthly cost (our QPS profile) | βΉ6,800 | βΉ3,200 (Hetzner) | βΉ740 (same Postgres box as the rest of the app) |
| Operational complexity | Zero | Medium (separate cluster) | Trivial (already running PG) |
| Joining vector results to relational tables | API round-trip + in-app join | API round-trip + in-app join | One SQL query |
Last point matters most. Every drill needs joining a vector match to: word metadata, syllabus tags, the student's mastery row, the spaced-recall queue. With pgvector, that is one query. With Pinecone, it is four. The 26 ms saving per request matters at 800 drills Γ 18,000 active students.
## The Bayesian Mastery Model (Where The Real IP Sits)
The vector retrieval is the easy part. The hard part is deciding which 30 of the 200 candidates the student should actually see. We track, per student, per concept, a beta distribution of mastery β parameterised as Beta(Ξ±, Ξ²) where Ξ± is "correct attempts + 1" and Ξ² is "incorrect attempts + 1". The expected mastery is Ξ± / (Ξ± + Ξ²), and the variance gives us a confidence interval.
When the variance is wide (the student has only seen this concept twice), we deliberately surface more questions in that concept band β exploration. When the variance is tight and mastery is below 0.7, we drill in β exploitation. This is the same multi-armed-bandit framing we wrote about for question selection, applied at the concept level.
Two practical gotchas surfaced in the first three weeks of production:
Gotcha 1 β The "all easy" trap. A new Class 9 student starts with uniform priors β Beta(1,1) for every concept. The model predicts P(correct) β 0.5 for every word. Without a hard floor, every first session would be all easy words. Fix: for the first 3 sessions of a new student, we add a difficulty floor that grows with attempts.
Gotcha 2 β Beta priors collapse on streaks. If a student gets 18 in a row right, the beta tightens fast and the model thinks the concept is mastered. But the words in that streak might all be CEFR A2. Fix: we weight updates by the word's CEFR level β getting a C1 word right contributes 2.4Γ more Ξ± than an A1 word.
## The Difficulty Tuner (Logistic Regression Per Student Per Concept)
Every candidate word gets scored with a logistic model:
P(correct) = sigmoid(student_mastery_for_concept 1.4 - word_difficulty 1.1 + days_since_last_seen * 0.03)
We trained the coefficients on 1.2 million answered questions across our pilot cohort β PenLeap users from 38 schools across Mumbai, Delhi, Bangalore, and Indore. The cross-validated MAE on predicted P(correct) vs observed: 0.071. That is good enough that the 72% target band is hit within Β±5 percentage points on the actual session.
## The Deduper (Stops The Same Word Showing Twice In 14 Days)
An obvious pitfall: if "ubiquitous" was the right answer in yesterday's drill, it should not be the right answer today. Our first deduper just checked the last 7 days. We learned the hard way that a 14-day window cuts repeat-fatigue complaints by 64%. The implementation is a Redis sorted set per student, keyed by word_id, scored by timestamp. A drill candidate gets dropped if its word appeared in the last 14 days.
Cost of the dedupe: 1 ms per Redis ZRANGEBYSCORE call. Memory: ~14 KB per active student per fortnight. We hold 18,000 students in a 256 MB Redis instance with room to spare.
## A Cost Breakdown We Ship To Investors
At 800 drill-sets per student per month and 18,000 active students, that is βΉ4.46 lakh in compute. Per-paying-user revenue at our βΉ449/month tier is βΉ4,490 / 10 = βΉ449. Gross margin per user on the drill engine alone: roughly 94%. That is the kind of unit economics that lets us keep adding features.
## What We Got Wrong In The First Build
The first version of this engine shipped in March 2025. We had to rebuild three things in the first 60 days.
Wrong #1 β We started with brute-force cosine over 42K vectors. It was simple, it worked, and it took 280 ms at p50. Acceptable on paper. Unacceptable when concurrent QPS hit 60 β Postgres CPU pinned at 100%, queries queued. Switching to HNSW dropped p50 to 12 ms and freed the CPU for everything else.
Wrong #2 β The mastery model updated after every answer. This caused a write storm: 30 questions Γ 18,000 students = 540K row updates an hour during peak study time. Postgres handled it, but replication lag spiked. Fix: batch updates per session, write once at session end.
Wrong #3 β Distractors were generated on demand. Every drill needed 4 wrong-answer choices that were "plausibly close." Our first version called Claude Haiku per word. At 30 words Γ ~700 ms per call, the drill set took 21 seconds to generate. We pre-generated distractors for the entire 42K-word bank in a one-time batch run (cost: βΉ2,800 in API spend, took 4 hours), stored them in a JSONB column, and removed the on-demand call entirely.
## The Pre-Ship Checklist (Vector Retrieval In Production)
HNSW index built with m=16, ef_construction=64 (default settings hurt recall at 384 dim)
Recall@k measured against brute force on a 1,000-query holdout β must be β₯98%
p99 query latency under load β must be under 100 ms with 60 QPS
Mastery-model updates batched per session, not per question
Distractors pre-generated and stored β never on-demand
Dedupe window is at least 14 days (we tried 7, complaints rose)
New-student difficulty floor for the first 3 sessions
CEFR-level weighting baked into the mastery update
One Postgres connection per request, max β not per-vector-call
Logging: per-session correctness band actually achieved vs target band
## When NOT To Build This Yourself
Skip the custom mastery model if (a) you have under 500 active students β a fixed difficulty curve will work fine, (b) your content is fewer than 2,000 items β pgvector is overkill, just rank by concept tag, or (c) your team has no production Postgres experience β running pgvector in production needs the same operational maturity as running any heavy-write Postgres workload. Our promptfoo pipeline post describes the same eval discipline we used to validate the mastery model β borrow the framework, skip the custom build until you actually have the data.
## Real Example β The Same Pipeline For 11+ Creative Writing
We built this engine for vocabulary first because vocabulary is the cleanest possible item: one word, one correct answer, one mastery dimension. The same architecture now runs PenLeap's 11+ creative-writing drill bank for UK CSSE / FSCE candidates. The only changes: 24 concepts instead of 32 (creative writing has a different rubric), CEFR mapping replaced with the UK 11+ band system, and the difficulty band is 65% instead of 72% (creative writing is more subjective, so we want students to feel slightly more challenged). The engine itself runs unchanged.
## A Detail That Saved Us On Day 47
On day 47 of the production rollout, a student in Pune complained: "every drill is the same 5 words." Our retrieval was working β but her mastery vector had collapsed onto one concept (idioms) because she had been answering only idiom drills. The bandit had decided idioms were her weakest concept and was hammering her with idioms. Fix: a "concept diversity floor" β every drill set must touch at least 3 different concepts, even if mastery says "this student should only see idioms." We tuned the floor to 3 after testing 2 and 4 with a 600-student A/B. Three concepts hit the right balance between focus and breadth.
## What This Pipeline Does NOT Do
Three things people ask about that we deliberately skip. It does not generate the words from scratch β the 42K-word bank is curated, not LLM-output. We tried LLM generation in v0; the words were grammatically correct but semantically generic. It does not adapt the difficulty mid-session β once a session is generated, it is fixed. We tried adaptive within-session and student feedback was negative ("the questions get harder when I get them right, which feels like punishment"). It does not score open-ended writing β that is a separate engine described in our rubric scoring post.
## FAQ
### What sentence-transformer model do you use?
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. 384 dimensions, 12-layer model, runs on CPU at ~30 ms per word. We considered the v3 768-dim models but the recall improvement on our specific task was marginal (98.4% to 98.9%) and the storage / compute doubled.
### How do you build the concept centroids?
Hand-curated seed words per concept (~40 words per concept, 32 concepts), embedded with the same model, averaged. We refresh them quarterly when the curriculum team reviews the mastery model.
### Why Beta distributions for mastery?
Conjugate prior to the binomial. Each correct answer is one Bernoulli trial. The posterior update is closed-form: just increment Ξ±. Cheaper than running a Bayesian network or a neural net per student.
### What happens if pgvector goes down?
We fall back to a tag-based retrieval β pick words randomly from the student's weakest concepts using the syllabus tag index. Quality drops noticeably (the difficulty tuner is starved of candidates) but the app stays functional. We have triggered this fallback twice in 6 months β both times during Postgres maintenance windows.
### How do you handle a student who has mastered everything?
We have not hit this problem yet β the long tail of CEFR B2/C1 words is large enough that no Class 9 student has reached >85% mastery on all 32 concepts. When we eventually do, the plan is to surface words from the next exam-board level (CBSE Class 10 / ICSE Class 10).
### Can the same engine handle Hindi vocabulary?
Not without retraining. The current sentence-transformer is multilingual but the concept centroids are English-anchored. We are running a pilot for Hindi vocabulary on our Hindi grading engine, but Hindi morphology (sandhi, samas, anuswar) needs different concept tags than English.
### Why 800 drills per month?
Empirical. Below 600/month, the spaced-recall queue starves and forgetting curves dominate. Above 1,000/month, the mastery improvement curves flatten β students plateau. 800 is the sweet spot from our 6-month cohort data, validated against the [LECTOR paper's findings on optimal review frequency](https://www.arxiv.org/pdf/2508.03275).
### Did you consider FSRS instead of Beta-binomial?
We did. [FSRS (the spaced-repetition algorithm Anki adopted in 2024)](https://github.com/open-spaced-repetition/fsrs4anki) is more sophisticated for pure recall. We chose Beta-binomial because our drills are not pure recall β they are concept-mastery checks. FSRS would over-fit to "did you remember the answer" instead of "do you understand the concept." We use a simplified FSRS only for the spaced-recall slot inside each drill set.
Building an edtech product? Get a 90-min architecture call.
Our team has shipped two production AI edtech products β PenLeap for 11+ creative writing and the TalkDrill voice tutor with 5,000+ users. We will spend 90 minutes on your stack: vector store choice, mastery model design, prompt eval discipline, and the 5 things that always break in month 2. βΉ0 for the first call, βΉ14,000 for a written follow-up. Email contact@softechinfra.com or use the contact form.