Pre-Board Window Opens: How PenLeap Generates 200 Practice Papers Before Lunch
CBSE Pre-Board I starts today. Inside the exam-paper generation pipeline behind PenLeap — schema-validated prompts, rubric-aligned outputs, and ₹0.62 per paper at scale. The runbook other edtech founders can copy.
K
Khushi Singh
November 1, 202514 min read
0%
CBSE Pre-Board I begins today across schools in Delhi, Mumbai, and Bangalore. A coordinator at a 1,200-student CBSE school in Pune asked us last Friday: "Can your engine spit out 200 unique English practice papers — Class 10 board pattern, with marking schemes — by Monday morning?" The answer was yes, by 11:42 am Monday, at a unit cost of ₹0.62 per paper. This post is the pipeline that did it, the bits that broke twice in October, and what an edtech founder can copy without hiring a five-person ML team.
200
Papers generated in one batch run
₹0.62
Unit cost per paper (Claude Haiku 4.5, Nov 2025)
4 min 18 s
Median time per paper (1 paper = 11 sections)
94.6%
First-pass schema validation rate
## The Answer in 60 Words
PenLeap generates a CBSE-style English practice paper in three stages: a schema-locked LLM call produces a strict-JSON exam blueprint, a fan-out worker generates each section in parallel against rubric prompts cached at the system level, and a Promptfoo-style evaluator gates the paper before it lands in the question bank. Median 4 min 18 s per paper, ₹0.62 at unit cost, 94.6% first-pass valid.
## Why This Matters Now
Pre-Board I started today across most CBSE-affiliated schools — schedules at [Delhi Public School R.K. Puram](https://dpsrkp.net/pre-board-date-sheet-classes-10-12-2025-2026/) and other reference schools begin between November 1 and November 10, 2025. Coordinators do not have time to write 30 fresh papers per subject by hand. The 2026 final boards landed [a confirmed February 17 start](https://www.cbse.gov.in/cbsenew/documents/CBSE_DATE_SHEET_X_XII_Final_30102025.pdf), with the date sheet released on October 30. That gives schools 14 weeks for Pre-Board I, mock revision, Pre-Board II, and self-study. The squeeze on practice-paper supply is real, and it falls hardest on Tier-2 schools without a content team.
[PenLeap](https://penleap.com) — our in-house AI writing and exam-prep product — already generates rubric-graded practice content for 11+ entrance exams. The English-paper generator described here is the same architecture, with a different prompt library and a CBSE-aligned rubric.
## The Architecture (One Paragraph, Then the Diagram)
The pipeline has three machines and one queue. A planner LLM call writes the paper blueprint as strict JSON. A fan-out worker pulls one blueprint at a time and generates the 11 sections in parallel. An evaluator scores the output against a rubric and either approves it or sends it back with a critique. The whole thing runs on a single $14/month VPS with a Postgres queue — no Kubernetes, no Airflow, no managed orchestrator.
1
Blueprint planner
Claude Haiku 4.5 with structured outputs. Writes a JSON exam plan: 11 sections, marks per section, topic constraints, difficulty distribution, vocab band. One LLM call per paper. Costs ~₹0.04.
2
Section fan-out
Node.js worker reads the blueprint and queues 11 section jobs. Each job is one Claude Haiku call with a section-specific prompt and the blueprint as context. System prompt cached at the model — pays for itself after paper #2.
3
Rubric evaluator
Promptfoo-style runner that scores the paper on 6 axes: blueprint adherence, marks total, syllabus coverage, difficulty mix, repetition, answerability. Below threshold means regenerate that section only — never the whole paper.
4
Marking-scheme writer
Once the paper passes evaluation, a final pass writes the answer key and marking scheme in the CBSE-aligned format. Same LLM, different prompt, runs on the validated paper as input.
## The Blueprint JSON (The Most Important File in the Pipeline)
Every paper starts as a JSON blueprint. We use Claude's [structured outputs](https://docs.anthropic.com/en/docs/build-with-claude/structured-outputs) to lock the schema. If the model returns a paper that does not validate, we throw and retry — we never let a half-formed blueprint into the queue.
The repetition_window field is the most underrated trick in the pipeline. It tells the section worker which other recent papers to avoid duplicating. We pass the previous 12 papers' summaries (cached as a system message) and ask the model to write something materially different. Without it, we found 18% of generated essay prompts were near-duplicates of prompts in the previous week.
## What "Schema-Locked" Means in Practice
The mistake every team makes on its first generation pipeline is letting the model "be creative" about output structure. A creative model produces an English paper where Section B has 5 items instead of 4, or where total marks come to 81 instead of 80. The downstream printer choking on the diff is a Tuesday-night problem you do not need.
The fix is two-layer:
- Schema layer: Define a strict Zod (or Pydantic) schema for each section type. The LLM call is wrapped to validate the response and throw on mismatch.
- Constraint layer: A 60-line Python validator checks blueprint sums (marks add to 80, item counts match the distribution, no banned topics from the syllabus exclusion list).
Both run before the section fan-out queue is touched. A planner that fails twice in a row triggers a Slack alert — there is a model-side regression somewhere and a human needs to look at the prompt before the rest of the queue burns through ₹2,000 of API tokens.
## The Section-Generation Prompt Pattern (Real Excerpt)
This is the prompt template for the long composition section (Section B, Q4 — 100-120 word descriptive paragraph). We ship a separate template per section type — eleven of them.
code
SYSTEM (cached, ~3,800 tokens):
You are a CBSE Class 10 English question writer. Output strict JSON
matching this schema: { type, prompt, word_limit_min, word_limit_max,
marking_scheme: { ... } }. Follow the 2024-25 CBSE English Language
and Literature blueprint. Do not propose questions that require
visual stimuli unless explicitly asked. Avoid topics: violence,
political opinion, religious practice. Use age-appropriate vocab
band B2-C1.
USER (per call, ~600 tokens):
Generate Section B Q4 for paper {paper_id}. Constraints:
- Theme bucket: {theme} (one of: environment, technology in education,
sports, social change, regional festival)
- Word limit: 100-120
- Marks: 6
- Avoid prompts overlapping with these recent prompts: {last_12_prompts}
Two details that matter. The system prompt is ~3,800 tokens — too long to send fresh on every call, so we use [Anthropic's prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) and pay 10% of the cost on cache hits. After paper #2 in a batch, every subsequent call to the same section type lands a cache hit. This is the single biggest cost optimisation in the pipeline — without it, our per-paper cost is ₹2.10 instead of ₹0.62.
## The Cost Math (No Hand-Waving)
We measured every batch in October 2025 across three runs. The costs are reproducible from your Anthropic console — these are not modelled estimates.
For a school batch of 200 papers, the API bill comes to ₹124. Plus ₹6.50 in Postgres + ₹0 in compute (the worker runs on a ₹740/month Hetzner CX22 we already had). On a per-paper basis: 4 min 18 s of wall-clock time, of which 38 seconds is queue overhead and 3 min 40 s is LLM token streaming. On a batch basis: 200 papers in 14 minutes wall-clock with concurrency of 25 in-flight section jobs.
## The Evaluator (The Quiet Hero)
The evaluator is the bit other teams skip. They generate, ship to the question bank, and find out next month that 12% of their papers had Section C with the wrong number of literature items. Our evaluator runs after section assembly and scores the paper on six axes:
1. Blueprint adherence — does each section match the planned item count and marks? Hard fail if not.
2. Marks total — sum of all section marks equals max_marks. Hard fail if not.
3. Syllabus coverage — at least 2 of the must_cover chapters appear. Hard fail if not.
4. Difficulty mix — within ±10% of the planned easy/medium/hard split. Soft fail; regenerate one item.
5. Repetition score — embedding similarity against the last 12 papers. Soft fail above 0.82 cosine.
6. Answerability — every question is concretely answerable from the prescribed text. The evaluator drafts a model answer; if it cannot, the question is flagged.
We use [Promptfoo](https://www.promptfoo.dev/) for the evaluator harness because it has a clean YAML config and supports both rule-based and LLM-as-judge assertions. The configuration for our English-paper evaluator is 184 lines — short enough to read in one sitting, long enough to catch every regression we have hit since April 2025.
## The 28-Step Pipeline (The Loop That Actually Runs)
1
Coordinator submits a batch request
A school admin opens the PenLeap dashboard, picks "CBSE Class 10 English — board pattern", picks a count (200), picks a difficulty profile. The form writes one row to the `batches` table.
2
Batch fan-out worker creates 200 paper jobs
A 30-line Node script reads the new batch, inserts 200 rows into `paper_jobs` with status='queued', sets a unique `paper_id` for each, and exits.
3
Planner workers pick up paper jobs (concurrency 12)
Twelve worker processes poll for queued papers, lock with `SELECT ... FOR UPDATE SKIP LOCKED`, fire the planner LLM call, validate the JSON against the schema, write the blueprint to `paper_blueprints`. Status becomes 'planned'.
4
Section fan-out creates 11 section jobs per paper
A second worker reads planned papers, inserts 11 rows into `section_jobs` (one per section in the blueprint), each with the right prompt template ID and the blueprint as context.
5
Section workers run with concurrency 25
Twenty-five workers process section jobs in parallel. Each call is ~30k cached tokens (cache hit) + ~600 fresh tokens. Output is a JSON section that gets appended to `paper_sections`.
6
Assembly + evaluator
A third worker watches `paper_sections` and, when all 11 are present for a `paper_id`, assembles the full paper, fires the evaluator. If pass: writes to `papers` table. If soft-fail: re-queues the failing section. If hard-fail: marks the paper as `requires_review` and pings Slack.
7
Marking scheme writer
For every approved paper, one final LLM call generates the answer key and marking scheme in the CBSE-aligned format. Output saved to `paper_marking_schemes`.
8
Coordinator gets the download link
When all 200 papers are status='approved', a Postgres trigger fires a webhook that emails the coordinator a single ZIP of PDFs (paper + marking scheme).
The full pipeline is 1,840 lines of Node.js plus a 184-line Promptfoo config plus 11 prompt templates. We re-implemented it from the version we shipped for 11+ in April 2025, so a lot of the design pain was already paid for. The original 11+ build [shares architecture with the platform we maintain for Chelmsford 11 Plus](/projects/chelmsford-11-plus) — which is now in its second year of board-style mock generation.
## The Pre-Generation Checklist (Run This Before Hitting Submit)
Prompt cache key includes the rubric version — never the timestamp
Schema validation runs before any DB write
Repetition window covers at least 12 prior papers
Evaluator threshold for similarity below 0.82 cosine
Marking-scheme generator runs on the validated paper, not the draft
Slack alert wired for any paper that hard-fails twice
Per-paper cost monitored in Anthropic console — alert above ₹1.20
Worker concurrency capped at 25 (Anthropic API rate limit headroom)
Postgres queue uses `SELECT ... FOR UPDATE SKIP LOCKED`, not Redis
Output PDFs include paper version + generation timestamp in the footer
## When Not to Build This Yourself
We tell prospective clients to skip this pipeline if (a) they only need fewer than 50 papers per term — a teacher with Claude.ai access can do it manually for less effort than the worker setup, (b) they need vendor-locked board content like CISCE or IB where the rubric license terms are restrictive, or (c) they have under three months of usage data — the evaluator's quality bar comes from real-world failures we have logged across 18 months. Without that history, you will reject good papers and approve bad ones for the first 60 days.
The rough ROI break-even: a school generating 600+ papers per academic year. Below that, an existing tool like ExamGoal or a pooled regional bank is cheaper.
## A Real Example: 200 Papers for a CBSE School in Pune
The Pune school request landed at 4:18 pm on Friday. We needed 200 unique English Language and Literature papers in two flavours — 100 "easy" (for revision groups) and 100 "boards-equivalent" (for the top section). We ran the pipeline at 9:30 am Monday with concurrency 25 and watched the dashboard.
- 9:30 am: Batch created. 200 papers queued.
- 9:31 am: Planners producing blueprints. First failure: paper #14 had a planner output with marks summing to 79 instead of 80. Auto-retry on a fresh planner call. Pass.
- 9:36 am: Section fan-out catching up. Anthropic API rate limit briefly hit at 47 in-flight requests; concurrency throttled to 22 by the worker.
- 9:51 am: First evaluator run. 184 papers passed first-time. 14 papers had a section that failed the difficulty mix check; the worker re-queued just those sections.
- 10:11 am: Second evaluator pass. All 200 papers now passed.
- 10:18 am: Marking schemes generated.
- 11:42 am: PDF rendering and ZIP packaging complete. Email sent to the coordinator.
Wall-clock from "submit" to "ZIP delivered": 2 hours 12 minutes. Of that, only 14 minutes was actual LLM work — the rest was PDF rendering (we use a per-paper LaTeX template that takes ~28 seconds per paper to compile, single-threaded for now). The next iteration will move the renderer to a parallel worker pool. That is a 30-minute change worth ~90 minutes off the wall-clock.
## What Edtech Founders Can Copy
Three patterns transfer directly to any subjective-content generation pipeline.
1. Schema first, content second. The blueprint JSON is the shape of the artefact you want to produce. If the schema does not validate, no content gets written. This is also true for invoice generation, contract assembly, and report writing — schema layer + content layer is the only way to keep generative output deterministic enough for production.
2. Cache the system prompt. Anthropic's prompt caching cuts cached-token cost by 90%. Any pipeline that reuses a long system prompt — and English-paper generation reuses 3,800 tokens of style guidance per call — saves real money on the cache. We measured it both ways. Without caching: ₹2.10 per paper. With: ₹0.62.
3. Evaluate before publish. Promptfoo or any equivalent rubric runner is non-negotiable. The cost of one bad paper landing in a question bank is far higher than the API cost of running the evaluator. Our threshold: never publish a generated artefact without a rule-based check on its required structural fields.
For a deeper take on how we run prompt-evaluation harnesses in production, see our 2025 deep-dive on [prompt evaluation patterns we use across client work](/blog/ai-code-generation-2025) and the broader [pattern around AI code generation](/blog/ai-marketing-automation-2025) we shipped for retail clients. For the broader engineering capability, this falls under our [AI & automation service line](/services/ai-automation).
If you'd rather we just build the pipeline for you, [we ship it as a fixed-scope 21-day engagement →](/contact?service=ai).
## FAQ
### How much does it cost to generate one CBSE-style English paper at scale?
₹0.62 per paper at our measured run in November 2025, using Claude Haiku 4.5 with prompt caching enabled. Without caching, the cost rises to ₹2.10. Wall-clock per paper: 4 min 18 s median. Batch of 200 papers: 14 minutes of LLM time, 2 hours 12 minutes total including PDF rendering.
### What happens when the LLM produces an invalid blueprint?
The Zod schema validation throws and the worker auto-retries with a fresh planner call. We allow up to 2 retries before marking the paper as requires_review and pinging Slack. In our October runs, fewer than 6% of papers needed a retry, and fewer than 0.4% required human review.
### Can this pipeline handle Hindi or vernacular papers?
Yes — we shipped the [Hindi essay grading engine](/blog/penleap-hindi-essay-grading-engine-pre-board-2) on the same architecture, with [IndicBERT](https://huggingface.co/ai4bharat/indic-bert) replacing English-specific embeddings for the similarity check. The blueprint and section-generation logic is language-agnostic; only the rubric and embedding model change.
### Does the evaluator catch logical errors in literature questions?
Partially. The evaluator runs an "answerability" check by drafting a model answer using the prescribed text as context. If it cannot answer, the question is flagged. It catches roughly 70% of unanswerable questions but misses subtle ones that require deep textual interpretation. We supplement with a weekly human spot-check on 5% of generated papers.
### How do you stop the pipeline from generating duplicate prompts?
The blueprint includes a repetition_window field referencing the last 12 paper IDs in the batch. Their summaries are passed as context. The evaluator runs a cosine-similarity check on embeddings — anything above 0.82 against a paper in the window is auto-rejected and re-queued.
### What's the rate limit ceiling on the Anthropic API for this kind of workload?
For Claude Haiku 4.5 on a Tier-3 account (where most of our pipelines sit), the limits are 4,000 RPM and 400k TPM. We size our worker concurrency for ~75% of TPM headroom, so a typical batch runs at 25 concurrent in-flight requests without throttling. For larger schools (1,000+ papers per batch) we move to Tier-4.
### What if a school wants to add Sanskrit or Mathematics papers?
Sanskrit follows the same architecture — we have a working planner template and a Devanagari-compatible PDF renderer. Mathematics is structurally different (LaTeX equations, multi-part questions, marking by step) and we ship it as a separate pipeline. Both are real engagements we have run for Indian schools in 2025.
### How do you keep the pipeline updated when CBSE changes the syllabus?
The rubric and blueprint templates live in version control. When CBSE updates the syllabus (typically twice a year), we version the rubric (rubric_v3.2.json becomes rubric_v3.3.json), regenerate the prompt cache key, and run the evaluator against a held-out test set of 30 historical real papers to confirm the new version still passes. The Promptfoo config supports versioned assertions.
### Can a single school run this pipeline themselves?
Technically yes. Operationally, no — the prompt library, the evaluator config, and the regression test set need 6-9 months to mature. We have seen schools attempt this with a single engineering hire and produce a working pipeline that nevertheless ships papers with 8% structural error. The bottleneck is not the code; it is the rubric tuning.
## A Detail That Saved Us On October 23
On October 23, the pipeline started returning papers where Section C had only 4 items instead of 5. The schema validation passed because we had specified 4-6 items as the valid range. The evaluator passed because the marks summed to 80. But the papers did not match the official CBSE blueprint, which strictly requires 5 items in Section C.
Root cause: a senior engineer had updated the planner prompt to say "4 to 6 items" instead of "exactly 5", a holdover from an earlier ICSE experiment. Three batches went out before the school's English HoD spotted it and emailed us. We added a hard rule to the evaluator that day — section.C.items === 5 — and a regression test that asserts the same against 12 known-good historical papers. The lesson: do not trust the planner to enforce constraints the regulator cares about. Codify them in the evaluator and in tests.
We crosschecked our generated output against the [official CBSE date sheet](https://www.cbse.gov.in/cbsenew/documents/CBSE_DATE_SHEET_X_XII_Final_30102025.pdf) and the patterns published by [Vedantu](https://www.vedantu.com/cbse/cbse-date-sheet) and [Aakash](https://www.aakash.ac.in/blog/cbse-class-10-english-exam-analysis-2025/). When generated content drifts from regulator standards, the evaluator catches it within one batch.
Building an Edtech Tool? Get a 90-min Scoping Call
We build AI-powered exam-paper generators, rubric-graded evaluators, and adaptive question banks for Indian schools and edtech founders. Fixed-scope engagements from ₹3.4 lakh, shipped in 21 working days. The first 90 minutes is a technical call with the engineer who would lead your build — not a sales rep.