What if a school wants the paper in Hindi?

Sonnet with Hindi blueprint and Hindi seed examples. Reviewer-change-rate similar to English. Hindi rubric encoding effort roughly 1.5x English.

Inside PenLeap's CBSE Class 10 Pre-Board Question Generator: 200 Papers Before Lunch, Zero Repeats

Q: Why Claude Sonnet 4.5 instead of GPT or Gemini?

Sonnet 4.5 had the best blueprint conformance (94% vs 86% GPT-4o, 88% Gemini 2.5 Pro) and lowest hallucination rate on syllabus references. Cost within 8% across all three.

Q: How do you keep the engine in sync with CBSE blueprint changes?

Manual update by our content lead. Automated alert on CBSE blueprint page hash changes triggers a review.

Q: Can the engine generate ICSE / IB / state-board papers?

Yes. Different blueprint, same engine. We have shipped ICSE Class 10 and IB MYP. Each new board: 2-3 weeks of blueprint encoding.

Q: What about diagrams and figures?

Diagrams are referenced by NCERT figure number. PDF renderer pulls the official figure from a pre-licensed bank. No new figure generation due to copyright risk.

Q: How do you handle MCQs with mathematically-correct distractors?

Generation prompt includes step-by-step distractor derivation. Validator runs a numeric sanity check. Wrong-but-plausible distractors are the hardest MCQ design problem.

Q: How do you prove zero repeats within a batch?

Pairwise MinHash check on final 200-paper output. Under 4 seconds. Flagged pairs trigger re-generation.

Q: Is this paper generation legally allowed?

Yes. Generated papers are original. Blueprint itself is a published structural specification, not copyrighted creative content.

Inside PenLeap's CBSE Class 10 Pre-Board Question Generator: 200 Papers Before Lunch, Zero Repeats

A coordinator at a 1,400-student CBSE school in Indore asked us last week: "Can your engine spit out 200 unique English Class 10 pre-board papers — official blueprint, marking schemes, zero repeats — by 11:30 am Friday?" The answer was yes, by 11:18 am, at a unit cost of ₹0.78 per paper. This post is the engine inside PenLeap — our in-house edtech product — that did it. Three components matter: a blueprint-conformance validator that rejects 12-14% of generated papers automatically, a MinHash-based deduplication layer that guarantees no question repeats across 200 papers, and a rubric the human reviewers actually trust. If you are building any AI grading or paper-generation tool, this is the system architecture.

200

Unique papers generated in one morning batch

₹0.78

Unit cost per generated paper (incl. validation)

Repeat questions across 200-paper run

94%

First-pass blueprint conformance (after 3 iterations)

## The Answer in 60 Words We split paper generation into three discrete stages: (1) a blueprint-conformance generator using Claude Sonnet 4.5 with a JSON schema constraint, (2) a MinHash + Jaccard-similarity deduplication layer that rejects any question with > 0.65 overlap to anything generated in the last 30 days, and (3) a rubric-aligned reviewer prompt that produces marking scheme. Total wall-clock per paper: 47 seconds. Manual review time per paper: 90 seconds. ## Why CBSE Pre-Board Generation Matters For Schools CBSE Pre-Board I happens in November-December. Schools generate 4-8 mock papers per subject. With 6 subjects × 6 sections × 8 papers = ~280 unique papers per school per cycle. Hand-generating these eats 2-3 senior teachers for two weeks. The market response so far has been three options: buy from sample-paper publishers (₹250-450 per paper, license-restricted), use shared papers from teacher WhatsApp groups (rampant repetition, often last year's leaked), or pay a content firm (₹120-180 per paper, 3-5 day turnaround). None of these solve the "0 repeats" problem. Our engine does. ## What Makes A Pre-Board Paper "Good" (The Rubric) A coordinator does not just want 200 papers. She wants 200 papers that pass internal review, match the CBSE blueprint exactly, and do not repeat any question from the school's previous mocks. Define "good" precisely:

📐

Blueprint Conformance

CBSE publishes a section-by-section, mark-by-mark blueprint per subject per year. Pre-board papers must match — Section A 20 marks of 1-mark questions, Section B 24 marks of 2-mark questions, etc. Off-blueprint = unusable.

🔄

Internal Non-Repetition

Within a 200-paper batch, no question text should repeat. Across the school's previous mocks, no question should overlap by more than ~65% Jaccard.

📚

Syllabus Anchoring

Every question maps to a specific NCERT chapter, theme, or learning outcome. The reviewer can audit "this question tests chapter 4 simile recognition." Untraceable questions get rejected.

✅

Marking Scheme Trustability

Each question ships with a marking scheme that a moderately experienced teacher would accept without correction. This is the rubric humans trust — measured by reviewer change-rate (target < 8%).

## The 3-Stage Pipeline

Stage 1 — Blueprint-conformance generation (38s per paper)

Claude Sonnet 4.5 call with a structured JSON output constraint. The system prompt holds the CBSE blueprint as a typed schema. Claude must emit a JSON list of question objects matching that schema. We use prefix caching for the blueprint — saves ~32% on cost per call.

Stage 2 — MinHash deduplication (1.4s per paper)

Each generated question gets a MinHash signature (200 hashes, shingled at 3-grams). A pre-loaded LSH index of the school's last 30 days of questions returns nearest neighbours. Any question with Jaccard > 0.65 is rejected and re-generated.

Stage 3 — Marking scheme + final validate (8s per paper)

A second Sonnet call (with the question paper as input) generates marking schemes per question. A schema validator checks that mark allocations match the question's stated marks. Failures route back to Stage 1 (rare — about 4%).

## The Blueprint As A JSON Schema (Where We Got It Right) CBSE blueprints are PDFs with tables. We hand-translated each one into a JSON schema. For Class 10 English Communicative 2025-26:

{
    "subject": "English Communicative",
    "class": 10,
    "total_marks": 80,
    "duration_min": 180,
    "sections": [
      {
        "section": "A",
        "name": "Reading",
        "total_marks": 20,
        "questions": [
          { "marks": 1, "count": 10, "type": "MCQ", "skill": "literal-comprehension" },
          { "marks": 2, "count": 5, "type": "short-answer", "skill": "inference" }
        ]
      },
      {
        "section": "B",
        "name": "Writing & Grammar",
        "total_marks": 20,
        "questions": [
          { "marks": 5, "count": 1, "type": "letter-formal", "skill": "formal-writing" },
          { "marks": 5, "count": 1, "type": "essay", "skill": "argumentative-writing" },
          { "marks": 1, "count": 10, "type": "MCQ", "skill": "grammar" }
        ]
      },
      { "section": "C", "name": "Literature", "total_marks": 40, "questions": [/ ... /] }
    ]
  }

Two things this enables. First, Claude has an unambiguous target — generate exactly this many questions of exactly these types. Second, our validator can check structural conformance with a tool, not a human. About 12-14% of first-pass generations fail validation (usually because Claude over-fills a section). The validator routes these back for re-generation automatically. ## The MinHash Deduper (How We Guarantee Zero Repeats) MinHash is a 1990s technique for estimating Jaccard similarity between sets in sublinear time. For our use case: each question becomes a set of 3-character shingles. We compute 200 MinHash signatures per question. To find near-duplicates, we use Locality-Sensitive Hashing (LSH) — banding the 200 hashes into 50 bands of 4 and finding any candidate that collides with the new question in any band.

| Approach | Index build time (10K questions) | Query time per question | False-negative rate at 0.65 Jaccard | |---|---|---|---| | Brute force pairwise | N/A | 480 ms | 0% | | Embedding cosine + ANN | 380s | 28 ms | 4-7% (semantic vs surface) | | MinHash + LSH (50 bands × 4 rows) | 24s | 2 ms | 0.8% | | Exact-match hash | 0.4s | 0.1 ms | Catches only verbatim repeats |

We use MinHash + LSH because (a) 0.8% false-negative is acceptable for our use case, (b) 2 ms per query lets us check the entire 200-paper batch in under 5 seconds, (c) build time is fast enough for an incremental index. The sentence-embedding ANN approach is more semantically accurate but flags too many "different question, same topic" pairs as duplicates — and that is wrong for a paper generator. We want surface-form distinctness, not topic distinctness. ## Where The "Trust" In The Rubric Comes From The rubric the human reviewer trusts is the marking scheme. If the reviewer has to change marking decisions for >15% of questions, our engine has failed — the human spent more time fixing than reviewing. Three things drove rubric trust:

1. Granular mark allocation per response feature. A 5-mark essay marking scheme says "1 mark for thesis statement, 2 marks for at least 3 supporting points, 1 mark for vocabulary range, 1 mark for grammar." Not "5 marks for a good essay."

2. Worked solutions for short-answer and MCQ. Every MCQ marking scheme includes the correct answer AND a one-line "why this is correct, why others are not" — saves the reviewer the time of working it out.

3. Edge-case notes. "If student writes 'rebellion' instead of 'revolt' — accept; if student writes 'fight' — half-mark only." These are the calls reviewers waste time on. Pre-empting them shows the rubric was written by something thinking like a teacher, not a checklist.

## The Cost Breakdown (Real Numbers) At 200 papers, that is ₹156 total. The school pays ₹14,000 for the batch (₹70 per paper retail). Margin is generous because it has to absorb the "what about quality" guarantees we make. We refund any paper rejected by the school's reviewer at no cost. ## What We Got Wrong (Worth Documenting) Wrong #1 — Chunked generation. First version generated each section in a separate Claude call to keep context windows small. Section-to-section consistency suffered (Section C literature questions sometimes referenced characters Section B grammar passages did not). Single-call generation with the full blueprint as cached prefix fixed it. Wrong #2 — Embedding-based dedupe. We started with sentence-embeddings for dedup. False-positive rate was 18% — different questions on similar topics got flagged as duplicates. MinHash on shingles is the right semantic level for "is this question text different." Wrong #3 — Bigger model = better paper. We tested Opus 4.5 vs Sonnet 4.5. Opus produced subtly better wording (~6% reviewer satisfaction improvement) but cost 5x more. Sonnet won on price-per-quality. ## The Pre-Ship Checklist (Question-Generation Engine)

Blueprint encoded as a typed JSON schema, not free text
JSON-schema validator runs after every generation
Failed validations route back to generation, not to humans
MinHash + LSH dedupe with band/row tuning for your similarity threshold
Dedupe index rebuilt nightly from the last 30-90 days of generated questions
Marking scheme generation is a separate call, not bundled with question generation
Reviewer change-rate tracked weekly (red-flag at > 15%)
Cost per paper measured and reported per batch
Fallback: human-curated paper bank if generation fails for > 2 minutes
PDF rendering separated from generation logic — render failures should not lose questions

## Common Mistakes (Each One Hurts) Symptom: "Reviewer change-rate is 30%." Cause: marking scheme is too vague. Fix: rewrite the rubric with granular per-feature mark allocation. Symptom: "Same question types keep appearing." Cause: generation prompt does not enumerate variety constraints. Fix: include "vary the question stem across the batch" instructions and a counter-example list of recently-generated stems in the cached prefix. Symptom: "MinHash flags as duplicate questions that are clearly different." Cause: Jaccard threshold too high. Lower from 0.7 to 0.55-0.6 and re-test. Symptom: "Generation takes too long during peak." Cause: serial generation. Fix: parallelise per-paper at the worker level. We run 8 generation workers per region with rate-limiting. Symptom: "Some papers do not match blueprint." Cause: validator skipped or fallback to human. Fix: validator is mandatory; never bypass. ## When NOT To Build A Custom Engine Skip the build if (a) you generate fewer than 50 papers per cycle — buying from a publisher is cheaper, (b) your school does not have a CBSE-style explicit blueprint — the schema discipline is what makes this work, or (c) your reviewers are happy with sample-paper aggregator quality. Our engine is for schools that have already tried aggregators and were unhappy with the repetition rate. ## Real Example — A 1,400-Student School In Indore Specific details. The 1,400-student CBSE school in Indore. 6 subjects, 6 sections, 8 papers per subject = 288 papers needed for Pre-Board I. We delivered all 288 papers across 5 days (English Friday morning was the burst test for our 200-in-a-morning capacity). Reviewer change-rate: 7.4% across all 288. Coordinator quote, used with her permission: "We saved two senior teachers a fortnight of work. The rubrics are tighter than what our own moderators wrote last year." ## A Detail That Saved Us On Day 19 On day 19 of the rollout, a school complained that two papers in their batch had "almost identical" essay prompts — both about climate change in Indian cities. Our MinHash dedupe had passed both: their textual overlap was below 0.65 (different vocabulary, similar topic). The fix was not to lower the Jaccard threshold (that would over-flag legitimately distinct questions). Instead, we added a second-pass topical-clustering step on the final 200-paper output: cluster by sentence-embedding cosine similarity, ensure no cluster has more than 4 papers. This tops up the surface-level MinHash dedupe with a topic-level guarantee. Both checks now run. ## How This Connects To Adaptive Vocabulary And Hindi Grading This engine is one stage of the broader PenLeap AI pipeline. The same blueprint-conformance discipline runs our adaptive vocabulary drills (covered in our 800-drills-per-student post), and an extended version handles Hindi essay grading. The shared infrastructure is: typed schemas, prefix-cached prompts, JSON validation, MinHash-style dedup, and reviewer-change-rate as the trust metric. ## FAQ ### Why Claude Sonnet 4.5 instead of GPT or Gemini? We tested all three on a 100-paper benchmark in July 2025. Sonnet 4.5 had the best blueprint conformance (94% first-pass vs 86% GPT-4o and 88% Gemini 2.5 Pro) and the lowest hallucination rate on syllabus references. Cost was within 8% across all three. ### How do you keep the engine in sync with CBSE blueprint changes? Manual update by our content lead. CBSE publishes blueprints once a year, sometimes with a mid-year clarification. We have an automated alert on the CBSE blueprint pages — when the page hash changes, we get an email. ### Can the engine generate ICSE / IB / state-board papers? Yes — different blueprint, same engine. We have shipped ICSE Class 10 papers and IB Middle Years Programme assessments. Each new board takes 2-3 weeks of blueprint encoding and validator testing. ### What about diagrams and figures? Diagrams are referenced by NCERT figure number; the generated paper includes the figure caption and a placeholder. Our PDF renderer pulls the official figure from a pre-licensed bank. No new figure generation — Indian copyright on textbook figures is strict and risk is not worth it. ### How do you handle MCQs with mathematically-correct distractors? For maths and science MCQs, the generation prompt includes a step-by-step "show me how you derived the distractors" instruction, and the validator runs a numeric sanity check. Wrong-but-plausible distractors are the hardest part of MCQ design. ### What if a school wants paper in Hindi? We use the Sonnet model with Hindi blueprint and Hindi seed examples. WER and reviewer-change-rate are similar to English. The Hindi rubric encoding effort is roughly 1.5× English because of the wider variation in correct-answer phrasings. ### How do you prove "zero repeats" within a batch? Pairwise MinHash check on the final 200-paper output. Wall-clock check time: under 4 seconds. Any flagged pair triggers a re-generation of the second paper. ### Is this paper generation legally allowed? Yes. Our generated papers are original. We do not reproduce CBSE's official sample papers. The blueprint itself is published by CBSE and not copyrighted (it is a structural specification, not creative content).

Building an edtech grading product?

We have shipped two production AI grading and generation systems — for PenLeap (CBSE / ICSE / 11+) and an enterprise client (vocational assessment). 90-min architecture call covers blueprint design, dedup strategy, rubric trust, and the failure modes we have seen in production. ₹0 first call, ₹14,000 follow-up audit. Email contact@softechinfra.com.

Book the 90-min architecture call

Tags:

EdTechPenLeapCBSEQuestion GenerationMinHashLLM PipelineCase Study

Share this post:

Khushi Singh

UI/UX Designer at Softechinfra focused on crafting intuitive, user-friendly digital experiences.

Back to Blog

{ "subject": "English Communicative", "class": 10, "total_marks": 80, "duration_min": 180, "sections": [ { "section": "A", "name": "Reading", "total_marks": 20, "questions": [ { "marks": 1, "count": 10, "type": "MCQ", "skill": "literal-comprehension" }, { "marks": 2, "count": 5, "type": "short-answer", "skill": "inference" } ] }, { "section": "B", "name": "Writing & Grammar", "total_marks": 20, "questions": [ { "marks": 5, "count": 1, "type": "letter-formal", "skill": "formal-writing" }, { "marks": 5, "count": 1, "type": "essay", "skill": "argumentative-writing" }, { "marks": 1, "count": 10, "type": "MCQ", "skill": "grammar" } ] }, { "section": "C", "name": "Literature", "total_marks": 40, "questions": [/ ... /] } ] }

Inside PenLeap's CBSE Class 10 Pre-Board Question Generator: 200 Papers Before Lunch, Zero Repeats

Building an edtech grading product?

Khushi Singh

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?

Inside PenLeap's CBSE Class 10 Pre-Board Question Generator: 200 Papers Before Lunch, Zero Repeats

Building an edtech grading product?

Khushi Singh

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?