Inside PenLeap: Grading 12,000 Essays a Day Without Hallucinated Marks

On a typical school morning, [PenLeap](https://penleap.com) — our in-house AI writing coach for 11+ students — grades roughly 12,000 essays before lunch. Each one gets four rubric scores, two paragraphs of feedback, and a rewrite suggestion. A parent in Reading or Hyderabad doesn't think about how the marks were produced; they just trust the number. This post is how we earn that trust at scale: the prompt structure, the multi-model voting layer, the calibration math, and the 4% human-in-the-loop sample rate that keeps the engine honest.

12,000

Essays Graded Daily (May 2026 average)

0.84

Quadratic-Weighted Kappa vs. Human Markers

Human Review Sample Rate

Independent Models in the Voting Layer

## TL;DR PenLeap grades student essays against four rubric criteria — Content, Structure, Language, SPaG — using three independent LLM judges (Claude Sonnet 4.5, GPT-5.4, Gemini 2.5 Pro). Each judge produces a score plus an evidence quote from the essay. A calibrator combines them with quadratic-weighted-kappa-aware reweighting. Disagreements above 2 band points trigger human review. Result: 0.84 QWK against human markers, hallucination rate under 0.6% on the audit set. ## Why this matters now The recent [Rulers paper](https://arxiv.org/html/2601.08654) and [LLM-Rubric framework](https://arxiv.org/html/2501.00274v1) both confirm what we learned the hard way: a single LLM judge with a free-form rubric will hallucinate scores at a measurable rate (somewhere between 4% and 11% in published evaluations). That number is fine for a research demo and unacceptable for a paid product where parents see the marks on their child's essay. The fix is architectural — multi-model voting, evidence-anchored scoring, and a tight human audit loop. ## The four-criterion rubric (and why it lives in the prompt) 11+ creative writing is graded against four bands. We mirror that structure exactly in the prompt — no creative reinterpretation. Below is the literal opening section of our scoring prompt (one of three; the other two are paraphrased to break model-specific biases).

You are a marker for the 11+ creative-writing paper.
  You grade four criteria, each on a 1-10 band.
  
  CONTENT (originality, story arc, character):
    9-10: Vivid original idea; full arc with twist or layered character
    7-8:  Clear original idea; arc resolved; one character motivation
    5-6:  Conventional idea; arc present but flat; character static
    3-4:  Generic idea; partial arc; no character interiority
    1-2:  Incoherent or off-prompt
  
  STRUCTURE (paragraphing, openings, pacing): [...four bands...]
  LANGUAGE  (vocabulary range, figurative): [...four bands...]
  SPAG     (spelling, punctuation, grammar): [...four bands...]
  
  For each criterion you MUST:
    1. Quote one exact span from the essay that anchors your score
    2. Justify in <= 25 words referencing the band descriptor
    3. Return JSON: {criterion, band, quote, justification}
  
  You may not invent text the essay does not contain.
  If you cannot find a quote, the band is at most 4.

The "you may not invent text" line is doing real work. The [Rulers paper](https://arxiv.org/html/2601.08654) calls this evidence-anchored scoring; in our internal logs, adding the quote requirement cut hallucination from 4.1% to 0.6% on the audit set. The model can't fabricate "vivid figurative language" if it has to point to the figurative phrase. ## The voting layer

🎓

Judge 1: Claude Sonnet 4.5

Strongest on Content and Structure. Slightly generous on Language. Median latency 1.4s per essay.

📝

Judge 2: GPT-5.4

Strongest on SPaG. Tighter on Language. Lower variance across re-runs.

🔍

Judge 3: Gemini 2.5 Pro

Strongest on long essays (>300 words). Independent failure modes from the other two — uncorrelated errors are the whole point.

⚖️

Calibrator: Per-Criterion Linear Reweight

Learned weights map raw judge scores to teacher-aligned bands. Refit weekly on the 600-essay human audit set.

## The calibration math (no hand-waving) The calibrator solves a small linear-regression problem per criterion. Each week we take the 600 most recently human-reviewed essays. For each, we have three judge scores and one ground-truth teacher score. We fit:

teacher_score = w_claude * claude_band
                + w_gpt    * gpt_band
                + w_gemini * gemini_band
                + b
  subject to: sum(weights) = 1, weights >= 0
  loss: quadratic-weighted-kappa-aware MSE

The kappa-aware loss penalizes large band misses (predicting 5 when the teacher said 8) more than small ones (predicting 7 when the teacher said 8). This is what [Autorubric's calibration paper](https://arxiv.org/html/2603.00077) calls "psychometric reliability" — we're optimizing for the same loss the human markers use. Current weight table (recalibrated 2026-05-04):

Criterion	Claude w	GPT w	Gemini w	Bias b	QWK
Content	0.46	0.30	0.24	-0.3	0.86
Structure	0.42	0.34	0.24	0.0	0.81
Language	0.28	0.40	0.32	-0.5	0.79
SPaG	0.20	0.52	0.28	0.0	0.88

The b column is small but matters — it captures systematic over- or under-grading we can't remove with reweighting. For Language, all three judges run ~0.5 bands generous versus the teacher panel; we subtract that. ## The human-in-the-loop sample rate Not every essay goes to a human. That would defeat the point. The trick is sampling the right ones.

Random base rate: 2%

A blind random sample goes to human markers regardless of judge agreement. This is the unbiased measurement floor — without it, we'd only ever review essays we already suspect are wrong.

Disagreement-triggered: +1.5%

If two of three judges disagree by more than 2 bands on any criterion, the essay is queued for human review. This catches the hard cases where the models are split.

Edge-band-triggered: +0.4%

Any essay scoring 1-2 or 9-10 on any criterion goes for human review. The tails matter most — a parent is most likely to challenge an extreme mark.

Parent-flagged: +0.1%

If a parent or student disputes a score in the app, the essay jumps the queue. ~50-60 essays a week, reviewed within 24 hours.

Total review rate ≈ 4%. At 12,000 essays a day, that's 480 essays. With 8 human markers averaging 7 minutes per essay, that's a 7-hour daily workload — sustainable for a small marking team in Lucknow that we hired and trained ourselves. ## DIY: build a smaller version of this engine You don't need 12,000 essays/day to need rubric-grade scoring. The minimal viable version below grades hundreds of essays a day with the same core ideas.

Write the rubric inline in the prompt

Do not link to a PDF. Do not paraphrase. Paste the literal band descriptors into the system prompt. Models score better when the criteria are concrete and visible.

Require evidence quotes

Make the JSON schema demand a verbatim quote. Validate that the quote actually appears in the input. Reject the response and retry if it doesn't — this catches most fabrications.

Use two judges, not three (to start)

Run Claude Sonnet 4.5 and GPT-5.4 in parallel. If they agree within 1 band, use the average. If they disagree, ask a third model or queue for human. You'll get most of the benefit at 60% of the cost.

Build a 200-essay golden set

Have a human grade 200 essays. Measure QWK between your model pipeline and the human scores. Anything above 0.75 is publishable; below 0.7 needs more work on the rubric prompt.

Recalibrate weekly

Models drift. Your judge mix on May 1 may be miscalibrated by June 1 because GPT-5.4 silently changed under you. Recompute the per-criterion weights once a week — it's a 10-minute job.

The biggest mistake we made early on. We trained the calibrator on the same essays we asked the judges to grade. Result: an inflated QWK that collapsed in production. The fix is rigid train/test split — the audit set you fit weights on must be different from the audit set you report QWK on. We learned this in week 3, painfully.

## Pre-flight checklist before you trust any rubric grader

Rubric written down in 200 words per criterion, in plain text
Two or more independent LLM judges with documented bias profiles
Evidence-quote requirement validated against the input text
Per-criterion calibration weights, refit at least monthly
QWK against human markers measured on a held-out test set (not training)
Sampled human review at minimum 2% random + disagreement-triggered queue
A path for parents/users to dispute a score within 24 hours
Versioned prompts in git, regression-tested via Promptfoo or equivalent

## Common failure modes (and what we do about them) Failure 1: the model praises the student's intent rather than the writing. Claude has a tendency to score generously on Content when the topic is emotional. We added a sentence to the prompt: "You are grading the writing, not the importance of the topic." It moved the average Content band down by 0.4 — measurable on the audit set. Failure 2: the model gives full marks for technical correctness when the essay is dull. GPT-5.4 is the worst offender on SPaG. We hold the SPaG band cap at 8 unless a separate "lexical density" check passes (vocabulary above the 11+ word list). Failure 3: the model spots one good phrase and lets it carry the whole score. Halo effect. The evidence-quote requirement helps, but we also force three independent quotes per criterion above band 7. If you can't find three, you can't score above 7. Failure 4: a parent uploads a photo of their child's handwritten essay and the OCR is poor. We pre-flight every image through a confidence threshold; below 92% character confidence, we ask for a retake before grading. Saves us from grading "wsa walking too the shup" as bad spelling. ## When NOT to use this architecture You only have a few hundred essays to grade total. Hire a human. The engineering investment doesn't pay back below ~50 essays a day at sustained volume. Your rubric is fuzzy or unwritten. The whole architecture depends on a rubric you can write down in 200 words per criterion. If the rubric is "the marker uses their professional judgement," LLM grading will give you the appearance of consistency without the substance. The stakes are exam-grade and binding. We do not grade real 11+ exams. PenLeap grades practice writing for skill development. The same engine for a real entrance exam would need a 100% human-review floor, not 4%. ## Why we built this in-house [PenLeap](https://penleap.com) is Softechinfra's in-house edtech product. We didn't build a rubric-grading engine because we wanted to write a blog post — we built it because no off-the-shelf tool we evaluated in 2024 could grade an 11+ essay against four bands with under 1% hallucination at our price point. The same architectural pattern — rubric-as-prompt, multi-judge voting, calibrator, sampled human review — works for any domain-specific evaluation task. We've now reused it for two client projects: scoring B2B sales-call transcripts against a coaching rubric, and scoring student code submissions against a CS-1 rubric. For the live conversation, [the Reddit thread on LLM essay grading](https://www.reddit.com/r/MachineLearning/) (search "rubric grading") is patchy but the [arxiv preprint on LLM Essay Scoring Bias](https://arxiv.org/html/2604.00259) is the single most useful 2026 read we know of. ## FAQ ### How does multi-model voting actually reduce hallucination? Each model has different failure modes. Claude tends to be generous on Content; GPT is generous on SPaG; Gemini is generous on long essays. Their errors are partially uncorrelated. By weighted-averaging across three judges, we average down the per-model bias. The Rulers paper measured a ~3x reduction in single-judge hallucination using exactly this pattern. ### Why don't you just fine-tune one model on graded essays? We tried it. A fine-tuned Mistral-7B on 8,000 essays got us to QWK 0.71 — worse than the multi-judge ensemble at QWK 0.84, and the fine-tune drifted within months as we collected more essays. The voting-and-calibrate approach lets us swap in new frontier models as they ship without retraining anything. ### What's the unit economics of one essay grade? At our current judge mix, one essay costs us roughly ₹3.40 in API spend — Claude ₹1.60, GPT ₹1.20, Gemini ₹0.60. Human review adds ~₹14 averaged across the 4% sampled. Total marginal cost per essay: ~₹4. We charge £8/month on Pro for unlimited submissions; the average paid student submits ~22 essays/month. ### How do you handle a parent disputing a mark? The essay re-enters human review within 24 hours. The marker can re-grade or confirm the AI score. ~31% of disputes result in a band adjustment; the remaining 69% confirm the original mark. We surface both the human and AI score back to the parent transparently. ### Could this engine grade SAT or IELTS essays? Conceptually yes; practically, only with substantial work. SAT and IELTS rubrics are public, so prompting is straightforward, but the calibration set has to be built from real graded essays — which are hard to source legally. A client asked us about TOEFL writing in 2025; we estimated 6 weeks of work to do it credibly. ### Where does the engine break? Three places. Highly stylized writing (deliberately fragmented sentences, modernist prose) confuses the SPaG judges. Code-switched English (Hindi-English mix common in Indian student work) reduces accuracy by ~6 QWK points on Language. And very short essays (<80 words) have too little signal — we report a confidence flag rather than a clean band.

Need a Domain-Specific Grading or Evaluation Engine?

We've built rubric-grade evaluation pipelines for student writing (PenLeap), sales-call coaching, and code review. Typical build: 8-12 weeks from rubric definition to production. The first call is with the engineer who'd lead your project — we'll tell you honestly whether your rubric is gradeable by an ensemble or whether you should hire markers.

Book a 20-min Call

Tags:

LLM EvaluationRubric GradingPenLeapEdtechClaudeMulti-ModelCalibration

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

You are a marker for the 11+ creative-writing paper. You grade four criteria, each on a 1-10 band. CONTENT (originality, story arc, character): 9-10: Vivid original idea; full arc with twist or layered character 7-8: Clear original idea; arc resolved; one character motivation 5-6: Conventional idea; arc present but flat; character static 3-4: Generic idea; partial arc; no character interiority 1-2: Incoherent or off-prompt STRUCTURE (paragraphing, openings, pacing): [...four bands...] LANGUAGE (vocabulary range, figurative): [...four bands...] SPAG (spelling, punctuation, grammar): [...four bands...] For each criterion you MUST: 1. Quote one exact span from the essay that anchors your score 2. Justify in <= 25 words referencing the band descriptor 3. Return JSON: {criterion, band, quote, justification} You may not invent text the essay does not contain. If you cannot find a quote, the band is at most 4.

teacher_score = w_claude * claude_band + w_gpt * gpt_band + w_gemini * gemini_band + b subject to: sum(weights) = 1, weights >= 0 loss: quadratic-weighted-kappa-aware MSE

Criterion

Claude w

GPT w

Gemini w

Bias b

QWK

Content

0.46

0.30

0.24

-0.3

0.86

Structure

0.42

0.34

0.24

0.0

0.81

Language

0.28

0.40

0.32

-0.5

0.79

SPaG

0.20

0.52

0.28

0.0

0.88

Inside PenLeap: Grading 12,000 Essays a Day Without Hallucinated Marks

Need a Domain-Specific Grading or Evaluation Engine?

Hrishikesh Baidya

Related Posts

Night Before Google I/O 2026: 5 Things Indian Builders Should Watch

Code with Claude SF: Managed Agents and the Build-vs-Buy Call

The IELTS Speaking Rubric Just Shifted. Here's How We're Updating TalkDrill

Want More Insights?

Inside PenLeap: Grading 12,000 Essays a Day Without Hallucinated Marks

Need a Domain-Specific Grading or Evaluation Engine?

Hrishikesh Baidya

Related Posts

Night Before Google I/O 2026: 5 Things Indian Builders Should Watch

Code with Claude SF: Managed Agents and the Build-vs-Buy Call

The IELTS Speaking Rubric Just Shifted. Here's How We're Updating TalkDrill

Want More Insights?