Llama 4 Scout vs. Maverick: Which One Fits a Single A100?
We benchmarked Llama 4 Scout and Maverick on a single A100 80GB for an Indian SaaS workload. Here is the VRAM math, the Hindi/Hinglish quality gap, and the model we actually shipped.
Hrishikesh Baidya
April 8, 202613 min read
0%
Meta dropped Llama 4 Scout and Llama 4 Maverick on April 5, 2026, and within 48 hours every founder in our Slack was asking the same question: do I self-host Scout on a single A100, rent Maverick on a serverless endpoint, or just keep paying Claude. We ran both models on a Hetzner-leased A100 80GB for a Bengaluru SaaS client doing Hindi/English customer-support classification. This post has the VRAM numbers, the latency-per-request log, the rupee cost per million tokens, and the decision tree we used to pick one.
## TL;DR — Which Llama 4 should you self-host in April 2026?
If you have one A100 80GB and you process Indian-language text, run Llama 4 Scout at INT4 (≈65 GB VRAM, fits with 11 GB headroom for KV cache). Maverick at 400B total parameters needs 8x H100s or aggressive offload, so it is not a single-A100 model. Scout's 10M-token context is real for retrieval but degrades sharply past 128K — plan your RAG window accordingly.
109B / 17B
Scout Total / Active Parameters (16 Experts)
~400B / 17B
Maverick Total / Active Parameters (128 Experts)
10M
Scout Context Window (1M for Maverick)
85.5
Maverick MMLU-Pro Score (Tops GPT-4o)
## Why this matters now (April 2026)
Three things changed for Indian teams that were on the fence about self-hosting. First, an A100 80GB at Hetzner now rents at €380/month (~₹35,000) on annual contract — cheaper than four developer seats on Claude Pro for the same workload. Second, Meta shipped Scout with native int4 weights via Unsloth, dropping VRAM from ~218 GB at fp16 to ~65 GB. Third, GPT-5.5 and Claude Opus 4.7 just both raised effective cost by ~30% via new tokenizers — moving moderate-volume inference workloads onto a self-hosted instance is finally cheaper than API calls below 80M tokens/month.
## The VRAM math (the part that decides everything)
Single A100 80GB has exactly 81,920 MB of HBM2e. Subtract ~1 GB for CUDA driver overhead, leaving 79 GB usable. The model weights take the first slice, then you need a KV-cache reservation for whatever context length you actually plan to use.
Practical reading: Scout at INT4 leaves you ~11 GB for the KV cache. At Llama 4's KV layout, that buys you roughly 32K–48K tokens of usable context per concurrent request, depending on batch size. The marketing 10M context number requires far more VRAM than 79 GB — Meta's own deployment guide uses 8x H100 (≈640 GB) to demonstrate it.
## Scout vs. Maverick: head-to-head comparison
Spec
Llama 4 Scout
Llama 4 Maverick
Single A100 verdict
Total params
109B (16 experts)
~400B (128 experts)
Scout fits, Maverick does not
Active params per token
17B
17B
Same compute footprint
Context window
10M tokens
1M tokens
Both lose accuracy past 128K
MMLU-Pro
74.3
80.5
Maverick is sharper
HumanEval
78.2
82.4
Both trail DeepSeek V4
ChartQA / DocVQA
88.8 / 94.4
90.0 / 94.4
Maverick edges out
VRAM (INT4)
~65 GB
~240 GB
Scout only
Best for
Long-context RAG, doc Q&A
Multimodal reasoning, charts
Scout for SMB self-host
## Our test: 4,000 Hindi/English support tickets on a single A100
Our client is a 60-person SaaS company in Bengaluru running a property-management product. Their support inbox gets ~4,000 tickets/week across Hindi, Hinglish, and English. They need three things from the model: detect language, classify intent into one of 14 categories, and draft a Hindi reply if the customer wrote in Hindi.
We benchmarked Scout INT4 (via vLLM 0.7) and a Together AI–hosted Maverick endpoint on the same 1,000-ticket evaluation set. Here are the actual numbers from May 4–6, 2026:
🎯
Intent accuracy (14 classes)
Scout: 91.2% — Maverick: 93.8% — Claude Opus 4.7: 95.1%. The 2.6-point gap between Scout and Maverick was not worth the 13x cost difference at our volume.
🗣️
Hindi reply fluency (1–5 human rated)
Scout: 3.8 — Maverick: 4.2 — Claude Opus 4.7: 4.6. Scout's Hindi is grammatically correct but stilted. We added a 200-example finetune dataset; Scout climbed to 4.3.
⏱️
Latency (P95, single ticket)
Scout on A100 vLLM: 1.4s — Maverick on Together AI: 2.8s — Opus 4.7 API: 3.6s. The local Scout deployment wins on tail latency, which matters for our support UX SLA.
💰
Cost per million tokens (blended)
Scout self-hosted: ₹46 — Maverick API: ₹612 — Claude Opus 4.7: ₹2,550 input + ₹2,125 output. Amortizing A100 rent across 80M tokens/month is the inflection point.
## The Indian-language quality gap nobody published
Open benchmarks like MMLU-Pro are English-first. Our internal eval is Hindi + Hinglish, and that is where Scout's pretraining mix shows up. Llama 4 Scout was trained on 200 languages with India-specific tokens prioritized — about 8% of its pretraining is non-English Asian languages, the highest in any open-weight base model in April 2026.
On 500 Hinglish messages from our client's WhatsApp Business inbox, Scout correctly preserved code-switching ("kya aap mera order ka status check kar sakte ho") in 87% of replies. Maverick scored 89%. Qwen 3.5 32B scored 83% and is the next-best open option. GPT-5.5 was 88%. The takeaway: Llama 4 Scout, despite being smaller, holds its own on Indian-language tasks because Meta finally invested in the data mix.
Reality check on the 10M context: The r/LocalLLaMA thread on Scout's long-context accuracy documents needle-in-a-haystack scores of only 15.6% at 128K tokens — far below Gemini 2.5 Pro's 90.6% at the same length. The marketing claim is technically true (it accepts the input) but accuracy collapses past 128K. Design your RAG retrieval to stay under that window.
## The runnable deployment plan (single A100, INT4, vLLM 0.7)
This is the exact stack we shipped for the Bengaluru SaaS client. Total time from blank GPU to first inference: 47 minutes.
1
Rent the A100 (₹35,000/month annual)
Hetzner GEX44 (A100 80GB, 96GB RAM, 1TB NVMe) at €380/month on annual contract. AWS p4d.24xlarge equivalent is ~₹3.2 lakh/month, so do not start on AWS unless your finance team has bound you to a marketplace agreement.
2
Install CUDA 12.4 + vLLM 0.7
pip install vllm==0.7.3 — needs Python 3.11. Driver version must be ≥555.42. Pin both. Verify with nvidia-smi showing 80GB of "MiB" available.
3
Pull the Unsloth Llama 4 Scout INT4 weights
huggingface-cli download unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit. About 65 GB on disk. Download time on Indian peering: 25 minutes from Bengaluru.
4
Launch vLLM with KV-cache budget
vllm serve unsloth/Llama-4-Scout-... --max-model-len 32768 --gpu-memory-utilization 0.92 --quantization bitsandbytes. The max-model-len is the part most teams set wrong — bigger than 32K and you OOM on the first concurrent request.
5
Sanity-check at 50 RPS with locust
A single A100 sustains ~38 RPS for our 800-token average request (input+output combined). If you're past that, you need a second GPU. Verification: P95 latency stays under 2.0s at 30 RPS.
6
Wire up LoRA finetune for Indian-language tasks
200–500 examples is enough to lift Hindi quality by ~0.5 points on our internal eval. Use Unsloth's FastLanguageModel.get_peft_model — runs in ~40 minutes on the same A100.
## When not to self-host (the honest counter-example)
Self-hosting Llama 4 Scout is the wrong call in three concrete cases. We have actually walked clients away from it.
Case 1: You process under 8M tokens/month. A100 rent is ~₹35,000/month. At Claude Opus 4.7 rates (₹425 per million input + ₹2,125 per million output), 8M tokens of mixed traffic costs roughly ₹18,000. Below that volume, the API is cheaper.
Case 2: Your accuracy floor is ≥95%. A Coimbatore D2C brand we advised needed 96% intent accuracy on returns/refunds (their CFO insisted). Scout topped out at 91.2% even after finetune. We sent them to Claude Opus 4.7 with a structured-output schema. Accuracy was 96.4%. The ₹40,000/month API bill was cheaper than the legal exposure on misclassified refund tickets.
Case 3: Your team has no MLOps capacity. A self-hosted Llama 4 deployment will OOM, crash, drift, and need monitoring like any production system. If you do not have a person who can wake up at 2 a.m. when the inference container dies, run it on Together AI or Fireworks at ~₹40/M tokens. The 25–35% premium over self-host is what an on-call rotation costs.
## Real-world example: 47-day timeline for a Pune logistics SMB
A 40-person logistics firm in Pune ships customer onboarding emails and shipment-status replies in Hindi, Marathi, and English. Pre-Llama-4, they used GPT-4o on the OpenAI API at ~₹62,000/month for ~12M tokens. We migrated them to a self-hosted Llama 4 Scout on a single A100 in Hetzner Helsinki (the cheapest A100 region with reasonable India latency — ~145ms RTT).
Migration timeline: 47 days. Week 1 — eval-set construction (2,000 historical tickets). Weeks 2–3 — vLLM deployment + LoRA finetune on 600 examples. Weeks 4–5 — A/B test (50% Scout, 50% GPT-4o) on live traffic. Week 6 — full cutover. Week 7 — write the on-call runbook.
Result: cost dropped from ₹62,000/month to ₹35,000/month (44% saving). Quality stayed flat on Marathi (we did not have time to add Marathi finetune data — it is the next sprint). The CFO got the savings; the engineering lead got the on-call shifts. Both wins are real, both have a cost.
## Decision checklist before you sign the Hetzner contract
You process ≥80M tokens/month on Indian-language or moderate-quality tasks
You can run ≤32K-token contexts (split longer inputs into RAG chunks)
Your accuracy floor is ≤93% (or you are willing to finetune)
You have one engineer who can own deployment, monitoring, and on-call
You have built an eval set of ≥500 real production examples
Your data residency rules permit Hetzner Helsinki or German DCs
You have a fallback to Together AI / Fireworks for failover
If you hit fewer than 5 of these, stay on API. If you hit 6 or 7, self-host is the correct call.
## Where Maverick still wins
For workloads where you need top-of-stack open-weight performance and you do have multi-GPU budget, Maverick edges Scout on every benchmark except long-context. On ChartQA and DocVQA, Maverick matches GPT-4o. If you are processing financial PDFs at scale and have 4x A100s or 2x H100s, Maverick is the right answer. Otherwise, you are paying for 280B parameters of MoE capacity that will sit idle on most token positions.
## Where Scout still loses
SWE-bench Verified. Both Llama 4 models trail DeepSeek V4 and Qwen 3.6 on real-world coding tasks by a meaningful margin. If your workload is "write me production code," neither Scout nor Maverick is the right pick in April 2026. Use Claude Opus 4.7 or DeepSeek V4 instead.
## FAQ
### Can I run Llama 4 Scout on an RTX 4090?
Not at usable quality. The 4090's 24 GB VRAM forces 1.78-bit quantization, which drops Scout's intent-classification accuracy by ~7 points on our eval. You can run it for demos. Do not run it in production.
### What is the realistic context length on a single A100?
32K–48K tokens with INT4 quantization and one concurrent request. Push to 64K and you risk OOM under load. The advertised 10M context requires 8x H100s.
### Is Llama 4 Scout better than Qwen 3.5 32B for Indian-language work?
Marginally yes, at ~3x the VRAM cost. Qwen 3.5 32B at INT4 fits in 20 GB, sustains 60+ RPS on the same A100, and is only 2–3 points behind Scout on our Hinglish eval. If you are VRAM-constrained, Qwen is the practical pick.
### How long does a LoRA finetune take?
500 examples, 3 epochs, batch size 4 on a single A100: about 40 minutes with Unsloth. Cost: roughly ₹500 of GPU time. The quality lift on Indian-language tasks is consistently 0.4–0.6 points on our 1–5 scale.
### Should I run vLLM, TGI, or llama.cpp?
vLLM 0.7 for production throughput. llama.cpp for laptop/edge demos. TGI is fine but Hugging Face's pricing tier-out makes vLLM the cheaper open path in 2026.
### What about voice and audio in Llama 4?
Neither Scout nor Maverick handles audio natively. If you are building voice products for the Indian market — like our in-house English fluency app TalkDrill — you still need Whisper or Voxtral upstream, then Llama 4 for the language layer. Stack accordingly.
### How does cost compare against GPT-5.5 and Claude Opus 4.7?
At our 12M-token monthly volume: self-hosted Scout ≈ ₹35,000 (GPU rent only) + ₹4,000 (engineer time for ops). GPT-5.5 API ≈ ₹68,000. Claude Opus 4.7 ≈ ₹74,000. Below 8M tokens, API wins. Above 25M tokens, self-host wins by a large margin.
Want a Self-Hosted Llama 4 Deployment?
We ship a working Llama 4 Scout deployment on your A100 (or our rented one) in 14 working days. Includes vLLM setup, LoRA finetune on your data, monitoring, and an on-call runbook. Typical project: ₹1.8L–₹3.5L depending on integration depth. Suitable if you process >15M tokens/month and want to cut API spend without losing Indian-language quality.