Which LLM Extracts Your Documents Best?
Speed. Accuracy. Cost. The answer depends on your documents — and it changes every time a provider ships a new model. ARENA benchmarks every major LLM on your actual files so you can stop guessing and start deciding.
Three Problems. Every Team. No Good Answer.
Every team using LLMs to extract data from documents hits the same wall:
Speed
That model works in a demo. Will it hold when you're processing 10,000 invoices? Extraction latency becomes a pipeline bottleneck at scale, and the fastest model isn't always the most accurate.
Accuracy
GPT-4o nails your invoices but struggles with contracts. Claude is perfect for identity documents but expensive. Gemini Flash is cheap and fast but misses nested line items. The "best" model depends entirely on what you're extracting.
Cost
LLM API costs scale linearly. The most accurate provider may be 10x the price of one that's "good enough" for half your document types. Without data, you're either over-paying or under-performing.
The compounding problem: The LLM landscape doesn't stand still. New models drop monthly. Pricing changes without warning. The provider you picked six months ago may not be the right choice today — but you'd never know without re-testing.
One Platform. Every Provider. Real Numbers.
Upload
PDFs, Office docs, images. Organise into datasets by document type. Define JSON schemas or let the LLM suggest structure.
Benchmark
Run extractions across 7 providers and dozens of models. Text mode and vision mode. Precision-timed to the millisecond.
Compare
Field-by-field accuracy against ground truth. Cost per extraction. Speed distribution. Discover which LLM is optimal for each dataset.
How ARENA Works
Your documents flow through every major LLM — real results in minutes
Speed. Accuracy. Cost. Measured, Not Guessed.
Real benchmark results from real documents. Every extraction timed, scored, and priced.
Speed
Precision timing in milliseconds. Clock starts at request deserialisation, stops at JSON parse. No I/O noise, no infrastructure overhead.
Accuracy
Field-level scoring against your ground truth. Quantitative accuracy from 0.0 to 1.0, per field, per document, per provider.
Cost
Estimated USD per extraction based on token usage and provider pricing. See the cost-accuracy trade-off for every provider.
Different Documents Need Different LLMs
Your invoices might extract best with GPT-4o in vision mode. Your contracts might need Claude in text mode. Your identity documents might work fine with Gemini Flash at a fraction of the cost.
Provider Accuracy Comparison
📊 Invoice Dataset · 200 documents7 Providers. Dozens of Models. One Test.
Every major LLM provider, unified under one benchmarking interface. Same input, same output format — real comparison.
OpenAI
Anthropic
Azure OpenAI
Mistral
Cohere
AWS Bedrock
Three extraction strategies: free-wheel (LLM suggests structure), schema-bounded (your JSON schema), and two-pass (categorise then extract). Mix providers across passes.
LLMs Change. Your Benchmarks Should Keep Up.
DBL (Document Benchmark Language) — a declarative scripting language for document extraction benchmarks. Write a script once, re-run it every time a provider ships a new model.
documents invoices {
dataset "invoice-dataset"
where type == "invoice"
}
engine gpt4o {
provider "openai"
model "gpt-4o"
mode vision
}
engine claude {
provider "anthropic"
model "claude-sonnet-4-20250514"
mode text
}
engine gemini_flash {
provider "google"
model "gemini-2.0-flash"
mode text
}
run benchmark {
documents invoices
engines [gpt4o, claude, gemini_flash]
repeat 3
parallel true
}
analysis trilema {
source benchmark
metrics [accuracy, duration_ms, cost_usd]
group_by engine
}{
"benchmark_id": "bench_7x9k2m",
"dataset": "invoices-q4",
"documents": 200,
"results": [
{
"engine": "gpt-4o",
"accuracy": 0.971,
"avg_ms": 1243,
"cost": "$0.032"
},
{
"engine": "claude-sonnet-4",
"accuracy": 0.954,
"avg_ms": 892,
"cost": "$0.028"
},
{
"engine": "gemini-flash",
"accuracy": 0.912,
"avg_ms": 456,
"cost": "$0.008"
}
],
"winner": {
"speed": "gemini-flash",
"accuracy": "gpt-4o",
"cost": "gemini-flash",
"balanced": "claude-sonnet-4"
}
}Built for Teams That Process Documents with LLMs
Choosing a Provider
We're building invoice processing and need to pick between OpenAI and Anthropic. ARENA tested both on our 200 invoices — GPT-4o was 3% more accurate but 4x slower. We went with Claude for production.
Tracking Model Changes
GPT-4o just shipped an update. We re-ran our DBL script and compared results to last month's baseline. Accuracy improved 3% on contracts, dropped 1% on receipts.
Optimising Cost at Scale
We discovered that Gemini Flash gets 94% accuracy on our identity docs at 1/5 the cost of GPT-4o. For 50,000 docs/month, that's the difference between viable and prohibitive.
Start Benchmarking. Free.
100 credits on the free tier. No credit card required. Upload your documents, pick your providers, run your first benchmark in minutes.