7 providers · Real documents · Real numbers

Which LLM Extracts Your Documents Best?

Speed. Accuracy. Cost. The answer depends on your documents — and it changes every time a provider ships a new model. ARENA benchmarks every major LLM on your actual files so you can stop guessing and start deciding.

0Providers
0Extraction Modes
<1msTiming Precision
benchmark.dbl — running6 extractionsLive
invoice_042.pdfGPT-4o97.1%1243ms
invoice_042.pdfClaude Sonnet 495.4%892ms
invoice_042.pdfGemini Flash91.2%456ms
contract_017.pdfGPT-4o89.3%2340ms
contract_017.pdfClaude Sonnet 494.1%1567ms
contract_017.pdfGemini Flash82.7%890ms
0+Benchmarks Run
0+Extractions Processed
0LLM Models
<2minAvg. Benchmark Time
01

Three Problems. Every Team. No Good Answer.

Every team using LLMs to extract data from documents hits the same wall:

Speed

That model works in a demo. Will it hold when you're processing 10,000 invoices? Extraction latency becomes a pipeline bottleneck at scale, and the fastest model isn't always the most accurate.

🎯

Accuracy

GPT-4o nails your invoices but struggles with contracts. Claude is perfect for identity documents but expensive. Gemini Flash is cheap and fast but misses nested line items. The "best" model depends entirely on what you're extracting.

💰

Cost

LLM API costs scale linearly. The most accurate provider may be 10x the price of one that's "good enough" for half your document types. Without data, you're either over-paying or under-performing.

The compounding problem: The LLM landscape doesn't stand still. New models drop monthly. Pricing changes without warning. The provider you picked six months ago may not be the right choice today — but you'd never know without re-testing.

02

One Platform. Every Provider. Real Numbers.

01

Upload

PDFs, Office docs, images. Organise into datasets by document type. Define JSON schemas or let the LLM suggest structure.

02

Benchmark

Run extractions across 7 providers and dozens of models. Text mode and vision mode. Precision-timed to the millisecond.

03

Compare

Field-by-field accuracy against ground truth. Cost per extraction. Speed distribution. Discover which LLM is optimal for each dataset.

How ARENA Works

Your documents flow through every major LLM — real results in minutes

📄
Your DocumentsPDF, DOCX, Images
ARENA
OpenAIClaudeGeminiMistralAzureCohereAWS
📊
Real NumbersSpeed · Accuracy · Cost
03

Speed. Accuracy. Cost. Measured, Not Guessed.

Real benchmark results from real documents. Every extraction timed, scored, and priced.

Benchmark ResultsLive
DocumentProviderAccuracyTimeCost
📄invoice_042.pdfvisionGPT-4o97.1%1,243ms$0.032
📄invoice_042.pdftextClaude Sonnet 495.4%892ms$0.028
📄invoice_042.pdftextGemini Flash91.2%456ms$0.008
📄contract_017.pdfvisionGPT-4o89.3%2,340ms$0.045
📄contract_017.pdftextClaude Sonnet 494.1%1,567ms$0.038
📄contract_017.pdftextGemini Flash82.7%890ms$0.012
🖼️id_card_091.jpgvisionGPT-4o96.4%1,890ms$0.041
🖼️id_card_091.jpgvisionClaude Sonnet 493.2%1,234ms$0.035
<1msPrecision

Speed

Precision timing in milliseconds. Clock starts at request deserialisation, stops at JSON parse. No I/O noise, no infrastructure overhead.

0.0→1.0Per Field

Accuracy

Field-level scoring against your ground truth. Quantitative accuracy from 0.0 to 1.0, per field, per document, per provider.

$USDPer Extraction

Cost

Estimated USD per extraction based on token usage and provider pricing. See the cost-accuracy trade-off for every provider.

04

Different Documents Need Different LLMs

Your invoices might extract best with GPT-4o in vision mode. Your contracts might need Claude in text mode. Your identity documents might work fine with Gemini Flash at a fraction of the cost.

🧾Invoices
BestGPT-4o97.1%
3x cheaper with Gemini
📑Contracts
BestClaude Sonnet 494.1%
40% faster with Claude
🪪ID Documents
BestGPT-4o (vision)96.4%
Vision mode +8% accuracy
🧾Receipts
BestGemini Flash94.8%
5x cheaper, same accuracy

Provider Accuracy Comparison

📊 Invoice Dataset · 200 documents
G
GPT-4oMost Accurate
97.1%
C
Claude Sonnet 4Best Balance
95.4%
G
Gemini FlashFastest
91.2%
M
Mistral Large
88.5%
C
Command R+
86.3%
>95% Excellent 90-95% Good <90% Fair
05

7 Providers. Dozens of Models. One Test.

Every major LLM provider, unified under one benchmarking interface. Same input, same output format — real comparison.

OpenAI

GPT-4oGPT-4-turbo
✓ Text✓ Vision

Anthropic

Claude Sonnet 4Claude Opus 4
✓ Text✓ Vision

Google

Gemini 2.0 FlashGemini Pro
✓ Text✓ Vision

Azure OpenAI

GPT-4o (Azure)
✓ Text✓ Vision

Mistral

Mistral LargeMistral Medium
✓ Text— Vision

Cohere

Command R+Command R
✓ Text— Vision

AWS Bedrock

Claude via BedrockTitan
✓ Text✓ Vision

Three extraction strategies: free-wheel (LLM suggests structure), schema-bounded (your JSON schema), and two-pass (categorise then extract). Mix providers across passes.

06

LLMs Change. Your Benchmarks Should Keep Up.

DBL (Document Benchmark Language) — a declarative scripting language for document extraction benchmarks. Write a script once, re-run it every time a provider ships a new model.

DBLbenchmark.dbl
documents invoices {
  dataset "invoice-dataset"
  where type == "invoice"
}

engine gpt4o {
  provider "openai"
  model "gpt-4o"
  mode vision
}

engine claude {
  provider "anthropic"
  model "claude-sonnet-4-20250514"
  mode text
}

engine gemini_flash {
  provider "google"
  model "gemini-2.0-flash"
  mode text
}

run benchmark {
  documents invoices
  engines [gpt4o, claude, gemini_flash]
  repeat 3
  parallel true
}

analysis trilema {
  source benchmark
  metrics [accuracy, duration_ms, cost_usd]
  group_by engine
}
JSONoutput.json
{
  "benchmark_id": "bench_7x9k2m",
  "dataset": "invoices-q4",
  "documents": 200,
  "results": [
    {
      "engine": "gpt-4o",
      "accuracy": 0.971,
      "avg_ms": 1243,
      "cost": "$0.032"
    },
    {
      "engine": "claude-sonnet-4",
      "accuracy": 0.954,
      "avg_ms": 892,
      "cost": "$0.028"
    },
    {
      "engine": "gemini-flash",
      "accuracy": 0.912,
      "avg_ms": 456,
      "cost": "$0.008"
    }
  ],
  "winner": {
    "speed": "gemini-flash",
    "accuracy": "gpt-4o",
    "cost": "gemini-flash",
    "balanced": "claude-sonnet-4"
  }
}
07

Built for Teams That Process Documents with LLMs

Choosing a Provider

We're building invoice processing and need to pick between OpenAI and Anthropic. ARENA tested both on our 200 invoices — GPT-4o was 3% more accurate but 4x slower. We went with Claude for production.

Tracking Model Changes

GPT-4o just shipped an update. We re-ran our DBL script and compared results to last month's baseline. Accuracy improved 3% on contracts, dropped 1% on receipts.

Optimising Cost at Scale

We discovered that Gemini Flash gets 94% accuracy on our identity docs at 1/5 the cost of GPT-4o. For 50,000 docs/month, that's the difference between viable and prohibitive.

Start Benchmarking. Free.

100 credits on the free tier. No credit card required. Upload your documents, pick your providers, run your first benchmark in minutes.