Everything You Need to Benchmark Document Extraction

Upload documents. Define schemas. Run extractions across 7 LLM providers. Compare accuracy, speed, and cost. All in one platform.

Three Ways to Extract

🌀

Free-Wheel Extraction

The LLM receives your document with no constraints. It suggests a category, proposes a schema, and extracts data. Useful for exploration — find out what the model "sees" before you impose structure.

No schema required
LLM suggests document type and structure
Great for initial dataset exploration

📐

Schema-Bounded Extraction

You provide one or more JSON schemas. The LLM must return data conforming to your structure. The real-world mode — when you know what you want and need the LLM to comply.

Send one or multiple schemas per extraction
LLM output validated against schema
Production-realistic testing

🔄

Two-Pass Extraction

Pass 1: a cheap, fast model categorises the document. Pass 2: a more capable model extracts using the correct schema. Mix providers across passes.

Separate models per pass
Cross-provider pass configuration
Optimise cost without sacrificing accuracy

7 Providers. Dozens of Models. Text and Vision.

ARENA abstracts away provider-specific API differences. Every extraction goes through a unified interface — same input, same output format, regardless of provider.

OpenAI

✓ Text✓ Vision

GPT-4oGPT-4-turbo

Anthropic

✓ Text✓ Vision

Claude Sonnet 4Opus 4

Google

✓ Text✓ Vision

Gemini 2.0 FlashPro

Azure OpenAI

✓ Text✓ Vision

GPT-4o (Azure-hosted)

Mistral

✓ Text— Vision

LargeMedium

Cohere

✓ Text— Vision

Command R+Command R

AWS Bedrock

✓ Text✓ Vision

Claude (Bedrock)Titan

Text mode: Document text is extracted and sent to the LLM as text. Vision mode: Pages are rendered as images and sent to the LLM's vision endpoint.

Precision Timing. No Noise.

ARENA's clock starts when the extraction request is deserialised and stops when the final JSON is parsed. You're measuring the LLM, not the infrastructure.

Millisecond precision

Extraction time measured without I/O overhead

Batch execution

Run hundreds of documents through multiple providers in one operation

Parallel & sequential modes

Control concurrency per benchmark

Repetition

Run the same extraction N times to measure variance

Rate limiting

Built-in per-provider throttling to respect API limits

Field-Level Accuracy. Not Just "It Worked."

Upload a golden JSON for each document. ARENA compares every LLM's output against ground truth at the field level.

Exact matchField value matches ground truth exactly

Fuzzy matchValue is semantically equivalent (date formats, whitespace)

Missing fieldsFields the LLM failed to extract

Extra fieldsFields the LLM hallucinated

Score: 0.0 → 1.0Per document, aggregated by provider, model, and document type

Accuracy Comparison (sample benchmark)

GPT-4o

97.1%

Claude Sonnet 4

95.4%

Gemini Flash

91.2%

Mistral Large

88.5%

Cross-LLM comparison

Side-by-side JSON diff between any two extractions
Provider × document-type accuracy heatmap
Speed vs. cost scatter plots

Benchmarks as Code

DBL is a declarative scripting language built into ARENA. Define what you want to benchmark, not how to execute it.

Reproducible

Same script, same benchmark, every time

Version-controlled

Track changes in git alongside your code

Shareable

Share benchmark configurations with your team

Composable

Include snippets, use templates, build libraries

Read the DBL Specification →

Dashboards That Answer Questions

Filter by provider, model, extraction mode, document type, or date range. Spot trends. Find regressions. Export data.

Accuracy

Per-provider, per-model accuracy over time

Speed

Extraction duration distribution, outlier detection

Cost

Cost per extraction by provider, projected monthly spend

Trends

Track how providers improve or regress across model versions

Your Documents. As They Are.

📄PDF📝DOCX📊XLSX📊PPTX🖼️PNG🖼️JPG🖼️TIFF

Group documents into datasets for batch testing. Tag and filter by type, source, or custom metadata. Upload via UI or API.

Built for Teams

Role-based access. Invite team members, share datasets, collaborate on DBL scripts.

Owner

Full access, billing management

Admin

Manage members, run benchmarks

Benchmarker

Run benchmarks, view results

Reporter

View-only access to results and analytics

See It in Action

Book a Demo→View Pricing