Everything You Need to Benchmark Document Extraction

Upload documents. Define schemas. Run extractions across 7 LLM providers. Compare accuracy, speed, and cost. All in one platform.

01

Three Ways to Extract

๐ŸŒ€

Free-Wheel Extraction

The LLM receives your document with no constraints. It suggests a category, proposes a schema, and extracts data. Useful for explorationย โ€” find out what the model "sees" before you impose structure.

  • No schema required
  • LLM suggests document type and structure
  • Great for initial dataset exploration
๐Ÿ“

Schema-Bounded Extraction

You provide one or more JSON schemas. The LLM must return data conforming to your structure. The real-world modeย โ€” when you know what you want and need the LLM to comply.

  • Send one or multiple schemas per extraction
  • LLM output validated against schema
  • Production-realistic testing
๐Ÿ”„

Two-Pass Extraction

Passย 1: a cheap, fast model categorises the document. Passย 2: a more capable model extracts using the correct schema. Mix providers across passes.

  • Separate models per pass
  • Cross-provider pass configuration
  • Optimise cost without sacrificing accuracy
02

7 Providers. Dozens of Models. Text and Vision.

ARENA abstracts away provider-specific API differences. Every extraction goes through a unified interfaceย โ€” same input, same output format, regardless of provider.

OpenAI
โœ“ Textโœ“ Vision
GPT-4oGPT-4-turbo
Anthropic
โœ“ Textโœ“ Vision
Claude Sonnet 4Opus 4
Google
โœ“ Textโœ“ Vision
Gemini 2.0 FlashPro
Azure OpenAI
โœ“ Textโœ“ Vision
GPT-4o (Azure-hosted)
Mistral
โœ“ Textโ€” Vision
LargeMedium
Cohere
โœ“ Textโ€” Vision
Command R+Command R
AWS Bedrock
โœ“ Textโœ“ Vision
Claude (Bedrock)Titan

Text mode: Document text is extracted and sent to the LLM as text. Vision mode: Pages are rendered as images and sent to the LLM's vision endpoint.

03

Precision Timing. No Noise.

ARENA's clock starts when the extraction request is deserialised and stops when the final JSON is parsed. You're measuring the LLM, not the infrastructure.

01

Millisecond precision

Extraction time measured without I/O overhead

02

Batch execution

Run hundreds of documents through multiple providers in one operation

03

Parallel & sequential modes

Control concurrency per benchmark

04

Repetition

Run the same extraction N times to measure variance

05

Rate limiting

Built-in per-provider throttling to respect API limits

04

Field-Level Accuracy. Not Just "It Worked."

Upload a golden JSON for each document. ARENA compares every LLM's output against ground truth at the field level.

Exact matchField value matches ground truth exactly
Fuzzy matchValue is semantically equivalent (date formats, whitespace)
Missing fieldsFields the LLM failed to extract
Extra fieldsFields the LLM hallucinated
Score: 0.0 โ†’ 1.0Per document, aggregated by provider, model, and document type

Accuracy Comparison (sample benchmark)

GPT-4o
97.1%
Claude Sonnet 4
95.4%
Gemini Flash
91.2%
Mistral Large
88.5%

Cross-LLM comparison

  • Side-by-side JSON diff between any two extractions
  • Provider ร— document-type accuracy heatmap
  • Speed vs. cost scatter plots
05

Benchmarks as Code

DBL is a declarative scripting language built into ARENA. Define what you want to benchmark, not how to execute it.

Reproducible

Same script, same benchmark, every time

Version-controlled

Track changes in git alongside your code

Shareable

Share benchmark configurations with your team

Composable

Include snippets, use templates, build libraries

06

Dashboards That Answer Questions

Filter by provider, model, extraction mode, document type, or date range. Spot trends. Find regressions. Export data.

Accuracy

Per-provider, per-model accuracy over time

Speed

Extraction duration distribution, outlier detection

Cost

Cost per extraction by provider, projected monthly spend

Trends

Track how providers improve or regress across model versions

07

Your Documents. As They Are.

๐Ÿ“„PDF๐Ÿ“DOCX๐Ÿ“ŠXLSX๐Ÿ“ŠPPTX๐Ÿ–ผ๏ธPNG๐Ÿ–ผ๏ธJPG๐Ÿ–ผ๏ธTIFF

Group documents into datasets for batch testing. Tag and filter by type, source, or custom metadata. Upload via UI or API.

08

Built for Teams

Role-based access. Invite team members, share datasets, collaborate on DBL scripts.

Owner

Full access, billing management

Admin

Manage members, run benchmarks

Benchmarker

Run benchmarks, view results

Reporter

View-only access to results and analytics