Everything You Need to Benchmark Document Extraction
Upload documents. Define schemas. Run extractions across 7 LLM providers. Compare accuracy, speed, and cost. All in one platform.
Three Ways to Extract
Free-Wheel Extraction
The LLM receives your document with no constraints. It suggests a category, proposes a schema, and extracts data. Useful for explorationย โ find out what the model "sees" before you impose structure.
- No schema required
- LLM suggests document type and structure
- Great for initial dataset exploration
Schema-Bounded Extraction
You provide one or more JSON schemas. The LLM must return data conforming to your structure. The real-world modeย โ when you know what you want and need the LLM to comply.
- Send one or multiple schemas per extraction
- LLM output validated against schema
- Production-realistic testing
Two-Pass Extraction
Passย 1: a cheap, fast model categorises the document. Passย 2: a more capable model extracts using the correct schema. Mix providers across passes.
- Separate models per pass
- Cross-provider pass configuration
- Optimise cost without sacrificing accuracy
7 Providers. Dozens of Models. Text and Vision.
ARENA abstracts away provider-specific API differences. Every extraction goes through a unified interfaceย โ same input, same output format, regardless of provider.
Text mode: Document text is extracted and sent to the LLM as text. Vision mode: Pages are rendered as images and sent to the LLM's vision endpoint.
Precision Timing. No Noise.
ARENA's clock starts when the extraction request is deserialised and stops when the final JSON is parsed. You're measuring the LLM, not the infrastructure.
Millisecond precision
Extraction time measured without I/O overhead
Batch execution
Run hundreds of documents through multiple providers in one operation
Parallel & sequential modes
Control concurrency per benchmark
Repetition
Run the same extraction N times to measure variance
Rate limiting
Built-in per-provider throttling to respect API limits
Field-Level Accuracy. Not Just "It Worked."
Upload a golden JSON for each document. ARENA compares every LLM's output against ground truth at the field level.
Accuracy Comparison (sample benchmark)
Cross-LLM comparison
- Side-by-side JSON diff between any two extractions
- Provider ร document-type accuracy heatmap
- Speed vs. cost scatter plots
Benchmarks as Code
DBL is a declarative scripting language built into ARENA. Define what you want to benchmark, not how to execute it.
Reproducible
Same script, same benchmark, every time
Version-controlled
Track changes in git alongside your code
Shareable
Share benchmark configurations with your team
Composable
Include snippets, use templates, build libraries
Dashboards That Answer Questions
Filter by provider, model, extraction mode, document type, or date range. Spot trends. Find regressions. Export data.
Accuracy
Per-provider, per-model accuracy over time
Speed
Extraction duration distribution, outlier detection
Cost
Cost per extraction by provider, projected monthly spend
Trends
Track how providers improve or regress across model versions
Your Documents. As They Are.
Group documents into datasets for batch testing. Tag and filter by type, source, or custom metadata. Upload via UI or API.
Built for Teams
Role-based access. Invite team members, share datasets, collaborate on DBL scripts.
Owner
Full access, billing management
Admin
Manage members, run benchmarks
Benchmarker
Run benchmarks, view results
Reporter
View-only access to results and analytics