From Upload to Decision in Three Steps
No framework to build. No scripts to maintain. Upload your documents, configure your benchmark, and let ARENA run the numbers.
Upload Your Documents
Drag and drop your PDFs, Office documents, or images. Organise them into datasets — group invoices together, contracts together, identity documents together. Each dataset becomes a test suite you can benchmark repeatedly.
- Supported: PDF, DOCX, XLSX, PPTX, PNG, JPG, TIFF
- Create datasets by document type, source, or any grouping
- Optionally upload ground truth JSON for accuracy scoring
- Define JSON schemas for schema-bounded extraction, or let the LLM free-wheel
Pick Your Providers. Run the Benchmark.
Select which LLM providers and models to test. Choose your extraction mode: free-wheel, schema-bounded, or two-pass. Set concurrency, repetition count, and any provider-specific parameters. Hit run.
Or write a DBL script and let the language handle it:
documents contracts {
dataset "legal-contracts"
}
engine gpt4o_vision {
provider "openai"
model "gpt-4o"
mode vision
}
engine claude_text {
provider "anthropic"
model "claude-sonnet-4-20250514"
mode text
}
engine gemini_flash {
provider "google"
model "gemini-2.0-flash"
mode text
}
run benchmark {
documents contracts
engines [gpt4o_vision, claude_text, gemini_flash]
repeat 5
parallel true
}- UI-based configuration for quick tests
- DBL scripts for reproducible, complex benchmarks
- Batch execution: test an entire dataset across all providers in one run
- Real-time progress monitoring
Get the Numbers. Make the Call.
ARENA scores every extraction against your ground truth at the field level. See which provider got the invoice number right, which one hallucinated a date, and which one missed the line items entirely.
Accuracy scores
0.0–1.0 per document, per provider, per field
Timing data
Extraction duration in milliseconds, excluding I/O overhead
Cost breakdown
Estimated USD per extraction by provider and model
JSON diff
Side-by-side comparison of any two extractions
Visualisations
Bar charts, heatmaps, scatter plots, trend lines
Exportable reports
Download data for internal review or stakeholder presentations
Data-Driven Provider Decisions
After running your benchmark, you'll know:
- Which provider extracts your specific document types most accurately
- Whether vision mode or text mode performs better for your documents
- The cost-accuracy trade-off — when a cheaper model is "good enough"
- How two-pass extraction compares to single-pass on your data
- Whether results are consistent (variance across repeated runs)
No more guessing. No more "it felt right." You have the numbers.
Ready to Benchmark?
Free tier. 100 credits. No credit card. Run a benchmark in under 10 minutes.