From Upload to Decision in Three Steps

No framework to build. No scripts to maintain. Upload your documents, configure your benchmark, and let ARENA run the numbers.

Documents

OpenAIClaudeGemini

LLM Providers

Results

Upload Your Documents

PDF

DOCX

JPG

Drag and drop your PDFs, Office documents, or images. Organise them into datasets — group invoices together, contracts together, identity documents together. Each dataset becomes a test suite you can benchmark repeatedly.

Supported: PDF, DOCX, XLSX, PPTX, PNG, JPG, TIFF
Create datasets by document type, source, or any grouping
Optionally upload ground truth JSON for accuracy scoring
Define JSON schemas for schema-bounded extraction, or let the LLM free-wheel

Pick Your Providers. Run the Benchmark.

Select which LLM providers and models to test. Choose your extraction mode: free-wheel, schema-bounded, or two-pass. Set concurrency, repetition count, and any provider-specific parameters. Hit run.

ProviderAccuracyTimeCost

GPT-4o96.3%1.2s$0.024

Claude Sonnet94.1%0.9s$0.018

Gemini Flash91.7%0.6s$0.008

Mistral Large88.4%1.4s$0.014

Or write a DBL script and let the language handle it:

DBLbenchmark.dbl

documents contracts {
  dataset "legal-contracts"
}

engine gpt4o_vision {
  provider "openai"
  model "gpt-4o"
  mode vision
}

engine claude_text {
  provider "anthropic"
  model "claude-sonnet-4-20250514"
  mode text
}

engine gemini_flash {
  provider "google"
  model "gemini-2.0-flash"
  mode text
}

run benchmark {
  documents contracts
  engines [gpt4o_vision, claude_text, gemini_flash]
  repeat 5
  parallel true
}

UI-based configuration for quick tests
DBL scripts for reproducible, complex benchmarks
Batch execution: test an entire dataset across all providers in one run
Real-time progress monitoring

Get the Numbers. Make the Call.

ARENA scores every extraction against your ground truth at the field level. See which provider got the invoice number right, which one hallucinated a date, and which one missed the line items entirely.

GPT-4o Vision

96.3%

Claude Sonnet

94.1%

Gemini Flash

88.7%

Accuracy scores

0.0–1.0 per document, per provider, per field

Timing data

Extraction duration in milliseconds, excluding I/O overhead

Cost breakdown

Estimated USD per extraction by provider and model

JSON diff

Side-by-side comparison of any two extractions

Visualisations

Bar charts, heatmaps, scatter plots, trend lines

Exportable reports

Download data for internal review or stakeholder presentations

Data-Driven Provider Decisions

After running your benchmark, you'll know:

Which provider extracts your specific document types most accurately
Whether vision mode or text mode performs better for your documents
The cost-accuracy trade-off — when a cheaper model is "good enough"
How two-pass extraction compares to single-pass on your data
Whether results are consistent (variance across repeated runs)

No more guessing. No more "it felt right." You have the numbers.

Ready to Benchmark?

Free tier. 100 credits. No credit card. Run a benchmark in under 10 minutes.

Book a Demo→View Features View Pricing