From Upload to Decision in Three Steps

No framework to build. No scripts to maintain. Upload your documents, configure your benchmark, and let ARENA run the numbers.

Documents
OpenAIClaudeGemini
LLM Providers
Results
01

Upload Your Documents

PDF
DOCX
JPG

Drag and drop your PDFs, Office documents, or images. Organise them into datasets — group invoices together, contracts together, identity documents together. Each dataset becomes a test suite you can benchmark repeatedly.

  • Supported: PDF, DOCX, XLSX, PPTX, PNG, JPG, TIFF
  • Create datasets by document type, source, or any grouping
  • Optionally upload ground truth JSON for accuracy scoring
  • Define JSON schemas for schema-bounded extraction, or let the LLM free-wheel
02

Pick Your Providers. Run the Benchmark.

Select which LLM providers and models to test. Choose your extraction mode: free-wheel, schema-bounded, or two-pass. Set concurrency, repetition count, and any provider-specific parameters. Hit run.

ProviderAccuracyTimeCost
GPT-4o96.3%1.2s$0.024
Claude Sonnet94.1%0.9s$0.018
Gemini Flash91.7%0.6s$0.008
Mistral Large88.4%1.4s$0.014

Or write a DBL script and let the language handle it:

DBLbenchmark.dbl
documents contracts {
  dataset "legal-contracts"
}

engine gpt4o_vision {
  provider "openai"
  model "gpt-4o"
  mode vision
}

engine claude_text {
  provider "anthropic"
  model "claude-sonnet-4-20250514"
  mode text
}

engine gemini_flash {
  provider "google"
  model "gemini-2.0-flash"
  mode text
}

run benchmark {
  documents contracts
  engines [gpt4o_vision, claude_text, gemini_flash]
  repeat 5
  parallel true
}
  • UI-based configuration for quick tests
  • DBL scripts for reproducible, complex benchmarks
  • Batch execution: test an entire dataset across all providers in one run
  • Real-time progress monitoring
03

Get the Numbers. Make the Call.

ARENA scores every extraction against your ground truth at the field level. See which provider got the invoice number right, which one hallucinated a date, and which one missed the line items entirely.

GPT-4o Vision
96.3%
Claude Sonnet
94.1%
Gemini Flash
88.7%

Accuracy scores

0.0–1.0 per document, per provider, per field

Timing data

Extraction duration in milliseconds, excluding I/O overhead

Cost breakdown

Estimated USD per extraction by provider and model

JSON diff

Side-by-side comparison of any two extractions

Visualisations

Bar charts, heatmaps, scatter plots, trend lines

Exportable reports

Download data for internal review or stakeholder presentations

Data-Driven Provider Decisions

After running your benchmark, you'll know:

  • Which provider extracts your specific document types most accurately
  • Whether vision mode or text mode performs better for your documents
  • The cost-accuracy trade-off — when a cheaper model is "good enough"
  • How two-pass extraction compares to single-pass on your data
  • Whether results are consistent (variance across repeated runs)

No more guessing. No more "it felt right." You have the numbers.

Ready to Benchmark?

Free tier. 100 credits. No credit card. Run a benchmark in under 10 minutes.