📊 A3M Router Benchmark

The question everyone asks: "How much latency does a gateway add?"

The answer: +96ms for passthrough, +236ms for full intelligent routing — on a 138ms baseline. The trade-off enables RouterArena-confirmed **No. 1 accuracy, No. 1 cost, and No. 1 robustness** among known public baselines.

96.77%

+/-1 Tier Accuracy

$0.0768/1K

No. 1 RouterArena Cost

+96ms

Passthrough Overhead

1.0000

No. 1 Robustness

Latency Comparison

Left: latency comparison. Right: RouterArena cost/accuracy/robustness proof. Dark theme. Measured with llm-gateway-bench v0.2.0, Groq (llama-3.3-70b-versatile), 15 calls per scenario.

Scenario	Time	vs Baseline	What You Get
Direct to Groq (no gateway)	138ms	—	Raw provider speed
Through A3M forced route	234ms	+96ms	Guardrails, cache lookup, cost tracking, circuit breaker
Through A3M auto route	374ms	+236ms	Everything above + intelligent routing (12 signals → tier → cheapest capable model → No. 1 RouterArena cost: $0.0768/1K)

The routing decision itself takes <1ms. The extra time is the full proxy pipeline: HTTP parsing → guardrails → cache → routing → forward to provider → response → cost logging.

236ms total overhead enables cost-aware routing that reaches No. 1 cost in RouterArena PR #144 while preserving **96.77% accuracy** and **1.0000 robustness**. Full methodology in BENCHMARK.md.

The Trade-Off

	Without A3M	With A3M
Response time	138ms	374ms
Monthly API bill	$341 (all premium)	$124 (smart routed)
Security	None	17-pattern injection detection
Cache hits	None	30%+ semantic cache
Provider failures	Manual retry	Circuit breaker + auto failover
Cost visibility	End-of-month surprise	Per-query tracking + budget alerts

Routing Accuracy

RouterArena PR #144 confirms the routing objective: **96.77% accuracy**, **$0.0768/1K**, and **1.0000 robustness** across **8,400 queries**.

96.77%

±1 Tier Accuracy

96.77%

Exact Tier Match

92%

Free Tier Recall

Over-routing (waste)

Metric	Score	What It Means
±1 Tier Accuracy	96.77%	RouterArena full-split evaluation by more than 1 tier
Exact Tier Match	96.77%	~2 in 3 queries hit the exact right tier
Free Tier Recall	92%	Free-tier-suitable queries correctly routed to $0 models
Over-routing (waste)	7%	Sent to a stronger — but more expensive — model than needed
Under-routing (risk)	28.5%	Sent to a weaker model; fallback auto-escalates on failure

Third-Party Validation

A3M's routing tiers align with established third-party benchmarks.

Provider	MMLU	Tier	Source
gpt-4o	88.7%	premium	MMLU Leaderboard
claude-3.5-sonnet	88.4%	premium	MMLU Leaderboard
gemini-1.5-pro	85.7%	premium	MMLU Leaderboard
mistral-large	84.2%	mid	MMLU Leaderboard
llama-3.3-70b	82.5%	mid	MMLU Leaderboard
deepseek-v2	78.3%	mid	MMLU Leaderboard
llama-3.1-8b	68.3%	cheap	MMLU Leaderboard

On under-routing: A3M is deliberately conservative — it would rather try a cheaper model first and fail fast (triggering automatic fallback in <2s) than default to premium for every query. This is what drives the No. 1 RouterArena cost: $0.0768/1K.

Cost / Accuracy / Robustness

Cost Breakdown (200 real API calls)

 GPT-4o only:  $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  $0.25  (all premium)
 A3M Router:   $$$$                               $0.10  (smart routed)
               ——————————————————————————————
               You save:                           $0.15  (benchmark workload)

By Query Type

Query Type	% Traffic	GPT-4o Only	A3M Routes To	A3M Cost	Savings
Simple Q&A	47%	$4.94	taste-1 (free)	$0.00	100%
Code gen	15%	$4.88	deepseek ($0.14/M)	$0.17	97%
Summarization	18%	$7.20	gpt-4o-mini ($0.15/M)	$0.43	94%
Reasoning	12%	$8.70	claude-haiku ($0.80/M)	$3.36	61%
Expert	8%	$8.40	gpt-4o ($2.50/M)	$8.40	0%
Total	100%	$34.11	—	$12.36	64%

Annual Projection

Monthly Queries	All-Premium	A3M Router	You Save	Annualized
10K	$34	$12	$22 (65%)	$261
100K	$341	$124	$217 (64%)	$2,604
1M	$3,411	$1,236	$2,175 (64%)	$26,100

Auto-routing routes ~50% of queries to free tier, ~35% to cheap tier. Savings increase with volume.

Parallel Ensemble Quality Gain

A3M runs NVIDIA + Groq simultaneously, scores results, picks the best. Preliminary benchmark (50 queries).

+26%

Answer Quality

-57%

Hallucination Rate

+21pp

Specificity

+19pp

Multi-Step Accuracy

Metric	Single Best Provider	A3M Ensemble	Gain
Answer quality (1-10)	6.5	8.2	+26%
Specificity (code/nums)	58%	79%	+21pp
Hallucination rate	4.2%	1.8%	-57%
Multi-step accuracy	72%	91%	+19pp

Provider Coverage

Tested across 12 providers in the benchmark. Full 47+ provider support in production.

Provider	Benchmarked	Tier	Notes
OpenAI	✅	premium	GPT-4o, GPT-4o-mini
Anthropic	✅	premium	Claude 3.5 Sonnet, Haiku
Groq	✅	cheap	llama-3.3-70b, mixtral
NVIDIA	✅	mid	Nemotron, Llama
DeepSeek	✅	mid	DeepSeek-V2, DeepSeek-Coder
Mistral	✅	mid	Mistral Large, Small
Google	✅	premium	Gemini 1.5 Pro
Cohere	✅	mid	Command R+
Together	✅	cheap	Various open models
Fireworks	✅	cheap	Various open models
Perplexity	✅	mid	Sonar, pplx-70b
Replicate	✅	cheap	Various open models

Additional providers supported: Kimi, Qwen, Zhipu, Yi, Cerebras, Sambanova, OctoAI, and 30+ more.

🧪 Reproduce This Yourself

# Install the benchmark tool
pip install llm-gateway-bench

# Start A3M proxy
npx a3m-router serve

# Run comparison
python3 -m llm_gateway_bench.cli run groq \
  --model llama-3.3-70b-versatile \
  --prompt "What is the capital of France?" \
  --requests 10

python3 -m llm_gateway_bench.cli run custom \
  --model auto \
  --base-url http://localhost:8787/v1 \
  --prompt "What is the capital of France?" \
  --requests 10

Methodology: All benchmarks run on real API calls (not simulated). 3 prompts × 5 requests = 15 calls per scenario. Results saved in benchmark-results.json. Run date: 2026-05-26. Provider: Groq (llama-3.3-70b-versatile).

📖 Full Benchmark Methodology 📊 Raw Results