📊 A3M Router Benchmark

The question everyone asks: "How much latency does a gateway add?"

The answer: +96ms for passthrough, +236ms for full intelligent routing — on a 138ms baseline. That's about $11 per millisecond saved.

99.5%
+/-1 Tier Accuracy
62%
Cost Savings
+96ms
Passthrough Overhead
+26%
Ensemble Quality Gain

Latency Comparison

A3M Router Benchmark Chart — latency comparison and cost savings projection

Left: latency comparison. Right: cost savings projection. Dark theme. Measured with llm-gateway-bench v0.2.0, Groq (llama-3.3-70b-versatile), 15 calls per scenario.

ScenarioTimevs BaselineWhat You Get
Direct to Groq (no gateway) 138ms Raw provider speed
Through A3M forced route 234ms +96ms Guardrails (17 injection patterns, PII), cache lookup (30%+ hit rate), cost tracking, circuit breaker
Through A3M auto route 374ms +236ms Everything above + intelligent routing (12 signals → tier → cheapest capable model → 62% cost savings)
The routing decision itself takes <1ms. The extra time is the full proxy pipeline: HTTP parsing → guardrails → cache → routing → forward to provider → response → cost logging.
236ms total overhead saves $2,604/year at 100K queries/month. Full methodology in BENCHMARK.md.

The Trade-Off

Without A3MWith A3M
Response time138ms374ms
Monthly API bill$341 (all premium)$124 (smart routed)
SecurityNone17-pattern injection detection
Cache hitsNone30%+ semantic cache
Provider failuresManual retryCircuit breaker + auto failover
Cost visibilityEnd-of-month surprisePer-query tracking + budget alerts

Routing Accuracy

200 real API calls, benchmarked against manual expert classification.

99.5%
±1 Tier Accuracy
64.5%
Exact Tier Match
92%
Free Tier Recall
7%
Over-routing (waste)
MetricScoreWhat It Means
±1 Tier Accuracy99.5%Only 1 in 200 queries is misrouted by more than 1 tier
Exact Tier Match64.5%~2 in 3 queries hit the exact right tier
Free Tier Recall92%Free-tier-suitable queries correctly routed to $0 models
Over-routing (waste)7%Sent to a stronger — but more expensive — model than needed
Under-routing (risk)28.5%Sent to a weaker model; fallback auto-escalates on failure

Third-Party Validation

A3M's routing tiers align with established third-party benchmarks.

ProviderMMLUTierSource
gpt-4o88.7%premiumMMLU Leaderboard
claude-3.5-sonnet88.4%premiumMMLU Leaderboard
gemini-1.5-pro85.7%premiumMMLU Leaderboard
mistral-large84.2%midMMLU Leaderboard
llama-3.3-70b82.5%midMMLU Leaderboard
deepseek-v278.3%midMMLU Leaderboard
llama-3.1-8b68.3%cheapMMLU Leaderboard
On under-routing: A3M is deliberately conservative — it would rather try a cheaper model first and fail fast (triggering automatic fallback in <2s) than default to premium for every query. This is what drives the 62% cost savings.

Cost Savings

Cost Breakdown (200 real API calls)

 GPT-4o only:  $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  $0.25  (all premium)
 A3M Router:   $$$$                               $0.10  (smart routed)
               ——————————————————————————————
               You save:                           $0.15  (62%)

By Query Type

Query Type% TrafficGPT-4o OnlyA3M Routes ToA3M CostSavings
Simple Q&A47%$4.94taste-1 (free)$0.00100%
Code gen15%$4.88deepseek ($0.14/M)$0.1797%
Summarization18%$7.20gpt-4o-mini ($0.15/M)$0.4394%
Reasoning12%$8.70claude-haiku ($0.80/M)$3.3661%
Expert8%$8.40gpt-4o ($2.50/M)$8.400%
Total100%$34.11$12.3664%

Annual Projection

Monthly QueriesAll-PremiumA3M RouterYou SaveAnnualized
10K$34$12$22 (65%)$261
100K$341$124$217 (64%)$2,604
1M$3,411$1,236$2,175 (64%)$26,100
Auto-routing routes ~50% of queries to free tier, ~35% to cheap tier. Savings increase with volume.

Parallel Ensemble Quality Gain

A3M runs NVIDIA + Groq simultaneously, scores results, picks the best. Preliminary benchmark (50 queries).

+26%
Answer Quality
-57%
Hallucination Rate
+21pp
Specificity
+19pp
Multi-Step Accuracy
MetricSingle Best ProviderA3M EnsembleGain
Answer quality (1-10)6.58.2+26%
Specificity (code/nums)58%79%+21pp
Hallucination rate4.2%1.8%-57%
Multi-step accuracy72%91%+19pp

Provider Coverage

Tested across 12 providers in the benchmark. Full 47+ provider support in production.

ProviderBenchmarkedTierNotes
OpenAIpremiumGPT-4o, GPT-4o-mini
AnthropicpremiumClaude 3.5 Sonnet, Haiku
Groqcheapllama-3.3-70b, mixtral
NVIDIAmidNemotron, Llama
DeepSeekmidDeepSeek-V2, DeepSeek-Coder
MistralmidMistral Large, Small
GooglepremiumGemini 1.5 Pro
CoheremidCommand R+
TogethercheapVarious open models
FireworkscheapVarious open models
PerplexitymidSonar, pplx-70b
ReplicatecheapVarious open models

Additional providers supported: Kimi, Qwen, Zhipu, Yi, Cerebras, Sambanova, OctoAI, and 30+ more.

🧪 Reproduce This Yourself

# Install the benchmark tool
pip install llm-gateway-bench

# Start A3M proxy
npx a3m-router serve

# Run comparison
python3 -m llm_gateway_bench.cli run groq \
  --model llama-3.3-70b-versatile \
  --prompt "What is the capital of France?" \
  --requests 10

python3 -m llm_gateway_bench.cli run custom \
  --model auto \
  --base-url http://localhost:8787/v1 \
  --prompt "What is the capital of France?" \
  --requests 10
Methodology: All benchmarks run on real API calls (not simulated). 3 prompts × 5 requests = 15 calls per scenario. Results saved in benchmark-results.json. Run date: 2026-05-26. Provider: Groq (llama-3.3-70b-versatile).

📖 Full Benchmark Methodology 📊 Raw Results