📊 A3M Router Benchmark
The question everyone asks: "How much latency does a gateway add?"
The answer: +96ms for passthrough, +236ms for full intelligent routing — on a 138ms baseline. That's about $11 per millisecond saved.
Latency Comparison
Left: latency comparison. Right: cost savings projection. Dark theme. Measured with llm-gateway-bench v0.2.0, Groq (llama-3.3-70b-versatile), 15 calls per scenario.
| Scenario | Time | vs Baseline | What You Get |
|---|---|---|---|
| Direct to Groq (no gateway) | 138ms | — | Raw provider speed |
| Through A3M forced route | 234ms | +96ms | Guardrails (17 injection patterns, PII), cache lookup (30%+ hit rate), cost tracking, circuit breaker |
| Through A3M auto route | 374ms | +236ms | Everything above + intelligent routing (12 signals → tier → cheapest capable model → 62% cost savings) |
The Trade-Off
| Without A3M | With A3M | |
|---|---|---|
| Response time | 138ms | 374ms |
| Monthly API bill | $341 (all premium) | $124 (smart routed) |
| Security | None | 17-pattern injection detection |
| Cache hits | None | 30%+ semantic cache |
| Provider failures | Manual retry | Circuit breaker + auto failover |
| Cost visibility | End-of-month surprise | Per-query tracking + budget alerts |
Routing Accuracy
200 real API calls, benchmarked against manual expert classification.
| Metric | Score | What It Means |
|---|---|---|
| ±1 Tier Accuracy | 99.5% | Only 1 in 200 queries is misrouted by more than 1 tier |
| Exact Tier Match | 64.5% | ~2 in 3 queries hit the exact right tier |
| Free Tier Recall | 92% | Free-tier-suitable queries correctly routed to $0 models |
| Over-routing (waste) | 7% | Sent to a stronger — but more expensive — model than needed |
| Under-routing (risk) | 28.5% | Sent to a weaker model; fallback auto-escalates on failure |
Third-Party Validation
A3M's routing tiers align with established third-party benchmarks.
| Provider | MMLU | Tier | Source |
|---|---|---|---|
| gpt-4o | 88.7% | premium | MMLU Leaderboard |
| claude-3.5-sonnet | 88.4% | premium | MMLU Leaderboard |
| gemini-1.5-pro | 85.7% | premium | MMLU Leaderboard |
| mistral-large | 84.2% | mid | MMLU Leaderboard |
| llama-3.3-70b | 82.5% | mid | MMLU Leaderboard |
| deepseek-v2 | 78.3% | mid | MMLU Leaderboard |
| llama-3.1-8b | 68.3% | cheap | MMLU Leaderboard |
Cost Savings
Cost Breakdown (200 real API calls)
GPT-4o only: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $0.25 (all premium)
A3M Router: $$$$ $0.10 (smart routed)
——————————————————————————————
You save: $0.15 (62%)
By Query Type
| Query Type | % Traffic | GPT-4o Only | A3M Routes To | A3M Cost | Savings |
|---|---|---|---|---|---|
| Simple Q&A | 47% | $4.94 | taste-1 (free) | $0.00 | 100% |
| Code gen | 15% | $4.88 | deepseek ($0.14/M) | $0.17 | 97% |
| Summarization | 18% | $7.20 | gpt-4o-mini ($0.15/M) | $0.43 | 94% |
| Reasoning | 12% | $8.70 | claude-haiku ($0.80/M) | $3.36 | 61% |
| Expert | 8% | $8.40 | gpt-4o ($2.50/M) | $8.40 | 0% |
| Total | 100% | $34.11 | — | $12.36 | 64% |
Annual Projection
| Monthly Queries | All-Premium | A3M Router | You Save | Annualized |
|---|---|---|---|---|
| 10K | $34 | $12 | $22 (65%) | $261 |
| 100K | $341 | $124 | $217 (64%) | $2,604 |
| 1M | $3,411 | $1,236 | $2,175 (64%) | $26,100 |
Parallel Ensemble Quality Gain
A3M runs NVIDIA + Groq simultaneously, scores results, picks the best. Preliminary benchmark (50 queries).
| Metric | Single Best Provider | A3M Ensemble | Gain |
|---|---|---|---|
| Answer quality (1-10) | 6.5 | 8.2 | +26% |
| Specificity (code/nums) | 58% | 79% | +21pp |
| Hallucination rate | 4.2% | 1.8% | -57% |
| Multi-step accuracy | 72% | 91% | +19pp |
Provider Coverage
Tested across 12 providers in the benchmark. Full 47+ provider support in production.
| Provider | Benchmarked | Tier | Notes |
|---|---|---|---|
| OpenAI | ✅ | premium | GPT-4o, GPT-4o-mini |
| Anthropic | ✅ | premium | Claude 3.5 Sonnet, Haiku |
| Groq | ✅ | cheap | llama-3.3-70b, mixtral |
| NVIDIA | ✅ | mid | Nemotron, Llama |
| DeepSeek | ✅ | mid | DeepSeek-V2, DeepSeek-Coder |
| Mistral | ✅ | mid | Mistral Large, Small |
| ✅ | premium | Gemini 1.5 Pro | |
| Cohere | ✅ | mid | Command R+ |
| Together | ✅ | cheap | Various open models |
| Fireworks | ✅ | cheap | Various open models |
| Perplexity | ✅ | mid | Sonar, pplx-70b |
| Replicate | ✅ | cheap | Various open models |
Additional providers supported: Kimi, Qwen, Zhipu, Yi, Cerebras, Sambanova, OctoAI, and 30+ more.
🧪 Reproduce This Yourself
# Install the benchmark tool
pip install llm-gateway-bench
# Start A3M proxy
npx a3m-router serve
# Run comparison
python3 -m llm_gateway_bench.cli run groq \
--model llama-3.3-70b-versatile \
--prompt "What is the capital of France?" \
--requests 10
python3 -m llm_gateway_bench.cli run custom \
--model auto \
--base-url http://localhost:8787/v1 \
--prompt "What is the capital of France?" \
--requests 10
benchmark-results.json. Run date: 2026-05-26. Provider: Groq (llama-3.3-70b-versatile).