Cut your AI bill before a token is ever billed

OptScale AI compresses everything your models read, aligns prompt caches so provider discounts actually hit, trims system-prompt overhead, and routes each request to the best-value model — one gateway, no code changes.

Start your 14-day free trial

Up to 60%
Lower AI spend
Up to 97%
Fewer tokens billed
<10ms
Added latency
No
Code changes

Your input tokens are the invoice

On the workloads enterprises actually run — RAG and agents — input tokens outnumber output by 5–30×. Cheaper models help the output side, but most of the spend is on everything you send: retrieved chunks, tool outputs, logs, and re-sent history.

10–30:1
RAG / retrieval
Retrieved context dwarfs the answer.
5–15:1
Agentic tool use
Tool outputs and history stack up each turn.
~3:1
Chat
Balanced — output cost still leads here.
~4×
Output token price
Output costs more per token — so both sides matter.
Input tokens (what you send) Output tokens (what the model writes)

Four levers, stacked on the gateway

Each request passes through four cost controls before it reaches a provider. They attack different parts of the bill, which is why they compound.

System-prompt & schema trimming

Strips fixed instruction and tool-schema overhead re-sent on every single call.

Input tokens ↓

OptScale compression engine

OptScale's compression engine shrinks retrieved context, tool outputs, logs and history inline — reversibly, before they reach a provider.

Input tokens ↓↓↓

LLM cache alignment

Stabilizes prompt prefixes so Anthropic / OpenAI KV caches actually hit — repeated context billed at ~10%.

Repeat tokens ↓

Smart model selection

Routes what remains to the best-value model that still meets quality — cost per token, not token count.

Cost / token ↓

≈97%

The compression engine alone strips 60–95% of context tokens on compressible payloads. Stacked with cache alignment, system-prompt trimming and smart routing, total tokens billed at full rate can drop by up to 97% on long-context, high-repetition workloads — which is what pulls blended cost down by up to 60%.*

Read the docs →

How the compression layer works

Content is detected, routed to the right compressor, and cached locally so nothing is lost. The model can pull an original back on demand.

01

Content router

Detects whether a payload is JSON, code, or prose and picks the right compressor.

02

Type-aware compressors

Separate JSON, AST-aware code, and text compressors — each shrinks its own content type.

03

Cache aligner

Stabilizes prefixes so provider KV caches keep hitting after compression.

04

Reversible retrieval

Originals stay cached locally; the model calls one back only if it needs it.

Real workloads. Same answers.

How much you save depends on the payload. Here's what typically compresses on real agent workloads — with answers held flat.

What compressesTypical reduction
Tool outputs, logs & API responsesup to ~90%
Code & search resultsup to ~90%
RAG chunks & long documents40–75%
Conversation history50–80%

Illustrative: a 60,000-token incident log shrinks to roughly 5,000 tokens before the model ever reads it — and the answer stays the same. Task accuracy holds flat on standard math and factual benchmarks after compression.

Measured, not guessed

Output-side savings are counterfactual — we never see what the model would have written. So OptScale reports an honest estimate with a confidence range rather than a round marketing number:

Reduction: 31.7%  (95% CI 27.7% … 35.7%)

Want a measured figure instead? Hold out a slice of traffic as an unshaped control group, and the dashboard shows tokens saved side-by-side, labelled measured vs estimated.

This is the same discipline behind our analytics pillar: quality and savings are measured after the fact with an audit trail, not predicted before it.

Where It Fits

Drops into the stack you already run

Compression lives on the gateway, so every team and agent inherits it without touching application code.

🔌

One endpoint

Point your apps and agents at the OptScale gateway. Compression, caching and routing apply to every request automatically.

🏢

SaaS or on-prem

Same platform, your terms. On-prem keeps every prompt and every token inside your network — a real edge in regulated industries.

🧾

Net of caching

Savings are modeled on top of provider prompt caching, not double-counted against it — so the number you see is the number you get.

See your own savings, live

Run a real workload through the gateway and watch tokens — and cost — drop in the dashboard.