Feature
Feature Added
✨ New Feature: Context Compression
Context Compression automatically summarizes conversation history using a smaller, cost-efficient model before sending requests to your primary model. This significantly reduces input token costs while preserving conversation context.
Highlights
- Up to 78% token savings on long multi-turn conversations
- Three compression strategies to balance quality vs. savings:
conservative— compresses after 8+ turns (minimal context loss)on— compresses after 6+ turns (balanced)aggressive— compresses after 3+ turns (maximum savings)- All endpoints supported:
POST /v1/chat/completionsPOST /v1/messagesPOST /v1/responses
How to Enable
Option 1: API Key Dashboard (Zero Code Changes)
Go to API Key Management → Edit your API key → Enable Context Compression and select your preferred strategy. All requests using that key will automatically apply compression.
Option 2: Per-Request via Request Body
{
"model": "gpt-4.1",
"messages": [...],
"compression": {
"enabled": true,
"strategy": "on",
"model": "gpt-4.1-mini"
}
}Option 3: Per-Request via HTTP Headers
X-Context-Compression: on X-Compression-Model: gpt-4.1-mini
SDK Support
Compression examples are now available for all supported SDKs:
- Python SDK (OpenAI, Anthropic, Responses API)
- TypeScript / Vercel AI SDK (@apertis/ai-sdk-provider)
- LangChain (via default_headers)
- LlamaIndex (via additional_kwargs)
- LiteLLM (via extra_headers)
Priority
Request body params > HTTP headers > API key defaults. Per-request settings always override key-level defaults.
See more on **Documentation**