Feature

2.2.18Feature Added

✨ New Feature: Context Compression

Context Compression automatically summarizes conversation history using a smaller, cost-efficient model before sending requests to your primary model. This significantly reduces input token costs while preserving conversation context.

Highlights

  • Up to 78% token savings on long multi-turn conversations
  • Three compression strategies to balance quality vs. savings:
  • conservative — compresses after 8+ turns (minimal context loss)
  • on — compresses after 6+ turns (balanced)
  • aggressive — compresses after 3+ turns (maximum savings)
  • All endpoints supported:
  • POST /v1/chat/completions
  • POST /v1/messages
  • POST /v1/responses

How to Enable

Option 1: API Key Dashboard (Zero Code Changes)

Go to API Key Management → Edit your API key → Enable Context Compression and select your preferred strategy. All requests using that key will automatically apply compression.

Option 2: Per-Request via Request Body

  {
    "model": "gpt-4.1",
    "messages": [...],
    "compression": {
      "enabled": true,
      "strategy": "on",
      "model": "gpt-4.1-mini"
    }
  }

Option 3: Per-Request via HTTP Headers

X-Context-Compression: on X-Compression-Model: gpt-4.1-mini

SDK Support

Compression examples are now available for all supported SDKs:

  • Python SDK (OpenAI, Anthropic, Responses API)
  • TypeScript / Vercel AI SDK (@apertis/ai-sdk-provider)
  • LangChain (via default_headers)
  • LlamaIndex (via additional_kwargs)
  • LiteLLM (via extra_headers)

Priority

Request body params > HTTP headers > API key defaults. Per-request settings always override key-level defaults.

See more on **Documentation**