Changelog

Type

March 2026

Feature

Feature Added

✨ New Feature: Context Compression

Context Compression automatically summarizes conversation history using a smaller, cost-efficient model before sending requests to your primary model. This significantly reduces input token costs while preserving conversation context.

Highlights

Up to 78% token savings on long multi-turn conversations
Three compression strategies to balance quality vs. savings:
conservative — compresses after 8+ turns (minimal context loss)
on — compresses after 6+ turns (balanced)
aggressive — compresses after 3+ turns (maximum savings)
All endpoints supported:
POST /v1/chat/completions
POST /v1/messages
POST /v1/responses

How to Enable

Option 1: API Key Dashboard (Zero Code Changes)

Go to API Key Management → Edit your API key → Enable Context Compression and select your preferred strategy. All requests using that key will automatically apply compression.

Option 2: Per-Request via Request Body

  {
    "model": "gpt-4.1",
    "messages": [...],
    "compression": {
      "enabled": true,
      "strategy": "on",
      "model": "gpt-4.1-mini"
    }
  }

Option 3: Per-Request via HTTP Headers

X-Context-Compression: on X-Compression-Model: gpt-4.1-mini

SDK Support

Compression examples are now available for all supported SDKs:

Python SDK (OpenAI, Anthropic, Responses API)
TypeScript / Vercel AI SDK (@apertis/ai-sdk-provider)
LangChain (via default_headers)
LlamaIndex (via additional_kwargs)
LiteLLM (via extra_headers)

Priority

Request body params > HTTP headers > API key defaults. Per-request settings always override key-level defaults.

See more on **Documentation**

February 2026

Feature

Models Added

Add Grok 4.2

Grok 4.2

Grok 4.2 is the next major iteration of xAI's Grok series, advancing the model's reasoning, coding, and multimodal capabilities with architectural improvements over Grok 4 and 4.1. It is positioned as a more powerful and general-purpose frontier AI model in the Grok family with stronger deep reasoning and real-world task performance.

Feature

Models Added

Add Qwen 3.5 Full Series & Seed-2.0-Mini

Seed-2.0-MiniQwen3.5 Plus 2026-02-15Qwen3.5 397B A17BQwen3.5-FlashQwen3.5-122B-A10BQwen3.5-27B

The full Qwen 3.5 series is provided at **Apertis Coding Plan** as well, Enjoy it.

Feature

Models Added

Add Nano Banana 2 (Gemini 3.1 Flash Image Preview)

Nano Banana 2 (Gemini 3.1 Flash Image Preview)

Gemini 3.1 Flash Image Preview (also known as "Nano Banana 2") is Google's latest state-of-the-art image generation and editing model, delivering Pro-level visual quality at Flash-level speed. It combines strong contextual understanding with fast, cost-efficient inference, enabling high-quality image generation and seamless iterative editing. Optimized for both performance and accessibility, it makes advanced visual creation workflows faster and more scalable.

Feature

System Update

Cached responses now support streaming (SSE) delivery, covering ~80% of API traffic that uses stream: true.

New feature: Cached responses now support streaming (SSE) delivery, covering ~80% of API traffic that uses stream: true.

On cache hit, the system emits synthetic SSE chunks from the stored response — no upstream API call needed
Content is split on rune boundaries (50 runes/chunk, 10ms intervals) to preserve multi-byte characters
Proper X-Cache-Hit, X-Cached-Tokens, and X-Actual-Model headers on streaming cache hits
Non-streaming cache hits continue to work as before (direct JSON response)

Cache Correctness Hardening

Temperature guard: Only caches requests where temperature: 0 is explicitly present in the raw JSON body. Omitted temperature (Go zero value 0.0) is no longer falsely treated as

cacheable — providers default to ~1.0 for omitted values

SSE error safety: If synthetic SSE emission fails mid-stream, the handler returns immediately instead of falling through to normal processing, preventing HTTP double-write

corruption

Tool call exclusion: Responses containing tool_calls are excluded from cache storage since the SSE emitter only supports text content replay

Cache TTL & Infrastructure

Default prompt cache TTL extended from 10 → 30 minutes

Enjoy it.