Caching behavior

Lumen caches non-streaming responses in Redis for 1 hour (configurable). Cache key = SHA-256 of (messages, model, temperature, tools, response_format).

What gets cached

Non-streaming POST /v1/chat/completions responses (HTTP 200)
Across customers — if two customers send identical prompts to the same tier, both get cache hits (but each sees their own audit hash)

What does NOT get cached

stream: true requests
Tool-use / function-call responses where the model picked a tool (we re-execute)
Requests with high temperature deltas (different temperature = different key)

Cost of a cache hit

Cache hits are billed at $0 cost-per-token, but still count toward your request quota.

Detecting cache hits

resp = client.chat.completions.create(...)
if resp.lumen["cache_hit"]:
    print("served from cache, free")

Bypassing the cache

Add a unique nonce to your prompt or vary temperature slightly:

client.chat.completions.create(
    ..., temperature=0.7001  # tiny variation forces a fresh request
)

Auditing

Cache hits get their own audit entries with cache_hit: true so the chain stays complete.