Caching behavior
Lumen caches non-streaming responses in Redis for 1 hour (configurable). Cache key = SHA-256 of (messages, model, temperature, tools, response_format).
What gets cached
- Non-streaming
POST /v1/chat/completionsresponses (HTTP 200) - Across customers — if two customers send identical prompts to the same tier, both get cache hits (but each sees their own audit hash)
What does NOT get cached
stream: truerequests- Tool-use / function-call responses where the model picked a tool (we re-execute)
- Requests with high temperature deltas (different temperature = different key)
Cost of a cache hit
Cache hits are billed at $0 cost-per-token, but still count toward your request quota.
Detecting cache hits
resp = client.chat.completions.create(...)
if resp.lumen["cache_hit"]:
print("served from cache, free")
Bypassing the cache
Add a unique nonce to your prompt or vary temperature slightly:
client.chat.completions.create(
..., temperature=0.7001 # tiny variation forces a fresh request
)
Auditing
Cache hits get their own audit entries with cache_hit: true so the chain stays complete.