Bedrock Prompt Caching Guide¶
Overview¶
AWS Bedrock's prompt caching feature can reduce costs by up to 90% and latency by up to 85% for agents with static instructions and dynamic context.
Status: Experimental (v1) - Temporary implementation until Agno adds native support
How It Works¶
When you enable caching, DCAF places a "cache checkpoint" in your system prompt. Everything before the checkpoint is cached by Bedrock for 5 minutes (TTL resets on each use).
[Static Instructions] ← CACHED
↓
[Cache Checkpoint]
↓
[Dynamic Context] ← NOT cached (fresh each time)
Quick Start¶
Basic Usage¶
from dcaf.core import Agent
agent = Agent(
system_prompt="You are a Kubernetes expert... [long instructions]",
tools=[list_pods, delete_pod],
model_config={
"cache_system_prompt": True # Enable caching
}
)
Separating Static and Dynamic Content¶
This is the key pattern for effective caching:
from dcaf.core import Agent
agent = Agent(
# Static - cached
system_prompt="You are a Kubernetes expert. Your job is to help users manage their clusters...",
# Dynamic - NOT cached
system_context=lambda ctx: f"""
Current tenant: {ctx['tenant_name']}
Namespace: {ctx['k8s_namespace']}
User: {ctx['user_email']}
""",
model_config={
"cache_system_prompt": True
}
)
Requirements¶
Minimum Token Count¶
Your static system prompt must be at least: - Claude 3.7 Sonnet: 1024 tokens - Claude 3.5 Haiku: 2048 tokens
If below threshold, caching is automatically disabled with a warning log.
Rule of thumb: ~4 characters = 1 token, so aim for 4000+ character prompts.
Best Practices¶
1. Put Static Content First¶
✅ Good: Static instructions, then dynamic context
system_prompt="You are a K8s expert. [lengthy guidelines]"
system_context="Current tenant: acme-corp"
❌ Bad: Mixing static and dynamic
2. Make Static Content Detailed¶
The more static content you cache, the bigger the savings:
✅ Good: Detailed instructions (1500+ tokens)
system_prompt="""
You are a Kubernetes expert assistant.
Guidelines:
- Always verify namespace before operations
- Explain commands clearly
- Ask for confirmation on destructive operations
- Follow kubectl best practices
- Provide helpful error messages
[More detailed guidelines...]
"""
❌ Bad: Brief instructions (50 tokens)
3. Use Callable for Dynamic Context¶
For runtime data, use a lambda or function:
system_context=lambda ctx: f"""
Tenant: {ctx.get('tenant_name')}
Namespace: {ctx.get('k8s_namespace')}
User: {ctx.get('user_email')}
"""
Monitoring¶
Cache Performance Logs¶
DCAF logs cache performance when available:
INFO: ✅ Cache HIT: 950 tokens reused (~90% cost reduction)
INFO: 📝 Cache MISS: 950 tokens cached for next request (cache created)
First Request is Always a MISS¶
The first request creates the cache (MISS). Subsequent requests within 5 minutes are HITs:
Request 1: MISS (creates cache) → Full cost
Request 2: HIT (uses cache) → 10% cost
Request 3: HIT (uses cache) → 10% cost
...
Request N: MISS (cache expired) → Full cost
Troubleshooting¶
No Cache Metrics in Logs¶
Symptom: Caching enabled but no "Cache HIT/MISS" logs
Possible causes: 1. System prompt below minimum token threshold 2. Model doesn't support caching 3. Bedrock API change
Solution: Check logs for warnings about token threshold
Caching Not Enabled¶
Symptom: No cache-related logs at all
Checklist:
- [ ] model_config={"cache_system_prompt": True} set?
- [ ] Using Bedrock provider (not OpenAI)?
- [ ] Using supported model (Claude 3.5/3.7)?
Cost Comparison¶
Example Scenario¶
- Static prompt: 1500 tokens
- Dynamic context: 100 tokens
- 100 requests
Without caching: - Per request: 1600 tokens - Total: 160,000 tokens - Cost: ~$0.48
With caching: - First request: 1600 tokens (MISS) - Next 99 requests: 100 tokens each (HIT) - Total: 11,500 tokens - Cost: ~$0.035
Savings: ~93% 💰
Supported Models¶
Prompt caching is supported on:
- anthropic.claude-3-7-sonnet-20250219-v1:0
- anthropic.claude-3-5-sonnet-20241022-v2:0
- anthropic.claude-3-5-haiku-20241022-v1:0
When NOT to Use Caching¶
❌ Don't use caching when: - System prompt changes frequently - System prompt is very short (<1024 tokens) - Low request volume (cache expires between requests) - Using non-Bedrock providers
✅ DO use caching when: - Long, static system prompts - High request volume (multiple requests per 5 minutes) - Multi-tenant scenarios (static instructions, dynamic tenant context)
Future Plans¶
This is a temporary implementation. Once Agno adds native prompt caching support, we'll migrate to their implementation and remove the custom code.