Prompt Caching
Cache system prompts to reduce processing time and costs with Anthropic models.
Prompt caching can help reducing processing time and costs. Consider it if you are using the same prompt multiple times in any flow.
You can read more about prompt caching with Anthropic models here.
Usage
To use prompt caching in your Kern setup, pass the cache_system_prompt argument when initializing the Claude model:
1from kern.agent import Agent2from kern.models.anthropic import Claude34agent = Agent(5 model=Claude(6 id="claude-3-5-sonnet-20241022",7 cache_system_prompt=True,8 ),9)Notice that for prompt caching to work, the prompt needs to be of a certain length. You can read more about this on Anthropic's docs.
Extended cache
You can also use Anthropic's extended cache beta feature. This updates the cache duration from 5 minutes to 1 hour. To activate it, pass the extended_cache_time argument and the following beta header:
1from kern.agent import Agent2from kern.models.anthropic import Claude34agent = Agent(5 model=Claude(6 id="claude-3-5-sonnet-20241022",7 betas=["extended-cache-ttl-2025-04-11"],8 cache_system_prompt=True,9 extended_cache_time=True,10 ),11)Multi-block caching with per-block TTL
Split the system prompt into independently-cacheable blocks with system_prompt_blocks. Each SystemPromptBlock controls its own cache flag and ttl. This lets you cache static instructions while leaving dynamic per-request content uncached.
1from datetime import datetime2from kern.agent import Agent3from kern.models.anthropic import Claude, SystemPromptBlock45blocks = [6 # Static instructions, cached for 1 hour7 SystemPromptBlock(8 text="You are a senior software architect. Give concise, opinionated advice.",9 cache=True,10 ttl="1h",11 ),12 # Dynamic per-request context, never cached13 SystemPromptBlock(14 text=f"Current time: {datetime.now().isoformat()}",15 cache=False,16 ),17]1819agent = Agent(20 model=Claude(21 id="claude-sonnet-4-5-20250929",22 cache_system_prompt=True,23 extended_cache_time=True,24 system_prompt_blocks=blocks,25 ),26 markdown=True,27)Blocks are appended after the agent-built system message in the Anthropic system array. system_prompt_blocks may also be a zero-arg callable that returns the list, evaluated on every request, which is how you inject dynamic content into a cached prompt without reinstantiating the model.
SystemPromptBlock field | Type | Default | Description |
|---|---|---|---|
text | str | required | The block content. |
cache | bool | True | Add cache_control to this block. Independent of cache_system_prompt. |
ttl | Optional["5m" | "1h"] | None | Per-block TTL. Overrides the model-level extended_cache_time for this block. |
Anthropic requires any 1h cached block to appear before any 5m block in the request. Since the agent-built block comes first and inherits the model-level TTL, set extended_cache_time=True whenever any SystemPromptBlock uses ttl="1h". Kern validates this ordering at assembly time and raises a clear error if it is violated.
Tool caching
Set cache_tools=True to cache tool definitions. Anthropic caches all tools as a prefix when cache_control is on the last tool.
1agent = Agent(2 model=Claude(3 id="claude-sonnet-4-5-20250929",4 cache_tools=True,5 ),6 tools=[...],7)Working example
1from pathlib import Path2from kern.agent import Agent3from kern.models.anthropic import Claude4from kern.utils.media import download_file56# Load an example large system message from S3. A large prompt like this would benefit from caching.7txt_path = Path(__file__).parent.joinpath("system_prompt.txt")8download_file(9 "https://kern-public.s3.amazonaws.com/prompts/system_promt.txt",10 str(txt_path),11)12system_message = txt_path.read_text()1314agent = Agent(15 model=Claude(16 id="claude-sonnet-4-20250514",17 cache_system_prompt=True, # Activate prompt caching for Anthropic to cache the system prompt18 ),19 system_message=system_message,20 markdown=True,21)2223# First run - this will create the cache24response = agent.run(25 "Explain the difference between REST and GraphQL APIs with examples"26)27if response and response.metrics:28 print(f"First run cache write tokens = {response.metrics.cache_write_tokens}")2930# Second run - this will use the cached system prompt31response = agent.run(32 "What are the key principles of clean code and how do I apply them in Python?"33)34if response and response.metrics:35 print(f"Second run cache read tokens = {response.metrics.cache_read_tokens}")