Prompt Caching

Cache system prompts to reduce processing time and costs with Anthropic models.

Prompt caching can help reducing processing time and costs. Consider it if you are using the same prompt multiple times in any flow.

You can read more about prompt caching with Anthropic models here.

Usage

To use prompt caching in your Kern setup, pass the cache_system_prompt argument when initializing the Claude model:

1from kern.agent import Agent
2from kern.models.anthropic import Claude
3
4agent = Agent(
5 model=Claude(
6 id="claude-3-5-sonnet-20241022",
7 cache_system_prompt=True,
8 ),
9)

Notice that for prompt caching to work, the prompt needs to be of a certain length. You can read more about this on Anthropic's docs.

Extended cache

You can also use Anthropic's extended cache beta feature. This updates the cache duration from 5 minutes to 1 hour. To activate it, pass the extended_cache_time argument and the following beta header:

1from kern.agent import Agent
2from kern.models.anthropic import Claude
3
4agent = Agent(
5 model=Claude(
6 id="claude-3-5-sonnet-20241022",
7 betas=["extended-cache-ttl-2025-04-11"],
8 cache_system_prompt=True,
9 extended_cache_time=True,
10 ),
11)

Multi-block caching with per-block TTL

Split the system prompt into independently-cacheable blocks with system_prompt_blocks. Each SystemPromptBlock controls its own cache flag and ttl. This lets you cache static instructions while leaving dynamic per-request content uncached.

1from datetime import datetime
2from kern.agent import Agent
3from kern.models.anthropic import Claude, SystemPromptBlock
4
5blocks = [
6 # Static instructions, cached for 1 hour
7 SystemPromptBlock(
8 text="You are a senior software architect. Give concise, opinionated advice.",
9 cache=True,
10 ttl="1h",
11 ),
12 # Dynamic per-request context, never cached
13 SystemPromptBlock(
14 text=f"Current time: {datetime.now().isoformat()}",
15 cache=False,
16 ),
17]
18
19agent = Agent(
20 model=Claude(
21 id="claude-sonnet-4-5-20250929",
22 cache_system_prompt=True,
23 extended_cache_time=True,
24 system_prompt_blocks=blocks,
25 ),
26 markdown=True,
27)

Blocks are appended after the agent-built system message in the Anthropic system array. system_prompt_blocks may also be a zero-arg callable that returns the list, evaluated on every request, which is how you inject dynamic content into a cached prompt without reinstantiating the model.

SystemPromptBlock fieldTypeDefaultDescription
textstrrequiredThe block content.
cacheboolTrueAdd cache_control to this block. Independent of cache_system_prompt.
ttlOptional["5m" | "1h"]NonePer-block TTL. Overrides the model-level extended_cache_time for this block.
Warning

Anthropic requires any 1h cached block to appear before any 5m block in the request. Since the agent-built block comes first and inherits the model-level TTL, set extended_cache_time=True whenever any SystemPromptBlock uses ttl="1h". Kern validates this ordering at assembly time and raises a clear error if it is violated.

Tool caching

Set cache_tools=True to cache tool definitions. Anthropic caches all tools as a prefix when cache_control is on the last tool.

1agent = Agent(
2 model=Claude(
3 id="claude-sonnet-4-5-20250929",
4 cache_tools=True,
5 ),
6 tools=[...],
7)

Working example

1from pathlib import Path
2from kern.agent import Agent
3from kern.models.anthropic import Claude
4from kern.utils.media import download_file
5
6# Load an example large system message from S3. A large prompt like this would benefit from caching.
7txt_path = Path(__file__).parent.joinpath("system_prompt.txt")
8download_file(
9 "https://kern-public.s3.amazonaws.com/prompts/system_promt.txt",
10 str(txt_path),
11)
12system_message = txt_path.read_text()
13
14agent = Agent(
15 model=Claude(
16 id="claude-sonnet-4-20250514",
17 cache_system_prompt=True, # Activate prompt caching for Anthropic to cache the system prompt
18 ),
19 system_message=system_message,
20 markdown=True,
21)
22
23# First run - this will create the cache
24response = agent.run(
25 "Explain the difference between REST and GraphQL APIs with examples"
26)
27if response and response.metrics:
28 print(f"First run cache write tokens = {response.metrics.cache_write_tokens}")
29
30# Second run - this will use the cached system prompt
31response = agent.run(
32 "What are the key principles of clean code and how do I apply them in Python?"
33)
34if response and response.metrics:
35 print(f"Second run cache read tokens = {response.metrics.cache_read_tokens}")