Speech-to-Text Agent

Build an Agent that transcribes audio files into structured data with speaker identification and conversation metadata. This Agent uses multimodal capabilities to convert audio content into structured transcription data.

What You'll Learn

By building this agent, you'll understand:

How to use Pydantic schemas for structured transcription output
How to configure multimodal Agent for audio processing
How to identify speakers in audio conversations

Use Cases

Transcribe meeting recordings with speaker identification, convert podcast episodes into searchable text, create subtitles for video content, or build voice note analyzers with structured transcription data.

How It Works

The Agent uses multimodal capabilities to process audio directly and output structured transcription data:

Input: Accepts audio files WAV format from URLs (or local files)
Process: Multimodal model analyzes audio content and identifies speakers
Structure: Output is validated against a Pydantic schema
Output: Returns typed data with transcript, description, and speaker list

The structured output makes transcriptions immediately usable in downstream applications without additional parsing.

STT example using Gemini flash 3 preview

1import httpx
2from kern.agent import Agent, RunOutput  # noqa
3from kern.media import Audio
4from kern.models.google import Gemini
5from pydantic import BaseModel, Field
6
7INSTRUCTIONS = """
8Transcribe the audio accurately and completely.
9
10Speaker identification:
11- Use the speaker's name if mentioned in the conversation
12- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently
13
14Non-speech audio:
15- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation
16- Ignore brief natural pauses
17
18Include everything spoken, even false starts and filler words (um, uh, etc.).
19"""
20
21
22class Utterance(BaseModel):
23    speaker: str = Field(..., description="Name or identifier of the speaker")
24    text: str = Field(..., description="What was said by the speaker")
25
26
27class Transcription(BaseModel):
28    description: str = Field(..., description="A description of the audio conversation")
29    utterances: list[Utterance] = Field(
30        ..., description="Sequential list of utterances in conversation order"
31    )
32
33
34# Fetch the audio file and convert it to a base64 encoded string
35# Simple audio file with a single speaker
36# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
37# Audio file with multiple speakers
38url = "https://kern-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"
39
40try:
41    response = httpx.get(url)
42    response.raise_for_status()
43    wav_data = response.content
44except httpx.HTTPStatusError as e:
45    raise ValueError(f"Error fetching audio file: {url}") from e
46
47# Provide the agent with the audio file and get result as text
48agent = Agent(
49    model=Gemini(id="gemini-3-flash-preview"),
50    markdown=True,
51    instructions=INSTRUCTIONS,
52    output_schema=Transcription,
53)
54
55agent.print_response(
56    "Give a transcript of the audio conversation",
57    audio=[Audio(content=wav_data)],
58)

STT example using OpenAI gpt-audio

1import httpx
2from kern.agent import Agent, RunOutput  # noqa
3from kern.media import Audio
4from kern.models.openai import OpenAIResponses
5from pydantic import BaseModel, Field
6
7INSTRUCTIONS = """
8Transcribe the audio accurately and completely.
9
10Speaker identification:
11- Use the speaker's name if mentioned in the conversation
12- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently
13
14Non-speech audio:
15- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation
16- Ignore brief natural pauses
17
18Include everything spoken, even false starts and filler words (um, uh, etc.).
19"""
20
21
22class Utterance(BaseModel):
23    speaker: str = Field(..., description="Name or identifier of the speaker")
24    text: str = Field(..., description="What was said by the speaker")
25
26
27class Transcription(BaseModel):
28    description: str = Field(..., description="A description of the audio conversation")
29    utterances: list[Utterance] = Field(
30        ..., description="Sequential list of utterances in conversation order"
31    )
32
33
34# Fetch the audio file and convert it to a base64 encoded string
35# Simple audio file with a single speaker
36# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
37# Audio file with multiple speakers
38url = "https://kern-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"
39
40try:
41    response = httpx.get(url)
42    response.raise_for_status()
43    wav_data = response.content
44except httpx.HTTPStatusError as e:
45    raise ValueError(f"Error fetching audio file: {url}") from e
46
47# Provide the agent with the audio file and get result as text
48agent = Agent(
49    model=OpenAIResponses(id="gpt-audio-2025-08-28", modalities=["text"]),
50    markdown=True,
51    instructions=INSTRUCTIONS,
52    output_schema=Transcription,
53    # We use a parser model here as gpt-audio-2025-08-28 cannot return structured output by itself
54    parser_model=OpenAIResponses(id="gpt-5.2"),
55)
56
57agent.print_response(
58    "Give a transcript of the audio conversation",
59    audio=[Audio(content=wav_data, format="wav")],
60)

What to Expect

The agent processes audio files and returns a structured Transcription object containing:

description: A summary describing what the audio is about
utterances: List of identified speakers (names if mentioned, otherwise "Speaker 1", "Speaker 2", etc.)

The utterances are in the order of the audio conversation and they contain:

speaker: Name or identifier of the speaker
text: What was said by the speaker

Processing time depends on audio length, typically 10-30 seconds for files under 5 minutes.

Usage

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate

1uv venv --python 3.12
2.venv\Scripts\activate

Set your API key

bash export GOOGLE_API_KEY=xxx

Install dependencies

bash uv pip install -U kern-ai google-genai httpx

Run Agent

1python speech_to_text_agent.py

1python speech_to_text_agent.py

Next Steps

Remove the structured output and use the text output instead if your use case does not require structured outputs
Extend the Transcription schema with additional fields like sentiment or topics
Try processing different audio formats (MP3, WAV, M4A)
Combine with other tools for enhanced analysis