Speech-to-Text Agent
Build an Agent that transcribes audio files into structured data with speaker identification and conversation metadata. This Agent uses multimodal capabilities to convert audio content into structured transcription data.
What You'll Learn
By building this agent, you'll understand:
- How to use Pydantic schemas for structured transcription output
- How to configure multimodal Agent for audio processing
- How to identify speakers in audio conversations
Use Cases
Transcribe meeting recordings with speaker identification, convert podcast episodes into searchable text, create subtitles for video content, or build voice note analyzers with structured transcription data.
How It Works
The Agent uses multimodal capabilities to process audio directly and output structured transcription data:
- Input: Accepts audio files WAV format from URLs (or local files)
- Process: Multimodal model analyzes audio content and identifies speakers
- Structure: Output is validated against a Pydantic schema
- Output: Returns typed data with transcript, description, and speaker list
The structured output makes transcriptions immediately usable in downstream applications without additional parsing.
STT example using Gemini flash 3 preview
1import httpx2from kern.agent import Agent, RunOutput # noqa3from kern.media import Audio4from kern.models.google import Gemini5from pydantic import BaseModel, Field67INSTRUCTIONS = """8Transcribe the audio accurately and completely.910Speaker identification:11- Use the speaker's name if mentioned in the conversation12- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently1314Non-speech audio:15- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation16- Ignore brief natural pauses1718Include everything spoken, even false starts and filler words (um, uh, etc.).19"""202122class Utterance(BaseModel):23 speaker: str = Field(..., description="Name or identifier of the speaker")24 text: str = Field(..., description="What was said by the speaker")252627class Transcription(BaseModel):28 description: str = Field(..., description="A description of the audio conversation")29 utterances: list[Utterance] = Field(30 ..., description="Sequential list of utterances in conversation order"31 )323334# Fetch the audio file and convert it to a base64 encoded string35# Simple audio file with a single speaker36# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"37# Audio file with multiple speakers38url = "https://kern-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"3940try:41 response = httpx.get(url)42 response.raise_for_status()43 wav_data = response.content44except httpx.HTTPStatusError as e:45 raise ValueError(f"Error fetching audio file: {url}") from e4647# Provide the agent with the audio file and get result as text48agent = Agent(49 model=Gemini(id="gemini-3-flash-preview"),50 markdown=True,51 instructions=INSTRUCTIONS,52 output_schema=Transcription,53)5455agent.print_response(56 "Give a transcript of the audio conversation",57 audio=[Audio(content=wav_data)],58)STT example using OpenAI gpt-audio
1import httpx2from kern.agent import Agent, RunOutput # noqa3from kern.media import Audio4from kern.models.openai import OpenAIResponses5from pydantic import BaseModel, Field67INSTRUCTIONS = """8Transcribe the audio accurately and completely.910Speaker identification:11- Use the speaker's name if mentioned in the conversation12- Otherwise use 'Speaker 1', 'Speaker 2', etc. consistently1314Non-speech audio:15- Note significant non-speech elements (e.g., [long pause], [music], [background noise]) only when relevant to understanding the conversation16- Ignore brief natural pauses1718Include everything spoken, even false starts and filler words (um, uh, etc.).19"""202122class Utterance(BaseModel):23 speaker: str = Field(..., description="Name or identifier of the speaker")24 text: str = Field(..., description="What was said by the speaker")252627class Transcription(BaseModel):28 description: str = Field(..., description="A description of the audio conversation")29 utterances: list[Utterance] = Field(30 ..., description="Sequential list of utterances in conversation order"31 )323334# Fetch the audio file and convert it to a base64 encoded string35# Simple audio file with a single speaker36# url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"37# Audio file with multiple speakers38url = "https://kern-public.s3.us-east-1.amazonaws.com/demo_data/sample_audio.wav"3940try:41 response = httpx.get(url)42 response.raise_for_status()43 wav_data = response.content44except httpx.HTTPStatusError as e:45 raise ValueError(f"Error fetching audio file: {url}") from e4647# Provide the agent with the audio file and get result as text48agent = Agent(49 model=OpenAIResponses(id="gpt-audio-2025-08-28", modalities=["text"]),50 markdown=True,51 instructions=INSTRUCTIONS,52 output_schema=Transcription,53 # We use a parser model here as gpt-audio-2025-08-28 cannot return structured output by itself54 parser_model=OpenAIResponses(id="gpt-5.2"),55)5657agent.print_response(58 "Give a transcript of the audio conversation",59 audio=[Audio(content=wav_data, format="wav")],60)What to Expect
The agent processes audio files and returns a structured Transcription object containing:
- description: A summary describing what the audio is about
- utterances: List of identified speakers (names if mentioned, otherwise "Speaker 1", "Speaker 2", etc.)
The utterances are in the order of the audio conversation and they contain:
- speaker: Name or identifier of the speaker
- text: What was said by the speaker
Processing time depends on audio length, typically 10-30 seconds for files under 5 minutes.
Usage
Set up your virtual environment
1uv venv --python 3.122source .venv/bin/activate1uv venv --python 3.122.venv\Scripts\activateSet your API key
bash export GOOGLE_API_KEY=xxx Install dependencies
bash uv pip install -U kern-ai google-genai httpx
Run Agent
1python speech_to_text_agent.py1python speech_to_text_agent.pyNext Steps
- Remove the structured output and use the text output instead if your use case does not require structured outputs
- Extend the
Transcriptionschema with additional fields likesentimentortopics - Try processing different audio formats (MP3, WAV, M4A)
- Combine with other tools for enhanced analysis