vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.

Prerequisites

Install vLLM and start serving a model:

1uv pip install vllm

1vllm serve Qwen/Qwen2.5-7B-Instruct \
2    --enable-auto-tool-choice \
3    --tool-call-parser hermes \
4    --dtype float16 \
5    --max-model-len 8192 \
6    --gpu-memory-utilization 0.9

This spins up the vLLM server with an OpenAI-compatible API.

NoteThe default vLLM server URL is http://localhost:8000/

Example

Basic Agent

1from kern.agent import Agent
2from kern.models.vllm import VLLM
3
4agent = Agent(
5    model=VLLM(
6        id="meta-llama/Llama-3.1-8B-Instruct",
7        base_url="http://localhost:8000/",
8    ),
9    markdown=True
10)
11
12agent.print_response("Share a 2 sentence horror story.")

Advanced Usage

With Tools

vLLM models work seamlessly with Kern tools:

1from kern.agent import Agent
2from kern.models.vllm import VLLM
3from kern.tools.hackernews import HackerNewsTools
4
5agent = Agent(
6    model=VLLM(id="meta-llama/Llama-3.1-8B-Instruct"),
7    tools=[HackerNewsTools()],
8    markdown=True
9)
10
11agent.print_response("What's the latest news about AI?")

Note View more examples here.

For the full list of supported models, see the vLLM documentation.

Params

Parameter	Type	Default	Description
`id`	`str`	`"microsoft/DialoGPT-medium"`	The id of the model to use with vLLM
`name`	`str`	`"vLLM"`	The name of the model
`provider`	`str`	`"vLLM"`	The provider of the model
`api_key`	`Optional[str]`	`None`	The API key (usually not needed for local vLLM)
`base_url`	`str`	`"http://localhost:8000/v1"`	The base URL for the vLLM server

VLLM is a subclass of the Model class and has access to the same params.