vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.

Prerequisites

Install vLLM and start serving a model:

1uv pip install vllm
1vllm serve Qwen/Qwen2.5-7B-Instruct \
2 --enable-auto-tool-choice \
3 --tool-call-parser hermes \
4 --dtype float16 \
5 --max-model-len 8192 \
6 --gpu-memory-utilization 0.9

This spins up the vLLM server with an OpenAI-compatible API.

NoteThe default vLLM server URL is http://localhost:8000/

Example

Basic Agent

1from kern.agent import Agent
2from kern.models.vllm import VLLM
3
4agent = Agent(
5 model=VLLM(
6 id="meta-llama/Llama-3.1-8B-Instruct",
7 base_url="http://localhost:8000/",
8 ),
9 markdown=True
10)
11
12agent.print_response("Share a 2 sentence horror story.")

Advanced Usage

With Tools

vLLM models work seamlessly with Kern tools:

1from kern.agent import Agent
2from kern.models.vllm import VLLM
3from kern.tools.hackernews import HackerNewsTools
4
5agent = Agent(
6 model=VLLM(id="meta-llama/Llama-3.1-8B-Instruct"),
7 tools=[HackerNewsTools()],
8 markdown=True
9)
10
11agent.print_response("What's the latest news about AI?")
Note View more examples here.

For the full list of supported models, see the vLLM documentation.

Params

ParameterTypeDefaultDescription
idstr"microsoft/DialoGPT-medium"The id of the model to use with vLLM
namestr"vLLM"The name of the model
providerstr"vLLM"The provider of the model
api_keyOptional[str]NoneThe API key (usually not needed for local vLLM)
base_urlstr"http://localhost:8000/v1"The base URL for the vLLM server

VLLM is a subclass of the Model class and has access to the same params.