vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.
Prerequisites
Install vLLM and start serving a model:
1uv pip install vllm1vllm serve Qwen/Qwen2.5-7B-Instruct \2 --enable-auto-tool-choice \3 --tool-call-parser hermes \4 --dtype float16 \5 --max-model-len 8192 \6 --gpu-memory-utilization 0.9This spins up the vLLM server with an OpenAI-compatible API.
NoteThe default vLLM server URL is
http://localhost:8000/Example
Basic Agent
1from kern.agent import Agent2from kern.models.vllm import VLLM34agent = Agent(5 model=VLLM(6 id="meta-llama/Llama-3.1-8B-Instruct",7 base_url="http://localhost:8000/",8 ),9 markdown=True10)1112agent.print_response("Share a 2 sentence horror story.")Advanced Usage
With Tools
vLLM models work seamlessly with Kern tools:
1from kern.agent import Agent2from kern.models.vllm import VLLM3from kern.tools.hackernews import HackerNewsTools45agent = Agent(6 model=VLLM(id="meta-llama/Llama-3.1-8B-Instruct"),7 tools=[HackerNewsTools()],8 markdown=True9)1011agent.print_response("What's the latest news about AI?")Note View more examples here.
For the full list of supported models, see the vLLM documentation.
Params
| Parameter | Type | Default | Description |
|---|---|---|---|
id | str | "microsoft/DialoGPT-medium" | The id of the model to use with vLLM |
name | str | "vLLM" | The name of the model |
provider | str | "vLLM" | The provider of the model |
api_key | Optional[str] | None | The API key (usually not needed for local vLLM) |
base_url | str | "http://localhost:8000/v1" | The base URL for the vLLM server |
VLLM is a subclass of the Model class and has access to the same params.