Purpose
Deploy and optimize LLM inference servers using vLLM, TGI, or Ollama with proper batching, quantization, and resource management.
When to use this skill
- deploying a model with vLLM or TGI for production inference
- configuring continuous batching and max concurrent requests
- fitting models into GPU memory with quantization (AWQ, GPTQ, GGUF)
- benchmarking throughput (tokens/sec) and latency (TTFT, TPS)
Do not use this skill when
- choosing which model to use — prefer
model-selection - running on CPU only — prefer
offline-cpu-inference - building agent memory or RAG — prefer
agent-memoryorembeddings-indexing
Procedure
- Choose serving framework — vLLM for high-throughput GPU serving, TGI for HuggingFace models, Ollama for local dev.
- Select quantization — AWQ/GPTQ for GPU (4-bit, minimal quality loss), GGUF for llama.cpp/CPU. Match quant to VRAM budget.
- Launch server —
vllm serve <model> --tensor-parallel-size <gpus> --max-model-len <ctx> --quantization awq. - Configure batching — vLLM uses continuous batching by default. Set
--max-num-seqsto control concurrency. - Set memory limits —
--gpu-memory-utilization 0.9for vLLM. Reserve 10% for KV cache overhead. - Add OpenAI-compatible API — vLLM and TGI both serve
/v1/chat/completions. Point clients athttp://localhost:8000. - Benchmark —
python -m vllm.entrypoints.openai.api_server --benchmarkor uselocust/heyfor load testing. - Monitor — track GPU utilization (
nvidia-smi), request queue depth, TTFT (time to first token), and throughput.
vLLM deployment
# Basic serving
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
# With quantization (fits larger models in less VRAM)
vllm serve TheBloke/Llama-3.1-70B-AWQ \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 4096
TGI deployment
docker run --gpus all -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--max-batch-total-tokens 32768 \
--max-input-tokens 4096
GPU memory estimation
Model params * bytes per param = VRAM needed
7B * 2 bytes (fp16) = ~14 GB
7B * 0.5 bytes (4bit) = ~3.5 GB + KV cache
70B * 2 bytes (fp16) = ~140 GB (2x A100 80GB)
70B * 0.5 bytes (4bit) = ~35 GB + KV cache
Decision rules
- Use vLLM for production GPU serving — best throughput via PagedAttention.
- Use Ollama for local development and testing — simplest setup.
- AWQ quantization for GPU, GGUF for CPU — do not mix formats.
- Set
max-model-lento actual max needed, not model max — saves KV cache memory. - Always benchmark with realistic request patterns before deploying.
References
Related skills
model-selection— choosing the right modelllama-cpp— CPU/hybrid inference with llama.cppoffline-cpu-inference— CPU-only optimization
