Purpose

Deploy and optimize LLM inference servers using vLLM, TGI, or Ollama with proper batching, quantization, and resource management.

When to use this skill

deploying a model with vLLM or TGI for production inference
configuring continuous batching and max concurrent requests
fitting models into GPU memory with quantization (AWQ, GPTQ, GGUF)
benchmarking throughput (tokens/sec) and latency (TTFT, TPS)

Do not use this skill when

choosing which model to use — prefer model-selection
running on CPU only — prefer offline-cpu-inference
building agent memory or RAG — prefer agent-memory or embeddings-indexing

Procedure

Choose serving framework — vLLM for high-throughput GPU serving, TGI for HuggingFace models, Ollama for local dev.
Select quantization — AWQ/GPTQ for GPU (4-bit, minimal quality loss), GGUF for llama.cpp/CPU. Match quant to VRAM budget.
Launch server — vllm serve <model> --tensor-parallel-size <gpus> --max-model-len <ctx> --quantization awq.
Configure batching — vLLM uses continuous batching by default. Set --max-num-seqs to control concurrency.
Set memory limits — --gpu-memory-utilization 0.9 for vLLM. Reserve 10% for KV cache overhead.
Add OpenAI-compatible API — vLLM and TGI both serve /v1/chat/completions. Point clients at http://localhost:8000.
Benchmark — python -m vllm.entrypoints.openai.api_server --benchmark or use locust/hey for load testing.
Monitor — track GPU utilization (nvidia-smi), request queue depth, TTFT (time to first token), and throughput.

vLLM deployment

# Basic serving
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

# With quantization (fits larger models in less VRAM)
vllm serve TheBloke/Llama-3.1-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 4096

TGI deployment

docker run --gpus all -p 8080:80 \
  -v /data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --max-batch-total-tokens 32768 \
  --max-input-tokens 4096

GPU memory estimation

Model params * bytes per param = VRAM needed
  7B  * 2 bytes (fp16)  = ~14 GB
  7B  * 0.5 bytes (4bit) = ~3.5 GB + KV cache
  70B * 2 bytes (fp16)  = ~140 GB (2x A100 80GB)
  70B * 0.5 bytes (4bit) = ~35 GB + KV cache

Decision rules

Use vLLM for production GPU serving — best throughput via PagedAttention.
Use Ollama for local development and testing — simplest setup.
AWQ quantization for GPU, GGUF for CPU — do not mix formats.
Set max-model-len to actual max needed, not model max — saves KV cache memory.
Always benchmark with realistic request patterns before deploying.

References

Related skills

model-selection — choosing the right model
llama-cpp — CPU/hybrid inference with llama.cpp
offline-cpu-inference — CPU-only optimization

inference-servingSafety 90Repository

Package Files

Purpose

When to use this skill

Do not use this skill when

Procedure

vLLM deployment

TGI deployment

GPU memory estimation

Decision rules

References

Related skills

Install

AI Quality Score

Metadata

Tags

inference-servingSafety 90Repository ShareFavorite skill

Package Files

Purpose

When to use this skill

Do not use this skill when

Procedure

vLLM deployment

TGI deployment

GPU memory estimation

Decision rules

References

Related skills

Install

AI Quality Score

Metadata

Tags

inference-servingSafety 90Repository