Triton Deployment
NVIDIA Triton Inference Server management.
Quick Start
# Pull Triton image
docker pull nvcr.io/nvidia/tritonserver:24.01-py3
# Run server
docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:24.01-py3 \
tritonserver --model-repository=/models
Model Repository Structure
models/
└── my_model/
├── config.pbtxt
└── 1/
└── model.onnx
Config Template
# config.pbtxt
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{ name: "input", data_type: TYPE_FP32, dims: [3, 224, 224] }
]
output [
{ name: "output", data_type: TYPE_FP32, dims: [1000] }
]
Health Check
# Ready check
curl localhost:8000/v2/health/ready
# Model status
curl localhost:8000/v2/models/my_model
Inference
# HTTP
curl -X POST localhost:8000/v2/models/my_model/infer \
-H "Content-Type: application/json" \
-d '{"inputs": [{"name": "input", "shape": [1,3,224,224], "datatype": "FP32", "data": [...]}]}'
# gRPC
grpcurl -d '...' localhost:8001 inference.GRPCInferenceService/ModelInfer
Best Practices
- Use dynamic batching for throughput
- Enable model warmup
- Monitor with Prometheus metrics (:8002)
- Use model versioning (1/, 2/, etc.)
