Ollama
Expert guidance for running local LLMs with Ollama.
Triggers
Use this skill when:
- Running LLMs locally for privacy or cost savings
- Setting up offline AI inference
- Managing local model deployments
- Working with open-source models (Llama, Mistral, etc.)
- Developing AI applications without cloud API costs
- Keywords: ollama, local llm, offline, self-hosted, llama, mistral, local model
Installation
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
Start Server
# Start Ollama service
ollama serve
# Runs on http://localhost:11434
Model Management
# Pull models
ollama pull llama3.1
ollama pull llama3.1:70b
ollama pull mistral
ollama pull codellama
ollama pull phi3
ollama pull gemma2
# List models
ollama list
# Show model info
ollama show llama3.1
# Remove model
ollama rm llama3.1
# Copy model
ollama cp llama3.1 my-llama
# Run model interactively
ollama run llama3.1
API Usage
Python
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1",
"prompt": "What is Python?",
"stream": False
}
)
print(response.json()["response"])
Python with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but unused
)
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
)
print(response.choices[0].message.content)
Streaming
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1",
"prompt": "Write a poem",
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
import json
data = json.loads(line)
print(data.get("response", ""), end="", flush=True)
Chat API
import requests
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
}
)
print(response.json()["message"]["content"])
Embeddings
response = requests.post(
"http://localhost:11434/api/embeddings",
json={
"model": "llama3.1",
"prompt": "Hello world"
}
)
embedding = response.json()["embedding"]
Custom Models (Modelfile)
Create Custom Model
# Modelfile
FROM llama3.1
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Set system prompt
SYSTEM """You are a helpful coding assistant specializing in Python.
Always provide code examples and explain your reasoning."""
# Set template (optional)
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""
# Create model
ollama create my-coder -f Modelfile
# Run custom model
ollama run my-coder
Import GGUF Models
# Modelfile
FROM ./mistral-7b-instruct-v0.2.Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
TEMPLATE """[INST] {{ .Prompt }} [/INST]
{{ .Response }}"""
Generation Parameters
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1",
"prompt": "Hello",
"options": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"num_predict": 256,
"num_ctx": 4096,
"repeat_penalty": 1.1,
"seed": 42
}
}
)
Vision Models
import base64
# Encode image
with open("image.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llava",
"prompt": "What's in this image?",
"images": [image_data]
}
)
LangChain Integration
from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama
# LLM
llm = Ollama(model="llama3.1")
response = llm.invoke("What is Python?")
# Chat model
chat = ChatOllama(model="llama3.1")
response = chat.invoke([
("system", "You are helpful."),
("human", "Hello!")
])
LlamaIndex Integration
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings
Settings.llm = Ollama(model="llama3.1", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="llama3.1")
Docker Deployment
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:
# Pull model in container
docker exec -it ollama ollama pull llama3.1
Environment Variables
# Model storage location
OLLAMA_MODELS=/path/to/models
# Server host/port
OLLAMA_HOST=0.0.0.0:11434
# GPU settings
OLLAMA_NUM_GPU=1
CUDA_VISIBLE_DEVICES=0
# Memory settings
OLLAMA_MAX_LOADED_MODELS=2
Popular Models
| Model | Size | Use Case |
|---|---|---|
llama3.1 | 8B | General purpose |
llama3.1:70b | 70B | Complex reasoning |
mistral | 7B | Fast, efficient |
codellama | 7B-34B | Code generation |
phi3 | 3.8B | Small but capable |
gemma2 | 9B | Google's model |
llava | 7B | Vision + language |
nomic-embed-text | - | Embeddings |
