Ollama

Expert guidance for running local LLMs with Ollama.

Triggers

Use this skill when:

Running LLMs locally for privacy or cost savings
Setting up offline AI inference
Managing local model deployments
Working with open-source models (Llama, Mistral, etc.)
Developing AI applications without cloud API costs
Keywords: ollama, local llm, offline, self-hosted, llama, mistral, local model

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Start Server

# Start Ollama service
ollama serve

# Runs on http://localhost:11434

Model Management

# Pull models
ollama pull llama3.1
ollama pull llama3.1:70b
ollama pull mistral
ollama pull codellama
ollama pull phi3
ollama pull gemma2

# List models
ollama list

# Show model info
ollama show llama3.1

# Remove model
ollama rm llama3.1

# Copy model
ollama cp llama3.1 my-llama

# Run model interactively
ollama run llama3.1

API Usage

Python

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1",
        "prompt": "What is Python?",
        "stream": False
    }
)
print(response.json()["response"])

Python with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ]
)
print(response.choices[0].message.content)

Streaming

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1",
        "prompt": "Write a poem",
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        import json
        data = json.loads(line)
        print(data.get("response", ""), end="", flush=True)

Chat API

import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!"}
        ]
    }
)
print(response.json()["message"]["content"])

Embeddings

response = requests.post(
    "http://localhost:11434/api/embeddings",
    json={
        "model": "llama3.1",
        "prompt": "Hello world"
    }
)
embedding = response.json()["embedding"]

Custom Models (Modelfile)

Create Custom Model

# Modelfile
FROM llama3.1

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Set system prompt
SYSTEM """You are a helpful coding assistant specializing in Python.
Always provide code examples and explain your reasoning."""

# Set template (optional)
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

# Create model
ollama create my-coder -f Modelfile

# Run custom model
ollama run my-coder

Import GGUF Models

# Modelfile
FROM ./mistral-7b-instruct-v0.2.Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

TEMPLATE """[INST] {{ .Prompt }} [/INST]
{{ .Response }}"""

Generation Parameters

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1",
        "prompt": "Hello",
        "options": {
            "temperature": 0.7,
            "top_p": 0.9,
            "top_k": 40,
            "num_predict": 256,
            "num_ctx": 4096,
            "repeat_penalty": 1.1,
            "seed": 42
        }
    }
)

Vision Models

import base64

# Encode image
with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llava",
        "prompt": "What's in this image?",
        "images": [image_data]
    }
)

LangChain Integration

from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama

# LLM
llm = Ollama(model="llama3.1")
response = llm.invoke("What is Python?")

# Chat model
chat = ChatOllama(model="llama3.1")
response = chat.invoke([
    ("system", "You are helpful."),
    ("human", "Hello!")
])

LlamaIndex Integration

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings

Settings.llm = Ollama(model="llama3.1", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="llama3.1")

Docker Deployment

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

# Pull model in container
docker exec -it ollama ollama pull llama3.1

Environment Variables

# Model storage location
OLLAMA_MODELS=/path/to/models

# Server host/port
OLLAMA_HOST=0.0.0.0:11434

# GPU settings
OLLAMA_NUM_GPU=1
CUDA_VISIBLE_DEVICES=0

# Memory settings
OLLAMA_MAX_LOADED_MODELS=2

Popular Models

Model	Size	Use Case
`llama3.1`	8B	General purpose
`llama3.1:70b`	70B	Complex reasoning
`mistral`	7B	Fast, efficient
`codellama`	7B-34B	Code generation
`phi3`	3.8B	Small but capable
`gemma2`	9B	Google's model
`llava`	7B	Vision + language
`nomic-embed-text`	-	Embeddings

ollamaSafety 100Repository

Package Files

Ollama

Triggers

Installation

Start Server

Model Management

API Usage

Python

Python with OpenAI SDK

Streaming

Chat API

Embeddings

Custom Models (Modelfile)

Create Custom Model

Import GGUF Models

Generation Parameters

Vision Models

LangChain Integration

LlamaIndex Integration

Docker Deployment

Environment Variables

Popular Models

Resources

Install

AI Quality Score

Metadata

Tags

ollamaSafety 100Repository ShareFavorite skill

Package Files

Ollama

Triggers

Installation

Start Server

Model Management

API Usage

Python

Python with OpenAI SDK

Streaming

Chat API

Embeddings

Custom Models (Modelfile)

Create Custom Model

Import GGUF Models

Generation Parameters

Vision Models

LangChain Integration

LlamaIndex Integration

Docker Deployment

Environment Variables

Popular Models

Resources

Install

AI Quality Score

Metadata

Tags

ollamaSafety 100Repository