askill
llama-cpp

llama-cppSafety 95Repository

Secondary local LLM inference engine via llama.cpp. This skill should be used when running GGUF models directly, loading LoRA adapters for Kothar, benchmarking inference speed, or serving models via llama-server. Complements Ollama (which remains primary for RLAMA and general use).

12 stars
1.2k downloads
Updated 3/15/2026

Package Files

Loading files...
SKILL.md

llama.cpp - Secondary Inference Engine

Direct access to llama.cpp for faster inference, LoRA adapter loading, and benchmarking on Apple Silicon. Ollama remains primary for RLAMA and general use; llama.cpp is the power tool.

Prerequisites

brew install llama.cpp

Binaries: llama-cli, llama-server, llama-embedding, llama-quantize

Quick Reference

Resolve Ollama Model to GGUF Path

To avoid duplicating model files, resolve an Ollama model name to its GGUF blob path:

~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b

Run Inference

GGUF=$(~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b)
llama-cli -m "$GGUF" -p "Your prompt here" -n 128 --n-gpu-layers all --single-turn --simple-io --no-display-prompt

Start API Server

To start an OpenAI-compatible server (port 8081, avoids Ollama's 11434):

~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>

# Or with options:
PORT=8082 CTX=8192 ~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>

Test the server:

curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'

Benchmark (llama.cpp vs Ollama)

~/.claude/skills/llama-cpp/scripts/llama_bench.sh qwen2.5:7b

Reports prompt processing and generation tok/s for both engines side by side.

LoRA Adapter Inference

Load a LoRA adapter dynamically on top of a base GGUF model (no merge required):

~/.claude/skills/llama-cpp/scripts/llama_lora.sh <base.gguf> <lora.gguf> "Your prompt"

This is the key advantage over Ollama: hot-swap LoRA adapters without rebuilding models.

Convert Kothar LoRA to GGUF

Convert HuggingFace LoRA adapters from the Kothar training pipeline into a merged GGUF model:

python3 ~/.claude/skills/llama-cpp/scripts/convert_lora_to_gguf.py \
  --base NousResearch/Hermes-2-Mistral-7B-DPO \
  --lora <path-or-hf-id> \
  --output kothar-q4_k_m.gguf \
  --quantize q4_k_m

When to Use llama.cpp vs Ollama

TaskUse
RLAMA queriesOllama (native integration)
Quick model chatOllama (ollama run)
LoRA adapter testingllama.cpp (llama_lora.sh)
Benchmarking tok/sllama.cpp (llama_bench.sh)
Maximum inference speedllama.cpp (10-20% faster)
Custom server configllama.cpp (llama_serve.sh)
Embedding generationEither (Ollama simpler, llama-embedding more control)
Kothar GGUF conversionllama.cpp (convert_lora_to_gguf.py)

Architecture

Ollama (primary, port 11434)          llama.cpp (secondary, port 8081)
├── RLAMA RAG queries                 ├── LoRA adapter hot-loading
├── Model management (pull/list)      ├── Benchmarking
├── General chat                      ├── Custom server configs
└── Embeddings (nomic-embed-text)     └── Kothar GGUF conversion

Both share the same GGUF model files (~/.ollama/models/blobs/)

Subprocess Best Practices (Build 7940+)

When calling llama-cli from scripts or subprocesses:

  • Always use --single-turn — generates one response then exits (prevents interactive chat mode hang)
  • Always use --simple-io — suppresses ANSI spinner that floods redirected output
  • Always use --no-display-prompt — suppresses prompt echo
  • Use --n-gpu-layers all instead of legacy -ngl 999
  • Use --flash-attn on (not bare --flash-attn) — now takes argument
  • Timing stats appear in stdout as [ Prompt: X t/s | Generation: Y t/s ] (via --show-timings, default: on)
  • Redirect stderr to file, not variable — spinner output can overflow bash variables

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

87/100Analyzed 2/24/2026

High-quality technical skill document for llama.cpp inference engine. Provides comprehensive coverage of GGUF model handling, LoRA adapter loading, benchmarking, and API serving. Well-structured with clear comparisons between llama.cpp and Ollama, specific subprocess best practices, and useful scripts. Minor扣分 for some project-specific terminology (Kothar, RLAMA) but overall excellent actionability and completeness.

95
92
80
85
90

Metadata

Licenseunknown
Version-
Updated3/15/2026
Publishertdimino

Tags

apici-cdllmpromptingtesting