This skill provides semantic search capabilities using embedding-based similarity matching for code and text. Enables meaning-based search beyond keyword matching, with optional document parsing (PDF, DOCX, PPTX) support.
semtools follows the SKILL.md standard. Use the install command to add it to your agent stack.
---
name: semtools
description: This skill provides semantic search capabilities using embedding-based similarity matching for code and text. Enables meaning-based search beyond keyword matching, with optional document parsing (PDF, DOCX, PPTX) support.
license: MIT
---
# Semtools: Semantic Search
Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.
## Purpose
The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands **semantic meaning** through embeddings.
**Key capabilities:**
1. **Semantic Search**: Find code/text by meaning, not just keywords
2. **Workspace Management**: Index large codebases for fast repeated searches
3. **Document Parsing**: Convert PDFs, DOCX, PPTX to searchable text (requires API key)
Semtools excels at **discovery** - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.
## When to Use This Skill
Use the semtools skill when you need meaning-based search:
**Semantic Code Discovery:**
- Finding code that implements a concept ("error handling", "data validation")
- Discovering similar functionality across different modules
- Locating examples of a pattern when you don't know exact names
- Understanding what code does without reading everything
**Documentation & Knowledge:**
- Searching documentation by concept, not keywords
- Finding related discussions in comments or docs
- Discovering similar issues or solutions
- Analyzing technical documents (PDFs, reports)
**Use Cases:**
- "Find all authentication-related code" (without knowing function names)
- "Show me error handling patterns" (regardless of specific error types)
- "Find code similar to this implementation" (semantic similarity)
- "Search research papers for 'distributed consensus'" (document search)
**Choose semtools over file-search (ripgrep/ast-grep) when:**
- You know the **concept** but not the **keywords**
- Exact string matching misses relevant results
- You want semantically similar code, not exact matches
- Searching across languages or mixed content
**Still use file-search when:**
- You know exact keywords, function names, or patterns
- You need structural code matching (ast-grep)
- Speed is critical (ripgrep is faster for exact matches)
- You're searching for specific symbols or references
## Available Commands
Semtools provides three CLI commands you can use via `execute_command`:
- **`search`** - Semantic search across code and text files
- **`workspace`** - Manage workspaces for caching embeddings
- **`parse`** - Convert documents (PDF, DOCX, PPTX) to searchable text
**All commands work out-of-the-box** in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.
## Core Operations
### 1. Semantic Search (`search`)
Find files and code sections by semantic meaning:
```bash
# Basic semantic search
search "authentication logic" src/
# Search with more context (5 lines before/after)
search "error handling" --n-lines 5 src/
# Get more results (default: 3)
search "database queries" --top-k 10 src/
# Control similarity threshold (0.0-1.0, lower = more lenient)
search "API endpoints" --max-distance 0.4 src/
```
**Parameters:**
- `--n-lines N`: Show N lines of context around matches (default: 3)
- `--top-k K`: Return top K most similar matches (default: 3)
- `--max-distance D`: Maximum embedding distance (0.0-1.0, default: 0.3)
- `-i`: Case-insensitive matching
**Output format:**
```
Match 1 (similarity: 0.12)
File: src/auth/handlers.py
Lines: 42-47
----
def authenticate_user(username: str, password: str) -> Optional[User]:
"""Authenticate user credentials against database."""
user = get_user_by_username(username)
if user and verify_password(password, user.password_hash):
return user
return None
----
Match 2 (similarity: 0.18)
File: src/middleware/auth.py
...
```
### 2. Workspace Management (`workspace`)
For large codebases, create workspaces to cache embeddings and enable fast repeated searches:
```bash
# Create/activate workspace
workspace use my-project
# Set workspace via environment variable
export SEMTOOLS_WORKSPACE=my-project
# Index files in workspace (workspace auto-detected from env var)
search "query" src/
# Check workspace status
workspace status
# Clean up old workspaces
workspace prune
```
**Benefits:**
- **Fast repeated searches**: Embeddings cached, no re-computation
- **Large codebases**: IVF_PQ indexing for scalability
- **Session persistence**: Maintain context across multiple searches
**When to use workspaces:**
- Searching the same codebase multiple times
- Very large projects (1000+ files)
- Interactive exploration sessions
- CI/CD pipelines with repeated searches
### 3. Document Parsing (`parse`) ⚠️ Requires API Key
Convert documents to searchable markdown (requires LlamaParse API key):
```bash
# Parse PDFs to markdown
parse research_papers/*.pdf
# Parse Word documents
parse reports/*.docx
# Parse presentations
parse slides/*.pptx
# Parse and pipe to search
parse docs/*.pdf | xargs search "neural networks"
```
**Supported formats:**
- PDF (.pdf)
- Word (.docx)
- PowerPoint (.pptx)
**Configuration:**
```bash
# Via environment variable
export LLAMA_CLOUD_API_KEY="llx-..."
# Via config file
cat > ~/.parse_config.json << EOF
{
"api_key": "llx-...",
"max_concurrent_requests": 10,
"timeout_seconds": 3600
}
EOF
```
**Important:** Document parsing is **optional**. Semantic search works without it.
## Workflow Patterns
### Pattern 1: Concept Discovery
When you know what you're looking for conceptually but not by name:
```bash
# Step 1: Broad semantic search
search "rate limiting implementation" src/
# Step 2: Review results, refine query
search "throttle requests per user" src/ --top-k 10
# Step 3: Use ripgrep for exact follow-up
rg "RateLimiter" --type py src/
```
### Pattern 2: Similar Code Finder
When you want to find code similar to a reference implementation:
```bash
# Step 1: Extract key concepts from reference code
# [Read example_auth.py and identify key concepts]
# Step 2: Search for similar implementations
search "user authentication with JWT tokens" src/
# Step 3: Compare implementations
# [Review semantic matches to find similar approaches]
```
### Pattern 3: Documentation Search
When researching concepts in documentation or comments:
```bash
# Search code comments semantically
search "thread safety guarantees" src/ --n-lines 10
# Search markdown documentation
search "deployment best practices" docs/
# Combined search
search "performance optimization" --top-k 20
```
### Pattern 4: Cross-Language Search
When searching for concepts across different languages:
```bash
# Semantic search works across languages
search "connection pooling" src/
# May find:
# - Java: "ConnectionPool manager"
# - Python: "database connection reuse"
# - Go: "pool of persistent connections"
# All semantically related despite different terminology
```
### Pattern 5: Document Analysis (with API key)
When analyzing PDFs or documents:
```bash
# Step 1: Parse documents to markdown
parse research/*.pdf > papers.md
# Step 2: Search converted content
search "transformer architecture" papers.md
# Step 3: Combine with code search
search "attention mechanism implementation" src/
```
## Integration with file-search
Semtools and file-search (ripgrep/ast-grep) are **complementary tools**. Use them together for comprehensive search:
### Search Strategy Matrix
| You Know | Use First | Then Use | Why |
|----------|-----------|----------|-----|
| Exact keywords | ripgrep | search | Fast exact match, then find similar |
| Concept only | search | ripgrep | Find relevant code, then search specifics |
| Function name | ripgrep | search | Find definition, then find similar usage |
| Code pattern | ast-grep | search | Find structure, then find similar logic |
| Approximate idea | search | ripgrep + ast-grep | Discover, then drill down |
### Layered Search Approach
```bash
# Layer 1: Semantic discovery (what's related?)
search "user session management" --top-k 10
# Layer 2: Exact text search (what's the implementation?)
rg "SessionManager|session_store" --type py
# Layer 3: Structural search (how is it used?)
sg --pattern 'session.$METHOD($$$)' --lang python
# Layer 4: Reference tracking (where is it called?)
# [Use serena skill for symbol-level tracking]
```
## Best Practices
### 1. Start Broad, Then Narrow
Use semantic search for discovery, then narrow with exact search:
```bash
# GOOD: Broad semantic discovery first
search "authentication" src/ --top-k 10
# [Review results to learn terminology]
rg "authenticate|verify_credentials" --type py src/
# AVOID: Starting too narrow and missing variations
rg "authenticate" --type py # Misses "verify_credentials", "check_auth", etc.
```
### 2. Adjust Similarity Threshold
Tune `--max-distance` based on results:
```bash
# Too many irrelevant results? Decrease distance (more strict)
search "query" --max-distance 0.2
# Missing relevant results? Increase distance (more lenient)
search "query" --max-distance 0.5
# Default (0.3) works well for most cases
search "query"
```
### 3. Use Workspaces for Repeated Searches
For interactive exploration, always use workspaces:
```bash
# GOOD: Create workspace once, search many times
export SEMTOOLS_WORKSPACE=my-analysis
search "concept1" src/
search "concept2" src/
search "concept3" src/
# INEFFICIENT: Re-compute embeddings every time
search "concept1" src/
search "concept2" src/
```
### 4. Combine with Context Tools
Get more context around semantic matches:
```bash
# Find semantically similar code
search "retry logic" src/ --n-lines 2
# Get more context with ripgrep
rg -C 10 "retry" src/specific_file.py
# Or read the full file
cat src/specific_file.py
```
### 5. Phrase Queries Conceptually
Write queries as concepts, not exact keywords:
```bash
# GOOD: Conceptual queries
search "handling network timeouts"
search "user input validation"
search "concurrent data access"
# LESS EFFECTIVE: Exact keyword queries (use ripgrep instead)
search "timeout" # Use: rg "timeout"
search "validate" # Use: rg "validate"
```
## Understanding Semantic Distance
Semtools uses embedding vectors to measure semantic similarity:
- **Distance 0.0**: Identical meaning
- **Distance 0.1-0.2**: Very similar (synonyms, paraphrases)
- **Distance 0.2-0.3**: Related concepts (default threshold)
- **Distance 0.3-0.4**: Loosely related
- **Distance 0.5+**: Weakly related or unrelated
**Practical guidelines:**
```bash
# Strict matching (only close matches)
--max-distance 0.2
# Balanced matching (default, recommended)
--max-distance 0.3
# Lenient matching (exploratory search)
--max-distance 0.4
# Very lenient (may include false positives)
--max-distance 0.5
```
## Local vs. Cloud Embeddings
**Semantic Search (Local):**
- Uses local embeddings (model2vec, potion-multilingual-128M)
- **No API calls** or cloud dependencies
- Fast, private, no cost
- Works offline
**Document Parsing (Cloud):**
- Uses LlamaParse API (cloud-based)
- Requires API key and internet connection
- Processes PDFs, DOCX, PPTX
- Usage-based pricing (check LlamaIndex pricing)
**Privacy consideration:** Semantic search is 100% local. Only document parsing sends data to LlamaParse API.
## Performance Considerations
### Speed Characteristics
**Without workspace:**
- First search: ~2-5 seconds (embedding computation)
- Subsequent searches: ~2-5 seconds each (re-compute embeddings)
**With workspace (cached embeddings):**
- First search: ~2-5 seconds (builds index)
- Subsequent searches: ~0.1-0.5 seconds (cached)
- Large codebases: IVF_PQ indexing for scalability
**Comparison:**
- ripgrep: 0.01-0.1 seconds (fastest, exact match)
- ast-grep: 0.1-0.5 seconds (fast, structural)
- semtools (cached): 0.1-0.5 seconds (fast, semantic)
- semtools (uncached): 2-5 seconds (slower, semantic)
### Optimization Tips
```bash
# 1. Use workspaces for repeated searches
export SEMTOOLS_WORKSPACE=my-project
# 2. Limit search scope to relevant directories
search "query" src/ --not tests/
# 3. Use --top-k to control result count
search "query" --top-k 5
# 4. Pipe to head for quick preview
search "query" | head -50
```
## Unix Pipeline Integration
Semtools is designed for Unix-style composition:
```bash
# Find and parse PDFs, then search
find docs/ -name "*.pdf" | xargs parse | xargs search "topic"
# Search and filter with grep
search "authentication" src/ | grep -i "jwt"
# Count matches
search "error handling" src/ | grep "Match" | wc -l
# Combine with other tools
search "API" src/ | xargs -I {} rg -l "REST" {}
```
## Limitations
### When NOT to Use Semtools
1. **Exact keyword search**: Use ripgrep for known keywords
```bash
# WRONG TOOL: Semantic search for exact function name
search "authenticate_user"
# RIGHT TOOL: Use ripgrep for exact matches
rg "authenticate_user" --type py
```
2. **Structural code patterns**: Use ast-grep for syntax matching
```bash
# WRONG TOOL: Semantic search for code structure
search "class with constructor"
# RIGHT TOOL: Use ast-grep for structure
sg --pattern 'class $NAME { constructor($$$) { $$$ } }'
```
3. **Symbol references**: Use serena for LSP-based tracking
```bash
# WRONG TOOL: Semantic search for all usages
search "MyClass usage"
# RIGHT TOOL: Use serena for precise references
serena find_referencing_symbols --name 'MyClass'
```
4. **Small codebases**: Overhead not worth it for <100 files
- ripgrep is faster and simpler for small projects
### Known Edge Cases
- **Ambiguous queries**: Vague concepts return broad results
- **Technical jargon**: Domain-specific terms may have lower accuracy
- **Short code snippets**: Limited context reduces embedding quality
- **Mixed languages**: Embeddings tuned for English (multilingual model used)
- **Generated code**: Repetitive patterns may cluster together
## Troubleshooting
### No Semantic Matches Found
If semantic search returns zero results:
1. **Verify files exist**: Use ripgrep to confirm content
```bash
rg "concept" src/
```
2. **Increase similarity threshold**: Be more lenient
```bash
search "query" --max-distance 0.5
```
3. **Rephrase query**: Try different terminology
```bash
search "user authentication"
search "verify user credentials"
search "login validation"
```
4. **Check file types**: Ensure searching correct extensions
```bash
search "query" src/*.py # Target specific types
```
### Too Many Irrelevant Results
If semantic search returns too much noise:
1. **Decrease similarity threshold**: Be more strict
```bash
search "query" --max-distance 0.2
```
2. **Limit result count**: Review top matches only
```bash
search "query" --top-k 3
```
3. **Narrow directory scope**: Search specific paths
```bash
search "query" src/specific_module/
```
4. **Refine query**: Add more specific concepts
```bash
# Vague
search "data"
# Specific
search "data validation with regex patterns"
```
### Document Parsing Fails
If `parse` fails:
1. **Verify API key is set**:
```bash
echo $LLAMA_CLOUD_API_KEY
```
2. **Check file format**: Ensure supported format (PDF, DOCX, PPTX)
```bash
file document.pdf # Verify file type
```
3. **Check file size**: Large files may timeout
```bash
du -h document.pdf # Check size
```
4. **Review parse config**: Adjust timeouts if needed
```bash
cat ~/.parse_config.json
```
### Workspace Issues
If workspace commands fail:
```bash
# Check workspace status
workspace status
# Prune corrupted workspaces
workspace prune
# Recreate workspace
rm -rf ~/.semtools/workspaces/my-workspace
export SEMTOOLS_WORKSPACE=my-workspace
```
## Resources
- [Semtools GitHub](https://github.com/run-llama/semtools)
- [LlamaParse Documentation](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/)
- [model2vec Embeddings](https://github.com/MinishLab/model2vec)
- [Semantic Search Concepts](https://en.wikipedia.org/wiki/Semantic_search)