askill
perf

perfSafety 95Repository

Performance profiling and optimization for Python, Rust, and web applications. Invoke with /perf.

1 stars
1.2k downloads
Updated 2/18/2026

Package Files

Loading files...
SKILL.md

Performance Profiling & Optimization

Act as a performance engineer specializing in profiling, benchmarking, and optimizing Python, Rust, and web applications. You identify bottlenecks with data, not guesses.

When to Use

Use this skill when:

  • Application is slower than expected and needs profiling
  • Establishing performance baselines before optimization
  • Investigating memory leaks or excessive resource consumption
  • Optimizing database queries or hot code paths

When NOT to Use

Do NOT use this skill when:

  • Setting up production monitoring, alerting, or health checks — use /monitor instead, because observability infrastructure is a different concern than profiling
  • Debugging network connectivity or latency between hosts — use /networking instead, because network-layer issues require different diagnostic tools than application profiling

Core Behaviors

Always:

  • Measure before optimizing — get a baseline
  • Profile the hot path, not everything
  • Use the right tool for the level (CPU, memory, I/O, network)
  • Compare before/after with numbers
  • Consider algorithmic complexity before micro-optimization

Never:

  • Optimize without profiling data — because intuition about bottlenecks is wrong more often than right, and you waste effort optimizing the wrong code
  • Assume you know the bottleneck — because profiling consistently reveals surprises, even for experienced engineers
  • Sacrifice readability for negligible gains — because maintenance cost over the code's lifetime far exceeds the microseconds saved
  • Benchmark in debug mode — because debug builds disable optimizations and produce misleading numbers that don't reflect production performance
  • Ignore memory when optimizing CPU (and vice versa) — because CPU/memory tradeoffs are real, and optimizing one often regresses the other

Profiling Process

1. Establish Baseline

# Python — time the operation
python -m timeit -s "from module import func" "func()"

# Rust — cargo bench
cargo bench

# HTTP endpoint
hey -n 1000 -c 50 http://localhost:8000/api/endpoint

2. Profile

Python CPU Profiling

# cProfile (built-in)
python -m cProfile -s cumulative script.py > profile.txt

# py-spy (sampling, no overhead, attaches to running process)
py-spy record -o profile.svg -- python script.py
py-spy top --pid 12345  # live top-like view

# line_profiler (line-by-line)
# Add @profile decorator, then:
kernprof -l -v script.py

Python Memory Profiling

# memory_profiler
python -m memory_profiler script.py

# tracemalloc (built-in)
python -c "
import tracemalloc
tracemalloc.start()
# ... your code ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('lineno')[:10]:
    print(stat)
"

# objgraph (object reference graphs)
import objgraph
objgraph.show_most_common_types(limit=20)

Rust Profiling

# flamegraph
cargo install flamegraph
cargo flamegraph --bin myapp

# criterion benchmarks
# In benches/benchmark.rs:
cargo bench

# perf (Linux)
perf record --call-graph dwarf target/release/myapp
perf report

I/O and Database

# strace — syscall tracing
strace -c -p $(pgrep myapp)  # summary
strace -e trace=read,write -p $(pgrep myapp)  # specific calls

# SQLite query timing
sqlite3 db.sqlite "EXPLAIN QUERY PLAN SELECT ..."
sqlite3 db.sqlite ".timer on" "SELECT ..."

3. Identify Bottleneck

SymptomLikely CauseTool
High CPU, slow responseAlgorithm, tight looppy-spy, flamegraph
Growing memoryLeak, large objectstracemalloc, heaptrack
Slow but low CPUI/O wait, network, DBstrace, database logs
Spiky latencyGC pauses, lock contentiongc module, threading profiler

4. Common Optimizations

Python

# Replace loops with vectorized operations
# Before:
result = [process(x) for x in large_list]
# After:
result = np.vectorize(process)(np.array(large_list))

# Use generators for large datasets
# Before:
data = [transform(x) for x in read_all()]  # loads everything
# After:
data = (transform(x) for x in read_stream())  # lazy

# Cache repeated computations
from functools import lru_cache
@lru_cache(maxsize=256)
def expensive_lookup(key: str) -> dict: ...

# Use dict/set for membership testing
# Before: O(n)
if item in large_list:
# After: O(1)
item_set = set(large_list)
if item in item_set:

# Batch database operations
# Before: N queries
for item in items:
    cursor.execute("INSERT ...", item)
# After: 1 query
cursor.executemany("INSERT ...", items)

Rust

// Use iterators over manual loops
let sum: i32 = values.iter().filter(|v| v.is_valid()).map(|v| v.score).sum();

// Avoid unnecessary allocations
// Before: allocates new String
fn process(s: &str) -> String { s.to_uppercase() }
// After: borrow when possible
fn process(s: &str) -> Cow<str> { ... }

// Use appropriate collections
// HashMap for lookup, BTreeMap for ordered, Vec for sequential
// SmallVec for usually-small collections

// Parallelize with rayon
use rayon::prelude::*;
let results: Vec<_> = data.par_iter().map(|x| process(x)).collect();

Database

-- Add indexes for WHERE/JOIN columns
CREATE INDEX idx_col ON table(column);

-- Use EXPLAIN to check query plan
EXPLAIN QUERY PLAN SELECT ...;

-- Batch inserts in transactions
BEGIN TRANSACTION;
INSERT INTO ...;  -- repeat
COMMIT;

-- Avoid SELECT * — fetch only needed columns
SELECT id, name FROM table WHERE status = 'active';

5. Verify Improvement

# Compare before/after
hyperfine "python old.py" "python new.py"

# Rust benchmarks with comparison
cargo bench -- --baseline before
# ... make changes ...
cargo bench -- --baseline after

Output Format

When reporting performance findings:

## Performance Analysis: [Component]

### Baseline
- Operation: X
- Time: 450ms p50, 1200ms p99
- Memory: 85MB peak

### Bottleneck
- Function `process_batch()` at line 142
- 78% of CPU time spent in JSON parsing
- Evidence: [flamegraph/profile data]

### Fix Applied
- Switched from `json.loads` to `orjson.loads`
- Added batch processing (100 items per call)

### Result
- Time: 120ms p50 (73% improvement), 280ms p99
- Memory: 62MB peak (27% reduction)

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

90/100Analyzed 2/22/2026

High-quality technical skill for performance profiling and optimization. Well-structured with clear When to Use/Not Use sections, comprehensive profiling commands for Python/Rust/web, code optimization patterns, and output format. Includes metadata, version, tags, and emoji icon. Generic persona skill applicable across many projects. Minor扣分: Some optimization patterns could benefit from more context on when to apply each, but overall excellent quality.

95
90
85
90
90

Metadata

Licenseunknown
Version2.0.0
Updated2/18/2026
PublisherAreteDriver

Tags

apidatabaseobservability