askill
intel-vtune-amd-uprof

intel-vtune-amd-uprofSafety 95Repository

Intel VTune and AMD uProf profiling skill for microarchitecture analysis. Use when analyzing hotspots, microarchitecture bottlenecks, memory access patterns, pipeline stalls, or using the roofline model. Covers VTune Community Edition (free) and AMD uProf as a free alternative. Activates on queries about VTune, uProf, microarchitecture analysis, pipeline stalls, memory bandwidth, roofline model, or hardware performance analysis.

30 stars
1.2k downloads
Updated 3/4/2026

Package Files

Loading files...
SKILL.md

Intel VTune & AMD uProf

Purpose

Guide agents through CPU microarchitecture profiling with Intel VTune Profiler (free Community Edition) and AMD uProf: hotspot identification, microarchitecture analysis, memory access pattern optimization, pipeline stall diagnosis, and roofline model analysis.

Triggers

  • "How do I use Intel VTune to profile my code?"
  • "What are pipeline stalls and how do I reduce them?"
  • "How do I analyze memory bandwidth with VTune?"
  • "What is the roofline model and how do I use it?"
  • "How do I use AMD uProf as a free alternative to VTune?"
  • "My code has good cache hit rates but is still slow"

Workflow

1. VTune setup (free Community Edition)

# Download Intel VTune Profiler (Community Edition — free)
# https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

# Install on Linux
source /opt/intel/oneapi/vtune/latest/env/vars.sh

# CLI usage
vtune -collect hotspots ./prog
vtune -collect microarchitecture-exploration ./prog
vtune -collect memory-access ./prog

# View results in GUI
vtune-gui &
# File → Open Result → select .vtune directory

# Or use amplxe-cl (legacy CLI)
amplxe-cl -collect hotspots ./prog
amplxe-cl -report hotspots -r result/

2. Analysis types

AnalysisWhat it findsWhen to use
HotspotsCPU-bound functionsFirst step — find where time is spent
Microarchitecture ExplorationIPC, pipeline stalls, retired instructionsAfter hotspot — why is the hotspot slow?
Memory AccessCache misses, DRAM bandwidth, NUMAMemory-bound code
ThreadingLock contention, parallel efficiencyMultithreaded code
HPC PerformanceVectorization, memory, rooflineHPC / scientific code
I/ODisk and network bottlenecksI/O-bound code

3. Hotspot analysis

# Collect and report hotspots
vtune -collect hotspots -result-dir hotspots_result ./prog

# Report top functions by CPU time
vtune -report hotspots -r hotspots_result -format csv | head -20

# CLI output example:
# Function       CPU Time  Module
# compute_fft    4.532s    libfft.so
# matrix_mult    2.108s    prog
# parse_input    0.234s    prog

Build with debug info for meaningful symbols:

gcc -O2 -g ./prog.c -o prog     # symbols visible in VTune
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog  # better stacks

4. Microarchitecture exploration — pipeline stalls

vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result

Key metrics to examine:

MetricMeaningGood value
IPC (Instructions Per Clock)How many instructions retire per cyclex86: aim for > 2.0
CPI (Clocks Per Instruction)Inverse of IPCLower is better
Bad SpeculationBranch mispredictions< 5%
Front-End BoundInstruction decode bottleneck< 15%
Back-End BoundExecution unit or memory stall< 30%
RetiringUseful work fraction> 70% ideal
Memory Bound% cycles waiting for memory< 20%
Pipeline Analysis (Top-Down Methodology):
├── Retiring (good, useful work)
├── Bad Speculation (branch mispredictions)
├── Front-End Bound
│   ├── Fetch Latency (I-cache misses, branch mispredicts)
│   └── Fetch Bandwidth
└── Back-End Bound
    ├── Memory Bound
    │   ├── L1 Bound → L1 cache misses
    │   ├── L2 Bound → L2 cache misses
    │   ├── L3 Bound → L3 cache misses
    │   └── DRAM Bound → main memory bandwidth limited
    └── Core Bound → ALU/compute bound

5. Memory access analysis

# Collect memory access profile
vtune -collect memory-access -r mem_result ./prog

# Key output sections:
# - Memory Bound: % time waiting for memory
# - LLC (Last Level Cache) Miss Rate
# - DRAM Bandwidth: GB/s achieved vs theoretical peak
# - NUMA: cross-socket accesses (for multi-socket systems)

Reading DRAM bandwidth:

DRAM Bandwidth: 18.4 GB/s
Peak Theoretical: 51.2 GB/s
Utilization: 36% — likely not DRAM-bound

If DRAM-bound: optimize data layout (AoS → SoA), reduce working set, improve spatial locality.

6. AMD uProf — free alternative for AMD CPUs

# Download AMD uProf
# https://www.amd.com/en/developer/uprof.html

# CLI profiling
AMDuProfCLI collect --config tbp ./prog          # time-based profiling
AMDuProfCLI collect --config assess ./prog       # microarchitecture assessment
AMDuProfCLI collect --config memory ./prog       # memory access

# Generate report
AMDuProfCLI report -i /tmp/uprof_result/ -o report.html

# Open GUI
AMDuProf &

AMD uProf metrics map to VTune equivalents:

  • Retired Instructions → IPC analysis
  • Branch Mispredictions → Bad Speculation
  • L1/L2/L3 Cache Misses → Memory Bound levels
  • Data Cache Accesses → Cache efficiency

7. Roofline model

The roofline model shows whether code is compute-bound or memory-bound by comparing achieved performance against hardware limits:

Performance (GFLOPS/s)
     |                    _______________
Peak |                 /
Perf |              /  compute bound
     |           /
     |        /
     |     /  memory bandwidth bound
     |  /
     +------------------------------→
        Arithmetic Intensity (FLOPS/Byte)
# VTune roofline collection
vtune -collect hpc-performance -r roofline_result ./prog
# Then: VTune GUI → Roofline view

# For manual calculation:
# Arithmetic Intensity = FLOPS / memory_bytes_accessed
# Peak FLOPS = CPUs × cores × freq × FLOPS_per_cycle_per_core
# Peak BW = from hardware spec (e.g., 51.2 GB/s for DDR4-3200 dual channel)

# likwid-perfctr for manual roofline data (Linux)
likwid-perfctr -C 0 -g FLOPS_DP ./prog          # double-precision FLOPS
likwid-perfctr -C 0 -g MEM ./prog               # memory bandwidth

Related skills

  • Use skills/profilers/hardware-counters for raw PMU event collection with perf stat
  • Use skills/profilers/linux-perf for perf-based profiling on Linux
  • Use skills/low-level-programming/cpu-cache-opt for memory access pattern optimization
  • Use skills/low-level-programming/simd-intrinsics for vectorization to increase FLOPS

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

78/100Analyzed 3/28/2026

High-quality technical reference skill for CPU microarchitecture profiling with Intel VTune and AMD uProf. Well-structured with clear triggers, detailed CLI commands, analysis type comparisons, and metric explanations. Includes AMD alternative coverage and roofline model guidance. Slight gaps in troubleshooting and platform-specific details, but highly actionable and reusable for general profiling tasks.

95
84
82
75
72

Metadata

Licenseunknown
Version-
Updated3/4/2026
Publishermohitmishra786

Tags

ci-cdgithub-actionsobservabilityprompting