Linex: Source-Level GPU Performance Profiling
Map GPU performance metrics to your source code lines. Get cycle-level timing, stall analysis, and instruction-level metrics for each line of source code.
When to Use
- User asks to profile a GPU application at source-line granularity
- Need to identify which specific lines of code are performance bottlenecks
- Analyzing stall patterns and execution bottlenecks at the source level
- Understanding cycle-level timing for each line of code
- Instruction-level analysis mapped to source lines
Instructions
- Ensure the target runs on AMD ROCm 7.0+ with
rocprofv3available. - Kernels must be compiled with
-g(debug symbols) for source mapping. - Choose execution path:
- If a Linex MCP server is available, use its MCP tools:
profile_applicationto run and profile a target application with the options below.analyze_instruction_hotspotsto perform instruction-level hotspot analysis on collected profiles.
- Otherwise use the Python API from the environment where Linex is installed.
- If a Linex MCP server is available, use its MCP tools:
Python API
from linex import Linex
profiler = Linex(
target_cu=0, # Target compute unit
shader_engine_mask="0xFFFFFFFF", # All shader engines
activity=10, # Activity counter polling
)
profiler.profile("./my_app", kernel_filter="my_kernel")
# Show hotspots (sorted by total_cycles)
for line in profiler.source_lines[:5]:
print(f"{line.file}:{line.line_number}")
print(f" {line.total_cycles:,} cycles ({line.stall_percent:.1f}% stalled)")
print(f" Executed {line.execution_count} times")
# Find memory-bound lines
memory_bound = [
l for l in profiler.source_lines
if l.stall_percent > 50
]
# Instruction-level analysis
for line in profiler.source_lines[:1]:
for inst in line.instructions:
print(f"{inst.isa}: {inst.latency_cycles} cycles")
SourceLine Properties
file- Source file pathline_number- Line numbertotal_cycles- Sum of all instruction cyclesstall_cycles- Cycles spent waitingidle_cycles- Cycles slot was idleexecution_count- Total executionsinstructions- List of ISA instructionsstall_percent- Convenience: stall_cycles / total_cycles * 100
InstructionData Properties
isa- ISA instruction textlatency_cycles- Total cycles for this instructionstall_cycles- Cycles spent waitingidle_cycles- Cycles slot was idleexecution_count- How many times it raninstruction_address- Virtual address in GPU memoryfile- Parsed from source_locationline- Parsed from source_locationstall_percent- Convenience: stall_cycles / latency_cycles * 100
Workflow
- Ensure the target binary is built with
-g(debug symbols) for source mapping. - Create a
Linex()profiler; optionally settarget_cu,shader_engine_mask, oractivity. - Call
profiler.profile(command, kernel_filter=...)to run profiling. - Access
profiler.source_lines(sorted by total_cycles) to find hotspots. - Use
line.stall_percentto identify memory-bound or dependency-bound lines. - Drill down into
line.instructionsfor instruction-level analysis. - Use relative paths for the target binary so the skill is portable.
Notes
- Requires ROCm 7.0+ with
rocprofv3support. - Source mapping requires kernels compiled with
-g(debug symbols). source_linesare automatically sorted bytotal_cycles(descending).- Use
kernel_filterto profile specific kernels by name (regex pattern). - For Triton or other frameworks, ensure debug symbols are available in the compiled output.
