askill
computation-analysis

computation-analysisSafety --Repository

Analyze computation-intensive operators and performance for Ascend NPU. Use when examining model operations, performance bottlenecks, and CANN operator library support.

1 stars
1.2k downloads
Updated 1/29/2026

Package Files

Loading files...
SKILL.md

Computation Analysis for Ascend NPU

You are analyzing computation patterns for Ascend NPU performance. This skill helps identify:

  1. Computation-intensive operators and their locations
  2. CANN operator library support status
  3. CPU fallback operations (performance impact)
  4. Optimization opportunities with torch_npu
  5. Performance profiling approach

When to Use

Invoke this skill when:

  • User asks about performance or computation
  • Analyzing model operators and operations
  • Looking for performance bottlenecks
  • Planning optimization strategies

Analysis Approach

1. Identify Heavy Operators

Search for compute-intensive patterns:

# Matrix operations
grep -rn "torch\.matmul\|@\|mm" <repo_path>
grep -rn "nn\.Linear" <repo_path>

# Convolutions
grep -rn "nn\.Conv" <repo_path>

# Attention mechanisms
grep -rn "attention\|scaled_dot_product" <repo_path>
grep -rn "F\.scaled_dot_product_attention" <repo_path>

# Normalization
grep -rn "LayerNorm\|BatchNorm" <repo_path>

# Activation functions
grep -rn "relu\|gelu\|silu\|softmax" <repo_path>

2. CANN Operator Support

CANN (Compute Architecture for Neural Networks) provides:

  • Automatic acceleration for standard PyTorch ops
  • TBE (Tensor Boost Engine) operator fusion
  • AI Core acceleration for matrix ops

Natively Supported (High Performance):

  • Matrix multiplication (MatMul, GEMM)
  • Convolutions (Conv1d, Conv2d, Conv3d)
  • Standard activations (ReLU, GELU, SiLU)
  • LayerNorm, BatchNorm
  • Standard attention (scaled_dot_product_attention)

CPU Fallback (Low Performance):

  • Custom CUDA kernels
  • Third-party library operations
  • Unsupported fusion operations

Check CANN documentation for operator support status.

3. Graph Optimization Opportunities

Operator Fusion:

  • Combine multiple operations into single kernel
  • Reduces memory transfers
  • Ascend TBE compiler does automatic fusion

Identify opportunities:

  • Sequential linear + activation
  • Conv + batch norm + activation
  • Multiple element-wise operations

4. Automatic Mixed Precision (AMP)

Performance Benefits:

  • 2-4x speedup on supported operations
  • Lower memory bandwidth requirements
  • Better AI Core utilization

Check for AMP usage:

# Existing
torch.cuda.amp.autocast  # → torch.npu.amp.autocast

# Opportunities
# - FP32 models that can use FP16
# - Operations supporting FP16 acceleration

5. Distributed Training

Communication Operations:

grep -rn "all_reduce\|broadcast\|gather" <repo_path>

HCCL (Huawei Collective Communication Library):

  • Replaces NCCL for Ascend
  • Used for multi-NPU training
  • Backend change required

Output Format

Computation-Intensive Operators

List heavy operators with locations:

Operator TypeCountLocationsComplexity
nn.Linear50model.py:23,45,67...O(n²)
nn.Conv2d20model.py:10-30...O(k²n²)
MatMul30attention.py:55...O(n³)
Attention5attention.py:40-80O(n²)

CANN Support Analysis

Natively Supported (High Performance):

  • List operators with CANN acceleration
  • Expected speedup vs CPU

CPU Fallback (Performance Risk):

  • List operators requiring CPU execution
  • Performance impact assessment
  • Suggested workarounds

Optimization Recommendations

torch_npu AMP:

  • Enable automatic mixed precision
  • Expected speedup: 2-4x
  • Operations supporting FP16

Graph Optimization:

  • Operator fusion opportunities
  • Expected performance gain
  • TBE compiler optimization

Distributed Training:

  • HCCL communication optimization
  • Gradient compression opportunities
  • Overlap compute and communication

Performance Profiling

Recommended Tools:

# NPU monitoring
npu-smi info  # Real-time NPU status
npu-smi info -t usages  # Memory and utilization

# Profiling
torch_npu.npu.profile  # Profile NPU operations
msprof  # CANN profiling tool

# Python profiling
python -m torch_npu.testing  # Benchmark utilities

Profiling Approach:

  1. Run model with small batch
  2. Use npu-smi to monitor utilization
  3. Profile individual operations
  4. Identify bottlenecks
  5. Optimize hotspots

Expected Performance on Ascend

Based on analysis:

  • Bottleneck identification: What limits performance
  • Optimization priority: Rank optimizations by impact
  • Expected speedup: vs GPU baseline
  • Performance model: Operations/sec, memory bandwidth

Tools to Use

Documentation First:

Computation Analysis:

  • Use Grep to search for computation patterns
  • Reference project PDFs:
    • knowledge/CANN商用版 8.5.0 算子库接口参考 01.pdf

Key Considerations

  • CANN automatically accelerates standard PyTorch ops
  • Custom ops require special handling
  • Profiling essential for optimization
  • AMP provides easy performance wins
  • HCCL required for multi-NPU training
  • Graph fusion provides additional speedup

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

AI review pending.

Metadata

Licenseunknown
Version-
Updated1/29/2026
PublisherFeRhodium

Tags

apiobservability