Computation Analysis for Ascend NPU

You are analyzing computation patterns for Ascend NPU performance. This skill helps identify:

Computation-intensive operators and their locations
CANN operator library support status
CPU fallback operations (performance impact)
Optimization opportunities with torch_npu
Performance profiling approach

When to Use

Invoke this skill when:

User asks about performance or computation
Analyzing model operators and operations
Looking for performance bottlenecks
Planning optimization strategies

Analysis Approach

1. Identify Heavy Operators

Search for compute-intensive patterns:

# Matrix operations
grep -rn "torch\.matmul\|@\|mm" <repo_path>
grep -rn "nn\.Linear" <repo_path>

# Convolutions
grep -rn "nn\.Conv" <repo_path>

# Attention mechanisms
grep -rn "attention\|scaled_dot_product" <repo_path>
grep -rn "F\.scaled_dot_product_attention" <repo_path>

# Normalization
grep -rn "LayerNorm\|BatchNorm" <repo_path>

# Activation functions
grep -rn "relu\|gelu\|silu\|softmax" <repo_path>

2. CANN Operator Support

CANN (Compute Architecture for Neural Networks) provides:

Automatic acceleration for standard PyTorch ops
TBE (Tensor Boost Engine) operator fusion
AI Core acceleration for matrix ops

Natively Supported (High Performance):

Matrix multiplication (MatMul, GEMM)
Convolutions (Conv1d, Conv2d, Conv3d)
Standard activations (ReLU, GELU, SiLU)
LayerNorm, BatchNorm
Standard attention (scaled_dot_product_attention)

CPU Fallback (Low Performance):

Custom CUDA kernels
Third-party library operations
Unsupported fusion operations

Check CANN documentation for operator support status.

3. Graph Optimization Opportunities

Operator Fusion:

Combine multiple operations into single kernel
Reduces memory transfers
Ascend TBE compiler does automatic fusion

Identify opportunities:

Sequential linear + activation
Conv + batch norm + activation
Multiple element-wise operations

4. Automatic Mixed Precision (AMP)

Performance Benefits:

2-4x speedup on supported operations
Lower memory bandwidth requirements
Better AI Core utilization

Check for AMP usage:

# Existing
torch.cuda.amp.autocast  # → torch.npu.amp.autocast

# Opportunities
# - FP32 models that can use FP16
# - Operations supporting FP16 acceleration

5. Distributed Training

Communication Operations:

grep -rn "all_reduce\|broadcast\|gather" <repo_path>

HCCL (Huawei Collective Communication Library):

Replaces NCCL for Ascend
Used for multi-NPU training
Backend change required

Output Format

Computation-Intensive Operators

List heavy operators with locations:

Operator Type	Count	Locations	Complexity
nn.Linear	50	model.py:23,45,67...	O(n²)
nn.Conv2d	20	model.py:10-30...	O(k²n²)
MatMul	30	attention.py:55...	O(n³)
Attention	5	attention.py:40-80	O(n²)

CANN Support Analysis

Natively Supported (High Performance):

List operators with CANN acceleration
Expected speedup vs CPU

CPU Fallback (Performance Risk):

List operators requiring CPU execution
Performance impact assessment
Suggested workarounds

Optimization Recommendations

torch_npu AMP:

Enable automatic mixed precision
Expected speedup: 2-4x
Operations supporting FP16

Graph Optimization:

Operator fusion opportunities
Expected performance gain
TBE compiler optimization

Distributed Training:

HCCL communication optimization
Gradient compression opportunities
Overlap compute and communication

Performance Profiling

Recommended Tools:

# NPU monitoring
npu-smi info  # Real-time NPU status
npu-smi info -t usages  # Memory and utilization

# Profiling
torch_npu.npu.profile  # Profile NPU operations
msprof  # CANN profiling tool

# Python profiling
python -m torch_npu.testing  # Benchmark utilities

Profiling Approach:

Run model with small batch
Use npu-smi to monitor utilization
Profile individual operations
Identify bottlenecks
Optimize hotspots

Expected Performance on Ascend

Based on analysis:

Bottleneck identification: What limits performance
Optimization priority: Rank optimizations by impact
Expected speedup: vs GPU baseline
Performance model: Operations/sec, memory bandwidth

Tools to Use

Documentation First:

Read official Ascend documentation before analysis:
- https://www.hiascend.com/doc_center/source/zh/Pytorch/730/ptmoddevg/trainingmigrguide/PT_LMTMOG_0002.html
- https://www.hiascend.com/doc_center/source/zh/canncommercial/850/API/aolapi/operatorlist_00001.html

Computation Analysis:

Use Grep to search for computation patterns
Reference project PDFs:
- knowledge/CANN商用版 8.5.0 算子库接口参考 01.pdf

Key Considerations

CANN automatically accelerates standard PyTorch ops
Custom ops require special handling
Profiling essential for optimization
AMP provides easy performance wins
HCCL required for multi-NPU training
Graph fusion provides additional speedup

computation-analysisSafety --Repository

Package Files

Computation Analysis for Ascend NPU

When to Use

Analysis Approach

1. Identify Heavy Operators

2. CANN Operator Support

3. Graph Optimization Opportunities

4. Automatic Mixed Precision (AMP)

5. Distributed Training

Output Format

Computation-Intensive Operators

CANN Support Analysis

Optimization Recommendations

Performance Profiling

Expected Performance on Ascend

Tools to Use

Key Considerations

Install

AI Quality Score

Metadata

Tags

computation-analysisSafety --Repository ShareFavorite skill

Package Files

Computation Analysis for Ascend NPU

When to Use

Analysis Approach

1. Identify Heavy Operators

2. CANN Operator Support

3. Graph Optimization Opportunities

4. Automatic Mixed Precision (AMP)

5. Distributed Training

Output Format

Computation-Intensive Operators

CANN Support Analysis

Optimization Recommendations

Performance Profiling

Expected Performance on Ascend

Tools to Use

Key Considerations

Install

AI Quality Score

Metadata

Tags

computation-analysisSafety --Repository