Computation Analysis for Ascend NPU
You are analyzing computation patterns for Ascend NPU performance. This skill helps identify:
- Computation-intensive operators and their locations
- CANN operator library support status
- CPU fallback operations (performance impact)
- Optimization opportunities with torch_npu
- Performance profiling approach
When to Use
Invoke this skill when:
- User asks about performance or computation
- Analyzing model operators and operations
- Looking for performance bottlenecks
- Planning optimization strategies
Analysis Approach
1. Identify Heavy Operators
Search for compute-intensive patterns:
# Matrix operations
grep -rn "torch\.matmul\|@\|mm" <repo_path>
grep -rn "nn\.Linear" <repo_path>
# Convolutions
grep -rn "nn\.Conv" <repo_path>
# Attention mechanisms
grep -rn "attention\|scaled_dot_product" <repo_path>
grep -rn "F\.scaled_dot_product_attention" <repo_path>
# Normalization
grep -rn "LayerNorm\|BatchNorm" <repo_path>
# Activation functions
grep -rn "relu\|gelu\|silu\|softmax" <repo_path>
2. CANN Operator Support
CANN (Compute Architecture for Neural Networks) provides:
- Automatic acceleration for standard PyTorch ops
- TBE (Tensor Boost Engine) operator fusion
- AI Core acceleration for matrix ops
Natively Supported (High Performance):
- Matrix multiplication (MatMul, GEMM)
- Convolutions (Conv1d, Conv2d, Conv3d)
- Standard activations (ReLU, GELU, SiLU)
- LayerNorm, BatchNorm
- Standard attention (scaled_dot_product_attention)
CPU Fallback (Low Performance):
- Custom CUDA kernels
- Third-party library operations
- Unsupported fusion operations
Check CANN documentation for operator support status.
3. Graph Optimization Opportunities
Operator Fusion:
- Combine multiple operations into single kernel
- Reduces memory transfers
- Ascend TBE compiler does automatic fusion
Identify opportunities:
- Sequential linear + activation
- Conv + batch norm + activation
- Multiple element-wise operations
4. Automatic Mixed Precision (AMP)
Performance Benefits:
- 2-4x speedup on supported operations
- Lower memory bandwidth requirements
- Better AI Core utilization
Check for AMP usage:
# Existing
torch.cuda.amp.autocast # → torch.npu.amp.autocast
# Opportunities
# - FP32 models that can use FP16
# - Operations supporting FP16 acceleration
5. Distributed Training
Communication Operations:
grep -rn "all_reduce\|broadcast\|gather" <repo_path>
HCCL (Huawei Collective Communication Library):
- Replaces NCCL for Ascend
- Used for multi-NPU training
- Backend change required
Output Format
Computation-Intensive Operators
List heavy operators with locations:
| Operator Type | Count | Locations | Complexity |
|---|---|---|---|
| nn.Linear | 50 | model.py:23,45,67... | O(n²) |
| nn.Conv2d | 20 | model.py:10-30... | O(k²n²) |
| MatMul | 30 | attention.py:55... | O(n³) |
| Attention | 5 | attention.py:40-80 | O(n²) |
CANN Support Analysis
Natively Supported (High Performance):
- List operators with CANN acceleration
- Expected speedup vs CPU
CPU Fallback (Performance Risk):
- List operators requiring CPU execution
- Performance impact assessment
- Suggested workarounds
Optimization Recommendations
torch_npu AMP:
- Enable automatic mixed precision
- Expected speedup: 2-4x
- Operations supporting FP16
Graph Optimization:
- Operator fusion opportunities
- Expected performance gain
- TBE compiler optimization
Distributed Training:
- HCCL communication optimization
- Gradient compression opportunities
- Overlap compute and communication
Performance Profiling
Recommended Tools:
# NPU monitoring
npu-smi info # Real-time NPU status
npu-smi info -t usages # Memory and utilization
# Profiling
torch_npu.npu.profile # Profile NPU operations
msprof # CANN profiling tool
# Python profiling
python -m torch_npu.testing # Benchmark utilities
Profiling Approach:
- Run model with small batch
- Use npu-smi to monitor utilization
- Profile individual operations
- Identify bottlenecks
- Optimize hotspots
Expected Performance on Ascend
Based on analysis:
- Bottleneck identification: What limits performance
- Optimization priority: Rank optimizations by impact
- Expected speedup: vs GPU baseline
- Performance model: Operations/sec, memory bandwidth
Tools to Use
Documentation First:
- Read official Ascend documentation before analysis:
Computation Analysis:
- Use
Grepto search for computation patterns - Reference project PDFs:
knowledge/CANN商用版 8.5.0 算子库接口参考 01.pdf
Key Considerations
- CANN automatically accelerates standard PyTorch ops
- Custom ops require special handling
- Profiling essential for optimization
- AMP provides easy performance wins
- HCCL required for multi-NPU training
- Graph fusion provides additional speedup
