Memory Analysis for Ascend NPU
You are analyzing memory patterns for Ascend NPU optimization. This skill helps identify:
- Data loading patterns and optimization opportunities
- Host-device transfers (CPU ↔ NPU)
- Automatic data migration effectiveness
- Mixed precision opportunities (FP16/BF16)
- Memory efficiency improvements
When to Use
Invoke this skill when:
- User asks about memory optimization for NPU
- Examining data loading and training pipelines
- Looking for memory inefficiencies
- Planning mixed precision training strategy
Analysis Approach
1. Data Loading Analysis
Examine:
- DataLoader configuration
- Batch size settings
- Number of workers
- Pin memory usage (
.pin_memory=True) - Prefetching strategies
Search Patterns:
grep -rn "DataLoader" <repo_path>
grep -rn "num_workers" <repo_path>
grep -rn "pin_memory" <repo_path>
2. Host-Device Transfer Patterns
Identify data movement:
# Explicit transfers
tensor = tensor.to('cuda') # → .to('npu') or automatic
model = model.cuda() # → .npu() or automatic
# Check for inefficient patterns
# - Redundant transfers
# - Transferring large unused data
# - Frequent CPU↔GPU bouncing
3. Automatic Data Migration
torch_npu provides automatic data migration:
- Tensors automatically move to NPU when needed
- Reduces explicit
.to('npu')calls - But may not cover all cases
Analyze:
- Will automatic migration work for this codebase?
- Are there cases preventing automatic migration?
- Performance impact of automatic vs explicit
4. Mixed Precision Training
FP16/BF16 Benefits on Ascend:
- 50% memory reduction
- 2-4x speedup on supported operations
- Better NPU utilization
Check for:
# Existing AMP usage
torch.cuda.amp.autocast # → torch.npu.amp.autocast
torch.cuda.amp.GradScaler # → torch.npu.amp.GradScaler
# Opportunities for AMP
# - Float32 models that can use FP16
# - Loss scaling requirements
5. Memory Efficiency Techniques
Identify opportunities:
- Gradient checkpointing: Trade compute for memory
- Gradient accumulation: Simulate larger batch sizes
- Memory pool reuse: NPU-specific memory optimization
- Tensor lifecycle: Proper cleanup to avoid leaks
Output Format
Data Loading Analysis
- DataLoader configuration summary
- Optimization opportunities:
- Increase workers
- Enable pin_memory (for NPU)
- Adjust prefetch_factor
- Use persistent_workers
Host-Device Transfer
- Current transfer patterns
- Redundant or inefficient transfers
- Automatic migration compatibility
- Recommendations:
- Use automatic data migration where possible
- Minimize explicit transfers
- Keep frequently accessed data on NPU
Mixed Precision Opportunities
- Current precision usage (FP32/FP16)
- FP16/BF16 compatibility with torch.npu.amp
- Expected memory savings (up to 50%)
- Expected speedup (2-4x on supported ops)
- Implementation:
# Enable torch_npu AMP from torch_npu import amp scaler = amp.GradScaler() with amp.autocast(): output = model(input)
Memory Efficiency
- Gradient checkpointing opportunities
- Gradient accumulation for large effective batch sizes
- Memory pool optimization strategies
- Potential memory leaks or improper cleanup
Specific torch_npu APIs
Recommend specific APIs:
# Memory management
torch.npu.empty_cache() # Clear unused memory
torch.npu.set_memory_strategy() # Memory allocation strategy
torch.npu.memory_allocated() # Current memory usage
torch.npu.max_memory_allocated() # Peak memory
# AMP
from torch_npu import amp
amp.autocast() # Automatic mixed precision
amp.GradScaler() # Loss scaling
Tools to Use
Documentation First:
- Read official Ascend documentation before analysis:
Memory Analysis:
- Use
Grepto search for memory-related patterns - Use
Readto examine data loading code
Notes
- Ascend NPU has different memory hierarchy than GPU
- HBM (High Bandwidth Memory) is precious resource
- Automatic data migration reduces code changes
- Mixed precision training highly recommended for NPU
- Profile actual memory usage with npu-smi
