System Design Analysis
Analyze distributed system designs for scalability, reliability, performance, and security. Produce structured review documents with gaps and actionable recommendations.
Core Principle
System design is about trade-offs, not perfect answers. Every recommendation must consider context: access patterns, scale requirements, consistency needs, and operational constraints.
Workflow
Phase 1: Information Gathering
Ask focused questions to understand the system. Prioritize these areas:
- Functional scope: What does the system do? Core operations?
- Scale: Expected QPS, data volume, user count?
- Access patterns: Read-heavy vs write-heavy? Hot spots?
- Consistency requirements: Strong vs eventual? Where?
- Availability targets: SLA requirements? Acceptable downtime?
- Current architecture: Existing components, databases, services?
- Known pain points: What's broken or struggling today?
Limit to 3-5 questions per message. Make reasonable assumptions when information is missing—document assumptions explicitly.
Phase 2: Topic Analysis
Based on gathered information, analyze relevant system design topics. Load reference files as needed:
| Topic | Reference File | When to Load |
|---|---|---|
| Load balancing | load-balancing.md | Traffic distribution, L4/L7 decisions |
| Caching | caching.md | Latency optimization, read scaling |
| Databases | databases.md | Data modeling, SQL vs NoSQL choices |
| CAP & Consistency | cap-consistency.md | Consistency model decisions |
| Sharding | sharding-partitioning.md | Write/storage scaling |
| Replication | replication.md | Availability, read scaling |
| Message queues | message-queues.md | Async processing, decoupling |
| Rate limiting | rate-limiting.md | Traffic protection, abuse prevention |
| Auth | auth.md | Security, identity management |
| Resilience | resilience-patterns.md | Failure handling, fault tolerance |
| Monitoring | monitoring-observability.md | Observability, debugging |
Load only topics relevant to the specific system under review.
Phase 3: Document Generation
Produce a structured analysis document with these sections:
# System Design Analysis: [System Name]
## 1. Abstract
Brief summary of the system and analysis scope (2-3 paragraphs).
## 2. Requirements
### 2.1 Stated Requirements
Requirements explicitly provided by user.
### 2.2 Assumed Requirements
Reasonable assumptions with rationale. Format:
- **Assumption**: [what was assumed]
- **Rationale**: [why this is reasonable]
## 3. Current System Review
Analysis of existing architecture against requirements. Organize by topic area.
## 4. Gaps
Identified issues, risks, or missing capabilities. Prioritize by impact:
- **Critical**: System failures, data loss risks
- **High**: Performance bottlenecks, scalability limits
- **Medium**: Operational inefficiencies, maintainability issues
- **Low**: Nice-to-have improvements
## 5. Recommendations
Actionable improvements with:
- **Problem addressed**: Which gap(s) this solves
- **Recommendation**: Specific technical approach
- **Example**: Concrete implementation guidance
- **Trade-offs**: What you gain vs what you sacrifice
- **Impact**: Expected improvement if implemented
Analysis Checklist
For each relevant topic, evaluate:
Load Balancing
- Algorithm appropriate for workload (round robin, least connections, consistent hashing)?
- L4 vs L7 appropriate for use case?
- LB itself highly available?
- Health checks configured?
Caching
- Cache strategy defined (cache-aside, write-through)?
- Eviction policy appropriate (LRU, TTL)?
- Cache invalidation strategy?
- Hot key and cache stampede handling?
Databases
- Data model matches access patterns?
- Indexes support critical queries?
- Read-heavy vs write-heavy considered?
- Appropriate SQL vs NoSQL choice?
CAP & Consistency
- Consistency model matches business requirements?
- Trade-offs between C and A explicit?
- Read-your-writes where needed?
Sharding
- Shard key distributes load evenly?
- Hot partitions addressed?
- Cross-shard operations minimized?
Replication
- Sync vs async replication appropriate?
- Replica lag acceptable?
- Leader election mechanism defined?
- Split-brain prevention?
Message Queues
- Delivery guarantees appropriate?
- Consumer idempotency?
- Dead-letter queue for failures?
- Backpressure handling?
Rate Limiting
- Algorithm chosen (token bucket recommended)?
- Limits appropriate for different tiers?
- Distributed enforcement for multi-node?
- Graceful handling of limit breaches?
Authentication & Authorization
- AuthN mechanism appropriate (JWT, sessions)?
- Token lifecycle managed (expiry, refresh)?
- AuthZ model defined (RBAC, ABAC)?
- Service-to-service auth?
Resilience
- Timeouts on all external calls?
- Retry strategy with backoff?
- Circuit breakers for unstable dependencies?
- Graceful degradation paths?
Monitoring
- Golden signals tracked (latency, traffic, errors, saturation)?
- Distributed tracing for request flows?
- Structured logging?
- Alerts tied to SLOs, not raw metrics?
Common Anti-Patterns to Flag
- No caching strategy: "Just add Redis" without invalidation plan
- Wrong database choice: Forcing SQL for graph data or NoSQL for transactions
- Ignoring partition tolerance: Designing as if network never fails
- Naive sharding: Choosing shard key without considering access patterns
- Synchronous everything: No async processing for non-critical paths
- Alert fatigue: Alerting on every error instead of user impact
- Missing rate limiting: No protection against traffic spikes
- Stateless assumption violations: Session stickiness breaking horizontal scaling
