System Design Analysis

Analyze distributed system designs for scalability, reliability, performance, and security. Produce structured review documents with gaps and actionable recommendations.

Core Principle

System design is about trade-offs, not perfect answers. Every recommendation must consider context: access patterns, scale requirements, consistency needs, and operational constraints.

Workflow

Phase 1: Information Gathering

Ask focused questions to understand the system. Prioritize these areas:

Functional scope: What does the system do? Core operations?
Scale: Expected QPS, data volume, user count?
Access patterns: Read-heavy vs write-heavy? Hot spots?
Consistency requirements: Strong vs eventual? Where?
Availability targets: SLA requirements? Acceptable downtime?
Current architecture: Existing components, databases, services?
Known pain points: What's broken or struggling today?

Limit to 3-5 questions per message. Make reasonable assumptions when information is missing—document assumptions explicitly.

Phase 2: Topic Analysis

Based on gathered information, analyze relevant system design topics. Load reference files as needed:

Topic	Reference File	When to Load
Load balancing	load-balancing.md	Traffic distribution, L4/L7 decisions
Caching	caching.md	Latency optimization, read scaling
Databases	databases.md	Data modeling, SQL vs NoSQL choices
CAP & Consistency	cap-consistency.md	Consistency model decisions
Sharding	sharding-partitioning.md	Write/storage scaling
Replication	replication.md	Availability, read scaling
Message queues	message-queues.md	Async processing, decoupling
Rate limiting	rate-limiting.md	Traffic protection, abuse prevention
Auth	auth.md	Security, identity management
Resilience	resilience-patterns.md	Failure handling, fault tolerance
Monitoring	monitoring-observability.md	Observability, debugging

Load only topics relevant to the specific system under review.

Phase 3: Document Generation

Produce a structured analysis document with these sections:

# System Design Analysis: [System Name]

## 1. Abstract
Brief summary of the system and analysis scope (2-3 paragraphs).

## 2. Requirements

### 2.1 Stated Requirements
Requirements explicitly provided by user.

### 2.2 Assumed Requirements
Reasonable assumptions with rationale. Format:
- **Assumption**: [what was assumed]
- **Rationale**: [why this is reasonable]

## 3. Current System Review
Analysis of existing architecture against requirements. Organize by topic area.

## 4. Gaps
Identified issues, risks, or missing capabilities. Prioritize by impact:
- **Critical**: System failures, data loss risks
- **High**: Performance bottlenecks, scalability limits
- **Medium**: Operational inefficiencies, maintainability issues
- **Low**: Nice-to-have improvements

## 5. Recommendations
Actionable improvements with:
- **Problem addressed**: Which gap(s) this solves
- **Recommendation**: Specific technical approach
- **Example**: Concrete implementation guidance
- **Trade-offs**: What you gain vs what you sacrifice
- **Impact**: Expected improvement if implemented

Analysis Checklist

For each relevant topic, evaluate:

Load Balancing

Algorithm appropriate for workload (round robin, least connections, consistent hashing)?
L4 vs L7 appropriate for use case?
LB itself highly available?
Health checks configured?

Caching

Cache strategy defined (cache-aside, write-through)?
Eviction policy appropriate (LRU, TTL)?
Cache invalidation strategy?
Hot key and cache stampede handling?

Databases

Data model matches access patterns?
Indexes support critical queries?
Read-heavy vs write-heavy considered?
Appropriate SQL vs NoSQL choice?

CAP & Consistency

Consistency model matches business requirements?
Trade-offs between C and A explicit?
Read-your-writes where needed?

Sharding

Shard key distributes load evenly?
Hot partitions addressed?
Cross-shard operations minimized?

Replication

Sync vs async replication appropriate?
Replica lag acceptable?
Leader election mechanism defined?
Split-brain prevention?

Message Queues

Delivery guarantees appropriate?
Consumer idempotency?
Dead-letter queue for failures?
Backpressure handling?

Rate Limiting

Algorithm chosen (token bucket recommended)?
Limits appropriate for different tiers?
Distributed enforcement for multi-node?
Graceful handling of limit breaches?

Authentication & Authorization

AuthN mechanism appropriate (JWT, sessions)?
Token lifecycle managed (expiry, refresh)?
AuthZ model defined (RBAC, ABAC)?
Service-to-service auth?

Resilience

Timeouts on all external calls?
Retry strategy with backoff?
Circuit breakers for unstable dependencies?
Graceful degradation paths?

Monitoring

Golden signals tracked (latency, traffic, errors, saturation)?
Distributed tracing for request flows?
Structured logging?
Alerts tied to SLOs, not raw metrics?

Common Anti-Patterns to Flag

No caching strategy: "Just add Redis" without invalidation plan
Wrong database choice: Forcing SQL for graph data or NoSQL for transactions
Ignoring partition tolerance: Designing as if network never fails
Naive sharding: Choosing shard key without considering access patterns
Synchronous everything: No async processing for non-critical paths
Alert fatigue: Alerting on every error instead of user impact
Missing rate limiting: No protection against traffic spikes
Stateless assumption violations: Session stickiness breaking horizontal scaling

system-design-analysisSafety 100Repository

Package Files

System Design Analysis

Core Principle

Workflow

Phase 1: Information Gathering

Phase 2: Topic Analysis

Phase 3: Document Generation

Analysis Checklist

Common Anti-Patterns to Flag

Install

AI Quality Score

Metadata

Tags

system-design-analysisSafety 100Repository ShareFavorite skill

Package Files

System Design Analysis

Core Principle

Workflow

Phase 1: Information Gathering

Phase 2: Topic Analysis

Phase 3: Document Generation

Analysis Checklist

Common Anti-Patterns to Flag

Install

AI Quality Score

Metadata

Tags

system-design-analysisSafety 100Repository