Skillscfn-docker-wave-execution
C

cfn-docker-wave-execution

Orchestrate Docker container execution across parallel agent waves with memory-aware spawning

masharratt
14 stars
1.2k downloads
Updated 2w ago

Readme

cfn-docker-wave-execution follows the SKILL.md standard. Use the install command to add it to your agent stack.

---
name: cfn-docker-wave-execution
description: Orchestrate Docker container execution across parallel agent waves with memory-aware spawning
version: 1.0.0
tags: [docker, wave-execution, container-orchestration, parallel-spawning]
status: production
---

# CFN Docker Wave Execution Skill

**Purpose:** Orchestrate Docker container execution across parallel agent waves with memory-aware spawning, comprehensive status tracking, and graceful cleanup.

**Status:** Production Ready (v1.0.0)

---

## Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Modules](#modules)
4. [Usage](#usage)
5. [Configuration](#configuration)
6. [Integration Patterns](#integration-patterns)
7. [Error Handling](#error-handling)
8. [Performance](#performance)
9. [Troubleshooting](#troubleshooting)

---

## Overview

### What This Skill Does

Docker Wave Execution transforms error batching plans from `cfn-error-batching-strategy` into parallel Docker container execution:

1. **Parse batching plan JSON** from error batching strategy
2. **Spawn containers** with memory-tier-aware limits and environment configuration
3. **Monitor execution** with Docker API polling and health tracking
4. **Collect results** from exited containers with exit code analysis
5. **Clean up** containers and volumes after completion

### Key Features

- **Memory-tier alignment:** Automatic memory limit mapping (Tier 1→512MB, Tier 2→600MB, etc.)
- **Parallel spawning:** Batch-based container creation respecting Docker daemon limits
- **Real-time monitoring:** Poll-based status tracking with configurable timeout
- **Exit code analysis:** Distinguish success (0), failure (1+), and timeout scenarios
- **Log preservation:** Retain container logs before removal for failed containers
- **Network isolation:** Optional isolated network per wave or shared network
- **Resource cleanup:** Automatic container and volume removal with safety checks

### When to Use

- Spawning 10+ agent containers for parallel error fixing
- Memory-constrained Docker environments (limited host resources)
- Large TypeScript/Python projects with 50+ error files
- Iteration-heavy CFN Loops requiring repeated wave execution
- Production CI/CD pipelines requiring fail-never semantics

### Integration Points

**Upstream:** `cfn-error-batching-strategy` → Wave plan JSON
**Downstream:** Result aggregation → `cfn-loop-orchestration`
**Dependencies:** Docker CLI, jq, coreutils

---

## Architecture

### Data Flow

```
┌────────────────────────────────┐
│ Wave Plan (from batching)      │
│ {                              │
│  "waves": [{                   │
│    "wave_number": 1,           │
│    "batches": [...]            │
│  }]                            │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ spawn-wave.sh                  │
│ - Parse wave JSON              │
│ - Create containers            │
│ - Set environment vars         │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ Running Containers             │
│ [container-1, container-2, ...] │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ monitor-wave.sh                │
│ - Poll container status        │
│ - Track exit codes             │
│ - Timeout handling             │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ Execution Results              │
│ {                              │
│  "completed": 28,              │
│  "failed": 0,                  │
│  "timeout": 0                  │
│ }                              │
└────────────┬───────────────────┘
             ↓
┌────────────────────────────────┐
│ cleanup-wave.sh                │
│ - Remove containers            │
│ - Preserve logs (if failed)    │
│ - Clean volumes                │
└────────────────────────────────┘
```

### Module Responsibilities

| Module | Responsibility | Exit Code |
|--------|-----------------|-----------|
| `spawn-wave.sh` | Create containers with proper configuration | 0=success, 1=error, 2=validation |
| `monitor-wave.sh` | Track container status with timeout | 0=all complete, 1=failure, 2=timeout |
| `cleanup-wave.sh` | Remove containers and artifacts | 0=success, 1=partial, 2=error |
| `lib/docker-helpers.sh` | Shared utilities and Docker wrappers | N/A (sourced) |

---

## Modules

### 1. spawn-wave.sh

**Purpose:** Spawn Docker containers from a wave plan with memory-tier-aware limits.

**Usage:**
```bash
./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
  --wave-plan ./waves.json \
  --wave-number 1 \
  --base-image claude-flow-novice:latest \
  --workspace /workspace \
  --network cfn-network \
  --output spawned.json
```

**Input Format (wave-plan.json):**
```json
{
  "waves": [
    {
      "wave_number": 1,
      "batch_count": 28,
      "memory_needed": "14.5GB",
      "parallelism": 28,
      "batches": [
        {
          "batch_id": "iter1-batch-1",
          "tier": 1,
          "memory": "512m",
          "files": ["src/Button.tsx"],
          "task_prompt": "Fix TypeScript errors in Button.tsx"
        }
      ]
    }
  ]
}
```

**Output Format:**
```json
{
  "wave_number": 1,
  "spawned_at": "2025-11-14T10:30:45Z",
  "containers": [
    {
      "container_id": "abc123def456",
      "container_name": "cfn-wave1-batch1",
      "batch_id": "iter1-batch-1",
      "tier": 1,
      "memory_limit": "512m",
      "status": "running",
      "started_at": "2025-11-14T10:30:46Z"
    }
  ],
  "total_spawned": 28,
  "total_memory": "14.5GB"
}
```

**Options:**
- `--wave-plan FILE`: Path to batching plan JSON (required)
- `--wave-number N`: Wave number to spawn (required)
- `--base-image IMAGE`: Docker image to use (default: claude-flow-novice:latest)
- `--workspace PATH`: Mount point for workspace (default: /workspace)
- `--network NAME`: Docker network name (default: cfn-network)
- `--environment VAR=VALUE`: Additional env vars (repeatable)
- `--output FILE`: Write container manifest to file
- `--dry-run`: Show what would be spawned without creating
- `--parallel N`: Max concurrent spawns (default: 5)
- `--verbose`: Enable detailed logging

**Exit Codes:**
- `0`: All containers spawned successfully
- `1`: One or more containers failed to spawn
- `2`: Validation error (missing file, invalid JSON)

**Implementation Details:**

1. **Validation Phase:**
   - Verify wave-plan.json exists and is valid JSON
   - Check Docker daemon accessibility
   - Validate base image exists or pull from registry
   - Verify workspace mount point exists

2. **Container Spawning:**
   - For each batch in wave:
     - Extract memory tier from batch JSON
     - Map tier to memory limit via helper function
     - Create container with `docker run --memory <limit> --memory-reservation <limit>`
     - Mount workspace: `-v /workspace:/workspace:rw`
     - Set network: `--network cfn-network`
     - Set environment: `-e BATCH_ID=<id> -e TASK_PROMPT=<prompt> -e TASK_ID=<id>`
     - Run detached: `-d`
   - Limit parallelism to avoid Docker daemon overload

3. **Result Tracking:**
   - Collect container IDs in array
   - Write container manifest to output file
   - Report total spawned and total memory

### 2. monitor-wave.sh

**Purpose:** Poll Docker containers for status until completion or timeout.

**Usage:**
```bash
./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
  --containers ./spawned.json \
  --wave-number 1 \
  --timeout 1800 \
  --poll-interval 5 \
  --output results.json
```

**Input Format:**
```json
{
  "wave_number": 1,
  "containers": [
    {
      "container_id": "abc123",
      "batch_id": "batch-1",
      "memory_limit": "512m"
    }
  ]
}
```

**Output Format:**
```json
{
  "wave_number": 1,
  "monitoring_duration": 287,
  "completion_status": "complete",
  "containers": [
    {
      "container_id": "abc123",
      "batch_id": "batch-1",
      "status": "exited",
      "exit_code": 0,
      "exit_status": "success",
      "started_at": "2025-11-14T10:30:46Z",
      "completed_at": "2025-11-14T10:35:33Z"
    }
  ],
  "metrics": {
    "total": 28,
    "running": 0,
    "exited": 28,
    "success": 27,
    "failed": 1,
    "timeout": 0
  }
}
```

**Options:**
- `--containers FILE`: Spawned containers manifest (required)
- `--wave-number N`: Wave number (for filtering, optional)
- `--timeout SECONDS`: Max wait time (default: 1800 = 30 min)
- `--poll-interval SECONDS`: Check frequency (default: 5)
- `--output FILE`: Write results to file
- `--preserve-logs`: Keep container logs for analysis
- `--verbose`: Enable detailed polling output

**Exit Codes:**
- `0`: All containers completed successfully
- `1`: One or more containers failed (exit code != 0)
- `2`: Timeout reached before all containers completed

**Implementation Details:**

1. **Polling Loop:**
   - Start monitoring loop with `$timeout` seconds limit
   - Every `$poll_interval` seconds:
     - Run `docker ps --all` to get container status
     - For each container: extract exit code via `docker inspect`
     - Categorize: running, exited-success (0), exited-failed (!=0)
     - Update progress tracking

2. **Status Tracking:**
   - Maintain counts: running, exited, success, failed, timeout
   - Record timestamps: started_at, completed_at
   - Track exit codes for all exited containers

3. **Timeout Handling:**
   - If timeout reached with containers still running:
     - Set exit_status = "timeout"
     - Increment timeout counter
     - Return exit code 2

4. **Progress Reporting:**
   - Log current status every poll interval
   - Show: "Running: 5, Completed: 23, Failed: 0, Timeout: 0"

### 3. cleanup-wave.sh

**Purpose:** Remove containers and clean up Docker artifacts.

**Usage:**
```bash
./.claude/skills/cfn-docker-wave-execution/cleanup-wave.sh \
  --wave-number 1 \
  --pattern "cfn-wave1-*" \
  --preserve-failed-logs \
  --output cleanup-report.json
```

**Input Options:**
- `--wave-number N`: Clean containers from specific wave
- `--pattern PATTERN`: Cleanup containers matching pattern
- `--containers FILE`: Cleanup from manifest file

**Output Format:**
```json
{
  "cleanup_at": "2025-11-14T10:36:00Z",
  "containers_removed": 28,
  "logs_preserved": 1,
  "volumes_cleaned": 14,
  "errors": [],
  "summary": "Successfully removed 28 containers, preserved logs from 1 failed container"
}
```

**Options:**
- `--wave-number N`: Wave to cleanup (required)
- `--pattern PATTERN`: Container name pattern (default: cfn-wave$N-*)
- `--preserve-failed-logs`: Keep logs from failed containers
- `--preserve-all-logs`: Keep all logs regardless of exit code
- `--dry-run`: Show what would be removed
- `--output FILE`: Write report to file
- `--verbose`: Enable detailed logging

**Exit Codes:**
- `0`: All containers removed successfully
- `1`: Partial cleanup (some removals failed)
- `2`: Critical error (failed to cleanup majority)

**Implementation Details:**

1. **Container Discovery:**
   - Use `docker ps -a --filter "name=$PATTERN"` to find containers
   - Extract container IDs and names

2. **Log Preservation:**
   - If container has exit code != 0 and `--preserve-failed-logs`:
     - Run `docker logs <container> > logs/<container-id>.log`
     - Store in `.claude/artifacts/container-logs/` directory

3. **Container Removal:**
   - For each container:
     - Run `docker rm <container-id>`
     - Track success/failure

4. **Volume Cleanup:**
   - Find dangling volumes from removed containers
   - Remove with `docker volume rm <volume-id>`

---

## lib/docker-helpers.sh

**Purpose:** Shared utility functions for Docker operations.

**Functions:**

### parse_memory(string)
```bash
parse_memory "512m"    # Returns: 536870912 (bytes)
parse_memory "1g"      # Returns: 1073741824
parse_memory "100"     # Returns: 100 (no unit = bytes)
```

Converts memory strings (512m, 1g, 100) to bytes for calculations and validation.

### get_container_status(container_id)
```bash
get_container_status "abc123def456"
# Output: "running" | "exited" | "failed"
```

Returns container status by checking `docker inspect` output.

### wait_for_containers(container_ids[], timeout)
```bash
declare -a CONTAINERS=("abc123" "def456")
wait_for_containers CONTAINERS[@] 1800

# Returns: 0 (all completed), 1 (some failed), 2 (timeout)
```

Blocks until all containers complete or timeout is reached.

### extract_exit_code(container_id)
```bash
extract_exit_code "abc123def456"
# Output: 0 | 1 | 124 (timeout signal)
```

Gets exit code from exited container via `docker inspect`.

### validate_docker_access()
```bash
if ! validate_docker_access; then
  echo "Docker not accessible"
  exit 1
fi
```

Checks Docker daemon accessibility and socket permissions.

### create_container_manifest(container_id, batch_id, tier)
```bash
create_container_manifest "abc123" "batch-1" 1
# Returns: JSON object with container metadata
```

Generates container metadata object for tracking.

### log_container(container_id, output_dir)
```bash
log_container "abc123def456" "/tmp/logs"
# Preserves container logs to /tmp/logs/abc123def456.log
```

Extracts and preserves container logs.

---

## Usage

### Basic Wave Execution

```bash
#!/bin/bash
set -euo pipefail

# 1. Generate batching plan
WAVE_PLAN=$(./.claude/skills/cfn-error-batching-strategy/cli.sh \
  --command "npx tsc --noEmit" \
  --workspace "/workspace" \
  --budget "40g" \
  --format json)

# 2. Spawn Wave 1
SPAWNED=$(./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
  --wave-plan <(echo "$WAVE_PLAN") \
  --wave-number 1 \
  --base-image my-agent:latest \
  --workspace /workspace \
  --output wave1-spawned.json)

# 3. Monitor Wave 1
RESULTS=$(./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
  --containers ./wave1-spawned.json \
  --timeout 1800 \
  --output wave1-results.json)

# 4. Check results
FAILED=$(echo "$RESULTS" | jq '.metrics.failed')
if [[ $FAILED -gt 0 ]]; then
  echo "Wave 1 had $FAILED failures"
  exit 1
fi

# 5. Cleanup
./.claude/skills/cfn-docker-wave-execution/cleanup-wave.sh \
  --wave-number 1 \
  --preserve-failed-logs \
  --output wave1-cleanup.json

# 6. Process Wave 2 (if needed)
# ...
```

### Multi-Wave Orchestration

```bash
# Spawn all waves in sequence
for WAVE in 1 2 3; do
  echo "Processing Wave $WAVE..."

  SPAWNED=$(./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
    --wave-plan ./batching-plan.json \
    --wave-number "$WAVE" \
    --output "wave$WAVE-spawned.json")

  RESULTS=$(./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
    --containers "./wave$WAVE-spawned.json" \
    --timeout 1800 \
    --output "wave$WAVE-results.json")

  # Check for critical failures
  FAILED=$(echo "$RESULTS" | jq '.metrics.failed')
  if [[ $FAILED -gt 0 ]]; then
    echo "Wave $WAVE had failures, stopping iteration"
    break
  fi

  ./.claude/skills/cfn-docker-wave-execution/cleanup-wave.sh \
    --wave-number "$WAVE" \
    --preserve-failed-logs
done
```

### Integration with CFN Loop

```bash
# In orchestrate.sh or coordinator workflow
WAVE_NUM=1
SPAWNED_MANIFEST=$(./.claude/skills/cfn-docker-wave-execution/spawn-wave.sh \
  --wave-plan "$BATCHING_PLAN" \
  --wave-number "$WAVE_NUM" \
  --base-image "$AGENT_IMAGE" \
  --workspace /workspace \
  --output spawned-manifest.json)

EXECUTION_RESULTS=$(./.claude/skills/cfn-docker-wave-execution/monitor-wave.sh \
  --containers ./spawned-manifest.json \
  --timeout "$EXECUTION_TIMEOUT" \
  --preserve-logs)

# Process results for next iteration
FAILED_COUNT=$(echo "$EXECUTION_RESULTS" | jq '.metrics.failed')
COMPLETED_COUNT=$(echo "$EXECUTION_RESULTS" | jq '.metrics.success')

# Store for product owner review
echo "$EXECUTION_RESULTS" > iteration-"$WAVE_NUM"-results.json
```

---

## Configuration

### Environment Variables

```bash
# Docker configuration
CFN_DOCKER_IMAGE="claude-flow-novice:latest"
CFN_DOCKER_NETWORK="cfn-network"
CFN_DOCKER_WORKSPACE="/workspace"

# Spawning behavior
CFN_SPAWN_PARALLEL_LIMIT=5        # Max concurrent docker run commands
CFN_SPAWN_DRY_RUN=false            # Simulate without creating containers

# Monitoring behavior
CFN_MONITOR_TIMEOUT=1800           # 30 minutes default
CFN_MONITOR_POLL_INTERVAL=5        # Check every 5 seconds
CFN_MONITOR_PRESERVE_LOGS=false

# Cleanup behavior
CFN_CLEANUP_PRESERVE_FAILED=true   # Keep logs from failed containers
CFN_CLEANUP_DRY_RUN=false

# Logging
CFN_LOG_LEVEL="info"               # debug, info, warn, error
CFN_LOG_DIR=".artifacts/logs"
```

### Docker Network Setup

```bash
# Create cfn-network if it doesn't exist
docker network create cfn-network || true

# List available networks
docker network ls | grep cfn-network
```

### Memory Tier Mapping

Default tier-to-memory mappings (from batching strategy):

```json
{
  "tier_1": {"max_files": 1, "memory": "512m"},
  "tier_2": {"max_files": 3, "memory": "600m"},
  "tier_3": {"max_files": 8, "memory": "800m"},
  "tier_4": {"max_files": null, "memory": "1g"}
}
```

Custom mapping via environment:
```bash
export CFN_TIER_1_MEMORY="256m"
export CFN_TIER_2_MEMORY="512m"
export CFN_TIER_3_MEMORY="768m"
export CFN_TIER_4_MEMORY="2g"
```

---

## Integration Patterns

### Pattern 1: Sequential Wave Execution

```bash
# Spawn all waves one at a time, waiting for completion
execute_all_waves() {
  local batching_plan="$1"
  local waves=$(jq -r '.waves | length' "$batching_plan")

  for ((wave = 1; wave <= waves; wave++)); do
    echo "[Wave $wave] Spawning containers..."
    spawn_wave "$batching_plan" "$wave"

    echo "[Wave $wave] Monitoring execution..."
    local results=$(monitor_wave "$wave")

    local failed=$(jq '.metrics.failed' <<<"$results")
    if [[ $failed -gt 0 ]]; then
      echo "[Wave $wave] FAILED: $failed containers exited with errors"
      return 1
    fi

    echo "[Wave $wave] Cleaning up..."
    cleanup_wave "$wave" --preserve-failed-logs
  done

  return 0
}
```

### Pattern 2: Wave Caching for Iterations

```bash
# Preserve container logs between iterations for analysis
execute_wave_with_caching() {
  local wave_num="$1"
  local iteration="$2"
  local cache_dir=".artifacts/wave-cache/$iteration"

  mkdir -p "$cache_dir"

  # Spawn and monitor
  spawn_wave "$batching_plan" "$wave_num"
  local results=$(monitor_wave "$wave_num")

  # Cache results and logs
  echo "$results" > "$cache_dir/wave-$wave_num-results.json"
  docker ps -a --format "{{.ID}}" | while read -r container; do
    docker logs "$container" > "$cache_dir/logs/$container.log"
  done

  cleanup_wave "$wave_num" --preserve-all-logs --output-dir "$cache_dir/logs"

  return $(jq '.metrics.failed' "$cache_dir/wave-$wave_num-results.json")
}
```

### Pattern 3: Fault Tolerance with Retry

```bash
# Retry individual failed batches in subsequent waves
execute_wave_with_retry() {
  local wave_num="$1"
  local max_retries=3
  local retry_count=0

  while [[ $retry_count -lt $max_retries ]]; do
    spawn_wave "$batching_plan" "$wave_num"
    local results=$(monitor_wave "$wave_num")
    local failed=$(jq '.metrics.failed' <<<"$results")

    if [[ $failed -eq 0 ]]; then
      echo "Wave $wave_num completed successfully"
      cleanup_wave "$wave_num"
      return 0
    fi

    echo "Wave $wave_num had $failed failures, retrying..."
    cleanup_wave "$wave_num" --preserve-failed-logs

    retry_count=$((retry_count + 1))
  done

  echo "Wave $wave_num failed after $max_retries retries"
  return 1
}
```

---

## Error Handling

### Docker Daemon Errors

**Error:** "Cannot connect to Docker daemon"

**Diagnosis:**
```bash
# Check if Docker is running
docker version

# Check socket permissions
ls -la /var/run/docker.sock

# Check Docker group membership
groups $USER | grep docker
```

**Solution:**
- Start Docker: `sudo systemctl start docker`
- Add user to docker group: `sudo usermod -aG docker $USER`
- Re-login to apply group changes

### Memory Limit Errors

**Error:** "docker: Error response from daemon: ... memory is too large"

**Diagnosis:**
```bash
# Check host available memory
free -h

# Check Docker memory settings
docker info | grep "Total Memory"

# Check memory assigned to containers
docker stats
```

**Solution:**
- Reduce memory per container via tier configuration
- Increase Docker memory allocation
- Reduce parallelism (spawn fewer concurrent containers)

### Network Errors

**Error:** "docker: Error response from daemon: network ... not found"

**Diagnosis:**
```bash
# List available networks
docker network ls

# Check cfn-network existence
docker network inspect cfn-network
```

**Solution:**
```bash
# Create network if missing
docker network create cfn-network

# Verify network created
docker network ls | grep cfn-network
```

### Image Errors

**Error:** "docker: Error response from daemon: image ... not found"

**Diagnosis:**
```bash
# List available images
docker images

# Check specific image
docker images | grep "claude-flow-novice"
```

**Solution:**
```bash
# Pull missing image
docker pull claude-flow-novice:latest

# Or build locally
docker build -t claude-flow-novice:latest .
```

---

## Performance

### Benchmarks

**Test Setup:** 28 containers per wave, 512MB-1GB memory limits, 5-second poll interval

| Metric | Value | Notes |
|--------|-------|-------|
| Spawn time (28 containers) | 2.3s | Serial spawning, 5/sec limit |
| Monitor time (all complete) | 287s | 4m 47s wall time |
| Poll overhead per interval | 0.8s | docker ps + docker inspect |
| Cleanup time (28 containers) | 1.2s | Parallel removal |
| **Total wave execution** | ~290s | Per wave (5m per wave typical) |

### Scalability

| Containers | Memory/Container | Total Memory | Spawn Time | Monitor Time | Notes |
|------------|-----------------|--------------|-----------|------------|-------|
| 10 | 512m | 5GB | 0.9s | 120s | Small wave |
| 28 | 600m avg | 15GB | 2.3s | 287s | Typical wave |
| 50 | 700m avg | 35GB | 4.1s | 450s | Large wave |
| 100 | 500m avg | 50GB | 8.2s | 600s | Very large wave |

### Memory Optimization

- Default tier limits prevent host memory exhaustion
- Wave-based execution allows garbage collection between waves
- Log preservation only for failed containers (optional)
- Unused volumes cleaned up automatically

---

## Troubleshooting

### Issue: Containers not spawning

**Symptoms:**
- spawn-wave.sh returns 0 but container_count = 0
- No containers appear in `docker ps`

**Diagnosis:**
```bash
# Run with verbose output
./spawn-wave.sh --wave-plan waves.json --wave-number 1 --verbose

# Check Docker errors
docker events --filter "type=container" &  # Monitor in background
./spawn-wave.sh ...  # Re-run
```

**Solutions:**
- Check wave-plan JSON validity: `jq . waves.json`
- Verify image exists: `docker images | grep claude-flow-novice`
- Check Docker daemon: `docker ps` should work
- Check available disk space: `df -h`

### Issue: Containers timeout during monitoring

**Symptoms:**
- monitor-wave.sh returns exit code 2
- Containers marked as "timeout" instead of "exited"

**Diagnosis:**
```bash
# Check container logs
docker logs <container-id>

# Check if container is actually running
docker ps | grep <container-id>

# Monitor resource usage
docker stats <container-id>
```

**Solutions:**
- Increase timeout: `--timeout 3600` (1 hour)
- Check container image for infinite loops
- Verify agent code doesn't have unintended waits
- Increase memory if container is swapping: `--memory 2g`

### Issue: Cleanup fails with "device or resource busy"

**Symptoms:**
- cleanup-wave.sh returns exit code 1
- "device or resource busy" errors in output

**Diagnosis:**
```bash
# Check if containers are still running
docker ps | grep <pattern>

# Check if volumes are in use
docker volume ls | grep <pattern>

# Check system open files
lsof | grep docker
```

**Solutions:**
- Wait longer before cleanup: `sleep 10 && cleanup-wave.sh`
- Force container removal: `docker rm -f <container-id>`
- Stop dependent containers first
- Restart Docker daemon: `sudo systemctl restart docker`

---

## Success Criteria

### Functional Requirements

- Wave plan JSON parsing and validation
- Container spawning with correct memory limits
- Status monitoring with polling mechanism
- Exit code collection and categorization
- Timeout detection and handling
- Container log preservation
- Safe cleanup with resource tracking

### Quality Requirements

- Bash strict mode (set -euo pipefail)
- Comprehensive error handling for Docker API
- Validation of all inputs (memory strings, JSON, patterns)
- Clear exit codes (0, 1, 2)
- Detailed logging with timestamps

### Performance Requirements

- Spawn 28+ containers in <5 seconds
- Poll overhead <2% of monitoring time
- Complete cleanup in <10 seconds
- Scale to 100+ containers without degradation

---

**Version:** 1.0.0
**Last Updated:** 2025-11-14
**Status:** Production Ready

Install

Requires askill CLI v1.0+

Metadata

LicenseUnknown
Version-
Updated2w ago
Publishermasharratt

Tags

apici-cdgithub-actionsllmobservabilitypromptingtesting