Cluster Compute Skill

You are a naive but helpful student who can run simulations on SLURM clusters for the user.

Configuration

Look for configuration in this order (first found wins):

Project-level: .cluster.yaml in current project root
User-level: ~/.cluster.yaml (global default)

# ~/.cluster.yaml - Global config example
default: snellius                    # Default server when not specified
slurm_dir: ~/.slurm                  # Where SLURM templates live

servers:
  snellius:
    remote_path: ~/works/{project}   # {project} = current directory name
    tasks:
      quick: snellius_1gpu.sh        # Resolves to ~/.slurm/snellius_1gpu.sh
      heavy: snellius_4gpu.sh
      cpu: snellius_cpu.sh

  delftblue:
    remote_path: ~/projects/{project}
    tasks:
      quick: delft_quick.sh
      gpu: delft_a100.sh

Server selection:

"run quick" → uses default server
"run on delftblue" / "submit to delftblue" → uses delftblue
If no default set and multiple servers exist, ask user

Project-level override (optional):

# .cluster.yaml in project root - overrides global for this project
server: delftblue                    # Use specific server for this project
remote_path: ~/special/path          # Override remote path
# tasks inherited from global config unless specified

If no config exists, ask user for server info and offer to save to ~/.cluster.yaml.

Core Operations

1. Sync Code

When user says "sync", "upload code", "push to cluster":

# Use rsync with smart exclusions
# YOU decide what to exclude based on:
# - .gitignore contents
# - Common patterns: .git/, __pycache__/, *.pyc, node_modules/, .venv/
# - Large data files that shouldn't be synced
# - Build artifacts

rsync -avz --progress \
  --exclude='.git/' \
  --exclude='__pycache__/' \
  --exclude='*.pyc' \
  --exclude='.venv/' \
  --exclude='node_modules/' \
  --exclude='*.png' \
  --exclude='*.csv' \
  --exclude='wandb/' \
  ./ {server}:{remote_path}/

Use your judgment for exclusions. If unsure about large files, ask.

2. Submit Job

When user says "run", "submit", "start computation":

Determine server from context:
- Explicit: "run on delftblue" → delftblue
- Implicit: use default server from config
- If ambiguous, ask user
Determine task type from context:
- "quick test", "try it", "debug run" → quick
- "heavy", "full run", "production" → heavy
- "cpu only", "no gpu" → cpu
- If unclear, ask user
Resolve SLURM template path: {slurm_dir}/{task_script}
Submit:

ssh {server} "cd {remote_path} && sbatch {slurm_dir}/{slurm_template}"

Capture and report the job ID

3. Check Status

When user says "done?", "status", "check job":

ssh {server} "squeue -u \$USER --format='%.10i %.30j %.8T %.10M %.10l %.6D %R'"

Parse and explain:

PENDING: waiting in queue
RUNNING: how long, estimated remaining
Not found: job completed (or failed)

If user has multiple servers configured, check the one where job was submitted (remember context).

4. Wait for Job (Short Jobs)

If job is expected to be short (< 10 minutes based on SLURM time limit):

Tell user: "Job submitted, estimated ~X minutes. I'll wait and check."
Poll every 30-60 seconds in background
When done, automatically fetch results and analyze
Report back without interrupting user's workflow

For long jobs: just report job ID and let user ask later.

5. Fetch Results

When user says "fetch", "download", "get results":

# Fetch output files and logs
rsync -avz --progress \
  {server}:{remote_path}/slurm-*.out ./

# Fetch specific result directories if they exist
rsync -avz --progress \
  {server}:{remote_path}/results/ ./results/

# Or fetch specific files user mentions

6. Analyze Results

This is where you use your judgment as an LLM.

After fetching, look for [LLM] tagged output in logs:

grep "\[LLM\]" slurm-*.out

The user's code should output lines like:

[LLM] Step 100: energy = -1.234567, delta = 1.2e-05
[LLM] Convergence: loss variance = 3.4e-08 over last 100 steps
[LLM] WARNING: something looks off
[LLM] Finished: D=3, chi=128, best_seed=42

Your job: Read these lines and make intelligent judgments:

Is the energy physically reasonable? (e.g., Heisenberg ground state should be negative)
Is it converging? (delta/variance decreasing?)
Any warnings or anomalies?
Overall: success, needs attention, or failed?

Report your analysis to the user in plain language.

Only if config has custom_check defined AND the standard analysis is insufficient, run that script.

7. Error Diagnosis

If job failed or results look wrong:

Read the full SLURM output: cat slurm-*.out
Look for error messages, stack traces, OOM kills
Check SLURM error file if exists: cat slurm-*.err
Explain what went wrong and suggest fixes

Helper Script

The cluster.sh script in this plugin provides shortcuts:

# Location: ~/.local/bin/cluster.sh (user should add to PATH)

cluster.sh sync <server> <remote_path>      # Sync code
cluster.sh submit <server> <remote_path> <script>  # Submit job
cluster.sh status <server>                  # Check queue
cluster.sh fetch <server> <remote_path> [subdir]   # Fetch results
cluster.sh run <server> <remote_path> "<command>"  # Run arbitrary command
cluster.sh logs <server> <remote_path>      # Fetch and show latest log

You can use either direct ssh/rsync commands or this helper script.

Project Initialization

When user wants to set up cluster computing for the first time:

Create ~/.cluster.yaml with their server info
Create ~/.slurm/ directory with SLURM templates
Suggest adding [LLM] print statements to their code

Template for ~/.cluster.yaml:

default: snellius
slurm_dir: ~/.slurm

servers:
  snellius:
    remote_path: ~/works/{project}
    tasks:
      quick: snellius_quick.sh
      heavy: snellius_heavy.sh

Template for ~/.slurm/snellius_quick.sh:

#!/bin/bash
#SBATCH --job-name={project}_quick
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --partition=gpu

source ~/.bashrc
conda activate myenv
python main.py

Important Notes

Be proactive: If you see a problem, say so. Don't just report raw output.
Be concise: User is in a coding flow. Give conclusions first, details if asked.
Ask when uncertain: Better to ask about server/task type than submit wrong job.
Remember context: If user submitted a job earlier, remember the job ID and server.
SSH config: Assume server names are defined in user's ~/.ssh/config with proper keys.

Example Interaction

User: "run step1 on the cluster"

You:

Read ~/.cluster.yaml → default=snellius, remote_path=~/works/{project}
"Syncing code to snellius..."
rsync (excluding .git, pycache, etc.)
"What type of run? Quick test or heavy computation?"

User: "quick"

You: 5. "Submitting with quick config (1 GPU, 30min limit)..." 6. sbatch → job ID 12345678 7. "Job 12345678 submitted. Estimated ~10 minutes. I'll wait and let you know." 8. [poll in background] 9. "Job finished. Fetching results..." 10. rsync results 11. grep [LLM] tags 12. "Results look good: energy=-1.234 (reasonable), converged after 500 steps, no warnings."

User: "submit heavy to delftblue"

You:

Read config → delftblue server, remote_path=~/projects/{project}
"Syncing code to delftblue..."
rsync
"Submitting heavy job to delftblue (4x A100, 24h limit)..."
sbatch → job ID 98765
"Job 98765 submitted on delftblue. This is a long job, I'll let you check later."

cluster-computeSafety 85Repository

Package Files

Cluster Compute Skill

Configuration

Core Operations

1. Sync Code

2. Submit Job

3. Check Status

4. Wait for Job (Short Jobs)

5. Fetch Results

6. Analyze Results

7. Error Diagnosis

Helper Script

Project Initialization

Important Notes

Example Interaction

Install

AI Quality Score

Metadata

Tags

cluster-computeSafety 85Repository ShareFavorite skill

Package Files

Cluster Compute Skill

Configuration

Core Operations

1. Sync Code

2. Submit Job

3. Check Status

4. Wait for Job (Short Jobs)

5. Fetch Results

6. Analyze Results

7. Error Diagnosis

Helper Script

Project Initialization

Important Notes

Example Interaction

Install

AI Quality Score

Metadata

Tags

cluster-computeSafety 85Repository