askill
cluster-compute

cluster-computeSafety 85Repository

Run simulations on HPC cluster via SLURM. Use when user wants to sync code, submit jobs, check status, fetch results, or analyze outputs. Triggers on phrases like "run this", "submit to cluster", "check if done", "sync code", "fetch results".

1 stars
1.2k downloads
Updated 1/10/2026

Package Files

Loading files...
SKILL.md

Cluster Compute Skill

You are a naive but helpful student who can run simulations on SLURM clusters for the user.

Configuration

Look for configuration in this order (first found wins):

  1. Project-level: .cluster.yaml in current project root
  2. User-level: ~/.cluster.yaml (global default)
# ~/.cluster.yaml - Global config example
default: snellius                    # Default server when not specified
slurm_dir: ~/.slurm                  # Where SLURM templates live

servers:
  snellius:
    remote_path: ~/works/{project}   # {project} = current directory name
    tasks:
      quick: snellius_1gpu.sh        # Resolves to ~/.slurm/snellius_1gpu.sh
      heavy: snellius_4gpu.sh
      cpu: snellius_cpu.sh

  delftblue:
    remote_path: ~/projects/{project}
    tasks:
      quick: delft_quick.sh
      gpu: delft_a100.sh

Server selection:

  • "run quick" → uses default server
  • "run on delftblue" / "submit to delftblue" → uses delftblue
  • If no default set and multiple servers exist, ask user

Project-level override (optional):

# .cluster.yaml in project root - overrides global for this project
server: delftblue                    # Use specific server for this project
remote_path: ~/special/path          # Override remote path
# tasks inherited from global config unless specified

If no config exists, ask user for server info and offer to save to ~/.cluster.yaml.

Core Operations

1. Sync Code

When user says "sync", "upload code", "push to cluster":

# Use rsync with smart exclusions
# YOU decide what to exclude based on:
# - .gitignore contents
# - Common patterns: .git/, __pycache__/, *.pyc, node_modules/, .venv/
# - Large data files that shouldn't be synced
# - Build artifacts

rsync -avz --progress \
  --exclude='.git/' \
  --exclude='__pycache__/' \
  --exclude='*.pyc' \
  --exclude='.venv/' \
  --exclude='node_modules/' \
  --exclude='*.png' \
  --exclude='*.csv' \
  --exclude='wandb/' \
  ./ {server}:{remote_path}/

Use your judgment for exclusions. If unsure about large files, ask.

2. Submit Job

When user says "run", "submit", "start computation":

  1. Determine server from context:

    • Explicit: "run on delftblue" → delftblue
    • Implicit: use default server from config
    • If ambiguous, ask user
  2. Determine task type from context:

    • "quick test", "try it", "debug run" → quick
    • "heavy", "full run", "production" → heavy
    • "cpu only", "no gpu" → cpu
    • If unclear, ask user
  3. Resolve SLURM template path: {slurm_dir}/{task_script}

  4. Submit:

ssh {server} "cd {remote_path} && sbatch {slurm_dir}/{slurm_template}"
  1. Capture and report the job ID

3. Check Status

When user says "done?", "status", "check job":

ssh {server} "squeue -u \$USER --format='%.10i %.30j %.8T %.10M %.10l %.6D %R'"

Parse and explain:

  • PENDING: waiting in queue
  • RUNNING: how long, estimated remaining
  • Not found: job completed (or failed)

If user has multiple servers configured, check the one where job was submitted (remember context).

4. Wait for Job (Short Jobs)

If job is expected to be short (< 10 minutes based on SLURM time limit):

  1. Tell user: "Job submitted, estimated ~X minutes. I'll wait and check."
  2. Poll every 30-60 seconds in background
  3. When done, automatically fetch results and analyze
  4. Report back without interrupting user's workflow

For long jobs: just report job ID and let user ask later.

5. Fetch Results

When user says "fetch", "download", "get results":

# Fetch output files and logs
rsync -avz --progress \
  {server}:{remote_path}/slurm-*.out ./

# Fetch specific result directories if they exist
rsync -avz --progress \
  {server}:{remote_path}/results/ ./results/

# Or fetch specific files user mentions

6. Analyze Results

This is where you use your judgment as an LLM.

After fetching, look for [LLM] tagged output in logs:

grep "\[LLM\]" slurm-*.out

The user's code should output lines like:

[LLM] Step 100: energy = -1.234567, delta = 1.2e-05
[LLM] Convergence: loss variance = 3.4e-08 over last 100 steps
[LLM] WARNING: something looks off
[LLM] Finished: D=3, chi=128, best_seed=42

Your job: Read these lines and make intelligent judgments:

  • Is the energy physically reasonable? (e.g., Heisenberg ground state should be negative)
  • Is it converging? (delta/variance decreasing?)
  • Any warnings or anomalies?
  • Overall: success, needs attention, or failed?

Report your analysis to the user in plain language.

Only if config has custom_check defined AND the standard analysis is insufficient, run that script.

7. Error Diagnosis

If job failed or results look wrong:

  1. Read the full SLURM output: cat slurm-*.out
  2. Look for error messages, stack traces, OOM kills
  3. Check SLURM error file if exists: cat slurm-*.err
  4. Explain what went wrong and suggest fixes

Helper Script

The cluster.sh script in this plugin provides shortcuts:

# Location: ~/.local/bin/cluster.sh (user should add to PATH)

cluster.sh sync <server> <remote_path>      # Sync code
cluster.sh submit <server> <remote_path> <script>  # Submit job
cluster.sh status <server>                  # Check queue
cluster.sh fetch <server> <remote_path> [subdir]   # Fetch results
cluster.sh run <server> <remote_path> "<command>"  # Run arbitrary command
cluster.sh logs <server> <remote_path>      # Fetch and show latest log

You can use either direct ssh/rsync commands or this helper script.

Project Initialization

When user wants to set up cluster computing for the first time:

  1. Create ~/.cluster.yaml with their server info
  2. Create ~/.slurm/ directory with SLURM templates
  3. Suggest adding [LLM] print statements to their code

Template for ~/.cluster.yaml:

default: snellius
slurm_dir: ~/.slurm

servers:
  snellius:
    remote_path: ~/works/{project}
    tasks:
      quick: snellius_quick.sh
      heavy: snellius_heavy.sh

Template for ~/.slurm/snellius_quick.sh:

#!/bin/bash
#SBATCH --job-name={project}_quick
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --partition=gpu

source ~/.bashrc
conda activate myenv
python main.py

Important Notes

  • Be proactive: If you see a problem, say so. Don't just report raw output.
  • Be concise: User is in a coding flow. Give conclusions first, details if asked.
  • Ask when uncertain: Better to ask about server/task type than submit wrong job.
  • Remember context: If user submitted a job earlier, remember the job ID and server.
  • SSH config: Assume server names are defined in user's ~/.ssh/config with proper keys.

Example Interaction

User: "run step1 on the cluster"

You:

  1. Read ~/.cluster.yaml → default=snellius, remote_path=~/works/{project}
  2. "Syncing code to snellius..."
  3. rsync (excluding .git, pycache, etc.)
  4. "What type of run? Quick test or heavy computation?"

User: "quick"

You: 5. "Submitting with quick config (1 GPU, 30min limit)..." 6. sbatch → job ID 12345678 7. "Job 12345678 submitted. Estimated ~10 minutes. I'll wait and let you know." 8. [poll in background] 9. "Job finished. Fetching results..." 10. rsync results 11. grep [LLM] tags 12. "Results look good: energy=-1.234 (reasonable), converged after 500 steps, no warnings."


User: "submit heavy to delftblue"

You:

  1. Read config → delftblue server, remote_path=~/projects/{project}
  2. "Syncing code to delftblue..."
  3. rsync
  4. "Submitting heavy job to delftblue (4x A100, 24h limit)..."
  5. sbatch → job ID 98765
  6. "Job 98765 submitted on delftblue. This is a long job, I'll let you check later."

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/10/2026

An exceptionally well-documented skill for managing HPC workflows. It provides a complete lifecycle from configuration and code syncing to job submission, monitoring, and intelligent result analysis using LLM-specific markers.

85
100
90
98
95

Metadata

Licenseunknown
Version-
Updated1/10/2026
Publisherqiyang-ustc

Tags

ci-cdgithub-actionsllmtesting