Kubernetes Troubleshooting

Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.

When to Apply

Use this skill when:

User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"

Priority Rules

Priority	Rule	Impact	Tools
1	Check pod status first	CRITICAL	`get_pods`, `describe_pod`
2	View recent events	CRITICAL	`get_events`
3	Inspect logs (including previous)	HIGH	`get_pod_logs`
4	Check resource metrics	HIGH	`get_pod_metrics`
5	Verify endpoints	MEDIUM	`get_endpoints`
6	Review network policies	MEDIUM	`get_network_policies`
7	Examine node status	LOW	`get_nodes`, `describe_node`

Quick Reference

Symptom	First Tool	Next Steps
Pod Pending	`describe_pod`	Check events, node capacity, resource requests
CrashLoopBackOff	`get_pod_logs(previous=True)`	Check exit code, resources, liveness probes
ImagePullBackOff	`describe_pod`	Verify image name, registry auth, network
OOMKilled	`get_pod_metrics`	Increase memory limits, check for memory leaks
ContainerCreating	`describe_pod`	Check PVC binding, secrets, configmaps
Terminating (stuck)	`describe_pod`	Check finalizers, PDBs, preStop hooks

Diagnostic Workflows

Pod Not Starting

1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops

Common Pod States

State	Likely Cause	Tools to Use
Pending	Scheduling issues	`describe_pod`, `get_nodes`, `get_events`
ImagePullBackOff	Registry/auth	`describe_pod`, check image name
CrashLoopBackOff	App crash	`get_pod_logs(previous=True)`
OOMKilled	Memory limit	`get_pod_metrics`, adjust limits
ContainerCreating	Volume/network	`describe_pod`, `get_pvc`

Node Issues

1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs

Deep Debugging Workflows

CrashLoopBackOff Investigation

1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace

Networking Issues

1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()

Storage Problems

1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes

DNS Resolution

1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")

Multi-Cluster Debugging

All tools support context parameter for targeting different clusters:

get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")

Diagnostic Scripts

For comprehensive diagnostics, run the bundled scripts:

See scripts/diagnose-pod.py for automated pod analysis
See scripts/health-check.sh for cluster health checks

Decision Tree

See references/DECISION-TREE.md for visual troubleshooting flowcharts.

Common Errors Reference

See references/COMMON-ERRORS.md for error message explanations and fixes.

Related Tools

Core Diagnostics

get_pods, describe_pod, get_pod_logs, get_pod_metrics
get_events, get_nodes, describe_node
get_resource_usage, compare_namespaces

Advanced (Ecosystem)

Cilium: cilium_endpoints_list_tool, hubble_flows_query_tool
Istio: istio_proxy_status_tool, istio_analyze_tool

Related Skills

k8s-diagnostics - Metrics and health checks
k8s-incident - Emergency runbooks
k8s-networking - Network troubleshooting

k8s-troubleshootSafety 95Repository

Package Files

Kubernetes Troubleshooting

When to Apply

Priority Rules

Quick Reference

Diagnostic Workflows

Pod Not Starting

Common Pod States

Node Issues

Deep Debugging Workflows

CrashLoopBackOff Investigation

Networking Issues

Storage Problems

DNS Resolution

Multi-Cluster Debugging

Diagnostic Scripts

Decision Tree

Common Errors Reference

Related Tools

Core Diagnostics

Advanced (Ecosystem)

Related Skills

Install

AI Quality Score

Metadata

Tags

k8s-troubleshootSafety 95Repository ShareFavorite skill

Package Files

Kubernetes Troubleshooting

When to Apply

Priority Rules

Quick Reference

Diagnostic Workflows

Pod Not Starting

Common Pod States

Node Issues

Deep Debugging Workflows

CrashLoopBackOff Investigation

Networking Issues

Storage Problems

DNS Resolution

Multi-Cluster Debugging

Diagnostic Scripts

Decision Tree

Common Errors Reference

Related Tools

Core Diagnostics

Advanced (Ecosystem)

Related Skills

Install

AI Quality Score

Metadata

Tags

k8s-troubleshootSafety 95Repository