askill
cuda

cudaSafety 90Repository

NVIDIA CUDA parallel computing platform — use when writing .cu kernels, using cuBLAS/cuDNN/cuFFT/cuSPARSE/cuRAND/cuSolver, Thrust, or Cooperative Groups for GPU-accelerated computing

6 stars
1.2k downloads
Updated 2/21/2026

Package Files

Loading files...
SKILL.md

CUDA

Overview

CUDA is NVIDIA's parallel computing platform and programming model for GPU-accelerated applications. It provides direct access to the GPU's virtual instruction set and parallel compute elements for executing kernels in C, C++, and Fortran.

cuda-samples version: v13.1 (CUDA Toolkit 13.1) CUDALibrarySamples: main (Feb 2025) Language: C/C++ (.cu files) Licenses: BSD-3-Clause (cuda-samples), Apache-2.0 (CUDALibrarySamples)

Quick Start

// Minimal kernel + launch
__global__ void addVectors(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

int main() {
    int n = 1 << 20;
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_c, n * sizeof(float));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    addVectors<<<blocks, threads>>>(d_a, d_b, d_c, n);
    cudaDeviceSynchronize();

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Core Concepts

  • Kernel: __global__ function executed on GPU by many parallel threads
  • Grid/Block/Thread: Launch hierarchy — <<<gridDim, blockDim>>> configures parallelism
  • Device memory: Must be explicitly allocated with cudaMalloc and freed with cudaFree
  • Streams: Async execution queues; default stream is synchronous with host
  • Unified Memory (cudaMallocManaged): Automatically migrates data between CPU and GPU

API Reference

DomainFileDescription
CUDA Runtimeapi-runtime.mdDevice mgmt, memory, streams, events, kernel launch
cuBLASapi-cublas.mdDense linear algebra: GEMM, GEMV, TRSM, batched ops
cuFFTapi-cufft.md1D/2D/3D FFT and batched transforms
cuSPARSEapi-cusparse.mdSparse matrix ops: SpMM, SpMV, format conversions
cuRANDapi-curand.mdRandom number generation on GPU
cuSolverapi-cusolver.mdDense/sparse solvers: QR, LU, eigenvalue, SVD
Thrustapi-thrust.mdSTL-like GPU algorithms: sort, reduce, transform, scan
Cooperative Groupsapi-cooperative-groups.mdFlexible thread synchronization beyond blocks
Workflowsworkflows.mdComplete working examples

Common Workflows

See references/workflows.md for complete examples.

Quick reference:

  • Matrix multiply: see workflows.md — cuBLAS GEMM section
  • FFT: see workflows.md — cuFFT 1D section
  • Custom kernel: see workflows.md — Custom CUDA Kernel section
  • Unified memory: see workflows.md — Unified Memory section
  • Error-check macros: see workflows.md — Error-Checked CUDA Boilerplate

Key Considerations

  • Error checking: Always check return codes; use CUDA_CHECK(err) macro pattern
  • Synchronization: cudaDeviceSynchronize() or stream synchronize before reading results on host
  • Memory alignment: 128-byte alignment for coalesced global memory access
  • Occupancy: Use cudaOccupancyMaxPotentialBlockSize to tune block dimensions
  • Tensor Cores: Available on Volta+ (sm_70+); cuBLAS uses them automatically for GEMM with correct types
  • Column-major: cuBLAS and cuSolver use Fortran (column-major) layout — transpose row-major C arrays or swap dimensions
  • cuFFT normalization: cuFFT does NOT normalize inverse transforms; divide by N manually
  • Streams: Always use cudaStreamNonBlocking when creating non-default streams to avoid implicit synchronization with the null stream

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

78/100Analyzed 2/23/2026

High-quality CUDA reference skill with clear description, structured API table, core concepts, and important safety considerations. The skill effectively covers when to use CUDA and provides quick start code plus key best practices. Being reference-style rather than step-by-step limits actionability slightly, but it's well-organized and highly reusable for developers working with NVIDIA GPU computing.

90
80
85
75
70

Metadata

Licenseunknown
Version-
Updated2/21/2026
Publisherdatathings

Tags

apiprompting