askill
multimodal-models

multimodal-modelsSafety 100Repository

Use when "CLIP", "Whisper", "Stable Diffusion", "SDXL", "speech-to-text", "text-to-image", "image generation", "transcription", "zero-shot classification", "image-text similarity", "inpainting", "ControlNet"

0 stars
1.2k downloads
Updated 1/15/2026

Package Files

Loading files...
SKILL.md

Multimodal Models

Pre-trained models for vision, audio, and cross-modal tasks.


Model Overview

ModelModalityTask
CLIPImage + TextZero-shot classification, similarity
WhisperAudio → TextTranscription, translation
Stable DiffusionText → ImageImage generation, editing

CLIP (Vision-Language)

Zero-shot image classification without training on specific labels.

CLIP Use Cases

TaskHow
Zero-shot classificationCompare image to text label embeddings
Image searchFind images matching text query
Content moderationClassify against safety categories
Image similarityCompare image embeddings

CLIP Models

ModelParametersTrade-off
ViT-B/32151MRecommended balance
ViT-L/14428MBest quality, slower
RN50102MFastest, lower quality

CLIP Concepts

ConceptDescription
Dual encoderSeparate encoders for image and text
Contrastive learningTrained to match image-text pairs
NormalizationAlways normalize embeddings before similarity
Descriptive labelsBetter labels = better zero-shot accuracy

Key concept: CLIP embeds images and text in same space. Classification = find nearest text embedding.

CLIP Limitations

  • Not for fine-grained classification
  • No spatial understanding (whole image only)
  • May reflect training data biases

Whisper (Speech Recognition)

Robust multilingual transcription supporting 99 languages.

Whisper Use Cases

TaskConfiguration
TranscriptionDefault transcribe task
Translation to Englishtask="translate"
SubtitlesOutput format SRT/VTT
Word timestampsword_timestamps=True

Whisper Models

ModelSizeSpeedRecommendation
turbo809MFastRecommended
large1550MSlowMaximum quality
small244MMediumGood balance
base74MFastQuick tests
tiny39MFastestPrototyping only

Whisper Concepts

ConceptDescription
Language detectionAuto-detects, or specify for speed
Initial promptImproves technical terms accuracy
TimestampsSegment-level or word-level
faster-whisper4× faster alternative implementation

Key concept: Specify language when known—auto-detection adds latency.

Whisper Limitations

  • May hallucinate on silence/noise
  • No speaker diarization (who said what)
  • Accuracy degrades on >30 min audio
  • Not suitable for real-time captioning

Stable Diffusion (Image Generation)

Text-to-image generation with various control methods.

SD Use Cases

TaskPipeline
Text-to-imageDiffusionPipeline
Style transferImage2Image
Fill regionsInpainting
Guided generationControlNet
Custom stylesLoRA adapters

SD Models

ModelResolutionQuality
SDXL1024×1024Best
SD 1.5512×512Good, faster
SD 2.1768×768Middle ground

Key Parameters

ParameterEffectTypical Value
num_inference_stepsQuality vs speed20-50
guidance_scalePrompt adherence7-12
negative_promptAvoid artifacts"blurry, low quality"
strength (img2img)How much to change0.5-0.8
seedReproducibilityFixed number

Control Methods

MethodInputUse Case
ControlNetEdge/depth/poseStructural guidance
LoRATrained weightsCustom styles
Img2ImgSource imageStyle transfer
InpaintingImage + maskFill regions

Memory Optimization

TechniqueEffect
CPU offloadReduces VRAM usage
Attention slicingTrades speed for memory
VAE tilingLarge image support
xFormersFaster attention
DPM schedulerFewer steps needed

Key concept: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.

SD Limitations

  • GPU strongly recommended (CPU very slow)
  • Large VRAM requirements for SDXL
  • May generate anatomical errors
  • Prompt engineering matters

Common Patterns

Embedding and Similarity

All three models use embeddings:

  • CLIP: Image/text embeddings for similarity
  • Whisper: Audio embeddings for transcription
  • SD: Text embeddings for image conditioning

GPU Acceleration

ModelVRAM Needed
CLIP ViT-B/32~2 GB
Whisper turbo~6 GB
SD 1.5~6 GB
SDXL~10 GB

Best Practices

PracticeWhy
Use recommended model sizesBest quality/speed balance
Cache embeddings (CLIP)Expensive to recompute
Specify language (Whisper)Faster than auto-detect
Use negative prompts (SD)Avoid common artifacts
Set seeds for reproducibilityConsistent results

Resources

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

88/100Analyzed 2/24/2026

Comprehensive technical reference document covering CLIP, Whisper, and Stable Diffusion models with detailed tables, use cases, parameters, and best practices. Well-structured with clear triggers in the description. Located in a dedicated skills folder. Minor deduction for borderline path depth but content quality outweighs this.

100
90
88
90
75

Metadata

Licenseunknown
Version1.0.0
Updated1/15/2026
Publishereyadsibai

Tags

ci-cdgithubllmprompting