plugin: video-editing updated: 2026-01-20

Transcription with Whisper

Production-ready patterns for audio/video transcription using OpenAI Whisper.

System Requirements

Installation Options

Option 1: OpenAI Whisper (Python)

# macOS/Linux/Windows
pip install openai-whisper

# Verify
whisper --help

Option 2: whisper.cpp (C++ - faster)

# macOS
brew install whisper-cpp

# Linux - build from source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# Windows - use pre-built binaries or build with cmake

Option 3: Insanely Fast Whisper (GPU accelerated)

pip install insanely-fast-whisper

Model Selection

Model	Size	VRAM	Accuracy	Speed	Use Case
tiny	39M	~1GB	Low	Fastest	Quick previews
base	74M	~1GB	Medium	Fast	Draft transcripts
small	244M	~2GB	Good	Medium	General use
medium	769M	~5GB	Better	Slow	Quality transcripts
large-v3	1550M	~10GB	Best	Slowest	Final production

Recommendation: Start with small for speed/quality balance. Use large-v3 for final delivery.

Basic Transcription

Using OpenAI Whisper

# Basic transcription (auto-detect language)
whisper audio.mp3 --model small

# Specify language and output format
whisper audio.mp3 --model medium --language en --output_format srt

# Multiple output formats
whisper audio.mp3 --model small --output_format all

# With timestamps and word-level timing
whisper audio.mp3 --model small --word_timestamps True

Using whisper.cpp

# Download model first
./models/download-ggml-model.sh base.en

# Transcribe
./main -m models/ggml-base.en.bin -f audio.wav -osrt

# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav -ocsv

Output Formats

SRT (SubRip Subtitle)

1
00:00:01,000 --> 00:00:04,500
Hello and welcome to this video.

2
00:00:05,000 --> 00:00:08,200
Today we'll discuss video editing.

VTT (WebVTT)

WEBVTT

00:00:01.000 --> 00:00:04.500
Hello and welcome to this video.

00:00:05.000 --> 00:00:08.200
Today we'll discuss video editing.

JSON (with word-level timing)

{
  "text": "Hello and welcome to this video.",
  "segments": [
    {
      "id": 0,
      "start": 1.0,
      "end": 4.5,
      "text": " Hello and welcome to this video.",
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3},
        {"word": "and", "start": 1.4, "end": 1.5},
        {"word": "welcome", "start": 1.6, "end": 2.0},
        {"word": "to", "start": 2.1, "end": 2.2},
        {"word": "this", "start": 2.3, "end": 2.5},
        {"word": "video", "start": 2.6, "end": 3.0}
      ]
    }
  ]
}

Audio Extraction for Transcription

Before transcribing video, extract audio in optimal format:

# Extract audio as WAV (16kHz, mono - optimal for Whisper)
ffmpeg -i video.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

# Extract as high-quality WAV for archival
ffmpeg -i video.mp4 -vn -c:a pcm_s16le audio.wav

# Extract as compressed MP3 (smaller, still works)
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 2 audio.mp3

Timing Synchronization

Convert Whisper JSON to FCP Timing

import json

def whisper_to_fcp_timing(whisper_json_path, fps=24):
    """Convert Whisper JSON output to FCP-compatible timing."""
    with open(whisper_json_path) as f:
        data = json.load(f)

    segments = []
    for seg in data.get("segments", []):
        segments.append({
            "start_time": seg["start"],
            "end_time": seg["end"],
            "start_frame": int(seg["start"] * fps),
            "end_frame": int(seg["end"] * fps),
            "text": seg["text"].strip(),
            "words": seg.get("words", [])
        })

    return segments

Frame-Accurate Timing

# Get exact frame count and duration
ffprobe -v error -count_frames -select_streams v:0 \
  -show_entries stream=nb_read_frames,duration,r_frame_rate \
  -of json video.mp4

Speaker Diarization

For multi-speaker content, use pyannote.audio:

pip install pyannote.audio

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")

Batch Processing

#!/bin/bash
# Transcribe all videos in directory

MODEL="small"
OUTPUT_DIR="transcripts"
mkdir -p "$OUTPUT_DIR"

for video in *.mp4 *.mov *.avi; do
  [[ -f "$video" ]] || continue

  base="${video%.*}"

  # Extract audio
  ffmpeg -i "$video" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/${base}.wav" -y

  # Transcribe
  whisper "/tmp/${base}.wav" --model "$MODEL" \
    --output_format all \
    --output_dir "$OUTPUT_DIR"

  # Cleanup temp audio
  rm "/tmp/${base}.wav"

  echo "Transcribed: $video"
done

Quality Optimization

Improve Accuracy

Noise reduction before transcription:

ffmpeg -i noisy_audio.wav -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.wav

Use language hint:

whisper audio.mp3 --language en --model medium

Provide initial prompt for context:

whisper audio.mp3 --initial_prompt "Technical discussion about video editing software."

Performance Tips

GPU acceleration (if available):

whisper audio.mp3 --model large-v3 --device cuda

Process in chunks for long videos:

# Split audio into 10-minute chunks
# Transcribe each chunk
# Merge results with time offset adjustment

Error Handling

# Validate audio file before transcription
validate_audio() {
  local file="$1"
  if ffprobe -v error -select_streams a:0 -show_entries stream=codec_type -of csv=p=0 "$file" 2>/dev/null | grep -q "audio"; then
    return 0
  else
    echo "Error: No audio stream found in $file"
    return 1
  fi
}

# Check Whisper installation
check_whisper() {
  if command -v whisper &> /dev/null; then
    echo "Whisper available"
    return 0
  else
    echo "Error: Whisper not installed. Run: pip install openai-whisper"
    return 1
  fi
}

Related Skills

ffmpeg-core - Audio extraction and preprocessing
final-cut-pro - Import transcripts as titles/markers

transcriptionSafety 98Repository

Package Files

Transcription with Whisper

System Requirements

Installation Options

Model Selection

Basic Transcription

Using OpenAI Whisper

Using whisper.cpp

Output Formats

SRT (SubRip Subtitle)

VTT (WebVTT)

JSON (with word-level timing)

Audio Extraction for Transcription

Timing Synchronization

Convert Whisper JSON to FCP Timing

Frame-Accurate Timing

Speaker Diarization

Batch Processing

Quality Optimization

Improve Accuracy

Performance Tips

Error Handling

Related Skills

Install

AI Quality Score

Metadata

Tags

transcriptionSafety 98Repository ShareFavorite skill

Package Files

Transcription with Whisper

System Requirements

Installation Options

Model Selection

Basic Transcription

Using OpenAI Whisper

Using whisper.cpp

Output Formats

SRT (SubRip Subtitle)

VTT (WebVTT)

JSON (with word-level timing)

Audio Extraction for Transcription

Timing Synchronization

Convert Whisper JSON to FCP Timing

Frame-Accurate Timing

Speaker Diarization

Batch Processing

Quality Optimization

Improve Accuracy

Performance Tips

Error Handling

Related Skills

Install

AI Quality Score

Metadata

Tags

transcriptionSafety 98Repository