askill
piper-tts-training

piper-tts-trainingSafety 100Repository

Train custom TTS voices for Piper (ONNX format) using fine-tuning or from-scratch approaches. Use when creating new synthetic voices, fine-tuning existing Piper checkpoints, preparing audio datasets for TTS training, or deploying voice models to devices like Raspberry Pi or Home Assistant. Covers dataset preparation, Whisper-based validation, training configuration, and ONNX export.

81 stars
1.6k downloads
Updated 2/8/2026

Package Files

Loading files...
SKILL.md

Piper TTS Voice Training

Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.

Overview

Piper produces fast, offline TTS suitable for embedded devices. Training involves:

  1. Corpus preparation (text covering phonetic range)
  2. Audio generation or recording
  3. Quality validation via Whisper transcription
  4. Fine-tuning from existing checkpoint (recommended) or training from scratch
  5. ONNX export for deployment

Fine-tuning vs from-scratch:

  • Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)
  • From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)

Workflow

1. Corpus Preparation

Gather 1,300-1,500+ phrases covering broad phonetic range:

  • Use piper-recording-studio corpus as base
  • Add domain-specific phrases for your use case
  • Include varied sentence structures and lengths

Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation.

2. Audio Generation

Generate or record training audio at 22050Hz mono WAV.

If using voice cloning (e.g., Chatterbox TTS):

  • Generate at source sample rate (often 24kHz)
  • Convert to 22050Hz: sox -v 0.95 input.wav -r 22050 -t wav output.wav
  • The -v 0.95 prevents clipping during resampling

Recording requirements:

  • Consistent microphone position and room acoustics
  • Minimal background noise
  • Natural speaking pace (not reading voice)

3. Quality Validation with Whisper

Automate quality checks rather than manual listening:

import whisper
from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text):
    result = model.transcribe(audio_path)
    transcribed = result["text"].strip()

    # Compare phonemically to handle spelling/punctuation differences
    expected_phonemes = phonemize_text(expected_text, "en-gb")
    transcribed_phonemes = phonemize_text(transcribed, "en-gb")

    return expected_phonemes == transcribed_phonemes

Retry failed samples up to 3 times. Target 95%+ dataset coverage.

4. Dataset Format (LJSpeech)

Structure your dataset:

dataset/
├── metadata.csv
└── wavs/
    ├── sample_0001.wav
    ├── sample_0002.wav
    └── ...

metadata.csv format: {id}|{text} (pipe-separated, no headers)

sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.

5. Preprocessing

Convert to PyTorch tensors:

python3 -m piper_train.preprocess \
    --language en-gb \
    --input-dir dataset/ \
    --output-dir piper_training_dir/ \
    --dataset-format ljspeech

Use en-gb for Australian/NZ/UK voices (espeak-ng phoneme set).

6. Training

Fine-tuning (recommended):

python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator gpu \
    --devices 1 \
    --batch-size 12 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 100 \
    --quality high \
    --precision 32

Key parameters:

  • --batch-size: Reduce if VRAM limited (12 works on 8GB)
  • --resume_from_checkpoint: Start from LJSpeech high-quality checkpoint
  • --precision 32: More stable than mixed precision
  • --validation-split 0.0 --num-test-examples 0: Skip validation for small datasets

Monitor with TensorBoard: watch loss_disc_all for convergence.

7. ONNX Export

python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx

Create metadata file output.onnx.json from training config.json.

Localisation for Australian, New Zealand and UK English

Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.

Corpus preparation:

  • Run scripts/convert_spelling.py on corpus text before training
  • Use en-gb or en-au espeak-ng voice for phonemisation
  • Review generated phonemes for Americanisms

Common spelling conversions:

AmericanAustralian/UK
-ize-ise
-or-our
-er-re
-og-ogue
-ense-ence

Phoneme considerations:

  • /r/ linking and intrusion patterns differ
  • Vowel sounds in words like "dance", "bath", "castle"
  • Final -ile pronunciation (hostile, missile)

For complete word lists and phonetic details, see references/localisation.md.

Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms.

Dependencies

Pin versions to avoid API breakage:

pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim

Docker containerisation recommended for reproducibility.

Hardware Requirements

Minimum (fine-tuning):

  • 8GB VRAM GPU (Pascal or newer)
  • 8GB system RAM
  • ~5 days for 1,000 epochs on Tesla P4

From scratch: Multiply time by ~200x.

Troubleshooting

IssueSolution
CUDA OOMReduce batch-size (try 8 or 4)
Checkpoint won't loadCheck pytorch-lightning version matches checkpoint
Garbled outputInsufficient training epochs or dataset too small
Wrong accentCheck espeak-ng language code and corpus spelling

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

96/100Analyzed 2/12/2026

An excellent, highly actionable guide for training Piper TTS voices. It provides end-to-end instructions including dataset preparation, validation code, training commands, and specific localization advice.

100
95
90
98
95

Metadata

Licenseunknown
Version-
Updated2/8/2026
Publishersammcj

Tags

apigithub-actionstesting