Qwen ASR Skill - Local Speech-to-Text

Transcribe audio files to text locally using the Qwen3-ASR-1.7B model.

Setup

Install dependencies into a Python 3.10+ virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install qwen-asr soundfile silero-vad

For GPUs with compute capability < 7.0 (e.g. GTX 1060), install PyTorch 2.4.x with CUDA 11.8:

pip install torch==2.4.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118

Run the transcription script with the path to an audio file:

python scripts/transcribe.py <audio_path>

Parameter	Description	Default
`audio_path`	Absolute path to audio file (required)	-
`--language`	Force language (e.g. Chinese, English). Auto-detect if omitted	Auto-detect
`--device`	Inference device: auto / cuda / cpu	auto
`--model-path`	Model path or HuggingFace ID	~/models/Qwen3-ASR-1.7B
`--max-chunk-sec`	Max chunk duration for VAD splitting. Long audio is split at silence boundaries	90
`--max-new-tokens`	Max tokens to generate. Increase for long audio	2048

Basic transcription:

python scripts/transcribe.py /path/to/audio.wav

Force language:

python scripts/transcribe.py /path/to/audio.mp3 --language Chinese

Force CPU inference:

python scripts/transcribe.py /path/to/audio.flac --device cpu

The script outputs JSON to stdout and status info to stderr:

{"language": "Chinese", "text": "Transcribed text content"}

On error:

{"error": "Error description"}

First run downloads the model (~4.7GB), cached for subsequent runs
Auto mode: tries GPU (float16) first, falls back to CPU (float32) if VRAM is insufficient
Supports: WAV, MP3, FLAC, M4A, OGG and other common audio formats
52 languages including Chinese, English, Japanese, Korean, French, German, etc.
22 Chinese dialects supported
Long audio: Audio longer than 90s is automatically split at silence boundaries using silero-vad, transcribed chunk by chunk, then concatenated. This prevents OOM on limited VRAM GPUs.