askill
qwen-asr-skill

qwen-asr-skillSafety 95Repository

Local speech-to-text transcription using Qwen3-ASR-1.7B. Use this skill when a user provides an audio file path and needs speech recognition, transcription, or ASR. Supports 52 languages and dialects, and common audio formats including WAV, MP3, FLAC, M4A, and OGG.

5 stars
1.2k downloads
Updated 2/15/2026

Package Files

Loading files...
SKILL.md

Qwen ASR Skill - Local Speech-to-Text

Transcribe audio files to text locally using the Qwen3-ASR-1.7B model.

Setup

Install dependencies into a Python 3.10+ virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install qwen-asr soundfile silero-vad

For GPUs with compute capability < 7.0 (e.g. GTX 1060), install PyTorch 2.4.x with CUDA 11.8:

pip install torch==2.4.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118

Usage

Run the transcription script with the path to an audio file:

python scripts/transcribe.py <audio_path>

Parameters

ParameterDescriptionDefault
audio_pathAbsolute path to audio file (required)-
--languageForce language (e.g. Chinese, English). Auto-detect if omittedAuto-detect
--deviceInference device: auto / cuda / cpuauto
--model-pathModel path or HuggingFace ID~/models/Qwen3-ASR-1.7B
--max-chunk-secMax chunk duration for VAD splitting. Long audio is split at silence boundaries90
--max-new-tokensMax tokens to generate. Increase for long audio2048

Examples

Basic transcription:

python scripts/transcribe.py /path/to/audio.wav

Force language:

python scripts/transcribe.py /path/to/audio.mp3 --language Chinese

Force CPU inference:

python scripts/transcribe.py /path/to/audio.flac --device cpu

Output Format

The script outputs JSON to stdout and status info to stderr:

{"language": "Chinese", "text": "Transcribed text content"}

On error:

{"error": "Error description"}

Notes

  • First run downloads the model (~4.7GB), cached for subsequent runs
  • Auto mode: tries GPU (float16) first, falls back to CPU (float32) if VRAM is insufficient
  • Supports: WAV, MP3, FLAC, M4A, OGG and other common audio formats
  • 52 languages including Chinese, English, Japanese, Korean, French, German, etc.
  • 22 Chinese dialects supported
  • Long audio: Audio longer than 90s is automatically split at silence boundaries using silero-vad, transcribed chunk by chunk, then concatenated. This prevents OOM on limited VRAM GPUs.

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

84/100Analyzed 3/2/2026

Well-structured skill for local ASR transcription with clear setup, usage instructions, parameters table, and examples. Covers 52 languages and handles long audio via VAD splitting. Minor deduction for missing tags and no dedicated skills folder structure.

95
90
80
85
85

Metadata

Licenseunknown
Version-
Updated2/15/2026
Publisherbrisyramshere

Tags

No tags yet.