Benchmark

Speed and accuracy measurements for long-form ASR workloads. The headline result: FunASR CPU inference can be faster than Whisper GPU inference for production transcription pipelines.

Summary

MetricResult
Dataset184 long-form Chinese audio files, 11,539 s total, 192.3 min.
GPUNVIDIA H100 80GB HBM3.
Best GPU speedSenseVoice-Small: 169.6x realtime in the full benchmark, 211.8x in the initial run.
Best CPU speedSenseVoice-Small: 17.2x realtime; Paraformer-Large: 15.6x realtime.
BaselineOpenAI Whisper-large-v3: 13.4x realtime on GPU.

Results

ModelDeviceRTFSpeedCERNotes
SenseVoice-SmallGPU0.005896169.6x8.92%ASR + language / emotion / event tags; CER after tag stripping.
Paraformer-LargeGPU0.008359119.6x12.71%Fast non-autoregressive Chinese ASR with VAD/punctuation pipeline.
Fun-ASR-NanoGPU0.05880317.0x10.56%LLM-based 31-language ASR with timestamps and hotwords.
GLM-ASR-NanoGPU0.02697437.1x31.07%LLM-based multilingual ASR.
Whisper-large-v3-turbo (OpenAI)GPU0.02170846.1x21.71%OpenAI Whisper implementation.
Whisper-large-v3 (OpenAI)GPU0.07469413.4x20.02%Baseline for large Whisper quality.
SenseVoice-SmallCPU0.05798817.2x5.14%CPU run from the remaining benchmark script.
Paraformer-LargeCPU0.06405615.6x9.30%CPU viable for batch jobs.
Fun-ASR-NanoCPU0.2743183.6x7.60%LLM-based model is heavier but still above realtime.

Methodology

Measurements were collected with the benchmark scripts in the workspace on 184 audio files. RTF is total inference time / total audio duration; speed is 1 / RTF. CER is computed after model-specific text cleanup, especially for SenseVoice tags.

python benchmark/run_full_benchmark.py
python benchmark/run_remaining.py
python benchmark/fix_sensevoice_cer.py

Use these numbers as practical guidance, not a universal leaderboard: hardware, batch size, audio length, decoding options, and text normalization all affect results.

How to Choose

NeedRecommended model
Fastest production transcriptionSenseVoice-Small or Paraformer-Large.
CPU batch transcriptionSenseVoice-Small first; Paraformer-Large for Chinese production pipelines.
Multilingual LLM-style recognition with timestampsFun-ASR-Nano, and use vLLM for higher LLM decoding throughput.
OpenAI-compatible local endpointfunasr-server with model alias sensevoice, paraformer, or fun-asr-nano.