Benchmark
Speed and accuracy measurements for long-form ASR workloads. The headline result: FunASR CPU inference can be faster than Whisper GPU inference for production transcription pipelines.
Summary
| Metric | Result |
|---|---|
| Dataset | 184 long-form Chinese audio files, 11,539 s total, 192.3 min. |
| GPU | NVIDIA H100 80GB HBM3. |
| Best GPU speed | SenseVoice-Small: 169.6x realtime in the full benchmark, 211.8x in the initial run. |
| Best CPU speed | SenseVoice-Small: 17.2x realtime; Paraformer-Large: 15.6x realtime. |
| Baseline | OpenAI Whisper-large-v3: 13.4x realtime on GPU. |
Results
| Model | Device | RTF | Speed | CER | Notes |
|---|---|---|---|---|---|
| SenseVoice-Small | GPU | 0.005896 | 169.6x | 8.92% | ASR + language / emotion / event tags; CER after tag stripping. |
| Paraformer-Large | GPU | 0.008359 | 119.6x | 12.71% | Fast non-autoregressive Chinese ASR with VAD/punctuation pipeline. |
| Fun-ASR-Nano | GPU | 0.058803 | 17.0x | 10.56% | LLM-based 31-language ASR with timestamps and hotwords. |
| GLM-ASR-Nano | GPU | 0.026974 | 37.1x | 31.07% | LLM-based multilingual ASR. |
| Whisper-large-v3-turbo (OpenAI) | GPU | 0.021708 | 46.1x | 21.71% | OpenAI Whisper implementation. |
| Whisper-large-v3 (OpenAI) | GPU | 0.074694 | 13.4x | 20.02% | Baseline for large Whisper quality. |
| SenseVoice-Small | CPU | 0.057988 | 17.2x | 5.14% | CPU run from the remaining benchmark script. |
| Paraformer-Large | CPU | 0.064056 | 15.6x | 9.30% | CPU viable for batch jobs. |
| Fun-ASR-Nano | CPU | 0.274318 | 3.6x | 7.60% | LLM-based model is heavier but still above realtime. |
Methodology
Measurements were collected with the benchmark scripts in the workspace on 184 audio files. RTF is total inference time / total audio duration; speed is 1 / RTF. CER is computed after model-specific text cleanup, especially for SenseVoice tags.
python benchmark/run_full_benchmark.py
python benchmark/run_remaining.py
python benchmark/fix_sensevoice_cer.py
Use these numbers as practical guidance, not a universal leaderboard: hardware, batch size, audio length, decoding options, and text normalization all affect results.
How to Choose
| Need | Recommended model |
|---|---|
| Fastest production transcription | SenseVoice-Small or Paraformer-Large. |
| CPU batch transcription | SenseVoice-Small first; Paraformer-Large for Chinese production pipelines. |
| Multilingual LLM-style recognition with timestamps | Fun-ASR-Nano, and use vLLM for higher LLM decoding throughput. |
| OpenAI-compatible local endpoint | funasr-server with model alias sensevoice, paraformer, or fun-asr-nano. |