Real-time Voice Dialogue Service
A WebSocket-based real-time speech-to-text service powered by the SenseVoice model, supporting multilingual recognition and Cantonese (Hong Kong Traditional Chinese) post-processing.
https://sts-vert-one.vercel.app
Visit Bob-
Project Overview
SenseVoice Real-time Speech-to-Text Service
One-Sentence Overview
A real-time speech-to-text service built on the SenseVoice model using FastAPI + WebSocket. It enables streaming multilingual speech recognition and provides dedicated post-processing for Cantonese in Hong Kong Traditional Chinese.
Project Scope
This project targets business scenarios requiring real-time speech-to-text capabilities, delivering a complete STT (Speech-to-Text) solution. It leverages a locally deployed SenseVoice ONNX model, integrated with VAD (Voice Activity Detection) and an intelligent post-processing pipeline to deliver low-latency, high-accuracy transcription.
Key Features
1. Real-time Streaming Speech Recognition
Audio streams are transmitted and transcribed in real time via WebSocket—transcription appears as users speak, eliminating the need to wait for full audio recording.
Core capabilities:
- Dual-trigger recognition: Real-time inference every 0.5 seconds + precise segment-level recognition upon VAD-detected speech end
- VAD (Voice Activity Detection): Intelligently identifies speech onset and offset
- Automatic reconnection: Seamless recovery after network interruptions
- Audio quality monitoring: Real-time volume analysis and quality feedback
2. Multilingual Support
Supports recognition across multiple languages and dialects to serve diverse user bases.
Supported languages:
- Mandarin Chinese
- English
- Japanese
- Korean
- Cantonese (with Hong Kong Traditional Chinese post-processing)
3. Cantonese (Hong Kong Traditional Chinese) Post-processing
When Cantonese is selected, the server automatically applies a comprehensive post-processing pipeline to ensure output conforms to Hong Kong’s orthographic conventions.
Processing workflow:
- OpenCC s2hk conversion: Simplified Chinese → Hong Kong Traditional Chinese
- Cantonese proper noun dictionary coverage (supports multi-layer dictionary stacking)
- Lightweight conversion applied to real-time chunks; full post-processing applied to final segment output
- Inverse Text Normalization (ITN): Converts numbers, dates, times, and other structured formats into natural language
4. MiniMax TTS Proxy
The service integrates a WebSocket proxy endpoint for MiniMax’s text-to-speech API—enabling simultaneous speech recognition and synthesis within a single deployment.
Proxy features:
- Local WebSocket endpoint that transparently forwards requests to MiniMax’s API
- Supports local injection of API keys
- Upstream events passed through unchanged (e.g., audio chunks, status events)
Technical Architecture
| Component | Technology | Description |
|---|---|---|
| Backend Framework | FastAPI | High-performance asynchronous Python web framework |
| Real-time Communication | WebSocket | Bidirectional streaming audio transport |
| Speech Model | SenseVoice ONNX | High-accuracy multilingual ASR model |
| VAD Model | Silero VAD | Lightweight, robust voice activity detection |
| Traditional Chinese Conversion | OpenCC | Industry-standard simplified–traditional conversion engine |
| Dictionary System | CC-Canto + Custom Dictionaries | Comprehensive Cantonese proper noun lexicon |
Use Cases
Real-time Meeting Transcription
Provides live transcription for meetings, trainings, lectures, and other collaborative settings—including mixed-language speech.
Cantonese Customer Support Systems
Serves Hong Kong–based customer service applications by converting spoken Cantonese into standardized Hong Kong Traditional Chinese text.
Voice-Interactive Applications
Acts as the underlying STT engine for voice assistants, voice input tools, and other conversational AI products.
Accessibility Assistive Tools
Delivers real-time speech-to-text support for deaf and hard-of-hearing users—enhancing accessibility and information access.
Product Value
- Fully on-premises deployment: Data never leaves your infrastructure—ensuring privacy and regulatory compliance
- Ultra-low latency streaming recognition: Delivers smooth, responsive user experience
- Unique Cantonese (Hong Kong Traditional Chinese) post-processing—addressing an unmet market need
- Dual-recognition mechanism balances responsiveness and accuracy
- Modular, well-documented architecture—designed for seamless integration into existing systems