published

healthy

Real-time Voice Dialogue Service

A WebSocket-based real-time speech-to-text service powered by the SenseVoice model, supporting multilingual recognition and Cantonese (Hong Kong Traditional Chinese) post-processing.

https://sts-vert-one.vercel.app

Visit Bob

Speech

WebSocket

Real-time

Project Overview

SenseVoice Real-time Speech-to-Text Service

One-Sentence Overview

A real-time speech-to-text service built on the SenseVoice model using FastAPI + WebSocket. It enables streaming multilingual speech recognition and provides dedicated post-processing for Cantonese in Hong Kong Traditional Chinese.

Project Scope

This project targets business scenarios requiring real-time speech-to-text capabilities, delivering a complete STT (Speech-to-Text) solution. It leverages a locally deployed SenseVoice ONNX model, integrated with VAD (Voice Activity Detection) and an intelligent post-processing pipeline to deliver low-latency, high-accuracy transcription.

Key Features

1. Real-time Streaming Speech Recognition

Audio streams are transmitted and transcribed in real time via WebSocket—transcription appears as users speak, eliminating the need to wait for full audio recording.

Core capabilities:

Dual-trigger recognition: Real-time inference every 0.5 seconds + precise segment-level recognition upon VAD-detected speech end
VAD (Voice Activity Detection): Intelligently identifies speech onset and offset
Automatic reconnection: Seamless recovery after network interruptions
Audio quality monitoring: Real-time volume analysis and quality feedback

2. Multilingual Support

Supports recognition across multiple languages and dialects to serve diverse user bases.

Supported languages:

Mandarin Chinese
English
Japanese
Korean
Cantonese (with Hong Kong Traditional Chinese post-processing)

3. Cantonese (Hong Kong Traditional Chinese) Post-processing

When Cantonese is selected, the server automatically applies a comprehensive post-processing pipeline to ensure output conforms to Hong Kong’s orthographic conventions.

Processing workflow:

OpenCC s2hk conversion: Simplified Chinese → Hong Kong Traditional Chinese
Cantonese proper noun dictionary coverage (supports multi-layer dictionary stacking)
Lightweight conversion applied to real-time chunks; full post-processing applied to final segment output
Inverse Text Normalization (ITN): Converts numbers, dates, times, and other structured formats into natural language

4. MiniMax TTS Proxy

The service integrates a WebSocket proxy endpoint for MiniMax’s text-to-speech API—enabling simultaneous speech recognition and synthesis within a single deployment.

Proxy features:

Local WebSocket endpoint that transparently forwards requests to MiniMax’s API
Supports local injection of API keys
Upstream events passed through unchanged (e.g., audio chunks, status events)

Technical Architecture

Component	Technology	Description
Backend Framework	FastAPI	High-performance asynchronous Python web framework
Real-time Communication	WebSocket	Bidirectional streaming audio transport
Speech Model	SenseVoice ONNX	High-accuracy multilingual ASR model
VAD Model	Silero VAD	Lightweight, robust voice activity detection
Traditional Chinese Conversion	OpenCC	Industry-standard simplified–traditional conversion engine
Dictionary System	CC-Canto + Custom Dictionaries	Comprehensive Cantonese proper noun lexicon

Use Cases

Real-time Meeting Transcription

Provides live transcription for meetings, trainings, lectures, and other collaborative settings—including mixed-language speech.

Cantonese Customer Support Systems

Serves Hong Kong–based customer service applications by converting spoken Cantonese into standardized Hong Kong Traditional Chinese text.

Voice-Interactive Applications

Acts as the underlying STT engine for voice assistants, voice input tools, and other conversational AI products.

Accessibility Assistive Tools

Delivers real-time speech-to-text support for deaf and hard-of-hearing users—enhancing accessibility and information access.

Product Value

Fully on-premises deployment: Data never leaves your infrastructure—ensuring privacy and regulatory compliance
Ultra-low latency streaming recognition: Delivers smooth, responsive user experience
Unique Cantonese (Hong Kong Traditional Chinese) post-processing—addressing an unmet market need
Dual-recognition mechanism balances responsiveness and accuracy
Modular, well-documented architecture—designed for seamless integration into existing systems

GitHub View all data