Back to Home
published
healthy

Real-time Voice Dialogue Service

A WebSocket-based real-time speech-to-text service powered by the SenseVoice model, supporting multilingual recognition and Cantonese (Hong Kong Traditional Chinese) post-processing.

https://sts-vert-one.vercel.app

Visit Bob
AI
Speech
WebSocket
Real-time

-

Project Overview

SenseVoice Real-time Speech-to-Text Service

One-Sentence Overview

A real-time speech-to-text service built on the SenseVoice model using FastAPI + WebSocket. It enables streaming multilingual speech recognition and provides dedicated post-processing for Cantonese in Hong Kong Traditional Chinese.

Project Scope

This project targets business scenarios requiring real-time speech-to-text capabilities, delivering a complete STT (Speech-to-Text) solution. It leverages a locally deployed SenseVoice ONNX model, integrated with VAD (Voice Activity Detection) and an intelligent post-processing pipeline to deliver low-latency, high-accuracy transcription.

Key Features

1. Real-time Streaming Speech Recognition

Audio streams are transmitted and transcribed in real time via WebSocket—transcription appears as users speak, eliminating the need to wait for full audio recording.

Core capabilities:

  • Dual-trigger recognition: Real-time inference every 0.5 seconds + precise segment-level recognition upon VAD-detected speech end
  • VAD (Voice Activity Detection): Intelligently identifies speech onset and offset
  • Automatic reconnection: Seamless recovery after network interruptions
  • Audio quality monitoring: Real-time volume analysis and quality feedback

2. Multilingual Support

Supports recognition across multiple languages and dialects to serve diverse user bases.

Supported languages:

  • Mandarin Chinese
  • English
  • Japanese
  • Korean
  • Cantonese (with Hong Kong Traditional Chinese post-processing)

3. Cantonese (Hong Kong Traditional Chinese) Post-processing

When Cantonese is selected, the server automatically applies a comprehensive post-processing pipeline to ensure output conforms to Hong Kong’s orthographic conventions.

Processing workflow:

  • OpenCC s2hk conversion: Simplified Chinese → Hong Kong Traditional Chinese
  • Cantonese proper noun dictionary coverage (supports multi-layer dictionary stacking)
  • Lightweight conversion applied to real-time chunks; full post-processing applied to final segment output
  • Inverse Text Normalization (ITN): Converts numbers, dates, times, and other structured formats into natural language

4. MiniMax TTS Proxy

The service integrates a WebSocket proxy endpoint for MiniMax’s text-to-speech API—enabling simultaneous speech recognition and synthesis within a single deployment.

Proxy features:

  • Local WebSocket endpoint that transparently forwards requests to MiniMax’s API
  • Supports local injection of API keys
  • Upstream events passed through unchanged (e.g., audio chunks, status events)

Technical Architecture

ComponentTechnologyDescription
Backend FrameworkFastAPIHigh-performance asynchronous Python web framework
Real-time CommunicationWebSocketBidirectional streaming audio transport
Speech ModelSenseVoice ONNXHigh-accuracy multilingual ASR model
VAD ModelSilero VADLightweight, robust voice activity detection
Traditional Chinese ConversionOpenCCIndustry-standard simplified–traditional conversion engine
Dictionary SystemCC-Canto + Custom DictionariesComprehensive Cantonese proper noun lexicon

Use Cases

Real-time Meeting Transcription

Provides live transcription for meetings, trainings, lectures, and other collaborative settings—including mixed-language speech.

Cantonese Customer Support Systems

Serves Hong Kong–based customer service applications by converting spoken Cantonese into standardized Hong Kong Traditional Chinese text.

Voice-Interactive Applications

Acts as the underlying STT engine for voice assistants, voice input tools, and other conversational AI products.

Accessibility Assistive Tools

Delivers real-time speech-to-text support for deaf and hard-of-hearing users—enhancing accessibility and information access.

Product Value

  • Fully on-premises deployment: Data never leaves your infrastructure—ensuring privacy and regulatory compliance
  • Ultra-low latency streaming recognition: Delivers smooth, responsive user experience
  • Unique Cantonese (Hong Kong Traditional Chinese) post-processing—addressing an unmet market need
  • Dual-recognition mechanism balances responsiveness and accuracy
  • Modular, well-documented architecture—designed for seamless integration into existing systems