WebSocket Server — Voice AI with Local Models (Chapter 11)

Network delivery of streaming audio requires a WebSocket server architecture. HTTP cannot efficiently handle streaming audio due to request-response semantics. WebSockets maintain persistent connections suitable for continuous audio flow.

WebSocket server implementation using asyncio:

import asyncio
import websockets
import numpy as np
import soundfile as sf

async def voice_session(websocket):
    """Handle a single voice conversation session."""
    pipeline = VoicePipeline()
    audio_buffer = []
    
    async for message in websocket:
        if isinstance(message, bytes):
            # Audio data received
            audio_chunk = np.frombuffer(message, dtype=np.int16)
            audio_chunk = audio_chunk.astype(np.float32) / 32768.0
            audio_buffer.append(audio_chunk)
            
        elif isinstance(message, str):
            if message == "end_stream":
                # Process accumulated audio
                full_audio = np.concatenate(audio_buffer)
                
                result = pipeline.process_audio(full_audio)
                
                if result is not None:
                    audio_bytes = (result * 32768).astype(np.int16).tobytes()
                    await websocket.send(audio_bytes)
                
                audio_buffer = []
                
            elif message == "ping":
                await websocket.send("pong")
    
    await websocket.close()

async def main():
    server = await websockets.serve(
        voice_session,
        "localhost",
        8765
    )
    print("Voice WebSocket server running on ws://localhost:8765")
    await server.serve_forever()

asyncio.run(main())

Client-side streaming requires careful chunk size management. Suboptimal chunk sizes introduce latency. Typical voice chunks use 20-100ms of audio per message for balance between throughput overhead and latency.

Connection state management:

class SessionState:
    def __init__(self, websocket):
        self.websocket = websocket
        self.audio_buffer = []
        self.transcription_history = []
        self.is_active = False
    
    async def send_audio(self, audio_chunk):
        audio_bytes = (audio_chunk * 32768).astype(np.int16).tobytes()
        await self.websocket.send(audio_bytes)
    
    async def send_transcript(self, text):
        import json
        await self.websocket.send(json.dumps({"type": "transcript", "text": text}))

Protocol design should include control messages for session management, ping/pong for connection liveness, and error reporting. Separate binary (audio) from text (control) message types.

Common failure modes:

Connection drops mid-session—implement reconnection with cached context
Client/server frame mismatch—validate audio format specifications
Slow clients causing buffer overflow—implement backpressure handling
Memory leaks from unclosed sessions—track and timeout abandoned connections