RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Voice AI with Local Models
  6. /Ch. 11
Voice AI with Local Models

11. WebSocket Server

Chapter 11 of 22 · 15 min
KEY INSIGHT

WebSocket architecture shifts complexity from request handling to connection lifecycle and state management.

Network delivery of streaming audio requires a WebSocket server architecture. HTTP cannot efficiently handle streaming audio due to request-response semantics. WebSockets maintain persistent connections suitable for continuous audio flow.

WebSocket server implementation using asyncio:

import asyncio
import websockets
import numpy as np
import soundfile as sf

async def voice_session(websocket):
    """Handle a single voice conversation session."""
    pipeline = VoicePipeline()
    audio_buffer = []
    
    async for message in websocket:
        if isinstance(message, bytes):
            # Audio data received
            audio_chunk = np.frombuffer(message, dtype=np.int16)
            audio_chunk = audio_chunk.astype(np.float32) / 32768.0
            audio_buffer.append(audio_chunk)
            
        elif isinstance(message, str):
            if message == "end_stream":
                # Process accumulated audio
                full_audio = np.concatenate(audio_buffer)
                
                result = pipeline.process_audio(full_audio)
                
                if result is not None:
                    audio_bytes = (result * 32768).astype(np.int16).tobytes()
                    await websocket.send(audio_bytes)
                
                audio_buffer = []
                
            elif message == "ping":
                await websocket.send("pong")
    
    await websocket.close()

async def main():
    server = await websockets.serve(
        voice_session,
        "localhost",
        8765
    )
    print("Voice WebSocket server running on ws://localhost:8765")
    await server.serve_forever()

asyncio.run(main())

Client-side streaming requires careful chunk size management. Suboptimal chunk sizes introduce latency. Typical voice chunks use 20-100ms of audio per message for balance between throughput overhead and latency.

Connection state management:

class SessionState:
    def __init__(self, websocket):
        self.websocket = websocket
        self.audio_buffer = []
        self.transcription_history = []
        self.is_active = False
    
    async def send_audio(self, audio_chunk):
        audio_bytes = (audio_chunk * 32768).astype(np.int16).tobytes()
        await self.websocket.send(audio_bytes)
    
    async def send_transcript(self, text):
        import json
        await self.websocket.send(json.dumps({"type": "transcript", "text": text}))

Protocol design should include control messages for session management, ping/pong for connection liveness, and error reporting. Separate binary (audio) from text (control) message types.

Common failure modes:

  1. Connection drops mid-session—implement reconnection with cached context
  2. Client/server frame mismatch—validate audio format specifications
  3. Slow clients causing buffer overflow—implement backpressure handling
  4. Memory leaks from unclosed sessions—track and timeout abandoned connections
EXERCISE

Implement a WebSocket server handling voice input. Test with a client that sends audio chunks and receives transcribed results or synthesized replies. Measure end-to-end latency. (15 minutes)

← Chapter 10
Real-Time Architecture
Chapter 12 →
WebSocket Client