11. WebSocket Server
Network delivery of streaming audio requires a WebSocket server architecture. HTTP cannot efficiently handle streaming audio due to request-response semantics. WebSockets maintain persistent connections suitable for continuous audio flow.
WebSocket server implementation using asyncio:
import asyncio
import websockets
import numpy as np
import soundfile as sf
async def voice_session(websocket):
"""Handle a single voice conversation session."""
pipeline = VoicePipeline()
audio_buffer = []
async for message in websocket:
if isinstance(message, bytes):
# Audio data received
audio_chunk = np.frombuffer(message, dtype=np.int16)
audio_chunk = audio_chunk.astype(np.float32) / 32768.0
audio_buffer.append(audio_chunk)
elif isinstance(message, str):
if message == "end_stream":
# Process accumulated audio
full_audio = np.concatenate(audio_buffer)
result = pipeline.process_audio(full_audio)
if result is not None:
audio_bytes = (result * 32768).astype(np.int16).tobytes()
await websocket.send(audio_bytes)
audio_buffer = []
elif message == "ping":
await websocket.send("pong")
await websocket.close()
async def main():
server = await websockets.serve(
voice_session,
"localhost",
8765
)
print("Voice WebSocket server running on ws://localhost:8765")
await server.serve_forever()
asyncio.run(main())
Client-side streaming requires careful chunk size management. Suboptimal chunk sizes introduce latency. Typical voice chunks use 20-100ms of audio per message for balance between throughput overhead and latency.
Connection state management:
class SessionState:
def __init__(self, websocket):
self.websocket = websocket
self.audio_buffer = []
self.transcription_history = []
self.is_active = False
async def send_audio(self, audio_chunk):
audio_bytes = (audio_chunk * 32768).astype(np.int16).tobytes()
await self.websocket.send(audio_bytes)
async def send_transcript(self, text):
import json
await self.websocket.send(json.dumps({"type": "transcript", "text": text}))
Protocol design should include control messages for session management, ping/pong for connection liveness, and error reporting. Separate binary (audio) from text (control) message types.
Common failure modes:
- Connection drops mid-session—implement reconnection with cached context
- Client/server frame mismatch—validate audio format specifications
- Slow clients causing buffer overflow—implement backpressure handling
- Memory leaks from unclosed sessions—track and timeout abandoned connections
Implement a WebSocket server handling voice input. Test with a client that sends audio chunks and receives transcribed results or synthesized replies. Measure end-to-end latency. (15 minutes)