Overview
The Voice Chat API is powered by the Lokutor Orchestrator, a central engine that coordinates the entire conversational pipeline: Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). It is designed for full-duplex interactions, meaning both the user and the agent can speak at the same time, with support for intelligent interruptions (barge-in).Interactive Playground
Test the full-duplex conversational capabilities of the Lokutor Orchestrator.Audio Specifications
To ensure optimal performance and compatibility with the orchestrator, your audio stream must adhere to the following specifications:| Property | Value |
|---|---|
| Sample Rate | 44,100 Hz (44.1kHz) |
| Channels | 1 (Mono) |
| Format | Signed 16-bit PCM (S16LE) |
| Bytes per Sample | 2 |
WebSocket API
The agent endpoint is a full-duplex WebSocket connection that handles both binary audio data and JSON control/event messages.- Endpoint:
wss://api.lokutor.com/ws/agent - Authentication: Include your API key as a query parameter
?api_key=your-api-key.
1. Initialization
Once connected, send your configuration parameters as individual JSON messages:2. Client to Server (User Input)
Stream raw audio bytes as Binary WebSocket Messages. The orchestrator uses a high-performance VAD to detect speech and trigger the processing pipeline.3. Server to Client (Events & Audio)
The server emits events as JSON messages and audio as binary chunks.Event Reference
| Event Type | Data Field | Description |
|---|---|---|
status | listening | VAD detected the user has started talking or is waiting for input. |
status | thinking | The LLM is generating a response. |
status | speaking | TTS has started generating audio. |
status | interrupted | User spoke while the agent was talking (Barge-in). |
transcript | data (string) | Transcribed text. Role is indicated in the role field. |
error | data (string) | An error occurred in the pipeline. |
Binary Data (Audio)
Raw PCM audio chunks (S16LE, 44.1kHz) are sent as binary messages during the speaking status.
Core Capabilities
Intelligent Barge-In
The orchestrator handles barge-in automatically. When a user starts speaking while the bot is outputting audio, the server sends astatus: interrupted event.
Note: Upon receiving an interrupted status, the client should immediately clear its local audio playback buffer and stop audio output.
Echo Guard & Pre-roll
- Echo Guard: Intelligently ignores bot audio picked up by the microphone to prevent feedback loops.
- Pre-roll: Maintains a small buffer of audio before speech starts to ensure no initial phonemes are clipped.
Session Management
The orchestrator maintains the dialogue state (Conversation History). You can adjust themax_context_messages to manage the context window size.
Best Practices
- Persistent Connection: Keep the WebSocket open for the duration of the user’s session.
- Handle Interruption: Responsive agents require the client to stop playback immediately on
INTERRUPTED. - Silence suppression: While the VAD is efficient, avoid sending pure noise/silence to the server to optimize bandwidth.