Skip to main content

Overview

The Voice Chat API is powered by the Lokutor Orchestrator, a central engine that coordinates the entire conversational pipeline: Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). It is designed for full-duplex interactions, meaning both the user and the agent can speak at the same time, with support for intelligent interruptions (barge-in).

Interactive Playground

Test the full-duplex conversational capabilities of the Lokutor Orchestrator.

Audio Specifications

To ensure optimal performance and compatibility with the orchestrator, your audio stream must adhere to the following specifications:
PropertyValue
Sample Rate44,100 Hz (44.1kHz)
Channels1 (Mono)
FormatSigned 16-bit PCM (S16LE)
Bytes per Sample2

WebSocket API

The agent endpoint is a full-duplex WebSocket connection that handles both binary audio data and JSON control/event messages.
  • Endpoint: wss://api.lokutor.com/ws/agent
  • Authentication: Include your API key as a query parameter ?api_key=your-api-key.

1. Initialization

Once connected, send your configuration parameters as individual JSON messages:
{ "type": "prompt", "data": "You are a helpful travel assistant. Be concise." }
{ "type": "voice", "data": "F1" }
{ "type": "language", "data": "en" }

2. Client to Server (User Input)

Stream raw audio bytes as Binary WebSocket Messages. The orchestrator uses a high-performance VAD to detect speech and trigger the processing pipeline.

3. Server to Client (Events & Audio)

The server emits events as JSON messages and audio as binary chunks.

Event Reference

Event TypeData FieldDescription
statuslisteningVAD detected the user has started talking or is waiting for input.
statusthinkingThe LLM is generating a response.
statusspeakingTTS has started generating audio.
statusinterruptedUser spoke while the agent was talking (Barge-in).
transcriptdata (string)Transcribed text. Role is indicated in the role field.
errordata (string)An error occurred in the pipeline.

Binary Data (Audio)

Raw PCM audio chunks (S16LE, 44.1kHz) are sent as binary messages during the speaking status.

Core Capabilities

Intelligent Barge-In

The orchestrator handles barge-in automatically. When a user starts speaking while the bot is outputting audio, the server sends a status: interrupted event.
Note: Upon receiving an interrupted status, the client should immediately clear its local audio playback buffer and stop audio output.

Echo Guard & Pre-roll

  • Echo Guard: Intelligently ignores bot audio picked up by the microphone to prevent feedback loops.
  • Pre-roll: Maintains a small buffer of audio before speech starts to ensure no initial phonemes are clipped.

Session Management

The orchestrator maintains the dialogue state (Conversation History). You can adjust the max_context_messages to manage the context window size.

Best Practices

  1. Persistent Connection: Keep the WebSocket open for the duration of the user’s session.
  2. Handle Interruption: Responsive agents require the client to stop playback immediately on INTERRUPTED.
  3. Silence suppression: While the VAD is efficient, avoid sending pure noise/silence to the server to optimize bandwidth.