Voice Chat (Conversational AI)

Overview

The Voice Chat API is powered by the Lokutor Orchestrator, a central engine that coordinates the entire conversational pipeline: Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). It is designed for full-duplex interactions, meaning both the user and the agent can speak at the same time, with support for intelligent interruptions (barge-in).

Interactive Playground

Test the full-duplex conversational capabilities of the Lokutor Orchestrator.

Audio Specifications

To ensure optimal performance and compatibility with the orchestrator, your audio stream must adhere to the following specifications:

Property	Value
Sample Rate	44,100 Hz (44.1kHz)
Channels	1 (Mono)
Format	Signed 16-bit PCM (S16LE)
Bytes per Sample	2

WebSocket API

The agent endpoint is a full-duplex WebSocket connection that handles both binary audio data and JSON control/event messages.

Endpoint: wss://api.lokutor.com/ws/agent
Authentication: Include your API key as a query parameter ?api_key=your-api-key.

1. Initialization

Once connected, send your configuration parameters as individual JSON messages:

{ "type": "prompt", "data": "You are a helpful travel assistant. Be concise." }
{ "type": "voice", "data": "F1" }
{ "type": "language", "data": "en" }

2. Client to Server (User Input)

Stream raw audio bytes as Binary WebSocket Messages. The orchestrator uses a high-performance VAD to detect speech and trigger the processing pipeline.

3. Server to Client (Events & Audio)

The server emits events as JSON messages and audio as binary chunks.

Event Reference

Event Type	Data Field	Description
`status`	`listening`	VAD detected the user has started talking or is waiting for input.
`status`	`thinking`	The LLM is generating a response.
`status`	`speaking`	TTS has started generating audio.
`status`	`interrupted`	User spoke while the agent was talking (Barge-in).
`transcript`	`data` (string)	Transcribed text. Role is indicated in the `role` field.
`error`	`data` (string)	An error occurred in the pipeline.

Binary Data (Audio)

Raw PCM audio chunks (S16LE, 44.1kHz) are sent as binary messages during the speaking status.

Core Capabilities

Intelligent Barge-In

The orchestrator handles barge-in automatically. When a user starts speaking while the bot is outputting audio, the server sends a status: interrupted event.

Note: Upon receiving an interrupted status, the client should immediately clear its local audio playback buffer and stop audio output.

Echo Guard & Pre-roll

Echo Guard: Intelligently ignores bot audio picked up by the microphone to prevent feedback loops.
Pre-roll: Maintains a small buffer of audio before speech starts to ensure no initial phonemes are clipped.

Session Management

The orchestrator maintains the dialogue state (Conversation History). You can adjust the max_context_messages to manage the context window size.

Best Practices

Persistent Connection: Keep the WebSocket open for the duration of the user’s session.
Handle Interruption: Responsive agents require the client to stop playback immediately on INTERRUPTED.
Silence suppression: While the VAD is efficient, avoid sending pure noise/silence to the server to optimize bandwidth.

Welcome

Guides

API Reference

SDKs

Resources

Voice Chat (Conversational AI)

Overview

Interactive Playground

Audio Specifications

WebSocket API

1. Initialization

2. Client to Server (User Input)

3. Server to Client (Events & Audio)

Event Reference

Binary Data (Audio)

Core Capabilities

Intelligent Barge-In

Echo Guard & Pre-roll

Session Management

Best Practices

Welcome

Guides

API Reference

SDKs

Resources

​Overview

​Interactive Playground

​Audio Specifications

​WebSocket API

​1. Initialization

​2. Client to Server (User Input)

​3. Server to Client (Events & Audio)

​Event Reference

​Binary Data (Audio)

​Core Capabilities

​Intelligent Barge-In

​Echo Guard & Pre-roll

​Session Management

​Best Practices

Overview

Interactive Playground

Audio Specifications

WebSocket API

1. Initialization

2. Client to Server (User Input)

3. Server to Client (Events & Audio)

Event Reference

Binary Data (Audio)

Core Capabilities

Intelligent Barge-In

Echo Guard & Pre-roll

Session Management

Best Practices