Overview
Real-time streaming is built for applications that require immediate responsiveness, such as conversational AI, voice assistants, and interactive gaming. Our WebSocket-based API delivers audio with sub-50ms latency, streaming raw PCM audio chunks as they’re generated.Interactive Playground
Test the real-time speed of the Lokutor engine. The TTFB measurement shows exactly how fast we start delivering audio.WebSocket API Reference
Connection
Initiate a WebSocket connection towss://api.lokutor.com/ws.
Include your API key as a query parameter in the WebSocket URL:
- URL:
wss://api.lokutor.com/ws?api_key=your-api-key
Synthesis Request (Client to Server)
Send a JSON-formatted string to start synthesis.Request Schema
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | Yes | - | The text to be synthesized into speech. |
voice | string | No | M1 | Voice ID. See Available Voices. |
lang | string | No | en | ISO language code. Options: en, es, ko, pt, fr. |
speed | float | No | 1.05 | Synthesis speed multiplier (0.5 to 2.0 recommended). |
steps | int | No | 5 | Denoising steps. Higher = higher quality, higher latency. |
version | string | No | versa-1.0 | Model version to use. |
visemes | boolean | No | false | Enable/Disable high-fidelity lipsync data generation. |
Synthesis Response (Server to Client)
The server responds with a stream of messages.- Binary Messages (Audio): Multiple binary chunks containing raw PCM audio.
- Format: Signed 16-bit PCM (S16LE).
- Sample Rate: 44,100 Hz.
- Channels: 1 (Mono).
- Byte Order: Little Endian.
- Text Message (JSON Visemes): If
visemes: trueis requested, the server sends JSON arrays containing lipsync metadata synchronized with the audio stream. - Text Message “EOS”: Sent when the synthesis for the current request is complete.
- Text Message “ERR: <message>”: Sent if an error occurs during processing.
Viseme Data Format
Whenvisemes is enabled, the server will yield metadata in the following format:
Available Voices
| ID | Gender | Description |
|---|---|---|
F1 - F5 | Female | Various feminine tones from professional to casual. |
M1 - M5 | Male | Various masculine tones from deep baritone to energetic. |
[!TIP] UseF1orM1for the best general-purpose performance and naturalness.
Best Practices for Low Latency
- Persistent Connections: Keep the WebSocket connection open for multiple requests to avoid handshake latency.
- Buffer Management: Since we stream audio in small chunks (~20-40ms), ensure your client-side player has a small buffer (e.g., 100-200ms) to handle network jitter without gaps.
- Optimized Steps: Use
steps: 3for extreme low latency (Real-time agents) orsteps: 10for high-quality content generation.
Error Codes
401 Unauthorized: Missing or invalidX-API-Key.429 Too Many Requests: Rate limit exceeded.503 Service Unavailable: Server overload or maintenance.
Implementation Examples
- JavaScript
- Python