Skip to main content

Overview

Real-time streaming is built for applications that require immediate responsiveness, such as conversational AI, voice assistants, and interactive gaming. Our WebSocket-based API delivers audio with sub-50ms latency, streaming raw PCM audio chunks as they’re generated.

Interactive Playground

Test the real-time speed of the Lokutor engine. The TTFB measurement shows exactly how fast we start delivering audio.

WebSocket API Reference

Connection

Initiate a WebSocket connection to wss://api.lokutor.com/ws. Include your API key as a query parameter in the WebSocket URL:
  • URL: wss://api.lokutor.com/ws?api_key=your-api-key

Synthesis Request (Client to Server)

Send a JSON-formatted string to start synthesis.

Request Schema

{
  "text": "Hello, how can I help you today?",
  "voice": "M1",
  "lang": "en",
  "speed": 1.05,
  "steps": 5,
  "version": "versa-1.0",
  "visemes": true
}
ParameterTypeRequiredDefaultDescription
textstringYes-The text to be synthesized into speech.
voicestringNoM1Voice ID. See Available Voices.
langstringNoenISO language code. Options: en, es, ko, pt, fr.
speedfloatNo1.05Synthesis speed multiplier (0.5 to 2.0 recommended).
stepsintNo5Denoising steps. Higher = higher quality, higher latency.
versionstringNoversa-1.0Model version to use.
visemesbooleanNofalseEnable/Disable high-fidelity lipsync data generation.

Synthesis Response (Server to Client)

The server responds with a stream of messages.
  • Binary Messages (Audio): Multiple binary chunks containing raw PCM audio.
    • Format: Signed 16-bit PCM (S16LE).
    • Sample Rate: 44,100 Hz.
    • Channels: 1 (Mono).
    • Byte Order: Little Endian.
  • Text Message (JSON Visemes): If visemes: true is requested, the server sends JSON arrays containing lipsync metadata synchronized with the audio stream.
  • Text Message “EOS”: Sent when the synthesis for the current request is complete.
  • Text Message “ERR: <message>”: Sent if an error occurs during processing.

Viseme Data Format

When visemes is enabled, the server will yield metadata in the following format:
[
  {
    "v": 10,       // Character index in the input text
    "c": "w",      // Character being spoken
    "t": 0.418     // Absolute timestamp (seconds) relative to sentence start
  }
]
These messages are sent as WebSocket Text Messages and are interleaved with the binary audio chunks. They always represent the alignment for the immediately following audio data.

Available Voices

IDGenderDescription
F1 - F5FemaleVarious feminine tones from professional to casual.
M1 - M5MaleVarious masculine tones from deep baritone to energetic.
[!TIP] Use F1 or M1 for the best general-purpose performance and naturalness.

Best Practices for Low Latency

  1. Persistent Connections: Keep the WebSocket connection open for multiple requests to avoid handshake latency.
  2. Buffer Management: Since we stream audio in small chunks (~20-40ms), ensure your client-side player has a small buffer (e.g., 100-200ms) to handle network jitter without gaps.
  3. Optimized Steps: Use steps: 3 for extreme low latency (Real-time agents) or steps: 10 for high-quality content generation.

Error Codes

  • 401 Unauthorized: Missing or invalid X-API-Key.
  • 429 Too Many Requests: Rate limit exceeded.
  • 503 Service Unavailable: Server overload or maintenance.

Implementation Examples

const ws = new WebSocket(`wss://api.lokutor.com/ws?api_key=your-api-key`);

ws.onopen = () => {
  ws.send(JSON.stringify({
    text: "Hello, world!",
    voice: "M1"
  }));
};

ws.onmessage = (event) => {
  if (typeof event.data === 'string') {
    if (event.data === 'EOS') {
      console.log('Done');
    }
  } else {
    // Handle audio data
    playAudio(event.data);
  }
};