Skip to main content
Latency is the single most important factor in creating an immersive voice experience. In human conversation, the typical response gap is about 200ms. To match this, your entire stack (STT + LLM + TTS) needs to be optimized.

Understanding the Latency Chain

The “End-to-End” latency is the sum of:
  1. User Audio to Server: Network latency for streaming the user’s speech.
  2. STT (Transcription): Time to turn audio into text.
  3. LLM (Thinking): Time for the model to generate the first token.
  4. TTS (Lokutor): Time to generate the first audio chunk (TTFB).
  5. Playback Buffer: The safety margin your player keeps to prevent stuttering.

Lokutor Optimizations

1. Adjusting Denoising Steps

The steps parameter in our API directly controls the compute time.
  • Use steps: 3 for extreme low latency.
  • Use steps: 5 as the sweet spot for quality vs. speed.

2. Stream Audio Chunks

Never wait for the full response to finish. Lokutor sends binary chunks as small as 20ms. Play them as they arrive (streaming playback).

3. Regional Connectivity

Ensure your client connects to the nearest Lokutor edge node. Our global infrastructure is designed to minimize the speed-of-light delay between your user and our inference engines.

Client-Side Tips

Adaptive Jitter Buffers

Instead of a fixed 500ms buffer, use an adaptive one. Start with 100ms and slowly increase it only if you detect network dropouts.

Pre-warming Connections

Keep the WebSocket connection open. Handshake and TLS negotiation can add 200-500ms of delay if you open a new connection for every request.
// Good: reuse connection
const client = new VoiceAgentClient({ apiKey: '...' });
await client.connect(); // Do this when the app starts

// Later...
client.sendAudio(data);

LLM Token Streaming

If you are using Lokutor’s standalone TTS with your own LLM, ensure you stream tokens from the LLM to Lokutor. Waiting for a full sentence from the LLM will significantly increase latency.