Skip to main content
Lokutor provides frame-accurate lipsync metadata alongside our streaming audio. This metadata, known as Visemes, allows you to animate 2D or 3D characters synchronously with the AI’s speech.

How it Works

When you request a synthesis with visemes: true, the server interleaves binary audio chunks with JSON text messages containing timing and character data.

Viseme Message Format

Each message is an array of objects representing the alignment for the immediately following audio data.
[
  {
    "v": 12,       // Index of the character in the source text
    "c": "o",      // The character/phoneme being spoken
    "t": 0.418     // Offset in seconds from the start of the audio stream
  }
]

Mapping Characters to Visuals

While the API provides raw character data, you should map these to specific mouth shapes (Visemes). Below is a suggested mapping for our included viseme assets:
Character/GroupViseme ShapePreview
a, e, iOpen Medium
o, u, wOpen Round
m, p, bLips Together
f, vLip Bite
t, d, s, zTeeth Close
LTongue Up
(Silence)Closed

Implementation Strategy

Syncing visemes requires matching the metadata timestamps (t) with your audio player’s current playback time.

1. Store Metadata

Maintain a queue or list of incoming visemes decoded from the WebSocket text messages.

2. Synchronization Loop

Use requestAnimationFrame to check the current time of your AudioContext or audio player and switch the visual state when the current time exceeds the viseme’s timestamp.
function updateAnimation() {
  const currentTime = audioContext.currentTime;
  
  // Find the viseme that matches the current audio time
  const activeViseme = visemesData.find((v, i) => {
    const next = visemesData[i+1];
    return currentTime >= v.t && (!next || currentTime < next.t);
  });

  if (activeViseme) {
    updateMouthShape(activeViseme.c);
  }
  
  requestAnimationFrame(updateAnimation);
}

High-Quality Animation Tips

  • Interpolation: For 3D characters, don’t just snap between shapes. Use a “Linear Interpolation” (LERP) over 20-50ms to create smooth transitions between blendshapes.
  • Pre-loading: Pre-load your character assets or SVGs before starting the stream to avoid flickering on the first response.
  • Latency Buffering: Viseme messages are sent slightly before the audio they represent. This gives your frontend a tiny window to prepare the next animation frame.