Lipsync & Visemes

Lokutor provides frame-accurate lipsync metadata alongside our streaming audio. This metadata, known as Visemes, allows you to animate 2D or 3D characters synchronously with the AI’s speech.

How it Works

When you request a synthesis with visemes: true, the server interleaves binary audio chunks with JSON text messages containing timing and character data.

Viseme Message Format

Each message is an array of objects representing the alignment for the immediately following audio data.

[
  {
    "v": 12,       // Index of the character in the source text
    "c": "o",      // The character/phoneme being spoken
    "t": 0.418     // Offset in seconds from the start of the audio stream
  }
]

Mapping Characters to Visuals

While the API provides raw character data, you should map these to specific mouth shapes (Visemes). Below is a suggested mapping for our included viseme assets:

Character/Group	Viseme Shape	Preview
`a`, `e`, `i`	Open Medium
`o`, `u`, `w`	Open Round
`m`, `p`, `b`	Lips Together
`f`, `v`	Lip Bite
`t`, `d`, `s`, `z`	Teeth Close
`L`	Tongue Up
(Silence)	Closed

Implementation Strategy

Syncing visemes requires matching the metadata timestamps (t) with your audio player’s current playback time.

1. Store Metadata

Maintain a queue or list of incoming visemes decoded from the WebSocket text messages.

2. Synchronization Loop

Use requestAnimationFrame to check the current time of your AudioContext or audio player and switch the visual state when the current time exceeds the viseme’s timestamp.

function updateAnimation() {
  const currentTime = audioContext.currentTime;
  
  // Find the viseme that matches the current audio time
  const activeViseme = visemesData.find((v, i) => {
    const next = visemesData[i+1];
    return currentTime >= v.t && (!next || currentTime < next.t);
  });

  if (activeViseme) {
    updateMouthShape(activeViseme.c);
  }
  
  requestAnimationFrame(updateAnimation);
}

High-Quality Animation Tips

Interpolation: For 3D characters, don’t just snap between shapes. Use a “Linear Interpolation” (LERP) over 20-50ms to create smooth transitions between blendshapes.
Pre-loading: Pre-load your character assets or SVGs before starting the stream to avoid flickering on the first response.
Latency Buffering: Viseme messages are sent slightly before the audio they represent. This gives your frontend a tiny window to prepare the next animation frame.

Welcome

Guides

API Reference

SDKs

Resources

How it Works

Viseme Message Format

Mapping Characters to Visuals

Implementation Strategy

1. Store Metadata

2. Synchronization Loop

High-Quality Animation Tips

Welcome

Guides

API Reference

SDKs

Resources

​How it Works

​Viseme Message Format

​Mapping Characters to Visuals

​Implementation Strategy

​1. Store Metadata

​2. Synchronization Loop

​High-Quality Animation Tips

How it Works

Viseme Message Format

Mapping Characters to Visuals

Implementation Strategy

1. Store Metadata

2. Synchronization Loop

High-Quality Animation Tips