How it Works
When you request a synthesis withvisemes: true, the server interleaves binary audio chunks with JSON text messages containing timing and character data.
Viseme Message Format
Each message is an array of objects representing the alignment for the immediately following audio data.Mapping Characters to Visuals
While the API provides raw character data, you should map these to specific mouth shapes (Visemes). Below is a suggested mapping for our included viseme assets:| Character/Group | Viseme Shape | Preview |
|---|---|---|
a, e, i | Open Medium | |
o, u, w | Open Round | |
m, p, b | Lips Together | |
f, v | Lip Bite | |
t, d, s, z | Teeth Close | |
L | Tongue Up | |
| (Silence) | Closed |
Implementation Strategy
Syncing visemes requires matching the metadata timestamps (t) with your audio player’s current playback time.
1. Store Metadata
Maintain a queue or list of incoming visemes decoded from the WebSocket text messages.2. Synchronization Loop
UserequestAnimationFrame to check the current time of your AudioContext or audio player and switch the visual state when the current time exceeds the viseme’s timestamp.
High-Quality Animation Tips
- Interpolation: For 3D characters, don’t just snap between shapes. Use a “Linear Interpolation” (LERP) over 20-50ms to create smooth transitions between blendshapes.
- Pre-loading: Pre-load your character assets or SVGs before starting the stream to avoid flickering on the first response.
- Latency Buffering: Viseme messages are sent slightly before the audio they represent. This gives your frontend a tiny window to prepare the next animation frame.