Mist v3 JSON WebSocket (/ws3): structured events with base64 audio chunks and word-level timestamps.
Documentation Index
Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
Use this file to discover all available pages before exploring further.
., ?, !. This is most pertinent for the initial messages sent to the API, as synthesis won’t begin until there are sufficient tokens to generate audio with natural prosody. After the first synthesis of any given utterance, typically enough time has elapsed that subsequent audio contains multiple clauses, and the buffering becomes largely invisible.
word_timestamps are the same length and index-aligned: for a given index i, words[i] is spoken from start[i] to end[i]. Times are in seconds, measured from the beginning of the audio for the current synthesis. If a context id was attached to the text that produced this audio, it is included on the event.
Example payload:
done event. This signals that the current synthesis is fully complete. If the client sends more text and triggers further synthesis, another done will follow.
done fires depends on the segment setting. See Segmentation & Behavior Settings for full details.
mistv3.pcm, mulaw, or mp3Hi. <200> I'd love to have a conversation with you. adds a 200ms pause. Learn more about custom pauses.