Mist v2 JSON WebSocket (/ws2): structured events with base64 audio chunks and word-level timestamps.
Documentation Index
Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
Use this file to discover all available pages before exploring further.
., ?, !. This is most pertinent for the initial messages sent to the API, as synthesis won’t begin until there are sufficient tokens to generate audio with natural prosody. After the first synthesis of any given utterance, typically enough time has elapsed that subsequent audio contains multiple clauses, and the buffering becomes largely invisible.
contextId: null, and the audio for the second will be tagged with its UUID.
word_timestamps are the same length and index-aligned: for a given index i, words[i] is spoken from start[i] to end[i]. Times are in seconds, measured from the beginning of the audio for the current synthesis. If a context id was attached to the text that produced this audio, it is included on the event.
Example payload:
mistv2.mp3, mulaw, or pcmimmediate=true in query params is equivalent to segment=immediate. If a null value is provided, it will default to “bySentence”.true, Rime shall save any currently OOV (out-of-vocabulary) words encountered in text, and save them for the User or Team to review on the
Speech QA dashboard. Note: It may take up to 15 minutes for OOV words to appear on your dashboard.