Websocket
Websockets JSON
Arcana JSON WebSocket (/ws3): structured events with base64 audio chunks and word-level timestamps.
All requests require authentication with a bearer token in the
examples:
Context IDs can be provided, which will be attached to subsequent messages
that the server sends back to the client. Rime will not maintain multiple s
imultaneous context ids. The events will contain the most recent context ID
at the time that audio was requested. In the above examples, even if both
messages are received by the server before it sends any audio, the audio
response for the first sentence will be tagged with
The audio will be a base64 encoded chunk of audio bytes in the audio format specified
when the connection was established. If you provided any context id when sending the relevant text, it’ll be included here.
The three arrays inside
When exactly
Authorization header: Authorization: Bearer YOUR_API_KEY. See API authentication for how to create a key.
Overview
In addition to a plaintext websocket implementation, Rime also has an implementation that sends and receives events as JSON objects. Like the other implementation, all synthesis arguments are provided as query parameters when establishing the connection. The websocket API buffers inputs up to on of the following punctuation characters:., ?, !. This is most pertinent for the initial messages
sent to the API, as synthesis won’t begin until there are sufficient
tokens to generate audio with natural prosody. After the first synthesis
of any given utterance, typically enough time has elapsed that subsequent
audio contains multiple clauses, and the buffering becomes largely invisible.
Messages
Send
Text
This is the most common message, which contains text for synthesis. schema:contextId: null,
and the audio for the second will be tagged with its UUID.
Clear
Your client can clear out the accumulated buffer, which is useful in the case of interruptions.Flush
This forces whatever buffer exists, if any, to be synthesized, and the generated audio to be sent over.EOS
At times, your client would like to generate audio for whatever remains in the buffer, and then have the connection immediately closed.Receive
Chunk
The most common event will be the audio chunk.Timestamps
Word-level timestamps are emitted alongside the audio chunks so the client can tell exactly which words have been spoken at any point. This is especially useful for handling interruptions: when the user starts talking over the output, you can map the playback position back to the last word that was actually heard.word_timestamps are the same length and index-aligned: for a given index i, words[i] is spoken from start[i] to end[i]. Times are in seconds, measured from the beginning of the audio for the current synthesis. If a context id was attached to the text that produced this audio, it is included on the event.
Example payload:
Done
After the last audio chunk for a synthesis batch has been sent, the server emits adone event. This signals that the current synthesis is fully complete. If the client sends more text and triggers further synthesis, another done will follow.
done fires depends on the segment setting. See Segmentation & Behavior Settings for full details.
Error
In the event of a malformed or unexpected input, the server will immediately respond with an error message. The server will not close the connection, and will still accept subsequent well-formed messages. It’s up to the client to decide if it wants to close upon receiving an error.Variable Parameters
Must be one of the voices listed in our documentation for
arcana.The text you’d like spoken. Character limit per request is 500 via the API and 1,000 in the dashboard UI.
This value must be set to
arcana else the websockets server will default to mistv2 for speech synthesis.One of
mp3, mulaw, or pcmIf provided, the language must match the language spoken by the provided speaker.
This can be checked in our voices documentation.
The sampling rate (Hz).
- On-cloud: Accepted values: 8000, 16000, 22050, 24000, 44100, 48000, 96000. Anything above 24000 is up sampling.
- On-prem: Any value is accepted.
Controls how text is segmented for synthesis. Available options:
- “immediate” - Synthesizes text immediately without waiting for complete sentences
- “never” - Never segments the text, waits for explicit flush or EOS
- “bySentence” (default) - Waits for complete sentences before synthesis
immediate=true in query params is equivalent to segment=immediate. If a null value is provided, it will default to “bySentence”.
