Sub-200ms latency is standard; contact us to optimize further.
On-premise deployments are available and have sub-100ms latency.
The response time from any text-to-speech API depends on a variety of factors including network latency, the length of the text input, any text preprocessing needed prior to model inference, and payload size.
Rime’s TTS API has been designed from the ground up to be the fastest to respond. The following sections will have information on the contributing factors to response latency and recommendations for reducing this latency.
Assuming the client is able to stream audio, whether on-device or through a telephony provider or system, streaming will always be faster than non-streaming requests to our API. The reason for this is simple: a great deal of end-to-end response time comes down to the client. If we reduce the size of the initial payload chunk, we can reduce TTFB, which is the time audio can start streaming.
The image above contains a plot of both response time and TTFB for the request body to our streaming PCM endpoint seen below.
The image above contains a plot of both response time and TTFB for the following request body to our streaming PCM endpoint:
mistv2
and mist
models are across the board faster than our v1
models (which were deprecated in February 2025). Using the modelId
parameter, you can select our Mist models for TTS inference.reduceLatency
, which turns off text normalization, to reduce the amount of computation needed to prepare input text for TTS inference. This can safely be used in cases where there are no digits, abbreviations, or tricky punctuation, e.g. Yes, I grew up on one twenty-three Main Street in Oakland, California.
instead of Yes, I grew up on 123 Main St. in Oakland, CA.
samplingRate
parameter.
Sub-200ms latency is standard; contact us to optimize further.
On-premise deployments are available and have sub-100ms latency.
The response time from any text-to-speech API depends on a variety of factors including network latency, the length of the text input, any text preprocessing needed prior to model inference, and payload size.
Rime’s TTS API has been designed from the ground up to be the fastest to respond. The following sections will have information on the contributing factors to response latency and recommendations for reducing this latency.
Assuming the client is able to stream audio, whether on-device or through a telephony provider or system, streaming will always be faster than non-streaming requests to our API. The reason for this is simple: a great deal of end-to-end response time comes down to the client. If we reduce the size of the initial payload chunk, we can reduce TTFB, which is the time audio can start streaming.
The image above contains a plot of both response time and TTFB for the request body to our streaming PCM endpoint seen below.
The image above contains a plot of both response time and TTFB for the following request body to our streaming PCM endpoint:
mistv2
and mist
models are across the board faster than our v1
models (which were deprecated in February 2025). Using the modelId
parameter, you can select our Mist models for TTS inference.reduceLatency
, which turns off text normalization, to reduce the amount of computation needed to prepare input text for TTS inference. This can safely be used in cases where there are no digits, abbreviations, or tricky punctuation, e.g. Yes, I grew up on one twenty-three Main Street in Oakland, California.
instead of Yes, I grew up on 123 Main St. in Oakland, CA.
samplingRate
parameter.