Overview
When we talk about performance for a real-time audio streaming service, we typically mean a combination of the following metrics:- Initial latency, or time-to-first-frame/byte (TTFF/TTFB), defined as the time elapsed when the first frame is delivered from when the request was sent. Lower is better.
- Real-time factor (RTF), defined as a proportion of the time spent on processing and the stream duration. A value ≤ 1 is required to stream at real-time. Lower is better.
- Concurrency, i.e. the number of requests a service can handle. Higher is better.
Rime does not provide rate-limiting or queuing support, as it is highly
dependent on the exact deployment being used.
Metrics
Rime provides some off-the-shelf metrics for reference only. Your mileage may vary depending on the hardware/software setup, latency constraints, and the actual traffic.Methodology
We use thearmchair tool for
benchmarking a system and report the following:
- Initial latency: measured when there is 1 concurrent request (
-c 1). - Max concurrency: measured at a performance target of 100% success rate, 99th
percentile of latency ≤ 1s, 99th percentile of RTF ≤ 1
(
--target=success:1.00,ttfb:p99@1s,rtf:p99@1.00).
| Model | Hardware | Initial latency | Max concurrency |
|---|---|---|---|
| Arcana v2 | H100 | μ=400ms | 32 |
armchair is run with the default arguments, and from the same machine that
serves the image to eliminate network latency.
Performance tuning
Performance tuning is only available for Arcana model images tagged with
20251027 or later.Environment variables
There is a set of environment variables for the model image that you can tune in order to improve the concurrency under a set performance contraint:DECODER_MAX_BATCH, defaults to32DECODER_NUM_SESSIONS, defaults to6GENERATOR_MAX_BATCH, defaults to32GENERATOR_GPU_MEMORY_UTILIZATION, defaults to0.8
Tuning
You can usearmchair to tune the
environment variables with the following workflow:
- Get a base performance report by running
armchairwith specific performance constraints, without specifying the concurrency level (-c). - If RTF is significantly lower than 1, increase both
DECODER_MAX_BATCHandGENERATOR_MAX_BATCH. - If the server fails to start with an OOM error, decrease
GENERATOR_GPU_MEMORY_UTILIZATION. - Repeat the process until the benchmarked concurrency level converges with
DECODER_MAX_BATCHandGENERATOR_MAX_BATCH. This is the maximum concurrency that the system can accept.

