Skip to main content

Overview

When we talk about performance for a real-time audio streaming service, we typically mean a combination of the following metrics:
  • Initial latency, or time-to-first-frame/byte (TTFF/TTFB), defined as the time elapsed when the first frame is delivered from when the request was sent. Lower is better.
  • Real-time factor (RTF), defined as a proportion of the time spent on processing and the stream duration. A value ≤ 1 is required to stream at real-time. Lower is better.
  • Concurrency, i.e. the number of requests a service can handle. Higher is better.
The initial latency and RTF typically work in the opposite direction of concurrency: the latency and RTF typically go up as the concurrency level goes up. To ensure the performance of a real-time streaming service under high load, it is recommended to put a limit on the maximum concurrent requests a system handles and queue or reject the out-of-capacity requests.
Rime does not provide rate-limiting or queuing support, as it is highly dependent on the exact deployment being used.

Metrics

Rime provides some off-the-shelf metrics for reference only. Your mileage may vary depending on the hardware/software setup, latency constraints, and the actual traffic.

Methodology

We use the armchair tool for benchmarking a system and report the following:
  • Initial latency: measured when there is 1 concurrent request (-c 1).
  • Max concurrency: measured at a performance target of 100% success rate, 99th percentile of latency ≤ 1s, 99th percentile of RTF ≤ 1 (--target=success:1.00,ttfb:p99@1s,rtf:p99@1.00).
ModelHardwareInitial latencyMax concurrency
Arcana v2H100μ=400ms32
armchair is run with the default arguments, and from the same machine that serves the image to eliminate network latency.

Performance tuning

Performance tuning is only available for Arcana model images tagged with 20251027 or later.

Environment variables

There is a set of environment variables for the model image that you can tune in order to improve the concurrency under a set performance contraint:
  • DECODER_MAX_BATCH, defaults to 32
  • DECODER_NUM_SESSIONS, defaults to 6
  • GENERATOR_MAX_BATCH, defaults to 32
  • GENERATOR_GPU_MEMORY_UTILIZATION, defaults to 0.8
The defaults for these variables are set to accomodate the lowest spec that Rime supports, so we recommend tuning them with a benchmark-driven approach.

Tuning

You can use armchair to tune the environment variables with the following workflow:
  1. Get a base performance report by running armchair with specific performance constraints, without specifying the concurrency level (-c).
  2. If RTF is significantly lower than 1, increase both DECODER_MAX_BATCH and GENERATOR_MAX_BATCH.
  3. If the server fails to start with an OOM error, decrease GENERATOR_GPU_MEMORY_UTILIZATION.
  4. Repeat the process until the benchmarked concurrency level converges with DECODER_MAX_BATCH and GENERATOR_MAX_BATCH. This is the maximum concurrency that the system can accept.