To ensure real-time streaming, it is often necessary to limit the number of inference requests that a model is processing concurrently. To support this, Rime’s Arcana model images return a Open Request Cost Aggregation (ORCA) header to inform a load balancer based on the number of concurrent requests.Documentation Index
Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
Use this file to discover all available pages before exploring further.
The full set of ORCA headers is returned by Arcana model containers
since the
20260115 release.HTTP ORCA header
The ORCA header in HTTP responses looks like:Max concurrency
Theapplication_utilization metric is calculated by dividing the number of
concurrent inference requests by a preconfigured max concurrency.
You can override the max concurrency after parameter
tuning by setting the
INFERENCE_CONCURRENCY_CAPACITY to the desired max concurrency.
The
INFERENCE_CONCURRENCY_CAPACITY variable is only used to calculate
the utilization to inform the load balancer. Setting it does not reject or
queue overflowing requests.
