Load balancing

To ensure real-time streaming, it is often necessary to limit the number of inference requests that a model is processing concurrently. To support this, Rime’s Arcana model images return a Open Request Cost Aggregation (ORCA) header to inform a load balancer based on the number of concurrent requests.

The full set of ORCA headers is returned by Arcana model containers since the 20260115 release.

HTTP ORCA header

The ORCA header in HTTP responses looks like:

endpoint-load-metrics: TEXT application_utilization=0.5, cpu_utilization=0.3128, mem_utilization=0.2453, rps_fractional=0.0000, eps=0.0000

Max concurrency

The application_utilization metric is calculated by dividing the number of concurrent inference requests by a preconfigured max concurrency. You can override the max concurrency after parameter tuning by setting the INFERENCE_CONCURRENCY_CAPACITY to the desired max concurrency.

The INFERENCE_CONCURRENCY_CAPACITY variable is only used to calculate the utilization to inform the load balancer. Setting it does not reject or queue overflowing requests.

Metrics Performance tuning

⌘I

Introduction

Getting started

Documentation

HTTP ORCA header

Max concurrency

Introduction

Getting started

Documentation

Documentation Index

​HTTP ORCA header

​Max concurrency

HTTP ORCA header

Max concurrency