Rime On-Premises Deployment Quickstart

On-prem is in public beta. For more information regarding access to Docker images and pricing info, reach out to help@rime.ai.

Introduction

Why On-Premises?

Deploying on-premises offers several advantages over using cloud APIs over a public network. One of the main benefits is speed; by hosting the services locally, you can significantly reduce network latency, resulting in faster system responses and data processing.

Security

With an on-premises deployment, all sensitive data remains within your corporate network, ensuring enhanced security as it is not transmitted over the Internet. This setup helps in complying with strict data privacy and protection regulations.

Performance

Latency

Mistv2: Our tests have shown median latency of 175ms with randomly generated sentences between 40 and 50 characters on A10Gs and similar GPUs.
Arcana: You should be getting a time-to-first-frame latency around 400ms on H100, and a real-time-factor (RTF) lower than 1.

Components

Prerequisites

Hardware Requirements

GPU
- For Mist
  - NVIDIA T4, L4, A10, or higher
- For Arcana
  - NVIDIA A100, H100 MIG 3g.40gb, or higher
Storage
- 50 GB storage
CPU
- 8 vCPUs
Memory requirements
- 32 GiB

Software Requirements

Supported Linux Distributions
- Debian 12 (bookworm), x86_64
- Ubuntu Server 24.04 (jammy), x86_64
NVIDIA drivers
- 570.133.20 or higher
Docker
NVIDIA Container Toolkit

Installations

NVIDIA Drivers

Follow https://www.nvidia.com/en-us/drivers to install the latest NVIDIA drivers, or use the following instructions on Debian-based systems:

NVIDIA Driver Installation (Debian-based)

# Update packages
sudo apt-get update

# Install basic toolchain and kernel headers
sudo apt-get install -y gcc make wget linux-headers-$(uname -r)

# Download and install the NVIDIA driver.
NVIDIA_DRIVER_VERSION=570.133.20
NVIDIA_DRIVER_PATH=/opt/NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run
sudo rm -f "${NVIDIA_DRIVER_PATH}"
sudo wget "https://us.download.nvidia.com/tesla/${NVIDIA_DRIVER_VERSION}/NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run" -O "${NVIDIA_DRIVER_PATH}"
sudo chmod +x "${NVIDIA_DRIVER_PATH}"
sudo "${NVIDIA_DRIVER_PATH}" --silent --no-questions

Docker

Follow https://docs.docker.com/engine/install to install Docker on your system.

NVIDIA Container Toolkit

Follow https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html to install the NVIDIA Container Toolkit.

Verification

To verify that you have all the prerequisites installed, run the following command:

Verify Prerequisites

docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubi9 nvidia-smi

You should see your GPU listed in the output, alongside the driver version and CUDA version.

Firewall Requirements

The Rime API instance will listen on port 8000 for http traffic, and port 8001 for websockets traffic. You will also need to allow the following outbound traffic in your firewall rules:

https://optimize.rime.ai/usage: registers on-prem usage with our servers.
https://optimize.rime.ai/license: verifies that your on-prem license is active.
us-docker.pkg.dev on port 443: container image registry.

Self-Service Licensing & Credentials

API Key Generation

Refer to our user interface dashboard to generate the necessary keys and credentials for authenticating and authorizing the deployment and use of our services.

Deployment

The deployment consists of two services, each powered by a container image:

API service: responsible for handling the HTTP and WebSocket requests, and for verifying the license. It serves as a proxy to the TTS service.
TTS service: responsible for model inference.

There is a 1:1 relationship between the API service and the TTS service: for each TTS model, you will need a corresponding API service. Multiple pairs of API and TTS services can be deployed on the same machine. Key file to be provided by Rime.

cat KEY-FILE | docker login -u _json_key --password-stdin https://us-docker.pkg.dev

Container Images

TTS Service

Currently the latest image versions are:

us-docker.pkg.dev/rime-labs/arcana/v2/ar:20251014
us-docker.pkg.dev/rime-labs/arcana/v2/de:20251014
us-docker.pkg.dev/rime-labs/arcana/v2/en:20251014
us-docker.pkg.dev/rime-labs/arcana/v2/es:20251014
us-docker.pkg.dev/rime-labs/arcana/v2/fr:20251014
us-docker.pkg.dev/rime-labs/mist/v2/en:20251006

API Service

The latest image version is:

us-docker.pkg.dev/rime-labs/api/service:20251010

Docker Compose Configuration

A simple way of deploying on a machine is to use Docker Compose. Create a docker-compose.yml file with your editor of choice to define the services and their configurations:

docker-compose.yml

version: '3.8'
services:
  api:
    image: <image_id>
    depends_on:
      - model
    ports:
      - "8000:8000"  # HTTP API
      - "8001:8001"  # WebSocket API
    restart: unless-stopped
    environment:
      - MODEL_URL=http://model:8080/invocations

  model:
    image: <image_id>
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              count: all
    restart: unless-stopped

When running on Kubernetes, ensure that MODEL_URL points to http://0.0.0.0:8080/invocations instead of the Docker Compose service name.

Multi-model backend

If you want to serve multiple arcana language via a single API instance, you can create a docker-compose.yml like the following:

docker-compose.yml

services:
  en-api: us-docker.pkg.dev/rime-labs/api/service:<tag>
    image:
    depends_on:
      - en-model
      - es-model
    ports:
      - "8000:8000"
    restart: unless-stopped
    environment:
      - MODEL_URL=http://en-model:8080/invocations
      - ARCANA_ENG_MODEL_URL=http://en-model:8080/invocations
      - ARCANA_SPA_MODEL_URL=http://es-model:8080/invocations
  en-model:
    image: us-docker.pkg.dev/rime-labs/arcana/v2/en:<tag>
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              count: all
    restart: unless-stopped

  es-model:
    image: us-docker.pkg.dev/rime-labs/arcana/v2/es:<tag>
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              count: all
    restart: unless-stopped

Note that the ARCANA_{LANG}_MODEL_URL env var must point to the container running the arcana image for that language, but you should still point MODEL_URL to a default model container. The model env vars currently supported are:

ARCANA_ENG_MODEL_URL
ARCANA_SPA_MODEL_URL
ARCANA_FRA_MODEL_URL
ARCANA_GER_MODEL_URL

The api will route to these model backends based on the request parameter lang.

Start Docker Compose

docker compose up -d

Deployment Steps

Environment Setup: Prepare your AWS environment according to the specifications required for optimal deployment.
Service Deployment: Using Docker, deploy the images on your server.
Networking Setup: Configure the network settings, including the Internet Gateway and port settings, to ensure connectivity and security.
Licensing and Authentication: Generate and apply the necessary API key via our dashboard to start using the services.

Note: Once the containers are started, expect 5 minutes delay for warm up before sending first tts requests.

Additional Information

Troubleshooting Guide: A troubleshooting guide will be provided to help resolve common issues during deployment.
Available voices/models: all voices are currently available.

Requests and Response Formats

HTTP Requests

Request:

Health Check

curl http://localhost:8000/health

which should return

{
    "apiStatus":"ok",
    "timestamp":timestamp,
    "licenseStatus":"valid"/"expired-or-not-set",
    "modelReachable":true/false
}

Request Example

curl -X POST "http://localhost:8000" -H "Authorization: Bearer <API KEY> -H "Content-Type: application/json" -d '{
  "text": "I would love to have a conversation with you. The new model is out.",
  "speaker": "joy",
  "modelId": "mist"
}' -o result_mist.txt

Response:

Response Format

{"audioContent":{"model_output":"<base64>"}}

Sample response file: result.txt

Receiving a response in mp3 format

Request:

Request Example

curl -X POST "http://localhost:8000" -H "Authorization: Bearer <API KEY>" -H "Content-Type: application/json" -H "Accept: audio/mp3" -d '{
  "text": "I would love to have a conversation with you.",
  "speaker": "joy",
  "modelId": "mist"
}' -o result.mp3

Response: Sample response file: result.mp3

Receiving a response in pcm (raw) format

Request:

Request Example

curl -X POST "http://localhost:8000" -H "Authorization: Bearer <API KEY>" -H "Content-Type: application/json" -H "Accept: audio/pcm" -d '{
  "text": "I would love to have a conversation with you.",
  "speaker": "joy",
  "modelId": "mist"
}' -o result.pcm

Response: Sample response file: result.pcm

Websockets Endpoints

The json websockets endpoint will be served at port 8001. For example ws://localhost:8001 which will be equivalent to our cloud websockets-json api .

Documentation

Arcana API reference

Mist v2 API reference

API Metadata

Other APIs

Quickstart

Rime On-Premises Deployment Quickstart

Introduction

Why On-Premises?

Security

Performance

Latency

Components

Prerequisites

Hardware Requirements

Software Requirements

Installations

NVIDIA Drivers

Docker

NVIDIA Container Toolkit

Verification

Firewall Requirements

Self-Service Licensing & Credentials

API Key Generation

Deployment

Container Images

TTS Service

API Service

Docker Compose Configuration

Multi-model backend

Start Docker Compose

Deployment Steps

Additional Information

Requests and Response Formats

HTTP Requests

Receiving a response in mp3 format

Receiving a response in pcm (raw) format

Websockets Endpoints

Documentation

Arcana API reference

Mist v2 API reference

API Metadata

Other APIs

​Rime On-Premises Deployment Quickstart

​Introduction

​Why On-Premises?

​Security

​Performance

​Latency

​Components

​Prerequisites

​Hardware Requirements

​Software Requirements

​Installations

NVIDIA Drivers

Docker

NVIDIA Container Toolkit

Verification

​Firewall Requirements

​Self-Service Licensing & Credentials

​API Key Generation

​Deployment

​Artifact Registry Login

​Container Images

​TTS Service

​API Service

​Docker Compose Configuration

​Multi-model backend

​Start Docker Compose

​Deployment Steps

​Additional Information

​Requests and Response Formats

​HTTP Requests

​Receiving a response in mp3 format

​Receiving a response in pcm (raw) format

​Websockets Endpoints

Rime On-Premises Deployment Quickstart

Introduction

Why On-Premises?

Security

Performance

Latency

Components

Prerequisites

Hardware Requirements

Software Requirements

Installations

Firewall Requirements

Self-Service Licensing & Credentials

API Key Generation

Deployment

Artifact Registry Login

Container Images

TTS Service

API Service

Docker Compose Configuration

Multi-model backend

Start Docker Compose

Deployment Steps

Additional Information

Requests and Response Formats

HTTP Requests

Receiving a response in mp3 format

Receiving a response in pcm (raw) format

Websockets Endpoints