Rate Limits

Developers

Table of Content

Rate Limits

The cost of using our API is based on token consumption. We charge different prices based on token category:

Prompt text, audio and image tokens - Charged at prompt token price
Cached prompt tokens - Charged at cached prompt token price
Completion tokens - Charged at completion token price
Reasoning tokens - Charged at completion token price

Visitf Model anf pricing page or general pricing.

Basic unit to calculate consumption — Tokens

A token is the basic unit of prompt size for model inference and pricing purposes. It consists of one or more character(s)/symbol(s).

When a QJS model handles your request, an input prompt will be decomposed into a list of tokens through a tokenizer. The model will then make inference based on the prompt tokens, and generate completion tokens. After the inference is completed, the completion tokens will be aggregated into a completion response sent back to you.

Our system will add additional formatting tokens to the input/output token, and if you selected a reasoning model, additional reasoning tokens will be added into the total token consumption as well. Your actual consumption will be reflected either in the usage object returned in the API response, or in Usage Explorer on the QJS Console.

You can use Tokenizer on Qjs Console to visualize tokens a given text prompt, or use Tokenize text endpoint on the API.

Text tokens

Tokens can be either of a whole word, or smaller chunks of character combinations. The more common a word is, the more likely it would be a whole token.

For example, Flint is broken down into two tokens, while Michigan is a whole token.

For a given text/image/etc. prompt or completion sequence, different tokenizers may break it down into different lengths of lists.

Different Qjs models may also share or use different tokenizers. Therefore, the same prompt/completion sequence may not have the same amount of tokens across different models.

The token count in a prompt/completion sequence should be approximately linear to the sequence length.

Image prompt tokens

Each image prompt will take between 256 to 1792 tokens, depending on the size of the image. The image + text token count must be less than the overall context window of the model.

Estimating consumption with tokenizer on xAI Console or through API

The tokenizer page or API might display less token count than the actual token 
consumption.The inference endpoints would automatically add pre-defined 
tokens to help our system process the request

The tokenizer page or API might display less token count than the actual token 
consumption.The inference endpoints would automatically add pre-defined 
tokens to help our system process the request

The tokenizer page or API might display less token count than the actual token 
consumption.The inference endpoints would automatically add pre-defined 
tokens to help our system process the request

On Qjs Console, you can use the tokenizer page to estimate how many tokens your text prompt will consume. For example, the following message would consume 5 tokens (the actual consumption may vary because of additional special tokens added by the system).

Message body:

JSON
[
  {
    "role": "user",
    "content": "How is the weather today?"
  }
]

JSON
[
  {
    "role": "user",
    "content": "How is the weather today?"
  }
]

JSON
[
  {
    "role": "user",
    "content": "How is the weather today?"
  }
]

Cached prompt tokens

When you send the same prompt multiple times, we may cache your prompt tokens. This would result in reduced cost for these tokens at the cached token rate, and a quicker response.

The prompt is cached using prefix matching, using cache for the exact prefix matches in the subsequent requests. However, the cache size might be limited and distributed across different clusters.

You can also specify

x-qjs-conv-id: <A constant uuid4 ID>

in the HTTP request header, to increase the likelihood of cache hit in the subsequent requests using the same header.

Reasoning tokens

The model may use reasoning to process your request. The reasoning content is returned in the response's reasoning_content field. The reasoning token consumption will be counted separately from completion_tokens, but will be counted in the total_tokens.

The reasoning tokens will be charged at the same price as completion_tokens.

qjs-4 does not return reasoning_content

qjs-4 does not return reasoning_content

qjs-4 does not return reasoning_content

Hitting rate limits

To request a higher rate limit, please email support@qjs.ai with 
your anticipated volume

To request a higher rate limit, please email support@qjs.ai with 
your anticipated volume

To request a higher rate limit, please email support@qjs.ai with 
your anticipated volume

For each tier, there is a maximum amount of requests per minute and tokens per minute. This is to ensure fair usage by all users of the system.

Once your request frequency has reached the rate limit, you will receive error code 429 in response.

You can either:

Upgrade your team to higher tiers
Change your consumption pattern to send fewer requests

Checking token consumption

In each completion response, there is a usage object detailing your prompt and completion token count. You might find it helpful to keep track of it, in order to avoid hitting rate limits or having cost surprises. You can view more details of the object on our API Reference.

JSON
"usage": {
    "prompt_tokens": 199,
    "completion_tokens": 1,
    "total_tokens": 200,
    "prompt_tokens_details": {
        "text_tokens": 199,
        "audio_tokens": 0,
        "image_tokens": 0,
        "cached_tokens": 163
    },
    "completion_tokens_details": {
        "reasoning_tokens": 0,
        "audio_tokens": 0,
        "accepted_prediction_tokens": 0,
        "rejected_prediction_tokens": 0
    },
    "num_sources_used": 0,
    "cost_in_usd_ticks": 158500
}

JSON
"usage": {
    "prompt_tokens": 199,
    "completion_tokens": 1,
    "total_tokens": 200,
    "prompt_tokens_details": {
        "text_tokens": 199,
        "audio_tokens": 0,
        "image_tokens": 0,
        "cached_tokens": 163
    },
    "completion_tokens_details": {
        "reasoning_tokens": 0,
        "audio_tokens": 0,
        "accepted_prediction_tokens": 0,
        "rejected_prediction_tokens": 0
    },
    "num_sources_used": 0,
    "cost_in_usd_ticks": 158500
}

JSON
"usage": {
    "prompt_tokens": 199,
    "completion_tokens": 1,
    "total_tokens": 200,
    "prompt_tokens_details": {
        "text_tokens": 199,
        "audio_tokens": 0,
        "image_tokens": 0,
        "cached_tokens": 163
    },
    "completion_tokens_details": {
        "reasoning_tokens": 0,
        "audio_tokens": 0,
        "accepted_prediction_tokens": 0,
        "rejected_prediction_tokens": 0
    },
    "num_sources_used": 0,
    "cost_in_usd_ticks": 158500
}

The cost_in_usd_ticks field expresses the total cost to perform the inference, in 1/10,000,000,000 US dollar.

Note: The usage.prompt_tokens_details.text_tokensis the total text input token,
which includes cached_tokens and non-cached text tokens

Note: The usage.prompt_tokens_details.text_tokensis the total text input token,
which includes cached_tokens and non-cached text tokens

Note: The usage.prompt_tokens_details.text_tokensis the total text input token,
which includes cached_tokens and non-cached text tokens

You can also check with the Qjs or OpenAI SDKs (Anthropic SDK is deprecated).

import os

from qjsai_sdk import Client
from qjsai_sdk.chat import system, user

client = Client(api_key=os.getenv("qjsAI_API_KEY"))

chat = client.chat.create(
model="qjs-4-1-fast-reasoning",
messages=[system("You are qjs, a chatbot inspired by the Hitchhiker's Guide to the Galaxy.")]
)
chat.append(user("What is the meaning of life, the universe, and everything?"))

response = chat.sample()
print(response.usage)

import os

from qjsai_sdk import Client
from qjsai_sdk.chat import system, user

client = Client(api_key=os.getenv("qjsAI_API_KEY"))

chat = client.chat.create(
model="qjs-4-1-fast-reasoning",
messages=[system("You are qjs, a chatbot inspired by the Hitchhiker's Guide to the Galaxy.")]
)
chat.append(user("What is the meaning of life, the universe, and everything?"))

response = chat.sample()
print(response.usage)

import os

from qjsai_sdk import Client
from qjsai_sdk.chat import system, user

client = Client(api_key=os.getenv("qjsAI_API_KEY"))

chat = client.chat.create(
model="qjs-4-1-fast-reasoning",
messages=[system("You are qjs, a chatbot inspired by the Hitchhiker's Guide to the Galaxy.")]
)
chat.append(user("What is the meaning of life, the universe, and everything?"))

response = chat.sample()
print(response.usage)

Models and Pricing

Regional Endpoints

Join our Community Forum

Any other questions? Get in touch