Documentation Index
Fetch the complete documentation index at: https://platform.kimi.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Text and Multimodal Models
kimi-k2.6 is Kimi’s most intelligent model to date, supporting text, image, and video input, as well as thinking and non-thinking modes. It is suitable for conversation, code generation, visual understanding, and agent tasks. The input to a model is commonly called a “prompt”, and clear instructions plus representative examples are the most effective way to get stable outputs. Other models are also available — see the Model List for details.
Language Model Inference Service
The language model inference service is an API service based on the pretrained models developed and trained by us (Moonshot AI). Today, the platform primarily exposes a Chat Completions interface for conversation, code generation, visual understanding, and agent tasks. Models do not directly access external resources such as the internet or databases by default, but you can extend them with official tools or custom tool calls when needed.Token
Text generation models process text in units called Tokens. A Token represents a common sequence of characters. For example, a single English character like “antidisestablishmentarianism” might be broken down into a combination of several Tokens, while a short and common phrase like “word” might be represented by a single Token. Generally speaking, for a typical English text, 1 Token is roughly equivalent to 3-4 English characters. It is important to note that the total length of Input and Output cannot exceed the selected model’s maximum context length. For example,kimi-k2.6 supports context windows up to 256K. For other models’ context lengths, see the Model List.
Rate Limits
How do these rate limits work? Rate limits are measured in four ways: concurrency, RPM (requests per minute), TPM (tokens per minute), and TPD (tokens per day). The rate limit can be reached in any of these categories, depending on which one is hit first. For example, you might send 20 requests to ChatCompletions, each with only 100 Tokens, and you would hit the limit (if your RPM limit is 20), even if you haven’t reached 200k Tokens in those 20 requests (assuming your TPM limit is 200k). For the gateway, for convenience, we calculate rate limits based on the max_completion_tokens parameter in the request. This means that if your request includes the max_completion_tokens parameter, we will use this parameter to calculate the rate limit. If your request does not include the max_completion_tokens parameter, we will use the default max_completion_tokens parameter to calculate the rate limit. After you make a request, we will determine whether you have reached the rate limit based on the number of Tokens in your request plus the number of max_completion_tokens in your parameter, regardless of the actual number of Tokens generated. In the billing process, we calculate the cost based on the number of Tokens in your request plus the actual number of Tokens generated.Other Important Notes:
- Rate limits are enforced at the user level, not the key level.
- Currently, we share rate limits across all models.