Benchmarking is an engineering task that needs stability and reproducibility. You’ll be calling the model thousands of times; even tiny drifts in system setup or network latency can compromise result accuracy. Here’s what we’ve learned to keep things reproducible and trustworthy. Quick notesDocumentation Index
Fetch the complete documentation index at: https://platform.kimi.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
- For any unlisted or closed-source benchmark: set
temperature = 1.0,stream = true,top_p = 0.95 - Reasoning benchmarks:
max_tokens = 128k, and run at least 500–1000 samples to get low variance (e.g.AIME 2025: 32 runs -> 30 × 32 = 960 questions) - Coding benchmarks:
max_tokens = 256k - Agentic task benchmarks:
- For multi-hop search:
max_tokens = 256k+ context management - Others:
max_tokens ≥ 16k–64k
- For multi-hop search:
K2.6 Models Benchmark Recommended Settings
| Benchmark Category | Benchmark | Temperature | Recommended max tokens | Recommended runs | Top-p | Others (e.g. test log) |
|---|---|---|---|---|---|---|
| Multi-modal | MMMU-Pro | 1.0 | max tokens = 96k | 3 | top_p=0.95 | thinking= |
| MMMU-Pro w/ python | 1.0 | per step tokens = 64k; total max tokens = 256k | 3 | top_p=0.95 | Recommended max steps = 50 thinking= | |
| CharXiv (RQ) | 1.0 | max tokens = 96k | 3 | top_p=0.95 | thinking= | |
| CharXiv (RQ) w/ python | 1.0 | per step tokens = 64k; total max tokens = 256k | 3 | top_p=0.95 | Recommended max steps = 50 thinking= | |
| MathVision | 1.0 | max tokens = 96k | 3 | top_p=0.95 | thinking= | |
| MathVision w/ python | 1.0 | per step tokens = 64k; total max tokens = 256k | 3 | top_p=0.95 | Recommended max steps = 50 thinking= | |
| V* w/ python | 1.0 | per step tokens = 64k; total max tokens = 256k | 3 | top_p=0.95 | Recommended max steps = 50 thinking= | |
| Agent | HLE-Full w/ tools | 1.0 | per step tokens = 48k; total max tokens = 256k | 1 | top_p=0.95 | Recommended max steps = 300 thinking= |
| BrowseComp | 1.0 | per step tokens = 48k; total max tokens = 256k | 1 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| DeepSearchQA | 1.0 | per step tokens = 48k; total max tokens = 256k | 1 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| WideSearch | 1.0 | per step tokens = 48k; total max tokens = 256k | 4 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| Toolathlon | 1.0 | per step tokens = 48k; total max tokens = 256k | 4 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| MCPMark | 1.0 | per step tokens = 48k; total max tokens = 256k | 4 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| Claw Eval | 1.0 | per step tokens = 48k; total max tokens = 256k | 4 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| APEX-Agents | 1.0 | per step tokens = 48k; total max tokens = 256k | 4 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| Coding | Terminal-Bench 2.0 (Terminus-2) | 1.0 | max tokens = 256k | 3 | top_p=0.95 | thinking= |
| SWE-Bench Pro | 1.0 | per step tokens = 32k; total max tokens = 256k | 5 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| SWE-Bench Multilingual | 1.0 | per step tokens = 32k; total max tokens = 256k | 5 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| SWE-Bench Verified | 1.0 | per step tokens = 32k; total max tokens = 256k | 5 | top_p=0.95 | Recommended max steps = 300 thinking= | |
| SciCode | 1.0 | max tokens = 96k | 4 | top_p=0.95 | thinking= | |
| OJBench (python) | 1.0 | max tokens = 96k | 8 | top_p=0.95 | thinking= | |
| LiveCodeBench (v6) | 1.0 | max tokens = 96k | 1 | top_p=0.95 | thinking= | |
| Math | AIME 2026 | 1.0 | max tokens = 96k | 32 | top_p=0.95 | thinking= |
| HMMT 2026 (Feb) | 1.0 | max tokens = 96k | 32 | top_p=0.95 | thinking= | |
| IMO-AnswerBench | 1.0 | max tokens = 96k | 4 | top_p=0.95 | thinking= | |
| Knowledge | HLE-Full | 1.0 | max tokens = 96k | 1 | top_p=0.95 | thinking= |
| GPQA-Diamond | 1.0 | max tokens = 96k | 8 | top_p=0.95 | thinking= |
K2.5 Models Benchmark Recommended Settings
| Benchmark Category | Benchmark | Temperature | Recommended max tokens | Recommended runs | Top-p | Others (e.g. test log) |
|---|---|---|---|---|---|---|
| Multi-modal | MMMU-Pro | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= |
| CharXiv (RQ) | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| MathVision | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| MathVista | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| OCRBench | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| ZeroBench | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| WorldVQA | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| InfoVQA (val) | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| SimpleVQA | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| ZeroBench w/ tools | 1.0 | max tokens = 64k | 3 | top_p=0.95 | Recommended max steps = 30 thinking= | |
| Code | SWE Series | 1.0 | per step tokens = 16k; total max tokens = 256k | 5 | top_p=0.95 | thinking= |
| Lcb + OJBench | 1.0 | max tokens = 128k | 1 | top_p=0.95 | thinking= | |
| TerminalBench | 1.0 | max tokens = 128k | 3 | top_p=0.95 | thinking= | |
| Reasoning | AIME2025 no tools | 1.0 | total max tokens = 96k | 32 | top_p=0.95 | thinking= |
| AIME2025 w/ tools | 1.0 | per turn tokens = 96k; total max tokens = 96k | 32 | top_p=0.95 | thinking= Recommended max steps = 120 | |
| HLE no tools | 1.0 | max tokens = 96k | 1 | top_p=0.95 | thinking= | |
| HLE w/ tools | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | top_p=0.95 | thinking= Recommended max steps = 120 | |
| HLE heavy | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | top_p=0.95 | thinking= Recommended max steps = 200 parallel n=8 | |
| HMMT2025 no tools | 1.0 | max tokens = 96k | 32 | top_p=0.95 | thinking= | |
| HMMT2025 w/tools | 1.0 | per step tokens = 96k; total tokens = 96k | 32 | top_p=0.95 | thinking= Recommended max steps = 120 | |
| IMO-AnswerBench | 1.0 | max tokens = 96k | 3 | top_p=0.95 | thinking= | |
| GPQA-Diamond | 1.0 | max tokens = 96k | 8 | top_p=0.95 | thinking= | |
| Agentic Search Task | BrowseComp / BrowseComp-ZH / Seal-0 / Frames | 1.0 | per step tokens = 24k; total max tokens = 256k | 4 | top_p=0.95 | thinking= Recommended max steps = 250 Recommend using a context management mechanism to prevent overly long context and ensure enough tool calls Include today’s date in the system prompt and let the model search when it is uncertain |
| Agentic Task | Tau | 1.0 | >=16k | 4 | top_p=0.95 | thinking= Recommended max steps = 100 |
{"type": "enabled"}, please note the following constraints to ensure model performance:
tool_choicecan only be set to “auto” or “none” (default is “auto”) to avoid conflicts between reasoning content and the specified tool_choice. Any other value will result in an error;- During multi-step tool calling, you must keep the
reasoning_contentfrom the assistant message in the current turn’s tool call within the context, otherwise an error will be thrown; - The official builtin
$web_searchtool is temporarily incompatible with Kimi K2.5/K2.6 thinking mode, you can choose to disable thinking mode first and then use the$web_searchtool.
K2-Thinking Series Models Benchmark Recommended Settings
| Category | Benchmark | Temperature | Max token | Suggested runs | Notes |
|---|---|---|---|---|---|
| Code | SWE | 0.7(recommended) 1.0 (ok) | per step tokens = 16k; total max token = 256k | 5 | |
| Lcb + OJBench | 1.0 | max tokens = 128k | 1 | ||
| TerminalBench | 1.0 | max tokens = 128k | 3 | ||
| Reasoning | AIME2025 no tools | 1.0 | total max tokens = 96k | 32 | |
| AIME2025 w/ tools | 1.0 | per step tokens = 48k; total max tokens = 128k | 16 | max steps = 120 | |
| HLE no tools | 1.0 | max tokens = 96k | 1 | ||
| HLE w/ tools | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | max steps = 120 | |
| HLE heavy | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | max steps = 200 parallel n=8 | |
| HMMT2025 no tools | 1.0 | max tokens = 96k | 32 | ||
| HMMT2025 w/tools | 1.0 | per step tokens = 96k; total tokens = 96k | 32 | max steps = 120 | |
| IMO-AnswerBench | 1.0 | max tokens = 96k | 3 | ||
| GPQA-Diamond | 1.0 | max tokens = 96k | 8 | ||
| Agentic Search Task | BrowseComp/ BrowseComp-ZH/Seal-0/ Frames | 1.0 | per step tokens = 24k; total max tokens = 256k | 4 | max steps = 250 Enable context management to prevent context overflow and ensure enough tool calls. Include today’s date in the system prompt, and tell the model to search when unsure. |
| Agentic Task | Tau | 0.0 | >=16k | 4 | max steps = 100 |
API Recommendations & Notes
- Use the official API: some 3rd-party endpoints show noticeable accuracy drift.
- Use the recommended models for testing
- For K2.6: use
kimi-k2.6for testing - For K2.5: use
kimi-k2.5for testing - For K2 series: use
kimi-k2-thinking-turbofor faster inference
- For K2.6: use
- Must set:
stream = true- Non-streaming mode can lead to random mid-connection interruptions that are hard to control.
- Current API default settings:
- Kimi K2.6:
- default max_tokens = 32768
- default thinking =
{"type": "enabled", "keep": null} - default temperature = 1.0
- default top_p = 0.95
- default n = 1
- default presence_penalty = 0.0
- default frequency_penalty = 0.0
- Kimi K2 Thinking:
- default temp = 1.0
- default max token = 64000
- Kimi K2.5:
- default max_tokens = 32768
- default thinking =
{"type": "enabled"} - default temperature = 1.0
- default top_p = 0.95
- default n = 1
- default presence_penalty = 0.0
- default frequency_penalty = 0.0
- Kimi K2.6:
- Timeouts:
- With
stream = false,api.moonshot.aitimeout = 2 hours, but some ISPs may terminate earlier. - So again we recommend you to set
stream = true
- With
- Concurrency:
- Keep concurrency low to avoid rate limiting
- Retry logic is not optional:
- handle overloaded
- handle unexpected finish reason due to random server issues
- handle errors due to complicated network issues
FAQ
Q1. Is the temperature setting consistent across models? A. No. Different model families use different recommended temperatures:- k2.6 model: temperature = 1.0
- k2.5 model: temperature = 1.0
- k2-thinking series: temperature = 1.0
- k2 other series: temperature = 0.6