Best Practices for Benchmarking

Benchmarking is an engineering task that needs stability and reproducibility. You’ll be calling the model thousands of times; even tiny drifts in system setup or network latency can compromise result accuracy. Here’s what we’ve learned to keep things reproducible and trustworthy. Quick notes

For any unlisted or closed-source benchmark: settemperature = 1.0, stream = true, top_p = 0.95
Reasoning benchmarks: max_tokens = 128k, and run at least 500–1000 samples to get low variance (e.g. AIME 2025: 32 runs -> 30 × 32 = 960 questions)
Coding benchmarks: max_tokens = 256k
Agentic task benchmarks:
- For multi-hop search: max_tokens = 256k + context management
- Others: max_tokens ≥ 16k–64k

K2.6 Models Benchmark Recommended Settings

Benchmark Category	Benchmark	Temperature	Recommended max tokens	Recommended runs	Top-p	Others (e.g. test log)
Multi-modal	MMMU-Pro	1.0	max tokens = 96k	3	top_p=0.95	thinking=
	MMMU-Pro w/ python	1.0	per step tokens = 64k; total max tokens = 256k	3	top_p=0.95	Recommended max steps = 50 thinking=
	CharXiv (RQ)	1.0	max tokens = 96k	3	top_p=0.95	thinking=
	CharXiv (RQ) w/ python	1.0	per step tokens = 64k; total max tokens = 256k	3	top_p=0.95	Recommended max steps = 50 thinking=
	MathVision	1.0	max tokens = 96k	3	top_p=0.95	thinking=
	MathVision w/ python	1.0	per step tokens = 64k; total max tokens = 256k	3	top_p=0.95	Recommended max steps = 50 thinking=
	V* w/ python	1.0	per step tokens = 64k; total max tokens = 256k	3	top_p=0.95	Recommended max steps = 50 thinking=
Agent	HLE-Full w/ tools	1.0	per step tokens = 48k; total max tokens = 256k	1	top_p=0.95	Recommended max steps = 300 thinking=
	BrowseComp	1.0	per step tokens = 48k; total max tokens = 256k	1	top_p=0.95	Recommended max steps = 300 thinking=
	DeepSearchQA	1.0	per step tokens = 48k; total max tokens = 256k	1	top_p=0.95	Recommended max steps = 300 thinking=
	WideSearch	1.0	per step tokens = 48k; total max tokens = 256k	4	top_p=0.95	Recommended max steps = 300 thinking=
	Toolathlon	1.0	per step tokens = 48k; total max tokens = 256k	4	top_p=0.95	Recommended max steps = 300 thinking=
	MCPMark	1.0	per step tokens = 48k; total max tokens = 256k	4	top_p=0.95	Recommended max steps = 300 thinking=
	Claw Eval	1.0	per step tokens = 48k; total max tokens = 256k	4	top_p=0.95	Recommended max steps = 300 thinking=
	APEX-Agents	1.0	per step tokens = 48k; total max tokens = 256k	4	top_p=0.95	Recommended max steps = 300 thinking=
Coding	Terminal-Bench 2.0 (Terminus-2)	1.0	max tokens = 256k	3	top_p=0.95	thinking=
	SWE-Bench Pro	1.0	per step tokens = 32k; total max tokens = 256k	5	top_p=0.95	Recommended max steps = 300 thinking=
	SWE-Bench Multilingual	1.0	per step tokens = 32k; total max tokens = 256k	5	top_p=0.95	Recommended max steps = 300 thinking=
	SWE-Bench Verified	1.0	per step tokens = 32k; total max tokens = 256k	5	top_p=0.95	Recommended max steps = 300 thinking=
	SciCode	1.0	max tokens = 96k	4	top_p=0.95	thinking=
	OJBench (python)	1.0	max tokens = 96k	8	top_p=0.95	thinking=
	LiveCodeBench (v6)	1.0	max tokens = 96k	1	top_p=0.95	thinking=
Math	AIME 2026	1.0	max tokens = 96k	32	top_p=0.95	thinking=
	HMMT 2026 (Feb)	1.0	max tokens = 96k	32	top_p=0.95	thinking=
	IMO-AnswerBench	1.0	max tokens = 96k	4	top_p=0.95	thinking=
Knowledge	HLE-Full	1.0	max tokens = 96k	1	top_p=0.95	thinking=
Knowledge	GPQA-Diamond	1.0	max tokens = 96k	8	top_p=0.95	thinking=

K2.5 Models Benchmark Recommended Settings

Benchmark Category	Benchmark	Temperature	Recommended max tokens	Recommended runs	Top-p	Others (e.g. test log)
Multi-modal	MMMU-Pro	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	CharXiv (RQ)	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	MathVision	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	MathVista	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	OCRBench	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	ZeroBench	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	WorldVQA	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	InfoVQA (val)	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	SimpleVQA	1.0	max tokens = 64k	3	top_p=0.95	thinking=
	ZeroBench w/ tools	1.0	max tokens = 64k	3	top_p=0.95	Recommended max steps = 30 thinking=
Code	SWE Series	1.0	per step tokens = 16k; total max tokens = 256k	5	top_p=0.95	thinking=
	Lcb + OJBench	1.0	max tokens = 128k	1	top_p=0.95	thinking=
	TerminalBench	1.0	max tokens = 128k	3	top_p=0.95	thinking=
Reasoning	AIME2025 no tools	1.0	total max tokens = 96k	32	top_p=0.95	thinking=
	AIME2025 w/ tools	1.0	per turn tokens = 96k; total max tokens = 96k	32	top_p=0.95	thinking= Recommended max steps = 120
	HLE no tools	1.0	max tokens = 96k	1	top_p=0.95	thinking=
	HLE w/ tools	1.0	total max tokens = 128k; per step tokens = 48k	1	top_p=0.95	thinking= Recommended max steps = 120
	HLE heavy	1.0	total max tokens = 128k; per step tokens = 48k	1	top_p=0.95	thinking= Recommended max steps = 200 parallel n=8
	HMMT2025 no tools	1.0	max tokens = 96k	32	top_p=0.95	thinking=
	HMMT2025 w/tools	1.0	per step tokens = 96k; total tokens = 96k	32	top_p=0.95	thinking= Recommended max steps = 120
	IMO-AnswerBench	1.0	max tokens = 96k	3	top_p=0.95	thinking=
	GPQA-Diamond	1.0	max tokens = 96k	8	top_p=0.95	thinking=
Agentic Search Task	BrowseComp / BrowseComp-ZH / Seal-0 / Frames	1.0	per step tokens = 24k; total max tokens = 256k	4	top_p=0.95	thinking= Recommended max steps = 250 Recommend using a context management mechanism to prevent overly long context and ensure enough tool calls Include today’s date in the system prompt and let the model search when it is uncertain
Agentic Task	Tau	1.0	>=16k	4	top_p=0.95	thinking= Recommended max steps = 100

For third-party providers, refer to Kimi Vendor Verifier (KVV) to choose high-accuracy services. Details: https://kimi.com/blog/kimi-vendor-verifier.html Tool Use Compatibility When using tools, if the thinking parameter is set to {"type": "enabled"}, please note the following constraints to ensure model performance:

tool_choice can only be set to “auto” or “none” (default is “auto”) to avoid conflicts between reasoning content and the specified tool_choice. Any other value will result in an error;
During multi-step tool calling, you must keep the reasoning_content from the assistant message in the current turn’s tool call within the context, otherwise an error will be thrown;
The official builtin $web_search tool is temporarily incompatible with Kimi K2.5/K2.6 thinking mode, you can choose to disable thinking mode first and then use the $web_search tool.

You can refer to Use Thinking Models for correct usage of tool calling.

K2-Thinking Series Models Benchmark Recommended Settings

Category	Benchmark	Temperature	Max token	Suggested runs	Notes
Code	SWE	0.7(recommended) 1.0 (ok)	per step tokens = 16k; total max token = 256k	5
	Lcb + OJBench	1.0	max tokens = 128k	1
	TerminalBench	1.0	max tokens = 128k	3
Reasoning	AIME2025 no tools	1.0	total max tokens = 96k	32
	AIME2025 w/ tools	1.0	per step tokens = 48k; total max tokens = 128k	16	max steps = 120
	HLE no tools	1.0	max tokens = 96k	1
	HLE w/ tools	1.0	total max tokens = 128k; per step tokens = 48k	1	max steps = 120
	HLE heavy	1.0	total max tokens = 128k; per step tokens = 48k	1	max steps = 200 parallel n=8
	HMMT2025 no tools	1.0	max tokens = 96k	32
	HMMT2025 w/tools	1.0	per step tokens = 96k; total tokens = 96k	32	max steps = 120
	IMO-AnswerBench	1.0	max tokens = 96k	3
	GPQA-Diamond	1.0	max tokens = 96k	8
Agentic Search Task	BrowseComp/ BrowseComp-ZH/Seal-0/ Frames	1.0	per step tokens = 24k; total max tokens = 256k	4	max steps = 250 Enable context management to prevent context overflow and ensure enough tool calls. Include today’s date in the system prompt, and tell the model to search when unsure.
Agentic Task	Tau	0.0	>=16k	4	max steps = 100

API Recommendations & Notes

Use the official API: some 3rd-party endpoints show noticeable accuracy drift.
Use the recommended models for testing
- For K2.6: use kimi-k2.6 for testing
- For K2.5: use kimi-k2.5 for testing
- For K2 series: use kimi-k2-thinking-turbo for faster inference
Must set: stream = true
- Non-streaming mode can lead to random mid-connection interruptions that are hard to control.
Current API default settings:
- Kimi K2.6:
  - default max_tokens = 32768
  - default thinking = {"type": "enabled", "keep": null}
  - default temperature = 1.0
  - default top_p = 0.95
  - default n = 1
  - default presence_penalty = 0.0
  - default frequency_penalty = 0.0
- Kimi K2 Thinking:
  - default temp = 1.0
  - default max token = 64000
- Kimi K2.5:
  - default max_tokens = 32768
  - default thinking = {"type": "enabled"}
  - default temperature = 1.0
  - default top_p = 0.95
  - default n = 1
  - default presence_penalty = 0.0
  - default frequency_penalty = 0.0
Timeouts:
- With stream = false, api.moonshot.ai timeout = 2 hours, but some ISPs may terminate earlier.
- So again we recommend you to set stream = true
Concurrency:
- Keep concurrency low to avoid rate limiting
Retry logic is not optional:
- handle overloaded
- handle unexpected finish reason due to random server issues
- handle errors due to complicated network issues

FAQ

Q1. Is the temperature setting consistent across models? A. No. Different model families use different recommended temperatures:

k2.6 model: temperature = 1.0
k2.5 model: temperature = 1.0
k2-thinking series: temperature = 1.0
k2 other series: temperature = 0.6

Q2. Why use stream = true? A. Long outputs can take minutes. Idle TCP connections may be terminated by firewalls, load balancers, or NAT gateways. Streaming keeps the connection alive and significantly improves reliability. In production, requests with stream = false fail far more often than with stream = true. Q3. How much concurrency should I use? A. Your API account has specific rate limits (see Recharge and Rate Limits). Start low. If you hit HTTP 429 (rate limit), your concurrency is too high. Accuracy > speed, so tune concurrency to stay within limits. Q5. Why should I add retry? A. Even with streaming, requests can fail due to transient network issues. Retry on temporary faults (network jitter, server overload, rate limiting) to avoid avoidable failures. Q6. Why should multi-turn or multi-step tasks include full context and reasoning? A. The model needs full context to stay logically consistent. Without previous reasoning steps, later turns can go off track or produce incomplete answers.

Contact Us

Hit any issues? Drop us an email at [email protected] with your logs. We’ll take a look!

​K2.6 Models Benchmark Recommended Settings

​K2.5 Models Benchmark Recommended Settings

​K2-Thinking Series Models Benchmark Recommended Settings

​API Recommendations & Notes

​FAQ

​Contact Us

K2.6 Models Benchmark Recommended Settings

K2.5 Models Benchmark Recommended Settings

K2-Thinking Series Models Benchmark Recommended Settings

API Recommendations & Notes

FAQ

Contact Us