MiniMax M2: Developer guide to the 230B open-source coding model
Updated on April 13, 2026
MiniMax M2 230B open-source coding model architecture visualization
MiniMax released M2 in late 2025 as an open-weight model targeting serious coding and agentic work. At 230B total parameters with only 10B active via a Mixture-of-Experts architecture, it punches well above its inference cost. The model weights are MIT-licensed, the API is free for a limited time, and the community is already deploying it on commodity GPU clusters.
This guide covers everything a developer needs to actually use it: the API, local deployment with vLLM, tool calling, and the one gotcha that quietly kills performance if you miss it.
What is MiniMax M2?
MiniMax M2 is a 230B-parameter MoE language model built for coding and agentic tasks. Because only 10B parameters activate per forward pass, it runs at roughly the cost of a 10B dense model while drawing on the capacity of a much larger one. The tradeoff is memory at rest — you still need to hold all the weights in VRAM.
Key facts at a glance:
| Spec | Value |
|---|---|
| Total parameters | 230B |
| Active parameters (MoE) | 10B |
| Context window | 200K tokens |
| Max output | 128K tokens (incl. CoT) |
| Precision | FP8 |
| License | MIT |
The model supports function calling, JSON mode, streaming, and real-time reasoning via interleaved <think> blocks.
M2 vs M2.1: which version to use
MiniMax shipped M2.1 as a follow-up focused on real-world engineering tasks. The headline improvements are:
- Multi-language programming: Rust, Java, Go, C++, Kotlin, TypeScript, Objective-C
- Native Android and iOS app development
- More concise outputs with lower token consumption
- Office automation and composite instruction handling
For most developers, M2.1 is the version to run today. Third-party providers (DeepInfra, Together AI) already redirect M2 requests to M2.1 automatically. The local deployment steps below use M2.1 explicitly.
Benchmark results
MiniMax M2 was competitive with frontier closed models at release. M2.1 pushed further:
| Benchmark | MiniMax M2 | MiniMax M2.1 |
|---|---|---|
| SWE-bench Verified | 69.4% | Higher (exceeds Claude Sonnet 4.5 on multilingual) |
| Terminal-Bench | 46.3% | — |
| BrowseComp | 44% | — |
| GAIA (text only) | 75.7% | — |
| VIBE average | — | 88.6 |
| VIBE-Web | — | 91.5 |
| VIBE-Android | — | 89.7 |
| SWE-bench Multilingual | 56.5% | 72.5% |
For context, GPT-5 scores 74.9% on SWE-bench Verified. M2 at 69.4% is close, and M2.1’s multilingual score of 72.5% already clears Claude Sonnet 4.5’s 68% on the same benchmark.
If you’re building AI agent systems that need a capable open-weight backbone, these numbers make M2.1 worth a serious look.
Cloud API quickstart
MiniMax Open Platform
The fastest path is the official MiniMax API at platform.minimax.io. It’s free for a limited time and uses an OpenAI-compatible interface:
from openai import OpenAI
client = OpenAI(
base_url="https://api.minimax.io/v1",
api_key="YOUR_MINIMAX_API_KEY",
)
response = client.chat.completions.create(
model="MiniMax-Text-01", # maps to M2/M2.1
messages=[
{"role": "user", "content": "Write a Rust function to parse a CSV with error handling."}
],
temperature=1.0,
top_p=0.95,
)
print(response.choices[0].message.content)
Recommended inference parameters for M2:
temperature:1.0top_p:0.95top_k:40
These are MiniMax’s own recommendations. Lowering temperature tends to make the thinking more deterministic but can reduce creativity on open-ended tasks.
Third-party providers
If you want a single API key that spans multiple models, both DeepInfra and Together AI host M2/M2.1 with OpenAI-compatible endpoints:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "MiniMaxAI/MiniMax-M2",
"messages": [{"role": "user", "content": "Explain MoE routing in one paragraph."}],
"stream": true
}'
DeepInfra redirects M2 requests to M2.1 automatically, so you don’t need to change your model string.
Local deployment with vLLM
Running M2.1 locally gives you full control over context, throughput, and cost. The model is open-weight on HuggingFace at MiniMaxAI/MiniMax-M2.1.
Hardware requirements
| Use case | GPU recommendation |
|---|---|
| Standard agentic / coding (200K ctx) | 4× A100/A800 80GB or 4× H200 |
| Extended context (multi-million tokens) | 8× H200 141GB |
FP8 weights weigh in around 220GB. Budget another ~240GB of KV cache per million context tokens on top of that.
Installation
Use the nightly vLLM build — the stable release doesn’t include the MiniMax tool call parser yet:
uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
Launching the server
4-GPU setup (standard workloads):
SAFETENSORS_FAST_GPU=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
8-GPU setup (data parallelism + expert parallelism):
SAFETENSORS_FAST_GPU=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
--trust-remote-code \
--data-parallel-size 8 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
Key flags:
SAFETENSORS_FAST_GPU=1— loads weights directly to GPU, skips CPU staging, faster startup--tool-call-parser minimax_m2— converts MiniMax’s XML tool format to OpenAI-compatibletool_calls--reasoning-parser minimax_m2_append_think— extracts<think>blocks into a dedicated reasoning field in the response
Once running, the server exposes a standard /v1/chat/completions endpoint on port 8000.
Troubleshooting common errors
CUDA memory access error:
# Add this flag to the vllm serve command
--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
PyTorch 2.9+ matmul precision error:
VLLM_FLOAT32_MATMUL_PRECISION="tf32" SAFETENSORS_FAST_GPU=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
Tool calling and agentic workflows
Preserving <think> blocks
This is the single most important thing to get right. MiniMax M2 is an interleaved thinking model — it wraps reasoning in <think>...</think> tags before responding. If you strip those tags from conversation history, benchmark performance drops up to 40% on some evals.
Always pass the full assistant response including <think> blocks back in multi-turn conversations:
# CORRECT: preserve the full response including <think> blocks
messages = [
{"role": "user", "content": "Fix this Python bug: ..."},
{"role": "assistant", "content": full_response_with_think_tags},
{"role": "user", "content": "Now add unit tests for that fix."},
]
# WRONG: stripping thinking content
messages = [
{"role": "user", "content": "Fix this Python bug: ..."},
{"role": "assistant", "content": visible_text_only}, # degrades performance
{"role": "user", "content": "Now add unit tests for that fix."},
]
Function calling example
After launching the local server, tool calling works exactly like OpenAI’s API:
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
tools = [{
"type": "function",
"function": {
"name": "run_shell_command",
"description": "Execute a shell command and return stdout/stderr",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string", "description": "The shell command to run"},
"cwd": {"type": "string", "description": "Working directory"}
},
"required": ["command"]
}
}
}]
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.1",
messages=[
{"role": "user", "content": "List all Python files in the current directory and count lines of code in each."}
],
tools=tools,
tool_choice="auto",
temperature=1.0,
top_p=0.95,
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
tool_call = choice.message.tool_calls[0].function
print(f"Calling: {tool_call.name}")
print(f"Args: {json.loads(tool_call.arguments)}")
For multi-agent pipelines, M2.1 works well as a backbone in frameworks like CrewAI or LangGraph. The model’s strong tool use and long context window (200K tokens) make it well-suited to the kind of long-horizon agentic processes where smaller models lose coherence.
Performance at scale
Running on 4× H200 GPUs, vLLM with M2.1 delivers:
| Scenario | Output tokens/sec | Time to first token |
|---|---|---|
| Max throughput | 6,624 | 2,890ms |
| Production (16 concurrent) | 919 | 57ms |
| Per-output token latency | — | 17ms (16 concurrent) |
For most agentic workloads, the 16-concurrent-request configuration is the practical operating point — low latency, reasonable throughput, and headroom for bursty traffic.
To benchmark your own deployment:
uv pip install "vllm[bench]"
vllm bench serve \
--backend openai-chat \
--model MiniMaxAI/MiniMax-M2.1 \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path likaixin/InstructCoder \
--num-prompts 1000 \
--max-concurrency 16 \
--temperature=1.0 \
--top-p=0.95
Summary
MiniMax M2 is one of the more capable open-weight coding models available right now. Here’s what to take away:
- Use M2.1, not the original M2 — the multilingual and agentic improvements are significant
- Start with the cloud API at
platform.minimax.iowhile it’s free, then self-host if costs or latency become a concern - 4× A100 80GB is the minimum practical local setup for the 200K context window
- Never strip
<think>blocks from conversation history — preserve the full assistant response - Use the nightly vLLM build — the stable release lacks the MiniMax tool call parser
For teams evaluating open-weight models against closed APIs, M2.1’s SWE-bench Multilingual score of 72.5% (vs GPT-5’s 74.9%) makes it worth a serious eval, especially if data privacy or cost at scale is a concern. Compare it against the latest GPT models before committing to a provider.
→ MiniMax M2 GitHub repo (weights, quickstart, architecture notes)