AI Development

MiniMax M2: Developer guide to the 230B open-source coding model

Updated on April 13, 2026

Learn how to use MiniMax M2 via API or run it locally with vLLM. Covers benchmarks, hardware requirements, tool calling, and agentic workflow tips.

AI Development LLM Scaling Developer Tools Machine Learning AI Agents

MiniMax M2 230B open-source coding model architecture visualization

MiniMax released M2 in late 2025 as an open-weight model targeting serious coding and agentic work. At 230B total parameters with only 10B active via a Mixture-of-Experts architecture, it punches well above its inference cost. The model weights are MIT-licensed, the API is free for a limited time, and the community is already deploying it on commodity GPU clusters.

This guide covers everything a developer needs to actually use it: the API, local deployment with vLLM, tool calling, and the one gotcha that quietly kills performance if you miss it.

What is MiniMax M2?

MiniMax M2 is a 230B-parameter MoE language model built for coding and agentic tasks. Because only 10B parameters activate per forward pass, it runs at roughly the cost of a 10B dense model while drawing on the capacity of a much larger one. The tradeoff is memory at rest — you still need to hold all the weights in VRAM.

Key facts at a glance:

Spec	Value
Total parameters	230B
Active parameters (MoE)	10B
Context window	200K tokens
Max output	128K tokens (incl. CoT)
Precision	FP8
License	MIT

The model supports function calling, JSON mode, streaming, and real-time reasoning via interleaved <think> blocks.

M2 vs M2.1: which version to use

MiniMax shipped M2.1 as a follow-up focused on real-world engineering tasks. The headline improvements are:

Multi-language programming: Rust, Java, Go, C++, Kotlin, TypeScript, Objective-C
Native Android and iOS app development
More concise outputs with lower token consumption
Office automation and composite instruction handling

For most developers, M2.1 is the version to run today. Third-party providers (DeepInfra, Together AI) already redirect M2 requests to M2.1 automatically. The local deployment steps below use M2.1 explicitly.

Benchmark results

MiniMax M2 was competitive with frontier closed models at release. M2.1 pushed further:

Benchmark	MiniMax M2	MiniMax M2.1
SWE-bench Verified	69.4%	Higher (exceeds Claude Sonnet 4.5 on multilingual)
Terminal-Bench	46.3%	—
BrowseComp	44%	—
GAIA (text only)	75.7%	—
VIBE average	—	88.6
VIBE-Web	—	91.5
VIBE-Android	—	89.7
SWE-bench Multilingual	56.5%	72.5%

For context, GPT-5 scores 74.9% on SWE-bench Verified. M2 at 69.4% is close, and M2.1's multilingual score of 72.5% already clears Claude Sonnet 4.5's 68% on the same benchmark.

If you're building AI agent systems that need a capable open-weight backbone, these numbers make M2.1 worth a serious look.

Cloud API quickstart

MiniMax Open Platform

The fastest path is the official MiniMax API at platform.minimax.io. It's free for a limited time and uses an OpenAI-compatible interface:

Python

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.minimax.io/v1",
    api_key="YOUR_MINIMAX_API_KEY",
)
 
response = client.chat.completions.create(
    model="MiniMax-Text-01",  # maps to M2/M2.1
    messages=[
        {"role": "user", "content": "Write a Rust function to parse a CSV with error handling."}
    ],
    temperature=1.0,
    top_p=0.95,
)
 
print(response.choices[0].message.content)

Recommended inference parameters for M2:

temperature: 1.0
top_p: 0.95
top_k: 40

These are MiniMax's own recommendations. Lowering temperature tends to make the thinking more deterministic but can reduce creativity on open-ended tasks.

Third-party providers

If you want a single API key that spans multiple models, both DeepInfra and Together AI host M2/M2.1 with OpenAI-compatible endpoints:

Terminal

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2",
    "messages": [{"role": "user", "content": "Explain MoE routing in one paragraph."}],
    "stream": true
  }'

DeepInfra redirects M2 requests to M2.1 automatically, so you don't need to change your model string.

Local deployment with vLLM

Running M2.1 locally gives you full control over context, throughput, and cost. The model is open-weight on HuggingFace at MiniMaxAI/MiniMax-M2.1.

Hardware requirements

Use case	GPU recommendation
Standard agentic / coding (200K ctx)	4× A100/A800 80GB or 4× H200
Extended context (multi-million tokens)	8× H200 141GB

FP8 weights weigh in around 220GB. Budget another ~240GB of KV cache per million context tokens on top of that.

Installation

Use the nightly vLLM build — the stable release doesn't include the MiniMax tool call parser yet:

Terminal

uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Launching the server

4-GPU setup (standard workloads):

Terminal

SAFETENSORS_FAST_GPU=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think

8-GPU setup (data parallelism + expert parallelism):

Terminal

SAFETENSORS_FAST_GPU=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
    --trust-remote-code \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think

Key flags:

SAFETENSORS_FAST_GPU=1 — loads weights directly to GPU, skips CPU staging, faster startup
--tool-call-parser minimax_m2 — converts MiniMax's XML tool format to OpenAI-compatible tool_calls
--reasoning-parser minimax_m2_append_think — extracts <think> blocks into a dedicated reasoning field in the response

Once running, the server exposes a standard /v1/chat/completions endpoint on port 8000.

Troubleshooting common errors

CUDA memory access error:

Terminal

# Add this flag to the vllm serve command
--compilation-config '{"cudagraph_mode": "PIECEWISE"}'

PyTorch 2.9+ matmul precision error:

Terminal

VLLM_FLOAT32_MATMUL_PRECISION="tf32" SAFETENSORS_FAST_GPU=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think

Tool calling and agentic workflows

Preserving `<think>` blocks

This is the single most important thing to get right. MiniMax M2 is an interleaved thinking model — it wraps reasoning in <think>...</think> tags before responding. If you strip those tags from conversation history, benchmark performance drops up to 40% on some evals.

Always pass the full assistant response including <think> blocks back in multi-turn conversations:

Python

# CORRECT: preserve the full response including <think> blocks
messages = [
    {"role": "user", "content": "Fix this Python bug: ..."},
    {"role": "assistant", "content": full_response_with_think_tags},
    {"role": "user", "content": "Now add unit tests for that fix."},
]
 
# WRONG: stripping thinking content
messages = [
    {"role": "user", "content": "Fix this Python bug: ..."},
    {"role": "assistant", "content": visible_text_only},  # degrades performance
    {"role": "user", "content": "Now add unit tests for that fix."},
]

Function calling example

After launching the local server, tool calling works exactly like OpenAI's API:

Python

from openai import OpenAI
import json
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
 
tools = [{
    "type": "function",
    "function": {
        "name": "run_shell_command",
        "description": "Execute a shell command and return stdout/stderr",
        "parameters": {
            "type": "object",
            "properties": {
                "command": {"type": "string", "description": "The shell command to run"},
                "cwd": {"type": "string", "description": "Working directory"}
            },
            "required": ["command"]
        }
    }
}]
 
response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M2.1",
    messages=[
        {"role": "user", "content": "List all Python files in the current directory and count lines of code in each."}
    ],
    tools=tools,
    tool_choice="auto",
    temperature=1.0,
    top_p=0.95,
)
 
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
    tool_call = choice.message.tool_calls[0].function
    print(f"Calling: {tool_call.name}")
    print(f"Args: {json.loads(tool_call.arguments)}")

For multi-agent pipelines, M2.1 works well as a backbone in frameworks like CrewAI or LangGraph. The model's strong tool use and long context window (200K tokens) make it well-suited to the kind of long-horizon agentic processes where smaller models lose coherence.

Performance at scale

Running on 4× H200 GPUs, vLLM with M2.1 delivers:

Scenario	Output tokens/sec	Time to first token
Max throughput	6,624	2,890ms
Production (16 concurrent)	919	57ms
Per-output token latency	—	17ms (16 concurrent)

For most agentic workloads, the 16-concurrent-request configuration is the practical operating point — low latency, reasonable throughput, and headroom for bursty traffic.

To benchmark your own deployment:

Terminal

uv pip install "vllm[bench]"
 
vllm bench serve \
    --backend openai-chat \
    --model MiniMaxAI/MiniMax-M2.1 \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 1000 \
    --max-concurrency 16 \
    --temperature=1.0 \
    --top-p=0.95

Summary

MiniMax M2 is one of the more capable open-weight coding models available right now. Here's what to take away:

Use M2.1, not the original M2 — the multilingual and agentic improvements are significant
Start with the cloud API at platform.minimax.io while it's free, then self-host if costs or latency become a concern
4× A100 80GB is the minimum practical local setup for the 200K context window
Never strip <think> blocks from conversation history — preserve the full assistant response
Use the nightly vLLM build — the stable release lacks the MiniMax tool call parser

For teams evaluating open-weight models against closed APIs, M2.1's SWE-bench Multilingual score of 72.5% (vs GPT-5's 74.9%) makes it worth a serious eval, especially if data privacy or cost at scale is a concern. Compare it against the latest GPT models before committing to a provider.

→ MiniMax M2 GitHub repo (weights, quickstart, architecture notes)

Category: AI Development

AI Development LLM Scaling Developer Tools Machine Learning AI Agents

MiniMax M2: Developer guide to the 230B open-source coding model

MiniMax M2 230B open-source coding model architecture visualization

What is MiniMax M2?

M2 vs M2.1: which version to use

Benchmark results

Cloud API quickstart

MiniMax Open Platform

Third-party providers

Local deployment with vLLM

Hardware requirements

Installation

Launching the server

Troubleshooting common errors

Tool calling and agentic workflows

Preserving `<think>` blocks

Function calling example

Performance at scale

Summary

Related Posts

Hierarchical Reasoning Model: Achieving 100x Faster Reasoning with 27M Parameters

Comparing 5 AI Agent Frameworks (CrewAI, LangGraph, AutoGen, LangChain, Swarm)

GPT-5.2 for Developers: Faster Agentic Workflows, Better Benchmarks, and Real-World Examples

Get the latest AI insights delivered to your inbox

MiniMax M2 230B open-source coding model architecture visualization

What is MiniMax M2?

M2 vs M2.1: which version to use

Benchmark results

Cloud API quickstart

MiniMax Open Platform

Third-party providers

Local deployment with vLLM

Hardware requirements

Installation

Launching the server

Troubleshooting common errors

Tool calling and agentic workflows

Preserving <think> blocks

Function calling example

Performance at scale

Summary

Related Posts

Hierarchical Reasoning Model: Achieving 100x Faster Reasoning with 27M Parameters

Comparing 5 AI Agent Frameworks (CrewAI, LangGraph, AutoGen, LangChain, Swarm)

GPT-5.2 for Developers: Faster Agentic Workflows, Better Benchmarks, and Real-World Examples

Get the latest AI insights delivered to your inbox

Preserving `<think>` blocks