"No Free Lunch": Deconstruct Efficient Attention with MiniMax M2
by: MiniMax LLM Team together with Xinyuan Tong, Kangyan Zhou, Mingyi Lu, and Chenyang Zhao, Nov 04, 2025
We are excited to announce day-one support for the new flagship model, MiniMax M2, on SGLang. The MiniMax M2 redefines efficiency for agents: it is a compact, fast, and cost-effective Mixture of Experts (MoE) model (230 billion total parameters, 10 billion active) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With only 10B activated parameters, M2 delivers the sophisticated, end-to-end tool-use performance expected from leading models, but in a streamlined form factor that makes deployment and scaling easier than ever.
python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2 \
--tp-size 8 \
--ep-size 8 \
--tool-call-parser minimax-m2 \
--trust-remote-code \
--host 0.0.0.0 \
--reasoning-parser minimax-append-think \
--port 8000 \
--mem-fraction-static 0.85
This release marks a major collaboration between the SGLang and MiniMax teams. While SGLang provided rapid and efficient support for the new model, we invited the MiniMax team to formally analyze their trade-offs and reflections on Efficient Attention algorithms. From the M1 to the M2 model, the MiniMax team has been at the forefront of exploring these algorithms. In this post, they share their empirical insights on the trade-offs and explain why the MiniMax M2 model ultimately reverted to full attention.
The Evaluation Challenge: Benchmarks vs. Reality
In the evolution of Large Language Model (LLM) architecture, the computational complexity of the attention mechanism remains a central challenge. Linear or sparse attention mechanisms, such as Lightning Attention in MiniMax-01, aimed to solve the quadratic computational bottleneck of full attention. However, the MiniMax M2 model has reverted to full attention, a decision that provides critical empirical insights into the production-readiness of efficient attention alternatives.
The MiniMax team reports that despite their theoretical appeal, no efficient attention variant has yet demonstrated stable, superior performance over full attention in real-world industrial deployments. For LLMs deployed in open scenarios, model quality remains the paramount priority, rendering an efficient-but-subpar model of little practical value. Achieving competitive quality introduces severe system-level and methodological challenges.
Benchmarks as a "Leaky Abstraction"
LLM benchmarks (e.g., MMLU, BBH, LongBench) are essential tools, but they are inherently "lossy" abstractions of true capability. MiniMax's experience showed that in small-scale experiments, hybrid attention models (e.g., Lightning Attention + Full Attention) performed on par with pure full attention models on these standard leaderboards.
However, this superficial parity concealed deep capability deficits. As the models were scaled up, these hybrid attention models demonstrated clear shortcomings in complex, multi-hop reasoning tasks.
The High Cost of Validation
This limitation of benchmarks creates a vicious cycle: once a specific flaw (like multi-hop reasoning) is identified, researchers develop new proxy metrics to optimize for it. But there is no guarantee that this new proxy metric still correlates with real-world downstream performance at an even larger scale, nor can it exhaustively cover other hidden weaknesses.
Ironically, while efficient attention aims to save compute, the experimental compute required just to get a statistically significant signal on these harder validation metrics grows astronomically. Discovering the real problems is often far more difficult than solving them.
Infrastructure and System Co-Design Hurdles
The theoretical advantages of efficient attention must be realized through mature training and inference infrastructure. However, the current hardware and software ecosystem is increasingly optimized for full attention, creating significant barriers to entry for new architectures.
Mismatch in Compute and Memory Bottlenecks
Take linear attention as an example. Its theoretical computational and memory complexity are linear and constant, respectively. In theory, the efficiency crossover point should appear at just a few thousand tokens.
In practice, however, many linear attention architectures are memory-bound, even during training. This means that without extreme IO optimization, the system fails to utilize the GPU's available FLOPs, leaving vast amounts of compute on the table and nullifying the theoretical gains.
Inference System Integration Challenges
In a production inference setting, any new attention mechanism must co-exist with critical systems like prefix caching and speculative decoding. The MiniMax report highlights several key engineering problems:
- Low-Precision State Storage: Linear attention is currently far more sensitive to numerical precision than full attention, posing a severe challenge for the low-precision KV cache and state storage commonly used in inference.
- Prefix Caching: In real-world applications like dialogue, the cache-hit rate is very high. The new architecture must handle this frequent cache-hit scenario gracefully.
- Speculative Decoding: How to deeply optimize speculative decoding mechanisms with an efficient attention backbone remains an open and unsolved problem.
Empirical Case Study
To further explore this, the MiniMax team attempted to implement a Hybrid Sliding Window Attention (SWA) model during M2's training, but the experiment was unsuccessful.
Motivation: System Load Balancing
The team attempted to build an intra-layer hybrid SWA model. The system-level motivation was that mixing SWA and full attention within the same layer could ensure consistent computational intensity. This, in turn, would reduce load imbalance issues in Pipeline Parallelism and across Attention Data Parallel groups. SWA was also chosen for its significantly lower engineering complexity compared to other efficient attention methods.
Results: Consistent Failure Across Dimensions
Despite numerous configurations and continued pre-training for hundreds of billions (even trillions) of tokens, the results were poor. All variants, without exception, performed extremely poorly on agent tasks and complex long-context evaluations.
This held true across multiple experimental dimensions, including:
- Adjusting the ratio between SWA and full attention.
- Independently modifying ROPE settings for SWA and full attention (even replacing ROPE with NoPE in some layers).
- Exploring both intra-layer and inter-layer hybrid designs.
- Conducting post-hoc analysis of global attention patterns (like induction heads) to tune the SWA.
- Use a sink token in SWA.
Conclusion and Outlook
MiniMax M2's return to full attention is not a rejection of the efficient attention direction, but rather a pragmatic choice based on the engineering realities of industrial-grade LLM systems today.
This case study clearly demonstrates that the success of an efficient attention architecture depends not only on the algorithm itself, but on the co-maturity of three pillars: evaluation, data, and infrastructure.
As GPU compute growth slows and context lengths continue to increase, the benefits of linear and sparse attention will eventually emerge. However, to cross the chasm from theory to production, the community must continue to invest in building more informative evaluation systems, more mature training and inference infrastructure, and higher-quality, information-rich long-context data.