Introducing Miles — RL Framework To Fire Up Large-Scale MoE Training

by: RadixArk Team, Nov 19, 2025


A journey of a thousand miles is made one small step at a time.

Today, we are releasing Miles, an enterprise-grade reinforcement learning framework tailored for large-scale MoE training and production workloads.

Miles is built on top of slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE runs (including GLM-4.6). While slime proved that lightweight design works, Miles takes the next step: delivering the reliability, scale, and control needed for real-world enterprise deployments.

GitHub: radixark/miles.

Why Miles?

Every mile of progress begins with one well-placed step - slime it is. As a very lightweight and customizable RL framework, slime has been growing popular across the community. It has also been battle-tested in large MoE training, where it is used to train GLM-4.6. slime comes with a few elegant design principles:

Open-to-Use Performance

We provide native, structured support for SGLang and Megatron's full optimization stack, keeping pace with the rapid evolution of inference and training frameworks.

Modular Design

Key components—Algorithm, Data, Rollout, and Eval—are fully decoupled. You can plug in new agent types, reward functions, or sampling strategies with minimal code changes.

Built for Researchers

Every abstraction is readable and hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without digging into low-level code. We also provide inference-only and training-only debugging modes for fast diagnosis.

Community-Driven

slime evolved through real-world feedback from the LMSYS and SGLang communities, embodying what open collaboration across research and engineering can achieve.

What's New?

Miles builds on slime but focuses on new hardware (e.g., GB300), large-scale MoE RL, and production-grade stability. Recent additions include (most of which we've also upstreamed to slime):

True On-Policy

Beyond deterministic inference (bitwise identical results), we now support true on-policy via an infrastructure approach.

  • We've eliminated the mismatch between training and inference to exactuly 0 kl divergence.
  • This uses Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and torch.compile. We also aligned numeric operations between training and inference.

Memory Improvements

To maximize performance without hitting OOM errors, we've made several updates:

  • Added error propagation to avoid crashes on benign OOMs.
  • Implemented memory margins to fix NCCL-related OOMs.
  • Fixed excessive memory usage in FSDP.
  • Added support for move-based and partial offloading, plus host peak memory savings.

Speculative Decoding with Online Draft Model Training

Freezing the draft model in RL prevents it from following the target model's policy, which reduces accept length and speedup. We now perform online SFT on the draft model throughout RL.

  • Achieves 25%+ rollout speedup vs. a frozen MTP, especially in late training stages.
  • Supports MTP with sequence packing + CP, loss masks with edge-case handling, LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing.

Other Improvements

We've enhanced the FSDP training backend, allowed independent deployment of the rollout subsystem, and added more debug utilities (metrics, post-hoc analyzers, better profilers). We also included a formal mathematics (Lean) example with SFT/RL scripts.

Roadmap

We are committed to supporting enterprise-grade RL training. Upcoming efforts include:

  • Large-scale MoE RL examples on new hardware (e.g., GB300).
  • Multi-modal training support.
  • Rollout accelerations:
    • Compatibility with SGLang spec v2.
    • Advanced speculative decoding (e.g., EAGLE3, multi-spec layer).
  • Better resource allocation for balanced training & serving in large-scale async training.
  • Elasticity to GPU failures.

Acknowledgment

Miles wouldn't exist without the slime authors and the broader SGLang RL community.

We invite researchers, startups, and enterprise teams to explore slime and Miles—pick the one that fits your needs—and join us in making reinforcement learning efficient and reliable. We're listening to the community and actively working on Miles to build a production-ready training environment.