SGLang for gpt-oss: From Day 0 Support to Enhanced Performance
by: Liangsheng Yin, Ke Bao, Aug 27, 2025
We are excited to announce a major update for SGLang, focusing on deep performance optimizations and new features for the recently released openai/gpt-oss-120b model. While we had support from day zero, we took the last few weeks to enhance our engine to ensure you get the best possible performance.
This post highlights our latest achievements: a significant performance improvement for gpt-oss with up to 2.1x higher throughput on prefill and 2.25x higher throughput on decode, out-of-the-box support for NVIDIA Blackwell & Hopper and AMD MI350 GPUs, speculative decoding support, and enhanced APIs to power complex agentic applications—all while maintaining the model's high accuracy.
All changes are now available in our main branch.
Get Started with SGLang
pip install "sglang[all]>=0.5.1.post3"
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --tp 4
For detailed instructions on environment setup and how to gain the best performance, please see our guide in awesome-sglang.
By the Numbers: Comprehensive Benchmark Results 📊
To show the impact of our optimizations, we benchmarked SGLang across a range of hardware configurations. For all the results, the reproduction command can be found here.
Low-Latency Performance (Batch Size = 1)
For latency-sensitive applications, we measured single-batch decode throughput across B200 and H100 GPUs, showcasing excellent performance.
Hardware / Precision | NVIDIA B200 | NVIDIA H100 |
---|---|---|
MXFP4 | 416.02 tok/s | 318.53 tok/s |
BF16 | 315.63 tok/s | 293.12 tok/s |
High-Throughput Performance (Batch Size = 32)
For high-throughput applications, SGLang delivers significant performance gains over our initial Day 0 support and have shown great performance on both prefill and decode on different hardwares.
The results of AMD MI350 were tested with triton backend which is not fully optimized yet, and more optimizations with AMD AITER will be released soon.Performance Deep Dive 🚀
Our performance gains come from several key optimizations at the kernel level:
- FlashInfer Kernels for Blackwell: To unlock peak performance for gpt-oss on Blackwell GPUs, we integrated highly optimized kernels from FlashInfer. This accelerates core components, including multi-head attention and Mixture of Experts (MoE) layers, on the new hardware.
- FlashAttention-3 for Hopper: We modified the FlashAttention-3 kernels to support attention sinks, providing a significant speedup for inference on Hopper GPUs.
- Kernel Fusion and Reduction: We performed several low-level fusions to reduce overhead. This includes fusing the RMS norm with all-reduce, merging the set KV buffer operation into RoPE, and fusing hidden states padding into quantization. We also removed unnecessary kernels, enabled PDL for some kernels, and reduced CPU overhead for greater efficiency.
Accuracy Alignment with Official Report 🎯
We validated our optimized gpt-oss implementation against the GPQA benchmark and confirmed that our results align closely with the official model card, ensuring that these speedups do not compromise the model's reasoning capabilities.
Reasoning Effort | SGLang | vLLM | Official |
---|---|---|---|
Low | 65.6 | 65.3 | 67.1 |
Medium | 72.1 | 72.4 | 73.1 |
High | 79.8 | 79.4 | 80.1 |
Speculative Decoding Support 🦅
Speculative Decoding is a key technique for improving LLM inference performance. EAGLE3 is the current state-of-the-art speculative decoding method, and SGLang was the first framework to support it, thanks to close collaboration with EAGLE team.
In SGLang, you can easily launch gpt-oss model with EAGLE3 speculative decoding:
# On Hopper:
# - Tree decoding (topk > 1) and chain decoding (topk = 1) are supported on both FA3 and Triton backends.
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --tp 4
# On Blackwell:
# - Chain decoding (topk = 1) is supported on TRTLLM-MHA backend. Tree decoding (topk > 1) is in progress, stay tuned!
# - Both tree decoding (topk > 1) and chain decoding (topk = 1) are supported on the Triton backend.
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4
python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --attention-backend triton --tp 4
For openai/gpt-oss-120b
model, we trained an EAGLE3 draft model, lmsys/EAGLE3-gpt-oss-120b-bf16
with SpecForge, an efficient framework for speculative draft model training. Our trained draft model achieves a higher average acceptance length compared to NVIDIA’s gpt-oss draft model.
We also benchmarked openai/gpt-oss-120b
with EAGLE3 on H200 TP4 and observed promising results across several standard benchmark datasets:
which achieves:
- 1.39x speedup with the
steps=3, topk=1, num_draft_tokens=4
setting. - 1.52x speedup with the
steps=5, topk=4, num_draft_tokens=8
setting.
Powering Agentic Applications 🤖
To better enable agentic workflows, SGLang offers OpenAI Response API support and native chat completion support. Here is an example of how to build a simple web search agent with SGLang (python3.12
and gpt-oss
package are required for built-in tools, more setup details can be found here).
Launch the server:
export EXA_API_KEY=YOUR_EXA_KEY
python3 -m sglang.launch_server --port 30000 --model-path openai/gpt-oss-120b --tp 4 --tool-server demo
Use Response API to build a web search agent:
import openai
client = openai.OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.responses.create(
model="openai/gpt-oss-120b",
tools=[{"type": "web_search_preview"}],
input="What does SGLang update today?"
)
print(response.output_text)
What's Next? 🔮
None of the Day-0 support or the subsequent optimizations would have been possible without the collective effort of the SGLang community. Shout-out to the SGLang team, SpecForge team, FlashInfer team, Oracle team, Eigen AI team, NVIDIA team and AMD team for pushing this forward together!
We will continue pushing the boundaries of LLM inference. On our roadmap are further explorations into SWA (Sliding Window Attention) optimizations, AMD AITER integration, along with new advances in speculative decoding, to deliver even greater performance gains.
We invite you to try the latest version of SGLang and share your feedback. Thank you for being an essential part of this journey!