BLOG

Latest updates and releases by LMSYS Org are announced through our blogpost series.

Accelerating SGLang with Multiple Token Prediction

by: Eigen AI Team, July 17, 2025

TL;DR SGLang is the first and only open-source serving framework to support Multiple Token Prediction (MTP) in combination with Large-Scale Expert Parallelism (EP) and Prefill-Decode disaggregation. This integration delivers up to 60% higher output throughput through a new decoding paradigm, better ...

How to support new VLMs into SGLang: A Case Study with NVILA

by: The NVILA Team, July 16, 2025

The world of LLMs is evolving at a remarkable pace, with Visual Language Models (VLMs) at the forefront of this revolution. These models power applications that can understand and reason about both images and text. There are tons of new VLM models emerging daily, and we want to integrate them into S...

Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang

by: Intel PyTorch Team, July 14, 2025

The impressive performance of DeepSeek R1 marked a rise of giant Mixture of Experts (MoE) models in Large Language Models (LLM). However, its massive model size and unique architecture have posed new challenges on deployment. The significant memory requirements will normally require 8x or even 16x h...

slime: An SGLang-Native Post-Training Framework for RL Scaling

by: The slime Team, July 9, 2025

Vision That Drives slime We believe in RL. We believe RL is the final piece toward AGI. If you feel the same way, you'll share our vision: Every field should be end-to-end RLed and every task should become an agent environment. Every RL run should last longer, and every model should scale larger. R...

OME: Revolutionizing LLM Infrastructure with Model-Driven Architecture

by: The Oracle Team, July 8, 2025

The Tale of Two Teams: Why Model Serving Is Broken In any large organization deploying LLMs, two distinct teams emerge with conflicting needs: The ML Engineers spend months benchmarking models, experimenting with serving technologies, and crafting optimal deployment strategies. Each model demands di...

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput

by: The SGLang Team, June 16, 2025

The GB200 NVL72 is the world's most advanced hardware for AI training and inference. In this blog post, we're excited to share early results from running DeepSeek 671B with prefill-decode disaggregation and large-scale expert parallelism on the GB200 NVL72. By leveraging Blackwell-specific features ...

Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs

by: The SGLang Team, May 5, 2025

DeepSeek is a popular open-source large language model (LLM) praised for its strong performance. However, its large size and unique architecture, which uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), require an advanced system for efficient serving at scale. In this blog, we exp...

SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs

by: The SGLang Team, December 4, 2024

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features: Zero-overhead batch scheduler: 1.1x increase in throughput. Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate. Data parallelism attention for DeepSeek mo...

Announcing a New Site for Chatbot Arena

by: LMSys Team, Sep 20, 2024

We’re excited to share that Chatbot Arena now has its own dedicated website: lmarena.ai and blog! You might be wondering why we’re making this change. Over the past year, with the incredible support of our community, Chatbot Arena has evolved into a mature ecosystem and platform. We believe it’s tim...

RedTeam Arena: An Open-Source, Community-driven Jailbreaking Platform

by: Anastasios Angelopoulos*, Luca Vivona*, Wei-Lin Chiang*, Aryan Vichare, Lisa Dunlap, Salvivona, Pliny, Ion Stoica, Sep 13, 2024

We are excited to launch RedTeam Arena, a community-driven redteaming platform, built in collaboration with Pliny and the BASI community! <img src="/images/blog/redteam_arena/badwords.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom:...

SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision

by: The SGLang Team, September 4, 2024

We're excited to announce the release of SGLang v0.3, which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates: Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA) Up to 1.5x lower latency with torch.compile...

Does style matter? Disentangling style and substance in Chatbot Arena

by: Tianle Li*, Anastasios Angelopoulos*, Wei-Lin Chiang*, Aug 29, 2024

Why is GPT-4o-mini so good? Why does Claude rank so low, when anecdotal experience suggests otherwise? We have answers for you. We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style...

Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)

by: The SGLang Team, Jul 25, 2024

At LMSYS.org, we've been running the Chatbot Arena platform for over a year, serving millions of users. We know firsthand how crucial efficient serving is for AI products and research. Through our operational experiences and in-depth research, we've continuously enhanced the underlying serving syste...

RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing

by: Isaac Ong*, Amjad Almahairi*, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, July 1, 2024

LLMs have demonstrated remarkable capabilities across a range of tasks, but there exists wide variation in their costs and capabilities, as seen from the plot of performance against cost in Figure 1. Very broadly, more capable models tend to be more expensive than less capable models. This leads to ...

The Multimodal Arena is Here!

by: Christopher Chou*, Lisa Dunlap*, Wei-Lin Chiang, Ying Sheng, Lianmin Zheng, Anastasios Angelopoulos, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, June 27, 2024

Multimodal Chatbot Arena We added image support to Chatbot Arena! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother. In just two weeks, we have collected over 17,0...

Introducing Hard Prompts Category in Chatbot Arena

by: Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024

Background Introducing Hard Prompts, a new and challenging category in the Chatbot Arena Leaderboard. Over the past few months, the community has shown a growing interest in more challenging prompts that push the limits of current language models. To meet this demand, we are excited to introduce the...

What’s up with Llama 3? Arena data analysis

by: Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez, Wei-Lin Chiang, May 8, 2024

On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English Chatbot Arena leaderboard with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this bl...

LMSYS Kaggle Competition – Predicting Human Preference with $100,000 in Prizes

by: LMSYS Arena Team, May 2, 2024

Overview LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the Chatbot Arena, containing conversations and user prefer...

From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline

by: Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica, April 19, 2024

Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1) robustly separate model capability, 2) reflect human preference in real-world use cases, and 3) frequently update to avoid over-fitting or test set leakage. Traditional ...

LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation

by: LMSYS Arena Team, Mar 1, 2024

Our Mission Chatbot Arena (lmarena.ai) is an open-source project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLM...

Fast JSON Decoding for Local LLMs with Compressed Finite State Machine

by: Liangsheng Yin, Ying Sheng, Lianmin Zheng, Feb 5, 2024

Constraining an LLM to consistently generate valid JSON or YAML that adheres to a specific schema is a critical feature for many applications. In this blog post, we introduce an optimization that significantly accelerates this type of constrained decoding. Our approach utilizes a compressed finite s...

Fast and Expressive LLM Inference with RadixAttention and SGLang

by: Lianmin Zheng*, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng*, Jan 17, 2024

Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, control flow, and interaction with external environments. However, there is a notable deficiency in efficient systems for programming and executing ...

Chatbot Arena: New models & Elo system update

by: Wei-Lin Chiang, Tim Li, Joseph E. Gonzalez, Ion Stoica, Dec 7, 2023

Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over 130,000 votes that are now collected to rank the most capable 40+ models! In this blog post, we'll cover the results of several new models: Tulu-2-DPO-70B...

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

by: Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, November 21, 2023

TL;DR: We introduce lookahead decoding, a new, exact, and parallel decoding algorithm to accelerate LLM inference. Lookahead decoding breaks the sequential dependency in autoregressive decoding by concurrently extracting and verifying n-grams directly with the LLM, utilizing the Jacobi iteration me...

Recipe for Serving Thousands of Concurrent LoRA Adapters

by: Ying Sheng*, Shiyi Cao*, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, November 15, 2023

In this blog post, we introduce S-LoRA (code), a system designed for the scalable serving of many LoRA adapters. S-LoRA adopts the idea of Unified Paging for KV cache and adapter weights to reduce memory fragmentation. Heterogeneous Batching of LoRA computation with different ranks leveraging optim...

Catch me if you can! How to beat GPT-4 with a 13B model

by: Shuo Yang*, Wei-Lin Chiang*, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica, Nov 14, 2023

Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)! To ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination. <img src="/images/blog/decontaminator/llama-rephraser.png" s...

ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions

by: Zi Lin*, Zihan Wang*, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang, October 30, 2023

In this blogpost, we introduce ToxicChat, a benchmark consisting of 10K high-quality data for content moderation in real-world user-AI interactions. Evaluation results show that fine-tuning on this benchmark notably improves a baseline model’s ability to detect toxic queries in user-AI interactions....

Chatbot Arena Conversation Dataset Release

by: LMSYS Org, July 20, 2023

Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. In this blog post, we ...

How Long Can Open-Source LLMs Truly Promise on Context Length?

by: The LongChat Team, June 29, 2023

In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 16K tokens. Evaluation results show that the long-range retrieval accuracy of LongChat-13B is up to 2x higher than other long-context open models s...

Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B

by: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Hao Zhang, June 22, 2023

In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics: Chatbot Arena Elo, based on 42K anonymous votes from Chatbot Arena using the Elo rating system. MT-Bench score, based on a challenging multi-turn benchmark and GPT-4 gr...

Building a Truly "Open" OpenAI API Server with Open Models Locally

by: Shuo Yang and Siyuan Zhuang, June 9, 2023

Many applications have been built on closed-source OpenAI APIs, but now you can effortlessly port them to use open-source alternatives without modifying the code. FastChat's OpenAI-compatible API server enables this seamless transition. In this blog post, we show how you can do this and use LangChai...

Chatbot Arena Leaderboard Updates (Week 4)

by: LMSYS Org, May 25, 2023

In this update, we are excited to welcome the following models joining the Chatbot Arena: Google PaLM 2, chat-tuned with the code name chat-bison@001 on Google Cloud Vertex AI Anthropic Claude-instant-v1 MosaicML MPT-7B-chat Vicuna-7B A new Elo rating leaderboard based on the 27K anonymous voting ...

Chatbot Arena Leaderboard Updates (Week 2)

by: LMSYS Org, May 10, 2023

We release an updated leaderboard with more models and new data we collected last week, after the announcement of the anonymous Chatbot Arena. We are actively iterating on the design of the arena and leaderboard scores. In this update, we have added 4 new yet strong players into the Arena, including...

Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings

by: Lianmin Zheng*, Ying Sheng*, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, May 3, 2023

We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in ches...

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality

by: The Vicuna Team, March 30, 2023

We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LL...