BLOG

Latest updates and releases by LMSYS Org are announced through our blogpost series.

From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline

by: Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica, April 19, 2024


Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1) robustly separate model capability, 2) reflect human preference in real-world use cases, and 3) frequently update to avoid over-fitting or test set leakage. Traditional ...

LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation

by: LMSYS Arena Team, Mar 1, 2024


Our Mission Chatbot Arena (chat.lmsys.org) is an open-source project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We launch the evaluation platform for any user to rate LLMs v...

Fast JSON Decoding for Local LLMs with Compressed Finite State Machine

by: Liangsheng Yin, Ying Sheng, Lianmin Zheng, Feb 5, 2024


Constraining an LLM to consistently generate valid JSON or YAML that adheres to a specific schema is a critical feature for many applications. In this blog post, we introduce an optimization that significantly accelerates this type of constrained decoding. Our approach utilizes a compressed finite s...

Fast and Expressive LLM Inference with RadixAttention and SGLang

by: Lianmin Zheng*, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng*, Jan 17, 2024


Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, control flow, and interaction with external environments. However, there is a notable deficiency in efficient systems for programming and executing ...

Chatbot Arena: New models & Elo system update

by: Wei-Lin Chiang, Tim Li, Joseph E. Gonzalez, Ion Stoica, Dec 7, 2023


Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over 130,000 votes that are now collected to rank the most capable 40+ models! In this blog post, we'll cover the results of several new models: Tulu-2-DPO-70B...

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

by: Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, November 21, 2023


TL;DR: We introduce lookahead decoding, a new, exact, and parallel decoding algorithm to accelerate LLM inference. Lookahead decoding breaks the sequential dependency in autoregressive decoding by concurrently extracting and verifying n-grams directly with the LLM, utilizing the Jacobi iteration me...

Recipe for Serving Thousands of Concurrent LoRA Adapters

by: Ying Sheng*, Shiyi Cao*, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, November 15, 2023


In this blog post, we introduce S-LoRA (code), a system designed for the scalable serving of many LoRA adapters. S-LoRA adopts the idea of Unified Paging for KV cache and adapter weights to reduce memory fragmentation. Heterogeneous Batching of LoRA computation with different ranks leveraging optim...

Catch me if you can! How to beat GPT-4 with a 13B model

by: Shuo Yang*, Wei-Lin Chiang*, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica, Nov 14, 2023


Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)! To ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination. <img src="/images/blog/decontaminator/llama-rephraser.png" s...

ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions

by: Zi Lin*, Zihan Wang*, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang, October 30, 2023


In this blogpost, we introduce ToxicChat, a benchmark consisting of 10K high-quality data for content moderation in real-world user-AI interactions. Evaluation results show that fine-tuning on this benchmark notably improves a baseline model’s ability to detect toxic queries in user-AI interactions....

Chatbot Arena Conversation Dataset Release

by: LMSYS Org, July 20, 2023


Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. In this blog post, we ...

How Long Can Open-Source LLMs Truly Promise on Context Length?

by: The LongChat Team, June 29, 2023


In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 16K tokens. Evaluation results show that the long-range retrieval accuracy of LongChat-13B is up to 2x higher than other long-context open models s...

Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B

by: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Hao Zhang, June 22, 2023


In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics: Chatbot Arena Elo, based on 42K anonymous votes from Chatbot Arena using the Elo rating system. MT-Bench score, based on a challenging multi-turn benchmark and GPT-4 gr...

Building a Truly "Open" OpenAI API Server with Open Models Locally

by: Shuo Yang and Siyuan Zhuang, June 9, 2023


Many applications have been built on closed-source OpenAI APIs, but now you can effortlessly port them to use open-source alternatives without modifying the code. FastChat's OpenAI-compatible API server enables this seamless transition. In this blog post, we show how you can do this and use LangChai...

Chatbot Arena Leaderboard Updates (Week 4)

by: LMSYS Org, May 25, 2023


In this update, we are excited to welcome the following models joining the Chatbot Arena: Google PaLM 2, chat-tuned with the code name chat-bison@001 on Google Cloud Vertex AI Anthropic Claude-instant-v1 MosaicML MPT-7B-chat Vicuna-7B A new Elo rating leaderboard based on the 27K anonymous voting ...

Chatbot Arena Leaderboard Updates (Week 2)

by: LMSYS Org, May 10, 2023


We release an updated leaderboard with more models and new data we collected last week, after the announcement of the anonymous Chatbot Arena. We are actively iterating on the design of the arena and leaderboard scores. In this update, we have added 4 new yet strong players into the Arena, including...

Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings

by: Lianmin Zheng*, Ying Sheng*, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, May 3, 2023


We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in ches...

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality

by: The Vicuna Team, March 30, 2023


We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LL...