LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation

by: LMSYS Arena Team, Mar 01, 2024


Our Mission

Chatbot Arena (chat.lmsys.org) is an open-source project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We launch the evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish leaderboard periodically.

Our Progress

Chatbot Arena was first launched in May 2023 and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 300,000 votes across 10 million prompts. This extensive engagement has enabled the evaluation of more than 60 LLMs, such as GPT-4, Gemini/Bard, Llama, and Mistral, significantly enhancing understanding of their capabilities and limitations.

Our periodic leaderboard and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of user preference data and one million user prompts, supporting research and model improvement.

The platform's infrastructure (FastChat) and evaluation tools, available on GitHub, emphasize our dedication to transparency and community engagement in the evaluation process. This approach not only enhances the reliability of our findings but also fosters a collaborative environment for advancing LLMs.

In our ongoing efforts, we feel obligated to establish policies that guarantee evaluation transparency and trustworthiness. Moreover, we actively involve the community in shaping any modifications to the evaluation process, reinforcing our commitment to openness and collaborative progress.

Our Policy

Last Updated: April 11, 2024

Open source: The platform (FastChat) including UI frontend, model serving backend and evaluation tools are all open source at GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard.

Transparent: The evaluation process, including rating computation, identifying anomalous users, and LLM selection are all made publicly available so others can reproduce our analysis and fully understand the process of collecting data. Furthermore, we will involve the community in deciding any changes in the evaluation process.

Listing models on the leaderboard: The leaderboard will only include models that are accessible to other third parties. In particular, the leaderboard will only include models that are either (1) open weights or/and (2) publicly available through APIs (e.g., gpt-4-0613, gemini-pro) or services (e.g., Bard, GPT-4+browsing).

Once the model is on the leaderboard, the model will remain accessible at chat.lmsys.org for at least two weeks for the community to evaluate it.

Before a model is published on the leaderboard, we need to accumulate enough votes to compute its rating. We host the model in the blind test mode. This is called the initial-rating phase. If the model provider decides to pull out and not show the model on the leaderboard, we will allow it, but we might still share with the community the data generated by the model during the initial-rating phase under the "anonymous" label. Note that this only applies to proprietary models under private APIs or pre-release open models. It does not apply to models offered via public APIs or open weight models which can be evaluated by anyone.

To ensure the leaderboard correctly reflects model rankings over time, we rely on live comparisons between models. We may retire models from the leaderboard that are no longer online after a certain time period.

Sharing data with the community: We will periodically share data with the community. In particular, we will periodically share 20% of the arena vote data we have collected including the prompts, the answers, the identity of the model providing each answer (if the model is or has been on the leaderboard), and the votes. For the models we collected votes for but have never been on the leaderboard, we will still release data but we will label the model as "anonymous".

Sharing data with the model providers: Upon request, we will offer early data access with model providers who wish to improve their models. However, this data will be a subset of data that we periodically share with the community. In particular, with a model provider, we will share the data that includes their model's answers. For battles, we will not identify the opponent. We will label the opponent as "anonymous". This data will be later shared with the community during the periodic releases. If the model is not on the leaderboard at the time of sharing, the model’s answers will be labeled as "anonymous".

FAQ

Why another eval?

Most LLM benchmarks are static, which makes them prone to contamination, as these LLMs are trained on most available data on the Internet. Chatbot Arena aims to alleviate this problem by providing live evaluation with a continuous stream of new prompts from real people. We also believe that the open nature of the platform will attract users that accurately reflect the broader set of LLM users and real use cases.

What model to evaluate? Why not all?

We will continuously add new models and retire old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process, i.e., it might take too much to accumulate enough votes to accurately rate each model. Today, the decision to add new models is rather ad-hoc: we add models based on the community’s perceived interest. We intend to formalize his process in the near future.

Why should the community trust our eval?

We seek to provide transparency and all tools as well as the platform we are using in open-source. We invite the community to use our platform and tools to statistically reproduce our results.

Why do you only share 20% of data, not all?

Arena's mission is to ensure trustable evaluation. We periodically share data to mitigate the potential risk of overfitting certain user distributions or preference biases in Arena. We will actively review this policy based on the community's feedback.

Who will fund this effort? Any conflict of interests?

Chatbot Arena is only funded by gifts, in money, cloud credits, or API credits. The gifts have no strings attached.

Any feedback?

Feel free to send us email or leave feedback on Github!