What’s up with Llama 3? Arena data analysis

by: Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez, Wei-Lin Chiang, May 08, 2024

On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English Chatbot Arena leaderboard with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus.

We investigate the following:

What types of prompts are users asking? Do users prefer Llama 3 on certain types of prompts?
How challenging are these prompts? Does the ranking change if the prompts are easier/harder?
Are certain users or prompts overrepresented? Do duplicate prompts or rankings from a small number of users affect the win rate?
Does Llama 3 have qualitative differences which make users like it more?

We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-0409-preview) and reach the following conclusions:

Llama 3 beats other top-ranking models on open-ended writing and creative problems but loses on more close-ended math and coding problems.
As prompts get harder, Llama 3’s win rate against top-tier models drops significantly.
Deduplication or outliers do not significantly affect the win rate.
Qualitatively, Llama 3’s outputs are friendlier and more conversational than other models, and these traits appear more often in battles that Llama 3 wins.

Figure 1. Llama 3-70b's win rate (excluding ties) against top 5 models across prompt topics. * denotes that the category contains less than 50 battles.

Analyzing win rate across different types of prompts

Topic Analysis. We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a pre-established taxonomy of topics (from Reka's paper) and visualize the win rate of Llama 3-70b against the other top models in Figure 1. We see that Llama 3’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples, this remains inconclusive.

Win Rate versus Prompt Difficulty. We employ our recently released pipeline which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are:

1. Specificity: Does the prompt ask for a specific output?

2. Domain Knowledge: Does the prompt cover one or more specific domains?

3. Complexity: Does the prompt have multiple levels of reasoning, components, or variables?

4. Problem-Solving: Does the prompt directly involve the AI to demonstrate active problem-solving skills?

5. Creativity: Does the prompt involve a level of creativity in approaching the problem?

6. Technical Accuracy: Does the prompt require technical accuracy in the response?

7. Real-world Application: Does the prompt relate to real-world applications?

We score 1000 battles against the top 3 models on the leaderboard and plot their win rates versus prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see the blog for further discussion.

Figure 2. Several top models' win rate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.

Figure 3. The percentage of prompts with number of hardness criteria met in 3.5K sample of arena battles. We observe a significant portion of the battles are classified as hard (~27%).

We can further analyze which types of prompts affect win rate by fitting a decision tree on the 7 binary columns representing if a given prompt has satisfied each of the criteria above. From this decision tree, we can segment prompts into criteria subsets such that Llama 3-70b-Instruct either performs very well or very poorly. The tree shown in Figure 4 shows us which subsets change the model’s win rate the most when conditioned on.

Figure 4. Llama 3-70b-Instruct's win rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.

The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria most immediately divides Llama3-70b-Instruct’s performance into its strengths and weaknesses. It supports our initial findings above that Llama3-70b-Instruct is stronger on open-ended tasks rather than more closed-ended tasks. We can traverse further down the tree and see that Llama3-70b-Instruct is quite strong on open-ended creative questions (see the blue path), reaching around a 60% win-rate against these top models. Emperically, these types of questions are often writing and brainstorming style questions. For example two prompts where Llama-3-70B-Instruct won are: "Write the first chapter of a novel." and "Could you provide two story suggestions for children that promote altruism? ". On the other hand, following the orange path, we can notice that Llama3-70b-Instruct has a lower win-rate against top models when answering close-ended, non-real-world, reasoning-based questions. These questions are often logic puzzles and math word word problems. Two examples where Llama-3-70B-Instruct won are: "123x = -4x * 2 - 65" and "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?"

The effect of overrepresented prompts and judges

Effect of duplicate prompts. Using fuzzy string matching, we find that ~9% (6658/7327) of the user prompts in battles between Llama 3 and the other top models are duplicates, and show in Table 1 that deduplication does not significantly affect Llama 3's win rate.

Table 1: Llama 3-70b battle stats.

Model	# battles	# battles no tie	# battles (dedup, no tie)	Llama 3 win rate	Llama 3 win rate (dedup, no tie)
Claude 3 Opus	1959	1328	1171	51.28%	51.58%
Gemini 1.5	2413	1620	1437	50.06%	49.48%
GPT-4 0125	1271	881	779	48.58%	49.04%
GPT-4 1106	526	349	307	50.72%	52.12%
GPT-4-Turbo	2097	1437	1287	47.74%	47.73%

User analysis. First we consider some basic user statistics in Table 2 to check that judging behavior is similar between Claude-3-Opus-20240229 and Llama 3-70B-Instruct.

Table 2. Detailed Engagement Metrics for LLMs (Timeframe: April 24 - May 1, 2023). The latest and detailed version here.

Model	Battles	Unique Judges	Mean Votes per Judge	Median Votes per Judge	Max Votes per Judge
Llama 3-70B-Instruct	12,719	7,591	1.68	1	65
Claude-3-Opus-20240229	68,656	48,570	1.41	1	73
All Models All Time	749,205	316,372	2.37	1	591

In order to limit the impact of users that vote many times, we can take the mean of each judge’s win rate, thereby bounding the impact of each individual judge. In this case, we find that this stratified win rate shown in Table 3 is still very similar to the original win rate, suggesting that very active judges are not skewing the result.

Table 3. Model Win Rates (Timeframe: April 24 - May 1, 2023). The latest and detailed version here. Note that ties are counted as 0.5, with wins and losses as 1 and 0, respectively.

Model	Win rate	Stratified Win Rate
Llama 3-70B-Instruct	0.541	0.543
Claude-3-Opus-20240229	0.619	0.621

Qualitative differences between Llama 3 outputs versus other models. From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models.

Measuring sentiment. To measure excitement, we assign a binary label to each output based on the presence of an exclamation point. For positivity, friendliness, and conversationality, we use GPT-3.5 as a judge to rate each output on a scale of 1-5. In a given battle, Llama 3's outputs are labeled as more excited, positive, conversational, or friendly if their score is higher than the opponent's. Figure 5 displays the distribution of these qualities across models, revealing that Llama 3's outputs generally exhibit higher levels of excitement, positivity, friendliness, and conversationality as compared to their opponents.

Figure 5: Proportion of arena prompts where Llama 3 is more positive/friendly/conversational/exclamatory than its opponent.

Is sentiment related to win rate? Figure 6 compares the sentiment qualities of Llama 3's outputs in battles it wins versus those it loses. We see that all traits appear more in winning battles and less in losing battles, but this difference is relatively small, especially for positivity and friendliness. This suggests that while these traits might play a role in competitive success, their influence requires further exploration for more definitive insights.

Figure 6: Llama 3's sentiment versus its win rate when Llama 3 is more positive/friendly/conversational/exclamatory than its opponent.

Conclusion

From the beginning, our mission has been to advance LLM development and understanding. While in the past we have focused on high-level ranking and benchmark design, moving forward, we hope to extend the analysis here and conduct more in-depth analysis into changes in human preference as well as model behavior.

Acknowledgment

We thank Matei Zaharia and Ion Stoica for their valuable feedback.

Citation

@misc{llama3arena2024,
    title = {What’s up with Llama 3? Arena data analysis},
    url = {https://lmsys.org/blog/2024-05-08-llama3/},
    author = {Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez, Wei-Lin Chiang},
    month = {May},
    year = {2024}
}

Qualitative examples

Prompt: What is entropy? Explain using only words that start by a or t

Llama-70b-Instruct: What an intriguing inquiry!

Alright, Adept Thinker, here's an attempt to articulate the abstract abstraction of entropy using only words that start with A or T:

"Astonishingly, the tendency to array things in an orderly arrangement tends to topple, allowing anomalous atoms to transition towards tumultuous, aimless arrangements. This transformation tends to accelerate, augmenting the amount of apparent arbitrariness. As a result, the aggregate arrangement assumes an attitude of advanced ambiguity, making it arduous to accurately anticipate the trajectory of attending atoms."

There you have it, an all-A-and-T explanation of entropy!

GPT-4-Turbo: Entropy, abstractly articulated, always alludes to a thermodynamic theory, attributing attributes typically tied to the transformation, transfer, or turmoil within thermodynamic systems. Technically, this term tackles the tendency to transition from an arranged state to a tangled, tumultuous state. Theoretically, entropy assesses the amount attainable arrangements accessible at a thermodynamic level. Thus, through time, systems tend toward these tangled states, termed "thermodynamic equilibrium." This transformation typifies the arrow of time, tracing transitions toward total disorder.

Winner: Llama 3

Prompt: What is this called you are right until you are proven wrong

Llama 3-70b-Instruct: You're referring to a common phenomenon!

What you're describing is often called "Presumption of innocence until proven guilty" or simply "Presumption of innocence." This concept is a fundamental principle in many legal systems, (including criminal law and ethics) that assumes a person is innocent until they are proven guilty beyond a reasonable doubt.

However, if we broaden the context to include everyday conversations and debates, this phenomenon can also be referred to as:

Appeal to ignorance: The assumption that a statement is true simply because it has not been proven false.
Burden of proof: The idea that someone making a claim must provide evidence to support it, and until they do, the claim is not accepted as true.
Skepticism: A healthy dose of doubt or prompting of a claim until sufficient evidence is provided to support it.

In general, it's essential to approach claims and arguments with a critical and open-minded perspective, recognizing that assumptions can be wrong until proven right, and that evidence and reasoning are necessary to support a claim.

Claude-3-Opus: The concept you are referring to is known as the "null hypothesis" or "presumption of innocence."

In scientific research, the null hypothesis is the default assumption that there is no relationship or effect between the variables being studied. Researchers aim to reject the null hypothesis by providing evidence supporting their alternative hypothesis. Until sufficient evidence is presented, the null hypothesis is presumed to be true.

In the legal context, the presumption of innocence is a fundamental principle stating that a person accused of a crime is considered innocent until proven guilty in a court of law. The burden of proof lies with the prosecution to establish guilt beyond a reasonable doubt.

Both concepts share the idea that a claim or accusation is initially assumed to be false or unsupported until sufficient evidence is provided to prove otherwise. This approach helps to prevent unjustified assertions and encourages rigorous evidence-based reasoning.

Winner: Claude-3-Opus