Federico Ramallo - Density Labs blog

Here is the ranking of the top 10 models on the updated version of the Open LLM Leaderboard:

Qwen/Qwen2-72B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct
microsoft/Phi-3-medium-4k-instruct
01-ai/Yi-1.5-34B-Chat
CohereForAI/c4ai-command-r-plus
abacusai/Smaug-72B-v0.1
Qwen/Qwen1.5-110B
Qwen/Qwen1.5-110B-Chat
microsoft/Phi-3-small-128k-instruct
01-ai/Yi-1.5-9B-Chat

Microservices architecture is a modern method of designing software systems by dividing an application into multiple independent services, each responsible for specific functions.To address this, the Open LLM Leaderboard was created to provide a standardized evaluation setup for reference models, ensuring fair and comparable results. Over the past year, the leaderboard has become a widely used resource in the machine learning community, drawing millions of visitors and active users who collaborate on submissions and discussions.

However, the success of the leaderboard and the increasing performance of models have led to several challenges. The benchmarks used became too easy for models, leading to a phenomenon called saturation, where models reach baseline human performance, and some newer models showed signs of contamination by being trained on or exposed to benchmark data. Additionally, some benchmarks contained errors that affected the accuracy of evaluations.

To address these issues, the Open LLM Leaderboard has been upgraded to version 2, featuring new benchmarks with uncontaminated, high-quality datasets and reliable metrics. The new benchmarks cover general tasks such as knowledge testing, reasoning on short and long contexts, complex mathematical abilities, and tasks correlated with human preference, like instruction following. The selected benchmarks include:

MMLU-Pro: An improved version of the MMLU dataset, presenting more choices, requiring more reasoning, and having higher quality through expert reviews.

GPQA: A difficult knowledge dataset with questions designed by domain experts to be challenging for laypersons but easier for experts.

MuSR: A dataset with algorithmically generated complex problems, such as murder mysteries and team allocation optimizations, requiring reasoning and long-context parsing.

MATH: A compilation of high-school-level competition problems formatted consistently, focusing on the hardest questions.

IFEval: A dataset testing models' ability to follow explicit instructions, using rigorous metrics for evaluation.

BBH: A subset of challenging tasks from the BigBench dataset, focusing on multistep arithmetic, algorithmic reasoning, language understanding, and world knowledge.

The criteria for selecting these benchmarks included evaluation quality, reliability and fairness of metrics, absence of contamination, and relevance to the community.

Another significant change in the leaderboard is the use of normalized scores for ranking models. Instead of summing raw benchmark scores, scores are now normalized between a random baseline (0 points) and the maximal possible score (100 points). This approach ensures a fairer comparison of models' performance across different benchmarks.

The evaluation suite has also been updated to improve reproducibility. The Open LLM Leaderboard now uses an updated version of the lm-eval harness from EleutherAI, ensuring consistent evaluations. New features include support for delta weights, a logging system compatible with the leaderboard, and the use of chat templates for evaluation.

The leaderboard also introduces the "maintainer’s choice" category to highlight high-quality models and prioritize evaluations for the most relevant models to the community. A voting system has been implemented to allow the community to prioritize which models should be evaluated first.

To improve user experience, the leaderboard interface has been enhanced for faster and simpler navigation. The FAQ and About sections have been moved to a dedicated documentation page, and the frontend performance has been improved thanks to contributions from the Gradio team.

The upgraded leaderboard aims to push the boundaries of open and reproducible model evaluations, encouraging the development of state-of-the-art models. The community can continue to find previous results archived in the Open LLM Leaderboard Archive. The future looks promising with trends showing improvements in smaller, more efficient models. This upgraded version sets a new standard for evaluating and comparing LLMs, fostering progress in the field of deep learning.

Source

Here is the ranking of the top 10 models on the updated version of the Open LLM Leaderboard:

Qwen/Qwen2-72B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct
microsoft/Phi-3-medium-4k-instruct
01-ai/Yi-1.5-34B-Chat
CohereForAI/c4ai-command-r-plus
abacusai/Smaug-72B-v0.1
Qwen/Qwen1.5-110B
Qwen/Qwen1.5-110B-Chat
microsoft/Phi-3-small-128k-instruct
01-ai/Yi-1.5-9B-Chat