Spaces:

leafspark
/

BaseBench

Running

File size: 5,067 Bytes

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>BaseBench</title>
    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
    <style>
        body {
            font-family: Arial, sans-serif;
            max-width: 800px;
            margin: 0 auto;
            padding: 20px;
            transition: background-color 0.3s, color 0.3s;
        }
        body.dark-mode {
            background-color: #1a1a1a;
            color: #f0f0f0;
        }
        h1 {
            text-align: center;
        }
        #content {
            margin-top: 20px;
        }
        #theme-toggle {
            position: absolute;
            top: 10px;
            right: 10px;
            padding: 5px 10px;
            background-color: #4CAF50;
            color: white;
            border: none;
            cursor: pointer;
        }
        table {
            border-collapse: collapse;
            width: 100%;
        }
        th, td {
            border: 1px solid #ddd;
            padding: 8px;
            text-align: left;
        }
        .dark-mode th, .dark-mode td {
            border-color: #444;
        }
    </style>
</head>
<body>
    <h1>BaseBench</h1>
    <button id="theme-toggle">Dark/Light Theme</button>
    <div id="content"></div>

    <script>
        const markdown = `
BaseBench: A Foundational Language Model Evaluation Framework

Description:
BaseBench is a targeted evaluation framework designed to assess the fundamental capabilities of large language models across a spectrum of basic yet crucial tasks. This suite focuses on core competencies that serve as building blocks for more complex language understanding and generation.

**Features**:

1. Encoding/Decoding Proficiency: Tests the model's ability to work with common encoding schemes like Base64 and ROT13, evaluating its understanding of data representation and transformation.

2. Basic Mathematical Reasoning: Assesses the model's capacity to perform simple arithmetic operations and mathematical problem-solving, gauging its numerical processing capabilities.

3. Linguistic Analysis: Examines the model's grasp of fundamental language properties such as character counting and frequency analysis, probing its understanding of word structure and composition.

4. Error Detection and Correction: Challenges the model to identify and rectify typographical errors, testing its language pattern recognition and error handling abilities (tokenization).

**Purpose**:
BaseBench aims to provide a clear, quantifiable measure of a language model's proficiency in these foundational areas. By focusing on these essential skills, the benchmark offers:

1. A standardized baseline for comparing different models or versions.
2. Insight into a model's fundamental processing capabilities.
3. A tool for identifying potential gaps in basic language and data handling skills.
4. A means to track incremental improvements in core model competencies.
5. Difficult enough to avoid saturation

| Rank | Model                              | Accuracy           | Time  | Speed     |
|------|------------------------------------|--------------------|-------|-----------|
| 1    | openai/gpt-4o                      | 59.00% (1475/2500) | 03:17 | 12.66it/s |
| 2    | anthropic/claude-3.5-sonnet:beta   | 52.56% (1314/2500) | 14:44 | 2.83it/s  |
| 3    | mistralai/mistral-large-2407       | 37.20% (930/2500)  | 05:13 | 7.96it/s  |
| 4    | openai/gpt-4o-mini                 | 36.92% (923/2500)  | 08:28 | 4.91it/s  |
| 5    | anthropic/claude-3-haiku:beta      | 36.72% (918/2500)  | 06:20 | 6.57it/s  |
| 6    | google/gemini-pro-1.5              | 26.92% (673/2500)  | 03:05 | 13.51it/s |
| 7    | google/gemma-2-27b-it              | 25.24% (631/2500)  | 05:52 | 7.08it/s  |
| 8    | meta-llama/llama-3.1-405b-instruct | 24.24% (606/2500)  | 07:19 | 5.69it/s  |
| 9    | 01-ai/yi-large                     | 20.68% (517/2500)  | 02:37 | 15.83it/s |
| 10   | mistralai/mixtral-8x22b-instruct   | 19.60% (490/2500)  | 04:32 | 9.18it/s  |
| 11   | meta-llama/llama-3.1-70b-instruct  | 19.04% (476/2500)  | 18:01 | 2.31it/s  |

**Insights**:
- GPT models lead (only Anthropic's flagship manages to beat 4o-mini)
- Mistral Large is an outlier, however it beats GPT-4o-mini easily (also corresponding to the MMLU-Pro score)
- Llama models score fairly low
- Closed source models/proprietry tend to score better (Mistral Large), due to training differences?
- Gemini is fast, however quality is comparable to Gemma
        `;

        document.addEventListener('DOMContentLoaded', function() {
            const content = document.getElementById('content');
            content.innerHTML = marked.parse(markdown);

            const themeToggle = document.getElementById('theme-toggle');
            themeToggle.addEventListener('click', function() {
                document.body.classList.toggle('dark-mode');
            });
        });
    </script>
</body>
</html>