File size: 11,330 Bytes
a1a7dfb
 
67f6eb3
 
a1a7dfb
45cd785
a1a7dfb
 
 
 
 
 
 
 
 
 
67f6eb3
a1a7dfb
 
67f6eb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1a7dfb
 
67f6eb3
a1a7dfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45cd785
a1a7dfb
 
 
 
45cd785
a1a7dfb
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
from fasthtml_hf import setup_hf_backup
from fasthtml.common import *
from fasthtml.components import *
from fasthtml.components import D_title, D_article, D_front_matter, D_contents, D_byline


app, rt = fast_app()


@rt("/")
def get():
    return Html(
        Head(
            Meta(charset="UTF-8"),
            Meta(name="viewport", content="width=device-width, initial-scale=1.0"),
            Link(rel="stylesheet", href="style.css"),
            Script(src="https://distill.pub/template.v2.js"),
        ),
        Body(
            D_title(
                H1(
                    "TxT360: fully open and transparent fusion of web and curated corpora for pre-training large language models",
                    cls="l-page",
                    style="text-align: center;",
                )
            ),
            D_article(
                D_contents(
                    Nav(
                        H3("Table of Contents"),
                        Div(A("TxT360")),
                        Ul(
                            Li(A("Introduction", href="#section1")),
                            Li(A("Background", href="#section2")),
                            Li(A("Main Content", href="#section3")),
                            Li(A("Conclusion", href="#section4")),
                        ),
                        Div(A("Web Data", href="#section5")),
                        Div(A("Curated Sources", href="#section3")),
                        Div(A("Common Steps", href="#section4")),
                        Div(A("TxT360 Results", href="#section4")),
                        role="navigation",
                        cls="l-text figcaption",
                    ),
                ),
                Div(
                    Section(
                        H2("Introduction"),
                        P("""We are excited to introduce TxT360, a
                            large-scale, comprehensive, and fully transparent
                            dataset designed for Large Language Model (LLM)
                            pre-training. TxT360 is engineered to strike a
                            balance between the quantity and quality of
                            pre-training data, pushing the limit on both
                            fronts. This comprehensive dataset encompasses both
                            expansive web-based data and highly curated data
                            sources, making it one of the most robust LLM
                            pre-training corpora available today.  Our web data
                            component includes 99 snapshots from Common Crawl,
                            amassing 5.7 trillion tokens and occupying 11 TB of
                            disk space in jsonl.gz format. On the curated side,
                            TxT360 integrates one of the most extensive
                            collections of high-quality sources across multiple
                            domains, ensuring diverse and rich content referred
                            to as curated sources, 14 sources across 10
                            domains.  To maintain the highest quality, we
                            meticulously pre-processed the web data to filter
                            out low-quality content and conducted thorough
                            reviews of the curated sources. This process not
                            only unified their formats but also identified and
                            rectified any anomalies. Not only do we 100%
                            open-source our processing scripts, but we also
                            release the details of our data reviews, revealing
                            the decision-making processes behind data selection
                            and quality assurance.  This level of transparency
                            allows researchers and practitioners to fully
                            understand the dataset’s composition and make
                            informed decisions when using TxT360 for training.
                            Additionally, TxT360 includes detailed
                            documentation and analysis of the data, covering
                            distribution statistics, domain coverage, and
                            processing pipeline, which helps users navigate and
                            utilize the dataset effectively.  Overall, TxT360
                            represents a significant step forward in the
                            availability and transparency of large-scale
                            training data for language models, setting a new
                            standard for dataset quality and openness."""),
                        id="section1",
                    ),
                    Section(
                        H2("Background"),
                        P(
                            """ The quality and size of a pre-training dataset
                            play a crucial role in the performance of large
                            language models (LLMs). The community has
                            introduced a variety of datasets for this purpose,
                            including purely web-based datasets like RefinedWeb
                            [1], RedPajama-Data-V2 [2], DCLM [3], and
                            FineWeb [4], as well as comprehensive datasets
                            derived from multiple highly-curated data sources
                            such as The Pile [5], RedPajama-Data-V1 [6], and
                            Dolma [7] . It is commonly known that web-based
                            datasets provide a vast quantity of data, while
                            highly-curated multi-source datasets consistently
                            deliver high quality and diversity, both critical
                            for effective LLM pre-training.  However, despite
                            the advancements in both types of data, each type
                            of dataset has its limitations. For instance, the
                            processing scripts for the web dataset, RefinedWeb,
                            known for its high quality, are not public, and
                            only about 10% of the entire dataset has been
                            disclosed. Conversely, the web component of
                            existing highly-curated multi-source datasets is
                            relatively small compared to purely web-based
                            datasets, limiting their coverage and diversity
                            compared to the scale of information from the
                            internet.  By integrating the extensive reach of
                            web data with the exceptional quality of curated
                            sources, TxT360 is crafted to meet and surpass the
                            rigorous standards required for state-of-the-art
                            LLM pre-training. """
                        ),
                        id="section2",
                    ),
                    Section(
                        H2("Main Content"),
                        P(
                            """The performance of a large language model (LLM)
                            depends heavily on the quality and size of its
                            pretraining dataset. However, the pretraining
                            datasets for state-of-the-art open LLMs like Llama
                            3 and Mixtral are not publicly available and very
                            little is known about how they were created.
                            Reading time: 45 min. For the best reading
                            experience, we recommend not using a mobile phone.
                            Recently, we released 🍷 FineWeb, a new,
                            large-scale (15-trillion tokens, 44TB disk space)
                            dataset for LLM pretraining. FineWeb is derived
                            from 96 CommonCrawl snapshots and produces
                            better-performing LLMs than other open pretraining
                            datasets. To bring more clarity in machine learning
                            and advance the open understanding of how to train
                            good quality large language models, we carefully
                            documented and ablated all of the design choices
                            used in FineWeb, including in-depth investigations
                            of deduplication and filtering strategies. The
                            present long form report is a deep dive in how to
                            create a large and high-quality web-scale dataset
                            for LLM pretraining. The dataset itself, 🍷
                            FineWeb, is available here.  We are extremely
                            thankful to the whole distill.pub team (Christopher
                            Olah, Shan Carter, Ludwig Schubert in particular)
                            for creating the template on which we based this
                            blog post. Thanks also for inspiring us with
                            exquisitely crafted articles and blog posts.  In
                            this report we also introduce 📚 FineWeb-Edu, a
                            subset of FineWeb constructed using scalable
                            automated high-quality annotations for educational
                            value, and which outperforms all openly accessible
                            web-datasets on a number of educational benchmarks
                            such as MMLU, ARC, and OpenBookQA. 📚 FineWeb-Edu
                            is available in two sizes/filtering-level: 1.3
                            trillion (very high educational content) and 5.4
                            trillion (high educational content) tokens (all
                            tokens are measured with GPT2 tokenizer). You can
                            download it here.  Both datasets are released under
                            the permissive ODC-By 1.0 license TLDR: This blog
                            covers a discussion on processing and evaluating
                            data quality at scale, the 🍷 FineWeb recipe
                            (listing and explaining all of our design choices),
                            and the process followed to create its 📚
                            FineWeb-Edu subset."""
                        ),
                        id="section3",
                    ),
                    Section(
                        H2("Conclusion"),
                        P("""This is the conclusion section where we
                            summarize the key points discussed in the blog post
                            and provide final thoughts.
                          """),
                        id="section4",
                    ),
                ),
                cls="container",
            ),
        ),
        lang="en",
    )


setup_hf_backup(app)
serve()