Update README.md
Browse files
README.md
CHANGED
@@ -17,12 +17,22 @@ library_name: transformers
|
|
17 |
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
18 |
</p>
|
19 |
|
20 |
-
[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-jwfct4j4rvxk2?sr=0-21&ref_=beagle&applicationId=AWSMPContessa)
|
21 |
|
22 |
# ReaderLM-v2
|
23 |
|
24 |
`ReaderLM-v2` is a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
|
25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
## Model Overview
|
27 |
|
28 |
- **Model Type**: Autoregressive, decoder-only transformer
|
@@ -36,16 +46,6 @@ library_name: transformers
|
|
36 |
- **Intermediate Size**: 8960
|
37 |
- **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
|
38 |
|
39 |
-
## What's New in `ReaderLM-v2`
|
40 |
-
|
41 |
-
`ReaderLM-v2` represents a significant leap forward from its predecessor, with several key improvements:
|
42 |
-
|
43 |
-
- **Better Markdown Generation**: Thanks to its new training paradigm and higher-quality training data, the model excels at generating complex elements like code fences, nested lists, tables, and LaTeX equations.
|
44 |
-
- **JSON Output**: Introduces direct HTML-to-JSON generation using predefined schemas, eliminating the need for intermediate markdown conversion.
|
45 |
-
- **Longer Context Handling**: Handles up to 512K tokens combined input and output length, with improved performance on long-form content.
|
46 |
-
- **Multilingual Support**: Comprehensive support across 29 languages for broader applications.
|
47 |
-
- **Enhanced Stability**: Greatly alleviates degeneration issues after generating long sequences through contrastive loss during training.
|
48 |
-
|
49 |
---
|
50 |
|
51 |
# Usage
|
@@ -65,7 +65,7 @@ You can try it without an API key at a lower rate limit. For higher rate limits,
|
|
65 |
|
66 |
## On Google Colab
|
67 |
|
68 |
-
|
69 |
|
70 |
Note that the free T4 GPU has limitations—it doesn't support bfloat16 or flash attention 2, leading to higher memory usage and slower processing of longer inputs. Nevertheless, ReaderLM-v2 successfully processes large documents under these constraints, achieving processing speeds of 67 tokens/s input and 36 tokens/s output. For production use, we recommend an RTX 3090/4090 for optimal performance.
|
71 |
|
|
|
17 |
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
18 |
</p>
|
19 |
|
20 |
+
[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-jwfct4j4rvxk2?sr=0-21&ref_=beagle&applicationId=AWSMPContessa) | [Arxiv (soon!)]
|
21 |
|
22 |
# ReaderLM-v2
|
23 |
|
24 |
`ReaderLM-v2` is a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
|
25 |
|
26 |
+
## What's New in `ReaderLM-v2`
|
27 |
+
|
28 |
+
`ReaderLM-v2` represents a significant leap forward from its predecessor, with several key improvements:
|
29 |
+
|
30 |
+
- **Better Markdown Generation**: Thanks to its new training paradigm and higher-quality training data, the model excels at generating complex elements like code fences, nested lists, tables, and LaTeX equations.
|
31 |
+
- **JSON Output**: Introduces direct HTML-to-JSON generation using predefined schemas, eliminating the need for intermediate markdown conversion.
|
32 |
+
- **Longer Context Handling**: Handles up to 512K tokens combined input and output length, with improved performance on long-form content.
|
33 |
+
- **Multilingual Support**: Comprehensive support across 29 languages for broader applications.
|
34 |
+
- **Enhanced Stability**: Greatly alleviates degeneration issues after generating long sequences through contrastive loss during training.
|
35 |
+
|
36 |
## Model Overview
|
37 |
|
38 |
- **Model Type**: Autoregressive, decoder-only transformer
|
|
|
46 |
- **Intermediate Size**: 8960
|
47 |
- **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
|
48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
---
|
50 |
|
51 |
# Usage
|
|
|
65 |
|
66 |
## On Google Colab
|
67 |
|
68 |
+
You can try `ReaderLM-v2` via our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
|
69 |
|
70 |
Note that the free T4 GPU has limitations—it doesn't support bfloat16 or flash attention 2, leading to higher memory usage and slower processing of longer inputs. Nevertheless, ReaderLM-v2 successfully processes large documents under these constraints, achieving processing speeds of 67 tokens/s input and 36 tokens/s output. For production use, we recommend an RTX 3090/4090 for optimal performance.
|
71 |
|