numb3r3 commited on
Commit
14fbb0c
·
verified ·
1 Parent(s): a6c0d30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -127
README.md CHANGED
@@ -19,164 +19,155 @@ library_name: transformers
19
 
20
  [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
21
 
22
- # Intro
23
 
24
- Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
 
25
 
26
- `ReaderLM-v2` features several significant improvements:
27
 
28
- - **Better Markdown Generation**: `ReaderLM-v2` generates markdown with improved formatting and readability.
29
- - **JSON Output**: `ReaderLM-v2` can output JSON format, which is useful for downstream processing.
30
- - **Longer Context Handling**: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length.
31
- - **Multilingual Support**: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.
32
 
 
33
 
34
- # Get Started
35
 
36
- ## On Google Colab
37
- The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
38
- which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
39
- The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running.
40
- Feel free to test it with any website.
41
- For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
42
- However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
43
 
44
- ## Local
45
 
46
- To use this model, you need to install `transformers`:
47
 
48
- ```bash
49
- pip install transformers
50
- ```
51
 
 
52
 
53
- ### HTML to Markdown Conversion
54
-
55
- Then, you can use the model to convert HTML to Markdown as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ```python
58
- # pip install transformers
59
- from transformers import AutoModelForCausalLM, AutoTokenizer
60
- import re
61
-
62
- # (REMOVE <SCRIPT> to </script> and variations)
63
- SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times
64
-
65
- # (REMOVE HTML <STYLE> to </style> and variations)
66
- STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times
67
-
68
- # (REMOVE HTML <META> to </meta> and variations)
69
- META_PATTERN = r'<[ ]*meta.*?>' # mach any char zero or more times
70
-
71
- # (REMOVE HTML COMMENTS <!-- to --> and variations)
72
- COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times
73
-
74
- # (REMOVE HTML LINK <LINK> to </link> and variations)
75
- LINK_PATTERN = r'<[ ]*link.*?>' # mach any char zero or more times
76
-
77
- # (REPLACE base64 images)
78
- BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
79
-
80
- # (REPLACE <svg> to </svg> and variations)
81
- SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
82
-
83
- def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
84
- return re.sub(
85
- SVG_PATTERN,
86
- lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
87
- html,
88
- flags=re.DOTALL,
89
- )
90
-
91
- def replace_base64_images(html: str, new_image_src: str = "#") -> str:
92
- return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
93
-
94
- def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
95
- html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
96
- html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
97
- html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
98
- html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
99
- html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
100
-
101
- if clean_svg:
102
- html = replace_svg(html)
103
-
104
- if clean_base64:
105
- html = replace_base64_images(html)
106
-
107
- return html
108
-
109
-
110
- device = "cuda" # for GPU usage or "cpu" for CPU usage
111
- tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
112
- model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
113
-
114
- def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str:
115
- """
116
- Create a prompt for the model with optional instruction and JSON schema.
117
-
118
- Args:
119
- text (str): The input HTML text
120
- tokenizer: The tokenizer to use
121
- instruction (str, optional): Custom instruction for the model
122
- schema (str, optional): JSON schema for structured extraction
123
-
124
- Returns:
125
- str: The formatted prompt
126
- """
127
-
128
- if not instruction:
129
- instruction = "Extract the main content from the given HTML and convert it to Markdown format."
130
-
131
- if schema:
132
- instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.'
133
- prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
134
- else:
135
- prompt = f"{instruction}\n```html\n{text}\n```"
136
-
137
- messages = [
138
- {
139
- "role": "user",
140
- "content": prompt,
141
- }
142
- ]
143
-
144
- return tokenizer.apply_chat_template(
145
- messages, tokenize=False, add_generation_prompt=True
146
- )
147
-
148
- # example html content
149
  html = "<html><body><h1>Hello, world!</h1></body></html>"
150
 
151
- # clean the html content, remove scripts, styles, comments, etc.
152
  html = clean_html(html)
153
 
154
  input_prompt = create_prompt(html)
155
-
156
- print(input_prompt)
157
-
158
  inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
159
  outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
160
 
161
  print(tokenizer.decode(outputs[0]))
162
  ```
163
 
164
- You can also specify the content you want to extract from the HTML by providing a custom instruction.
165
- For example, if you want to extract the menu items from the HTML content, you can create a prompt like this:
166
 
167
  ```python
168
  instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
169
  input_prompt = create_prompt(html, instruction=instruction)
170
-
171
  inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
172
  outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
173
 
174
  print(tokenizer.decode(outputs[0]))
175
  ```
176
 
177
- ### HTML to JSON Conversion
178
-
179
- To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema.
180
 
181
  ```python
182
  schema = """
@@ -200,6 +191,7 @@ schema = """
200
  }
201
  """
202
 
 
203
  input_prompt = create_prompt(html, schema=schema)
204
 
205
  inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
@@ -208,7 +200,6 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
208
  print(tokenizer.decode(outputs[0]))
209
  ```
210
 
 
211
 
212
- ## AWS Sagemaker & Azure Marketplace
213
-
214
- TBD
 
19
 
20
  [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
21
 
22
+ # ReaderLM-v2
23
 
24
+ `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
25
+ It supports multiple languages (29 in total) and is specialized for tasks involving HTML parsing, transformation, and text extraction.
26
 
27
+ ## Model Overview
28
 
29
+ - **Model Type**: Autoregressive, decoder-only transformer
30
+ - **Parameter Count**: ~1.5B
31
+ - **Context Window**: Up to 512K tokens (combined input and output)
32
+ - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
33
 
34
+ ## What's New in `ReaderLM-v2`
35
 
36
+ `ReaderLM-v2` features several significant improvements over its predecessor:
37
 
38
+ - **Better Markdown Generation**: Generates cleaner, more readable Markdown output.
39
+ - **JSON Output**: Can produce JSON-formatted text, enabling structured extraction for further downstream processing.
40
+ - **Longer Context Handling**: Can handle up to 512K tokens, which is beneficial for large HTML documents or combined transformations.
41
+ - **Multilingual Support**: Covers 29 languages for broader application across international web data.
 
 
 
42
 
43
+ ---
44
 
45
+ # Usage
46
 
47
+ Below, you will find instructions and examples for using `ReaderLM-v2` locally using the Hugging Face Transformers library.
48
+ For a more hands-on experience in a hosted environment, see the [Google Colab Notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing).
 
49
 
50
+ ## On Google Colab
51
 
52
+ The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
53
+ The notebook runs on a free T4 GPU tier and uses vLLM and Triton for faster inference. You can feed any website’s HTML directly into the model.
54
+
55
+ • For simple HTML-to-Markdown tasks, you only need to provide the raw HTML (no special instructions).
56
+ • For JSON output and instruction-based extraction, use the prompt formatting guidelines in the notebook.
57
+
58
+ ## Local Usage
59
+
60
+ To use `ReaderLM-v2` locally:
61
+
62
+ 1. Install the necessary dependencies:
63
+
64
+ ```bash
65
+ pip install transformers
66
+ ```
67
+
68
+ 2. Load and run the model:
69
+
70
+ ```python
71
+ from transformers import AutoModelForCausalLM, AutoTokenizer
72
+ import re
73
+
74
+ device = "cuda" # or "cpu"
75
+ tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
76
+ model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
77
+ ```
78
+
79
+ 3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM):
80
+
81
+ ```python
82
+ # Patterns
83
+ SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>'
84
+ STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>'
85
+ META_PATTERN = r'<[ ]*meta.*?>'
86
+ COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>'
87
+ LINK_PATTERN = r'<[ ]*link.*?>'
88
+ BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
89
+ SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
90
+
91
+ def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
92
+ return re.sub(
93
+ SVG_PATTERN,
94
+ lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
95
+ html,
96
+ flags=re.DOTALL,
97
+ )
98
+
99
+ def replace_base64_images(html: str, new_image_src: str = "#") -> str:
100
+ return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
101
+
102
+ def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
103
+ html = re.sub(SCRIPT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
104
+ html = re.sub(STYLE_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
105
+ html = re.sub(META_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
106
+ html = re.sub(COMMENT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
107
+ html = re.sub(LINK_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
108
+
109
+ if clean_svg:
110
+ html = replace_svg(html)
111
+ if clean_base64:
112
+ html = replace_base64_images(html)
113
+ return html
114
+ ```
115
+
116
+ 4. Create a prompt for the model:
117
+
118
+ ```python
119
+ def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
120
+ """
121
+ Create a prompt for the model with optional instruction and JSON schema.
122
+ """
123
+ if not instruction:
124
+ instruction = "Extract the main content from the given HTML and convert it to Markdown format."
125
+ if schema:
126
+ # This is an example instruction for JSON output
127
+ instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
128
+ prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
129
+ else:
130
+ prompt = f"{instruction}\n```html\n{text}\n```"
131
+
132
+ messages = [
133
+ {
134
+ "role": "user",
135
+ "content": prompt,
136
+ }
137
+ ]
138
+
139
+ return tokenizer.apply_chat_template(
140
+ messages, tokenize=False, add_generation_prompt=True
141
+ )
142
+ ```
143
+
144
+ ### HTML to Markdown Example
145
 
146
  ```python
147
+ # Example HTML
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  html = "<html><body><h1>Hello, world!</h1></body></html>"
149
 
 
150
  html = clean_html(html)
151
 
152
  input_prompt = create_prompt(html)
 
 
 
153
  inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
154
  outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
155
 
156
  print(tokenizer.decode(outputs[0]))
157
  ```
158
 
159
+ ### Instruction-Focused Extraction
 
160
 
161
  ```python
162
  instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
163
  input_prompt = create_prompt(html, instruction=instruction)
 
164
  inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
165
  outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
166
 
167
  print(tokenizer.decode(outputs[0]))
168
  ```
169
 
170
+ ### HTML to JSON Example
 
 
171
 
172
  ```python
173
  schema = """
 
191
  }
192
  """
193
 
194
+ html = clean_html(html)
195
  input_prompt = create_prompt(html, schema=schema)
196
 
197
  inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
 
200
  print(tokenizer.decode(outputs[0]))
201
  ```
202
 
203
+ ## AWS Sagemaker & Azure Marketplace & Google Cloud Platform
204
 
205
+ Coming soon.