jinaai
/

ReaderLM-v2

@@ -19,164 +19,155 @@ library_name: transformers
 [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
-# Intro
-Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
-`ReaderLM-v2` features several significant improvements:
-- **Better Markdown Generation**: `ReaderLM-v2` generates markdown with improved formatting and readability.
-- **JSON Output**: `ReaderLM-v2` can output JSON format, which is useful for downstream processing.
-- **Longer Context Handling**: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length.
-- **Multilingual Support**: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.
-# Get Started
-## On Google Colab
-The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
-which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
-The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running.
-Feel free to test it with any website.
-For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
-However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
-## Local
-To use this model, you need to install `transformers`:
-```bash
-pip install transformers
-```
-### HTML to Markdown Conversion
-Then, you can use the model to convert HTML to Markdown as follows:
 ```python
-# pip install transformers
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import re
-# (REMOVE <SCRIPT> to </script> and variations)
-SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
-# (REMOVE HTML <STYLE> to </style> and variations)
-STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
-# (REMOVE HTML <META> to </meta> and variations)
-META_PATTERN = r'<[ ]*meta.*?>'  # mach any char zero or more times
-# (REMOVE HTML COMMENTS <!-- to --> and variations)
-COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
-# (REMOVE HTML LINK <LINK> to </link> and variations)
-LINK_PATTERN = r'<[ ]*link.*?>'  # mach any char zero or more times
-# (REPLACE base64 images)
-BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
-# (REPLACE <svg> to </svg> and variations)
-SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
-def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
-    return re.sub(
-        SVG_PATTERN,
-        lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
-        html,
-        flags=re.DOTALL,
-    )
-def replace_base64_images(html: str, new_image_src: str = "#") -> str:
-    return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
-def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
-    html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
-    html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
-    html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
-    html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
-    html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
-    if clean_svg:
-        html = replace_svg(html)
-    if clean_base64:
-        html = replace_base64_images(html)
-    return html
-device = "cuda" # for GPU usage or "cpu" for CPU usage
-tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
-model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
-def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str:
-    """
-    Create a prompt for the model with optional instruction and JSON schema.
-    Args:
-        text (str): The input HTML text
-        tokenizer: The tokenizer to use
-        instruction (str, optional): Custom instruction for the model
-        schema (str, optional): JSON schema for structured extraction
-    Returns:
-        str: The formatted prompt
-    """
-    if not instruction:
-        instruction = "Extract the main content from the given HTML and convert it to Markdown format."
-    if schema:
-        instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.'
-        prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
-    else:
-        prompt = f"{instruction}\n```html\n{text}\n```"
-    messages = [
-        {
-            "role": "user",
-            "content": prompt,
-        }
-    ]
-    return tokenizer.apply_chat_template(
-        messages, tokenize=False, add_generation_prompt=True
-    )
-# example html content
 html = "<html><body><h1>Hello, world!</h1></body></html>"
-# clean the html content, remove scripts, styles, comments, etc.
 html = clean_html(html)
 input_prompt = create_prompt(html)
-print(input_prompt)
 inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
 print(tokenizer.decode(outputs[0]))
 ```
-You can also specify the content you want to extract from the HTML by providing a custom instruction.
-For example, if you want to extract the menu items from the HTML content, you can create a prompt like this:
 ```python
 instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
 input_prompt = create_prompt(html, instruction=instruction)
 inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
 print(tokenizer.decode(outputs[0]))
 ```
-### HTML to JSON Conversion
-To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema.
 ```python
 schema = """
@@ -200,6 +191,7 @@ schema = """
 }
 """
 input_prompt = create_prompt(html, schema=schema)
 inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
@@ -208,7 +200,6 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
 print(tokenizer.decode(outputs[0]))
 ```
-## AWS Sagemaker & Azure Marketplace
-TBD

 [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
+# ReaderLM-v2
+`ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
+It supports multiple languages (29 in total) and is specialized for tasks involving HTML parsing, transformation, and text extraction.
+## Model Overview
+- **Model Type**: Autoregressive, decoder-only transformer
+- **Parameter Count**: ~1.5B
+- **Context Window**: Up to 512K tokens (combined input and output)
+- **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
+## What's New in `ReaderLM-v2`
+`ReaderLM-v2` features several significant improvements over its predecessor:
+- **Better Markdown Generation**: Generates cleaner, more readable Markdown output.
+- **JSON Output**: Can produce JSON-formatted text, enabling structured extraction for further downstream processing.
+- **Longer Context Handling**: Can handle up to 512K tokens, which is beneficial for large HTML documents or combined transformations.
+- **Multilingual Support**: Covers 29 languages for broader application across international web data.
+---
+# Usage
+Below, you will find instructions and examples for using `ReaderLM-v2` locally using the Hugging Face Transformers library.
+For a more hands-on experience in a hosted environment, see the [Google Colab Notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing).
+## On Google Colab
+The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
+The notebook runs on a free T4 GPU tier and uses vLLM and Triton for faster inference. You can feed any website’s HTML directly into the model.
+• For simple HTML-to-Markdown tasks, you only need to provide the raw HTML (no special instructions).
+• For JSON output and instruction-based extraction, use the prompt formatting guidelines in the notebook.
+## Local Usage
+To use `ReaderLM-v2` locally:
+1. Install the necessary dependencies:
+   ```bash
+   pip install transformers
+   ```
+2. Load and run the model:
+   ```python
+   from transformers import AutoModelForCausalLM, AutoTokenizer
+   import re
+   device = "cuda"  # or "cpu"
+   tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
+   model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
+   ```
+3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM):
+   ```python
+   # Patterns
+   SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>'
+   STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>'
+   META_PATTERN = r'<[ ]*meta.*?>'
+   COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>'
+   LINK_PATTERN = r'<[ ]*link.*?>'
+   BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
+   SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
+   def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
+       return re.sub(
+           SVG_PATTERN,
+           lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
+           html,
+           flags=re.DOTALL,
+       )
+   def replace_base64_images(html: str, new_image_src: str = "#") -> str:
+       return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
+   def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
+       html = re.sub(SCRIPT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
+       html = re.sub(STYLE_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
+       html = re.sub(META_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
+       html = re.sub(COMMENT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
+       html = re.sub(LINK_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
+       if clean_svg:
+           html = replace_svg(html)
+       if clean_base64:
+           html = replace_base64_images(html)
+       return html
+   ```
+4. Create a prompt for the model:
+  ```python
+  def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
+      """
+      Create a prompt for the model with optional instruction and JSON schema.
+      """
+      if not instruction:
+          instruction = "Extract the main content from the given HTML and convert it to Markdown format."
+      if schema:
+          # This is an example instruction for JSON output
+          instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
+          prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
+      else:
+          prompt = f"{instruction}\n```html\n{text}\n```"
+      messages = [
+          {
+              "role": "user",
+              "content": prompt,
+          }
+      ]
+      return tokenizer.apply_chat_template(
+          messages, tokenize=False, add_generation_prompt=True
+      )
+  ```
+### HTML to Markdown Example
 ```python
+# Example HTML
 html = "<html><body><h1>Hello, world!</h1></body></html>"
 html = clean_html(html)
 input_prompt = create_prompt(html)
 inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
 print(tokenizer.decode(outputs[0]))
 ```
+### Instruction-Focused Extraction
 ```python
 instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
 input_prompt = create_prompt(html, instruction=instruction)
 inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
 print(tokenizer.decode(outputs[0]))
 ```
+### HTML to JSON Example
 ```python
 schema = """
 }
 """
+html = clean_html(html)
 input_prompt = create_prompt(html, schema=schema)
 inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
 print(tokenizer.decode(outputs[0]))
 ```
+## AWS Sagemaker & Azure Marketplace & Google Cloud Platform
+Coming soon.