--- pipeline_tag: text-generation language: - multilingual inference: false license: cc-by-nc-4.0 library_name: transformers ---

Jina AI: Your Search Foundation, Supercharged!

Trained by Jina AI.

[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) # ReaderLM-v2 `ReaderLM-v2` is the second generation of [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b), a **1.5B** parameter language model that converts raw HTML into formatted markdown or structured JSON with improved accuracy and better support for longer contexts. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction. ## Model Overview - **Model Type**: Autoregressive, decoder-only transformer - **Parameter Count**: ~1.5B - **Context Window**: Up to 512K tokens (combined input and output) - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total) ## What's New in `ReaderLM-v2` `ReaderLM-v2` features several improvements over [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b): - **Better Markdown Generation**: Generates cleaner, more readable Markdown output. - **JSON Output**: Produce structured JSON-formatted text, enabling structured extraction for further downstream processing. - **Longer Context Handling**: Can handle up to 512K tokens, which is beneficial for large HTML documents. - **Multilingual Support**: Covers 29 languages for broader applications across international web data. --- # Usage Below, you will find instructions and examples for using `ReaderLM-v2` locally using the Hugging Face Transformers library. For a more hands-on experience in a hosted environment, see the [Google Colab Notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing). ## On Google Colab The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), The notebook demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running. Feel free to test it with any website. For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions. However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples. ## Local Usage To use `ReaderLM-v2` locally: 1. Install the necessary dependencies: ```bash pip install transformers ``` 2. Load and run the model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import re device = "cuda" # or "cpu" tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2") model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device) ``` 3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM): ```python # Patterns SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>' STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>' META_PATTERN = r'<[ ]*meta.*?>' COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>' LINK_PATTERN = r'<[ ]*link.*?>' BASE64_IMG_PATTERN = r']+src="data:image/[^;]+;base64,[^"]+"[^>]*>' SVG_PATTERN = r'(]*>)(.*?)(<\/svg>)' def replace_svg(html: str, new_content: str = "this is a placeholder") -> str: return re.sub( SVG_PATTERN, lambda match: f"{match.group(1)}{new_content}{match.group(3)}", html, flags=re.DOTALL, ) def replace_base64_images(html: str, new_image_src: str = "#") -> str: return re.sub(BASE64_IMG_PATTERN, f'', html) def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False): html = re.sub(SCRIPT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL) html = re.sub(STYLE_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL) html = re.sub(META_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL) html = re.sub(COMMENT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL) html = re.sub(LINK_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL) if clean_svg: html = replace_svg(html) if clean_base64: html = replace_base64_images(html) return html ``` 4. Create a prompt for the model: ```python def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str: """ Create a prompt for the model with optional instruction and JSON schema. """ if not instruction: instruction = "Extract the main content from the given HTML and convert it to Markdown format." if schema: # This is an example instruction for JSON output instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format." prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```" else: prompt = f"{instruction}\n```html\n{text}\n```" messages = [ { "role": "user", "content": prompt, } ] return tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) ``` ### HTML to Markdown Example ```python # Example HTML html = "

Hello, world!

" html = clean_html(html) input_prompt = create_prompt(html) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0])) ``` ### Instruction-Focused Extraction ```python instruction = "Extract the menu items from the given HTML and convert it to Markdown format." input_prompt = create_prompt(html, instruction=instruction) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0])) ``` ### HTML to JSON Example ```python schema = """ { "type": "object", "properties": { "title": { "type": "string" }, "author": { "type": "string" }, "date": { "type": "string" }, "content": { "type": "string" } }, "required": ["title", "author", "date", "content"] } """ html = clean_html(html) input_prompt = create_prompt(html, schema=schema) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0])) ``` ## AWS Sagemaker & Azure Marketplace & Google Cloud Platform Coming soon.