---
pipeline_tag: text-generation
language:
- multilingual
inference: false
license: cc-by-nc-4.0
library_name: transformers
---
Trained by Jina AI.
[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) # ReaderLM-v2 `ReaderLM-v2` is the second generation of [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b), a **1.5B** parameter language model that converts raw HTML into formatted markdown or structured JSON with improved accuracy and better support for longer contexts. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction. ## Model Overview - **Model Type**: Autoregressive, decoder-only transformer - **Parameter Count**: ~1.5B - **Context Window**: Up to 512K tokens (combined input and output) - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total) ## What's New in `ReaderLM-v2` `ReaderLM-v2` features several improvements over [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b): - **Better Markdown Generation**: Generates cleaner, more readable Markdown output. - **JSON Output**: Produce structured JSON-formatted text, enabling structured extraction for further downstream processing. - **Longer Context Handling**: Can handle up to 512K tokens, which is beneficial for large HTML documents. - **Multilingual Support**: Covers 29 languages for broader applications across international web data. --- # Usage Below, you will find instructions and examples for using `ReaderLM-v2` locally using the Hugging Face Transformers library. For a more hands-on experience in a hosted environment, see the [Google Colab Notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing). ## On Google Colab The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), The notebook demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running. Feel free to test it with any website. For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions. However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples. ## Local Usage To use `ReaderLM-v2` locally: 1. Install the necessary dependencies: ```bash pip install transformers ``` 2. Load and run the model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import re device = "cuda" # or "cpu" tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2") model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device) ``` 3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM): ```python # Patterns SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>' STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>' META_PATTERN = r'<[ ]*meta.*?>' COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>' LINK_PATTERN = r'<[ ]*link.*?>' BASE64_IMG_PATTERN = r']+src="data:image/[^;]+;base64,[^"]+"[^>]*>' SVG_PATTERN = r'(