File size: 4,591 Bytes
a557518
 
 
e165930
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---

license: mit
---



# EndpointHandler

`EndpointHandler` is a Python class that processes image and text data to generate embeddings and similarity scores using the ColQwen2 model—a visual retriever based on Qwen2-VL-2B-Instruct with the ColBERT strategy. This handler is optimized for retrieving documents and visual information based on their visual and textual features.

## Overview

- **Efficient Document Retrieval**: Uses the ColQwen2 model to produce embeddings for images and text for accurate document retrieval.
- **Multi-vector Representation**: Generates ColBERT-style multi-vector embeddings for improved similarity search.
- **Flexible Image Resolution**: Supports dynamic image resolution without altering the aspect ratio, capped at 768 patches for memory efficiency.
- **Device Compatibility**: Automatically utilizes available CUDA devices or defaults to CPU.

## Model Details

The **ColQwen2** model extends Qwen2-VL-2B with a focus on vision-language tasks, making it suitable for content indexing and retrieval. Key features include:
- **Training**: Pre-trained with a batch size of 256 over 5 epochs, with a modified pad token.
- **Input Flexibility**: Handles various image resolutions without resizing, ensuring accurate multi-vector representation.
- **Similarity Scoring**: Utilizes a ColBERT-style scoring approach for efficient retrieval across image and text modalities.

This base version is untrained, providing deterministic initialization of the projection layer for further customization.

## How to Use

The following example demonstrates how to use `EndpointHandler` for processing PDF documents and text. PDF pages are converted to base64 images, which are then passed as input alongside text data to the handler.

### Example Script

```python

import torch

from pdf2image import convert_from_path

import base64

from io import BytesIO

import requests



# Function to convert PIL Image to base64 string

def pil_image_to_base64(image):

    """Converts a PIL Image to a base64 encoded string."""

    buffer = BytesIO()

    image.save(buffer, format="PNG")

    return base64.b64encode(buffer.getvalue()).decode()



# Function to convert PDF pages to base64 images

def convert_pdf_to_base64_images(pdf_path):

    """Converts PDF pages to base64 encoded images."""

    pages = convert_from_path(pdf_path)

    return [pil_image_to_base64(page) for page in pages]



# Function to send payload to API and retrieve response

def query_api(payload, api_url, headers):

    """Sends a POST request to the API and returns the response."""

    response = requests.post(api_url, headers=headers, json=payload)

    return response.json()



# Main execution

if __name__ == "__main__":

    # Convert PDF pages to base64 encoded images

    encoded_images = convert_pdf_to_base64_images('document.pdf')

    

    # Prepare payload

    payload = {

        "inputs": [],

        "image": encoded_images,

        "text": ["example query text"]

    }

    

    # API configuration

    API_URL = "https://your-api-url"

    headers = {

        "Accept": "application/json",

        "Authorization": "Bearer your_access_token",

        "Content-Type": "application/json"

    }

    

    # Query the API and get output

    output = query_api(payload=payload, api_url=API_URL, headers=headers)

    print(output)

```

## Inputs and Outputs

### Input Format
The `EndpointHandler` expects a dictionary containing:
- **image**: A list of base64-encoded strings for images (e.g., PDF pages converted to images).
- **text**: A list of text strings representing queries or document contents.
- **batch_size** (optional): The batch size for processing images and text. Defaults to `4`.



Example payload:

```json

{

    "image": ["base64_image_string_1", "base64_image_string_2"],

    "text": ["sample text 1", "sample text 2"],

    "batch_size": 4

}

```



### Output Format

The handler returns a dictionary with the following keys:

- **image**: List of embeddings for each image.

- **text**: List of embeddings for each text entry.

- **scores**: List of similarity scores between the image and text embeddings.



Example output:

```json

{

    "image": [[0.12, 0.34, ...], [0.56, 0.78, ...]],

    "text": [[0.11, 0.22, ...], [0.33, 0.44, ...]],

    "scores": [[0.87, 0.45], [0.23, 0.67]]

}

```



### Error Handling

If any issues occur during processing (e.g., decoding images or model inference), the handler logs the error and returns an error message in the output dictionary.