|
---
|
|
license: mit
|
|
---
|
|
|
|
|
|
# EndpointHandler
|
|
|
|
`EndpointHandler` is a Python class that processes image and text data to generate embeddings and similarity scores using the ColQwen2 model—a visual retriever based on Qwen2-VL-2B-Instruct with the ColBERT strategy. This handler is optimized for retrieving documents and visual information based on their visual and textual features.
|
|
|
|
## Overview
|
|
|
|
- **Efficient Document Retrieval**: Uses the ColQwen2 model to produce embeddings for images and text for accurate document retrieval.
|
|
- **Multi-vector Representation**: Generates ColBERT-style multi-vector embeddings for improved similarity search.
|
|
- **Flexible Image Resolution**: Supports dynamic image resolution without altering the aspect ratio, capped at 768 patches for memory efficiency.
|
|
- **Device Compatibility**: Automatically utilizes available CUDA devices or defaults to CPU.
|
|
|
|
## Model Details
|
|
|
|
The **ColQwen2** model extends Qwen2-VL-2B with a focus on vision-language tasks, making it suitable for content indexing and retrieval. Key features include:
|
|
- **Training**: Pre-trained with a batch size of 256 over 5 epochs, with a modified pad token.
|
|
- **Input Flexibility**: Handles various image resolutions without resizing, ensuring accurate multi-vector representation.
|
|
- **Similarity Scoring**: Utilizes a ColBERT-style scoring approach for efficient retrieval across image and text modalities.
|
|
|
|
This base version is untrained, providing deterministic initialization of the projection layer for further customization.
|
|
|
|
## How to Use
|
|
|
|
The following example demonstrates how to use `EndpointHandler` for processing PDF documents and text. PDF pages are converted to base64 images, which are then passed as input alongside text data to the handler.
|
|
|
|
### Example Script
|
|
|
|
```python
|
|
import torch
|
|
from pdf2image import convert_from_path
|
|
import base64
|
|
from io import BytesIO
|
|
import requests
|
|
|
|
# Function to convert PIL Image to base64 string
|
|
def pil_image_to_base64(image):
|
|
"""Converts a PIL Image to a base64 encoded string."""
|
|
buffer = BytesIO()
|
|
image.save(buffer, format="PNG")
|
|
return base64.b64encode(buffer.getvalue()).decode()
|
|
|
|
# Function to convert PDF pages to base64 images
|
|
def convert_pdf_to_base64_images(pdf_path):
|
|
"""Converts PDF pages to base64 encoded images."""
|
|
pages = convert_from_path(pdf_path)
|
|
return [pil_image_to_base64(page) for page in pages]
|
|
|
|
# Function to send payload to API and retrieve response
|
|
def query_api(payload, api_url, headers):
|
|
"""Sends a POST request to the API and returns the response."""
|
|
response = requests.post(api_url, headers=headers, json=payload)
|
|
return response.json()
|
|
|
|
# Main execution
|
|
if __name__ == "__main__":
|
|
# Convert PDF pages to base64 encoded images
|
|
encoded_images = convert_pdf_to_base64_images('document.pdf')
|
|
|
|
# Prepare payload
|
|
payload = {
|
|
"inputs": [],
|
|
"image": encoded_images,
|
|
"text": ["example query text"]
|
|
}
|
|
|
|
# API configuration
|
|
API_URL = "https://your-api-url"
|
|
headers = {
|
|
"Accept": "application/json",
|
|
"Authorization": "Bearer your_access_token",
|
|
"Content-Type": "application/json"
|
|
}
|
|
|
|
# Query the API and get output
|
|
output = query_api(payload=payload, api_url=API_URL, headers=headers)
|
|
print(output)
|
|
```
|
|
|
|
## Inputs and Outputs
|
|
|
|
### Input Format
|
|
The `EndpointHandler` expects a dictionary containing:
|
|
- **image**: A list of base64-encoded strings for images (e.g., PDF pages converted to images).
|
|
- **text**: A list of text strings representing queries or document contents.
|
|
- **batch_size** (optional): The batch size for processing images and text. Defaults to `4`.
|
|
|
|
Example payload:
|
|
```json
|
|
{
|
|
"image": ["base64_image_string_1", "base64_image_string_2"],
|
|
"text": ["sample text 1", "sample text 2"],
|
|
"batch_size": 4
|
|
}
|
|
```
|
|
|
|
### Output Format
|
|
The handler returns a dictionary with the following keys:
|
|
- **image**: List of embeddings for each image.
|
|
- **text**: List of embeddings for each text entry.
|
|
- **scores**: List of similarity scores between the image and text embeddings.
|
|
|
|
Example output:
|
|
```json
|
|
{
|
|
"image": [[0.12, 0.34, ...], [0.56, 0.78, ...]],
|
|
"text": [[0.11, 0.22, ...], [0.33, 0.44, ...]],
|
|
"scores": [[0.87, 0.45], [0.23, 0.67]]
|
|
}
|
|
```
|
|
|
|
### Error Handling
|
|
If any issues occur during processing (e.g., decoding images or model inference), the handler logs the error and returns an error message in the output dictionary.
|
|
|
|
|