Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -12,18 +12,24 @@ pinned: false
|
|
12 |
|
13 |
|
14 |
This application demonstrates a Multimodal Retrieval-Augmented Generation (RAG) system using the Qwen2-VL model and a custom RAG implementation. It allows users to upload images and ask questions about them, combining visual and textual information to generate responses.
|
15 |
-
|
|
|
|
|
16 |
|
17 |
## Prerequisites
|
18 |
|
19 |
- Python 3.8+
|
|
|
|
|
|
|
|
|
20 |
- CUDA-compatible GPU (recommended for optimal performance)
|
21 |
|
22 |
## Installation
|
23 |
|
24 |
1. Clone the repository:
|
25 |
```
|
26 |
-
git clone https://github.com/
|
27 |
cd multimodal-rag-app
|
28 |
```
|
29 |
|
@@ -50,38 +56,24 @@ It is deployed here on HuggingFace Spaces [https://huggingface.co/spaces/clayton
|
|
50 |
3. Open a web browser and navigate to the URL provided by Streamlit (usually `http://localhost:8501`).
|
51 |
|
52 |
|
53 |
-
## Features
|
54 |
-
|
55 |
-
- Image upload or selection of an example image
|
56 |
-
- Text-based querying of uploaded images
|
57 |
-
- Multimodal RAG processing using custom RAG model and Qwen2-VL
|
58 |
-
- Adjustable response length
|
59 |
-
|
60 |
-
|
61 |
## Usage
|
62 |
|
63 |
1. Choose to upload an image or use the example image.
|
64 |
2. If uploading, select an image file (PNG, JPG, or JPEG).
|
65 |
-
3. Enter a
|
66 |
4. Adjust the maximum number of tokens for the response using the slider.
|
67 |
-
5. View the
|
68 |
-
|
69 |
-
## Deployment
|
70 |
|
71 |
-
|
72 |
|
73 |
-
|
74 |
-
2. Choose a deployment platform (e.g., Streamlit Cloud, Heroku, or a cloud provider like AWS or GCP).
|
75 |
-
3. Follow the platform-specific deployment instructions, which typically involve:
|
76 |
-
- Connecting your GitHub repository to the deployment platform
|
77 |
-
- Configuring environment variables if necessary
|
78 |
-
- Setting up any required build processes
|
79 |
|
80 |
-
|
81 |
|
82 |
-
##
|
83 |
|
84 |
-
|
|
|
85 |
|
86 |
## Contributing
|
87 |
|
@@ -89,11 +81,4 @@ Contributions are welcome! Please feel free to submit a Pull Request.
|
|
89 |
|
90 |
## License
|
91 |
|
92 |
-
|
93 |
-
|
94 |
-
## Acknowledgments
|
95 |
-
|
96 |
-
- This project uses the [Qwen2-VL model](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) from Hugging Face.
|
97 |
-
- The custom RAG implementation is based on the [colpali model](https://huggingface.co/vidore/colpali).
|
98 |
-
|
99 |
-
|
|
|
12 |
|
13 |
|
14 |
This application demonstrates a Multimodal Retrieval-Augmented Generation (RAG) system using the Qwen2-VL model and a custom RAG implementation. It allows users to upload images and ask questions about them, combining visual and textual information to generate responses.
|
15 |
+
|
16 |
+
|
17 |
+
It is deployed here on HuggingFace Spaces https://huggingface.co/spaces/clayton07/qwen2-colpali-ocr
|
18 |
|
19 |
## Prerequisites
|
20 |
|
21 |
- Python 3.8+
|
22 |
+
- Pytorch 2.4.1
|
23 |
+
- Torchvision 0.19.1
|
24 |
+
- Qwen V1
|
25 |
+
- Byaldi
|
26 |
- CUDA-compatible GPU (recommended for optimal performance)
|
27 |
|
28 |
## Installation
|
29 |
|
30 |
1. Clone the repository:
|
31 |
```
|
32 |
+
git clone https://github.com/Claytonn7/qwen2-colpali-ocr.git
|
33 |
cd multimodal-rag-app
|
34 |
```
|
35 |
|
|
|
56 |
3. Open a web browser and navigate to the URL provided by Streamlit (usually `http://localhost:8501`).
|
57 |
|
58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
## Usage
|
60 |
|
61 |
1. Choose to upload an image or use the example image.
|
62 |
2. If uploading, select an image file (PNG, JPG, or JPEG).
|
63 |
+
3. Enter a single keyword in the provided input field.
|
64 |
4. Adjust the maximum number of tokens for the response using the slider.
|
65 |
+
5. View the extracted text from the image, with the searched keyword highlighted. Example screenshot [here](https://github.com/Claytonn7/qwen2-colpali-ocr/blob/main/examples-app/6-keyword-highlight2.jpg)
|
|
|
|
|
66 |
|
67 |
+
NB: Check the examples-app directory on this repo for more example screenshots.
|
68 |
|
69 |
+
## Disclaimer
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
+
The app utilizes the free tier of HuggingFace Spaces, which only has support for CPU, resulting in very slow processing times. For optimal performance, it's recommended to run the app locally on a machine with GPU support.
|
72 |
|
73 |
+
## Acknowledgments
|
74 |
|
75 |
+
- This project uses the [Qwen2-VL model](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) from Hugging Face.
|
76 |
+
- The [byaldi](https://github.com/AnswerDotAI/byaldi) implementation of the [colpali model](https://huggingface.co/vidore/colpali).
|
77 |
|
78 |
## Contributing
|
79 |
|
|
|
81 |
|
82 |
## License
|
83 |
|
84 |
+
This project is licensed under the GPL-2.0 License
|
|
|
|
|
|
|
|
|
|
|
|
|
|