Spaces:
Sleeping
Sleeping
title: Thesizer | |
emoji: 🖥️ | |
colorFrom: purple | |
colorTo: blue | |
sdk: gradio | |
app_file: app.py | |
pinned: true | |
# HAMK Thesis Assistant (Thesizer) | |
AI Project - 10.2024 | |
``` | |
.----. | |
.---------. | == | | |
|.-"""""-.| |----| | |
|| THE || | == | | |
|| SIZER || |----| | |
|'-.....-'| |::::| | |
`"")---(""` |___.| | |
/:::::::::::\" _ " | |
/:::=======:::\`\`\ | |
`"""""""""""""` '-' | |
``` | |
**Thesizer** is a fine-tuned LLM that is trained to assist with authoring Thesis'. | |
It is specifically trained and tuned to guide the user using HAMK's thesis | |
standards and guidelines [Thesis - HAMK](https://www.hamk.fi/en/student-pages/planning-your-studies/thesis/). | |
The goal of this is to make the process of writing a thesis easier, by helping | |
the user using spoken language, so that it will be easier for the user to find | |
help on the more technical aspects of writing a thesis. This way the user can | |
focus on what is the most important. The actual content of the Thesis. | |
This project is part of HAMK's `Development of AI Applications` -course. The | |
idea for the project spawned from the overwhelming feeling that all of the | |
different guidelines and learning documents there are for writing a thesis using | |
the standards of HAMK. Thesizer takes those documents and fine-tunes itself so | |
that it will be able to provide useful information and help the user. It is kind | |
of like a spoken language search engine for thesis writing technicalities. | |
## Table of contents | |
1. [Running the model](#Running-the-model) | |
2. [Documentation](#Documentation) | |
1. [The development process](#The-development-process) | |
- [Planning](#Planning) | |
- [Creating the model](#Creating-the-model) | |
2. [Tools used](#Tools-used) | |
3. [Challenges](#Challenges) | |
3. [Dependencies](#Dependencies) | |
3. [Helpful links](#Helpful-links) | |
### Running the model | |
1. **Install all of the dependencies** | |
The guide to this is in the [Dependencies](#Dependencies)-section. | |
2. **Run the model** | |
```bash | |
python3 thesizer_rag.py | |
# give a prompt | |
python3 thesizer_rag.py "mikä on abstract sivu?" | |
``` | |
![thesizer web interface](./readme_images/helpWithThesis2.png) | |
## Documentation | |
### 1. The development process | |
#### Planning | |
The development process started after we had come to an agreement on the project. | |
We then tasked everyone to go and research what Hugging Face transformers would | |
be suitable as a base for our Fine-Tuned model. | |
Some of the models considered: | |
_General LLMs_ | |
- [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) | |
- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) | |
- [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) | |
_Translation models_ | |
- [facebook/mbart-large-50-one-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt) | |
#### Creating the model | |
As a base for the RAG model, we used this learning material from the Hugging | |
Face documentation: [Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain](https://huggingface.co/learn/cookbook/rag_zephyr_langchain) | |
### 2. Tools used | |
**LangChain** | |
Thesizer uses [LangChain](https://www.langchain.com/langchain) as the base | |
framework for our RAG application. It provided easy and straight forward | |
way for us to give a pre-trained llm model context awareness. | |
The documents that were used for the context awareness are all located in the | |
[learning_material](./learning_material/) -folder. They are mostly in finnish, | |
with some being in english. You can clone this repository and use your own | |
documents instead if you would like to see how it works and adapts to the | |
contents of the folder. | |
**Hugging Face** | |
The models used by thesizer are fetched from [HuggingFace](https://huggingface.co/models). | |
They are then used with `HuggingFacePipeline` package, which provides very easy | |
interaction with the models. | |
**FAISS** | |
FAISS is a highly efficient vector database, that also provides fast similiarity | |
searching of the data inside of it. FAISS supports CUDA and is written in C++. | |
Because of this, it also needs to be downloaded specifically to the user's | |
hardware, in the same way that pytorch needs to be. | |
[FAISS - ai.meta.com](https://ai.meta.com/tools/faiss/) | |
Thesizer uses FAISS for managing all of the documentation. It also uses | |
similiarity searching to find content from these files that match the user's query. | |
All of the FAISS processing is done asynchronously, so that theoretically it could | |
be passed more documentation during runtime without affecting the other users | |
processing time. | |
### 3. Challenges | |
## Dependencies | |
If the requirements.txt does not work, you can try installing the dependcies | |
using the following commands. | |
0. **Create a virtual environment** | |
```bash | |
# linux | |
python3 -m venv venv | |
source venv/bin/activate | |
# windows | |
python -m venv venv | |
.\venv\Scripts\Activate.ps1 | |
``` | |
1. **Install Pytorch that matches your machine** | |
Get the installation command from this website: [Start Locally | PyTorch](https://pytorch.org/get-started/locally/) | |
2. **Install FAISS** | |
```bash | |
# if you have a gpu | |
pip install faiss-gpu-cu12 # CUDA 12.x | |
pip install faiss-gpu-cu11 # CUDA 11.x | |
# if you only have a cpu | |
pip install faiss-cpu | |
``` | |
3. **Install universal pip packages** | |
```bash | |
pip install transformers accelerate bitsandbytes sentence-transformers langchain \ | |
langchain-community langchain-huggingface pypdf bs4 lxml nltk gradio "unstructured[md]" | |
``` | |
4. **Log into your Hugging Face account** | |
If you are prompted to log in, follow the instructions of the prompt. | |
You can figure it out. | |
> [!NOTE] | |
> If there are problems or the process feels complicated, add your own | |
> guide right here and replace this note. | |
## Helpful links | |
Are you interested in finding out more and digging deeper? Here are some of the | |
sources that we used when creating this project. | |
Some links can also be found within the code comments, so that you can find them | |
near where they are used. Locality of behaviour babyyy. | |
**RAG & LangChain** | |
- Loading PDF files with LangChain and PyPDF | [How to load PDFs - LangChain](https://python.langchain.com/docs/how_to/document_loader_pdf/) | |
- Loading HTML files with LangChain and BeautifulSoup4 | [How to load HTML - LangChain](https://python.langchain.com/docs/how_to/document_loader_html/) | |
- Loading Markdown files with LangChain and unstructured | [How to load Markdown](https://python.langchain.com/docs/how_to/document_loader_markdown/) | |
- Asynchronous FAISS | [Faiss \(async\) - LangChain](https://python.langchain.com/docs/integrations/vectorstores/faiss_async/) | |
- Retriving data from a vectorstore | [How to use a vectorstore as a retriever - LangChain](https://python.langchain.com/docs/how_to/vectorstore_retriever/) | |