Spaces:
Running
Running
seanpedrickcase
commited on
Commit
·
22ca76e
1
Parent(s):
3c1c3de
Allowed for app running on AWS to use smaller embedding model and not to load representation LLM (due to size restrictions).
Browse files- Dockerfile +13 -9
- app.py +1 -1
- download_model.py +16 -0
- funcs/representation_model.py +13 -0
- funcs/topic_core_funcs.py +8 -5
- requirements.txt +2 -2
- requirements_gpu.txt +3 -2
Dockerfile
CHANGED
@@ -35,15 +35,19 @@ RUN mkdir -p /home/user/.cache/matplotlib && chown -R user:user /home/user/.cach
|
|
35 |
RUN mkdir -p /home/user/app/model/rep && chown -R user:user /home/user/app/model/rep
|
36 |
RUN mkdir -p /home/user/app/model/embed && chown -R user:user /home/user/app/model/embed
|
37 |
|
38 |
-
# Download the quantised phi model directly with curl
|
39 |
-
RUN curl -L -o /home/user/app/model/rep/Phi-3-mini-128k-instruct
|
40 |
-
|
41 |
-
# Download the Mixed bread embedding model during the build process
|
42 |
-
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
|
43 |
-
RUN apt-get install git-lfs -y
|
44 |
-
RUN git lfs install
|
45 |
-
RUN git clone https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 /home/user/app/model/embed
|
46 |
-
RUN rm -rf /home/user/app/model/embed/.git
|
|
|
|
|
|
|
|
|
47 |
|
48 |
# Switch to the "user" user
|
49 |
USER user
|
|
|
35 |
RUN mkdir -p /home/user/app/model/rep && chown -R user:user /home/user/app/model/rep
|
36 |
RUN mkdir -p /home/user/app/model/embed && chown -R user:user /home/user/app/model/embed
|
37 |
|
38 |
+
# Download the quantised phi model directly with curl. Changed at it is so big - not loaded
|
39 |
+
#RUN curl -L -o /home/user/app/model/rep/Phi-3.1-mini-128k-instruct-Q4_K_M.gguf https://huggingface.co/bartowski/Phi-3.1-mini-128k-instruct-GGUF/tree/main/Phi-3.1-mini-128k-instruct-Q4_K_M.gguf
|
40 |
+
|
41 |
+
# Download the Mixed bread embedding model during the build process - changed as it is too big for AWS. Not loaded.
|
42 |
+
#RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
|
43 |
+
#RUN apt-get install git-lfs -y
|
44 |
+
#RUN git lfs install
|
45 |
+
#RUN git clone https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 /home/user/app/model/embed
|
46 |
+
#RUN rm -rf /home/user/app/model/embed/.git
|
47 |
+
|
48 |
+
# Download the BGE embedding model during the build process. Create a directory for the model and download specific files using huggingface_hub
|
49 |
+
COPY download_model.py /src/download_model.py
|
50 |
+
RUN python /src/download_model.py
|
51 |
|
52 |
# Switch to the "user" user
|
53 |
USER user
|
app.py
CHANGED
@@ -39,7 +39,7 @@ with block:
|
|
39 |
# Topic modeller
|
40 |
Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
|
41 |
|
42 |
-
Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/
|
43 |
|
44 |
For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
|
45 |
|
|
|
39 |
# Topic modeller
|
40 |
Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
|
41 |
|
42 |
+
Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3.1-mini-128k-instruct-GGUF](https://huggingface.co/bartowski/Phi-3.1-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
|
43 |
|
44 |
For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
|
45 |
|
download_model.py
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from huggingface_hub import hf_hub_download
|
2 |
+
|
3 |
+
# Define the repository and files to download
|
4 |
+
repo_id = "sentence-transformers/all-MiniLM-L6-v2" #"BAAI/bge-small-en-v1.5"
|
5 |
+
files_to_download = [
|
6 |
+
"config.json",
|
7 |
+
"config_sentence_transformers.json",
|
8 |
+
"model.safetensors",
|
9 |
+
"tokenizer_config.json",
|
10 |
+
"vocab.txt"
|
11 |
+
]
|
12 |
+
|
13 |
+
# Download each file and save it to the /model/bge directory
|
14 |
+
for file_name in files_to_download:
|
15 |
+
print("Checking for file", file_name)
|
16 |
+
hf_hub_download(repo_id=repo_id, filename=file_name, local_dir="/model/embed") #"/model/bge"
|
funcs/representation_model.py
CHANGED
@@ -4,16 +4,21 @@ from llama_cpp import Llama
|
|
4 |
from pydantic import BaseModel
|
5 |
import torch.cuda
|
6 |
from huggingface_hub import hf_hub_download
|
|
|
7 |
|
8 |
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
|
9 |
from funcs.embeddings import torch_device
|
10 |
from funcs.prompts import phi3_prompt, phi3_start
|
|
|
11 |
|
12 |
chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
|
13 |
chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
|
14 |
|
15 |
random_seed = 42
|
16 |
|
|
|
|
|
|
|
17 |
# Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
|
18 |
print("torch device for representation functions:", torch_device)
|
19 |
if torch_device == "gpu":
|
@@ -140,6 +145,14 @@ def create_representation_model(representation_type: str, llm_config: dict, hf_m
|
|
140 |
"""
|
141 |
|
142 |
if representation_type == "LLM":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
print("Generating LLM representation")
|
144 |
# Use llama.cpp to load in model
|
145 |
|
|
|
4 |
from pydantic import BaseModel
|
5 |
import torch.cuda
|
6 |
from huggingface_hub import hf_hub_download
|
7 |
+
from gradio import Warning
|
8 |
|
9 |
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
|
10 |
from funcs.embeddings import torch_device
|
11 |
from funcs.prompts import phi3_prompt, phi3_start
|
12 |
+
from funcs.helper_functions import get_or_create_env_var
|
13 |
|
14 |
chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
|
15 |
chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
|
16 |
|
17 |
random_seed = 42
|
18 |
|
19 |
+
RUNNING_ON_AWS = get_or_create_env_var('RUNNING_ON_AWS', '0')
|
20 |
+
print(f'The value of RUNNING_ON_AWS is {RUNNING_ON_AWS}')
|
21 |
+
|
22 |
# Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
|
23 |
print("torch device for representation functions:", torch_device)
|
24 |
if torch_device == "gpu":
|
|
|
145 |
"""
|
146 |
|
147 |
if representation_type == "LLM":
|
148 |
+
print("RUNNING_ON_AWS:", RUNNING_ON_AWS)
|
149 |
+
if RUNNING_ON_AWS=="1":
|
150 |
+
error_message = "LLM representation not available on AWS due to model size restrictions. Returning base representation"
|
151 |
+
Warning(error_message, duration=5)
|
152 |
+
print(error_message)
|
153 |
+
representation_model = {"LLM":base_rep}
|
154 |
+
return representation_model
|
155 |
+
|
156 |
print("Generating LLM representation")
|
157 |
# Use llama.cpp to load in model
|
158 |
|
funcs/topic_core_funcs.py
CHANGED
@@ -13,10 +13,10 @@ PandasDataFrame = Type[pd.DataFrame]
|
|
13 |
|
14 |
from funcs.clean_funcs import initial_clean, regex_clean
|
15 |
from funcs.anonymiser import expand_sentences_spacy
|
16 |
-
from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs, output_folder
|
17 |
from funcs.embeddings import make_or_load_embeddings, torch_device
|
18 |
from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
|
19 |
-
from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag, random_seed
|
20 |
|
21 |
from sklearn.feature_extraction.text import CountVectorizer
|
22 |
|
@@ -36,11 +36,14 @@ today = datetime.now().strftime("%d%m%Y")
|
|
36 |
today_rev = datetime.now().strftime("%Y%m%d")
|
37 |
|
38 |
# Load embeddings
|
39 |
-
|
|
|
|
|
|
|
40 |
|
41 |
# LLM model used for representing topics
|
42 |
-
hf_model_name = "
|
43 |
-
hf_model_file = "Phi-3-mini-128k-instruct
|
44 |
|
45 |
# When topic modelling column is chosen, change the default visualisation column to the same
|
46 |
def change_default_vis_col(in_colnames:List[str]):
|
|
|
13 |
|
14 |
from funcs.clean_funcs import initial_clean, regex_clean
|
15 |
from funcs.anonymiser import expand_sentences_spacy
|
16 |
+
from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs, output_folder, get_or_create_env_var
|
17 |
from funcs.embeddings import make_or_load_embeddings, torch_device
|
18 |
from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
|
19 |
+
from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag, random_seed, RUNNING_ON_AWS
|
20 |
|
21 |
from sklearn.feature_extraction.text import CountVectorizer
|
22 |
|
|
|
36 |
today_rev = datetime.now().strftime("%Y%m%d")
|
37 |
|
38 |
# Load embeddings
|
39 |
+
if RUNNING_ON_AWS=="0":
|
40 |
+
embeddings_name = "mixedbread-ai/mxbai-embed-large-v1" #"BAAI/large-small-en-v1.5" #"jinaai/jina-embeddings-v2-base-en"
|
41 |
+
else:
|
42 |
+
embeddings_name = "sentence-transformers/all-MiniLM-L6-v2"
|
43 |
|
44 |
# LLM model used for representing topics
|
45 |
+
hf_model_name = "bartowski/Phi-3.1-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
|
46 |
+
hf_model_file = "Phi-3.1-mini-128k-instruct-Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
|
47 |
|
48 |
# When topic modelling column is chosen, change the default visualisation column to the same
|
49 |
def change_default_vis_col(in_colnames:List[str]):
|
requirements.txt
CHANGED
@@ -2,7 +2,7 @@ gradio # Not specified version due to interaction with spacy - reinstall latest
|
|
2 |
boto3
|
3 |
transformers==4.41.2
|
4 |
accelerate==0.26.1
|
5 |
-
torch==2.
|
6 |
bertopic==0.16.2
|
7 |
spacy==3.7.4
|
8 |
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
|
@@ -14,5 +14,5 @@ presidio_anonymizer==2.2.354
|
|
14 |
scipy==1.11.4
|
15 |
polars==0.20.6
|
16 |
sentence-transformers==3.0.1
|
17 |
-
llama-cpp-python==0.2.
|
18 |
numpy==1.26.4
|
|
|
2 |
boto3
|
3 |
transformers==4.41.2
|
4 |
accelerate==0.26.1
|
5 |
+
torch==2.4.0
|
6 |
bertopic==0.16.2
|
7 |
spacy==3.7.4
|
8 |
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
|
|
|
14 |
scipy==1.11.4
|
15 |
polars==0.20.6
|
16 |
sentence-transformers==3.0.1
|
17 |
+
llama-cpp-python==0.2.87 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
|
18 |
numpy==1.26.4
|
requirements_gpu.txt
CHANGED
@@ -12,7 +12,8 @@ presidio_analyzer==2.2.354
|
|
12 |
presidio_anonymizer==2.2.354
|
13 |
scipy==1.11.4
|
14 |
polars==0.20.6
|
|
|
15 |
torch --index-url https://download.pytorch.org/whl/cu121
|
16 |
-
llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
|
17 |
-
numpy==1.26.4
|
18 |
sentence-transformers==3.0.1
|
|
|
|
|
|
12 |
presidio_anonymizer==2.2.354
|
13 |
scipy==1.11.4
|
14 |
polars==0.20.6
|
15 |
+
llama-cpp-python==0.2.87 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
|
16 |
torch --index-url https://download.pytorch.org/whl/cu121
|
|
|
|
|
17 |
sentence-transformers==3.0.1
|
18 |
+
numpy==1.26.4
|
19 |
+
|