seanpedrickcase commited on
Commit
22ca76e
·
1 Parent(s): 3c1c3de

Allowed for app running on AWS to use smaller embedding model and not to load representation LLM (due to size restrictions).

Browse files
Dockerfile CHANGED
@@ -35,15 +35,19 @@ RUN mkdir -p /home/user/.cache/matplotlib && chown -R user:user /home/user/.cach
35
  RUN mkdir -p /home/user/app/model/rep && chown -R user:user /home/user/app/model/rep
36
  RUN mkdir -p /home/user/app/model/embed && chown -R user:user /home/user/app/model/embed
37
 
38
- # Download the quantised phi model directly with curl
39
- RUN curl -L -o /home/user/app/model/rep/Phi-3-mini-128k-instruct.Q4_K_M.gguf https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF/tree/main/Phi-3-mini-128k-instruct.Q4_K_M.gguf
40
-
41
- # Download the Mixed bread embedding model during the build process
42
- RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
43
- RUN apt-get install git-lfs -y
44
- RUN git lfs install
45
- RUN git clone https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 /home/user/app/model/embed
46
- RUN rm -rf /home/user/app/model/embed/.git
 
 
 
 
47
 
48
  # Switch to the "user" user
49
  USER user
 
35
  RUN mkdir -p /home/user/app/model/rep && chown -R user:user /home/user/app/model/rep
36
  RUN mkdir -p /home/user/app/model/embed && chown -R user:user /home/user/app/model/embed
37
 
38
+ # Download the quantised phi model directly with curl. Changed at it is so big - not loaded
39
+ #RUN curl -L -o /home/user/app/model/rep/Phi-3.1-mini-128k-instruct-Q4_K_M.gguf https://huggingface.co/bartowski/Phi-3.1-mini-128k-instruct-GGUF/tree/main/Phi-3.1-mini-128k-instruct-Q4_K_M.gguf
40
+
41
+ # Download the Mixed bread embedding model during the build process - changed as it is too big for AWS. Not loaded.
42
+ #RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
43
+ #RUN apt-get install git-lfs -y
44
+ #RUN git lfs install
45
+ #RUN git clone https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 /home/user/app/model/embed
46
+ #RUN rm -rf /home/user/app/model/embed/.git
47
+
48
+ # Download the BGE embedding model during the build process. Create a directory for the model and download specific files using huggingface_hub
49
+ COPY download_model.py /src/download_model.py
50
+ RUN python /src/download_model.py
51
 
52
  # Switch to the "user" user
53
  USER user
app.py CHANGED
@@ -39,7 +39,7 @@ with block:
39
  # Topic modeller
40
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
41
 
42
- Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
43
 
44
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
45
 
 
39
  # Topic modeller
40
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
41
 
42
+ Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3.1-mini-128k-instruct-GGUF](https://huggingface.co/bartowski/Phi-3.1-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
43
 
44
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
45
 
download_model.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from huggingface_hub import hf_hub_download
2
+
3
+ # Define the repository and files to download
4
+ repo_id = "sentence-transformers/all-MiniLM-L6-v2" #"BAAI/bge-small-en-v1.5"
5
+ files_to_download = [
6
+ "config.json",
7
+ "config_sentence_transformers.json",
8
+ "model.safetensors",
9
+ "tokenizer_config.json",
10
+ "vocab.txt"
11
+ ]
12
+
13
+ # Download each file and save it to the /model/bge directory
14
+ for file_name in files_to_download:
15
+ print("Checking for file", file_name)
16
+ hf_hub_download(repo_id=repo_id, filename=file_name, local_dir="/model/embed") #"/model/bge"
funcs/representation_model.py CHANGED
@@ -4,16 +4,21 @@ from llama_cpp import Llama
4
  from pydantic import BaseModel
5
  import torch.cuda
6
  from huggingface_hub import hf_hub_download
 
7
 
8
  from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
9
  from funcs.embeddings import torch_device
10
  from funcs.prompts import phi3_prompt, phi3_start
 
11
 
12
  chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
13
  chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
14
 
15
  random_seed = 42
16
 
 
 
 
17
  # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
18
  print("torch device for representation functions:", torch_device)
19
  if torch_device == "gpu":
@@ -140,6 +145,14 @@ def create_representation_model(representation_type: str, llm_config: dict, hf_m
140
  """
141
 
142
  if representation_type == "LLM":
 
 
 
 
 
 
 
 
143
  print("Generating LLM representation")
144
  # Use llama.cpp to load in model
145
 
 
4
  from pydantic import BaseModel
5
  import torch.cuda
6
  from huggingface_hub import hf_hub_download
7
+ from gradio import Warning
8
 
9
  from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
10
  from funcs.embeddings import torch_device
11
  from funcs.prompts import phi3_prompt, phi3_start
12
+ from funcs.helper_functions import get_or_create_env_var
13
 
14
  chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
15
  chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
16
 
17
  random_seed = 42
18
 
19
+ RUNNING_ON_AWS = get_or_create_env_var('RUNNING_ON_AWS', '0')
20
+ print(f'The value of RUNNING_ON_AWS is {RUNNING_ON_AWS}')
21
+
22
  # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
23
  print("torch device for representation functions:", torch_device)
24
  if torch_device == "gpu":
 
145
  """
146
 
147
  if representation_type == "LLM":
148
+ print("RUNNING_ON_AWS:", RUNNING_ON_AWS)
149
+ if RUNNING_ON_AWS=="1":
150
+ error_message = "LLM representation not available on AWS due to model size restrictions. Returning base representation"
151
+ Warning(error_message, duration=5)
152
+ print(error_message)
153
+ representation_model = {"LLM":base_rep}
154
+ return representation_model
155
+
156
  print("Generating LLM representation")
157
  # Use llama.cpp to load in model
158
 
funcs/topic_core_funcs.py CHANGED
@@ -13,10 +13,10 @@ PandasDataFrame = Type[pd.DataFrame]
13
 
14
  from funcs.clean_funcs import initial_clean, regex_clean
15
  from funcs.anonymiser import expand_sentences_spacy
16
- from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs, output_folder
17
  from funcs.embeddings import make_or_load_embeddings, torch_device
18
  from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
19
- from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag, random_seed
20
 
21
  from sklearn.feature_extraction.text import CountVectorizer
22
 
@@ -36,11 +36,14 @@ today = datetime.now().strftime("%d%m%Y")
36
  today_rev = datetime.now().strftime("%Y%m%d")
37
 
38
  # Load embeddings
39
- embeddings_name = "mixedbread-ai/mxbai-embed-large-v1" #"BAAI/large-small-en-v1.5" #"jinaai/jina-embeddings-v2-base-en"
 
 
 
40
 
41
  # LLM model used for representing topics
42
- hf_model_name = "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
43
- hf_model_file = "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
44
 
45
  # When topic modelling column is chosen, change the default visualisation column to the same
46
  def change_default_vis_col(in_colnames:List[str]):
 
13
 
14
  from funcs.clean_funcs import initial_clean, regex_clean
15
  from funcs.anonymiser import expand_sentences_spacy
16
+ from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs, output_folder, get_or_create_env_var
17
  from funcs.embeddings import make_or_load_embeddings, torch_device
18
  from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
19
+ from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag, random_seed, RUNNING_ON_AWS
20
 
21
  from sklearn.feature_extraction.text import CountVectorizer
22
 
 
36
  today_rev = datetime.now().strftime("%Y%m%d")
37
 
38
  # Load embeddings
39
+ if RUNNING_ON_AWS=="0":
40
+ embeddings_name = "mixedbread-ai/mxbai-embed-large-v1" #"BAAI/large-small-en-v1.5" #"jinaai/jina-embeddings-v2-base-en"
41
+ else:
42
+ embeddings_name = "sentence-transformers/all-MiniLM-L6-v2"
43
 
44
  # LLM model used for representing topics
45
+ hf_model_name = "bartowski/Phi-3.1-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
46
+ hf_model_file = "Phi-3.1-mini-128k-instruct-Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
47
 
48
  # When topic modelling column is chosen, change the default visualisation column to the same
49
  def change_default_vis_col(in_colnames:List[str]):
requirements.txt CHANGED
@@ -2,7 +2,7 @@ gradio # Not specified version due to interaction with spacy - reinstall latest
2
  boto3
3
  transformers==4.41.2
4
  accelerate==0.26.1
5
- torch==2.3.1
6
  bertopic==0.16.2
7
  spacy==3.7.4
8
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
@@ -14,5 +14,5 @@ presidio_anonymizer==2.2.354
14
  scipy==1.11.4
15
  polars==0.20.6
16
  sentence-transformers==3.0.1
17
- llama-cpp-python==0.2.79 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
18
  numpy==1.26.4
 
2
  boto3
3
  transformers==4.41.2
4
  accelerate==0.26.1
5
+ torch==2.4.0
6
  bertopic==0.16.2
7
  spacy==3.7.4
8
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 
14
  scipy==1.11.4
15
  polars==0.20.6
16
  sentence-transformers==3.0.1
17
+ llama-cpp-python==0.2.87 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
18
  numpy==1.26.4
requirements_gpu.txt CHANGED
@@ -12,7 +12,8 @@ presidio_analyzer==2.2.354
12
  presidio_anonymizer==2.2.354
13
  scipy==1.11.4
14
  polars==0.20.6
 
15
  torch --index-url https://download.pytorch.org/whl/cu121
16
- llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
17
- numpy==1.26.4
18
  sentence-transformers==3.0.1
 
 
 
12
  presidio_anonymizer==2.2.354
13
  scipy==1.11.4
14
  polars==0.20.6
15
+ llama-cpp-python==0.2.87 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
16
  torch --index-url https://download.pytorch.org/whl/cu121
 
 
17
  sentence-transformers==3.0.1
18
+ numpy==1.26.4
19
+