Spaces:

WebashalarForML
/

SpacyModelCreator

Sleeping

App Files Files Community

SpacyModelCreator / README2.md

WebashalarForML

Update README2.md

cf9329c verified 4 months ago

preview code

raw

history blame

6.36 kB

	\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
	\\----------- Resume Parser ----------\\
	\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

	# Overview:
	This project is a comprehensive Resume Parsing tool built using Python,
	integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
	If Mistral fails or encounters issues,
	the system falls back to a custom-trained spaCy model to ensure continued functionality.
	The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.


	# Installation Guide:

	1. Create and Activate a Virtual Environment
	python -m venv venv
	source venv/bin/activate # For Linux/Mac
	# or
	venv\Scripts\activate # For Windows

	# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
	- For Linux/Mac:
	source venv/bin/activate
	- For Windows:
	venv\Scripts\activate

	2. Install Required Libraries
	pip install -r requirements.txt

	# Ensure the following dependencies are included:
	- Flask
	- spaCy
	- huggingface_hub
	- PyMuPDF
	- python-docx
	- Tesseract-OCR (for image-based parsing)

	; NOTE : If any model or library is not installed, you can install it using:
	pip install <model_name>
	_Replace <model_name> with the specific model or library you need to install_

	3. Set up Hugging Face Token
	- Add your Hugging Face token to the .env file as:
	HF_TOKEN=<your_huggingface_token>


	# File Structure Overview:
	Mistral_With_Spacy/
	│
	├── Spacy_Models/
	│ └── ner_model_05_3 # Pretrained spaCy model directory for resume parsing
	│
	├── templates/
	│ ├── index.html # UI for file upload
	│ └── result.html # Display parsed results in structured JSON
	│
	├── uploads/ # Directory for uploaded resume files
	│
	├── utils/
	│ ├── mistral.py # Code for calling Mistral API and handling responses
	│ ├── spacy.py # spaCy fallback model for parsing resumes
	│ ├── error.py # Error handling utilities
	│ └── fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
	│
	├── venv/ # Virtual environment
	│
	├── .env # Environment variables file (contains Hugging Face token)
	│
	├── main.py # Flask app handling API routes for uploading and processing resumes
	│
	└── requirements.txt # Dependencies required for the project


	# Program Overview:

	# Mistral Integration (utils/mistral.py)
	- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
	- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
	- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.

	# SpaCy Integration (utils/spacy.py)
	- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
	- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
	- Validation: Includes validation for extracted emails and contacts.

	# File Conversion (utils/fileTotext.py)
	- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
	- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
	- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
	- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
	- RSF Files: Reads plain text from RSF files.
	- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
	Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
	- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.


	# Error Handling (utils/error.py)
	- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.

	# Flask API (main.py)
	Endpoints:
	- /upload for uploading resumes.
	- Displays parsed results in JSON format on the results page.
	- UI: Simple interface for uploading resumes and viewing the parsing results.


	# Tree map of program:

	main.py
	├── Handles API side
	├── File upload/remove
	├── Process resumes
	└── Show result
	utils
	├── fileTotext.py
	│ └── Converts files to text
	│ ├── PDF
	│ ├── DOCX
	│ ├── RTF
	│ ├── ODT
	│ ├── PNG
	│ ├── JPG
	│ └── JPEG
	├── mistral.py
	│ ├── Mistral API Calls
	│ │ └── Uses Mistral-Nemo-Instruct-2407 model
	│ ├── Personal and Professional Extraction
	│ │ ├── Extracts personal information
	│ │ └── Extracts professional information
	│ └── Fallback Mechanism
	│ └── Uses spaCy NER model if Mistral fails
	└── spacy.py
	├── Custom Trained Model
	│ └── Uses spaCy model (ner_model_05_3)
	├── Named Entity Recognition
	│ └── Extracts key information (Name, Email, Contact, etc.)
	└── Validation
	└── Validates emails and contacts


	# References:

	- [Flask Documentation](https://flask.palletsprojects.com/)
	- [spaCy Documentation](https://spacy.io/usage)
	- [Mistral Documentation](https://docs.mistral.ai/)
	- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
	- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
	- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
	- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
	- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)