Spaces:
Sleeping
Sleeping
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ | |
\\----------- **Resume Parser** ----------\\ | |
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ | |
# Overview: | |
This project is a comprehensive Resume Parsing tool built using Python, | |
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing. | |
If Mistral fails or encounters issues, | |
the system falls back to a custom-trained spaCy model to ensure continued functionality. | |
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS. | |
# Installation Guide: | |
1. Create and Activate a Virtual Environment | |
python -m venv venv | |
source venv/bin/activate # For Linux/Mac | |
# or | |
venv\Scripts\activate # For Windows | |
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate. | |
- For Linux/Mac: | |
source venv/bin/activate | |
- For Windows: | |
venv\Scripts\activate | |
2. Install Required Libraries | |
pip install -r requirements.txt | |
# Ensure the following dependencies are included: | |
- Flask | |
- spaCy | |
- huggingface_hub | |
- PyMuPDF | |
- python-docx | |
- Tesseract-OCR (for image-based parsing) | |
; NOTE : If any model or library is not installed, you can install it using: | |
pip install <model_name> | |
_Replace <model_name> with the specific model or library you need to install_ | |
3. Set up Hugging Face Token | |
- Add your Hugging Face token to the .env file as: | |
HF_TOKEN=<your_huggingface_token> | |
# File Structure Overview: | |
Mistral_With_Spacy/ | |
β | |
βββ Spacy_Models/ | |
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing | |
β | |
βββ templates/ | |
β βββ index.html # UI for file upload | |
β βββ result.html # Display parsed results in structured JSON | |
β | |
βββ uploads/ # Directory for uploaded resume files | |
β | |
βββ utils/ | |
β βββ mistral.py # Code for calling Mistral API and handling responses | |
β βββ spacy.py # spaCy fallback model for parsing resumes | |
β βββ error.py # Error handling utilities | |
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.) | |
β | |
βββ venv/ # Virtual environment | |
β | |
βββ .env # Environment variables file (contains Hugging Face token) | |
β | |
βββ main.py # Flask app handling API routes for uploading and processing resumes | |
β | |
βββ requirements.txt # Dependencies required for the project | |
# Program Overview: | |
# Mistral Integration (utils/mistral.py) | |
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes. | |
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format. | |
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback. | |
# SpaCy Integration (utils/spacy.py) | |
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing. | |
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes. | |
- Validation: Includes validation for extracted emails and contacts. | |
# File Conversion (utils/fileTotext.py) | |
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing. | |
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content. | |
- DOCX Files: Uses `python-docx` to extract structured text from Word documents. | |
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files. | |
- RSF Files: Reads plain text from RSF files. | |
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes. | |
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki). | |
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process. | |
# Error Handling (utils/error.py) | |
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app. | |
# Flask API (main.py) | |
Endpoints: | |
- /upload for uploading resumes. | |
- Displays parsed results in JSON format on the results page. | |
- UI: Simple interface for uploading resumes and viewing the parsing results. | |
# Tree map of program: | |
main.py | |
βββ Handles API side | |
βββ File upload/remove | |
βββ Process resumes | |
βββ Show result | |
utils | |
βββ fileTotext.py | |
β βββ Converts files to text | |
β βββ PDF | |
β βββ DOCX | |
β βββ RTF | |
β βββ ODT | |
β βββ PNG | |
β βββ JPG | |
β βββ JPEG | |
βββ mistral.py | |
β βββ Mistral API Calls | |
β β βββ Uses Mistral-Nemo-Instruct-2407 model | |
β βββ Personal and Professional Extraction | |
β β βββ Extracts personal information | |
β β βββ Extracts professional information | |
β βββ Fallback Mechanism | |
β βββ Uses spaCy NER model if Mistral fails | |
βββ spacy.py | |
βββ Custom Trained Model | |
β βββ Uses spaCy model (ner_model_05_3) | |
βββ Named Entity Recognition | |
β βββ Extracts key information (Name, Email, Contact, etc.) | |
βββ Validation | |
βββ Validates emails and contacts | |
# References: | |
- [Flask Documentation](https://flask.palletsprojects.com/) | |
- [spaCy Documentation](https://spacy.io/usage) | |
- [Mistral Documentation](https://docs.mistral.ai/) | |
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index) | |
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/) | |
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/) | |
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki) | |
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html) |