Spaces:
Sleeping
Sleeping
File size: 6,363 Bytes
cf9329c 057a19f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\\----------- **Resume Parser** ----------\\
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
# Overview:
This project is a comprehensive Resume Parsing tool built using Python,
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
If Mistral fails or encounters issues,
the system falls back to a custom-trained spaCy model to ensure continued functionality.
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
# Installation Guide:
1. Create and Activate a Virtual Environment
python -m venv venv
source venv/bin/activate # For Linux/Mac
# or
venv\Scripts\activate # For Windows
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
- For Linux/Mac:
source venv/bin/activate
- For Windows:
venv\Scripts\activate
2. Install Required Libraries
pip install -r requirements.txt
# Ensure the following dependencies are included:
- Flask
- spaCy
- huggingface_hub
- PyMuPDF
- python-docx
- Tesseract-OCR (for image-based parsing)
; NOTE : If any model or library is not installed, you can install it using:
pip install <model_name>
_Replace <model_name> with the specific model or library you need to install_
3. Set up Hugging Face Token
- Add your Hugging Face token to the .env file as:
HF_TOKEN=<your_huggingface_token>
# File Structure Overview:
Mistral_With_Spacy/
β
βββ Spacy_Models/
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
β
βββ templates/
β βββ index.html # UI for file upload
β βββ result.html # Display parsed results in structured JSON
β
βββ uploads/ # Directory for uploaded resume files
β
βββ utils/
β βββ mistral.py # Code for calling Mistral API and handling responses
β βββ spacy.py # spaCy fallback model for parsing resumes
β βββ error.py # Error handling utilities
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
β
βββ venv/ # Virtual environment
β
βββ .env # Environment variables file (contains Hugging Face token)
β
βββ main.py # Flask app handling API routes for uploading and processing resumes
β
βββ requirements.txt # Dependencies required for the project
# Program Overview:
# Mistral Integration (utils/mistral.py)
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
# SpaCy Integration (utils/spacy.py)
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
- Validation: Includes validation for extracted emails and contacts.
# File Conversion (utils/fileTotext.py)
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
- RSF Files: Reads plain text from RSF files.
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
# Error Handling (utils/error.py)
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
# Flask API (main.py)
Endpoints:
- /upload for uploading resumes.
- Displays parsed results in JSON format on the results page.
- UI: Simple interface for uploading resumes and viewing the parsing results.
# Tree map of program:
main.py
βββ Handles API side
βββ File upload/remove
βββ Process resumes
βββ Show result
utils
βββ fileTotext.py
β βββ Converts files to text
β βββ PDF
β βββ DOCX
β βββ RTF
β βββ ODT
β βββ PNG
β βββ JPG
β βββ JPEG
βββ mistral.py
β βββ Mistral API Calls
β β βββ Uses Mistral-Nemo-Instruct-2407 model
β βββ Personal and Professional Extraction
β β βββ Extracts personal information
β β βββ Extracts professional information
β βββ Fallback Mechanism
β βββ Uses spaCy NER model if Mistral fails
βββ spacy.py
βββ Custom Trained Model
β βββ Uses spaCy model (ner_model_05_3)
βββ Named Entity Recognition
β βββ Extracts key information (Name, Email, Contact, etc.)
βββ Validation
βββ Validates emails and contacts
# References:
- [Flask Documentation](https://flask.palletsprojects.com/)
- [spaCy Documentation](https://spacy.io/usage)
- [Mistral Documentation](https://docs.mistral.ai/)
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html) |