Spaces:
Sleeping
\\\\\\\\\\\\\\\\\\\\\\ \----------- Resume Parser ----------\ \\\\\\\\\\\\\\\\\\\\\\
Overview:
This project is a comprehensive Resume Parsing tool built using Python, integrating the Mistral-Nemo-Instruct-2407 model for primary parsing. If Mistral fails or encounters issues, the system falls back to a custom-trained spaCy model to ensure continued functionality. The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
Installation Guide:
Create and Activate a Virtual Environment python -m venv venv source venv/bin/activate # For Linux/Mac
or
venv\Scripts\activate # For Windows
NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
- For Linux/Mac: source venv/bin/activate - For Windows: venv\Scripts\activate
Install Required Libraries pip install -r requirements.txt
Ensure the following dependencies are included:
- Flask
- spaCy
- huggingface_hub
- PyMuPDF
- python-docx
- Tesseract-OCR (for image-based parsing)
; NOTE : If any model or library is not installed, you can install it using: pip install Replace with the specific model or library you need to install
- Set up Hugging Face Token
- Add your Hugging Face token to the .env file as: HF_TOKEN=
File Structure Overview:
Mistral_With_Spacy/
β
βββ Spacy_Models/
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
β
βββ templates/
β βββ index.html # UI for file upload
β βββ result.html # Display parsed results in structured JSON
β
βββ uploads/ # Directory for uploaded resume files
β
βββ utils/
β βββ mistral.py # Code for calling Mistral API and handling responses
β βββ spacy.py # spaCy fallback model for parsing resumes
β βββ error.py # Error handling utilities
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
β
βββ venv/ # Virtual environment
β
βββ .env # Environment variables file (contains Hugging Face token)
β
βββ main.py # Flask app handling API routes for uploading and processing resumes
β
βββ requirements.txt # Dependencies required for the project
Program Overview:
# Mistral Integration (utils/mistral.py)
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
# SpaCy Integration (utils/spacy.py)
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
- Validation: Includes validation for extracted emails and contacts.
# File Conversion (utils/fileTotext.py)
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
- RSF Files: Reads plain text from RSF files.
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
# Error Handling (utils/error.py)
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
# Flask API (main.py)
Endpoints:
- /upload for uploading resumes.
- Displays parsed results in JSON format on the results page.
- UI: Simple interface for uploading resumes and viewing the parsing results.
Tree map of program:
main.py
βββ Handles API side
βββ File upload/remove
βββ Process resumes
βββ Show result
utils
βββ fileTotext.py
β βββ Converts files to text
β βββ PDF
β βββ DOCX
β βββ RTF
β βββ ODT
β βββ PNG
β βββ JPG
β βββ JPEG
βββ mistral.py
β βββ Mistral API Calls
β β βββ Uses Mistral-Nemo-Instruct-2407 model
β βββ Personal and Professional Extraction
β β βββ Extracts personal information
β β βββ Extracts professional information
β βββ Fallback Mechanism
β βββ Uses spaCy NER model if Mistral fails
βββ spacy.py
βββ Custom Trained Model
β βββ Uses spaCy model (ner_model_05_3)
βββ Named Entity Recognition
β βββ Extracts key information (Name, Email, Contact, etc.)
βββ Validation
βββ Validates emails and contacts