\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\----------- **Resume Parser** ----------\\ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ # Overview: This project is a comprehensive Resume Parsing tool built using Python, integrating the Mistral-Nemo-Instruct-2407 model for primary parsing. If Mistral fails or encounters issues, the system falls back to a custom-trained spaCy model to ensure continued functionality. The tool is wrapped with a Flask API and has a user interface built using HTML and CSS. # Installation Guide: 1. Create and Activate a Virtual Environment python -m venv venv source venv/bin/activate # For Linux/Mac # or venv\Scripts\activate # For Windows # NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate. - For Linux/Mac: source venv/bin/activate - For Windows: venv\Scripts\activate 2. Install Required Libraries pip install -r requirements.txt # Ensure the following dependencies are included: - Flask - spaCy - huggingface_hub - PyMuPDF - python-docx - Tesseract-OCR (for image-based parsing) ; NOTE : If any model or library is not installed, you can install it using: pip install _Replace with the specific model or library you need to install_ 3. Set up Hugging Face Token - Add your Hugging Face token to the .env file as: HF_TOKEN= # File Structure Overview: Mistral_With_Spacy/ │ ├── Spacy_Models/ │ └── ner_model_05_3 # Pretrained spaCy model directory for resume parsing │ ├── templates/ │ ├── index.html # UI for file upload │ └── result.html # Display parsed results in structured JSON │ ├── uploads/ # Directory for uploaded resume files │ ├── utils/ │ ├── mistral.py # Code for calling Mistral API and handling responses │ ├── spacy.py # spaCy fallback model for parsing resumes │ ├── error.py # Error handling utilities │ └── fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.) │ ├── venv/ # Virtual environment │ ├── .env # Environment variables file (contains Hugging Face token) │ ├── main.py # Flask app handling API routes for uploading and processing resumes │ └── requirements.txt # Dependencies required for the project # Program Overview: # Mistral Integration (utils/mistral.py) - Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes. - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format. - Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback. # SpaCy Integration (utils/spacy.py) - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing. - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes. - Validation: Includes validation for extracted emails and contacts. # File Conversion (utils/fileTotext.py) - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing. - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content. - DOCX Files: Uses `python-docx` to extract structured text from Word documents. - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files. - RSF Files: Reads plain text from RSF files. - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes. Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki). - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process. # Error Handling (utils/error.py) - Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app. # Flask API (main.py) Endpoints: - /upload for uploading resumes. - Displays parsed results in JSON format on the results page. - UI: Simple interface for uploading resumes and viewing the parsing results. # Tree map of program: main.py ├── Handles API side ├── File upload/remove ├── Process resumes └── Show result utils ├── fileTotext.py │ └── Converts files to text │ ├── PDF │ ├── DOCX │ ├── RTF │ ├── ODT │ ├── PNG │ ├── JPG │ └── JPEG ├── mistral.py │ ├── Mistral API Calls │ │ └── Uses Mistral-Nemo-Instruct-2407 model │ ├── Personal and Professional Extraction │ │ ├── Extracts personal information │ │ └── Extracts professional information │ └── Fallback Mechanism │ └── Uses spaCy NER model if Mistral fails └── spacy.py ├── Custom Trained Model │ └── Uses spaCy model (ner_model_05_3) ├── Named Entity Recognition │ └── Extracts key information (Name, Email, Contact, etc.) └── Validation └── Validates emails and contacts # References: - [Flask Documentation](https://flask.palletsprojects.com/) - [spaCy Documentation](https://spacy.io/usage) - [Mistral Documentation](https://docs.mistral.ai/) - [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index) - [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/) - [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/) - [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki) - [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)