Spaces:

WebashalarForML
/

SpacyModelCreator

Sleeping

File size: 6,363 Bytes

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\\----------- **Resume Parser** ----------\\
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

# Overview:
This project is a comprehensive Resume Parsing tool built using Python,
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
If Mistral fails or encounters issues,
the system falls back to a custom-trained spaCy model to ensure continued functionality.
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.


# Installation Guide:

1. Create and Activate a Virtual Environment
    python -m venv venv
    source venv/bin/activate  # For Linux/Mac
    # or
    venv\Scripts\activate  # For Windows

    # NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
        - For Linux/Mac:
            source venv/bin/activate
        - For Windows:
            venv\Scripts\activate

2. Install Required Libraries
    pip install -r requirements.txt

    # Ensure the following dependencies are included:
    - Flask
    - spaCy
    - huggingface_hub
    - PyMuPDF
    - python-docx
    - Tesseract-OCR (for image-based parsing)

; NOTE : If any model or library is not installed, you can install it using:
    pip install <model_name>
    _Replace <model_name> with the specific model or library you need to install_

3. Set up Hugging Face Token
    - Add your Hugging Face token to the .env file as:
    HF_TOKEN=<your_huggingface_token>


# File Structure Overview:
    Mistral_With_Spacy/
    │
    ├── Spacy_Models/
    │   └── ner_model_05_3  # Pretrained spaCy model directory for resume parsing
    │
    ├── templates/
    │   ├── index.html  # UI for file upload
    │   └── result.html  # Display parsed results in structured JSON
    │
    ├── uploads/  # Directory for uploaded resume files
    │
    ├── utils/
    │   ├── mistral.py  # Code for calling Mistral API and handling responses
    │   ├── spacy.py  # spaCy fallback model for parsing resumes
    │   ├── error.py  # Error handling utilities
    │   └── fileTotext.py  # Functions to extract text from different file formats (PDF, DOCX, etc.)
    │
    ├── venv/  # Virtual environment
    │
    ├── .env  # Environment variables file (contains Hugging Face token)
    │
    ├── main.py  # Flask app handling API routes for uploading and processing resumes
    │
    └── requirements.txt  # Dependencies required for the project


# Program Overview:

    # Mistral Integration (utils/mistral.py)
        - Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
        - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
        - Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.

    # SpaCy Integration (utils/spacy.py)
        - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
        - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
        - Validation: Includes validation for extracted emails and contacts.

    # File Conversion (utils/fileTotext.py)
       - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
          - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
          - DOCX Files: Uses `python-docx` to extract structured text from Word documents.
          - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
          - RSF Files: Reads plain text from RSF files.
          - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
            Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
          - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.


    # Error Handling (utils/error.py)
        - Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.

    # Flask API (main.py)
        Endpoints:
        - /upload for uploading resumes.
        - Displays parsed results in JSON format on the results page.
        - UI: Simple interface for uploading resumes and viewing the parsing results.


# Tree map of program:

    main.py
    ├── Handles API side
    ├── File upload/remove
    ├── Process resumes
    └── Show result
    utils
    ├── fileTotext.py
    │   └── Converts files to text
    │       ├── PDF
    │       ├── DOCX
    │       ├── RTF
    │       ├── ODT
    │       ├── PNG
    │       ├── JPG
    │       └── JPEG
    ├── mistral.py
    │   ├── Mistral API Calls
    │   │   └── Uses Mistral-Nemo-Instruct-2407 model
    │   ├── Personal and Professional Extraction
    │   │   ├── Extracts personal information
    │   │   └── Extracts professional information
    │   └── Fallback Mechanism
    │       └── Uses spaCy NER model if Mistral fails
    └── spacy.py
        ├── Custom Trained Model
        │   └── Uses spaCy model (ner_model_05_3)
        ├── Named Entity Recognition
        │   └── Extracts key information (Name, Email, Contact, etc.)
        └── Validation
            └── Validates emails and contacts


# References:

- [Flask Documentation](https://flask.palletsprojects.com/)
- [spaCy Documentation](https://spacy.io/usage)
- [Mistral Documentation](https://docs.mistral.ai/)
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)