SpacyModelCreator / README2.md
WebashalarForML's picture
Update README2.md
cf9329c verified
|
raw
history blame
6.36 kB

\\\\\\\\\\\\\\\\\\\\\\ \----------- Resume Parser ----------\ \\\\\\\\\\\\\\\\\\\\\\

Overview:

This project is a comprehensive Resume Parsing tool built using Python, integrating the Mistral-Nemo-Instruct-2407 model for primary parsing. If Mistral fails or encounters issues, the system falls back to a custom-trained spaCy model to ensure continued functionality. The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.

Installation Guide:

  1. Create and Activate a Virtual Environment python -m venv venv source venv/bin/activate # For Linux/Mac

    or

    venv\Scripts\activate # For Windows

    NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.

     - For Linux/Mac:
         source venv/bin/activate
     - For Windows:
         venv\Scripts\activate
    
  2. Install Required Libraries pip install -r requirements.txt

    Ensure the following dependencies are included:

    • Flask
    • spaCy
    • huggingface_hub
    • PyMuPDF
    • python-docx
    • Tesseract-OCR (for image-based parsing)

; NOTE : If any model or library is not installed, you can install it using: pip install Replace with the specific model or library you need to install

  1. Set up Hugging Face Token
    • Add your Hugging Face token to the .env file as: HF_TOKEN=

File Structure Overview:

Mistral_With_Spacy/
β”‚
β”œβ”€β”€ Spacy_Models/
β”‚   └── ner_model_05_3  # Pretrained spaCy model directory for resume parsing
β”‚
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ index.html  # UI for file upload
β”‚   └── result.html  # Display parsed results in structured JSON
β”‚
β”œβ”€β”€ uploads/  # Directory for uploaded resume files
β”‚
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ mistral.py  # Code for calling Mistral API and handling responses
β”‚   β”œβ”€β”€ spacy.py  # spaCy fallback model for parsing resumes
β”‚   β”œβ”€β”€ error.py  # Error handling utilities
β”‚   └── fileTotext.py  # Functions to extract text from different file formats (PDF, DOCX, etc.)
β”‚
β”œβ”€β”€ venv/  # Virtual environment
β”‚
β”œβ”€β”€ .env  # Environment variables file (contains Hugging Face token)
β”‚
β”œβ”€β”€ main.py  # Flask app handling API routes for uploading and processing resumes
β”‚
└── requirements.txt  # Dependencies required for the project

Program Overview:

# Mistral Integration (utils/mistral.py)
    - Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
    - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
    - Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.

# SpaCy Integration (utils/spacy.py)
    - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
    - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
    - Validation: Includes validation for extracted emails and contacts.

# File Conversion (utils/fileTotext.py)
   - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
      - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
      - DOCX Files: Uses `python-docx` to extract structured text from Word documents.
      - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
      - RSF Files: Reads plain text from RSF files.
      - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
        Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
      - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.


# Error Handling (utils/error.py)
    - Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.

# Flask API (main.py)
    Endpoints:
    - /upload for uploading resumes.
    - Displays parsed results in JSON format on the results page.
    - UI: Simple interface for uploading resumes and viewing the parsing results.

Tree map of program:

main.py
β”œβ”€β”€ Handles API side
β”œβ”€β”€ File upload/remove
β”œβ”€β”€ Process resumes
└── Show result
utils
β”œβ”€β”€ fileTotext.py
β”‚   └── Converts files to text
β”‚       β”œβ”€β”€ PDF
β”‚       β”œβ”€β”€ DOCX
β”‚       β”œβ”€β”€ RTF
β”‚       β”œβ”€β”€ ODT
β”‚       β”œβ”€β”€ PNG
β”‚       β”œβ”€β”€ JPG
β”‚       └── JPEG
β”œβ”€β”€ mistral.py
β”‚   β”œβ”€β”€ Mistral API Calls
β”‚   β”‚   └── Uses Mistral-Nemo-Instruct-2407 model
β”‚   β”œβ”€β”€ Personal and Professional Extraction
β”‚   β”‚   β”œβ”€β”€ Extracts personal information
β”‚   β”‚   └── Extracts professional information
β”‚   └── Fallback Mechanism
β”‚       └── Uses spaCy NER model if Mistral fails
└── spacy.py
    β”œβ”€β”€ Custom Trained Model
    β”‚   └── Uses spaCy model (ner_model_05_3)
    β”œβ”€β”€ Named Entity Recognition
    β”‚   └── Extracts key information (Name, Email, Contact, etc.)
    └── Validation
        └── Validates emails and contacts

References: