mlkorra commited on
Commit
9df3fc4
Β·
verified Β·
1 Parent(s): 389e9dd

Update pages/Project_Wiki.py

Browse files
Files changed (1) hide show
  1. pages/Project_Wiki.py +45 -11
pages/Project_Wiki.py CHANGED
@@ -39,31 +39,65 @@ def main():
39
  """, unsafe_allow_html=True)
40
 
41
  # Q2: Solution Explanation
 
42
  st.markdown("""
43
  <div class="question-card">
44
  <div class="question">πŸ” Q2: Can you explain your solution approach?</div>
45
  <div class="answer">
46
  The solution implements a multi-stage document classification pipeline:
47
  <br><br>
48
- <b>1. Direct URL Text Approach:</b>
49
  <ul>
50
- <li>Initially considered direct URL text extraction</li>
51
- <li>Found limitations in accuracy and reliability</li>
 
 
52
  </ul>
53
  <br>
54
- <b>2. Baseline Approach (ML Model):</b>
55
  <ul>
56
- <li>Implemented TF-IDF vectorization</li>
57
- <li>Used Logistic Regression for classification</li>
58
- <li>Provided quick and efficient results</li>
 
 
 
 
 
 
 
 
 
 
 
 
59
  </ul>
60
  <br>
61
- <b>3. (DL Model):</b>
62
  <ul>
63
- <li>Utilized BERT-based model architecture</li>
64
- <li>Fine-tuned on construction document dataset</li>
65
- <li>Achieved superior accuracy and context understanding</li>
66
  </ul>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  </div>
68
  </div>
69
  """, unsafe_allow_html=True)
 
39
  """, unsafe_allow_html=True)
40
 
41
  # Q2: Solution Explanation
42
+ # Q2: Solution Explanation
43
  st.markdown("""
44
  <div class="question-card">
45
  <div class="question">πŸ” Q2: Can you explain your solution approach?</div>
46
  <div class="answer">
47
  The solution implements a multi-stage document classification pipeline:
48
  <br><br>
49
+ <b>1. Data Collection & Processing:</b>
50
  <ul>
51
+ <li>Dataset: 2500+ training URLs and 250+ test URLs</li>
52
+ <li>Implemented ThreadPooling with 20 workers for parallel processing</li>
53
+ <li>Reduced download time to ~40 minutes (vs. 3+ hours sequential)</li>
54
+ <li>Used PDFPlumber for robust text extraction</li>
55
  </ul>
56
  <br>
57
+ <b>2. Model Development Pipeline:</b>
58
  <ul>
59
+ <li><i>Baseline Approach:</i>
60
+ <ul>
61
+ <li>TF-IDF vectorization for text representation</li>
62
+ <li>Logistic Regression for initial classification</li>
63
+ <li>Quick inference and resource-efficient</li>
64
+ </ul>
65
+ </li>
66
+ <br>
67
+ <li><i>Advanced Approach:</i>
68
+ <ul>
69
+ <li>BERT-based architecture for deep learning</li>
70
+ <li>Fine-tuned on construction document dataset</li>
71
+ <li>Superior context understanding and accuracy</li>
72
+ </ul>
73
+ </li>
74
  </ul>
75
  <br>
76
+ <b>3. Evaluation Strategy:</b>
77
  <ul>
78
+ <li>Comprehensive metric suite (Precision, Recall, F1)</li>
79
+ <li>Special consideration for class imbalance</li>
80
+ <li>Comparative analysis between baseline and BERT</li>
81
  </ul>
82
+ <br>
83
+ <b>4. Deployment & Demo:</b>
84
+ <ul>
85
+ <li>Streamlit-based interactive web interface</li>
86
+ <li>Real-time document classification</li>
87
+ <li>Comprehensive project documentation</li>
88
+ <li>Performance visualization and analytics</li>
89
+ </ul>
90
+ <br>
91
+ <div style='
92
+ background-color: #e8f4f8;
93
+ padding: 15px;
94
+ border-radius: 5px;
95
+ border-left: 4px solid #1f77b4;
96
+ '>
97
+ <b>πŸ’‘ Key implementation:</b> The parallel processing implementation significantly reduced data preparation time,
98
+ allowing for faster iteration and model experimentation. This, combined with the dual-model approach,
99
+ provides both efficiency and accuracy in document classification.
100
+ </div>
101
  </div>
102
  </div>
103
  """, unsafe_allow_html=True)