Spaces:
Sleeping
Sleeping
Update pages/Project_Wiki.py
Browse files- pages/Project_Wiki.py +45 -11
pages/Project_Wiki.py
CHANGED
@@ -39,31 +39,65 @@ def main():
|
|
39 |
""", unsafe_allow_html=True)
|
40 |
|
41 |
# Q2: Solution Explanation
|
|
|
42 |
st.markdown("""
|
43 |
<div class="question-card">
|
44 |
<div class="question">π Q2: Can you explain your solution approach?</div>
|
45 |
<div class="answer">
|
46 |
The solution implements a multi-stage document classification pipeline:
|
47 |
<br><br>
|
48 |
-
<b>1.
|
49 |
<ul>
|
50 |
-
<li>
|
51 |
-
<li>
|
|
|
|
|
52 |
</ul>
|
53 |
<br>
|
54 |
-
<b>2.
|
55 |
<ul>
|
56 |
-
<li>
|
57 |
-
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
</ul>
|
60 |
<br>
|
61 |
-
<b>3.
|
62 |
<ul>
|
63 |
-
<li>
|
64 |
-
<li>
|
65 |
-
<li>
|
66 |
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
</div>
|
68 |
</div>
|
69 |
""", unsafe_allow_html=True)
|
|
|
39 |
""", unsafe_allow_html=True)
|
40 |
|
41 |
# Q2: Solution Explanation
|
42 |
+
# Q2: Solution Explanation
|
43 |
st.markdown("""
|
44 |
<div class="question-card">
|
45 |
<div class="question">π Q2: Can you explain your solution approach?</div>
|
46 |
<div class="answer">
|
47 |
The solution implements a multi-stage document classification pipeline:
|
48 |
<br><br>
|
49 |
+
<b>1. Data Collection & Processing:</b>
|
50 |
<ul>
|
51 |
+
<li>Dataset: 2500+ training URLs and 250+ test URLs</li>
|
52 |
+
<li>Implemented ThreadPooling with 20 workers for parallel processing</li>
|
53 |
+
<li>Reduced download time to ~40 minutes (vs. 3+ hours sequential)</li>
|
54 |
+
<li>Used PDFPlumber for robust text extraction</li>
|
55 |
</ul>
|
56 |
<br>
|
57 |
+
<b>2. Model Development Pipeline:</b>
|
58 |
<ul>
|
59 |
+
<li><i>Baseline Approach:</i>
|
60 |
+
<ul>
|
61 |
+
<li>TF-IDF vectorization for text representation</li>
|
62 |
+
<li>Logistic Regression for initial classification</li>
|
63 |
+
<li>Quick inference and resource-efficient</li>
|
64 |
+
</ul>
|
65 |
+
</li>
|
66 |
+
<br>
|
67 |
+
<li><i>Advanced Approach:</i>
|
68 |
+
<ul>
|
69 |
+
<li>BERT-based architecture for deep learning</li>
|
70 |
+
<li>Fine-tuned on construction document dataset</li>
|
71 |
+
<li>Superior context understanding and accuracy</li>
|
72 |
+
</ul>
|
73 |
+
</li>
|
74 |
</ul>
|
75 |
<br>
|
76 |
+
<b>3. Evaluation Strategy:</b>
|
77 |
<ul>
|
78 |
+
<li>Comprehensive metric suite (Precision, Recall, F1)</li>
|
79 |
+
<li>Special consideration for class imbalance</li>
|
80 |
+
<li>Comparative analysis between baseline and BERT</li>
|
81 |
</ul>
|
82 |
+
<br>
|
83 |
+
<b>4. Deployment & Demo:</b>
|
84 |
+
<ul>
|
85 |
+
<li>Streamlit-based interactive web interface</li>
|
86 |
+
<li>Real-time document classification</li>
|
87 |
+
<li>Comprehensive project documentation</li>
|
88 |
+
<li>Performance visualization and analytics</li>
|
89 |
+
</ul>
|
90 |
+
<br>
|
91 |
+
<div style='
|
92 |
+
background-color: #e8f4f8;
|
93 |
+
padding: 15px;
|
94 |
+
border-radius: 5px;
|
95 |
+
border-left: 4px solid #1f77b4;
|
96 |
+
'>
|
97 |
+
<b>π‘ Key implementation:</b> The parallel processing implementation significantly reduced data preparation time,
|
98 |
+
allowing for faster iteration and model experimentation. This, combined with the dual-model approach,
|
99 |
+
provides both efficiency and accuracy in document classification.
|
100 |
+
</div>
|
101 |
</div>
|
102 |
</div>
|
103 |
""", unsafe_allow_html=True)
|