{ "cells": [ { "cell_type": "markdown", "id": "023d00e4", "metadata": {}, "source": [ "# This a sentiment analysis task that design and optimization of the model, as well as potentially utilizing advanced NLP techniques such as fine-tuning pre-trained models like BERT. In a local Python environment where you can use TensorFlow and Keras, I took an optimized approach using a more complex LSTM model, incorporating some advanced techniques that help improve the accuracy." ] }, { "cell_type": "markdown", "id": "cb52f311", "metadata": {}, "source": [ "## Import Libraries:\n", "\n", "#### numpy and tensorflow are foundational libraries for numerical operations and machine learning.\n", "#### Classes imported from tensorflow.keras are used to build neural network models:\n", "- Sequential for linear stacking of layers.\n", "- Embedding, LSTM, Dense, Dropout, and Bidirectional for different types of neural network layers.\n", "- Tokenizer and pad_sequences are utilities for text processing.\n", "- train_test_split from sklearn is used for splitting data into training and test sets." ] }, { "cell_type": "code", "execution_count": 1, "id": "214a844b", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import tensorflow as tf\n", "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional\n", "from tensorflow.keras.preprocessing.text import Tokenizer\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "from tensorflow.keras.initializers import Constant\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 2, "id": "c01d5192", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.10.0\n" ] } ], "source": [ "print(tf.__version__)" ] }, { "cell_type": "markdown", "id": "c7ad5ab0", "metadata": {}, "source": [ "## Load and Preprocess Data:\n", "\n", "### Download the NLP dataset Amazon Sentiment Review form Kaggle \n", "#### https://www.kaggle.com/datasets/bittlingmayer/amazonreviews\n", "\n", "- Reads each line, splits it to separate the label from the review, assigns binary labels (1 for positive, 0 for negative), and stores the results in lists.\n", "\n", "- Converts the label list to a NumPy array for further processing.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "992745e5", "metadata": {}, "outputs": [], "source": [ "# Load the data (example path, replace with your actual path)\n", "file_path= 'Dataset/train.txt'\n", "\n", "# Preprocess the data\n", "with open(file_path, 'r', encoding='utf-8') as file:\n", " lines = file.readlines()\n", "\n", "labels = []\n", "reviews = []\n", "for line in lines:\n", " split_line = line.strip().split(' ', 1)\n", " label = 1 if split_line[0] == '__label__2' else 0\n", " review = split_line[1]\n", " labels.append(label)\n", " reviews.append(review)\n", "\n", "labels = np.array(labels)\n" ] }, { "cell_type": "markdown", "id": "f2a2b587", "metadata": {}, "source": [ "## Tokenization and Sequence Padding:\n", "\n", "- Initializes a Tokenizer object, specifying a maximum vocabulary size of 10,000 words and an out-of-vocabulary token .\n", "- Fits the tokenizer on the collected reviews, creating an index of all unique words.\n", "- Converts the reviews into lists of integers based on the tokenizer's word index.\n", "- Pads these sequences to a fixed length of 250, ensuring all input data has consistent dimensions, necessary for training neural networks." ] }, { "cell_type": "code", "execution_count": 4, "id": "edfcccbf", "metadata": {}, "outputs": [], "source": [ "# Tokenizing and padding text\n", "tokenizer = Tokenizer(num_words=10000, oov_token=\"\")\n", "tokenizer.fit_on_texts(reviews)\n", "sequences = tokenizer.texts_to_sequences(reviews)\n", "padded_sequences = pad_sequences(sequences, padding='post', maxlen=250)" ] }, { "cell_type": "markdown", "id": "62aea0f6", "metadata": {}, "source": [ "## Loading Pre-trained GloVe Embeddings\n", "\n", "### Download the Glove Embedding from Kaggle\n", "\n", "#### https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt\n", "\n", "- Defines a function to load the GloVe (Global Vectors for Word Representation) embeddings.\n", "- Reads the GloVe file, parsing each line to extract the word and its corresponding coefficient vector.\n", "- Creates an embedding matrix that maps each word in the tokenizer's index to its GloVe vector, if available. Words not in GloVe will have a vector of zeros.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "a44dac7b", "metadata": {}, "outputs": [], "source": [ "# # Load pre-trained GloVe embeddings\n", "\n", "def load_glove_embeddings(path, word_index, embedding_dim):\n", " embeddings_index = {}\n", " # Open the file with UTF-8 encoding specified\n", " with open(path, 'r', encoding='utf-8') as f:\n", " for line in f:\n", " word, coefs = line.split(maxsplit=1)\n", " coefs = np.fromstring(coefs, 'f', sep=' ')\n", " embeddings_index[word] = coefs\n", "\n", " # Prepare embedding matrix\n", " num_words = min(len(word_index) + 1, 10000)\n", " embedding_matrix = np.zeros((num_words, embedding_dim))\n", " for word, i in word_index.items():\n", " if i >= num_words:\n", " continue\n", " embedding_vector = embeddings_index.get(word)\n", " if embedding_vector is not None: # Corrected line here\n", " # Use the embedding vector for words found in the GloVe index\n", " embedding_matrix[i] = embedding_vector # words not found will be all-zeros.\n", "\n", " return embedding_matrix\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "8800fc79", "metadata": {}, "outputs": [], "source": [ "embedding_dim = 100 # Size of the GloVe vectors you're using\n", "glove_path = 'Embedding/glove.6B.100d/glove.6B.100d.txt'\n", "embedding_matrix = load_glove_embeddings(glove_path, tokenizer.word_index, embedding_dim)\n" ] }, { "cell_type": "markdown", "id": "5ed572bf", "metadata": {}, "source": [ "## Model Definition and Compilation:\n", "\n", "- Defines a sequential model for sentiment analysis.\n", "- Adds an Embedding layer to transform indices into dense vectors of fixed size.\n", "- Utilizes Bidirectional layers with LSTM units to capture patterns from both forward and backward states of the input sequence.\n", "- Adds Dense layers with ReLU activation for non-linear transformations and a dropout layer to reduce overfitting.\n", "- The final output layer uses a sigmoid activation function for binary classification.\n", "- Compiles the model with the Adam optimizer and binary cross-entropy loss function, tracking accuracy as a metric." ] }, { "cell_type": "code", "execution_count": 7, "id": "2d8b02df", "metadata": {}, "outputs": [], "source": [ "# Build the model\n", "model = Sequential([\n", " Embedding(10000, embedding_dim, embeddings_initializer=Constant(embedding_matrix), \n", " input_length=250, trainable=False),\n", " Bidirectional(LSTM(128, return_sequences=True)),\n", " Dropout(0.5),\n", " Bidirectional(LSTM(128)),\n", " Dropout(0.5),\n", " Dense(1, activation='sigmoid')\n", "])" ] }, { "cell_type": "code", "execution_count": 8, "id": "ce4fa260", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding (Embedding) (None, 250, 100) 1000000 \n", " \n", " bidirectional (Bidirectiona (None, 250, 256) 234496 \n", " l) \n", " \n", " dropout (Dropout) (None, 250, 256) 0 \n", " \n", " bidirectional_1 (Bidirectio (None, 256) 394240 \n", " nal) \n", " \n", " dropout_1 (Dropout) (None, 256) 0 \n", " \n", " dense (Dense) (None, 1) 257 \n", " \n", "=================================================================\n", "Total params: 1,628,993\n", "Trainable params: 628,993\n", "Non-trainable params: 1,000,000\n", "_________________________________________________________________\n" ] } ], "source": [ "model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\n", "model.summary()" ] }, { "cell_type": "code", "execution_count": 9, "id": "a3c4e6a3", "metadata": {}, "outputs": [], "source": [ "# Splitting the data\n", "X_train, X_val, y_train, y_val = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)\n" ] }, { "cell_type": "markdown", "id": "ddafc948", "metadata": {}, "source": [ "## Model Training:\n", "\n", "- Trains the model on the padded text sequences and labels.\n", "- Runs for 5 epochs with 20% of the data reserved for validation to monitor performance and mitigate overfitting." ] }, { "cell_type": "code", "execution_count": 10, "id": "596a0174", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/5\n", "1093/1093 [==============================] - 1457s 1s/step - loss: 0.3991 - accuracy: 0.8189 - val_loss: 0.3092 - val_accuracy: 0.8722\n", "Epoch 2/5\n", "1093/1093 [==============================] - 1371s 1s/step - loss: 0.2911 - accuracy: 0.8783 - val_loss: 0.2605 - val_accuracy: 0.8944\n", "Epoch 3/5\n", "1093/1093 [==============================] - 1641s 2s/step - loss: 0.2511 - accuracy: 0.8970 - val_loss: 0.2432 - val_accuracy: 0.9026\n", "Epoch 4/5\n", "1093/1093 [==============================] - 1400s 1s/step - loss: 0.2308 - accuracy: 0.9075 - val_loss: 0.2350 - val_accuracy: 0.9058\n", "Epoch 5/5\n", "1093/1093 [==============================] - 1400s 1s/step - loss: 0.2066 - accuracy: 0.9193 - val_loss: 0.2306 - val_accuracy: 0.9133\n" ] } ], "source": [ "# Training the model\n", "history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=1)\n" ] }, { "cell_type": "markdown", "id": "039f21c0", "metadata": {}, "source": [ "## Save the Train Model" ] }, { "cell_type": "code", "execution_count": 11, "id": "e4693576", "metadata": {}, "outputs": [], "source": [ "# Save the model\n", "model.save('sentiment_model.h5')" ] }, { "cell_type": "code", "execution_count": 12, "id": "2e1b099f", "metadata": {}, "outputs": [], "source": [ "# Save the tokenizer\n", "import pickle\n", "with open('tokenizer.pkl', 'wb') as handle:\n", " pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)\n" ] }, { "cell_type": "markdown", "id": "66daf062", "metadata": {}, "source": [ "## Evaluate \n", "\n", "- Evaluate the model Validation loss\n", "- Evaluate the model Validation accuracy\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "b64ef635", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "547/547 [==============================] - 191s 349ms/step - loss: 0.2306 - accuracy: 0.9133\n", "Validation loss: 0.23057065904140472\n", "Validation accuracy: 0.9132547974586487\n" ] } ], "source": [ "# Evaluate the model\n", "loss, accuracy = model.evaluate(X_val, y_val)\n", "print(\"Validation loss:\", loss)\n", "print(\"Validation accuracy:\", accuracy)" ] }, { "cell_type": "code", "execution_count": 14, "id": "54a7aec7", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the training history to visualize the learning over epochs:\n", "import matplotlib.pyplot as plt\n", "# Plotting training history\n", "plt.plot(history.history['accuracy'], label='Accuracy')\n", "plt.plot(history.history['val_accuracy'], label='Validation Accuracy')\n", "plt.title('Model Accuracy')\n", "plt.ylabel('Accuracy')\n", "plt.xlabel('Epoch')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "f1590396", "metadata": {}, "source": [ "## Model Load\n", "\n", "- load the sentiment_model.h5 model from you dir " ] }, { "cell_type": "code", "execution_count": 15, "id": "57021426", "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.models import load_model\n", "\n", "# Load the model\n", "model = load_model('sentiment_model.h5')" ] }, { "cell_type": "code", "execution_count": 16, "id": "9acdbe1f", "metadata": {}, "outputs": [], "source": [ "# Load tokenizer\n", "with open('tokenizer.pkl', 'rb') as f:\n", " tokenizer = pickle.load(f)" ] }, { "cell_type": "code", "execution_count": 17, "id": "b44fd582", "metadata": {}, "outputs": [], "source": [ "# from tensorflow.keras.preprocessing.text import Tokenizer\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences" ] }, { "cell_type": "markdown", "id": "bbf51b83", "metadata": {}, "source": [ "## Utility Functions for Text Preprocessing and Sentiment Prediction:\n", "\n", "- preprocess_text converts input texts into padded sequences suitable for model input, using the previously defined tokenizer.\n", "- predict_sentiment processes texts, makes predictions with the trained model, and interprets the results as 'Positive' or 'Negative' based on the prediction score." ] }, { "cell_type": "code", "execution_count": 18, "id": "de6eac0b", "metadata": {}, "outputs": [], "source": [ "def preprocess_text(texts, tokenizer, max_length=250):\n", " sequences = tokenizer.texts_to_sequences(texts)\n", " padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')\n", " return padded_sequences\n", "\n", "def predict_sentiment(texts, model, tokenizer):\n", " preprocessed_texts = preprocess_text(texts, tokenizer)\n", " predictions = model.predict(preprocessed_texts)\n", " print(\"predictions\",predictions)\n", " sentiment_labels = ['Negative' if pred < 0.5 else 'Positive' for pred in predictions.flatten()]\n", " return sentiment_labels" ] }, { "cell_type": "markdown", "id": "a58a1314", "metadata": {}, "source": [ "## Sentiment Prediction on New Reviews:\n", "\n", "- Lists new review texts to test the model.\n", "- Calls predict_sentiment to determine the sentiment of each review.\n", "- Prints out each review with its predicted sentiment, providing a practical demonstration of the model in action." ] }, { "cell_type": "code", "execution_count": 20, "id": "d95588b0", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 245ms/step\n", "predictions [[0.9961173 ]\n", " [0.00163375]\n", " [0.5040343 ]\n", " [0.9762217 ]\n", " [0.00119857]\n", " [0.15386383]\n", " [0.98825365]]\n", "Review: I absolutely loved this product, it worked wonders for me!\n", "Sentiment: Positive\n", "\n", "Review: Horrible experience, it broke down the first time I used it.\n", "Sentiment: Negative\n", "\n", "Review: Okay product, but I expected something better.\n", "Sentiment: Positive\n", "\n", "Review: Perfect, just as described! Would buy again!\n", "Sentiment: Positive\n", "\n", "Review: Not worth the money, very disappointing.\n", "Sentiment: Negative\n", "\n", "Review: If you want to listen to El Duke , then it is better if you have access to his shower,this is not him, it is a gimmick,very well orchestrated.\n", "Sentiment: Negative\n", "\n", "Review: Review of Kelly Club for Toddlers: For the price of 7.99, this PC game is WELL worth it, great graphics, colorful and lots to do! My four year old daughter is in love with the many tasks to complete in this game, including dressing and grooming wide variety of pets and decoration of numerous floats to show in your little one's very own parade.\n", "Sentiment: Positive\n", "\n" ] } ], "source": [ "# Example reviews\n", "new_reviews = [\n", " \"I absolutely loved this product, it worked wonders for me!\",\n", " \"Horrible experience, it broke down the first time I used it.\",\n", " \"Okay product, but I expected something better.\",\n", " \"Perfect, just as described! Would buy again!\",\n", " \"Not worth the money, very disappointing.\",\n", " \"If you want to listen to El Duke , then it is better if you have access to his shower,this is not him, it is a gimmick,very well orchestrated.\",\n", " \"Review of Kelly Club for Toddlers: For the price of 7.99, this PC game is WELL worth it, great graphics, colorful and lots to do! My four year old daughter is in love with the many tasks to complete in this game, including dressing and grooming wide variety of pets and decoration of numerous floats to show in your little one's very own parade.\"\n", "]\n", "\n", "# Predict sentiments\n", "sentiments = predict_sentiment(new_reviews, model, tokenizer)\n", "# print(sentiments)\n", "for review, sentiment in zip(new_reviews, sentiments):\n", " print(f\"Review: {review}\\nSentiment: {sentiment}\\n\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "6a3b88ac", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 218ms/step\n", "Prediction: [0.00290678]\n", "Review: very bad product\n", "Sentiment: Negative\n", "\n" ] } ], "source": [ "def preprocess_text(text, tokenizer, max_length=250):\n", " sequence = tokenizer.texts_to_sequences([text])\n", " padded_sequence = pad_sequences(sequence, maxlen=max_length, padding='post')\n", " return padded_sequence\n", "\n", "def predict_sentiment(text, model, tokenizer):\n", " preprocessed_text = preprocess_text(text, tokenizer)\n", " prediction = model.predict(preprocessed_text)[0]\n", " print(\"Prediction:\", prediction)\n", " sentiment_label = 'Negative' if prediction < 0.5 else 'Positive'\n", " return sentiment_label\n", "\n", "# Example review\n", "new_review = \"very bad product\"\n", "\n", "# Predict sentiment\n", "sentiment = predict_sentiment(new_review, model, tokenizer)\n", "print(f\"Review: {new_review}\\nSentiment: {sentiment}\\n\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f35e0602", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }