FaceBook Comment Sentiment Classifier with BERT
Introduction
In this project, I will be designing a text classification model to help distinguish between three types of FaceBook comments:
- Appreciation
- Complaints
- Feedback
message | Appreciation | Complaint | Feedback | |
---|---|---|---|---|
30 | ugh | 0 | 1 | 0 |
931 | What are you lining your cans with? | 0 | 0 | 1 |
596 | me with my hershey cake :) | 1 | 0 | 0 |
434 | I can has free 8x10? | 0 | 0 | 1 |
The model will be trained on a dataset of 7,961 comments. The classifier will use the BERT architecture, which is a state-of-the-art Natural Language Processing model. The model will be trained on a GPU using the PyTorch framework.
The goal of this project is to build a classifier that achieves the highest predictive performance on a hidden dataset of 2,039 comments.
The model will be evaluated on the following metrics:
- Precision
- Recall
- F1 Score
The tools I will be using in this project are:
- Python
- PyTorch
- Machine Learning framework for training the model
- Transformers Library
- HuggingFace’s library for using pretrained BERT models
- Google Colab
- A Google-hosted cloud platform for running Jupyter notebooks. Useful when coding away from home, or in need of a high end GPU for deep learning
The steps I will be taking in this project are:
- Import the data from local environment into Google Colab Environment
- Preprocess the data for PyTorch
- Setup and Train the BERT Pipeline
- Evaluate the classifier on the test set
Importing the Data
Here is an overview of the libraries needed for the project:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from tqdm.notebook import tqdm
from transformers import BertTokenizer, AdamW, BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertForSequenceClassification
from google.colab import files
Before we can begin, we will need to upload the data to the Google Colab environment. The data is stored in a CSV file, so we will use the files
library to manually upload the unlabeled and labeled datasets.
# running this code in Google Colab will prompt you to upload the data
uploaded = files.upload()
# read labeled data into a pandas dataframe
df = pd.read_csv('FB_posts_labeled.txt', sep='\t')
# read unlabeled data into a pandas dataframe
df_test = pd.read_csv('FB_posts_unlabeled.txt', sep='\t')
Data Preparation
In order to train the model, we will need to:
- encode the output label columns into one integer value column (0, 1, 2).
- split the data into training and validation sets.
- We will use the
train_test_split
function from thesklearn
library to do this.
- We will use the
- tokenize the text data
- We will use the
BertTokenizer
class from thetransformers
library to do this.
- We will use the
- convert the data into PyTorch tensors
- this is the data type that the model will be trained on.
# create labels for classification
label_dict = {'Appreciation': 0, 'Complaint': 1, 'Feedback': 2}
# create label column that encodes labels based on dict values
conds = [df.Appreciation == 1, df.Complaint == 1, df.Feedback == 1]
choices = [0, 1, 2]
df['label'] = np.select(conds, choices, default=0)
# drop old columns
df.drop(columns=['Appreciation', 'Complaint', 'Feedback', 'postId'], inplace=True)
df
message | label | |
---|---|---|
2563 | Along with so many others, it’s “Goodbye Macy’s” for me until they give Trump the axe. | 1 |
2601 | My aunt loaded all of her Easter photos onto the CVS photo website. Is there a way to output them back to the computer? | 2 |
4778 | i have over 7000 fb friends and followers on twitter. i will not stop until they’ve all heard about my poor and inefficient treatment. | 1 |
# split data into training and validation sets, with stratification
X_train, X_val, y_train, y_val = train_test_split(df.index.values, df.label.values, test_size=0.15, random_state=12, stratify=df.label.values, shuffle=True)
print(f"length of training data: {len(X_train)}")
print(f"length of testing data: {len(X_val)}")
length of training data: 6766
length of testing data: 1195
# verify that the training and validation sets are stratified
pd.Series(y_train).value_counts(normalize=True)
pd.Series(y_val).value_counts(normalize=True)
53% of data is complaints, minority classes are Appreciation and Feedback
Creating Embeddings
The BERT model requires that the input data be tokenized and converted into embeddings. The BertTokenizer
class from the transformers
library will be used to do this. The BertTokenizer
class will:
- tokenize the text data
- convert the tokens into embeddings
- add special tokens to the beginning and end of the sequence
- pad the sequences to a maximum length
- create attention masks for the padded tokens
We will use ‘cased’ version of BERT in case capitalized comments contain info about sentiment (e.g. Anger)
# Download pretrained BERT tokenizer for embeddings
tokenizer = BertTokenizer.from_pretrained('bert-base-cased',
do_lower_case=False)
The longest text in our dataset is over 512 tokens long. We will set the maximum length to 512 tokens, since this is the limit for BERT and enable truncation.
# set max length to 512 tokens
max_len = 512
# tokenize and encode sequences in the training set
encoded_data_train = tokenizer.batch_encode_plus(
df[df.index.isin(X_train)].message.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=max_len,
return_tensors='pt'
)
# tokenize and encode sequences in the validation set
encoded_data_val = tokenizer.batch_encode_plus(
df[df.index.isin(X_val)].message.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=max_len,
return_tensors='pt'
)
# tokenize and encode sequences in the test set
encoded_data_test = tokenizer.batch_encode_plus(
df_test.message.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=max_len,
return_tensors='pt'
)
Creating PyTorch Data Tensors
We will use the TensorDataset
class from the torch.utils.data
library to create the PyTorch data tensors. The TensorDataset
class will:
- convert the input ids, attention masks, and labels into PyTorch tensors
- create a dataset from the tensors
# convert the training set into a TensorDataset
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.index.isin(X_train)].label.values)
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
# convert the validation set into a TensorDataset
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.index.isin(X_val)].label.values)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)
# convert the test set into a TensorDataset
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
dataset_test = TensorDataset(input_ids_test, attention_masks_test)
Creating Data Loaders
We will use the DataLoader
class from the torch.utils.data
library to create the data loaders. The DataLoader
class will:
- shuffle the data
- create batches of data
- load the data into the GPU
# create the DataLoaders for our training and validation sets
# set batch size to 8
batch_size = 8
dataloader_train = DataLoader(dataset_train,
shuffle=True,
batch_size=batch_size)
dataloader_validation = DataLoader(dataset_val,
shuffle=True,
batch_size=batch_size)
dataloader_test = DataLoader(dataset_test,
shuffle=False,
batch_size=batch_size)
Creating the BERT Classifier
We will use the BertForSequenceClassification
class from the transformers
library to create the BERT classifier. The BertForSequenceClassification
class will:
- load the pretrained BERT model
- add a dropout layer
- add a dense layer
- add a softmax activation function
# load the pretrained BERT model
bert = BertForSequenceClassification.from_pretrained('bert-base-cased',
num_labels=len(label_dict),
output_attentions=False,
output_hidden_states=False)
Creating the Optimizer
We will use the AdamW
class from the transformers
library to create the optimizer. The AdamW
class will:
- create the Adam optimizer
- apply weight decay to the parameters
# create the optimizer
optimizer = AdamW(bert.parameters(),
lr=1e-5,
eps=1e-8)
Creating the Learning Rate Scheduler
We will use the get_linear_schedule_with_warmup
function from the transformers
library to create the learning rate scheduler. The get_linear_schedule_with_warmup
function will:
- create the learning rate scheduler
- set the initial learning rate
- set the number of warmup steps
- set the number of training steps
# get the number of training epochs
epochs = 4
# total number of training steps is number of batches * number of epochs
total_steps = len(dataloader_train) * epochs
# create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0,
num_training_steps=total_steps)
Defining the Performance Metrics
We will use the accuracy_score
function from the sklearn.metrics
library to calculate the accuracy. We will also use the f1_score
function from the sklearn.metrics
library to calculate the F1 score.
# function to calculate the accuracy of our predictions vs labels
def accuracy_per_class(preds, labels):
label_dict_inverse = {v: k for k, v in label_dict.items()}
preds_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
for label in np.unique(labels_flat):
y_preds = preds_flat[labels_flat==label]
y_true = labels_flat[labels_flat==label]
print(f'Class: {label_dict_inverse[label]}')
print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')
# function to calculate the F1 score of our predictions vs labels
def f1_score_func(preds, labels):
pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return f1_score(labels_flat, pred_flat, average='weighted')
Evaluating the Model
We will define an evaluate
function to evaluate the model on the validation set. The evaluate
function will:
- put the model into evaluation mode
- initialize the loss and accuracy for this epoch
- iterate over the batches of the validation set
- load the batch into the GPU
- perform a forward pass
- calculate the loss
- update the loss and accuracy
# function to train the model
def evaluate(dataloader_val):
model.eval()
loss_val_total = 0
# empty list to save model predictions
predictions, true_vals = [], []
# iterate over batches
for batch in dataloader_val:
# load batch to GPU
batch = tuple(b.to(device) for b in batch)
# unpack the inputs from our dataloader
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
'labels': batch[2],
}
# get model predictions for the current batch
with torch.no_grad():
outputs = model(**inputs)
# get the loss of the model predictions for the current batch
loss = outputs[0]
logits = outputs[1]
# add on to the total loss
loss_val_total += loss.item()
# model predictions are stored on GPU. So, push it to CPU
logits = logits.detach().cpu().numpy()
label_ids = inputs['labels'].cpu().numpy()
# append the model predictions and true values
predictions.append(logits)
true_vals.append(label_ids)
# compute the training loss of the epoch
loss_val_avg = loss_val_total/len(dataloader_val)
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)
return loss_val_avg, predictions, true_vals
Before we train the model, we will set seeds for reproducibility and verify that the GPU is available.
# set the seed for reproducibility
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
# check if GPU is available
if torch.cuda.is_available():
# tell PyTorch to use the GPU
device = torch.device('cuda')
print('There are %d GPU(s) available.' % torch.cuda.device_count())
print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
print('No GPU available, using the CPU instead.')
device = torch.device('cpu')
# push the model to GPU
model.to(device)
Defining the Training Loop
We will use the train
function to define the training loop. The train
function will:
- put the model into training mode
- initialize the loss and accuracy for this epoch
- iterate over the batches of the training set
- load the batch into the GPU
- clear the gradients
- perform a forward pass
- calculate the loss
- perform a backward pass
- clip the gradients to 1
- update the weights
- update the learning rate
- update the loss and accuracy
# function to train the model
for epoch in tqdm(range(epochs)):
# perform one full pass over the training set
model.train()
# initialize the loss and accuracy for this epoch
loss_train_total = 0
progress_bar = tqdm(dataloader_train,
desc='Epoch {:1d}'.format(epoch),
leave=False,
disable=False)
# iterate over batches
for batch in progress_bar:
# clear previously calculated gradients
model.zero_grad()
# load batch to GPU
batch = tuple(b.to(device) for b in batch)
# unpack the inputs from our dataloader
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
'labels': batch[2],
}
# get model predictions for the current batch
outputs = model(**inputs)
# get the loss of the model predictions for the current batch
loss = outputs[0]
# add on to the total loss
loss_train_total += loss.item()
# backward pass to calculate the gradients
loss.backward()
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
# update the learning rate
scheduler.step()
# update progress bar
progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
#save the mdoel at each epoch
torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
# compute the training loss of the epoch
loss_train_avg = loss_train_total/len(dataloader_train)
# print epoch # and training loss
tqdm.write(f'\nEpoch {epoch}')
tqdm.write(f'Training loss: {loss_train_avg}')
# compute the validation loss & accuracy of the epoch
val_loss, predictions, true_vals = evaluate(dataloader_val)
val_f1 = f1_score_func(predictions, true_vals)
# print validation loss & accuracy
tqdm.write(f'Validation loss: {val_loss}')
tqdm.write(f'F1 Score (Weighted): {val_f1}')
Epoch 1
Training loss: 0.5482631898960384
Validation loss: 0.5348240941578235
F1 Score (Weighted): 0.8805902422136439
Epoch 2
Training loss: 0.3173633142948431
Validation loss: 0.5934633307977647
F1 Score (Weighted): 0.885306253267006
Epoch 3
Training loss: 0.16964389977592292
Validation loss: 0.69995356636486
F1 Score (Weighted): 0.8886812897554138
Epoch 4
Training loss: 0.09559309570186539
Validation loss: 0.7222233983969576
F1 Score (Weighted): 0.8836152107516756
Validation on Test Data
Before validating on our unlabeled test data, we need to import the best model weights. We will use the model weights from epoch 3 since it achieved the best F1 score.
# Load in the model weights
model.load_state_dict(torch.load('finetuned_BERT_epoch_4.model',
map_location=torch.device('cuda')))
Next, we can use the saved model to get the predictions on the test set.
# evaluate saved model on testing dataset
model.eval()
predictions = []
for batch in dataloader_test:
batch = tuple(b.to(device) for b in batch)
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs[0]
logits = logits.detach().cpu().numpy()
predictions.append(logits)
predictions = np.concatenate(predictions, axis=0)
# convert probabilities to class labels and flatten the list
predictions = np.argmax(predictions, axis=1).flatten()
Finally, we can create a submission file with the predictions.
# append label predictions to test dataframe
df_test['pred'] = predictions
# convert label predictions to original labeled columns
df_test = pd.get_dummies(df_test, columns=['pred'])
df_test.columns = ['postId', 'message', 'Appreciation_pred', 'Complaint_pred', 'Feedback_pred']
del df_test['message']
# save predictions to csv file
df_test.to_csv('submission.csv', index=False)
Conclusion
Without any hyperparameter tuning, we were able to achieve an F1 score of 0.88 on the test set. This is a great result considering that we only used 10% of the training data. We can further improve the model by using more training data and tuning the hyperparameters. We can also try other models such as XLNet and RoBERTa.