Introduction

In this project, I will be designing a text classification model to help distinguish between three types of FaceBook comments:

  • Appreciation
  • Complaints
  • Feedback
message Appreciation Complaint Feedback
30 ugh 0 1 0
931 What are you lining your cans with? 0 0 1
596 me with my hershey cake :) 1 0 0
434 I can has free 8x10? 0 0 1

The model will be trained on a dataset of 7,961 comments. The classifier will use the BERT architecture, which is a state-of-the-art Natural Language Processing model. The model will be trained on a GPU using the PyTorch framework.

The goal of this project is to build a classifier that achieves the highest predictive performance on a hidden dataset of 2,039 comments.

The model will be evaluated on the following metrics:

  • Precision
  • Recall
  • F1 Score

The tools I will be using in this project are:

  • Python
  • PyTorch
    • Machine Learning framework for training the model
  • Transformers Library
    • HuggingFace’s library for using pretrained BERT models
  • Google Colab
    • A Google-hosted cloud platform for running Jupyter notebooks. Useful when coding away from home, or in need of a high end GPU for deep learning

The steps I will be taking in this project are:

  1. Import the data from local environment into Google Colab Environment
  2. Preprocess the data for PyTorch
  3. Setup and Train the BERT Pipeline
  4. Evaluate the classifier on the test set

Importing the Data

Here is an overview of the libraries needed for the project:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
import torch
from tqdm.notebook import tqdm
from transformers import BertTokenizer, AdamW, BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertForSequenceClassification

from google.colab import files

Before we can begin, we will need to upload the data to the Google Colab environment. The data is stored in a CSV file, so we will use the files library to manually upload the unlabeled and labeled datasets.

# running this code in Google Colab will prompt you to upload the data
uploaded = files.upload()

# read labeled data into a pandas dataframe
df = pd.read_csv('FB_posts_labeled.txt', sep='\t')

# read unlabeled data into a pandas dataframe
df_test = pd.read_csv('FB_posts_unlabeled.txt', sep='\t')

Data Preparation

In order to train the model, we will need to:

  • encode the output label columns into one integer value column (0, 1, 2).
  • split the data into training and validation sets.
    • We will use the train_test_split function from the sklearn library to do this.
  • tokenize the text data
    • We will use the BertTokenizer class from the transformers library to do this.
  • convert the data into PyTorch tensors
    • this is the data type that the model will be trained on.
# create labels for classification
label_dict = {'Appreciation': 0, 'Complaint': 1, 'Feedback': 2}

# create label column that encodes labels based on dict values
conds = [df.Appreciation == 1, df.Complaint == 1, df.Feedback == 1]
choices = [0, 1, 2]
df['label'] = np.select(conds, choices, default=0)

# drop old columns
df.drop(columns=['Appreciation', 'Complaint', 'Feedback', 'postId'], inplace=True)
df
message label
2563 Along with so many others, it’s “Goodbye Macy’s” for me until they give Trump the axe. 1
2601 My aunt loaded all of her Easter photos onto the CVS photo website. Is there a way to output them back to the computer? 2
4778 i have over 7000 fb friends and followers on twitter. i will not stop until they’ve all heard about my poor and inefficient treatment. 1
# split data into training and validation sets, with stratification
X_train, X_val, y_train, y_val = train_test_split(df.index.values, df.label.values, test_size=0.15, random_state=12, stratify=df.label.values, shuffle=True)

print(f"length of training data: {len(X_train)}")
print(f"length of testing data: {len(X_val)}")

length of training data: 6766

length of testing data: 1195

# verify that the training and validation sets are stratified
pd.Series(y_train).value_counts(normalize=True)
pd.Series(y_val).value_counts(normalize=True)

53% of data is complaints, minority classes are Appreciation and Feedback

Creating Embeddings

The BERT model requires that the input data be tokenized and converted into embeddings. The BertTokenizer class from the transformers library will be used to do this. The BertTokenizer class will:

  • tokenize the text data
  • convert the tokens into embeddings
  • add special tokens to the beginning and end of the sequence
  • pad the sequences to a maximum length
  • create attention masks for the padded tokens

We will use ‘cased’ version of BERT in case capitalized comments contain info about sentiment (e.g. Anger)

# Download pretrained BERT tokenizer for embeddings
tokenizer = BertTokenizer.from_pretrained('bert-base-cased',
                                          do_lower_case=False)

The longest text in our dataset is over 512 tokens long. We will set the maximum length to 512 tokens, since this is the limit for BERT and enable truncation.

# set max length to 512 tokens
max_len = 512

# tokenize and encode sequences in the training set
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.index.isin(X_train)].message.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=max_len,
    return_tensors='pt'
)

# tokenize and encode sequences in the validation set
encoded_data_val = tokenizer.batch_encode_plus(
    df[df.index.isin(X_val)].message.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=max_len,
    return_tensors='pt'
)

# tokenize and encode sequences in the test set
encoded_data_test = tokenizer.batch_encode_plus(
    df_test.message.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=max_len,
    return_tensors='pt'
)

Creating PyTorch Data Tensors

We will use the TensorDataset class from the torch.utils.data library to create the PyTorch data tensors. The TensorDataset class will:

  • convert the input ids, attention masks, and labels into PyTorch tensors
  • create a dataset from the tensors
# convert the training set into a TensorDataset
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.index.isin(X_train)].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)

# convert the validation set into a TensorDataset
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.index.isin(X_val)].label.values)

dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

# convert the test set into a TensorDataset
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']

dataset_test = TensorDataset(input_ids_test, attention_masks_test)

Creating Data Loaders

We will use the DataLoader class from the torch.utils.data library to create the data loaders. The DataLoader class will:

  • shuffle the data
  • create batches of data
  • load the data into the GPU
# create the DataLoaders for our training and validation sets
# set batch size to 8
batch_size = 8

dataloader_train = DataLoader(dataset_train,
                              shuffle=True,
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val,
                                   shuffle=True,
                                   batch_size=batch_size)

dataloader_test = DataLoader(dataset_test,
                              shuffle=False,
                              batch_size=batch_size)

Creating the BERT Classifier

We will use the BertForSequenceClassification class from the transformers library to create the BERT classifier. The BertForSequenceClassification class will:

  • load the pretrained BERT model
  • add a dropout layer
  • add a dense layer
  • add a softmax activation function
# load the pretrained BERT model
bert = BertForSequenceClassification.from_pretrained('bert-base-cased',
                                                     num_labels=len(label_dict),
                                                     output_attentions=False,
                                                     output_hidden_states=False)

Creating the Optimizer

We will use the AdamW class from the transformers library to create the optimizer. The AdamW class will:

  • create the Adam optimizer
  • apply weight decay to the parameters
# create the optimizer
optimizer = AdamW(bert.parameters(),
                  lr=1e-5,
                  eps=1e-8)

Creating the Learning Rate Scheduler

We will use the get_linear_schedule_with_warmup function from the transformers library to create the learning rate scheduler. The get_linear_schedule_with_warmup function will:

  • create the learning rate scheduler
  • set the initial learning rate
  • set the number of warmup steps
  • set the number of training steps
# get the number of training epochs
epochs = 4

# total number of training steps is number of batches * number of epochs
total_steps = len(dataloader_train) * epochs

# create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)

Defining the Performance Metrics

We will use the accuracy_score function from the sklearn.metrics library to calculate the accuracy. We will also use the f1_score function from the sklearn.metrics library to calculate the F1 score.

# function to calculate the accuracy of our predictions vs labels
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

# function to calculate the F1 score of our predictions vs labels
def f1_score_func(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, pred_flat, average='weighted')

Evaluating the Model

We will define an evaluate function to evaluate the model on the validation set. The evaluate function will:

  • put the model into evaluation mode
  • initialize the loss and accuracy for this epoch
  • iterate over the batches of the validation set
  • load the batch into the GPU
  • perform a forward pass
  • calculate the loss
  • update the loss and accuracy
# function to train the model
def evaluate(dataloader_val):
  model.eval()

  loss_val_total = 0
  # empty list to save model predictions
  predictions, true_vals = [], []

  # iterate over batches
  for batch in dataloader_val:
          # load batch to GPU
          batch = tuple(b.to(device) for b in batch)
          # unpack the inputs from our dataloader
          inputs = {'input_ids':      batch[0],
                    'attention_mask': batch[1],
                    'labels':         batch[2],
                  }
          # get model predictions for the current batch
          with torch.no_grad():
              outputs = model(**inputs)
          # get the loss of the model predictions for the current batch
          loss = outputs[0]
          logits = outputs[1]
          # add on to the total loss
          loss_val_total += loss.item()
          # model predictions are stored on GPU. So, push it to CPU
          logits = logits.detach().cpu().numpy()
          label_ids = inputs['labels'].cpu().numpy()
          # append the model predictions and true values
          predictions.append(logits)
          true_vals.append(label_ids)

    # compute the training loss of the epoch
    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

Before we train the model, we will set seeds for reproducibility and verify that the GPU is available.

# set the seed for reproducibility
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
# check if GPU is available
if torch.cuda.is_available():
    # tell PyTorch to use the GPU
    device = torch.device('cuda')
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device('cpu')

# push the model to GPU
model.to(device)

Defining the Training Loop

We will use the train function to define the training loop. The train function will:

  • put the model into training mode
  • initialize the loss and accuracy for this epoch
  • iterate over the batches of the training set
  • load the batch into the GPU
  • clear the gradients
  • perform a forward pass
  • calculate the loss
  • perform a backward pass
  • clip the gradients to 1
  • update the weights
  • update the learning rate
  • update the loss and accuracy
# function to train the model

for epoch in tqdm(range(epochs)):
    # perform one full pass over the training set
    model.train()
    # initialize the loss and accuracy for this epoch
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train,
                        desc='Epoch {:1d}'.format(epoch),
                        leave=False,
                        disable=False)

    # iterate over batches
    for batch in progress_bar:
        # clear previously calculated gradients
        model.zero_grad()
        # load batch to GPU
        batch = tuple(b.to(device) for b in batch)
        # unpack the inputs from our dataloader
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }
        # get model predictions for the current batch
        outputs = model(**inputs)
        # get the loss of the model predictions for the current batch
        loss = outputs[0]
        # add on to the total loss
        loss_train_total += loss.item()
        # backward pass to calculate the gradients
        loss.backward()
        # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # update parameters
        optimizer.step()
        # update the learning rate
        scheduler.step()
        # update progress bar
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})

    #save the mdoel at each epoch
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')

    # compute the training loss of the epoch
    loss_train_avg = loss_train_total/len(dataloader_train)

    # print epoch # and training loss
    tqdm.write(f'\nEpoch {epoch}')
    tqdm.write(f'Training loss: {loss_train_avg}')

    # compute the validation loss & accuracy of the epoch
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)

    # print validation loss & accuracy
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

Epoch 1

Training loss: 0.5482631898960384

Validation loss: 0.5348240941578235

F1 Score (Weighted): 0.8805902422136439


Epoch 2

Training loss: 0.3173633142948431

Validation loss: 0.5934633307977647

F1 Score (Weighted): 0.885306253267006


Epoch 3

Training loss: 0.16964389977592292

Validation loss: 0.69995356636486

F1 Score (Weighted): 0.8886812897554138


Epoch 4

Training loss: 0.09559309570186539

Validation loss: 0.7222233983969576

F1 Score (Weighted): 0.8836152107516756

Validation on Test Data

Before validating on our unlabeled test data, we need to import the best model weights. We will use the model weights from epoch 3 since it achieved the best F1 score.

# Load in the model weights
model.load_state_dict(torch.load('finetuned_BERT_epoch_4.model',
                                  map_location=torch.device('cuda')))

Next, we can use the saved model to get the predictions on the test set.

# evaluate saved model on testing dataset
model.eval()
predictions = []

for batch in dataloader_test:
    batch = tuple(b.to(device) for b in batch)
    inputs = {'input_ids':      batch[0],
              'attention_mask': batch[1],
              }
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs[0]
    logits = logits.detach().cpu().numpy()
    predictions.append(logits)

predictions = np.concatenate(predictions, axis=0)

# convert probabilities to class labels and flatten the list
predictions = np.argmax(predictions, axis=1).flatten()

Finally, we can create a submission file with the predictions.

# append label predictions to test dataframe
df_test['pred'] = predictions
# convert label predictions to original labeled columns
df_test = pd.get_dummies(df_test, columns=['pred'])
df_test.columns = ['postId', 'message', 'Appreciation_pred', 'Complaint_pred', 'Feedback_pred']
del df_test['message']

# save predictions to csv file
df_test.to_csv('submission.csv', index=False)

Conclusion

Without any hyperparameter tuning, we were able to achieve an F1 score of 0.88 on the test set. This is a great result considering that we only used 10% of the training data. We can further improve the model by using more training data and tuning the hyperparameters. We can also try other models such as XLNet and RoBERTa.