Spam and Ham Detection for Youtube Comments Part 1 of 3

Introduction:

This tutorial assumes you know the basics about Machine Learning. The purpose of this tutorial is to show a user how to user SKLearn, and Pandas for text classification. We are using the Youtube Data set to classify Spam and Ham Comments. The goal is to build a basic Spam detector.


To get the data set used in this tutorial please visit UCI Machine Learning Repository YouTube Spam Collection Data Set. This data set is nicely formatted and easy to understand.

This Tutorial will be broken up into 3 parts.

  1. A simple MultinomialNB example using CountVectorizer
  2. Using a pipeline and TfidfVectorizer, using Logistic Regression, XGBoost, and MultinomialNB
  3. Finally, using added features to our model pipline, to include review length, to see if this improves our models

Import Libraries and Load Data

In these example I would like to stress using Sklearn, Pandas and the Numpy libraries. While the example below does not use Numpy, it is an alternative option. In this step, we import the libraries we will need for this project. Since our data is in a CSV format, this makes importing and manipulating our data simple.

Loading a CSV in Python is straight forward, but deciding which method to use is up to the user and the use case. For most CSV files, Pandas can load the file for you and allow you to process it pretty easily. Two methods will be show, a custom one and using Pandas.

Not using a Library - Custom Implementation

In the Custom example we can edit every field as we parse the file. This can be good for very custom features, but it can get messy quick. For our basic example, lets see how much cleaner it is using pandas.



Using Pandas Library - Pandas Implementation

This is a pretty straight forward implementation.


Comparison

From what we see, they both accomplish the same task, but one is much cleaner. The KEY difference are the data types. In the first example, we created a list. In the second example the data type is pandas.core.series.Series. For our purpose, we treat them both as an array like structure. To learn more about pandas series, read here. It's worth noting that a Series can differ that single column data frame. Read about that here. Another google example of the difference is here.

Print out our labels

It is normally a good idea to look for a class imbalance.


Label Balance

From our example we have a good balance of ham and spam examples to train and test on.


Train Test Split

Lets do a 80/20 train test split. This way we can train on 80% of our data and test on 20%.


Model

The Count Vectorizer counts words in the document and makes a sparse matrix for each entry. If you are not sure what this means, I suggest heading on over to SKLearns documentation pages for a few basic example. SKLearn's Count Vectorizer and the text feature extraction examples.

Next is SKLearn's MultinomialNB implementation. This is just a Naive Bayes classifier for multinomial models. This is a basic go to classifier for Spam Ham Models. Read more about SKLearn's MultinomialNB implementation here.
Lastly, we must call transform on our test test. This allows us to handle any word that was not seen in the training data and transform allows us to ignore new words in the test set.

Results

And here are the printed results. An accuracy of 92.6% isn't that bad, but we can do better.

Tutorial Parts

  1. Coming Soon, Part 1
  2. Coming Soon, Part 2
  3. Coming Soon, Part 3