This is the MNIST (National Institute of Standards and Technology ) data set that comes with Keras. It has 10,000 test images and 60,000 training images. This data set is sometimes refered to as the hello world of deep learning.
We will step through the Data Process
Data Procoss Steps
Goal or Hypothesis
Data Retrieval
Data Processing
Data Exploration
Model Data
Present Results
The goal for this project is to come up with a machine learning model in Keras to predict hand written numbers.
In this example, Keras already comes with the MNist Data set for hand written recognition. The data just needs to be loaded. The data comes in a train and test split, which will be used.
from keras.datasets import mnist #Loads the Data Set
#load the data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
After loading the data, we need to process it into a format that we can use. In our goal we stated we want to use Keras, so this gives us an idea on the data format we need. The data we loaded by default has pixel values in the range between 0 and 255, in graystale format.
At the data processing point, this is a good place to decided if we will use a validation set or not. We can split this off from the test set, or the training set. In this example, I will not use a validation set.
Lets quickly look at a single image
#Before Data Processing or Data Exploration, lets see what a image looks like
import matplotlib.pyplot as plt #Used to Display data and results
#To display an example image, here is one digit
digit = x_train[0]
plt.imshow(digit, cmap=plt.cm.binary)
plt.show()
Processing the data
We can see what our data looks like, but we need to put the data into a format that Keras can use. First we change the shape for single input dimensions. This is just a flatten array. Then we change the unit8 values (0 to 255 range) to float32 (0 to 1) which is Keras's default. Below is how we do this.
Next we and to change our labels to categorical format. This will turn the labels of values 0-9 into an array of bianary values. For example the label 1 turns into: [0,1,0,0,0,0,0,0,0,0] to to be categorical.
from keras.utils import to_categorical
#Change the shape for single dimension output
#Change values from uint8 (0 to 255) to float32 (0 to 1)
x_train = x_train.reshape(x_train.shape[0],28*28)
x_train = x_train.astype('float32')/255
x_test = x_test.reshape(x_test.shape[0], 28*28)
x_test = x_test.astype('float32')/255
#Change labels to categorical format
y_test = to_categorical(y_test)
y_train = to_categorical(y_train)
Now that our data is in a format Keras can use, lets look at the data more closely. What this means is lets see how balanced the data is. For starters, we will see how balanced our class labels are in our training set.
train_count = (y_train).sum()
test_count = (y_test).sum()
print("Number of Training images " + str(train_count))
print("Number of Testing images " + str(test_count))
This is ruffly a 85/15 (or closer 84/14) split. A more common approach is a 80/20 train/test split. But since this is already split this way, we will keep this format.
This is a import part, because later on in the Keras model, when fit is used, there is an option to do a validation split. Since we see our test data is ruffly about 15%, we will use this same percent for the validation split.
#Lets explore our training data. This is our total y labels for training and validation data
import numpy as np
plt.clf()
values1 = (y_train).sum(axis=0)
display = [0,1,2,3,4,5,6,7,8,9]
plt.bar(display, values1, color = 'b', align='center', label='Train')
plt.xticks(display,display)
plt.ylim(bottom=5000)
plt.yticks(np.arange(5000, 7001, 200))
plt.ylabel('Frequency')
plt.xlabel('Number')
plt.title('Data Distribuation')
plt.legend()
plt.show()
plt.clf()
values1 = (y_test).sum(axis=0)
display = [0,1,2,3,4,5,6,7,8,9]
plt.bar(display, values1, color = 'b', align='center', label='Train')
plt.xticks(display,display)
plt.ylim(bottom=500)
plt.yticks(np.arange(500, 1201, 100))
plt.ylabel('Frequency')
plt.xlabel('Number')
plt.title('Data Distribuation')
plt.legend()
plt.show()
From the data distributions seen, there isn't that big of a difference. In both cases there are less number 5s than any other number, and clearly more number 1s. But the balance isn't drastic.
Next step is to apply a model to our data. For this example we will use Keras. Using a basic sequential model, we experiment with different layers. Without getting into Cross Validation, we explore a few common and basic optimizations.
#Create the model. These are two NN implementations.
#This is used as a baseline, while the CNN is the real implementation
# The first step in our process is to import the libraries we will need
from keras import models #Loads models
from keras import layers #Loads Layers
def my_model():
'''
#Baseline 1 - First attempt
#Train - On epoch 6 - loss: 0.1367 - acc: 0.9605 - val_loss: 0.1389 - val_acc: 0.9611
# Test - 6 epochs - Loss 0.13756338410377503 Accuracy 0.9608
model = models.Sequential()
model.add(layers.Dense(32, input_shape=(28*28,), activation='relu'))
'''
#Baseline 2
# Train - On epoch 6 - loss: 0.1305 - acc: 0.9614 - val_loss: 0.1392 - val_acc: 0.9613
# Test - Loss 0.13497577327471227 Accuracy 0.9604
model = models.Sequential()
model.add(layers.Dense(32, activation = 'relu', input_shape = (28*28,)))
model.add(layers.Dense(16, activation = 'relu',))
'''
#Baseline 3 - Best results
# Train - On epoch 3 - loss: 0.0654 - acc: 0.9808 - val_loss: 0.0751 - val_acc: 0.9786
# Test - 3 epochs - Loss 0.07999640309328679 Accuracy 0.9767
model = models.Sequential()
model.add(layers.Dense(512, activation = 'relu', input_shape = (28*28,)))
'''
'''
#Baseline 4
# Train - on epcoh 3 loss: 0.0839 - acc: 0.9753 - val_loss: 0.0853 - val_acc: 0.9764
# Test - Loss 0.08825657425741665 Accuracy 0.9729
model = models.Sequential()
model.add(layers.Dense(256, activation = 'relu', input_shape = (28*28,)))
'''
'''
#Baseline 5
# Train - on epoch 3 - loss: 0.0890 - acc: 0.9746 - val_loss: 0.0983 - val_acc: 0.9742
# Test - 3 epochs - Loss 0.07707974565780605 Accuracy 0.9788
model = models.Sequential()
model.add(layers.Dense(512, activation = 'relu', input_shape = (28*28,)))
model.add(layers.Dense(256, activation = 'relu'))
'''
'''
#Baseline 6
# Train - on epoch 20 - loss: 0.1173 - acc: 0.9784 - val_loss: 0.1158 - val_acc: 0.9812
# Test - 20 epochs - Loss 0.1360144746005074 Accuracy 0.9808
model = models.Sequential()
model.add(layers.Dense(512, activation = 'relu', input_shape = (28*28,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(256, activation = 'relu'))
model.add(layers.Dropout(0.5))
'''
#This is the output layer, need 10 for 10 classificationsi, returns an array of 10
model.add(layers.Dense(10,activation='softmax'))
#Used to compile the model
model.compile(optimizer="RMSprop", loss="categorical_crossentropy", metrics=["accuracy"])
return model
#Randomly pick 10 epochs and batch_size 64
epochs = 20
batch_size = 64
model = my_model()
history = model.fit(np.array(x_train), np.array(y_train), epochs=epochs, batch_size=batch_size, validation_split=0.15)
#Now lets evaluate the results we seen
history_dict = history.history
loss_values = history_dict['loss']
loss_values = np.array(loss_values)
val_loss_values = history_dict['val_loss']
val_loss_values = np.array(val_loss_values)
epochs = range(1, len(loss_values) + 1)
plt.xticks(epochs)
plt.plot(epochs, loss_values, label='Training loss')
plt.plot(epochs, val_loss_values, label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()
plt.xticks(epochs)
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc_values, label='Training acc')
plt.plot(epochs, val_acc_values, label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
#Fit/Train the model based on eval above
#Since accuracy looks best at 5 epochs, we select 5model = my_model()
model = my_model()
model.fit(x_train, y_train, epochs = 6, batch_size=64)
#See how accurate the training model is against a test set
loss, accuracy = model.evaluate(x_test,y_test)
print("Loss " + str(loss))
print("Accuracy " + str(accuracy))
Results Our goal was to get good accruacy and we did. The initial attempt was about 96% accuracy, and with some slight tuning we were able to get to about 98% accuracy.
Further Exploration:
For a better result we could use a cross validation to find the best perameters. We did however, not us cross validation to tune parameters. So adding and tuning multiple layers might give us better results.
Areas to explore:
Optimisers I did not change the optimiser because I know for this example RMSprop does fine. An topimiser like Adam likely can do better.
loss We used this to see if the model is improving or getting worse in relation to the accuracy. Categorical Crossentropy is the best option for this problem.
Accuracy Since our goal was to get the best accuracy, we used this as a metric.
Next Goal: Lets explore how a CNN performs on the same data.