Supervised Learning : Multiclass Classification

  • prepare train data
  • define train variables
  • define step/update function
    • define loss function
  • train

Importing Dependencies

1
2
3
4
5
6
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

from tensorflow.keras import models, layers, optimizers, losses, metrics
from tensorflow.keras.datasets import reuters

Importing Dataset

1
2
3
LIMIT_WORD = 10000

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=LIMIT_WORD)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz
2113536/2110848 [==============================] - 2s 1us/step - ETA: 0s


D:\Anaconda3\envs\tf_env\lib\site-packages\tensorflow_core\python\keras\datasets\reuters.py:113: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
D:\Anaconda3\envs\tf_env\lib\site-packages\tensorflow_core\python\keras\datasets\reuters.py:114: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
1
2
print(train_data.shape)
print(test_data.shape)
(8982,)
(2246,)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def convert_to_english(sequence):
# word_index is a dictionary mapping words to an integer index
word_index = reuters.get_word_index()

# We reverse it, mapping integer indices to words
reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()]
)

# We decode the review; note that our indices were offset by 3
# because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
decoded_review = " ".join(
[reverse_word_index.get(i - 3, '?') for i in sequence] # if not found replace with '?'
)
return decoded_review

1
print(convert_to_english(train_data[0]))
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters_word_index.json
557056/550378 [==============================] - 1s 1us/step
? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3

Preparing the data

We cannot feed lists of integers into a neural network. We have to turn our lists into tensors. There are two ways we could do that:

  • We could pad our lists so that they all have the same length, and turn them into an integer tensor of shape (samples, word_indices), then use as first layer in our network a layer capable of handling such integer tensors (the Embedding layer, which we will cover in detail later in the book).
  • We could one-hot-encode our lists to turn them into vectors of 0s and 1s. Concretely, this would mean for instance turning the sequence [3, 5] into a 10,000-dimensional vector that would be all-zeros except for indices 3 and 5, which would be ones. Then we could use as first layer in our network a Dense layer, capable of handling floating point vector data.

We will go with the latter solution. Let’s vectorize our data, which we will do manually for maximum clarity:

1
2
3
4
5
6
# one hot encoding for data
def vectorize_sequences(sequences, dimension=LIMIT_WORD):
results = np.zeros((len(sequences), dimension)) # Create an all-zero matrix of shape (len(sequences), dimension)
for i, sequence in enumerate(sequences):
results[i, sequence] = 1 # set specific indices of results[i] to 1s
return results
1
2
3
# vectorize train test data
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
1
print(x_train[0])
[0. 1. 1. ... 0. 0. 0.]
1
2
3
4
5
6
# one hot encoding for label
def to_one_hot(labels, dimension=46):
results = np.zeros((len(labels), dimension)) # Create an all-zero matrix of shape (len(sequences), dimension)
for i, labels in enumerate(labels):
results[i, labels] = 1 # set specific indices of results[i] to 1s
return results
1
2
3
4
5
6
7
8
9
# vectorize train test labels
y_train = to_one_hot(train_labels)
y_test = to_one_hot(test_labels)

# can also use builting keras function
# from keras.utils.np_utils import to_categorical
#
# y_train = to_categorical(train_labels)
# y_test = to_categorical(test_labels)
1
print(y_train[0])
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Building the neural network

Architecture

  • Intermediate Dense Layer : 2 , Hidden Units : 64 , Activation : Relu
  • Output Layer : 46 , Activation : Softmax

the number of hidden units in the intermediate layer should be large so that it can learn 46 different class
, small layer can cause information bottleneck , permanently dropping relevant information

As this is a multiclass classification problem ,
we will use softmax activation function in the output layer .
the network will output a probability distribution over the 46 different output classes , all of which will sum to 1

Compile

Lastly, we need to pick a loss function and an optimizer.
The best loss function to use in this case is categorical_crossentropy .
It measures the distance between two probability distributions:
here, between the probability distribution output by the network and the true distribution of the labels.
By minimizing the distance between these two distributions,
you train the network to output something as close as possible to the true labels.

  • Loss Function : categorical_crossentropy
  • Optimizer : rmsprop
1
2
3
4
5
6
7
8
9
10
11
12
13
def get_model():
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(LIMIT_WORD,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(
optimizer=optimizers.RMSprop(lr=0.001),
loss=losses.categorical_crossentropy,
metrics=[metrics.categorical_accuracy]
)

return model;

Validation

  • Hold Out Validation
1
print(x_train.shape[0])
8982
1
2
num_of_validation_sample = int(x_train.shape[0]*0.2)
print(num_of_validation_sample)
1796
1
2
3
4
5
6
7
# don't use random here , as the labels are separate

x_val = x_train[:num_of_validation_sample]
partial_x_train = x_train[num_of_validation_sample:]

y_val = y_train[:num_of_validation_sample]
partial_y_train = y_train[num_of_validation_sample:]

Train / Fit

1
2
3
4
5
6
7
8
9
model = get_model()

history = model.fit(
partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val)
)
Train on 7186 samples, validate on 1796 samples
Epoch 1/20
7186/7186 [==============================] - 2s 263us/sample - loss: 2.7741 - categorical_accuracy: 0.4880 - val_loss: 1.9426 - val_categorical_accuracy: 0.6063
Epoch 2/20
7186/7186 [==============================] - 1s 118us/sample - loss: 1.5439 - categorical_accuracy: 0.6948 - val_loss: 1.5580 - val_categorical_accuracy: 0.6314
Epoch 3/20
7186/7186 [==============================] - 1s 113us/sample - loss: 1.1424 - categorical_accuracy: 0.7599 - val_loss: 1.2634 - val_categorical_accuracy: 0.7261
Epoch 4/20
7186/7186 [==============================] - 1s 118us/sample - loss: 0.8985 - categorical_accuracy: 0.8103 - val_loss: 1.1652 - val_categorical_accuracy: 0.7522
Epoch 5/20
7186/7186 [==============================] - 1s 109us/sample - loss: 0.7243 - categorical_accuracy: 0.8519 - val_loss: 1.2074 - val_categorical_accuracy: 0.7183
Epoch 6/20
7186/7186 [==============================] - 1s 110us/sample - loss: 0.5923 - categorical_accuracy: 0.8814 - val_loss: 1.0435 - val_categorical_accuracy: 0.7673
Epoch 7/20
7186/7186 [==============================] - 1s 107us/sample - loss: 0.4817 - categorical_accuracy: 0.9044 - val_loss: 1.0104 - val_categorical_accuracy: 0.7851
Epoch 8/20
7186/7186 [==============================] - 1s 102us/sample - loss: 0.3897 - categorical_accuracy: 0.9233 - val_loss: 1.0097 - val_categorical_accuracy: 0.7745
Epoch 9/20
7186/7186 [==============================] - 1s 105us/sample - loss: 0.3223 - categorical_accuracy: 0.9368 - val_loss: 1.0310 - val_categorical_accuracy: 0.7823
Epoch 10/20
7186/7186 [==============================] - 1s 101us/sample - loss: 0.2704 - categorical_accuracy: 0.9431 - val_loss: 0.9731 - val_categorical_accuracy: 0.7984
Epoch 11/20
7186/7186 [==============================] - 1s 100us/sample - loss: 0.2286 - categorical_accuracy: 0.9489 - val_loss: 1.0508 - val_categorical_accuracy: 0.7795
Epoch 12/20
7186/7186 [==============================] - 1s 99us/sample - loss: 0.2015 - categorical_accuracy: 0.9532 - val_loss: 0.9798 - val_categorical_accuracy: 0.8073
Epoch 13/20
7186/7186 [==============================] - 1s 100us/sample - loss: 0.1754 - categorical_accuracy: 0.9545 - val_loss: 1.0130 - val_categorical_accuracy: 0.8018
Epoch 14/20
7186/7186 [==============================] - 1s 103us/sample - loss: 0.1593 - categorical_accuracy: 0.9573 - val_loss: 1.0304 - val_categorical_accuracy: 0.8001
Epoch 15/20
7186/7186 [==============================] - 1s 103us/sample - loss: 0.1451 - categorical_accuracy: 0.9576 - val_loss: 1.1990 - val_categorical_accuracy: 0.7678
Epoch 16/20
7186/7186 [==============================] - 1s 105us/sample - loss: 0.1344 - categorical_accuracy: 0.9585 - val_loss: 1.2040 - val_categorical_accuracy: 0.7728
Epoch 17/20
7186/7186 [==============================] - 1s 102us/sample - loss: 0.1258 - categorical_accuracy: 0.9605 - val_loss: 1.1260 - val_categorical_accuracy: 0.7973
Epoch 18/20
7186/7186 [==============================] - 1s 106us/sample - loss: 0.1171 - categorical_accuracy: 0.9585 - val_loss: 1.3081 - val_categorical_accuracy: 0.7611
Epoch 19/20
7186/7186 [==============================] - 1s 110us/sample - loss: 0.1156 - categorical_accuracy: 0.9571 - val_loss: 1.1250 - val_categorical_accuracy: 0.7929
Epoch 20/20
7186/7186 [==============================] - 1s 101us/sample - loss: 0.1065 - categorical_accuracy: 0.9612 - val_loss: 1.1712 - val_categorical_accuracy: 0.7862

Note that the call to model.fit() returns a history object.
This object has a member history, which is a dictionary containing data about everything
that happened during training. Let’s take a look at it:

1
2
history_dict = history.history
history_dict.keys()
dict_keys(['loss', 'categorical_accuracy', 'val_loss', 'val_categorical_accuracy'])
1
2
3
4
5
6
import matplotlib.pyplot as plt

acc = history.history['categorical_accuracy']
val_acc = history.history['val_categorical_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
1
2
3
4
5
6
7
8
9
10
11
12
epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

png

1
2
3
4
5
6
7
8
9
10
plt.clf()   # clear figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

png

As you can see, the training loss decreases with every epoch and the training accuracy increases with every epoch. That’s what you would expect when running gradient descent optimization – the quantity you are trying to minimize should get lower with every iteration

But that isn’t the case for the validation loss and accuracy: they seem to peak at the 10th epoch. This is an example of what we were warning against earlier: a model that performs better on the training data isn’t necessarily a model that will do better on data it has never seen before

In precise terms, what you are seeing is “overfitting”: after the 10th epoch, we are over-optimizing on the training data, and we ended up learning representations that are specific to the training data and do not generalize to data outside of the training set.

In this case, to prevent overfitting, we could simply stop training after 10 epochs.

Evaluate

1
results = model.evaluate(x_test, y_test)
2246/2246 [==============================] - 0s 115us/sample - loss: 1.2289 - categorical_accuracy: 0.7796
1
print(results)

Prediction

1
predictions  = model.predict(x_test)
1
print(predictions.shape)
(2246, 46)
1
print(predictions[0])
[4.46490594e-06 5.76023513e-06 2.44091467e-08 9.56246376e-01
 4.17100601e-02 5.68068231e-07 1.80576393e-07 1.13307908e-06
 4.55904490e-04 1.93153497e-07 1.25025309e-07 2.28998324e-04
 1.34078937e-06 7.84989752e-05 1.70830447e-06 5.53114390e-08
 1.09190032e-05 4.19231537e-06 1.03650223e-07 2.35304658e-04
 9.40605300e-04 2.34770396e-05 2.54490487e-06 8.72557706e-08
 9.19277113e-07 1.09992165e-07 9.71686553e-09 1.61543612e-09
 1.58277669e-06 9.61224600e-07 1.44538478e-06 3.38638664e-08
 1.63851297e-07 7.28492239e-07 6.30199884e-06 9.01029978e-07
 2.80448749e-05 2.52350905e-08 9.86374857e-07 1.83881369e-07
 9.36134313e-07 1.66743348e-06 1.62385675e-06 1.71055007e-07
 7.81268525e-08 4.72410619e-07]
1
np.sum(predictions[0])
0.99999994
1
np.argmax(predictions[0])
3

Analysis & Overfitting

We can see that our model is overfitting after the 10th epoch

1
2
3
4
5
6
7
8
9
10
# now training on the full train data and upto epoch 10
model = get_model()
model.fit(
x_train,
y_train,
epochs=10,
batch_size=512
)
results = model.evaluate(x_test, y_test)
print(results)
Train on 8982 samples
Epoch 1/10
8982/8982 [==============================] - 1s 161us/sample - loss: 2.5005 - categorical_accuracy: 0.5281
Epoch 2/10
8982/8982 [==============================] - 1s 83us/sample - loss: 1.3553 - categorical_accuracy: 0.7122
Epoch 3/10
8982/8982 [==============================] - 1s 87us/sample - loss: 1.0068 - categorical_accuracy: 0.7878
Epoch 4/10
8982/8982 [==============================] - 1s 100us/sample - loss: 0.7865 - categorical_accuracy: 0.8368
Epoch 5/10
8982/8982 [==============================] - 1s 86us/sample - loss: 0.6255 - categorical_accuracy: 0.8688
Epoch 6/10
8982/8982 [==============================] - 1s 86us/sample - loss: 0.4986 - categorical_accuracy: 0.8949
Epoch 7/10
8982/8982 [==============================] - 1s 91us/sample - loss: 0.3998 - categorical_accuracy: 0.9156
Epoch 8/10
8982/8982 [==============================] - 1s 91us/sample - loss: 0.3309 - categorical_accuracy: 0.9273
Epoch 9/10
8982/8982 [==============================] - 1s 88us/sample - loss: 0.2720 - categorical_accuracy: 0.9383
Epoch 10/10
8982/8982 [==============================] - 1s 77us/sample - loss: 0.2369 - categorical_accuracy: 0.9444
2246/2246 [==============================] - 0s 138us/sample - loss: 0.9619 - categorical_accuracy: 0.7965
[0.9618503038944778, 0.79652715]

we can see that there is almost 4% improve with this just small optimization

A different way to handle the labels and the loss

we can also cast as integer tensors

1
2
y_train = np.array(train_labels)
y_test = np.array(test_labels)

The only thing it would change is the choice of the loss function. Our previous loss, categorical_crossentropy, expects the labels to follow a categorical encoding.
With integer labels, we should use sparse_categorical_crossentropy

This new loss function is still mathematically the same as categorical_crossentropy; it just has a different interface.

1
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['acc'])