pydata

Keep Looking, Don't Settle

Build Neural Network from Scratch

How to build Neural Network from scratch

This post will introduce how to build a neural network from stratch:

  1. the forward-propagation: from input data to activations on each layer
  2. the backpropagation: using chain rules to calculate the deravatives of cost function to weights(parameters) on each layer
  3. minimize the cost function by SGD with the derivatives from step 2.

Example 1.

First I will introduce a very simple example. Suppose we have data like the following. \(X\) is a 4x3 array. \(Y\) is a 4x1 vector. The task is to predict the output \(Y\) based on the input \(X\). We will show how to build a neural network to do this job.

Inputs: X Output: Y
0 0 1 0
0 1 1 1
1 0 1 1
1 1 1 0

Our NN will have 2 layers. The first layer is from input \(X\) (we will call it \(a_1\)) to activations \(a_2\) with the sigmoid activation function. The shape of the weights for the first layer (\(\Theta_1^{T}\)) is 4x3. So the shape of layer 1 is 4x4.

The layer 1 (\(a_2\)) will be the input for the second layer(\(a_3\)) with weights parameter \(\Theta_2^{T}\) and sigmoid activation function. The shape of \(\Theta_2^{T}\) is 4x1 (the shape of input data to layer 2 is 4x4 and the shape of \(Y\) is 4x1, these two will determine the shape of \(\Theta_2\) because layer 2 is the output layer here).

Preparation

1.1. Model Structure

Our prediction Neural Network will have two layers: the first layer a linear combination between input \(X\) and parameter \(\theta_1\). Then this will be the input to the sigmoid function, which is the second layer.

\begin{aligned} a_1 &= X \\ z_1 &= a_1 \cdot \Theta_1^{T} \\ a_2 &= \mbox{sigmoid}(z_1) = \frac{1}{1 + e^{- z_1}} \qquad \mbox{the first activation layer} \\ z_2 &= a_2 \cdot \Theta_2^{T} \\ a_3 &= \mbox{sigmoid}(z_2) = \frac{1}{1 + e^{- z_2}} \qquad \mbox{the second activation layer} \end{aligned}

The symbal \(\cdot\) above means dot product of the vector or the matrix. We also need the [elementwise product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices) denoted as \(\odot\).

1.2. Cost Function

Cost function generally measures the distance between the predictions and the true value.

  1. Cross Entropy Loss function:

    \begin{aligned} C = -\frac{1}{n}\sum_{x}\left[y \log(a_3) + (1 - y)\log(1 - a_3) \right] \end{aligned}

  2. L2 cost function:

\begin{aligned} C = \frac{1}{2}||y - a_3||^2 \end{aligned}

1.3. Derivatives / Gradient

Suppose the Lost function is

\begin{aligned} C = \frac{1}{2} ||y - a_3||^2 \end{aligned}

First, let's denote

\begin{aligned} d_3 &= (a_3 - y) \odot (a_3 \odot (1 - a_3)) ,\; \; \mbox{ or call it } a_3\_delta \\ d_2 &= d_3 \cdot \Theta_2 \odot (a_2 \odot (1 - a_2)) \end{aligned}

The partival derivative of Loss function to parameter \(\Theta_2\) using chain rule is:

\begin{aligned} \frac{\partial {C}}{\partial {\Theta_2}} & = \frac{\partial{C}}{\partial{a_3}} \frac{\partial{a_3}}{\partial{z_2}} \frac{\partial{z_2}}{\partial{\Theta_2}} \\ & = (a_3 - y) \left(a_3 (1 - a_3)\right) a_2 \; \; (\mbox{if all are scalers}) \\ & = \left[(a_3 - y) \odot \left(a_3 \odot (1 - a_3)\right)\right]^{T} \cdot a_2 \\ & \triangleq d_3^{T} \cdot a_2 \end{aligned}

where \(d_3 \triangleq (a_3 - y) \odot \left(a_3 \odot (1 - a_3)\right)\) . \(\odot\) maens element-wise product, \(\cdot\) means dot product.

The partival derivative of Loss function to parameter \(\Theta_1\) using chain rule is:

\begin{aligned} \frac{\partial {C}}{\partial {\Theta_1}} & = \frac{\partial{C}}{\partial{a_3}} \frac{\partial{a_3}}{\partial{z_2}} \frac{\partial{z_2}}{\partial{a_2}} \frac{\partial{a_2}}{\partial{z_1}} \frac{\partial{z_1}}{\partial{\Theta_1}} \\ & \propto (a_3 - y) \left(a_3 (1 - a_3)\right) \Theta_2 (a_2 (1 - a_2)) a_1 \\ & \propto \left[(a_3 - y) \odot \left(a_3 (1 - a_3)\right) \cdot \Theta_2 \odot (a_2 \odot (1 - a_2))\right]^{T} a_1 \\ & = d_2^{T} \cdot a_1 \end{aligned}

where \(d_2 \triangleq d_3 \cdot \Theta_2 \odot \left(a_2 \odot (1 - a_2) \right)\) .

To find the value of \(\Theta_1\) and \(\Theta_2\), we need to iteratively adjust their value until their gradient is close to 0. That is, to iterate:

\begin{aligned} \Theta_2^{i + 1} &= \Theta_2^{i} + \lambda \frac{\partial {C}}{\partial {\Theta_2^{i}}} \\ \Theta_1^{i + 1} &= \Theta_1^{i} + \lambda \frac{\partial {C}}{\partial {\Theta_1^{i}}} \end{aligned}

4. Code

import numpy as np
np.random.seed(10)
x = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = np.array([0, 1, 1, 0]).reshape(4, -1)

a1 = x
theta1 = np.random.random(12).reshape(4, 3) * 2 - 1
theta2 = np.random.random(4).reshape(1, 4) * 2 - 1

learning_rate = 1

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

for i in range(100000):
    # forward propagation: including the first layer and the second layer
    z1 = a1.dot(theta1.T)
    a2 = sigmoid(z1) 
    z2 = a2.dot(theta2.T)
    a3 = sigmoid(z2) 

    # backpropagation: chain rules of the partial derivatives
    a3_error = y - a3
    theta2_grad = np.dot(np.transpose(a3_error * (a3 * (1 - a3))), a2) 
    theta1_grad = np.dot(np.transpose(np.dot(a3_error * (a3 * (1 - a3)),  theta2) * (a2 * (1 - a2))), a1)

    # update the parameters until the gradient converges to 0
    theta2 += learning_rate * theta2_grad
    theta1 += learning_rate * theta1_grad
theta2
array([[ -3.16174322,  10.29352444,  11.40818558,  -4.45945196]])
theta1
array([[ 3.35916278, -3.32786847,  1.87094082],
       [ 5.84130712, -5.98365309, -3.07635286],
       [-6.45870855,  6.34717945, -3.30437575],
       [-3.91182875,  3.95761648,  2.17740367]])
a3
array([[ 0.00276657],
       [ 0.99710763],
       [ 0.99718615],
       [ 0.00243296]])
# The prediction is
a3 > 0.5
array([[False],
       [ True],
       [ True],
       [False]], dtype=bool)

The above is a simple example to introduce the insides of a neural network: how to calculate the forward propagation from input data to the prediction output and the cost function, how to calcualte the back propagatin of the partial derivatives with chain rules, and how to update the parameters until the gradients converging to zero, although in fact neural network is not necessary for this simple example since there are only 4 data points but we have introduced 16 parameters(parameters in \(\Theta_1\) and \(\Theta_2\)).

Example 2

The following example is a more real data question: I collect 60 verifacation code graphs. Each graph has 4 characters. The user needs to recognize these 4 characters correctly to get correct input verification.

Each graph has 4 characters. An example of the picture is like

alt text

The corresponding characters are: 'k' 'n' 'n' 'p'.

2.1. Model

We will build a three layers neural network.

\begin{aligned} a_1 &= X \\ z_1 &= a_1 \cdot \Theta_1^{T} \\ a_2 &= \mbox{sigmoid}(z_1) = \frac{1}{1 + e^{- z_1}} \qquad \mbox{the first activation layer} \\ z_2 &= a_2 \cdot \Theta_2^{T} \\ a_3 &= \mbox{sigmoid}(z_2) = \frac{1}{1 + e^{- z_2}} \qquad \mbox{the second activation layer} \\ z_3 &= a_3 \cdot \Theta_3^{T} \\ a_4 &= \mbox{sigmoid}(z_3) = \frac{1}{1 + e^{- z_3}} \qquad \mbox{the third activation layer} \end{aligned}

2.2. Derivatives

Suppose the Lost function is

\begin{aligned} C = -\frac{1}{n}\sum_{x}\left[y \log(a_3) + (1 - y)\log(1 - a_3) \right] \end{aligned}

First, let's denote

\begin{aligned} d_4 &= a_4 - y ,\; \; \mbox{ or call it } a_4\_delta \\ d_3 &= d_4 \cdot \Theta_3 \odot (a_3 \odot (1 - a_3)) ,\; \; \mbox{ or call it } a_3\_delta \\ d_2 &= d_3 \cdot \Theta_2 \odot (a_2 \odot (1 - a_2)) \end{aligned}

The partival derivative of Loss function to parameter \(\Theta_3\) using chain rule is:

\begin{aligned} \frac{\partial {C}}{\partial {\Theta_3}} & = \frac{\partial{C}}{\partial{a_4}} \frac{\partial{a_4}}{\partial{z_4}} \frac{\partial{z_4}}{\partial{\Theta_3}} \\ & = - \left[ \frac{y}{a_4} - \frac{1 - y}{1 - a_4} \right] \left(a_4 (1 - a_4)\right) a_3 = ( a_4 - y) a_3 \; \; (\mbox{if all are scalers}) \\ & = \left[(a_4 - y)\right]^{T} \cdot a_3 \\ & \triangleq d_4^{T} \cdot a_3 \end{aligned}

The partival derivative of Loss function to parameter \(\Theta_2\) using chain rule is:

\begin{aligned} \frac{\partial {C}}{\partial {\Theta_2}} & =\frac{\partial{C}}{\partial{a_4}} \frac{\partial{a_4}}{\partial{z_4}} \frac{\partial{z_4}}{\partial{a_3}} \frac{\partial a_3}{\partial z_3} \frac{\partial z_3}{\partial \Theta_2} \\ & \propto - \left[ \frac{y}{a_4} - \frac{1 - y}{1 - a_4} \right] \left(a_4 (1 - a_4)\right) \Theta_3 (a_3(1-a_3)) a_2 \; \; (\mbox{if all are scalers}) \\ & = \left[d_4 \cdot \Theta_3 \odot \left(a_3 \odot (1 - a_3)\right)\right]^{T} \cdot a_2 \\ & \triangleq d_3^{T} \cdot a_2 \end{aligned}

The partival derivative of Loss function to parameter \(\Theta_1\) using chain rule is:

\begin{aligned} \frac{\partial {C}}{\partial {\Theta_1}} & = \frac{\partial{C}}{\partial{a_4}} \frac{\partial{a_4}}{\partial{z_4}} \frac{\partial{z_4}}{\partial{a_3}} \frac{\partial a_3}{\partial z_3} \frac{\partial z_3}{\partial a_2} \frac{\partial{a_2}}{\partial{z_2}} \frac{\partial{z_2}}{\partial{\Theta_1}} \\ & \propto - \left[ \frac{y}{a_4} - \frac{1 - y}{1 - a_4} \right] \left(a_4 (1 - a_4)\right) \Theta_3 (a_3(1-a_3)) \Theta_2 (a_2(1-a_2)) \cdot a_1 \; \; (\mbox{if all are scalers}) \\ & = \left[(a_3 - y) \odot \left(a_3 (1 - a_3)\right) \cdot \Theta_2 \odot (a_2 \odot (1 - a_2))\right]^{T} a_1 \\ & = d_2^{T} \cdot a_1 \end{aligned}

2.3. Details

The overall steps will include these:

  1. from pic to data
  2. set up activation function / sigmoid
  3. calc gradient descent function
  4. loss function
  5. predict
  6. put all together

Shape of the data

Theta1.T: 261 x 200
Theta2.T: 201 x 200
Theta3.T: 201 x 36

X: 240 * 260
a1: 240 * 261
z2: 240 * 200
a2: 240 * 200 -> 240 * 201
z3: 240 * 201
a3: 240 * 200 -> 240 * 201
z4: 240 * 36
a4: 240 * 36

y_matirx: 240 * 36
import scipy.io
import os
import scipy.signal
import pandas as pd
import numpy as np
from PIL import Image
import scipy.optimize as opt
word = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',\
        'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

mypath = r'data/yanzhengma/'
train_data = pd.read_pickle(mypath + 'train_data.pkl')
def sigmoid(x, derivative = False):
    if derivative == True:
        return x * (1 - x)
    else:
        return 1 / (1 + np.exp(-x))

def rand_init_weights(layer_in, layer_out):
    epsilon_init = 0.1
    return np.random.rand(layer_out, 1 + layer_in) * 2 * epsilon_init - epsilon_init

def nn_gradient(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
    '''
    theta1: 200x261
    theta2: 200x201
    theta3: 36x201
    a1: 240x261
    z2: 240x200
    a2: 240x200 -> 240x201
    z3: 240x200
    a3: 240x2-- -> 240x201
    z4: 240x36
    a4: 240x36
    d4: 240x36
    d3: 240x201 -> 240x200
    d2: 240x201 -> 240x200
    theta1_grad: 200x261
    theta2_grad: 200x201
    theta3_grad: 36x201
    final output: 200x261 + 200x201 + 36x201 = 99636
    '''
    theta1 = nn_params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
                       (hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
    theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
    n = np.shape(X)[0]
    a1 = np.append(np.ones((n, 1)), X, axis = 1)
    z2 = a1.dot(theta1.T)
    a2 = sigmoid(z2)
    a2 = np.append(np.ones((n, 1)), a2, axis = 1)
    z3 = a2.dot(theta2.T)
    a3 = sigmoid(z3)
    a3 = np.append(np.ones((n, 1)), a3, axis = 1)
    z4 = a3.dot(theta3.T)
    a4 = sigmoid(z4)

    y_matrix = np.zeros((n, num_labels))
    for i in range(num_labels):
        y_matrix[:, i] = (y == (i + 1)).reshape(-1)

    d4 = a4 - y_matrix
    d3 = d4.dot(theta3) * a3 * (1 - a3)   
    d3 = d3[:, 1:]
    d2 = d3.dot(theta2) * a2 * (1 - a2)  
    d2 = d2[:, 1:]
    theta1_grad = 1 / n * (d2.T.dot(a1))
    theta2_grad = 1 / n * (d3.T.dot(a2)) 
    theta3_grad = 1 / n * (d4.T.dot(a3))  
    return np.append(np.append(theta1_grad.flatten(), theta2_grad.flatten(), axis = 0), theta3_grad.flatten(), axis = 0)

def nn_cost(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
    theta1 = nn_params[0:(hidden_layer_size * (input_layer_size + 1))].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
                       (hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
    theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
    n = np.shape(X)[0]
    a1 = np.append(np.ones((n, 1)), X, axis = 1)
    z2 = a1.dot(theta1.T)
    a2 = sigmoid(z2)
    a2 = np.append(np.ones((n, 1)), a2, axis = 1)
    z3 = a2.dot(theta2.T)
    a3 = sigmoid(z3)
    a3 = np.append(np.ones((n, 1)), a3, axis = 1)
    z4 = a3.dot(theta3.T)
    a4 = sigmoid(z4)

    y_matrix = np.zeros((n, num_labels))
    for i in range(num_labels):
        y_matrix[:, i] = (y == (i + 1)).reshape(-1)

    return -(1 / n) * (np.sum(np.sum(y_matrix * np.log(a4))) + np.sum(np.sum((1 - y_matrix) * np.log(1 - a4)))) + \
        lambda_param / ((2 * n) * np.sum(np.sum(theta1[:, 1:]**2)) + np.sum(np.sum(theta2[:, 1:]**2)) + \
                        np.sum(np.sum(theta3[:, 1:]**2)))

def predict(theta1, theta2, theta3, X):
    n = np.shape(X)[0]
    p = np.zeros((n, 1))
    h1 = sigmoid(np.append(np.ones((n, 1)), X, 1).dot(theta1.T))
    h2 = sigmoid(np.append(np.ones((n, 1)), h1, 1).dot(theta2.T))
    h3 = sigmoid(np.append(np.ones((n, 1)), h2, 1).dot(theta3.T))
    p = np.ndarray.argmax(h3, axis = 1).reshape([p.shape[0], 1])
    return p + 1

def test(train_data = train_data):
    data = train_data
    X = train_data['X']
    y = train_data['y']
    input_layer_size = 260
    hidden_layer_size = 200
    num_labels = 36

    initial_theta1 = rand_init_weights(input_layer_size, hidden_layer_size)
    initial_theta2 = rand_init_weights(hidden_layer_size, hidden_layer_size)
    initial_theta3 = rand_init_weights(hidden_layer_size, num_labels)

    initial_nn_params = np.append(np.append(initial_theta1.flatten(), initial_theta2.flatten()), initial_theta3.flatten())
    lambda_param = 0.1
    result = opt.minimize(fun = nn_cost, x0 = initial_nn_params, \
                          args = (input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param), \
                          method = "CG", jac = nn_gradient, options = {"maxiter": 100})
    nn_params = result.x
    theta1 = nn_params[0:hidden_layer_size*(input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = nn_params[hidden_layer_size*(input_layer_size + 1) : \
                       (hidden_layer_size*(input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
    theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
    pred = predict(theta1, theta2, theta3, X)
    print("Training Accuracy: ", np.mean(pred == y) * 100, "%")
    return

test()
('Training Accuracy: ', 100.0000000, '%')