Loading [MathJax]/jax/output/HTML-CSS/jax.js

pydata: Huiming's learning notes

Keep Looking, Don't Settle

Build Neural Network from Scratch

How to build Neural Network from scratch

This post will introduce how to build a neural network from stratch:

  1. the forward-propagation: from input data to activations on each layer
  2. the backpropagation: using chain rules to calculate the deravatives of cost function to weights(parameters) on each layer
  3. minimize the cost function by SGD with the derivatives from step 2.

Example 1.

First I will introduce a very simple example. Suppose we have data like the following. X is a 4x3 array. Y is a 4x1 vector. The task is to predict the output Y based on the input X. We will show how to build a neural network to do this job.

Inputs: X Output: Y
0 0 1 0
0 1 1 1
1 0 1 1
1 1 1 0

Our NN will have 2 layers. The first layer is from input X (we will call it a1) to activations a2 with the sigmoid activation function. The shape of the weights for the first layer (ΘT1) is 4x3. So the shape of layer 1 is 4x4.

The layer 1 (a2) will be the input for the second layer(a3) with weights parameter ΘT2 and sigmoid activation function. The shape of ΘT2 is 4x1 (the shape of input data to layer 2 is 4x4 and the shape of Y is 4x1, these two will determine the shape of Θ2 because layer 2 is the output layer here).

Preparation

1.1. Model Structure

Our prediction Neural Network will have two layers: the first layer a linear combination between input X and parameter θ1. Then this will be the input to the sigmoid function, which is the second layer.

a1=Xz1=a1ΘT1a2=sigmoid(z1)=11+ez1the first activation layerz2=a2ΘT2a3=sigmoid(z2)=11+ez2the second activation layer

The symbal above means dot product of the vector or the matrix. We also need the [elementwise product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices) denoted as .

1.2. Cost Function

Cost function generally measures the distance between the predictions and the true value.

  1. Cross Entropy Loss function:

    C=1nx[ylog(a3)+(1y)log(1a3)]

  2. L2 cost function:

C=12||ya3||2

1.3. Derivatives / Gradient

Suppose the Lost function is

C=12||ya3||2

First, let's denote

d3=(a3y)(a3(1a3)), or call it a3_deltad2=d3Θ2(a2(1a2))

The partival derivative of Loss function to parameter Θ2 using chain rule is:

CΘ2=Ca3a3z2z2Θ2=(a3y)(a3(1a3))a2(if all are scalers)=[(a3y)(a3(1a3))]Ta2dT3a2

where d3(a3y)(a3(1a3)) . maens element-wise product, means dot product.

The partival derivative of Loss function to parameter Θ1 using chain rule is:

CΘ1=Ca3a3z2z2a2a2z1z1Θ1(a3y)(a3(1a3))Θ2(a2(1a2))a1[(a3y)(a3(1a3))Θ2(a2(1a2))]Ta1=dT2a1

where d2d3Θ2(a2(1a2)) .

To find the value of Θ1 and Θ2, we need to iteratively adjust their value until their gradient is close to 0. That is, to iterate:

Θi+12=Θi2+λCΘi2Θi+11=Θi1+λCΘi1

4. Code

import numpy as np
np.random.seed(10)
x = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = np.array([0, 1, 1, 0]).reshape(4, -1)

a1 = x
theta1 = np.random.random(12).reshape(4, 3) * 2 - 1
theta2 = np.random.random(4).reshape(1, 4) * 2 - 1

learning_rate = 1

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

for i in range(100000):
    # forward propagation: including the first layer and the second layer
    z1 = a1.dot(theta1.T)
    a2 = sigmoid(z1) 
    z2 = a2.dot(theta2.T)
    a3 = sigmoid(z2) 

    # backpropagation: chain rules of the partial derivatives
    a3_error = y - a3
    theta2_grad = np.dot(np.transpose(a3_error * (a3 * (1 - a3))), a2) 
    theta1_grad = np.dot(np.transpose(np.dot(a3_error * (a3 * (1 - a3)),  theta2) * (a2 * (1 - a2))), a1)

    # update the parameters until the gradient converges to 0
    theta2 += learning_rate * theta2_grad
    theta1 += learning_rate * theta1_grad
theta2
array([[ -3.16174322,  10.29352444,  11.40818558,  -4.45945196]])
theta1
array([[ 3.35916278, -3.32786847,  1.87094082],
       [ 5.84130712, -5.98365309, -3.07635286],
       [-6.45870855,  6.34717945, -3.30437575],
       [-3.91182875,  3.95761648,  2.17740367]])
a3
array([[ 0.00276657],
       [ 0.99710763],
       [ 0.99718615],
       [ 0.00243296]])
# The prediction is
a3 > 0.5
array([[False],
       [ True],
       [ True],
       [False]], dtype=bool)

The above is a simple example to introduce the insides of a neural network: how to calculate the forward propagation from input data to the prediction output and the cost function, how to calcualte the back propagatin of the partial derivatives with chain rules, and how to update the parameters until the gradients converging to zero, although in fact neural network is not necessary for this simple example since there are only 4 data points but we have introduced 16 parameters(parameters in Θ1 and Θ2).

Example 2

The following example is a more real data question: I collect 60 verifacation code graphs. Each graph has 4 characters. The user needs to recognize these 4 characters correctly to get correct input verification.

Each graph has 4 characters. An example of the picture is like

alt text

The corresponding characters are: 'k' 'n' 'n' 'p'.

2.1. Model

We will build a three layers neural network.

a1=Xz1=a1ΘT1a2=sigmoid(z1)=11+ez1the first activation layerz2=a2ΘT2a3=sigmoid(z2)=11+ez2the second activation layerz3=a3ΘT3a4=sigmoid(z3)=11+ez3the third activation layer

2.2. Derivatives

Suppose the Lost function is

C=1nx[ylog(a3)+(1y)log(1a3)]

First, let's denote

d4=a4y, or call it a4_deltad3=d4Θ3(a3(1a3)), or call it a3_deltad2=d3Θ2(a2(1a2))

The partival derivative of Loss function to parameter Θ3 using chain rule is:

CΘ3=Ca4a4z4z4Θ3=[ya41y1a4](a4(1a4))a3=(a4y)a3(if all are scalers)=[(a4y)]Ta3dT4a3

The partival derivative of Loss function to parameter Θ2 using chain rule is:

CΘ2=Ca4a4z4z4a3a3z3z3Θ2[ya41y1a4](a4(1a4))Θ3(a3(1a3))a2(if all are scalers)=[d4Θ3(a3(1a3))]Ta2dT3a2

The partival derivative of Loss function to parameter Θ1 using chain rule is:

CΘ1=Ca4a4z4z4a3a3z3z3a2a2z2z2Θ1[ya41y1a4](a4(1a4))Θ3(a3(1a3))Θ2(a2(1a2))a1(if all are scalers)=[(a3y)(a3(1a3))Θ2(a2(1a2))]Ta1=dT2a1

2.3. Details

The overall steps will include these:

  1. from pic to data
  2. set up activation function / sigmoid
  3. calc gradient descent function
  4. loss function
  5. predict
  6. put all together

Shape of the data

Theta1.T: 261 x 200
Theta2.T: 201 x 200
Theta3.T: 201 x 36

X: 240 * 260
a1: 240 * 261
z2: 240 * 200
a2: 240 * 200 -> 240 * 201
z3: 240 * 201
a3: 240 * 200 -> 240 * 201
z4: 240 * 36
a4: 240 * 36

y_matirx: 240 * 36
import scipy.io
import os
import scipy.signal
import pandas as pd
import numpy as np
from PIL import Image
import scipy.optimize as opt
word = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',\
        'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

mypath = r'data/yanzhengma/'
train_data = pd.read_pickle(mypath + 'train_data.pkl')


def sigmoid(x, derivative = False):
    if derivative == True:
        return x * (1 - x)
    else:
        return 1 / (1 + np.exp(-x))

def rand_init_weights(layer_in, layer_out):
    epsilon_init = 0.1
    return np.random.rand(layer_out, 1 + layer_in) * 2 * epsilon_init - epsilon_init

def nn_gradient(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
    '''
    theta1: 200x261
    theta2: 200x201
    theta3: 36x201
    a1: 240x261
    z2: 240x200
    a2: 240x200 -> 240x201
    z3: 240x200
    a3: 240x2-- -> 240x201
    z4: 240x36
    a4: 240x36
    d4: 240x36
    d3: 240x201 -> 240x200
    d2: 240x201 -> 240x200
    theta1_grad: 200x261
    theta2_grad: 200x201
    theta3_grad: 36x201
    final output: 200x261 + 200x201 + 36x201 = 99636
    '''
    theta1 = nn_params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
                       (hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
    theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
    n = np.shape(X)[0]
    a1 = np.append(np.ones((n, 1)), X, axis = 1)
    z2 = a1.dot(theta1.T)
    a2 = sigmoid(z2)
    a2 = np.append(np.ones((n, 1)), a2, axis = 1)
    z3 = a2.dot(theta2.T)
    a3 = sigmoid(z3)
    a3 = np.append(np.ones((n, 1)), a3, axis = 1)
    z4 = a3.dot(theta3.T)
    a4 = sigmoid(z4)

    y_matrix = np.zeros((n, num_labels))
    for i in range(num_labels):
        y_matrix[:, i] = (y == (i + 1)).reshape(-1)

    d4 = a4 - y_matrix
    d3 = d4.dot(theta3) * a3 * (1 - a3)   
    d3 = d3[:, 1:]
    d2 = d3.dot(theta2) * a2 * (1 - a2)  
    d2 = d2[:, 1:]
    theta1_grad = 1 / n * (d2.T.dot(a1))
    theta2_grad = 1 / n * (d3.T.dot(a2)) 
    theta3_grad = 1 / n * (d4.T.dot(a3))  
    return np.append(np.append(theta1_grad.flatten(), theta2_grad.flatten(), axis = 0), theta3_grad.flatten(), axis = 0)

def nn_cost(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
    theta1 = nn_params[0:(hidden_layer_size * (input_layer_size + 1))].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
                       (hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
    theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
    n = np.shape(X)[0]
    a1 = np.append(np.ones((n, 1)), X, axis = 1)
    z2 = a1.dot(theta1.T)
    a2 = sigmoid(z2)
    a2 = np.append(np.ones((n, 1)), a2, axis = 1)
    z3 = a2.dot(theta2.T)
    a3 = sigmoid(z3)
    a3 = np.append(np.ones((n, 1)), a3, axis = 1)
    z4 = a3.dot(theta3.T)
    a4 = sigmoid(z4)

    y_matrix = np.zeros((n, num_labels))
    for i in range(num_labels):
        y_matrix[:, i] = (y == (i + 1)).reshape(-1)

    return -(1 / n) * (np.sum(np.sum(y_matrix * np.log(a4))) + np.sum(np.sum((1 - y_matrix) * np.log(1 - a4)))) + \
        lambda_param / ((2 * n) * np.sum(np.sum(theta1[:, 1:]**2)) + np.sum(np.sum(theta2[:, 1:]**2)) + \
                        np.sum(np.sum(theta3[:, 1:]**2)))

def predict(theta1, theta2, theta3, X):
    n = np.shape(X)[0]
    p = np.zeros((n, 1))
    h1 = sigmoid(np.append(np.ones((n, 1)), X, 1).dot(theta1.T))
    h2 = sigmoid(np.append(np.ones((n, 1)), h1, 1).dot(theta2.T))
    h3 = sigmoid(np.append(np.ones((n, 1)), h2, 1).dot(theta3.T))
    p = np.ndarray.argmax(h3, axis = 1).reshape([p.shape[0], 1])
    return p + 1

def test(train_data = train_data):
    data = train_data
    X = train_data['X']
    y = train_data['y']
    input_layer_size = 260
    hidden_layer_size = 200
    num_labels = 36

    initial_theta1 = rand_init_weights(input_layer_size, hidden_layer_size)
    initial_theta2 = rand_init_weights(hidden_layer_size, hidden_layer_size)
    initial_theta3 = rand_init_weights(hidden_layer_size, num_labels)

    initial_nn_params = np.append(np.append(initial_theta1.flatten(), initial_theta2.flatten()), initial_theta3.flatten())
    lambda_param = 0.1
    result = opt.minimize(fun = nn_cost, x0 = initial_nn_params, \
                          args = (input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param), \
                          method = "CG", jac = nn_gradient, options = {"maxiter": 100})
    nn_params = result.x
    theta1 = nn_params[0:hidden_layer_size*(input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = nn_params[hidden_layer_size*(input_layer_size + 1) : \
                       (hidden_layer_size*(input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
    theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
                        hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
    pred = predict(theta1, theta2, theta3, X)
    print("Training Accuracy: ", np.mean(pred == y) * 100, "%")
    return

test()

('Training Accuracy: ', 100.0000000, '%')