How to build Neural Network from scratch
This post will introduce how to build a neural network from stratch:
- the forward-propagation: from input data to activations on each layer
- the backpropagation: using chain rules to calculate the deravatives of cost function to weights(parameters) on each layer
- minimize the cost function by SGD with the derivatives from step 2.
Example 1.
First I will introduce a very simple example. Suppose we have data like the following. \(X\) is a 4x3 array. \(Y\) is a 4x1 vector. The task is to predict the output \(Y\) based on the input \(X\). We will show how to build a neural network to do this job.
Inputs: X | Output: Y | ||
0 | 0 | 1 | 0 |
0 | 1 | 1 | 1 |
1 | 0 | 1 | 1 |
1 | 1 | 1 | 0 |
Our NN will have 2 layers. The first layer is from input \(X\) (we will call it \(a_1\)) to activations \(a_2\) with the sigmoid activation function. The shape of the weights for the first layer (\(\Theta_1^{T}\)) is 4x3. So the shape of layer 1 is 4x4.
The layer 1 (\(a_2\)) will be the input for the second layer(\(a_3\)) with weights parameter \(\Theta_2^{T}\) and sigmoid activation function. The shape of \(\Theta_2^{T}\) is 4x1 (the shape of input data to layer 2 is 4x4 and the shape of \(Y\) is 4x1, these two will determine the shape of \(\Theta_2\) because layer 2 is the output layer here).
Preparation
1.1. Model Structure
Our prediction Neural Network will have two layers: the first layer a linear combination between input \(X\) and parameter \(\theta_1\). Then this will be the input to the sigmoid function, which is the second layer.
The symbal \(\cdot\) above means dot product of the vector or the matrix. We also need the [elementwise product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices) denoted as \(\odot\).
1.2. Cost Function
Cost function generally measures the distance between the predictions and the true value.
-
Cross Entropy Loss function:
\begin{aligned} C = -\frac{1}{n}\sum_{x}\left[y \log(a_3) + (1 - y)\log(1 - a_3) \right] \end{aligned} -
L2 cost function:
1.3. Derivatives / Gradient
Suppose the Lost function is
First, let's denote
The partival derivative of Loss function to parameter \(\Theta_2\) using chain rule is:
where \(d_3 \triangleq (a_3 - y) \odot \left(a_3 \odot (1 - a_3)\right)\) . \(\odot\) maens element-wise product, \(\cdot\) means dot product.
The partival derivative of Loss function to parameter \(\Theta_1\) using chain rule is:
where \(d_2 \triangleq d_3 \cdot \Theta_2 \odot \left(a_2 \odot (1 - a_2) \right)\) .
To find the value of \(\Theta_1\) and \(\Theta_2\), we need to iteratively adjust their value until their gradient is close to 0. That is, to iterate:
4. Code
import numpy as np
np.random.seed(10)
x = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = np.array([0, 1, 1, 0]).reshape(4, -1)
a1 = x
theta1 = np.random.random(12).reshape(4, 3) * 2 - 1
theta2 = np.random.random(4).reshape(1, 4) * 2 - 1
learning_rate = 1
def sigmoid(x):
return 1 / (1 + np.exp(-x))
for i in range(100000):
# forward propagation: including the first layer and the second layer
z1 = a1.dot(theta1.T)
a2 = sigmoid(z1)
z2 = a2.dot(theta2.T)
a3 = sigmoid(z2)
# backpropagation: chain rules of the partial derivatives
a3_error = y - a3
theta2_grad = np.dot(np.transpose(a3_error * (a3 * (1 - a3))), a2)
theta1_grad = np.dot(np.transpose(np.dot(a3_error * (a3 * (1 - a3)), theta2) * (a2 * (1 - a2))), a1)
# update the parameters until the gradient converges to 0
theta2 += learning_rate * theta2_grad
theta1 += learning_rate * theta1_grad
theta2
array([[ -3.16174322, 10.29352444, 11.40818558, -4.45945196]])
theta1
array([[ 3.35916278, -3.32786847, 1.87094082],
[ 5.84130712, -5.98365309, -3.07635286],
[-6.45870855, 6.34717945, -3.30437575],
[-3.91182875, 3.95761648, 2.17740367]])
a3
array([[ 0.00276657],
[ 0.99710763],
[ 0.99718615],
[ 0.00243296]])
# The prediction is
a3 > 0.5
array([[False],
[ True],
[ True],
[False]], dtype=bool)
The above is a simple example to introduce the insides of a neural network: how to calculate the forward propagation from input data to the prediction output and the cost function, how to calcualte the back propagatin of the partial derivatives with chain rules, and how to update the parameters until the gradients converging to zero, although in fact neural network is not necessary for this simple example since there are only 4 data points but we have introduced 16 parameters(parameters in \(\Theta_1\) and \(\Theta_2\)).
Example 2
The following example is a more real data question: I collect 60 verifacation code graphs. Each graph has 4 characters. The user needs to recognize these 4 characters correctly to get correct input verification.
Each graph has 4 characters. An example of the picture is like
The corresponding characters are: 'k' 'n' 'n' 'p'.
2.1. Model
We will build a three layers neural network.
2.2. Derivatives
Suppose the Lost function is
First, let's denote
The partival derivative of Loss function to parameter \(\Theta_3\) using chain rule is:
The partival derivative of Loss function to parameter \(\Theta_2\) using chain rule is:
The partival derivative of Loss function to parameter \(\Theta_1\) using chain rule is:
2.3. Details
The overall steps will include these:
- from pic to data
- set up activation function / sigmoid
- calc gradient descent function
- loss function
- predict
- put all together
Shape of the data
Theta1.T: 261 x 200
Theta2.T: 201 x 200
Theta3.T: 201 x 36
X: 240 * 260
a1: 240 * 261
z2: 240 * 200
a2: 240 * 200 -> 240 * 201
z3: 240 * 201
a3: 240 * 200 -> 240 * 201
z4: 240 * 36
a4: 240 * 36
y_matirx: 240 * 36
import scipy.io
import os
import scipy.signal
import pandas as pd
import numpy as np
from PIL import Image
import scipy.optimize as opt
word = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',\
'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
mypath = r'data/yanzhengma/'
train_data = pd.read_pickle(mypath + 'train_data.pkl')
def sigmoid(x, derivative = False):
if derivative == True:
return x * (1 - x)
else:
return 1 / (1 + np.exp(-x))
def rand_init_weights(layer_in, layer_out):
epsilon_init = 0.1
return np.random.rand(layer_out, 1 + layer_in) * 2 * epsilon_init - epsilon_init
def nn_gradient(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
'''
theta1: 200x261
theta2: 200x201
theta3: 36x201
a1: 240x261
z2: 240x200
a2: 240x200 -> 240x201
z3: 240x200
a3: 240x2-- -> 240x201
z4: 240x36
a4: 240x36
d4: 240x36
d3: 240x201 -> 240x200
d2: 240x201 -> 240x200
theta1_grad: 200x261
theta2_grad: 200x201
theta3_grad: 36x201
final output: 200x261 + 200x201 + 36x201 = 99636
'''
theta1 = nn_params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
n = np.shape(X)[0]
a1 = np.append(np.ones((n, 1)), X, axis = 1)
z2 = a1.dot(theta1.T)
a2 = sigmoid(z2)
a2 = np.append(np.ones((n, 1)), a2, axis = 1)
z3 = a2.dot(theta2.T)
a3 = sigmoid(z3)
a3 = np.append(np.ones((n, 1)), a3, axis = 1)
z4 = a3.dot(theta3.T)
a4 = sigmoid(z4)
y_matrix = np.zeros((n, num_labels))
for i in range(num_labels):
y_matrix[:, i] = (y == (i + 1)).reshape(-1)
d4 = a4 - y_matrix
d3 = d4.dot(theta3) * a3 * (1 - a3)
d3 = d3[:, 1:]
d2 = d3.dot(theta2) * a2 * (1 - a2)
d2 = d2[:, 1:]
theta1_grad = 1 / n * (d2.T.dot(a1))
theta2_grad = 1 / n * (d3.T.dot(a2))
theta3_grad = 1 / n * (d4.T.dot(a3))
return np.append(np.append(theta1_grad.flatten(), theta2_grad.flatten(), axis = 0), theta3_grad.flatten(), axis = 0)
def nn_cost(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
theta1 = nn_params[0:(hidden_layer_size * (input_layer_size + 1))].reshape(hidden_layer_size, input_layer_size + 1)
theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
n = np.shape(X)[0]
a1 = np.append(np.ones((n, 1)), X, axis = 1)
z2 = a1.dot(theta1.T)
a2 = sigmoid(z2)
a2 = np.append(np.ones((n, 1)), a2, axis = 1)
z3 = a2.dot(theta2.T)
a3 = sigmoid(z3)
a3 = np.append(np.ones((n, 1)), a3, axis = 1)
z4 = a3.dot(theta3.T)
a4 = sigmoid(z4)
y_matrix = np.zeros((n, num_labels))
for i in range(num_labels):
y_matrix[:, i] = (y == (i + 1)).reshape(-1)
return -(1 / n) * (np.sum(np.sum(y_matrix * np.log(a4))) + np.sum(np.sum((1 - y_matrix) * np.log(1 - a4)))) + \
lambda_param / ((2 * n) * np.sum(np.sum(theta1[:, 1:]**2)) + np.sum(np.sum(theta2[:, 1:]**2)) + \
np.sum(np.sum(theta3[:, 1:]**2)))
def predict(theta1, theta2, theta3, X):
n = np.shape(X)[0]
p = np.zeros((n, 1))
h1 = sigmoid(np.append(np.ones((n, 1)), X, 1).dot(theta1.T))
h2 = sigmoid(np.append(np.ones((n, 1)), h1, 1).dot(theta2.T))
h3 = sigmoid(np.append(np.ones((n, 1)), h2, 1).dot(theta3.T))
p = np.ndarray.argmax(h3, axis = 1).reshape([p.shape[0], 1])
return p + 1
def test(train_data = train_data):
data = train_data
X = train_data['X']
y = train_data['y']
input_layer_size = 260
hidden_layer_size = 200
num_labels = 36
initial_theta1 = rand_init_weights(input_layer_size, hidden_layer_size)
initial_theta2 = rand_init_weights(hidden_layer_size, hidden_layer_size)
initial_theta3 = rand_init_weights(hidden_layer_size, num_labels)
initial_nn_params = np.append(np.append(initial_theta1.flatten(), initial_theta2.flatten()), initial_theta3.flatten())
lambda_param = 0.1
result = opt.minimize(fun = nn_cost, x0 = initial_nn_params, \
args = (input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param), \
method = "CG", jac = nn_gradient, options = {"maxiter": 100})
nn_params = result.x
theta1 = nn_params[0:hidden_layer_size*(input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
theta2 = nn_params[hidden_layer_size*(input_layer_size + 1) : \
(hidden_layer_size*(input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
pred = predict(theta1, theta2, theta3, X)
print("Training Accuracy: ", np.mean(pred == y) * 100, "%")
return
test()
('Training Accuracy: ', 100.0000000, '%')