How to build Neural Network from scratch
This post will introduce how to build a neural network from stratch:
- the forward-propagation: from input data to activations on each layer
- the backpropagation: using chain rules to calculate the deravatives of cost function to weights(parameters) on each layer
- minimize the cost function by SGD with the derivatives from step 2.
Example 1.
First I will introduce a very simple example. Suppose we have data like the following.
Inputs: X | Output: Y | ||
0 | 0 | 1 | 0 |
0 | 1 | 1 | 1 |
1 | 0 | 1 | 1 |
1 | 1 | 1 | 0 |
Our NN will have 2 layers. The first layer is from input
The layer 1 (
Preparation
1.1. Model Structure
Our prediction Neural Network will have two layers: the first layer a linear combination between input
The symbal
1.2. Cost Function
Cost function generally measures the distance between the predictions and the true value.
-
Cross Entropy Loss function:
C=−1n∑x[ylog(a3)+(1−y)log(1−a3)] -
L2 cost function:
1.3. Derivatives / Gradient
Suppose the Lost function is
First, let's denote
The partival derivative of Loss function to parameter
where
The partival derivative of Loss function to parameter
where
To find the value of
4. Code
import numpy as np
np.random.seed(10)
x = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = np.array([0, 1, 1, 0]).reshape(4, -1)
a1 = x
theta1 = np.random.random(12).reshape(4, 3) * 2 - 1
theta2 = np.random.random(4).reshape(1, 4) * 2 - 1
learning_rate = 1
def sigmoid(x):
return 1 / (1 + np.exp(-x))
for i in range(100000):
# forward propagation: including the first layer and the second layer
z1 = a1.dot(theta1.T)
a2 = sigmoid(z1)
z2 = a2.dot(theta2.T)
a3 = sigmoid(z2)
# backpropagation: chain rules of the partial derivatives
a3_error = y - a3
theta2_grad = np.dot(np.transpose(a3_error * (a3 * (1 - a3))), a2)
theta1_grad = np.dot(np.transpose(np.dot(a3_error * (a3 * (1 - a3)), theta2) * (a2 * (1 - a2))), a1)
# update the parameters until the gradient converges to 0
theta2 += learning_rate * theta2_grad
theta1 += learning_rate * theta1_grad
theta2
array([[ -3.16174322, 10.29352444, 11.40818558, -4.45945196]])
theta1
array([[ 3.35916278, -3.32786847, 1.87094082],
[ 5.84130712, -5.98365309, -3.07635286],
[-6.45870855, 6.34717945, -3.30437575],
[-3.91182875, 3.95761648, 2.17740367]])
a3
array([[ 0.00276657],
[ 0.99710763],
[ 0.99718615],
[ 0.00243296]])
# The prediction is
a3 > 0.5
array([[False],
[ True],
[ True],
[False]], dtype=bool)
The above is a simple example to introduce the insides of a neural network: how to calculate the forward propagation from input data to the prediction output and the cost function, how to calcualte the back propagatin of the partial derivatives with chain rules, and how to update the parameters until the gradients converging to zero, although in fact neural network is not necessary for this simple example since there are only 4 data points but we have introduced 16 parameters(parameters in
Example 2
The following example is a more real data question: I collect 60 verifacation code graphs. Each graph has 4 characters. The user needs to recognize these 4 characters correctly to get correct input verification.
Each graph has 4 characters. An example of the picture is like
The corresponding characters are: 'k' 'n' 'n' 'p'.
2.1. Model
We will build a three layers neural network.
2.2. Derivatives
Suppose the Lost function is
First, let's denote
The partival derivative of Loss function to parameter
The partival derivative of Loss function to parameter
The partival derivative of Loss function to parameter
2.3. Details
The overall steps will include these:
- from pic to data
- set up activation function / sigmoid
- calc gradient descent function
- loss function
- predict
- put all together
Shape of the data
Theta1.T: 261 x 200
Theta2.T: 201 x 200
Theta3.T: 201 x 36
X: 240 * 260
a1: 240 * 261
z2: 240 * 200
a2: 240 * 200 -> 240 * 201
z3: 240 * 201
a3: 240 * 200 -> 240 * 201
z4: 240 * 36
a4: 240 * 36
y_matirx: 240 * 36
import scipy.io
import os
import scipy.signal
import pandas as pd
import numpy as np
from PIL import Image
import scipy.optimize as opt
word = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',\
'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
mypath = r'data/yanzhengma/'
train_data = pd.read_pickle(mypath + 'train_data.pkl')
def sigmoid(x, derivative = False):
if derivative == True:
return x * (1 - x)
else:
return 1 / (1 + np.exp(-x))
def rand_init_weights(layer_in, layer_out):
epsilon_init = 0.1
return np.random.rand(layer_out, 1 + layer_in) * 2 * epsilon_init - epsilon_init
def nn_gradient(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
'''
theta1: 200x261
theta2: 200x201
theta3: 36x201
a1: 240x261
z2: 240x200
a2: 240x200 -> 240x201
z3: 240x200
a3: 240x2-- -> 240x201
z4: 240x36
a4: 240x36
d4: 240x36
d3: 240x201 -> 240x200
d2: 240x201 -> 240x200
theta1_grad: 200x261
theta2_grad: 200x201
theta3_grad: 36x201
final output: 200x261 + 200x201 + 36x201 = 99636
'''
theta1 = nn_params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
n = np.shape(X)[0]
a1 = np.append(np.ones((n, 1)), X, axis = 1)
z2 = a1.dot(theta1.T)
a2 = sigmoid(z2)
a2 = np.append(np.ones((n, 1)), a2, axis = 1)
z3 = a2.dot(theta2.T)
a3 = sigmoid(z3)
a3 = np.append(np.ones((n, 1)), a3, axis = 1)
z4 = a3.dot(theta3.T)
a4 = sigmoid(z4)
y_matrix = np.zeros((n, num_labels))
for i in range(num_labels):
y_matrix[:, i] = (y == (i + 1)).reshape(-1)
d4 = a4 - y_matrix
d3 = d4.dot(theta3) * a3 * (1 - a3)
d3 = d3[:, 1:]
d2 = d3.dot(theta2) * a2 * (1 - a2)
d2 = d2[:, 1:]
theta1_grad = 1 / n * (d2.T.dot(a1))
theta2_grad = 1 / n * (d3.T.dot(a2))
theta3_grad = 1 / n * (d4.T.dot(a3))
return np.append(np.append(theta1_grad.flatten(), theta2_grad.flatten(), axis = 0), theta3_grad.flatten(), axis = 0)
def nn_cost(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param):
theta1 = nn_params[0:(hidden_layer_size * (input_layer_size + 1))].reshape(hidden_layer_size, input_layer_size + 1)
theta2 = nn_params[(hidden_layer_size * (input_layer_size + 1)) : \
(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
n = np.shape(X)[0]
a1 = np.append(np.ones((n, 1)), X, axis = 1)
z2 = a1.dot(theta1.T)
a2 = sigmoid(z2)
a2 = np.append(np.ones((n, 1)), a2, axis = 1)
z3 = a2.dot(theta2.T)
a3 = sigmoid(z3)
a3 = np.append(np.ones((n, 1)), a3, axis = 1)
z4 = a3.dot(theta3.T)
a4 = sigmoid(z4)
y_matrix = np.zeros((n, num_labels))
for i in range(num_labels):
y_matrix[:, i] = (y == (i + 1)).reshape(-1)
return -(1 / n) * (np.sum(np.sum(y_matrix * np.log(a4))) + np.sum(np.sum((1 - y_matrix) * np.log(1 - a4)))) + \
lambda_param / ((2 * n) * np.sum(np.sum(theta1[:, 1:]**2)) + np.sum(np.sum(theta2[:, 1:]**2)) + \
np.sum(np.sum(theta3[:, 1:]**2)))
def predict(theta1, theta2, theta3, X):
n = np.shape(X)[0]
p = np.zeros((n, 1))
h1 = sigmoid(np.append(np.ones((n, 1)), X, 1).dot(theta1.T))
h2 = sigmoid(np.append(np.ones((n, 1)), h1, 1).dot(theta2.T))
h3 = sigmoid(np.append(np.ones((n, 1)), h2, 1).dot(theta3.T))
p = np.ndarray.argmax(h3, axis = 1).reshape([p.shape[0], 1])
return p + 1
def test(train_data = train_data):
data = train_data
X = train_data['X']
y = train_data['y']
input_layer_size = 260
hidden_layer_size = 200
num_labels = 36
initial_theta1 = rand_init_weights(input_layer_size, hidden_layer_size)
initial_theta2 = rand_init_weights(hidden_layer_size, hidden_layer_size)
initial_theta3 = rand_init_weights(hidden_layer_size, num_labels)
initial_nn_params = np.append(np.append(initial_theta1.flatten(), initial_theta2.flatten()), initial_theta3.flatten())
lambda_param = 0.1
result = opt.minimize(fun = nn_cost, x0 = initial_nn_params, \
args = (input_layer_size, hidden_layer_size, num_labels, X, y, lambda_param), \
method = "CG", jac = nn_gradient, options = {"maxiter": 100})
nn_params = result.x
theta1 = nn_params[0:hidden_layer_size*(input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
theta2 = nn_params[hidden_layer_size*(input_layer_size + 1) : \
(hidden_layer_size*(input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1))].reshape(hidden_layer_size, hidden_layer_size + 1)
theta3 = nn_params[(hidden_layer_size * (input_layer_size + 1) + \
hidden_layer_size * (hidden_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
pred = predict(theta1, theta2, theta3, X)
print("Training Accuracy: ", np.mean(pred == y) * 100, "%")
return
test()
('Training Accuracy: ', 100.0000000, '%')