Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

README.md # The Back-propagation Algorithm The aim of this exercise is to implement the back-propagation algorithm. There are many ready-made tools that do this already but here we aim to implement...

1 answer below »
README.md
# The Back-propagation Algorithm
The aim of this exercise is to implement the back-propagation algorithm. There are many ready-made tools that do this already but here we aim to implement the algorithm using only the linear alge
a and other mathematics tools available in numpy and scipy. This way you get an insight into how the algorithm works before you move on to use it in larger toolboxes.
We will restrict ourselves to fully-connected feed forward neural networks with one hidden layer (plus an input and an output layer).
## Section 1
### Section 1.1 The Sigmoid function
We will use the following nonlinear activation function:
$$\sigma(a)=\frac{1}{1+e^{-a}}$$
We will also need the derivative of this function:
$$\frac{d}{da}\sigma(a) = \frac{e^{-a}}{(1+e^{-a})^2} = \sigma(a) (1-\sigma(a))$$
Create these two functions:
1. The sigmoid function: `sigmoid(x)`
2. The derivative of the sigmoid function: `d_sigmoid(x)`
**Note**: To avoid overflows, make sure that inside `sigmoid(x)` you check if `x<-100` and return `0.0` in that case.
Example inputs and outputs:
* `sigmoid(0.5)` -> ` XXXXXXXXXX`
* `d_sigmoid(0.2)` -> ` XXXXXXXXXX`
### Section 1.2 The Perceptron Function
A perceptron takes in an a
ay of inputs $X$ and an a
ay of the co
esponding weights $W$ and returns the weighted sum of $X$ and $W$, as well as the result from the activation function (i.e. the sigmoid) of this weighted sum. *(see Eq. 5.48 & 5.62 in Bishop)*.
Implement the function `perceptron(x, w)` that returns the weighted sum and the output of the activation function.
Example inputs and outputs:
* `perceptron(np.a
ay([1.0, 2.3, 1.9]),np.a
ay([0.2,0.3,0.1]))` -> ` XXXXXXXXXX, XXXXXXXXXX)`
* `perceptron(np.a
ay([0.2,0.4]),np.a
ay([0.1,0.4]))` -> ` XXXXXXXXXX, XXXXXXXXXX)`
### Section 1.3 Forward Propagation
When we have the sigmoid and the perceptron function, we can start to implement the neural network.
Implement a function `ffnn` which computes the output and hidden layer variables for a single hidden layer feed-forward neural network. If the number of inputs is $D$, the number of hidden layer neurons is $M$ and the number of output neurons is $K$, the matrices $W_1$ of size $[(D+1)\times M]$ and $W_2$ of size $[(M+1)\times K]$ represent the linear transform from the input layer and the hidden layer and from the hidden layer to the output layer respectively.
Write a function `y, z0, z1, a1, a2 = ffnn(x, M, K, W1, W2)` where:
* `x` is the input pattern of $(1\times D)$ dimensions (a line vector)
* `W1` is a $((D+1)\times M)$ matrix and `W2` is a $(M+1)\times K$ matrix. (the `+1` are for the bias weights)
* `a1` is the input vector of the hidden layer of size $(1\times M)$ (needed for backprop).
* `a2` is the input vector of the output layer of size $(1\times K)$ (needed for backprop).
* `z0` is the input pattern of size $(1\times (D+1))$, (this is just `x` with `1.0` inserted at the beginning to match the bias weight).
* `z1` is the output vector of the hidden layer of size $(1\times (M+1))$ (needed for backprop).
* `y` is the output of the neural network of size $(1\times K)$.
Example inputs and outputs:
*First load the iris data:*
```
features, targets, classes = load_iris()
(train_features, train_targets), (test_features, test_targets) = \
split_train_test(features, targets)
```
*Then call the function*
```
# Take one point:
x = train_features[0, :]
K = 3 # number of classes
M = 10
# Initialize two random weight matrices
W1 = 2 * np.random.rand(D + 1, M) - 1
W2 = 2 * np.random.rand(M + 1, K) - 1
y, z0, z1, a1, a2 = ffnn(x, M, K, W1, W2)
```
*Outputs*:
* `y` : `[ XXXXXXXXXX XXXXXXXXXX]`
* `z0`: `[ XXXXXXXXXX]`
* `z1`: `[ XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX]`
* `a1`: `[ XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX]`
* `a2`: `[ XXXXXXXXXX XXXXXXXXXX]`
### Section 1.4 Backward Propagation
We will now implement the back-propagation algorithm to evaluate the gradient of the e
or function $\Delta E_n(x)$.
Create a function `y, dE1, dE2 = backprop(x, target_y, M, K, W1, W2)` where:
* `x, M, K, W1` and `W2` are the same as for the `ffnn` function
* `target_y` is the target vector. In our case (i.e. for the classification of Iris) this will be a vector with 3 elements, with one element equal to 1.0 and the others equal to 0.0. (*).
* `y` is the output of the output layer (vector with 3 elements)
* `dE1` and `dE2` are the gradient e
or matrices that contain $\frac{\partial E_n}{\partial w_{ji}}$ for the first and second layers.
Assume sigmoid hidden and output activation functions and assume cross-entropy e
or function (for classification). Notice that $E_n(\mathbf{w})$ is defined as the e
or function for a single pattern $\mathbf{x}_n$. *The algorithm is described on page 244 in Bishop*.
The inner working of your `backprop` function should follow this order of actions:
1. run `ffnn` on the input.
2. calculate $\delta_k = y_k - target\_y_k$
3. calculate $\delta_j = (\frac{d}{da}\sigma(a1_j)) \sum_{k} w_{k,j+1}\delta_k$ (the `+1` is because of the bias weights)
4. initialize `dE1` and `dE1` as zero-matrices with the same shape as `W1` and `W2`
5. calculate `dE1_{i,j} = \delta_j z0_i` and `dE2_{j,k} = \delta_k z1_j`
Example inputs and outputs:
*Call the function*
```
K = 3 # number of classes
M = 6
D = train_features.shape[1]
x = features[0, :]
# create one-hot target for the feature
target_y = np.zeros(K)
target_y[targets[0]] = 1.0
np.random.seed(42)
# Initialize two random weight matrices
W1 = 2 * np.random.rand(D + 1, M) - 1
W2 = 2 * np.random.rand(M + 1, K) - 1
y, dE1, dE2 = backprop(x, target_y, M, K, W1, W2)
```
*Output*
* `y`: `[ XXXXXXXXXX XXXXXXXXXX]`
* `dE1`:
```
[[-3.17372897e-03 3.13040504e-02 -6.72419861e-03 7.39219402e-02
-1.16539047e-04 9.29566482e-03]
[-1.61860177e-02 1.59650657e-01 -3.42934129e-02 3.77001895e-01
-5.94349138e-04 4.74078906e-02]
[-1.11080514e-02 1.09564176e-01 -2.35346951e-02 2.58726791e-01
-4.07886663e-04 3.25348269e-02]
[-4.44322055e-03 4.38256706e-02 -9.41387805e-03 1.03490716e-01
-1.63154665e-04 1.30139307e-02]
[-6.34745793e-04 6.26081008e-03 -1.34483972e-03 1.47843880e-02
-2.33078093e-05 1.85913296e-03]]
```
* `dE2`:
```
[[-5.73709549e-01 1.21816299e-01 5.68407958e-01]
[-3.82317044e-02 8.11777445e-03 3.78784091e-02]
[-5.13977514e-01 1.09133338e-01 5.09227901e-01]
[-2.11392026e-01 4.48850716e-02 2.09438574e-01]
[-1.65803375e-01 3.52051896e-02 1.64271203e-01]
[-3.19254175e-04 6.77875452e-05 3.16303980e-04]
[-5.60171752e-01 1.18941805e-01 5.54995262e-01]]
```
(*): *This is refe
ed to as [one-hot encoding](https:
en.wikipedia.org/wiki/One-hot). If we have e.g. 3 possible classes: `yellow`, `green` and `blue`, we could assign the following label mapping: `yellow: 0`, `green: 1` and `blue: 2`. Then we could encode these labels as:*
$$
\text{yellow} = \begin{bmatrix} XXXXXXXXXXend{bmatrix}, \text{green} = \begin{bmatrix} XXXXXXXXXXend{bmatrix}, \text{blue} = \begin{bmatrix} XXXXXXXXXXend{bmatrix},
$$
*But why would we choose to do this instead of just using $0, 1, 2$ ? The reason is simple, using ordinal categorical label injects assumptions into the network that we want to avoid. The network might assume that `yellow: 0` is more different from `blue: 2` than `green: 1` because the difference in the labels is greater. We want our neural networks to output* **probability distributions over classes** *meaning that the output of the network might look something like:*
$$
\text{NN}(x) = \begin{bmatrix} XXXXXXXXXXend{bmatrix}
$$
*From this we can directly make a prediction, `0.65` is highest so the model is most confident in that the input feature co
esponds to the `blue` label*
## Section 2 - Training the Network
We are now ready to train the network. Training consists of:
1. forward propagating an input feature through the network
2. Calculate the e
or between the prediction the network made and the actual target
3. Back-propagating the e
or through the network to adjust the weights.
### Section 2.1
Write a function called `W1tr, W2tr, E_total, misclassification_rate, guesses = train_nn(X_train, t_train, M, K, W1, W2, iterations, eta)` where
Inputs:
* `X_train` and `t_train` are the training data and the target values
* `M, K, W1, W2` are defined as above
* `iterations` is the number of iterations the training should take, i.e. how often we should update the weights
* `eta` is the learning rate.
Outputs:
* `W1tr`, `W2tr` are the updated weight matrices
* `E_total` is an a
ay that contains the e
or after each iteration.
* `misclassification_rate` is an a
ay that contains the misclassification rate after each iteration
* `guesses` is the result from the last iteration, i.e. what the network is guessing for the input dataset `X_train`.
The inner working of your `train_nn` function should follow this order of actions:
1. Initialize necessary variables
2. Run a loop for `iterations` iterations.
3. In each iteration we will collect the gradient e
or matrices for each data point. Start by initializing `dE1_total` and `dE2_total` as zero matrices with the same shape as `W1` and `W2` respectively.
4. Run a loop over all the data points in `X_train`. In each iteration we call backprop to get the gradient e
or matrices and the output values.
5. Once we have collected the e
or gradient matrices for all the data points, we adjust the weights in `W1` and `W2`, using `W1 = W1 + eta* dE1_total / N` where `N` is the number of data points in `X_train` (and similarly for `W2`).
6. For the e
or estimation we'll use the cross-entropy e
or function, *(Eq. 4.90 in Bishop)*.
7. When the outer loop finishes, we return from the function
Example inputs and outputs:
*Call the function*:
```
K = 3 # number of classes
M = 6
D = train_features.shape[1]
np.random.seed(42)
# Initialize two random weight matrices
W1 = 2 * np.random.rand(D + 1, M) - 1
W2 = 2 * np.random.rand(M + 1, K) - 1
W1tr, W2tr, Etotal, misclassification_rate, last_guesses = train_nn(
train_features[:20, :], train_targets[:20], M, K, W1, W2, 500, 0.3)
```
*Output*
* `W1tr`:
```
[[ XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX ]
...
[ XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX]]
```
*
Answered 4 days After Sep 21, 2021

Solution

Vicky answered on Sep 23 2021
149 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here