Intro to deep learning¶

Alt text that describes the graphic

Deep learning is an artificial intelligence function that closely resembles the workings of the human brain, processing data and finding patterns for use in decision making. Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled. Also known as deep neural learning or deep neural network.

Alt text that describes the graphic

One of the key differences between machine learning and deep learning is the feature extraction. Feature extraction is done by a human in machine learning whereas in deep learning the model does its own feature extraction.

Alt text that describes the graphic

Deep learning models tend to perform well with large amount of data whereas old machine learning models tend to stop improving after a saturation point and the learning curve flattens.

Alt text that describes the graphic

Neurons and Layers¶

Most neural network structures use some type of neuron. Many different kinds of neural networks exist, and programmers introduce experimental neural network structures all the time. An algorithm that is called a neural network will typically be composed of individual, interconnected units even though these units may or may not be called neurons. The name for a neural network processing unit varies among the literature sources. It could be called a node, neuron, cell, or unit.

Alt text that describes the graphic

The artificial neuron receives input from one or more sources that may be other neurons or data fed into the network. This input is usually floating-point or binary. Often binary input is encoded to floating-point by representing true or false as $1$ or $0$. Sometimes the program also depicts the binary input as using a bipolar system with true as $1$ and false as $-1$.

An artificial neuron multiplies each of these inputs by a weight. Then it adds these multiplications and passes this sum to an activation function. Some neural networks do not use an activation function. The following equation summarizes the calculated output of a neuron:

$$ f(x, w) = \phi (\Sigma_{i} \theta_{i} \cdot x_{i})) $$

In the above equation, the variables $x$ and $\theta$ represent the input and weights of the neuron. The variable $i$ corresponds to the number of weights and inputs. You must always have the same number of weights as inputs. The neural network multiplies each weight by its respective input and feeds the products of these multiplications into an activation function that is denoted by the Greek letter $\phi$ (phi). This process results in a single output from the neuron.

Types of Neurons¶

Not every neural network will use every kind of neuron. It is also possible for a single neuron to fill the role of several different neuron types.

There are usually four types of neurons in a neural network:

Input Neurons - Each input neuron is mapped to one element in the feature vector.
Hidden Neurons - Hidden neurons allow the neural network to be abstract and process the input into the output.
Output Neurons - Each output neuron calculates one part of the output.
Context Neurons - Holds state between calls to the neural network to predict.
Bias Neurons - Work similar to the y-intercept of a linear equation.

Context Neurons¶

Recurrent neural networks make use of context neurons to hold state. This type of neuron allows the neural network to maintain state. As a result, a given input may not always produce the same output. This inconsistency is similar to the workings of biological brains. Consider how context factors in your response when you hear a loud horn. If you hear the noise while you are crossing the street, you might startle, stop walking, and look in the direction of the horn. If you hear the horn while you are watching an action-adventure film in a movie theatre, you don't respond in the same way. Therefore, prior inputs give you the context for processing the audio input of a horn.

Time series is one application of context neurons. You might need to train a neural network to learn input signals to perform speech recognition or to predict trends in security prices. Context neurons are one way for neural networks to deal with time-series data. Figure 3.CTX shows how a neural network might arrange context neurons:

Figure 3.CTX: Neural Network with Context Neurons¶

Alt text that describes the graphic

Weights¶

When input data comes into a neuron, it gets multiplied by a weight value that is assigned to this particular input. These weights start out as random values, and as the neural network learns more about the input data the network adjusts the weights based on any errors in categorization that the previous weights resulted in. This is training the neural network.

Bias Neurons¶

Bias is just like an intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Moreover, bias value allows you to shift the activation function to either right or left, which may be critical for successful learning.

Consider this 1-input, 1-output network that has no bias:

Alt text that describes the graphic

The output of the network is computed by multiplying the input (x) by the weight (w0) and passing the result through some kind of activation function (e.g. a sigmoid function.)

Here is the function that this network computes, for various values of w0:

Alt text that describes the graphic

Changing the weight w0 essentially changes the "steepness" of the sigmoid. That's useful, but what if you wanted the network to output 0 when x is 2? Just changing the steepness of the sigmoid won't really work -- you want to be able to shift the entire curve to the right.

That's exactly what the bias allows you to do. If we add a bias to that network, like so:

Alt text that describes the graphic

Now the output of the network becomes sig(w0*x + w1*1.0). Here is what the output of the network looks like for various values of w1:

Alt text that describes the graphic

Having a weight of -5 for w1 shifts the curve to the right, which allows us to have a network that outputs 0when x is 2.

What Is An Input Layer?¶

The input layer is responsible for receiving the inputs. These inputs can be loaded from an external source such as a web service or a csv file etc. There must always be one input layer in a neural network. The input layer takes in the inputs, performs the calculations via its neurons and then the output is transmitted onto the subsequent layers.

What Is A Hidden Layer?¶

Hidden layers reside in-between input and output layers and this is the primary reason why they are referred to as hidden. The word “hidden” implies that they are not visible to the external systems and are “private” to the neural network. There could be zero or more hidden layers in a neural network. The larger the number of hidden layers in a neural network, the longer it will take for the neural network to produce the output and the more complex problems the neural network can solve.

What Is An Output Layer?¶

The output layer is responsible for producing the final result. There must always be one output layer in a neural network. The output layer takes in the inputs which are passed in from the layers before it, performs the calculations via its neurons and then the output is computed.

Activation Functions¶

Activation functions are functions that decide, given the inputs into the node, what should be the node’s output be? It is used to determine the output of neural network like yes or no. It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).

Primary activation functions used

Rectified Linear Unit (ReLU): Used for the output of hidden layers. [Cite:glorot2011deep]
Softmax: Used for the output of classification neural networks. Softmax Example
Linear: Used for the output of regression neural networks (or 2-class classification).

Rectified Linear Units (ReLU)¶

Introduced in 2000 by Teh & Hinton, the rectified linear unit (ReLU) has seen very rapid adoption over the past few years. Before the ReLU activation function, the programmers generally regarded the hyperbolic tangent as the activation function of choice. Most current research now recommends the ReLU due to superior training results. As a result, most neural networks should utilize the ReLU on hidden layers and either softmax or linear on the output layer. The following equation shows the straightforward ReLU function:

$$ \phi(x) = \max(0, x) $$

We will now examine why ReLU typically performs better than other activation functions for hidden layers. Part of the increased performance is because the ReLU activation function is a linear, non-saturating function. Unlike the sigmoid/logistic or the hyperbolic tangent activation functions, the ReLU does not saturate to $-1$, $0$, or $1$. A saturating activation function moves towards and eventually attains a value. The hyperbolic tangent function, for example, saturates to $-1$ as $x$ decreases and $1$ as $x$ increases.

Alt text that describes the graphic

Most current research states that the hidden layers of your neural network should use the ReLU activation.

How are Neural Networks used for Deep learning¶

For Deep learning, several Neural Network layers are connected in feedforward or feedback style also called forward propagation to pass information to each other.

Feedforward: This is the simplest type of ANN. The connections do not form a cycle and hence has no loops. The input is directly fed to output (in a single direction) through a series of weights. They are extensively used in pattern recognition. This type of organization is referred to as bottoms-up or top-down approach.

Alt text that describes the graphic

Multiply - add process
Dot product
Forward propagation for one data point at a time
Output is the prediction for that data point

$$ y = \Sigma(input \ast weight) + bias $$

Feed forward example

import numpy as np

input_data = np.array([2, 3])

print('input_data',input_data)

weights = {'node_0': np.array([1, 1]),
           'node_1': np.array([-1, 1]),
           'output': np.array([2, -1])}

node_0_value = (input_data * weights['node_0']).sum()
node_1_value = (input_data * weights['node_1']).sum()

input_data [2 3]

hidden_layer_values = np.array([node_0_value, node_1_value])

print('hidden layer values',hidden_layer_values)

output = (hidden_layer_values * weights['output']).sum()

print('\noutput',output)

hidden layer values [5 1]

output 9

Sigmoid Activation Function Example

def sigmoid(X):
    return 1/(1+np.exp(-X))

input_data = np.array([-1, 2])

weights = {'node_0': np.array([3, 3]),
           'node_1': np.array([1, 5]),
           'output': np.array([2, -1])}

node_0_input = (input_data * weights['node_0']).sum()
node_0_output = sigmoid(node_0_input)
    
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = sigmoid(node_1_input)
    
hidden_layer_output = np.array([node_0_output, node_1_output])
output = (hidden_layer_output * weights['output']).sum()

print('\nOutput',output)

Output 0.905271648220853

Representation learning¶

Deep networks internally build representations of patterns in the data
Partially replace the need for feature engineering
Subsequent layers build increasingly sophisticated representations of raw data
Modeler doesn’t need to specify the interactions
When you train the model, the neural network gets weights that find the relevant patterns to make better predictions

The need for optimization¶

Predictions with multiple points

Making accurate predictions gets harder with more points
At any set of weights, there are many values of the error
... corresponding to the many points we make predictions for

Loss function¶

Alt text that describes the graphic

A loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some cost associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative in which case it is to be maximized. The Loss function computes the error for a single training example while the Cost function is the average of the loss functions for all the training examples.

Aggregates errors in predictions from many data points into single number
Measure of model’s predictive performance

Lower loss function value means a better model

Goal: Find the weights that give the lowest value for the loss function
Solution: Gradient descent

Gradient descent¶

Alt text that describes the graphic

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a loss or cost function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

$X$ = parameter to optimize
$y$ = loss or cost function

The goal is to minimize the function. We need to find that value of $X$ that produces the lowest value of $y$. A derivative is calculated using the power or chain rule as the slope of the cost function parabola at a particular point. The slope is described by drawing a tangent line to the graph at the point. This tangent line computes the desired direction to reach the minima. Partial derivative is calculated for $> 2$ parameters, $2$ or more derivatives of the same function = Gradient.

This size of steps taken to reach the minimum is called Learning Rate.

Slope calculation example¶

To calculate the slope for a weight, need to multiply:

Slope of the loss function with respect to the value at the node we feed into
The value of the node that feeds into our weight
Slope of the activation function with respect to the value we feed into

Here we calculate slopes and update weights¶

weights = np.array([1, 2])
input_data = np.array([3, 4])

target = 6
learning_rate = 0.01

preds = (weights * input_data).sum()
error = preds - target
print(error)

5

gradient = 2 * input_data * error
print(gradient)

[30 40]

weights_updated = weights - learning_rate * gradient
preds_updated = (weights_updated * input_data).sum()
error_updated = preds_updated - target
print(error_updated)

2.5

Backpropagation¶

Alt text that describes the graphic

Back-propagation is the essence of neural net training. It is the practice of fine-tuning the process of updating the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration). Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalization.

Allows gradient descent to update all weights in neural network (by getting gradients for all weights)
Comes from chain rule of calculus

Backpropagation process

Start at some random set of weights
Use forward propagation to make a prediction
Use backward propagation to calculate the slope of the loss function with respect to each weight
Multiply that slope by the learning rate, and subtract from the current weights
Keep repeating the process until we get to a flat part.

Stochastic gradient descent

It is common to calculate slopes on only a subset of the data (‘batch’)
Use a different batch of data to calculate the next update
Start over from the beginning once all data is used
Each time through the training data is called an epoch
When slopes are calculated on one batch at a time: stochastic gradient descent