NLP Zero to One: Deep Learning Theory Basics (Part 3/30)

Perceptron, Neural-Nets, Activation Functions

Kowshik chilamkurthy

Published in

Nerd For Tech

5 min readMar 2, 2021

Introduction..

The idea of neural networks draws its inspiration from the biological neurons of the human brain. Neural network is a network of small computing units, each of which takes a vector of input values “X” and outputs a single value “y”. Neural nets often referred as deep learning, because these networks have many layers of small computing units. In this blog we will introduce basic neural networks every NLP practitioner should know.

Basic computing unit: Perceptron Algo..

Fig-1: Perceptron algo in single Neuron ( basic computational unit in Neural Nets)

Deep learning is combining a network of artificial neurons where information is passed between neurons. Each of these neurons learns a different function from its input vector and output a single value.

Before we discuss the concepts of deep learning, let’s try to understand perceptron algorithm which happens inside each neuron (basic computing unit). The basic form of the perceptron algorithm is same mathematics as logistic regression.

“b” is the learnable bias term. We can treat b as an additional weight w₀.
Individually learned weight wᵢ is multiplied with xᵢ of X and passed to a function f(.) to get output y. f(.) is called activation function. It is important to note that bias term or w₀ is also learnable parameters. This bias term allows the model to move the decision boundary away from the origin at perceptron level.

As you can clearly see that perceptron assumes a linear relationship between input variables X and the output y. In real world problem, these linear assumptions of perceptron often fails.

Feed-Forward, Fully Connected NN..

Links these neurons together into different layers of a network.
Uses a differentiable, non-linear activation function in each neutron

The NN is composed of interconnected neurons and the data flows in only direction so it is called feed-forward neural network. Layer is defined as a set of neurons. These layers are “fully connected”, meaning that each neurone in each layer takes as input the outputs from all the neurons in the previous layer, and there is a link between every pair of neurones from two adjacent layers. NN must contain an input and output layer and at least one hidden layer.

Illustration of single layer NN, generated by author

h1 and h2 are the nodes in the hidden layer. Observe that each node in hidden layer is completely connected to the input X. Non-linear Activation functions are applied at the end of each neuron which allows the output value to be a non-linear, weighted combination of its inputs, thereby creating nonlinear features used by the next layer. 1.

Non-Linear Activations Functions

In a node, a basic computation block in the neural-net activation functions are used to ensure the output value to be a non-linear, weighted combination of its inputs. Non-linear activation functions play a very important role in improving the representative power of the neural-net. We’ll discuss three popular non-linear functions f(.).

Sigmoid

With range (0,1), this function acts as a continuous squashing function. It also have continuous derivative ideal for gradient descent methods.
Drawbacks: Internal covariate shift is introduced because the outputs of the sigmoid are not cantered around 0, but instead around 0.5 as this introduces a discrepancy between the layers because the outputs are not in a consistent range.

Tanh

With range(−1,1), this is it is zero-centered function which solves one of the issues with the sigmoid activation function.
Drawbacks: the gradient saturation at the extremes of the function, this will cause the gradients to be very close to 0.

Relu

with range(0,♾), simple, fast activation function typically found in computer vision. The function is a linear, if input is greater than 0.
ReLU function is computationally very faster as sigmoid and tanh functions require exponential operation. It promises better convergence because of non-saturating gradient in one direction.
Drawbacks: Relu must be used very carefully if large gradients are involved, large gradient update prevents the neuron from ever updating again. So learning rated must be kept lower when dealing with Relu.

Note:

Leaky ReLU: Leaky ReLU introduces an α parameter that allows small gradients to be back-propagated.

Softmax: The softmax function allows us to output a categorical probability distribution over K classes. We can use the softmax to produce a vector of probabilities according to the output of that neuron

Previous: NLP Zero to One : Sparse Document Representations (Part 2/30)
Next: NLP Zero to One: Deep Learning Training Procedure (Part 4/30)