Why do we need normalization
In today’s topic, i will try to explain what data normalization is and why we need it, good reading.
Data normalization is commonly used in machine learning, but most of the users don’t know why they are using it, they just heard that we should do it.
Let’s be direct, data normalization is most of the time no more nor less than a linear transformation applied to the data, let’s say that we have a numerical variable X for each row in our dataset, this variable can be whatever you want, a price or a weight or …
The normalisation of the variable X will be the result of the linear function f applied to each of the values of X, ie :
Where E_X is the set of the values of X in our dataset.
Of course we won’t use any a or b, they’re well chosen. In the practice there are two most used normalization methods.
1. standard deviation normalization
The linear transformation will be this :
is this a linear transformation ?
Yes, look at this :
It’s true that the a and b parameters depends on the values of X but once they are calculated they are fixe when we apply this function to the values of X.
but what’s the use of subtracting the mean and dividing by sigma ?
Let’s see an exemple where we have the distribution of X before and after normalization.
Before normalization, the range of values that X take is large between 0 and 50 ( look at the the x-axis), after the normalization all the values of X are more concentrated between zero.
This normalization will add two things :
- First it will center our data, meaning that the mean of the new values of X will be zero, we will have both positive and negative values.
- Second it will set the variance to 1, in this exemple it will makes the values closer to each other.
Here is 10 values of X before and After normalization.
The values became smaller and distributed along zero, also since it’s a linear transformation with a positive slope, if a row(individual) had the biggest value of X for exemple, it will also have the biggest value after the transformation, in other words the normalization saves the order between the values that existed before (this aspect is very important).
Also with the normalization we don’t lose informations, that means that we can go from the normalized data to the original data by taking the inverse of the transformation f.
Okey cool, but why do we need to do this in machine learning ?
Don’t worry, I will answer this question just after introducing the second normalization method.
2. MIN-MAX normalization
Like the first normalization, this will transform variables with large scale into small one, it will come with this :
- All normalized values will be in the interval [0,1], the consequence of that is that the mean is also in [0,1].
- The standard deviation will be lower than 1 because all the values are in [0,1], the normalized values will be closer here than if we have used the standard deviation normalization.
- Like the standard deviation normalization this will save the order, if a row had the biggest value of X for exemple it will also have the biggest value after the transformation.
Now let’s move to the interesting part, why do we normalize ?
The aim of normalizing data
The principal aim of normalization is reducing scale and make variables comparables between them because they become without units.
In my mind I have four applications of normalization exemple, but I will only talk about the Neural Network case because the other three (Regularization, Principal Component Analysis, K-NN) need a whole topic.
Normalization for Neural Network
You should know something, when training a model using a loss function, normalizing or not will not change the loss function or create greater Local/Global minima, said in other words it give this :
Theoretically normalization will not improve our model.
i’m a little bit confused ..
I said theoretically, and we all know that there is always a difference between the theory and when we apply it.
Training a Neural Network is done by minimizing the loss function, and we do so by using the gradient of this function, but sometimes our model stop learning, precisely some neurons got trapped and it occurs when some of the components of the gradient become null (it’s called gradient vanishing).
but when the gradient become null ?
- First case is when we are in a Global/Local minima, here all the gradient is equal to zero.
- Second case is when some components of it are equal to zero by the fault of the activation function.
The gradient is a multiplication of derivatives and one of those derivatives is the activation function derivative so what happens if this one is null (equal to zero )?
Yes, the gradient become also null, and the problem is that the derivative of some activation functions converge quickly to zero.
In the figure above we have represented the sigmoid activation function and its derivative, and you can see that it is not null only in a small interval centered in zero, if by some misfortune the value of the neuron don’t fall in the green interval then the derivative will value zero and that neuron will stop learning.
So I will let you imagine if our variables have huge values, yes we are pretty sure that we won’t fall in the green part…
Normalization is a way to prevent this because the values become smaller, I said prevent because it will decrease the probability of not falling in the green interval, but won’t give 0% chance.
is there any other activation function whose derivative doesn’t converge quickly to zero to avoid gradient vanishing?
PLOT TWIST : yes there is one whose derivative never value zero.
This activation function is mainly used because of the problem of the gradient vanishing and it’s the most used activation function while using a neural network.
I hope it was clear and you have now some knowledges about normalization and why it’s important to apply it in some case, if you have some questions or incomprehensions leave comment.