why neural network work with nonlinear problem

raian lefgoum
9 min readJun 14, 2019

Have you ever wondered why deep learning work so well and is able to find pattern in data ? in this topic i will try to explain it step by step so keep reading until the end.

In machine learning, we often deal with linear problem, a linear problem is when a variable Y is linearly dependent on another variable X, this dependency can be perfectly linear or not.

On the left, the dependency is perfect, it means that Y = aX+b, on the right it’s not perfect because we don’t have a straight line but it somehow suit it so what we want to do is to draw the nearest line to all the points, this is what we call Linear Regression.

The purpose of doing this, is that we can predict Y giving an new unseen input X with this line equation, but as you can see it only suit linear problem for example the example below can’t be seen as a linear problem.

I can draw the most close line to my points, but the no linearity of the problem lead that we can’t approximate our problem by just a line.

In fact the linear regression don’t only work with one variable X , but can take as many variable as we want to , it can find the best linear combination that approximate Y by Y’ such as :

but as always i must be in presence of a linear problem.

Okey, but in a more concrete way, what does a linear problem mean ?

we say that Y is linear to X if when X increase, Y globally increase or decrease linearly to X but not the two in the same time.

In the left Y is linearly dependent on X because it suit what i said before.

In the right picture, while X is increasing Y decrease and increase, we are not in the presence of a linear problem

If we train a linear model between Y and X for the point in the right image it would give really bad result, but there is a way to remediate on this, it’s by adding new variables that break the nonlinearity and train the linear model with those variable, sound a bit complicated ? let’s see that !

First, let define the shape of the new variable that we will define :

I will name those variable Conditional variable (remember that name because i will use it a lot), here the condition can only be a linear combination of the input variable, for example :

is OK , we can even rewrite it like this :

But

is not OK, since the condition is quadratic to X

Now that we have declared their shape, let’s continue with our problem, we remark that we can divide our graph like this.

The graph has three part , two part ( the two in the extreme) where Y take high value, and the part in the middle where Y take low value, so the idea here is to define a new variable that take the value 1 where Y take high value ( so where X is in one of the two extreme part) and 0 where Y take low value ( where X is in the middle) like this :

To represent this mathematically we need at least three conditional variable :

this variable take the value 1 where i’m in the extreme left part.

this variable take the value 1 where i’m in the extreme right part.

And finally :

This variable will take the value 1 if V+W=>1 and this is true if either V or W is equal to 1 so we are in either one of the extreme part and 0 if we are in the middle.

We can see that Z is kind of linearly with Y (even though it only takes two values), unfortunately Z is not enough to fit my point correctly , it’s only define the part where i’m but not the intensity of Y according to the part.

What i want to do is create another variable J such as it goes 1 if Y have high value compared to other in the same part, so for example if Y is in the middle part (have low value) but compared to other points in the middle part it have high value ( let’s say for example 15 ) so J will value 1.

To represent this mathematically i need those conditional variable (yeah i know it’s a lot, take your time to understand).

J will value 1 if A or B or H or G value 1, and each one of them represent one of the four part described above, now we will train a linear regression using the variable J and Z only and there is the result for some value of X:

It had better result compared to linear regression with brute X , we can perform the result by using additional other conditional variable, but i stop here, the important is that you understand the main idea behind this, we can overcome the nonlinearity of the problem by creating variable who have this shape :

And train a linear model with those variable, the more you add the more you can fit your point.

But the problem here is that we have manually created those variables, imagine that you don’t have one input X but many others, also the best condition maybe hard to find it may also not involve one variable, the problem above was very simple compared to problem encountered in real life, so how will we be able to find those variables ?

I guess you have some idea about it .. YES the artificial neural network is a model that FIND and CREATE those variables by his own !

ARTIFICIAL NEURAL NETWORK

neural network is composed of layers, and each layer is composed of neurons, a neuron is a linear combination of the precedent layer neuron’s on which we apply the sigmoid function, this is the expression of the j th neuron in the L th layer :

The sigmoid function has this expression :

You still don’t see the relation with the precedent section ? hum let’s continue

This is the shape of the sigmoid function :

This function take most of the time values very close to 1 or 0 ( not exactly 1 or 0 ) , and value between them in a small interval which is almost negligible so we could say in fact that:

This is why we call it activation function.

The neuron is the sigmoid applied to the linear combination, so in fact the value of the neuron is :

And it’s the shape of the conditional variable that i have created before, in reality each of the neuron of the second layer are conditional variable where the condition is linear combination of the first layer neuron’s, the third layer neuron’s are conditional variables in that shape too, where the condition are a linear combination of the second layer neuron’s and so on ..

The more you add neuron in a layer, the more you create new variables that can overcome nonlinearity, the more you add layer the more those variables will be used to create more sophisticated variables.

In the first example we have created the variables W V Z, if we train a neural network, the W V will be created in the first hidden layer(second layer) since they only need X which is in the first layer( input ) but the variable Z can’t be created in the first hidden layer because it need W and V which are also in the first hidden layer, it’ll be created in the second hidden layer ( here we need at least two hidden layer )

The final layer have only one input which is our target Y, and it is a linear combination of the neurons of the precedent layer only ( we don’t apply sigmoid function to the neuron of the last layer ), we can see the last neuron as a linear regression of the conditional variables of the last hidden layer (even thought it’s not exactly that ) exactly like i did before.

The power of the artificial neural network is that it create and find the best conditional variable (if he’s well trained) by himself because the shape of those variable are implemented in his model, they are found with the stochastic gradient descent algorithm in most of case which need some assumption such as the derivability of the activation function ( in our example sigmoid ), i will not talk about this topic here.

Before it end i have some important remark to make :

  • The problem that we have solved is approximately Y= X², so you would say why we have done all this since we could have approximate it by , the answer is that we want a generic model that approximate a problem without specify the shape of the relationship manually, with neural network you don’t have to do that he will figure it out alone.
  • The decomposition and the shape of the conditional variable that i proposed are propositions and not necessarily the best it depends on which activation function you will use (there many other beside the sigmoid).
  • The form of the conditions that i have used are simple, i have used for example W+V=>1 just to not make it too complicated, but the neural network could have found 0.6W+0.3V=>0.834 for example.

Hope i make it clear about why the neural network work, if you have some questions or remark leave a comment :D

--

--

raian lefgoum

Computer Science Engineer and Mathematics passionate, interested in optimization and machine learning.