Greetings, This blogs contain the required information regarding the maths behind Linear Regression. This blog contains :
- Flowchart of Linear Regression
- Loss
- Optimizer
- Explanation
In the previous blogs, We proved that how we can predict y with respect to X if X and y are correlated to each other. In this blog, we will discuss how the ML model decides the weight criteria.
Flowchart of Linear Regression :
The main fundamental equation that linear regression uses is :
`y = m\times X + c`.
Reframing the fundamental equations with respect to weights :
`y = W0 \times X + W1`. where( W is weight )
X and y :
Firstly X and y are fed to the machine learning model where X is feature and y is our target feature.
Weights :
For the very first time, weights are automatically initialized. In the first iteration, weights should be close to 0 and weights should be continuous (weights are nothing but m and c i.e slope and constant in our case). In general, weights are something that will be multiplied with X to get the value of y.
Predict y :
After weights are initialized. y should be predicted with respect to the generated weights. and then we compare predicted y and true y for accuracy.
Loss ( Error, Residual ) :
There are various loss functions that are available. Mean square error (MSE), root mean square error (RMSE), Mean absolute error (MAE), etc. Linear Regression uses Huber loss (All loss functions will be explained in the further blogs). The formula for Huber squared loss is :
`loss = \frac{\sum(y -Å· )^{2}}{2m}`.
Where,
- y is the true value of y
- Å· is the predicted value of y
- m is the number of data points or rows in the dataset or observations.
Optimizer :
After calculating loss, we have to make sure that it should be minimum. If the loss is not minimum we have to update weights so that next time loss is minimum. The process of finding the weights that have the lowest possible loss is known as optimization. The optimizer which linear regression uses is gradient descent. The equation of gradient descent is :
Iterate this equation until it converges (until we get global optimum i.e lowest possible loss)
Wnew `= Wold - \eta \frac{\partial loss}{\partial Wold} ` (where `\eta` is learning rate )
As presented in fig 1, This process will go up to a specific number of iterations (loop) that the user is supposed to define.
Detailed Explanation:
Multiple lines can be drawn but only one line can be the best fit line (a line that gives the least possible error ) with respect to any particular dataset. Hence, the main goal is to weights of the line which gives the least possible error. Let's just consider c = 0 for simplicity. Refer visualization.
|
|
|||||
|
|
As you can see in fig 2,3,4,5 each and every line is giving us different loss but we want weights that give the lowest loss. Hence let's plot the graph of Loss vs weight. This Loss vs weight graph will give us weights that will give the lowest possible loss and the line with that weight will be the best fit line. The graph between weights and loss will be parabolic because the loss function has a power of 2.
![]() |
| Fig 6. weights VS loss |
There can be confusion about this graph. If you focus on the axis then you will get to know that the previous graph i.e figs 2,3,4,5 contains one line and that weights are mapped in the weight vs Loss graph. There can be multiple lines drawn between data points. Each line will have its weight and each line will give some loss with respect to data points (high or low). Hence this graph has all that line (weights i.e m) and their loss. refer to the line color, the same color is given to the line and the points available in the weights vs loss graph. As we can observe that the green point is close to the optimum and the green line is very well-fitting the data. At the same time, the blue point is at the maximum hence you can see the blue line is not fitting the data (having a huge amount of error). This concludes that we need to find the optimum value i.e low loss value in weights vs loss graph to get the weights of our best fit line. The process in which we find the optimum value of some convex function or exponential function is done with the help of an optimizer. Linear regression uses gradient descent as an optimizer. The main aim is to find the global optimum. global optimum means low loss.
Detailed Explanation of Gradient descent :
The main aim of the gradient descent is to provide the weights which give the lowest optimum loss value. This is possible if we get the global optimum i.e weights of low loss. As concluded in fig 6. We will be getting a parabolic kind of a convex function every time. This is an iterative thing means we have to repeat this equation. The equation of the gradient descent is :
while (converge)
{
Wnew `= Wold - \eta \frac{\partial loss}{\partial Wold} `
}
where,
- ` \eta ` is the learning rate
- `\frac{\partial loss}{\partial Wold}` is the step value, it is also known as gradient i.e slope.
- Wold is the previous weights
- Wnew is the updated weights.
why we are not checking all the points of the Weights vs Loss graph for optimization?
This can be done by checking all the available points and selecting the weights which have a minimum loss. But this is not the best possible solution because it will take more time and hence checking all the points technique is discarded. It will lead to an issue known as the Vanishing gradient problem (Discussed in further blogs ).
why we are not taking constant steps for optimization?
We cannot select the constant steps because then we have only 2 options. Either it will be small steps or big steps optimization.
|
|
If we take small steps then it will take more time to converge or to provide the weights with the lowest possible loss. This problem is also known as the Vanishing gradient problem. If we take big steps then the global optimum can be skipped and our pointer loss (current weight value at that time) will shoot high. This will eventually give the weights with high loss. This problem is also known as Exploding gradient problem. Refer to figs 7 and 8 to visualize Vanishing and exploding gradient problem looks like.
Hence, either this process of optimization is time-consuming or it can skip global optimum. Gradient descent provides a better solution, that can be applicable for betterment. Gradient descent states that if the loss is high then step size will be big and if the loss is small then step size will be small so that global optimum can be achieved. Hence the movement of the pointer (pointer means current weights value ) in gradient descent looks like.
Hence we can say that: Loss is directly proportional to step size and because it is a step. Thus the equation will be :
Wnew `= Wold - \frac{loss}{Wold} `
Where,
`\frac{loss}{Wold}` is the step size value.
Role of Direction
Up till now, We got how to take a step. The direction of that step is still missing. Direction plays an important role because if the global optimum is behind then it will be ignored by the pointer. The pointer should have the ability to move in both the direction so that if the global optimum is skipped by the pointer due to any reason then it should come back to get the global optimum. This exception is resolved if we add differentiation to the step size. After adding differentiation to the step size, it will act like a gradient i.e slope which can be positive or negative. Hence, The updated equation will be :
Wnew `= Wold - ( \frac{\partial loss}{\partial Wold}) `
![]() |
| Fig 10. Gradient |
Suppose, if our pointer is on the left side of the global optimum. The gradient i.e slope will be negative. thus the equation will be :
Wnew `= Wold - ( - \frac{\partial loss}{\partial Wold}) ` (The value of gradient step size will be negative)
Wnew `= Wold + \frac{\partial loss}{\partial Wold} ` ( Value of weights will be increased )
and the new weight will go closer to the global optimum ref fig. 10 for visualization.
Similarly, if our pointer is on the right side of the global optimum. The gradient i.e slope will be positive. Thus the equation will be :
Wnew `= Wold - ( + \frac{\partial loss}{\partial Wold}) ` (The value of gradient step size will be positive)
Wnew `= Wold - \frac{\partial loss}{\partial Wold} ` (Value of weights will be decreased)
and the new weights will go closer to the global optimum ref fig. 10 for visualization.
we have to calculate the slope and this is possible only if we differentiate the size of our step. hence the equation will be :
Wnew `= Wold - \frac{\partial loss}{\partial Wold} `
Learning rate as our Hyper-parameter :
If a loss scale is too high due to the dataset then the step will be too huge that it will skip the global optimum or if the loss is too low it can take more time to converge i.e exploding and vanishing gradient problem. Hence for avoiding such cases we need to scale the step size. hence for scaling learning rate is used. The learning rate is decided at runtime by the developer or the user because it can be different for the different datasets. That is why this learning rate is known as a hyperparameter because it is decided with respect to the nature of the dataset. This learning rate is getting multiplied by our step value. Thus the main equation which is the handing all the cases and exceptions and giving the global optimum value will be :
Wnew `= Wold - \eta \frac{\partial loss}{\partial Wold} `
we have to iterate this equation multiple times to get our global optimum.
while(converge)
{
Wnew `= Wold - \eta \frac{\partial loss}{\partial Wold} `
}
In the next blog, we will see one example. In that, all the steps will be covered and you will visualize all the steps at run time.
Summary :
- `y = m\times X + c` is the equation of line
- `loss = \frac{\sum(y -Å· )^{2}}{2m}`
- Wnew `= Wold - \eta \frac{\partial loss}{\partial Wold} ` is the equation for gradient descent i.e optimizer.










