Linear Regression (V) ML

Greetings, In the previous blog, we have discussed various advance linear regression approaches that can be used to solve a real-life problem. This blog contains the required information regarding :

Underfitting and overfitting
Ridge, LASSO, and elasticnet regression.

Drawbacks of Multiple or Polynomial regression :

The equations which we derived from the previous blogs are :

multiple linear regression equation `y = W^\mathsf{T} \times X`
Polynomial linear regression equation `y = W0 \times X + W1 \times X^2 + ... +Wn \times X^n `
multiple polynomial linear regression (Multiple + polynomial linear regression) equation is

`y = W00 \times X0 + W01 \times X0^2 + ... +W0n \times X0^n `
`+ W10 \times X1 + W11 \times X1^2 + ... +W1n \times X1^n `
`+ W20 \times X2 + W21 \times X2^2 + ... +W2n \times X2^n `
.....
`+ Wm0 \times Xm + Wm1 \times Xm^2 + ... +Wmn \times Xm^n `.

This equation is used to solve a real-life problem but at the same time, these equation comes with drawbacks as well.

The main problem is Overfitting and underfitting. The effect of overfitting and underfitting in both algorithms is different :

Polynomial regression
Multiple regression

Underfitting :

Fig 5. Underfitted model (desmos.com)

Fig 6. Underfitted weight vs loss (desmos.com)

Underfitting means less training. In simple terms, the training of the model is not completed properly. Underfitting occurs when the model is trained with fewer loops (iterations of gradient descent equation is fewer). With respect to human intelligence i.e school example, if students are finding it difficult to solve a problem then it is considered as an underfitted model. Underfitting occurs mostly, it always occurs at the beginning of training. In underfitting, we can observe that the hypothesis (line) is not properly fitting the data. The loss value is high. It can be observed when the model accuracy is too low and the loss is too high. It is also considered a bad practice to use an underfitted model.

Expected model :

The best fit line graph in multiple linear regression should be linear and for polynomial regression, it should be exponential.

Fig 7. proper model (desmos.com)

Fig 8. Weights Vs loss (desmos.com)

It should not be connected with each data point. A little loss is ok. Just, the hypothesis should follow a proper exponential or linear function and with respect to that, it should fit the data.

Overfitting:

Overfitting in simple terms means that your model is performing very good results in the training dataset but drastically fails in test data. Overfitting occurs if we provide more loops (iterations in gradient descent) than required for training. With respect to human intelligence, it is like memorizing something without understanding the core concept. Hence at the time of testing, the model is expecting the same values of the training dataset to give proper results.

Polynomial Regression :

Fig 3. Overfitting (desmos.com)

Fig 4. Overfitted weights vs loss (desmos.com)

In polynomial regression, this type of hypothesis (line) is possible because the equation contains powered functions refer polynomial and multiple polynomial equations. In fig 1, This zig-zag hypothesis (line) will not be able to predict correctly like normal polynomial regression. If we observe that the overfitting line is jumping from one point to another in the training dataset and it does not know where to jump in the test case because in testing we only have the values of X and not y. Hence, if we observe that the training data accuracy is more than testing accuracy or model accuracy will be 100% then it is considered as an overfitted model. It is a bad practice to use overfitted model. The simple fact, if the data point is not perfectly linear then how our loss is 0 means the model is an overfitted model.

Multiple Linear Regression :

Overfitting in multiple linear regression is a little different because this equation does not contain an exponential function. Overfitting in multiple linear regression or even in linear regression with one feature is a concern with the value of weight. If the value of weight in linear regression is too big then it is considered as an overfitted model. The reason behind overfitting is because as the value of weight goes higher, a small change in X can bring a huge change in y. Thus the error rate in an overfitted model goes high.

Fig 9. Linear overfitting (desmos.com)

Fig 10. Linear weights vs loss graph

With reference to figure 9 and 10. The loss between the green line and the red line is not much but the value of weights is a lot higher. The reason why the higher weights should not be there is that less change in X will cause more change in y. The visualization is given below :

Fig 11. less weights change in y (desmos.com)

Fig 12. more weights change in y (desmos.com)

In fig 11 and 12, The change in weights in X is the same but the change in y is more because of weights. Hence it is always recommendable that your weights should be less because it would minimize the loss. Overfitting in the linear equation means a higher value of weights due to overtraining of the model.

For multiple polynomial regression, both multiple regression overfitting and polynomial overfitting are applied.

Solution of Overfitting :

Polynomial Regression :

For polynomial regression, to avoid overfitting, use the power function as a hyper-parameter and then find the value of power by trial and error method.

Multiple Regression :

For multiple regression, Regularization is used to avoid overfitting.

Regularization :

Regularization is the concept that makes sure that your model does not get overfit due to overtraining. It tries to penalize the value of weights. In simple terms, if the actual loss is low but the weights are high then also the resulting loss will be high. Similarly, if the actual loss is a little high and the weights are less then the resulting loss will be less. There are 3 types of special regression that follow the regularization technique :

Ridge Regression
LASSO Regression (least absolute shrinkage selection operation)
Elastic net Regression

The entire Linear regression process is the same but the equation of loss is modified. In simple terms, the loss equation will be modified so that weights should be low for the best fit line.

Ridge Regression :

`Loss = (`normal loss function`) + (`regularizer`)`

Normal loss function ` = \frac{\sum (y - ŷ)^{2}}{2m}`

Regularizer ` = \sum W^{2}`

`Loss = (\frac{\sum (y - ŷ)^{2}}{2m}) +\lambda (\sum W^{2})`

Where,

`\lambda` is the hyper-parameter.

We can manipulate the penalization of weights with the help of this hyper-parameter. if the weights are higher then the square of that weight will be added in the loss function due to which the value of loss will go higher and gradient descent i.e optimizer will not give that weight as an output (because loss is high). This algorithm is highly penalizing the weights because regularize is a square function of weights.

LASSO Regression (Least absolute shrinkage selection operation) :

`Loss = (`normal loss function`) + (`regularizer`)`

Normal loss function ` = \frac{\sum (y - ŷ)^{2}}{2m}`

Regularizer ` = \sum |W| `

`Loss = (\frac{\sum (y - ŷ)^{2}}{2m}) +\lambda (\sum |W| )`

Where,

`\lambda` is the hyper-parameter.

We can manipulate the penalization of weights with the help of this hyper-parameter. if the weights are higher then the weight will be added in the loss function due to which the value of loss will go higher and gradient descent i.e optimizer will not give that weights as an output (because loss is high). This algorithm is penalizing the weights a little less as compared to Ridge regression because the power of regularizing is less.

Elastic Net Regression :

Elastic net regression is a combination of Ridge and LASSO regression.

Loss = Normal loss + LASSO regression regularize + Ridge regression regularizer.

Loss = `(\frac{\sum (y - ŷ)^{2}}{2m}) +\alpha \lambda (\sum |W|)`

`+ \frac{(1-\alpha)^{2}}{2} \lambda (\sum W^{2})`

where,

`\alpha` is a hyper-parameter that gives an option to set proportions to both ridge and LASSO.

`\lambda` is a hyper-parameter, discussed in ridge and LASSO regression.

If we want the role of ridge regression in elastic net should be more then the value of `\alpha` should be less. If we want the role of LASSO regression in elastic net should be more then the value of `\alpha` should be more. The ridge regularizer `\alpha` is divided by 2 because the power of ridge regularization is high. Hence to maintain both regularizers, In ridge regularizer `\alpha` is divided by 2.

Summary :

Underfitting means, your model requires more training.
Overfitting means, your model is overtrained
To avoid overfitting, we use regularization
Linear regression with regularizers is Ridge, LASSO, and elasticNet regression.

-Santosh Saxena

Linear Regression (V) ML - 7

Drawbacks of Multiple or Polynomial regression :

Underfitting :

Expected model :

Polynomial Regression :

Multiple Linear Regression :

For multiple polynomial regression, both multiple regression overfitting and polynomial overfitting are applied.

Solution of Overfitting :

Polynomial Regression :

Multiple Regression :

Regularization :

Ridge Regression :

LASSO Regression (Least absolute shrinkage selection operation) :

Elastic Net Regression :

Summary :

Contact Form