Linear Regression (VI) ML - 8



 

        This blog contains the required information to understand some important assumptions, evaluation metrics, and other important things regarding linear regression.  In the previous blogs, all types of linear regression and the working of linear regression are discussed. This blog focuses on : 

  1. Evaluation metrics. 
  2. Assumptions.
  3. Outliers

Evaluation metrics : 

        Evaluation metrics are used to validate AI models. According to student example, It is like exams. To test the AI model we should have some unknown questions and answers i.e X test and y test. AI model will fetch X test as input and AI model will give y predict as an output (working is discussed in previous blogs). After comparing y predict (predicted answers) with the y test (true answers), we can comment that whether the AI model is properly trained or not. The process of comparison between predicted results and true results is done with various evaluation metrics.  
        Evaluation metrics which is used to evaluate Linear regression models are :
  1. Mean Squared Error (MSE).
  2. Mean Absolute Error (MAE).
  3. Root Mean Squared Error (RMSE).
  4. `R^{2}`.
  5. Adjusted `R^{2}`.

Mean Squared Error (MSE):

`MSE = \frac{\sum (y - Å·)^2}{n}`

where,
y = True value or Actual value of y.
Å· = Predicted value of y. 
n = number of the data samples.
y - Å· = residual.

        The main logic of this equation is taking the mean of all the residuals. But residuals can be negative if Å· is greater than y. If the residual is negative then the summation of all the residuals will also decrease and eventually error will decrease. To avoid this, This equation takes the square of residuals due to which negative sign is ignored and we get an actual error.

 Example : 

y - Å· = 2 , 3 , -1 , 5

n =  4, because 4 data points are there. 

`MSE = \frac{\sum (y - Å·)^2}{n}`

`MSE = \frac{2^{2} + 3^{2} + (-1)^{2} + 5^{2}}{4}`

`MSE = \frac{4 + 9 + 1 + 25}{4}`

`MSE = \frac{39}{4}`

`MSE = 9.75`

Mean Absolute Error (MAE):


`MAE = \frac{\sum | y - Å· | }{n}`

where,
y = True value or Actual value of y.
Å· = Predicted value of y. 
n = number of the data samples.
y - Å· = residual.

        The logic of MSE and MAE is the same. The main difference between MSE and MAE is that MAE is eliminating the negative sign by taking the absolute value and MSE is eliminating the negative sign by taking square. MSE is highly sensitive with residual (a small increase in residual can bring a high increase in MSE)  as compared to MAE. 

Example : 

y - Å· = 2 , 3 , -1 , 5

n =  4, because 4 data points are there. 

`MAE = \frac{\sum |y - Å·|}{n}`

`MAE = \frac{|2| + |3| + |-1| + |5|}{4}`

`MAE = \frac{2 + 3 + 1 + 5}{4}`

`MAE = \frac{11}{4}`

`MAE = 2.75`

`R^{2}` Statistics :


`R^2 = \frac{TSS - RSS}{TSS}`

`R^2 = 1 - \frac{RSS}{TSS}`

Where, 

TSS = Total Squared Summation = `\sum (y - ȳ)^2`
RSS = Residual Squared Summation = `\sum (y - Å·)^2`

`R^2` tells, how much X and y are correlated to each other. 


Adjusted `R^{2}` Statistics :

There can be some features that are not contributing to form a line. `R^2` will be slightly more if there are more features. Hence, it is an issue because some features are not contributing and because of many features the value of `R^2` is more. The proof of this statement is below. 

Suppose, there are 2 equations, one with fewer features and one with more features. 

Equation 1 : y = w0 `\times`  x1 + w1 `\times`   x1 .
Equation 2 : y = w0 `\times`    x1 + w1 `\times`    x1 + w2 `\times`    x2 + w3 `\times`    x3.

RSS1 : `\sum (y - w0 \times x1 + w1 \times x1)^2`
RSS2 : `\sum` (y - w0 `\times`   x1 + w1 `\times`   x1 + w2 `\times`   x2 + w3 `\times`   x3`)^2`

Because of more number of features residual of equation 2 will be a little less than equation 1. Due to the square, The change in both the equations will be considerable. hence, The RSS value of equation 1 will be less than equation 2.

RSS1 > RSS2. 

`R^2  = 1 - \frac{RSS}{TSS}`

`R^{2}1  = 1 - \frac{RSS1}{TSS1}`

`R^{2}2  = 1 - \frac{RSS2}{TSS2}`

RSS1 > RSS2. 

Therefore , `R^{2}2` > `R^{2}1`.

To overcome this issue, If there are more features then they should be penalized. Hence, Adjusted `R^2` came into the picture. The equation of adjusted  `R^2` is : 

Adjusted  `R^2  = 1 - \frac{(1-R^2)(N-1)}{N-P-1}`

Where, 

`R^2  = 1 - \frac{RSS}{TSS}`

N = Number of data points
P = Number of features. 


Assumptions of Linear Regression : 

        There are some assumptions that any data scientist expects before applying linear regression to any data. The Assumptions of Linear regression are : 

  1. In Linear regression, The line i.e hypothesis depends on the weights and residual.
  2. Residuals are not correlated with each other.
  3. Independent variable is not correlated with each other. This is known as Exogenity.
  4. Residual has a constant variance. This is known as Homoscedascity.
  5. The summation of all residuals is zero. 
  6. No Multi-collinearity.
  7. Weights are normally distributed. 

Line depends on weights and residual.

        After the model is trained. proper weights with low loss are set by the gradient descent optimizer. Gradient descent sets the weights with respect to loss i.e residual. Hence, the line is formed with respect to that weights. Hence, it is said that the Line depends on weights and residuals. 

Residuals are not correlated with each other. 

        The residuals generated while model training is never correlated with each other. 

Independent variable is not correlated with each other. 

        The values of features i.e values of independent features are not correlated with each other. They must be correlated with respect to the values target features. Independent variables can have any values depending on the dataset.

Residual has a constant variance. 


        The residual generated in the test case will have constant variance. This means the error found in each data point will a little bit deviated with respect to actual data points. The magnitude between this deviation will almost equal. Either the residual will be positive or negative. This is because the line is the best fit line with respect to the dataset. 

The summation of all residuals is zero. 

           Some residual can be positive or some residual can be negative. If the line is the best fit line then all the summation of data points will lead to zero because some residuals are positive and some residuals are negative.

No Multi-collinearity. 

            Features i.e independent variables should not give the same information i.e there should not be any correlation between the independent features. If independent features are correlated to each other then it is said that the features are having multicollinearity. 

Weights are normally distributed.

            If we plot the graph of weights then we can find that the weights are normally distributed. This means that the graph will follow the bell curve. Technical things will be discussed in further blogs. 


                                                                                                     -Santosh Saxena       
Post a Comment (0)
Next Post Previous Post