Machine Learning - Linear Regression

Regression is a statistical process, which attempts to determine the relationship between a dependent outcome (or a response) variable, typically denoted using the letter $y$, and one or more independent feature variables, typically denoted using letters $x_1, x_2, ..., x_n$, etc.

Linear Regression is applicable when the Dependent Outcome (sometimes called a Response or a Target) variable is a CONTINUOUS variable and has a linear relationship with one or more Independent Feature (or Predictor) variable(s).

In general, one could express the relationship between the dependent outcome variable and the independent feature variables in mathematical terms as $y = \beta_0 + \beta_1.x_1 + ... + \beta_n.x_n$, where coefficients $\beta_1, \beta_2, ..., \beta_n$ are referred to as the Regression Coefficients (or the weights or the parameters) associated with the corresponding independent variables $x_1, x_2, ..., x_n$ and $\beta_0$ is a constant.

In other words, $y = \beta_0 + \sum_{i=1}^n \beta_i.x_i$

The process of determining the coefficients (or the weights or the parameters) of the linear regression equation is often referred to as the Ordinary Least Squares (or OLS).

In the case of Simple Linear Regression, there is ONE dependent outcome variable, that has a relationship with ONE independent feature variable.

To understand simple linear regression, let us consider the simple case of predicting (or estimating) a dependent outcome variable $y$ using a single independent feature variable $x$. For every value of the independent feature variable $x$, there is a corresponding outcome value $y$. That is, we will have a set of pairs $[(x_0, y_0), (x_1, y_1), ..., (x_n, y_n)]$.

In other words, one could estimate the dependent outcome value using the model $\hat{y} = \beta_0 + \beta_1.x$. Notice that this equation is similar to that of a line $y = m.x + b$, where $m$ is the Slope of the line and $b$ is the Intercept of the line. Hence, the model for simple linear regression is often times referred to as the Line of Best Fit.

For the line of best fit, the idea is to minimize the sum of all the residual distances (or the errors) between the actual values of the outcome variable (red dots) $y_i$ compared to their predicted values (blue dots) $\hat{y_i}$.

In mathematical terms, the sum of all the residual errors can be represented as $E = \sum_{i=1}^n (y_i - \hat{y_i})$. However, since some residuals will be POSITIVE and others NEGATIVE, the result of their sum will end up cancelling each other and we end up with $E = 0$, which is not correct. To avoid this, we need to sum the SQUARE of the residual errors. In other words, $E = \sum_{i=1}^n (y_i - \hat{y_i})^2$. This equation is often referred to as the Sum of Squared Errors (or SSE) or Residual Sum of Squares (or RSS).

For the optimal line of best fit, we need to minimize the Error Function (or the Cost Function) $E = \sum_{i=1}^n (y_i - \hat{y_i})^2 = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1.x_i))^2$.

We know the values for $x_i$ and $y_i$, but have two unknown variables $\beta_0$ and $\beta_1$.

In order to MINIMIZE the error $E$, we need to take the partial derivatives of the error function with respect to the two unknown variables $\beta_0$ and $\beta_1$ and set their result to zero.

In other words, we need to solve for $\Large{\frac{\partial{E}}{\partial{\beta_0}}}$ $= 0$ and $\Large{\frac{\partial{E}} {\partial{\beta_1}}}$ $= 0$.

First, $\Large{\frac{\partial{E}}{\partial{\beta_0}}}$ $= \sum_{i=1}^n 2(y_i - (\beta_0 + \beta_1.x_i)) (-1)$

On simplification, we get, $- \sum_{i=1}^n y_i + \beta_0 \sum_{i=1}^n 1 + \beta_1 \sum_{i=1}^n x_i = 0$

Or, $\sum_{i=1}^n y_i = \beta_0 n + \beta_1 \sum_{i=1}^n x_i$ ..... $\color{red} (1)$

Second, $\Large{\frac{\partial{E}}{\partial{\beta_1}}}$ $= \sum_{i=1}^n 2(y_i - (\beta_0 + \beta_1.x_i)) (-x_i)$

On simplification, we get, $- \sum_{i=1}^n x_i.y_i + \beta_0 \sum_{i=1}^n x_i + \beta_1 \sum_{i=1}^n x_i^2 = 0$

Or, $\sum_{i=1}^n x_i.y_i = \beta_0 \sum_{i=1}^n x_i + \beta_1 \sum_{i=1}^n x_i^2$ ..... $\color{red} (2)$

Let $A = \sum_{i=1}^n x_i^2$, $B = \sum_{i=1}^n x_i$, $C = \sum_{i=1}^n x_i.y_i$, and $D = \sum_{i=1}^n y_i$

Therefore, equations $\color{red} (1)$ and $\color{red} (2)$ can be rewritten as follows:

$D = \beta_0 n + \beta_1 B$ ..... $\color{red} (1)$

$C = \beta_0 B + \beta_1 A$ ..... $\color{red} (2)$

Solving equations $\color{red} (1)$ and $\color{red} (2)$, we get the following:

$\beta_0 = \Large{\frac{(CB - DA)}{(B^2 - nA)}}$ $= \Large{\frac{(DA - CB)}{(nA - B^2)}}$ $= \Large{\frac{\sum y_i \sum x_i^2 - \sum x_i \sum x_i.y_i}{n \sum x_i^2 - (\sum x_i)^2}}$ ..... $\color{red} (3)$

$\beta_1 = \Large{\frac{(nC - DB)}{(nA - B^2)}}$ $= \Large{\frac{n \sum x_i.y_i - \sum x_i \sum y_i}{n \sum x_i^2 - (\sum x_i)^2}}$ ..... $\color{red} (4)$

We know the following:

$\bar{x} =$ $\Large{\frac{1}{n}}$ $\sum_{i=1}^n x_i$

$\bar{y} =$ $\Large{\frac{1}{n}}$ $\sum_{i=1}^n y_i$

$\overline{xy} =$ $\Large{\frac{1}{n}}$ $\sum_{i=1}^n x_i.y_i$

Diving both the equations $\color{red} (3)$ and $\color{red} (4)$ by $n^2$ and simplifying, we will arrive at the following results:

$\color{red} \boldsymbol{\beta_0}$ $= \bbox[pink,2pt]{\Large{\frac{\bar{y}.\bar{x^2} - \bar{x}.\overline{xy}}{\bar{x^2} - \bar{x}^2}}}$

$\color{red} \boldsymbol{\beta_1}$ $= \bbox[pink,2pt]{\Large{\frac{\overline{xy} - \bar{x}.\bar{y}}{\bar{x^2} - \bar{x}^2}}}$

In the case of Multiple Linear Regression, there is ONE dependent outcome variable that has relationships with MORE than ONE independent feature variables.

To understand multiple linear regression, let us consider the case of predicting (or estimating) a dependent variable $y$ using $n$ independent feature variables $x_1, x_2, ..., x_n$. In mathematical terms, one could estimate the dependent outcome value using the linear model $\hat{y} = \beta_0 + \beta_1.x_1 + \beta_2.x_2 + ... + \beta_n.x_n$.

Using the matrix notation, one could write the above linear model as: $\hat{y} = \beta_0 + \begin{bmatrix} \beta_1 & \beta_2 & ... & \beta_n \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ ... \\ x_n \end{bmatrix}$

That is, $\hat{y} = \begin{bmatrix} \beta_0 & \beta_1 & ... & \beta_n \end{bmatrix} \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ ... \\ x_n \end{bmatrix}$.

Now, what about extending the above linear model to predict $m$ dependent output values ???

In other words, $y_m = \begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_m \end{bmatrix}$

Then, $\begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_m \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & ... & x_{1,n} \\ 1 & x_{2,1} & x_{2,2} & ... & x_{2,n} \\ ... & ... & ... & ... & ... \\ 1 & x_{m,1} & x_{m,2} & ... & x_{m,n} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ ... \\ \beta_n \end{bmatrix}$

Once again, using the matrix notation, we arrive at $\hat{y} = X\beta$, where $X$ is a matrix. Note that $\beta$ has to be to the right of $X$ in order to apply the coefficients (or weights) appropriately to each row (one set of feature values) of $X$.

Given that we desire to predict $m$ dependent output values, for the optimal line of best fit, we need to minimize $E = \sum_{i=1}^m (y_i - \hat{y_i})^2$.

Using the matrix notation, $E = \begin{bmatrix} (y_1-\hat{y_1}) & (y_2-\hat{y_2}) & ... & (y_m-\hat{y_m}) \end{bmatrix} \begin{bmatrix} (y_1-\hat{y_1}) \\ (y_2-\hat{y_2}) \\ ... \\ (y_m-\hat{y_m}) \end{bmatrix}$

If $y = \begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_m \end{bmatrix}$ and $\hat{y} = \begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ ... \\ \hat{y_m} \end{bmatrix}$, then $\begin{bmatrix} (y_1-\hat{y_1}) \\ (y_2-\hat{y_2}) \\ ... \\ (y_m-\hat{y_m}) \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_m \end{bmatrix} - \begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ ... \\ \hat{y_m} \end{bmatrix} = y - \hat{y}$

Therefore, $E = (y - \hat{y})^T(y - \hat{y}) = (y^T - \hat{y}^T)(y - \hat{y}) = (y^T - (X\beta)^T)(y - X\beta) = (y^T - \beta^TX^T)(y - X\beta)$.

That is, $E = y^Ty - \beta^TX^Ty - y^TX\beta + \beta^TX^TX\beta$.

* Matrix Calculus Rules *

1. $\Large{\frac{\partial}{\partial{\beta}}}$ $y^Ty = 0$

2. $\Large{\frac{\partial}{\partial{\beta}}}$ $\beta^TX^Ty = X^Ty$

3. $\Large{\frac{\partial}{\partial{\beta}}}$ $y^TX\beta = y^TX$

4. $\Large{\frac{\partial}{\partial{\beta}}}$ $\beta^TX^TX\beta = 2X^TX\beta$

Therefore, $\Large{\frac{\partial{E}}{\partial{\beta}}}$ $= 0 - 2X^Ty + 2X^TX\beta$

That is, $\Large{\frac{\partial{E}}{\partial{\beta}}}$ $= - 2X^Ty + 2X^TX\beta$

To minimize the error, $\Large{\frac{\partial{E}}{\partial{\beta}}}$ $= 0$

That is, $- 2X^Ty + 2X^TX\beta = 0$

Or, $X^TX\beta = X^Ty$

To solve for $\beta$, $(X^TX)^{-1}X^TX\beta = (X^TX)^{-1}X^Ty$

Simplifying, we get the following:

$\color{red} \boldsymbol{\beta}$ $= \bbox[pink,2pt]{(X^TX)^{-1}X^Ty}$

Mean Squared Error (or MSE) is a metric that measures the amount of error from the regression model.

In mathematical terms,

$\color{red} \boldsymbol{MSE}$ $= \bbox[pink,2pt]{\Large{\frac{SSE}{n}}}$

where

$SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2$ is called Sum of Squared Errors and represents the variability from the actual expected value.

R-Square is a metric (or measure) that evaluates the goodness (or accuracy) of prediction by a regression model. It is often referred to as the Coefficient of Determination and is denoted using the symbol $R^2$.

In other words, $R^2$ represents the proportion of variance of the dependent outcome variable that is influenced (or explained) by the independent feature variable(s) in the regression model.

The $R^2$ value ranges from $0$ to $1$. For example, if $R^2$ is $0.85$, then it indicates that $85%$ of the variation in the response variable is explained by the predictor variables.

In mathematical terms,

$\color{red} \boldsymbol{R^2}$ $= \bbox[pink,2pt]{1 - \Large{\frac{SSE}{SST}}}$

where

$SST = \sum_{i=1}^n (y_i - \bar{y})^2$ is called Sum of Squared Total and represents the total variability from the mean

$SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2$ is called Sum of Squared Errors and represents the variability from the actual expected value.

For a perfectly fit linear regression model, $SSE = 0$ and $R^2 = 1$.

When additional independent feature (or predictor) variables that have no influence on the dependent outcome (or response) are added to the regression model, the value of $R^2$ tends to increase and appears to indicate the model is a better fit, when in reality it is NOT and misleading.

The Adjusted R-Square, denoted by $\bar{R}^2$ (or $R_{adj}^2$), addresses the issue by taking into account the number of independent feature variables in the model.

If $p$ is the number independent feature variables, then:

$\color{red} \boldsymbol{\bar{R}^2}$ $= \bbox[pink,2pt]{1 - \Large{\frac{(1 - R^2)(n - 1)}{n - p - 1}}}$

*** Matrix Transpose Rule ***

*** Matrix Calculus Rules ***

* Matrix Transpose Rule *

* Matrix Calculus Rules *