Chapter 3 Linear regression
Open with Full Screen in HD Quality
Project on Linear regression
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The most basic form, simple linear regression, involves one dependent variable and one independent variable. Multiple linear regression involves one dependent variable and multiple independent variables.
Simple Linear Regression
In simple linear regression, the relationship between the dependent variable and the independent variable is modeled with the equation:
where:
- is the dependent variable.
- is the independent variable.
- is the y-intercept (the value of when ).
- is the slope of the line (the change in for a one-unit change in ).
- is the error term (the difference between the observed and predicted values of ).
Multiple Linear Regression
In multiple linear regression, the relationship involves more than one independent variable:
where:
- is the dependent variable.
- are the independent variables.
- is the y-intercept.
- are the coefficients of the independent variables.
- is the error term.
Objective
The objective of linear regression is to find the best-fitting line (or hyperplane in the case of multiple regression) that minimizes the sum of the squared differences between the observed values and the values predicted by the linear model. This is known as the least squares criterion.
Finding the Coefficients
The coefficients and (in simple linear regression) or (in multiple linear regression) are determined using the method of least squares. The formulas for these coefficients minimize the sum of squared residuals (the differences between observed and predicted values).
In matrix notation for multiple linear regression, the equation is written as:
where:
- is an vector of the dependent variable.
- is an matrix of the independent variables (including a column of ones for the intercept).
- is a vector of the coefficients.
- is an vector of the error terms.
The least squares solution is given by:
Assumptions of Linear Regression
- Linearity: The relationship between the dependent and independent variables is linear.
- Independence: Observations are independent of each other.
- Homoscedasticity: The variance of the residuals is constant across all levels of the independent variable(s).
- Normality: The residuals (errors) are normally distributed.
Evaluating the Model
To evaluate the performance of a linear regression model, several metrics are commonly used:
- R-squared: Indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
- Adjusted R-squared: Adjusts the R-squared value based on the number of predictors in the model.
- Mean Squared Error (MSE): The average of the squares of the residuals.
- Root Mean Squared Error (RMSE): The square root of the MSE, providing a measure of the average magnitude of the errors.
Conclusion
Linear regression is a fundamental technique in statistics and machine learning for modeling and analyzing relationships between variables. Its simplicity and interpretability make it a widely used method for predictive modeling and data analysis.