DS3 Module 2 - Multiple Regression

Module Overview

In this module, you will learn about multiple regression. You'll explore how to model relationships with multiple predictor variables, conduct t-tests to determine variable significance, and compare model fit using adjusted R-squared.

Learning Objectives

Model the Relationship of Multiple Predictor Variables to a Single Outcome
Conduct a t-test to Determine the Significance of Individual Variables in the Model
Compare Model Fit Using Adjusted R-squared

Objective 01 - Model the Relationship of Multiple Predictor Variables to a Single Outcome

Overview

The previous module fit a linear regression model to two variables in our car crash data set: total accidents and alcohol impairment. We found a significant relationship between the two variables and could reject the null hypothesis.

In this module, we will look at how adding multiple predictor variables to a linear regression affects the outcome. Can we improve the linear regression model by adding in more predictor variables? First, let's load in the data, fit the model, and look at the results.

Follow Along

For this module, we'll look at the whole data set again, instead of just focusing on two variables.

import seaborn as sns

# Load the car crash dataset
crashes = sns.load_dataset("car_crashes")

crashes.head()

	total	speeding	alcohol	not_distracted	no_previous	ins_premium	ins_losses	abbrev
0	18.8	7.332	5.640	18.048	15.040	784.55	145.08	AL
1	18.1	7.421	4.525	16.290	17.014	1053.48	133.93	AK
2	18.6	6.510	5.208	15.624	17.856	899.47	110.35	AZ
3	22.4	4.032	5.824	21.056	21.280	827.34	142.39	AR
4	12.0	4.200	3.360	10.920	10.680	878.41	165.63	CA

We'll fit our model using alcohol as the independent variable and total as the dependent variable.

import seaborn as sns
from statsmodels.formula.api import ols

crashes = sns.load_dataset("car_crashes")

# Set-up and fit the model in one step
model = ols("total ~ alcohol", data=crashes).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  total   R-squared:                       0.727
Model:                            OLS   Adj. R-squared:                  0.721
Method:                 Least Squares   F-statistic:                     130.5
Date:                Wed, 21 Apr 2021   Prob (F-statistic):           2.04e-15
Time:                        13:48:00   Log-Likelihood:                -110.99
No. Observations:                  51   AIC:                             226.0
Df Residuals:                      49   BIC:                             229.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.8578      0.921      6.357      0.000       4.006       7.709
alcohol        2.0325      0.178     11.422      0.000       1.675       2.390
==============================================================================
Omnibus:                        1.922   Durbin-Watson:                   1.776
Prob(Omnibus):                  0.382   Jarque-Bera (JB):                1.705
Skew:                           0.439   Prob(JB):                        0.426
Kurtosis:                       2.824   Cond. No.                         16.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R-squared

Now we're going to look at a new result in our model summary: R-squared. This term is a statistical measure representing the proportion of the variance for a dependent variable explained by an independent variable (or variables) in a regression model. For our data, the R-squared value is the proportion of the variance for our variable 'total', explained by our independent variable, 'alcohol'.

Reading from the table, we have an R-squared value of 0.727 or 73% (this is proportion expressed as percentage). So 73% of the variance in total accidents is explained by alcohol impairment, but what about the other 27%? Looking at the data we loaded, we can see other variables, including speeding, not_distracted, ins_premiums. So let's add in one of these different variables and see how they impact the model and R-squared.

Multiple Linear Regression

For a single variable linear regression the equation was:

Single variable regression model: y = b_0 + b_1 * x

To add in other variables, we add additional terms:

Multiple variable regression model: y = b_0 + b_1 * x + b_2 * x + b_3 * x + ...

Let's look at a scatter plot where we visualize another variable. For this data, it makes sense to also look at the ins_premium variable, which is the car insurance premium amount paid by the drivers. If a driver has a lot of accidents, we would expect an increase in insurance premiums.

import seaborn as sns
import matplotlib.pyplot as plt

crashes = sns.load_dataset("car_crashes")

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(14,6))

# Compare the two independent variables to each other - are they related?
sns.scatterplot(x='alcohol', y='ins_premium', data=crashes, s=50, ax=ax1)

# The color now represents the insurance premium
sns.scatterplot(x='alcohol', y='total', hue='ins_premium', data=crashes, s=50, palette='magma', ax=ax2)

plt.show()

mod3<em>obj1</em>2vars_new.png

In the plot on the left, we can see that there isn't much of a relationship between our two independent variables: alcohol impairment and insurance premiums don't seem to have a strong correlation. On the plot on the right, we have our independent variable (alcohol) on the x-axis and the dependent variable (total) on the y-axis. We've chosen to plot the insurance premium variable (ins_premium) on the same axes but color-coded to visualize any correlations.

Now, let's fit the model with two independent variables.

import seaborn as sns
from statsmodels.formula.api import ols

crashes = sns.load_dataset("car_crashes")
model = ols('total ~ alcohol + speeding', data=crashes).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  total   R-squared:                       0.730
Model:                            OLS   Adj. R-squared:                  0.719
Method:                 Least Squares   F-statistic:                     64.87
Date:                Wed, 21 Apr 2021   Prob (F-statistic):           2.27e-14
Time:                        13:48:00   Log-Likelihood:                -110.71
No. Observations:                  51   AIC:                             227.4
Df Residuals:                      48   BIC:                             233.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.6807      0.957      5.934      0.000       3.756       7.606
alcohol        1.9152      0.241      7.954      0.000       1.431       2.399
speeding       0.1502      0.206      0.728      0.470      -0.265       0.565
==============================================================================
Omnibus:                        2.495   Durbin-Watson:                   1.809
Prob(Omnibus):                  0.287   Jarque-Bera (JB):                2.045
Skew:                           0.490   Prob(JB):                        0.360
Kurtosis:                       2.978   Cond. No.                         23.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now that we've added another variable, we have an additional line in our model for speeding, which includes the value of the coefficient.

Challenge

More variables could be added to this model! We still haven't explored the no_previous, not_distracted, and ins_premium variables. Try adding a different variable in place of speeding and then look at the R-squared value. How does it change? In the next objective in this module, we'll look more closely at the p-value and t-value.

Additional Resources

Kaggle: Bad drivers dataset

Objective 02 - Conduct a t-test to Determine the Significance of Individual Variables in the Model

Overview

Let's review:

In the previous objective we

added another variable to our linear regression
restated the null hypothesis: slopes (Beta_1 and Beta_0) are both zero for both variables
looked at the R-squared value

We have a good understanding of how to add another variable to linear regression. But we still have a few more things to look at in this analysis. Mainly - we're ready to perform a t-test on the additional variable in our regression model. So, how do we interpret the t-statistic and the resulting p-value with this new variable?

Again, we'll import our car crash data and fit a multiple linear regression model.

import pandas as pd
import seaborn as sns

# Load the car crash dataset
crashes = sns.load_dataset("car_crashes")

crashes.head()

	total	speeding	alcohol	not_distracted	no_previous	ins_premium	ins_losses	abbrev
0	18.8	7.332	5.640	18.048	15.040	784.55	145.08	AL
1	18.1	7.421	4.525	16.290	17.014	1053.48	133.93	AK
2	18.6	6.510	5.208	15.624	17.856	899.47	110.35	AZ
3	22.4	4.032	5.824	21.056	21.280	827.34	142.39	AR
4	12.0	4.200	3.360	10.920	10.680	878.41	165.63	CA

# Import the OLS model from statsmodels
from statsmodels.formula.api import ols

# Set-up and fit the model in one step
# (format Y ~ X1 + X2)
model = ols('total ~ alcohol + speeding', data=crashes).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  total   R-squared:                       0.730
Model:                            OLS   Adj. R-squared:                  0.719
Method:                 Least Squares   F-statistic:                     64.87
Date:                Sat, 10 Oct 2020   Prob (F-statistic):           2.27e-14
Time:                        14:32:50   Log-Likelihood:                -110.71
No. Observations:                  51   AIC:                             227.4
Df Residuals:                      48   BIC:                             233.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.6807      0.957      5.934      0.000       3.756       7.606
alcohol        1.9152      0.241      7.954      0.000       1.431       2.399
speeding       0.1502      0.206      0.728      0.470      -0.265       0.565
==============================================================================
Omnibus:                        2.495   Durbin-Watson:                   1.809
Prob(Omnibus):                  0.287   Jarque-Bera (JB):                2.045
Skew:                           0.490   Prob(JB):                        0.360
Kurtosis:                       2.978   Cond. No.                         23.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Remember, in the previous objective, we considered that speeding might be a factor in the total number of accidents in which a driver is involved. So we added this variable to our regression model to test this out.

Well, our model says otherwise! If we look at the variable speeding, we can see that we have a p-value of 0.47. Recall that if the p-value is more significant than our critical value, we fail to reject the null hypothesis. In other words, we can't reject the statement that there is NO relationship between total accidents and speeding. So it seems that speed is not as big of a factor in accidents as we might have thought!

Let's look at the R-squared value: it was 0.727 for one variable (alcohol), and when we added a second variable (speeding), the adjusted R-squared went down. These results would suggest that the speeding variable didn't help to explain any additional variation in the target variable (total). Therefore, our interpretation of the p-value and the null hypothesis was correct.

We're going to look more closely at R-squared and adjusted R-squared in the next objective.

Challenge

As in the previous objective, we only looked at one additional variable. So go ahead and repeat the analysis above but try one of the other variables.

Additional Resources

The statistical analysis t-test explained for beginners and experts

Objective 03 - Compare Model Fit Using Adjusted R-squared

Overview

In the last two objectives, we learned how to add an additional variable to a linear regression model, perform a t-test, and interpret the p-value for the new variable.

Finally, we're going to look more closely at the R-squared and adjusted R-squared values. Specifically, we'll look at how they change from having one variable in our regression model and adding an additional variable.

As usual, we'll import the data and fit the model: first with just one variable (alcohol) and again with the second variable (speeding).

import pandas as pd
import seaborn as sns

# Load the car crash dataset
crashes = sns.load_dataset("car_crashes")

crashes.head()

	total	speeding	alcohol	not_distracted	no_previous	ins_premium	ins_losses	abbrev
0	18.8	7.332	5.640	18.048	15.040	784.55	145.08	AL
1	18.1	7.421	4.525	16.290	17.014	1053.48	133.93	AK
2	18.6	6.510	5.208	15.624	17.856	899.47	110.35	AZ
3	22.4	4.032	5.824	21.056	21.280	827.34	142.39	AR
4	12.0	4.200	3.360	10.920	10.680	878.41	165.63	CA

# Import the OLS model from statsmodels
from statsmodels.formula.api import ols

# Set-up and fit the model in one step
# (format Y ~ X)
model = ols('total ~ alcohol', data=crashes).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  total   R-squared:                       0.727
Model:                            OLS   Adj. R-squared:                  0.721
Method:                 Least Squares   F-statistic:                     130.5
Date:                Sat, 10 Oct 2020   Prob (F-statistic):           2.04e-15
Time:                        14:33:40   Log-Likelihood:                -110.99
No. Observations:                  51   AIC:                             226.0
Df Residuals:                      49   BIC:                             229.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.8578      0.921      6.357      0.000       4.006       7.709
alcohol        2.0325      0.178     11.422      0.000       1.675       2.390
==============================================================================
Omnibus:                        1.922   Durbin-Watson:                   1.776
Prob(Omnibus):                  0.382   Jarque-Bera (JB):                1.705
Skew:                           0.439   Prob(JB):                        0.426
Kurtosis:                       2.824   Cond. No.                         16.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# Set-up and fit the multiple regression model
# (format Y ~ X1 + X2)
model = ols('total ~ alcohol + speeding', data=crashes).fit()

# Print the model summary
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  total   R-squared:                       0.730
Model:                            OLS   Adj. R-squared:                  0.719
Method:                 Least Squares   F-statistic:                     64.87
Date:                Sat, 10 Oct 2020   Prob (F-statistic):           2.27e-14
Time:                        14:33:40   Log-Likelihood:                -110.71
No. Observations:                  51   AIC:                             227.4
Df Residuals:                      48   BIC:                             233.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.6807      0.957      5.934      0.000       3.756       7.606
alcohol        1.9152      0.241      7.954      0.000       1.431       2.399
speeding       0.1502      0.206      0.728      0.470      -0.265       0.565
==============================================================================
Omnibus:                        2.495   Durbin-Watson:                   1.809
Prob(Omnibus):                  0.287   Jarque-Bera (JB):                2.045
Skew:                           0.490   Prob(JB):                        0.360
Kurtosis:                       2.978   Cond. No.                         23.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Looking at the R-squared value for the model with one variable, we can see that alcohol explains 72.7% (0.727) of the variance in total accidents. When we add in speeding, though, the R-squared value increases to 0.730 but the adjusted R-squared value decreases from 0.721 to 0.719. R-squared values tend to increase with the number of variables added to our model, even if those variables are not useful in explaining the target any better. Adjusted R-squared values on the other hand, adds in a term to adjust for using multiple independent variables in linear regression. So, if R-squared does not increase significantly when we add the new variable(s), then the adjusted R-squared value will actually decrease.

The decrease is what happened here! We added in speeding, and the adjusted R-squared decreased. The result would imply that speeding doesn't help explain the variance in the target variable (total).

Challenge

This is your last chance to experiment with this multiple regression model (okay, not really - you have all the code and can run this model a million times if you like). But, you can now add different variables to the regression model and see if the R-squared and adjusted R-squared values increase.

Additional Resources

Difference Between R-squared and Adjusted R-squared

Guided Project

Open DS_132_Multiple_Regression.ipynb in the GitHub repository below to follow along with the guided project:

GitHub: Multiple Regression

Guided Project Video

Module Assignment

Complete the Module 2 assignment to practice multiple regression techniques you've learned.

Module 2 Assignment

Assignment Solution Video

Resources

Documentation and Tutorials

Yale: Multiple Linear Regression

Articles and Readings

JMP: What is Multiple Regression

Module 2: Multiple Regression

Module Overview

Learning Objectives

Objective 01 - Model the Relationship of Multiple Predictor Variables to a Single Outcome

Overview

Follow Along

R-squared

Multiple Linear Regression

Challenge

Additional Resources

Objective 02 - Conduct a t-test to Determine the Significance of Individual Variables in the Model

Overview

Challenge

Additional Resources

Objective 03 - Compare Model Fit Using Adjusted R-squared

Overview

Challenge

Additional Resources

Guided Project

Guided Project Video

Module Assignment

Assignment Solution Video

Resources

Documentation and Tutorials

Articles and Readings