DS3 Module 1 - Inference for Linear Regression

Module Overview

In this module, we will learn to set up a hypothesis test for the slope term for a linear regression model and how this allows us to determine if there is a statistically significant relationship between the two variables. Finally, we'll discuss sampling variability and the limits of the conclusions we can draw with our data.

Learning Objectives

Identify the Appropriate Hypotheses to Test for a Statistically Significant Association Between Two Quantitative Variables
Conduct and Interpret a t-test for the Slope Parameter and Make the Connection Between the t-test for a Population Mean and Slope Coefficient
Identify the Appropriate Parts of the Output of a Linear Regression Model and Use Them to Build a Confidence Interval for the Slope Term
Identify Violations of the Assumptions for Linear Regression and Articulate Why Correlation Does Not Imply Causation Inference for Linear Regression

Objective 01 - Identify the Appropriate Hypotheses to Test for a Statistically Significant Association Between Two Quantitative Variables

Overview

The previous module focused on describing the relationship between two variables, calculating a correlation, and fitting a linear regression model to the data. The correlation coefficient and the residuals will inform about the strength of the correlation between the two. But, we haven't yet looked at the statistical significance of how the variables are related.

The focus of this module will be on this topic, and we'll start with hypothesis testing!

What are we testing?

We have fit a linear regression to two variables and then plotted the line on our dataset. The slope parameter represents how the two variables are related. If the slope were zero, we would say there are zero relationships between the variables. As one increases, the other stays at a constant value.

We're going to introduce a new dataset for this module: the car crashes dataset from 539. The variables that we'll consider are the total number of drivers involved in fatal collisions and the factors of speeding and alcohol impairment.

Column descriptions:

total (Number of Drivers Involved in Fatal Collisions Per Billion Miles)
speeding (Percentage of Drivers Involved in Fatal Collisions Who Were Speeding)
alcohol (Percentage of Drivers Involved in Fatal Collisions Who Were Alcohol-Impaired)

Let's load and view this data first and then discuss sample and population statistics.

import seaborn as sns

# Load the car crash dataset
crashes = sns.load_dataset("car_crashes")

# Limit the data to only the columns we need
crashes = crashes[["total", "speeding", "alcohol"]]

# Display the first few rows
crashes.head()

	total	speeding	alcohol
0	18.8	7.332	5.640
1	18.1	7.421	4.525
2	18.6	6.510	5.208
3	22.4	4.032	5.824
4	12.0	4.200	3.360

To start, we're going to compare the variables total and alcohol. We'll first implement a linear regression (find the best-fit line) and then look at how we should interpret the results using inferential statistics.

import seaborn as sns
import matplotlib.pyplot as plt

crashes = sns.load_dataset("car_crashes")
crashes = crashes[["total", "speeding", "alcohol"]]

# Make the scatter plot
sns.lmplot(data=crashes, x="alcohol", y="total", ci=None)

# Display the scatter plot
plt.show()

A graph showing linear regression between the data points total and alcohol.

Remember the equation for a line was y = b₀ + b₁x

For our data, we fit the intercept (b₀) and the slope (b₁). Now we will make a statistical statement about the slope parameter.

But first, we need to set up our hypothesis and to do that we'll review the terms population and sample.

population: the entire group that we would like to draw conclusions about
sample: the subset of the population

In our dataset above, we're looking at just a sample of drivers; we don't have all the data for all drivers and all of their accident records.

Hypothesis Testing: sample and population means

What we're doing with hypothesis testing is to compare a sample mean to some hypothetical population mean. Then, we'd like to evaluate with some level of certainty if we can reject the null hypothesis. In other words, we're testing the hypothetical mean of the population (to reject the null hypothesis or not) and using the sample to do the testing.

But we aren't looking at the sample mean here - we're looking at the slope of the relationship between our variables. So, if these two variables (total and alcohol) are entirely unrelated, the slope would be 0 (a flat line). We can then write our null hypothesis as:

null hypothesis = slope is equal to zero

The alternative hypothesis in this case would be that the two variables are related:

alternative hypothesis = slope is not equal to zero.

In the Guided Project, you will learn how to express the null and alternative hypotheses mathematically.

Challenge

Think about a different dataset (you don't need to actually import it) and write out a null hypothesis and alternative hypothesis.

Additional Resources

Objective 02 - Conduct and Interpret a t-test for the Slope Parameter and Make the Connection Between the t-test for a Population Mean and Slope Coefficient

Overview

In the last objective, we looked at two variables from our car crash dataset. We can see that there is a relationship between the variables. We stated our null hypothesis and alternative hypothesis. So, now let's perform the hypothesis test.

Follow Along

In the previous module, we used the scikit-learn LinearRegression() library to fit our model. While this helped provide a preview on predictive modeling, we need more statistical information on our linear models.

So instead we'll use the statsmodel library and the Ordinary Least Squares model.

We'll load our data, fit the model, and then look at the output.

import pandas as pd
import seaborn as sns

# Load the car crash dataset
crashes = sns.load_dataset("car_crashes")

mycols = ['total', 'speeding', 'alcohol']

crashes = crashes[mycols]
crashes.head()

	total	speeding	alcohol
0	18.8	7.332	5.640
1	18.1	7.421	4.525
2	18.6	6.510	5.208
3	22.4	4.032	5.824
4	12.0	4.200	3.360

import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm

Y = crashes['total']
X = crashes['alcohol']

# Create the model with the X, Y data
X = sm.add_constant(X)
model = sm.OLS(Y, X)

# Fit the model
results = model.fit()

# Look at the results that include the t-statistic
print(results.t_test([1, 0]))

                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             5.8578      0.921      6.357      0.000       4.006       7.709
==============================================================================

Yay! We fit our linear model, we have the slope (coef in the table), standard error (std err) and the t-value (t). Remember from the previous sprint how the t-statistics is calculated:

$t = \frac{\bar{x} - \mu_0}{SE}$

where ̄x is the sample mean, μ₀ is the mean under the null hypothesis and SE is the standard error. Because we are performing the hypothesis test on the slope parameter, we adjust the t-statistic:

$t = \frac{b_1 - \beta_1}{SE}$

where b₁ is the slope in the sample, β₁ is the slope under the null hypothesis and SE is the standard error.

Remember that we stated the null hypothesis as there is no relationship between the variables, so the slope (β₁) is equal to zero. Let's plug in the numbers and compare what we calculate as the t-statistic to the output in the table above.

$t = \frac{5.8578 - 0}{0.921} = 6.36$

It worked! The t-statistic we just calculated agrees with the value of t in the table.

We also don't need to calculate the p-value as it's also given in the table. Because the p value is essentially zero (or rounds to zero) we can reject the null hypothesis. In other words we reject the statement that there is no relationship between the variables.

Additional Resources

statsmodel API

Objective 03 - Identify the Appropriate Parts of the Output of a Linear Regression Model and Use Them to Build a Confidence Interval for the Slope Term

Overview

We're going to continue with our car crash dataset. Remember that we have not fit a model and performed a t-test on the slope parameter. The slope measures the relationship between the total number of car crashes and if alcohol use by the driver was a factor. So that's it, right? Not exactly. We want to know how well we know the value for the slope or the confidence we have in the value. If we took a different sample of car crash data, we would likely get an additional slope value.

Now is when confidence intervals come into play!

Confidence Intervals

import seaborn as sns

# Load the car crash dataset
crashes = sns.load_dataset("car_crashes")
crashes = crashes[['total', 'speeding', 'alcohol']]
crashes.head()

	total	speeding	alcohol
0	18.8	7.332	5.640
1	18.1	7.421	4.525
2	18.6	6.510	5.208
3	22.4	4.032	5.824
4	12.0	4.200	3.360

import seaborn as sns
import statsmodels.api as sm

crashes = sns.load_dataset("car_crashes")
crashes = crashes[['total', 'speeding', 'alcohol']]

Y = crashes['total']
X = crashes['alcohol']

# Create the model with the X, Y data
X = sm.add_constant(X)
model = sm.OLS(Y, X)

# Fit the model
results = model.fit()

# Look at the results that include the t-statistic
print(results.t_test([1, 0]))

                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             5.8578      0.921      6.357      0.000       4.006       7.709
==============================================================================

On the far right of this table, we see a 95% confidence interval (the lower and upper bounds). We would write our slope coefficient as 95% confident that the value lies between 4.006 and 7.709.

Plotting Confidence Intervals

We can use the abilities of the seaborn plotting library to plot the confidence interval on our plot. But first, we specify at what confidence level we would like to display by using the ci argument.

Create the scatter plot with the confidence interval

import seaborn as sns
import matplotlib.pyplot as plt

crashes = sns.load_dataset("car_crashes")
crashes = crashes[['total', 'speeding', 'alcohol']]

# Create the scatter plot with the confidence interval
sns.lmplot(data=crashes, x='alcohol', y='total', ci=95)

plt.show()

A scatter plot with a confidence interval. The data shows the relationship between total and alcohol.

The "true" slope of the line is somewhere within the shaded region.

Challenge

Using the same dataset, try changing the confidence interval on the last plot. If you make the confidence interval bigger (e.g., a more significant number like 99%), will the shaded region increase or decrease? How will reducing the confidence level affect the size of the shaded region?

Additional Resources

Objective 04 - Identify Violations of the Assumptions for Linear Regression and Articulate Why Correlation Does Not Imply Causation Inference for Linear Regression

Overview

Now that we have sufficiently covered how to fit a linear regression model, it's essential to know when not to use one! The following list describes a few cases when linear regression models should be avoided.

Linear Regression: When not to use it

Categorical data (like a classification)
Modeling non-linear relationships
Outliers (without adjusting for them)

Let's go over a few of these examples now, with real datasets, and see what happens when we fit a linear regression model.

Follow Along

Categorical Variables

Not all data is suitable for fitting a linear model. If we have data where we're trying to predict to which class an observation belongs, we should fit a classification model (we'll learn about these in future Sprints). The following dataset and plot are examples of data better suited to a classification model.

import matplotlib.pyplot as plt
import seaborn as sns

dots = sns.load_dataset("dots")
sns.lmplot(data=dots, x="time", y="coherence", ci=95)

plt.show()

A graph that shows categorical variables. Each data point belongs to a cluster of data points in it's respective category.

These two variables are not well suited for fitting a linear model. The slope would be zero for each line at a different coherence level. These two variables could be used as part of a classification model, with coherence as one of the features. Always plot your data before you fit a model!

Non-linear Data

Let's look at another example. The hourly temperature in most temperate climate locations varies from an overnight low to a daily high. While we could easily model this type of data, a linear regression would not be a good choice.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

temps = pd.read_csv("denver_temps.csv")
sns.lmplot(data=temps, x="time", y="temperature", ci=95)

plt.show()

A graph of non-linear data. The graphs shows the relationship between temperature and time.

Additional Resources

Classification Versus Regression in Machine Learning

Guided Project

Open DS_131_Inference_For_Regression.ipynb in the GitHub repository below to follow along with the guided project:

GitHub: Inference for Regression

Guided Project Video

Module Assignment

Complete the Module 1 assignment to practice inference for linear regression techniques you've learned.

Module 1 Assignment

Assignment Solution Video

Resources

Documentation and Tutorials

Statsmodels: Linear Regression

Articles and Readings

Explaining Linear Regression with Hypothesis Testing and Confidence Interval

Module 1: Inference for Linear Regression

Module Overview

Learning Objectives

Objective 01 - Identify the Appropriate Hypotheses to Test for a Statistically Significant Association Between Two Quantitative Variables

Overview

What are we testing?

Hypothesis Testing: sample and population means

Challenge

Additional Resources

Objective 02 - Conduct and Interpret a t-test for the Slope Parameter and Make the Connection Between the t-test for a Population Mean and Slope Coefficient

Overview

Follow Along

Additional Resources

Objective 03 - Identify the Appropriate Parts of the Output of a Linear Regression Model and Use Them to Build a Confidence Interval for the Slope Term

Overview

Confidence Intervals

Plotting Confidence Intervals

Create the scatter plot with the confidence interval

Challenge

Additional Resources

Objective 04 - Identify Violations of the Assumptions for Linear Regression and Articulate Why Correlation Does Not Imply Causation Inference for Linear Regression

Overview

Linear Regression: When not to use it

Follow Along

Categorical Variables

Non-linear Data

Additional Resources

Guided Project

Guided Project Video

Module Assignment

Assignment Solution Video

Resources

Documentation and Tutorials

Articles and Readings