Module 2: Hypothesis Testing (chi-square tests)

Module Overview

In Module 1, we focused on hypothesis testing with t-tests, which are used for continuous data. In this module, we'll explore hypothesis testing with chi-square tests, which are used for categorical data.

Chi-square tests allow us to determine whether there is a significant association between categorical variables or whether observed categorical data matches what we would expect under a certain hypothesis. These tests are essential when analyzing survey responses, demographic information, or any data where variables are measured in categories rather than continuous values.

Learning Objectives

Objective 01 - Explain the Purpose of a Chi-square Test and Identify Applications

Overview

In this section, we're going to discuss a new statistical test called the chi-square test. It's sometimes written using the Greek letter chi, which looks like a wavy capital X: .

So why do we need yet another statistical test? Well, we can't apply a t-test to all situations. In some cases, we need to compare populations in different ways to determine how they are or are not related.

For example, we might have two or more populations for which we would like to compare two or more response categories. Say we are looking at the proportion of men and women who say their Facebook viewing time increases during specific months of the year. We would then not be calculating the mean of this sample distribution but rather if the amount of viewing time of one is significant compared to another group.

Another application of the chi-square test is to determine if two categorical variables are independent. An example might be to look at the association between texting while driving and car accidents. How can we determine if these two variables are related to each other?

The chi-square test of independence is another way to state this type of test: how dependent or independent are the two variables being tested. Let's move right into an example!

Follow Along

Chi-square Statistic

To complete a chi-square test on our sample populations, we need to set up our variables in a "contingency table." It's called this because we're testing to see if the number of cases in one of our categories is contingent upon (dependent on/independent of) the other variable.

Contingency Table

For this example, we'll look at some made-up data about cats and dogs and if they prefer treats or toys. Then, based on our chi-square analysis, we'll be able to make a statement about the preferences of the animals based on statistics.

Cats Dogs Row Total
Treats 200 290 490
Toys 400 910 1310
Column total 600 1200 1800

Based on these numbers, we can calculate the expected values by dividing the values in the columns by the total for that column.

Expected Value

Cats Dogs
Treats (600x490)/1800 = 163.33 (1200x490)/1800 = 326.67
Toys (600x1310)/1800 = 436.67 (1200x1310)/1800 = 873.33

We have this fancy table and expected values for cats and dogs preferences for either treats or toys. But how do we know if any of these results are statistically significant? Calculating the chi-square statistic comes in at this point. The following formula calculates the chi-square statistic:

chi-square = sum(observed-expected)^2/expected)

Taking this formula, we'll calculate the chi-square statistic for our pet data.

Chi-square statistic: calculation

Cats Dogs
Treats (200-163.33)^2 / 163.33 = 8.23 (290-326.67)^2 / 326.67 = 4.12
Toys (400-436.67)^2 / 436.67 = 3.08 (910-873.33)^2 / 873.33 = 1.54

And after summing up the value in each cell, we'll have the chi-square statistic: 8.23 + 3.08 + 4.12 + 1.54 = 16.97. 16.97 is our observed chi-square value. The final step is to compare the observed value we calculated to the critical chi-square value. The critical chi-square value depends on the degrees of freedom in your data set and determines if your results are statistically significant.

In our above data set, for one degree of freedom and an alpha level of 0.05, we can use this table to look up the critical chi-square value as 3.84. Our calculated chi-square of 16.79 is greater than 3.84, so we can conclude our results are not due to chance. We can say that cats enjoy treats significantly more than dogs. (Remember this is manufactured data; your dog or cat may not fit into the above category)

Challenge

Now it's your turn to practice calculating a chi-square statistic. Using the above examples, create your contingency table using some data that interests you. You can search for a "contingency table" and see if some small example tables have data in a suitable format. Use the following steps to calculate your chi-square value:

Additional Resources

Objective 02 - Set Up a Chi-square Test for Independence on Two Categorical Variables

Overview

In the previous objective, we learned about the chi-square statistic. We worked out the chi-square value by hand using a contingency table. For this next objective, we're going to use the magic of SciPy and the scipy.stats module to compute the chi-square statistic.

Follow Along

We'll look at our previous contingency table example so that we can compare our scipy.stats results to our manual calculation.

Cats, Dogs, and Treats

Remember our contingency table from earlier in the module?

Contingency Table: Cats & Dogs
Cats Dogs Row Total
Treats 200 290 490
Toys 400 910 1310
Column total 600 1200 1800

Using these values, we calculated a chi-square statistic of 16.97. Next, we'll put these same values into the SciPy stats chi2_contingency function, which will perform a chi-square test of the independence of the variables in the given contingency table.

# Import the libraries
import numpy as np
from scipy.stats import chi2_contingency

# Create the table using as a NumPy array
table = np.array([[200, 290], [400, 910]])

# Print out the table to double-check
print('Contingency table: \n', table)

# Perform the chi-square test
stat, p, dof, expected = chi2_contingency(table, correction=False)

# Print out the stats in a nice format
print('Expected values: \n ', expected.round(2))
print(f'The chi square statistics is: {stat:.3f}')
print(f'The p value is: {p:.6f}')
Contingency table: 
 [[200 290]
 [400 910]]
Expected values: 
  [[163.33 326.67]
 [436.67 873.33]]
The chi square statistics is: 16.965
The p value is: 0.000038
            

Challenge

Using the above example as a guide, choose one of the example table data sets from this website and re-create it in Python. It would help if you tried to do the following for this table:

Additional Resources

Objective 03 - Use a Chi-square Test p-value to Draw the Correct Conclusion About the Null and Alternative Hypothesis

Overview

We've already covered a p-value and how we apply it to a null and alternative hypothesis. But let's go over a quick review.

When we perform a hypothesis test, we calculate a p-value. Using the significance level we decided on before performing our test, we then have enough information to either 1) reject or 2) fail to reject the null hypothesis.

  1. p-value < alpha: reject the null hypothesis
  2. p-value > alpha: fail to reject the null hypothesis

Example: Dice Roll

We can use a chi-square test on a collection of dice rolls to determine if the dice are fair or if the random number generator we are using is random (well, as far as we can detect).

Using dice roll statistics as our data set, we're going to work through the whole process of stating the null hypothesis, performing a chi-square test, deciding on the significance level, determining the p-value, and then making a decision on the null hypothesis.

Follow Along

We already know the expected value of each number when we roll a dice. For example, for a six-sided die, each number should occur 1/6 or about 16.67% of the time. But, we can estimate the expected frequency for each value by using a random number generator.

Let's decide on the null hypothesis and the significance level.

Null Hypothesis

For this situation, it would make sense to choose the null hypothesis to simply be: "the dice are fair".

Generated Dice Rolls

We used the random number generator in Python to simulate the dice rolling results. We "rolled" five dice, each a total of 50 times. Here are the results, along with the total for each value between 1-6

A B C D E tot
1 13 7 10 5 13 48
2 5 7 4 12 9 37
3 5 9 14 0 10 38
4 12 13 8 7 7 47
5 7 10 9 13 6 45
6 8 4 5 13 5 35

Each value should come up 1/6 of the time; the total number of rolls is 250, and 250/6=41.67. So we can see that the results are pretty close to that number for most of the values except for one (a little high) and six (a little low).

Let's put the data in NumPy arrays and run a chi-square test on them.

import numpy as np

# Create the array for each die value
a1 = [13, 7, 10, 5, 13]
a2 = [5, 7, 4, 12, 9]
a3 = [5, 9, 14, 0, 10]
a4 = [12, 13, 8, 7, 7]
a5 = [7, 10, 9, 13, 6]
a6 = [8, 4, 5, 13, 5]

# Combine them into a (6,5) array
dice = np.array([a1, a2, a3, a4, a5, a6])
# Import the stats module
from scipy.stats import chi2_contingency

# Perform the chi-square test
stat, p, dof, expected = chi2_contingency(dice, correction=False)

# Print out the stats in a nice format
print('Expected values: \n ', expected.round(2))
print('The degrees of freedom: ', dof)
print(f'The chi square statistics is: {stat:.3f}')
print(f'The p value is: {p:.6f}')
Expected values: 
  [[9.6 9.6 9.6 9.6 9.6]
 [7.4 7.4 7.4 7.4 7.4]
 [7.6 7.6 7.6 7.6 7.6]
 [9.4 9.4 9.4 9.4 9.4]
 [9.  9.  9.  9.  9. ]
 [7.  7.  7.  7.  7. ]]
The degrees of freedom:  20
The chi square statistics is: 40.375
The p value is: 0.004477

Interpret the result - computer generated

Now we need to use the Table: Chi-Square Probabilities and a significance level to interpret our result.

Let's choose an alpha level of 0.05. Our calculated chi-square of 40.375 is greater than 31.410. Our calculated p-value is 0.00447, which is less than 0.05. We reject our null hypothesis that the die is fair, and can conclude that the computer is using a "rigged" die.

Physical Dice

Let's look at the rolls from a random assortment of actual, physical dice. We set up the number of rolls and dice the same way as for the random number generator. Here are the results of rolling five dice 50 times each.

A B C D E tot
1 4 3 5 11 4 27
2 9 15 10 4 11 46
3 7 10 8 6 8 38
4 13 6 8 9 12 46
5 9 9 7 11 6 39
6 8 7 12 9 9 43
# Create the array for each die value
a1 = [4, 3, 5, 11, 4]
a2 = [9, 15, 10, 4, 11]
a3 = [7, 10, 8, 6, 8 ]
a4 = [13, 6, 8, 9, 12]
a5 = [9, 9, 7, 11, 6]
a6 = [8, 7, 12, 9, 9]

# Combine them into a (6,5) array
dice = np.array([a1, a2, a3, a4, a5, a6])
# Perform the chi-square test
stat, p, dof, expected = chi2_contingency(dice, correction=False)

# Print out the stats in a nice format
print('Expected values: \n ', expected.round(2))
print(f'The chi square statistics is: {stat:.3f}')
print(f'The p value is: {p:.6f}')
Expected values: 
  [[5.4 5.4 5.4 5.4 5.4]
 [9.8 9.8 9.8 9.8 9.8]
 [7.8 7.8 7.8 7.8 7.8]
 [9.6 9.6 9.6 9.6 9.6]
 [8.4 8.4 8.4 8.4 8.4]
 [9.  9.  9.  9.  9. ]]
The chi square statistics is: 21.989
The p value is: 0.341086

Interpret the result - human generated

Again, we'll use the table Table: Chi-Square Probabilities and a significance level to interpret our result.

For this trial, we'll use an alpha level of 0.05. Our calculated chi-square of 21.989 is less than 31.410. As with the example above, we can also use the calculated p-value. In this case, our p-value of 0.34 is greater than our alpha of 0.05, and we fail to reject the null hypothesis.

We can conclude that our results are what we would expect if the physical dice used were fair.

Both sets of tests could return different results based on the values used.

Challenge

You may take the opportunity to generate your own dice-rolling data and see how your results compare to the computer-generated ones. You can use fewer dice (and roll more than one at a time) to collect your sample. Once you have some data, construct a contingency table and calculate your chi-square statistic. Then compare your results using your preferred significance level. Are your dice fair?

Additional Resources

Guided Project

Open DC_122_Chi2_Tests.ipynb in the GitHub repository below to follow along with the guided project:

Guided Project Video

Module Assignment

Complete the Module 2 assignment to practice chi-square testing techniques you've learned.

Assignment Solution Video

Resources

Chi-Square Test Resources

Advanced Chi-Square Analysis