Module 3: Ridge Regression

Module Overview

In this module, you will build on your regression knowledge with ridge regression. You'll learn about one-hot encoding, feature selection, and how regularization can improve model performance. These techniques will help you handle categorical variables and build more effective models with many features.

Learning Objectives

Objective 01 - Implement one-hot encoding of categorical features

Overview

In the previous two modules, we focused only on numerical features. In the penguin data set however, there were also several other features like biological sex, penguin species, and island of origin. We did not use these non-numeric columns - but what if we had wanted add those features while building the model?

This type of data, known as categorical data, is composed of specific values that belong to a finite set of categories or classes. For example, the column labeled "sex" has two values: male and female. This is a specific type of data called a binary variable. It would be a simple process to convert the "male" values to 0 and the "female" values to 1. We would then have a numeric vector that can be used as input for a model.

What if we had a column that had more than two categories? We could do something called "ordinal encoding" where we assign an integer to each class. For example, there could be other values in the "sex" column to indicate an indeterminate value or just unknown. We could assign these values to a class equal to '3'. Then, our ordinal encoding would be 1, 2, 3 (female, male, unknown/other).

There is a problem with this method: it implies some ranking of the categories. How we assigned the ordinal values was random: we could have assigned 'unknown'= 1, 'female'= 2, and 'male'= 3. So the ordering of the classes doesn't have any meaning.

One-hot encoding

To get around this "ranking" issue we can use one-hot encoding. Instead of using ordinal integers we can create new feature vectors by encoding with a 0 or 1. Using the above example, we could take the single "sex" column that contains three categories and convert it to three columns, with values of '0' or '1'. The following tables show this more clearly.

Original feature column

sex
male
male
female
other
male
female

One-hot encoded column

male female other
1 0 0
1 0 0
0 1 0
0 0 1
1 0 0
0 1 0

We now have three feature vectors, each with a '1' corresponding to when that class is present. Each value can't belong to more than one class at a time, so the cell that is "on" (has a '1') is the "hot" cell.

Follow Along

We're going to work through an example using scikit-learn utilities to one-hot encode the "species" column in the penguin data set. Let's load the data, and look at the data frame and the number of classes in the species column.

import seaborn as sns

# Bring in the penguins!
penguins = sns.load_dataset("penguins")
display(penguins.head())

# Find the number of classes in species
penguins['species'].unique()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

We have three species listed, which is a good number of categories to demonstrate with. The species values are string objects, which is accepted as input to the sklearn.preprocessing.OneHotEncoder() transformer. We need to reshape the array so that it's 2D using the np.newaxis method as before.

# Imports
import numpy as np

# Select and reshape input array
species = penguins.species[:, np.newaxis]

# Import the encoder
from sklearn.preprocessing import OneHotEncoder

# Instantiate the encoder as an object
enc = OneHotEncoder(sparse=False)

# Use the fit_transform method (2 steps in 1)
onehot = enc.fit_transform(species)

# Display every 25th row
onehot[::25]
array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

The species column values, containing three classes, is now an array of three vectors. In each of the columns we can see there is only one "hot" cell ('1') and the rest are zeros.

Challenge

In the penguins data set there is another column ("island") that can be one-hot encoded. Following the same process as above, one-hot encode this column. As a stretch goal, you can concatenate the one-hot encoded array with the original DataFrame. This will be useful when we want to use the dataset in a model where you need to provide a single feature matrix; each one-hot encoded column will be treated as a feature.

Additional Resources

Objective 02 - Implement a univariate feature selection process

Overview

We've discussed the concepts of feature engineering and feature selection in the earlier Sprints. Now that we are becoming more familiar with a variety of machine learning topics, it's time to come back to this important concept and learn how to select the features we would like to include in our models.

In the previous module we discussed the concepts of underfitting-overfitting models and the relationship to the bias-variance trade-off. We can expand this concept and focus on how the number of features we use in a model impacts the results. For example, consider the case where we are using a complicated model with a lot of features. When we look at how the model performs on the testing set, we might decide it's not a good fit and find a need to make the model simpler.

One method to reduce model complexity is by using fewer features. This decreases the number of parameters that need to be fit by the model and increases the chance that the model will generalize well. So, how do we determine which features, and how many, to keep in our model?

One of the more popular approaches to feature selection is the univariate feature selection process. The name originates from the process of looking at each feature individually and measuring the strength of its relationship with the target feature. For a linear regression, we would take each independent feature and separately measure the strength of the linear correlation with the dependent variable.

We'll work through an example with the penguin data set (yes, again - it's a nice data set to work on!).

Follow Along

We remember from the penguin data set that we have six features and one target. In our uses we have been using the numeric features to predict the body mass of the penguins. We're going to continue with the numeric features for this next example.

# Import pandas and seaborn
import pandas as pd
import numpy as np
import seaborn as sns

# Load the data into a DataFrame
penguins = sns.load_dataset("penguins")

# Drop NaNs
penguins.dropna(inplace=True)

# Create the features matrix
features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']
X = penguins[features]

# Create the target array
y = penguins['body_mass_g']

# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Using SelectKBest

The utility we're using to select the best features is the SelectKBest method from the sklearn.feature_selection module. The name of the method makes it easier to remember what it does: select the k best features where k is an integer.

# Import the feature selector utility
from sklearn.feature_selection import SelectKBest, f_regression

# Create the selector object with the best k=1 features
selector = SelectKBest(score_func=f_regression, k=1)

# Run the selector on the training data
X_train_selected = selector.fit_transform(X_train, y_train)

# Find the features that was selected
selected_mask = selector.get_support()
all_features = X_train.columns
selected_feature = all_features[selected_mask]

print('The selected feature: ', selected_feature[0])
The selected feature:  flipper_length_mm

The flipper length seems to be the best feature to include when trying to predict the penguin's body mass.

Challenge

We only selected the single best feature; try the same process as above but set k=2 and see which are the two best features. You can also use the one-hot encoded feature columns from the previous objective as additional features. Are any of those columns a good feature when trying to predict body mass?

Additional Resources

Objective 03 - Express and explain the intuition and interpretation of ridge regression

Overview

When we are training a model, simple is usually better. With fewer parameters to determine, the model is easier to interpret. Also, with too many parameters there is a chance the model will fit the training data well but won't generalize to new data. This objective will focus on a specific type of regularization called ridge regression.

Regularization

The concept of regularization is to reduce variance in a model with a "shrinkage" penalty. A more formal definition of regularization is to add information or bias to prevent overfitting. In regression, there are two common types of regularization: ridge regression and lasso. Ridge regression uses a penalty where the tuning parameter is multiplied by the squared sum of all coefficients (L2). Lasso regression is similar, but the penalty is a tuning parameter multiplied by the sum of the absolute values of all the coefficients (L1).

Ridge Regression

A regression model determines the parameters that minimize the residual sum of squares (RSS). The basis of ridge regression is to minimize the residual sum of squares (RSS) plus some penalty multiplied by the sum of the squares of the coefficients. The larger the penalty, the more the parameters are penalized, especially large parameters. The penalty or tuning parameter is called both lambda and alpha; lambda means something in Python already, so we'll use alpha for this discussion.

Note: the term "ridge" refers to the shape of the function we are trying to minimize. We turn a ridge into a peak where the parameters that minimize the RSS plus the penalty term are located at this peak.

Standardization

Another consideration when using any type of regularization technique is standardization. If the variables are on different scales, the larger variable will have a different contribution to the penalized terms. For example, if we were predicting weight from height, if height is in meters, a change of one unit will be a large change in weight. If height is in a smaller unit like centimeters, a unit change will be a much smaller change in weight.

Follow Along

Now that we have gone over the concepts of regularization and ridge regression, let's look at the math illustrated with a few plots to help us understand the concept better.

Sum of Squares

When we fit a linear regression, we want to find the parameters (slope and intercept) that minimize the sum of the squares of the distance from each data point to the best-fit line. In the plot below on the left, the sum of the errors is shown by the solid magenta line. Now let's take two of these data points and call them our "training data" (plot on the right). With two points, we can fit a perfect line which has no error. But the RSS error has increased, which is indicated by the gray "model error" line in the plot on the right.

Example1

With ridge regression, we want to introduce a small amount of bias in the training data. In other words, we don't want the line to fit these two training points perfectly. With this new line there is a small amount of error (red lines). With this new, slightly biased model, the RSS for the rest of the data has decreased.

Example2

Additional Resources

Objective 04 - Use sklearn to fit and interpret ridge regression models

Overview

In the previous objective, we looked at a simple example of how the introduction of bias reduces the variance in a model. Now, we're going to implement three different models: a linear regression, a linear regression with a number of polynomial features, and a ridge regression.

Follow Along

Let's look at our examples. First, we'll generate a data set to practice with. Let's pretend that this data represents the number of ice cream cones sold according to the hour after the shop opened. We would like to fit this data to predict how much ice cream will be sold later in the day.

# Generate the practice data set
import numpy as np

np.random.seed(15)
x = 5 * np.random.rand(50)
y = abs(0.5*np.sin(x) + 0.5 * np.random.rand(50))*10

# Create the feature matrix 
X = x[:, np.newaxis]

# Fit a linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)

# Create the data for the model (best-fit line)
xfit = np.linspace(0, 5, 1000)
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
# Fit a linear regression with polynomial features
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

poly_model = make_pipeline(PolynomialFeatures(15),
                           LinearRegression())

# Create the data for the model (best-fit line)
poly_model.fit(x[:, np.newaxis], y)
yfit_poly = poly_model.predict(xfit[:, np.newaxis])
# Fit a ridge regression with polynomial features
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

ridge_model = make_pipeline(PolynomialFeatures(15),
                            StandardScaler(),
                            Ridge(alpha = 0.05))

ridge_model.fit(x[:, np.newaxis], y)

# Create the data for the model (best-fit line)
yfit_ridge = ridge_model.predict(xfit[:, np.newaxis])

Now we'll create a plot with our practice data and all three models.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(x,y)

ax.plot(xfit, yfit, linestyle='--', label='linear regression')
ax.plot(xfit, yfit_poly, linestyle = '-.', label='linear regression (polynomial)')
ax.plot(xfit, yfit_ridge, label='ridge regression')

ax.set_ylim([0, 10])

ax.set_xlabel('Hours after opening')
ax.set_ylabel('Ice cream sales')
ax.legend()

plt.show()

mod3_obj4_ridge

In the above plot, we can see that the linear regression with a number of polynomial features (15) tries to fit all the little increases and decreases in the data set. We wouldn't want to try to predict what future ice cream sales would look like with the orange dash-dot line. The model fit by using ridge regression (solid green line) is a better fit to the data and captures the decrease in sales 4-5 hours after opening. Even though the ridge regression model uses the same degree polynomial as the orange dash-dot line, the parameters are penalized by the ridge regression alpha parameter.

Challenge

With the code example above, try selecting a different value for alpha. The larger the value, the more the parameters are penalized and the less likely they are to affect the model. You can also experiment and set alpha=0 and see what the resulting model is.

Additional Resources

Guided Project

Open JDS_SHR_213_guided_project_notes.ipynb in the GitHub repository below to follow along with the guided project:

Guided Project Video

Module Assignment

Complete the Module 3 assignment to practice ridge regression techniques you've learned.

Assignment Solution Video

Resources

Documentation and Tutorials

Articles and Readings