DS Unit 2 - Sprint 6: Tree-Based Models

Welcome to Sprint 6!

In this Sprint, we'll continue our study of predictive modeling with tree-based models such as decision trees and random forests. We'll also learn how to clean data with outliers, impute missing values, encode categoricals, and engineer new features. In addition, we'll learn how to implement a pipeline to make it easier to process and fit a model.

In this sprint, your Module Projects will be submitted to the in-class Kaggle competition! For each project, you will refine your submission and try to increase your model accuracy.

Sprint Overview

Module 1

Decision Trees

This module will introduce some basic but important concepts: cleaning data to account for outliers and implementing a pipeline using some of the linear regression models from previous sprints. In this module, we'll also introduce a decision tree and see how they are used for both classification and regression tasks.

View Module

Module 2

Random Forests

We'll continue learning about tree-based models and implement a random forest model in scikit-learn. This module will also cover how to encode data and how those encodings affect tree-based models differently compared to linear models.

View Module

Module 3

Cross-Validation and Grid Search

In the previous sprint, we covered how to implement a train-test split. Now we're going to introduce the concept of cross-validation and using an independent test set, something which you'll be using when you submit to the in-class Kaggle competition. We'll also learn how to optimize the hyperparameters in our models to improve our model's accuracy.

View Module

Module 4

Classification Metrics

This last module will cover some important concepts in Data Science including classification model evaluation metrics. We'll introduce a confusion matrix along with how to interpret precision and recall. Additionally, we’ll learn about the receiver operating characteristic (ROC) curve and how we can use it to interpret a classifier model.

View Module

Sprint Resources