DS Unit 2 - Sprint 6: Tree-Based Models
Welcome to Sprint 6!
In this Sprint, we'll continue our study of predictive modeling with tree-based models such as decision trees and random forests. We'll also learn how to clean data with outliers, impute missing values, encode categoricals, and engineer new features. In addition, we'll learn how to implement a pipeline to make it easier to process and fit a model.
In this sprint, your Module Projects will be submitted to the in-class Kaggle competition! For each project, you will refine your submission and try to increase your model accuracy.
Sprint Overview
Module 1
Decision Trees
This module will introduce some basic but important concepts: cleaning data to account for outliers and implementing a pipeline using some of the linear regression models from previous sprints. In this module, we'll also introduce a decision tree and see how they are used for both classification and regression tasks.
View ModuleModule 2
Random Forests
We'll continue learning about tree-based models and implement a random forest model in scikit-learn. This module will also cover how to encode data and how those encodings affect tree-based models differently compared to linear models.
View ModuleModule 3
Cross-Validation and Grid Search
In the previous sprint, we covered how to implement a train-test split. Now we're going to introduce the concept of cross-validation and using an independent test set, something which you'll be using when you submit to the in-class Kaggle competition. We'll also learn how to optimize the hyperparameters in our models to improve our model's accuracy.
View ModuleModule 4
Classification Metrics
This last module will cover some important concepts in Data Science including classification model evaluation metrics. We'll introduce a confusion matrix along with how to interpret precision and recall. Additionally, we’ll learn about the receiver operating characteristic (ROC) curve and how we can use it to interpret a classifier model.
View Module