Module 2: Exploratory Data Analysis & Feature Engineering

Module Overview

In this module, you'll learn how to work with data using Google Colab and Pandas. You'll discover how to read and load datasets, explore your data using Pandas' powerful analysis tools, perform feature engineering to transform your data, and master string functions in Pandas for text data manipulation.

Learning Objectives

Detailed Objective: Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is an essential part of learning to be a data scientist. And something that experienced data scientists do regularly.

We'll be using some of the numerous tools available in the pandas library. Earlier in the module, we learned how to load datasets into notebooks. So now that we have all this data, what do we do with it?

Basic Information Methods

We can use a few methods to quickly look at your DataFrame and get an idea of what's inside. Here are some of the most common descriptions of what each method does:

Method Description
df.shape Display the size (rows, columns)
df.head() Display the first 5 rows (we can display first n rows by including the number in parenthesis)
df.tail() Display the last 5 rows (we can display last n rows by including the number in parenthesis)
df.describe() Display the statistics of numerical data types
df.info() Display the number of entries (rows), number of columns, and the data types

Column-specific Methods

Sometimes we don't want to look at the entire DataFrame and instead focus on a single column or a few columns. There are a few ways to select a column, but we'll mainly use the column name. For example, if we have a DataFrame called df and a column named "column_1," we could select just a single column by using df["column_1"]. Once we have a single column chosen, we can use some of the following methods to get more information.

Method Description
df.columns Print a list of the columns
df['column_name'] Select a single column (returns a Series)
df['column_name'].value_counts() Count the number of object and boolean occurrences
df.sort_values(by='column_name') Sort the values in the given column
df.drop() Remove rows or columns by specifying the label or index of the row/column

Handling Missing Values

With a lot of data comes the unavoidable fact that some of it will be messy. Messy data means that there will be missing values, "not-a-number" (NaN) occurrences, and problems with zeros not being zero. Fortunately, several pandas methods make dealing with the mess a little easier.

Method Description
df.isnull().sum() Count and sum the number of null occurrences (NaN or None)
df.fillna() Fill NaN values in a variety of ways
df.dropna() Remove values that are NaN or None; by default removes all rows with NaNs

Practical Example

Let's see a practical example of EDA using the M&Ms dataset, which is small and contains both numeric and object (string) data types:

# Import pandas
import pandas as pd

# Read the data from the website
url_mms = 'https://tinyurl.com/mms-statistics'
df = pd.read_csv(url_mms)

# Look at the shape
df.shape  # returns (816, 4)

# Look at the head of the file
df.head()

When we run df.head(), we'll see the first 5 rows of the dataset with columns 'type', 'color', 'diameter', and 'mass'. We can get more information about the dataset with:

# DataFrame information
df.info()

The output will tell us we have 816 entries, 4 columns, and reveal the data types. Let's also check the statistics of the numeric columns:

# DataFrame describe
df.describe()

This shows statistics like count, mean, standard deviation, min/max, and percentiles for numeric columns.

What if we want to count how many of each candy type we have?

Objective 01 - Load a CSV Dataset From a URL Using Pandas read_csv

Overview

We will be working with data in many different forms throughout this course. But before we can start to do anything with that data, we need to load it into our workspace. So we'll be working in Google Colab notebooks, focusing on loading data into that environment. Eventually, you'll be working with a local Jupyter environment, but these instructions will work for that, too.

Pandas

You are likely already familiar with the Python data analysis library pandas. We'll provide a quick overview here and then work through some examples in the next section.

The pandas library provides extensive data analysis and data manipulation tools. It also includes data structures (Series, DataFrames) that work well with various data types, including tabular data, time-series data, and arbitrary matrix data (for example, columns with different data types).

Reading files using read.csv

To start, we'll learn how to load data with one of the most common pandas methods: read_csv. This method reads data in the comma-separated value or CSV format: the values in each row are separated by a comma, and new lines (rows) begin on the following line. A CSV file can be read from a locally saved file on your computer or loaded from a URL. We'll practice both these techniques in this module, starting with data that is stored online.

As with many pandas methods, there are several options to use with read_csv. To learn about these, an excellent place to begin is with some of the official documentation.

For this first exercise, we will use the default read_csv parameters. Time to read in some data!

Follow Along

We will practice using the read_csv method to load a data set from a URL. The UCI Machine Learning Repository has a great selection of data sets, and many are in a good format for practicing.

The Tic-Tac-Toe Endgame data set is a manageable size (~900 rows) and has ten columns; this size makes it easy to examine once we read it in. To retrieve the URL, click on the above link and click on the Download: Data Folder link. You should see a list; right-click on tic-tac-toe data and select Copy link address.

In your notebook, you can use the following code to read in the data set:

# Import pandas with the standard alias
import pandas as pd

# Set a variable to the URL you copied above
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data'

# Read or load the data
df = pd.read_csv(url)

Great! If you copied the link correctly, you should have sucessfully loaded the data set. A simple check is to type the variable name and run the cell:

# Look at the data (df)
df.head()
x x.1 x.2 x.3 o o.1 x.4 o.2 o.3 positive
0 x x x x o o o x o positive
1 x x x x o o o o x positive
2 x x x x o o o b b positive
3 x x x x o o b o b positive
x x x x o o b b o positive

We can see that there isn't a header row (a row with labels for each column). To fix this, look up the information for this data set, and manually create a list of columns names; this list could be set as the header row.

Challenge

Since the data set we read above is missing a header row, this would be an excellent time to practice adding one. For the UCI Repository specifically, the information about the data sets is often located in the same “Data Folder” from where you downloaded it.

Here are some steps to get started:

Additional Resources

Objective 02- Load a CSV dataset from a local file using Pandas read_csv

Overview

When working with data sets, you'll find many of them conveniently stored online at various locations (UCI Repository, Kaggle, etc.). But you'll often want to download a data set to store it on your local computer. Storing the data set makes it easy to use tools on your computer to view and edit the file, in addition to having a copy stored on your hard drive.

The pandas read_csv() method can also read in locally saved files. Instead of providing the URL, you will use the path to the file. Think of the path like the address of the file: it's a list of directories leading to the file location. An example path might look like /Users/myname/Documents/data_science/tic-tac-toe.data where the last part of the path is the file name. To read in a local file with pandas, you can use the following code (assuming pandas has been installed on your computer):

# Read in an example file (from a example user's Downloads folder)
import pandas as pd
df = pd.read_csv('/home/username/Downloads/tic-tac-toe.data')

Done! Common issues here are having an incorrect path listed for the file or an incorrect file name. You also need to make sure you include quotes (either single or double) around your path.

Since we are currently working in notebooks on Google Colab, we need to know how to get data loaded into that environment.

Google Colab

There are a few different ways to load files into your Colab notebook. The first one we will cover here is loading a file from your computer. You need to have already downloaded and saved the file. The next step is to use the files method from the google.colab package:

# Upload a file to Colab from your computer
from google.colab import files
uploaded = files.upload()

The system will prompt you to select a file.

Select the tic-tac-toe.data file you downloaded. It should be in your Downloads folder.

This should return a result that looks like this:

Upload a file from Colab to your computer

Now, we can read our file into a DataFrame:

# Store your uploaded file in a pandas DataFrame 
df = pd.read_csv('tic-tac-toe.data')

Follow Along

Let's work through another example, where we read in a file from a notebook running locally and from a notebook running in Colab. We'll download another file from the UCI Machine Learning Repository: the Auto MPG Data Set.

Local Jupyter Notebook

If you are following along on your notebook running locally (on your computer, not on Google Colab), change the path to correspond to where you have saved the file on your computer.

# Load a file saved LOCALLY
import pandas as pd
df = pd.read_csv('/home/username/Downloads/auto-mpg.data')

Google Colab notebook

Now in a Colab notebook, we'll follow the same steps from the Overview and load this new data set and read it into a DataFrame:

from google.colab import files
uploaded = files.upload()

# A prompt will ask you to choose a file
# Choose a file

# Store your uploaded file in a pandas DataFrame
import io
df = pd.read_csv(io.BytesIO(uploaded['auto-mpg.data']))

In this case, we used the Python package io, which stands for input/output. The io.BytesIO is reading data stored as bytes in an in-memory buffer (in this case in Colab).

You now have a new data set to explore!

Challenge

For more practice, look through the UCI Machine Learning Repository and find a data set to download to your computer. Make sure it is a CSV file and not in a different format. Upload into Google Colab and then load that file into a DataFrame.

Additional Resources

Objective 03 - Use Basic Pandas Functions for Exploratory Data Analysis-EDA Overview

Overview

Exploratory data analysis (EDA) is an essential part of learning to be a data scientist. And something that experienced data scientists do regularly.

We’ll be using some of the numerous tools available in the pandas library. Earlier in the module, we learned how to load datasets into notebooks. So now that we have all this data, what do we do with it?

Basic Information

We can use a few methods to look at your DataFrame quickly and get an idea of what’s inside. Here are some of the most common descriptions of what each method does:

method description
df.shape display the size (x, y)
df.head() display the first 5 rows ( we can display first n rows by including the number in parenthesis)
df.tail() display the last 5 rows ( we can display last n rows by including the number in parenthesis)
df.describe() display the statistics of numerical data types
df.info() display the number of entries (rows), number of columns, and the data types

Column-specific

Sometimes we don’t want to look at the entire DataFrame and instead focus on a single column or a few columns. There are a few ways to select a column, but we’ll mainly use the column name. For example, if we have a DataFrame called df and a column named “column_1,” we could select just a single column by using df["column_1"]. Once we have a single column chosen, we can use some of the following methods to get more information.

method description
df.columns print a list of the columns
df['column_name'] select a single column ( returns a Series)
df['column_name'].value_counts() count the number of object and boolean occurrences
df.sort_values(by='column_name') sort the values in the given column
df.drop() remove rows or columns by specifying the label or index of the row/column

Missing Values

With a lot of data comes the unavoidable fact that some of it will be messy. Messy data means that there will be missing values, “not-a-number” (NaN) occurrences, and problems with zeros not being zero. Fortunately, several pandas methods make dealing with the mess a little easier.

method description
df.isnull().sum() count and sum the number of null occurrences (NaN or None)
df.fillna() fill NaN values in a variety of ways
df.dropna() remove values that are NaN or None; by default removes all rows with NaNs

Follow Along

The above methods cover much ground, but we’ll work through examples using them on a data set. But, first, we need some data. We’ll use the M&Ms data set for this because it’s small and contains both numeric and Object (string) data types.

# Import pandas
import pandas as pd

# Read the data from the website
url_mms = 'https://tinyurl.com/mms-statistics'
df = pd.read_csv(url_mms)

# Look at the shape
df.shape  # returns (816, 4)

# Look at the head of the file
df.head()
type color diameter mass
0 peanut butter blue 16.20 2.18
1 peanut butter brown 16.50 2.01
2 peanut butter orange 15.48 1.78
3 peanut butter brown 16.32 1.98
4 peanut butter yellow 15.59 1.62

Let's look at the information using df.info():

# DataFrame information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 4 columns):
type        816 non-null object
color       816 non-null object
diameter    816 non-null float64
mass        816 non-null float64
dtypes: float64(2), object(2)
memory usage: 25.6+ KB
            

From this, we can see the columns “type” and “color” are object data types, and “diameter” and “mass” are numeric. We can use describe to print out the statistics for the numeric columns using df.describe():

# DataFrame describe
df.describe()
diameter mass
count 816.000000 816.000000
mean 14.171912 1.419632
std 1.220001 0.714765
min 11.230000 0.720000
25% 13.220000 0.860000
50% 13.600000 0.920000
75% 15.300000 1.930000
max 17.880000 3.620000

DataFrame Columns

We can print out the columns of this data set with df.columns which returns an Index:

# DataFrame columns
df.columns
Index(['type', 'color', 'diameter', 'mass'], dtype='object')

To drop one of the columns, for example "mass", we'll use the df.drop() with the column specified:

# Drop the mass column
df.drop(columns='mass').head()
type color diameter
0 peanut butter blue 16.20
1 peanut butter brown 16.50
2 peanut butter orange 15.48
3 peanut butter brown 16.32
4 peanut butter yellow 15.59

We can also look at the number of different values in one of the columns. Let's see how many different values we have in the "type" column:

# Count the values in the 'type' column
df['type'].value_counts()
plain            462
peanut butter    201
peanut           153
Name: type, dtype: int64

We see an excellent display of how many plain, peanut butter, and peanut M&M types are present in the DataFrame.

Challenge

Now, while using the M&Ms data set, practice loading and exploring the DataFrame; specifically, learn if there are any Null or NaN values in this data set and remove them if there are. You can also practice using the df.sort_values() method on one of the numeric columns (‘mass’ or ‘diameter’)

Additional Resources

Objective 04 - Understand the Purpose of Feature Engineering

Overview

Why do we care about feature engineering?

When we begin working on a new project and a new data set, we have to consider how to make the best use of the information contained in the dataset. We want to get the most out of the data, without having to leave out useful information. It is also essential to ensure that the features we are using, are best suited for the type of modeling we would like to perform. In other words, what is the best representation of the data so that a model can learn from it? Choosing good features might allow you to use a less complex model, saving time on both implementation and interpretation.

Many of these considerations fall under the process of "feature engineering." So before we dig too deep, let's take a closer look at what features are.

What are features?

A feature is a measurable property or attribute, and is often something that is measured or observed. It can be a numeric feature (eg. temperature, weight, time etc.), or categorical (eg: sex, color, group, etc. ). We usually work with data in the form of a table, also called tabular data. The data is composed of rows (also called observations, instances or samples) and columns (also called variables or attributes). Some of the columns in a data set contain features, such as a column measuring the temperature of something over time (measurable observation). But other columns, like the index, would not be considered a feature.

Feature importance and selection

Not all features are equal when it comes to using them in creating a model. Depending on the problem you are solving, some features may be more beneficial than others. Therefore, some criteria can rank features, and then the analysis can only use the top-ranked feature.

Feature extraction

Feature extraction doesn't exactly sound like a process you would apply to data, but it's an integral part of feature engineering. The process can be complex, but it's taking high-dimensional data and reducing it without taking away any information.Some standard methods of feature extraction include Principal Component Analysis (PCA) and unsupervised clustering methods.

Feature construction

The process of feature construction is a little more straightforward compared to feature extraction. By combining, splitting apart, or otherwise modifying existing features, you can create new features.For example, when using data that consists of a calendar date, we could break the date variable into years, months, and days. A new feature could then be given as the year or the month by itself.

Feature engineering process

Here we describe a basic feature engineering process. We will cover the individual steps of the process in more detail as the course progresses.

Follow Along

Now that we know what features are and some methods used to select, extract, and construct features, we will use another practice data set to create some new features from the existing columns. Next, we'll use the M&Ms data set to make some new features from what already exists in the data set.

# Import our libraries
import pandas as pd

# Read in the M&Ms data into a DataFrame
mms = pd.read_csv('https://tinyurl.com/mms-statistics')

# Display the data
display(mms.head())
type color diameter mass
0 peanut butter blue 16.20 2.18
1 peanut butter brown 16.50 2.01
2 peanut butter orange 15.48 1.78
3 peanut butter brown 16.32 1.98
4 peanut butter yellow 15.59 1.62

We don't have a lot of features to work with because this data set is small. But, we do have two physical measurements: diameter and mass. If you recall from your high-school science classes, we could use these two properties to create a new property: density.

First, we'll assume that M&Ms candy is spherical. This equation determines the volume of a sphere: V=(4/3) x pi x radius^3. And then, the density is the mass divided by the volume. We need to create a new column (volume) and another column to hold the density values.

# Create the volume column (in cubic cm)

#divide the diameter by 10 to get cm
mms['diam_in_cm'] = mms['diameter']/10

#divide the diameter by 2 to get radius
#equation for spherical volume is 4/3 * pi * r^3
mms['volume'] = (4/3)*(3.14)*(mms['diam_in_cm']/2)**3

# Create the density column (in grams/cubic cm)
mms['density'] = mms['mass']/mms['volume']

# Take a look at the new columns
display(mms.head())
type color diameter mass diam_in_cm volume density
0 peanut butter blue 16.20 2.18 1.620 2.224966 0.979790
1 peanut butter brown 16.50 2.01 1.650 2.350879 0.854999
2 peanut butter orange 15.48 1.78 1.548 1.941294 0.916914
3 peanut butter brown 16.32 1.98 1.632 2.274777 0.870415
4 peanut butter yellow 15.59 1.62 1.559 1.982973 0.816955

Challenge

Now it's your turn: can you think of any additional features to create from the above data set? Try reading in this data and make some new columns, possibly using different units for the volume and density. The part to practice here is simply creating a new column; it doesn't matter if the property you create is applicable or not (for now).

Additional Resources

Objective 05- Demonstrate How to Work With Strings in Pandas

Overview

When we work with data, we're going to encounter many different data types: strings, integers, decimals, dates, times, null values, and possibly some other weird types not commonly found but that show up once in a while.

Text Data in Pandas

There are two ways to store text data in pandas: as an object-dtype NumPy array and as StringDtype extension type. object dtype is the default type we infer a list of strings to.

For example, we can create a pandas Series with single characters and look at the dtype. It will show the default object dtype.

# Import pandas
import pandas as pd

# Create a Series with single characters
pd.Series(['a', 'b', 'c'])
0    a
1    b
2    c
dtype: object

String methods

There are many different string methods to format and clean up strings. Here are a few of the more common ways (they are generally similar to the built-in Python string methods):

method description
s.str.lower() converts characters to lower case
s.str.upper() converts characters to upper case
s.str.len() returns the length of each item
s.str.strip() removes white space
s.str.split('_') separate on the given character

Follow Along

Using another practice data set (yes, we like practice data sets), we will use some of the string methods to clean up the text data. To try out something different, we will use only a subset of UFO data set (the full data set is available here).

# Read in the locally saved sample data
ufo_df = pd.read_csv('scrubbed.csv')
ufo_df.head()
datetime city state country shape duration (seconds) comments latitude longitude
0 10/10/1970 16:00 bellmore ny us disk 1800 silver disc seen by family and neighbors 40.668611 -73.527500
1 10/10/1971 21:00 lexington nc us oval 30 green oval shaped light over my local church p... 35.823889 -80.253611
2 10/10/1974 23:00 hudson ks us light 1200 The light chased us. 38.105556 -98.659722
3 10/10/1976 20:30 washougal wa us oval 60 Three extremely large lights hanging above nea... 45.582778 -122.352222
4 10/10/1980 23:30 manchester nh us light 300 A red glowing sphere stopped and watched me. 42.995556 -71.455278

It's also a good idea to get into the habit of looking at the data type in your DataFrame, which is easy using ufo_df.info().

# Display the DataFrame 
ufo_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 9 columns):
datetime              11 non-null object
city                  11 non-null object
state                 11 non-null object
country               11 non-null object
shape                 11 non-null object
duration (seconds)    11 non-null int64
comments              11 non-null object
latitude              11 non-null float64
longitude             11 non-null float64
dtypes: float64(2), int64(1), object(6)
memory usage: 920.0+ bytes

Text Cleaning

With the above information, we can see a mix of object-dtypes and numeric dtypes (int64, float64). In the ufo_df DataFrame, some text columns need some formatting help. Let's take the 'city,' 'state,' and 'country' columns and format them as upper case.

# Select each column (a column of a DataFrame is a Series)

# Convert the city names to title case (capitalize each work)
ufo_df['city'] = ufo_df['city'].str.title()

# Convert the state and country abbreviations to upper case
ufo_df['state'] = ufo_df['state'].str.upper()
ufo_df['country'] = ufo_df['country'].str.upper()

# Display the correct DataFrame
display(ufo_df.head())
datetime city state country shape duration (seconds) comments latitude longitude
0 10/10/1970 16:00 Bellmore NY US disk 1800 silver disc seen by family and neighbors 40.668611 -73.527500
1 10/10/1971 21:00 Lexington NC US oval 30 green oval shaped light over my local church p... 35.823889 -80.253611
2 10/10/1974 23:00 Hudson KS US light 1200 The light chased us. 38.105556 -98.659722
3 10/10/1976 20:30 Washougal WA US oval 60 Three extremely large lights hanging above nea... 45.582778 -122.352222
4 10/10/1980 23:30 Manchester NH US light 300 A red glowing sphere stopped and watched me. 42.995556 -71.455278

Challenge

Clean up any additional columns using the complete UFO data set linked above (and in the Resources). You should download and look at the data first, either in a spreadsheet or in pandas. It's also an option to truncate or only use some of the rows to keep things a little more straightforward.

Additional Resources

Guided Project

Open DS_112_EDA_Features.ipynb in the GitHub repository below to follow along with the guided project:

Guided Project Video

Module Assignment

Complete the Module 2 assignment to practice exploratory data analysis and feature engineering techniques you've learned.

Assignment Solution Video

Resources