Module 3: Document Classification

Module Overview

In this module, we'll explore document classification, a fundamental NLP task that involves categorizing text documents into predefined classes. We'll learn how to extract features from text data, implement classification pipelines, apply dimensionality reduction techniques like Latent Semantic Indexing (LSI), and benchmark different vectorization methods to optimize classification performance. These skills are essential for applications such as sentiment analysis, spam detection, and topic categorization.

Learning Objectives

Extract text features and use them in classification pipelines
Apply Latent Semantic Indexing (LSI) to a document classification problem
Benchmark different vectorization methods in document classification tasks

Guided Project

Open DS_413_Document_Classification_Lecture_GP.ipynb in the GitHub repository to follow along with the guided project.

GitHub Repo Slides Guided Project Solution

Module Assignment

Participate in a Kaggle competition to classify whisky reviews using different NLP techniques. Apply text feature extraction, LSI, and word embeddings to optimize classification performance and achieve at least 80% accuracy.

Module 3: Document Classification

Module Overview

Learning Objectives

Guided Project

Module Assignment

Assignment Solution Video

Additional Resources

Text Classification and Feature Extraction

Dimensionality Reduction and LSI

Kaggle Competitions and Benchmarking