Module 3: Document Classification

Module Overview

In this module, we'll explore document classification, a fundamental NLP task that involves categorizing text documents into predefined classes. We'll learn how to extract features from text data, implement classification pipelines, apply dimensionality reduction techniques like Latent Semantic Indexing (LSI), and benchmark different vectorization methods to optimize classification performance. These skills are essential for applications such as sentiment analysis, spam detection, and topic categorization.

Learning Objectives

  • Extract text features and use them in classification pipelines
  • Apply Latent Semantic Indexing (LSI) to a document classification problem
  • Benchmark different vectorization methods in document classification tasks

Guided Project

Open DS_413_Document_Classification_Lecture_GP.ipynb in the GitHub repository to follow along with the guided project.

Module Assignment

Participate in a Kaggle competition to classify whisky reviews using different NLP techniques. Apply text feature extraction, LSI, and word embeddings to optimize classification performance and achieve at least 80% accuracy.

Assignment Solution Video