DS Unit 4 Sprint 13: Natural Language Processing

Welcome to Sprint 13

A common unstructured data is the sort of information you are consuming right now - natural language, in written or spoken form.

Human language is a fascinating phenomenon and powerful expressive tool. Still, despite the many grammar rules, language is not a fully defined deterministic system in the same way that programming languages (like Python) are. Language is considered semi-structured, but even its structure (nouns, adjectives, verbs, etc.) can be challenging to recognize. Most humans are fluent in one or more languages, but even that fluency doesn't mean they can explicitly list or consciously understand the "rules" they are following.

Nonetheless, human language is the main form of content on the internet (and beyond), and the ability to computationally process it at scale can lead to many compelling products. For example, a brand may want to track users sentiment towards them on social media before/after an advertising campaign. To generate a high-quality automated summary, a news service may recognize critical entities in a news story. But the text is not numbers - and even representing it as, e.g. ASCII/Unicode values don't capture the meaning, just the abstract labeling of symbols. So how can we hope to achieve these sorts of tasks?

In this sprint, we will learn assorted NLP (Natural Language Processing) techniques involving cleaning and preprocessing, allowing us to feed the data into the more traditional statistical models. In addition, some more advanced specialized models are particularly conducive to NLP, which we will address.

Sprint Modules

Module 1

Natural Language Processing - Introduction

Learn foundational NLP concepts including tokenization, stop word removal, and text normalization. Human languages are far less structured than computer languages, presenting unique challenges for machines to understand sarcasm, irony, synonyms, and nuance without lived experience for context.

View Module

Module 2

Vector Representations

Transform text into numerical formats for machine learning. Explore Bag of Words, TF-IDF vectorization, and word embedding models. Learn to represent documents as vectors for search, visualization, and classification tasks.

View Module

Module 3

Document Classification

Link text feature extraction with classification techniques for document classification problems. Apply machine learning algorithms to categorize text documents using vectorized representations.

View Module

Module 4

Topic Modeling

Discover hidden topics in text using unsupervised learning. Master Latent Dirichlet Allocation (LDA) for text mining, dimensionality reduction, information retrieval, and understanding document structure.

DS Unit 4 Sprint 13: Natural Language Processing

Welcome to Sprint 13

Sprint Modules

Module 1

Natural Language Processing - Introduction

Module 2

Vector Representations

Module 3

Document Classification

Module 4

Topic Modeling

Sprint Resources

Primary Resources

NLP Libraries and Tools

Text Processing and Vectorization

Classification and Topic Modeling