Fundamentals of NLP: Preprocessing Text Using NLTK & SpaCy

Ramsha Munawar

Skillsoft issued completion badges are earned based on viewing the percentage required or receiving a passing score when assessment is required. Tokenization, stemming, and lemmatization are essential natural language processing (NLP) tasks. Tokenization involves breaking text into units (tokens), such as words or phrases, facilitating analysis. Stemming reduces words to a common base form by removing prefixes or suffixes, promoting simplicity in representation. In contrast, lemmatization considers grammatical aspects to transform words into their base or dictionary form. You will begin this course by tokenizing text using the Natural Language Toolkit (NLTK) and SpaCy, which involves splitting a large block of text into smaller units called tokens, usually words or sentences. You will then remove stopwords, common words such as "a" and "the" that add little meaning to text. Next, you'll explore the WordNet lexical database, which contains information about the semantic relationship between words. You'll use Synsets to view similar words and explore hypernyms, hyponyms, meronyms and holonyms. Finally, you'll compare stemming and lemmatization using NLTK and SpaCy. You will explore both processes with NLTK and perform lemmatization using SpaCy.

Issued on

February 6, 2025

Expires on

Does not expire