Fundamentals of NLP: Representing Text as Numeric Features

Arya Ganguli

Skillsoft issued completion badges are earned based on viewing the percentage required or receiving a passing score when assessment is required. When performing sentiment classification using machine learning, it is necessary to encode text into a numeric format because machine learning models can only parse numbers, not text. There are a number of encoding techniques for text data, such as one-hot encoding, count vector encoding, and word embeddings. In this course, you will learn how to use one-hot encoding, a simple technique that builds a vocabulary from all words in your text corpus. Next, you will move on to count vector encoding, which tracks word frequency in each document and explore term frequency–inverse document frequency (TF-IDF) encoding, which also creates vocabularies and document vectors but uses a TF-IDF score to represent words. Finally, you will perform sentiment analysis using encoded text. You will use a count vector to encode your input data and then set up a Gaussian Naïve-Bayes model. You will train the model and evaluate its metrics. You will also explore how to improve the model performance by stemming words, removing stopwords, and using N-grams.

Issued on

February 5, 2025

Expires on

Does not expire