Performing Text Preprocessing in Python

Tutorial 2 of 5

Introduction

Goal of the tutorial: The goal of this tutorial is to provide a comprehensive guide on how to perform text preprocessing in Python.

What you will learn: By the end of this tutorial, you will learn different text preprocessing techniques such as tokenization, stop words removal, stemming, lemmatization, and how to apply them using Python's Natural Language Toolkit (NLTK).

Prerequisites: Basic knowledge of Python programming language and basic understanding of Natural Language Processing (NLP) would be beneficial.

Step-by-Step Guide

Text preprocessing is a crucial step in any Natural Language Processing task. It helps in cleaning and simplifying text, which may improve your model's performance. Here are the main steps involved in text preprocessing:

Tokenization: This is the process of breaking down the text into individual words or tokens.
Removing Stop words: Stop words are common words that do not contribute much to the content or meaning of a document (e.g., "the", "is", "in"). We remove them to reduce the amount of noise in the text.
Stemming: This process reduces a word to its root form. For instance, "running", "runs", "ran" are all variations of the word "run", and after stemming, they will be reduced to "run".
Lemmatization: Similar to stemming, this process reduces words to their base form, but it considers the context and part of speech. It links words with similar meaning to one word. For example, "good", "better", "best" after lemmatization would be "good".

Code Examples

Here are some practical examples. We will use NLTK for these operations.

Example 1: Tokenization

import nltk
nltk.download('punkt') # Downloading the punkt package
from nltk.tokenize import word_tokenize

text = "This is an example sentence. We will tokenize this sentence."
tokens = word_tokenize(text)
print(tokens)

In this code snippet, we first import the necessary packages. We then use the word_tokenize function from NLTK to tokenize our example sentence.

Expected Output:

['This', 'is', 'an', 'example', 'sentence', '.', 'We', 'will', 'tokenize', 'this', 'sentence', '.']

Example 2: Removing Stop words

from nltk.corpus import stopwords
nltk.download('stopwords') # Downloading the stopwords package

stop_words = set(stopwords.words('english')) 
filtered_sentence = [word for word in tokens if not word in stop_words]

print(filtered_sentence)

We first download and import the stopwords package. We then create a list comprehension that includes words not in the list of English stop words.

Expected Output:

['This', 'example', 'sentence', '.', 'We', 'tokenize', 'sentence', '.']

Summary

In this tutorial, we have covered the basics of text preprocessing in Python using NLTK. We have discussed tokenization, stop words removal, stemming, and lemmatization.

To continue learning, you can explore other techniques like POS tagging, Named Entity Recognition (NER), and syntactic parsing. For additional resources, you can check out the NLTK documentation and the book "Natural Language Processing with Python".

Practice Exercises

Exercise 1: Tokenize the following sentence: "NLTK is a leading platform for building Python programs to work with human language data."

Exercise 2: After tokenization, remove stop words from the tokens obtained in Exercise 1.

Exercise 3: Perform stemming on the tokens obtained in Exercise 2.

Solutions:

Exercise 1:

text = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(text)
print(tokens)

Exercise 2:

filtered_sentence = [word for word in tokens if not word in stop_words]
print(filtered_sentence)

Exercise 3:

from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_sentence]
print(stemmed_words)

Keep practicing with more complex sentences and larger text data for better understanding and proficiency in text preprocessing.