PolarSPARC

Basics of Natural Language Processing using NLTK


Bhaskar S 01/22/2023


Overview


Natural Language Processing (or NLP for short) is related to the processing and analysis of the unstructured language interactions between the computer systems and the humans, with the intent of building interactive real-world applications that can interpret and understand the context of those unstructured language interactions and react accordingly.

The following are some of the common use-cases for NLP:

Natural Language Toolkit (or NLTK for short) is one of the popular Python based open source natural language processing libraries for performing various language processing tasks.


Installation and Setup


Installation is assumed to be on a Linux desktop running Ubuntu 22.04 LTS using the user alice. Also, it assumed is that Python 3 is setup in a Virtual Environment.

Open a Terminal window in which we will excute the various commands.

Let us first create a directory named /home/alice/nltk_data by executing the following command in the Terminal:

$ mkdir -p $HOME/nltk_data

Now, to install few of the Python packages - the nltk, the jupyter, the scikit-learn, and the wordcloud packages, execute the following commands in the Terminal:

$ pip install nltk

$ pip install jupyter

$ pip install scikit-learn

$ pip install wordcloud

Next, we need to download and setup all the NLTK provided data packages.

Execute the following command in the Terminal:

$ python

The prompt will change to >>> indicating the Python REPL mode.

Execute the following commands in the Python REPL:

>>> import nltk

>>> nltk.download()

The following would be the typical output:

Output.1

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

The prompt will change to Downloader> indicating the NLTK REPL mode.

Type the letter c and press ENTER in the NLTK REPL as shown below:

Downloader> c

The following would be the typical output:

Output.2

Data Server:
  - URL: <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml>
  - 7 Package Collections Available
  - 111 Individual Packages Available

Local Machine:
  - Data directory: /usr/local/share/nltk_data

The prompt will change to Config> indicating the NLTK Config REPL mode.

Type the letter d and press ENTER in the NLTK Config REPL as shown below:

Config> d

The prompt will change to New Directory>, where we will type the NLTK data directory and press ENTER shown below:

New Directory> /home/alice/nltk_data

The prompt will change to Config> indicating the NLTK Config REPL mode.

Type the letter m and press ENTER in the NLTK Config REPL as shown below:

Config> m

The following would be the typical output:

Output.3

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

The prompt will change to Downloader> indicating the NLTK REPL mode.

Type the letter l and press ENTER in the NLTK REPL as shown below:

Downloader> l

The following would be the typical output:

Output.4

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                            Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] comtrans............ ComTrans Corpus Sample
  [ ] conll2000........... CONLL 2000 Chunking Corpus
  [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
Hit Enter to continue: 

Press ENTER each time to page through the various NLTK packages. In the end one would end up in an output as show below:

Output.5

Collections:
  [ ] all-corpora......... All the corpora
Hit Enter to continue: 
  [ ] all-nltk............ All packages available on nltk_data gh-pages
                            branch
  [ ] all................. All packages
  [ ] book................ Everything used in the NLTK Book
  [ ] popular............. Popular packages
  [ ] tests............... Packages for running tests
  [ ] third-party......... Third-party data packages

([*] marks installed packages)

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

Type the letter d and press ENTER in the NLTK REPL as shown below:

Downloader> d

The following would be the typical output:

Output.6

-Download which package (l=list; x=cancel)?

The prompt will change to Identifier> indicating the NLTK Download REPL mode.

Type the word all and press ENTER in the NLTK Download REPL as shown below:

Identifier> all

This will download and setup of all the NLTK data packages and in the end the following would be the typical output:

Output.7

Downloading collection 'all'
      | 
      | Downloading package abc to /home/alice/nltk_data...
      |   Unzipping corpora/abc.zip.
      | Downloading package alpino to /home/alice/nltk_data...
      |   Unzipping corpora/alpino.zip.
      | Downloading package averaged_perceptron_tagger to /home/alice/nltk_data...
      |   Unzipping taggers/averaged_perceptron_tagger.zip.
... SNIP ...
      | Downloading package wordnet_ic to /home/alice/nltk_data...
      |   Unzipping corpora/wordnet_ic.zip.
      | Downloading package words to /home/alice/nltk_data...
      |   Unzipping corpora/words.zip.
      | Downloading package ycoe to /home/alice/nltk_data...
      |   Unzipping corpora/ycoe.zip.
      | 
    Done downloading collection all

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

Type the letter q and press ENTER in the NLTK REPL as shown below:

Downloader> q

The prompt will change to >>> indicating the Python REPL mode.

Execute the following command in the Python REPL to exit:

>>> exit()

We are now DONE with the installation and setup process.


Hands-on NLTK


All the commands in this hands-on section will be using the Jupyter notebook.

To launch the Jupyter Notebook, execute the following command in the Terminal:

$ jupyter notebook

To load the nltk Python module, run the following statement in the Jupyter cell:

import nltk

To set the correct path to the nltk data packages, run the following statement in the Jupyter cell:

nltk.data.path.append("/home/alice/nltk_data")

To list all the text from Project Gutengerg included in nltk, run the following statement in the Jupyter cell:

nltk.corpus.gutenberg.fileids()

The following would be a typical output:


NLTK Files
Figure.1

Tokenization is the process of breaking a text into either its sentences or its words. For example, the text Alice in the Wonderland is tokenized into a list of words - ['Alice', 'in', 'the', 'Wonderland'].

To load and get all the words from the text Alice in the Wonderland included in nltk, run the following statement in the Jupyter cell:

words = nltk.corpus.gutenberg.words('carroll-alice.txt')

In this demo, we will use a custom text file charles-christmas_carol.txt for The Christmas Carol from the Project Gutenberg.

To load and tokenize all the words from the custom text charles-christmas_carol.txt to the lower case, run the following statements in the Jupyter cell:

from nltk.tokenize import word_tokenize

with open ('txt_data/charles-christmas_carol.txt') as fcc:

    all_tokens = word_tokenize(fcc.read().lower())

Stop Words are the commonly used filler words (such as 'a', 'and', 'on', etc) that add little value to the context of the text.

To display all the Stop Words from the English language defined in nltk, run the following statements in the Jupyter cell:

from nltk.corpus import stopwords

stop_words = stopwords.words('english')

stop_words

The following would be a typical output showing an initial subset of entries:


NLTK Stop Words
Figure.2

Normalization is the process of cleaning the text to make it more uniform, such as removing the punctuations and the Stop Words.

To remove all the punctuations from our custom text, run the following statement in the Jupyter cell:

cc_tokens = [word for word in all_tokens if word.isalpha()]

To remove all the Stop Words from our custom text, run the following statement in the Jupyter cell:

all_words = [word for word in cc_tokens if word not in stop_words]

Frequency Distribution is the count of each of the words from a text.

To display the Frequency Distribution of the words from our custom text, run the following statements in the Jupyter cell:

all_fd = nltk.FreqDist(all_words)

all_fd

The following would be a typical output showing a subset of entries:


NLTK Freq Dist
Figure.3

To display the 20 most commonly used words from our custom text, run the following statement in the Jupyter cell:

all_fd.most_common(20)

The following would be a typical output:


NLTK Most Common
Figure.4

Hapaxes are the unique words that occur only once in the entire text.

To display all the Hapaxes from our custom text, run the following statements in the Jupyter cell:

single_words = all_fd.hapaxes()

single_words

The following would be a typical output showing an initial subset of entries:


NLTK Hapaxes
Figure.5

Stemming is the process of reducing words to their root forms. For example, the words walked and walking will be reduced to the root word walk.

To create and display the root words for all the words from our custom text, run the following statements in the Jupyter cell:

lancaster = nltk.LancasterStemmer()

cc_stems = [lancaster.stem(word) for word in cc_words]

cc_stems

The following would be a typical output showing an initial subset of entries:


NLTK Root Words
Figure.6

Parts of Speech is part of a language grammer and indicates the role of a word in the text.

The English language has 8 parts of speech, which are as follows:

Part of Speech Role Example
Noun A person, place, or thing Alice, New York, Apple
Pronoun Replaces a Noun He, She, You
Adjective Describes a Noun Big, Colorful, Great
Verb An action or state of being Can, Go, Learn
Adverb Describes a Verb, Adjective, or another Adverb Always, Silently, Very
Preposition Links a Noun or Pronoun to another word At, From, To
Conjunction Joins words or phrases And, But, So
Interjection An Exclamation Oh, Ouch, Wow

To display all the abbreviations of the different Parts of Speech from the English language defined in nltk, run the following statement in the Jupyter cell:

nltk.help.upenn_tagset()

The following would be a typical output showing a subset of entries:


NLTK Parts of Speech
Figure.7

To determine and display the Parts of Speech for all the words from our custom text, run the following statements in the Jupyter cell:

cc_pos2 = nltk.pos_tag(cc_words, tagset='universal')

cc_pos2

The following would be a typical output showing an initial subset of entries:


NLTK POS
Figure.8

Chunking is the process of identifying groups of words from a text that make sense to be together, such as, the noun phrases or verb phrases, etc. For example, given the text 'Alice likes to travel to New York City', the word Tokenization will split the words 'New', 'York', and 'City'. With Chunking they will be grouped together, which makes more sense.

Chunking works on the words after they have been tagged with their corresponding Parts of Speech.

To determine and display the noun phrases for all the words (with their Parts of Speech tagged) from our custom text, run the following statements in the Jupyter cell:

parser = nltk.RegexpParser(r'NOUN_PHRASE: {*+}')

cc_chunks = parser.parse(cc_pos2)

print(cc_chunks)

The following would be a typical output showing an initial subset of entries:


NLTK Chunks
Figure.9

Named Entity Recognition is the process of identifying named entities such as people, locations, organizations, etc.

Like Chunking, the Named Entity Recognition works on the words after they have been tagged with their corresponding Parts of Speech.

To determine and display the named entities for a given simple text, run the following statements in the Jupyter cell:

text = 'Alice is the CEO of Acme International and has been tasked with building a Space Ship for Mars'

words = nltk.word_tokenize(text)

tags = nltk.pos_tag(words)

ner = nltk.ne_chunk(tags)

print(ner)

The following would be a typical output showing all of the entries:


NLTK Named Entities
Figure.10

Until now, we have been processing words one at a time (referred to as Unigrams). Sometimes we may have to process words two at a time (referred to as Bigrams) or words three at a time (referred to as Trigrams) to extract the intent of the text. A more generic way of processing n-words at a time is referred to as n-grams.

To find and display the bigrams for from our custom text, run the following statements in the Jupyter cell:

from nltk.util import ngrams

all_bigrams = list(ngrams(all_words, 2))

all_bigrams

The following would be a typical output showing an initial subset of entries:


NLTK Bigrams
Figure.11

The number argument in the ngrams(all_words, 2) indicates the number of words to consider at a time.


References


Christmas Carol Text

NLTK Documentation



© PolarSPARC