Understanding Natural Language Processing with Python: Unlocking the Power of NLTK
Introduction
What is NLP?
Natural language processing (NLP) is a form of artificial intelligence (AI) that allows computers to understand human language, it involves making computers to perform useful tasks with the natural languages humans use. The input and output of an NLP system can be either speech or written text.
In this blog post, I would like to show how to use NLTK, or Natural Language Toolkit (Python package that you can use for NLP).
what are the benefits of learning NLP?
NLTK has been called a wonderful tool for teaching and working in computational linguistics using Python and an amazing library to play with natural language. Indeed, it is the most popular library for NLP which was written in Python and has a big community behind it. NLTK also is very easy to learn.
This NLP post will use Python NLTK library.
Install Python’s NLTK
To nltk package, all you need to do is open your python terminal and type “pip install nltk”
Tokenize Text Using Python
We can use nltk library to split the text by word or sentence.
from nltk.tokenize import sent_tokenize
mytext = "Hello Ahmad, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))
output:
['Hello Ahmad, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']
Get Synonyms from WordNet
WordNet is a database which is built for natural language processing. It includes groups of synonyms and a brief definition. You can get these definitions and examples for a given word like this:
from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
print(syn[0].definition())
print(syn[0].examples())
Output:
a symptom of some physical hurt or disorder
['the patient developed severe pain and distension']
Get Antonyms from WordNet
You can get the antonyms words with the same way, all you have to do is to check the lemmas before adding them to the array if it’s an antonym or not.
antonyms = []
for syn in wordnet.synsets("small"):
for lemma in syn.lemmas():
if lemma.antonyms():
antonyms.append(lemma.antonyms()[0].name())
print(antonyms)
Output:
['large', 'big', 'big']
NLTK Word Stemming
Word stemming means removing affixes from words and return the root word. Ex: The stem of the word working => work.
Search engines use this technique when indexing pages, so many people write different versions for the same word and all of them are stemmed to the root word. There are many algorithms for stemming, but the most used algorithm is Porter stemming algorithm.
NLTK has a class called PorterStemmer which uses Porter stemming algorithm.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('working'))
output:
work
Lemmatizing Words Using WordNet:
Word lemmatizing is similar to stemming, but the difference is the result of lemmatizing is a real word.
Unlike stemming, when you try to stem some words, it will result in something like this:
print(stemmer.stem('increases'))
output:
increas
If we try to lemmatize the same word using NLTK WordNet, the result is correct:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('increases'))
output:
increase
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('adding', pos="v"))
output:
play
add
Stemming and Lemmatization Difference
Stemming works on words without knowing its context and that’s why stemming has lower accuracy and faster than lemmatization. It seems that lemmatizing is better than stemming. Word lemmatizing returns a real word even if it’s not the same word, it could be a synonym, but at least it’s a real word.
Conclusion:
Using Python and a fantastic library to play with natural language, NLTK has been hailed as a remarkable tool for teaching and working in computational linguistics
Comments