Python Programming Tutorials

Named Entity Recognition with Stanford NER Tagger

Guest Post by Chuck Dishmon

An alternative to NLTK's named entity recognition (NER) classifier is provided by the Stanford NER tagger. This tagger is largely seen as the standard in named entity recognition, but since it uses an advanced statistical learning algorithm it's more computationally expensive than the option provided by NLTK.

A big benefit of the Stanford NER tagger is that is provides us with a few different models for pulling out named entities. We can use any of the following:

3 class model for recognizing locations, persons, and organizations
4 class model for recognizing locations, persons, organizations, and miscellaneous entities
7 class model for recognizing locations, persons, organizations, times, money, percents, and dates

In order to move forward we'll need to download the models and a jar file, since the NER classifier is written in Java. These are available for free from the Stanford Natural Language Processing Group. Conveniently for us, NTLK provides a wrapper to the Stanford tagger so we can use it in the best language ever (ahem, Python)!

The parameters passed to the StanfordNERTagger class include:

Classification model path (3 class model used below)
Stanford tagger jar file path
Training data encoding (default of ASCII)

Here's how we set it up to tag a sentence with the 3 class model:

# -*- coding: utf-8 -*-

from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
					   '/usr/share/stanford-ner/stanford-ner.jar',
					   encoding='utf-8')

text = 'While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.'

tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)

print(classified_text)

Once we've tokenized by word and classified the sentence, we see the tagger produces a list of tuples as follows:

[('While', 'O'), ('in', 'O'), ('France', 'LOCATION'), (',', 'O'), ('Christine', 'PERSON'), ('Lagarde', 'PERSON'), ('discussed', 'O'), ('short-term', 'O'), ('stimulus', 'O'), ('efforts', 'O'), ('in', 'O'), ('a', 'O'), ('recent', 'O'), ('interview', 'O'), ('with', 'O'), ('the', 'O'), ('Wall', 'ORGANIZATION'), ('Street', 'ORGANIZATION'), ('Journal', 'ORGANIZATION'), ('.', 'O')]

Nice! Each token is tagged (using our 3 class model) with either 'PERSON', 'LOCATION', 'ORGANIZATION', or 'O'. The 'O' simply stands for other, i.e., non-named entities.

The list is now ready for testing with annotated data, which we'll cover in the next tutorial.

The next tutorial:

Tokenizing Words and Sentences with NLTK
Stop words with NLTK
Stemming words with NLTK
Part of Speech Tagging with NLTK
Chunking with NLTK
Chinking with NLTK
Named Entity Recognition with NLTK
Lemmatizing with NLTK
The corpora with NLTK
Wordnet with NLTK
Text Classification with NLTK
Converting words to Features with NLTK
Naive Bayes Classifier with NLTK
Saving Classifiers with NLTK
Scikit-Learn Sklearn with NLTK
Combining Algorithms with NLTK
Investigating bias with NLTK
Improving Training Data for sentiment analysis with NLTK
Creating a module for Sentiment Analysis with NLTK
Twitter Sentiment Analysis with NLTK
Graphing Live Twitter Sentiment Analysis with NLTK with NLTK
Named Entity Recognition with Stanford NER Tagger
Testing NLTK and Stanford NER Taggers for Accuracy
Testing NLTK and Stanford NER Taggers for Speed
Using BIO Tags to Create Readable Named Entity Lists