An alternative to NLTK's named entity recognition (NER) classifier is provided by the Stanford NER tagger. This tagger is largely seen as the standard in named entity recognition, but since it uses an advanced statistical learning algorithm it's more computationally expensive than the option provided by NLTK.
A big benefit of the Stanford NER tagger is that is provides us with a few different models for pulling out named entities. We can use any of the following:
In order to move forward we'll need to download the models and a jar file, since the NER classifier is written in Java. These are available for free from the Stanford Natural Language Processing Group. Conveniently for us, NTLK provides a wrapper to the Stanford tagger so we can use it in the best language ever (ahem, Python)!
The parameters passed to the StanfordNERTagger class include:
Here's how we set it up to tag a sentence with the 3 class model:
# -*- coding: utf-8 -*- from nltk.tag import StanfordNERTagger from nltk.tokenize import word_tokenize st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', '/usr/share/stanford-ner/stanford-ner.jar', encoding='utf-8') text = 'While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.' tokenized_text = word_tokenize(text) classified_text = st.tag(tokenized_text) print(classified_text)
Once we've tokenized by word and classified the sentence, we see the tagger produces a list of tuples as follows:
[('While', 'O'), ('in', 'O'), ('France', 'LOCATION'), (',', 'O'), ('Christine', 'PERSON'), ('Lagarde', 'PERSON'), ('discussed', 'O'), ('short-term', 'O'), ('stimulus', 'O'), ('efforts', 'O'), ('in', 'O'), ('a', 'O'), ('recent', 'O'), ('interview', 'O'), ('with', 'O'), ('the', 'O'), ('Wall', 'ORGANIZATION'), ('Street', 'ORGANIZATION'), ('Journal', 'ORGANIZATION'), ('.', 'O')]
Nice! Each token is tagged (using our 3 class model) with either 'PERSON', 'LOCATION', 'ORGANIZATION', or 'O'. The 'O' simply stands for other, i.e., non-named entities.
The list is now ready for testing with annotated data, which we'll cover in the next tutorial.