We've seen by now how easy it can be to use classifiers out of the box, and now we want to try some more! The best module for Python to do this with is the Scikit-learn (sklearn) module.
If you would like to learn more about the Scikit-learn Module, I have some tutorials on machine learning with Scikit-Learn.
Luckily for us, the people behind NLTK forsaw the value of incorporating the sklearn module into the NLTK classifier methodology. As such, they created the SklearnClassifier API of sorts. To use that, you just need to import it like:
from nltk.classify.scikitlearn import SklearnClassifier
From here, you can use just about any of the sklearn classifiers. For example, lets bring in a couple more variations of the Naive Bayes algorithm:
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
With this, how might we use them? It turns out, this is very simple:
MNB_classifier = SklearnClassifier(MultinomialNB()) MNB_classifier.train(training_set) print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, testing_set)) BNB_classifier = SklearnClassifier(BernoulliNB()) BNB_classifier.train(training_set) print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, testing_set))
It is as simple as that. Let's bring in some more:
from sklearn.linear_model import LogisticRegression,SGDClassifier from sklearn.svm import SVC, LinearSVC, NuSVC
Now, all of our classifiers should look something like:
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100) classifier.show_most_informative_features(15) MNB_classifier = SklearnClassifier(MultinomialNB()) MNB_classifier.train(training_set) print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100) BernoulliNB_classifier = SklearnClassifier(BernoulliNB()) BernoulliNB_classifier.train(training_set) print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100) LogisticRegression_classifier = SklearnClassifier(LogisticRegression()) LogisticRegression_classifier.train(training_set) print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100) SGDClassifier_classifier = SklearnClassifier(SGDClassifier()) SGDClassifier_classifier.train(training_set) print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100) SVC_classifier = SklearnClassifier(SVC()) SVC_classifier.train(training_set) print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100) LinearSVC_classifier = SklearnClassifier(LinearSVC()) LinearSVC_classifier.train(training_set) print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100) NuSVC_classifier = SklearnClassifier(NuSVC()) NuSVC_classifier.train(training_set) print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)
The result of running this should give you something along the lines of:
Original Naive Bayes Algo accuracy percent: 63.0 Most Informative Features thematic = True pos : neg = 9.1 : 1.0 secondly = True pos : neg = 8.5 : 1.0 narrates = True pos : neg = 7.8 : 1.0 rounded = True pos : neg = 7.1 : 1.0 supreme = True pos : neg = 7.1 : 1.0 layered = True pos : neg = 7.1 : 1.0 crappy = True neg : pos = 6.9 : 1.0 uplifting = True pos : neg = 6.2 : 1.0 ugh = True neg : pos = 5.3 : 1.0 mamet = True pos : neg = 5.1 : 1.0 gaining = True pos : neg = 5.1 : 1.0 wanda = True neg : pos = 4.9 : 1.0 onset = True neg : pos = 4.9 : 1.0 fantastic = True pos : neg = 4.5 : 1.0 kentucky = True pos : neg = 4.4 : 1.0 MNB_classifier accuracy percent: 66.0 BernoulliNB_classifier accuracy percent: 72.0 LogisticRegression_classifier accuracy percent: 64.0 SGDClassifier_classifier accuracy percent: 61.0 SVC_classifier accuracy percent: 45.0 LinearSVC_classifier accuracy percent: 68.0 NuSVC_classifier accuracy percent: 59.0
So, we can see SVC is wrong more often than it is right right out of the gate, so we should probably dump that one. But then what? The next thing we can try is to use all of these algorithms at once. An algo of algos! To do this, we can create another classifier, and make the result of that classifier based on what the other algorithms said. Sort of like a voting system, so we'll just need an odd number of algorithms. That's what we'll be talking about in the next tutorial.