Welcome to part six of the Deep Learning with Neural Networks and TensorFlow tutorials. Where we left off, we explained our plan and theory for applying our deep neural network to some sentiment training data, and now we're going to be working on the pre-processing script for that.
To do this, we will use two files: and . The pos file has ~5,000 positive sentiment statements, and the neg file has ~5,000 negative sentiment statements.
We left off with:
import nltk from nltk.tokenize import word_tokenize import numpy as np import random import pickle from collections import Counter from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() hm_lines = 100000
Now we'll begin to build the lexicon:
def create_lexicon(pos,neg): lexicon = [] with open(pos,'r') as f: contents = f.readlines() for l in contents[:hm_lines]: all_words = word_tokenize(l) lexicon += list(all_words) with open(neg,'r') as f: contents = f.readlines() for l in contents[:hm_lines]: all_words = word_tokenize(l) lexicon += list(all_words)
Here, we've begun the function, which takes a path to the positive file and the negative file. From here, we open the files, read the lines, tokenize the words, and add them to the lexicon.
At this point, our lexicon is just a list of every word in our training data. If you had a huge dataset, too large to fit into your memory, then you'd need to adjust the hm_lines
value here, to just go through the first hm_lines
number of lines per file. Now we still need to lemmatize and remove duplicates. We also don't really need super common words, nor very uncommon words. For example, words like "a", "and", or "or" aren't going to give us much value in this simple "bag of words" model, so we don't want them. Uncommon words aren't going to be very useful either, since they'd likely be so rare that their very presence would skew the results. We can try to play with this to see if we're correct in this belief.
Continuing along in our create_lexicon
function:
lexicon = [lemmatizer.lemmatize(i) for i in lexicon] w_counts = Counter(lexicon) l2 = [] for w in w_counts: #print(w_counts[w]) if 1000 > w_counts[w] > 50: l2.append(w) print(len(l2)) return l2
Here, we lemmatize, then count the word occurance. If the word occurs less than 1,000 times, but more than 50 times, we want to include it in our lexicon. These two values are definitely something you may want to tweak, and really ought to be some sort of % of the entire dataset. I will just mention here that none of this code is optimized or meant to be used in production. This is just conceptual code, with tons of room for improvement.
Great, so we have a lexicon. Now we can take this lexicon, and use it as our bag of words that we will look for in a string. Each time we find a lemma in our lexicon that exists in the lemmatized and word tokenized sample sentence, the index of that lemma in the lexicon is turned "on" in our previously numpy zeros array that is the same length as the lexicon.
To do this, we'll build a sample_handling
function:
def sample_handling(sample,lexicon,classification): featureset = [] with open(sample,'r') as f: contents = f.readlines() for l in contents[:hm_lines]: current_words = word_tokenize(l.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) features[index_value] += 1 features = list(features) featureset.append([features,classification]) return featureset
This will iterate through the "sample" file that we choose. In our case, this is the pos.txt
or neg.txt
. We also pass the lexicon, and the classification of the file coming through. From here, it tokenizes the sample file by word, then lemmatizes the words. Now, we begin with a numpy.zeros array that is the length of the lexicon. Now we begin to iterate through the lemmatized words, adding 1 to the index value in the features array that is the same index of the word in the lexicon. From here, we apply this to our total featureset. When done, we return that whole thing. This function will be run twice; once for the positives and once for the negatives.
To be well suited for trainiong and testing with our current model, we ideally want to see have a training set of features, another of the associated labels, and then the same thing for the testing data. Let's make a quick function to do that too:
def create_feature_sets_and_labels(pos,neg,test_size = 0.1): lexicon = create_lexicon(pos,neg) features = [] features += sample_handling('pos.txt',lexicon,[1,0]) features += sample_handling('neg.txt',lexicon,[0,1]) random.shuffle(features) features = np.array(features) testing_size = int(test_size*len(features)) train_x = list(features[:,0][:-testing_size]) train_y = list(features[:,1][:-testing_size]) test_x = list(features[:,0][-testing_size:]) test_y = list(features[:,1][-testing_size:]) return train_x,train_y,test_x,test_y
The create_feature_sets_and_labels
function is where everything comes together. We create the lexicon here based on the raw sample data that we have, then we build the full features based on their associated files, the lexicon, and then the classifications.
Next, we want to shuffle this data, convert to a numpy array, and then we build the training and testing sets. From here, we return the data into individual variables. Now we are ready to go ahead and try to run this.
if __name__ == '__main__': train_x,train_y,test_x,test_y = create_feature_sets_and_labels('/path/to/pos.txt','/path/to/neg.txt') # if you want to pickle this data: with open('/path/to/sentiment_set.pickle','wb') as f: pickle.dump([train_x,train_y,test_x,test_y],f)
Full code for create_sentiment_featuresets.py
import nltk from nltk.tokenize import word_tokenize import numpy as np import random import pickle from collections import Counter from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() hm_lines = 100000 def create_lexicon(pos,neg): lexicon = [] with open(pos,'r') as f: contents = f.readlines() for l in contents[:hm_lines]: all_words = word_tokenize(l) lexicon += list(all_words) with open(neg,'r') as f: contents = f.readlines() for l in contents[:hm_lines]: all_words = word_tokenize(l) lexicon += list(all_words) lexicon = [lemmatizer.lemmatize(i) for i in lexicon] w_counts = Counter(lexicon) l2 = [] for w in w_counts: #print(w_counts[w]) if 1000 > w_counts[w] > 50: l2.append(w) print(len(l2)) return l2 def sample_handling(sample,lexicon,classification): featureset = [] with open(sample,'r') as f: contents = f.readlines() for l in contents[:hm_lines]: current_words = word_tokenize(l.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) features[index_value] += 1 features = list(features) featureset.append([features,classification]) return featureset def create_feature_sets_and_labels(pos,neg,test_size = 0.1): lexicon = create_lexicon(pos,neg) features = [] features += sample_handling('pos.txt',lexicon,[1,0]) features += sample_handling('neg.txt',lexicon,[0,1]) random.shuffle(features) features = np.array(features) testing_size = int(test_size*len(features)) train_x = list(features[:,0][:-testing_size]) train_y = list(features[:,1][:-testing_size]) test_x = list(features[:,0][-testing_size:]) test_y = list(features[:,1][-testing_size:]) return train_x,train_y,test_x,test_y if __name__ == '__main__': train_x,train_y,test_x,test_y = create_feature_sets_and_labels('/path/to/pos.txt','/path/to/neg.txt') # if you want to pickle this data: with open('/path/to/sentiment_set.pickle','wb') as f: pickle.dump([train_x,train_y,test_x,test_y],f)
Now that we've got our data, in the next tutorial, we're going to feed this new data through to the same model that we recently used to classify hand-written digits.