Python Programming Tutorials

Simple Preprocessing Language Data for Deep Learning

Welcome to part six of the Deep Learning with Neural Networks and TensorFlow tutorials. Where we left off, we explained our plan and theory for applying our deep neural network to some sentiment training data, and now we're going to be working on the pre-processing script for that.

To do this, we will use two files: and . The pos file has ~5,000 positive sentiment statements, and the neg file has ~5,000 negative sentiment statements.

We left off with:

import nltk
from nltk.tokenize import word_tokenize
import numpy as np
import random
import pickle
from collections import Counter
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
hm_lines = 100000

Now we'll begin to build the lexicon:

def create_lexicon(pos,neg):

	lexicon = []
	with open(pos,'r') as f:
		contents = f.readlines()
		for l in contents[:hm_lines]:
			all_words = word_tokenize(l)
			lexicon += list(all_words)

	with open(neg,'r') as f:
		contents = f.readlines()
		for l in contents[:hm_lines]:
			all_words = word_tokenize(l)
			lexicon += list(all_words)

Here, we've begun the function, which takes a path to the positive file and the negative file. From here, we open the files, read the lines, tokenize the words, and add them to the lexicon.

At this point, our lexicon is just a list of every word in our training data. If you had a huge dataset, too large to fit into your memory, then you'd need to adjust the hm_lines value here, to just go through the first hm_lines number of lines per file. Now we still need to lemmatize and remove duplicates. We also don't really need super common words, nor very uncommon words. For example, words like "a", "and", or "or" aren't going to give us much value in this simple "bag of words" model, so we don't want them. Uncommon words aren't going to be very useful either, since they'd likely be so rare that their very presence would skew the results. We can try to play with this to see if we're correct in this belief.

Continuing along in our create_lexicon function:

	lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
	w_counts = Counter(lexicon)
	l2 = []
	for w in w_counts:
		#print(w_counts[w])
		if 1000 > w_counts[w] > 50:
			l2.append(w)
	print(len(l2))
	return l2

Here, we lemmatize, then count the word occurance. If the word occurs less than 1,000 times, but more than 50 times, we want to include it in our lexicon. These two values are definitely something you may want to tweak, and really ought to be some sort of % of the entire dataset. I will just mention here that none of this code is optimized or meant to be used in production. This is just conceptual code, with tons of room for improvement.

Great, so we have a lexicon. Now we can take this lexicon, and use it as our bag of words that we will look for in a string. Each time we find a lemma in our lexicon that exists in the lemmatized and word tokenized sample sentence, the index of that lemma in the lexicon is turned "on" in our previously numpy zeros array that is the same length as the lexicon.

To do this, we'll build a sample_handling function:

def sample_handling(sample,lexicon,classification):

	featureset = []

	with open(sample,'r') as f:
		contents = f.readlines()
		for l in contents[:hm_lines]:
			current_words = word_tokenize(l.lower())
			current_words = [lemmatizer.lemmatize(i) for i in current_words]
			features = np.zeros(len(lexicon))
			for word in current_words:
				if word.lower() in lexicon:
					index_value = lexicon.index(word.lower())
					features[index_value] += 1

			features = list(features)
			featureset.append([features,classification])

	return featureset

This will iterate through the "sample" file that we choose. In our case, this is the pos.txt or neg.txt. We also pass the lexicon, and the classification of the file coming through. From here, it tokenizes the sample file by word, then lemmatizes the words. Now, we begin with a numpy.zeros array that is the length of the lexicon. Now we begin to iterate through the lemmatized words, adding 1 to the index value in the features array that is the same index of the word in the lexicon. From here, we apply this to our total featureset. When done, we return that whole thing. This function will be run twice; once for the positives and once for the negatives.

To be well suited for trainiong and testing with our current model, we ideally want to see have a training set of features, another of the associated labels, and then the same thing for the testing data. Let's make a quick function to do that too:

def create_feature_sets_and_labels(pos,neg,test_size = 0.1):
	lexicon = create_lexicon(pos,neg)
	features = []
	features += sample_handling('pos.txt',lexicon,[1,0])
	features += sample_handling('neg.txt',lexicon,[0,1])
	random.shuffle(features)
	features = np.array(features)

	testing_size = int(test_size*len(features))

	train_x = list(features[:,0][:-testing_size])
	train_y = list(features[:,1][:-testing_size])
	test_x = list(features[:,0][-testing_size:])
	test_y = list(features[:,1][-testing_size:])

	return train_x,train_y,test_x,test_y

The create_feature_sets_and_labels function is where everything comes together. We create the lexicon here based on the raw sample data that we have, then we build the full features based on their associated files, the lexicon, and then the classifications.

Next, we want to shuffle this data, convert to a numpy array, and then we build the training and testing sets. From here, we return the data into individual variables. Now we are ready to go ahead and try to run this.

if __name__ == '__main__':
	train_x,train_y,test_x,test_y = create_feature_sets_and_labels('/path/to/pos.txt','/path/to/neg.txt')
	# if you want to pickle this data:
	with open('/path/to/sentiment_set.pickle','wb') as f:
		pickle.dump([train_x,train_y,test_x,test_y],f)

Full code for create_sentiment_featuresets.py

import nltk
from nltk.tokenize import word_tokenize
import numpy as np
import random
import pickle
from collections import Counter
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
hm_lines = 100000

def create_lexicon(pos,neg):

	lexicon = []
	with open(pos,'r') as f:
		contents = f.readlines()
		for l in contents[:hm_lines]:
			all_words = word_tokenize(l)
			lexicon += list(all_words)

	with open(neg,'r') as f:
		contents = f.readlines()
		for l in contents[:hm_lines]:
			all_words = word_tokenize(l)
			lexicon += list(all_words)

	lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
	w_counts = Counter(lexicon)
	l2 = []
	for w in w_counts:
		#print(w_counts[w])
		if 1000 > w_counts[w] > 50:
			l2.append(w)
	print(len(l2))
	return l2





def sample_handling(sample,lexicon,classification):

	featureset = []

	with open(sample,'r') as f:
		contents = f.readlines()
		for l in contents[:hm_lines]:
			current_words = word_tokenize(l.lower())
			current_words = [lemmatizer.lemmatize(i) for i in current_words]
			features = np.zeros(len(lexicon))
			for word in current_words:
				if word.lower() in lexicon:
					index_value = lexicon.index(word.lower())
					features[index_value] += 1

			features = list(features)
			featureset.append([features,classification])

	return featureset



def create_feature_sets_and_labels(pos,neg,test_size = 0.1):
	lexicon = create_lexicon(pos,neg)
	features = []
	features += sample_handling('pos.txt',lexicon,[1,0])
	features += sample_handling('neg.txt',lexicon,[0,1])
	random.shuffle(features)
	features = np.array(features)

	testing_size = int(test_size*len(features))

	train_x = list(features[:,0][:-testing_size])
	train_y = list(features[:,1][:-testing_size])
	test_x = list(features[:,0][-testing_size:])
	test_y = list(features[:,1][-testing_size:])

	return train_x,train_y,test_x,test_y


if __name__ == '__main__':
	train_x,train_y,test_x,test_y = create_feature_sets_and_labels('/path/to/pos.txt','/path/to/neg.txt')
	# if you want to pickle this data:
	with open('/path/to/sentiment_set.pickle','wb') as f:
		pickle.dump([train_x,train_y,test_x,test_y],f)

Now that we've got our data, in the next tutorial, we're going to feed this new data through to the same model that we recently used to classify hand-written digits.

The next tutorial:

Practical Machine Learning Tutorial with Python Introduction
Regression - Intro and Data
Regression - Features and Labels
Regression - Training and Testing
Regression - Forecasting and Predicting
Pickling and Scaling
Regression - Theory and how it works
Regression - How to program the Best Fit Slope
Regression - How to program the Best Fit Line
Regression - R Squared and Coefficient of Determination Theory
Regression - How to Program R Squared
Creating Sample Data for Testing
Classification Intro with K Nearest Neighbors
Applying K Nearest Neighbors to Data
Euclidean Distance theory
Creating a K Nearest Neighbors Classifer from scratch
Creating a K Nearest Neighbors Classifer from scratch part 2
Testing our K Nearest Neighbors classifier
Final thoughts on K Nearest Neighbors
Support Vector Machine introduction
Vector Basics
Support Vector Assertions
Support Vector Machine Fundamentals
Constraint Optimization with Support Vector Machine
Beginning SVM from Scratch in Python
Support Vector Machine Optimization in Python
Support Vector Machine Optimization in Python part 2
Visualization and Predicting with our Custom SVM
Kernels Introduction
Why Kernels
Soft Margin Support Vector Machine
Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
Support Vector Machine Parameters
Machine Learning - Clustering Introduction
Handling Non-Numerical Data for Machine Learning
K-Means with Titanic Dataset
K-Means from Scratch in Python
Finishing K-Means from Scratch in Python
Hierarchical Clustering with Mean Shift Introduction
Mean Shift applied to Titanic Dataset
Mean Shift algorithm from scratch in Python
Dynamically Weighted Bandwidth for Mean Shift
Introduction to Neural Networks
Installing TensorFlow for Deep Learning - OPTIONAL
Introduction to Deep Learning with TensorFlow
Deep Learning with TensorFlow - Creating the Neural Network Model
Deep Learning with TensorFlow - How the Network will run
Deep Learning with our own Data
Simple Preprocessing Language Data for Deep Learning
Training and Testing on our Data for Deep Learning
10K samples compared to 1.6 million samples with Deep Learning
How to use CUDA and the GPU Version of Tensorflow for Deep Learning
Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
RNN w/ LSTM cell example in TensorFlow and Python
Convolutional Neural Network (CNN) basics
Convolutional Neural Network CNN with TensorFlow tutorial
TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
Using a neural network to solve OpenAI's CartPole balancing environment