Welcome to part eight of the Deep Learning with Neural Networks and TensorFlow tutorials. In the last tutorial, we applied a deep neural network to our own dataset, but we didn't get very useful results. We're wondering what might happen if we significantly increase the size of the dataset. Before, we were using ~10,000 samples, how about we try with 1.6 million samples?
The dataset that we will use this time is from Stanford, and containes 1.6 million examples of positive and negative sentiment: Sentiment140 dataset.
Now, at the moment, this dataset isn't likely too large for you to fit into memory, but, once we convert it to the bag of words model from before, it definitely will be. So, this time, we have to begin considering what to do when datasets are much larger. When working with large datasets, we have a few changes:
With this in mind, we need to re-pre-process our data for this specific dataset. I am going to assume that you've been following along, so we will be running through this a bit quicker than usual. You can call this file whatever you like, we won't be importing it, we'll just be using it to pre-process our data. Something like data_preprocessing.py
will suffice:
import nltk from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import pickle import numpy as np import pandas as pd lemmatizer = WordNetLemmatizer() ''' polarity 0 = negative. 2 = neutral. 4 = positive. id date query user tweet '''
These are the packages we'll be using, along with some notes on the dataset. First, we'll convert the sentiment values of the dataset:
def init_process(fin,fout): outfile = open(fout,'a') with open(fin, buffering=200000, encoding='latin-1') as f: try: for line in f: line = line.replace('"','') initial_polarity = line.split(',')[0] if initial_polarity == '0': initial_polarity = [1,0] elif initial_polarity == '4': initial_polarity = [0,1] tweet = line.split(',')[-1] outline = str(initial_polarity)+':::'+tweet outfile.write(outline) except Exception as e: print(str(e)) outfile.close()
Here, we just simply pass a file in (the original dataset), then output the new file. We're modifying the sentiment label, and some of the formatting. You only need to run this function once for the training and testing data, like so:
init_process('training.1600000.processed.noemoticon.csv','train_set.csv') init_process('testdata.manual.2009.06.14.csv','test_set.csv')
Next, we create our lexicon. This is very similar to our method before, the only difference is that this lexicon is based on a random one out of every 2500 samples that we have, to get a random sampling:
def create_lexicon(fin): lexicon = [] with open(fin, 'r', buffering=100000, encoding='latin-1') as f: try: counter = 1 content = '' for line in f: counter += 1 if (counter/2500.0).is_integer(): tweet = line.split(':::')[1] content += ' '+tweet words = word_tokenize(content) words = [lemmatizer.lemmatize(i) for i in words] lexicon = list(set(lexicon + words)) print(counter, len(lexicon)) except Exception as e: print(str(e)) with open('lexicon.pickle','wb') as f: pickle.dump(lexicon,f)
This lexicon creation only needs to be run once as well:
create_lexicon('train_set.csv')
Now, we can either vectorize the data into the bag of words model prior to training the network, or we can do it inline with the network. With this dataset's size, it could be possible for us to vectorize the data first, saving it somewhere, then feeding through the network, but this is going to be an impractical practice later on with a much larger set.
For our test set, however, we can easily do this, so here's the code for that:
def create_test_data_pickle(fin): feature_sets = [] labels = [] counter = 0 with open(fin, buffering=20000) as f: for line in f: try: features = list(eval(line.split('::')[0])) label = list(eval(line.split('::')[1])) feature_sets.append(features) labels.append(label) counter += 1 except: pass print(counter) feature_sets = np.array(feature_sets) labels = np.array(labels) create_test_data_pickle('processed-test-set.csv')
Go ahead and run the above script once, which will preprocess our dataset and create our lexicon for us. We'll head over to our neural network python script now:
import tensorflow as tf import pickle import numpy as np import nltk from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() n_nodes_hl1 = 500 n_nodes_hl2 = 500 n_classes = 2 batch_size = 32 total_batches = int(1600000/batch_size) hm_epochs = 10 x = tf.placeholder('float') y = tf.placeholder('float') hidden_1_layer = {'f_fum':n_nodes_hl1, 'weight':tf.Variable(tf.random_normal([2638, n_nodes_hl1])), 'bias':tf.Variable(tf.random_normal([n_nodes_hl1]))} hidden_2_layer = {'f_fum':n_nodes_hl2, 'weight':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'bias':tf.Variable(tf.random_normal([n_nodes_hl2]))} output_layer = {'f_fum':None, 'weight':tf.Variable(tf.random_normal([n_nodes_hl2, n_classes])), 'bias':tf.Variable(tf.random_normal([n_classes])),}
This should all look familiar. The only addition here is the batch_size
and total_batches
value, which we'll use shortly.
The deep neural network model itself is left unchanged:
def neural_network_model(data): l1 = tf.add(tf.matmul(data,hidden_1_layer['weight']), hidden_1_layer['bias']) l1 = tf.nn.relu(l1) l2 = tf.add(tf.matmul(l1,hidden_2_layer['weight']), hidden_2_layer['bias']) l2 = tf.nn.relu(l2) output = tf.matmul(l2,output_layer['weight']) + output_layer['bias'] return output
Now, before we get into the next function, we'll add:
saver = tf.train.Saver() tf_log = 'tf.log'
This is how we're going to save our model in the form of checkpoints as time goes on. Now we get to the training of the network:
def train_neural_network(x): prediction = neural_network_model(x) cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) ) optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) try: epoch = int(open(tf_log,'r').read().split('\n')[-2])+1 print('STARTING:',epoch) except: epoch = 1 while epoch <= hm_epochs: if epoch != 1: saver.restore(sess,"model.ckpt") epoch_loss = 1 with open('lexicon.pickle','rb') as f: lexicon = pickle.load(f) with open('train_set_shuffled.csv', buffering=20000, encoding='latin-1') as f: batch_x = [] batch_y = [] batches_run = 0 for line in f: label = line.split(':::')[0] tweet = line.split(':::')[1] current_words = word_tokenize(tweet.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) # OR DO +=1, test both features[index_value] += 1 line_x = list(features) line_y = eval(label) batch_x.append(line_x) batch_y.append(line_y) if len(batch_x) >= batch_size: _, c = sess.run([optimizer, cost], feed_dict={x: np.array(batch_x), y: np.array(batch_y)}) epoch_loss += c batch_x = [] batch_y = [] batches_run +=1 print('Batch run:',batches_run,'/',total_batches,'| Epoch:',epoch,'| Batch Loss:',c,) saver.save(sess, "model.ckpt") print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss) with open(tf_log,'a') as f: f.write(str(epoch)+'\n') epoch +=1 train_neural_network(x)
Much of this should be familiar, but there are a few changes. First, note the:
try: epoch = int(open(tf_log,'r').read().split('\n')[-2])+1 print('STARTING:',epoch) except: epoch = 1
This is our way to tracking what epoch we're on using a log file. I am still pretty new at TensorFlow myself, I would expect there to be a way to save the epoch number within the model, but I couldn't seem to get it working.
Next:
while epoch <= hm_epochs: if epoch != 1: saver.restore(sess,"model.ckpt") epoch_loss = 1 with open('lexicon.pickle','rb') as f: lexicon = pickle.load(f) with open('train_set_shuffled.csv', buffering=20000, encoding='latin-1') as f: batch_x = [] batch_y = [] batches_run = 0
This is our way of continuing this until we've done as many epochs as we've wanted. As the epoch begins, if we're not starting at the first one, then we're going to load in the checkpoint file. We load in the lexicon pickle, then we begin to load the shuffled training set. From here, each line will be vectorized, and added to our batch, which will be of size of our buffering.
for line in f: label = line.split(':::')[0] tweet = line.split(':::')[1] current_words = word_tokenize(tweet.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) # OR DO +=1, test both features[index_value] += 1 line_x = list(features) line_y = eval(label) batch_x.append(line_x) batch_y.append(line_y) if len(batch_x) >= batch_size: _, c = sess.run([optimizer, cost], feed_dict={x: np.array(batch_x), y: np.array(batch_y)}) epoch_loss += c batch_x = [] batch_y = [] batches_run +=1 print('Batch run:',batches_run,'/',total_batches,'| Epoch:',epoch,'| Batch Loss:',c,)
This would be a full epoch at this point, so then we want to be sure to save it, and update our epoch log file:
saver.save(sess, "model.ckpt") print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss) with open(tf_log,'a') as f: f.write(str(epoch)+'\n') epoch +=1
When done, we can test our accuracy as usual:
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1)) accuracy = tf.reduce_mean(tf.cast(correct, 'float')) feature_sets = [] labels = [] counter = 0 with open('processed-test-set.csv', buffering=20000) as f: for line in f: try: features = list(eval(line.split('::')[0])) label = list(eval(line.split('::')[1])) feature_sets.append(features) labels.append(label) counter += 1 except: pass print('Tested',counter,'samples.') test_x = np.array(feature_sets) test_y = np.array(labels) print('Accuracy:',accuracy.eval({x:test_x, y:test_y}))
Since we're using checkpoints now, and a larger dataset, we might actually want a separate testing function for accuracy:
def test_neural_network(): prediction = neural_network_model(x) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) for epoch in range(hm_epochs): try: saver.restore(sess,"model.ckpt") except Exception as e: print(str(e)) epoch_loss = 0 correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1)) accuracy = tf.reduce_mean(tf.cast(correct, 'float')) feature_sets = [] labels = [] counter = 0 with open('processed-test-set.csv', buffering=20000) as f: for line in f: try: features = list(eval(line.split('::')[0])) label = list(eval(line.split('::')[1])) #print(features) #print(label) feature_sets.append(features) labels.append(label) counter += 1 except: pass print('Tested',counter,'samples.') test_x = np.array(feature_sets) test_y = np.array(labels) print('Accuracy:',accuracy.eval({x:test_x, y:test_y})) test_neural_network()
Just in case you missed something, the full pre-processing script:
import nltk from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import pickle import numpy as np import pandas as pd lemmatizer = WordNetLemmatizer() ''' polarity 0 = negative. 2 = neutral. 4 = positive. id date query user tweet ''' def init_process(fin,fout): outfile = open(fout,'a') with open(fin, buffering=200000, encoding='latin-1') as f: try: for line in f: line = line.replace('"','') initial_polarity = line.split(',')[0] if initial_polarity == '0': initial_polarity = [1,0] elif initial_polarity == '4': initial_polarity = [0,1] tweet = line.split(',')[-1] outline = str(initial_polarity)+':::'+tweet outfile.write(outline) except Exception as e: print(str(e)) outfile.close() init_process('training.1600000.processed.noemoticon.csv','train_set.csv') init_process('testdata.manual.2009.06.14.csv','test_set.csv') def create_lexicon(fin): lexicon = [] with open(fin, 'r', buffering=100000, encoding='latin-1') as f: try: counter = 1 content = '' for line in f: counter += 1 if (counter/2500.0).is_integer(): tweet = line.split(':::')[1] content += ' '+tweet words = word_tokenize(content) words = [lemmatizer.lemmatize(i) for i in words] lexicon = list(set(lexicon + words)) print(counter, len(lexicon)) except Exception as e: print(str(e)) with open('lexicon-2500-2638.pickle','wb') as f: pickle.dump(lexicon,f) create_lexicon('train_set.csv') def convert_to_vec(fin,fout,lexicon_pickle): with open(lexicon_pickle,'rb') as f: lexicon = pickle.load(f) outfile = open(fout,'a') with open(fin, buffering=20000, encoding='latin-1') as f: counter = 0 for line in f: counter +=1 label = line.split(':::')[0] tweet = line.split(':::')[1] current_words = word_tokenize(tweet.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) # OR DO +=1, test both features[index_value] += 1 features = list(features) outline = str(features)+'::'+str(label)+'\n' outfile.write(outline) print(counter) convert_to_vec('test_set.csv','processed-test-set.csv','lexicon-2500-2638.pickle') def shuffle_data(fin): df = pd.read_csv(fin, error_bad_lines=False) df = df.iloc[np.random.permutation(len(df))] print(df.head()) df.to_csv('train_set_shuffled.csv', index=False) shuffle_data('train_set.csv') def create_test_data_pickle(fin): feature_sets = [] labels = [] counter = 0 with open(fin, buffering=20000) as f: for line in f: try: features = list(eval(line.split('::')[0])) label = list(eval(line.split('::')[1])) feature_sets.append(features) labels.append(label) counter += 1 except: pass print(counter) feature_sets = np.array(feature_sets) labels = np.array(labels) create_test_data_pickle('processed-test-set.csv')
Neural Network Script:
import tensorflow as tf import pickle import numpy as np import nltk from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() n_nodes_hl1 = 500 n_nodes_hl2 = 500 n_classes = 2 batch_size = 32 total_batches = int(1600000/batch_size) hm_epochs = 10 x = tf.placeholder('float') y = tf.placeholder('float') hidden_1_layer = {'f_fum':n_nodes_hl1, 'weight':tf.Variable(tf.random_normal([2638, n_nodes_hl1])), 'bias':tf.Variable(tf.random_normal([n_nodes_hl1]))} hidden_2_layer = {'f_fum':n_nodes_hl2, 'weight':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'bias':tf.Variable(tf.random_normal([n_nodes_hl2]))} output_layer = {'f_fum':None, 'weight':tf.Variable(tf.random_normal([n_nodes_hl2, n_classes])), 'bias':tf.Variable(tf.random_normal([n_classes])),} def neural_network_model(data): l1 = tf.add(tf.matmul(data,hidden_1_layer['weight']), hidden_1_layer['bias']) l1 = tf.nn.relu(l1) l2 = tf.add(tf.matmul(l1,hidden_2_layer['weight']), hidden_2_layer['bias']) l2 = tf.nn.relu(l2) output = tf.matmul(l2,output_layer['weight']) + output_layer['bias'] return output saver = tf.train.Saver() tf_log = 'tf.log' def train_neural_network(x): prediction = neural_network_model(x) cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) ) optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) try: epoch = int(open(tf_log,'r').read().split('\n')[-2])+1 print('STARTING:',epoch) except: epoch = 1 while epoch <= hm_epochs: if epoch != 1: saver.restore(sess,"model.ckpt") epoch_loss = 1 with open('lexicon.pickle','rb') as f: lexicon = pickle.load(f) with open('train_set_shuffled.csv', buffering=20000, encoding='latin-1') as f: batch_x = [] batch_y = [] batches_run = 0 for line in f: label = line.split(':::')[0] tweet = line.split(':::')[1] current_words = word_tokenize(tweet.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) # OR DO +=1, test both features[index_value] += 1 line_x = list(features) line_y = eval(label) batch_x.append(line_x) batch_y.append(line_y) if len(batch_x) >= batch_size: _, c = sess.run([optimizer, cost], feed_dict={x: np.array(batch_x), y: np.array(batch_y)}) epoch_loss += c batch_x = [] batch_y = [] batches_run +=1 print('Batch run:',batches_run,'/',total_batches,'| Epoch:',epoch,'| Batch Loss:',c,) saver.save(sess, "model.ckpt") print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss) with open(tf_log,'a') as f: f.write(str(epoch)+'\n') epoch +=1 train_neural_network(x) def test_neural_network(): prediction = neural_network_model(x) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) for epoch in range(hm_epochs): try: saver.restore(sess,"model.ckpt") except Exception as e: print(str(e)) epoch_loss = 0 correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1)) accuracy = tf.reduce_mean(tf.cast(correct, 'float')) feature_sets = [] labels = [] counter = 0 with open('processed-test-set.csv', buffering=20000) as f: for line in f: try: features = list(eval(line.split('::')[0])) label = list(eval(line.split('::')[1])) feature_sets.append(features) labels.append(label) counter += 1 except: pass print('Tested',counter,'samples.') test_x = np.array(feature_sets) test_y = np.array(labels) print('Accuracy:',accuracy.eval({x:test_x, y:test_y})) test_neural_network()
How'd we do? After about 10 epochs, we're sitting around 74% accuracy. This isn't anything to write home about, but at least it's better than 50%. While we're thinking of it, it might be wise to confirm that our dataset is a perfect 50/50 split. What if our model always predicts positive sentiment, and 74% of our data is positive? We had better check:
import pandas as pd from collections import Counter df = pd.read_csv('train_set.csv',names=['sentiment','tweet'], delimiter=':::') print(Counter(df['sentiment']))
Counter({})
Well, that's good, a perfect 50/50 split.
What if we actually liked the 74% accuracy, and we wanted to actually use this model? How might we do it? We already have everything we need at this point. All we need to do is take a string input, vectorize it according to our bag of words model, feed it through the neural network, and the output will either be [1,0]
for positive sentiment or [0,1]
for negative sentiment. Here's a quick script that simply uses the network:
import tensorflow as tf import pickle import numpy as np import nltk from nltk.tokenize import sent_tokenize, word_tokenize from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() n_nodes_hl1 = 500 n_nodes_hl2 = 500 n_classes = 2 hm_data = 2000000 batch_size = 32 hm_epochs = 10 x = tf.placeholder('float') y = tf.placeholder('float') current_epoch = tf.Variable(1) hidden_1_layer = {'f_fum':n_nodes_hl1, 'weight':tf.Variable(tf.random_normal([2638, n_nodes_hl1])), 'bias':tf.Variable(tf.random_normal([n_nodes_hl1]))} hidden_2_layer = {'f_fum':n_nodes_hl2, 'weight':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'bias':tf.Variable(tf.random_normal([n_nodes_hl2]))} output_layer = {'f_fum':None, 'weight':tf.Variable(tf.random_normal([n_nodes_hl2, n_classes])), 'bias':tf.Variable(tf.random_normal([n_classes])),} def neural_network_model(data): l1 = tf.add(tf.matmul(data,hidden_1_layer['weight']), hidden_1_layer['bias']) l1 = tf.nn.relu(l1) l2 = tf.add(tf.matmul(l1,hidden_2_layer['weight']), hidden_2_layer['bias']) l2 = tf.nn.relu(l2) output = tf.matmul(l2,output_layer['weight']) + output_layer['bias'] return output saver = tf.train.Saver() def use_neural_network(input_data): prediction = neural_network_model(x) with open('lexicon.pickle','rb') as f: lexicon = pickle.load(f) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) saver.restore(sess,"model.ckpt") current_words = word_tokenize(input_data.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) # OR DO +=1, test both features[index_value] += 1 features = np.array(list(features)) # pos: [1,0] , argmax: 0 # neg: [0,1] , argmax: 1 result = (sess.run(tf.argmax(prediction.eval(feed_dict={x:[features]}),1))) if result[0] == 0: print('Positive:',input_data) elif result[0] == 1: print('Negative:',input_data) use_neural_network("He's an idiot and a jerk.") use_neural_network("This was the best store i've ever seen.")
Output here should be:
Negative: He's an idiot and a jerk. Positive: This was the best store i've ever seen.
I have hosted the checkpoint file and lexicon pickle for anyone who is interested, since running the 10-15 epochs to get a decent model can take a while.
In the next tutorial, we're going to be talking about using CUDA and the GPU version of TensorFlow.