Welcome to part 6 of the chatbot with Python and TensorFlow tutorial series. In this part, we're going to work on creating our training data. There are two different overall models and workflows that I am considering working with in this series: One I know works (shown in the beginning and running live on the Twitch stream), and another that can probably work better, but I am still poking around with it. Either way, our setup for training data is relatively similar. We need to create files that are basically "parent" and "reply" text files, where each line is the sample. So line 15 in the parent file is the parent comment, and then line 15 in the reply file is the response to line 15 in the parent file.
To create these files, we just need to grab pairs from the database, and append them to the respective training files. Let's begin with:
import sqlite3 import pandas as pd timeframes = ['2015-05'] for timeframe in timeframes:
For the run through here, I am just running through a single month, having created only one database, but you might instead want to create a single database, with tables being called the month and year, or you can create a bunch of sqlite databases with tables like we have, then iterate them to create your files. Anyway, I just have the one, so I will leave timeframes as a single item list. Let's continue building this loop:
for timeframe in timeframes: connection = sqlite3.connect('{}.db'.format(timeframe)) c = connection.cursor() limit = 5000 last_unix = 0 cur_length = limit counter = 0 test_done = False
The first line just establishes our connection, then we define the cursor, then the limit
. The limit is the size of chunk that we're going to pull at a time from the database. Again, we're working with data that is plausibly much larger than the RAM we have. We want to set limit to 5000 for now, so we can have some testing data. We can raise that later. We'll use the last_unix
to help us make pulls from the database, cur_length
will tell us when we're done, counter
will allow us to show some debugging information, and test_done
for when we're done building testing data.
while cur_length == limit: df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} and parent NOT NULL and score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit),connection) last_unix = df.tail(1)['unix'].values[0] cur_length = len(df)
So long as the cur_length
is the same as our limit, we've still got more pulling to do. Then, we'll pull the data from the database and slap it into a dataframe. For now, we're not doing anything much with the dataframe, but, later, we could use this to set more restrictions on the types of data we want to consider. We store the last_unix
so we know what time to draw from after. We also take note of the length of the return. Now, this this data, and build our training/testing files. We'll start with the testing:
if not test_done: with open('test.from','a', encoding='utf8') as f: for content in df['parent'].values: f.write(content+'\n') with open('test.to','a', encoding='utf8') as f: for content in df['comment'].values: f.write(str(content)+'\n') test_done = True
Now, if you want, you could also raise the limit at this point. After test_done = True, you could also re-define limit to be something like 100K or something. Now, let's do the training code:
else: with open('train.from','a', encoding='utf8') as f: for content in df['parent'].values: f.write(content+'\n') with open('train.to','a', encoding='utf8') as f: for content in df['comment'].values: f.write(str(content)+'\n')
We could make this code simpler and better by making it a function instead so we're not copying and pasting basically the same code. But... instead... let's keep this ball rolling:
counter += 1 if counter % 20 == 0: print(counter*limit,'rows completed so far')
Here, we'll see output for every 20 steps, so every 100K pairs if we keep the limit to 5,000.
Full code up to this point:
import sqlite3 import pandas as pd timeframes = ['2015-05'] for timeframe in timeframes: connection = sqlite3.connect('{}.db'.format(timeframe)) c = connection.cursor() limit = 5000 last_unix = 0 cur_length = limit counter = 0 test_done = False while cur_length == limit: df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} and parent NOT NULL and score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit),connection) last_unix = df.tail(1)['unix'].values[0] cur_length = len(df) if not test_done: with open('test.from','a', encoding='utf8') as f: for content in df['parent'].values: f.write(content+'\n') with open('test.to','a', encoding='utf8') as f: for content in df['comment'].values: f.write(str(content)+'\n') test_done = True else: with open('train.from','a', encoding='utf8') as f: for content in df['parent'].values: f.write(content+'\n') with open('train.to','a', encoding='utf8') as f: for content in df['comment'].values: f.write(str(content)+'\n') counter += 1 if counter % 20 == 0: print(counter*limit,'rows completed so far')
Alright, run that, and I'll see you when you've got the data ready!