What's going on everyone and welcome to the 2nd part of the chatbot with Python and TensorFlow tutorial series. By now, I am assuming you have the data downloaded, or you're just here to watch. With most machine learning, you need to take the data and, at some point, you need to have input and output. With neural networks, this means input layer and output layers for the actual neural network. For a chatbot, this means we need to separate things out to be a comment, and then a reply. The comment is the input, the reply is the desired output. Now with Reddit, not all comments have replies, and then many comments will have many replies! We need to just pick one.
The other thing we need to consider is that, as we iterate over this file, we may find a reply, but then we might find a better reply later. One way we might do this is to go off of upvotes. We might also only want upvoted responses, and so on. There are many things we could consider here, feel free to tweak as you want!
To begin, the format of our data, if we went the torrent route:
{"author":"Arve","link_id":"t3_5yba3","score":0,"body":"Can we please deprecate the word \"Ajax\" now? \r\n\r\n(But yeah, this _is_ much nicer)","score_hidden":false,"author_flair_text":null,"gilded":0,"subreddit":"reddit.com","edited":false,"author_flair_css_class":null,"retrieved_on":1427426409,"name":"t1_c0299ap","created_utc":"1192450643","parent_id":"t1_c02999p","controversiality":0,"ups":0,"distinguished":null,"id":"c0299ap","subreddit_id":"t5_6","downs":0,"archived":true}
Each line is like the above. We don't really need *all* of this data, but we'd like to definitely take the body
, comment_id
and parent_id
. If you downloaded the full torrent, or are using the BigQuery database, there's no shortage of sample data to work with, so I will also be using score
. We can set limits for scores. We could also work with specific subreddits in an effort to make an AI that talks like a specific subreddit. For now, I will just have us working with all subreddits.
Now, because even just a single month of comments can be more than 32GB, I can't fit that into RAM, we need to buffer through the data. My idea here is to go ahead and buffer through the comment files, and then store the data we're interested in into an SQLite database. The idea here will be that we can insert the comment data into this database. All comments will come chronologically, so all comments will be the "parent" initially, and have no parent of their own. Over time though, there will be replies, and we can then store this "reply," which will have a parent in the database that we can also pull by id, and then we can have rows where we have a parent comment and a reply.
Then, as time goes on, we might find replies to the parent comment that are voted higher than the one that is currently in there. When this happens, we can update that row with the new information so we can wind up with replies that are generally the more highly-voted ones.
Anyway, there are a bunch of ways to do this, let's just get something started! To begin, let's make some imports:
import sqlite3 import json from datetime import datetime
We will be using sqlite3
for our database, json
to load in the lines from the datadump, and then datetime
really just for logging. This wont be totally necessary.
So the torrent dump came with a bunch of directories by year, which contain the actual json data dumps, named by year and month (YYYY-MM). They are compressed in .bz2
. Make sure you extract the ones you intend to use. We're not going to write code to do it, so make sure you do it!
Next, we'll start with some variables:
timeframe = '2015-05' sql_transaction = [] connection = sqlite3.connect('{}.db'.format(timeframe)) c = connection.cursor()
The timeframe
value is going to be our year and month of data that we're going to use. You could also make this a list of these, then iterate over them if you like. For now, I will just work with the May 2015 file. Next, we have sql_transaction
. So the "commit" in SQL is the more costly action. If you know you're going to be inserting millions of rows, you should also know you *really* don't want to commit these one by one. Instead, you build the statements up in a single transaction, then you execute it all, and THEN you commit, in bulk groups. Next, we want to make our table. With SQLite, the database is created with the connect
if it doesn't already exist.
def create_table(): c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)")
Here, we're preparing to store the parent_id, comment_id, the parent comment, the reply (comment), subreddit, the time, and then finally the score (votes) for the comment.
Next, we can begin our main block of code:
if __name__ == '__main__': create_table()
Full code up to this point:
import sqlite3 import json from datetime import datetime timeframe = '2015-05' sql_transaction = [] connection = sqlite3.connect('{}2.db'.format(timeframe)) c = connection.cursor() def create_table(): c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)") if __name__ == '__main__': create_table()
Once we have this setup, we're ready to begin iterating over our data file and storing this information. We'll start doing that in the next tutorial!