Hello and welcome to a chatbot with Python tutorial series. In this series, we're going to cover how I created a halfway decent chatbot with Python and TensorFlow. Here are some examples of the chatbot in action:
I use Google and it works.
— Charles the AI (@Charles_the_AI) November 24, 2017
I prefer cheese.
— Charles the AI (@Charles_the_AI) November 24, 2017
The internet
— Charles the AI (@Charles_the_AI) November 24, 2017
I'm not sure . I'm just a little drunk.
— Charles the AI (@Charles_the_AI) November 24, 2017
My goal was to create a chatbot that could talk to people on the Twitch Stream in real-time, and not sound like a total idiot. In order to create a chatbot, or really do any machine learning task, of course, the first job you have is to acquire training data, then you need to structure and prepare it to be formatted in a "input" and "output" manner that a machine learning algorithm can digest. Arguably, this is where all the real work is when doing just about any machine learning. The building of a model and training/testing steps are the easy parts!
For getting chat training data, there are quite a few resources you could look into. For example, there is the Cornell movie dialogue corpus that seems to be one of the most popular. There are many other sources, but I wanted something that was more... raw. Something a little less polished... something with some character to it. Naturally, this took me to Reddit. At first, I thought I would use the Python Reddit API Wrapper, but the limits imposed by Reddit on crawling are not the most friendly. To collect bulk amounts of data, you'd have to break some rules. Instead, I found a data dump of 1.7 Billion Reddit Comments. Well, that should do it!
The structure of Reddit is in a tree-form, not like a forum or something where everything is linear. The parent comments are linear, but replies to parent comments branch out. Just in case there are some people who aren't familiar:
-Top level reply 1 --Reply to top level reply 1 --Reply to top level reply 1 ---Reply to reply... -Top level reply 2 --Reply to top level reply 1 -Top level reply 3
The structure we need for deep learning is input-output. So we really are trying to get something more along the lines of comment and reply pairs. In the above example, we could use the following as comment-reply pairs:
-Top level reply 1
and --Reply to top level reply 1
--Reply to top level reply 1
and
---Reply to reply...
So, what we need to do is take this Reddit dump, and produce these pairs. The next thing we need to consider is that we should probably have only 1 reply per comment. Even though many single comments might have many replies, we should really just go with one. We can either go with the first one, or we can go with the top-voted one. More on this later. Our first order of business is to get the data. If you have storage constraints, you can check out the one month of Reddit comments, which is for Jan 2015. Otherwise, you can get the entire dump:
magnet:?xt=urn:btih:7690f71ea949b868080401c749e878f98de34d3d&dn=reddit%5Fdata&tr=http%3A%2F%2Ftracker.pushshift.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80
I have only downloaded this torrent twice, but, depending on seeds and peers, your download speeds may vary significantly.
Finally, you can also access the data via Google BigQuery: Google BigQuery of all Reddit comments. The BigQuery tables appear to be updated over time, while the torrent isn't, so this is also a fine option. I am personally going to be using the torrent, because it is totally free, so, if you want to follow along exactly, you'll need that, but feel free to change things to work with Google BigQuery if you want!
Since the download of the data can take a considerable amount of time, I will break here. Continue in the next tutorial once you have the data downloaded. You can follow along with this entire tutorial series by downloading *just* the 2015-01 file, you do not need the entire 1.7 billion comment dump. A single month will suffice.