Due to deep-learning's desire for large datasets, anything that can be modeled or simulated can be easily learned by AI.
With Python, we can easily create our own environments, but there are also quite a few libraries out there that do this for you. The most popular that I know of is OpenAI's gym environments.
There are also many concepts like mathematics, even concepts like encryption, where we can generate hundreds of thousands, or millions, of samples easily.
For this tutorial, we're going to use the "CartPole" environment.
To follow along, the following requirements will be necessary:
Installing the GPU version of TensorFlow in Ubuntu
Installing the GPU version of TensorFlow on a Windows machine
Using TensorFlow and concept tutorials:
Introduction to deep learning with neural networks
The idea of CartPole is that there is a pole standing up on top of a cart. The goal is to balance this pole by wiggling/moving the cart from side to side to keep the pole balanced upright.
The environment is deemed successful if we can balance for 200 frames, and failure is deemed when the pole is more than 15 degrees from fully vertical.
Every frame that we go with the pole "balanced" (less than 15 degrees from vertical), our "score" gets +1, and our target is a score of 200.
Now, how do we do this? There are endless ways, some very complex, and some very specific. I'd like to solve this very generally, and in a way that we could easily apply this same solution to a wide variety of problems.
This will also give me the ability to illustrate a very interesting property of neural networks. If you've ever taken a statistics course, you might be familiar with the scenario where you can have various signals, which have some degree of predictive power, and combine them for something with more predictive power than the sum of the parts.
Neural networks are fully capable of doing this on their own entirely.
To illustrate this, we're going to start by creating an agent that, when in this cartpole environment, it just randomly chooses actions (left and right). Recall that our goal is to get a score of 200, but we'll go ahead and use any scenario where we've scored above 50 to learn from.
From here, the input layer is the obervation from the environment, which includes pole position and such. The output layer is just one of two actions: Left or Right.
Alright, let's get started:
import gym
import random
import numpy as np
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
from statistics import median, mean
from collections import Counter
LR = 1e-3
env = gym.make("CartPole-v0")
env.reset()
goal_steps = 500
score_requirement = 50
initial_games = 10000
Now, let's just get a quick impression of what a random agent looks like.
def some_random_games_first():
# Each of these is its own game.
for episode in range(5):
env.reset()
# this is each frame, up to 200...but we wont make it that far.
for t in range(200):
# This will display the environment
# Only display if you really want to see it.
# Takes much longer to display it.
env.render()
# This will just create a sample action in any environment.
# In this environment, the action can be 0 or 1, which is left or right
action = env.action_space.sample()
# this executes the environment with an action,
# and returns the observation of the environment,
# the reward, if the env is over, and other info.
observation, reward, done, info = env.step(action)
if done:
break
some_random_games_first()
Each time you see the scene start over, that's because the environment was "done." In our case, we kept losing.
Now that you've seen what random is, can we learn from it? Absolutely.
def initial_population():
# [OBS, MOVES]
training_data = []
# all scores:
scores = []
# just the scores that met our threshold:
accepted_scores = []
# iterate through however many games we want:
for _ in range(initial_games):
score = 0
# moves specifically from this environment:
game_memory = []
# previous observation that we saw
prev_observation = []
# for each frame in 200
for _ in range(goal_steps):
# choose random action (0 or 1)
action = random.randrange(0,2)
# do it!
observation, reward, done, info = env.step(action)
# notice that the observation is returned FROM the action
# so we'll store the previous observation here, pairing
# the prev observation to the action we'll take.
if len(prev_observation) > 0 :
game_memory.append([prev_observation, action])
prev_observation = observation
score+=reward
if done: break
# IF our score is higher than our threshold, we'd like to save
# every move we made
# NOTE the reinforcement methodology here.
# all we're doing is reinforcing the score, we're not trying
# to influence the machine in any way as to HOW that score is
# reached.
if score >= score_requirement:
accepted_scores.append(score)
for data in game_memory:
# convert to one-hot (this is the output layer for our neural network)
if data[1] == 1:
output = [0,1]
elif data[1] == 0:
output = [1,0]
# saving our training data
training_data.append([data[0], output])
# reset env to play again
env.reset()
# save overall scores
scores.append(score)
# just in case you wanted to reference later
training_data_save = np.array(training_data)
np.save('saved.npy',training_data_save)
# some stats here, to further illustrate the neural network magic!
print('Average accepted score:',mean(accepted_scores))
print('Median score for accepted scores:',median(accepted_scores))
print(Counter(accepted_scores))
return training_data
Now we will make our neural network. We're just going to use a simple multilayer perceptron model.
def neural_network_model(input_size):
network = input_data(shape=[None, input_size, 1], name='input')
network = fully_connected(network, 128, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 256, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 512, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 256, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 128, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
model = tflearn.DNN(network, tensorboard_dir='log')
return model
def train_model(training_data, model=False):
X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
y = [i[1] for i in training_data]
if not model:
model = neural_network_model(input_size = len(X[0]))
model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_learning')
return model
IF you do not understand the neural network code, see the linked tutorials at the beginning of this notebook. I've already covered neural networks extensively, no sense in repeating myself!
Let's produce the training data:
training_data = initial_population()
Take note here that the average score is 60, the median is 57, and the HIGHEST example here is 111, and that's the only one above 100. Now, let's train our neural network on this data that gave us these scores...
model = train_model(training_data)
Now, we're going to use code very similar to the initial_population function, the only major difference is that, rather than using a random action, our we'll generate an action FROM our neural network instead. We're going to go ahead and visualize these as well, and then save some stats:
scores = []
choices = []
for each_game in range(10):
score = 0
game_memory = []
prev_obs = []
env.reset()
for _ in range(goal_steps):
env.render()
if len(prev_obs)==0:
action = random.randrange(0,2)
else:
action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])
choices.append(action)
new_observation, reward, done, info = env.step(action)
prev_obs = new_observation
game_memory.append([new_observation, action])
score+=reward
if done: break
scores.append(score)
print('Average Score:',sum(scores)/len(scores))
print('choice 1:{} choice 0:{}'.format(choices.count(1)/len(choices),choices.count(0)/len(choices)))
print(score_requirement)
Solved.
That's all for now, head back home for more tutorials.