The Github Copilot you have at home
This model, which I am calling GPyT (Generative Python Transformer), is a small GPT model trained from scratch on ~80GB of pure Python code.
You can see how I built this dataset in the video series here: https://www.youtube.com/playlist?list=PLQVvvaa0QuDdKvPge9PXQtFzvhMRyFPhW
This code took a while to compile, and even longer to actually train. I used the DGX Station A100 (4xA100 80GB cards) to train this model, but was only able to get 2 epochs done before the machine was taken away.
But, now, it's time to poke around and see what this model can do! I have hosted this model on HuggingFace.co, which you can find here: https://huggingface.co/Sentdex/GPyT
I created both a TensorFlow and Pytorch version, so everyone should be happy. Also, this model can at least run on a CPU, so feel free to remove all instances of .to("cuda") that you see if you don't have a GPU to run this on.
This model is meant purely for research purposes only, you should definitely not be using this for commercial purposes, much less any other since the code that is output could quite literally be anything.
You can use the following code to get started:
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("Sentdex/GPyT")
model = AutoModelWithLMHead.from_pretrained("Sentdex/GPyT").to("cuda")
def generate(code, max_length=100):
'''Takes input code, replaces newline chars with <N>,
tokenizes, feeds thru model, decodes,
then reformats the newlines back in'''
newlinechar = "<N>"
converted = code.replace("\n", newlinechar)
tokenized = tokenizer.encode(converted, return_tensors='pt').to("cuda")
resp = model.generate(tokenized, max_length=max_length).to("cuda")
decoded = tokenizer.decode(resp[0])
reformatted = decoded.replace("<N>","\n")
return reformatted
print(generate("import"))
/home/h/.local/lib/python3.8/site-packages/transformers/models/auto/modeling_auto.py:843: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models. warnings.warn( Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import numpy as np import pytest import pandas as pd from pandas import DataFrame, Series, date_range import pandas._testing as tm class TestDataFrameToDatetime: def test_to_json_multiindex(self): # GH#17043 df = DataFrame( { "a": [1, 2, 3, 4
In this case, we started with a import
only as input, and we can see the model continued it by importing numpy as np, pytest, and pandas as pd...etc. We can allow for a longer length to see more:
print(generate("import", max_length=500))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import numpy as np import pytest import pandas as pd from pandas import DataFrame, Series, date_range import pandas._testing as tm class TestDataFrameToDatetime: def test_to_json_multiindex(self): # GH#17043 df = DataFrame( { "a": [1, 2, 3, 4], "b": [1.0, 2.0, 3.0, 4.0], "c": [1.0, 2.0, 3.0, 4.0], "d": [1.0, 2.0, 3.0, 4.0], "e": [1.0, 2.0, 3.0, 4.0], "f": [1.0, 2.0, 3.0, 4.0], "g": [1.0, 2.0, 3.0, 4.0], "h": [1.0, 2.0, 3.0, 4.0], "i": [1.0, 2.0, 3.0, 4.0], "j": [1.0, 2.0, 3.0, 4.0], "k": [1.0, 2.0, 3.0, 4.0], } ) <pad> result = df.to_json(orient="records", lines=True) expected = DataFrame( { "a": [1.0, 2.0, 3.0, 4.0], } ) assert result == expected <pad><pad> <pad><pad> = json.loads(result) <pad> = json.loads(result) <pad> = json.loads(result) <pad> = json.loads(result) <pad> = json.loads(result) <pad><pad> = json.loads(result) <pad><pad><pad> = json.loads(result) <pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
Not exactly the prettiest, but we didn't start with much either. Let's try some more:
We can do much more interesting things from here too. For example:
inp = """import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [5, 6, 2]"""
print(generate(inp))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import matplotlib.pyplot as plt x = [1, 2, 3] y = [5, 6, 2] # [1, 2, 3] plt.figure() plt.plot(x, y) plt.plot(x, y) plt.plot(x, y) plt.plot(x, y) plt.plot(x, y)
Some repetition at the end there, but it sort of figured things out. We can run that:
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [5, 6, 2] # [1, 2, 3]
plt.figure()
plt.plot(x, y)
plt.plot(x, y)
plt.plot(x, y)
plt.plot(x, y)
plt.plot(x, y)
[<matplotlib.lines.Line2D at 0x7f31b416f070>]
We can also nudge the model a bit. What if we wanted a scatterplot, but forgot how to make one?
inp = """import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [5, 6, 2]
# scatterplot
"""
print(generate(inp))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import matplotlib.pyplot as plt x = [1, 2, 3] y = [5, 6, 2] # scatterplot plt.scatter(x, y, c='r', label='x') plt.scatter(x, y, c='r', label='y') plt.scatter(x, y, c='r', label='x') plt.scatter(x,
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [5, 6, 2]
# scatterplot
plt.scatter(x, y, c='r', label='x')
<matplotlib.collections.PathCollection at 0x7f31b4107760>
Not TERRIBLE. It at least helped us to get on the right track. What about a histogram? That one has a few more parms you need to remember and handle for:
inp = """import matplotlib.pyplot as plt
x = [1, 2, 3]
# y = [5, 6, 2]
# histogram
"""
print(generate(inp))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import matplotlib.pyplot as plt x = [1, 2, 3] # y = [5, 6, 2] # histogram plt.hist(x, bins=5) plt.title('Histogram') plt.ylabel('Histogram') plt.legend() plt.show() plt.figure() plt.plot(x, y)<N
Making a slight mod to the range, and we're actually off to an okay start here:
import matplotlib.pyplot as plt
x = [1, 2, 3]
# y = [5, 6, 2]
# histogram
plt.hist(x, bins=5)
plt.title('Histogram')
plt.ylabel('Histogram')
plt.legend()
No handles with labels found to put in legend.
<matplotlib.legend.Legend at 0x7f31b405b6a0>
Okay, so it does some matplotlib which is neat, it even added some labels for us, but what else can we do? How about some few shot learning? Let's limit ourselves to just the next line with a helper function, and see if we can get the package right for doing some project:
def next_line_only(original, model_out):
orig_nl = original.count("\n")
one_more_lines = [l for l in model_out.splitlines(True)][:orig_nl+1]
one_more_line = "".join(one_more_lines)
return one_more_line
inp = """# graphing:
import matplotlib
# web requests:
import requests
# array math:
"""
print(next_line_only(inp, generate(inp)))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
# graphing: import matplotlib # web requests: import requests # array math: import numpy as np
inp = """# graphing:
import matplotlib
# web requests:
import requests
# neural networks
"""
print(next_line_only(inp, generate(inp)))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
# graphing: import matplotlib # web requests: import requests # neural networks from keras.layers import Dense, Dropout, Flatten, Flatten
inp = """# graphing:
import matplotlib
# web requests:
import requests
# build website
"""
print(next_line_only(inp, generate(inp)))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
# graphing: import matplotlib # web requests: import requests # build website from flask import Flask, request, jsonify, request
This behavior is fairly interesting. Can it be more interesting?
inp = """from flask import Flask, render_template
app = Flask(__name__)
@app.route('/')
"""
#print(next_line_only(inp, generate(inp)))
print(generate(inp))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
from flask import Flask, render_template app = Flask(__name__) @app.route('/') def index(): return render_template("index.html", methods=['GET']) @app.route('/index') def index(): return render_template("index.html", methods=['GET']) @app.route('/index') def index(): return render_
It at least recognized this was an index page, and populated quite a bit for us. Some repetition, but we can logically handle this.
def stop_at_repeat(model_out):
lines = model_out.splitlines(True)
no_repeat = ""
for l in lines:
if no_repeat.count(l) == 0 or l == "\n":
no_repeat += l
else:
return no_repeat
return no_repeat
inp = """from flask import Flask, render_template
app = Flask(__name__)
@app.route('/')
"""
#print(next_line_only(inp, generate(inp)))
m = generate(inp)
print(stop_at_repeat(m))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
from flask import Flask, render_template app = Flask(__name__) @app.route('/') def index(): return render_template("index.html", methods=['GET']) @app.route('/index')
Copying and pasting some tutorial code from: https://pythonprogramming.net/convolutional-neural-network-deep-learning-python-tensorflow-keras/, then just adding a comment to make a conv layer:
inp = """import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
import pickle
pickle_in = open("X.pickle","rb")
X = pickle.load(pickle_in)
pickle_in = open("y.pickle","rb")
y = pickle.load(pickle_in)
X = X/255.0
model = Sequential()
# Conv:
"""
m = generate(inp, max_length=500)
print(stop_at_repeat(m))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import tensorflow as tf from tensorflow.keras.datasets import cifar10 from tensorflow.keras.preprocessing.image import ImageDataGenerator from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D import pickle pickle_in = open("X.pickle","rb") X = pickle.load(pickle_in) pickle_in = open("y.pickle","rb") y = pickle.load(pickle_in) X = X/255.0 model = Sequential() # Conv: model.add(Conv2D(64, (3, 3), strides=(1, 1), padding="same")) model.add(MaxPooling2D((2, 2), strides=(1, 1), padding="same"))
Cool, we got a convolutional layer, along with pooling
inp = """import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
import pickle
pickle_in = open("X.pickle","rb")
X = pickle.load(pickle_in)
pickle_in = open("y.pickle","rb")
y = pickle.load(pickle_in)
X = X/255.0
model = Sequential()
# Conv:
model.add(Conv2D(64, (3, 3), strides=(1, 1), padding="same"))
model.add(MaxPooling2D((2, 2), strides=(1, 1), padding="same"))
# Flatten:
"""
m = generate(inp, max_length=500)
print(stop_at_repeat(m))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import tensorflow as tf from tensorflow.keras.datasets import cifar10 from tensorflow.keras.preprocessing.image import ImageDataGenerator from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D import pickle pickle_in = open("X.pickle","rb") X = pickle.load(pickle_in) pickle_in = open("y.pickle","rb") y = pickle.load(pickle_in) X = X/255.0 model = Sequential() # Conv: model.add(Conv2D(64, (3, 3), strides=(1, 1), padding="same")) model.add(MaxPooling2D((2, 2), strides=(1, 1), padding="same")) # Flatten: model.add(Flatten())
inp = """import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
import pickle
pickle_in = open("X.pickle","rb")
X = pickle.load(pickle_in)
pickle_in = open("y.pickle","rb")
y = pickle.load(pickle_in)
X = X/255.0
model = Sequential()
# Conv:
model.add(Conv2D(64, (3, 3), strides=(1, 1), padding="same"))
model.add(MaxPooling2D((2, 2), strides=(1, 1), padding="same"))
# Flatten:
model.add(Flatten())
# Output 10 classes:
"""
m = generate(inp, max_length=500)
print(stop_at_repeat(m))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
import tensorflow as tf from tensorflow.keras.datasets import cifar10 from tensorflow.keras.preprocessing.image import ImageDataGenerator from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D import pickle pickle_in = open("X.pickle","rb") X = pickle.load(pickle_in) pickle_in = open("y.pickle","rb") y = pickle.load(pickle_in) X = X/255.0 model = Sequential() # Conv: model.add(Conv2D(64, (3, 3), strides=(1, 1), padding="same")) model.add(MaxPooling2D((2, 2), strides=(1, 1), padding="same")) # Flatten: model.add(Flatten()) # Output 10 classes: model.add(Dense(10))
We could also try some Pytorch, taking some code from the docs, and changing some aspects:
https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html
inp = """class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 128),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(128, 10),
nn.ReLU()
)
def forward(self, x):
"""
m = generate(inp, max_length=200)
print(m)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 128), nn.ReLU(), nn.Linear(256, 256), nn.ReLU(), nn.Linear(128, 10), nn.ReLU() ) def forward(self, x): x = self.flatten(x) x = self.linear_relu_stack(x) x = self.linear_relu_stack(x) return x class NeuralNetwork(nn.Module): def __init__(self):
Looks like it's starting to make another NN class, and the forward method isn't quite perfect, but it is not outrageous.
So, this clearly isn't anything we can take to market or that is going to steal your job, nor can it compete with something like Github's copilot (probably anyways, from what I've seen very few people have even been able to tinker with it)...but this is a massive improvement over what the best example I could come up with just 3 years ago could do.
There's definitely some basic levels of understanding here, and there's probably a whole lot more that this model can do that I just don't know about yet, so I encourage you all to play with the model and share what you learn!
Keep in mind that this model is capable of generating just about anything since it took code from all of Github. This means it could produce nefarious code, copyrighted/licensed code, and much more that I haven't even considered. This model is intended to be used at your own risk and for educational and research purposes only! I mostly embarked on this project just to see where we were with automated code-writing. I started working on this project before I even knew Copilot was a thing, but it indeed looks like GPT3 can produce even better results. I just have absolutely no way to create a GPT3 variant, so GPT2 will have to do :D.
Have fun.