In this Data analysis with Python and Pandas tutorial, we're going to clear some of the Pandas basics. Data prior to being loaded into a Pandas Dataframe can take multiple forms, but generally it needs to be a dataset that can form to rows and columns. So maybe a dictionary like this:
web_stats = {'Day':[1,2,3,4,5,6], 'Visitors':[43,34,65,56,29,76], 'Bounce Rate':[65,67,78,65,45,52]}
We can turn this dictionary to a dataframe by doing the following:
import pandas as pd web_stats = {'Day':[1,2,3,4,5,6], 'Visitors':[43,34,65,56,29,76], 'Bounce Rate':[65,67,78,65,45,52]} df = pd.DataFrame(web_stats)
Now what can we do? As seen before, you can call for a quick initial snippit by doing:
print(df.head())
Bounce Rate Day Visitors 0 65 1 43 1 67 2 34 2 78 3 65 3 65 4 56 4 45 5 29
You may also want the last few lines instead. For this, you can do something like:
print(df.tail())
Bounce Rate Day Visitors 1 67 2 34 2 78 3 65 3 65 4 56 4 45 5 29 5 52 6 76
Finally, you can also put the number of the head or tail you want, like so:
print(df.tail(2))
Bounce Rate Day Visitors 4 45 5 29 5 52 6 76
You can see here how there are these numbers on the left, 0,1,2,3,4,5 and so on, like line numbers. These numbers are actually your "index." The index of a dataframe is what the data is related by, ordered by...etc. Generally, it is going to be the variable that connects all of the data. In this case, we never defined anything for this purpose, and it would be a challenge for Pandas to just somehow "know" what that variable was. Thus, when you do not define an index, Pandas will just make one for you like this. Looking at the data set right now, do you see a column that connects the others?
The "Day" column fits that bill! Generally, if you have any dated data, the date will be the "index" as this is how all of the data points relate. There are many ways to identify the index, change the index, and so on. We'll cover a couple here. First, on any existing dataframe, we can set a new index like so:
df.set_index('Day', inplace=True)
Output:
Bounce Rate Visitors Day 1 65 43 2 67 34 3 78 65 4 65 56 5 45 29
Now you can see that those line numbers are gone, and also notice how "Day" is lower than the other column headers, this is done to denote the index. One thing to note is the use of inplace=True
. What this does is allow us modify the dataframe "inplace," which means we actually modify the variable itself. Without inplace=True, we would need to do something like:
df = df.set_index('Day')
You can also set multiple indexes, but that's a more advanced topic for maybe a later date. You can do it easily, but reasoning for it is fairly niche.
Once you have a reasonable index that is either a datetime or a number like we have, then it will work as an X axis. If the other columns are also number data, then you can plot easily. Like we did before, go ahead and do:
import matplotlib.pyplot as plt from matplotlib import style style.use('fivethirtyeight')
Then, at the bottom, you can plot. Remember earlier when we referenced a specific column? Maybe you noticed, but we can reference specific items in a dataframe like this:
print(df['Visitors'])
Day 1 43 2 34 3 65 4 56 5 29 6 76 Name: Visitors, dtype: int64
You can also reference parts of the dataframe like an object, so long as there aren't any spaces, so you can do something like this:
print(df.Visitors)
Day 1 43 2 34 3 65 4 56 5 29 6 76 Name: Visitors, dtype: int64
So we can plot a single column like this:
df['Visitors'].plot() plt.show()
We can also plot the entire dataframe. So long as the data is normalized or on the same scale, this will work just fine. Here's an example:
df.plot() plt.show()
Notice how a legend is just automatically added? Another neat feature you might appreciate is that the legend also automatically moves out of the way of the actual plot lines. If you're new to Python and Matplotlib, this might easily not matter much to you, but this isn't a normal thing.
Finally, before we leave, you can also reference multiple columns at a time, like so (we only have 2 columns, but the same works with however many you start with):
print(df[['Visitors','Bounce Rate']])
So that's a list of column headers, held by brackets, within brackets from the dataframe. You can also plot this too.
These are some of the ways you can directly interact with your dataframe, referencing various aspects to the dataframe with an example of graphing those specific aspects.