Assuming you have the machine learning data file downloaded, then you're ready to learn how to parse the data.
The data set we have mimics exactly having visited the web pages at the time, only we're not actually needing to visit the page. We have the full HTML source code, so it is just like parsing the website, minus the bandwidth use.
First, we're going to want to know what the corresponding date is to our data, then we're going to pull the actual data.
To start:
import pandas as pd import os import time from datetime import datetime path = "X:/Backups/intraQuarter"
Above, we're importing pandas for the Pandas module, os so that we can interact with directories, time and datetime for managing time and date information.
Finally, we define "path," which is the path to the intraQuarter folder (you need to unzip the original zip file you downloaded from this website).
def Key_Stats(gather="Total Debt/Equity (mrq)"): statspath = path+'/_KeyStats' stock_list = [x[0] for x in os.walk(statspath)] #print(stock_list)
Here, we begin our function, with the specification that we're going to try to collect the Debt/Equity value.
Statspath is the path to the stats directory.
stock_list is a quick one-liner for loop that uses os.walk to list out all contents within a directory.
Next:
for each_dir in stock_list[1:]: each_file = os.listdir(each_dir) if len(each_file) > 0:
Above, we're cycling through every directory (which is every stock ticker). Then, we list "each_file" which is each file within that stock's directory. If the length of each_file, which is a list of all of the files in the stock's directory is greater than 0, then we want to proceed. Some stocks have no files/data.
for file in each_file: date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html') unix_time = time.mktime(date_stamp.timetuple()) print(date_stamp, unix_time) #time.sleep(15) Key_Stats()
Finally, we run a for loop that pulls the date_stamp from each file. Our files are stored under their ticker, with a file name of the exact date and time from the information being pulled.
From there, we explain to date-time what the format for our date stamp is, then we convert to a unix time stamp.