About the Data

A scientist is only as good as his or her data. Unfortunately for me, as a one-man operation, I don't have the same resources as established outfits like Advanced NFL Stats, Pro Football Focus, or ESPN. Fortunately, there's still plenty one can do with publicly available statistics and the ability to ask good questions.

The first place to look is, naturally, on the NFL's own website. The site does indeed have a vast trove of statistics, including game logs and seasonal information. Sadly, however, this information is not available in an easily downloadable format, and there is little flexibility in how you can search through the data.

Despite these setbacks, it is still possible to use a web scraping script to automatically convert pages of the NFL website into a format which enables further analysis. I wrote a Python script to perform this task, but the process is difficult and time consuming. Efforts to download data from NFL.com are further hampered because the league seems to change their webpage format on a fairly regular basis, requiring frequent tweaks and rewrites to my program. As a result it is often necessary to find a better data set in order to perform more advanced analyses.

There are already some good resources which aim to provide detailed statistics in a more analytics-friendly way. One example is Pro-Football-Reference.com, which has 3rd-party databases with a public-facing query system. While a dramatic improvement from the official NFL website, it still doesn't allow the total freedom a raw dataset provides.

The best solution I've found so far is from the Armchair Analysis website. While clearly designed to cater to the needs of sports bettors, the site provides access to their databases for researchers at a significant discount (currently $25, but was free when I came across the site). These files contain a huge amount of well organized and maintained data, from play-by-play details to full player and team statistics.

While I currently favor the statistics from Armchair Analysis, I fully subscribe to the theory of using whatever resources are the most efficient and will use the data I was able to download from the NFL if it better suits the topic at hand. It's also possible I will find new sources of data in the future. I will therefore make it clear in each post exactly where the data is coming from (a nice side-effect of my chosen format).

Social Media Bar

Get Widget