There is a community of gamers who spend their leisure time finding every nook, cranny, glitch, and hack in order to complete video games at breakneck speed. They are speedrunners. The community is quite popular because watching players break these games is entertaining going beyond just watching a game. There is precise execution; there is deep analysis; there is a little luck. The community decided to put together an event for charity because, hey, people like watching players break their favorite games.
Games Done Quick was born. In January 2010, Classic Games Done Quick kicked off a “franchise” that has grown to raise millions of dollars per event. They typically do two events per year, winter and summer. However, there are a few special events – Harvey Relief Done Quick, for example, raised over $225,000 in just a few days for the Houston Food Bank.
You can go to their website to learn more about them, how to watch, and so forth. Their next event is Awesome Games Done Quick 2018, the winter version, from January 7th-14th in Herndon, VA.
The rest of this will be about the data wrangling. Oh my. The data wrangling. You hear in data analysis all the time: you spend about 90% of your time wrangling the data and only about 10% actually analyzing it. That was absolutely true.
You see, Games Done Quick hosts data on their tracking website. The games that were run, when they were run, start times, end times, who the player was, etc. They also have records of every donation they’ve received. Great! Except the datasets are not merged, and the over 296,000 donations are not in a database. In fact, those 296,000 donations are spread across 5,928 URLs, about 50 to a page. Luckily, the list of runs was on one page and was easy just to copy/paste to Excel.
Given the data challenges, I thought this was a fantastic opportunity to write some Python code to accomplish the following:
- Scrape the donation data from the HTML using the Beautiful Soup package
- Create a dataset of every donation in a .csv file
- Merge the “run” and “donation” datasets into a “master” dataset that hooks up donations with the game during which they were given (if there was one)
The thing with that plan was that I had never written any Python code before, but I’ve always wanted to learn. My coding experience was limited to SQL, SAS, and some R. Good thing Google and Stack Overflow exist. I’m sure there’s a better way than how I did it. Here’s a link to my code in GitHub. It’s honestly pretty messy right now, as I frequently had to revise things and add things as I cleaned data. The file names right now won’t match between the two functions I wrote due to some manual Excel work I did between running the two functions.
I wrote a function to loop through every URL, parse the HTML, slam the results into a data frame with the Pandas package, and put out a .csv. I wrote a second function to convert datetimes into standard formats, merge the two files, and put out a second .csv. The merging part was the hardest of all this.
Typically with two Excel files, you could just join them in Tableau on a common key. There wasn’t a common key here; what I wanted was to do a left join with my donations dataset as the parent. I wanted the join to be where the donation time was between the start and end times of the video game run. Tableau does not support this join functionality with text/Excel files. Ugh.
I went to Stack Overflow to solve this problem. It was relatively common, and there were several solutions. Since I have SQL experience, I was looking for a solution in that mindset. Luckily, there is a Pandasql package that allows you to write SQL that gets executed by Python. So, I ended up writing SQL to merge the files together.
That felt a little…counterproductive? My goal was to learn Python. At the same time, the right tool for the right job, you know? SQL made it relatively easy for me to merge the files, and this isn’t a repeated job where performance is a concern. I just needed to run this once.
Going beyond the logistics of merging the files, there were other data cleanup issues. I wanted to analyze the games, but game names were not consistent from event to event. For example, there were “Legend of Zelda” and “The Legend of Zelda” rows of data. It’s the same game, but those are treated as different games when aggregating data.
I created a reference file that has a row for each unique game name and then made two additional columns – a standardized “formatted” game name and a “series” name to aggregate all the Zelda games, all the Sonic games, etc. The series names and which games fit under them, to be honest, are a little subjective on my part. For example, Mariokart games are their own series, not under Mario Brothers games. There are lots of other little quirks with assigning a series to a game and with assigning a “formatted” name. Do I include fan-made games based on other games in the same series? Do I separate the “formatted” names of original and remastered versions of the same game? Again, I made some subjective choices.
Additional data problems included: missing or duplicate run times (like assigning a large group of games to a start time of January 7th, 2014 6:00 AM), overlapping game times (which often ended up being an AM/PM switch), and no actual start/end times (some times on the tracking website were the scheduled times from before the event began, not the actual start/end times recorded during the event). Many of these issues were for earlier events, where the dollar amounts were smaller, so hopefully the games analysis isn’t thrown off too much because of these issues. I corrected what issues I could using GDQ Vods and YouTube/archive.org to a small extent. Not everything is 100% correct. Some things couldn’t be solved.
To associate a donation/run with a specific event, I wrote a calculation manually in Tableau. Lastly, donations might still be trickling in for Harvey Relief Done Quick, so my results are slightly out of date.
Ok. That’s enough wrangling. Click here to view the interactive version of the dashboard. It’s really meant to be interactive. I’ve included screenshots of the main two dashboards below with some analysis, but seriously. Go explore the data, click on stuff, have fun.
GDQ has really taken off in the last few years. The first decent-sized bump comes at AGDQ 2014, which was the first event to raise over one million dollars. AGDQ 2017, just three years later, raised more than double that amount (over $2.2 million). The event has come a long way from its $10,531.64 total raised in its first event, Classic Games Done Quick. The average donation has more than doubled since the event’s inception. Also note the seasonality – more people are inside watching games in the winter than in the summer.
Another data quirk, for which I attempted to account but likely did not fully eliminate, was that sponsors have made big, data-skewing donations. Those are included in the above dashboard, but I tried to throw the sponsor donations out when analyzing individual games. I manually categorized sponsors vs. non-sponsors. I didn’t want to skew the data because a sponsor made a $10,000 donation (go go sponsors!). The games analysis is below.
Mario Brothers is unsurprisingly the top game series by number of total runs, but Super Metroid appears to be the most popular game. It has the highest number of runs. Over its lifespan, it doesn’t beat out the “Tasbot plays” games* in dollars per minute, but keep in mind the difference between my custom “formatted” game name and individual runs. “Game” refers to the aggregate, formatted game name. Individual run is…an individual run. Not aggregate.
*”Tasbot plays” games are where a human has programmed a computer to play the game, which is pretty cool.
In the boxplot graph at the bottom, that dot at the far right in the “Longer Run” row with $2,708 dollars per minute, is a Super Metroid run from AGDQ 2017. In just a little over two hours, AGDQ gathered 5,913 donations and $346,633.74 (excluding sponsors). No individual run has gathered more dollars than that Super Metroid run at AGDQ 2017. A different Super Metroid run from AGDQ 2012 appears in the top 10 of individual runs as well – number four overall with $185,965.84 (excluding sponsors).
In terms of total dollars, Super Metroid has gathered the most dollars of any game, and it’s not close. It appears to be the most popular game.
A note – I chose dollars per minute for the dashboard instead of raw totals because there can be very short games and very long games, and some short games can be popular. I figured dollars/minute to be a proxy for engagement/popularity that was properly adjusting for the length of the run. Of course longer runs will get more raw dollars.
I’ll be honest; this was a lot of work. My original goal was to learn some Python and tell a neat story about a great event. I think I accomplished those things, but it took way more data cleaning than I anticipated. That’s OK. It was fun. GDQ is fun.