Ok, I have a confession to make. I’m not a baseball fan. I apologise to all the hardcore fans out there, I realize that to many people baseball is life (for me the ultimate sport is soccer, but that’s a discussion for another place!). I do however love data, stats, and all things related. Therefore, when I decided to start writing a blog about stats, I realized that the sheer amount and variety of baseball data allowed me the easiest route to accomplish what I wanted. No, this is not a sabermetrics blog (I’ll leave that for www.sabr.org and other fantastic sites) – what I want to accomplish is more an adventure in data, where I’m just happening to use baseball. My aim will be to show how different analyses can be done using different statistical software (I have access to SAS 9.3, JMP 10, R 3.0.0, WEKA 3.6.6., and a couple of others that I’ll be using for specific topics); any errors in interpreting the data are completely my own. I may not use all the packages every time, but I will do at least two – showing where one makes something more complicated, or that one does something that may be tricky but is worthwhile, or that despite all attempts I just cannot make something work, even though I’ve followed the documentation.
I have downloaded the Lahman 2012 Baseball database (from http://www.seanlahman.com/baseball-archive/statistics/) and have copied the Stadium information from the MLB site (www.mlb.mlb.com/team). Given the nature of the Stadium data, I had to do some cleanup in Excel just to make it easier to use.
I welcome questions, comments and ideas. If you’d like to see something specific using the baseball data, please post a comment and I’ll see what I can do!
So, let’s get going with our first project, and it’s actually using a piece of software I didn’t mention above. This comes from the Centre for Disease Control and is called Epi Info (wwwn.cdc.gov/epiinfo) and is a free Microsoft Access-backend GUI originally intended for Public Health analysis. However, why I really like this database is it makes mapping so easy!
I’ve taken the Addresses of the stadiums throughout both leagues, and using Latitudes and Longitudes from Geonames (http://download.geonames.org/export/zip/), I came up with a spreadsheet that looks like this:
When I open Epi Info, this is the screen:
Clicking on “Create Maps” brings up a Bing Map of the world; you can use it just as you would Google Earth, zooming in and out etc. You can also switch from Satellite to basic Street maps, depending on what you want to see.
To add our data to the map, click Add Data Layer and then Case Cluster (remember, this is intended for public health so “Case Cluster” is the logical label for it).
You’ll navigate to your file, and then once you click Open, you’ll get a screen like this one:
Using the drop boxes, select the columns that contain the Lat and Long fields, and then change the colour and apply any filters you want.
Here’s the map of the Baseball Stadiums, in just a few easy steps. The numbers in the circles indicate how many dots are in the vicinity; zoom in to see them split apart.
To show you how close I was able to get, here’s the home of the Toronto Blue Jays, with the dot to the left of it; not bad!
OK so this is all cool, but I want to take it to the next level – I want to see where the American League stadiums are compared to the National League ones. Going back to my spreadsheet, I simply create two tabs in the same workbook – one “AL” and one “NL”, with the data split accordingly. Going through the same steps as above, but selecting blue for the National League in the same screen I pick the Lat / Long, I get this map (zoomed in for emphasis):
I hope that this has given you a taste of what I’m aiming to accomplish, and that at some point you’ll need to map some data and be able to get it finished with minimal frustration!
Until next time, the aim of my blog is nicely summarized by this quote: “In baseball you have terrific data and you can be a lot more creative with it.” (Nate Silver, taken from http://www.brainyquote.com/quotes/keywords/data.html).