Design Challenge 1: Data Sets

by Mike Gleicher on February 13, 2017

These are the “approved” data sets for Design Challenge 1. Remember, you must use one of these approved data sets. If you want to use a different data set, you must get it approved (and we’ll put it on this list).

This list is in no particular order.  The datasets are available in this Box folder.

Metropolitan Area Population Change

Note: this data set is small / easy. If you pick this one, the expectations for what you will need to do with it are much higher. I really dislike the vis on the census bureau website, you should do better (from the visualization, you can link to the data table). But the data is too small, and I’m not sure how many rich stories are to be found in it.

White House Budget Data

The data used in developing the budgets (back in 2016 and 2017). From the White House github. I recommend going to the 2017 branch and selecting “download ZIP” (look for the green “clone or download” button). There is good documentation, and the data is quite rich – giving historical spending in a lot of categories.

In the past, we considered the “receipts” data as small, and the “budgets and outlays” as harder data sets. Here we’re grouping them together.

Airline On-Time Peformance

The Bureau of Transportation Statistics lets you download a lot of data, one month at a time from this page. We’ve downloaded a few months for you – but even if you download our versions, you might want to refer to this page for explanations of all the fields, and look up tables (files that say what the codes mean).

For this data set, you may choose to use the months we downloaded, or download your own (please specify what data you use). You can choose to use just 1 month, or you can pick multiple months to compare (if you want a real challenge).

Nationwide Crime Data

One of the functions of the Federal Bureau of Investigation (FBI) is to compile crime statistics within the US and use this information to help local law enforcement to curtail crime.   Every year, the FBI releases this data along with recommendations for communities to stem violent crime.  We have downloaded the 2014 year dataset (as well as 2015) of types of crime by area, available on Box.

If you use this dataset, we ask that you resist ranking cities/states or their law enforcement capabilities by their crime, as requested by the FBI.  Showing trends and patterns should be your goal here.

Census Data By County

You can get census data in all kinds of forms. This page has 4 spreadsheets. Any one of them could tell an interesting story – but you probably want to put together multiple files. The complication is that it’s a long list of counties (you might just pick some, or try to give a sense of the range of what is going on, or identify unusual things, or …). The files are also in the Box.

The files are:

  • Population Estimates – has data 2010-2015 (per year) with inflows and outflows. There is a seperate sheet in the excel file that explains the columns.
  • Education – has data from multiple years (1970, 1980, 1990, 2000, 2015) for different levels of educational attainment.
  • Unemployment – has data from many different years
  • Poverty Estimates – mainly 2015 data, explanations for the columns in a separate sheet.

Time Usage Survey

The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the IPUMS website.

Getting a data set requires picking from all the options. And you can probably pull together an interesting data set in many ways. I grabbed one from the site. I also checked that, despite the scary agreements I had to agree to, sharing it with a class is legal (see this), so I put a grab of how Americans time usage has changed over the years into DataSets Box folder.

You can find out what the “time use codes” mean on this page.

Interpretting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation here. The state codes are here.


Student Contributed Data Sets

Beijing Air Quality Data

2 Data Sets about Air Quality in Beijing, joined into a single cohesive table.

From the contributor:

The data comes from two sources:

  1. Air quality data: http://www.stateair.net/web/historical/1/1.html (need to download each .csv separately)
  2. Weather data: https://www.wunderground.com/history/airport/ZBAA (here’s a link to a .csv for 2011)

I first pulled the air quality data (where measurements are taken multiple times a day), and aggregated to be at the daily level. Then I merged the weather data to the air quality data. I have a GitHub repository with the data and R and Python code.

Note: the github repo not only has the documentation for the data, and the data conveniently processed into a CSV file, but it also has code for some basic visualizations. I can’t stop you from looking at the code. But, if you are not the author, you cannot turn in these visualizations.

UN Refugee Data

UN-Link: http://popstats.unhcr.org/en/asylum_seekers_monthly

 

Previous post:

Next post: