Design Challenge 1: Data Sets

by Mike Gleicher on September 19, 2017

Addition: September 22, 2017: New data sets were added at the bottom (see Student suggested Data Sets from 2017 Fall). These data sets seem really interesting – but they may be more challenging.

These are the “approved” data sets for Design Challenge 1. Remember, you must use one of these approved data sets. If you want to use a different data set, you must get it approved (and we’ll put it on this list).

This list is in no particular order.

Data Sets from Old Classes

The datasets are available in this Box folder. (except for ones you need to grab yourself – and even then a copy might be available in the folder)

Everyone who is registered for class should have access to the box folder. If you are having a problem, let me know – Box doesn’t always cooperate.

White House Budget Data

The data used in developing the budgets (back in 2016 and 2017). From the White House github. I recommend going to the 2017 branch and selecting “download ZIP” (look for the green “clone or download” button). There is good documentation, and the data is quite rich – giving historical spending in a lot of categories.

In the past, we considered the “receipts” data as small, and the “budgets and outlays” as harder data sets. Here we’re grouping them together.

Airline On-Time Peformance

The Bureau of Transportation Statistics lets you download a lot of data, one month at a time from this page. We’ve downloaded a few months for you – but even if you download our versions, you might want to refer to this page for explanations of all the fields, and look up tables (files that say what the codes mean).

For this data set, you may choose to use the months we downloaded, or download your own (please specify what data you use). You can choose to use just 1 month, or you can pick multiple months to compare (if you want a real challenge).

Nationwide Crime Data

One of the functions of the Federal Bureau of Investigation (FBI) is to compile crime statistics within the US and use this information to help local law enforcement to curtail crime. Every year, the FBI releases this data along with recommendations for communities to stem violent crime. We have downloaded the 2014 year dataset (as well as 2015) of types of crime by area, available on Box.

If you use this dataset, we ask that you resist ranking cities/states or their law enforcement capabilities by their crime, as requested by the FBI. Showing trends and patterns should be your goal here.

Census Data By County

Note: this is aggregated census data – which is much less interesting than the IPUMS “raw” (or sampled) data.

You can get census data in all kinds of forms. This page has 4 spreadsheets. Any one of them could tell an interesting story – but you probably want to put together multiple files. The complication is that it’s a long list of counties (you might just pick some, or try to give a sense of the range of what is going on, or identify unusual things, or …). The files are also in the Box.

The files are:

  • Population Estimates – has data 2010-2015 (per year) with inflows and outflows. There is a seperate sheet in the excel file that explains the columns.
  • Education – has data from multiple years (1970, 1980, 1990, 2000, 2015) for different levels of educational attainment.
  • Unemployment – has data from many different years
  • Poverty Estimates – mainly 2015 data, explanations for the columns in a separate sheet.

Time Usage Survey

The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the IPUMS website.

Getting a data set requires picking from all the options. And you can probably pull together an interesting data set in many ways. I grabbed one from the site. I also checked that, despite the scary agreements I had to agree to, sharing it with a class is legal (see this), so I put a grab of how Americans time usage has changed over the years into DataSets Box folder.

You can find out what the “time use codes” mean on this page.

Interpretting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation here. The state codes are here.

Detailed Census Data

You can get detailed census data (as in samples of specific people) from the IPUMS  website. This data gets very huge very fast (you can get millions of people) and requires aggregation and clever ways to handle it efficiently (Tableau does surprisingly well).

We will probably use this data for another challenge, but it’s so big and interesting (both in terms of amount of individuals as well as amount of variables about everyone), that a little redundancy is not bad.

When you create a data set, you have to pick which census to sample (e.g., which years), and which variables you want. The tool will create huge CSV files (gigabytes). It also created documentation files.

In the box folder, I have a big data grab I got (past 15 years, many variables) – there’s the CSV file and the documentation file. There is also a “reduced file” that I created with a processing script – I decoded some of the columns, and selected a subset of the years. Even this small set is millions of people!

Basketball Players

This dataset is relatively small, but should be big enough to be interesting. It was used in the past for Alper (who was the TA) to demonstrate how to use Tableau and Excel for doing class projects. It’s in the Box.


Student Contributed Data Sets (from 2017 Spring)

Beijing Air Quality Data

2 Data Sets about Air Quality in Beijing, joined into a single cohesive table.

From the contributor:

The data comes from two sources:

  1. Air quality data: http://www.stateair.net/web/historical/1/1.html (need to download each .csv separately)
  2. Weather data: https://www.wunderground.com/history/airport/ZBAA (here’s a link to a .csv for 2011)

I first pulled the air quality data (where measurements are taken multiple times a day), and aggregated to be at the daily level. Then I merged the weather data to the air quality data. I have a GitHub repository with the data and R and Python code.

Note: the github repo not only has the documentation for the data, and the data conveniently processed into a CSV file, but it also has code for some basic visualizations. I can’t stop you from looking at the code. But, if you are not the author, you cannot turn in these visualizations.

UN Refugee Data

UN-Link: http://popstats.unhcr.org/en/asylum_seekers_monthly

 


 

Student suggested Data Sets from 2017 Fall

These data sets were approved in class. They all seem pretty interesting. They may require you to sign up for an account.


Old Datasets that you CANNOT USE

These data sets were suggested in old editions of the class (when we had undergrads as well). They are too simple/small to be interesting. But you can use them for practice

Metropolitan Area Population Change

Note: this data set is small / easy. If you pick this one, the expectations for what you will need to do with it are much higher. I really dislike the vis on the census bureau website, you should do better (from the visualization, you can link to the data table). But the data is too small, and I’m not sure how many rich stories are to be found in it.

Print Friendly, PDF & Email

Previous post:

Next post: