DC1 Data Sets

Page content

This page lists the data sets that you may use for Design Challenge 1 (DC1): One dataset / Four Stories.

You must pick one of these data sets for your assignment.

For Design Challenge 1, we are providing 4 different data sets. Variants of these data sets have been used successfully by students in the past. All of these data sets are sufficiently large and rich to contain many interesting stories. We may provide more choices, or better versions of the datasets listed. We will not remove things from this list (so if you pick something, it will continue to be OK).

Box Folder with All Data Sets

There are four choices (some with multiple variants):

  1. Census Data
  2. Airline On-Time Performance
  3. Time Usage (how people spend their time)
  4. IMDB movie data

Some of the data sets get quite big. However, Tableau is remarkably efficient at working with 100MB+ data sets - it uses fancy database technologies to perform aggregations quickly.

In each case, you may access the original data from its source. However, we are providing versions where we have done a little of the data wrangling work for you. Thank Young, the TA.

In your assignment, be sure to explain which data set you are using (including which variant). If you got your own version of the data, please say so.

Note: we are working on being able to provide these as part of our Tableau site so you can work with them using Tableau online without having to download/upload them yourself. Stay tuned for an announcement!

Census Data by County

This data set aggregates many different quantities of interest over the counties of the US. In past assignments, we’ve provided even more detailed data sets which required even more aggregation.

The USDA provides this data as 4 separate sheets (on This page). Any one of them could tell an interesting story but together, they provide a very rich and complex data set full of stories.

We have joined the 4 spreadsheets together (joining by the “FIPS Code” column) creating a single file. The four files are also combined (joined by the “FIPS Code” column) into one file, and put in Box. The rows for the states (not counties) are also removed. The data is downloaded and processed on September 13, 2020 from USDA/ERS.

  • CSV File on Box (~4MB, 3196 columns and 339 rows)

The variable descriptions can be found on here, and some of them are replicated in the following table:

Variable Descriptions
Column name Description
Births_2019 Births in period 7/1/2018 to 6/30/2019
CENSUS_2010_POP 4/1/2010 resident Census 2010 population
CI90LB017_2018 90% confidence interval lower bound of estimate of people age 0-17 in poverty 2018
CI90LB017P_2018 90% confidence interval lower bound of estimate of percent of people age 0-17 in poverty 2018
CI90LBINC_2018 90% confidence interval lower bound of estimate of median household income 2018
CI90UB017_2018 90% confidence interval upper bound of estimate of people age 0-17 in poverty 2018
CI90UB017P_2018 90% confidence interval upper bound of estimate of percent of people age 0-17 in poverty 2018
CI90UBINC_2018 90% confidence interval upper bound of estimate of median household income 2018
Civilian_labor_force_2018 Civilian labor force annual average, 2018
Deaths_2019 Deaths in period 7/1/2018 to 6/30/2019
DOMESTIC_MIG_2019 Net domestic migration in period 7/1/2018 to 6/30/2019
Economic_typology_2015 County economic types, 2015 edition
Employed_2019 Number employed annual average, 2019
ESTIMATES_BASE_2010 4/1/2010 resident total population estimates base
FIPS_Code State-County FIPS Code
GQ_ESTIMATES_2019 7/1/2019 Group Quarters total population estimate
GQ_ESTIMATES_BASE_2010 4/1/2010 Group Quarters total population estimates base
INTERNATIONAL_MIG_2019 Net international migration in period 7/1/2018 to 6/30/2019
Med_HH_Income_Percent_of_State_Total_2019 County Household Median Income as a percent of the State Total Median Household Income, 2019
MEDHHINC_2018 Estimate of median household income 2018
Median_Household_Income_2019 Estimate of Median household Income, 2019
Metro_2013 Metro nonmetro dummy 0=Nonmetro 1=Metro (Based on 2013 OMB Metropolitan Area delineation)
N_POP_CHG_2019 Numeric Change in resident total population 7/1/2018 to 7/1/2019
NATURAL_INC_2019 Natural increase in period 7/1/2018 to 6/30/2019
NET_MIG_2019 Net migration in period 7/1/2018 to 6/30/2019
PCTPOV017_2018 Estimated percent of people age 0-17 in poverty 2018
POP_ESTIMATE_2019 7/1/2019 resident total population estimate
POV017_2018 Estimate of people age 0-17 in poverty 2018
R_death_2019 Death rate in period 7/1/2018 to 6/30/2019
R_DOMESTIC_MIG_2019 Net domestic migration rate in period 7/1/2018 to 6/30/2019
R_INTERNATIONAL_MIG_2019 Net international migration rate in period 7/1/2018 to 6/30/2019
R_NATURAL_INC_2019 Natural increase rate in period 7/1/2018 to 6/30/2019
R_NET_MIG_2019 Net migration rate in period 7/1/2018 to 6/30/2019
RESIDUAL_2019 Residual for period 7/1/2018 to 6/30/2019
Rural-urban_Continuum_Code_2013 Rural-urban Continuum Code, 2013
State State Abbreviation
Unemployed_2019 Number unemployed annual average, 2019
Unemployment_rate_2019 Unemployment rate, 2019
Urban_Influence_Code_2013 Urban Influence Code, 2013

Airline On-Time Performance

The Bureau of Transportation Statistics lets you download a lot of data, one month at a time from this page. You can download many different attributes of every flight.

We’ve chosen a set of attributes for each flight, which leads to huge files (250MB/month), and downloaded several months (one per season for 3 years).

We’ve downloaded a few months for you and assembled them into easier to use files. We’ve put multiple months together and down-sampled the data (using random sampling) to reduce the data set size. While Tableau seems very able to process even the largest of these files, they do get a little unweildy.

We chose 2019 (not 2020) since it has more “usual” patterns (although, comaparing pre- and post-COVID would be interesting).

The “small” datasets were downsampled to produce file sizes of about 5MB. The January file was downsampled to 2% (that is, each flight has a 2% chance of being included). The 4 month data set and 4 month*3 year data set was downsampled to have a similar total number of flights (e.g., the 4 month set has 2%/4).

The “medium” datasets were downsampled to produce file sizes of about 25MB each. The 4 month data set and 4 month*3 year data set was downsampled to have a similar total number of flights (e.g., the 4 month set has 10%/4).

The “large” data sets include 10% of all the months included. You can get a sense of how big they are from this picture (12 months, 40-60K flights a month)

airlines-count.png

For this data set, you may choose to use the months we downloaded, or download your own (please specify what data you use). You can choose to use just 1 month (version 1), you can pick multiple months to compare (version 2), or multiple years to compare if you want a real challenge (version 3). Note that you may filter by geography (select a particular state or all states), year, and monthly periods. Files are available as .csv.

You may want information on the airports (for example, to get location coordinates for each airport code). This data is available from Federal Aviation Administration. The Airline codes list is useful - especially for Madison where you see airlines like “F9” and “OO” (these are regional commuter companies that fly under another airline’s name).

You might want to refer to Bureau of Transportation Statistics for explanations of all the fields, and look up tables (files that say what the codes mean). Fields include date, carrier, origin/destination information, departure/arrival performance, gate/airport information, flight summaries, cause of delay, and some of them are replicated in the following table:

Variable Descriptions
Column name Description
ActualElapsedTime Elapsed Time of Flight, in Minutes
AirTime Flight Time, in Minutes
ArrDel15 Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ArrDelay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
ArrDelayMinutes Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
ArrivalDelayGroups Arrival Delay intervals, every (15-minutes from <-15 to >180)
ArrTime Actual Arrival Time (local time: hhmm)
ArrTimeBlk CRS Arrival Time Block, Hourly Intervals
CancellationCode Specifies The Reason For Cancellation
Cancelled Cancelled Flight Indicator (1=Yes)
CarrierDelay Carrier Delay, in Minutes
CRSArrTime CRS Arrival Time (local time: hhmm)
CRSDepTime CRS Departure Time (local time: hhmm)
CRSElapsedTime CRS Elapsed Time of Flight, in Minutes
DayofMonth Day of Month
DayOfWeek Day of Week
DepartureDelayGroups Departure Delay intervals, every (15 minutes from <-15 to >180)
DepDel15 Departure Delay Indicator, 15 Minutes or More (1=Yes)
DepDelay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
DepDelayMinutes Difference in minutes between scheduled and actual departure time. Early departures set to 0.
DepTime Actual Departure Time (local time: hhmm)
DepTimeBlk CRS Departure Time Block, Hourly Intervals
Dest Destination Airport
DestAirportID Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
DestAirportSeqID Destination Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
DestCityMarketID Destination Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
DestCityName Destination Airport, City Name
DestState Destination Airport, State Code
DestStateFips Destination Airport, State Fips
DestStateName Destination Airport, State Name
DestWac Destination Airport, World Area Code
Distance Distance between airports (miles)
DistanceGroup Distance Intervals, every 250 Miles, for Flight Segment
DivActualElapsedTime Elapsed Time of Diverted Flight Reaching Scheduled Destination, in Minutes. The ActualElapsedTime column remains NULL for all diverted flights.
DivAirportLandings Number of Diverted Airport Landings
DivArrDelay Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights.
DivDistance Distance between scheduled destination and final diverted airport (miles). Value will be 0 for diverted flight reaching scheduled destination.
Diverted Diverted Flight Indicator (1=Yes)
DivReachedDest Diverted Flight Reaching Scheduled Destination Indicator (1=Yes)
DOT_ID_Reporting_Airline An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
FirstDepTime First Gate Departure Time at Origin Airport
Flight_Number_Reporting_Airline Flight Number
FlightDate Flight Date (yyyymmdd)
Flights Number of Flights
IATA_CODE_Reporting_Airline Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.
LateAircraftDelay Late Aircraft Delay, in Minutes
LongestAddGTime Longest Time Away from Gate for Gate Return or Cancelled Flight
Month Month
NASDelay National Air System Delay, in Minutes
Origin Origin Airport
OriginAirportID Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
OriginAirportSeqID Origin Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
OriginCityMarketID Origin Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
OriginCityName Origin Airport, City Name
OriginState Origin Airport, State Code
OriginStateFips Origin Airport, State Fips
OriginStateName Origin Airport, State Name
OriginWac Origin Airport, World Area Code
Quarter Quarter (1-4)
Reporting_Airline Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
SecurityDelay Security Delay, in Minutes
Tail_Number Tail Number
TaxiIn Taxi In Time, in Minutes
TaxiOut Taxi Out Time, in Minutes
TotalAddGTime Total Ground Time Away from Gate for Gate Return or Cancelled Flight
WeatherDelay Weather Delay, in Minutes
WheelsOff Wheels Off Time (local time: hhmm)
WheelsOn Wheels On Time (local time: hhmm)
Year Year

Time Usage Survey

2003-2019 all samples, Respondents

The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the IPUMS website.

We have done a data pull for you. We are allowed to share a data pull (see this). We chose to collect data from all of the years (2003-2019), and selected a wide range of different attributes. The data was downloaded on September 13, 2020 from ATUS.

You may create your own data pull if you’d like to try this with different columns (but be sure to document it in your writeup). Getting a data set requires picking from all the options - there are so many options that picking a good set is pretty time consuming. It is actually an interesting exercise to see how they document their data - they are very careful in documenting everything.

You can find out what the “time use codes” mean on this page.

Interpreting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation here. The state codes are here. Here are some of them are replicated in the following table:

Variable Descriptions
Column name Description
RECTYPE Record type
1 Household
2 Person
3 Activity
4 Who
5 Eldercare
REGION Region
1 Northeast
2 Midwest
3 South
4 West
STATEFIP FIPS State Code
01 Alabama
02 Alaska
04 Arizona
05 Arkansas
06 California
08 Colorado
09 Connecticut
10 Delaware
11 District of Columbia
12 Florida
13 Georgia
15 Hawaii
16 Idaho
17 Illinois
18 Indiana
19 Iowa
20 Kansas
21 Kentucky
22 Louisiana
23 Maine
24 Maryland
25 Massachusetts
26 Michigan
27 Minnesota
28 Mississippi
29 Missouri
30 Montana
31 Nebraska
32 Nevada
33 New Hampshire
34 New Jersey
35 New Mexico
36 New York
37 North Carolina
38 North Dakota
39 Ohio
40 Oklahoma
41 Oregon
42 Pennsylvania
44 Rhode Island
45 South Carolina
46 South Dakota
47 Tennessee
48 Texas
49 Utah
50 Vermont
51 Virginia
53 Washington
54 West Virginia
55 Wisconsin
56 Wyoming
METRO Metropolitan/central city status
01 Metropolitan, central city
02 Metropolitan, balance of MSA
03 Metropolitan, not identified
04 Nonmetropolitan
05 Not identified
FAMINCOME Family income
001 Less than $5,000
002 $5,000 to $7,499
003 $7,500 to $9,999
004 $10,000 to $12,499
005 $12,500 to $14,999
006 $15,000 to $19,999
007 $20,000 to $24,999
008 $25,000 to $29,999
009 $30,000 to $34,999
010 $35,000 to $39,999
011 $40,000 to $49,999
012 $50,000 to $59,999
013 $60,000 to $74,999
014 $75,000 to $99,999
015 $100,000 to $149,999
016 $150,000 and over
996 Refused
997 Don’t know
998 Blank
HH_SIZE Number of people in household
001 1
002 2
003 3
004 4
005 5
006 6
007 7
008 8
009 9
010 10
011 11
012 12
013 13
014 14
015 15
016 16
999 NIU (Not in universe)
PERNUM Person number (general)
01 1
02 2
03 3
04 4
05 5
06 6
07 7
08 8
09 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
LINENO Person line number
001 1
002 2
003 3
004 4
005 5
006 6
007 7
008 8
009 9
010 10
011 11
012 12
013 13
014 14
015 15
016 16
017 17
018 18
019 19
999 NIU (Not in universe)
SEX Sex
01 Male
02 Female
99 NIU (Not in universe)
RACE Race
0100 White only
0110 Black only
0120 American Indian, Alaskan Native
0130 Asian or Pacific Islander
0131 Asian only
0132 Hawaiian Pacific Islander only
0200 White-Black
0201 White-American Indian
0202 White-Asian
0203 White-Hawaiian
0210 Black-American Indian
0211 Black-Asian
0212 Black-Hawaiian
0220 American Indian-Asian
0221 American Indian-Hawaiian
0230 Asian-Hawaiian
0300 White-Black-American Indian
0301 White-Black-Asian
0302 White-Black-Hawaiian
0310 White-American Indian-Asian
0311 White-American Indian-Hawaiian
0320 White-Asian-Hawaiian
0330 Black-American Indian-Asian
0331 Black-American Indian-Hawaiian
0340 Black-Asian-Hawaiian
0350 American Indian-Asian-Hawaiian
0398 Other 3 race combinations
0399 2 or 3 races, unspecified
0400 White-Black-American Indian-Asian
0401 White-Black-American Indian-Hawaiian
0402 White-Black-Asian-Hawaiian
0403 Black-American Indian-Asian-Hawaiian
0404 White-American Indian-Asian-Hawaiian
0500 White-Black-American Indian-Asian-Hawaiian
0599 4 or 5 races, unspecified
9999 NIU (Not in universe)
EDUC Highest level of school completed
- Less than HS diploma
010 Less than 1st grade
011 1st, 2nd, 3rd, or 4th grade
012 5th or 6th grade
013 7th or 8th grade
014 9th grade
015 10th grade
016 11th grade
017 12th grade - no diploma
- HS diploma, no college
020 High school graduate - GED
021 High school graduate - diploma
- Some college
030 Some college but no degree
031 Associate degree - occupational vocational
032 Associate degree - academic program
- College degree +
040 Bachelor’s degree (BA, AB, BS, etc.)
041 Master’s degree (MA, MS, MEng, MEd, MSW, etc.)
042 Professional school degree (MD, DDS, DVM, etc.)
043 Doctoral degree (PhD, EdD, etc.)
999 NIU (Not in universe)
EDUCYRS Years of education
100 Grades 1-12
101 Less than first grade
102 First through fourth grade
105 Fifth through sixth grade
107 Seventh through eighth grade
109 Ninth grade
110 Tenth grade
111 Eleventh grade
112 Twelfth grade
200 College
213 College–one year
214 College–two years
215 College–three years
216 College–four years
217 Bachelor’s degree
300 Advanced degree
316 Master’s degree
317 Master’s degree–one year program
318 Master’s degree–two year program
319 Master’s degree–three+ year program
320 Professional degree
321 Doctoral degree
999 NIU (Not in universe)
EMPSTAT Labor force status
01 Employed - at work
02 Employed - absent
03 Unemployed - on layoff
04 Unemployed - looking
05 Not in labor force
99 NIU (Not in universe)

IMDB Movie Data

This is data on approximately 5000 movies with columns including director, genre, plot keywords, and box office statistics. The data comes from Kaggle, and may be noisy.

This data set is small for the design challenge – but we’re still allowing it. The genre field contains texts separated by vertical bars, and we converted them into separate indicator columns and appended to the end of the table.

  • CSV file on Box (~1MB, 5044 columns and 54 rows)

The data was downloaded and processed on September 13, 2020 from Kaggle.

The list of genres are summarized in the following table:

Genre Counts
Genre Count
Action 1153
Adventure 923
Fantasy 610
Sci-Fi 616
Thriller 1411
Documentary 121
Romance 1107
Animation 242
Comedy 1872
Family 546
Musical 132
Mystery 500
Western 97
Drama 2594
History 207
Sport 182
Crime 889
Horror 565
War 213
Biography 293
Music 214
Game-Show 1
Reality-TV 2
News 3
Short 5
Film-Noir 6