Explainers Supplementary Material

Figure 1: American-ness

This is figure 1 from the paper. It illustrates an explainer for the property of "American-ness" defined by selecting the cities that (I thought) were in the USA. Unfortunately, this is a difficult example to learn to read these kinds of diagrams from. I recommend looking at figure 0 that explains how to read the diagram on an easier example, and then trying to read this one.

This example is based on a data set that scores 140 cities around the world on 40 "liveability" criteria. The data set came from a contest on Buzzdata.com. To demonstrate explainers, I have tried to show that other properties of cities can be connected to these liveability criteria.

For a first example, I annotated 18 cities that (I thought) were in the USA as being "American", and the rest as being not. I had the system build explainers that distinguished the two classes - that is, to create simple functions that rank the cities in the positive class higher than those in the negative class. The diagram below is one of the explainers generated by this process. (to get a sense of the range of choices, and why this one was chosen see Figure 1s, which has the "scagnostics" plot for this explainer, as well as some comparisons).

The SVG files work in any browser with good SVG support. However, the html pages with embedded SVG seem to appear wrong in browsers other than Chrome. For some reason, the text gets messed up (probably a CSS issue).

Cities in blue are those annotated as being in the US, while the green ones are the ones not.

The grid of boxes with city names in it on the left hand side of the diagram is a list of all the cities in rank order. Making a vertical list would have lead to a diagram that would be way too tall to read. This list is in reading order (so the top left, New York, is the highest ranked, then read left to right and top to bottom, so that Suzhou is the 2nd lowest scoring, and Kathmandu is the lowest).

The curves to the right of the boxes connect the 5 cities on each row to their position on the value axis. Note that because the data is highly quantized (all the features are on a 1-5 scale), there are lots of ties. Indeed, the top 13 cities are all tied for first place. (note how a lot of curves all connect to the same place)

The histogram (to the right of the curves) has odd gaps because of this quantization. It is actually 20 evenly spaced bins from min value to max value, but only a few of them are non-empty because of the quantization issues.

To the right of the histogram are 3 modified boxplots. The leftmost (gray) gives the distribution of the entire data. The middle one (green) the distribution of the non-US-marked cities, and the right one (blue) the US-marked cities. From this box plot, you can see that this explainer does a pretty good job of separating the classes.

Notice that there are two really glaring errors: two cities marked as being in the US that score very low on the "American-ness" score. This is despite the fact that training tried to make them be American. However, because we have chosen a simple function, it is unable to correctly capture the answers - we have made a tradeoff between correctness and simplicity.

Having two negative outliers, makes them worthy of examination. One of the outliers is San Juan - which is in a US Territory (which by some definitions is not part of the country). The other outlier is San Jose, which might be a city in California, or it might be the capital of Guatamala (the data source does not specify). Both of these cities consistently score low on various explainers of American-ness (see Figure 2 in the paper).