Figure 1s (not in paper): Scagnostics for Figure 1

In Figure 1 (and Figure 2) I show explainers for the concept of "USA-ness" (or American-ness) from city livability data. Here, I give a little detail about where those explainers were chosen from.

To create the explainers, the system generated 3991 different explainers with various tradeoffs by using the greedy algorithm in the paper (for up to 4 variables), and then trying various levels of quantization.

To get a sense of this collection of explainers, we use a "scagnostics" style plot. I use scarequotes since its in the spirit of scagnostics, not the actual metrics proposed by Tukey (or Wilkinson and colleagues). In this kind of display, each explainer function is a point in high-dimensional space, where the axes are different metrics we might apply. Here, I am just using 5 metrics (since it's hard to view more than 5 dimensions): 3 of correctness (mcc, nth, and margin), and 2 for simplicity (number of variables and quantization level).

Showing this in a scatterplot matrix leads to a bit of a mess (since the web based scatterplot drawing chokes with so many points). We have 3991 points (each point is a function - and it appears on each of the graph), so there is horrendous overdraw. The color of the points represents the number of variables: red is 1, purple 2, blue 3, cyan 4. Lower numbered colors are always drawn on top of higher numbered ones (so if you see a blue dot, there might be cyan dots hiding underneath it).

Looking at this, you can see that the best 4 variable explainers do better (in terms of performance) than the best 3 variable ones. The best mcc (Mathews correlation coefficient) score for a 3 variable is .725, whereas the best 4 variable is .795 (the nth metrics are .92 and .96). If all we cared about what the machine-learning notion of correctness, the 4th variable would be a noticable improvement.

For explanations, is the improvement in correctness worth the extra cost in terms of complexity? It's hard to say - I would like to do some actual cognitive science to understand the comprehensibility of the models.

To help you decide for yourself, here are the 2 best (in terms of nth scores) 4 variable explainers and the 2 best 3 variable ones. In all cases, the maximum quantization level is 5.