Figure 6: Multi-Dimensional Scaling and Distance Metrics

Note: this is not exactly the same as the figure in the paper, for reasons described below.

This paper figure has two scatterplots that take the 180 acts of Shakespeare (36 plays each have 5 acts) and do a multi-dimensional scaling to project them down to the plane in a way that preserves the distances (from the original 115-dimensional Docuscope space). The idea is that similar things should be close. These are really simple scatterplots (made with D3). The legends are in the paper (green comedy, yellow history, purple tradedy, red late plays). The biggest advantage of having them online is that by hovering over a point, you can get a tooltip saying what it is.

You can see that even with the most simple distance metric, some patterns are starting to emerge. The different genres do tend to be similiar to one another.

However, to get a more meaningful distance metric, we need to determine how to properly weight the different variables (and, their cross terms). Since we have no a prioi way to know what this weighting should be, we can try to determine it based on some facts that we do know. Effectively, we are learning a distance metric function (e.g. a function that computes a distance between two objects). This problem is called Distance Metric Learning. In the paper, there are some references, including a recent VAST paper (by Brown et al.) that applies it in a Visual Analytics setting.

Here, we use explainer functions to create four new dimensions - one for each of the genres. Then we use Euclidean distance in this new 4 dimensional space. This has a number of advantages: it reduces the number of variables (since Euclidean metrics in high dimensions are problematic), it gives some hints as to what we think is important to set the weightings, and the new dimensions have similar scales.

You will notice that the points cluster by genre much better than they did before. This shouldn't be surprising, since we gave it some hints to put these points together. However, because it used this information to create the distance function, the distances are hopefully more meaningful. The fact that the genres separate nicely is just a sanity check: we would hope that points close to each other really are more similar to one another. You can hover over each point to see what act it is.

If you're wondering what 4 explainers functions I used to make this picture, here they are. For each genre, I took the best (in terms of nth score) 2 variable explainer with small integer coefficients. In the figure in the paper, a different set of functions were used (3 variables).