Interactive Analysis of Word Vector Embeddings

Florian Heimerl, Michael Gleicher

GO TO ONLINE DEMO >>>

Abstract

Word vector embeddings are an emerging tool for natural language processing. They have proven beneficial for a wide variety of language processing tasks. Their utility stems from the ability to encode word relationships within the vector space. Applications range from components in natural language processing systems to tools for linguistic analysis in the study of language and literature. In many of these applications, interpreting embeddings and understanding the encoded grammatical and semantic relations between words is useful, but challenging. Visualization can aid in such interpretation of embeddings. In this paper, we examine the role for visualization in working with word vector embeddings. We provide a literature survey to catalogue the range of tasks where the embeddings are employed across a broad range of applications. Based on this survey, we identify key tasks and their characteristics. Then, we present visual interactive designs that address many of these tasks. The designs integrate into an exploration and analysis environment for embeddings. Finally, we provide example use cases for them and discuss domain user feedback.

Teaser

Videos

Before playing around with the demos, it may help to check out some examples. The following vides introduce the visualizations and interactions and show example scenarios.

Concept projections

Reconstructed co-occurrences in one embedding

Reconstructed co-occurrences across multiple embeddings

Exploring and comparing neighborhoods of word vectors

Exploring and comparing neighborhoods of arithmetic structures

Exploring the changing meaning of broadcast by coparing neighborhoods

Demo

You can access a running version of our implementation here.

Software

The source code of our implementations are available on github.

In addition, we provide a dockerized version of our implementation that bundles all necessary libraries here. The image can be easily deployed on your private machine, or on cloud platforms such as digitalocean or Amazon AWS. To run it, simply install docker on your machine, and pull the image using:

  docker pull fheimerl/viswordembeddings
Then set it up following the manual here.

Embeddings

We provide 22 example embeddings in the right format, ready to use and explore with our implementations. Those are:

Data Format

To use your own embeddings, store them in a simple text format, with one vector per line. The word is the first token in each line, followed by the vector, with a value for each dimension, separated by spaces. The resulting file looks like this (for three-dimensional word vectors):

    house 0.21231 0.23423 0.23232
    frog 0.13423 0.32123 0.49632
    wall 0.43289 0.16987 0.24687
    .
    .
    .
  
In addition, the file may contain biases and context vectors, concatenated with the word vectors:
    house 0.21231 0.23423 0.23232 0.57684 0.13457 0.46513 0.32846 0.32146
    frog 0.13423 0.32123 0.49632 0.46158 0.53264 0.21955 0.31642 0.20134
    wall 0.43289 0.16987 0.24687 0.31256 0.19465 0.31648 0.21536 0.06431
    .
    .
    .
  
This format can be produced and processed by all popular implementations of word vector embedding methods (e.g., GloVe, word2vec, gensim). To convert this text format into numpy matrices, which viswordembeddings expects, you can use the python script available here.

Alternatively, you can store embeddings directly in a set of numpy arrays, with the following naming conventions:

Replace [embedding_name] with an arbitrary string that denotes the embedding. Bold entries in the above list must exist for viswordembeddings to be able to load the embedding.

viswordembeddings expects all embeddings stored in numpy arrays with the above file naming conventions in the /embeddings subfolder of a /data folder next to the /app folder. Other files the /data folder can contain is stopwords.txt, which contains stop words, one word per line. A /wordlists subfolder can contain text files with the word lists for the scatter plot design. The files can contain an arbitrary number of words, one per line.

You can download an example /data folder with all embeddings listed above, an English stop word list, and example word lists here.

Paper

Our paper about this project was presented at EuroVis 2018 and published in Computer Graphics Forum. It features a comprehensive domain literature-based task analysis of embedding tasks, presents effective designs for a subset of them, and discusses the rationales behind them. A preprint of the paper is available here (supplemental material). Please cite this project using the following citation:

  @Article{HG18,
    author       = {Heimerl, Florian and Gleicher, Michael},
    title        = {Interactive Analysis of Word Vector Embeddings},
    journal      = {Computer Graphics Forum},
    number       = {3},
    volume       = {37},
    month        = {jun},
    year         = {2018},
    projecturl   = {http://graphics.cs.wisc.edu/Vis/EmbVis},
    url          = {http://graphics.cs.wisc.edu/Papers/2018/HG18}
  }

last updated 4 June 2018