Interactive Analysis of Word Vector Embeddings
GO TO ONLINE DEMO >>>
Abstract
Word vector embeddings are an emerging tool for natural language processing. They have proven beneficial for a wide variety of language processing tasks. Their utility stems from the ability to encode word relationships within the vector space. Applications range from components in natural language processing systems to tools for linguistic analysis in the study of language and literature. In many of these applications, interpreting embeddings and understanding the encoded grammatical and semantic relations between words is useful, but challenging. Visualization can aid in such interpretation of embeddings. In this paper, we examine the role for visualization in working with word vector embeddings. We provide a literature survey to catalogue the range of tasks where the embeddings are employed across a broad range of applications. Based on this survey, we identify key tasks and their characteristics. Then, we present visual interactive designs that address many of these tasks. The designs integrate into an exploration and analysis environment for embeddings. Finally, we provide example use cases for them and discuss domain user feedback.
Teaser
Videos
Before playing around with the demos, it may help to check out some examples. The following vides introduce the visualizations and interactions and show example scenarios.Concept projections
Reconstructed co-occurrences in one embedding
Reconstructed co-occurrences across multiple embeddings
Exploring and comparing neighborhoods of word vectors
Exploring and comparing neighborhoods of arithmetic structures
Exploring the changing meaning of broadcast by coparing neighborhoods
Demo
You can access a running version of our implementation here.
Software
The source code of our implementations are available on github.
In addition, we provide a dockerized version of our implementation that bundles all necessary libraries here. The image can be easily deployed on your private machine, or on cloud platforms such as digitalocean or Amazon AWS. To run it, simply install docker on your machine, and pull the image using:
docker pull fheimerl/viswordembeddingsThen set it up following the manual here.
Embeddings
We provide 22 example embeddings in the right format, ready to use and explore with our implementations. Those are:
- 1800-1990: The HistWords embeddings created by Hamilton et al. We use their "All English" data set. Those embeddings are trained on Google n-grams for each decade from 1800-1990, resulting in 20 embeddings (download).
- EEBO_TCP: Word embedding trained with GloVe on the historic EEBO-TCP corpus. This embeddings is trained with a window size of 15, and a minimum word count of 5 (download). The dimensionality of this embedding is 50, which we have chosen to reduce memory requirements of our online demo. While such an embedding works reasonably well for demonstration purposes, high quality embeddings used in production environments usually have between 200 to 300 dimensions.
- wiki1-5: Word embeddings trained with GloVe on the entire English Wikipedia. All five embeddings are trained with a window size of 15, and a minimum word count of 5 (download). The dimensionality of these embeddings is 50, which we have chosen to reduce memory requirements of our online demo. While those embeddings work reasonably well for demonstration purposes, high quality embeddings used in production environments usually have between 200 to 300 dimensions.
Data Format
To use your own embeddings, store them in a simple text format, with one vector per line. The word is the first token in each line, followed by the vector, with a value for each dimension, separated by spaces. The resulting file looks like this (for three-dimensional word vectors):
house 0.21231 0.23423 0.23232 frog 0.13423 0.32123 0.49632 wall 0.43289 0.16987 0.24687 . . .In addition, the file may contain biases and context vectors, concatenated with the word vectors:
house 0.21231 0.23423 0.23232 0.57684 0.13457 0.46513 0.32846 0.32146 frog 0.13423 0.32123 0.49632 0.46158 0.53264 0.21955 0.31642 0.20134 wall 0.43289 0.16987 0.24687 0.31256 0.19465 0.31648 0.21536 0.06431 . . .This format can be produced and processed by all popular implementations of word vector embedding methods (e.g., GloVe, word2vec, gensim). To convert this text format into numpy matrices, which viswordembeddings expects, you can use the python script available here.
Alternatively, you can store embeddings directly in a set of numpy arrays, with the following naming conventions:
- [embedding_name]_vecs.npy
Stores the embedding word vectors in a numpy matrics. - [embedding_name]_terms.npy
Stores the words as strings in a numpy vector. Vectors in the matrix are mapped to their name through the index. - [embedding_name]_stats.npy
Stores statistics about the vectors (currently about neighborhood sizes). This file can be generated from the matrix file using this script. - [embedding_name]_cvecs.npy
Stores the context vectors in the same matrix format that is used for the word vectors. - [embedding_name]_biases.npy
Stores the biases for the word vectors. - [embedding_name]_cbiases.npy
Stores the biases for the context vectors.
viswordembeddings expects all embeddings stored in numpy arrays with the above file naming conventions in the /embeddings subfolder of a /data folder next to the /app folder. Other files the /data folder can contain is stopwords.txt, which contains stop words, one word per line. A /wordlists subfolder can contain text files with the word lists for the scatter plot design. The files can contain an arbitrary number of words, one per line.
You can download an example /data folder with all embeddings listed above, an English stop word list, and example word lists here.
Paper
Our paper about this project was presented at EuroVis 2018 and published in Computer Graphics Forum. It features a comprehensive domain literature-based task analysis of embedding tasks, presents effective designs for a subset of them, and discusses the rationales behind them. A preprint of the paper is available here (supplemental material). Please cite this project using the following citation:
@Article{HG18, author = {Heimerl, Florian and Gleicher, Michael}, title = {Interactive Analysis of Word Vector Embeddings}, journal = {Computer Graphics Forum}, number = {3}, volume = {37}, month = {jun}, year = {2018}, projecturl = {http://graphics.cs.wisc.edu/Vis/EmbVis}, url = {http://graphics.cs.wisc.edu/Papers/2018/HG18} }
last updated 4 June 2018