Metadata

This page explains VEP metadata holdings for its curated corpora and for its supplemental TCP metadata. The majority of the metadata described below can be accessed through the Metadata Builder, which will allow you to download spreadsheets of data customized to your interests.

Corpora Metadata

VEP tends to offer the following metadata for its corpora collections:

  • Master spreadsheet: a spreadsheet that catalogs the corpus and provides selected information about its files (e.g., genre, subject headings, translator, etc.)
  • Ubiqu+Ity spreadsheet: a spreadsheet of statistical information about the corpus’s texts, generated by Ubiqu+Ity
  • List of 1-Grams*: a spreadsheet that ranks all words from the corpus in question from most frequent to least frequent

* Corpus 1-grams cannot be accessed through the Metadata Builder.

TCP Supplemental Metadata

VEP offers supplemental metadata for TCP texts. Note that the supplemental TCP metadata is not meant to be all-encompassing. We offer the metadata as a start to help users sift through the 61,000 texts. To illustrate, the language metadata does not account for every text in the TCP, only those that have the least amount of recognizable English words.

  • TCP Master spreadsheet: a spreadsheet of metadata provided about the TCP; the text titles have had their spelling standardized
  • Derived Dates: a spreadsheet that provides derived dates for all texts in the TCP
  • Derived ESTC IDs: a spreadsheet that matches TCP texts to ESTC IDs beyond what was provided in the TCP Master spreadsheet; provides about 1600 more matches
  • Non-English Language Information: a spreadsheet that lists primary and secondary languages for TCP texts; these are texts in the TCP that have the least amount of recognizable English in them
  • Number of Figures per TCP text: a spreadsheet that records the number of <FIGURE> tags in each TCP XML file; this spreadsheet is helpful for those who wish to find texts that contain images and tables of information
  • TCP Top 100 1-Grams Frequency per Text: a spreadsheet that lists the frequency per TCP text of the top 100 words in the entire TCP