Do you work with the TEI P4 versions of TCP XML files and wonder what all those tags mean? After surveying XML tags in TCP corpora, I made a spreadsheet that lists all of the tags, defines them, and mentions where you may find said tags within the XML documents.
The survey and examining the files has made obvious that different TCP corpora have different levels of curation. The EEBO-TCP corpus has the most fleshed out metadata information. Not all tags are used across all corpora. According to my survey, Evans-TCP doesn’t use <FILEDESC>tags like EEBO-TCP and ECCO-TCP. Also, EEBO-TCP has the following tags that ECCO-TCP and Evans-TCP doesn’t: <AB>, <DEL>, <FW>, and <SUBST>.
Knowing all of the tags and what they are used for has been important for VEP’s methods, since we’re writing a script that grants users flexibility for text extraction from TCP TEI P4 XML files. It uses a configuration file that indicates text to extract and ignore between XML tags. For example, if you want to extract plays without their stage directions, the configuration file will allow you to ignore the <STAGE> tags that contain them.
For the spreadsheet I made, I obtained TAG definitions from TEI: Text Encoding Initiative.