Finding and Preparing Text

How do you find an electronic text of a work you want to study? How should you prepare it for study with a tool like Voyant? This appendix provides a short guide to finding and preparing an electronic text.

1.0 Finding an Electronic Text

If you want to study a work that is out of copyright there is a chance that someone has already prepared an electronic text. Here are some suggestions on how to find electronic texts. Some good places to look for electronic texts include:

Major Online Text Collections by Language (Columbia University Libraries) <http://library.columbia.edu/locations/dhc/off-campus/language.html>
Gallica has a large collection of French texts <http://gallica.bnf.fr/>
DH Toychest: Data Collections and Datasets (curated by Alan Liu) <http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets>
Project Gutenberg <http://www.gutenberg.org/>
The Internet Archive Text Archive <http://archive.org/details/texts>
Digital Public Library of America <http://dp.la/>
Europeana Newspaper Collections <http://research.europeana.eu/itemtype/newspapers>
The University of Oxford Text Archives <http://ota.ahds.ac.uk/>

Once you have found a text you can either use the URL in Voyant or download the text, edit it, and then upload to Voyant.

If the text is usable as is on the web then you can just enter the URL into Voyant. For example Gutenberg has a Plain text (UTF-8) version of Plato’s Phaedrus at <http://www.gutenberg.org/cache/epub/1636/pg1636.txt>. You can enter that URL into the Add Texts text box. For that matter you can add multiple URLs, one to each line.
In many cases you may want to download, proof and edit the text to improve the accuracy of the analysis. For example, the Gutenberg version of the Phaedrus has metadata at the beginning a long Gutenberg license at the end. This will throw off statistics on words.

Go here for help on Loading Texts into Voyant.

2.0 Preparing an Electronic Text

In some cases you will need to create your own electronic text from a print original. It is best to create these texts as plain text files in Unicode (UTF-8).

You can type the text into a text editor or use a scanner with OCR (Optical Character Recognition). Scanners are now cheap and often available at libraries or computing services. With a scanner you can scan page after page and then run OCR software on your computer that will recognize the text. You can even get smartphones now with good enough cameras so that apps (applications) can be bought that will take a picture of one or more pages and run OCR on that. An example would be Scanner Pro for iOS, but others are available.

Here are some general guidelines for preparing texts for Voyant.

Plain text files. Save the text as a plain text file (.txt). If you are using a word processor like Microsoft Word you will need to Save As and pick the File Format: Plain Text (.txt). Simply changing the extension of a file from .docx to .txt does not make it a text file. For that matter a text file can have a different extension like .xml and still be a text file.
Check with a text editor. You can check that a file is a plain text file by opening it in a text editor like Notepad (PC) or TextEdit (Mac). If it opens and doesn’t have garbage characters (that look like boxes) then you probably suceeded. In a text editor you can always save again and make sure. TextEdit has a command to Make Plain Text (under the Format menu) if the text is not a plain text. This will covert the file to plain text and then you can save it.
Don’t use a word processor; get a text editor. It is generally not a good idea to use a word processor even if you are used to them because they save files in proprietary formats and it is easy to forget that. It is worth getting a good text editor other than the ones that come free on your computer if you will be doing a lot of text analysis. TextWrangler for Mac is an example. In TextWrangler you can Save As and make sure the character encoding is Unicode (UTF-8).
Using the XML. Voyant can use XML markup in useful ways to create collections or to extract only a subset of a text. This is an advanced topic, see the Creating a Corpus page of the Voyant.

Here are some links to guidelines on how to prepare a scholarly digital text.

The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials <http://www.nyu.edu/its/humanities/ninchguide/>
Text Encoding Initiative Guidelines <http://www.tei-c.org/index.xml>

3.0 Assembling a Research Corpus

For some research questions for which there is no given text you will need to think about how to assemble a corpus of texts that can be analyzed. For example, you might want to study how high-performance computing (HPC) has been discussed on the web by assembling popular web pages that define HPC. Here are some hints as to how to assemble a corpus.

Keep track of items. It is easy to lose track of what you gathered for you corpus from where. Keep track in a notebook or spreadsheet.
Zip your files. If you have a collection of files that you want to analyze together it is easiest to Zip them into one archive and upload that. You can upload individually, but that takes longer.
Give your files short clear names. The names of files are used in Voyant when it shows you the items in a collection. For this reason it makes sense to use names that can be read in Voyant.
Name files with chronological order. If your items (files) have an order that you want preserved by Voyant in tools like Trends then you should name your files in a way that will sort properly. For example you might name your files by the year they represent as in 1984.txt, 1985.txt and so on.

Here is a simple strategy for building a corpus of web pages around a subject.

Search for the subject you want to study on Google. It will give you the most popular pages according to their algorithm.
Go through each link and see if it is what you want. If it is then copy the URL into a spreadsheet and save the web page (HTML Only) to a folder using a suitable name.
- You should add the name of the file saved to the folder to the spreadsheet so you can associate URLs to file names.
- You can also add metadata to the spreadsheet for each page in other columns. This will let you sort by subsets.
- You may find that you need to grab more than one page from a site. Just make sure you keep track of URLs and the filenames you save them under.
Once you have collected a sample of pages and documented them in a spreadsheet you can copy all the URLs from the spreadsheet and paste them into the Add Text box of Voyant.
- You can use metadata in columns to sort the spreadsheet if you want to select subsets of URLs. If you had a column with the language of the web page you could sort and choose only the French ones.
Inevitably the links will break or pages change so you may want to use the saved files. At this point you might want to Zip up your folder of files and submit that to Voyant.
- Alternatively, you could go through each file and extract the relevant text so that you don’t have all the navigation text. You can then create a new file with all the texts extracted. That file can then be uploaded and analyzed.

4.0 Useful Links

There are a number of other sites that deal with access to electronic texts.

OpenEditions <http://www.openedition.org/>
University of Michigan: Finding Electronic Texts <http://www-personal.umich.edu/~esrabkin/FindingElectronicTexts.htm>
The European Library <http://www.theeuropeanlibrary.org/>
HathiTrust <http://hathitrust.org/> (Downloading books requires institutional login.)

The Resources page of this site has other links