My Issue with Humanities Data

Updated: Oct 17, 2023

Each of my doctorates presents different computational approaches to the study of texts. Both were written between 1999 and 2003 and were heavily influenced by my collaborators, a group led by Christopher Howe, head of the biochemistry lab at the University of Cambridge. As I worked as part of the STEMMA (Studies of Textual Evolution of Manuscripts through Mathematical Analysis) project, surrounded by scientists, I might have accidentally adopted their terminology. I am certain I was not trying to make a point about data or whether I, as a humanist, could have any such thing. The materials I was working with, including full-text transcriptions, unregularized, regularized, and lineated collations, were meant to be ingested and processed by computers.

There is no point in either of my theses in which I define humanities data. I just had data. They were processable and countable, and they represented the repository of the information I required to make judgements about the textual affiliations of the source of the corrections in Caxton’s second edition of the Canterbury Tales or the transmission of the extra-textual features represented by the tale-order in the manuscript tradition.

The process to obtain the data for each thesis was fundamentally different. I used collations for my De Montfort University work (the complete appendixes, included in a CD-ROM, amounted to more than 10,000 pages of printed material). For my NYU dissertation, I used a distance-coded version of a table containing the order of the tales as found in all complete, or almost complete, manuscripts. For comparison purposes, this was recoded with a method devised by Li-San Wang and Tandy Warnow, IEDB.

In my NYU thesis, I distinguish word-data from tale-order data as concepts but always assume that they require no definition from a humanities perspective. From a textual critical one, one of them is strictly textual (word-data), while the other is “extra-textual” (tale-order). Beyond this, I didn’t make any other distinctions.

One specific comment I recall from my January 2003 defence (DMU) was that the committee members were impressed by the objectivity with which the research was presented. In general, I used the word “variant” for words appearing in a singular place of variation. Again, I didn’t do this deliberately to make a point; the word allowed me to refer indistinctly to words without making a judgement about the direction of variation (although occasionally, I could tell exactly how the text had developed).

As we completed the submission of our SSHRC Insight last week, I realized what my problem is with the “debates” surrounding humanities data: some scholars are treating data as if humanists had just discovered the concept, while I have been using the term since before anyone had thought of calling what we do “Digital Humanities.” No wonder it is difficult for me to see why so many people have taken issue with this matter.

