Do most faculty in the humanities see collaborative research and engagement with large datasets in their future?
Two projects led by Johns Hopkins University that seem to come from apparently opposite ends of the spectrum–the Virtual Observatory (VO) and the Roman de la Rose Digital Library–reveal some unexpected relationships that bring to light a shift in scholarly practices that cyberinfrastructure may bring.
The Virtual Observatory represents one of the quintessential cyberinfrastructure projects–large, complex datasets shared, visualized and analyzed by a distributed group of astronomers. The Rose project features a digital library (along with related services) comprising digital surrogates of Old French manuscripts copied from the late-13th to the mid-16th centuries, many of which are richly illuminated. How could the Rose project offer any insight into data-driven scientific investigation? Even a completely digitized corpus of all extant Rose manuscripts (some 250 are known worldwide) would not approach the scale of the VO datasets. But, upon reflection, we suggest there may be an important relationship between “data mining” and collaborative scholarly practices in the sciences and humanities.
Greater collaboration among humanities scholars in the discovery and production of knowledge is cited by the ACLS cyberinfrastructure report, Our Cultural Commonwealth, as one of the goals and characteristics of an effectively implemented cyberinfrastructure for the humanities and social sciences. But it is posed in the report against the entrenched, traditional academic culture of the “individual genius” working in isolation. Referencing a recent study on American literary scholarship online, the ACLS report commented that:
Despite the demonstrated value of collaboration in the sciences, there are relatively few formal digital communities and relatively few institutional platforms for online collaboration in the humanities. In these disciplines, single-author work continues to dominate. Lone scholars, the report remarked, are working in relative isolation building their own content and tools, struggling with their own intellectual property issues, and creating their own archiving solutions.
While this may be true, we have reason to believe that a change is at hand; what is more, when one considers the evolving historical relationship between the humanities and sciences, the picture becomes more complex.
Rudolphine Tables = Open Content Alliance?
Scientists were not always good collaborators. In discussing the Rudolphine Tables (Johannes Kepler’s 1627 star catalog and planetary tables that radically improved the ability to calculate planetary positions), computer scientist Michael Nelson made the startling suggestion that the Tables might be considered on a par with today’s Google Book Search or the Open Content Alliance, in their power to inspire a new generation of scholarship. But he continues by noting that there were a host of issues standing in the way of the Tables’ publication, including “significant infrastructure costs (in the form of purpose-built observatories), professional jealousy, intellectual property restrictions, and political and religious instability.” This suggests that at the time, astronomy was a discipline defined by lone practitioners who would guard their data with great secrecy; in the “data-poor” environments of the early scientific era, scientists did not readily share data or collaborate.
In contrast, by 1627 when the Rudolphine Tables were published, the Roman de la Rose had been written, re-written, re-purposed, recast, illuminated, and shared many times over. In that era, before the development of scientific instrumentation, when “data” consisted of the spoken word, the written word, and illuminations, this body of manuscripts represents a “data-rich” environment where humanists didcollaborate in the creation of new knowledge.
Perhaps it is not a set of inherent characteristics within specific disciplines that defines their mode of scholarship or communication, but rather the relative ease or difficulty with which practitioners of those disciplines can generate, acquire or process data. While many may think that humanities materials are comparatively data-poor, we suggest they can be data-rich in numerous ways. A single Rose manuscript, for example, contains a tremendous amount of textual, visual, and semantic content that is sometimes difficult to extract in meaningful ways, and nearly impossible to represent adequately in a printed edition. As our ability to move these data into digital formats improves, we believe that humanists will be drawn into new forms of collaboration that will inspire new kinds of scholarship: large-scale digitization might bring the humanities into a new age of “data-driven scholarship,” much as the Rudolphine Tables inspired astronomers.
The NSF’s 2007 report, Cyberinfrastructure Vision for 21st Century Discovery, cites 27 recent cyberinfrastructure studies and reports from across the sciences, engineering, social sciences, and humanities. This surely represents an unprecedented convergence of interest across C.P. Snow’s “Two Cultures” in the promise of cyberinfrastructure and of data-driven research. There is no doubt that the sciences and engineering are leading the way for data-driven scholarship in our current environment, but many areas of humanities research are increasingly data-driven as well. As our digital library group at Johns Hopkins has learned more about the data curation needs of projects from a variety of disciplines, we have realized that we are facing a data deluge–not only relating to the Virtual Observatory, but also to the ever-increasing size and number of data files that other humanities projects such as the Roman de la Rose Digital Library are now generating.
Manuscripts, so evidently data-rich in the era in which they were created, today retain their former value and meaning while they inspire a new generation of humanists to create new sets of data. This includes the metadata needed to encode, organize, and understand the texts, annotations, and the visual art embodied in the manuscripts. Not only does this demonstrate the parallel need for data curation and preservation in the humanities and the sciences (for at the level of storage infrastructure, a byte is a byte and a terabyte a terabyte) but it underscores the fact that there is an increasing convergence of what it is that is analyzed by humanities scholars and scientists: data. In addition, there is an increasing overlap between the two communities in the tools needed for storing, accessing, and manipulating this data. Let us propose, then, that putting aside obvious aesthetic differences, scientific datasets are a modern “equivalent” of medieval manuscripts.
In fact, one could argue that manuscripts such as The Rose represented the richest sets of data/information available in their day and were stored for subsequent examination, analysis and repurposing. Additionally, they contained multiple types of data such as integrated texts and images, user annotations, and intertextual allusions and references. These intertextual references frequently pointed the reader to other texts available in the same monastic or university libraries. Thus the early codex, situated in a library of other codices to which it was linked in a semantic web of intertextuality, was a collection of active links, hyperlinks if you will, that simultaneously informed the reader how to navigate the text at hand and pointed outward to other relevant documents. The library was the “web” before the Web existed.
Digital tools are allowing us to capture, manipulate, and examine books and their data in ways that are revolutionizing the humanities. Entire libraries are now being digitized, linking their components in unforeseen ways. Libraries that have been dispersed by auction, theft or the vagaries of time may be virtually reassembled. And new libraries, whether a collection of all extant Rose manuscripts (which of course has never, and could never have been, assembled) or something on the immense scale of the Google Books project, are emerging, bringing with them powerful tools and possibilities for research that have barely been realized. Finding themselves in new kinds of data-rich, multi-media environments, created by the mass digitization projects as well as the continuing projects of libraries, museums and archives to digitize their special collections, image, moving-image and sound files, humanists are increasingly considering the potential for cyberinfrastructure-related research and teaching.
Digital media provide an opportunity to reflect more accurately forms of medieval textuality and transmission that disappeared during the print era. The routine integration of text and image on computer screens, the recombinant nature of electronic texts, and the idea that anyone can copy, alter, edit, and retransmit a document (much to the chagrin of those with the most to lose from the potential collapse of traditional copyright laws), all have strong parallels in medieval texts and acts of textual transmission.
Print culture played a formative role in creating the notion of a single, authoritative text, as well as the expectation of an individual genius working alone. Technology in the form of the printing press shifted the scholarly landscape. Old models of collaboration, as well as the attendant mechanisms of creating, publishing, and transmitting works of scholarship, were replaced by a new world of large-scale publishing whose aim was the production of multiple, identical copies of a single authoritative text. The mechanization of production, copying, and transmission led to the virtual extinction of a scribal culture that produced unique versions of texts in which the roles of author, scribe, editor, and publisher were inextricably blurred.
Technology, both in its processes and tools, always will influence and shape a culture. But how do we ensure that the evolving cyberinfrastructure supports but doesn’t overly define the new forms of emerging data-driven scholarship? One of the imperatives for the humanities community is to define its own needs on a continuous basis and from that to create the specifications for and build many of its own tools. At the same time, it will be worthwhile to discover whether new cyberinfrastructure-related tools, services, and systems from one discipline can support scientists, engineers, social scientists, and humanists in others. NSF (perhaps in collaboration with the NEH and IMLS ) might help track the portability of such resources.
Finally, we want to point out that the reason we can apply a historical lens to this issue today is because of earlier commitments to the preservation of our heritage. However, as highly-coveted manuscripts and other valuable physical objects are digitized, the resultant datasets are often not as highly regarded by libraries. We believe this represents a shortcoming of vision. For while the curation of physical codices will remain an essential role for libraries, the collection and curation of digital objects will assume greater importance for libraries of the future, and the infrastructure, budgetary priorities, and strategic plans of library organizations would do well to account for this sooner rather than later. In the digital age, data can become at risk in as short a period as five years, and we have already irrevocably lost important datasets. The importance of curating datasets to ensure long-term, persistent access cannot be overstated. Imagine the loss to science and scholarship if we had not preserved the Rudolphine Tables or the Roman de la Rose manuscripts.
 American Council of Learned Societies’ Commission on Cyberinfrastructure for Humanities and Social Sciences, Our Cultural Commonwealth (2006), p.28. See also p. 48 on how “traditional scholarly work, in the form of a single-authored, printed book or article published by a university press or scholarly society, is the currency of tenure and promotion, and work online or in new media, especially work involving collaboration, is not encouraged.” http://www.acls.org/cyberinfrastructure/acls.ci.report.pdf
 Michael L. Nelson, “I Don’t Know and I Don’t Care,” NSF/JISC Repositories Workshop, April 2, 2007http://www.sis.pitt.edu/~repwkshop/papers/nelson.html. Retrieved September 2, 2007.
 National Science Foundation Office of Cyberinfrastructure, Cyberinfrastructure Vision for 21st Century Discovery, 3:46 (2007): Appendix B, “Representative Reports and Workshops.” http://www.nsf.gov/od/oci/CI_Vision_March07.pdf. Retrieved August 8, 2007.
 The term, coined by British scientist and novelist C.P. Snow in his 1959 Rede Lecture “The Two Cultures and the Scientific Revolution,” became a shorthand for the rift between the sciences and humanities in approaches to problems. See C.P. Snow The Two Cultures (Cambridge Univ Press: 1959, reprinted 1993).
 See, for example, the effort to articulate this in the 2005 report from the University of Virginia’s Institute for Advanced Technology in the Humanities, Summit on Digital Tools for the Humanities. http://www.iath.virginia.edu/dtsummit/SummitText.pdf Accessed October 6, 2007.