Session Summary: Big Data and the Reinvention of the Humanities

Greg Crane, Professor and Chair of Classics, Tufts University, began his presentation by telling us that Big Data is both a curse and part of the solution for the reinvention of the Humanities. He outlined some of the current projects examining data in humanism including Corpus and Computational Linguistics, Cyberinfrastructure for Historical Languages, Mining A Million Books and Digging into Data.

To give us an idea of scale, particularly the size of the Mining A Million Books project, Crane reminded us that if we read a book a day we could get through 36,000 books in a lifetime, so a million books represents a huge amount of data. However, Google has digitised around 13 million books, so a million books is really a starting point.

Big data on this scale requires automated analysis of everything in order to hunt through what Crane described as a “vast primal soup” of information. You need text mining, visualisation and other various technologies, but you also needed targeted human analysis. In the past some have felt there is a tension between traditional humanistic analysis – involving close reading and careful thought – and automated methods. Crane argued that there should not be this contrast, as the automated methods should be used to help identify the data points that you can then look at and think about, so they become a tool to aid traditional analysis methods. There are some analytical tasks that machines do not very well and therefore need a human to assess, including categorisation and classification tasks, so you need people to come in and adjudicate. Crane also noted the need for research into the results of any automated analysis and publication.

Crane focused in on the main issue that this connects to, for him, by quoting from an article which observed that the number of liberal arts students at Harvard has declined by 27% in favour of science and engineering subjects in the last five years. He suggested that this is partly because engineering gives practical knowledge that they feel they can actually use, but also because it makes you a partner in the learning. Engineering undergraduates are very much involved in serious research, whereas this is a totally alien mindset in the humanities. Crane feels that undergraduates need to be engaged in serious research, particularly at a time when there is so much to be done. Digital technology is enabling new research by humanities undergraduates through the re-emergence of editing as a binary activity and the commented edition and translation as an undergraduate thesis basis. Given the shear amounts of material that remain untranslated and unworkable, this represents really useful work that could be undertaken by undergraduates and published.

To illustrate the changes in scale that have come about as a result of digital technologies, Crane used the example of Latin. The universe of accessible Latin in the pre-digital, print dominated world, for example, was really quite small, as large amounts of material were inaccessible and stored away in print form. The onset of digital has allowed the extraction of text labelled as Latin from 12,000 books so far and therefore expanded the workable amount of Latin from less than 10 million words accessible for study to at least 1 billion. There are a further 15,000 books yet to extract. There are not enough people to analyse and categorise this without the work of undergraduates. Crane emphasised again that doing practical work strikes a deep chord with undergraduates.

Crane then moved on to discuss machine actionable interpretation by looking at competing analyses of sentences which can then be compared in the context of other data which can help to assess which interpretation is more likely. He explained how these interpretations are presented in a diagrammatic, more mathematical format which can be analysed and stored in tree banks. Crane views these tree banks as being the most important development in the study of historical languages for 150 years. He emphasised that they have 4000 years of historical linguistic data is waiting to be analysed, which they don’t yet know what to do with, as Humanities researchers do not currently think in terms of such large amounts of data.

Crane highlighted language as the biggest barrier to studying this huge amount of data that developments in technology are now helping to break down. In the past you needed to read Chinese well to be able to study a Chinese document, but now a little knowledge of the language, a computer translation tool and a dictionary can enable you to make much more progress and work with more linguistic material than you could before.

To conclude, Crane discussed what the Humanities need. He said they need data – they have a lot of unstructured data available with now metadata. They need open content – historical sources as shared data. Crane asserted that there is very little published historical data, if you consider publication to be something you can annotate and analyse. Making something available in print under subscription where you cannot re-use and interact with the data, Crane does not consider it to be published: it is an archival object, not part of the public sphere and does not enable global participation. They need to extract scientific corpora form vast, lightly structured collections – this is what the Digging the Data project is about. They need new intellectual configurations and new humanists – including computer scientists, computational linguistics, cognitive scientists and humanists with deep, cross-cultural training including in new tools and technologies.