Session Summary: Open Data Policies, Heather Joseph

Heather Joseph, Executive Director of the Scholarly Publishing and Academic Resources Coalition (SPARC) gave us an overview of how SPARC’s work been focusing on making journal articles more openly accessible and raising awareness at all levels – including the grass routes in libraries on campuses, and the grass tops with the policy makers. SPARC want to advance the accessibility of academic articles coupled with explicit liberal sharing and re-use rights. They also recognise that the articles are linked with the corresponding research data – and are in fact just one reduction of that data – so they have also done work on effective open data policies, moving seamlessly from the digital articles to the data set.

To put this work into context, Joseph provided an overview of the policy environment in the US and described how the current situation provides what she considers to be the best shot to date for doing some serious work on open access for data.

She began with copyright policy, which lays out very clear guidelines and gives a framework for what citizens can expect in terms of accessing and re-using data from government funded research. She highlighted that strong Freedom of Information Acts that also give good guidelines for accessibility. Finally she highlighted OMB A-130, which lays out a very clear approach recommending “open and unrestricted access to public information at no more than the cost of dissemination”. Access to information policies are being developed around this, with tax payer access as a central aspect in creating open access policies that are now extending into the data arena.

Joseph outlined the goals of the US public access policies:

  • expand access to results of taxpayer funded researchers
  • accelerate pace of discovery and innovation
  • create permanently accessible accessible archive for public use
  • enhance accountability and transparency of federal agency

Joseph then talked through the thinking behind these goals and the benefits that the policy makers are aiming for.

These goals have been widely used in the conversations around articles for some time, but are only just being transferred into the conversations around data. Joseph discussed why these conversations about data have been slower, highlighting the important drivers for the development in open data in the different areas. She noted that policy makers had consciously decided to separate out data and articles and keep them on different trajectories, which has contributed to the different rates of development for those open access conversations.

To chart the difference in more detail, Joseph observed that in the article movement there is a reasonably engaged grass root in the scientific community calling for open access. The libraries, consumers groups and advocates are very engaged, and key leaders at the funding level have been calling for significant change. This has all resulted in some positive top-down mandates from congress.

However, in the data area, whilst there is engagement at the scientific community, but the library community is not leading the way this time. Instead, there is a group of the general public, including people who want to remix data, create applications and visualisations. There is not yet much interest within various agencies surrounding open data, but huge interest at the executive branch level.

Joseph stressed that for policy change to happen, all three layers – top, middle and bottom – need to be moving together. She supported this with a quote from Sir Tim Berners-Lee, who said: “It has to start at the top, it has to start in the middle, and it has to start at the bottom.”

She emphasised that there are strong, clear open data policies that have have emerged in certain areas, but tend to be community specific, rather than an overarching. These policies are important, but Joseph wanted to focus on the movement towards coalescence.

She referred to an open government directive issued by President Obama, with the goal of making the output of federal government more transparent, so the outputs can be used and interacted with by the general public. The result of this has been the Data.gov site, which although progressive in concept, has been criticised in terms of the value of the data that has been added. Joseph noted that the data sets available are not generally serious science research material. However, the site is only a year old, and so far 97m people have visited. PEW have found that 40% of visitors have downloaded data from the site, which shows that people are engaging with the data. Joseph views this as a great cause for optimism. She also described how the site is not only being used to disseminate data, but also to collect it from the general public to further research.

Joseph moved on to discuss the significant development by the NSF, whereby they are calling for a data management plan to be included in all funding applications, addressing what is going to happen to the research data and how it is going to be made accessible for the future. The guidelines are not prescriptive, reflecting what Joseph described as a recognition that, at this transitional stage, we need a holistic approach.

Finally, Joseph discussed some of the emerging themes that are resulting from the current policy environment. There is a recognition that maximising access maximises benefits which is driving a shift towards setting the default for scholarly publishing to Open. However, in the data world there are shades of Open, so there is an increasing recognition in the conversations that exceptions will be the rule, given legitimate concerns about confidentiality and privacy. Joseph highlighted the need for a community driven approach to identify were the exceptions should be and how the policies are developed and implemented. She explicitly recognised the need for partnerships and a culture change to incentivise sharing data. Connecting networks of knowledge to let people get to information in new ways to solve problems is what this is all for, so the aim is not just to get everyone onboard, it is to get them to recognise why we need open data.

Session Summary: Preserving Library Collections

Roger Schonfeld, Manager of Research at ITHAKA S+R, linked his presentation about Preserving Library Collections with the discussions around Chris Cobb’s preceding presentation, noting that the intersection between quality and costing is particularly pertinent to the issues he intended to discuss.

He began by introducing the work of ITHAKA, a not-for-profit organisation that helps universities to work with new technologies. They have a strong focus on digital preservation and have been doing work to support libraries with the transfer from print to digital preservation from a policy perspective.

Schonfeld framed some of the issues around digital preservation by describing the results of a survey they conducted last autumn, which questioned faculty members about their feelings regarding hard copy journals compared with electronic copies. They used the strongly worded statement: “If my library cancelled the print version of a journal, but continued to make it available electronically, that would be fine with me”. The study shows that nearly 75% of faculty members agreed strongly with this statement, with this pattern clearly evident across humanities, social sciences and sciences.

Schonfeld discussed how these local questions about what “my library” does connects with the more system-wide questions about the implications of all libraries cancelling print versions of journals. The natural outcome of every library cancelling its print subscription would inevitably be that publishers would stop producing print editions and move to a digital-only model. However, when the same faculty members were questioned about publishers ceasing to produce print copies, the level of agreement was much lower. There is therefore something different between the local level and what happens at the systems level, even through these things are deeply connected, and this has significant impact when looking at collaboration and shared services.

They also asked a strongly worded question about university journal back files “Assuming that electronic versions of journals are proven to work well and are readily accessible, I would be happy to see hard copy collections discarded and replaced with electronic collections,” as they recognised that not getting something registers very differently on an emotional level with getting something in the past and then throwing it out. Around 40% of faculty members strongly agreed with this statement, an increase on the previous study where only 20% of faculty members supported this idea.

In summarising the rest of the study, Schonfeld noted that the number of faculty members who support their library, or even some libraries, maintaining hard copy records of print journals is steadily eroding. This has important implications for whether or how print versions of digitised materials will be preserved.

These trends have led to an increased interest in building shared collection of print repositories. Schonfeld showed a map of some of the shared print repositories that have been coming into existence, noting that there is no logic to where these have sprung up, but that there are a number appearing.

One of the assets he identified of creating these shared collection was a move towards not just looking at what the libraries have but what might be missing from these collections. These developments have also led the library community in the US to start thinking about ways of shaving costs off the process of preservation following the digitisation of materials.

That said, the way the system has been evolving has led to all sorts of materials not being brought into a print preservation infrastructure. The shared collections are not communicating with each other to establish what might be preserved in 20 institutions and what might be preserved in only one, or not at all, so the system does need to be improved. Understanding these issues has become of particular interest to Schonfeld and his colleagues.

Schonfeld observed that there are two approaches to creating a more robust print preservation:

  • Centralize the preserved copies and develop a business model to provide access to them
  • Share information about the preservation status of individual copies – this model would be expected to evolve around groups of libraries self organising, driven not so much by the preservation needs of their documents, but by cost and space considerations connected with storage of preserved materials.

These two models can sound very similar, but the question is about whether libraries are motivated by the need for long term access to print, which is little used following its digitisation, or whether they are motivated by space saving and cost avoidance, and what the implications are of that answer.

What Schonfeld and one of his colleagues have done is to come up with a framework for thinking about preservation needs which could be applied to both models. This focussed on identify what the preservation needs actually are, including:

  • identifying how many copies of print materials are needed following the preservation (which has both access and preservation implications)
  • finding out about the current community preservation activity – identifying whether there is a surplus or a gap.

They also identified some questions about the quality and characteristics of the digital version which they recommend be considered before the print copies can be confidently withdrawn. Part of this framework also included the recommendation of a 20 year preservation of print material if all of the above conditions were satisfied, after which there could be a reassessment of the continuing need to preserve print.

They have used this framework to develop a proof of concept tool to help libraries to make retention and withdraw decisions, with preservation at the core. This is not designed to tell libraries when to throw a copy out, but merely to tell them whether their copy is actionable i.e. its preservation needs are met elsewhere and they can be confident of this when considering what to do with their own copy.

Going forward, they want to refine this framework and expand the tool. Their key idea is to create a decision making framework that can take place in a decentralised environment. There does not need to be a brain in the centre of the collaboration to ensure that preservation takes place. Schonfeld’s belief is that whatever the nodes in the preservation system may be, towards the end libraries would not want to draw down further beyond a very minimum preservation threshold. A central model to ensure this would be difficult to create and fund, so his model is driven by information sharing as the most sustainable option for the future.

Digital Content and Institutional Planning: Interview with Meg Bellinger

Providing a US perspective on Digital Content and Institutional Planning, Meg Bellinger, Director for the Office of Digital Assets and Infrastructure, Yale University gave a presentation entitled: “Yale Digital Commons”. In this video interview she explains what that work has entailed and the issues that they have faced.

If you are unable to see this video, please click here.

Session Summary: Shared Services

Chris Cobb, Pro Vice Chancellor of Roehampton University, began by defining “shared services”, which he described as a seductive, altruistic term, which belies what others would call outsourcing. He highlighted that it can mean centralisation within one organisation, sharing between institutions or outsourcing.

He outlined the origins of the term in the UK following the Gershon Review (2004/2005), which was followed by a cabinet office Transformational Government Technology Review in 2005 – one of the results of which was the NHS shared data system linking 150 health care trusts, which is now fully operational and saving the government £225million.

Following on from this, HEFCE (Higher Education Funding Council for England) commissioned KPMG to write a report into shared services in HE, which identified the barriers that still exist. Cobb highlighted some of these barriers, including the additional costs of shared services, which are subject to irrecoverable VAT. However, the more over arching barrier was the need for HEIs to maintain their independence given that they are in competition with one another so any move away from academic or operational independence, or towards shared student services would be viewed as a threat to the individual institution’s competitive advantage.

The KPMG report did identify some very good examples of shared services that already work well. UCAS, JANET and the Student Loan companies are just some examples of services which are shared across the sector. There were also examples of institutions supporting each other on a smaller scale. Some of these came out of economies of scales and some provide critical mass, bringing additional skills and security to the service.

HEFCE has also been sponsoring feasibility pilots and examining different institutional experiences, but many of these have not yet really got off the ground due to various factors. Despite good will at a high level, the devil often proved to be in the detail of the business plans and many of these services sharing opportunities have not flowed through as was hoped.

Cobb went on to cite other work in this area, which has been conducted by University of Liverpool and Liverpool John Moores University. These two institutions commissioned a study looking for areas of commonality which could be exploited with shared services. However, the results identified few areas with strong synergies, generally in finance and HR.

Cobb then moved onto libraries and library management systems, where work is being conducted to examine the potential of modularising and disaggregating library management systems into different component parts, which can then be put into the cloud; developing standards and shared licensing agreements with publishers.

JISC has been supportive of shared services through two different routes. Cobb described how they have commissioned reports of their own which go further than the KPMG report. These were conducted by Duke and Jordon, who looked at the appetite and barriers for shared services in ICT in universities. Amongst their key findings, they identified a dominance of a small number of shared suppliers, many bought over a decade ago in response to the Millennium bug, and stagnation in the market. Again, JISC’s approach has been to look at disaggregating the various elements of these services and creating modules using cloud computing to create a different architecture. They recognise the need for commercial suppliers and this approach is designed to push the suppliers in a different direction and avoid complacency.

When examining the drivers for using shared services, Duke and Jordon found that the continuity and resilience argument was the most compelling from an ICT perspective. The efficiency arguments were also significant, but Cobb found it particularly interesting that it was the critical mass argument that was strongest, not the economies argument.

JISC has also been sponsoring the Flexible Service Delivery Programme, which has a number of pilot projects, the core goal behind all of them is the disaggregation of systems, sourced through suppliers or open source.

In conclusion, Cobb described the current situation as disappointing, but noted that the fiscal crisis is making people more receptive to the idea of outsourcing shared services, so there are lots of opportunities. The new government is pushing for efficiency, and they have provided a £20m stimulus package to fund shared data centres, e-procurement and extending funding for the Flexible Service Delivery Programme, so there will be further work in this area.

Digital Content and Institutional Planning: Interview with Jessica Gardner

Jessica Gardner, Assistant Director of Library and Research Support, University of Exeter, describes her presentation and the questions it raised from the audience in this short video interview immediately after her talk.

If you are unable to see this video, please click here.

Session Summary: First, Capture Your Data

Peter Ainsworth of the University of Sheffield opened by describing what he views as the most exciting work in Humanities in the UK at the moment, which is to do with the linking together of humanities and scientific disciplines.

He went on to introduce the Digging into Image Data project by showing us some of the data. The project brought together 10 books produced within the same square mile of Paris that were copies of the Chronicles of Jean Froissart by two artists and two sets of scribes, which are now distributed across the world in different collections. To make these books accessible for study, they have created digital versions of each text, which constitutes about 10 terabytes of data, so the texts are no longer locked in vaults, but freely available in the cloud. Ainsworth demonstrated that the quality of the images allows you to get extremely close to the images, compare manuscripts and measure details very precisely.

One of the major problems they faced in this enterprise was to help the libraries and collectors who hold the original copies of the benefits of producing these electronic editions. The owners and curators have different attitudes to the open access, with some more than happy for the materials to be made openly available as part of preserving and bringing them to new audiences, whilst others require large payments in return for any use. His main challenge, therefore, was to make the technology do the talking to convince the libraries of the huge value of this sort of comparative work.

The Digging into Data challenge helped to connect this project with others work with similar types of data, including the University of Illinois at Urbana-Champaign, Michigan State University and the Alliance for American Quilts, North Carolina. These institutions are also developing technologies to study historical maps and quilts – both of which involve analysing large amounts of image data. This joint project looks at the computational scalability of adaptive image analyses and how this can help to identify the authors of the work within their collections of maps, quilts and manuscripts.

Ainsworth explained how the mixture of specialist academics and technicians enables them to look at the images in a variety of different ways. He also discussed their methodology, including how they ensure reliability and comparability in terms of the quality of their data sets – including photographing by hand and using the same meta data, rather than digitising automatically.

Despite differences in the types of data, the research methodology across the three is the same. They hope that the final output of the work will involve data about the salient characteristics of an artist with respect to other artists and software for extracting those salient characteristics which could be applied to identifying fakes. This includes the challenge of identifying whether there were many people or just one individual under the “master” titles used by art historians. Ainsworth showed us a series of image features from various illustrations to explain how certain image elements can be compared and analysed to these ends.

Ainsworth’s own work at the University of Sheffield involves looking more at the scribes than the artists of the manuscripts, attempting to get around the more subjective methods that have been used to identify between different scribes. To conclude, he took us through the technicalities with a series of slides demonstrating how they are using vectors to analyse the handwriting more objective, scientific and unprecedented ways using e-science to find ways of conducting analysis that humanists have not been able to do before, because they did not have the tools.

Mobile Devices: Interview with Joan Lippincott

Joan Lippincott, Associate Executive Director of CNI, gave a presentation in the session addressing Mobile Devices entitled: “An Integrated Strategy for Mobile Devices”. In this video interview she provides us with an overview of her presentation and the questions it generated.

If you are unable to see this video, please click here.

Session Summary: Big Data and the Reinvention of the Humanities

Greg Crane, Professor and Chair of Classics, Tufts University, began his presentation by telling us that Big Data is both a curse and part of the solution for the reinvention of the Humanities. He outlined some of the current projects examining data in humanism including Corpus and Computational Linguistics, Cyberinfrastructure for Historical Languages, Mining A Million Books and Digging into Data.

To give us an idea of scale, particularly the size of the Mining A Million Books project, Crane reminded us that if we read a book a day we could get through 36,000 books in a lifetime, so a million books represents a huge amount of data. However, Google has digitised around 13 million books, so a million books is really a starting point.

Big data on this scale requires automated analysis of everything in order to hunt through what Crane described as a “vast primal soup” of information. You need text mining, visualisation and other various technologies, but you also needed targeted human analysis. In the past some have felt there is a tension between traditional humanistic analysis – involving close reading and careful thought – and automated methods. Crane argued that there should not be this contrast, as the automated methods should be used to help identify the data points that you can then look at and think about, so they become a tool to aid traditional analysis methods. There are some analytical tasks that machines do not very well and therefore need a human to assess, including categorisation and classification tasks, so you need people to come in and adjudicate. Crane also noted the need for research into the results of any automated analysis and publication.

Crane focused in on the main issue that this connects to, for him, by quoting from an article which observed that the number of liberal arts students at Harvard has declined by 27% in favour of science and engineering subjects in the last five years. He suggested that this is partly because engineering gives practical knowledge that they feel they can actually use, but also because it makes you a partner in the learning. Engineering undergraduates are very much involved in serious research, whereas this is a totally alien mindset in the humanities. Crane feels that undergraduates need to be engaged in serious research, particularly at a time when there is so much to be done. Digital technology is enabling new research by humanities undergraduates through the re-emergence of editing as a binary activity and the commented edition and translation as an undergraduate thesis basis. Given the shear amounts of material that remain untranslated and unworkable, this represents really useful work that could be undertaken by undergraduates and published.

To illustrate the changes in scale that have come about as a result of digital technologies, Crane used the example of Latin. The universe of accessible Latin in the pre-digital, print dominated world, for example, was really quite small, as large amounts of material were inaccessible and stored away in print form. The onset of digital has allowed the extraction of text labelled as Latin from 12,000 books so far and therefore expanded the workable amount of Latin from less than 10 million words accessible for study to at least 1 billion. There are a further 15,000 books yet to extract. There are not enough people to analyse and categorise this without the work of undergraduates. Crane emphasised again that doing practical work strikes a deep chord with undergraduates.

Crane then moved on to discuss machine actionable interpretation by looking at competing analyses of sentences which can then be compared in the context of other data which can help to assess which interpretation is more likely. He explained how these interpretations are presented in a diagrammatic, more mathematical format which can be analysed and stored in tree banks. Crane views these tree banks as being the most important development in the study of historical languages for 150 years. He emphasised that they have 4000 years of historical linguistic data is waiting to be analysed, which they don’t yet know what to do with, as Humanities researchers do not currently think in terms of such large amounts of data.

Crane highlighted language as the biggest barrier to studying this huge amount of data that developments in technology are now helping to break down. In the past you needed to read Chinese well to be able to study a Chinese document, but now a little knowledge of the language, a computer translation tool and a dictionary can enable you to make much more progress and work with more linguistic material than you could before.

To conclude, Crane discussed what the Humanities need. He said they need data – they have a lot of unstructured data available with now metadata. They need open content – historical sources as shared data. Crane asserted that there is very little published historical data, if you consider publication to be something you can annotate and analyse. Making something available in print under subscription where you cannot re-use and interact with the data, Crane does not consider it to be published: it is an archival object, not part of the public sphere and does not enable global participation. They need to extract scientific corpora form vast, lightly structured collections – this is what the Digging the Data project is about. They need new intellectual configurations and new humanists – including computer scientists, computational linguistics, cognitive scientists and humanists with deep, cross-cultural training including in new tools and technologies.

Innovation in Learning and Teaching: Interview with Jackie Carter

Jackie Carter, Senior Manager of Learning, Teaching and Social Science Data Services, Mimas, University of Manchester, gave a UK perspective in the session on Innovation in Learning and Teaching with her presentation: “Benefits of Using Real World Data”. In this video interview, she gives us a summary of her talk and the discussion that took place around her themes.

If you are unable to see this video, please click here.

Session Summary: Cloud Computing and Advanced Networking – Developments in HE

Greg Jackson, Vice President for Policy and Analysis at EDUCAUSE, began his presentation by providing some background to EDUCAUSE, which is the largest and oldest group examining the use of information technology to enhance higher education and one of the sponsors of CNI. They do a lot of work for the profession including professional development and conferences, actively look at teaching and learning and how to integrate the use of technology in those areas, support a Centre for Applied Research which looks applies research from other contexts, and examine policies to make sure that they don’t accidentally have a negative effect on education.

Jackson introduced an image of clouds, observing that we often think of clouds at dawn, bringing in a new day with a lot of excitement stemming from that, but also observing that clouds can sometimes come at the end of the day indicating that things are about to become very dark. Jackson used this analogy to illustrate that we are at a point where cloud infrastructure is caught between these two states. There is a great enthusiasm for what is possible, but also a great worry about the complexity of the problems involved. The second problem he identified was that very often what we do and what we say we need to do in order to achieve things can be very different.

He questioned why are we suddenly having conversations where we take as a given that cloud services are something that we should be thinking about, whereas five years ago people were not having that conversation. One of the reasons he identified was the growth of networking capability to the point where cloud services become practical. Jackson described how global networking has advanced so it is now very difficult to find places in the world where you cannot get the bandwidth that you need. However, this is only useful if you have national infrastructure, so he also described the evolution of optical systems in the US where they are creating a national backbone network that connects to more regional and local networks. He noted that several other countries have been ahead of the US in this respect, but this trend is certainly replicated elsewhere in the world. High speed networking is therefore moving rapidly towards the point of reaching individual homes, not just institutions, which means that we can stop worrying about where something is and start thinking about what it is.

He then gave some background to the evolution of commercial cloud services using their excess capacity including players like Microsoft Cloud Services, Google Code and Amazon Web Services. Despite reticence that these services do not do what the institution would do if it were developing these services themselves to meet the detailed list of very specialised requirements their users say they need when asked, Jackson observed that empirically, once people are offered something reasonably good but free, they will go for it. These commercial cases have shown higher education service developers that a low pricing point will encourage people to adapt.

The development of the networking capabilities and the evolution of commercial cloud services, together with the intellectual authorities (including PEW) pronouncing that the future is in the cloud, acted as prologues for Jackson, bringing us to the point of observing that we are at a moment when things are changing not just because cloud computing is a buzz term, but because it represents a fundamental change in the way we are thinking.

Next, Jackson highlighted some of the challenges for cloud computing within HE. He used a video clip of Jack Nicholson in Five Easy Pieces, trying to circumvent a waitress when he wanted to order something that was not on the menu. This illustrated that what is being offered and what one wants usually don’t match. We need to aggregate demand and work together together to procures from the vendor community. He broke these down into categories of types of activities that you might expect institutions to club together to do, and highlighted the different possible approaches to aggregating their needs to move forward, depending on the situation. These included creating collective entities from groups of institutions, using brokered procurement models and developing umbrella terms to negotiate better deals with vendors. He noted that there are already success stories, particularly collectives purchasing networking capacity, but there are still areas where institutions could work together more effectively with respect to cloud services.

Another challenge related to authentication and authorisation, which can represent particular problems for cloud services. It is difficult to know that a person or corporation exists when providing cloud services, and if they exist, it is difficult to know that the person you are dealing with is the person you think you are dealing with. There are also more subtle issues that come into play, particularly when the same person can be identified in different ways, or there can be confusion with identities between individuals with similar names within institutions.

Jackson also cited privacy and confidentiality as areas of concern for institutions, with fears of the technology “letting things out”. Cloud technologies can make it easier to get to data, but there is a worry about the cloud that if we put things in a space that we don’t control, the security of the information could be compromised. He noted that profiling by services like Google (for advertising purposes) can lead to breaches of confidentiality as when we start sharing more information about ourselves, we start sharing things we don’t mean to share.

Finally physical location is a challenge for cloud services, as you don’t necessarily know where the information is being held. If it is outside of the country, this can present particular legal problems for certain types of data.

Jackson then moved on to discuss some of the possible solutions that are emerging, including work by EDUCAUSE to examine these areas. He discussed the idea of a federated identity, whereby the parent organisation issues credentials and we can trust to help solve the authentication problem. However, there are issues in establishing standards for verification and a mechanism for issuing the required information and no more. This is moving forward quickly, but the question is: who issues the first credential for the individual user? The college? Jackson observed that by the time most people get to college they already have an online identify – usually a Google ID, except that Google does not verify identity.

He concluded by discussing attitudes to the risks presented by cloud networking. The frequent question is: “Isn’t it riskier to go to the cloud rather than do things myself?” Jackson argued that there is just a change in the risk portfolio of what you do. We need a mechanism to manage the risk and we need to think about how we talk about cloud computing so it remains the cloud of the optimistic dawn, not the the sunset.