JISC/CNI Meeting Reports

This blog represents a record of the recent JISC/CNI Meeting 2010 Managing Data in Difficult Times: policies, strategies, technologies and infrastructure to manage research and teaching data in a fast changing technological and economic environment.

To read summaries or to hear interviews with the speaker, please select a session from the list below:

Opening Plenary: The Data Conservancy and the US NSF DataNet Initiative
Sayeed Choudhury, John Hopkins University
[interview] [session summary]

Cloud Computing and Advanced Networking: Developments in HE
Greg Jackson, EDUCAUSE
[session summary]

Undergraduate Learning, Small Colleges and Digital Gaming: Collaboration and the State of Play
Bryan Alexander, NITLE

Innovation in Learning and Teaching: Benefits of Using Real World Data
Jackie Carter, Mimas, University of Manchester

An Integrated Strategy for Mobile Devices
Joan Lippincott, CNI

Big Data and the Reinvention of the Humanities
Greg Crane, Tufts University
[session summary]

First Capture Your Data
Peter Ainsworth, University of Sheffield
[session summary]

Shared Services
Chris Cobb, Roehampton University
[session summary]

Preserving Libary Collections
Roger Schonfeld, ITHAKA
[session summary]

Yale Digital Commons
Meg Bellinger, Yale University

Enhanced Learning: Institutional Strategies for Digitisation and Education
Jessica Gardner, University of Exeter

Open Data Policies
Heather Joseph, SPARC
[session summary]

Open Data Policies
Liz Lyon, UKOLN
[session summary]

Special Collections Transformed by Technology
Cliff Lynch, CNI

Moving Forward with Digital Preservation at the Library of Congress
Laura Campbell, NDIIPP, Library of Congress
[session summary]

Sustainable Digital Preservation and Access
Chris Rusbridge, Chris Rusbridge Consulting
[session summary]

Repositories and Cloud Services for Data Infrastructure
Sandy Payette, DuraSpace

Repositories Update UK
Peter Burnhill, EDINA

Memento: Versioning Resources to Support HTTP-Based Discovery
Herbert Van de Sompel, Los Alamos National Laboratory
[session summary]

Supporting Technical Innovation in the UKL Repositories UK
Paul Walk, UKOLN
[session summary]

Closing Plenary: The International e-Science Movement: Status and Future
Professor Dan Atkins, University of Michigan
[session summary]

The speakers’ slides can be found at the event home page here.

Session Summary: The International e-Science Movement – Status and Future

The closing plenary was presented by Dan Atkins, W.K Kellogg Professor for Community Information and Associate Vice-President for Research Cyberinfrastructure, University of Michigan. His topic was the status of the e-science movement in both the UK and the US. In introducing this theme, he explained that there needs to be a matrices of relationships, approaches and sets of goals, so he would be weaving between three different perspectives: the funders, researchers and universities.

Atkins observed that e-scientists are starting to address computational discovering and computational thinking for all at various service levels, which includes modelling, simulation, prediction, exacting knowledge from data of all kinds. He felt that the UK has been ahead in this area. However, he was keen to pick up on a comment from one of the earlier speakers, who observed that “no good crisis should go unused”. Infrastructure is generally taken for granted except in times of crisis, and currently many US institutions are having to look very hard at the amount of money they are spending on cyberinfrastructure. Some are just dictating cuts, whilst others are not only looking for efficiencies but are also trying to look at the mission effectiveness of ICT, consolidation of demand and sourcing of services. He observed that this means that ICT enabled collaborations that are now possible will only work if the end user is engaged and people are increasingly looking for this outcome. He emphasised that the ultimate goal is no longer just high performance computing, but also high performance collaboration.

To focus in on this, Atkins quoted the UK definition of e-science by Sir John Taylor, which highlights collaboration and information utility. He noted that this contrasts with the evolution of the e-science programme in the US.

Atkins continued by discussing the experience of chairing an NSF Blue Ribbon advisory panel, which recommended that the definition should be approached in a more agnostic way with respect to technology and should not be so tightly coupled to the notion of the grid. He assured us that the e-science movement in the UK had been very influential in the US and was one of the factors which led to the call for the Blue Ribbon review that he chaired. Their resulting report led the NSF to create a new office and a reflective report that embraced the recommendations that they made.

He went on to discuss the resonance between the push of the technological capabilities and the pull of the scientific need by observing that e-science or science enabled by cyberinfrastructure and ICT is increasingly essential for meeting the grand challenges of 21st century science. This is because there is an inherent complexity, multi-scale and multi-science nature of todays frontiers, increased scale and value of data and demand for semantic federation, demand for active curation and long-term preservation of access. In the US, may be less in the UK, these investments are critical to education, so these same investments could be made to create opportunities for students and teachers to engage in more high quality, authentic, passion building science.

One of the results of the report was the reflective report from the NSF called Cyberinfrastructures for the 21st Century. This was, for the time, quite revolutionary as it had more in it than high performance computing. The report called for a multi-component approach that involves high performance computing, but also recognised the need for cyberinfrastructure as an object of study and a mechanism for enabling learning.

Atkins went on to strongly recommend the book The Fourth Paradigm: Data-Intensive Scientific Discovery. This points out that we have had science for thousands of years based on empirical observation, we recently added a theoretical branch, then a computational branch then latterly a data exploration branch, which is the fourth paradigm.
Atkins observed that increasingly people are understanding that the e-science movement is real and growing and critical. It is now being recognised as relevant to the humanities and arts, and is being coupled with e-learning – not just putting textbooks up, but taking advantage of the whole participatory element. The term e-development has also arisen. Atkins emphasised that the whole future of the research university will be influenced by this movement, and they should pay attention to the e-science movement because it is effectively a pilot of the techniques which will be relevant to the whole of a research university.

He then took us through some of the findings and recommendations of the UK e-science review, which included:

  • The feeling that the UK is already in a world-leading positions
  • That the investments so far have already empowered significant contributions in the UK and beyond
  • That the UK is at a cross roads and must decide whether to create the necessary combination of financial, organisational and policy commitments to capitalise on their prior investments
  • That e-science is an organic, emerging process, requiring ongoing, co-ordinated investments

The key recommendation was that the UK should continue to nurture a robust infrastructure.

Atkins took us through the many strengths of the current programme and what it had already achieved, but also highlighted the weaknesses, which included the too rapid and early reduction of core funding and high level leadership. This left to a sense of abandonment. He also noted that there is a lack of applicable models for sustaining and maintaining software/infrastructure operations – often due to funding restrictions. He used the Moores revised technology adoption life cycle diagram to identify that there is a chasm between the momentum of the early adopters and the early majority, left by the withdrawal of funding support.

Atkins moved on to discuss what the universities doing about cyberinfrastructure, citing the recommendations outlined in a report entitled: Research Cyberinfrastructure Strategy for the CIC: Advice to the Provosts from the Chief Operating Officers. Based on this, he observed that universities should all currently be investing in:

  • preparing for federated identity management
  • maintaining state of the art communication networks
  • providing institutional stewardship of research data-intensive
  • consolidating computing resources, whilst maintaining diverse architectures

To illustrate this, he discussed the activities of the University of Michigan, which include developing an IT visioning, planning and governance model. This features mission stewards and domain stewards who take responsibility for understanding the needs of the institution. Faculty representation and student representation. He observed that there are often fundamental issues of governance and who decides about cyberinfrastructure, and many institutions do not have the kind of model that clearly identifies roles and jurisdictions.

Atkins then introduced the University of Michigan’s CIRRUS project, which focusses on exploring the use of pod optimised performance data centres, which have much better energy efficiencies. The current situation in research computing involves facilities over the whole university, and researchers who have their own machines. This project is trying to migrate more to their FLUX machine and cloud capabilities. Part of the CIRRUS project is to understand the options and which options fit best from a research and a finance position.

Finally, Akins discussed the issues facing the NSF and the main challenges that they are currently working on. Key to this was an understanding for the need to approach all cyberinfrastructure as part of an ecosystem rather than a set of several entities, which linked in neatly with the main themes of the conference as a whole and helped set the agenda for the future with of the NSF in the e-science arena.

Further details and references from Professor Atkins’ talk can be found in his slides.

Session Summary: Supporting Technical Innovation in the UK – Repositories UK

Paul Walk, Deputy Director of UKOLN, used his presentation to discuss Repositories UK, a project that is really about innovation support. As UKOLN is now one of two JISC-funded Innovation Centre, a role that is still being worked out together with JISC. However, close to Walk’s heart is their work to support the developer community, so he would be discussing the project from this angle.

Walk outlined some of the lessons they had learnt from the history of the project, including the lesson that aggregation, as an activity, has general potential value and that a search service is only one realisation of that value. They were very keen that no particular service should dictate what should be done with this, as they became more aware that they were dealing with something that could become an infrastructure component.

Walk then took us through a diagram which described the pattern he had seen repeated in service development based on public funding over the years. This showed a collection of data in the centre which is of broad interest and general usefulness. In order to respond to this, the funders would want to see some kind of end-user interface to interact with that data. This would be the main focus of the funding, but there would also be the promise that the developers would put standard machine based APIs on to that data so that could potentially be used in the future for other purposes. Walk observed that you would often see this type of project pattern repeated by other funded projects in the JISC information environment. However, the APIs are there notionally, not supported and often either unused or, even worse, useless. He referred to this as a service anti-pattern: a design pattern for solving problems that is generalised, plausible and attractive, but has some kind of hidden problem.

In assessing why this pattern reoccurs, Walk noted that it is often caused by funding, which follows happy users. Trying to get something funded that does not have a clear end user, but would rather enable further, as yet not fully understood development that may in the future help end users is a really hard sell. Funders also want something they can showcase, whilst infrastructure is near enough invisible – particularly if it is good infrastructure – and it is hard to ascertain direct impact from infrastructure for users. This leads to a clear motivation to develop a user-facing service and not really develop the backend.

Walk went on to suggest a slightly better pattern, which involves using the API to build the user-facing application, so there is less uncertainty about the status of the API. This means that the API is no longer tacked on, which keeps them honest and means that they know that it works. The result is that there is more likelihood that another developer will build on the API to reach another end user.

Walk then provided a description of RepositoriesUK, which is a managed aggregation of repository metadata from UK HE institutions. They deliberately do not normalise the records, they just provide a cache, focussed on academic papers, from which others can then normalise the data – so they are effectively solving a network latency problem. The goal is to support innovation, and in order to develop using the pattern Walk described above, they needed a use-case, so they aim to develop some business intelligence, which in this case is a series of visualisations of that metadata to show patterns of research represented in the repositories across the UK, to help an end-user to do business analysis. The final aim is to develop an infrastructure component for services.

They introduced some design principles, which included a tiered service model, where they build a core service, then allow other people to work with them to develop further services, allowing that infrastructure to emerge and enable others. He hopes that they can develop patterns that they can then generalise and feedback into the infrastructure.

Walk then talked us through a diagram of how RepUK works (see picture below) and some of the technicalities, including how the data can be cleaned up for various uses, and what the licensing restrictions are for certain uses (despite taking data from open access repositories).

Their current progress includes 0.75millon records, with 6 projects consuming this data. They are also getting a lot of love from developers. However, they have identified issues with linking – do they want people to link to the records they are holding in the cache? Would they be undermining the source repositories if RepUK was really successful with its SEO? Walk identified these as areas for further discussion. He also identified state management is the real challenge, with maintaining the state of the records and federating one aggregation to the next, and the next, being really difficult. They have not yet tried to solve this, but have flagged it up as an issue, and Walk emphasised that he would be interested in talking with anyone with a view in this area.

Finally, he reported on the new lessons they have learnt, including the lesson that developers need infrastructure too. He quoted Tom Coates, who said “you need to develop for users, machines and developers, which is what they are trying to do with RepUK. Walk concluded by emphasising that there needs to be a leap of faith and an focus on creating the right environmental conditions for things to evolve.

UK Repositories Update: Interview with Peter Burnhill

The Director of EDINA National Data Centre, Peter Burnhill, provided a UK repositories update in his presentation to the conference. Although the impending lunch break prevented an extensive question and answer session afterwards, Peter kindly gives us a summary of his talk in this video interview.

If you cannot see this video, please click here.

Session Summary: Memento – Versioning Resources to Support HTTP-based Discovery

Herbert Van de Somple, Staff Scientist at Los Alamos National Laboratory, gave us an enthusiastic introduction to Memento, which is an NDIIPP funded project examining the concept of time travel for the web. The aim of Memento is to make it easier to navigating the web of the past. There are web resources that have representations that change over time, but fortunately there are archive versions of prior versions of those pages. However, the archive version will be at a different URL to the original content. In wikipedia there is a general URI for the current version of the page which persists, then a different URI for each of the archive versions. That these records exist is a great source of optimism for Van de Somple, but he observed that finding and navigating these resources can be very difficult and is a matter of search.

He described how to find the archive pages for CNN.com and Wikipedia, which you then have to browse. There could be huge numbers of entries to navigate to get to what you want. Once you have found the page you require, navigating is more difficult again. He illustrated this by looking at an archive page from wikipedia, where the links go to the current version of the page, rather than the contemporary version, so you are not really navigating in time. There are also things missing, so your navigation experience of the past is not complete. Archives often re-write the links, but resources that are archived outside will not necessarily be seen correctly.

Van de Somple explained how Memento is a frame work which leverages the original URI of a resource to do a search in time. To get a prior version of a resource you have to use a URI, but Memento wants to change that by introducing the possibility of going to the current resource, but asking for it in a previous form.

He then got technical by explaining how HTTP “get” works and the preferences that this expresses using headers to help work out if you want to access a page at a particular compression, in a certain language and so on. The Memento team want to introduce a new preference to the current list: datetime. This is a resource in that will introduce content negotiation using time, leveraging an upcoming standard HTTP link. They have created a browser plugin that works with Wikipedia which uses a time slider to navigate seamlessly. This is currently available for download at the project website. They feel that the idea has real chance of wide adoption with some real traction.

He went on to talk about versioning, which is particularly significant in research. He discussed time-generic resources, which deliver a current representation when accessed. There are then time specific resources which are effectively snap shots of a current state at a particular point in time. When this is coupled with memento, this becomes a powerful way to navigate purely using HTTP across versions of resources. He took us through a practical example which uses pictures of Van de Sompel taken over a series of days that were published at the same URI, but use this system of time-specific resources. He also used an example using Dbpedia, navigating historical data used HTTP when you only need the generic URI.

Van de Somple then took us through the potential application of memento for scholarship, specifically looking at annotations, which are expected to become more web-centric, rather desktop-centric. However, if you attach annotations to a URI and the resource changes, your annotation can lose its context and become irrelevant. In fact, for certain regularly updating resources, the annotation is only relevant at that specific moment in time. Currently, the architecture of the web does not address this, but Memento allows annotations to become persistent, and in turn supports sharing of annotations. They have done experiments with online annotation tools, and if the archive copies were there, the annotations would be associated with the right content, even if the current content on the page had changed. Van de Somple used this as a convincing use case to show that the combination of Memento with a annotation model that supports datetime will be very useful.

He concluded by emphasising the value of the URI for discovery. The linked data concept means we now have page and machine readable data linked with the same URI, which has increased the value of the URI. Adding Memento will increase the value again, as the one URI would also give the archive of that page as well.

Session Summary: Sustainable Digital Preservation and Access

Chris Rusbridge gave a presentation reporting on his involvement with the Blue Ribbon Task Force for Sustainable Digital Preservation and Access.

Rusbridge emphasised that we are in a world that increasingly relies on digital information in so many different ways, but this is very much at risk. He quoted a slide from one of the Task Force’s co-chairs which asserted: “Access to information tomorrow requires preservation today”. However, he observed that this sets quite a low bar to preservation. The statement does not place lots of restrictions or grand standards for formats – it just says keep it! Rusbridge referred to this as the “zero-floor” of digital preservation. In order to have access to a resource, you have to have that resource in the first place.

This led on to a summary of the key issues implied by this statement: what should we keep, who is responsible for it and who is going to pay.

In response to this last point, Rusbridge took us through some of the ways in which we currently pay for access to digital information, including government grants, adverts, subscriptions, donations, pay for service. However, most of the digital preservation work so far has been funded predominantly through government grants, so getting to a situation where there is some sustainability is the really tough call. This is because those who pay, those who provide, those who benefit are not the same.

The Task Force was charged to take a longer term focus. Rusbridge noted amongst the participants from different fields the involvement of world class economists, who don’t usually work in this type of area. Rusbridge found it interesting that after the Task Force had done its initial analysis, they understood that a market solution for digital preservation is not going to happen, as there are sufficient breaks in the incentives chain which they felt there were no obvious, simple market solutions to overcome.

Rusbridge emphasised that the economics of digital preservation are not just about money. There needs to be a value proposition that is persuasive, clear incentives and roles and responsibilities that work towards the aim. It is therefore a much wider problem than simply finding more money.

He observed that in the past, all of the work on long term preservation has been funded by short term resource allocation. It has only recently that organisations like the British Library and National Library of Congress have begun to bring the preservation process into their core budgetory frameworks. He noted that it is very hard to see the value in digital preservation, so it is difficult to see how you might get a return on the investment. As a result, there has been little co-ordination between incentives and a general attitude of “someone else will do it”.

Rusbridge observed that, so far, most of the discourse has been the technical area, but the Task Force’s discussion was deliberately about the economics and the allocation of resources, which represents the first time that digital preservation has been seriously discussed as an economic issue.

One of the best phrases to result from the Task Force’s report, for Rusbridge was that “the case for preservation is the case for use”. You cannot make a sustainable case for indefinite dark archive preservation. The benefits should emphasise the outcomes and there needs to be an element of self-interest in the case.

The economics discussion had some very different, useful language compared to that which is usually used in conversations about digital preservation. The economic framework featured core attributes – which are based on core attributes and choice variables. Rusbridge explained that one of the core attributes is that digital preservation is a derived demand: no one wants digital preservation. They want the outcome of preservation – the access to the resource.

Other terminology included the “non-rival in consumption” which is Rusbridge described as the curse of the internet at large. If you try to make a business case for a sustainable flow of funding to support digital preservation in an environment where, by definition, people can take over resources and use them, so there is no incentive to give you money to keep those resources open. This is the free rider problem, which is inherent in the internet. The economists also described the digital preservation process as “temporally dynamic and path-dependant”. By this they mean that preservation is a sequence of steps, so getting one of those wrong will eliminate all future possibilities. It is dependant on things you have done in the past.

Once you reach the individual cases, “choice variables” come in, including who owns, who benefits, who selects, who preserves, who pays. These choice variables differ greatly in different cases and can have a huge impact on what you do.

Returning to his description of the Blue Ribbon Task Force’s remit, Rusbridge explained that there were four domain areas under examination, which were: scholarly discourse, research data, commercially owned cultural content and collectively produced web content. He admitted this was an ambitious set of domains to think about.

Rusbridge first discussed the issue of research data, the sustainability challenges and the actions that result. He highlighted the problem that there is a vast amount of data, some worthless, some enormously valuable – so how do you choose what to preserve? He also observed that the incentives to do preservation decrease as you get closer to the researcher – society may say that the data is valuable, the PI may be more possessive. This is an odd sort of process where those who control the data are quite possessive over it, so Rusbridge felt that mandates were useful to help address this problem.

He went on to discuss commercially owned cultural content, which presents an enormous set of challenges. The demand for access to a cultural record is clear. However, the private and public incentives are at odds, with the public incentive to preserve for the public, and the private incentive to preserve for future revenue generation or destroy so that it does not compete with another revenue generating project. There is also no hand over mechanism to assets to be taken into public ownership should the media company go out of business and make those assets available.

In contrast, the future demand for collectively produced web content is not yet clear. However, Rusbrudge observed that people are beginning to care, although he incentive to preserve are still weak and difficult to articulate, whilst there are also problems with unclear ownership. Rusbridge used the example of Twitter to illustrate how there could be future uses and benefits to preserving the record.

To conclude, the Task Force has produced an agenda for further action which involves developing public private partnerships, with the Library of Congress taking on the Twitter archive providing a good example of how this may work. They also recommended further action to seek economies of scale and scope, and securing chains of stewardship by creating mechanisms for hand over.

Finally, Rusbridge highlighted how we can act on one of the recommendations individually by putting creative commons licences on our blogs so that it can be preserved, without someone having to decide whether it is a risk to preserve.

Repositories and Cloud Services for Cyberinfrastructure: Interview with Sandy Payette

Sandy Payette, Chief Executive Officer of DuraSpace, provided a US repositories update in her presentation: “Repositories and Cloud Services for Data Cyberinfrastructure”. In this video interview she outlines the main topics within her presentation and some of the issues that arose in the brief questions session that followed.

If you are unable to see this video, please click here.

Session Summary: Moving Forward with Digital Preservation at the Library of Congress

On the occasion of the 10th anniversary of the National Digital Information Infrastructure and Preservation (NDIIPP), its leader Laura Campbell from the Library of Congress outlined the lessons that have been learned over the course of the project, observing that strategy is one thing, execution is another.

NDIIPP’s mission is to develop a national strategy to collect, preserve and make accessible significant digital content, especially information that is created in digital form only, for current and future generation. Their current network consists of 170 partners, with an architecture for preservation that provides for preservation and access, and a national collection of digital content that runs to around 300 terabytes.

Their initial approach was to focus on natural networks, which grew up in local pockets, and learnt about the tools and services that were needed to support these partners. They saw roles and responsibilities evolve, found that resources were scarce and that these resources have been under increasing pressure over the last couple of years due to the fiscal crisis. Campbell highlighted that these local networks do not yet share and make all of their information available, but there is currently a major tide shift in that direction. Barriers still remain connected with a lack of public policy incentives and there is a critical need for professional development, so NDIIPP are working with others to work on courses to roll out to the wider community in response to this need.

Campbell discussed the mixture of public and private partners in the network, which means that there are people working with others who they would never normally work with and new relationships developing. She gave the example of 40 national libraries, all working to develop tools and technologies to preserve websites in their countries, which are now working together to share experiences. NDIIPP worked with NSF to launch the first ever NSF sponsored digital preservation programme. Campbell observed that these are major sea changes. They are working at a macro level to change the way people think and hopefully reduce costs, whilst leveraging what they could do alone and learning from one another.

She went on to outline the types of collections on which they focus. This covers a wide range of materials, including social science data sets. They also have material for Congress, including election sites, websites from the end of the Bush term; e-journals, geospatial data, audio visual data, websites and culture heritage data. Some of these collections are huge and all represent amazing learning.

Access remains a huge challenge and remains their main area of focus, but they have learnt that one approach does not fit all. To illustrate this focus, Campbell outlined NDIIPP’s goals for 2010 which include the launch of an NDIIPP portal, giving full coverage for all partner collections, tools for finding, access and sharing.

Campbell described how they have worked on creating standards and federal agencies digitization guidelines, including standards for still images and audio visual, to help establish common practices. She also outlined their work with the creative industries to “preserve creative America” by creating standards for archiving, standardising metadata for stock photos, recommended work flow practices and bringing commercial groups together in a way that would not normally be possible in commercial practice.

NDIIPP continues to work on an architecture that creates a shared environment for validation and storage, delivering content, and produces more discovery tools. Campbell admitted that there are still outstanding issues on copyright exceptions which affect preservation, but described how they have fought for rights to retain public interest in private records.

She outlined the communication and outreach that their work has involved, including over 12,000 subscribing to their newsletter and a series of videos on iTunes.

Campbell went on to describe NDIIPP’s current transition from a programme of modelling and testing to an ongoing stewardship alliance – the National Digital Stewardship Alliance. They have a 10 year content collection plan and will be focusing on content communities including public policy on the web, digital news and geospatial information. She emphasised that they would welcome data from those areas and that they want to enhance their focus on educational outreach for those outside the network. They are also interested in working internationally, so opportunities exist to join the network.

Campbell concluded by reaffirming that no execution is perfect. There is lots of room for iterations and improvements, so NDIIPP is looking for more people to join and work with them.

Special Collections Transformed by Technology: Interview with Cliff Lynch

CNI’s Executive Director, Cliff Lynch, gave a presentation entitled: “Special Collections Transformed by Technology” as part of the conference focus session on US developments. In this video interview, Cliff explains the areas he touched upon in this presentation and highlights the areas of interest raised by the audience in the following question and answer session.

If you are unable to see the video, please click here.

Session Summary: Open Data Policies, Liz Lyon

Liz Lyon, the Director of UKOLN, built on Heather Joseph’s talk by contrasting the US policy environment with UK data policy, both today and in the future, and examined some of the challenges.

In a snap shot look at the present situation in UK HE institutions, Lyon quoted from a forthcoming report, which included comments by faculty members along the lines of: “I just back everything up onto data sticks, I didn’t even know you could back up to servers” . This was used to demonstrate that researchers have different levels of knowledge and practice about backing up and many have experienced catastrophic data loss.

The report also identified other issues on the ground relating to data management and open access. Giving away data is one big barrier for researchers, who liken it to giving away their baby. Trusting the methods of other researchers often stops them wanting to share their data or use the data shared by others for further research. They found that many researchers found the policy documents wordy and difficult to understand – they wanted guidance, but view policies as hollow mandates. Lyon explained how the findings of this report help to give an idea of how a data policy should be packaged if it is to be useful.

Lyon moved on to look at the future data landscape, using genomics as an example case. Sequencing genomes generates huge amounts of data, and is getting faster and cheaper ramping up the data deluge. This presents a challenge to researchers and they need advice and guidance on data storage – particularly given that their aim is to: “analyse an entire human genome in a single day sitting in with a laptop in Starbucks”. Lyon noted that several of pharmacological companies are moving towards putting data into the cloud, but our university researchers are lacking guidance about the most appropriate ways to store their data and make accessible.

Accessibility is a particular issue for this type of data, but there are changes in attitude taking place. Lyon drew our attention to the 24 human genomes now openly published on the web – including some quite high profile people, including Desmond Tutu. She referenced web sites and related communities allowing you to pay for a kit to help you send off a sample and get your genome sequenced. There is a community around this product consisting of people sharing their genomic information on the web. She also highlighted a new method of anonymising medical records for genomics research at Vanderbilt University, which will make it easier to make data available for mining.

Elaborating on this theme, Lyon quoted from a Wall Street Journal article “My Data, Your Data, Our Data”, which predicted that in the future people may choose to make their genomic data available on Facebook. She commented that if patients demand their data to be shared, researchers have no choice, and cited an example of one patient who is trying to get data about bone cancer unlocked so researchers can find a cure for his disease faster. Lyon felt that this demonstrates there is a growing role for university ethics committees to get involved in this space.

Lyon went on to discuss the BBC’s Lab UK projects, which are designed to use public participation and citizen science. Theses experiments are developed by scientists and are leading to real, useful peer reviewed and published science. This example raises questions about the attitudes of researchers in our faculties about working with the public, and whether they have the necessary guidance for working with the public in a similar way to engage more people and gather larger data sets for their research.

Lyon also talked about current work to develop new metrics for data citation, which can be very complex ranging from citing the journal at the macro level, to citing the visualisations and data sets at the micro level. Again, she felt that there needs to be more policy guidance to help our researchers understand these new levels of granularity resulting from open data.

Finally, Lyon discussed the activities of the DCC in this area. They scanned the policy documents of UK institutions to see what guidance is already available and are subsequently developing a data management planning tool. This includes a check list using the information in the various funding council policies, which was put out for comment and received useful feedback. They are now developing an online tool which they hope will be available in the summer at http://www.dcc.ac.uk/dmponline, which they hope will be a useful tool for researchers to give them some prompts to help them develop a specific data management plan.

She emphasised that we need to get to a position where data management plan is the norm, and embedded in the policies and the research lifecycle for the researcher. Lyon also wants clearer guidance for all of the major players, and for the data management plan to be part of the review processes so the DCC are developing tools to help reviewers. She admited that there are then questions and some thinking that needs to be done around compliance.

Lyon also observed that it would be good to share these plans so there needs to be some infrastructure for this. Assuming that a data management plan is going to led to better data, she also identified the need for some careful cost benefit analysis.

To conclude, Lyon observed that practice is currently disconnected from policy. She noted some policy gaps and the amount of work to be done with the funders to make the data management plans actually work.