Session Summary: The Data Conservancy and the US NSF DataNet Initiative

Sayeed Choudhury of John Hopkins University chose to focus on the infrastructure element of the expanded conference title, putting the technology aspect of the Data Conservancy project aside for his opening plenary in order to discuss the issues they are exploring around infrastructure design.

He began by citing the report Understanding Infrastructure: Dynamics, Tensions and Design by Edwards et al, 2007, which calls not for a rigid road map, but principles of navigation. This idea relates to the Data Conservancy project for Choudhury by reinforcing that technical fixes are not the most important aspect – he noted that if we believe there is a particular technical pathway, this can create dependance which it can be difficult to extract ourselves from.

His key point was that infrastructure development occurs because of fast changing technological and economic environment. When things are going well you should not notice the infrastructure and certainly do not usually try to overhaul it. He illustrated this by describing his faith in the infrastructure of the ATM system to get his foreign currency when travelling abroad. Even when his faith in this infrastructure was challenged by an error message in a foreign language, the consistent nature of the infrastructure enabled him to move through the system and achieve his goal of extracting currency.

Connected to this example of a global infrastructure, Choudhury noted that in more modern times we have a more global perspective. There are periods of time when people have particular needs in a local area, but then at some point some larger over arching view point needs to come into the mix and review the larger infrastructure and how these local systems can be connected and used more effectively. He illustrated this by describing the standardisation of the railroad system in the US, which now connects across the whole country, having started as a series of local systems.

This illustration connected to the DataNet project, which observes that science and engineering research and education are increasingly digital and data-intensive, so there will come a point when the amount of data will become too much for the local scientists involved to handle. It is at this point that they need help. The levels of data needed to reach this point may vary: astronomers may be able to handle more data than social scientists. Part of what the Data Conservancy project is designed to look at is large scale infrastructure, to see how data can be managed more efficiently both locally and as part of a larger framework to address grand research challenges.

This focus on data management by the NSF, which funds the Data Conservancy under the DataNet project, has led to their requirement for all grant proposals to include a data management plan to be examined in the review criteria. The Data Conservancy are now reflecting on what this might mean for them. They are also following and reflecting on a feasibility study by John Hopkins University looking at an open access repository. Choudhury feels that this open access repository is another lens on infrastructure, with many motivations. He observed that the previously closed infrastructure model had unintended consequences that made it very difficult for us, and the open access movement is an implicit recognition that we need to address these issues.

He moved on to described the Data Conservancy’s a shared vision: “data curation is a means to collect organize, validate and preserve data so that scientists can find new ways to address the grand research challenges that face.” He noted that preserving the data is in that definition, but need to preserve needs to be built in a manner that supports current and future research, as many of the great research challenges involve looking at issues through different lenses. The goal is to support new forms of inquiry and leading that address grand research challenges, not just preserve the data. Their strategy is a comprehensive strategy that brings together lots of elements. They are studying existing systems through user-centred design and research, and building on existing exemplar scientific communities and virtual organisations that have deep engagement with citizen scientists and extensive experience with large-scale, distributed system development. He also explained why they had felt that it was important to include libraries as a core to this work given the existing systems and connections. There is currently no theoretical framework for building digital libraries, but this is what they hope they will be doing through the Data Conservancy, so the involvement and experience of librarians needs to inform their research.

Choudhury went on to explain some of the background to the Data Conservancy’s partner institutions and why they have been brought into the mix to look at different elements. Bringing together these different perspectives has had great benefits, but also some tensions. One of their critical issues is making sure that the framework does not stifle the research.

When discussing the objectives of the project, Choudhury focussed on the requirement for the outcome to be sustainable. He observed that going back to DataNet at the end of the project and asking for more funding is not a sustainable response, so they are working to get libraries invested in the project so it can continue beyond the funding.

In describing the practical approach of the Data Conservancy project, he explained that they are starting with the astronomy research community as an exemplar community, as they already have some good practices which have evolved from their own work in managing the large amounts of data produced in their experiments. The researchers will be mining the astronomers’ practices for key lessons then trying to extrapolate that into other disciplines. One of the results Choudhury predicted was that instruments may prove to be an important element, as this will effect what kind of data is collected and how it should be curated. The practices of the astronomy community are strongly connected with the nature of their instruments and the types of data they produce, so the researchers will have to examine how this translates across other scientific areas.

One of the other challenges Choudhury highlighted was discoverability, which is being addressed through a data framework which requires a high level view of the data, without getting caught up in the jargon of particular disciplines. Identifying appropriate descriptive terms, avoiding jargon but not undermining the depth of meaning, is a very challenging problem. Choudhury sees this as their grand information research challenge. The common conceptualization that they are using is the concept of the observation, which spans across sciences and already has some common attributes in different scientific disciplines. There are some scientific communities that are already doing some work in this area, so there are already some existing concepts to build upon.

Choudhury stressed that they are not trying to do is build infrastructure and impose it on people. He referred to a book that inspired him Emergence: The Connected Lives of Ants, Brains, Cities and Software by Steven Johnson which conceptualises the movement from low-level rules to higher-level sophistication and describes this as emergence. He felt this was a compelling idea and asks: “What is an emergent model for data conservation?” and: “How can this idea of emergence be worked into the infrastructure so it is evolving and responding to the needs of the community, rather than prescribed?”

In an emergent system, not everyone needs to develop a huge infrastructure – we just need to connect local data repositories. We need simple ways to deposit data into different trusted locations and ways to automatically discern what other data it may be connected to, which could be things that you expect, or not. Whatever services could be applied to compare this data should be readily available to you within this system. Chourhury believes that if they can achieve this then there would be much greater participation from scientists themselves. He made the key observation that infrastructure is not about system building, but rather the rich, comprehensive set of human and technology interactions.

To conclude, Chouhury advised that we have to embrace the diversity of cultures. People come from different contexts, so any framework should not become a strait jacket. We need to embrace the chaos before imposing the order and put up with some anxiety now to develop a system that is emergent and sustainable in the longer term.