University of Glasgow

University of Leicester


The Problem

The Virtual Observatory (VO) project within astronomy has twocore problems:

  1. How to find data from scattered, and often under-resourced, archives, and
  2. Once it is found how to make use of it, given that different archives will generally have significantly different models of how their data is structured.

The High-Energy Physics (HEP) community doesn't really have the first problem -- there are rather fewer important accelerators and detectors, and so fewer data sources -- but does have the second, since different facilities have different, and necessarily inflexible, ideas about how to structure the data they produce.

One approach is to convert data into and out of a consensus model, but it is proving unexpectedly difficult in practice for the VO community (incorporated in the International VO Alliance, or IVOA) to agree on such a model which is simultaneously rich enough to be useful, and simple enough that consensus is possible. The HEP community has only recently started to work on this problem, which in that context is referred to as the problem of finding common metadata (data about data) covering the datasets which are available. On top of the problems of reaching consensus, the VO has discovered that it actually needs morethan one data model, to deal with the separate cases of searching for data ("find me all the datasets which have such-and-such a type of data within them"), and driving the processing of that data once it is found ("here is some data; do the right thing with it").

Proposed Solution

This project aims to use a radically different and adventurous aproach, making it possible for software to extract from a given dataset the information which is relevant to that software. This approach avoids relying on a difficult consensus, and recognises that partial and indirect understanding of a dataset's structure can, in important cases, be enough for the software to do its work.

This different approach avoids aiming for a possibly illusory consensus. Instead, we hope to make it easy for both data centres and the developers of data processing applications to precisely articulate the data model which they themselves have, and then to allow them to produce a formal description of how this relates to other data models which they know about. At this point it becomes possible for an application to say "I know about data models A and B, but I don't know about this new one. However I see that someone has described how this new model is related, at least partially, to A's and B's models, so I can use that knowledge to process the new data." The relationship of the new data model to the old one might have been described by the maintainers of models A or B, or by the providers of the new data, or even by a third party who has had to work it out either using published sources, or by conversation with the new data's providers. Of course, there are issues about the authority of these descriptions, and the handling of conflicting information, but these are interesting challenges.

This approach builds on both well-established Artificial Intelligence(AI) and Knowledge Engineering (KE) work, as well as on the emergingtechnologies from the Semantic Web community. It separately builds onwork from the Information Retrieval (IR) community on how to work withdistributed, heterogeneous and uncertain data.