The MIRADOR project

Metadata Improved Relevance Assessment through Descriptions of Online Resources


Sheila Rock, Alison Cawsey, Patrick McAndrew and Diana Bental
Department of Computing and Electrical Engineering
Heriot-Watt University
Edinburgh EH14 4AS, UK
mirador@cee.hw.ac.uk

Abstract

MIRADOR (Metadata Improved Relevance Assessment through Descriptions of Online Resources) is concerned with using metadata to produce tailored descriptions of online resources, that will help people evaluate their relevance and suitability. In this document, we describe the background to MIRADOR, and then discuss some early evaluative work that we have begun.

1. Introduction

There is a great diversity of multimedia resources available on the Internet. Such diversity and volume is very useful in most contexts, and educational ones, which are the focus of our investigation, are no exception. However, as the volume of resources increases, it becomes difficult to track down material relevant to an individual, and expensive to download a resource just to see if it is relevant.

The main response to this problem has been in the form of better search and retrieval tools. However, we believe it is also beneficial to approach this issue from the user end (in the educational context, users may be teachers, learners, or anyone making use of a resource in any way), by providing mechanisms whereby the user is more able to assess the potential relevance of a resource of interest. In particular, providing the user with better descriptions of the resources will aid this assessment.

Our aim is therefore to support the user in searching for multimedia resources on the web by:

  • providing tailored descriptions of existing networked resources
  • taking into account user profiles and richer resource descriptions
  • using metadata to generate these tailored descriptions.
Figure 1 is an overview of our approach. We distinguish between local information, such as user profiles and the particular query that a user might have, and web-based information, which includes the web resources themselves and the metadata associated with them. In a conventional search, the query together with data about the web resources results in a document set that matches the query. Some search processes might use metadata to drive or enhance the search. What MIRADOR aims to provide is an ability to generate tailored descriptions of the document set identified in the search process, using metadata and using user profiles for the tailoring. [*]
Figure 1 An overview of the MIRADOR approach

2. Background

There have been a number of recent developments in providing web resources that have contributed to our approach.

The concept of metadata is increasingly gaining currency as a mechanism for enhancing the usefulness of internet resources, and there have been a number of initiatives around formalising and standardising frameworks and architectures for its use. The Dublin Core is one example of such an initiative, an attempt to identify a metadata set for describing electronic resources, and there have been others. Thiele[1] provides a good overview of the literature around the Dublin Core Workshop series.

In the wake of these initiatives, there has been an increase in the number of web resources that include metadata. This is not without its own problems, however. We have noticed for example that the inclusion of metadata in web resources is often patchy, and occasionally we might find a document whose metadata bears little relation to the resource it describes. Like comments in a program, because the metadata is not as visible as the resource itself, it can often be forgotten when a resource is changed or upgraded. In the most extreme case an author might use a metadata template, perhaps copied from another resource, and not ever get round to completing or changing the metatdata for the new resource.

There has been progress towards metadata standards for describing educational resources. In particular, the IMS Project (Instructional Management Systems), a cooperative of academic, commercial and government organizations, looking at internet architecture for learning, is investing some energy in metadata, and together with ARIADNE (Alliance of Remote Instructional Authoring and Distribution), has come up with a schema for education metadata.

Briefly, the ARIADNE proposal describes a schema of information, in 6 mandatory categories. These (currently) are:

  • general information on the resource itself
  • semantics of the resource
  • pedagogical attributes
  • technical characteristics
  • conditions for use
  • meta-metadata information
Within these, various descriptors are proposed, not all of them mandatory. There is currently a total of over about 25 descriptors, and within these, the IMS proposal describes over 80 elements, in a hierarchy.

In the computational linguistics arena, natural language generation techniques are being used to generate summaries of documents, for information retrieval [2]. In contrast to this, our approach is to use metadata as the underlying knowledge base, and apply natural language generation techniques in generating tailored descriptions of resources. One important advantage that metadata gives over using document content as the source data, is that metadata is available for non-text resources, such as those containing video, audio, or graphics. The approach taken by Amitay[3] recognises these limitations with non-text documents. Instead of using the resource of interest itself, she looks for descriptions found in textual resources that point to the resource of interest, and aims to determine which such description is the most appropriate. Our approach is instead to use the metadata found with the resource, to generate a coherent description of a number of resources, that is tailored to the context of use.

So, combining relevant aspects of these recent developments, we propose to address the need for better mechanisms for finding web resources, by generating natural language descriptions of existing multimedia resources from metadata about the resources, and tailoring these descriptions to take into account the educational context and user need. We have constructed the example in Figure 2 to demonstrate this, using categories based on the ARIADNE-IMS educational metadata recommendations.

[*] Figure 2 An example of the kind of output we aim for MIRADOR to produce


3. A preliminary investigation

As a prelude to establishing what kinds of tailored descriptions would be useful to users of web resources, we have done a study of some existing descriptions. Our aims in this study are to:

1.
determine the structure of human authored document descriptions, which will inform the design of the descriptions we might aim to generate
2.
establish how the content of some human-authored descriptions matches the information in a recognised metadata schema (in particular, the ARIADNE-IMS master schema, which has been proposed for educational resources).

EEVL, the Edinburgh Engineering Virtual Library, maintains a searchable catalogue of reviews and links to engineering-related web sites. Described by Moffat [4], it was established as a project under the Electronic Libraries Programme. The EEVL database contains descriptions of nearly 4000 engineering-related web sites, and we have taken a sample of 22 of these (descriptions of tutorials, about computing topics) for our analysis. These are descriptions of single documents, and our aim is ultimately to provide descriptions of multiple resources, including some contrast and comparison. However, the single document descriptions are a useful starting point for identifying the kinds of information such descriptions contain.

The resources being described are generally text documents, but some of them include video clips, audio, images, etc. Two example descriptions are shown in Figure 3. We have identified (so far) some 20 different information types, which together cover the content of the descriptions. These include things like Written_for, Written_by, Written_why, Keywords, Aimed_at, Time_to_complete, etc. Any one description may have text that pertains to any of these 20 information types.

The IMS metadata master schema is based on the IEEE LOM V2.2 Working document, jointly authored by IMS and ARIADNE. Some, but not all, of our information types have an obvious mapping to this schema. The IMS schema, in contrast to the Dublin Core, has a hierarchy of sets of metadata, containing a total in excess of 80 fields.


6. PLC Tutor

The PLC Tutor provides a complete non-vendor-specific guide to programmable logic controllers. It offers an introduction, and sections on basic programming, advanced programming, wiring, and links to manufacturers. New chapters are introduced to the tutorial on a regular basis. The site can also be downloaded, if required.



13. Transmath - a CBL mathematics tutor

The Transmaths project aims to address some of the problems experienced by increasing numbers of first year undergraduate scientists and engineers, who arrive at University with inadequate mathematical knowledge and skills, by providing them with a self-paced, user friendly computerised mathematics tutor. The Mathematics departments of Imperial College and the University of Leeds were awarded a grant under the Teaching and Learning Technology Programme (TLTP) to produce CBL material for the remedial teaching of mathematics.

The Transmath Web page provides further information about the project, a list of available modules, an ftp server for downloading Transmath, articles written by the Transmath team, and an evaluation of TMP and Transmath software. Other sites of interest are also available.

Figure 3 Example descriptions, taken from the EEVL database

Initial findings

Most of the information we have in the descriptions can be directly connected with fields within the ARIADNE categories General, Technical, Pedagogical, and Semantics. The areas where this is not the case are minor:

We have noticed that many descriptions will themselves have hyperlinks to resources other than the one they are describing. This is information that is not easy to cater for in the metadata schema. The closest we can find is the information in the Relation category, which is described as 'characteristics of the resource in relationship to other resources', but this is more general than we would like.

In the IMS schema, keywords are a sub-level within the Semantics category, pertaining to the Concept or the Discipline. This does not always fit with the use of keywords in our sample descriptions, which might sometimes pertain to the educational goal, or the nature of the resource, for example.

Many of the information categories will have contributions from more than one language fragment, while others will have only one. For example, there may be a number of educational concepts mentioned in the description; there may be a number of prerequisites for learners wishing to use the resource. Though usually such collections of information are close to each other in the description text, there is no requirement that this is the case. In particular, hyperlinks to other resources may be distributed through various parts of the description.

We also note that an important property of text descriptions is that they provide a coherent organisation of information, that itself conveys some semantic content. An analysis of this organisation takes us beyond the flat representation that is obtainable from metadata. In particular, the Dublin Core metatdata has no hierarchical structure at all; the ARIADNE-IMS schema has a hierarchy which is inflexible. This suggests that in generating descriptions we must consider this indirect semantic content and underlying communicative goals.

Natural language prose is felt to be more useful than say just listing the metadata information for a number of reasons. It is possible to focus the information in a way that is tailored to the user's needs and it allows emphasis and stress of certain information. In addition, a text description is an appropriate way in which to provide the kind of comparison we envisage. A small study is planned, to verify this, comparing the usefulness of text versus tables in this context.

4. Concluding remarks

The MIRADOR project is one that aims to bring together recent developments in metadata, educational resources, and natural language generation, in a novel way, which will provide some help to users of networked educational resources, in dealing with the volume of resources that are available.

We aim to develop a system to provide better descriptions of resources, and take as our starting point an analysis of professional human authored descriptions. This analysis suggests that the metadata we plan to use as our source data is, if complete, a mostly adequate resource, but we have identified some limitations. Our analysis also suggests ways of structuring descriptions.

We now plan to work on implementation issues and move on to consider how descriptions can be tailored to user and query.

Bibliography

1. Thiele, Harold, The Dublin Core and Warwick Framework D-Lib Magazine, January 1998, ISSN 1082-9873. Available: http://www.dlib.org/dlib/january98/01thiele.html [Accessed February 1999].

2. McKeown, Kathleen R., Jordan, Desmond A. and Hatzivassiloglou, Vasileios Generating Patient Specific Summaries of Online Literature. AAAI Spring Symposium on Intelligent Text Summarisation, pp34-43, Stanford, 1998.

3.Amitay, Einat. (1998). Using common hypertext links to identify the best phrasal description of target web documents. In Proceedings of the SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998. Available http://www.mri.mq.edu.au/~einat/sigir/ [Accessed April 1999].

4. Moffat, Malcolm, An EEVL solution to engineering information on the Internet Aslib Electronics Group 38th Annual Conference, 15-17 May 1996. Available: http://www.eevl.ac.uk/paper1.html [Accessed February 1999].