When Virtual Documents Meet the Real World

Stephen J. Green1, Maria Milosavljevic2, Robert Dale1, Cecile Paris2
1Language Technology Group
Microsoft Research Institute
Macquarie University
North Ryde NSW 2113
2Intelligent Interactive Technologies Group
CSIRO Mathematical and Information Sciences
Locked Bag 17
Sydney NSW 1670

1. Introduction

We have been involved in an ongoing project involving the creation of virtual documents using techniques drawn from the area of Natural Language Generation (NLG; see Reiter and Dale (1999) for an up-to-date overview of this field). In the past year, we have taken this work to a new level, by attempting to scale up the techniques we originally developed on small hand-constructed knowledge bases (KBs) to a stage where they cam be applied to a large KB that was automatically built from a database of museum objects.

This experience has illustrated some important principles to keep in mind when virtual documents on the Web meet the difficulties and inadequacies inherent in real-world data. In addition, we have gained some insight into how generation techniques can be used to produce virtual documents in multiple languages.

2. Virtual documents and NLG

The forerunner of most of the systems that we have built is the Peba-II system (Milosavljevic et al., 1996), which automatically generates descriptions and comparisons of animals. These descriptions and comparisons are based on a model of the user, so that they can include references to animals that the user has already seen described.

As an experiment in seeing how domain-dependent these techniques were, we took the Peba-II system as a base and built a new database containing the kinds of objects that one would expect to see in a museum of computing. This system, Power, could be used to produce exactly the same kinds of descriptions and comparisons as Peba-II, with only a new KB and the addition of some new lexical items.

The problem with the approach taken in these two systems was that the KBs had to be constructed by hand, a time-consuming and difficult task at best. Because of this, the KBs were small, although of quite high quality. Despite this, the systems have been quite successful and have shown the feasibility of using NLG techniques in a near real-time environment like the Web (see Dale et al. (1998) for more information.)

3. Automatically acquiring a knowledge base

Our success in porting Peba-II to a new domain lead us to consider how we could automatically acquire a much larger KB of objects to describe. Our interest in the museum domain is not unique - The Intelligent Labelling Explorer (Hitzman et al., 1997) project has also focussed on generating descriptions of museum gallery objects. The Powerhouse Museum was willing to provide access to the database of their Collection Information System (CIS). This database consists of approximately 300,000 objects, most of which are in storage at the museum's various facilities because of a lack of floor space in the museum's exhibition halls.

      <rec num=12798 id="H4448-513">
      OID: H4448-513                
      INT: Part
      LOC: TH2.STEP.6A
      LOD: 27/11/1997
      OBN: Boots
      OBS: Balmoral boots, elastic sided, pair, women's,
           patent/kid/leather/elasticised fabric/wood,/brass prize
           work, [Gundry & Sons], England, c.1851; 1862-1869.
      DES: Balmoral boots, elastic sided, pair, women's,
           patent/kid/leather/ elasticised fabric / wood /brass, prize
           work, [Gundry & Sons], England, c.1851; 1862-1869.  Pair of
           women's elastic sided boots (Balmoral), with wooden filler,
           of welted construction with rounded toes featuring peaked
           caps and stacked heels.  The uppersconsist of a patent
           golosh, seamed at the back, glace kid leg, seamed at front
           and back, and elastic sides extending to the golosh. The
           uppers are decorated with oval stitching at the edge of
           caps and scallops at the throat of golosh.  The leather
           heel is fine wheeled, featuring a top piece with brass
           nailed edge.  The black leather sole features a sueded
           forepart with brass nails, as well as an internal clump and
           brass hinged section for extra strength and a brown
           polished ridged waist with black edge.  Reputed to have
           been made by Gundry & Sons.  (See object file for
           specialist report by June Swann)
      MDE: Gundry & Sons; London, England
      MDN: 1965 list says "made by Gundry & Sons, Soho Square."  Swann
           says hinged device to increase flexibility is
           unusual. Similar screws on H4448-515.  Note hinged sole in
           1862 exhibition.  She finds no information about Box in
           information she has about the 1851 exhibition, though
           William Walsh is mentioned in connection with a pair of
           shoes.  Patent 558, 5 March 1861, granted to J.M. Carter, a
           similar sole with 2 cuts across the tread and 4 rows of
           screws "for soldiers, riflemen, sportsmen.  The inner sole
           is whole and contains pitch."  It is not possible to
           confirm whether these boots contain pitch.
      DAT: c 1851 - 1869
      MAR: Interior obscured by last, no marks on exterior
      DIM: Length 248 mm Height 31 mm Overall Height 160 mm Width 58 mm 

Figure 1: A well-populated database record

We selected as a subset of the items in the database only those objects on display in the museum. We also included in this set all those objects in the database that were a part of something on display, even if these particular objects were not on display. This subset amounted to approximately 15,000 objects. Figures 1 and 2 demonstrate the range of data quality in the database. The entry in Figure 1 provides a rather complete description of the Balmoral boots, including a discussion on the construction and history of the boots. The entry in figure 2, on the other hand, provides no information beyond that required for a database record to actually exist! It is worth noting that this database was intended for internal use by curators only, and not intended to be used to generate object descriptions for public consumption.

      <rec num=15463 id="PROP/EXP/82">
      OID: PROP/EXP/82              
      INT: PROP
      LOC: EH2.EXP.3H
      LOD: 16/02/1996
      OBS: Aerosol deodorants/colognes (2), "Lynx" & "Australis".
      DES: Aerosol deodorants/colognes (2), "Lynx" & "Australis". 

Figure 2: A sparsely populated database record

In addition to the database, we were provided with a hierarchical thesaurus of objects that had been prepared by the museum. With these two information sources, and with lists of countries and materials extracted from the Macquarie Thesaurus we could begin the process of extracting a KB from the database. The extraction process is broken down into five basic steps:

  1. Normalisation of the database: This is to ensure that each record is surrounded by an SGML-style rec tag, and that each field of an entry is on a single line. This eases the processing requirements at the further stages.
  2. Extraction of dimensions: In this step, a Perl script extracts the dimensions of the objects. This information resides in easily identifiable fields (e.g., the DIM field in Figure 1) and the information in that field is structured and can be decomposed into its subfields (e.g., length and height).
  3. Extraction of thesaurus categories: This step involves trying to identify the thesaurus category that applies to each of the objects in the database. This is normally found in the OBN (Object Name) field and corresponds to an entry in the Powerhouse's thesaurus.
  4. Extraction of names, materials, makers, locations, and dates of construction: This involves extracting information from the textual information contained in the database records. Most of our work here so far has focussed on the OBS (Object Statement) field. This field is supposed to include information encoded in a standardised and rigorous way. However, in practice, not all the information that is supposed to be included is present, or it is present in a different order, or format, from the norm. Yet, with the help of information from the Powerhouse's thesaurus and the lists of materials and countries, we were able to identify information such as date of manufacturing or purchasing, materials and location.
  5. Extraction of PART-OF and A-KIND-OF information: We used the OID (Object ID) field to determine the PART-OF hierarchy for the database. For example, in the database record shown in Figure 1, the OID H4448-513 indicates that this object is the 513th part of the object with OID H4448 (in this case the Balmoral boots are part of a large collection of footwear). According to the database specifications, an object may have parts, sub-parts, and sub-sub-parts.

Description of the Balmoral Boots

Figure 3: A description of the Balmoral boots produced by the PowerTNG system

Clearly the quality of the KB entries created will depend greatly on the database they are extracted from. For example, of the 15,483 records that we received in the database dump, only 9,887 (in other words, around 64% of the total) actually have an OBN field. Furthermore, of the 9,887 objects that have OBN fields, only 7,751 are valid object names, that is to say that the Object Name assigned to the object appears in the museum thesaurus. Thus, about 50% of the database entries do not provide any information about the types of the objects.

Even a high-quality entry, such as that in Figure 1, is problematic, as much of the information that we would like to put in a description is "locked-up" in text. In Figure 1, we see this in the description of the construction of the boot. Although much recent work in Natural Language Processing is oriented towards extracting useful information from texts that have some similar characteristics, current techniques are simply not up to the task of performing this extracting reliably in such a way that results are reusable for our purposes. We cannot simply use the text directly, as we cannot guarantee that the text will be well structured or even relevant. In addition, the text may contain information that the public viewing these descriptions are not meant to know. A more-structured database would make this task significantly easier. Figure 3 shows the generated text for the database entry in Figure 1.

4. Navigating through a virtual document space

In the Peba-II and Power systems, the task of navigating the virtual document space was accomplished by selecting items of interest from a list and then following hypertext links around the object hierarchy. This is sufficient when there is only a small number of objects and a user can easily look through a list of them. When we move to a KB consisting of thousands of objects, however, we need to rethink our navigation strategy. It is, to say the least, infeasible (and perhaps impossible in current browsers) to have a user select an item of interest from a list of 15,000.

The obvious choice for a navigation strategy is to use the object hierarchy provided by the Powerhouse. This has the advantage that we can use a structure that was built by humans and which therefore should represent the actual relationships between the objects in the database. Given the thesaurus and the generated KB, we can produce a hierarchy that will allow a user to navigate from the top level to a specific object. Currently, this navigation hierarchy is a static set of pages that is generated when the KB is regenerated, but there is no reason why it cannot be dynamic too.

As the location of the objects in the museum is also encoded in the database entries, we built a navigation hierarchy based on location. The location information includes the exhibit that an object is part of as well as the case in which it appears. We found, however, that such a hierarchy offered little help to users, since they had no idea what was in the exhibits, let alone which objects were gathered in something called "Case 6A".

While using the thesaurus-based navigation hierarchy allows a user to navigate to an object by traversing through classes and subclasses, it will not allow them to navigate directly from one object to another. Our aim in building the generation component was to make sure that any systematic relationships between objects (e.g., that one is a part of another, or they were manufactured by the same artisan) is expressed in the virtual document as a hypertext link. This is implemented simply as a set of inverted files to which the generation component has access.

Our initial approach was to inline all of these links into the text, but this led to some confusion about the discourse intention of following the links (see Milosavljevic and Oberlander (1998) for a discussion of dynamic hypertext as discourse). In the original Peba and Power systems, only the names of objects were linked. The discourse intention of following one of these links was clear: "describe this animal" or "describe this computer". However, when more than one kind of entity is linked, the intention is less clear. For example, if the user follows a link whose anchor is the word leather, will he or she be taken to a description of the material leather or to a listing of the objects in the database that are made of leather?

Our solution was simply to make the discourse intentions for links other than links to named entities very explicit, leading to constructions such as See other objects made of leather appearing at the end of the descriptions of objects.

5. Changes in functionality

Although we are claiming to be porting the Peba-II/Power system to an automatically generated KB, in fact, we have completely re-written the system to take advantage of some of the lessons learned during the development of the previous systems. As with any re-write, there have been some changes in functionality.

5.1 Comparisons

Unlike the ancestor systems, the current system will not generate comparisons. There is no fundamental reason why this cannot be done, but there is no obvious way to introduce a comparison to the user. In the original systems, comparisons were generated either by selecting two objects from two lists of all the objects known to the system or by selecting another object when looking at a description of some object. Needless to say, this is ineffective when there are 15,000 objects to choose from.

Another factor limiting the scope for providing comparisons is that, in the hand-crafted KB used previously, the KB author was able to specify objects that could be confused with the given object, along with other useful properties that might play a role in comparison. These potential confusors are not available in the automatically generated KB. We are still exploring the appropriate use of comparisons in this context, but see Milosavljevic and Oberlander (1998) for more information.

5.2 Multilinguality

A major change for the better is that the current system is fully multilingual, with descriptions available in English, French, Spanish, Dutch, and Chinese. The main difficulty with the multilinguality is that one of the major components of the database is the names of the objects themselves. Translating these names to each of the target languages is a daunting (but not impossible) task. Currently, the English names of items are always shown. Figure 4 shows the text of the description shown in Figure 3 in French and Spanish.

Ces objects sont des `Balmoral boots'. Ils ont ete fabriques en entre 1870 et 1875. Ils ont ete fabriques a `London'. Ils sont en `leather', patent leather, glace kid, `linen' et `wood'. Les `Balmoral boots' sont 45 mm de hauteur, 255 mm de longeur, 55 mm de largeur total, 150 mm de hauteur total et 30 mm de largeur.
Estos son `Balmoral boots'. Fueron hechos entre los años 1870 y 1875. Fue producido en `London', Inglaterra. Están hechos de cuero, patent leather, glace kid, lino y madera. Los `Balmoral boots' tienen 45 mm de altura, 255 mm de largo, 55 mm de anchura total, 150 mm de altura total y 30 mm de ancho.

Figure 4: French and Spanish descriptions of the Balmoral boots

For some of the smaller sets of lexical items (e.g., materials and places), translations have and are continuing to be made. These are presented to the user in the target language. Adding a new language is straightforward, requiring the translation of about 400 materials and countries and the translation of the template sentences used by the generator.

6. What's missing?

Clearly, we could improve the quality of the texts generated if we could do a better job of extracting information from the database of objects. Unfortunately, this is currently out of our hands. The experience would have been very different if the database had been more consistently marked-up with XML or another representation language.

The lack of typed links on the Web leads us to produce texts that are probably less fluent than they could be. If it were possible to specify exactly what discourse intention would be understood by the user following a particular link, then we could do away with the lists of links at the bottom of each text that we produce. This would appear to be possible under the current extended linking proposal for XLink.

One of the advantages of a hand-crafted KB is that it is a relatively straightforward task to specify possible links between objects while building it. Although we could use a similarity metric to propose links between items based on the KB features that they have in common, the sparseness of our current data would mean that most such comparisons would be based on size. It may be that we can resort to using IR-like techniques (see Green, 1997) to automatically determine which objects are related, but this will most likely only work for the textually rich objects. By the same token, this would allow us to make use of data that we are currently ignoring.

Finally, there is no straightforward (or perhaps integrated) way to get an overview of what the information space underlying the system is. There is currently no way for a user to get some idea of how many things there are to look at or how to get to them.


(Bernard, 1990) Bernard J. Ed. The Macquarie Thesaurus. Macquarie Library, North Ryde NSW. 1990.

(Dale et al., 1998) Robert Dale, Jon Oberlander, Maria Milosavljevic and Alistair Knott. Integrating natural language generation and hypertext to produce dynamic documents. Interacting with Computers. 11(2). 15 December 1998.

(Green, 1997) Stephen J. Green. Automated link generation: Can we do better than term repetition?. In Proceedings of the Seventh International World Wide Web Conference, April, 1998, Brisbane, Australia, pp. 75--84.

(Hitzman et al., 1997) Hitzeman J., Mellish C. and Oberlander J. Dynamic generation of museum web pages: The intelligent labelling explorer. Archives and Museum Informatics, 11:105--112. 1997.

(Milosavljevic et al., 1996) Maria Milosavljevic, Adrian Tulloch and Robert Dale. 1996. Text Generation in a Dynamic Hypertext Environment. In Proceedings of the Nineteenth Australasian Computer Science Conference (ACSC'96), Melbourne, Australia. 31 January - 2 February 1996, 417-426.

(Milosavljevic and Oberlander, 1998) Maria Milosavljevic and Jon Oberlander. 1998. Dynamic Hypertext Catalogues: Helping Users to Help Themselves. In the Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh, PA, USA, 20-24 June 1998.

(Reiter and Dale, 1999) Ehud Reiter and Robert Dale. 1999. Building Natural Language Generation Systems. Cambridge University Press.