A Key for Enhanced Hypertext Functionality
and Virtual Documents: Knowledge

Philippe Martin and Peter Eklund
Griffith University, School of Information Technology, PMB 50 Gold Coast MC, QLD 9726 Australia
Tel: +61 7 5594 8271; Fax: +61 7 5594 8066; E-mail: {philippe.martin,p.eklund}@gu.edu.au

Position paper for the workshop "Virtual Documents, Hypertext Functionality and the Web"
at the 8th International World Wide Web Conference.



Table of contents

  1. What the users want: precise and organized information, not documents
  2. Easing knowledge representation
  3. Storing knowledge and commands in documents
  4. Storing knowledge and commands in distributed scalable knowledge servers
  5. Conclusion



1. What the users want: precise and organized information, not documents

Web search engines - such as Altavista1 or Infoseek2 - retrieve entire documents based on the keywords they include. They exploit undirected Web robots to periodically traverse and index internet/intranet documents. Directed Web robots - such as Harvest3, WebSQL4 and WebLog5 - apply string-matching and structure-matching commands (e.g. hypertext path expressions) to explore an intranet or a small subset of Internet and retrieve entire documents or parts of them. However, often people are not looking for lists of documents but either for a precise answer to a precise query, or for a structured presentation of information related to a certain object such as a particular event, technique, software type, idea or person. For example, someone looking for "large-scale deductive database systems" does not want a giant list of references to conferences, articles and courses on database systems, or home pages and user manuals of specific database systems, s/he first wants a classification of features that such systems may have, and then s/he may ask for a classification of existing tools according to some features, e.g. the kinds of query language, exploited techniques, API, memory & performance characteristics, support for multi-users, reliability, license.

Though such precise information and comparisons are important for each person interested in using deductive database systems, it is a long and difficult task for that person to collect the information just by reading documents. However, it is not necessarily difficult for each provider of information on an object to represent this information in a document or a shared knowledge repository so that they can be retrieved - and to a certain extent, merged or composed - via conceptual commands. As opposed to string-matching and structure-matching commands, conceptual commands rely on logical inferences (e.g. exploitation of subsumption relations between terms in the knowledge statements) and improve both precision and recall in information retrieval. They may also be combined with other commands within scripts or usual documents to create virtual documents.


2. Easing knowledge representation

The easiest way to express information is in natural languages. However, outside limited domains, these languages are too ambiguous for the semantic content of sentences to be automatically extracted. We argue in our article for the WWW8 conference6 that general and intuitive knowledge representation languages or derived simpler notations (e.g. a "controlled language" that is a subset of natural language that eliminates sources of ambiguity) are preferable to metadata languages7 based on XML8 (e.g. RDF9 and OML10) for indexing Web documents and representing knowledge within them. Indeed, the retrieval of precise information is eased by a language designed to represent semantic content and support logical inference, and the readability of such a language eases its exploitation, presentation and direct insertion within a document (thus also avoiding information duplication).

XML is intended as a machine-readable rather than human-readable language because it is mainly meant to be generated and read by machines not people. XML-based metadata languages inherit this poor readability and most of them (e.g. RDF) do not specify how to represent logical operators or quantifiers. As an alternative, WebKB proposes to use expressive but intuitive knowledge representation languages to represent (or index) information in documents and mix knowledge statements with other textual elements (e.g. sentences, sections or references to images). To allow this, the knowledge (or commands exploiting it) must be enclosed within the HTML tags "<KR>" and "</KR>" or the strings "$(" and ")$". The knowledge representation language used in each chunk must be specified at its beginning, e.g.: "<KR language="CG">". (Lexical/structural/procedural commands may be used whichever language is specified). Thus, there is no need to separate knowledge from its documentation nor duplicate it in an external knowledge base.

At present,WebKB only exploits the CG (Conceptual Graph) formalism. However, the exploitation of wrappers (e.g. KIF to CGs) or other inference engines would allow WebKB to accept other knowledge representation languages. To compare the alternatives, here is an example showing how a simple sentence may currently be represented in WebKB, how it could be represented in KIF, and what its RDF representation is. The sentence is: "John believes that Mary has a cousin who has the same age as her".

<KR language="CG"> load "http://www.bar.com/topLevelOntology"; //Import this ontology Age < Property; //Declare Age as a subtype of Property Cousin(Person,Person) {Relation type Cousin}; //Declare relation Cousin with its signature [Statement: [Person: "Mary"]- { ->(Chrc)->[Age: *a]; ->(Cousin)->[Person]->(Chrc)->[*a]; } ]->(Believer)->[Person: "John"]; </KR>
<KR language="KIF"> load "http://www.bar.com/topLevelOntology"; //the WebKB command for file interpretation (Define-Ontology Example (Slot-Constraint-Sugar topLevelOntology)) (Define-Class Age (?X) :Def (Property ?X)) (Define-Relation Cousin(?s ?p) "Relation type Cousin" :Def (And (Person ?s) (Person ?p))) (Exists ((?j Person)) (And (Name ?j John) (Believer ?j '(Exists ((?m Person) (?p Person) (?a Age)) (And (Name ?m Mary) (Chrc ?m ?a) (Cousin ?m ?p) (Chrc ?p ?a) )) ))) </KR>
<!-- RDF notation (with allowed abbreviations); this file is named "example" --> <RDF xmlns:rdf="http://www.w3.org/TR/WD-rdf-syntax#" xmlns:t="http://www.bar.com/topLevelOntology"> <Class ID="Age"> <subClassOf resource="t#Property"/> </Class> <PropertyType ID="Cousin"> <comment>Relation type Chrc (Characteristic)</comment> <range resource="t#Person"/> <domain resource="t#Person"/> </PropertyType> </RDF> <RDF xmlns="http://www.w3.org/TR/WD-rdf-syntax#" xmlns:t="http://www.bar.com/topLevelOntology" xmlns:x="http://www.bar.com/example"> <!-- x refers to this file --> <Description aboutEach="#Statement_01"> <t#Believer>John</t#Believer> </Description> <t#Person bagID="Statement_01"> <t#Name>Mary</t#Name> <x#Chrc><x#Age ID="age"></x#Age></x#Chrc> <x#Cousin><t#Person><x#Chrc resource="#age"/></t#Cousin> </t#Person> </RDF>

The CG representation (top) seems simpler than the others. The semantic network structure of CGs (i.e. concepts connected by relations) has three advantages:

    (i) it restricts the formulation of knowledge without compromising expressivity and this tends to ease knowledge comparison from a computational viewpoint;
    (ii) it encourages the users to express relations between concepts (as opposed, for instances, to languages where "slots" of frames or objects can be used);
    (iii) it permits a better visualization of relations between concepts.
We advocate the use of Conceptual Graphs (CGs)11 and simpler notational variants that enhance knowledge readability (e.g. we have also developed a formalized English and structured text notation). To further ease the representation process, we propose (i) a technique allowing users to leave some knowledge terms undeclared, and (ii) a top-level ontology of 400 concept and relation types. We have implemented a knowledge-based directed Web robot named WebKB to parse and execute our notations, knowledge handling & retrieval commands, Web document handling commands, and script language (to combine groups of commands). This tool is accessible as a CGI server. The WebKB site12 provides HTML+Javascript interfaces.

Various kinds of applications of knowledge representation, indexation and queries are illustrated by examples in the WebKB site. Here is how some information on the Aditi database system could be represented in one of the structured text notations accepted by WebKB. The difference with the structured way (Information is extracted from the "Catalog of free database systems"13). Relations between each term used in this knowledge statement and other terms may be similarly defined elsewhere (in other documents or shared knowledge repositories) by one or several other users. Then, for example, subsumption relations between terms may be exploited for conceptual retrieval.

[Aditi.
  isa: large-scale deductive database system;
  user interface: NU-Prolog, graphical interface (implemented with: Motif);
  index method: B-trees, multi-level signature files;
  ports: SunOS, IRIX;
](representation date: 1992/12/17; representation author: aditi@cs.mu.oz.au).



3. Storing knowledge and commands in documents

It is handy for an information provider to store and structure knowledge inside Web documents, especially if the duplication of information into machine readable statements and human-only readable statements can be avoided (e.g. by using a controlled language14 for sentences and a visual language15 for graphics) or at least reduced by the possibility of mixing and linking the two kinds of statements. To allow this, WebKB exploits the convention that each group of knowledge statements or commands in a document must be delimited by the two special HTML tags "<KR>" and "</KR>" or the strings "$(" and ")$". The knowledge representation language used in each group must be specified at its beginning, e.g.: "<KR language="CG">". Each group is visible unless the document's author hides it with HTML comment tags. Furthermore, various notations allow people to use knowledge statements for indexing any part of any Web document (not just parts which can be referred by URLs). Thus, knowledge statements may be retrieved and handled via document-based commands, and conversely indexed parts of documents may be retrieved and handled via knowledge-based commands.

When a command sent to the WebKB CGI server requires it to "run" a Web document (referred to by a URL), the server retrieves the document and executes the knowledge statements and commands within it (some commands may be to run other Web documents). The results are sent back to the client and constitutes a generated document (hence, a virtual document). Depending on a parameter, the WebKB server may or may not send back the human-only readable statements along with the results (if it does, the generated document is a copy of the original document with the query results in place of the commands). Similarly, the WebKB server may be used to exploit other CGI servers. Within HTML documents, dynamic linking may be achieved by using Javascript16 to associate a command with an hypertext link in such a way that the command is sent to the WebKB server when the link is activated.

As with any other directed Web robot, the scalability and efficiency of the current WebKB is limited by the fact that (i) the users must know which documents contain (or may contain) the knowledge to exploit, and (ii) these documents must be accessed and parsed each time their content has to be exploited. Pieces of knowledge, like Web documents, may be provided by all Web users, and need to be inter-related or integrated, to allow each user to benefit from the knowledge of users they do not know. For this purpose a cooperatively built knowledge repository is necessary.



4. Storing knowledge and commands in distributed scalable knowledge servers

Some Web servers, called ontology servers, support shared knowledge repositories, e.g. the Ontolingua ontology server17 and Ontosaurus18. However, they are not usable for managing large quantities of knowledge and, apart from AI-Trader19, do not allow the indexation and retrieval of parts of documents. Finally, support of cooperation between the users is essentially limited to consistency enforcement, annotations and structured dialogues, as in APECKS20, Co421 and Tadzebao22.

We are extending WebKB to handle a cooperatively built knowledge repository which addresses scalability via the five following points23:

    (i) a scalable multi-user persistent object repository to support the storage and exploitation of knowledge structures (we have chosen the Shore24 system);
    (ii) algorithms allowing the exploitation of large-scale dynamic taxonomies efficiently (we have chosen Fall's algorithms25);
    (iii) visualization techniques (mainly the handling of aliases for terms and the generation of views) to avoid lexical conflicts and enable users to focus on certain kinds of knowledge;
    (iv) protocols to allow users to solve semantic conflicts via the insertion of new terms and relations in the common ontology and, in some cases, in the knowledge of other users;
    (v) conventions for representing knowledge to improve the automatic comparison of knowledge from different users and hence their consistency and retrieval.

Though these five points permit the exploitation of a large knowledge repository (essential for efficiency reasons and practical use), it is also clear that for efficiency and reliability reasons, a unique server cannot be used to handle a universal knowledge repository by all Web users. Knowledge has to be distributed and mirrored on various knowledge servers. However, since there is no static conceptual schemas in knowledge bases, the techniques of distributed database systems - such as AlephWeb26, Hermes27, Infomaster28 and TSIMMIS29 - cannot all be reused.

A first step to the distribution of a knowledge repository is to duplicate it on several servers, with updates made on a server automatically duplicated in other servers. Some servers may be dedicated to searches and others to updates.
A second step is to have general servers and specialized servers. A specialized server would store the same knowledge as general servers plus knowledge related to a well-defined set of objects, e.g. knowledge expressed with the subtypes of certain types. Since these sets of objects are well-defined (extensively or via definitions), a general server would store the URLs of these servers and, when answering a query, delegate the query to the relevant servers if more precision is required. These sets of objects might be determined by the managers of specialized servers, or according to the frequency of accesses to objects in knowledge repositories. Whatever the specialized server a user updates, if the knowledge it enters is relevant to other servers (e.g. if the knowledge is expressed with general terms), it should be automatically duplicated in these servers. The rationale of all this duplication is to speed searches and simplify the query mechanisms by avoiding, whenever possible, parallel searches in various servers and then the composition of the results.

Other steps may be necessary, but what should be avoided in this knowledge-based (hence precision-oriented) approach is to let the specialized servers develop independently of one another instead of being part of a unique consistent virtual knowledge repository. Otherwise, conceptual queries and cooperation across the repositories are no more possible, and as in current traders, a most relevant repository to answer a query has to be automatically "guessed".

Finally, knowledge servers should not be limited to storing knowledge statements: they should also allow a storage and handling of knowledge-based and document-based commands similar to the storage and handling we described for documents.



5. Conclusion

The more a piece of information is precisely represented, the more adequately it can be retrieved and exploited. General and intuitive knowledge representation languages seem best adapted for this end. WebKB permits the use of Conceptual Graphs as well as simpler notations when less expressivity or precision is required. Ambiguities due to declared terms are partially solved according to the constraints in the used ontologies.

Storing knowledge within documents is useful but the scalability of this approach is limited. Ultimately, we believe a knowledge-based Web relies on scalable distributed cooperatively built knowledge repositories and automated knowledge acquisition techniques. We have proposed (and work on) some directions for this goal. In this view, knowledge-annotated documents are used as isolated modules of knowledge on which a user can work before submitting content to a knowledge server for integration. A document including commands can also be sent to a knowledge server as a template for generating virtual documents. Of course, scripts of commands could also be stored in a repository handled by a knowledge server and referred to from a document. We are currently extending WebKB to allow for these combinations of features.

In the same way we register a Web site today, we will probably register knowledge representations (or documents including knowledge representations) and complement or refine one another's knowledge.




References

1. Altavista: http://www.altavista.digital.com/
2. Infoseek: http://www.infoseek.com
3. Harvest: http://harvest.transarc.com/
4. WebSQL: http://www.cs.toronto.edu/~websql/
5. WebLog: http://www.cs.concordia.ca/~special/bibdb/weblog.html
6. WWW8 article: http://meganesia.int.gu.edu.au/~phmartin/WebKB/doc/papers/www8/www8.ps
7. Metadata languages: http://www.w3.org/Metadata/
8. XML: http://www.w3.org/XML/
9. RDF: http://www.w3.org/RDF/
10. OML: http://wave.eecs.wsu.edu/CKRMI/OML.html
11. CGs: http://meganesia.int.gu.edu.au/~phmartin/WebKB/doc/CGs.html
12. WebKB: http://meganesia.int.gu.edu.au/~phmartin/WebKB/
13. Databases: http://www.cis.ohio-state.edu/hypertext/faq/usenet/databases/free-databases/faq.html
14. Controlled languages: http://www-uilots.let.uu.nl/Controlled-languages/
15. Visual languages: http://www.cpsc.ucalgary.ca/~kremer/home.html\#visualLanguages
16. Javascript: http://developer.netscape.com/docs/manuals/communicator/jsref/index.htm
17. Ontolingua ontology server: http://WWW-KSL-SVC.stanford.edu:5915/
18. Ontosaurus: http://www.isi.edu/isd/ontosaurus.html
19. AI-Trader: http://www.vsb.informatik.uni-frankfurt.de/projects/aitrader/intro.html
20. APECKS: http://www.psychology.nottingham.ac.uk/staff/Jenifer.Tennison/APECKS/
21. Co4: http://ksi.cpsc.ucalgary.ca/KAW/KAW96/euzenat/euzenat96b.html
22. Tadzebao: http://ksi.cpsc.ucalgary.ca:80/KAW/KAW98/domingue/
23. Repository ideas: http://meganesia.int.gu.edu.au/~phmartin/WebKB/doc/coopKBbuilding.html
24. Shore: http://www.cs.wisc.edu/shore/
25. Fall's algorithms: http://www.cs.sfu.ca/cs/people/GradStudents/fall/personal/index.html
26. AlephWeb: http://www.pangea.org/alephweb.aleph/paper.html
27. Hermes: http://www.cs.umd.edu/projects/hermes/
28. Infomaster: http://infomaster.stanford.edu/infomaster-info.html
29. TSIMMIS: http://www-db.stanford.edu/tsimmis/tsimmis.html
bility reasons).

Issues: scalability, precision, annotation alternatives, where diff, similarities (ok with objects, but with methods ?) user models ? - representation&query: NL/XML++ - organisation --> posty»–lXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXcate ñBo}y»–lL³B¿³yœÐp$oGcat @ñBo}y»–lL³B¿³yœÐp$oGÿÿþaux >SURL6http://www.int.gu.edu.au/~peklund/www8/dynamicDoc.html