Ontologies for Sharing, Ontologies for Use
David Shotton §, Chris Catton and Graham Klyne
Image Bioinformatics Research Group, Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK
§ Corresponding author: e-mail: firstname.lastname@example.org
“An ontology is a formal, explicit specification of a shared conceptualisation. ”
Although ontologies have been created to cover a large number of domains, many of them have characteristics that make them difficult to reuse in other contexts. If ontologies are to be reused easily by others, they should be constructed as sets of small modules, each of which is a simple well-defined subsumption hierarchy that is not over-burdened by domain and range constraints. This will make it easy for others to re-use the ontology without incurring unintended logical entailments. In contrast, when using ontologies for particular applications such as the support of knowledge management systems, they may need to be extended and constrained. This should be done in a manner that does not obscure the interpretation of core metadata items expressed using the extended ontology, when these are read by applications that have no knowledge of those extensions.
Current ontology development practice
Development of the semantic web (http://www.scientificamerican.com/article.cfm?id=the-semantic-web; http://doi.ieeecomputersociety.org/10.1109/MIS.2006.62) involves the creation of vocabularies and ontologies designed to describe particular domains of knowledge. It is desirable that such an ontology, ideally encapsulating a consensus view among domain experts, can be used as part of a practical information management application, and can also be easily shared with the wider community. This paper examines the conflicting design requirements for these two functions, and summarizes a design approach.
Gruber (http://tomgruber.org/writing/onto-design.htm) stated:
‘An ontology should require the minimal ontological commitment sufficient to support the intended knowledge sharing activities … An ontology serves a different purpose than a knowledge base, and therefore a different notion of representational adequacy [McCarthy and Hayes, 1969] applies’.
However, in the current wave of interest in the semantic web and ontologies, this distinction has been blurred. Some ontologies available on the web are too tightly specified to be easily shared, and are closer in spirit to knowledge bases than to Gruber’s idea of an ontology.
We believe this is the result of two forces:
a) The current state of tools and standards for ontology development means that we are only just reached the point where different ontologies can be imported and made to work together easily and reliably.
b) In developing the semantic web, much effort has been expended on producing stand-alone ontologies for use in specific systems, but less on designing them for general applicability so that they may be widely and easily shared.
If the semantic web is to become more than a collection of isolated knowledge-base systems, we need to be able to re-use and combine ontologies more easily. There are several reasons for using ontologies created by others rather than writing them ourselves: it saves time and money, it permits each domain to be described accurately by appropriate domain specialists, and most importantly perhaps, it permits larger communities to come to a common agreement on the meaning of specific terms.
However, many ontologies are difficult or impossible to re-use, because they are too complex or over-committed to a particular application, contain inadequate or incomplete class annotations, are poorly structured, express a philosophy that is at odds with that of the importing ontology, or are over-restricted.
a) Ontology too complex for easy re-use Many ontologies are very large. For example, CYC (http://www.cyc.com/cyc/technology/whatiscyc) contains over a million assertions, the Gene Ontology (GO) has almost 30,000 terms, while the National Cancer Institute Thesaurus (http://www.mindswap.org/2003/CancerOntology/; http://ncimeta.nci.nih.gov) in 2003 contained about 26,000 concepts and about 71,000 terms (http://www.mindswap.org/papers/WebSemantics-NCI.pdf). Large size is appropriate when the ontology is an application ontology used to describe a large domain (e.g. GO): such ontologies benefit from having simple structures. Possibilities for re-use are limited if the ontology in question is too complex or unwieldy. If the need is just to be able to re-use a few classes, importing a large existing ontology bodily into a new ontology is problematic, since it obscures structure and takes up resources.
Our experience has been that existing published ontologies may also suffer from trivial faults:
- they may contain a rich mixture of dissimilar class types that are not clearly differentiated;
- they may contain class duplications and logical inconsistencies that have gone unrecognized because of its large size; and
- they may contain classes that properly belong in separate third-party ontologies, not in the domain-specific ontology under consideration.
In contrast, a number of vocabularies, including Dublin Core (http://dublincore.org/documents/dcmi-terms/), FOAF (http://xmlns.com/foaf/spec/) and SiOC (http://sioc-project.org/ontology), which might be described a generic rather than domain-specific, have found widespread use by virtue of their simplicity. The distinction between application ontologies and reference ontologies is clarified in a related article, Reference and Application Ontologies, by James Malone and Helen Parkinson.
b) Ontology annotations incomplete or inadequate Ontology class annotations provide an important link between human- and machine-processable knowledge, and for this reason are important components of any ontology. In some ontologies, the vaguensss or inappropriate implied meaning of some of the class names used, coupled with inadequate annotation, rendered them confusing. Other ontologies, while having class names that might seem appropriate for re-use, fail to provide annotations for their classes and properties, or use annotations that are inadequate. For example, the ka (Knowledge Acquisition) ontology (http://www.cs.man.ac.uk/~horrocks/OWL/Ontologies/ka.owl) contains no human-readable descriptions of the classes and properties, making it difficult to use without lots of cross-referencing to other documents, while in BIBO, the bibliographic ontology (http://bibliontology.com/), the class Standard is inadequately annotated as “A document describing a standard”. Many other annotations are taken directly from Wikipedia, and are not always appropriate. CIDOC CRM (http://cidoc.ics.forth.gr/) is an example of an ontology with generally good annotations.
c) Ontology poorly structured For example, in the Bibtex ontology http://oaei.ontologymatching.org/2004/Contest/301/onto.html) the hasAuthor property is a datatype property with a range of String, which effectively prevents one from identifying an author as a member of the class Person. Furthermore, it cannot be used to describe an ordered list of authors in an easily machine-processable fashion.
d) Ontology expresses an alternative philosophy Occasionally, different ontologies express alternative views about a domain, making it difficult to use elements of one easily within the other. For example, while CiTO, the Citation Typing Ontology (http://purl.org/net/cito/) adopts the Works, Expressions, Manifestations hierarchy of the FRBR (Functional Requirements for Bibliographic Records; http://www.ifla.org/VII/s13/frbr/frbr1.htm) classification model developed by the United States Library of Congress to characterize different aspects of a publication, BIBO does not. While CiTO has Work: cito:ResearchPaper; Expression: cito:JournalArticle, BIBO has bibo:AcademicArticle, which conflated these two concepts. This makes it difficult to re-use BIBO classes within CiTO.
e) Ontology too committed Perhaps the main problem is that many ontologies commit secondary users to modelling patterns that may be inappropriate to their needs. For example there is no consensus between bibliographic ontologies as to whether a document is authored by a set of people, an ordered list of people, or an ‘agent’ that may be a corporate body, situations brought about by varying restrictions on the range of the property hasAuthor. This point relates to the previous one, since over-commitment becomes a problem in the face of alternative philosophies.
Thus, although different ontological representations now abound, many pose problems when it comes to re-using them in other ontologies. The alternative approach of defining equivalences between terms in different ontologies suffers from some of the same problems, since use of owl:EquivalentClass is logically strict. Strict equivalence is inappropriate if the definitions of the classes within the two ontologies differ significantly. For example, in FRBR a Work is a distinct intellectual or artistic creation, an abstract concept recognised through its various expressions. However, in CiTO the definition of cito:Work is restricted to works that cite or may be cited, primarily works of scholarship that contain bibliographic references, rather than artistic works such as plays or photographs that do not. Thus cito:Work is classified as a subclass of frbr:Work, not an equivalent class. Such subclassing should be undertaken with caution, since it implies logical entailment of any domain and range restrictions of the superclass. An alternative is just to indicate that some sort of relationship exists between classes in two ontologies, for example by use of rdfs:seeAlso (http://www.w3.org/TR/rdf-schema/).
Ontologies for sharing, ontologies for use
To avoid these potential problems, we propose the following principles of ontology design to maximize the reusability of ontologies by third parties. Modifying existing ontologies so that their public shared versions conform to these requirements is best undertaken collaboratively by a group of domain experts.
- Ontologies designed for sharing within a community should be kept small, or have a simple structure that uses just a few properties and is not deeply nested. It is easier to achieve consensus about a small ontology on a single topic that is small enough to be comprehended in its entirety, just because there are fewer assertions to disagree about. Small and/or simple ontologies are easier to incorporate by people wishing to use them for third-party applications.
- Classes designed for sharing should form simple subsumption (is_a) hierarchies in which sibling classes are disjoint from each other, and where each class represents a single notion. This makes the ontologies easier to manage and easier to validate by domain experts.
- It follows from this that in any collection of related small ontology modules, covering a particular domain of knowledge, each class must be defined in only one ontology module. If a class appears in two modules, it probably conflates two notions which need to be distinguished and separated.
- All classes and properties should be annotated with clear and informative human-readable definitions, containing links to real-world examples where appropriate.
- Ontology building is an empirical activity – while each ontology must be fit for purpose, clear and unambiguous, it need not capture the totality of applicable knowledge.
- When writing an ontology that covers a small, specialist area, one should not assert anything that is not essential to the intended meaning. In particular, one should think carefully before specifying the domain and range of a property, since this may cause problems for others extending from the ontology.
- Third-party ontologies should be used wherever they are available and suitable, since there is no point in re-inventing wheels.
- Ontologies should be written in a standard ontology language such as OWL, and validated of the ontology modules using an appropriate reasoner, for example FaCT++ (http://owl.man.ac.uk/factplusplus/).
- More complex ontologies for specific applications (“knowledge bases” in the sense used by Gruber) can be built by combining smaller ontologies, and then by adding restrictions to enrich their meaning.
This statement of principles reflects the insights on ontology normalization first made by the Manchester ontologist Professor Alan Rector (http://www.cs.man.ac.uk/~rector/papers/rector-modularisation-kcap-2003-distrib.pdf).
To summarize, we propose the following distinctions:
A public shared ontology, as far as possible:
- should act primarily as a structured defined vocabulary;
- should define a limited domain;
- should be a simple subsumption hierarchy with disjoint sibling classes;
- should be sparing in its use of other ontologies;
- should be written in a standard ontology language such as OWL; and
- should have detailed human-readable annotations of the intended meaning of each term.
An application-level ontology, as far as possible:
- should be based upon or import one or more public ontologies describing particular domains;
- should restrict and/or cross-relate the public ontologies, thereby enabling more powerful reasoning to be used within a particular application;
- should extend the public ontology with new classes carefully, enabling more specific descriptions to be made without compromising the ability of third party applications that are not aware of these extensions to make sense of metadata thus encoded; and
- should ideally also be expressed in OWL-DL, to permit use of a Description Logic reasoner both for validation and for inference of additional relationships defined by the restrictions.
Once ontologies have been made suitable for public sharing and re-use, they should be published on an open access web server, or in an appropriate ontology warehouse such as the Open Biomedical Ontologies or SchemaWeb. Application-level ontologies are still ‘shared conceptualizations’, but are now shared implicitly, by the users of the application.
Combining modular ontologies
One of the main advantages of producing modular ontologies is that not only can they be extended and constrained for a new purpose, but that they can also be combined easily to produce new or more comprehensive ontologies. This process is made considerably easier when the modules are built with the same design principles and related to a common upper level ontology. The benefits of common design principles are emphasised by Open Biomedical Ontologies (http://www.obofoundry.org/). Upper level ontologies are described in the accompanying paper, Upper Level Ontologies, by Frank Gibson.
Pitfalls remain, however, when attempting to combining existing ontologies. To take a trivial example, semantic integration within the bibliographic community could be achieved with respect to the identification of bibliographic entities themselves by common usage of the Dublin Core class dc:title, which is well defined. However, FOAF uses the class foaf:title to denote the prefix to a person’s name used to signify veneration, an official position, or a professional or academic qualification (Reverend, Dame, President, Dr, Professor, etc.). While dc:title and foaf:title are logically distinct, we need to be careful to avoid human misinterpretation of ‘title’, if both are used in the same ontology.
At a deeper level, we need to avoid assumptions that lead to semantic misalignment of terms. For example, it would be incorrect in many cultures to equate “family name” with “last name”. A biological example is given by our work with Drosophila genomics data. The FlyTED Database contains in situ hybridization images of gene expression in Drosophila testis. In it, each tagged sequence probe used to identify the location of messenger RNA in the specimens, was, for convenience, originally described using the name of gene from which the mRNA was transcribed. However, because of subsequent gene re-assignments within FlyBase, the genomic database for Drosophila, mismatches developed between the “genes” in FlyTED and the corresponding gene identifiers in FlyBase. This caused subsequent inconsistencies when combining results from the two databases as part of our OpenFlyData Services, which were resolved by recognising the “gene” names in FlyTED as being distinct from the FlyBase IDs, rather than synonymous, and then by introducing a mapping between them.
There are other problems to be resolved when one has to choose between a number of overlapping ontologies that could be used within an integration, but between which there are significant differences. How is one to choose between them, or reconcile their differences? These are issues for which there is no clear prescriptive answers, and for which best practice is being worked out within the ontology community. Related issues of semantic data integration are described in the article Semantic Integration in the Life Sciences by Allyson Lister.
Extending shared ontologies
If extension of a ‘public’ ontology is required for a particular application, it is advisable to extend it only by adding new sub-classes at the edges, rather than by modifying the meaning of more central ontological entities. If this is done, third party applications, able to understand the public ontology but having no knowledge of these extensions, will still be able to understand correctly the core metadata created using the extended ontology. We exemplify this from our recent work with the CIDOC Conceptual Reference Model (CIDOC CRM; http://cidoc.ics.forth.gr/), an ontology developed for the museum community specifically to describe cultural heritage information.
In the CLAROS Project (http://www.clarosnet.org/; http://imageweb.zoo.ox.ac.uk/pub/2009/publications/Kurtz_Parker_Shotton_et_al-IEEE_CLAROS_paper.pdf), we have used the CIDOC CRM to provide a common framework onto which to map the data models of a number of academic resources describing classical art objects, and have then created a data web integrating information from these resources into a single user interface. For this, we have employed the RDF version of CIDOC CRM from Erlangen University (http://purl.org/NET/crm-owl).
Dating in antiquity is not an exact science, and we needed to capture this uncertainty in our CLAROS metadata. For this, we extended CIDOC CRM to permit us to record the estimated bounds of the inaccuracy relating to dates of creation of classical art objects. We could have done this by introducing new properties to relate existing CIDOC CRM classes. However, the problem with this approach is that baseline applications (without appropriate inference support) won’t know about these new properties, so the nature of the relationships that they might understand is lost, and sections of the knowledge graph might become disconnected for such applications. The alternative approach we chose to adopt was just to add new nodes at the edge of the CRM graph, and then add new properties relating these to instances of existing classes. This left the original classes connected only by original properties. Furthermore, we supplemented our new date range metadata with more general statements that do not depend on knowledge of these extensions, accessible to ‘standard’ CIDOC CRM applications.
Specifically, we created the classes claros:not_before and claros:not_after, that have the CRM class E61.Time_Primitive class as their domain, and used these, together with a ‘label’ time range statement for display use by applications that understand only the non-extended CRM terms, thus (in RDF N3 notation):
. . .
[ rdf:type crm:E61.Time_Primative class ;
claros:not_before “-525″۸۸xsd:gyear ;
claros:not_after “-475″۸۸xsd:gyear ;
rdfs:label “about 500 BC” ] .
The need for better tools
The ontology writing and editing tools Protégé version 4 (http://protege.stanford.edu/) and OBOedit version 2 (http://oboedit.org/) are powerful and sophisticated – see articles on Protégé & Protégé-OWL by Alan Rector and on OBO Format by David Osumi-Sutherland. However, there is still a need for good tools in four areas to assist those creating, integrating and using ontologies:
- Tools to assist in the early stages of ontology development, using text mining and natural language processing to extract relevant terms from the scientific literature that can then be organized into ontology modules.
- Tools that allow vocabulary designers to capture, refine and ultimately formalize their intuitions without being forced to deal with distracting logical details early in the design process.
- Tools to permit the comparison of different ontologies, and of different versions of a single ontology under development, and to visualize these differences in a readily comprehensible manner.
- Tools to assist in the location of relevant third-party ontologies for integration and use. The National Centre for Biomedical Ontology’s BioPortal, and the European Bioinformatics Institute’s Ontology Lookup Service works well for the biological domain, but there are no equivalents for more generic ontologies such as the Information Artifact Ontology, the Citation Typing Ontology, FRBR, FOAF and CIDOC CRM. One just has to learn about these by reading, attending meetings, word of mouth and experience. This makes entry into this field extremely difficult for newcomers.
Applying the principles outlined here should make it easier to create an ontology de novo, or to take an existing ‘tangled’ ontology, normalize it, and ‘bring it to market’. Such public ontologies, being simpler and more loosely defined, should gain a far wider degree of consensus and obtain greater usage. The mechanics of this ontology normalization process are described in a separate article entitled Automatic Maintenance of Multiple Inheritance Ontologies by Mikel Egaña Aranguren, while the construction of application ontologies is described in the related articles Community Driven Ontology Development by James Malone.
This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0 (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.
The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.