ontology – Ontogenesis http://ontogenesis.knowledgeblog.org An Ontology Tutorial Wed, 12 May 2010 18:45:41 +0000 en-US hourly 1 https://wordpress.org/?v=5.2 Friends and Family: Exploring Transitivity and Subproperties http://ontogenesis.knowledgeblog.org/1376 http://ontogenesis.knowledgeblog.org/1376#respond Thu, 08 Aug 2013 14:03:20 +0000 http://ontogenesis.knowledgeblog.org/?p=1376

Summary

An exploration of the relationship between subproperties and property characteristics, in particular transitivity.

Author

Sean Bechhofer
Information Management Groups
School of Computer Science
University of Manchester
Oxford Road
Manchester
United Kingdom
M13 9PL
sean.bechhofer@manchester.ac.uk

Property Characteristics and Subproperties

Transitive properties can be very useful in ontologies. Recall that a property P is transitive if and only if the following is true:

* For all x, y, and z: P(x,y) and P(y,z) => P(x,z)

An example of a transitive property is “ancestor”. Any ancestor of an ancestor of mine is also an ancestor of mine. OWL provides us with an axiom for stating that a particular property is transitive.

ObjectProperty: ancestor
  Characteristics: Transitive

The notion of subproperties are also useful. For a property Q, R is a subproperty if and only if

* For all x, y: R(x,y) => Q(x,y)

An example of a subproperty relationship is “hasParent” and “hasFather”. Any two individuals that are related via the father relationship must be related via the parent relationship.

ObjectProperty: hasParent

ObjectProperty: hasFather
  SubPropertyOf: hasParent

Sometimes there is confusion over the way in which characteristics like transitivity interact with the sub/super property hierarchy. As far as transitivity is concerned, the characteristic is not “inherited” by sub-properties — we cannot infer that the sub property of a transitive property is transitive. The same holds for super properties.

To illustrate this, consider the following example. We have three (object) properties: knows, hasFriend, marriedTo. One of these (hasFriend) is transitive (now you might question this as a piece of modelling, but please just go with my rosy world-view that all the friends of my friends are also friends), and the properties are arranged in a hierarchy. In Manchester syntax we would have:

ObjectProperty: knows

ObjectProperty: hasFriend
  Characteristics: Transitive
  SubPropertyOf: knows

ObjectProperty: isMarriedTo
  SubPropertyOf: hasFriend

And yes I know that expecting marriage to imply friendship is again hopelessly optimistic, but I’m a hopeless optimist.

Now, consider a domain with four elements, Arthur, Betty, Charlie and Daphne. They are related as follows:

* Arthur knows Betty.

* Betty knows Charlie and Daphne.
* Betty hasFriend Charlie and Daphne.
* Betty isMarriedTo Charlie.

* Charlie knows Daphne.
* Charlie hasFriend Daphne.
* Charlie isMarriedTo Daphne.

The situation is as pictured below.

transitivity

If we look at the ontology presented above, we can see that all the axioms hold — the subproperty axioms are being respected, as is the transitivity of hasFriend. Thus this situation is a model of the ontology.

Now if we consider isMarriedTo, we can see that our conditions for transitivity do not hold. There are three elements with isMarriedTo(Betty,Charlie) and isMarriedTo(Charlie,Daphne), but we do not have isMarriedTo(Betty,Daphne). So we cannot infer that isMarriedTo is transitive from the axioms. Similarly, there are three elements where knows(Arthur,Betty) and knows(Betty,Charlie) but we don’t have knows(Arthur,Charlie).

Recall that the inferences we can make from an ontology or collection of axioms are those things that necessarily hold in all models of the ontology. This little sample model provides us a “witness” for the fact that we cannot infer that knows is transitive from the axioms. Similarly, we cannot infer that isMarriedTo is transitive.

Of course, this is just saying that we can’t in general make such an inference. We are not saying that superproperties cannot (sometimes) be transitive. If we add to our interpretation the fact that Arthur knows Charlie and Daphne, then in this interpretation, knows is indeed transitive. And if we allow Betty to marry Daphne — hey, it’s 2013! — then we have a transitive subproperty (in this interpretation).

On the topic of transitivity and sub properties, the thesaurus representation SKOS uses a common modelling pattern, where a non-transitive property (skos:broader) has a transitive superproperty (skos:broaderTransitive) defined. The superproperty is not intended to be used for asserting relationships, but can be used to query for transitive chains of skos:broader relationships (assuming our query engine is performing inference). As we now know, this doesn’t mean that skos:broader is necessarily transitive.

This pattern is also often used for representing partonomy. Here we would use a (non-transitive) hasDirectPart for asserting parts of a whole, with a transitive superproperty hasPart allowing us to query the transitive closure. We can use counting with hasDirectPart — for example min or max cardinality restrictions — which we would not be able to do in OWL DL if hasDirectPart was transitive, due to restrictions relating to simple properties (see the OWL2 Structural Specification).

For other property characteristics the situation is different. For example, a subproperty of a functional property must be functional. Why? We’ll leave that as an exercise for the interested reader……

]]>
http://ontogenesis.knowledgeblog.org/1376/feed 0
Common reasons for ontology inconsistency http://ontogenesis.knowledgeblog.org/1343 http://ontogenesis.knowledgeblog.org/1343#respond Wed, 12 Jun 2013 20:18:05 +0000 http://ontogenesis.knowledgeblog.org/?p=1343

Summary

Following on from the previous Ontogenesis article “(I can’t get no) satisfiability” [1], this post explores common reasons for the inconsistency of an ontology. Inconsistency is a severe error which implies that none of the classes in the ontology can have instances (OWL individuals), and (under standard semantics) no useful knowledge can be inferred from the ontology.

Introduction

In the previous Ontogenesis article “(I can’t get no) satisfiability” [1], the authors discussed the notions of “unsatisfiability”, “incoherence”, and “inconsistency”. We recall that a class is “unsatisfiable” if there is a contradiction in the ontology that implies that the class cannot have any instances (OWL individuals); an ontology is “incoherent” if it contains at least one unsatisfiable class. If the ontology is “inconsistent” it is impossible to interpret the axioms in the ontology such that there is at least one class which has an instance; we say that “every class is interpreted as the empty set”.

While incoherent OWL ontologies can be (and are) published and used in applications, inconsistency is generally regarded as a severe error: most OWL reasoners cannot infer any useful information from an inconsistent ontology. When faced with an inconsistent ontology, they simply report that the ontology is inconsistent and then abort the classification process, as shown in the Protégé screenshot below. Thus, when building an OWL ontology, inconsistency (and some of the typical patterns that often lead to inconsistency) needs to be avoided.

Protege Screenshot

In what follows, we will outline and explain common reasons for the inconsistency of an OWL ontology which we separate into errors caused by axioms on the class level (TBox), on the instance level (ABox), and by a combination of class- and instance-related axioms. Note that the examples are simplified versions which represent, in as few axioms as possible, the effects multiple axioms in combination can have on an ontology.

Instantiating an unsatisfiable class (TBox + ABox)

Instantiating an unsatisfiable class is commonly regarded as the most typical cause of inconsistency. The pattern is fairly simple – we assign the type of an unsatisfiable class to an individual:

Individual: Dora
  Types: MadCow

where MadCow is an unsatisfiable class. The actual reason for the unsatisfiability does not matter; the contradiction here is caused by the fact that we require a class that cannot have any instances (MadCow) to have an instance named Dora. Clearly, there is no ontology in which the individual Dora can fulfil this requirement; we say that the ontology has no model. Therefore, the ontology is inconsistent. This example shows that, while incoherence is not a severe error as such, it can quickly lead to inconsistency, and should therefore be avoided.

Instantiating disjoint classes (TBox + ABox)

Another fairly straightforward cause of inconsistency is the instantiation of two classes which were asserted to be disjoint:

Individual: Dora
  Types: Vegetarian, Carnivore
  DisjointClasses: Vegetarian, Carnivore

What we state here is that the individual Dora is an instance of both the class Vegetarian and the class Carnivore. However, we also say that Vegetarian and Carnivore are disjoint classes, which means that no individual can be both a Vegetarian and a Carnivore. Again, there is no interpretation of the ontology in which the individual Dora can fulfil both requirements; therefore, the ontology has no models and we call it inconsistent.

Conflicting assertions (ABox)

This error pattern is very similar to the previous one, but all assertions now happen in the ABox, that is, on the instance level of the ontology:

Individual: Dora
  Types: Vegetarian, not Vegetarian

Here, the contradiction is quite obvious: we require the individual Dora to be a member of the class Vegetarian and at the same time to not be a member of Vegetarian.

Conflicting axioms with nominals (all TBox)

Nominals (oneOf in OWL lingo) allow the use of individuals in TBox statements about classes; this merging of individuals and classes can lead to inconsistency. The following example, based on an example in [2], is slightly more complex than the previous ones:

Class: MyFavouriteCow
  EquivalentTo: {Dora}
Class: AllMyCows
  EquivalentTo: {Dora, Daisy, Patty}
  DisjointClasses: MyFavouriteCow, AllMyCows

The first axiom in this example requires that every instance in the class MyFavouriteCow must be equivalent to the individual Dora. In a similar way, the second axiom states that any instance of AllMyCows must be one of the individuals Dora, Daisy, or Patty. However, we then go on to say that MyFavouriteCow and AllMyCows are disjoint; that is, no member of the class  MyFavouriteCow can be a member of AllMyCows. Since we already stated that Dora is a member of both MyFavouriteCow and AllMyCows, the final disjointness axiom causes a contradiction which means there cannot be any interpretation of the axioms that fulfils all three requirements. Therefore, the ontology is inconsistent.

No instantiation possible (all TBox)

The following examples demonstrates an error which may not occur in a single axiom as it is shown here (simply because it is unlikely that a user would write down a statement which is obviously conflicted), but could be the result of several axioms which, when taken together, have the same effect as the axiom below. It is also non-trivial to express the axiom in Manchester syntax (the OWL syntax chosen for these examples) since it contains a General Concept Inclusion (GCI)[3], so we will bend the syntax slightly to illustrate the point.

Vegetarian or not Vegetarian
  SubClassOf: Cow and not Cow

Let’s unravel this axiom. First, in order for any individual satisfy the left-hand side of the axiom, it has to be either a member of Vegetarian or not a member of Vegetarian. Clearly, since either something is a member of a class or it is not (there are no values “in between”), the statement holds for all individuals in the ontology. The right-hand side (or, second line) of the axiom then requires all individuals to be a member of the class Cow and not Cow at the same time; again, this falls into the same category as the examples above, which means that no individual can meet this requirement. Due to this contradiction, there is no way to interpret the axiom to satisfy it, which renders the ontology inconsistent.

Conclusion

In this post, we have discussed some of the most common reasons for inconsistency of an OWL ontology by showing – simplified – examples of the error patterns. While some of these – such as instantiation of an unsatisfiable class – can be identified fairly easily, others – such as conflicting axioms involving nominals – can be more subtle.

References

  1. U. Sattler, R. Stevens, and P. Lord, "(I can’t get no) satisfiability", Ontogenesis, 2013. http://ontogenesis.knowledgeblog.org/1329
  2. B. Parsia, E. Sirin, and A. Kalyanpur, "Debugging OWL ontologies", Proceedings of the 14th international conference on World Wide Web - WWW '05, 2005. http://dx.doi.org/10.1145/1060745.1060837
  3. U. Sattler, and R. Stevens, "Being complex on the left-hand-side: General Concept Inclusions", Ontogenesis, 2012. http://ontogenesis.knowledgeblog.org/1288
]]>
http://ontogenesis.knowledgeblog.org/1343/feed 0
Review for What is an upper level ontology? http://ontogenesis.knowledgeblog.org/853 http://ontogenesis.knowledgeblog.org/853#respond Wed, 12 May 2010 18:36:15 +0000 http://ontogenesis.knowledgeblog.org/?p=853

This is a review of What is an upper level ontology?

In this article, Robert nicely covers the different aspects that upper level ontologies need to consider to prescribe a coherent view of the world for its adopters.

Here are some specific comments that need to be addressed:

1. Instead of using electron as an example, i would use something more concrete – like a  car.

2. Explain the notation for ” )1,2)”

3. Elaborate on abstract entities – why are these important, and give some examples.

]]>
http://ontogenesis.knowledgeblog.org/853/feed 0
Separating Concepts from Labels http://ontogenesis.knowledgeblog.org/818 http://ontogenesis.knowledgeblog.org/818#respond Tue, 20 Apr 2010 13:37:28 +0000 http://ontogenesis.knowledgeblog.org/?p=818

When creating ontologies it is good practice to separate the concept and the label or term used to refer to that concept. Take the category or class that the object that sits on the top of your neck belongs. The words "Head", "Téte", "Kopf" and "Cabeza" are the terms used in English, French, German and Spanish that all refer to the same category of objects. The category is the same, but the label is different. We can change the label or term without our notion of the category to which it refers changing. Managing synonymy and polysemy is a strong reason for separating the symbol for categories in the ontology from its label.

Both OWL and the OBO Format allow this separation quite easily. The RDFS label in OWL can be used (with language tags as well) to use names that do not necessarily correspond to the URI. I can have just a number for the URI and a "proper term" for the label. The OBO format similarly allows such a distinction (the OBO guidelines insist on such a separation and that the ID is semantic free). James J. Cimino in his Desiderata for Controlled Medical Vocabularies also expouses this separation [1].

Sometimes terminologies or vocabularies are called ontologies. Rather, an ontology, via its labels or terms, can deliver a vocabulary, but the ontology itself is not a vocabulary. The difference is that the concept becomes the first class citizen, not the words used to describe the concept.

Many ontologies formalise this distinction by using semantic free identifiers for the concept. It is this identifier that is used, for example, as the means of annotation. The Gene Ontology has a set of rules for change summarised as: a change to a definition changes the "nature" of the concept; it is now a diferent category and thus requires another identifier–it has become a different entity. If the same identifier is kept, the meaning of the annotation changes. This is why the GO has obselete terms. If, however, just the term or label changes, then the underlying concept has not changed (the label changes from "head" to "kopf" — the concept is still the same and the annotation (of some data item in a database) means the same as it did before the label change. thus annotations are made with the id or URI, not the label (this is why OBO Eds should always be quoted in papers along with the meaningful label or term.).

Many ontologies also have rules or < href"point to a place holder on naming conventions">naming conventions for their terms or labels. This is simply a matter of consistency and explicitness in the labelling such that the meaning is, as much as is possible, apparent from the "presentation" of the concept. The assignment of the semantic free identifier usually becomes part of the ontology authoring process — such ids are usually digits or combinations of letters and digits that are automatically generated. only new ids are given (ids are never issued twice); all numbers have the same numbers of digits, with padding to the left with zeros.

In summary, this separation of meaningless id/uri and meaningful label is a "best practice" that aids in both ontology and data management.

Authors

Robert Stevens
School of Computer Science,
The University of Manchester,
Oxford Road,
Manchester,
UK

Duncan Hull
EMBL Outstation – Hinxton,
European Bioinformatics Institute,
Wellcome Trust Genome Campus,
Hinxton,
Cambridge,
CB10 1SD,
UK

References

[1] Cimino JJ. Desiderata for controlled medical vocabularies in the Twenty-First Century. Methods Inf Med 1998;37(4-5):394-403.

]]>
http://ontogenesis.knowledgeblog.org/818/feed 0
Review of What is an ontology? http://ontogenesis.knowledgeblog.org/511 http://ontogenesis.knowledgeblog.org/511#respond Fri, 22 Jan 2010 13:31:49 +0000 http://ontogenesis.knowledgeblog.org/?p=511

This is a review of What is an ontology?

This well written article spans both logical and philosophical considerations in Ontology so as to provides insight into the kinds of entities that are believed to exist and how we might formally represent them and the basic relations that may exist between them.

The discussion on relations relating to identity (transformedInto, derivedFrom) necessitates further explanation. Are the criterion for identity embedded in physical continuity or in the conscious self? Indeed, we observe that a from develops from a tadpole, the idea lies in the material *largely* persisting spatiotemporally, and that the gain and loss of parts (and the corresponding qualities) is gradual and acceptably identity-preserving.  Yet, we wonder whether the addition of even a single atom to a molecule through some chemical reaction maintains identity. To what extent does the gain or loss of parts become sufficiently important that it requires the distinction of forming a new entity? Perhaps more challenging is if we were to replace a person’s brain with another, we might perceive them to be the same individuals throughout the operations, but would this criterion for identity change if consciousness followed the brain? Then what might we say of identity? Important questions indeed for formal ontology and the representation of biological knowledge.

]]>
http://ontogenesis.knowledgeblog.org/511/feed 0
Ontologies for Sharing, Ontologies for Use http://ontogenesis.knowledgeblog.org/312 http://ontogenesis.knowledgeblog.org/312#comments Fri, 22 Jan 2010 10:56:43 +0000 http://ontogenesis.knowledgeblog.org/?p=312

David Shotton §, Chris Catton and Graham Klyne

Image Bioinformatics Research Group, Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK

§ Corresponding author: e-mail: david.shotton@zoo.ox.ac.uk

An ontology is a formal, explicit specification of a shared conceptualisation.

Abstract

Although ontologies have been created to cover a large number of domains, many of them have characteristics that make them difficult to reuse in other contexts.  If ontologies are to be reused easily by others, they should be constructed as sets of small modules, each of which is a simple well-defined subsumption hierarchy that is not over-burdened by domain and range constraints.  This will make it easy for others to re-use the ontology without incurring unintended logical entailments.  In contrast, when using ontologies for particular applications such as the support of knowledge management systems, they may need to be extended and constrained.  This should be done in a manner that does not obscure the interpretation of core metadata items expressed using the extended ontology, when these are read by applications that have no knowledge of those extensions.

Current ontology development practice

Development of the semantic web (http://www.scientificamerican.com/article.cfm?id=the-semantic-web http://doi.ieeecomputersociety.org/10.1109/MIS.2006.62) involves the creation of vocabularies and ontologies designed to describe particular domains of knowledge.  It is desirable that such an ontology, ideally encapsulating a consensus view among domain experts, can be used as part of a practical information management application, and can also be easily shared with the wider community.  This paper examines the conflicting design requirements for these two functions, and summarizes a design approach.

Gruber (http://tomgruber.org/writing/onto-design.htm) stated:

An ontology should require the minimal ontological commitment sufficient to support the intended knowledge sharing activities … An ontology serves a different purpose than a knowledge base, and therefore a different notion of representational adequacy [McCarthy and Hayes, 1969] applies’.

However, in the current wave of interest in the semantic web and ontologies, this distinction has been blurred.  Some ontologies available on the web are too tightly specified to be easily shared, and are closer in spirit to knowledge bases than to Gruber’s idea of an ontology.

We believe this is the result of two forces:

a)      The current state of tools and standards for ontology development means that we are only just reached the point where different ontologies can be imported and made to work together easily and reliably.

b)      In developing the semantic web, much effort has been expended on producing stand-alone ontologies for use in specific systems, but less on designing them for general applicability so that they may be widely and easily shared.

If the semantic web is to become more than a collection of isolated knowledge-base systems, we need to be able to re-use and combine ontologies more easily.  There are several reasons for using ontologies created by others rather than writing them ourselves: it saves time and money, it permits each domain to be described accurately by appropriate domain specialists, and most importantly perhaps, it permits larger communities to come to a common agreement on the meaning of specific terms.

However, many ontologies are difficult or impossible to re-use, because they are too complex or over-committed to a particular application, contain inadequate or incomplete class annotations, are poorly structured, express a philosophy that is at odds with that of the importing ontology, or are over-restricted.

a)      Ontology too complex for easy re-use Many ontologies are very large.  For example, CYC (http://www.cyc.com/cyc/technology/whatiscyc) contains over a million assertions, the Gene Ontology (GO) has almost 30,000 terms, while the National Cancer Institute Thesaurus (http://www.mindswap.org/2003/CancerOntology/; http://ncimeta.nci.nih.gov) in 2003 contained about 26,000 concepts and about 71,000 terms (http://www.mindswap.org/papers/WebSemantics-NCI.pdf).  Large size is appropriate when the ontology is an application ontology used to describe a large domain (e.g. GO): such ontologies benefit from having simple structures.  Possibilities for re-use are limited if the ontology in question is too complex or unwieldy.  If the need is just to be able to re-use a few classes, importing a large existing ontology bodily into a new ontology is problematic, since it obscures structure and takes up resources.

Our experience has been that existing published ontologies may also suffer from trivial faults:

  • they may contain a rich mixture of dissimilar class types that are not clearly differentiated;
  • they may contain class duplications and logical inconsistencies that have gone unrecognized because of its large size; and
  • they may contain classes that properly belong in separate third-party ontologies, not in the domain-specific ontology under consideration.

In contrast, a number of vocabularies, including Dublin Core (http://dublincore.org/documents/dcmi-terms/), FOAF (http://xmlns.com/foaf/spec/) and SiOC (http://sioc-project.org/ontology), which might be described a generic rather than domain-specific, have found widespread use by virtue of their simplicity. The distinction between application ontologies and reference ontologies is clarified in a related article, Reference and Application Ontologies, by James Malone and Helen Parkinson.

b)      Ontology annotations incomplete or inadequate Ontology class annotations provide an important link between human- and machine-processable knowledge, and for this reason are important components of any ontology.  In some ontologies, the vaguensss or inappropriate implied meaning of some of the class names used, coupled with inadequate annotation, rendered them confusing.  Other ontologies, while having class names that might seem appropriate for re-use, fail to provide annotations for their classes and properties, or use annotations that are inadequate.  For example, the ka (Knowledge Acquisition) ontology (http://www.cs.man.ac.uk/~horrocks/OWL/Ontologies/ka.owl) contains no human-readable descriptions of the classes and properties, making it difficult to use without lots of cross-referencing to other documents, while in BIBO, the bibliographic ontology (http://bibliontology.com/), the class Standard is inadequately annotated as “A document describing a standard”.  Many other annotations are taken directly from Wikipedia, and are not always appropriate.  CIDOC CRM (http://cidoc.ics.forth.gr/) is an example of an ontology with generally good annotations.

c)      Ontology poorly structured For example, in the Bibtex ontology  http://oaei.ontologymatching.org/2004/Contest/301/onto.html) the hasAuthor property is a datatype property with a range of String, which effectively prevents one from identifying an author as a member of the class Person.  Furthermore, it cannot be used to describe an ordered list of authors in an easily machine-processable fashion.

d)      Ontology expresses an alternative philosophy Occasionally, different ontologies express alternative views about a domain, making it difficult to use elements of one easily within the other.  For example, while CiTO, the Citation Typing Ontology (http://purl.org/net/cito/) adopts the Works, Expressions, Manifestations hierarchy of the FRBR (Functional Requirements for Bibliographic Records; http://www.ifla.org/VII/s13/frbr/frbr1.htm) classification model developed by the United States Library of Congress to characterize different aspects of a publication, BIBO does not.  While CiTO has Work: cito:ResearchPaper; Expression: cito:JournalArticle, BIBO has bibo:AcademicArticle, which conflated these two concepts.  This makes it difficult to re-use BIBO classes within CiTO.

e)      Ontology too committed Perhaps the main problem is that many ontologies commit secondary users to modelling patterns that may be inappropriate to their needs. For example there is no consensus between bibliographic ontologies as to whether a document is authored by a set of people, an ordered list of people, or an ‘agent’ that may be a corporate body, situations brought about by varying restrictions on the range of the property hasAuthor.  This point relates to the previous one, since over-commitment becomes a problem in the face of alternative philosophies.

Thus, although different ontological representations now abound, many pose problems when it comes to re-using them in other ontologies.  The alternative approach of defining equivalences between terms in different ontologies suffers from some of the same problems, since use of owl:EquivalentClass is logically strict.  Strict equivalence is inappropriate if the definitions of the classes within the two ontologies differ significantly.  For example, in FRBR a Work is a distinct intellectual or artistic creation, an abstract concept recognised through its various expressions.  However, in CiTO the definition of cito:Work is restricted to works that cite or may be cited, primarily works of scholarship that contain bibliographic references, rather than artistic works such as plays or photographs that do not.  Thus cito:Work is classified as a subclass of frbr:Work, not an equivalent class.  Such subclassing should be undertaken with caution, since it implies logical entailment of any domain and range restrictions of the superclass.  An alternative is just to indicate that some sort of relationship exists between classes in two ontologies, for example by use of rdfs:seeAlso (http://www.w3.org/TR/rdf-schema/).

Ontologies for sharing, ontologies for use

To avoid these potential problems, we propose the following principles of ontology design to maximize the reusability of ontologies by third parties.  Modifying existing ontologies so that their public shared versions conform to these requirements is best undertaken collaboratively by a group of domain experts.

  1. Ontologies designed for sharing within a community should be kept small, or have a simple structure that uses just a few properties and is not deeply nested. It is easier to achieve consensus about a small ontology on a single topic that is small enough to be comprehended in its entirety, just because there are fewer assertions to disagree about.  Small and/or simple ontologies are easier to incorporate by people wishing to use them for third-party applications.
  2. Classes designed for sharing should form simple subsumption (is_a) hierarchies in which sibling classes are disjoint from each other, and where each class represents a single notion.  This makes the ontologies easier to manage and easier to validate by domain experts.
  3. It follows from this that in any collection of related small ontology modules, covering a particular domain of knowledge, each class must be defined in only one ontology module.  If a class appears in two modules, it probably conflates two notions which need to be distinguished and separated.
  4. All classes and properties should be annotated with clear and informative human-readable definitions, containing links to real-world examples where appropriate.
  5. Ontology building is an empirical activity – while each ontology must be fit for purpose, clear and unambiguous, it need not capture the totality of applicable knowledge.
  6. When writing an ontology that covers a small, specialist area, one should not assert anything that is not essential to the intended meaning.  In particular, one should think carefully before specifying the domain and range of a property, since this may cause problems for others extending from the ontology.
  7. Third-party ontologies should be used wherever they are available and suitable, since there is no point in re-inventing wheels.
  8. Ontologies should be written in a standard ontology language such as OWL, and validated of the ontology modules using an appropriate reasoner, for example FaCT++ (http://owl.man.ac.uk/factplusplus/).
  9. More complex ontologies for specific applications (“knowledge bases” in the sense used by Gruber) can be built by combining smaller ontologies, and then by adding restrictions to enrich their meaning.

This statement of principles reflects the insights on ontology normalization first made by the Manchester ontologist Professor Alan Rector (http://www.cs.man.ac.uk/~rector/papers/rector-modularisation-kcap-2003-distrib.pdf).

To summarize, we propose the following distinctions:

A public shared ontology, as far as possible:

  • should act primarily as a structured defined vocabulary;
  • should define a limited domain;
  • should be a simple subsumption hierarchy with disjoint sibling classes;
  • should be sparing in its use of other ontologies;
  • should be written in a standard ontology language such as OWL; and
  • should have detailed human-readable annotations of the intended meaning of each term.

An application-level ontology, as far as possible:

  • should be based upon or import one or more public ontologies describing particular domains;
  • should restrict and/or cross-relate the public ontologies, thereby enabling more powerful reasoning to be used within a particular application;
  • should extend the public ontology with new classes carefully, enabling more specific descriptions to be made without compromising the ability of third party applications that are not aware of these extensions to make sense of metadata thus encoded; and
  • should ideally also be expressed in OWL-DL, to permit use of a Description Logic reasoner both for validation and for inference of additional relationships defined by the restrictions.

Once ontologies have been made suitable for public sharing and re-use, they should be published on an open access web server, or in an appropriate ontology warehouse such as the Open Biomedical Ontologies or SchemaWeb.  Application-level ontologies are still ‘shared conceptualizations’, but are now shared implicitly, by the users of the application.

Combining modular ontologies

One of the main advantages of producing modular ontologies is that not only can they be extended and constrained for a new purpose, but that they can also be combined easily to produce new or more comprehensive ontologies.  This process is made considerably easier when the modules are built with the same design principles and related to a common upper level ontology. The benefits of common design principles are emphasised by Open Biomedical Ontologies (http://www.obofoundry.org/).  Upper level ontologies are described in the accompanying paper, Upper Level Ontologies, by Frank Gibson.

Pitfalls remain, however, when attempting to combining existing ontologies.  To take a trivial example, semantic integration within the bibliographic community could be achieved with respect to the identification of bibliographic entities themselves by common usage of the Dublin Core class dc:title, which is well defined.  However, FOAF uses the class foaf:title to denote the prefix to a person’s name used to signify veneration, an official position, or a professional or academic qualification (Reverend, Dame, President, Dr, Professor, etc.).  While dc:title and foaf:title are logically distinct, we need to be careful to avoid human misinterpretation of ‘title’, if both are used in the same ontology.

At a deeper level, we need to avoid assumptions that lead to semantic misalignment of terms.  For example, it would be incorrect in many cultures to equate “family name” with “last name”.  A biological example is given by our work with Drosophila genomics data.  The FlyTED Database contains in situ hybridization images of gene expression in Drosophila testis.  In it, each tagged sequence probe used to identify the location of messenger RNA in the specimens, was, for convenience, originally described using the name of gene from which the mRNA was transcribed.  However, because of subsequent gene re-assignments within FlyBase, the genomic database for Drosophila, mismatches developed between the “genes” in FlyTED and the corresponding gene identifiers in FlyBase.  This caused subsequent inconsistencies when combining results from the two databases as part of our OpenFlyData Services, which were resolved by recognising the “gene” names in FlyTED as being distinct from the FlyBase IDs, rather than synonymous, and then by introducing a mapping between them.

There are other problems to be resolved when one has to choose between a number of overlapping ontologies that could be used within an integration, but between which there are significant differences.  How is one to choose between them, or reconcile their differences?  These are issues for which there is no clear prescriptive answers, and for which best practice is being worked out within the ontology community.  Related issues of semantic data integration are described in the article Semantic Integration in the Life Sciences by Allyson Lister.

Extending shared ontologies

If extension of a ‘public’ ontology is required for a particular application, it is advisable to extend it only by adding new sub-classes at the edges,  rather than by modifying the meaning of more central ontological entities.  If this is done, third party applications, able to understand the public ontology but having no knowledge of these extensions, will still be able to understand correctly the core metadata created using the extended ontology.  We exemplify this from our recent work with the CIDOC Conceptual Reference Model (CIDOC CRM; http://cidoc.ics.forth.gr/), an ontology developed for the museum community specifically to describe cultural heritage information.

In the CLAROS Project (http://www.clarosnet.org/; http://imageweb.zoo.ox.ac.uk/pub/2009/publications/Kurtz_Parker_Shotton_et_al-IEEE_CLAROS_paper.pdf), we have used the CIDOC CRM to provide a common framework onto which to map the data models of a number of academic resources describing classical art objects, and have then created a data web integrating information from these resources into a single user interface.  For this, we have employed the RDF version of CIDOC CRM from Erlangen University (http://purl.org/NET/crm-owl).

Dating in antiquity is not an exact science, and we needed to capture this uncertainty in our CLAROS metadata.  For this, we extended CIDOC CRM to permit us to record the estimated bounds of the inaccuracy relating to dates of creation of classical art objects.  We could have done this by introducing new properties to relate existing CIDOC CRM classes.  However, the problem with this approach is that baseline applications (without appropriate inference support) won’t know about these new properties, so the nature of the relationships that they might understand is lost, and sections of the knowledge graph might become disconnected for such applications. The alternative approach we chose to adopt was just to add new nodes at the edge of the CRM graph, and then add new properties relating these to instances of existing classes.  This left the original classes connected only by original properties. Furthermore, we supplemented our new date range metadata with more general statements that do not depend on knowledge of these extensions, accessible to ‘standard’ CIDOC CRM applications.

Specifically, we created the classes claros:not_before and claros:not_after, that have the CRM class E61.Time_Primitive class as their domain, and used these, together with a ‘label’ time range statement for display use by applications that understand only the non-extended CRM terms, thus (in RDF N3 notation):

. . .

[ rdf:type crm:E61.Time_Primative class ;

claros:not_before “-525″۸۸xsd:gyear ;

claros:not_after “-475″۸۸xsd:gyear ;

rdfs:label “about 500 BC” ] .

The need for better tools

The ontology writing and editing tools Protégé version 4 (http://protege.stanford.edu/) and OBOedit version 2 (http://oboedit.org/) are powerful and sophisticated – see articles on Protégé & Protégé-OWL by Alan Rector and on OBO Format by David Osumi-Sutherland.  However, there is still a need for good tools in four areas to assist those creating, integrating and using ontologies:

  • Tools to assist in the early stages of ontology development, using text mining and natural language processing to extract relevant terms from the scientific literature that can then be organized into ontology modules.
  • Tools that allow vocabulary designers to capture, refine and ultimately formalize their intuitions without being forced to deal with distracting logical details early in the design process.
  • Tools to permit the comparison of different ontologies, and of different versions of a single ontology under development, and to visualize these differences in a readily comprehensible manner.
  • Tools to assist in the location of relevant third-party ontologies for integration and use.  The National Centre for Biomedical Ontology’s BioPortal, and the European Bioinformatics Institute’s Ontology Lookup Service works well for the biological domain, but there are no equivalents for more generic ontologies such as the Information Artifact Ontology, the Citation Typing Ontology, FRBR, FOAF and CIDOC CRM.  One just has to learn about these by reading, attending meetings, word of mouth and experience.  This makes entry into this field extremely difficult for newcomers.

Conclusion

Applying the principles outlined here should make it easier to create an ontology de novo, or to take an existing ‘tangled’ ontology, normalize it, and ‘bring it to market’.  Such public ontologies, being simpler and more loosely defined, should gain a far wider degree of consensus and obtain greater usage.  The mechanics of this ontology normalization process are described in a separate article entitled Automatic Maintenance of Multiple Inheritance Ontologies by Mikel Egaña Aranguren, while the construction of application ontologies is described in the related articles Community Driven Ontology Development by James Malone.

Acknowledgements

This paper is a restatement and expansion of ideas first employed during the Second Animal Behavior Metadata Workshop held at Cornell University in September 2005, posted here.

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0 (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

David Shotton §, Chris Catton and Graham Klyne

]]>
http://ontogenesis.knowledgeblog.org/312/feed 3
What is an ontology? http://ontogenesis.knowledgeblog.org/66 http://ontogenesis.knowledgeblog.org/66#comments Fri, 22 Jan 2010 10:26:48 +0000 http://ontogenesis.knowledgeblog.org/?p=66

Robert Stevens*, Alan Rector* and Duncan Hull

* School of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
† EMBL Outstation – Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Defining ontology

In OntoGenesis we intend to provide biologists and bioinformaticians with the means to understand ontologies within the context of biological data; their nature; use; how they are built; and some of those bio-ontologies that exist. All of this is based upon knowing what an ontology is. This can then lead on to the motivation for their use in biology; how they are used; and so on. The definition of ontology is disputed and this is confounded by computer scientists having re-used and re-defined a discipline of philosophy. The definition here will not suit a lot of people and upset many (especially use of the word “concept”); We make no apology for this situation, only noting that the argument can take up resources better used in helping biologists describe and use their data more effectively.

In informatics and computer science, an ontology is a representation of the shared background knowledge for a community. Very broadly, it is a model of the common entities that need to be understood in order for some group of software systems and their users to function and communicate at the level required for a set of tasks. In doing so, an ontology provides the intended meaning of a formal vocabulary used to describe a certain conceptualisation of objects in a domain of interest. An ontology describes the categories of objects described in a body of data and the relationships between those objects and the relationships between those categories. In doing so, an ontology describes those objects and sometimes defines what is needed to be known in order to recognise one of those objects. An ontology should be distinguished from thesauri, classification schemes and other simple knowledge organisation systems. By controlling the labels given to the categories in an ontology, a controlled vocabulary can be delivered; though an ontology is not a controlled vocabulary. when represented as a set of logical axioms with a strict semantics, an ontology can be used to make inferences about the objects that it describes and consequently provides a means to symbolically manipulate knowledge.

In philosophy, ontology is a term with its origins with Aristotle in his writings on Metaphysics, IV,1 from 437 BCE. In very general terms, it is a branch of philosophy concerned with that which exists; that is, a description of the things in the world. Philosophers in this field tend to be concerned with understanding what it means to be a particular thing in the world; that is, the nature of the entity. The goal is to achieve a complete and true account of reality. Computer scientists have taken the term and somewhat re-defined it, removing the more philosophical aspects and concentrating upon the notion of a shared understanding or specification of the concepts of interest in a domain of information that can be used by both computer and humans to describe and process that information. The goal with a computer science ontology is to make knowledge of a domain computationally useful. There is less concern with a true account of reality as it is information that is being processed, not reality. The definition used here (and any other definition for that matter) is contentious and many will disagree with it. Within the bio-ontology community there are those that take a much more philosophical stance on ontology. The OBO Foundary, for instance, do take a much more philosophical view.

Putting the string “define:ontology” into the Google search engine finds some twenty or so definitions of ontology. They all cluster around either a philosophical or a computer science definition of ontology. This is presumably the root of the jibe that ontology is all about definitions, but there is no definition of ontology. So, we should really distinguish between philosophical ontology and computer science ontology and remove some of the dispute. Tom Gruber has one of the most widely cited definitions of ontology in computer science, though conceptual models of various types have been built within computer science for decades. Tom Gruber’s definition is:

“In the context of knowledge sharing, the term ontology means a specification of a conceptualisation. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy.” DOI:10.1006/knac.1993.1008 DOI:10.1006/ijhc.1995.1081

The most noteworthy point is that Gruber states that his definition of ontology is not “ontology in the philosophical sense”. Nevertheless, computer science ontology is still informed by the philosophical, but the goals for their creation and use are different.

An important part of any ontology is the individuals or objects. There are trees, flowers, the sky, stones, animals, etc. As well as these material objects, there are also immaterial objects, such as ideas, spaces, representations of real things, etc. In the world of molecular biology and beyond, we wish to understand the nature, distinctions between and interactions of objects such as: Small and macromolecules; their functionalities; the cells in which they are made and work; together with the pieces of those cells; the tissues these cells aggregate to form; etc, etc. We do this through data collected about these phenomena and consequently we wish to describe the objects described in those data.

As human beings, we put these objects into categories or classes. These categories are a description of that which is described in a body of data. The categories themselves are a human conception. We live in a world of objects, but the categories into which humans put them are merely a way of describing the world; they do not themselves exist; they are a conceptualisation. The categories in an ontology are a representation of these concepts. The drive to categorise is not restricted to scientists; all human beings seem to indulge in the activity. If a community agrees upon which categories of objects exist in the world, then a shared understanding has been created.

In order to communicate about these categories, as we have already seen, we need to give them labels. A collection of labels for the categories of interest forms a vocabulary or lexicon. Human beings can give multiple labels to each of these categories. This habit of giving multiple labels to the same category and the converse of giving the same label to different categories polysemy) leads to grave problems when trying to use the descriptions of objects in biological data resources. This issue is one of the most powerful motivations for the use of ontologies within bioinformatics.

As well as agreeing on the categories in which we will place the objects of interest described in our data, we can also agree upon what the labels are for these categories. This has obvious advantages for communications – knowing to which category of objects a particular label has been given. This is an essential part of the shared understanding. By agreeing upon these labels and committing to their use, a community creates a controlled vocabulary.

The objects of these categories can be related to each other. When each and every member of one category or class is also the member of another category or class, then the former is subsumed by the latter or forms a subclass of the superclass. This subclass superclass relationship between objects is variously known as the “is-aDOI:10.1109/MC.1983.1654194, subsumption or taxonomic relationship. There can be more than one subclass for any given class. If every single kind of subclass is known, then the description is exhaustive or covered. Also, any pair of subclasses may overlap in their extent, that is, share some objects, or they may be mutually exclusive, in which case they are said to be disjoint. Both philosophical and ontology engineering best practice often advocate keeping sibling classes pairwise disjoint.

As well as the is-a relationship, objects can be related to each other by many other kinds of relationship DOI:10.1186/gb-2005-6-5-r46. One of the most frequently used is the partOf relationship, which is used to describe how objects are parts of, components of, regions of, etc. of other objects. Other relationships will describe how one object developsInTo or is transformed into another object, whilst retaining its identity (such as tadpole to frog). The deriveFrom relationship describes how one object changes into another object with a change of identity. Another relationship describes how a discrete object can ParticipateIn a process object.

These relationships, particularly the is-a relationship give structure to a description of a world of objects. The relationships, like the categories whose instances they relate, also have labels. Relationship labels are another part of a vocabulary. The structured description of objects also gives a structured controlled vocabulary.

So far, we have only described relationships that make some statement about the objects being described. It is also possible to make statements about the categories or classes. When describing the elemental form of an atom, for example, `Helium’, statements about the discovery date, industrial uses, are about the category or class, not about the objects in the class. Each instance of a `Helium’ object was not discovered in 1903; most helium atoms existed prior to that date, but humans discovered and labelled that category at that date.

Ideally, we wish to know how to recognise members of these categories. That is, we define what it is to be a member of a category. When describing the relationships held by an object in a category, we put inclusion conditions upon those instances or category membership criteria. We divide these conditions into two sorts:

  1. Necessary Conditions: These are conditions that an object must fulfil, but fulfilling that condition is not enough to recognise an object as being a member of a particular category.
  2. Necessary and Sufficient Conditions: These are conditions that an object must fulfil and are also sufficient to recognise an object to be a member of a particular category.

For example, in an ontology of small molecules such as Chemical Entities of Biological Interest (ChEBI) DOI:10.1093/nar/gkm791 has a definition of alcohol and there are several ways of defining what this means. Each and every organic molecule of alcohol must have a hydroxyl group. That an organic molecule has a hydroxyl substituent is not, however, enough to make that molecule an alcohol. If, however, an organic molecule has a saturated backbone and a hydroxyl substituent on that backbone is enough to recognise an alcohol (at least according to the IUPAC “Gold Book”).

In making such definitions, an ontology makes distinctions. A formal ontology makes these distinctions rigourously. Broad ontological distinctions would include those between Continuant and Occurant; that is, between entities (things we can put in our hands) and processes. Continuants take part in processes and processes have participants that are continuants. Another distinction would be between Dependent and Independent objects. The existence of some objects depend on the existence of another object to bear that object. for example, a car is independent of the blue colour it bears. Continuants, for example, can be sub-categorised into material and immaterial continuants such as the skull and the cavity in the skull. Making such ontological distinctions primarily helps in choosing the relationships between the objects being described, as well as some level of consistency.

Capturing such descriptions, including the definitions forms an ontology. Representing these descriptions as a set of logical axioms with a strict semantics enables those descriptions to be reliably interpreted by both humans and computers. Forming a consensus on which categories should be used to describe a domain and agreeing on the definitions by which objects in those categories are recognised enables that knowledge to be shared.

The life sciences, unlike physics, has not yet reduced its laws and principles to mathematical formulae. It is not yet possible, as it is with physical observations, to take a biological observation, apply some equations and determine the nature of that observation and make predictions etc. Biologists record many facts about entities and from those facts make inferences. These facts are the knowledge about the domain of biology. This knowledge is held in the many databases and literature resources used in biology.

Due to human nature, the autonomous way in which these resources develop, the time span in which they develop, etc., the categories into which biologists put their objects and the labels used to describe those categories are highly heterogeneous. This heterogeneity makes the knowledge component of biological resources very difficult to use. Deep knowledge is required by human users and the scale and complexity of these data makes that task difficult. In addition, the computational use of this knowledge component is even more difficult, exacerbated by the overwhelmingly natural language representation of these knowledge facts.

In molecular biology, we are used to having nucleic acid and protein sequence data that are computationally amenable. There are good tools that inform a biologist when two sequences are similar. Any evolutionary inference based on that similarity, however, based upon knowledge about the characterised sequence. Use of this knowledge has been dependent on humans and reconciliation of all the differing labels and conceptualisations used in representing that knowledge is necessary. For example, in post-genomic biology, it is possible to compare the sequences of the genome and the proteins it encodes, but not to compare the functionality of those gene products.

There is, therefore, a need to have a common understanding of the categories of objects described in life sciences data and the labels used for those categories. In response to this need biologists have begun to create ontologies that describe the biological world. The initial move came from computer scientists who used ontologies to create knowledge bases that described the domain with high-fidelity; an example is EcoCyc http://view.ncbi.nlm.nih.gov/pubmed/8594595. Ontologies were also used in projects such as TAMBIS DOI:10.1147/sj.402.0532 to describe molecular biology and bioinformatics to reconcile diverse information sources and allow creation of rich queries over those resources. The explosion in activity came, however, in the post-genomic era with the advent of the Gene Ontology (GO) doi:10.1038/75556. The GO describes the major functional attributes of gene products – molecular function, biological process and cellular components. Now some forty plus genomic resources use GO to describe these aspects of the gene products of their respective organisms. Similarly, the Sequence Ontology describes sequence features; PATO (the phenotype Attribute and trait ontology) describes the qualities necessary to describe an organism’s phenotype. All these and more are part of the Open Biomedical Ontologies project (OBO) 10.1038/nbt1346.

Conclusion

In conclusion, we can say that there is a need to describe the entities existing within data generated by biologists so that they know what they are dealing with. This entails being able to define the categories of biological entities represented within those data. As well as describing the biological entities, we also need to describe the science by which they have been produced. This has become a large effort within the bioinformatics community. It has also been found to be a difficult task and much effort can be used in attempting to find the true nature of entities in biology and science. It should be remembered, however, that the goal of the bio-ontology effort is to allow biologists to use and analyse their data; building an ontology is not a goal in itself.

References

This text is adapted and updated from Ontologies in Biology by Robert Stevens. A numbered list of references will be generated from the DOI’s above in later drafts of this article after peer review.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0, which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

]]>
http://ontogenesis.knowledgeblog.org/66/feed 10
Semantic Integration in the Life Sciences http://ontogenesis.knowledgeblog.org/126 http://ontogenesis.knowledgeblog.org/126#comments Thu, 21 Jan 2010 15:20:03 +0000 http://ontogenesis.knowledgeblog.org/?p=126

There are a number of limitations in data integration: data sets are often noisy, incomplete, of varying levels of granularity and highly changeable. Every time one of the underlying databases changes, the integrated database needs to be updated, and if there are any format changes, the parsers that convert to the unified format need to be modified as well. This ”database churn” was identified by Stein to be a major limiting factor in establishing a successful data warehouse (Stein 2003).

Ruttenberg et al. see the Semantic Web, of which both OWL and RDF are components, as having the potential to aid translational and systems biology research; indeed, any life science field where there are large amounts of data in distributed, disparate formats should benefit from Semantic Web technologies (Ruttenberg et al. 2007).

Semantic Integration

Integrated data sources, whether distributed or centralised, allow querying of multiple data sources in a single search. Traditional methods of data integration map at least two data models to a single, unified, model. Such methods tend to resolve syntactic differences between models, but do not address possible inconsistencies in the concepts defined in those models. Semantic integration resolves the syntactic heterogeneity present in multiple data models as well as the semantic heterogeneity among similar concepts across those data models. Often, ontologies or other semantic web tools such as RDF are used to perform the integration.

Addressing Semantic Heterogeneity

Semantic heterogeneity describes the difference in meaning of data among different data sources. A high level of semantic heterogeneity makes direct mapping difficult, often requiring further information to ensure a successful mapping. Such heterogeneity is not resolved in more traditional syntactic data integration methods. For instance, in data warehousing or data federation, multiple source schemas (e.g. database schemas) are converted to a single target schema. In data warehousing, the data stored in the source models is copied to the target, while in federated databases the data remains in the source models and is queried remotely via the target schema.

However, the schema reconciliation in non-semantic approaches tends to be hard-coded for the task at hand, and is not easily used for other projects. Often, data is aligned by linking structural units such as XSD components or table and row names. Further, concepts between the source and target schema are often linked based on syntactic similarity, which does not necessarily account for possible differences in the meanings of those concepts. For instance, a protein in BioPAX is strictly defined as having only one polypeptide chain, while a protein in UniProtKB (The UniProt Consortium 2008) can consist of multiple chains. Semantic data integration is intended to resolve both syntactic and semantic heterogeneity and can allow a richer description of domain of interest than is possible with syntactic methods. By using ontologies, kinds of entities, including relations, can be integrated across domains based on their meaning. However, application of such techniques in bioinformatics is difficult, partly due to the bespoke nature of the majority of available tools.

The protein example can be further extended to illustrate the practical differences between traditional data integration and semantic integration. In traditional data integration methods, two database schemas may contain a “Protein” table, but if what the developers mean by “Protein” is different, there is little way of determining this difference programmatically. An integration project using these two schemas as data sources may erroneously mark them as equivalent tables. In semantic integration, if the two data sources had modelled Protein correctly, the differences in their meaning would be clear both programmatically and to a human looking at the axioms for Protein in two data sources’ ontologies. In such cases, once the semantic differences are identified they can then be resolved. One possibility would be the creation—by the person creating the integrated ontology and data set—of a Protein superclass that describes a Protein in a generic way. The two source definitions could then be modelled as children of that Protein superclass.

Ontology-based Integration

Integration methods based on ontologies can be more generic, re-usable and independent of the integrative applications they were created for, when compared with traditional approaches which resolve only syntactic heterogeneity (Cheung et al. 2007). Mappings between schemas in non-semantic approaches are specific to those schemas, and cannot be applied to other data sources; conversely, mappings between ontologies (and therefore to the data sources that utilise those ontologies) can be used by any resource making use of those ontologies, and not just the original, intended, data sources. Two concepts may have different names, but if they reference the same ontology term, then it may be sensible to mark them as semantically equivalent. However, this method brings its own challenges, as described in the Ontogenesis article Ontologies for Sharing, Ontologies for Use:

“The alternative approach of defining equivalences between terms in different ontologies suffers from some of the same problems, since use of owl:EquivalentClass is logically strict. Strict equivalence is inappropriate if the definitions of the classes within the two ontologies differ significantly. . . . . An alternative is just to indicate that some sort of relationship exists between classes between two ontologies by use of skos:related (http://www.w3.org/TR/skos-primer/). “

Ontology mapping, also known as class rewriting, is a well-studied methodology that allows the mapping of a source class to a target class from different ontologies. As primitive classes are used in DL to characterise defined classes (pg. 52, Baader et al. 2003), such rewriting also allows the linking of relationships (also known as properties) between the two ontologies. Mapping can be used to automatically generate queries over the data source ontologies via an core ontology using views over the data source ontologies. Additionally, mapping can be applied more generally to rewrite the required features of data source ontologies as a function of a core ontology, as described in Rousset et al. for two existing data integration systems, PISCEL and Xyleme (Rousset et al. 2004).

In the life sciences, the most common formats for ontologies are OWL and OBO. More complex semantic integration tasks can be performed using greater than two ontologies and often employ a mediator, or core, ontology which is used in concert with more than one format, or source, ontologies.

Mapping Strategies

Often, the data sources to be integrated cover very different domains, and one or even two ontologies are not sufficient to describe all of the sources under study. In such cases, there are a variety of methodologies to map more than two ontologies together. Most ontology integration techniques where more than two ontologies can be classified according to two broad mapping strategies: global-as-view, where the core ontology is created as a view of the source ontologies, and local-as-view, where the reverse is true. Global-as-view mapping defines a core ontology as a function of the syntactic ontologies rather than as a semantically-rich description of the research domain in its own right, though the level of dependence of the core ontology can vary (Wache et al. 2001, Rousset et al. 2004, Gu et al. 2008). With local-as-view, the core ontology is independent of the syntactic ontologies, and the syntactic ontologies themselves are described as views of the core ontology.

Hybrid approaches (Lister et al. 2009, Xu et al. 2004) also generate mappings between source ontologies and the core ontology. However, unlike traditional approaches, the core ontology is completely independent of any of the source ontologies. Such approaches allow both the straightforward addition of new source ontologies as well as the maintenance of the core ontology as an independent entity.

Current Semantic Integration Efforts

RDF databases are generally accessed and queried via SPARQL. Life science RDF databases include the Data Web projects such as OpenFlyData (Miles et al., submitted); Neurocommons (Ruttenberg et al. 2009), BioGateway (Antezana et al. 2009) and S3DB (Deus et al. 2008). Many others are listed in Table 1 of Antezana (Antezana et al. 2009). Some databases only use RDF, while others make use of OWL.

Databases such as RDF triple stores provide data sets in a syntactically similar way, but the semantic heterogeneity is not necessarily resolved. For instance, while Bio2RDF stores millions of RDF triples, queries must still trace a path against existing resources rather than have those resources linked via a shared ontology or ontologies (Belleau et al. 2008). Shared vocabularies (e.g. OBO Foundry ontologies) can be used to build connections between RDF data files, which would provide existing connections among data sets that could be leveraged by integration projects.

Semantic integration projects can make use of expressive logic-based ontologies to aid integration. Work on ontology mapping and other semantic data integration methodologies in the life sciences includes the RDF approaches mentioned above as well as the TAMBIS ontology-based query system (Stevens et al. 2000); mapping the Gene Ontology to UMLS (Lomax et al. 2004); the integration of Entrez Gene/HomoloGene with BioPAX via the EKoM (Sahoo et al. 2008); the database integration system OntoFusion (Alonso-Calvo et al. 2007); the SWRL mappings used in rule-based mediation to annotate systems biology models (Lister et al. 2009); and the pharmacogenomics of depression project (Dumontier and Villanueva-Rosales, 2009).

Even with improved methods in data integration, problems of data churn remain. Some projects, such as that by Zhao et al., have proposed the use of Named Graphs to track provenance and churn of bioinformatics data, such as gene name changes (Zhao et al. 2009). Ultimately, it is not just the syntax and semantics of the data sources which must be resolved, but also the challenges associated with ensuring that data is up to date, complete and correctly traced and labelled.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0 (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

]]>
http://ontogenesis.knowledgeblog.org/126/feed 6
Ontological Annotation of Data http://ontogenesis.knowledgeblog.org/50 http://ontogenesis.knowledgeblog.org/50#comments Thu, 21 Jan 2010 14:41:29 +0000 http://ontogenesis.knowledgeblog.org/?p=50

Introduction

This is a blogged article on the how scientific data, specifically biomedical data, is annotated with ontologies. It introduces the kinds of data that are annotated, the people who perform annotation, the ontologies used for annotation and resources such as databases which make that annotation available for users. It is not intended to be a comprehensive guide to all ontologies or databases which use ontologies.

Author Profile

Helen Parkinson is a  geneticist who was seduced to the dark side (Bioinformatics) 10 years ago. She  manages and annotates high throughput functional genomics data for the ArrayExpress database and Atlas of Gene Expression hosted at The European Bioinformatics Institute. She also builds ontologies such as EFO and OBI to annotate these data.


1. What does ‘Ontological Annotation of Data’ mean?

Good question, lets start with what we mean by data in this context. There are articles describing 58 new databases and 73 updated databases  in the 2010 NAR database issue.  These databases are necessary as the scientific technology we use in biomedicine now produces huge amounts of data. For example, a PhD student  in 1990 might routinely sequence 1 kilobase of DNA using35S sequencing technology (subject to possessing technical skills to do the experiment). Such volumes of data can be stored easily in a FASTA format file .  The same PhD student in 2010 could sequence several human genomes (subject to funding, access to a sequencing facility who will perform the experiment and ethical approval).

This presents a data and knowledge management problem. The raw data generated by the sequencer can be stored in the same file formats as used in 1990, however, the information about the genes present in the genome, their position, their function and whether they are expressed in the individual being assayed is usually stored in a database. When we consider  the phenotype of the human from which samples were taken and the purpose of the study,  and results generated by the study there are two axes of annotation to consider – that which relate to what is being assayed – the genetic content of the individual where the genes are, and what they may do, and the meta data about the individual: age, sex, physical characteristics, diseases they may have, and what was actually sampled – e.g. diseased  or normal tissue, or peipheral blood.

2. Who does the annotation?

In our example the PhD student may have done the annotation of the 1 kilobase of DNA in 1990 and PhDs were awarded for finding genes, sequencing parts of the genome and functional analysis. In 2010 the function(s) of many genes is known and this information is reported in the scientific literature as free text. Free text can be searched effectively but the information on gene function is  more useful when it it is organised and the knowledge linked to the gene information. The most commonly used ontology in Biomedicine is the Gene Ontology or GO which has the “aim of standardizing the representation of gene and gene product attributes across species and databases”. The gene ontology is built by a team of specialist bioinformaticians who structure the ontology, provide definitions and generally ensure that it is fit for purpose (add a ref to the GO chapter). GO is used  by curators of model organism databases like Zfin or domain specific databases like Uniprot to annotate genes.

3. Why do they do it?

GO is used to describe gene products in a formal and structured way. As  Gene products have common functions across species, there are many proteins in more than 20 species are annotated to the GO term ‘transcription factor binding‘ in Uniprot (a database of proteins)  . Transcription factor binding is a high level term, it has 9 direct child terms, each of which also have child terms  linked by is-a  relationships. The structure of the molecular hierarchy of the GO allows subsumption queries which traverse these relationships and representing more or less specific knowledge about biological processes as the hierarchy is traversed. The GO enforces an ‘all paths to root must be true’ rule so the terms and their relationships represent a statement of biological truth based on available knowledge. E.g.

transcription factor binding is-a protein binding is-a binding is-a molecular function

More or less specific annotation can be specified by a curator selecting a term from lower or higher up the hierarchy. Annotations are made to GO based on scientific literature, automated analyses based on sequence homology and assertions made by expert curators. Annotations change over time  on the basis of emerging biological knowledge, and the content of the GO also changes as terms are added, or removed, annotations are therefore updated periodically.

4. Sample annotation vs. gene annotation

In the example above we considered gene specific annotation and explored the use of the GO in the context of protein databases. Now let us suppose our PhD student has several human cell lines and is sequencing  these to compare difference in expression of genes in these samples.   We saw that GO provides annotation on processes, function and cellular compartment, so what sort of annotation about these cell lines is important and why?

Cell lines can be immortalized and in this case are derived from diseased tissue in a human  and are used as a model system for investigating the disease process. Cell lines are commerically available from centres such as ATCC, who provide rich information about the cell type, growth conditions and disease of the donor.  This information is expressed as free text in ATCC, and some of this text has been structured into an application ontology called EFO. This allows us to identify all cell lines which are derived from cancer samples if the EFO terms are mapped into available data sets. The relationships between concepts relating to cell lines is shown in the figure below and are represented in Manchester Syntax.

‘cell line’ and derives_from some ‘organism part’

‘cell line‘ and bearer_of some ‘disease state’

‘cell line’ and derives_from  only from some ‘species’

cell line and derives_from some ‘cell type’

Once we have this information for our cell lines of interest and these are mapped into an appropriate dataset we can combine this information with the gene annotation using GO and expression data and perform complex queries. For example: which human genes  annotated as having GO process ‘cell adhesion’  are over-expressed in cell lines derived from cancer cells.

This type of query thus requires multiple ontologies, mapped into two different datasets and a GUI to visualize the result of the query, or some programmatic access. In this example the annotations were mapped to sample data (provided by biologists like our PhD student) by the ArrayExpress curators, and the gene annotations were provided by the GOA curators at the EBI and data is visualized by the Atlas of Gene Expression at the EBI. Ontologies therefore can be made directly interoperable via application ontologies or via data.

5. Tools for applying ontologies to data

We have already discussed two user groups – specialist curators who build and annotate to GO  and an application ontology and our PhD student who is annotating their own data and consuming existing GO annotations. This suggests we need different types of tools for these two types of users who have different skill sets.

Where can I get GO annotations?

GO annotations are available from many different resources, a complete list of tools that search the GO is maintained by the Gene Ontology Consortium, many of these tools also provide links to proteins annotated to GO terms.

What tools can I use to annotate my samples?

In our example of samples annotated with cell lines the annotation is made in the context of the submission to a database and annotation is performed by curators who use lexical matching tools combined with manual curation. There are also data submission and annotation tools such as Annotare

How can I search ontologies?

The BioPortal and Ontology Look-up Service OLS search tools provide access to multiple ontologies which can be searched singly, or combinatorially  for common concepts such as ‘fibroblast’

Can annotation be automated?

Human curators are expensive, highly skilled individuals and the volume of data is growing beyond the ability of existing curators to annotate it. There are a number of attempts to automate annotation using text mining tools such as Textspresso and Whatizit and curator support tools also use this technology.

6. Conclusion

Data is annotated with ontologies by both biologists and specialist curators who both use and create ontologies for this purpose. Annotation is made available by databases which offer GUIs for searching and programmatic access via APIs. Some data is automatically annotated using text mining tools.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 2.5 (http://creativecommons.org/licenses/by/2.5/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

]]>
http://ontogenesis.knowledgeblog.org/50/feed 3