Reviewed – Ontogenesis

Is a class the same as its extent?

Phillip Lord — Fri, 14 Jan 2011 16:25:34 +0000

Abstract

While it is true that an ontology class defines a set of individuals, its extent, the reverse is not so. The class is not the same as its extent. In this article, we consider why this is not so when modelling the real world, and also within the context of the formal semantics of OWL.

Author

Phillip Lord,
School of Computing Science,
Newcastle University

Uli Sattler
School of Computer Science
Manchester University

Introduction

An ontology consists of a number of classes; these logically divide the world up into sets of individuals. This set of individuals is known as the extent of the ontology. In this article, we consider whether the reverse is true; is it the case that this set of individuals define the class? We consider this in two ways. First, we investigate the semantics of OWL, the Web Ontology Language. Second, we use a biological example, to clarify why there needs to be a separation between the class and its extent.

OWL

One of the key features of OWL is that it has a well-defined semantics. In general, this formal semantics is most important when building computational tools to reason over OWL ontologies; the precisely defined meaning to statements makes it possible for independent tools to come to identical and clearly defined conclusions. In this article, we consider this semantics informally, and the implications that this has for the meaning of a class in OWL.

Different users of OWL tend to think of the statements made within the ontology in different ways. Consider a very simple OWL statement such as:

Class: B
    SubClassOf: A

The simplest interpretation of these statements is “if you are a B then you are also an A“. A slightly more formal logical interpretation is “given B subclass of A, any instance b of B is also an instance of A”. Alternatively interpretation is that “B subclass of A, implies that the set of instances of B is a subset of the instances of A”. These definitions are all equivalent. We can use the set-theoretic semantics to define two further parts of OWL. owl:Thing can be interpreted as a set containing all possible instances, while a class that equivalent to the empty or null set, which contains no instances is “unsatisfiable”; these are often called “inconsistent” classes, although strictly this is a property of an ontology with at least one unsatisfiable class.

At first sight, therefore this set-theoretic interpretation appears to imply that OWL class are extensional; that is, they are defined purely by their membership. However, there is an added complication which alters this conclusion

The set-theoretic interpretation of OWL is made with respect to mathematical universe; this is not the real universe of things around us, but a collection of all the mathematical individuals that we wish to consider. For a given ontology, there are many potential different mathematical universes; again, this simply means that we can consider different sets of individuals all of which obey the statements in the ontology.

In one universe, A might have 10 individuals and B might have 4. In another universe, both A and B might both be empty, having no individuals. In another, both A and B might contain all individuals. An OWL ontology allows us to distinguish between those universes where all classes are true, or are satisfied and those where this is not true. The former universes are known as models. There may be many, perhaps infinitely many, models. Conclusions or implications must be necessarily true in any model. That a model exists where neither A nor B has any instances does not make these classes unsatisfiable; for a class to be unsatisfiable there must be no model in which it can have individuals.

The motivation for this form of interpretation is lodged in OWLs “open world assumption” — things which are not stated are considered unknown. For a given ontology, it would be a mistake to interpret classes or subclass relations on the basis of their stated individuals. After all these are only individuals that we know about, and their could be others.

It is possible, within OWL, to define a class extensionally, using owl:oneOf; this defines all the individuals that are in the set of this class in any universe. This form of definition is very much the exception within an OWL ontology rather than the rule.

Biology

At first sight, it might appear appear that the formal semantics of OWL are not of relevance to modelling biology ontological, especially if a technology other than OWL is being used. In this case, there is only a single universe with a specific set of individuals, so the ideas of “in all possible universes” does not make sense. However, here we argue that the idea of potential individuals is still useful, and clearly demonstrates the distinction between a class and its extent.

Consider three terms ReceptorProtein, PhotoreceptorProtein and 7-transmembraneProtein. It seems fairly straight-forward to deduce that PhotoreceptorProtein is a subclass of ReceptorProtein. By definition, any protein which operates as a photoreceptor must by definition also operate as a receptor. Alternatively, in terms of set-theoretic semantics, the set of photoreceptors is a subset of receptors.

Most biologists will know of the relationship between the 7-transmembrane proteins and photoreceptors — the best known photoreceptor family is the extensive Opsin/Rhodopsin family whose members are found in organisms as disparate as humans and bacteria. Our quick survey of five biologists showed that all of them were aware of this. However, without recourse to external resources, none were able to say whether there were any photoreceptors which are not also 7-transmembrane proteins. Even, with recourse to external resources, they could only answer questions about proteins currently known, not about all proteins.

Considering next the ontology of these classes. If classes were defined simply by their extension, then we need an answer to this question. If there are no photoreceptors which are not also 7-transmembrane proteins, then the set of photoreceptors is a subset of 7-transmembrane proteins. Under these circumstances PhotoreceptorProtein would be a subclass of 7-transmembraneProtein. Unfortunately, it is difficult or impossible these questions.

However, if we think of classes in the same way as OWL, then instead we ask the question, whether it is possible that a protein might operate as a photoreceptor without being a 7-transmembrane protein. Most biologists would suggest that it is both reasonable and plausible suggest that this kind of protein might exist. Therefore, this suggests that PhotoreceptorProtein should not be a subclass of 7-transmembraneProtein.

By contrast, consider a roughly analoguous example of the two classes TransmembraneProtein and IonTransporterProtein. In this case, there are good grounds for suggesting that an ion transporter protein is also going to be a transmembrane protein; ions cannot directly pass through a lipid bilayer, and it hard to see who a protein could enable this without having access to both sides of a plasma membrane. There seems to be no possible protein which could be both. In this case, within an ontology it would seem reasonable that IonTransporterProtein should be either asserted or inferred to be a subclass of TransmembraneProtein.

The consideration of individuals which are possible, which might exist, is therefore useful when building an ontology. This is not to suggest that we need to take into account all possibilities when building an ontology. For instance, within most biomedical ontologies, we are not considering the real universe of all individuals, but a subset of it, which we describe as “non-pathological”. For example, an anatomical ontology might suggest that set of individual of the class Toes is a subset of the class part_of Foot. It can make this statement despite the fact that in the real world there are individuals without toes or who have toes which are not part of the foot, as they are not being considered in this case. By analogy with OWL semantics, they are outside the “universe”. We might also choose to restrict our universe to a single species — another common assumption in medical ontologies — which excludes many complex possibilities from consideration.

Finally, we can exclude possibilities that are just implausible; for example, when building a biomedical ontology, we would not need to worry about the possible existence of silcon based life, or other forms of alien existence. If, at some point in the future, either of these is discovered, we will need to change our ontologies; this is likely to be one of the smaller changes caused by this kind of discovery.

Conclusion

Although OWL uses a set-theoretic semantics, it is easy to misinterpret this as meaning that a class is defined by its extent. At first sight, the understanding that classes are defined by any possible extent within a given universe may seem overly complex, but it actually mirrors one process that can be used to determine subclass relations when modelling biological individuals: while a class will contain a specific set of individuals, it’s extent, we rarely know what all of these individuals are, and they may change in the future. Instead we need to decide which possible individuals a class might contain; it is this set that defines the meaning of the class.

What is an upper level ontology?

Robert Hoehndorf — Tue, 13 Apr 2010 13:09:25 +0000

Abstract:Upper level ontologies are used to facilitate the semantic integration of domain ontologies and guide the development of new ontologies. For this purpose, they contain general categories that are applicable across multiple domains. Upper level ontologies usually provide rich definitions and axioms for their categories. Different upper level ontologies provide different distinctions based on the kinds of entities they include, their theories of space and time as well as the relation of individuals to space and time.

Author: Robert Hoehndorf, European Bioinformatics Institute

What is an upper level ontology?

An ontology is a shared conceptualization of a domain. Ontologies are used to specify the meaning of the terms in a vocabulary that is used within some domain. To represent the meaning of terms, ontologies contain categories. These are organized in an is-a hierarchy, the taxonomy. Some categories are more general than others with respect to the is-a hierarchy. When defining these general categories within a domain, it is often possible to introduce more general categories. For example, in an ontology of cell components, the most general category may be Cell component. When we want to define Cell component, we can introduce another, more general category, and provide distinguishing properties. For example, we can define Cell component as an Object which is part of some Cell. Object is the more general category here, and Cell component will be a sub-category of Object. We can then further define Object as an Entity which has spatial extension and is wholly present at one point of time.

The assumption behind upper-level ontologies is that, when this generalization is performed in ontologies of multiple domains, we will come up with a small set of categories that is the same in all these domains. Most domains will deal with objects, processes, properties, relations, space, time, roles, functions, categories, individuals or similar. An upper-level ontology is an ontology that defines and axiomatizes these most general categories.

There is considerable disagreement about what these general categories that are relevant in any domain are. There is even more debate about what the properties of these categories are. For example, what are the properties of time? Are there only time points, and a time interval is defined as the set of all time points between two distinct time points? Or are time intervals primitive and time points are derived from infinite sets of time intervals that meet? Are there atomic time intervals or is time continuous?

Many of these questions have been debated in philosophy for thousands of years. As a result, upper level ontologies often rely much more on philosophical theories and commitments to particular philosophical views than domain ontologies. Upper level ontologies also rely much more on axioms than on formal or natural language definitions, because it is often hard to define an upper level category using other categories that are uniformly understood. Instead, the way these categories interact with other categories becomes more important. Therefore, although it is often sufficient for many domain ontologies to define categories in natural language or through explicit definitions, rich axiom systems are necessary to establish the meaning of upper level categories.

A foundation of a domain ontology in an upper ontology consists at least of assignments of super-categories from the upper level ontology for all the categories of the domain ontology. Because the domain ontology will usually be structured in the form of a taxonomy, only few categories will have to be explicitly assigned a super-category from the upper level ontology. A more expressive method of foundation is the ontological reduction of a domain category to an upper level ontology, where domain categories are explicitly or implicitly defined using the categories of the upper level ontology.

Why use upper level ontologies?

The main application of upper level ontologies is to provide semantic interoperability of ontologies across multiple domains. Because upper level ontologies provide general concepts which are common to all domains, they can provide a common ontological foundation for domain ontologies.

For example, consider an ontology for physics with a category Electron, and another ontology in the manufacturing domain with a category Transporter. Both Electon and Transporter may be defined as a sub-category of Object. Yet, Object in an ontology of physics and Object in an ontology of manufacturing may have different properties, they may in fact be completely different things. For example, instance of Object in physics may always have a temporal extension, or their location may not be determinable at the same time as their momentum. In the manufacturing domain, objects may always have a price, they may always have two or more components, and so on. An upper level ontology provides well-defined primitives to make these conflicts explicit, and provide a common foundation for both. Electron could be classified as a sub-category of Process, while Transporter becomes a sub-category of Endurant in an ontology. Therefore, upper level ontologies help to make the ontological commitment of a vocabulary explicit.

Upper level ontologies provide restrictions on the categories they provide through axioms. These restrictions are inherited by the domain ontologies which are founded in the upper level ontologies. Consequently, upper level ontologies provide a means to verify domain ontologies with respect to a particular foundation in an upper level ontology. This is particularily useful when a new ontology is developed with the intention to semantically interoperate with an already existing ontology.

When applied in the ontology development process, upper level ontologies provide a means to verify basic ontological constraits. They can also be used to verify the compatibility of the developed ontology with other ontologies that are founded in the same upper level ontology. Consequently, they can provide a high-level compatibility and plausibility check for domain ontologies and their semantic integration.

Categories and individuals

A fundamental upper level distinction is one between individuals (or particulars) and ontological categories. Although there is considerable discussion about the nature of individuals in philosophy, the common definitions states that an individual is an entity that cannot be instantiated. A category can be instantiated. The relation between a category and its instances is the instance-of relation.

Most upper level ontologies focus on the kinds of individuals that are present in some domain. However, there are other entities that are relevant in several domains. In any knowledge representation task or in the process of ontology engineering, we use categories. Categories have definitions, a history, an intension, there are axioms pertaining to categories, authors and creators of categories, categories may be consistently defined or inconsistently defined, and so on. Based on these properties, there are different kinds of categories. Consequently, there are upper level ontology who distinguish at a very basic level between categories and individuals. The instances of a Category category will be ontological categories such as Dog, Electron, Red, Species, while the instances of an Individual category will be individuals: my spider Nero, the red of the apple I eat now, the 1999 Berlin Marathon. Some of the instances of Category will be sub-categories (via is-a) of Individual, such as Dog or Red.

Whether the upper level ontology provides general upper level categories for both categories and individuals, or only for individuals, is the first distinguishing feature between upper level ontologies.

Time and space

A fundamental component of most upper level ontologies is a theory of space and time. The basic distinctions are between time points and time intervals, as well as spatial points and spatial regions.

A simple model of time are the real numbers (or dense linear orders). The basic entities in an ontology of time based on real numbers are time points, which correspond to real numbers. Time intervals are derived by pairs of two real numbers. For example, the real number e can be considered a time point, and [e,10] a time interval. Such an ontology of time has difficulties when time intervals are divided. If we want to divide the interval [0,2] into two intervals of exactly the same length, we have two options: either [0,1] and the half-open interval ]1,2], or the half-open interval [0,1[ and [1,2]. In each case, there is one time interval for which we cannot determine the first or the last time point that belong to the interval, because one time interval will be half-open. This is often counter-intuitive.

To solve this approach, temporal-based ontologies of time were proposed. In these ontologies, time intervals are considered to be primitives and time points derived. Time intervals can meet other time intervals: an interval I meets an interval J when I and J do not overlap and there is no interval between I and J, i.e., I ends at the same time that J starts. Time points are derived as sets of intervals that meet one interval. In interval-based ontologies of time, time intervals can be divided into exactly two halfs, and for each a start and end point can be constructed.

In addition to point-based and interval-based ontologies of time, mixed approaches are being developed. The ontology of Brentano-time uses two temporal categories: time intervals and time boundaries. In Brentano-time, time intervals are primitive and each time interval has exactly one left and exactly one right boundary. Time boundaries are dependent on time intervals, and two time boundaries can coincide. When two time boundaries coincide, they are at the same time. When a right boundary of an interval coincides with the left boundary of another interval, these intervals meet. If two left boundaries coincide, the intervals start at the same time and overlap in their beginning. If two right boundaries overlap, the intervals end at the same time and they overlap at their end. Dividing a time interval in two parts yields two intervals, both with left and right boundaries. The right boundary of the first interval coincides with the left boundary of the second, yet both boundaries are distinct entities. This allows referencing both the last point of the first and the first of the second interval, while both intervals are divided into exactly two halfs.

Space is usually similar to time. Ontologies that use the real numbers as a model of time use $latex {R^3}&fg=000000$ as a model for space. Using time intervals as basic entities of time goes together with using spatial regions as primitives and deriving planes, lines and points from those. Similar to Brentano-time, Brentano-space treats spatial regions as primitives, and spatial regions have two-, one- and zero-dimensional boundaries which can coincide. Similar to time, we can ask how to divide a spatial region into exactly two halfs and find similar solutions in the different ontologies of space.

The ontology of space and time in upper level ontologies is our second distinguishing feature.

Objects and Processes

Based on the ontology of space and time, different categories of individuals can be derived. When the ontology of time is based on time points as primitives, three-dimensional objects which are present at points in time will naturally be available in the ontology. Based on the definition of time intervals in such a model, processes can be introduced in which objects may participate. Objects at time points are called endurants or continuants. An endurant is an individual which is wholly present at each time point at which it exists, and it persists through time. Wholly present means to be present with all its parts. In particular, endurants have no temporal parts.

The main problem for endurants is their persistence through time. How, in what sense, is John F. Kennedy as a child the same person as John F. Kennedy before his death? What makes an endurant persist through time, while loosing and gaining parts and changing most of its properties? The solution to this problem is to assign identity conditions to an endurant, such that an endurant is considered to be the same endurant as long as it has a property which assigns identity it. These identity conditions do not have to be intrinsic to the endurant, but can be assigned to it within specific contexts. Therefore, it may be that two objects at two different time points are the same with respect to one identity condition, and distinct with respect to another.

Endurants conflate presence at time points and persistence through time. In particular, there is not an instance of an endurant, but always an instance of an endurant at some time point. Similarily, endurants have parts only at time points and properties only at time points.

An alternative to using endurants in an ontology which uses time points is to separate both aspects: persistence through time and whole presence at time points. In such a setting, two categories must be introduced: one for entities existing at time points, another to provide the identity criterion for persistence through time.

On the other hand, occurrants are entitites which have temporal parts, they unfold through time. In particular, processes are occurrants. Endurants may participate in occurrants.

Examples of endurants are my spider Nero, the red of the apple in front of me or the Eiffel Tower. Examples for occurrants are the World War, the 1999 Berlin marathon or the process of writing this blog post.

Ontologies that employ a theory of time based on time intervals will contain temporally extended objects as primitives, and need to derive objects at time points in some form. Some ontologies get by without temporally non-extended entities, in particular the General Process Theory (GPT). These are strictly four-dimensional ontologies as all entities in these ontologies are are temporally extended. Objects may be very small processes, properties are layers of processes, etc.

Ontologies using Brentano-time and Brentano-space are bi-categorical in a different sense than endurant-based ontologies. Endurants are wholly present at time points. In Brentano-time, some entities are wholly present at time boundaries. These entities are called presentials. Similar to the case for endurants, criteria must be established for persistence through time, using a persistant category which provides identity criteria for persistence through time. Because Brentano-time is based on time intervals, additional constraints are usually established to require that the presentials belonging to one persisting object (persistant) are embedded in a connected process.

One particular feature of ontologies based on Brentano-time is that it is possible to have two distinct presentials at coinciding time boundaries (at the same time) which are identical with respect to some persistant. One application of this feature is to divide processes in two parts and assign properties to the participants of the objects. For example, a ball thrown into the air will move upwards for some time, and downwards for another. In Brentano-time it is possible to find the first presential in the downward process, and the last presential in the upward process, and both exist at coinciding time boundaries, therefore the same time.

A further kind of entity included in some ontologies are abstract entities. Abstract entities are independent of space of time. This means that they either exist outside of space and time, or they exist at all times and everywhere.

Further distinctions

Further distinctions drawn by upper level ontologies pertain to existential or ontological dependence. An entity a is existentially dependent on another entity b, if, whenever a exists, necessarily, b exists. The important ontological problem with existential dependence it the formalization of necessarily. For example, according to the axioms of set theory, whenever a exists, so does the singleton set {a}. Therefore, a is existentially dependent on its singleton — a rather counter-intuitive assertion.

Ontological properties (or qualities) are often considered to be existentially dependent on their bearer: whenever a property exists, necessarily, so does a bearer of the property. Similarily, relations can be dependent on their relata, roles on their players or processes on their participants.

The major distinctions drawn in most upper level ontologies pertain to those: individuals vs. categories, theories of space and time, persistence through time, the relation between objects and processes and dependent vs. independent entities.

Implemented top-level ontologies

Basic Formal Ontology: The Basic Formal Ontology (BFO) is an ontology of non-abstract individuals that uses real numbers as its model of space and time, and includes two categories of endurants (called Continuants) and occurrents.

Descriptive Ontology for Linguistic and Cognitive Engineering: The Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) is an ontology of individuals, both abstract and concrete. DOLCE uses real numbers as its model of space and time, and includes endurants, occurrents and abstract individuals.

General Formal Ontology: The General Formal Ontology (GFO) is an ontology of categories and individuals. It uses Brentano-time and Brentano-space and is a four-dimensional ontology. It includes processes, presentials and abstract individuals, and additionally contains a classification of ontological categories.

Further upper level ontologies include the Suggested Upper Merged Ontology (SUMO), the KR Ontology or the Cyc upper ontology.

Components of an Ontology

Phillip Lord — Fri, 22 Jan 2010 13:09:15 +0000

Abstract

A computational ontology consists of a number of different components, such as Classes, Individuals and Relations which are described in this article.

Introduction

An Ontology consists of a number of different components. The names of these components differ between ontologies depending on the ontology language used, philosophical persuasion or background of the authors. Despite this, their core components are largely shared between different ontologies. These components can be separated into two kinds: those that describe the Entities of the domain — here called concepts, individuals and relationships; and those which either enable the use of the ontology or describe the ontology itself — described in the last section of this article.

Concept

Concepts, also called Classes, Types or Universals are a core component of most ontologies. A Concept represents a group of different Individuals, that share common characteristics, which may be more or less specific.

For example, (most) humans share certain characteristics, such as related DNA, a set of specific body parts, the ability to speak a complex language. Likewise, all mammals share these characteristics, except for the ability to speak.

Most ontology languages allow the author to define Concepts on the basis of these characteristics; additionally, some languages, such as OWL also allow definition of Concepts extensionally by their membership. For example, the Concept “members of the beatles” could be defined as the set of “John, Paul, George and Ringo”.

One Concept may be a subconcept (also known as subclass, or kind of) another Concept; this means that if the Concept C' is a subconcept of C, then any individual of type C' will also be an individual of type C. It is possible within an ontology to explicitly state that C' is a subconcept of C; in some languages, including OWL it is also possible to infer this.

Concepts may also share relationships with each other; these describe the way individuals of one Concept relate to the individuals of another.

Individual

Individuals also known as instances or particulars are the base unit of an ontology; they are the things that the ontology describes or potentially could describe. Individuals may model concrete objects such people, machines or proteins; they may also model more abstract objects such as this article, a person’s job or a function.

Individuals are a formal part of an ontology and are one way of describing the entities of interest. Perhaps more common within bioinformatics is the development of ontologies consisting only of Concepts which are then used to annotate data records directly.

Relation

Relations in an ontology describe the way in which individuals relate to each other. Relations can normally be expressed directly between individuals (this article has author Phillip Lord) or between Concepts (an article has author a person); in the latter case, this describes a relationship between all individuals of the Concepts.

Although it is dependant on the ontology language, it is often possible to express different categories of relationships between Concepts. Consider, for example, “person has father person”. This is an existentially quantified relationship; it is the case that every person has a father, and that this individual is also a person. This can be contrasted from “person is father of person”; this is a universal quantified relationship. It is true that every individual which is father of a person is, themselves, a person; however, it would be wrong to assert that every person is the father of another.

Non-Ontological Aspects

As well as the formal ontological aspects already described, most ontology languages have many other features. These are often critical to the use of the ontology.

Documentation or formal definitions are normally provided for each concept, relation and individual. These provide plain or formal English definitions for the other components of an ontology. They are useful for ontology authors as they can be used to ensure that the intention of the entity (given in the documentation) is accurately reflected in the ontology; they are also useful for users of the ontology to understand these intentions.

Many ontologies also have editor notes or procedural documentation which describe the current state of the entity within the editorial process; for example, whether the term has been deprecated (should no longer be used), under review (likely to change) or released (likely to remain stable).

Ontology metadata provides documentation for the ontology as a whole, describing, for example, the purpose and scope of the ontology, the release date or version number, and the authorship of the ontology.

Depending on the ontology language being used, an ontology may define imports; other ontologies from which entities have been used and which are required to have a full understanding of the domain, as well as using the ontology computationally.

Other Components

Although this article describe the basic components which are found in most ontologies, there are many other components which are specific to individual technologies or ontology languages such as OBO Format or OWL.

About the Author

Phillip Lord is a lecturer in Computing Science at Newcastle University.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 2.5, which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

Upper Level Ontologies

Frank Gibson — Fri, 22 Jan 2010 10:54:40 +0000

As the biomedical domain is highly interconnected, bio-ontologies may overlap with each other. For instance, the Ontology of Biomedical Investigation (OBI) requires the availability of definitions for those chemicals used in any investigation. These definitions do not need to be developed within the OBI ontology as there is already a biomedical ontology for the domain of chemicals, called ChEBI. Similarly, software, such as Array Express, making use of an ontology may require more than a single domain ontology. Typically, in these types of scenarios, it is necessary to integrate multiple ontologies into a single coherent narrative. In order to integrate or re-use specific domain ontologies following this “building-block” approach there has to be a high level structure or common “scaffold” where different parts of different domain ontologies may be “plugged” into. To ensure ease of interoperation, or re-use of a domain ontology, well designed and documented ontologies, are essential, and upper ontologies are fundamental in this integrative effort.

Upper level ontologies provide a domain independent conceptual model that aims to be highly re-usable across specific domain applications. One of the primary purposes of upper ontologies is to aid semantic integration across ontologies and to encourage a set of design principles within those ontologies that use them. Upper ontologies usually describe very general level or abstract concepts. Most of the upper ontologies provide a general classification criterion that makes it easy to re-use, extend and maintain those existing ontologies required by a particular application. Therefore, it is essential, to aid interoperability and re-use, that ontology development methodologies should provide general guidelines for the use of upper level ontologies. These guidelines should cover the documentation of

the design decisions and the justification for choosing one upper-ontology over another,
examples that illustrate how they used, in the conceptualisation of a particular domain.

Examples of upper level ontologies include: the Basic Formal Ontology (BFO), DOLCE and GFO. Depending on an upper ontologies representation, or “world view” the upper ontology will provide a framework on how to model, physical objects, processes and information and provide contraints on how these classes are related to each other.

What is an ontology?

Robert Stevens — Fri, 22 Jan 2010 10:26:48 +0000

Robert Stevens*, Alan Rector* and Duncan Hull†

* School of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
† EMBL Outstation – Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Defining ontology

In OntoGenesis we intend to provide biologists and bioinformaticians with the means to understand ontologies within the context of biological data; their nature; use; how they are built; and some of those bio-ontologies that exist. All of this is based upon knowing what an ontology is. This can then lead on to the motivation for their use in biology; how they are used; and so on. The definition of ontology is disputed and this is confounded by computer scientists having re-used and re-defined a discipline of philosophy. The definition here will not suit a lot of people and upset many (especially use of the word “concept”); We make no apology for this situation, only noting that the argument can take up resources better used in helping biologists describe and use their data more effectively.

In informatics and computer science, an ontology is a representation of the shared background knowledge for a community. Very broadly, it is a model of the common entities that need to be understood in order for some group of software systems and their users to function and communicate at the level required for a set of tasks. In doing so, an ontology provides the intended meaning of a formal vocabulary used to describe a certain conceptualisation of objects in a domain of interest. An ontology describes the categories of objects described in a body of data and the relationships between those objects and the relationships between those categories. In doing so, an ontology describes those objects and sometimes defines what is needed to be known in order to recognise one of those objects. An ontology should be distinguished from thesauri, classification schemes and other simple knowledge organisation systems. By controlling the labels given to the categories in an ontology, a controlled vocabulary can be delivered; though an ontology is not a controlled vocabulary. when represented as a set of logical axioms with a strict semantics, an ontology can be used to make inferences about the objects that it describes and consequently provides a means to symbolically manipulate knowledge.

In philosophy, ontology is a term with its origins with Aristotle in his writings on Metaphysics, IV,1 from 437 BCE. In very general terms, it is a branch of philosophy concerned with that which exists; that is, a description of the things in the world. Philosophers in this field tend to be concerned with understanding what it means to be a particular thing in the world; that is, the nature of the entity. The goal is to achieve a complete and true account of reality. Computer scientists have taken the term and somewhat re-defined it, removing the more philosophical aspects and concentrating upon the notion of a shared understanding or specification of the concepts of interest in a domain of information that can be used by both computer and humans to describe and process that information. The goal with a computer science ontology is to make knowledge of a domain computationally useful. There is less concern with a true account of reality as it is information that is being processed, not reality. The definition used here (and any other definition for that matter) is contentious and many will disagree with it. Within the bio-ontology community there are those that take a much more philosophical stance on ontology. The OBO Foundary, for instance, do take a much more philosophical view.

Putting the string “define:ontology” into the Google search engine finds some twenty or so definitions of ontology. They all cluster around either a philosophical or a computer science definition of ontology. This is presumably the root of the jibe that ontology is all about definitions, but there is no definition of ontology. So, we should really distinguish between philosophical ontology and computer science ontology and remove some of the dispute. Tom Gruber has one of the most widely cited definitions of ontology in computer science, though conceptual models of various types have been built within computer science for decades. Tom Gruber’s definition is:

“In the context of knowledge sharing, the term ontology means a specification of a conceptualisation. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy.” DOI:10.1006/knac.1993.1008 DOI:10.1006/ijhc.1995.1081

The most noteworthy point is that Gruber states that his definition of ontology is not “ontology in the philosophical sense”. Nevertheless, computer science ontology is still informed by the philosophical, but the goals for their creation and use are different.

An important part of any ontology is the individuals or objects. There are trees, flowers, the sky, stones, animals, etc. As well as these material objects, there are also immaterial objects, such as ideas, spaces, representations of real things, etc. In the world of molecular biology and beyond, we wish to understand the nature, distinctions between and interactions of objects such as: Small and macromolecules; their functionalities; the cells in which they are made and work; together with the pieces of those cells; the tissues these cells aggregate to form; etc, etc. We do this through data collected about these phenomena and consequently we wish to describe the objects described in those data.

As human beings, we put these objects into categories or classes. These categories are a description of that which is described in a body of data. The categories themselves are a human conception. We live in a world of objects, but the categories into which humans put them are merely a way of describing the world; they do not themselves exist; they are a conceptualisation. The categories in an ontology are a representation of these concepts. The drive to categorise is not restricted to scientists; all human beings seem to indulge in the activity. If a community agrees upon which categories of objects exist in the world, then a shared understanding has been created.

In order to communicate about these categories, as we have already seen, we need to give them labels. A collection of labels for the categories of interest forms a vocabulary or lexicon. Human beings can give multiple labels to each of these categories. This habit of giving multiple labels to the same category and the converse of giving the same label to different categories polysemy) leads to grave problems when trying to use the descriptions of objects in biological data resources. This issue is one of the most powerful motivations for the use of ontologies within bioinformatics.

As well as agreeing on the categories in which we will place the objects of interest described in our data, we can also agree upon what the labels are for these categories. This has obvious advantages for communications – knowing to which category of objects a particular label has been given. This is an essential part of the shared understanding. By agreeing upon these labels and committing to their use, a community creates a controlled vocabulary.

The objects of these categories can be related to each other. When each and every member of one category or class is also the member of another category or class, then the former is subsumed by the latter or forms a subclass of the superclass. This subclass superclass relationship between objects is variously known as the “is-a” DOI:10.1109/MC.1983.1654194, subsumption or taxonomic relationship. There can be more than one subclass for any given class. If every single kind of subclass is known, then the description is exhaustive or covered. Also, any pair of subclasses may overlap in their extent, that is, share some objects, or they may be mutually exclusive, in which case they are said to be disjoint. Both philosophical and ontology engineering best practice often advocate keeping sibling classes pairwise disjoint.

As well as the is-a relationship, objects can be related to each other by many other kinds of relationship DOI:10.1186/gb-2005-6-5-r46. One of the most frequently used is the partOf relationship, which is used to describe how objects are parts of, components of, regions of, etc. of other objects. Other relationships will describe how one object developsInTo or is transformed into another object, whilst retaining its identity (such as tadpole to frog). The deriveFrom relationship describes how one object changes into another object with a change of identity. Another relationship describes how a discrete object can ParticipateIn a process object.

These relationships, particularly the is-a relationship give structure to a description of a world of objects. The relationships, like the categories whose instances they relate, also have labels. Relationship labels are another part of a vocabulary. The structured description of objects also gives a structured controlled vocabulary.

So far, we have only described relationships that make some statement about the objects being described. It is also possible to make statements about the categories or classes. When describing the elemental form of an atom, for example, `Helium’, statements about the discovery date, industrial uses, are about the category or class, not about the objects in the class. Each instance of a `Helium’ object was not discovered in 1903; most helium atoms existed prior to that date, but humans discovered and labelled that category at that date.

Ideally, we wish to know how to recognise members of these categories. That is, we define what it is to be a member of a category. When describing the relationships held by an object in a category, we put inclusion conditions upon those instances or category membership criteria. We divide these conditions into two sorts:

Necessary Conditions: These are conditions that an object must fulfil, but fulfilling that condition is not enough to recognise an object as being a member of a particular category.
Necessary and Sufficient Conditions: These are conditions that an object must fulfil and are also sufficient to recognise an object to be a member of a particular category.

For example, in an ontology of small molecules such as Chemical Entities of Biological Interest (ChEBI) DOI:10.1093/nar/gkm791 has a definition of alcohol and there are several ways of defining what this means. Each and every organic molecule of alcohol must have a hydroxyl group. That an organic molecule has a hydroxyl substituent is not, however, enough to make that molecule an alcohol. If, however, an organic molecule has a saturated backbone and a hydroxyl substituent on that backbone is enough to recognise an alcohol (at least according to the IUPAC “Gold Book”).

In making such definitions, an ontology makes distinctions. A formal ontology makes these distinctions rigourously. Broad ontological distinctions would include those between Continuant and Occurant; that is, between entities (things we can put in our hands) and processes. Continuants take part in processes and processes have participants that are continuants. Another distinction would be between Dependent and Independent objects. The existence of some objects depend on the existence of another object to bear that object. for example, a car is independent of the blue colour it bears. Continuants, for example, can be sub-categorised into material and immaterial continuants such as the skull and the cavity in the skull. Making such ontological distinctions primarily helps in choosing the relationships between the objects being described, as well as some level of consistency.

Capturing such descriptions, including the definitions forms an ontology. Representing these descriptions as a set of logical axioms with a strict semantics enables those descriptions to be reliably interpreted by both humans and computers. Forming a consensus on which categories should be used to describe a domain and agreeing on the definitions by which objects in those categories are recognised enables that knowledge to be shared.

The life sciences, unlike physics, has not yet reduced its laws and principles to mathematical formulae. It is not yet possible, as it is with physical observations, to take a biological observation, apply some equations and determine the nature of that observation and make predictions etc. Biologists record many facts about entities and from those facts make inferences. These facts are the knowledge about the domain of biology. This knowledge is held in the many databases and literature resources used in biology.

Due to human nature, the autonomous way in which these resources develop, the time span in which they develop, etc., the categories into which biologists put their objects and the labels used to describe those categories are highly heterogeneous. This heterogeneity makes the knowledge component of biological resources very difficult to use. Deep knowledge is required by human users and the scale and complexity of these data makes that task difficult. In addition, the computational use of this knowledge component is even more difficult, exacerbated by the overwhelmingly natural language representation of these knowledge facts.

In molecular biology, we are used to having nucleic acid and protein sequence data that are computationally amenable. There are good tools that inform a biologist when two sequences are similar. Any evolutionary inference based on that similarity, however, based upon knowledge about the characterised sequence. Use of this knowledge has been dependent on humans and reconciliation of all the differing labels and conceptualisations used in representing that knowledge is necessary. For example, in post-genomic biology, it is possible to compare the sequences of the genome and the proteins it encodes, but not to compare the functionality of those gene products.

There is, therefore, a need to have a common understanding of the categories of objects described in life sciences data and the labels used for those categories. In response to this need biologists have begun to create ontologies that describe the biological world. The initial move came from computer scientists who used ontologies to create knowledge bases that described the domain with high-fidelity; an example is EcoCyc http://view.ncbi.nlm.nih.gov/pubmed/8594595. Ontologies were also used in projects such as TAMBIS DOI:10.1147/sj.402.0532 to describe molecular biology and bioinformatics to reconcile diverse information sources and allow creation of rich queries over those resources. The explosion in activity came, however, in the post-genomic era with the advent of the Gene Ontology (GO) doi:10.1038/75556. The GO describes the major functional attributes of gene products – molecular function, biological process and cellular components. Now some forty plus genomic resources use GO to describe these aspects of the gene products of their respective organisms. Similarly, the Sequence Ontology describes sequence features; PATO (the phenotype Attribute and trait ontology) describes the qualities necessary to describe an organism’s phenotype. All these and more are part of the Open Biomedical Ontologies project (OBO) 10.1038/nbt1346.

Conclusion

In conclusion, we can say that there is a need to describe the entities existing within data generated by biologists so that they know what they are dealing with. This entails being able to define the categories of biological entities represented within those data. As well as describing the biological entities, we also need to describe the science by which they have been produced. This has become a large effort within the bioinformatics community. It has also been found to be a difficult task and much effort can be used in attempting to find the true nature of entities in biology and science. It should be remembered, however, that the goal of the bio-ontology effort is to allow biologists to use and analyse their data; building an ontology is not a goal in itself.

References

This text is adapted and updated from Ontologies in Biology by Robert Stevens. A numbered list of references will be generated from the DOI’s above in later drafts of this article after peer review.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0, which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

OWL, an ontology language

Uli Sattler — Thu, 21 Jan 2010 15:33:16 +0000

This article takes the reader on an introductory tour of OWL, with particular attention on the meaning of OWL statements, their entailments, and what reasoners do. Related Knowledge Blog posts include one on ontology components, one on OWL syntaxes, and one on the extent of classes.

There are numerous ontology languages around, most prominently the Web Ontology Language OWL. OWL has been developed based on experiences with its predecessors DAML+OIL and OIL, and its design has been carried out by W3C working groups. OWL 2 is an extension and revision of the OWL (published in 2004) and is a W3C recommendation.

OWL and OWL 2 are called Web Ontology Languages because they are based on web standards such as XML, IRIs, and RDF, and because they are designed in such a way that they can be used over the web (for example, one OWL file can import others by their URI). There are numerous usages of OWL and OWL 2, however, that are rather local, for example to a software or information system.

These languages come with a lot of options and choices, which we will only briefly mention here, and only come back to when they are important. OWL comes in three flavours (OWL Full, OWL lite, and OWL DL), and OWL 2 comes with two semantics (i.e., two ways of determining the meaning of an ontology, direct and RDF-based) and three profiles (i.e., fragments or syntactic restrictions, called OWL 2 EL, QL and RL), and you can choose between a number of syntaxes to save your ontology in. Since the tools and especially the reasoners around mostly support OWL 2’s direct semantics and OWL DL, we will concentrate here on those. Also, OWL 2 is backwards compatible to OWL, so we can discuss advantages and new features of OWL 2 elsewhere, and can forget the difference for now and just talk about OWL (and mean both OWL and OWL 2).

Next, we would like to utter a warning: OWL has been designed to be consumed by computers, so in its natural form (especially in certain syntaxes), it is really hard to read or write for humans: e.g., the following snippet of an OWL ontology in the RDF syntax says that

a:Boy owl:equivalentClass _:x .
_:x rdf:type owl:Class .
_:x owl:intersectionOf ( Child Male)

boys are exactly those children who are male. The same example in the Manchester syntax looks more readable,

EquivalentClasses( Boy ObjectIntersectionOf( Child Male ) )

but we can easily imagine a much nicer presentation of this statement, and tool developers have designed useful, goal- or application-oriented tools or visualisations. This is clearly a good thing: it helps the user to interact with an (OWL) ontology, without requiring them to be fluent in the ontology language and while supporting the task at hand.

Now what is in an OWL ontology? There is some stuff like headers and declarations around an ontology but, in essence, an OWL ontology is a set of axioms, and each of these makes a statement that we think is true about our view of the world. An axiom can say something about classes, individuals, and properties. For example, the following axioms (in Manchester syntax) talk about two classes, Man and Person, and one property, hasChild, and two individuals, Alice and Bob.

SubClassOf( Man Person )

SubClassOf(Person (hasChild only Person))

ClassAssertion(Bob Man)

PropertyAssertion(hasChild Bob Alice)

Roughly speaking, these axioms say something about these classes, properties, and individuals, and this meaning is fixed through their semantics, which allows us to distinguish interpretations/structures/worlds/… that satisfy these axioms from those that don’t. For example, a structure where every Man is a Person would satisfy the first axiom, whereas one where we have a Man who is not a Person would not satisfy the first axiom. Rather confusingly for modelers in general, we call those interpretations/structures/worlds/… that satisfy all axioms of an ontology a model of this ontology. It is worth pointing out that one ontology can have many many models, of varying size and even infinite ones. And here we can even have a sneak preview at reasoning or inferencing: assume the axioms in our ontology are such that in all its models, it happens that every GrandParent is a Parent. Then we call this an entailment or a consequence of our ontology, and we expect a reasoner to find this out and let us know (if you are familiar with Protégé, then you might have seen an inferred class hierarchy, which is basically this).

More detailed, this semantics works as follows: first, fix a set — any set of things will do, finite or infinite, as long as it is not empty. Then, take each class name (such as Man) and interpret it as a set — any set is fine, it can even be empty. Then, take each property name (such as hasChild) and interpret it as a relation on your set (basically by drawing edges between your elements) — again, you are free to choose whatever relation you like. Then, take each individual name (such as Bob) and interpret it as one of your elements. All together, you have now an interpretation (but remember that 1 ontology can have many many interpretations). Now, to check whether your interpretation satisfies your ontology, you can go through your ontology axiom by axiom and check whether your interpretation satisfies each axiom. For example, in order for your interpretation to satisfy

the first axiom, SubClassOf( Man Person ), the set that interprets Man has to be a subset of the set that interprets Person. Since this kind of sentence will soon become horribly contrived, we rather say ‘every instance of Man is also an instance of Person’.
the second axiom, SubClassOf(Person (hasChild only Person)), every instance of Man is related, via the property hasChild, to instances of Person only. I.e., for an instance of Man, if it has an out-going hasChild edge, then this must link it to an instance of Person.
the third axiom, ClassAssertion(Bob Man), the element that interprets Bob must be an instance of Man (see, now it becomes quite easy?).
the fourth axioms, PropertyAssertion(hasChild Bob Alice), the element that interprets Bob must be related, via the hasChild property, to the element that interprets Alice.

So, in this case, we could in principle, construct or invent interpretations and test whether they satisfy our ontology, i.e., whether it’s a model of it or not. This would, however, hardly enable us to say something about what holds in all models in our ontology because, as mentioned earlier, there can be loads of those, even infinitely many…so we rather leave this to tools called reasoners (and they do this in a more clever way). This whole exercise should, however, help us understand the above mentioned entailment. Consider the following two axioms:

EquivalentClass(Parent (Person and isParentOf some Person))

EquivalentClass(GrandParent (Person and (isParentOf some (isParentOf some Person)))

The first axiom says that the instances of Parent are exactly those elements who are related, via isParentOf, to some instance of Person. The second axiom says that the instances of GrandParent are exactly those elements who are related, via isParentOf, to some element who is related, via isParentOf, to an instance of Person. Please note that the GrandParent axiom does not mention Parent. Now you can try to construct an interpretation that satisfies both axioms and where you have an instance of GrandParent that is not a Parent…and it will be impossible…then you can think some more and come to the conclusion that these two axioms entail that every GrandParent is a Parent, i.e., that GrandParent is a sub class of Parent!

Coming back to Protégé: if you look at the inferred class hierarchy in Protege, then you see both the ‘told’ plus these entailed subclass relationships. In OWL, we also have two special classes, thing and nothing, and they are interesting for the following reasons:

if thing is a subclass of a user-defined class, say X, then every element in every interpretation is always an instance of X. This is often regarded as problematic, e.g., for reuse reasons.
if your class, say Y, is a subclass of nothing, then Y can never have any instance at all, because nothing is according to the OWL specification, always interpreted as the empty set. In many cases, this thus indicates a modelling error and requires some repair.

Finally, we also ask our reasoner to answer a query, e.g. to give us all instances of Person. If you look again at the four axioms above, then we only have that Bob is an instance of Man, so we might be tempted to not return Bob to this query. On the other hand, we also have the axiom that says that every instance of Man is also an instance of Person, so we should return Bob because our ontology entails that Bob is a Person. Reasoners can be used to answer such queries, and they are not restricted to class names: for example, we could also query for all instances of (Person and (hasChild some Person)). Now, from the four axioms we have, we can’t infer that Bob should be returned to this query because, although we know that Bob is a Person and is hasChild related to Alice, we don’t know anything about her, and thus we don’t know whether she is a Person or not. Hence Bob can’t be returned to this query. Similarly, if we query for all instances of (Person and (hasChild atmost 1)), we cannot expect Bob to be in the answer: although we know that Bob is a Person and is hasChild related to Alice, we don’t know whether he has possibly other children, unbeknownst to us. This kind of behaviour is referred to as OWL’s open world assumption.

It is quite common to distinguish class-level ontologies (which only have axioms about classes, but don’t mention individuals), from instance-level ontologies (i.e., assertions about the types and relations between individuals). We find ontologies that are purely class-level, such as Snomed-CT and NCIt, and where reasoning is used purely to make sure that the things said about classes and the resulting entailed class hierarchy are correct, and that no contradictory things have been said that would lead to subclasses of nothing or to the whole ontology being contradictory. One interesting option is then, e.g., to export the resulting class hierarchy as a SKOS vocabulary to be used for navigation. We also find ontologies with both class- and instance-level axioms, and which are used with the above query answering mechanism for flexible, powerful mechanism for accessing data.

Finally, if you want to use OWL for your application, you will first have to clarify whether this involves a purely class-level ontology, or whether you want to use OWL for accessing data. In the latter case, you have two options: you can leave the data in the database, files, or formats that it currently resides in, and use existing approaches (e.g., using Quonto, OWLGres or Requiem) to map this data to your class-level ontology and thus query it through the OWL ontology. Or you can extract and load it into an instance-level ontology and go from there. Both clearly have advantages and disadvantages, whose discussion goes beyond the scope of this article (as many other aspects).

So, where to go next if you want to learn more about OWL? First, you could download an OWL editor such as Protégé 4, and follow a tutorial on how to build an OWL ontology (see below for more links). You could also read the substantial OWL Primer (it has a cool feature which lets you decide which syntaxes to show and which to hide!) and take it from there. Or you could read some of the papers on experiences with OWL in modelling biology. Regardless of what you do, building your own OWL ontology and asking reasoners to make entailments salient seems always to be a good plan.

Helpful links:

OWL primer
tutorials: a selection from Manchester
a list of reasoners and other tools like the OWL editor Protégé 4 and the OWL API

PS: I need to point out that (i) OWL is heavily influence by classical first order predicate logic and by research in description logics (these are fragments of first order logic that have been developed in knowledge representation and reasoning since the late 80ies), and that (ii) OWL is much more than what is mentioned here: e.g., we can annotate axioms and classes, import other ontologies, etc., and in addition to the OWL constructors such as ‘and’, ‘some’, ‘only’, used here, there are numerous others, far too many to be mentioned here.

Semantic Integration in the Life Sciences

Allyson Lister — Thu, 21 Jan 2010 15:20:03 +0000

There are a number of limitations in data integration: data sets are often noisy, incomplete, of varying levels of granularity and highly changeable. Every time one of the underlying databases changes, the integrated database needs to be updated, and if there are any format changes, the parsers that convert to the unified format need to be modified as well. This ”database churn” was identified by Stein to be a major limiting factor in establishing a successful data warehouse (Stein 2003).

Ruttenberg et al. see the Semantic Web, of which both OWL and RDF are components, as having the potential to aid translational and systems biology research; indeed, any life science field where there are large amounts of data in distributed, disparate formats should benefit from Semantic Web technologies (Ruttenberg et al. 2007).

Semantic Integration

Integrated data sources, whether distributed or centralised, allow querying of multiple data sources in a single search. Traditional methods of data integration map at least two data models to a single, unified, model. Such methods tend to resolve syntactic differences between models, but do not address possible inconsistencies in the concepts defined in those models. Semantic integration resolves the syntactic heterogeneity present in multiple data models as well as the semantic heterogeneity among similar concepts across those data models. Often, ontologies or other semantic web tools such as RDF are used to perform the integration.

Addressing Semantic Heterogeneity

Semantic heterogeneity describes the difference in meaning of data among different data sources. A high level of semantic heterogeneity makes direct mapping difficult, often requiring further information to ensure a successful mapping. Such heterogeneity is not resolved in more traditional syntactic data integration methods. For instance, in data warehousing or data federation, multiple source schemas (e.g. database schemas) are converted to a single target schema. In data warehousing, the data stored in the source models is copied to the target, while in federated databases the data remains in the source models and is queried remotely via the target schema.

However, the schema reconciliation in non-semantic approaches tends to be hard-coded for the task at hand, and is not easily used for other projects. Often, data is aligned by linking structural units such as XSD components or table and row names. Further, concepts between the source and target schema are often linked based on syntactic similarity, which does not necessarily account for possible differences in the meanings of those concepts. For instance, a protein in BioPAX is strictly defined as having only one polypeptide chain, while a protein in UniProtKB (The UniProt Consortium 2008) can consist of multiple chains. Semantic data integration is intended to resolve both syntactic and semantic heterogeneity and can allow a richer description of domain of interest than is possible with syntactic methods. By using ontologies, kinds of entities, including relations, can be integrated across domains based on their meaning. However, application of such techniques in bioinformatics is difficult, partly due to the bespoke nature of the majority of available tools.

The protein example can be further extended to illustrate the practical differences between traditional data integration and semantic integration. In traditional data integration methods, two database schemas may contain a “Protein” table, but if what the developers mean by “Protein” is different, there is little way of determining this difference programmatically. An integration project using these two schemas as data sources may erroneously mark them as equivalent tables. In semantic integration, if the two data sources had modelled Protein correctly, the differences in their meaning would be clear both programmatically and to a human looking at the axioms for Protein in two data sources’ ontologies. In such cases, once the semantic differences are identified they can then be resolved. One possibility would be the creation—by the person creating the integrated ontology and data set—of a Protein superclass that describes a Protein in a generic way. The two source definitions could then be modelled as children of that Protein superclass.

Ontology-based Integration

Integration methods based on ontologies can be more generic, re-usable and independent of the integrative applications they were created for, when compared with traditional approaches which resolve only syntactic heterogeneity (Cheung et al. 2007). Mappings between schemas in non-semantic approaches are specific to those schemas, and cannot be applied to other data sources; conversely, mappings between ontologies (and therefore to the data sources that utilise those ontologies) can be used by any resource making use of those ontologies, and not just the original, intended, data sources. Two concepts may have different names, but if they reference the same ontology term, then it may be sensible to mark them as semantically equivalent. However, this method brings its own challenges, as described in the Ontogenesis article Ontologies for Sharing, Ontologies for Use:

“The alternative approach of defining equivalences between terms in different ontologies suffers from some of the same problems, since use of owl:EquivalentClass is logically strict. Strict equivalence is inappropriate if the definitions of the classes within the two ontologies differ significantly. . . . . An alternative is just to indicate that some sort of relationship exists between classes between two ontologies by use of skos:related (http://www.w3.org/TR/skos-primer/). “

Ontology mapping, also known as class rewriting, is a well-studied methodology that allows the mapping of a source class to a target class from different ontologies. As primitive classes are used in DL to characterise defined classes (pg. 52, Baader et al. 2003), such rewriting also allows the linking of relationships (also known as properties) between the two ontologies. Mapping can be used to automatically generate queries over the data source ontologies via an core ontology using views over the data source ontologies. Additionally, mapping can be applied more generally to rewrite the required features of data source ontologies as a function of a core ontology, as described in Rousset et al. for two existing data integration systems, PISCEL and Xyleme (Rousset et al. 2004).

In the life sciences, the most common formats for ontologies are OWL and OBO. More complex semantic integration tasks can be performed using greater than two ontologies and often employ a mediator, or core, ontology which is used in concert with more than one format, or source, ontologies.

Mapping Strategies

Often, the data sources to be integrated cover very different domains, and one or even two ontologies are not sufficient to describe all of the sources under study. In such cases, there are a variety of methodologies to map more than two ontologies together. Most ontology integration techniques where more than two ontologies can be classified according to two broad mapping strategies: global-as-view, where the core ontology is created as a view of the source ontologies, and local-as-view, where the reverse is true. Global-as-view mapping defines a core ontology as a function of the syntactic ontologies rather than as a semantically-rich description of the research domain in its own right, though the level of dependence of the core ontology can vary (Wache et al. 2001, Rousset et al. 2004, Gu et al. 2008). With local-as-view, the core ontology is independent of the syntactic ontologies, and the syntactic ontologies themselves are described as views of the core ontology.

Hybrid approaches (Lister et al. 2009, Xu et al. 2004) also generate mappings between source ontologies and the core ontology. However, unlike traditional approaches, the core ontology is completely independent of any of the source ontologies. Such approaches allow both the straightforward addition of new source ontologies as well as the maintenance of the core ontology as an independent entity.

Current Semantic Integration Efforts

RDF databases are generally accessed and queried via SPARQL. Life science RDF databases include the Data Web projects such as OpenFlyData (Miles et al., submitted); Neurocommons (Ruttenberg et al. 2009), BioGateway (Antezana et al. 2009) and S3DB (Deus et al. 2008). Many others are listed in Table 1 of Antezana (Antezana et al. 2009). Some databases only use RDF, while others make use of OWL.

Databases such as RDF triple stores provide data sets in a syntactically similar way, but the semantic heterogeneity is not necessarily resolved. For instance, while Bio2RDF stores millions of RDF triples, queries must still trace a path against existing resources rather than have those resources linked via a shared ontology or ontologies (Belleau et al. 2008). Shared vocabularies (e.g. OBO Foundry ontologies) can be used to build connections between RDF data files, which would provide existing connections among data sets that could be leveraged by integration projects.

Semantic integration projects can make use of expressive logic-based ontologies to aid integration. Work on ontology mapping and other semantic data integration methodologies in the life sciences includes the RDF approaches mentioned above as well as the TAMBIS ontology-based query system (Stevens et al. 2000); mapping the Gene Ontology to UMLS (Lomax et al. 2004); the integration of Entrez Gene/HomoloGene with BioPAX via the EKoM (Sahoo et al. 2008); the database integration system OntoFusion (Alonso-Calvo et al. 2007); the SWRL mappings used in rule-based mediation to annotate systems biology models (Lister et al. 2009); and the pharmacogenomics of depression project (Dumontier and Villanueva-Rosales, 2009).

Even with improved methods in data integration, problems of data churn remain. Some projects, such as that by Zhao et al., have proposed the use of Named Graphs to track provenance and churn of bioinformatics data, such as gene name changes (Zhao et al. 2009). Ultimately, it is not just the syntax and semantics of the data sources which must be resolved, but also the challenges associated with ensuring that data is up to date, complete and correctly traced and labelled.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0 (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

Automatic maintenance of multiple inheritance ontologies

Mikel Egana Aranguren — Thu, 21 Jan 2010 16:12:45 +0000

Mikel Egaña Aranguren <mikel.egana.aranguren@gmail.com>

(Technical University of Madrid, Spain)

Introduction

Ontologies with multiple inheritance are difficult to maintain manually. However, providing the correct set of axioms, an automated reasoner can be used to maintain such ontologies. The effort is considerable requiring a richer axiomatisation but worthwhile as the automated reasoner is able to maintain the whole structure, avoiding human errors. The more expressive axiomatisation also enables richer queries and other advantages.

Multiple inheritance ontologies

In a multiple inheritance ontology, there are classes with more than one superclass, forming a “polyhierarchy”. For example, in the Cell Type ontology, a cell can be a subclass of several cell types at the same time: phagocyte is a defensive cell, motile cell, stuff accumulating cell, and an animal cell.

The manual maintenance of such structure requires the ontologist to assert all the necessary subsumptions (class-superclass relations). The difficulty of manually maintaining polyhierarchies results from the fact that, for example, when adding a new class, all the appropriate subsumptions must be added, and it is likely that the ontologist will miss some. Another problem with a manually maintained polyhierarchy is the fact that the asserted subsumptions are completely opaque to the reasoner; the reasoner does not “know” why such subsumptions have been asserted.

What is Normalisation?

Normalisation is an ontology building technique that relies on using an automated reasoner (e.g. Pellet) to maintain the polyhierarchy, instead of doing it manually. Thus, the reasoner infers all the necessary subsumptions from the class descriptions, building an inferred polyhierarchy, instead of a manually asserted one. However, adequate and precise class descriptions are needed in order for the reasoner to be able to infer the wanted polyhierarchy.

Languages such as OWL provide the necessary expressivity to write class expressions that are rich enough for the reasoner to infer the polyhierarchy: universal restriction (only), existential restriction (some), number restriction (min, max, exactly), boolean operators (or, and, not), etc. Such constructs can be combined to build rich expressions like part_of some (nucleus and (has_function only photosynthesis)) (part of at least one thing that if it has a function it must be photosynthesis). More importantly from the perspective of Normalisation, defined or primitive classes can be created using OWL. A defined class has at least one necessary and sufficient condition (e.g. nucleus equivalentTo has_part some nucleolus): that is, having a nucleolus as part is enough to infer that and organelle is a nucleus (nucleus is the only organelle with nucleolus as part). A primitive class has only necessary conditions (e.g. nucleus subClassOf part_of some cell): that is, all the nuclei are part of a cell, but other organelles are also part of a cell, so if we find an entity that is part of a cell we cannot infer that it is a nucleus.

In order to use OWL’s capabilities, a normalised ontology should be divided in two parts: the primitive axis and the modules. The primitive axis is formed by primitive classes (yellow ovals), pair-wise disjoint, and with only one superclass. The primitive axis has several levels, and contains the bulk of the classes. The modules are classes with no superclasses (apart from owl:Thing or root class), not disjoint, and defined (brown ovals).

When reasoning is performed, the reasoner will infer that each module has several subclasses from the primitive axis, creating a polyhierarchy. The key for such inference is the fact that each class from the primitive axis has several necessary conditions, and each of such conditions is also present in one of the modules. When adding a new class, the maintainer adds conditions to it that, when inference is performed, will lead to the addition of the needed subsumptions by the reasoner, instead of adding such subsumptions manually.

There are ontologies in which Normalisation fits better. For example, the Cell Type Ontology (CL) presents a polyhierarchy where the Normalisation structure can be neatly applied, as the classification of cells according to different criteria (ploidy, function, development stage, lineage, nucleation, etc.) can be codified as modules: i.e., in a Normalised CL (A version of CL built using Normalisation) there would be a module Haploid Cell (equivalentTo has_ploidy some haploid) that would be inferred as superclass of all the haploid cells (primitive classes with the condition subClassOf has_ploidy some haploid; e.g. ovum, spermatozoon, etc.).

Why use Normalisation?

The use of Normalisation has several advantages. The main advantage is the maintenance process; the reasoner infers all the entailed subsumptions, without missing any. That is especially important in big ontologies like the Gene Ontology, as demonstrated in the GONG project, or in ontologies with a high subsumption per class ratio.

In a Normalised ontology, there is a set of agreed object properties, and, when adding a new class, the ontologist need only explore such object properties and add the due restrictions to the new class. The process resembles the description of an object by filling a form. Therefore the modelling is principled as every developer “fills the same form”. The principled modelling allows to split the work between many developers. This modelling process also results in a modular ontology, as to extend the ontology with a new module, it is only necessary to add a new defined class. Defined classes, or modules, can be regarded as different “views” upon the same collection of objects (e.g. cell by function, cell by organism, cell by ploidy, cell by nuclear number, etc.).

To enable the inference of the polyhierarchy by the reasoner, many axioms need to be added. Such a rich axiomisation is beneficial because it makes the modelling explicit; the reasoner and other users know why a class is a subclass of another class, as this relation is the result of both having a common condition, rather than the manual assertion of the subsumption relationship. For example, if we assert that leptomemingeal cell is a secretory cell, other users and most importantly the reasoner do not know why it is a secretory cell (a biologist may deduce the reason from the term names, but term names are completely useless for reasoners). However, if we assert that leptomemingeal cell has_function some ECM_secretion, it is clear why it has been classified as a subclass of secretory cell (which is equivalentTo has_function some secretion, and ECM secretion is a subClassOf secretion).

Having a richer axiomisation allows to execute more complex queries against the ontology. Such axiomisation also makes automatic debugging possible (e.g. by using explanations). Having explicit wrong axioms is preferable to having implicit wrong ideas, as the reasoner will suggest a possible path for a solution.

Conclusion

Reasoning can be used to maintain ontologies in different ways. One of such ways is to use Normalisation, an ontology building technique that enables the automatic maintenance of polyhierarchies. Normalisation requires the addition of precise axioms for the reasoner to infer the correct subsumptions. It could be argued that Normalisation requires the same, or even more, work than the manual maintenance. However, using normalisation, the same amount of work yields several advantages (automatic maintenance and rich axiomisation), and, in the long term, manual maintenance requires more work, e.g. to fix missing subsumptions.

Reviewed – Ontogenesis

Is a class the same as its extent?

Abstract

Author

Introduction

OWL

Biology

Conclusion

What is an upper level ontology?

What is an upper level ontology?

Why use upper level ontologies?

Categories and individuals

Time and space

Objects and Processes

Further distinctions

Implemented top-level ontologies

Components of an Ontology

Abstract

Introduction

Concept

Individual

Relation

Non-Ontological Aspects

Other Components

About the Author

Acknowledgements

Upper Level Ontologies

What is an ontology?

Defining ontology

Conclusion

References

Acknowledgements

OWL, an ontology language

Semantic Integration in the Life Sciences

Semantic Integration

Addressing Semantic Heterogeneity

Ontology-based Integration

Mapping Strategies

Current Semantic Integration Efforts

Acknowledgements

Automatic maintenance of multiple inheritance ontologies

Introduction

Multiple inheritance ontologies

What is Normalisation?

Why use Normalisation?

Conclusion

Related topics