Allyson Lister – Ontogenesis

Peer Review of Ontological Annotation of Data

Allyson Lister — Fri, 22 Jan 2010 13:57:35 +0000

This is a peer review of Ontological Annotation Of Data.

Overall, this is a very useful resource for people learning about how ontological annotations can be applied to data. There are just a few points below which, when addressed, should improve the overall clarity of the article.

As (strictly) Ontogenesis contains articles rather than blog posts, I suggest modifying the first sentence by removing the word “blogged”. I would also agree with the other reviewer that the links to articles should be replaced by DOIs before final publishing.

At the end of the last paragraph in the first section, you have the following large sentence: “When we consider the phenotype of the human from which samples were taken and the purpose of the study, and results generated by the study there are two axes of annotation to consider – that which relate to what is being assayed – the genetic content of the individual where the genes are, and what they may do, and the meta data about the individual: age, sex, physical characteristics, diseases they may have, and what was actually sampled – e.g. diseased or normal tissue, or peipheral blood.”

This sentence is a little long, and would be clearer with some changes. Perhaps making into two sentences, something like: “There are two axes of annotation when considering assays relating to the phenotype of the human from which samples were taken and the purpose of the study. The first axis is the genetic content of the individual, and what those genes may do. Secondly, annotation concerning the meta data about the individual: age, sex, physical characteristics, diseases they may have, and what was actually sampled – e.g. diseased or normal tissue, or peipheral blood.” This may need further re-wording to capture exactly what was meant.

When you describe GO, you have a note to yourself about a reference to GO: this needs to be added.

In the “Why do we do it” section, I would change “More or less specific annotation…” to “more specific or less specific annotation…” as the meaning is more clear that way. At the end of this same paragraph is the following sentence: “Annotations change over time on the basis of emerging biological knowledge, and the content of the GO also changes as terms are added, or removed, annotations are therefore updated periodically.” In this case, the last bit of the sentence is difficult to understand. Perhaps change to something like “Annotations change over time on the basis of emerging biological knowledge. As a consequence, the annotation content of the GO is updated periodically as terms are added, or removed.”

Finally, to match how other articles are written, I don’t think you need to number your headings, especially as they aren’t all numbered in the first place. However, each heading should be a real HTML heading, e.g.

in order for the automated Table of Contents to be generated, as a TOC would be very useful in this type of article, which covers many topics.

Review of Reference and Application Ontologies

Allyson Lister — Fri, 22 Jan 2010 11:43:28 +0000

This is a review of Reference and Application Ontologies.

This article describes what reference ontologies and application ontologies are, and how they can be used together. It is a well-written introduction to the subject. What follows are my suggestions for improvement.

In the first paragraph, it says both that application ontologies are for “domain specific use” and also used for modelling “cross-domain experiments”. While not necessarily at odds with each other, these two sentences may confuse readers.

In the last Background section, there is: “Finally reference ontologies do not necessarily contain sufficient combinations of classes (e.g. intersections or unions) to represent experimental data. For example information about a cell line includes a cell type and tissue from which it derives, and information about the individual from which tissue was obtained.” It is unclear why this exemplifies the sentence preceding it (the “Finally…” sentence). An additional sentence here explaining the link between them would be useful.

There is the phrase “(e.g. need an example)” in the second paragraph for the “Motivation…” section. I think the example is not there yet.

In the examples of application ontologies, the NIFSTD section would benefit from further structuring, as it is a little confusing as-is. I would suggest making a bulleted list containing paragraphs 2 and 3, and begin the first paragraph with “The NeuroInformatics Framework – NIF, formerly known as BIRN, is ….”

Review of Semantic Integration in the Life Sciences

Michel Dumontier — Fri, 22 Jan 2010 11:36:24 +0000

This is a review of Semantic Integration in the Life Sciences

This article discusses semantic data integration to address the overwhelming challenge of integrating and querying across thousands of biological databases, in a way that should be superior with “syntactic” data integration. It discusses at the highest level the kinds of data integration strategies: i) local-as-view where queries are mapped to their local sources (no transformation required) and ii) global-as-view where source data is transformed into a common schema. While ontology presents a salient opportunity to unify various data sources through a shared conceptualization, at least one example (protein, as it means for biopax/uniprot) demonstrates that this will be a significant, and non-trivial challenge. This is a great motivating example – it exemplifies syntactic (URI differences) and semantic (natural language/axiomatic definitions) heterogeneity -> how do we resolve this is a question worth of careful analysis in this article.

From schemas to ontology

The article needs to be reformulated in such a way that it clearly presents the problem of data integration and clearly defines and contrasts syntactic and semantic approaches, but more importantly, identifies the role of ontology in this process, as opposed to schemas. Ontology, in this sense, refers not only to an enhanced logic-based formalism in which class descriptions can be logically evaluated for equivalence through subsumption/consistency checking, but also to philosophical ontology such that different kinds of entities, including relations, can be integrated across domains.

From RDF to OWL

While the vast majority of RDF-based data integration efforts are fairly trivial (and require much time and effort), these do not exploit the explicit semantics found in rich ontologies. For instance, Bio2RDF ((Belleau F et al. J biomed inform 2008) now contains over 5 billion linked data statements using RDF, but there with no overaching ontology, queries must be formulated by tracing a path against existing resources. In contrast, data integration projects such as the pharmacogenomics of depression project (Michel Dumontier and Natalia Villanueva-Rosales. Briefings in Bioinformatics. 2009. 10(2):153-163.) use expressive logic-based ontologies that build on foundational philosophical ontology.

Vision

The vision, then, is that data integration occurs across domains through logic + philosophical ontology.

Ontogenesis: Who’s here?

Duncan Hull — Fri, 22 Jan 2010 09:04:10 +0000

Who’s here? The following is an alphabetical list of people currently attending the Ontogenesis Blogging a Book Experiment.

Sean Bechhofer, University of Manchester
Michel Dumontier, University of Carleton
Mikel Egana-Aranguren
Frank Gibson
Matthew Horridge, University of Manchester
Duncan Hull, EBI
Simon Jupp, University of Manchester
Allyson Lister, Newcastle University
Phillip Lord, Newcastle University
James Malone, EBI
David Osumi-Sutherland, University of Cambridge
Helen Parkinson, EBI
Robert Stevens, University of Manchester
Christopher Brewster, Aston Business School
Alan Rector, University of Manchester
Ulrike Sattler, University of Manchester
David Shotton, University of Oxford

Semantic Integration in the Life Sciences

Allyson Lister — Thu, 21 Jan 2010 15:20:03 +0000

There are a number of limitations in data integration: data sets are often noisy, incomplete, of varying levels of granularity and highly changeable. Every time one of the underlying databases changes, the integrated database needs to be updated, and if there are any format changes, the parsers that convert to the unified format need to be modified as well. This ”database churn” was identified by Stein to be a major limiting factor in establishing a successful data warehouse (Stein 2003).

Ruttenberg et al. see the Semantic Web, of which both OWL and RDF are components, as having the potential to aid translational and systems biology research; indeed, any life science field where there are large amounts of data in distributed, disparate formats should benefit from Semantic Web technologies (Ruttenberg et al. 2007).

Semantic Integration

Integrated data sources, whether distributed or centralised, allow querying of multiple data sources in a single search. Traditional methods of data integration map at least two data models to a single, unified, model. Such methods tend to resolve syntactic differences between models, but do not address possible inconsistencies in the concepts defined in those models. Semantic integration resolves the syntactic heterogeneity present in multiple data models as well as the semantic heterogeneity among similar concepts across those data models. Often, ontologies or other semantic web tools such as RDF are used to perform the integration.

Addressing Semantic Heterogeneity

Semantic heterogeneity describes the difference in meaning of data among different data sources. A high level of semantic heterogeneity makes direct mapping difficult, often requiring further information to ensure a successful mapping. Such heterogeneity is not resolved in more traditional syntactic data integration methods. For instance, in data warehousing or data federation, multiple source schemas (e.g. database schemas) are converted to a single target schema. In data warehousing, the data stored in the source models is copied to the target, while in federated databases the data remains in the source models and is queried remotely via the target schema.

However, the schema reconciliation in non-semantic approaches tends to be hard-coded for the task at hand, and is not easily used for other projects. Often, data is aligned by linking structural units such as XSD components or table and row names. Further, concepts between the source and target schema are often linked based on syntactic similarity, which does not necessarily account for possible differences in the meanings of those concepts. For instance, a protein in BioPAX is strictly defined as having only one polypeptide chain, while a protein in UniProtKB (The UniProt Consortium 2008) can consist of multiple chains. Semantic data integration is intended to resolve both syntactic and semantic heterogeneity and can allow a richer description of domain of interest than is possible with syntactic methods. By using ontologies, kinds of entities, including relations, can be integrated across domains based on their meaning. However, application of such techniques in bioinformatics is difficult, partly due to the bespoke nature of the majority of available tools.

The protein example can be further extended to illustrate the practical differences between traditional data integration and semantic integration. In traditional data integration methods, two database schemas may contain a “Protein” table, but if what the developers mean by “Protein” is different, there is little way of determining this difference programmatically. An integration project using these two schemas as data sources may erroneously mark them as equivalent tables. In semantic integration, if the two data sources had modelled Protein correctly, the differences in their meaning would be clear both programmatically and to a human looking at the axioms for Protein in two data sources’ ontologies. In such cases, once the semantic differences are identified they can then be resolved. One possibility would be the creation—by the person creating the integrated ontology and data set—of a Protein superclass that describes a Protein in a generic way. The two source definitions could then be modelled as children of that Protein superclass.

Ontology-based Integration

Integration methods based on ontologies can be more generic, re-usable and independent of the integrative applications they were created for, when compared with traditional approaches which resolve only syntactic heterogeneity (Cheung et al. 2007). Mappings between schemas in non-semantic approaches are specific to those schemas, and cannot be applied to other data sources; conversely, mappings between ontologies (and therefore to the data sources that utilise those ontologies) can be used by any resource making use of those ontologies, and not just the original, intended, data sources. Two concepts may have different names, but if they reference the same ontology term, then it may be sensible to mark them as semantically equivalent. However, this method brings its own challenges, as described in the Ontogenesis article Ontologies for Sharing, Ontologies for Use:

“The alternative approach of defining equivalences between terms in different ontologies suffers from some of the same problems, since use of owl:EquivalentClass is logically strict. Strict equivalence is inappropriate if the definitions of the classes within the two ontologies differ significantly. . . . . An alternative is just to indicate that some sort of relationship exists between classes between two ontologies by use of skos:related (http://www.w3.org/TR/skos-primer/). “

Ontology mapping, also known as class rewriting, is a well-studied methodology that allows the mapping of a source class to a target class from different ontologies. As primitive classes are used in DL to characterise defined classes (pg. 52, Baader et al. 2003), such rewriting also allows the linking of relationships (also known as properties) between the two ontologies. Mapping can be used to automatically generate queries over the data source ontologies via an core ontology using views over the data source ontologies. Additionally, mapping can be applied more generally to rewrite the required features of data source ontologies as a function of a core ontology, as described in Rousset et al. for two existing data integration systems, PISCEL and Xyleme (Rousset et al. 2004).

In the life sciences, the most common formats for ontologies are OWL and OBO. More complex semantic integration tasks can be performed using greater than two ontologies and often employ a mediator, or core, ontology which is used in concert with more than one format, or source, ontologies.

Mapping Strategies

Often, the data sources to be integrated cover very different domains, and one or even two ontologies are not sufficient to describe all of the sources under study. In such cases, there are a variety of methodologies to map more than two ontologies together. Most ontology integration techniques where more than two ontologies can be classified according to two broad mapping strategies: global-as-view, where the core ontology is created as a view of the source ontologies, and local-as-view, where the reverse is true. Global-as-view mapping defines a core ontology as a function of the syntactic ontologies rather than as a semantically-rich description of the research domain in its own right, though the level of dependence of the core ontology can vary (Wache et al. 2001, Rousset et al. 2004, Gu et al. 2008). With local-as-view, the core ontology is independent of the syntactic ontologies, and the syntactic ontologies themselves are described as views of the core ontology.

Hybrid approaches (Lister et al. 2009, Xu et al. 2004) also generate mappings between source ontologies and the core ontology. However, unlike traditional approaches, the core ontology is completely independent of any of the source ontologies. Such approaches allow both the straightforward addition of new source ontologies as well as the maintenance of the core ontology as an independent entity.

Current Semantic Integration Efforts

RDF databases are generally accessed and queried via SPARQL. Life science RDF databases include the Data Web projects such as OpenFlyData (Miles et al., submitted); Neurocommons (Ruttenberg et al. 2009), BioGateway (Antezana et al. 2009) and S3DB (Deus et al. 2008). Many others are listed in Table 1 of Antezana (Antezana et al. 2009). Some databases only use RDF, while others make use of OWL.

Databases such as RDF triple stores provide data sets in a syntactically similar way, but the semantic heterogeneity is not necessarily resolved. For instance, while Bio2RDF stores millions of RDF triples, queries must still trace a path against existing resources rather than have those resources linked via a shared ontology or ontologies (Belleau et al. 2008). Shared vocabularies (e.g. OBO Foundry ontologies) can be used to build connections between RDF data files, which would provide existing connections among data sets that could be leveraged by integration projects.

Semantic integration projects can make use of expressive logic-based ontologies to aid integration. Work on ontology mapping and other semantic data integration methodologies in the life sciences includes the RDF approaches mentioned above as well as the TAMBIS ontology-based query system (Stevens et al. 2000); mapping the Gene Ontology to UMLS (Lomax et al. 2004); the integration of Entrez Gene/HomoloGene with BioPAX via the EKoM (Sahoo et al. 2008); the database integration system OntoFusion (Alonso-Calvo et al. 2007); the SWRL mappings used in rule-based mediation to annotate systems biology models (Lister et al. 2009); and the pharmacogenomics of depression project (Dumontier and Villanueva-Rosales, 2009).

Even with improved methods in data integration, problems of data churn remain. Some projects, such as that by Zhao et al., have proposed the use of Named Graphs to track provenance and churn of bioinformatics data, such as gene name changes (Zhao et al. 2009). Ultimately, it is not just the syntax and semantics of the data sources which must be resolved, but also the challenges associated with ensuring that data is up to date, complete and correctly traced and labelled.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0 (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.