EFO – Ontogenesis http://ontogenesis.knowledgeblog.org An Ontology Tutorial Thu, 05 Apr 2012 10:29:17 +0000 en-US hourly 1 https://wordpress.org/?v=5.2 Reference and Application Ontologies http://ontogenesis.knowledgeblog.org/295 http://ontogenesis.knowledgeblog.org/295#comments Fri, 22 Jan 2010 10:19:30 +0000 http://ontogenesis.knowledgeblog.org/?p=295

James Malone and Helen Parkinson

European Bioinformatics Institute, Cambridge, CB10 1SD, UK

Introduction

An application ontology is an ontology engineered for a specific use or application focus and whose scope is specified through testable use cases. The application ontology will often use or reference canonical ontologies to construct ontological classes and relationships between classes. Application ontologies are used when modeling cross-domain experiments in biology, for data annotation or visualization and for producing data driven views across reference ontologies for specific user groups.

Author Profiles

Helen Parkinson is a  geneticist who was seduced to the dark side (Bioinformatics) 10 years ago. She manages and annotates high throughput functional genomics data for the ArrayExpress database and Atlas of Gene Expression hosted at The European Bioinformatics Institute. She also builds ontologies such as EFO and OBI to annotate these data.

James Malone is a knowledge engineer and computer scientist who builds ontologies and triple stores at the EBI. He is a Newcastle United supporter and therefore often disappointed.

Background

There are many reference or ‘canonical’ ontologies in biomedicine. Organizations such as the OBO Foundry aim to organise these reference ontologies into a collection of non-overlapping or ‘orthogonal’ and interoperable resources. There are challenges in integrating, building and consuming reference ontologies. Current reference ontologies are not fully interoperable as they are constructed in different styles, using different tools and often do not share a common upper level ontololgy.
Consequently the import of all or part of most reference ontologies into a single resource is not practical or feasible. Furthermore, importing and combining large ontologies like FMA produces very large ontologies which cause scaling problems when performing reasoning using description logics. There is also an issue of coverage; reference ontologies do not necessarily contain sufficient combinations of classes (e.g. intersections or unions) to represent experimental data. For example information about a cell line includes a cell type and tissue from which it derives, and information about the individual from which tissue was obtained.

Motivation for developing Application Ontologies

Application ontologies are typical used when crossing domains e.g. transcriptomics and genomics, or combining annotation on the sample, gene and experiment dimensions. Let’s consider a gene expression use case: we’d like to make statements about experimental processes, assays, cell types, cell lines, diseases and chemical compounds used to treat cell lines which are experimental models for disease. Performing queries using all these concepts requires that reference ontologies are fully integrated. An application ontology resolves these issues by importing all or parts of reference ontologies that are required to support the application use cases and by integrating along a common axis. The common axis may be an upper level ontology or via a structure that best represents the needs of the application e.g. driven by the data.

Application ontologies can also offer alternative ‘views’ on the reference ontologies by producing specific user or domain-oriented definitions for ontology classes. This may involve producing a definition that a particular community will relate to (given the application focus) (e.g. ‘normalization’ may have several meanings depending upon the context and application focus) or rendering class labels for a specific user community.

An application ontology should be evaluated against a set of use cases and competenecy questions  which represent the scope and requirements of the particular application. For example, a user query use case may contain the competency question ‘what cancer cell line data is there’. This requires sufficient ontological coverage to capture the concept of ‘cancer cell line’.

Examples of Application Ontologies

EFO

The EBI’s Experimental Factor Ontology is used to represent sample variables from gene expression experimental data. EFO imports classes from multiple reference ontologies and produces new classes which add additional knowledge to reference ontology classes in order to meet querying and curation use cases.

NIFSTD

The NeuroInformatics Framework – NIF (NIF), formerly known as BIRN, have produced the NIFSTD ontology. NIF is  ‘A dynamic inventory of Web-based neuroscience resources: data, materials, and tools accessible via any computer connected to the Internet’.  NIF has two application resources:

1. NIFSTD an ontology with separate modules covering major domains of neuroscience: anatomy, cell, subcellular, molecule, function and dysfunction.

2. NeuroLex has detailed concepts for describing experimental techniques and instruments typically employed to carry out neuroscientific studies, as well as concepts for describing digital resources being created throughout the neuroscience community.

Both NIFSTD and NeuroLex are non-orthogonal to OBO foundry ontologies and contain cross references to e.g. FMA terms, adding local terms when needed.

Conclusion

Application ontologies are used to meet specific use cases and consume reference ontologies. They have some drawbacks which must be managed if they are to be used successfully.

1. Scaling can be an issue, terms need to be imported and ontologies can become large quickly.

2. Ontologies change rapidly, therefore importing classes without checking if these are still current can mean inbuilt obsolesence. Agent technology can be used to manage this.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 2.5 (http://creativecommons.org/licenses/by/2.5/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

]]>
http://ontogenesis.knowledgeblog.org/295/feed 6
Ontological Annotation of Data http://ontogenesis.knowledgeblog.org/50 http://ontogenesis.knowledgeblog.org/50#comments Thu, 21 Jan 2010 14:41:29 +0000 http://ontogenesis.knowledgeblog.org/?p=50

Introduction

This is a blogged article on the how scientific data, specifically biomedical data, is annotated with ontologies. It introduces the kinds of data that are annotated, the people who perform annotation, the ontologies used for annotation and resources such as databases which make that annotation available for users. It is not intended to be a comprehensive guide to all ontologies or databases which use ontologies.

Author Profile

Helen Parkinson is a  geneticist who was seduced to the dark side (Bioinformatics) 10 years ago. She  manages and annotates high throughput functional genomics data for the ArrayExpress database and Atlas of Gene Expression hosted at The European Bioinformatics Institute. She also builds ontologies such as EFO and OBI to annotate these data.


1. What does ‘Ontological Annotation of Data’ mean?

Good question, lets start with what we mean by data in this context. There are articles describing 58 new databases and 73 updated databases  in the 2010 NAR database issue.  These databases are necessary as the scientific technology we use in biomedicine now produces huge amounts of data. For example, a PhD student  in 1990 might routinely sequence 1 kilobase of DNA using35S sequencing technology (subject to possessing technical skills to do the experiment). Such volumes of data can be stored easily in a FASTA format file .  The same PhD student in 2010 could sequence several human genomes (subject to funding, access to a sequencing facility who will perform the experiment and ethical approval).

This presents a data and knowledge management problem. The raw data generated by the sequencer can be stored in the same file formats as used in 1990, however, the information about the genes present in the genome, their position, their function and whether they are expressed in the individual being assayed is usually stored in a database. When we consider  the phenotype of the human from which samples were taken and the purpose of the study,  and results generated by the study there are two axes of annotation to consider – that which relate to what is being assayed – the genetic content of the individual where the genes are, and what they may do, and the meta data about the individual: age, sex, physical characteristics, diseases they may have, and what was actually sampled – e.g. diseased  or normal tissue, or peipheral blood.

2. Who does the annotation?

In our example the PhD student may have done the annotation of the 1 kilobase of DNA in 1990 and PhDs were awarded for finding genes, sequencing parts of the genome and functional analysis. In 2010 the function(s) of many genes is known and this information is reported in the scientific literature as free text. Free text can be searched effectively but the information on gene function is  more useful when it it is organised and the knowledge linked to the gene information. The most commonly used ontology in Biomedicine is the Gene Ontology or GO which has the “aim of standardizing the representation of gene and gene product attributes across species and databases”. The gene ontology is built by a team of specialist bioinformaticians who structure the ontology, provide definitions and generally ensure that it is fit for purpose (add a ref to the GO chapter). GO is used  by curators of model organism databases like Zfin or domain specific databases like Uniprot to annotate genes.

3. Why do they do it?

GO is used to describe gene products in a formal and structured way. As  Gene products have common functions across species, there are many proteins in more than 20 species are annotated to the GO term ‘transcription factor binding‘ in Uniprot (a database of proteins)  . Transcription factor binding is a high level term, it has 9 direct child terms, each of which also have child terms  linked by is-a  relationships. The structure of the molecular hierarchy of the GO allows subsumption queries which traverse these relationships and representing more or less specific knowledge about biological processes as the hierarchy is traversed. The GO enforces an ‘all paths to root must be true’ rule so the terms and their relationships represent a statement of biological truth based on available knowledge. E.g.

transcription factor binding is-a protein binding is-a binding is-a molecular function

More or less specific annotation can be specified by a curator selecting a term from lower or higher up the hierarchy. Annotations are made to GO based on scientific literature, automated analyses based on sequence homology and assertions made by expert curators. Annotations change over time  on the basis of emerging biological knowledge, and the content of the GO also changes as terms are added, or removed, annotations are therefore updated periodically.

4. Sample annotation vs. gene annotation

In the example above we considered gene specific annotation and explored the use of the GO in the context of protein databases. Now let us suppose our PhD student has several human cell lines and is sequencing  these to compare difference in expression of genes in these samples.   We saw that GO provides annotation on processes, function and cellular compartment, so what sort of annotation about these cell lines is important and why?

Cell lines can be immortalized and in this case are derived from diseased tissue in a human  and are used as a model system for investigating the disease process. Cell lines are commerically available from centres such as ATCC, who provide rich information about the cell type, growth conditions and disease of the donor.  This information is expressed as free text in ATCC, and some of this text has been structured into an application ontology called EFO. This allows us to identify all cell lines which are derived from cancer samples if the EFO terms are mapped into available data sets. The relationships between concepts relating to cell lines is shown in the figure below and are represented in Manchester Syntax.

‘cell line’ and derives_from some ‘organism part’

‘cell line‘ and bearer_of some ‘disease state’

‘cell line’ and derives_from  only from some ‘species’

cell line and derives_from some ‘cell type’

Once we have this information for our cell lines of interest and these are mapped into an appropriate dataset we can combine this information with the gene annotation using GO and expression data and perform complex queries. For example: which human genes  annotated as having GO process ‘cell adhesion’  are over-expressed in cell lines derived from cancer cells.

This type of query thus requires multiple ontologies, mapped into two different datasets and a GUI to visualize the result of the query, or some programmatic access. In this example the annotations were mapped to sample data (provided by biologists like our PhD student) by the ArrayExpress curators, and the gene annotations were provided by the GOA curators at the EBI and data is visualized by the Atlas of Gene Expression at the EBI. Ontologies therefore can be made directly interoperable via application ontologies or via data.

5. Tools for applying ontologies to data

We have already discussed two user groups – specialist curators who build and annotate to GO  and an application ontology and our PhD student who is annotating their own data and consuming existing GO annotations. This suggests we need different types of tools for these two types of users who have different skill sets.

Where can I get GO annotations?

GO annotations are available from many different resources, a complete list of tools that search the GO is maintained by the Gene Ontology Consortium, many of these tools also provide links to proteins annotated to GO terms.

What tools can I use to annotate my samples?

In our example of samples annotated with cell lines the annotation is made in the context of the submission to a database and annotation is performed by curators who use lexical matching tools combined with manual curation. There are also data submission and annotation tools such as Annotare

How can I search ontologies?

The BioPortal and Ontology Look-up Service OLS search tools provide access to multiple ontologies which can be searched singly, or combinatorially  for common concepts such as ‘fibroblast’

Can annotation be automated?

Human curators are expensive, highly skilled individuals and the volume of data is growing beyond the ability of existing curators to annotate it. There are a number of attempts to automate annotation using text mining tools such as Textspresso and Whatizit and curator support tools also use this technology.

6. Conclusion

Data is annotated with ontologies by both biologists and specialist curators who both use and create ontologies for this purpose. Annotation is made available by databases which offer GUIs for searching and programmatic access via APIs. Some data is automatically annotated using text mining tools.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 2.5 (http://creativecommons.org/licenses/by/2.5/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

]]>
http://ontogenesis.knowledgeblog.org/50/feed 3