Ontological Annotation of Data
This is a blogged article on the how scientific data, specifically biomedical data, is annotated with ontologies. It introduces the kinds of data that are annotated, the people who perform annotation, the ontologies used for annotation and resources such as databases which make that annotation available for users. It is not intended to be a comprehensive guide to all ontologies or databases which use ontologies.
Helen Parkinson is a geneticist who was seduced to the dark side (Bioinformatics) 10 years ago. She manages and annotates high throughput functional genomics data for the ArrayExpress database and Atlas of Gene Expression hosted at The European Bioinformatics Institute. She also builds ontologies such as EFO and OBI to annotate these data.
1. What does ‘Ontological Annotation of Data’ mean?
Good question, lets start with what we mean by data in this context. There are articles describing 58 new databases and 73 updated databases in the 2010 NAR database issue. These databases are necessary as the scientific technology we use in biomedicine now produces huge amounts of data. For example, a PhD student in 1990 might routinely sequence 1 kilobase of DNA using35S sequencing technology (subject to possessing technical skills to do the experiment). Such volumes of data can be stored easily in a FASTA format file . The same PhD student in 2010 could sequence several human genomes (subject to funding, access to a sequencing facility who will perform the experiment and ethical approval).
This presents a data and knowledge management problem. The raw data generated by the sequencer can be stored in the same file formats as used in 1990, however, the information about the genes present in the genome, their position, their function and whether they are expressed in the individual being assayed is usually stored in a database. When we consider the phenotype of the human from which samples were taken and the purpose of the study, and results generated by the study there are two axes of annotation to consider – that which relate to what is being assayed – the genetic content of the individual where the genes are, and what they may do, and the meta data about the individual: age, sex, physical characteristics, diseases they may have, and what was actually sampled – e.g. diseased or normal tissue, or peipheral blood.
2. Who does the annotation?
In our example the PhD student may have done the annotation of the 1 kilobase of DNA in 1990 and PhDs were awarded for finding genes, sequencing parts of the genome and functional analysis. In 2010 the function(s) of many genes is known and this information is reported in the scientific literature as free text. Free text can be searched effectively but the information on gene function is more useful when it it is organised and the knowledge linked to the gene information. The most commonly used ontology in Biomedicine is the Gene Ontology or GO which has the “aim of standardizing the representation of gene and gene product attributes across species and databases”. The gene ontology is built by a team of specialist bioinformaticians who structure the ontology, provide definitions and generally ensure that it is fit for purpose (add a ref to the GO chapter). GO is used by curators of model organism databases like Zfin or domain specific databases like Uniprot to annotate genes.
3. Why do they do it?
GO is used to describe gene products in a formal and structured way. As Gene products have common functions across species, there are many proteins in more than 20 species are annotated to the GO term ‘transcription factor binding‘ in Uniprot (a database of proteins) . Transcription factor binding is a high level term, it has 9 direct child terms, each of which also have child terms linked by is-a relationships. The structure of the molecular hierarchy of the GO allows subsumption queries which traverse these relationships and representing more or less specific knowledge about biological processes as the hierarchy is traversed. The GO enforces an ‘all paths to root must be true’ rule so the terms and their relationships represent a statement of biological truth based on available knowledge. E.g.
More or less specific annotation can be specified by a curator selecting a term from lower or higher up the hierarchy. Annotations are made to GO based on scientific literature, automated analyses based on sequence homology and assertions made by expert curators. Annotations change over time on the basis of emerging biological knowledge, and the content of the GO also changes as terms are added, or removed, annotations are therefore updated periodically.
4. Sample annotation vs. gene annotation
In the example above we considered gene specific annotation and explored the use of the GO in the context of protein databases. Now let us suppose our PhD student has several human cell lines and is sequencing these to compare difference in expression of genes in these samples. We saw that GO provides annotation on processes, function and cellular compartment, so what sort of annotation about these cell lines is important and why?
Cell lines can be immortalized and in this case are derived from diseased tissue in a human and are used as a model system for investigating the disease process. Cell lines are commerically available from centres such as ATCC, who provide rich information about the cell type, growth conditions and disease of the donor. This information is expressed as free text in ATCC, and some of this text has been structured into an application ontology called EFO. This allows us to identify all cell lines which are derived from cancer samples if the EFO terms are mapped into available data sets. The relationships between concepts relating to cell lines is shown in the figure below and are represented in Manchester Syntax.
‘cell line’ and derives_from some ‘organism part’
‘cell line‘ and bearer_of some ‘disease state’
‘cell line’ and derives_from only from some ‘species’
cell line and derives_from some ‘cell type’
Once we have this information for our cell lines of interest and these are mapped into an appropriate dataset we can combine this information with the gene annotation using GO and expression data and perform complex queries. For example: which human genes annotated as having GO process ‘cell adhesion’ are over-expressed in cell lines derived from cancer cells.
This type of query thus requires multiple ontologies, mapped into two different datasets and a GUI to visualize the result of the query, or some programmatic access. In this example the annotations were mapped to sample data (provided by biologists like our PhD student) by the ArrayExpress curators, and the gene annotations were provided by the GOA curators at the EBI and data is visualized by the Atlas of Gene Expression at the EBI. Ontologies therefore can be made directly interoperable via application ontologies or via data.
5. Tools for applying ontologies to data
We have already discussed two user groups – specialist curators who build and annotate to GO and an application ontology and our PhD student who is annotating their own data and consuming existing GO annotations. This suggests we need different types of tools for these two types of users who have different skill sets.
Where can I get GO annotations?
GO annotations are available from many different resources, a complete list of tools that search the GO is maintained by the Gene Ontology Consortium, many of these tools also provide links to proteins annotated to GO terms.
What tools can I use to annotate my samples?
In our example of samples annotated with cell lines the annotation is made in the context of the submission to a database and annotation is performed by curators who use lexical matching tools combined with manual curation. There are also data submission and annotation tools such as Annotare
How can I search ontologies?
Can annotation be automated?
Human curators are expensive, highly skilled individuals and the volume of data is growing beyond the ability of existing curators to annotate it. There are a number of attempts to automate annotation using text mining tools such as Textspresso and Whatizit and curator support tools also use this technology.
Data is annotated with ontologies by both biologists and specialist curators who both use and create ontologies for this purpose. Annotation is made available by databases which offer GUIs for searching and programmatic access via APIs. Some data is automatically annotated using text mining tools.
This paper is an open access work distributed under the terms of the Creative Commons Attribution License 2.5 (http://creativecommons.org/licenses/by/2.5/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.
The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.
Tags: annotation, ArrayExpress, Atlas of Gene Expression, bioinformatics, bioportal, cell line, cell type, database, EFO, experiment, Experimental Factor Ontology, FASTA, free text, gene, gene expression, gene ontology, GO, Helen Parkinson, homology, ontology, phenotype, study, text mining