Alan Rector – Ontogenesis http://ontogenesis.knowledgeblog.org An Ontology Tutorial Thu, 25 Aug 2011 10:34:01 +0000 en-US hourly 1 https://wordpress.org/?v=5.2 Introduction: Health informatics terminologies and related resources http://ontogenesis.knowledgeblog.org/834 http://ontogenesis.knowledgeblog.org/834#respond Thu, 10 Feb 2011 09:26:28 +0000 http://ontogenesis.knowledgeblog.org/?p=834

Healthcare and medical terminology systems come in a sometimes bewildering variety of flavours and ‘standards’.   Despite the similar names, there have been, and continue to be, a number of fundamentally different sorts of artifact – see “Related things easily confused with ontologies” in What is an ontology.

They have a history going back at least to the mid 19th century and, arguably, to the London Bills of Mortality   published every Thursday from 1603 until the 1830s, a page of which is reproduced below.

However, the real development started in the mid 19th century.

What following does not claim to be complete.  (Please use the blog to comment and suggest additions for future versions).  It aims to be a quick ready on major heathcare terminology resources, particularly for those from outside the medical field,.   The organisation is a rough combination of purpose,  prominence, and sponsoring organisation. Some of the artifacts described claim to be “ontologies”; others specifically deny being “ontologies”, and even amongst those that use the word “ontology”, the meaning varies.

A detailed report on the current state of terminologies around 2008 prepared for a “roadmap” for EU development in “Semantic Healthcare” is available at http://www.semantichealth.org/DELIVERABLES/SemanticHEALTH_D6_1.pdf

Families of General Health Informatics Terminologies, Classifications, and Ontologies

ICD and Related (International Classification of Diseases)

The ICD is the primary classification for international reporting of causes of death and disease (mortality and morbidity).  It traces its roots to the 1860s when there was acute competition between a multiaxial French system developed by Marc D’Espine and a uniaxial English system developed by William Farr.  Whatever the merits of the case, after 15 years or more of meetings English system was chosen and became the basis of what we now know as the International Classification of Diseases, which is used for almost all international recording of causes of death and forms the foundation for most disease classifications and terminologies.  It is now maintained by WHO, is in its 10th edition, and is entering the revision to lead to its 11th edition planned to appear in 2011.  The official reference to the official ICD WHO version is http://apps.who.int/classifications/apps/icd/icd10online/.  It is also available for browsing from other sources including the National Center for Biomedical Ontologies’ BioPortal.

The official WHO ICD is organised in 22 chapters of which the first eighteen form the core, briefly: Infectious and parasitic diseases, Neoplasms, Diseases of blood and blood-forming organs, Endocrine, Behavioural, Nervous system, Eye, Ear, Circulatory, Respiratory, Digestive, Skin, Musculoskeletal, etc.

ICD is a “classification” rather than an “ontology”. The organisation is a rough combination of organisation, purpose, and prominence. To achieve this, ICD codes form a strict mutually exclusive and exhaustive monohierarchy. Each code must have exactly one parent, and all the children of a parent must completely cover all patients that fall under that heading.

ICD-10 codes represent the path from the root of the hierarchy to the code.  They are of the form Ann.m, where the Ann part represents the heading condition and the number after the full stop indicates the specific detailed type. (In previous versions, the first character indicated the chapter, but some chapters have run over so that the system had to be changed.)  For example, Chapter IX, “Diseases of the circulatory system, I00-I25”; I21, “Acute myocardial infarction”, I121.0, “Acute transmural myocardial infarction of the anterior wall”.

ICD has many additional features that have evolved over the years.  There is an elaborate mechanism of annotations for coders of “excludes” and “includes”.  There are two further volumes besides the primary hierarchy: one of precedence rules for choosing which diagnosis is the primary causes of death, and another that is an index, partly language specific, to help coders locate codes for words or phases that doctors may have used that do not correspond to the official name.  There are also “residual” categories – i.e. “Not otherwise specified”, “Not elsewhere classified”, and “Other” required to meet the requirement that at each level the child classes be exhaustive.  There is a cross referencing mechanism – the “dagger-asterisk” mechanism – that provides a primitive means for looking at diseases alternatively from an anatomical and aetiological point of view.  The details of these features are beyond the scope of this article, but readers should be aware of their existence.  The overall message is “Tread carefully when making assumptions about ICD.”

ICD, in its various forms, has a legal status in many jurisdictions.  This fact, along with the desire to maintain continuity in international statistics, means that compatibility and stability are prime considerations in revisions. The requirement for continuity inevitably conflicts with demands for revisions to reflect changes in medical knowledge and to correct errors and omissions.

Closely related systems and special chapters

The main content of ICD10 is found in chapters I – XVIII.  The remainder are in some ways special or auxiliary to the main purpose of international statistical reporting.

  • Chapter XIX and XX (“Injury, poisoning and certain other consequences of external causes” and “External causes of morbidity and mortality”) form virtually a separate classification of “external causes”.
  • Chapter XXI “Factors influencing health status”)  forms another virtually separate classification.
  • Chapter XXII (“Codes for special purposes” is a rag bag of information that must be coded in ICD for various purposes at least in some jurisdictions but does not fit elsewhere.

In addition, two other chapters have special characteristics:

  • Chapter V of ICD on Mental illness is closely aligned to the DSM4 (and the emerging DSM5), the standard diagnostic system for psychiatry (see below).
  • Chapter XVIII (“Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified”) is involved primarily in uses in primary care and in auxiliary information rather than morbidity and mortality recording.

Finally:

  • ICF – International classification of functions (formerly “ICIDH – International Classification of Functioning Disability and Health) – provides a listing for determining functional status rather than diagnosis – but is now being developed in conjunction with ICD-11

Clinical Modifications (CMs)

In addition, ICD has been adapted as a billing coding system in several countries including the US and Australia.  The ICDxCM for “clinical modification” (or “CMA” for Australia) effectively adds an additional digit for subclasses of existing codes to get sufficient granularity for purposes of billing and remuneration.

SNOMED-CT and the Read Codes

SNOMED and the Read Codes come from very different backgrounds but have converged to produce SNOMED-CT, so are treated together.

SNOMED & SNOMED-RT

The ancestor of SNOMED was the Systematic Nomenclature of Pathology (SNOP), developed by Roger Côte and David Rothwell, and eventually managed by the College of American Pathologists.  Unlike ICD, which had a single axis, SNOP organised information in four axes that could be freely combined, one term from each axis: anatomy (“topography”), morphology, etiology, and function.  This basic structure was taken over when it was extended more broadly to medicine as a whole to become “SNOMED International”, a version of which is still used in France, French-speaking Canada, and parts of Germany. The system was highly flexible but unconstrained – it was possible to code an “inflammation” (morphology) of the “lung” (anatomy) caused by “pneumococcus” (aetiology), but it was also possible to code “Emotional changes” (morphology) of “blood” (anatomy) caused by a “donkey” (anatomy).

In the mid-1990s, the College of American Pathologists was convinced to create a scheme based on the description logic KREP, a variant of what would now be called EL++, to produce a “reference terminology” – SNOMED-RT.  Major funding ($26M) was obtained, a large effort initiated.

The Read Codes / Clinical Terms Versions I, II, and III

In the mid 1980s, Abies Systems in the UK and James Read developed one of the early primary care systems based on recording four character codes, initially as a means of conserving disk space in an era when a 1MB floppy was “big.”  The Read Codes version I were based roughly on the ICD organisation, with many enhancements to deal with signs, symptoms, and other issues relevant to British general practice.  By the early 1990s, James Read had gained control of the codes and sold them to the UK government, along with an agreement to set up a Centre for Coding and Classification, which he headed.  This embarked on a long effort to develop Read II – or 5 digit read – to cover all of medicine.  The effort was ultimately defeated by the combinatorial explosion of terms, but the GP subset remains the standard coding system used by the majority of British GPs.

In the mid 1990s, the effort turned to Read III which was to be a radically more flexible system including modifiers and greater structure.  Read himself left the project, which then became known officially as “Clinical Terms V3” or just “CTv3”.  Before this effort could be completed, a merger with the SNOMED-RT effort was arranged, under which the two were combined to become SNOMED-CT.  A few British general practices still use CTv3, but the major effort has shirted to SNOMED-CT.

SNOMED-CT

Overview

The result of the collaborative development of CTv3 and SNOMED-RT is SNOMED-Clinical Terms or SNOMED-CT.   SNOMED-CT kept the basis in description logic from SNOMED-RT and developed and extended the delivery and identifier mechanisms and much of the content from CTv3.  It was developed by the  College of American Pathologists (CAP) through 2007 when it passed to a new International Health Terminology Standards Development Organisation (IHTSDO).  The IHTSDO is supported by national subscriptions, and the system is free for users from subscribing countries, and for academic use and other specialised uses world-wide. Detailed information is available at www.ihtsdo.org and www.ihtsdo.org/snomed, where there are also means for support, reporting errors, participating in forums, etc.

SNOMED-CT currently comprises roughly 450,000 distinct “concepts” and over a million “terms” – i.e. labels for concepts.  Each concept has a single “Fully specified term” and “preferred term”, but there are many synonym terms.  Both Concepts and Terms have IDs, which have a rich internal structure allowing namespaces and allocation to subsidiary developers. IDs are never re-used but can be made obsolete.   A standard for use of SNOMED is that the Term and ID are always quoted together, e.g. “Essential Hypertension|99042012”.

SNOMED is normally distributed as a set of files suitable for use on standard relational databases. Much of the authoring is done using a description logic equivalent to EL++. The “stated form” in KRSS notation is now distributed along with a PERL script that will convert it to OWL notation for loading into  OWL tools.

SNOMED is currently mandated for use in the UK and for some purposes in the US, but the number of installations is still limited.  Almost know one uses it “out of the box”, partly because of its enormous size, and partly because of the limitations in its curation. Most application developers create a “subset” for a particular purpose that they curate carefully.  The full system, as delivered, continues to have numerous errors, ambiguities, easily confused terms and idiosyncrasies that make it unsuitable for use directly for most purposes except by those intimately familiar with its detailed construction.  However, it undoubtedly provides the broadest coverage of any clinical vocabulary.  Because of the legacy from primary care of the Clinical Terms (Read Codes), it is particularly rich in terms for signs and symptoms. Its anatomy axis is also well selected and contains a useful partonomy that has been taken as a starting point for numerous other systems.  A major project is nearing completion to map SNOMED-CT to ICD-10CM, so that it can be used directly for recording information for billing in the US.

Pre-coordination, post-coordination,  the problem of equivalence, and “Situations with explicit context”

Because SNOMED allows qualifiers – such as severity and site – it is potentially possible to express a notion such as “Fracture of the Hip” either as a single “pre-coordinated” term with a definition or as a compound “post-coordinated”  expression consisting just of the expression in the definition without any label.  The “problem of equivalence” is the problem of recognising the equivalence between a “pre-coordinated” term and “post-coordinated” expression (or between alternative logically equivalent post-coordinated expressions).  The problem is important because, for historical reasons, many notions such as “Family history of Hypertension” have pre-coordinated terms in SNOMED, but such expressions exist only for the most common cases.  Hence, software often provides both options, and the possibility of entries occurring in both forms needs to be dealt with.

A notable feature of SNOMED is that any term that is used in a negation, conjunction or qualified by a notion such as “history of…”, “family history of…” etc. appears under a “Situation with explicit context”.  Therefore, subclasses of a disease must always be checked for in two locations, under the disease itself, and under “Situation with explicit context”.  SNOMED provides no mechanism for doing this automatically, although the description logic could potentially make it easy to construct such queries.

Access to SNOMED

Access to SNOMED depends on what country you are in.  However, for academic purposes, it is usually free to use.  There are a variety of browsers, but the most widely used is ClinicClue Explore, which is kept up to date and allows download of up-to-date data.

Bibliographic systems – MeSH

The “Medical Subject Headings” (MeSH), maintained by the US National Library of Medicine, are  for use in indexing Medline and PubMed (http://www.ncbi.nlm.nih.gov/pmc/) are the standard set of bibliographic headings for medical, and many biomedical, applications.  MeSH is a thesaurus; it is explicitly neither a terminology nor an ontology. The hierarchical relation is “broader-than/narrower-than”.  The numbering system indicates a path-name, but more than one path can terminate on the same term.  There are, however, no identifiers other than path names and text terms.

For almost any application, one requirement will be to be able to cross reference to MeSH headings in order to be able to look up additional resources in PubMed/Medline.

Meta-thesauri – UMLS and the NCI Meta-thesaurus

The UMLS Metathesaurus

The US National Library of Medicine embarked in the early 1990s on the development of a cross referencing system amongst existing terminologies, no known as the Unified Medical Language System Metathesaurus. The UMLS cross references concepts from over a hundred classifications, coding systems, and terminologies (http://www.nlm.nih.gov/research/umls/).  Use is free, but users must obtain and agree to licensing terms.  In general, it is forbidden to extract an entire terminology or large fraction of a terminology  – e.g. SNOMED – from the information in the UMLS, even though it might, in principle, be possible to do so.   All resources are available either via a remote API or to download and mount locally (although to do so is a major task).

The key notions in UMLS are the “Concept Unique Identifier” (CUI) and the “Lexical Unique Identifier” (LUI).  Each term has a Lexical Unique Identifier (up to certain normalisations for plurals, tenses, etc.).  Each “Lexical Unique Identifier” (LUI) is associated with one or more “Concept Unique Identifiers” (CUI).  Where more than one “Lexical Unique Identifier (LUI) is associated with the same “Concept Unique Identifier” (CUI), there is a case of synonymy; where more than one “Concept Unique Identifier” is associated with the same “Lexical Unique Identifier” (LUI) there is a case of polysemy.

Most importantly, UMLS terms from virtually all other terminologies and ontologies to MeSH, from which allows them to be used as entry points to PubMed/Medline.   Any developer of any terminology would be well advised to ensure that there is a route for mapping their terminology to UMLS CUIs and LUIs to ensure that it can be used to access this critical bibliographic resource.

Other UMLS Knowledge Resources

Access to the UMLS also gives access to a range of other resources, many aimed at language processing.

  • A Semantic Network (UMLS SN) of around 200 notions that provides high level categories for UMLS concepts.  The Semantic Network is particularly adapted to linguistic usage.
  • The “norming” software that is used to convert raw lexical strings to the lexical units that receive “LUIs”.  This deals with a range of issues – tense, case, word order, etc.
  • A range of other lexical tools that are free to download.

The NCI Thesaurus and Metathesaurus

The US National Cancer Institute (part of NIH – the National Institutes of Health) has developed its own “Thesaurus” – the NCI Thesaurus (NCIT) –  implemented in OWL, which is in fact nearer closer to an “ontology” as the term is used in this book.  It covers all cancers and related diseases, plus anatomy and various external factors.  It forms the basis for the Enterprise Vocabulary Services (EVS) of the major software infrastructure CaBIG being developed and mandated by the NCI.  The terminology in the NCI thesaurus is used, amongst other things, in the development and annotation of elements for CaBIG using the ISO 11179 standard for metadata registries.

As a separate artifact, but part of the same overall effort, the NCI has also developed the NCI Meta-thesaurus, which is similar in principle to the UMLS Metathesaurus but covers a different, although in places overlapping, set of resources and focuses on cancer specifically and linked into the EVS.

The NCI Thesaurus and Metathesaurus are particularly useful for translational medicine applications – i.e. those applications bridging molecular biology and clinical research and practice.

GALEN

GALEN (http://www.opengalen.org) is a generic ontology/terminology of clinical medicine developed in a series of European projects. It is almost certainly the most intricate large ontology of clinical medicine so far developed.  It is no longer being actively maintained, but is available for mining for healthcare purposes and testing as a challenge to description logic reasoners.  It was originally formulated in an early and somewhat idiosyncratic description logic, GRAIL (A. L. Rector, Bechhofer, S., Goble, C. A., Horrocks, I., Nowlan, W. A., and Solomon, W. D. 1997. The GRAIL concept modelling language for medical terminology. Artificial Intelligence in Medicine. 9, 139-171) that is roughly equivalent to EL++ but makes extensive use of role paths in ways that are not conformant to the OWL rules.  It has recently been translated into OWL, although the last details related to role paths have yet to be resolved.

GALEN was used as the basis for the development of the French national classification of procedures and for parts of the maintenance of the Dutch procedure classification.

A major feature of GALEN was its use of an “intermediate representation”.  in order to make it easy for authors, definitions were constructed by “dissecting” text descriptions of the procedures into a language that was relatively easy for users but could be translated directly into GRAIL.  The goal was that users could be come proficient with no more than three days’ training, and in this the project was largely successful.   (For details see A. L. Rector, Zanstra, P. E., Solomon, W. D., Rogers, J. E., Baud, R., Ceusters, W., W Claassen, Kirby, J., Rodrigues, J.-M., Mori, A. R., Haring, E. J. v. d., and Wagner, J. 1999. Reconciling users’ needs and formal requirements: Issues in developing a re-usable ontology for medicine. IEEE Transactions on Information Technology in BioMedicine. 2:4, 229-242.)

GALEN was based on a well developed upper ontology emphasizing the interaction between casual and part-whole relations and subsumption and dealt with a wide range of pragmatic issues in reconciling conflicts between the pragmatics of common medical usage and the rigorous logic in the description logic. (A. L. Rector and Rogers, J. E. 2006 Ontological and practical issues in using a description logic to represent medical concept systems: Experience from GALEN. In Reasoning Web, P. Barahona, F. Bry, E. Franconi, N. Henze, and U. Sattler,Eds Springer-Verlag)

Specialised Terminologies and Classifications

There are myriad specialist terminologies, classifications and ontologies in medical subfields.  Perhaps the best way to get an up-to-date list is to look at the UMLS Knowledge Sources.  Two areas of particular important are:

  • Nursing – where there are two competing terminologies:
  • Radiology – where there are a number of terminologies, where the standard is part of the DICOM image transfer / messaging standard.  In its long history it has previously been known as “ACR NEMA” and now subsumes what was even earlier called the “SNOMED-DICOM Microglossary” (which had little to do with SNOMED despite its name)

Interaction with Electronic Health Records and Messaging

A key function of health informatics terminologies and ontologies is to interact with Electronic Health Records (EHRs).  The one cannot be understood without some knowledge of the other.  The topic of EHR formats is a complicated and confusing area in its own right.   The links below give at least a starting point for further investigation.   The coding system LOINC is included here because of its close association with the HL7 messaging standard.

LOINC

Most laboratory information is exchanged using the LOINC codes – “Logical Identifier Names and Codes”  – (http://loinc.org/)  The LOINC codes are an open source set of codes that developed alongside the standard data interchange format messaging between clinical systems, HL7.   (see below).  The codes are a multipart expressions best understood by consulting the user manual and other information on the LOINC site.

HL7

HL7 – “Health Level 7” (referring originally to a seventh level over the then standard ISO 6 layer model) (http://www.hl7.org) is the most widely used interchange standard for clinical systems.  It is a messaging system rather than a health record system, but the line is often blurred in practice.  It supports two different interchange schemes:

  • V2 – is a bar delimited format for expressions defined in ASN.1, which, along with the LOINC codes, is almost universal for communication with laboratory equipment in hospitals along with the LOINC codes.
  • V3 – is a much more elaborate system developed on the basis of a “Reference Information Model” (RIM) over the past decade and a half, but implemented to only a limited degree, most widely in the UK as part of the NHS National Programme for IT. V3 is agnostic concerning its use of coding system, but many parts have been designed with SNOMED-CT in mind.

Although described as “versions”, V2 and V3 have almost nothing in common except their common aspiration to improve the interworking of clinical information systems.  In addition the “Clinical Document Architecture” (CDA) section of HL7 v3, is virtually a separate scheme on its own and often referred to separately (http://www.hl7.org/implement/standards/cda.cfm).  To add to the confusion CDA also exists in two versions.  Version 2 is nominally compliant with the RIM, but much simpler in practice, and supports coded entries.  Version 1 is largely a mechanism for structuring free text entries.

HL7 maintains its on Vocabulary group that manages the “vocabulary domains” (https://www.hl7.org/library/data-model/RIM/C30202/vocabulary.htm) and Common Terminology Services and API (http://informatics.mayo.edu/LexGrid/index.php?page=ctsspec).  The vocabulary group is concerned both with establishing specific “structural” vocabularies needed internally by HL7 and with selecting and establishing the interface to externally defined terminologies such as SNOMED and LOINC.

CEN 13606, OpenEHR, and Archetypes

The most widely used standard for medical records, per se, at least in Europe is CEN 13606 (http://www.en13606.eu/) which is a simplification of the OpenEHR architecture (http://www.openehr.org/home.html). (CEN is the European Standardisation Committee, parallel to ISO internationally.) Both CEN 13606 and OpenEHR use the Archetype formalism (http://www.openehr.org/svn/specification/TRUNK/publishing/architecture/am/archetype_principles.pdf), a data structuring format designed specifically for medical records.  From the point of view of terminology, the most important feature of Archetypes is that they never refer to terminology directly.  Rather each Archetype has an “Ontology section” that links an internal name to an external reference.  Note that the “ontology section” is not an ontology in itself, rather a means of redirecting terminology from the Archetype to other resources.  It is therefore a form of “Terminology Binding” (See A. Rector, Qamar, R., and Marley, T. 2009. Binding ontologies and coding systems to electronic health records and messages. Applied Ontology. 4:1, 51-69. pdf)

]]>
http://ontogenesis.knowledgeblog.org/834/feed 0
Ontologies for Sharing, Ontologies for Use http://ontogenesis.knowledgeblog.org/312 http://ontogenesis.knowledgeblog.org/312#comments Fri, 22 Jan 2010 10:56:43 +0000 http://ontogenesis.knowledgeblog.org/?p=312

David Shotton §, Chris Catton and Graham Klyne

Image Bioinformatics Research Group, Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK

§ Corresponding author: e-mail: david.shotton@zoo.ox.ac.uk

An ontology is a formal, explicit specification of a shared conceptualisation.

Abstract

Although ontologies have been created to cover a large number of domains, many of them have characteristics that make them difficult to reuse in other contexts.  If ontologies are to be reused easily by others, they should be constructed as sets of small modules, each of which is a simple well-defined subsumption hierarchy that is not over-burdened by domain and range constraints.  This will make it easy for others to re-use the ontology without incurring unintended logical entailments.  In contrast, when using ontologies for particular applications such as the support of knowledge management systems, they may need to be extended and constrained.  This should be done in a manner that does not obscure the interpretation of core metadata items expressed using the extended ontology, when these are read by applications that have no knowledge of those extensions.

Current ontology development practice

Development of the semantic web (http://www.scientificamerican.com/article.cfm?id=the-semantic-web http://doi.ieeecomputersociety.org/10.1109/MIS.2006.62) involves the creation of vocabularies and ontologies designed to describe particular domains of knowledge.  It is desirable that such an ontology, ideally encapsulating a consensus view among domain experts, can be used as part of a practical information management application, and can also be easily shared with the wider community.  This paper examines the conflicting design requirements for these two functions, and summarizes a design approach.

Gruber (http://tomgruber.org/writing/onto-design.htm) stated:

An ontology should require the minimal ontological commitment sufficient to support the intended knowledge sharing activities … An ontology serves a different purpose than a knowledge base, and therefore a different notion of representational adequacy [McCarthy and Hayes, 1969] applies’.

However, in the current wave of interest in the semantic web and ontologies, this distinction has been blurred.  Some ontologies available on the web are too tightly specified to be easily shared, and are closer in spirit to knowledge bases than to Gruber’s idea of an ontology.

We believe this is the result of two forces:

a)      The current state of tools and standards for ontology development means that we are only just reached the point where different ontologies can be imported and made to work together easily and reliably.

b)      In developing the semantic web, much effort has been expended on producing stand-alone ontologies for use in specific systems, but less on designing them for general applicability so that they may be widely and easily shared.

If the semantic web is to become more than a collection of isolated knowledge-base systems, we need to be able to re-use and combine ontologies more easily.  There are several reasons for using ontologies created by others rather than writing them ourselves: it saves time and money, it permits each domain to be described accurately by appropriate domain specialists, and most importantly perhaps, it permits larger communities to come to a common agreement on the meaning of specific terms.

However, many ontologies are difficult or impossible to re-use, because they are too complex or over-committed to a particular application, contain inadequate or incomplete class annotations, are poorly structured, express a philosophy that is at odds with that of the importing ontology, or are over-restricted.

a)      Ontology too complex for easy re-use Many ontologies are very large.  For example, CYC (http://www.cyc.com/cyc/technology/whatiscyc) contains over a million assertions, the Gene Ontology (GO) has almost 30,000 terms, while the National Cancer Institute Thesaurus (http://www.mindswap.org/2003/CancerOntology/; http://ncimeta.nci.nih.gov) in 2003 contained about 26,000 concepts and about 71,000 terms (http://www.mindswap.org/papers/WebSemantics-NCI.pdf).  Large size is appropriate when the ontology is an application ontology used to describe a large domain (e.g. GO): such ontologies benefit from having simple structures.  Possibilities for re-use are limited if the ontology in question is too complex or unwieldy.  If the need is just to be able to re-use a few classes, importing a large existing ontology bodily into a new ontology is problematic, since it obscures structure and takes up resources.

Our experience has been that existing published ontologies may also suffer from trivial faults:

  • they may contain a rich mixture of dissimilar class types that are not clearly differentiated;
  • they may contain class duplications and logical inconsistencies that have gone unrecognized because of its large size; and
  • they may contain classes that properly belong in separate third-party ontologies, not in the domain-specific ontology under consideration.

In contrast, a number of vocabularies, including Dublin Core (http://dublincore.org/documents/dcmi-terms/), FOAF (http://xmlns.com/foaf/spec/) and SiOC (http://sioc-project.org/ontology), which might be described a generic rather than domain-specific, have found widespread use by virtue of their simplicity. The distinction between application ontologies and reference ontologies is clarified in a related article, Reference and Application Ontologies, by James Malone and Helen Parkinson.

b)      Ontology annotations incomplete or inadequate Ontology class annotations provide an important link between human- and machine-processable knowledge, and for this reason are important components of any ontology.  In some ontologies, the vaguensss or inappropriate implied meaning of some of the class names used, coupled with inadequate annotation, rendered them confusing.  Other ontologies, while having class names that might seem appropriate for re-use, fail to provide annotations for their classes and properties, or use annotations that are inadequate.  For example, the ka (Knowledge Acquisition) ontology (http://www.cs.man.ac.uk/~horrocks/OWL/Ontologies/ka.owl) contains no human-readable descriptions of the classes and properties, making it difficult to use without lots of cross-referencing to other documents, while in BIBO, the bibliographic ontology (http://bibliontology.com/), the class Standard is inadequately annotated as “A document describing a standard”.  Many other annotations are taken directly from Wikipedia, and are not always appropriate.  CIDOC CRM (http://cidoc.ics.forth.gr/) is an example of an ontology with generally good annotations.

c)      Ontology poorly structured For example, in the Bibtex ontology  http://oaei.ontologymatching.org/2004/Contest/301/onto.html) the hasAuthor property is a datatype property with a range of String, which effectively prevents one from identifying an author as a member of the class Person.  Furthermore, it cannot be used to describe an ordered list of authors in an easily machine-processable fashion.

d)      Ontology expresses an alternative philosophy Occasionally, different ontologies express alternative views about a domain, making it difficult to use elements of one easily within the other.  For example, while CiTO, the Citation Typing Ontology (http://purl.org/net/cito/) adopts the Works, Expressions, Manifestations hierarchy of the FRBR (Functional Requirements for Bibliographic Records; http://www.ifla.org/VII/s13/frbr/frbr1.htm) classification model developed by the United States Library of Congress to characterize different aspects of a publication, BIBO does not.  While CiTO has Work: cito:ResearchPaper; Expression: cito:JournalArticle, BIBO has bibo:AcademicArticle, which conflated these two concepts.  This makes it difficult to re-use BIBO classes within CiTO.

e)      Ontology too committed Perhaps the main problem is that many ontologies commit secondary users to modelling patterns that may be inappropriate to their needs. For example there is no consensus between bibliographic ontologies as to whether a document is authored by a set of people, an ordered list of people, or an ‘agent’ that may be a corporate body, situations brought about by varying restrictions on the range of the property hasAuthor.  This point relates to the previous one, since over-commitment becomes a problem in the face of alternative philosophies.

Thus, although different ontological representations now abound, many pose problems when it comes to re-using them in other ontologies.  The alternative approach of defining equivalences between terms in different ontologies suffers from some of the same problems, since use of owl:EquivalentClass is logically strict.  Strict equivalence is inappropriate if the definitions of the classes within the two ontologies differ significantly.  For example, in FRBR a Work is a distinct intellectual or artistic creation, an abstract concept recognised through its various expressions.  However, in CiTO the definition of cito:Work is restricted to works that cite or may be cited, primarily works of scholarship that contain bibliographic references, rather than artistic works such as plays or photographs that do not.  Thus cito:Work is classified as a subclass of frbr:Work, not an equivalent class.  Such subclassing should be undertaken with caution, since it implies logical entailment of any domain and range restrictions of the superclass.  An alternative is just to indicate that some sort of relationship exists between classes in two ontologies, for example by use of rdfs:seeAlso (http://www.w3.org/TR/rdf-schema/).

Ontologies for sharing, ontologies for use

To avoid these potential problems, we propose the following principles of ontology design to maximize the reusability of ontologies by third parties.  Modifying existing ontologies so that their public shared versions conform to these requirements is best undertaken collaboratively by a group of domain experts.

  1. Ontologies designed for sharing within a community should be kept small, or have a simple structure that uses just a few properties and is not deeply nested. It is easier to achieve consensus about a small ontology on a single topic that is small enough to be comprehended in its entirety, just because there are fewer assertions to disagree about.  Small and/or simple ontologies are easier to incorporate by people wishing to use them for third-party applications.
  2. Classes designed for sharing should form simple subsumption (is_a) hierarchies in which sibling classes are disjoint from each other, and where each class represents a single notion.  This makes the ontologies easier to manage and easier to validate by domain experts.
  3. It follows from this that in any collection of related small ontology modules, covering a particular domain of knowledge, each class must be defined in only one ontology module.  If a class appears in two modules, it probably conflates two notions which need to be distinguished and separated.
  4. All classes and properties should be annotated with clear and informative human-readable definitions, containing links to real-world examples where appropriate.
  5. Ontology building is an empirical activity – while each ontology must be fit for purpose, clear and unambiguous, it need not capture the totality of applicable knowledge.
  6. When writing an ontology that covers a small, specialist area, one should not assert anything that is not essential to the intended meaning.  In particular, one should think carefully before specifying the domain and range of a property, since this may cause problems for others extending from the ontology.
  7. Third-party ontologies should be used wherever they are available and suitable, since there is no point in re-inventing wheels.
  8. Ontologies should be written in a standard ontology language such as OWL, and validated of the ontology modules using an appropriate reasoner, for example FaCT++ (http://owl.man.ac.uk/factplusplus/).
  9. More complex ontologies for specific applications (“knowledge bases” in the sense used by Gruber) can be built by combining smaller ontologies, and then by adding restrictions to enrich their meaning.

This statement of principles reflects the insights on ontology normalization first made by the Manchester ontologist Professor Alan Rector (http://www.cs.man.ac.uk/~rector/papers/rector-modularisation-kcap-2003-distrib.pdf).

To summarize, we propose the following distinctions:

A public shared ontology, as far as possible:

  • should act primarily as a structured defined vocabulary;
  • should define a limited domain;
  • should be a simple subsumption hierarchy with disjoint sibling classes;
  • should be sparing in its use of other ontologies;
  • should be written in a standard ontology language such as OWL; and
  • should have detailed human-readable annotations of the intended meaning of each term.

An application-level ontology, as far as possible:

  • should be based upon or import one or more public ontologies describing particular domains;
  • should restrict and/or cross-relate the public ontologies, thereby enabling more powerful reasoning to be used within a particular application;
  • should extend the public ontology with new classes carefully, enabling more specific descriptions to be made without compromising the ability of third party applications that are not aware of these extensions to make sense of metadata thus encoded; and
  • should ideally also be expressed in OWL-DL, to permit use of a Description Logic reasoner both for validation and for inference of additional relationships defined by the restrictions.

Once ontologies have been made suitable for public sharing and re-use, they should be published on an open access web server, or in an appropriate ontology warehouse such as the Open Biomedical Ontologies or SchemaWeb.  Application-level ontologies are still ‘shared conceptualizations’, but are now shared implicitly, by the users of the application.

Combining modular ontologies

One of the main advantages of producing modular ontologies is that not only can they be extended and constrained for a new purpose, but that they can also be combined easily to produce new or more comprehensive ontologies.  This process is made considerably easier when the modules are built with the same design principles and related to a common upper level ontology. The benefits of common design principles are emphasised by Open Biomedical Ontologies (http://www.obofoundry.org/).  Upper level ontologies are described in the accompanying paper, Upper Level Ontologies, by Frank Gibson.

Pitfalls remain, however, when attempting to combining existing ontologies.  To take a trivial example, semantic integration within the bibliographic community could be achieved with respect to the identification of bibliographic entities themselves by common usage of the Dublin Core class dc:title, which is well defined.  However, FOAF uses the class foaf:title to denote the prefix to a person’s name used to signify veneration, an official position, or a professional or academic qualification (Reverend, Dame, President, Dr, Professor, etc.).  While dc:title and foaf:title are logically distinct, we need to be careful to avoid human misinterpretation of ‘title’, if both are used in the same ontology.

At a deeper level, we need to avoid assumptions that lead to semantic misalignment of terms.  For example, it would be incorrect in many cultures to equate “family name” with “last name”.  A biological example is given by our work with Drosophila genomics data.  The FlyTED Database contains in situ hybridization images of gene expression in Drosophila testis.  In it, each tagged sequence probe used to identify the location of messenger RNA in the specimens, was, for convenience, originally described using the name of gene from which the mRNA was transcribed.  However, because of subsequent gene re-assignments within FlyBase, the genomic database for Drosophila, mismatches developed between the “genes” in FlyTED and the corresponding gene identifiers in FlyBase.  This caused subsequent inconsistencies when combining results from the two databases as part of our OpenFlyData Services, which were resolved by recognising the “gene” names in FlyTED as being distinct from the FlyBase IDs, rather than synonymous, and then by introducing a mapping between them.

There are other problems to be resolved when one has to choose between a number of overlapping ontologies that could be used within an integration, but between which there are significant differences.  How is one to choose between them, or reconcile their differences?  These are issues for which there is no clear prescriptive answers, and for which best practice is being worked out within the ontology community.  Related issues of semantic data integration are described in the article Semantic Integration in the Life Sciences by Allyson Lister.

Extending shared ontologies

If extension of a ‘public’ ontology is required for a particular application, it is advisable to extend it only by adding new sub-classes at the edges,  rather than by modifying the meaning of more central ontological entities.  If this is done, third party applications, able to understand the public ontology but having no knowledge of these extensions, will still be able to understand correctly the core metadata created using the extended ontology.  We exemplify this from our recent work with the CIDOC Conceptual Reference Model (CIDOC CRM; http://cidoc.ics.forth.gr/), an ontology developed for the museum community specifically to describe cultural heritage information.

In the CLAROS Project (http://www.clarosnet.org/; http://imageweb.zoo.ox.ac.uk/pub/2009/publications/Kurtz_Parker_Shotton_et_al-IEEE_CLAROS_paper.pdf), we have used the CIDOC CRM to provide a common framework onto which to map the data models of a number of academic resources describing classical art objects, and have then created a data web integrating information from these resources into a single user interface.  For this, we have employed the RDF version of CIDOC CRM from Erlangen University (http://purl.org/NET/crm-owl).

Dating in antiquity is not an exact science, and we needed to capture this uncertainty in our CLAROS metadata.  For this, we extended CIDOC CRM to permit us to record the estimated bounds of the inaccuracy relating to dates of creation of classical art objects.  We could have done this by introducing new properties to relate existing CIDOC CRM classes.  However, the problem with this approach is that baseline applications (without appropriate inference support) won’t know about these new properties, so the nature of the relationships that they might understand is lost, and sections of the knowledge graph might become disconnected for such applications. The alternative approach we chose to adopt was just to add new nodes at the edge of the CRM graph, and then add new properties relating these to instances of existing classes.  This left the original classes connected only by original properties. Furthermore, we supplemented our new date range metadata with more general statements that do not depend on knowledge of these extensions, accessible to ‘standard’ CIDOC CRM applications.

Specifically, we created the classes claros:not_before and claros:not_after, that have the CRM class E61.Time_Primitive class as their domain, and used these, together with a ‘label’ time range statement for display use by applications that understand only the non-extended CRM terms, thus (in RDF N3 notation):

. . .

[ rdf:type crm:E61.Time_Primative class ;

claros:not_before “-525″۸۸xsd:gyear ;

claros:not_after “-475″۸۸xsd:gyear ;

rdfs:label “about 500 BC” ] .

The need for better tools

The ontology writing and editing tools Protégé version 4 (http://protege.stanford.edu/) and OBOedit version 2 (http://oboedit.org/) are powerful and sophisticated – see articles on Protégé & Protégé-OWL by Alan Rector and on OBO Format by David Osumi-Sutherland.  However, there is still a need for good tools in four areas to assist those creating, integrating and using ontologies:

  • Tools to assist in the early stages of ontology development, using text mining and natural language processing to extract relevant terms from the scientific literature that can then be organized into ontology modules.
  • Tools that allow vocabulary designers to capture, refine and ultimately formalize their intuitions without being forced to deal with distracting logical details early in the design process.
  • Tools to permit the comparison of different ontologies, and of different versions of a single ontology under development, and to visualize these differences in a readily comprehensible manner.
  • Tools to assist in the location of relevant third-party ontologies for integration and use.  The National Centre for Biomedical Ontology’s BioPortal, and the European Bioinformatics Institute’s Ontology Lookup Service works well for the biological domain, but there are no equivalents for more generic ontologies such as the Information Artifact Ontology, the Citation Typing Ontology, FRBR, FOAF and CIDOC CRM.  One just has to learn about these by reading, attending meetings, word of mouth and experience.  This makes entry into this field extremely difficult for newcomers.

Conclusion

Applying the principles outlined here should make it easier to create an ontology de novo, or to take an existing ‘tangled’ ontology, normalize it, and ‘bring it to market’.  Such public ontologies, being simpler and more loosely defined, should gain a far wider degree of consensus and obtain greater usage.  The mechanics of this ontology normalization process are described in a separate article entitled Automatic Maintenance of Multiple Inheritance Ontologies by Mikel Egaña Aranguren, while the construction of application ontologies is described in the related articles Community Driven Ontology Development by James Malone.

Acknowledgements

This paper is a restatement and expansion of ideas first employed during the Second Animal Behavior Metadata Workshop held at Cornell University in September 2005, posted here.

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0 (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

David Shotton §, Chris Catton and Graham Klyne

]]>
http://ontogenesis.knowledgeblog.org/312/feed 3
What is an ontology? http://ontogenesis.knowledgeblog.org/66 http://ontogenesis.knowledgeblog.org/66#comments Fri, 22 Jan 2010 10:26:48 +0000 http://ontogenesis.knowledgeblog.org/?p=66

Robert Stevens*, Alan Rector* and Duncan Hull

* School of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
† EMBL Outstation – Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Defining ontology

In OntoGenesis we intend to provide biologists and bioinformaticians with the means to understand ontologies within the context of biological data; their nature; use; how they are built; and some of those bio-ontologies that exist. All of this is based upon knowing what an ontology is. This can then lead on to the motivation for their use in biology; how they are used; and so on. The definition of ontology is disputed and this is confounded by computer scientists having re-used and re-defined a discipline of philosophy. The definition here will not suit a lot of people and upset many (especially use of the word “concept”); We make no apology for this situation, only noting that the argument can take up resources better used in helping biologists describe and use their data more effectively.

In informatics and computer science, an ontology is a representation of the shared background knowledge for a community. Very broadly, it is a model of the common entities that need to be understood in order for some group of software systems and their users to function and communicate at the level required for a set of tasks. In doing so, an ontology provides the intended meaning of a formal vocabulary used to describe a certain conceptualisation of objects in a domain of interest. An ontology describes the categories of objects described in a body of data and the relationships between those objects and the relationships between those categories. In doing so, an ontology describes those objects and sometimes defines what is needed to be known in order to recognise one of those objects. An ontology should be distinguished from thesauri, classification schemes and other simple knowledge organisation systems. By controlling the labels given to the categories in an ontology, a controlled vocabulary can be delivered; though an ontology is not a controlled vocabulary. when represented as a set of logical axioms with a strict semantics, an ontology can be used to make inferences about the objects that it describes and consequently provides a means to symbolically manipulate knowledge.

In philosophy, ontology is a term with its origins with Aristotle in his writings on Metaphysics, IV,1 from 437 BCE. In very general terms, it is a branch of philosophy concerned with that which exists; that is, a description of the things in the world. Philosophers in this field tend to be concerned with understanding what it means to be a particular thing in the world; that is, the nature of the entity. The goal is to achieve a complete and true account of reality. Computer scientists have taken the term and somewhat re-defined it, removing the more philosophical aspects and concentrating upon the notion of a shared understanding or specification of the concepts of interest in a domain of information that can be used by both computer and humans to describe and process that information. The goal with a computer science ontology is to make knowledge of a domain computationally useful. There is less concern with a true account of reality as it is information that is being processed, not reality. The definition used here (and any other definition for that matter) is contentious and many will disagree with it. Within the bio-ontology community there are those that take a much more philosophical stance on ontology. The OBO Foundary, for instance, do take a much more philosophical view.

Putting the string “define:ontology” into the Google search engine finds some twenty or so definitions of ontology. They all cluster around either a philosophical or a computer science definition of ontology. This is presumably the root of the jibe that ontology is all about definitions, but there is no definition of ontology. So, we should really distinguish between philosophical ontology and computer science ontology and remove some of the dispute. Tom Gruber has one of the most widely cited definitions of ontology in computer science, though conceptual models of various types have been built within computer science for decades. Tom Gruber’s definition is:

“In the context of knowledge sharing, the term ontology means a specification of a conceptualisation. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy.” DOI:10.1006/knac.1993.1008 DOI:10.1006/ijhc.1995.1081

The most noteworthy point is that Gruber states that his definition of ontology is not “ontology in the philosophical sense”. Nevertheless, computer science ontology is still informed by the philosophical, but the goals for their creation and use are different.

An important part of any ontology is the individuals or objects. There are trees, flowers, the sky, stones, animals, etc. As well as these material objects, there are also immaterial objects, such as ideas, spaces, representations of real things, etc. In the world of molecular biology and beyond, we wish to understand the nature, distinctions between and interactions of objects such as: Small and macromolecules; their functionalities; the cells in which they are made and work; together with the pieces of those cells; the tissues these cells aggregate to form; etc, etc. We do this through data collected about these phenomena and consequently we wish to describe the objects described in those data.

As human beings, we put these objects into categories or classes. These categories are a description of that which is described in a body of data. The categories themselves are a human conception. We live in a world of objects, but the categories into which humans put them are merely a way of describing the world; they do not themselves exist; they are a conceptualisation. The categories in an ontology are a representation of these concepts. The drive to categorise is not restricted to scientists; all human beings seem to indulge in the activity. If a community agrees upon which categories of objects exist in the world, then a shared understanding has been created.

In order to communicate about these categories, as we have already seen, we need to give them labels. A collection of labels for the categories of interest forms a vocabulary or lexicon. Human beings can give multiple labels to each of these categories. This habit of giving multiple labels to the same category and the converse of giving the same label to different categories polysemy) leads to grave problems when trying to use the descriptions of objects in biological data resources. This issue is one of the most powerful motivations for the use of ontologies within bioinformatics.

As well as agreeing on the categories in which we will place the objects of interest described in our data, we can also agree upon what the labels are for these categories. This has obvious advantages for communications – knowing to which category of objects a particular label has been given. This is an essential part of the shared understanding. By agreeing upon these labels and committing to their use, a community creates a controlled vocabulary.

The objects of these categories can be related to each other. When each and every member of one category or class is also the member of another category or class, then the former is subsumed by the latter or forms a subclass of the superclass. This subclass superclass relationship between objects is variously known as the “is-aDOI:10.1109/MC.1983.1654194, subsumption or taxonomic relationship. There can be more than one subclass for any given class. If every single kind of subclass is known, then the description is exhaustive or covered. Also, any pair of subclasses may overlap in their extent, that is, share some objects, or they may be mutually exclusive, in which case they are said to be disjoint. Both philosophical and ontology engineering best practice often advocate keeping sibling classes pairwise disjoint.

As well as the is-a relationship, objects can be related to each other by many other kinds of relationship DOI:10.1186/gb-2005-6-5-r46. One of the most frequently used is the partOf relationship, which is used to describe how objects are parts of, components of, regions of, etc. of other objects. Other relationships will describe how one object developsInTo or is transformed into another object, whilst retaining its identity (such as tadpole to frog). The deriveFrom relationship describes how one object changes into another object with a change of identity. Another relationship describes how a discrete object can ParticipateIn a process object.

These relationships, particularly the is-a relationship give structure to a description of a world of objects. The relationships, like the categories whose instances they relate, also have labels. Relationship labels are another part of a vocabulary. The structured description of objects also gives a structured controlled vocabulary.

So far, we have only described relationships that make some statement about the objects being described. It is also possible to make statements about the categories or classes. When describing the elemental form of an atom, for example, `Helium’, statements about the discovery date, industrial uses, are about the category or class, not about the objects in the class. Each instance of a `Helium’ object was not discovered in 1903; most helium atoms existed prior to that date, but humans discovered and labelled that category at that date.

Ideally, we wish to know how to recognise members of these categories. That is, we define what it is to be a member of a category. When describing the relationships held by an object in a category, we put inclusion conditions upon those instances or category membership criteria. We divide these conditions into two sorts:

  1. Necessary Conditions: These are conditions that an object must fulfil, but fulfilling that condition is not enough to recognise an object as being a member of a particular category.
  2. Necessary and Sufficient Conditions: These are conditions that an object must fulfil and are also sufficient to recognise an object to be a member of a particular category.

For example, in an ontology of small molecules such as Chemical Entities of Biological Interest (ChEBI) DOI:10.1093/nar/gkm791 has a definition of alcohol and there are several ways of defining what this means. Each and every organic molecule of alcohol must have a hydroxyl group. That an organic molecule has a hydroxyl substituent is not, however, enough to make that molecule an alcohol. If, however, an organic molecule has a saturated backbone and a hydroxyl substituent on that backbone is enough to recognise an alcohol (at least according to the IUPAC “Gold Book”).

In making such definitions, an ontology makes distinctions. A formal ontology makes these distinctions rigourously. Broad ontological distinctions would include those between Continuant and Occurant; that is, between entities (things we can put in our hands) and processes. Continuants take part in processes and processes have participants that are continuants. Another distinction would be between Dependent and Independent objects. The existence of some objects depend on the existence of another object to bear that object. for example, a car is independent of the blue colour it bears. Continuants, for example, can be sub-categorised into material and immaterial continuants such as the skull and the cavity in the skull. Making such ontological distinctions primarily helps in choosing the relationships between the objects being described, as well as some level of consistency.

Capturing such descriptions, including the definitions forms an ontology. Representing these descriptions as a set of logical axioms with a strict semantics enables those descriptions to be reliably interpreted by both humans and computers. Forming a consensus on which categories should be used to describe a domain and agreeing on the definitions by which objects in those categories are recognised enables that knowledge to be shared.

The life sciences, unlike physics, has not yet reduced its laws and principles to mathematical formulae. It is not yet possible, as it is with physical observations, to take a biological observation, apply some equations and determine the nature of that observation and make predictions etc. Biologists record many facts about entities and from those facts make inferences. These facts are the knowledge about the domain of biology. This knowledge is held in the many databases and literature resources used in biology.

Due to human nature, the autonomous way in which these resources develop, the time span in which they develop, etc., the categories into which biologists put their objects and the labels used to describe those categories are highly heterogeneous. This heterogeneity makes the knowledge component of biological resources very difficult to use. Deep knowledge is required by human users and the scale and complexity of these data makes that task difficult. In addition, the computational use of this knowledge component is even more difficult, exacerbated by the overwhelmingly natural language representation of these knowledge facts.

In molecular biology, we are used to having nucleic acid and protein sequence data that are computationally amenable. There are good tools that inform a biologist when two sequences are similar. Any evolutionary inference based on that similarity, however, based upon knowledge about the characterised sequence. Use of this knowledge has been dependent on humans and reconciliation of all the differing labels and conceptualisations used in representing that knowledge is necessary. For example, in post-genomic biology, it is possible to compare the sequences of the genome and the proteins it encodes, but not to compare the functionality of those gene products.

There is, therefore, a need to have a common understanding of the categories of objects described in life sciences data and the labels used for those categories. In response to this need biologists have begun to create ontologies that describe the biological world. The initial move came from computer scientists who used ontologies to create knowledge bases that described the domain with high-fidelity; an example is EcoCyc http://view.ncbi.nlm.nih.gov/pubmed/8594595. Ontologies were also used in projects such as TAMBIS DOI:10.1147/sj.402.0532 to describe molecular biology and bioinformatics to reconcile diverse information sources and allow creation of rich queries over those resources. The explosion in activity came, however, in the post-genomic era with the advent of the Gene Ontology (GO) doi:10.1038/75556. The GO describes the major functional attributes of gene products – molecular function, biological process and cellular components. Now some forty plus genomic resources use GO to describe these aspects of the gene products of their respective organisms. Similarly, the Sequence Ontology describes sequence features; PATO (the phenotype Attribute and trait ontology) describes the qualities necessary to describe an organism’s phenotype. All these and more are part of the Open Biomedical Ontologies project (OBO) 10.1038/nbt1346.

Conclusion

In conclusion, we can say that there is a need to describe the entities existing within data generated by biologists so that they know what they are dealing with. This entails being able to define the categories of biological entities represented within those data. As well as describing the biological entities, we also need to describe the science by which they have been produced. This has become a large effort within the bioinformatics community. It has also been found to be a difficult task and much effort can be used in attempting to find the true nature of entities in biology and science. It should be remembered, however, that the goal of the bio-ontology effort is to allow biologists to use and analyse their data; building an ontology is not a goal in itself.

References

This text is adapted and updated from Ontologies in Biology by Robert Stevens. A numbered list of references will be generated from the DOI’s above in later drafts of this article after peer review.

Acknowledgements

This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0, which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.

The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.

]]>
http://ontogenesis.knowledgeblog.org/66/feed 10
Ontogenesis: Who’s here? http://ontogenesis.knowledgeblog.org/283 http://ontogenesis.knowledgeblog.org/283#comments Fri, 22 Jan 2010 09:04:10 +0000 http://ontogenesis.knowledgeblog.org/?p=283

Who’s here? The following is an alphabetical list of people currently attending the Ontogenesis Blogging a Book Experiment.

  1. Sean Bechhofer, University of Manchester
  2. Michel Dumontier, University of Carleton
  3. Mikel Egana-Aranguren
  4. Frank Gibson
  5. Matthew Horridge, University of Manchester
  6. Duncan Hull, EBI
  7. Simon Jupp, University of Manchester
  8. Allyson Lister, Newcastle University
  9. Phillip Lord, Newcastle University
  10. James Malone, EBI
  11. David Osumi-Sutherland, University of Cambridge
  12. Helen Parkinson, EBI
  13. Robert Stevens, University of Manchester
  14. Christopher Brewster, Aston Business School
  15. Alan Rector, University of Manchester
  16. Ulrike Sattler, University of Manchester
  17. David Shotton, University of Oxford
]]>
http://ontogenesis.knowledgeblog.org/283/feed 1