on November 19, 2010 by in Under Review, Comments (0)

Naming conventions for robust term labelling

Naming conventions for robust term labelling

Abstract

This article tackles guidelines supporting the creation of human intelligible yet computationally robust labels for concepts and relations in ontologies.

When an new concept or a relation is to be created in an ontology, one needs to agree on an intelligible name or label for it first. Care needs to be taken here, because bad or inconsistent naming conventions can significantly impair the readability and navigability within an ontology class hierarchy. Having different developers with heterogenous backgrounds creating labels according to different naming principles decreases their recognition-velocity (to quickly grasp a terms meaning via its name).

Heterogeneity in labelling schemes also hinders ontology alignment matching and integration efforts that rely on lexical, string-based computation of labels.

Nomenclature rules supported by oonomastic principles also provide guidance to ontology creators, helps to avoid flaws and inaccuracies when editing and especially when interlinking ontologies. Common naming conventions will ultimately assist consumers of ontologies to more readily understand what meanings were intended by the authors of ontologies when used in annotated bodies of data and SPARQL queries.

The Problem

To illustrate the problem of heterogenous labelling, we here list possible names for one and the same class: ‘G-Protein coupled receptor’ vs ‘7_transmembrane_domain_receptor’ vs ‘GPC_receptors’ vs ‘GPCR_class’.

Looking at the resources under the Open Biomedical Ontology (OBO) umbrella alone one can find diverse typographical conventions to name classes, e.g. “MyClass”,”My Class”, “My-Class”, “my class”, “my cls”, “our classes”, “class_set”, “my class type” and “my_very_specific_class_that_was_created_sunday-morning”. Sometimes this heterogeneity can be seen even within a single ontology. The lack lexical guidance (what convention to use) has engendered a wide variety of incoherent ad hoc conventions, which in turn causes problems for the use and administration of single and interrelated terminological artifacts.

There are some naming conventions in general use within the biomedical domain, e.g. the conventions by Iso, BioPax, HuGO or IUPAC. Unfortunately most of these refer to conventions in natural language documents, programming language environments or are restricted to very specific application domains. Where available for ontology engineering, existing naming conventions

  • address only a subset of the issues to be tackled (e.g. ISO standards dealing with ontology development good practices, but not addressing all requirements in a sound manner) (The problem of coverage)
  • are often conflicting, because they contain language- and implementation-dependent recommendations. (The problem of Implementation-specificity)
  • are not backend by the major ontology engineering tools. (The problem of support & integration)
  • are not freely available or commercialised (the problem of accessibility)
  • are dispersed or embedded into documents addressing general design principles and are therefore not visible to a majority of potential users. E.g., BioPax Manual (The problem of visibility)
  • are not agreed upon and accepted by a wider community (The problem of community acceptance)

For this reason a body has been founded under the OBO Foundry which collects and publishes naming conventions for ontology engineering (see below).

Different types of names

Representational units like classes and relations can be accompanied by one or more synonymous names (labels) of the following different name categories:

  • Display name: Any type of name that is chosen to be displayed in the hierarchy. This one is called ‘browser key’ in Protégé.
  • Formal name: Where the form of a name is controlled by a set of explicit guidelines and rules one refer to it as a formal name
  • Editor-preferred name: A formal name used by the ontology’s developers and adhering to their guidelines and naming conventions. Editor-preferred names are primarily constructed to aid those building and manipulating an ontology and should therefore be specified as the display name during ontology editing. The editor-preferred name for the Foundational Model of Anatomy (FMA) class FMA:3862 is ‘Anterior interventricular branch of left coronary artery’.
  • User-preferred name: An informal name chosen to meet the expectations of an end user community. Usually this would be the name most frequently found in the literature of the relevant domains, which can inter alia serve as an intuitive, queryable attribute for end users searching for data sets in a repository. The user-preferred names from FMA for FMA:3862 is ‘Left anterior descending branch of left coronary artery’.
  • Short name: A very short name that is useful when displaying large, dense graphs (whose nodes are classes and whose edges are relations). A short name from FMA for FMA:3862 is the acronym ‘LAD’.

Further types of names can be distinguished, which are not necessarily disjoint, such as ‘lexical variant’ (including  abbreviations and acronyms), ‘phonetic variant’ and ‘foreign language translation’. These must all be exact synonyms.

What types of names to capture in what RU

The limited granularity in naming types and the lack of clear guidance on which kind of name the rdf:ID, and the rdfs:label (in OWL) and term name (in OBO) should contain, has contributed significantly to the current heterogeneity in naming schemes between ontologies. The minimum recommendation is to assign an editor-preferred name, to which explicit naming conventions should be applied, and one or more user-preferred names, which are less controlled and chosen to match end user expectations and usage frequency. Having conventions for default names (using the editor-preferred name as display name) does not place restrictions on the use of less formal or colloquial names, which can and should still be captured as synonyms.

Naming Conventions

Efforts to create explicit typographic, syntactic and semantic concept labelling conventions have been carried out in isolation by most terminology developers. However, where naming conventions have been developed, widespread application has been hampered by several factors, most notably domain specificity, document inaccessibility and format dependency. To overcome this drawback, the OBO Foundry naming conventions effort has review and compared existing naming conventions, distilled and published universally valid conventions applicable to the OBO and OWL formats. The development of their current set of consensus-based, reusable naming conventions was also informed by a survey carried out on sixty ontology developers and can be accessed at
http://obofoundry.org/wiki/index.php/Naming
This is the categorisation of naming conventions as currently applied within the Foundry set:

1. Be clear and unambiguous

1.1 Use explicit and concise names

1.2 Use context independent names

1.3 Avoid confusing and overloaded taboo words

1.4 Avoid encoding administrative metadata in names

2. Be univocous

2.1 Use univocous names and avoid homonyms

2.2 Avoid conjunctions

2.3 Prefer singular nominal form

2.4 Use positive names

2.5 Avoid catch-all terms

3. Reduce string variance

3.1 Recycle strings

3.2 Use genus-differentia style names

3.3 Use space as word separators

3.4 Expand abbreviations and acronyms

3.5 Expand special symbols to words

4. Align Typography

4.1 Prefer lower case beginnings

4.2 Avoid character formatting

For the full set including definitions and examples please refer to the OBO Foundry pages.

Word Separators

The CamelCase or the underscore_conventions are often used, but the CamelCase can not capture class names like “Sample_pH” which would then read “SamplePH”. Further the CamelCase can result in ugly names when uppercase abbreviations are part of the name, e.g. “JEOLNMRInstrument”. For this reason we recommend the underscore convention which leads to a very readable “JEOL_NMR_Instrument”. Whether a space separator can be used depends on the implementation, e.g. the default Protégé Frames class name field :NAME maps to the rdf:id field, which does not allow spaces and expects an XML literal, because IDs can be usable as parts of URIs. What separator convention is used here depends on the way the parser for the representation language copes with Strings for the representational unit in question.

Exagerating formality of labels

A convention for a term name should not be a full representation language in itself. Some naming schemes, especially very domain-specific ones list rather detailed conventions which are highly non-redundant and therefore become error prone. The HUGO gene symbol naming conventions state the following rule for the naming of mouse phenotypes:

“If a new allelic mutation of a gene known only by a mutant phenotype is caused by a transgenic insertion, the symbol of this mutation should use the symbol of the transgene as superscript.” This leads to highly complex and non-readable symbols for genes, e.g. a mutation of abnormal wobbly gait caused by a transgene, mouse line 832, produced in the laboratory of Paul Wong will result in a gene symbol awgTg(GBtslenv)832Pkw which is not easy to read and probably takes longer to decipher for a majority of adressees than the provided textual description itself.

There may exist valid reasons in particular circumstances to ignore a particular convention, but the full implications must be understood and carefully weighed before choosing a different course. We believe that alternatives can and should be formulated and we hope these will be part of a ‘minimum information about ontologies (MIAO) yet to be established. Only if these conventions are made explicit they can easily be compared to each other

Benefits and applications

Agreed upon naming conventions are not simply a matter of aesthetics, they are a necessity in large collaborative efforts, where groups develop orthogonal but necessarily integrable parts of ontologies in a modular manner. To state an example, the MSI ontology, makes use of upper level ontologies, such as BFO and OBI, by importing their classes and then subclassing and constraining them. The resulting set of orthogonal inter-dependent ontological modules would look much more harmonized, if they all adhered to common naming conventions and design principles.

Having a single unified naming convention will lead to a normalized appearance of class names within the representational artefacts and throughout ontology libraries by relying on fairly simple rules. Furthermore, naming conventions would facilitate access to ontologies through meta-tools. E.g. PROMPT related ontology comparison, -alignment and –merging tools as currently developed by the NCBO BioPortal.

The application of common naming guidelines simplifies stand-alone ontology development and helps in subsequent administration tasks, it simplifies ontology networking, e.g., importing and using classes from external ontologies or imported ontology modules and increases the accessibility and exportability of terms, facilitating re-use and reducing redundant development. By increasing the robustness of ontology class names, naming conventions

  • support the manual and automated integration (i.e., comparison, orthogonality-checking, alignment and mapping) of terminological artifacts
  • facilitate access to ontologies through meta-tools such as the NCBO BioPortal by reducing the diversity with which these tools have to deal, thus reducing the burden on tool and ontology developers alike
  • increase the robustness of context-based text mining for automatic term recognition and text annotation.

Border cases

Aside of class naming conventions such guidelines would need to be created for further representational units, such as instances and identifiers. For example, some groups uses the identifier convention [group prefix] + [underscore] + [unique number] (e.g., ‘OBI_0000016’); whereas others simply use a ‘meaningful string’ (e.g., ‘IndependentContinuant’).  Although good practice as long as SPARQL tools do not support the display of labels other than what is stored in the ID we would not reccommend the use of semantic free IDs, as these render all source code and SPARQL queries unreadable.

A formalism is needed for naming and marking administrative ‘helper’ classes and metadata bins within ontologies. Until recently, non-ontological classes in OBI, such as ‘unclassified’ (OBI_200067), ‘to_be_fixed’ (OBI_334), ‘ChEBI_objects’ (OBI_336), ‘PATO_quality’ (OBI_302), ‘collected_relations’ (OBI_400132) could be found side-by-side with domain-level classes. These are now marked as helper classes by adding an underscore as prefix.

Branch, module, file and namespace naming conventions should be investigated. This is also indicated by the recurring discussions on ontology naming conflicts on the OBO discussion mailing list.

Last, but not least, besides our universal and domain independent conventions, specialized domain dependent ones (for certain ontological classes of high interest, i.e. named entities), should be created.

Author:

Daniel Schober, Institute for Medical Biometry and Medical Informatics, Freiburg University Medical Center, Germany

schober@imbi.uni-freiburg.de

No Comments

Leave a comment

Login