on April 20, 2010 by Robert Stevens and Duncan Hull in Under Review, Comments (0)

There is no “other” category: When to avoid “miscellaneous”

It is often tempting when describing a series of subclasses that you know are not exhaustive to add a class labelled as “Other” or “Miscellaneous”. This is generally seen to be a bad idea for a series of reasons.

Take a class such as “Molecular function” in the Gene Ontology (GO) (and it must be noted that the GO is not guilty of this ontological crime), then we might add a series of subclasses describing, for instance, enzyme functions, motility functions, binding functions, and so on. The temptation is to have another class called either “Miscellaneous”, “Function unknown” or “other” in to which everything that doesn’t fit in to a specific sub-class of molecular function is placed.

This is done because it is often the case that an ontology’s authors know that the set of sibling terms is incomplete and it is felt that this needs to be explicitly accommodated. Also, it can be felt useful to be able to explicitly say that (in this example) that the molecular function is unknown. These are different cases; I may either know the function, but there is no class by which it can be specified or the function may actually be unknown.

One problem with “Unknown”, “Other” or “Miscellaneous” is that the types of object in that category keep changing. As the unknown becomes known, then subsets of that category keep disappearing. So, we have no real idea of what “unknown” means.

From the point of view of a realist , it can also be argued that “unknown function” does not exist anyway. Gene products may not have “unknown function” but functions which we do not yet know – which is a different thing altogether. We don’t need to appeal to such a viewpoint; it is practically a bad idea to use these residual type categories.

The semantics of languages used for representing ontologies, such as OWL, can easily help us out. If the function we want is not in the ontology underneath “molecular function” or we do not know the function, all we have to do is label our gene product as having “Molecular function”. Our description says the gene product has a function, but does not say what it is. The semantics of this statement are as clear as saying a gene product has an unknown function.
Similarly, when the urge is felt for a “miscellaneous” category, the higher level category should be used; an annotation made that something is known, but not available and the ontology suitably changed.

We can also add the notion of a “covering axiom” to this discussion. By saying a class is equivalent to all its subclasses one is saying any instance of that class must be one of the “covered” subclasses — that is, the whole of the class in question can be accounted for or “covered” by the subclasses. By not adding a covering axiom one is saying there may be others; if there is a covering axiom, one is saying that there can be no others.

In summary, don’t add “miscellaneous” or “unknown”, etc. to your ontology; there is no need. Just as separation of identifier and label is one of Cimino’s desiderata [1], so is the avoidance of “miscellaneous”.

Authors

Robert Stevens
School of Computer Science,
The University of Manchester,
Oxford Road,
Manchester,
UK

Duncan Hull
EMBL Outstation – Hinxton,
European Bioinformatics Institute,
Wellcome Trust Genome Campus,
Hinxton,
Cambridge,
CB10 1SD,
UK

References

[1] Cimino JJ. Desiderata for controlled medical vocabularies in the Twenty-First Century. Methods Inf Med 1998;37(4-5):394-403.

Tags: , , ,

No Comments

Leave a comment

Login