The Need for ontology learning
Ontologies are the fundamental form of knowledge representation in contemporary Artificial Intelligence systems, especially systems of used in the Life Sciences (link to What is an Ontology article). The vast majority of currently used ontologies have been built entirely by hand, including the major of Life Science ontologies such GO and those in OBO Foundry. This manual development process represents a major knowledge acquisition bottleneck as sometimes hundreds of hours of effort have been involved and there are ongoing teams of people in place to keep the ontologies up to date (1). One consequence of this has been a series of ongoing efforts largely led by members of the Natural Language Processing (NLP) and Text Mining communities to automate, or semi-automate the ontology construction process. Ontology Learning is motivated by a number of factors:
- The high manual cost of ontology construction
- The continuous change in science and knowledge in general
- The very large amount of existing text with numbers growing exponentially (PubMed has over 2000 papers added to it per day)
- The extensive need for a variety of ontology type structures ranging from the relatively informal such as structured vocabularies, through somewhat more formal taxonomies, to fully rigorous ontologies expressed in OWL.
This article will provide a brief introduction to some of the key concepts, approaches and challenges.
The work undertaken on ontology learning from texts is dependent on a number of core assumptions. All knowledge representation models including ontologies are linguistic artefacts, for two reasons (2). Firstly, ontologies function as a means for human beings to interact with machines, even if in a highly structured manner. Secondly, nearly all ontologies use linguistic terms (or quasi-linguistic terms) as labels for concepts. The fundamental assumption made by the NLP community is that there is an identity between terms in an ontology and in a text which look the same. This perception is further re-enforced by the extensive use of ontologies (especially in the Life Sciences) to annotate data including textual data.
On the NLP technology side, two other assumptions need to be identified. The distributional hypothesis is a very important working assumption which states that words which appear in similar contexts must be semantically identical or closely related (3), (4). All efforts to organise and cluster terms in texts, and to identify semantic similarity (synonymy) depend on the distributional hypothesis, and the identification of semantic difference or identity.
A further assumption is that texts actually contain the necessary information to derive ontologically relevant definitions. A common ontology learning model involves taking a set of domain specific texts and deriving (as far as possible) a corresponding ontology. The inadequacies which this approach has (5) have led to attempts to use multiple knowledge sources for the ontology learning process (6).
Ontology Learning remains an active area of research with considerable potential for facilitating the work of ontology engineers. Further progress in Natural Language Processing and the integration of multiple sources of knowledge have the promise to substantially reduce the burden of formalisation by the ontology engineer. Concurrently, the outputs of ontology learning systems are of great utility in less formal context wheren taxonomies or formal vocabularies are required.
The Basics of Ontology Learning
At its most simplistic an ontology learning system (or workflow) allows the input of one or more texts and the output of some form of taxonomy. At a more detailed level, we can analyse the ontology learning from text process as follows (adapted from (7)):
- Text Collection. A corpus of texts has to be identified, collected and preprocessed.
- Term recognition or keyword extraction. Stopwords, named entities and other noise need to be removed and a subset of the vocabulary in the text identified.
- Relation Extraction: Synonymy and Clustering. Terms which do have a relation need to be identified. This step also identifies terms or expressions such as abbreviations which are considered to be synonymous.
- Relation Labelling. Relations which have been identified as existing need to be labelled. (Steps 3 and 4, often occur together).
- Hierarchy Construction. Most ontologies contain a subsumption hierarchies if classes, however, it may also be desirable to extract different types of axioms from the text, including disjointness and equivalence.
Each step can be undertaken in a number of different ways, some of which we will briefly consider below.
Step 1 Text Collection: Text collections or corpora are usually “convenience corpora” i.e. based on a convenient collection of texts readily accessible to the researcher. Some authors have selected texts from the web, or extracted abstracts or complete texts from sources such as PubMed using specific key words.
Step 2 Term Recognition and keyword extraction: There are a number of standard approaches in the literature including (8), (9), (10), (11), (12). A comparative evaluation and integration of different methodologies relevant to ontology learning is presented in (13).
Step 3 Synonymy and Clustering: There s a large literature on different approaches cf. (14) Chap. 14. Classic systems for the creation of synonym sets include Sextant (15), and the work of the speech technology community in language models (16).
Step 4 Relation Labelling: Automated approaches to ontology learning have tended to restrict themselves mostly to learning ISA hierarchies (hyponomy and hyperonomy relations). There exist some limited efforts to learn other types of relations (17). There are two standard approaches to relation labelling:
- String Inclusion: This approach is used with multi-word terms with a structure ABC, the string inclusion approach assumes ABC ISA BC ISA C. There is a strong argument to interpret this relation not as a hyponym relation but (as defined in SKOS) as a “broader than/narrower than” relation. This is a high recall but low precision approach (1).
- Lexico-syntactic Patterns: Ever since Hearst’s work (18), there has been a strong tradition of using lexico-syntactic patterns to identify a variety of ontological/semantic relations between terms. These patterns are usually of the form “NP is a type of NP” or “NP and other NPs”. This approach has a tendency to have nigh precision but low recall.
Step 5 Hierarchy Construction: Hierarchy construction usually falls out of the relation labelling step and may be combined. There are other approaches which focus first of hierarchy construction and ignore the relation labelling stage (19).
Other Knowledge Sources
It is common practice in ontology learning approaches to use multiple sources of knowledge apart from textual ones. Thus a variety of existing knowledge structures have been used to build ontologies semi-automatically. These include WordNet (20), existing thesauri such as Rogets, and in the Life Science domain such structures as UMLS. There is an inherent contradiction in using existing structures as part of the point is build up to date representations of knowledge as reflected in corpus of texts. However, ever since the work of Agirre et al. (21), there have been many attempts to use a starting ontology and augment or update the ontology using corpora or the web.
Specific Systems and Tools
- Text2Onto: Combines machine learning approaches with basic linguistic approaches such as tokenisation and part-of-speech (POS) tagging. Text2Onto is built upon the General Architecture for Text Engineering (GATE) that allows for flexible integration of natural language processing components. It also allows Java Annotation Pattern Engine (JAPE) rules to be written that can be used to facilitate recognition of ontological primitives.
- Abraxas: This is a system based on an iterative open ended approach to the ontology learning process. A seed corpus is used, terms are identified, ontological knowledge in the form of triples are extracted using lexico-syntactic patterns, gaps in ontologically explicit knowledge are identified and suitable texts to “cover” the gaps are identified from an external repository such as the web (22).
- KnowItAll: While strictly not an ontology learning system, the KnowItAll system is a good example of large scale knowledge acquisition from the web. It uses most of the component techniques widely used in ontology learning except it outputs knowledge fragments, rather than complete ontologies (23).
- OntoLearn: This is a system (24) which uses extensively WordNet as an external resource. A corpus has term extraction applied (the OntoLearn team have a high quality term recognition component (25)), and then knowledge about terms from WordNet is integrated. Ontolearn have undertaken the most extensive end user evaluations of their system (26).
Challenges and Future Directions
There are two major challenges for automated ontology learning. Both have practical and philosophical implications.
The Nature of Text
There is a significant gap between the vagueness, fluidity and ambiguity of natural language and the logical rigour of fully formal ontologies. One of the key challenges in the ontology learning field remains the extent to which techniques can be found to bridge this gap. The current approaches tend to either ignore this gap or make explicit the need for human post editing of the output of the ontology learning system. Domain specific text collections do not contain sufficient instances of use of terms, and sufficient variety of contexts to enable the accurate determination of ontological relations and features (5). In effect a large proportion of the knowledge that needs to be formalised is not present in most corpora, especially domain specific collections. It is in this light that the wider use of the web as a corpus for knowledge extraction/collection is highly attractive (cf. especially the KnowItAll system (23), (27)) but this will tend to increase the problems of ambiguity and domain specific senses.
The evaluation of the output of ontology learning systems remains a major challenge. The usual approaches to ontology evalauation have largely been based on quality control of the ontology building process and ensuring the ontology abides by certain principles (28), (29). From an ontology learning from text perspective, these approaches are inapplicable. The usual approach is to use a Gold Standard with which to compare the automatically generated output (6), sometimes using WordNet as the Gold Standard. The main problem with a Gold Standard is that if the ontology learning system is supposed to discover new knowledge then a GS based evaluation may penalise a system for discovering knowledge absent from the Standard. Also as systems scale up, it becomes harder and harder t get an overall assessment of the quality of the resulting output in a given case.
Ontology learning is an active field of research with considerable potential to make the task of ontology engineering substantially easier. Another important role for OL is to allow ontologies to be kept up to date more effectively. Finally, one should mention that in many contexts where less formal ontologies are needed (taxonomies or structured vocabularies), they output of OL systems is very close to that required.
This paper is an open access work distributed under the terms of the Creative Commons Attribution License 3.0 (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are attributed.
The paper and its publication environment form part of the work of the Ontogenesis Network, supported by EPSRC grant EP/E021352/1.
 Brewster, C.; Jupp, S.; Luciano, J.; Shotton, D.; Stevens, R. & Zhang, Z. Issues in learning an ontology from text BMC Bioinformatics, 2009, 10, S1
 Davis, R.; Shrobe, H. & Szolovits, P. What is a Knowledge Representation AI Magazine, 1993, 14, 17-33
 Harris, Z. Distributional structure Word, 1954, 23, 146-162
 Firth, J. R. Papers in Linguistics, 1934-1951 Oxford University Press, 1957
 Brewster, C.; Ciravegna, F. & Wilks, Y. Background and Foreground Knowledge in Dynamic Ontology Construction Proceedings of the Semantic Web Workshop, Toronto, August 2003, 2003
 Cimiano, P.; Pivk, A.; Schmidt-Thieme, L. & Staab, S. Learning Taxonomic Relations from Heterogeneous Sources of Evidence Ontology Learning from Text: Methods, Evaluation and Applications, IOS Press, 2005
 Brewster, C. [Mind the Gap: Bridging from Text to Ontological Knowledge] Department of Computer Science, University of Sheffield, 2008
 Ahmad, K. & Rogers, M. A. Wright, S.-E. & Budin, G. (ed.) Corpus Linguistics and Terminology Extraction Handbook of Terminology Management (Volume 2), John Benjamins Publishing Company, 2001, 725-760
 Velardi, P.; Missikoff, M. & Basili, R. Identification of Relevant Terms to Support the Construction of Domain Ontologies Proceedings of the ACL 2001 Workshop on Human Language Technology and Knowledge Management, 2001
 Park, Y.; Byrd, R. J. & Boguraev, B. Automatic Glossary Extraction: Beyond Terminology Identification 19th International Conference on Computational Linguistics – COLING 02, Howard International House and Academia Sinica, 2002
 Ananiadou, S. & Nenadic, G. Ananiadou, S. & McNaught, J. (ed.) Automatic Terminology Management in Biomedicine, Text Mining for Biology and Biomedicine, Artech House Books, 2006, 67-98
 Moigno, S. L.; Charlet, J.; Bourigault, D. & Jaulent, M.-C. Terminology extraction from text to build an ontology in surgical intensive care Proceedings of the AMIA 2002 annual symposium, 2002
 Zhang, Z.; Iria, J.; Brewster, C. & Ciravegna, F. A Comparative Evaluation of Term Recognition Algorithms Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC08), 2008
 Manning, C. D. & Schütze, H. [Foundations of Statistical Natural Language Processing] MIT Press, 1999
 Grefenstette, G. [Explorations in Automatic Thesaurus Discovery] Kluwer, 1994
 Brown, P.; Pietra, V. D.; deSouza, P.; Lai, J. & Mercer, R. Class Based n-gram Models of Natural Language Computational Linguistics, 1992, 18, 467-479
 Villaverde, J.; Persson, A.; Godoy, D. & Amandi, A. Supporting the discovery and labeling of non-taxonomic relationships in ontology learning Expert Systems with Applications, 2009, 36, 10288 – 10294
 Hearst, M. Automatic Acquisition of Hyponyms from Large Text Corpora Proceedings of the Fourteenth International Conference on Computational Linguistics (COLING 92), Nantes, France, July 1992, 1992
 Sanderson, M. & Croft, B. Deriving concept hierarchies from text Proceedings of the 22nd ACM SIGIR Conference, 1999, 206-213
 Fellbaum, C. D. [WordNet: An Electronic Lexical Database] MIT Press, 1998
 Agirre, E.; Ansa, O.; Hovy, E. H. & Martnez, D. Staab, S.; Maedche, A.; Nedellec, C. & Wiemer-Hastings, P. (ed.) Enriching very large ontologies using the WWW. ECAI Workshop on Ontology Learning, CEUR-WS.org, 2000, 31
 Wilks, Y. & Brewster, C. Natural Language Processing as a Foundation of the Semantic Web Foundations and Trends in Web Science, 2009, 1, 199-327
 Etzioni, O.; Cafarella, M.; Downey, D.; Kok, S.; Popescu, A.-M.; Shaked, T.; Soderland, S.; Weld, D. S. & Yates, A. Feldman, S. I.; Uretsky, M.; Najork, M. & Wills, C. E. (ed.) Web-scale information extraction in knowitall (preliminary results). Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17-20, 2004, ACM, 2004, 100-110
 Sclano, F. & Velardi, P. TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities Proceedings of the 3rd International Conference on Interoperability for Enterprise Software and Applications (I-ESA 2007), 2007
 Velardi, P.; Navigli, R.; Cucchiarelli, A. & Neri, F. Buitelaar, P.; Cimiano, P. & Magnini, B. (ed.) [Evaluation of OntoLearn, a methodology for automatic learning of domain ontologies] Ontology Learning from Text: Methods, Evaluation and Applications, IOS Press, 2005
 Etzioni, O.; Cafarella, M. J.; Downey, D.; Popescu, A.-M.; Shaked, T.; Soderland, S.; Weld, D. S. & Yates, A. Unsupervised named-entity extraction from the Web An experimental study. Artif. Intell., 2005, 165, 91-134
 Gómez-Pérez, A. Staab, S. & Studer, R. (ed.) [Ontology Evaluation.] Handbook on Ontologies, Springer, 2004, 251-274
 Oltramari, A.; Gangemi, A.; Guarino, N. & Masolo, C. Restructuring WordNet’s Top-Level: The OntoClean approach Proceedings of the Workshop OntoLex’2, Ontologies and Lexical Knowledge Bases, 2002