Learning to Learn Concept Descriptions

PhD candidate Giulio Petrucci

21 settembre 2018

Time: h 10:00 am
Location: Room Garda, Polo Ferrari 1 - Via Sommarive 5, Povo (TN)

PhD Candidate

Giulio Petrucci

Abstract of Dissertation

The goal of automatically encoding natural language text into some formal representation has been pursued in the field of Knowledge Engineering to support the construction of Formal Ontologies. Many \SOA{} methods have been proposed for the automatic extraction of lightweight Ontologies and to populate them. Only few have tackled the challenge of extracting expressive axioms that formalize the possibly complex semantics of ontological concepts.

In this thesis, we address the problem of encoding a natural language sentence expressing the description of a concept into a corresponding Description Logic axiom. In our approach, the encoding happens through a syntactic transformation, so that all the extralogical symbols in the formula are words actually occurring in the input sentence. We followed the recent advances in the field of Deep Learning in order to design suitable Neural Network architectures capable to learn by examples how to perform this transformation.

Since no pre-existing dataset was available to adequately train Neural Networks for this task, we designed a data generation pipeline to produce datasets to train and evaluate the architectures proposed in this thesis. These datasets provide therefore a first reference corpus for the task of learning concept description axioms from text via Machine Learning techniques, and are now available for the Knowledge Engineering community to fill the pre-existing lack of data.

During our evaluation, we assessed some key characteristics of the approach we propose. First, we evaluated the capability of the trained models to generalize over the syntactic structures used in the expression of concept descriptions, together with the tolerance to unknown words. The importance of these characteristics is due to the fact that Machine Learning systems are trained on a statistical sample of the problem space, and they have to learn to generalize over this sample in order to process new inputs. In particular, in our scenario, even an extremely large training set is not able to include all the possible ways a human can express the definition of a concept. At the same time, part of the human vocabulary is likely to fall out of the training set. Thus, testing these generalization capabilities and the tolerance to unknown words is crucial to evaluate the effectiveness of the model. Second, we evaluated the improvement in the performance of the model when it is incrementally trained with additional training examples. This is also a pivotal characteristic of our approach, since Machine Learning-based systems are typically supposed to continuously evolve and improve, on the long term, through iterative repetitions of training set enlargements and training process runs. Therefore, a valuable model has to show the ability to improve its performance when new training examples are added to the training set.

To the best of our knowledge, this work represents the first assessment of an approach to the problem of encoding expressive concept descriptions from text that is entirely Machine Learning-based and is trained in a end-to-end fashion starting from raw text. In detail, this thesis proposes the first two Neural Network architectures in literature to solve the problem together with their evaluation with respect to the above pivotal characteristics, and a first dataset generation pipeline together with concrete datasets.