Fortuitous data

Improving language technology with fortuitous data

Lecturers: Barbara Plank and Anders Johannsen

Course held at the ESSLLI summer school, August 15-19, Bozen-Bolzano

Abstract

Current successful approaches to natural language processing (NLP) are for the most part based on supervised learning. In turn, supervised learning critically depends on the availability of annotated data. Such data is generally not plentiful, as it requires time and expertise to develop annotated resources. This is the problem of data sparsity. At the same time, available annotated data is usually a sample of a particular domain or language. Thus, even if some annotated data is available, it is often not a clear fit for the problem at hand. This is the problem of data bias.

In this course, we present approaches to facilitate NLP development when confronted by sparsity, or even absence, of supervision through annotated, biased samples of language data. By using part-of-speech tagging and syntactic dependency parsing as running examples, we outline modern approaches to augmenting supervised techniques for top-level performance. The approaches include semi-supervised and unsupervised techniques, domain adaptation and cross-lingual learning. We place particular emphasis on leveraging the various sources of fortuitous data that may be available even in the most severely under-resourced domains of natural language. We argue that fortuitous data provides often the `secret sauce` to make approaches based on limited supervision work.

Schedule and slides

Day 1: Introduction, A typology of data mismatch, Learning in the shire [slides]
Day 2: Structured prediction [slides]
Day 3: Neural Networks graph view, representations and multi-task learning [slides]
Day 4: Fortuitous recipes + hands-on [exercise 1] [exercise 2]
Day 5: Cross-lingual learning

Lecturers

Anders Johannsen, Apple
Barbara Plank, University of Groningen