Transfer Learning in NLP

Part of my series of notes from NAACL-HLT 2019 in Minneapolis.

This tutorial had a… stressful number of people in it. My notes aren’t very thorough, but fortunately the slides are excellent and they even released code, so go have a look at that instead of reading this. Slides are here, code is there.

Why transfer learning?

tasks can inform each other
share common knowledge about language
annotated data is rare
empirically, transfer learning kicks ass

transfer

many types of transfer learning, this tutorial focuses on sequential transfer learning
- learn on one task/dataset, transfer to another
- pretraining + adaptation
goals of this tutorial: overview of techniques, practical advice
not exhaustive survey! and mostly in English

Themes

word vectors -> sentence vectors -> word-in-context vectors
language modeling pretraining – versatile, enough data for high-capacity model
shallow -> deep models
pretraining vs. target task – how similar are they affects how effective it is

Pretraining

distributional hypothesis, in a variety of forms
my takeaway: people are obsessed with BERT

LMs

LMs have to compress any possible context into a single vector that generalises over any possible completion
- this increases my suspicion that language modeling isn’t really doing what my intuition says happens when you just “predict the next word” or whatever, same with distributional hypothesis and embeddings learned thus
pre-training improves sample complexity
cross-lingual stuff
then I left because I needed coffee SO BAD OMG

Adaptation

keep pretrained model unchanged
- (maybe) remove last layer
- add linear layer, or use as inputs to another
or, changing model internals
- target task structurally very different
- use pretrained model weights to initialise bits of new model
e.g. adaptors – can help use fewer parameters for target task
I didn’t know anything about adaptors! papers: Houlsby et al. 2019, Stickland & Murray 2019

Optimisation

what, when, and how to update

optimisation

to fine-tune or not to fine-tune?
what schedule?
- want to avoid overwriting pretraining info
- avoid catastrophic forgetting
principle: update top-to-bottom
- cf. greedy layer-wise training
- freeze / unfreeze layers with various schedules
but also train all parameters jointly at the end
learning rates
- lower to avoid overwriting
- higher when need to adapt to target task
regularised to stay close to pretrained weights (assuming the two tasks are relatively similar)
getting more signal by using more tasks / data

Some future work

LMs as “rapid surface learners” – not always ideal
more diverse self-supervision objectives
- besides just co-occurrence / distributional hypothesis
different kinds of pre-trained representations
- besides just monolithic vector
- e.g. modular
need for grounding – can’t just use raw text
continual learning – no distinction between pretraining and adaptation

conclusion