Part of my series of notes from NAACL-HLT 2019 in Minneapolis.
This tutorial had a… stressful number of people in it. My notes aren’t very thorough, but fortunately the slides are excellent and they even released code, so go have a look at that instead of reading this. Slides are here, code is there.
Why transfer learning?
- tasks can inform each other
- share common knowledge about language
- annotated data is rare
- empirically, transfer learning kicks ass
- many types of transfer learning, this tutorial focuses on sequential transfer learning
- learn on one task/dataset, transfer to another
- pretraining + adaptation
- goals of this tutorial: overview of techniques, practical advice
- not exhaustive survey! and mostly in English
Themes
- word vectors -> sentence vectors -> word-in-context vectors
- language modeling pretraining – versatile, enough data for high-capacity model
- shallow -> deep models
- pretraining vs. target task – how similar are they affects how effective it is
Pretraining
- distributional hypothesis, in a variety of forms
- my takeaway: people are obsessed with BERT
- LMs have to compress any possible context into a single vector that generalises over any possible completion
- this increases my suspicion that language modeling isn’t really doing what my intuition says happens when you just “predict the next word” or whatever, same with distributional hypothesis and embeddings learned thus
- pre-training improves sample complexity
- cross-lingual stuff
- then I left because I needed coffee SO BAD OMG
Adaptation
- keep pretrained model unchanged
- (maybe) remove last layer
- add linear layer, or use as inputs to another
- or, changing model internals
- target task structurally very different
- use pretrained model weights to initialise bits of new model
- e.g. adaptors – can help use fewer parameters for target task
- I didn’t know anything about adaptors! papers: Houlsby et al. 2019, Stickland & Murray 2019
Optimisation
- what, when, and how to update
- to fine-tune or not to fine-tune?
- what schedule?
- want to avoid overwriting pretraining info
- avoid catastrophic forgetting
- principle: update top-to-bottom
- cf. greedy layer-wise training
- freeze / unfreeze layers with various schedules
- but also train all parameters jointly at the end
- learning rates
- lower to avoid overwriting
- higher when need to adapt to target task
- regularised to stay close to pretrained weights (assuming the two tasks are relatively similar)
- getting more signal by using more tasks / data
Some future work
- LMs as “rapid surface learners” – not always ideal
- more diverse self-supervision objectives
- besides just co-occurrence / distributional hypothesis
- different kinds of pre-trained representations
- besides just monolithic vector
- e.g. modular
- need for grounding – can’t just use raw text
- continual learning – no distinction between pretraining and adaptation