Part of my series of notes from NAACL-HLT 2019 in Minneapolis.
Emergence of Syntax and Number Units in LSTM Language Models
- subject-verb agreement
- grammatical agreement: copy of features from one word onto another
- psycholinguistic, neuroscientific etc. research on this
- previous work shows LSTMs can do this
- uses “behavioural evidence” – neural net as black box
- this work
- what’s the underlying mechanism?
- is it structure-sensitive?
- design stimuli with different attributes to probe behaviour
- ablate units – only 2 units significantly hurt performance when ablated
- separate singular and plural units?!
- dynamics of long-range number units
- also evidence of “short range” number units
- syntactic tree-depth data – try to predict tree-depth from network activations
- after decorrelating word position and depth
- interaction between syntax units & number units
- evidence that LSTMs aren’t just using heuristics, but structure
Neural Self-training through Spaced Recognition
- labeled data is important but hard to get
- self-training – sample data to label, but won’t always explore entire data space
- pretraining – want to de-couple pretraining & fine-tuning tasks
- data sampling based on Leitner Queues
- inspired by “scheduled learning” in humans
- adaptively learn how to move instances in the queue to sample
- some analysis of how much queue instances match training data, diversity of instances, sampling policy
- this seemed interesting but I got distracted, will read the paper I guess!
Neural Language Models and Psycholinguistic Subjects
- what information is contained in representations learned from neural LMs?
- what’s a parse tree really?
- not just something that makes linguists happy
- encodes possible grammatical unfoldings of sentences in a human understandable way
- psycholinguistics uses reading time as proxy for difficulty
- infer e.g. what data structures people are using internally
- this provides evidence for these trees existing in the first place!
- do this for neural models using negative log-likelihood of word given context
- measure penalty for surprising continuations
- construct examples of different kinds of syntactic state
- do NN models use the state?
- what cues do they use?
- two examples: subordinate clauses & garden path sentences
- can compare models that use grammars vs. those that don’t
- syntactic supervision is worth 100x more data
- hard to predict patterns in what models do and don’t learn