Introduction to Tree-LSTMs

1.
Improved Semantic RepresentationsFrom Tree-Structured Long Short-Term Memory Networks by Kai Sheng Tai, Richard Socher, Christopher D. Manning Daniel Perez tuvistavie CTO @ Claude Tech M2 @ The University of Tokyo October 2, 2017
2.
Distributed representation ofwords Idea Encode each word using a vector in Rd , such that words with similar meanings are close in the vector space. 2
3.
4.
5.
Basic RNN cell Ina plain RNN, ht is computed as follow ht = tanh(Wxt + Uht−1 + b) given, g(xt, ht−1) = Wxt + Uht−1 + b, 5
6.
Basic RNN cell Ina plain RNN, ht is computed as follow ht = tanh(Wxt + Uht−1 + b) given, g(xt, ht−1) = Wxt + Uht−1 + b, Issue Because of vanishing gradients, gradients do not propagate well through the network: impossible to learn long-term dependencies 5
7.
Long short-term memory(LSTM) Goal Improve RNN architecture to learn long term dependencies Main ideas • Add a memory cell which does not suﬀer vanishing gradient • Use gating to control how information propagates 6
8.
9.
10.
11.
12.
Tree-structured LSTMs Goal Improve encodingof sentences by using their structures Models • Child-sum tree LSTM Sums over all the children of a node: can be used for any number of children • N-ary tree LSTM Use diﬀerent parameters for each node: better granularity, but maximum number of children per node must be ﬁxed 9
13.
Child-sum tree LSTM Childrenoutputs and memory cells are summed Child-sum tree LSTM at node j with children k1 and k2 10
14.
Child-sum tree LSTM Properties •Does not take into account children order • Works with variable number of children • Shares gates weight (including forget gate) between children Application Dependency Tree-LSTM: number of dependents is variable 11
15.
N-ary tree LSTM Giveng (n) k (xt, hl1 , · · · , hlN ) = W (n)xt + N l=1 U (n) kl hjl + b(n) Binary tree LSTM at node j with children k1 and k2 12
16.
N-ary tree LSTM Properties •Each node must have at most N children • Fine-grained control on how information propagates • Forget gate can be parameterized so that siblings aﬀect each other Application Constituency Tree-LSTM: using a binary tree LSTM 13
17.
Sentiment classiﬁcation Task Predict sentimentˆyj of node j Sub-tasks • Binary classiﬁcation • Fine-grained classiﬁcation over 5 classes Method • Annotation at node level • Uses negative log-likelihood error ˆpθ(y|{x}j ) = softmax W (s) hj + b(s) ˆyj = arg max y ˆpθ(y|{x}j ) 14
18.
Sentiment classiﬁcation results ConstituencyTree-LSTM performs best on ﬁne-grained sub-task Method Fine-grained Binary CNN-multichannel 47.4 88.1 LSTM 46.4 84.9 Bidirectional LSTM 49.1 87.5 2-layer Bidirectional LSTM 48.5 87.2 Dependency Tree-LSTM 48.4 85.7 Constituency Tree-LSTM - randomly initialized vectors 43.9 82.0 - Glove vectors, ﬁxed 49.7 87.5 - Glove vectors, tuned 51.0 88.0 15
19.
Semantic relatedness Task Predict similarityscore in [1, K] between two sentences Method Similarity between sentences L and R annotated with score ∈ [1, 5] • Produce representations hL and hR • Compute distance h+ and angle h× between hL and hR • Compute score using fully connected NN hs = σ W (×) h× + W (+) h+ + b(h) ˆpθ = softmax W (p) hs + b(p) ˆy = rT ˆpθ r = [1, 2, 3, 4, 5] • Error is computed using KL-divergence 16
20.
Semantic relatedness results DependencyTree-LSTM performs best for all measures Method Pearson’s r MSE LSTM 0.8528 0.2831 Bidirectional LSTM 0.8567 0.2736 2-layer Bidirectional LSTM 0.8558 0.2762 Constituency Tree-LSTM 0.8582 0.2734 Dependency Tree-LSTM 0.8676 0.2532 17
21.
Summary • Tree-LSTMs allowto encode tree topologies • Can be used to encode sentences parse trees • Can capture longer and more ﬁne-grained words dependencies 18
22.
References Christopher Olah. Understanding lstmnetworks, 2015. Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. 2015. 19

Introduction to Tree-LSTMs

More Related Content

What's hot

Similar to Introduction to Tree-LSTMs

Recently uploaded

Introduction to Tree-LSTMs