論文メモ Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond

Ramesh Nallapati et al 2016

autoscale: true slidenumbers: true

Notes about

apply the attentional encoder-decoder RNN to text summarization.
- The model is developed for machine translation.
propose the novel models and show that they provide additional improvement in performance.
- The models use additional linguistic features such as parts-of-speech and so on.

Abstractive text summarization is the task of generating a headline or short summary consisting of a few sentences that captures the salient ideas of an article or a passage.

In summarization, original documents are compressed in lossy manner such that key concepts are preserved.

(Extraction of embedding vector) $$\bar{y}_t = E y_{t-1}$$
(Calculation of hidden layer) $$h_t = \tanh \left( W^{(l)} [\bar{y}_t, h_{t-1}]^t + b^{(l)} \right)$$
(Calculation of output layrer) $$o_t = W^{(o)} h_t + b^{(o)}$$
(Calculation of probability) $$p_t = \text{softmax}(o_t)$$
(Extraction of probability) $$P(y_t| Y_{<t}) = p_t \cdot y_t$$

(Forward calculation of hidden layer)
- $$\overrightarrow{h}_t = \tanh \left( \overrightarrow{W}^{(l)} [\bar{y}_t, \overrightarrow{h}_{t-1}]^t + \overrightarrow{b}^{(l)} \right)$$
(Backward calculation of hidden layer)
- $$\overleftarrow{h}_t = \tanh \left( \overleftarrow{W}^{(l)} [\bar{y}_t, \overleftarrow{h}_{t+1}]^t + \overleftarrow{b}^{(l)} \right)$$
(Calculation of output layrer)
- $$o_t = W^{(o)} [\overrightarrow{h}_t, \overleftarrow{h}_t]^t + b^{(o)}$$

(Input) $$X=(x_i)_ {i=1}^I$$
(Output) $$Y=(y_j)_ {j=0}^{J+1}$$ where $$y_0=BOS$$ and $$y_{J+1}=EOS$$
(Encoder) $$z=\Lambda(X)$$
(Hidden layer) $$h_t = \phi(h_{j-1}, y_{j-1})$$ where $$h_0 = z$$
(Probability)
- $$P(Y|X) = \prod_{j=1}^{J+1} P(y_j|Y_{<j}, X)$$
- $$P(y_j|Y_{<j}, X) = \psi(h_j, y_j)$$

(Encoder) $$h_i^{(s)} = \phi^{(s)}(x_i, h_{i-1}^{(s)})$$
(Decoder) $$h_j^{(t)} = \phi^{(t)}(y_j, h_{j-1}^{(t)})$$
(Attention)
- $$\bar{h} = \sum_{i=1}^I a_ih_i^{(s)}$$ where $$a_i = \frac{\exp(e_i)}{\sum_{\tilde{i}=1}^I \exp(e_{\tilde{i}})}$$ and $$e_i = \Omega(h_i^{(s)}, h_j^{(t)})$$
- $$\hat{h}_j^{(t)} = \tanh(W^{(a)}[\bar{h}, h_j^{(t)}])$$

Large vocabulary problems
- The computation cost of $$\text{softmax}(o_t)$$ is very high, when the vocabulary size is large, because $$\dim(o_t)$$ equals to the vocabulary size.
Approaches
- model-specific approaches
  - noise-contrastive estimation
  - binary hierarchical softmax
- translation-specific approaches

The encoder-decoder RNN with attention is described in preparation slides.
Authors' LVT is that
- the decoder-vocabulary of each mini-batch is restricted to words in the source documents of that batch.
- the most frequent words in the target dictionary are added until the vocabulary reaches a fixes size.

The algorithm is needed to handle unseen words, because the vocabulary is limited at training time.
In general, a most common way is to emit an ‘UNK’ token as a placeholder.
In summarization, an intuitive way is to simply point to their location in the source document.
Authors model this notion using our novel switching decoder/pointer architecture.
- The switch decides between using the generator or a pointer at every time step.

The probability of the switch turning on at the i-th time-step is
- $$P(s_i = 1) = \sigma(v^s \cdot (W_h^s h_i + W_e^s E o_{i-1} + W_c^s c_i + b^s))$$
- where $$E o_{i-1}$$ is the embedding vector of the emission from the previous time step,
- $$c_i$$ is the attention-weighted context vector.
The pointer value at i-th time-step is
- $$p_i = \arg \max_{j} (P^a_i(j))$$ for $$j \in {1, …, N_d}$$
- where $$P^a_i(j) = \exp(v^a \cdot (W_h^a h_{i-1} + W^a_e E o_{i-1} + W_c^a h^d_{j} + b^a))$$

At the training time, the explicit pointer information is provides whenever the summary word does not exist in the target vocabulary.
The loss function is
- $$\log P(y|x) = \sum_{i} (g_i \log {P(y_i|y_{-i}, x) P(s_i)}$$ $$ + (1-g_i) \log {P(p(i)|y_{-i}, x)(1-P(s_i))})$$
$$g_i$$ is an indicator function that is set to 0 whenever the word at position $$i$$ in the summary is OOV with respect to the decoder vocabulary.

In summarization, it is important to identify the key sentences from which the summary can be drawn.
This model aims to capture this notation of two levels of importance using bi-directional RNNs on the source sids, one at the word level and the other at the sentence level.
$$P^a(j) = \frac{P^a_w(j) P^a_s(s(j))}{\sum_{k=1}^{N_d} P^a_w(k) P^a_s(s(k))}$$