# 論文メモ Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond

Ramesh Nallapati et al 2016

autoscale: true slidenumbers: true

## Introduction

### Authors

• apply the attentional encoder-decoder RNN to text summarization.
• The model is developed for machine translation.
• propose the novel models and show that they provide additional improvement in performance.
• The models use additional linguistic features such as parts-of-speech and so on.

### Text summarization

• Abstractive text summarization is the task of generating a headline or short summary consisting of a few sentences that captures the salient ideas of an article or a passage.

### The difficulties of this kind of problems

• In summarization, original documents are compressed in lossy manner such that key concepts are preserved.

### Models: RNN language model (preparation)

• (Extraction of embedding vector) $$\bar{y}_t = E y_{t-1}$$
• (Calculation of hidden layer) $$h_t = \tanh \left( W^{(l)} [\bar{y}_t, h_{t-1}]^t + b^{(l)} \right)$$
• (Calculation of output layrer) $$o_t = W^{(o)} h_t + b^{(o)}$$
• (Calculation of probability) $$p_t = \text{softmax}(o_t)$$
• (Extraction of probability) $$P(y_t| Y_{<t}) = p_t \cdot y_t$$

### Models: bidirectional RNN language model (preparation)

• (Forward calculation of hidden layer)
• $$\overrightarrow{h}_t = \tanh \left( \overrightarrow{W}^{(l)} [\bar{y}_t, \overrightarrow{h}_{t-1}]^t + \overrightarrow{b}^{(l)} \right)$$
• (Backward calculation of hidden layer)
• $$\overleftarrow{h}_t = \tanh \left( \overleftarrow{W}^{(l)} [\bar{y}_t, \overleftarrow{h}_{t+1}]^t + \overleftarrow{b}^{(l)} \right)$$
• (Calculation of output layrer)
• $$o_t = W^{(o)} [\overrightarrow{h}_t, \overleftarrow{h}_t]^t + b^{(o)}$$

### Models: encoder-decoder RNN language model (preparation)

• (Input) $$X=(x_i)_ {i=1}^I$$
• (Output) $$Y=(y_j)_ {j=0}^{J+1}$$ where $$y_0=BOS$$ and $$y_{J+1}=EOS$$
• (Encoder) $$z=\Lambda(X)$$
• (Hidden layer) $$h_t = \phi(h_{j-1}, y_{j-1})$$ where $$h_0 = z$$
• (Probability)
• $$P(Y|X) = \prod_{j=1}^{J+1} P(y_j|Y_{<j}, X)$$
• $$P(y_j|Y_{<j}, X) = \psi(h_j, y_j)$$

### Models: Soft attention mechanism (preparation)

• (Encoder) $$h_i^{(s)} = \phi^{(s)}(x_i, h_{i-1}^{(s)})$$
• (Decoder) $$h_j^{(t)} = \phi^{(t)}(y_j, h_{j-1}^{(t)})$$
• (Attention)
• $$\bar{h} = \sum_{i=1}^I a_ih_i^{(s)}$$ where $$a_i = \frac{\exp(e_i)}{\sum_{\tilde{i}=1}^I \exp(e_{\tilde{i}})}$$ and $$e_i = \Omega(h_i^{(s)}, h_j^{(t)})$$
• $$\hat{h}_j^{(t)} = \tanh(W^{(a)}[\bar{h}, h_j^{(t)}])$$

### Models: Large vocabulary trick (preparation)

• Large vocabulary problems

• The computation cost of $$\text{softmax}(o_t)$$ is very high, when the vocabulary size is large, because $$\dim(o_t)$$ equals to the vocabulary size.
• Approaches

• model-specific approaches
• noise-contrastive estimation
• binary hierarchical softmax
• translation-specific approaches

### Models: Authors apply encoder-decoder RNN with attention and large vocabulary trick(LVT)

• The encoder-decoder RNN with attention is described in preparation slides.
• Authors' LVT is that
• the decoder-vocabulary of each mini-batch is restricted to words in the source documents of that batch.
• the most frequent words in the target dictionary are added until the vocabulary reaches a fixes size.

### Models: Authors propose the model uses the additional linguistic features.

• Linguistic features
• parts-of-speech tags (POS tag)
• named-entity tags
• MUC: organization, person, location, date, time, money, percent
• TF (Term Frequency)
• $$tf_{i, j} = \frac{n_{i, j}}{\sum_k n_{k, j}}$$
• $$n_{i, j}$$ is the number of times that term $$t_i$$ occurs in document $$d_j$$.
• IDF (Inverse Document Frequency)
• $$idf_{i}=\log \frac{|D|}{{d: t_i \in d}}$$

### Models: Authors propose modeling rare/unseen words using switching generator-pointer.

• The algorithm is needed to handle unseen words, because the vocabulary is limited at training time.
• In general, a most common way is to emit an ‘UNK’ token as a placeholder.
• In summarization, an intuitive way is to simply point to their location in the source document.
• Authors model this notion using our novel switching decoder/pointer architecture.
• The switch decides between using the generator or a pointer at every time step.

### Models: The switch

• The probability of the switch turning on at the i-th time-step is
• $$P(s_i = 1) = \sigma(vs \cdot (W_hs h_i + W_es E o_{i-1} + W_cs c_i + bs))$$
• where $$E o_{i-1}$$ is the embedding vector of the emission from the previous time step,
• $$c_i$$ is the attention-weighted context vector.
• The pointer value at i-th time-step is
• $$p_i = \arg \max_{j} (Pa_i(j))$$ for $$j \in {1, …, N_d}$$
• where $$Pa_i(j) = \exp(va \cdot (W_ha h_{i-1} + Wa_e E o_{i-1} + W_ca hd_{j} + ba))$$

### Models: The training of the switch parameters.

• At the training time, the explicit pointer information is provides whenever the summary word does not exist in the target vocabulary.
• The loss function is
• $$\log P(y|x) = \sum_{i} (g_i \log {P(y_i|y_{-i}, x) P(s_i)}$$ $$+ (1-g_i) \log {P(p(i)|y_{-i}, x)(1-P(s_i))})$$
• $$g_i$$ is an indicator function that is set to 0 whenever the word at position $$i$$ in the summary is OOV with respect to the decoder vocabulary.

### Models: Capturing hierarchical document structure with hierarchical attention

• In summarization, it is important to identify the key sentences from which the summary can be drawn.
• This model aims to capture this notation of two levels of importance using bi-directional RNNs on the source sids, one at the word level and the other at the sentence level.
• $$Pa(j) = \frac{Pa_w(j) Pa_s(s(j))}{\sum_{k=1}^{N_d} Pa_w(k) Pa_s(s(k))}$$