CS224n

Assignment 4

1. Neural Machine Translation with RNNs

Implementation: Assignment 4 Code
In the first task, a sequence-to-sequence(Seq2Seq) network with attention is built as a Neural Translation (NMT) system for Cherokee to English translation.

Seq2Seq_model_with_Multiplicative_Attention

Training Curves:
loss perplexity

(g). generate_sent_masks() functions in nmt_model.py

step() function in nmt_model.py sets \(e_t\) to -inf where env_masks has 1, so the attention weights will be zero fater softmax function for "pad" embeddings. Only the weights for real tokens in a sequence will be taken into consideration this way.

# Set e_t to -inf where enc_masks has 1
        if enc_masks is not None:
            e_t.data.masked_fill_(enc_masks.bool(), -float('inf'))

(h). test the model

sh run.sh test
(Windows) run.bat test

The result is:

Decoding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:56<00:00, 17.63it/s]
Corpus BLEU: 12.784050459913443

(i). Different Attention

dot product attention: \(\mathbf{e}_{t,i} = \mathbf{s}^{T}_{t}\mathbf{h}_{i}\)
multiplicative attention: \(\mathbf{e}_{t,i} = \mathbf{s}^{T}_{t}\mathbf{W}\mathbf{h}_{i}\)
additive attention: \(\mathbf{e}_{t,i} = \mathbf{v}^{T}tanh(\mathbf{W}_{1}\mathbf{h}_{i}+ \mathbf{W}_{2}\mathbf{s}_{t})\)

1. dot product attention compared to multiplicative attention

advantage	disadvantage
Computing faster	less expressive

2. additive attention compared to multiplicative attention

advantage	disadvantage
More versatile	heavy computing

2. Analyzing NMT Systems

(a). Cherokee is a ploysynthetic language, and it will make the word vocabulary very large. Modeling at a subword-level is easier.
(b). The transliterated Cherokee text might be more alignment with morphemic convention.
(c). Multilingual NMT, with the inductive bias that "the learning signal from one language should benefit the quality of translation to other languages", is a potential remedy.
(d).
i. Wrong pronouns. Pronoun construction is insufficient. Adding more layers to final vocabulary projection.
ii. repeated words. Putting overly high attention on adjective. Changing the attention mechanism might help.
iii. Less expressive words. Low representational capacity of the model. Adding more layers to the projection layers of the encoder and decoder.
(f). Computing the BLEU score

References:

[1]. BLEU: a Method for Automatic Evaluation of Machine Translation
[2]. NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
[3]. Attention and Augmented Recurrent Neural Networks