
Enhancing Non-Autoregressive Neural Machine Translation with Reordering Information
Discover how the ReorderNAT model addresses the limitations of traditional NMT models by incorporating reordering information to guide decoding, improving speed and accuracy in translation tasks. Learn about its methodology, experiments, and the importance of reordering for efficient NAT decoding.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information Qiu Ran , Yankai Lin , Peng Li , Jie Zhou Pattern Recognition Center, WeChat AI, Tencent Inc., China Speaker : Yu-Chen Kuan
OUTLINE Introduction Methodology Experiments Conclusion 2
Introduction The SOTA NMT models usually suffer from the slow inference speed due to their autoregressive property Non-autoregressive NMT (NAT) which can simultaneously decode all target words to break the bottleneck of the autoregressive NMT (AT) models 4
Introduction NAT models still suffer from the multimodality problem because of discard the dependencies among the target words Argue reordering information is essential for NAT models and helpful for alleviating the multimodality problem Proposed ReorderNAT, which explicitly models the reordering information to guide the decoding of NAT 5
NAT Given a source sentence X = {x1, , xn} and a target sentence Y = {y1, , ym}, NAT models the translation probability from X to Y as a product of conditionally independent target word probability: Instead of utilizing the translation history, NAT models usually copy source word representations as the input of the decoder 8
ReorderNAT Employs a reordering module to explicitly model the reordering information in the decoding First reorders the source sentence X into a pseudo-translation Z = {z1, , zm} formed by source words but in the target language word order Second generate target translation Y conditioned on it 9
ReorderNAT 10
ReorderNAT ReorderNAT models the overall translation probability as: P(Z|X) is modeled by the reordering module P(Y|Z, X) is modeled by the decoder module 12
Two Reordering Module: NAT & AT NAT Reordering Module: P(zi |X) is calculated by a single-layer Transformer Use a length predictor to o determine the length of the pseudo-translation 14
Two Reordering Module: NAT & AT AT models are more suitable for modeling the reordering information compared to NAT models Even a light AT model with similar decoding speed to a large NAT model could achieve better performance 15
Two Reordering Module: NAT & AT AT Reordering Module: Where z<i = {z1, , zi 1} indicates the pseudo translation history P(zi |z<i, X) is calculated by a single-layer recurrent neural network 16
Decoder Decoder Module: Input of ReorderNAT decoder module is the embeddings of pseudo-translation instead of copied embeddings of source sentence Use K and N K decoder blocks for the reordering and decoder modules respectively K to 1 for an AT module while N 1 for an NAT module 17
Guiding Decoding Strategy Utilize the pseudo-translation as a bridge to guide the decoding of the target sentence: 18
Guiding Decoding Strategy It is intractable to obtain an exact solution for maximizing Eq. 6 due to the high time complexity Deterministic guiding decoding (DGD) strategy Non-deterministic guiding decoding (NDGD) strategy 19
DGD strategy First generates the most probable pseudo-translation of the source sentence, then translates the source sentence conditioned on it: simple and effective, but the hard approximation may bring in some noises 20
DGD strategy 21
NDGD strategy Regards the probability distribution Q of words to generate the most probable pseudo-translations as a latent variable : s( ) is a score function of pseudo-translation (the input of softmax layer in the decoder) T is a temperature coefficient 22
DGD V.S. NDGD The major difference between DGD and NDGD strategy is the inputs of decoder module DGD strategy directly utilizes the word embeddings of generated pseudo-translation NDGD strategy utilizes the word embeddings weighted by the word probability of pseudo-translation 24
Training ReorderNAT is optimized by maximizing a joint loss: LR and LT indicate the reordering and translation losses respectively Both DGD and NDGD, the reordering loss LR is defined as: 25
Training For the DGD approach, the translation loss: For the NDGD approach, the translation loss: use the trained model for the DGD approach to initialize the model for the NDGD approach since if Q is not well trained, LT will converge very slowly 26
Datasets WMT14 En-De WMT16 En-Ro IWSLT16 En-De Chinese-English translation (LDC corpora) 28
Experimental Settings fast align tool: generate the pseudo-translation For IWSLT16 En-De: 5-layer Transformer model dmodel = 278, dhidden = 507, nhead = 2, pdropout = 0.1 For WMT14 En-De, WMT16 En-Ro and Chinese-English translation: 6-layer Transformer model dmodel = 512, dhidden = 512, nhead = 8, pdropout = 0.1 29
Experimental Settings For the GRU reordering module: same hidden size with the Transformer model in each dataset For each dataset: we select the optimal guiding decoding strategy according to the model performance on validation sets label smoothing : 0.15 Utilize the sequence-level knowledge distillation T in Eq.10: 0.2 30
Baseline Transformerfull: the teacher model used in the knowledge distillation Transformerone: a lighter version of Transformer, of which the decoder layer number is 1 Transformergru: replaces the decoder of Transformerfull with a single-layer GRU 31
Case Study 36
Conclusion In order to address the multimodality problem in NAT, proposed ReorderNAT which explicitly models the reordering information in the decoding procedure Deterministic and Non-deterministic guiding decoding strategies to utilize the reordering information to encourage the decoder to choose words belonging to the same translation Achieves better performance than most existing NAT models Significant speedup 38