Kyoto University Asian Translation Workshop Submissions Overview

1 / 21

Embed Share

Explore Kyoto University's submissions to the 3rd Workshop on Asian Translation, including details on Kyoto-EBMT and Kyoto-NMT systems. Learn about Example-Based Machine Translation, Sequence-to-Sequence models, and the KyotoEBMT pipeline in this comprehensive overview.

jveron Follow

Uploaded on Apr 03, 2025 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Kyoto University Participation to the 3rdWorkshop on Asian Translation Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi

Overview of our submissions 2 Systems Kyoto-EBMT Example-Based Machine Translation Uses Dependency analysis for both source and target side Some small incremental improvements over our last year participation Kyoto-NMT Our new implementation of the Neural MT paradigm Sequence-to-Sequence model with Attention Mechanism As first introduced by (Bahdanau et al., 2015) For the tasks: ASPEC Ja->En ASPEC En -> Ja ASPEC Ja -> Zh ASPEC Zh -> Ja

KyotoEBMT

KyotoEBMT Overview Example-Based MT paradigm Need parallel corpus Few language-specific assumptions still a few language-specific rules Tree-to-Tree Machine Translation Maybe the least commonly used variant of x-to-x Sensitive to parsing quality of both source and target languages Maximize the chances of preserving information Dependency trees Less commonly used than Constituent trees Most natural for Japanese Should contain all important semantic information 4

KyotoEBMT pipeline Somehow classic pipeline 1- Preprocessing of the parallel corpus 2- Processing of input sentence 3- Decoding/Tuning/Reranking Tuning and reranking done with kbMira seems to work better than PRO for us 5

KyotoNMT

KyotoNMT Overview Uses the sequence-to-sequence with attention model as proposed in (Bahdanau et al., 2015) with other subsequent improvements UNK-tags replacement (Luang et al., 2015) ADAM training, sub-word units, Hopefully we can add more original ideas in the future Implemented in Python using the Chainer library A version is GPL open-sourced

Sequence-to-Sequence with Attention Bahdanau+, 2015 Previous state new state Attention Model LSTM <1000> <1000> <2620> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <3620> <1000> <1000> maxout LSTM LSTM LSTM LSTM Current context <500> LSTM LSTM LSTM LSTM <620> softmax <620> <620> <620> <620> Target Embedding <30000> Source Embedding I am a Student Previously generated word New word

Depending on experiments: GRU LSTM 2-layers LSTM Other values were the same for all experiments Previous state new state Attention Model LSTM <1000> <1000> <2620> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <3620> <1000> <1000> maxout Target Vocabulary Size: 30 000 50 000 LSTM LSTM LSTM LSTM Current context <500> LSTM LSTM LSTM LSTM <620> softmax <620> <620> <620> <620> Target Embedding <30000> Source Embedding I am a Student Source Vocabulary Size: 30 000 200 000 Previously generated word New word

Regularization Weight Decay Choosing a good value seemed quite important 1e-6 worked noticeably better than 1e-5 or 1e-7 Early Stopping Keep the parameters with the best loss on dev set Or keep the parameters with the best BLEU on dev set best BLEU work better, but even better to ensemble best BLEU and best loss Dropout Only used between LSTM layers (when used multi-layer LSTM) 20% dropout Noise on target word embeddings

Noise on target word embedding Previous state new state Idea: add random noise at training time here to Seems to work (+ 1.5 BLEU) Attention Model force the network to not rely too much on this information LSTM <1000> But is it actually because the network became less prone to cascading errors? Or simply a regularization effect? <1000> <2620> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <3620> <1000> <1000> maxout LSTM LSTM LSTM LSTM Current context <500> LSTM This part can be the source of cascading errors at translation time At training time, we always give the correct previous word, but not at translation time LSTM LSTM LSTM <620> softmax <620> <620> <620> <620> Target Embedding <30000> Source Embedding I am a Student Previously generated word New word

Translation Translation with beam-search Large beam (maximum 100) Although other authors mention issues with large beam, it worked for us Normalization of the score by the length of the sentence final n-best candidates are pruned by the average loss per word UNK words replaced with a dictionary using the attention values Dictionary extracted from the aligned training corpus attention not always very precise, but does help

Ensembling Ensembling is known to improve Neural-MT results substantially. We could confirm this, using three type of ensembling: Normal Ensembling Train different models and ensemble over them Self-Ensembling Ensembling of several parameters at different steps of the same training session Mixed Ensembling Train several models, and use several parameters for each models Observations: Ensembling does help a lot Mixed > Normal > Self Diminishing returns ( typically +2-3 BLEU going from one to two models, less than +0.5 going from three to four models) Geometric averaging of probabilities worked better than Arithmetic averaging

The question of segmentation Several options for segmentation Natural (ie. words for English) Subword units, using eg. BPE (Senrich et al., 2015) Automatic segmentation tools (JUMAN, SKP) -> Trade-off between sentence size, generalization capacity and computation efficiency English Words units Subword units with BPE Japanese JUMAN segmentation Subword units with BPE Chinese SKP segmentation short units segmentation Subword units with BPE

Results

In term of BLEU, ensembling 4 simple models beats the larger NMT system In term of Human evaluation, the larger NMT model has a slightly better score In term of AM-FM actually ranks the EBMT system higher Results for WAT 2016 Ja -> En EBMT NMT 1 NMT 2 BLEU 21.22 24.71 26.22 AM-FM 59.52 56.27 55.85 Pairwise - 47.0 (3/9) 44.25 (4/9) JPO Adequacy - 3.89 (1/3) - # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 200k (JUMAN) 52k (BPE) - NMT 2 1 30k (JUMAN) 30k (words) x4

Results for WAT 2016 En -> Ja EBMT NMT 1 BLEU 31.03 36.19 AM-FM 74.75 73.87 Pairwise - 55.25 (1/10) JPO Adequacy - 4.02 (1/4) # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 52k (BPE) 52k (BPE) -

Results for WAT 2016 Ja -> Zh EBMT NMT 1 BLEU 30.27 31.98 AM-FM 76.42 76.33 Pairwise 30.75 (3/5) 58.75 (1/5) JPO Adequacy - 3.88 (1/3) # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 30k (JUMAN) 30k (KyotoMorph) -

Results for WAT 2016 Zh -> Ja EBMT NMT 1 NMT 2 BLEU 36.63 46.04 44.29 AM-FM 76.71 78.59 78.44 Pairwise - 63.75 (1/9) 56.00 (2/9) JPO Adequacy - 3.94 (1/3) - # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 30k (KyotoMorph) 30k (JUMAN) x2 NMT 2 2 200k (KyotoMorph) 50k (JUMAN) -

EBMT vs NMT , Src Ref Shown here are type and basic configuration and standards of this flow with some diagrams. EBMT This flow sensor type and the basic composition, standard is illustrated, and introduced. NMT This paper introduces the type, basic configuration, and standards of this flow sensor. NMT vs EBMT: NMT seems more fluent NMT sometimes add parts not in the source (over-translation) NMT sometimes forget to translate some part of the source (under-translation)

Conclusion Neural MT proved to be very efficient Especially for Ja -> Zh (almost +10 BLEU compared with EBMT) NMT vs EBMT: NMT output is more fluent and readable NMT has more often issues of under- or over-translation NMT takes longer to train but can be faster to translate Finding the optimal settings for NMT is very tricky Many hyper-parameters Each training takes a long time on a single GPU

Kyoto University Asian Translation Workshop Submissions Overview

Download Presentation

Presentation Transcript

Related

More Related Content