Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation Study

exploiting shared chinese characters in chinese n.w
1 / 42
Embed
Share

Explore how shared Chinese characters can optimize Chinese word segmentation for Chinese-Japanese machine translation. The study discusses common Chinese characters, word segmentation problems, experiments, and related work in this domain.

  • Chinese
  • Japanese
  • Machine Translation
  • Word Segmentation

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, Sadao Kurohashi Graduate School of Informatics, Kyoto University 1 EAMT2012 (2012/05/28)

  2. Outline Word Segmentation Problems Common Chinese Characters Chinese Word Segmentation Optimization Experiments Discussion Related Work Conclusion and Future Work 2

  3. Outline Word Segmentation Problems Common Chinese Characters Chinese Word Segmentation Optimization Experiments Discussion Related Work Conclusion and Future Work 3

  4. Word Segmentation for Chinese-Japanese MT Zh: Ja: / / / / / / / / / / Zh: / / / / / / / / / / / Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists. 4

  5. Word Segmentation Problems in Chinese-Japanese MT / / / / / / / / / / Zh: / / / / / / / / / / / Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists. Unknown words Affect segmentation accuracy and consistency Word segmentation granularity Affect word alignment 5

  6. Outline Introduction Common Chinese Characters Chinese Word Segmentation Optimization Experiments Discussion Related Work Conclusion and Future Work 6

  7. Chinese Characters Chinese characters are used both in Chinese (Hanzi) and Japanese (Kanji) There are many common Chinese characters between Hanzi and Kanji We made a common Chinese characters mapping table for 6,355 JIS Kanji (Chu et al., 2012) 7

  8. Common Chinese Characters Related Studies Automatic sentence alignment task (Tan et al., 1995) Dictionary construction (Goh et al., 2005) Word level semantic relations investigation (Huang et al., 2008) Phrase alignment (Chu et al., 2011) This study exploits common Chinese characters in Chinese word segmentation optimization for MT 8

  9. Outline Word Segmentation Problems Common Chinese Characters Chinese Word Segmentation Optimization Experiments Discussion Related Work Conclusion and Future Work 9

  10. Reason for Chinese Word Segmentation Optimization Segmentation for Japanese is easier than Chinese, because Japanese uses Kana other than Chinese characters F-score for Japanese segmentation is nearly 99% (Kudo et al., 2004), while that for Chinese is still about 95% (Wang et al., 2011) Therefore, we only do word segmentation optimization for Chinese, and keep the Japanese segmentation results 10

  11. Chinese Lexicons Extraction Parallel Training Corpus Common Chinese Characters Chinese Lexicons System Dictionary of Chinese Segmenter Chinese Annotated Corpus for Chinese Segmenter Chinese Lexicons Incorporation Short Unit Transformation System Dictionary with Chinese Lexicons Short Unit Chinese Corpus for Chinese Segmenter Training Optimized Chinese Segmenter 11

  12. Chinese Lexicons Extraction Parallel Training Corpus Common Chinese Characters Chinese Lexicons System Dictionary of Chinese Segmenter Chinese Annotated Corpus for Chinese Segmenter Chinese Lexicons Incorporation Short Unit Transformation System Dictionary with Chinese Lexicons Short Unit Chinese Corpus for Chinese Segmenter Training Optimized Chinese Segmenter 12

  13. Chinese Lexicons Extraction Step 1: Segment Chinese and Japanese sentences in the parallel training corpus Step 2: Convert Japanese Kanji tokens into Chinese using the mapping table we made (Chu et al., 2012) Step 3: Extract the converted tokens as Chinese lexicons if they exist in the corresponding Chinese sentence 13

  14. Extraction Example / / / / / / / / / / / Ja: Kanji tokens conversion / / / / / / / / / / / Ja: Check / / / / / / / / / / Zh: Extraction Chinese Lexicons : Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists. 14

  15. Chinese Lexicons Extraction Parallel Training Corpus Common Chinese Characters Chinese Lexicons System Dictionary of Chinese Segmenter Chinese Annotated Corpus for Chinese Segmenter Chinese Lexicons Incorporation Short Unit Transformation System Dictionary with Chinese Lexicons Short Unit Chinese Corpus for Chinese Segmenter Training Optimized Chinese Segmenter 15

  16. Chinese Lexicons Incorporation Using a system dictionary is helpful for Chinese word segmentation (Low et al., 2005; Wang et al., 2011) We incorporate the extracted lexicons into the system dictionary of a Chinese segmenter Assign POS tags by converting the POS tags assigned by the Japanese segmenter using POS tags mapping table between Chinese and Japanese 16

  17. CTB (Penn Chinese Treebank) JUMAN (Kurohashi et al., 1994) (adverb) AD(adverb) (conjunction) CC(coordinating conjunction) (noun)[ (numeral noun)] CD(cardinal number) (undefined word)[ (alphabet)] FW(foreign words) (interjection) IJ(interjection) (suffix)[ (measure word suffix)] M(measure word) (noun)[ (common noun)/ (sahen noun)/ (formal noun)/ (adverbial noun)], (suffix)[ (noun suffix)/ (special noun suffix)] (noun)[ (proper noun)/ (place name)/ (person name)/ (organization name)] (noun)[ (temporal noun)] NN(common noun) NR(proper noun) NT(temporal noun) (special word) PU(punctuation) (adjective) VA(predicative adjective) (verb)/ (noun)[ (sahen noun)] VV(other verb) 17

  18. Chinese Lexicons Extraction Parallel Training Corpus Common Chinese Characters Chinese Lexicons System Dictionary of Chinese Segmenter Chinese Annotated Corpus for Chinese Segmenter Chinese Lexicons Incorporation Short Unit Transformation System Dictionary with Chinese Lexicons Short Unit Chinese Corpus for Chinese Segmenter Training Optimized Chinese Segmenter 18

  19. Short Unit Transformation Adjusting Chinese word segmentation to make tokens 1-to-1 mapping as many as possible between a parallel sentences can improve alignment accuracy (Bai et al., 2008) Wang et al. (2010) proposed a short unit standard for Chinese word segmentation, which can reduce the number of 1-to-n alignments and improve MT performance 19

  20. Our Method We transform the annotated training data of Chinese segmenter utilizing the extracted lexicons _P/ _NN / _VA/ _DEC/ _NN / CTB: Lexicon: (effective) Lexicon : (element) _P/ _NN/ _NN / _VA/ _DEC/ _NN/ _NN / Short: From case element with high effectiveness Ref: 20

  21. Constraints We do not use the extracted lexicons that are composed of only one Chinese character (song)/ (praise) (praise) (song) long token extracted lexicons short unit tokens 21

  22. Outline Word Segmentation Problems Common Chinese Characters Chinese Word Segmentation Optimization Experiments Discussion Related Work Conclusion and Future Work 22

  23. Two Kinds of Experiments Experiments on Moses Experiments on Kyoto example-based machine translation (EBMT) system Nakazawa and Kurohashi, 2011) A dependency tree-based decoder 23

  24. Experimental Settings on MOSES (1/2) Parallel Training Corpus Zh-Ja paper abstract corpus (680k sentences) Chinese Annotated Corpus NICT Chinese Treebank (same domain of parallel corpus, 9,792 sentences) CTB 7 (31,131 sentences) Chinese Segmenter Japanese Segmenter In-house Chinese segmenter JUMAN (Kurohashi et al., 1994) MT system Moses with default options, except for the distortion limit (6 20) 24

  25. Experimental Settings on MOSES (2/2) Baseline: Only using the lexicons extracted from Chinese annotated corpus Incorporation: Incorporate the extracted Chinese lexicons Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data 25

  26. Results of Chinese-to-Japanese Translation Experiments on MOSES BLEU NICT Chinese Treebank CTB 7 Baseline 34.92 36.64 Incorporation 36.19 37.39 Short unit 36.82 38.59 CTB 7 shows better performance because the size is more than 3 times of NICT Chinese Treebank Lexicons extracted from a paper abstract domain also work well on other domains (i.e. CTB 7) 26

  27. Results of Japanese-to-Chinese Translation Experiments on MOSES BLEU NICT Chinese Treebank CTB 7 Baseline 31.42 31.83 Incorporation 31.24 32.34 Short unit 31.95 31.95 Not significant compared to Zh-to-Ja, because our proposed approach does not change the segmentation results of input Japanese sentences 27

  28. Experimental Settings on EBMT (1/2) Parallel Training Corpus Zh-Ja paper abstract corpus (680k sentences) Chinese Annotated Corpus CTB 7 (31,131 sentences) Chinese Segmenter In-house Chinese segmenter Japanese Segmenter JUMAN (Kurohashi et al., 1994) MT system Kyoto example-based machine translation (EBMT) system 28

  29. Experimental Settings on EBMT (2/2) Baseline: Only using the lexicons extracted from Chinese annotated corpus Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data 29

  30. Results of Translation Experiments on EBMT BLEU Zh-to-Ja Ja-to-Zh Baseline 22.84 19.10 Short unit 23.36 19.15 Translation performance is worse than MOSES, because EBMT suffers from low accuracy of Chinese parser Improvement by short unit is not significant because the Chinese parser is not trained on short unit segmented training data 30

  31. Outline Word Segmentation Problems Common Chinese Characters Chinese Word Segmentation Optimization Experiments Discussion Related Work Conclusion and Future Work 31

  32. Short Unit Effectiveness on MOSES Baseline ( Baseline (BLEU=49.38 BLEU=49.38) ) Input: / / / / / / / / / / / / / / / Output: / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / Short Short unit ( Input: / / / / / / / / / / / / / / / / Output: / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / unit (BLEU=56.33 BLEU=56.33) ) / / / / / / Reference Reference / / / / / / / / / / / / / / / / / / / / / / / / / / / / / (In this paper, we propose a basic security design method also consider functional suitability of the existing implementation method for determining countermeasures target.) 32

  33. Number of Extracted Lexicons Source Parallel training corpus 18,584 Source NICT Treebank CTB 7 Chinese annotated corpus 13,471 26,202 Short unit Chinese corpus 12,627 25,490 The number of extracted lexicons deceased after short unit transformation because duplicated lexicons increased 33

  34. Short Unit Transformation Percentage NICT Chinese Treebank 6,623 tokens out of 257,825 been transformed to 13,469 short unit tokens, the percentage is 2.57% CTB 7 19,983 tokens out of 718,716 been transformed to 41,336 short unit tokens, the percentage is 2.78% 34

  35. Short Unit Transformation Problems (1/3) Improper transformation problem (not)/ (favor)/ (think) (sorry) (favor) long token extracted lexicons short unit tokens 35

  36. Short Unit Transformation Problems (2/3) Transformation ambiguity problem (charge)/ (device) (charge) (electric equipment) (charger) (charge)/ (electric equipment) long token extracted lexicons short unit tokens 36

  37. Short Unit Transformation Problems (3/3) POS Tag assignment problem _NN(be)/ _NN(test)/ _NN(person) _NN (test subject) (test) long token extracted lexicons short unit tokens The correct POS tag for (be) should be LB ( in long bei-const) 37

  38. Outline Word Segmentation Problems Common Chinese Characters Chinese Word Segmentation Optimization Experiments Discussion Related Work Conclusion and Future Work 38

  39. Bai et al., 2008 Proposed a method of learning affix rules from a aligned Chinese-English bilingual terminology bank to adjust Chinese word segmentation in the parallel corpus directly 39

  40. Wang et al., 2010 Proposed a method based on transfer rules and a transfer database. The transfer rules are extracted from alignment results of annotated Chinese and segmented Japanese training data The transfer database is constructed using external lexicons, and is manually modified 40

  41. Conclusion We proposed an approach of exploiting common Chinese characters in Chinese word segmentation optimization for Chinese- Japanese MT Experimental results of Chinese-Japanese MT on a phrase-based SMT system indicated that our approach can improve MT performance significantly 41

  42. Future Work Solve the short unit transformation problems Evaluate the proposed approach on parallel corpus of other domains 42

Related


More Related Content