NAIST-IS-MT1251045 Factored Translation Models 2014 2 6
( ) Kevin Duh
Factored Translation Models Factored translation models Factored Translation Models, NAIST-IS-MT1251045, 2014 2 6. i
Post-ordering with Factored Translation Models for Japanese to English Translation Kazuya Kobayashi Abstract Translation quality using statistical machine translation systems strongly depends on language pair. When translating between Japanese and English, we need to consider long distance reorderings. Current statistical machine translation systems do not work well because of the low flexibility of standard reordering models and the limits of computational complexity. In this thesis, we focus a method called post-ordering to mitigate reordering problem in Japanese to English translation and propose a method using additional information beyond word surface forms. We use Factored Translation Models to use additional information such as POS-tag and word class. Keywords: Statistical Machine Translation, Post-ordering, Factored Translation Models, Japanese to English Translation Master s Thesis, Department of Information Science, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-MT1251045, February 6, 2014. iii
v Kevin Duh Graham Neubig 2
vii v 1 1 1.1......................... 1 1.2...................... 2 1.3............................. 2 1.4............................. 3 2 5 2.1.............................. 5 2.2.............................. 5 2.2.1 IBM 1......................... 6 2.2.2 IBM 2......................... 7 2.2.3 IBM 3......................... 7 2.2.4 IBM 4 5....................... 8 2.3........................ 8 2.4............................ 9 2.5...................... 9 3 Factored Translation Models 11 4 13 5 15 5.1 Head Finalization........................... 16 5.2............................. 16 6 Factored Translation Models 19 6.1 HFE....................... 20 6.2 HFE........................ 20
7 factor 21 7.1............................. 21 7.2............................ 21 7.3................................ 21 7.4............................ 22 7.4.1 HFE factor...... 22 7.4.2 HFE factor........ 23 7.4.3...... 24 8 27 29 viii
ix 2.1........................ 10 3.1 factored translation models......... 11 4.1................. 14 5.1......................... 15 5.2 XML Enju....................... 17 5.3 Head Finalization.............. 17 6.1 Factored Translation Models................................. 20
xi 7.1................... 23 7.2 HFE.............. 24 7.3 HFE................. 24 7.4 factor................ 25 7.5........... 26
1 1 1.1 noisy channel model f e f = ( f 1, f 2,, f m ) e = (e 1,e 2,,e n ) P(e f) P(e f) = P(e)P(f e) (1.1) P(f) (1.1) e (1.1) ê ê = argmaxp(e)p(f e) (1.2) e P(e) P(f e) e e f noisy channel model Och [14] M h m (e,f) P(e f) P(e f) = exp[ M m=1 λ mh m (e,f)] e exp[ M m=1 λ (1.3) mh m (e,f)]
λ (1.3) e (1.3) ê ê = argmaxp(e f) (1.4) e = argmax e M m=1 λ m h m (e,f) (1.5) 1.2 John hit a ball. SVO SOV n n! 2 1.3 2
factored translation models 1.4 2 3 factored translation models 4 5 6 factored translation models 7 8 3
5 2 Koehn [11] Och [16] 2.1 (1.1) (1.3) n-gram w 1,w 2,...,w l P(w 1 w 2...w l ) = P(w 1 )P(w 2 w 1 )...P(w l w 1 w 2...w l 1 ) (2.1) (2.1) n-gram w i n 1 3-gram P(w i w i 1 w i 2 ) = C(w i 2w i 1 w i ) C(w i 2 w i 1 ) (2.2) C(x) x (2.2) 3-gram 0 Kneser-Ney [8] Witten-Bell [19] 2.2 (1.1) (1.3) IBM [2] IBM e f
a P(f e) P(f e) = P(f, a e) (2.3) a f e m l 1 a a 1 a 2...a m m f j e i a j = i f j e a j = 0 P(f,a e) P(f,a e) = P(m e) M j=1 P(a j f j 1 1,a j 1 1,m,e)P( f j f j 1 1,a j 1 1,m,e) (2.4) f j i f i j IBM 1 5 5 1 5 2.2.1 IBM 1 IBM 1 3 (2.4) P(m e) m e P(a j a j 1 1, f j 1 1,m,e) l (l + 1) 1 P( f j f j 1 1,a j 1 1,m,e) f j e a j f ( f j e a j ) = P( f j f j 1 1,a j 1 1,m,e) (2.5) (2.5) e a j f j P(f,a e) P(f,a e) = ε (l + 1) m m j=1 t( f j e a j ) (2.6) 6
(2.6) f a j 0 l P(f e) P(f e) = = ε (l + 1) m ε (l + 1) m l a 1 =0 l i=0 m j=1 l m a m =0 j=1 t( f j e a j ) t( f j e a j ) (2.7) 2.2.2 IBM 2 2 P(a j a j 1 1, f j 1 1,m,l) l j a j m a(a j j,m,l) = P(a j a j 1 1, f j 1 1,m,l) (2.8) (2.7) (2.8) P(f e) P(f e) = ε = ε l a 1 =0 l m i=0 j=1 l m a m =0 j=1 t( f j e a j )a(a j j,m,l) t( f j e a j )a(a j j,m,l) (2.9) 1 2 a(i j,m,l) (l + 1) 1 1 2 2.2.3 IBM 3 1 2 1 3 e i n(ϕ e) fertility model ϕ = 0 p 1 3 7
2 a(a j j,m,l) d( j i,l.m) 3 n(ϕ e) p 1 t( f j e a j ) d( j i,l.m) 2.2.4 IBM 4 5 3 4 3 4 5 2.3 4 1 8
2.4 Moses[10] Moses Galley [4] f = ( f 1, f 2,, f m ) ē = ( e 1, e 2,, e n ) a = (a 1,a 2,...,a n ) o = (o 1,o 2,...,o n ) P(o ē, f ) = n i=1 P(o i ē i, f ai,a i 1,a i ) (2.10) o i 3 (2.1) monotone M 2 swap S 2 discontinuous D 2 o f m = n i=1 log p(o i = M...) f s = n i=1 log p(o i = S...) f d = n i=1 log p(o i = D...) 2.5 n n! 9
2.1: 10
11 3 Factored Translation Models Koehn [9] factored translation models factor house houses house houses factored translation models 3.1 2 Factored translation models 3.1: factored translation models
13 4 4.1 Collins [3] Collins Katz-Brown [7] 2 Katz-Brown Isozaki [6] Isozaki Sudoh [18] Sudoh Isozaki [6]
4.1: 14
15 5 head-final language Isozaki Sudoh [18] Sudoh 5.1 2 Isozaki [6] (Head-Final English HFE) HFE HFE 2 HFE HFE 2 5.1:
5.1 Head Finalization Head-Fianl English Head Finalization [6] Enju 1 [12] 5.2 Enju Enju 2 Head Finalization 4 Head Finalization 5.3 1. 2. 3. a an the 4. va0 va1 va2 2 4 5.2 5.1 2 HFE HFE HFE Head Finalization HFE HFE HFE HFE HFE 1 http://www.nactem.ac.uk/tsujii/enju/index.html 16
5.2: XML Enju 5.3: Head Finalization 17
19 6 Factored Translation Models factored translation models[9] f e P( f e) P( f word e word ) P( f word ) f f e f P( f e) P( f word e word ) P( f word ) factor translation models factor P( f word, f f actor1, f f actor2, e word ) factor P( f f actor ) P( f e) Factor 2 factor Enju Brown Clustering[1] Brown Clustering 50 1,000 2 50 Enju 50 Brown Clustering 1,000 6.1 head finalization HFE HFE Enju Brown Clustering factor HFE HFE HFE factor HFE HFE HFE
6.1: Factored Translation Models 6.1 HFE HFE HFE factor P( f word ) P( f ) P( f ) HFE factor P( f word, f, f e word ) HFE HFE Brown Clustering 6 6.2 HFE HFE HFE factor HFE factor 12 20
21 7 factor 7.1 Wikipedia Wikipedia Wikipedia 318,443 1,166 1,160 7.2 GIZA++[15] 1 IBM 4 SRILM 2 5-gram factor 7-gram MERT[13] Moses 3 7.3 BLEU[17] RIBES[5] BLEU n-gram BLEU (7.1) BLEU = BP exp( N n=1 w n logp n ) (7.1) 1 https://code.google.com/p/giza-pp/ 2 http://www.speech.sri.com/projects/srilm/download.html 3 http://www.statmt.org/moses/
BP w n n-gram logp n n-gram N = 4 RIBES Kendall τ unigram RIBES (7.2) RIBES = τ + 1 P α (7.2) 2 τ Kendall τ P unigram α unigram 7.4 7.1 factor BLEU RIBES factor 1000 BLEU RIBES Factor 50 Brown Clustering 7.4.1 HFE factor factor HFE Head Finalization 7.2 50 BLEU 22
HFE HFE BLEU RIBES PBMT 16.95 65.23 PBMT + 16.68 64.54 PBMT + 50 12 17.36 65.25 PBMT + 1000 17.56 65.88 PBMT + 50 17.41 65.23 PBMT + 1000 17.47 65.50 16.22 65.73 + 16.22 65.77 + 50 6 12 16.69 65.39 + 1000 16.16 65.89 + 50 16.55 65.45 + 1000 16.79 65.99 7.1: 7.4.2 HFE factor factor factor HFE Head Finalization HFE 7.3 7.2 BLEU RIBES 1,000 BLEU 1,000 BLEU 50 BLEU 7.2 7.3 50 1000 1,000 fator 7.4 BLEU factor BLEU 23
BLEU RIBES PBMT 15.65 68.35 16.06 68.62 50 16.32 68.36 1000 15.61 68.39 50 16.17 68.06 1000 16.09 68.44 7.2: HFE BLEU RIBES PBMT 59.69 82.31 58.85 81.85 50 60.09 82.27 1000 60.74 83.36 50 60.08 82.58 1000 60.99 83.16 7.3: HFE 7.4.3 7.5 7.5 factored translation models HFE BLEU HFE HFE RIBES HFE HFE HFE HFE factor 24
HFE HFE BLEU RIBES 16.22 65.73 + 16.22 65.77 + 50 16.69 65.39 + 1,000 6 12 16.16 65.89 + 50 16.55 65.45 + 1,000 16.79 65.99 + 50 & + 1,000 17.16 65.69 7.4: factor 1,000 BLEU RIBES 1,000 1,000 25
HFE HFE BLEU RIBES 16.22 65.73 + 16.22 65.77 + 50 6 12 16.69 65.39 + 1,000 16.16 65.89 + 50 16.55 65.45 + 1,000 16.79 65.99 16.32 64.64 + 17.16 64.64 + 50 20 12 16.62 63.84 + 1,000 17.01 65.36 + 50 16.84 64.39 + 1,000 17.43 65.25 7.5: 26
27 8 factored translation models factored translation models BLEU n-garm RIBES factor 50 1,000 brown clustering deep learning factor factor HFE factor HFE
29 [1] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. Class-based n-gram models of natural language. Computational linguistics, Vol. 18, No. 4, pp. 467 479, 1992. [2] Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, Vol. 19, No. 2, pp. 263 311, 1993. [3] Michael Collins, Philipp Koehn, and Ivona Kučerová. Clause restructuring for statistical machine translation. pp. 531 540, 2005. [4] Michel Galley and Christopher D Manning. A simple and effective hierarchical phrase reordering model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 848 856, 2008. [5] Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 944 952, 2010. [6] Hideki Isozaki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. Head finalization: A simple reordering rule for sov languages. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pp. 244 251, 2010. [7] Jason Katz-Brown and Michael Collins. Syntactic Reordering in Preprocessing for Japanese to English Translation: MIT System Description for NTCIR-7 Patent Translation Task. Proceedings of NTCIR-7 Workshop Meeting, 2008. [8] Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, Vol. 1, pp. 181 184. IEEE, 1995.
[9] Philipp Koehn and Hieu Hoang. Factored translation models. EMNLP-CoNLL, pp. 868 876, 2007. [10] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. pp. 177 180, 2007. [11] Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48 54, 2003. [12] Yusuke Miyao and Jun ichi Tsujii. Feature forest models for probabilistic hpsg parsing. Computational Linguistics, Vol. 34, No. 1, pp. 35 80, 2008. [13] Franz Josef Och. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 160 167, 2003. [14] Franz Josef Och and Hermann Ney. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 295 302, 2002. [15] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computational linguistics, Vol. 29, No. 1, pp. 19 51, 2003. [16] Franz Josef Och, Christoph Tillmann, Hermann Ney, et al. Improved alignment models for statistical machine translation. In Proc. of the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 20 28, 1999. [17] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311 318, 2002. [18] Katsuhito Sudoh, Xianchao Wu, Kevin Duh, Hajime Tsukada, and Masaaki Nagata. Post-ordering in statistical machine translation. In Proc. MT Summit, 2011. 30
[19] Ian H Witten and Timothy C Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. Information Theory, IEEE Transactions on, Vol. 37, No. 4, pp. 1085 1094, 1991. 31