2 Sequence Alignment as a Set Partitioning Problem Masaaki Nishino,JunSuzuki, Shunji Umetani, Tsutomu Hirao and Masaaki Nagata Sequence alignment, which involves aligning elements of two given sequences, occurs in many natural language processing (NLP) tasks such as sentence alignment. Previous approaches for solving sequence alignment problems in NLP can be categorized into two groups. The first group assumes monotonicity of alignments; the second group does not assume monotonicity or consider the continuity of alignments. However, for example, in aligning sentences of parallel legal documents, it is desirable to use a sentence alignment method that does not assume monotonicity but can consider continuity. Herein, we present a method to align sequences where block-wise changes in the order of sequence elements exist. Our method formalizes a sequence alignment problem as a set partitioning problem, a type of combinatorial optimization problem, and solves the problem to obtain an alignment. We also propose an efficient algorithm to solve the optimization problem by applying column generation. Key Words: Sequence Alignment, Combinatorial Optimization, Column Generation NTT, NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, Graduate School of Information Science and Technology, Osaka University
Vol. 23 No. 2 March 2016 1 2 DNA RNA (Moore 2002; Braune and Fraser 2010; Quan, Kit, and Song 2013) (Nie, Simard, Isabelle, and Durand 1999) (Qu and Liu 2012; 2015; 2012) F E F i f i E j e j F i +1 E j +1 (Qu and Liu 2012; 2015; 2012) 1 2 176
1 (Quan et al. 2013) Wikipedia f i e j f i+1 e j 1 F f i f i+1 E 2 (Quan et al. 2013) Bilingual Laws Information System (BLIS) 2 1 4 f i e j f i+1 e j f i e j f i+1 e j 2 http://www.legistlation.gov.hk 177
Vol. 23 No. 2 March 2016 2 BLIS F E (Korte and Vygen 2008) F E S S 1,...,S N S D {S 1,...,S N } D S S = Si DS i i j S i,s j D S i S j = 2 F, E S 1,...,S N S 1,...,S N F, E F, E 178
F, E 3 O( F 2 E 2 ) NP (Lübbecke and Desrosiers 2005) 2 (Gale and Church 1993) (Moore 2002; Braune and Fraser 2010) Deng (Deng, Kumar, and Byrne 2007) Deng Quan (Quan et al. 2013) 3 F, E 4 179
Vol. 23 No. 2 March 2016 Quan Quan Quan (Qu and Liu 2012) ( 2015) ( 2012) Brown (Brown, Petra, Pietra, and Mercer 1993) Wu (Wu 1997) (Brown et al. 1993) 3 F, E s F, t E S(s, t) 4 4 180
S(s, t) Moore (Moore 2002) s F s t E t S(s, t) ( S(s, t) = P (m mt )( s,m t ) m s ms ) (m s +1) m tr(t t j s i ) u(s i ) (1) j=1 i=1 m s, m t s, t s i, t j s i t j tr(t j s i ) s i t j u(s i ) s i P (m s,m t ) i=1 P (m s,m t )= exp ( m sr)(m s r) mt m t! (2) r (Brown et al. 1993) 4 (1) (2) (1) (2) 3 3 1 181
Vol. 23 No. 2 March 2016 3 3 2 F, E F E f i F i e k E k F i j f ij F 1 i j F e kl E E k l 1 k l E a ijkl (f ij,e kl ) f(a ijkl )=f ij, e(a ijkl )=e kl f ij e kl seqmatch(f ij,e kl ) f ij e kl X, A ijkl seqmatch(f ij,e kl ) seqmatch(f ij,e kl )= max X A ijkl (s,t) X S(s, t) (3) 4.1 seqmatch(f ij,e kl ) f ij e kl seqmatch(f ij,e kl ) a ijkl M M A M A a, a A f(a) f(a )= e(a) e(a )= a A f(a) =F a A e(a) =E 182
A A Â =argmax{score(a)} (4) A A Â score(a) F E A score(a) =λ K a A seqmatch(f(a),e(a)) (5) K A λ 0 <λ 1 λ =1 λ λ 1 (5) score(a) a (ILP) ILP maximize ijkl +logλ)y ijkl (6) ijkl(w subject to y ijkl =1 x :1 x F (7) i,j:i x j kl ij k,l:k x l y ijkl =1 x :1 x E (8) y ijkl {0, 1} i, j, k, l (9) w ijkl log seqmatch(f ij,e kl ) y ijkl a ijkl y ijkl =1 a ijkl (7) F x a ijkl 1 (8) E 2 F E 1 (f ij,e kl ) 183
Vol. 23 No. 2 March 2016 5 5 (6) (9) F E ( F +1)( E +1)/4 (Lübbecke and Desrosiers 2005) (6) (9) a ijkl {0, 1} 0 a ijkl 1 (Master problem: MP) M MP (Restricted master problem: RMP) RMP M M RMP RMP f n u n, e m v m RMP RMP RMP MP a ijkl M\M a ijkl 5 f 1 -e 4, f 2 -e 3, f 3 -e 2, f 4 -e 1 (3) seqmatch(f ij,e kl ) 184
j l w ijkl = w ijkl +logλ û n ˆv m (10) n=i m=k û n RMP u n ˆv m v m w ijkl RMP RMP MP MP RMP MP F E ( F +1)( E +1)/4 w ijkl 3 (10) û n,ˆv m Smith- Waterman (Smith and Waterman 1981) Smith-Waterman N, M O(NM) 6 q[j, l] f j,e l w ijkl (1 i j, 1 k l) q[j, l] log λ q[j 1,l 1] + S(f j,e l ) û j ˆv l q[j, l] =max (11) q[j 1,l]+S(f j ) û j q[j, l 1] + S(e l ) ˆv l q[0, 0] = log λ S(f j,e l ), S(f j ), S(e l ) (Moore 2002) f j e l f j E 6 185
Vol. 23 No. 2 March 2016 e l F log λ f j+1, e l+1 f j, e l 1 j F, 1 l E q[j, l] q[j, l] (11) q[j, l] x ijkl (11) 4 3 q[j 1,l 1], q[j 1,l], q[j, l 1] log λ q[j,l ] i = j +1 k = l +1 i k q[j, l] O( F E ) O( F + E ) Smith-Waterman 4 RMP M = {x 1 F,1 E } (line 1) x 1 F,1 E RMP RMP 7 (line 3) Smith-Waterman (line 4) (line 5) MP RMP (line 8) RMP MP 4 7 RMP RMP 186
6 6.1 25,000 2,500 K 60 K =1, 3, 6, 12, 20 K =1 60 40 20 K =3, 6, 12 2/3 60 40 20 K =3, 6, 12 1/3 Moore (Moore) (BM) (1) Moore(Moore 2002) (recall) (precision) F (F-measure) K 5 GIZA++(Och and Ney 2003) ILOG CPLEX λ λ =0.1 λ =0.01 2 187
Vol. 23 No. 2 March 2016 6.2 1, 2 3 SP + ILP SP + CG BM Moore Moore (Moore 2002) K =1 K λ =0.1 2 F K =1 Moore F 0.003 λ 1 F K =1 K =3 K =6 F F F SP+ILP (λ =0.1) 0.961 0.866 0.911 0.986 0.917 0.950 1.000 0.822 0.902 SP+CG (λ =0.1) 0.961 0.872 0.914 0.986 0.923 0.954 1.000 0.815 0.898 SP+ILP (λ =0.01) 0.961 0.872 0.914 0.986 0.923 0.954 0.804 0.831 0.817 SP+CG (λ =0.01) 0.961 0.872 0.914 0.928 0.908 0.918 0.833 0.843 0.838 BM 0.955 0.765 0.849 0.986 0.853 0.915 1.000 0.730 0.844 Moore 0.961 0.872 0.914 0.585 0.873 0.701 0.524 0.743 0.614 K =12 K =20 F F SP+ILP (λ =0.1) 0.990 0.764 0.862 0.986 0.731 0.840 SP+CG (λ =0.1) 0.991 0.769 0.866 0.986 0.743 0.847 SP+ILP (λ =0.01) 0.609 0.776 0.682 0.504 0.697 0.585 SP+CG (λ =0.01) 0.536 0.776 0.634 0.441 0.659 0.528 BM 0.990 0.694 0.816 0.986 0.623 0.763 Moore 0.506 0.776 0.613 0.400 0.694 0.508 2 F K =3 K =6 K =12 F F F SP+ILP (λ =0.1) 0.989 0.876 0.929 0.990 0.869 0.926 0.988 0.758 0.858 SP+CG (λ =0.1) 0.989 0.876 0.929 0.990 0.843 0.911 0.988 0.761 0.859 SP+ILP (λ =0.01) 0.989 0.883 0.933 0.879 0.896 0.887 0.721 0.790 0.754 SP+CG (λ =0.01) 0.989 0.899 0.942 0.801 0.872 0.835 0.703 0.806 0.751 BM 0.989 0.772 0.867 0.990 0.794 0.881 0.988 0.674 0.801 Moore 0.638 0.907 0.749 0.707 0.827 0.762 0.486 0.719 0.580 188
3 F K =3 K =6 K =12 F F F SP+ILP (λ =0.1) 1.000 0.810 0.895 0.990 0.760 0.860 0.990 0.799 0.884 SP+CG (λ =0.1) 1.000 0.792 0.884 0.990 0.760 0.860 0.990 0.791 0.879 SP+ILP (λ =0.01) 1.000 0.826 0.905 0.910 0.805 0.854 0.749 0.783 0.766 SP+CG (λ =0.01) 0.987 0.870 0.925 0.748 0.805 0.775 0.762 0.798 0.779 BM 1.000 0.703 0.825 0.978 0.673 0.797 0.990 0.678 0.805 Moore 0.684 0.885 0.772 0.655 0.796 0.719 0.484 0.685 0.567 K =1, 3 λ =0.01 λ =0.1 F K =6, 12, 20 λ =0.1 λ λ K λ =0.01 K λ =0.1 Moore F F 0.08 λ =0.1 0.03 1 F 4 CPLEX 30 400 100 5 10 6 10 3 189
Vol. 23 No. 2 March 2016 4 (a) (λ =0.1) k =1 K =3 K =6 K =12 K =20 SP+ILP 3.60 10 2 5.57 10 2 6.08 10 2 6.41 10 2 4.46 10 2 SP+CG 1.48 2.35 1.32 1.19 0.79 (b) (λ =0.01) k =1 K =3 K =6 K =12 K =20 SP+ILP 3.55 10 2 5.51 10 2 5.91 10 2 5.83 10 2 4.07 10 2 SP+CG 1.63 7.88 3.92 3.49 3.72 (c) (λ =0.1) K =3 K =6 K =12 SP+ILP 8.80 10 9.04 10 1.10 10 3 SP+CG 2.03 1.09 2.31 (e) (λ =0.1) K =3 K =6 K =12 SP+ILP 4.11 10 2 4.26 10 2 4.61 10 2 SP+CG 1.58 1.38 1.09 (d) (λ =0.01) K =3 K =6 K =12 SP+ILP 6.12 10 8.11 10 7.05 10 SP+CG 1.96 1.30 2.24 (f) (λ =0.01) K =3 K =6 K =12 SP+ILP 4.04 10 2 4.04 10 2 4.01 10 2 SP+CG 3.83 3.98 3.99 NP 5 3,600 190
5 (a) (λ =0.1) k =1 K =3 K =6 K =12 K =20 SP+ILP 3.35 10 6 3.35 10 6 3.35 10 6 3.35 10 6 3.35 10 6 SP+CG 9.39 10 2 1.02 10 3 8.31 10 2 7.38 10 2 7.00 10 2 (b) (λ =0.01) k =1 K =3 K =6 K =12 K =20 SP+ILP 3.35 10 6 3.35 10 6 3.35 10 6 3.35 10 6 3.35 10 6 SP+CG 1.02 10 3 1.50 10 3 1.10 10 3 1.12 10 3 1.21 10 3 (c) (λ =0.1) K =3 K =6 K =12 SP+ILP 1.50 10 6 1.50 10 6 1.50 10 6 SP+CG 7.14 10 2 7.18 10 2 5.90 10 2 (e) (λ =0.1) K =3 K =6 K =12 SP+ILP 3.35 10 6 3.35 10 6 3.35 10 6 SP+CG 9.30 10 2 8.89 10 2 7.65 10 2 (d) (λ =0.01) K =3 K =6 K =12 SP+ILP 1.50 10 6 1.50 10 6 1.50 10 6 SP+CG 9.24 10 2 8.35 10 2 9.09 10 2 (f) (λ =0.01) K =3 K =6 K =12 SP+ILP 3.35 10 6 3.35 10 6 3.35 10 6 SP+CG 1.27 10 3 1.17 10 3 1.10 10 3 10,000 1,000 100 10 1 0.1 0.01 0 100 200 300 400 500 600 5 191
Vol. 23 No. 2 March 2016 7 Braune, F. and Fraser, A. (2010). Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In Proceedings of COLING 2010, pp. 81 89. Brown, P. F., Petra, S. A. D., Pietra, V. J. D., and Mercer, R. L. (1993). The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), pp. 263 311. Deng, Y., Kumar, S., and Byrne, W. (2007). Segmentation and Alignment of Parallel Text for Statistical Machine Translation. Natural Language Engineering, 13 (3), pp. 235 260. Gale, W. A. and Church, K. W. (1993). A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19 (1), pp. 75 102. Korte, B. H. and Vygen, J. (2008). Combinatorial Optimization: Theory and Algorithms. Springer Verlag. Lübbecke, M. E. and Desrosiers, J. (2005). Selected Topics in Column Generation. Operations Research, 53 (6), pp. 1007 1023. Moore, R. C. (2002). Fast and Accurate Sentence Alignment of Bilingual Corpora. In Proceedings of AMTA 02, pp. 135 144. Nie, J.-Y., Simard, M., Isabelle, P., and Durand, R. (1999). Cross-language Information Retrieval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74 81. ACM. Och, F. J. and Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29 (1), pp. 19 51. 192
Qu, Z. and Liu, Y. (2012). Sentence Dependency Tagging in Online Question Answering Forums. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 554 562, Jeju Island, Korea. Association for Computational Linguistics. Quan, X., Kit, C., and Song, Y. (2013). Non-Monotonic Sentence Alignment via Semisupervised Learning. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 622 630, Sofia, Bulgaria. Association for Computational Linguistics. Smith, T. F. and Waterman, M. S. (1981). Identification of Common Molecular Subsequences. Journal of Molecular Biology, 147, pp. 195 197. (2012).., 19 (3), pp. 193 212. (2015).., 22 (1), pp. 27 58. Wu, D. (1997). Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora. Computational Linguistics, 23 (3), pp. 377 403. 2008 ACL 1999 2001 2005 2008 2009 MIT CSAIL NTT ACL 1998 2002 INFORMS MOS AAAI 193
Vol. 23 No. 2 March 2016 1995 1997 NTT 2000 NTT ACL 1987 ACL 2015 9 2 2015 11 5 2015 12 3 194