1,a) 1,b) 1,c) 1,d) 2,e) (MIDI ) audio-to-audio (RNMF) (DTW) DTW 1., (MIDI ) MIDI CD 2 1 1 MIDI CGM (Consumer Generated Music) Web Songrium [1] 2007 7 120 Web 1 2 / AIP a) wada@sap.ist.i.kyoto-u.ac.jp b) yoshiaki@sap.ist.i.kyoto-u.ac.jp c) enakamura@sap.ist.i.kyoto-u.ac.jp d) itoyama@sap.ist.i.kyoto-u.ac.jp e) yoshii@sap.ist.i.kyoto-u.ac.jp 1 å Ÿš žÿžœ v² f q œ œ 通して歌唱を 力 F0 c 2017 Information Processing Society of Japan 1
力 出力 ÝÒ 源分離 f å 2 š ž 伸縮率 f g å ² f œå 1 ( F0) 3 1 2 audio-to-audio 3 2 RNMF [2] DTW audio-to-audio 2. 2.1 [3] [4] MIDI 2.2 [5 10] Dannenberg [5] Vercoe [6] Raphael [7] (HMM) Cont [8] HMM (HSMM) [9] Montecchio [10] 2.3 [11 15]Gong [11] HSMM [12] Iskandar [13] Wang [14] c 2017 Information Processing Society of Japan 2
4 1 2 3 ñ{ Á œž œ 7#3/.' Ÿ ½s å g œ½s f 5 6 7 3 Dzhambazov [15] (MFCC) HMM 2.4 - [16 19] Huang [16] (RPCA) [17] RPCA F0 Rafii [18] Yang [19] [20] 3. audio-to-audio 3.1 3 4 VB-RNMF (1) (2) (3) (4) (5) F0 (6) (7) 3 2, 4, 5 3 3 3 4 3 5 F0 3.2 3 2 3 3 3.3 NMF (VB-RNMF) [2] c 2017 Information Processing Society of Japan 3
[16 19] 4 VB-RNMF VB-RNMF VB-RNMF 1 Y = [y 1,..., y T ] L = [l 1,..., l T ] S = [s 1,..., s T ] y t l t + s t (1) L 2 K W = [w 1,..., w K ] H = [h 1,..., h T ] y t Wh t + s t (2) Kullback-Leibler (KL) (P ) KL 3 p(y W, H, S) = ( ) P y ft w fk h kt + s ft f,t k (3) (G ) 45 p(w α wh, β wh ) = f,k G(w fk α wh, β wh ) (4) p(h α wh, β wh ) = k,t G(h kt α wh, β wh ) (5) α wh β wh Jeffreys 67 p(s α s, β s ) = f,t G(s ft α s, β s ft), (6) p(β s ft) (β s ft ) 1. (7) α s (3) (7) WH S の唱 $ 歌た '. れ さ ' 離分 åp 5 p å Ït Ÿ œÿ œå '.'$$ œå Ït DTW 6 T = 8 c = 4, MaxRunCount = 4 DTW 3.4 audio-to-audio 5 DTW [21] DTW F0 MFCC 2 (F0) (MFCC) F0 MFCC F0 Subharmonic Summation [22] audio-toaudio X = {x 1,... x T } Y = {y 1,... y T } F0 MFCC F0 f X = {f (x) 1,..., f (x) T } MFCC m X = {m (x) 1,..., m(x) } T c 2017 Information Processing Society of Japan 4
Algorithm 1 DTW t 0, j 0 (t, j) while t < T, j < T do if GetInc(t, j) Column then t t + 1 for k = j c + 1,..., j do if k > 0 then (8) d t,k end for if GetInc(t, j) Row then j j + 1 for k = t c + 1,..., t do if k > 0 then (8) d k,j end for if GetInc(t, j) == previous then runcount runcount +1 else runcount 1 if GetInc(t, j) Both then previous GetInc(t, j) (t, j) end while Algorithm 2 FUNCTION GetInc (t, j) if t < c then return Both if runcount < MaxRunCount then if previous == Row then return Column else return Row (x, y) = arg min(d(k, l)), where k == t or l == j if x < t then return Row else if y < j then return Column else return Both F0 f Y = {f (y) 1,..., f (y) T } MFCC m Y = {m (y) 1,..., m(y) T } MFCC 12 F0 MFCC X = {x i }T i=1 = {f (x) i, m (x) i } T i=1 Y = {y i }T (y) i=1 = {f i, m (y) i } T i=1 13 DTW DTW 6 DTW 6 DTW 1 D = {d i,j }(i = 1,..., T ; j = 1,..., T ) 1 GetInc 2 1 (t, j) c (t, j) c GetInc runcount MaxRunCount T = 300, c = 4, MaxRunCount = 3 1 8 d i,j = x i y j + min(d i,j 1, d i 1,j, d i 1,j 1 ) (8) 8 x i y j x i y j x i y j = 13 k=1 (x ik y jk )2 DTW L = {(i 1, j 1 ),..., (i l, j l )}(0 i k i k+1 T, 0 j k j k+1 T ) (i k, j k ) DTW X Y x i k y j k 3.5 DTW L R = {r 1,..., r T } k r k 9 r k = {i 1,..., i l } k {j 1,..., j l } k (9) r k c 2017 Information Processing Society of Japan 5
(1) (2) 1 2 3 4 1, r R = {r 1,..., r T } r [23] 4. 4 (CM )(ASIAN KUNG-FU GENERATION) () ( ) 4 (1) (2) 2 1. 2. 3. 4. 4 2 1 F0 5. VB-RNMF DTW audio-to-audio audio-to-audio JSPS 26700020, 24220006, 26280089, 15K16654, 16H01744, 16J05486 JST AC- CEL No. JPMJAC1602 [1] Hamasaki, M. et al.: Songrium: Browsing and Listening Environment for Music Content Creation Community, Proc. SMC, pp. 23 30 (2015). [2] Bando, Y. et al.: Variational Bayesian Multi-channel Robust NMF for Human-voice Enhancement with a Deformable and Partially-occluded Microphone Array, Proc. EUSIPCO, pp. 1018 1022 (2016). [3] Tachibana, H. et al.: A Real-time Audio-to-audio Karaoke Generation System for Monaural Recordings Based on Singing Voice Suppression and Key Conversion Techniques, J. IPSJ, Vol. 24, No. 3, pp. 470 482 (2016). [4] Inoue, W. et al.: Adaptive Karaoke System: Human Singing Accompaniment Based on Speech Recognition, Proc. ICMC, pp. 70 77 (1994). [5] Dannenberg, R. B.: An On-Line Algorithm for Real- Time Accompaniment, Proc. ICMC, pp. 193 198 (1984). [6] Vercoe, B.: The Synthetic Performer in The Context of Live Performance, Proc. ICMC, pp. 199 200 (1984). [7] Raphael, C.: Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models, IEEE Trans. on PAMI, Vol. 21, No. 4, pp. 360 370 (1999). [8] Cont, A.: A Coupled Duration-focused Architecture for Realtime Music to Score Alignment, IEEE Trans. on PAMI, Vol. 32, No. 6, pp. 974 987 (2010). [9] Nakamura, T. et al.: Real-Time Audio-to-Score Alignment of Music Performances Containing Errors and Arbitrary Repeats and Skips, IEEE/ACM TASLP, Vol. 24, No. 2, pp. 329 339 (2016). [10] Montecchio, N. et al.: A Unified Approach to Real Time Audio-to-score and Audio-to-Audio Alignment Using Sequential Montecarlo Inference Techniques, Proc. ICASSP (2011). [11] Gong, R. et al.: Real-time Audio-to-Score Alignment of Singing Voice Based on Melody and Lyric Information, Proc. Interspeech (2015). [12] Fujihara, H. et al.: LyricSynchronizer: Automatic Synchronization System between Musical Audio Signals and Lyrics, Proc. IEEE Journal of Selected Topics in Signal c 2017 Information Processing Society of Japan 6
Processing Conference, pp. 1252 1261 (2011). [13] Iskandar, D. et al.: Syllabic Level Automatic Synchronization of Music Signals and Text Lyrics, Proc. ACMMM, pp. 659 662 (2006). [14] Wang, Y. et al.: LyricAlly: Automatic Synchronization of Textual Lyrics to Acoustic Music Signals, IEEE TASLP, Vol. 16, No. 2, pp. 338 349 (2008). [15] Dzhambazov, G. et al.: Modeling of Phoneme Durations for Alignment between Polyphonic Audio and Lyrics, Proc. SMC, pp. 281 286 (2015). [16] Huang, P.-S. et al.: Singing-Voice Separation from Monaural Recordings Using Robust Principal Component Analysis, Proc. IEEE ICASSP, pp. 57 60 (2012). [17] Ikemiya, Y. et al.: Singing Voice Separation and Vocal F0 Estimation Based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation, IEEE/ACM TASLP, Vol. 24, No. 11, pp. 2084 2095 (2016). [18] Rafii, Z. et al.: Music/Voice Separation Using The Similarity Matrix, Proc. ISMIR, pp. 583 588 (2012). [19] Yang, P.-K. et al.: Bayesian Singing-Voice Separation, Proc. ISMIR, pp. 507 512 (2014). [20] Huang, P.-S. et al.: Singing-Voice Separation from Monaural Recordings Using Deep Recurrent Neural Networks, Proc. ISMIR, pp. 477 482 (2014). [21] Dixon, S.: An On-Line Time Warping Algorithm for Tracking Musical Performances, Proc. the 19th IJCAI, pp. 1727 1728 (2005). [22] Hermes, D. J.: Measurement of Pitch by Subharmonic Summation, J. ASA, Vol. 83, No. 1, pp. 257 264 (1988). [23] Flanagan, J. et al.: Phase Vocoder, Bell System Technical Journal, Vol. 45, pp. 1493 1509 (1966). c 2017 Information Processing Society of Japan 7