STRAIGHT_Tutorial_Tianjin2016.key - PDF 無料ダウンロード

Lecture 1 Introduction to speech signal processing in STRAIGHT vocoder Hideki Kawahara Emeritus Professor: Wakayama University, Japan Tianjin University, China, 9 December, 2016

Collaborators Roy D. Patterson Masanori Morise Hideki Banno Toshio Irino Ryuichi Nisimura Verena G. Skuk Stefan Schweinberger Parham Zolfaghari Ken-Ichi Sakakibara Ikuyo Masuda-Katsuse Alain de Cheveigne Josh McDermott Osamu Fujimura Toru Takahashi Tomoki Toda and many others. 2

Matlab

Link to STRAIGHT resources TANDEM- STRAIGHT and morphing Username: Tianjin-Lecture Password: STRAIGHT (Valid on 9 December, 2016) Link 4

Link to STRAIGHT resources legacy- STRAIGHT Username: Tianjin-Lecture Password: STRAIGHT (Valid on 9 December, 2016) Link 5

Lecture 1 Introduction to speech signal processing in STRAIGHT vocoder Hideki Kawahara Emeritus Professor: Wakayama University, Japan Tianjin University, China, 9 December, 2016

Topic Application STRAIGHT Background 7

Summary Application STRAIGHT Background Interference-free representations play important roles Periodic excitation is an efficient and robust strategy for sampling and transmitting relevant information for communications using voice STRAIGHT is a collection of functions and applications Extended morphing provides a unique research strategy useful for para- and non-linguistic aspects of speech 8

Topic Application STRAIGHT Background 9

Interference Vocal tract SHAPE information is mixed with interfering structure caused by repetitive structure in voiced sounds Linear predictive analysis still suffers from estimation bias caused by repetitive structure in voiced sounds Interference-free representations 10

Visualization: spectrogram S(ω,t) = w(τ t)s(τ )e jωτ dτ 2 wide-band narrow-band 11

Movie 12

Visualization: spectrogram S(ω,t) = w(τ t)s(τ )e jωτ dτ 2 wide-band narrow-band 13

Matlab

http://www.wakayama-u.ac.jp/~kawahara/matlabrealtimespeechtools/

In case of speech repetition Articulator Voicing organ filter transfer function mixing nightmare of signal processing voice source fundamental frequency source wave

Topic Application STRAIGHT Background 21

STRAIGHT is a VOCODER analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 22

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 23

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 24

Interference-free representation of power spectrum Complementary set of pitch synchronized time windows and spline-based spectral smoothing and inverse filtering legacy-straight [kawahara et.al. 1999] - original idea is by Kawahara (1997) F0-adaptive set of time windows separated a half pitch period and F0-adaptive smoothing followed by digital filter compensation based on consistent sampling TANDEM-STRAIGHT [kawahara et.al. 2008]???? 25

Interference: power spectrum DVD 26

Interference-free representation of power spectrum Complementary set of pitch synchronized time windows and spline-based spectral smoothing and inverse filtering legacy-straight [kawahara et.al. 1999] F0-adaptive set of time windows separated a half pitch period and F0-adaptive smoothing followed by digital filter compensation based on consistent sampling TANDEM-STRAIGHT [kawahara et.al. 2008] - original time window idea is by Morise et.al. (2007)???? 27

TANDEM-STRAIGHT: periodic pulse power spectrum Movie 28

TANDEM: principle signal model power spectrum arbitrary real numbers windowing function TANDEM spectrum temporally varying term fundamental period 32

log-power spectrum TANDEM-STRAIGHT: synthetic vowel /a/ Movie 33

Selection of window How to select windowing function averaged spectrum temporal variation normalized duration 34

temporal variation single window normalized duration (re. T0)

* Nuttall windows Nuttall, A. H. (1981). Some windows with very good sidelobe behavior. Acoustics, Speech and Signal Processing, IEEE Transactions on, 29(1), 84-91. 0 20 40 Hann Blackman Nuttall gain (db) 60 80 100 120 10 0 10 1 frequency (Hz) 36

Nuttall #12 in Table II Note: nuttallwin in Matlab is different.

temporal variation TANDEM window normalized duration (re.t0)

Interference-free representation of power spectrum Complementary set of pitch synchronized time windows and spline-based spectral smoothing and inverse filtering legacy-straight [kawahara et.al. 1999] F0-adaptive set of time windows separated a half pitch period and F0-adaptive smoothing followed by digital filter compensation based on consistent sampling TANDEM-STRAIGHT [kawahara et.al. 2008, 2011]???? 40

Consistent sampling simple filtering Consistent sampling: recovery only at sampled points [Unser 2000] Sampling theory: whole waveform recovery 41

Consistent sampling recovered spectrum smoothed spectrum smoothing function frequency domain representation 42

Consistent sampling 1 correlation 1 filter coefficient [Kawahara & Morise 2011b] 43

Implementation cepstrum representation truncated and of TANDEM spectrum adjusted coefficients lifter form of the rectangular frequency smoother [Kawahara & Morise 2011b] 44

Test example STRAIGHT spectrum TANDEM spectrum [Kawahara & Morise 2011b] 45

TANDEM-STRAIGHT: natural speech Movie 46

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 48

Other cause of interference Phase spectrogram and instantaneous frequency DVD 49

F0 extractors using instantaneous frequency Fundamental component selection using a constant-q filter bank and AM-FM magnitude legacy STRAIGHT [kawahara et.al.1999a] Fixed point of frequency to instantaneous frequency mapping [kawahara et.al.1999b] Refinement of initial estimates of F0s using instantaneous frequency Multi-source F0 extractor with intensive manual optimization of parameters [kawahara et.al. 2005] XSX: excitation source extractor based on interference-free representation of power spectra [kawahara et.al. 2008][Fujimura et.al. 2009] YANGsaf [Kawahara et.al., 2016] 50

Interference-free representation of instantaneous frequency Interferences in instantaneous frequency of periodic signals Interference-free representation of instantaneous frequency (animation) Derivation of Interference-free representation of instantaneous frequency 51

Movie 52

waveform and time windows time and frequency resolution Movie phase spectrogram viewer of target representation 53

Movie 54

Movie 55

Movie 57

Movie 58

Instantaneous frequency: problem Definition: Time derivative of phase where singularity 60

Flanagan s equation Derivation-1 No need of inverse function 61

Flanagan s equation Derivation-2 Simplification and notation = 62

Averaged instantaneous frequency Power weighted average Derivation-3 Note! Denominator is TANDEM spectrum 63

Averaged instantaneous frequency Power weighted average Derivation-4 Numerator: sum of each numerator 64

Numerators Derivation-5 Substitution and simplification Squared terms vanish 65

TANDEM trick Derivation-6 Independent on time This term should be eliminated = TANDEM trick 66

F0 extractors using instantaneous frequency Fundamental component selection using a constant-q filter bank and AM-FM magnitude legacy STRAIGHT [kawahara et.al.1999a] Fixed point of frequency to instantaneous frequency mapping [kawahara et.al.1999b] Refinement of initial estimates of F0s using instantaneous frequency NDF: Multi-source F0 extractor with intensive manual optimization of parameters [kawahara et.al. 2005] XSX: excitation source extractor based on interference-free representation of power spectra [kawahara et.al. 2008][Fujimura et.al. 2009] YANGsaf [Kawahara et.al., 2016] 67

Periodicity detection by spectral division Movie 68

TANDEM STRAIGHT F0 adaptive processing sp. division shaping

F0 adaptive processing Contradiction: no F0 information multiple hypothesis and integration

Multiple hypothesis and integration detector-1 signal detector-2 detector-3 detector-4 integration F0 salience blackman 2.5 npo:3 std:0.0012663 1.4 1.2 1 response 0.8 0.6 detector-n 0.4 0.2 71 0 40 30 20 10 0 10 20 30 40 normalized lag in semitone (re. T0)

Spectral division TANDEM spectrum: <- envelope and periodic structure F0-adaptive smoothed spectrum <- envelope TANDEM spectrum spectrum only with periodic component smoothed spectrum 72

Selecting base-band sp. division shaping 73

Integration of individual detectors shaping individual response shaped response integrated response 74

Integration of individual detectors shaped response integrated response 75

Integration of individual detectors Movie 76

Alternating amplitude Movie 77

Displacement of pulse timing Movie 78

Analysis of Noh voice Fujimura, O., Honda, K., Kawahara, H., Konparu, Y., Morise, M., & Williams, J. C. (2009). Noh voice quality. Logopedics Phoniatrics Vocology, 34(4), 157-170. 79

F0 extractors using instantaneous frequency Fundamental component selection using a constant-q filter bank and AM-FM magnitude legacy STRAIGHT [kawahara et.al.1999a] Fixed point of frequency to instantaneous frequency mapping [kawahara et.al.1999b] Refinement of initial estimates of F0s using instantaneous frequency NDF: Multi-source F0 extractor with intensive manual optimization of parameters [kawahara et.al. 2005] XSX: excitation source extractor based on interference-free representation of power spectra [kawahara et.al. 2008][Fujimura et.al. 2009] YANGsaf:[Kawahara et.al., 2016] SSW9 80

Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis Hideki Kawahara, Yannis Agiomyrgiannakis, Heiga Zen Wakayama University, Japan ISCA SSW9, Sunnyvale, CA, USA, 13-15 September, 2016

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 82

Non-periodic component Distance between lower and upper power spectrum envelope and calibration based on simulation [Kawahara et.al. 2001] Residuals of pitch scale linear prediction of frequency sub-bands and sigmoidal spectral modeling [Kawahara et.al. 2010a] Interference-free group delay representation for estimating deviation from periodicity [Kawahara et.al. 2014] Event extraction based on running kurtosis [Kawahara et.al. 2010b] the least successful component 84

EGG is not almighty 1 0.8 0.6 0.4 0.2 EGG speech 0 0.2 0.4 0.25 0.3 0.35 0.4 0.45 time (s) 85

glottal closure instance GCI no GCI gender:f talkerid:14 sentenceid:28 reldscrpncy:5.8 % differentiated EGG signal speech signal 2.5 2 1.5 1 0.5 0 mismatch 0.5 1 EGG signal 1.5 2 close open 2.5 0.72 0.74 0.76 0.78 0.8 0.82 time (s)

amount of mismatch relative mismatch 10 0 10 1 10 2 10 3 10 4 Mismatch is not rare 10% 1% 840 utterances were tested (30 sentences 28 speakers) only 77 (in 840) utterances do not have mismatch 0 100 200 300 400 500 600 700 800 87 record count sorted utterance ID

F0 modulation in Noh voice 6*F0 1200 frequency (Hz) 1000 800 600 400 200 0 1.5 1.52 1.54 1.56 1.58 1.6 1.62 1.64 1.66 1.68 1.7 88 x 10 4 time (ms)

F0 modulation in Noh voice coupling 1:3 chaos? 1:2 1:3 1:2 6*F0 1200 frequency (Hz) 1000 800 600 400 200 0 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 x 89 10 4 time (ms)

F0 modulation power spectrum 40 45 vibrato modulation relative rms level (db, semitone) 50 55 60 65 70 75 80 10 0 10 1 10 2 modulation frequency (Hz) 90

Power spectrum of periodic impulse train using previous windows zeroes noise component develops MAVEBA'2001

Envelope calculation cepstrum original and smoothed spectrum upper and lower envelope lifter MAVEBA'2001

Multi resolution analysis: example estimated excitation speech waveform MAVEBA'2001

Broad-band colored noise turbulence: random boundary frequency slope converted pulse to noise ratio square root of pulse to noise ratio

Parameter estimation logit conversion weighted least square solution weight update

Fitting example

Event detection Examples original natural speech synthesis without events synthesis with event detection

Outline Introduction: TANDEM-STRAIGHT Non-periodic component in speech sounds Wide-band noise Acoustic events Discussion

Outline Introduction: TANDEM-STRAIGHT Non-periodic component in speech sounds Wide-band noise Acoustic events Conclusion

Acoustic event detection strongly distorted distribution 4th moment 2nd moment kurtosis (non-negative) r =2, 4 implementation as filtering

Acoustic event detection local peak of running kurtosis closest centroid of 4th power of the windowed wave peak picking (initial estimate) practical solution event location adjustment theoretical solution filtered signal

Event detection example

Peak kurtosis distribution for 112 utterances highly non-gaussian

Other cause of interference Phase spectrogram and group delay DVD 111

Movie

113 Movie

Movie

group delay Flanagan-like equation di [log(x)] τ g = dω [ ] 1 dx = I X dω = R[X]I [ dx dω ] I[X]R [ dx dω ] X 2, 115

Flanagan-like equation X(ω,t)= X d (ω,t)= dx(ω,t) dω = = j w(τ)x(τ t)e jωτ dτ w(τ)x(τ t) de jωτ dω dτ τw(τ)x(τ t)e jωτ dτ, τ g (ω,t)= R[X(ω,t)]I[X d(ω,t)] I[X(ω,t)]R[X d (ω,t)] X(ω,t) 2, 116

Interference model x(t) =δ ( t T 0 2 ) + αδ ( t + T 0 2 ) time window covers temporally repeating events with amplitude modification 117

( ) ( ) Interference in power spectrum P (ω,t)= ( w t T 0 ) 2 2 ( +2αw t T 0 2 + α ) ( w t + T 0 ( w t + T 0 2 2 ) ) 2 cos (2π ff0 ) periodic variation on the frequency axis 118

( ) ( ) ( Interference in power spectrum P (ω,t)= ( w t T 0 ) 2 2 ( +2αw t T 0 2 + α ) ( w t + T 0 ( w t + T 0 2 2 ) ) 2 cos (2π ff0 ) TANDEM trick cancels the interference P F (ω,t)= P ( ω ω 0 4,t ) + P ( ω + ω 0 4,t ) 2 119

Power weighted average τ ga (ω,t)= τ g1(ω,t) S 1 (ω,t) 2 + τ g2 (ω,t) S 2 (ω,t) 2 S 1 (ω,t) 2 + S 2 (ω,t) 2 P F (ω,t) Using T0/2 separation makes this sum interference-free 120

Numerators R[S(ω,t)]I[S d (ω,t)] = w d (t 1 )w(t 1 ) cos 2 ( ωt 1 ) α 2 w d (t 2 )w(t 2 ) cos 2 ( ωt 2 ) αw d (t 2 )w(t 1 ) cos( ωt 1 ) cos( ωt 2 ) αw d (t 1 )w(t 2 ) cos( ωt 1 ) cos( ωt 2 ) I[S(ω,t)]R[S d (ω,t)] = w d (t 1 )w(t 1 ) sin 2 ( ωt 1 ) + α 2 w d (t 2 )w(t 2 ) sin 2 ( ωt 2 ) + αw d (t 2 )w(t 1 ) sin( ωt 1 ) sin( ωt 2 ) + αw d (t 1 )w(t 2 ) sin( ωt 1 ) sin( ωt 2 ), 121

d d 1 1 1 + α 2 w d (t 2 )w(t 2 ) sin 2 ( ωt 2 ) Numerators: simplification + αw d (t 2 )w(t 1 ) sin( ωt 1 ) sin( ωt 2 ) + αw d (t 1 )w(t 2 ) sin( ωt 1 ) sin( ωt 2 ), sin 2 θ + cos 2 θ =1 sin cos A cos 2 θ + B cos + sin 2 θ =1 cos A sin B = cos(a A cos B B) + sin A sin B cos(a B) R[S(ω,t)]I[S d (ω,t)] I[S(ω,t)]R[S d (ω,t)] = (27) w d (t 1 )w(t 1 ) α 2 w d (t 2 )w(t 2 ) α (w d (t 1 )w(t 2 )+w d (t 2 )w(t 1 )) cos (2π ff0 ), (28) periodic component 122 P F (ω,t)

Interference-free group τ df (ω,t)= 1 P F (ω,t) delay [R[S(ω 1,t)]I[S d (ω 1,t)] I[S(ω 1,t)]R[S d (ω 1,t)] + R[S(ω 2,t)]I[S d (ω 2,t)] I[S(ω 2,t)]R[S d (ω 2,t)]] R I I R ω 1 = ω ω 0 4, ω 2 = ω + ω 0 where 4 Interference on the frequency axis is removed but 123

Interference-free group delay on the frequency axis 124

Interference-free group delay both in the time and frequency 125

waveform and time windows modulation window phase spectrogram power spectrum group delay Movie

Movie

Synthesis Minimum phase impulse response Pitch synchronized overlap and add of periodic and aperiodic components Mixed mode excitation and approximate time varying filter 130

Synthesis: OLA analysis physical attributes synthesis spectral envelope analysis spectral envelope filter-1 + output signal input signal-1 F0 analysis F0 modification periodic pulse generator filter-2 signal parameter non-periodicity analysis nonperiodicity noise generator data process 132

Synthesis: approximate TVF analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 133

x[n] = FFT based convolution k x[k]δ 0,n k ( y[n] = M k=0 x[k]h[n k] cyclic version of signal with length N N 1 ( ) 2πjkn x[n] = X[k] exp N k=0 DFT coefficient N 1 ỹ[n] = h[n k] x[k] Y [k] =H[k]X[k] k=0 cyclic convolution 134

original signal x[n] = FFT based convolution k= s (k) [n] = k= x (k) 1 [n]w[n kl] constraint on w[n kl] =1 subdivision window k= subdivided signal implementation w 1 [n] =1,n=0, 1,...,K 1 ( ) w 2nπ 2 [n] =0.5 0.5 cos 2L K =2L +1 FFT length limit ( ) 135

Time varying filter in STRAIGHT x[n] = M k=0 x[k]h[n k; k] minimum phase impulse response h min (t) = 1 2π H min (ω)e jωt dt h min (t) =R [ 1 2π ] H min (ω)e jωt dt R[ln(H min (ω))] + ji[ln(h min (ω))] = c(q) = 1 2π ln(p (ω))e jωq dω c min (q)e jωq dq c min (q) = c(q) (q>0) c(0)/2 (q = 0) 0 (q <0) 136

Numerical examples 137

STRAIGHT spectrogram: section 138

Minimum phase responses 139

Time invariant and variant responses: spectral views rectangular subdivision raised cosine subdivision time invariant filter time varying filter 140

STRAIGHT is a VOCODER enabling flexible manipulation analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 141

STRAIGHT: summary STRAIGHT decomposes input speech into Interference-free spectrum Fundamental frequency: F0 Aperiodic component Virtually perfect removal of interferences Flexible manipulation of speech without introducing quality degradations STRAIGHT component procedures provide building blocks for various applications such as TTS systems STRAIGHT serves as a test-bed for new component algorithms 142

Topic Application STRAIGHT Background 143

Lecture 2 Hands on tutorial of generalized speech morphing based on STRAIGHT Hideki Kawahara Emeritus Professor: Wakayama University, Japan Tianjin University, China, 9 December, 2016

Topic Application STRAIGHT Background 145

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 146

STRAIGHT GUIs Matlab APSIPA DL talk 149

Snapshot: F0 extraction Matlab customizable F0 extractor 150

Snapshot: modification Matlab duration size F0 amplitude 151

Manipulation by function calls Analysis functions Source analysis Fundamental frequency Aperiodic component Filter analysis Synthesis function Default OLA synthesis Optional: approximate time varying filter Optional: sinusoidal synthesis 153

Fundamental frequency r = exf0candidateststraightgb(x,fs,paramsin) samplingfrequency: 22050 f0: [159x1 double] periodicitylevel: [159x1 double].. post processing rc = autof0tracking(r,x); rc.vuv = refinevoicingdecision(x,rc); 155

Aperiodicity parameter q = aperiodicityratiosigmoid(x,rc,sidemargin,exponent,displayon) samplingfrequency: 22050 f0: [159x1 double] vuv: [159x1 double] sigmoidparameter: [2x159 double]..... 156

Filter analysis exspectrumtstraightgb(x,fs,sourceobj,paramsin) ElapsedTimeForSpectrum: 0.1414 temporalpositions: [1x159 double] spectrogramstraight: [1025x159 double] samplingfrequency: 22050 TANDEMSTRAIGHTconditions: [1x1 struct] spectrogramtandem: [1025x159 double] dateofspectrumestimation: 'DD-MM-2014 01:00:58' 158

Default OLA synthesis exgeneralstraightsynthesisr2(sourcestructure,filterstructure) synthesisout: [17106x1 double] samplingfrequency: 22050 elapsedtime: 0.1040 generalized framework generalstraightsynthesisframeworkr2(feedinghandle, responsehandle, deterministichandle, randomhandle, shifterhandle, datasubstrate,optionalparameters) 160

Modification by function calls Fundamental frequency manipulation (example) making fundamental frequency 1.2 times higher rm = r; rm.f0 = r.f0*1.2; s = exgeneralstraightsynthesisr2(rm,f); making fundamental frequency 50 Hz higher rm = r; rm.f0 = r.f0+50; s = exgeneralstraightsynthesisr2(rm,f); 162

Modification by function calls Speaking rate manipulation (example) making total duration 2 times longer rm = r; rm.temporalpositions = r.temporalpositions*2; s = exgeneralstraightsynthesisr2(rm,f); 163

Modification by function calls Vocal tract length manipulation (example) making vocal tract length 1.2 times longer fftl = (size(f.spectrogramstraight,1)-1)*2; fxoriginal = (0:fftl/2)/fftl*f.samplingFrequency; fxtarget = fxoriginal*1.2; fxtarget = min(f.samplingfrequency/2, fxtarget); fm = f; fm.f.spectrogramstraight = interp1(fxoriginal,f.spectrogramstraight,fxtarget); s = exgeneralstraightsynthesisr2(r,fm); 164

Modification by function calls Vocal tract length manipulation (example) making vocal tract length 0.8 times of the original fftl = (size(f.spectrogramstraight,1)-1)*2; fxoriginal = (0:fftl/2)/fftl*f.samplingFrequency; fxtarget = fxoriginal*0.8; fm = f; fm.f.spectrogramstraight = interp1(fxoriginal,f.spectrogramstraight,fxtarget); s = exgeneralstraightsynthesisr2(r,fm); nonlinear frequency axis modification is possible by designing fxtarget 165

http://ml.cs.yamanashi.ac.jp/straight/english/index.html 166

http://ml.cs.yamanashi.ac.jp/straight/english/index.html 167

Matlab

170

voices Temporally variable multi-aspect N-way morphing attribute 172

Temporally variable multi-aspect N-way morphing analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal F0 analysis F0 periodic pulse generator shaper and mixer input signal-1 non-periodicity analysis nonperiodicity morphing non-periodic component generator time axis alignment time axis mapping time axis alignment input signal-k frequency axis alignment frequency axis mapping frequency axis alignment signal analysis parameter physical attributes data input signal-n analysis a set of indexed weights of physical attributes process 173

STRAIGHT analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal F0 analysis F0 periodic pulse generator shaper and mixer input signal-1 non-periodicity analysis nonperiodicity morphing non-periodic component generator time axis alignment time axis mapping time axis alignment input signal-k frequency axis alignment frequency axis mapping frequency axis alignment signal analysis parameter physical attributes data input signal-n analysis a set of indexed weights of physical attributes process 174

Generalized morphing enabling extrapolation location, speed... no constraint F0, power... positivity time axis, frequency axis... monotonicity w.sum(function) exponent(w.sum(log(function))) integration(exponent(w.sum(log(function )))) derivative of function 177

What is the problem? interpolation 178

What is the problem? interpolation Break down extrapolation Non-monotonic mapping 179

Speech parameter constraints ( 1 ) time increases monotonically ( 2 ) frequency increases monotonically ( 3 ) time-frequency spectral representation is positive ( 4 ) fundamental frequency is positive abstract time Θ ( ) { ) ) Θ (k) (ν, τ) = f (k) 0 ( t (k) (τ), a (k) (t (k) (τ) P (f (k) (k) (ν),t (k) (τ),f (k) (ν),t (k) (τ), (1) morphing entity 180 abstract frequency ), }

No constraint case morphed parameter: function number of cases weight N g m1 (t m3 (τ)) = w (k) (t (k) (τ))g (k) (t (k) (τ)), (2) k=1 speech parameter index of case N w (k) (t (k) (τ)) = 1. k=1 not always necessary 186

positivity constraint ( N g m2 (t m3 (τ)) = exp w (k) (t (k) (τ)) log ( g (k) (t (k) (τ)) )) k=1 ( k=1 ( N ( = g (k) (t (k) (τ)) ) w (k) (t (k) (τ)), (4) g m2 (t m3 (τ)) > 0 188

monotonicity constraint morphed attribute: function number of cases weight ( ( τ N ( ) ) dg g m3 (τ) = exp w (k) (k) (ξ) (ξ) log dξ 0 dξ k=1 index of case τ N ( ( ) dg (k) w (ξ) (k) (ξ) = dξ, (5) dξ 0 k=1 speech attribute abstract parameter dg m3 (τ) > 0 dτ 190

Generalized morphing ( ( ) ) morphing entity ( examplar ( Θ m (ν, τ)=t Θ (1) (ν, τ), Θ (2) (ν, τ),...,θ (K) (ν, τ); W ), (6) W ={w F0 (τ), w A (τ), w P (τ), w Fx (τ), w Tx } (τ)}, (7) w X (τ) =[w (1) X (τ),w(2) X (τ),...,w(k) X (τ)]t } X {F 0, A, P, F x,t x } F0 aperiodicity time-frequency rep. frequency c. time c. 191

( Implementation: ) piece-wise linear function ( ) time axis of an example ID of the example ( ) t (k) (τ) =(p (k) (τ n+1 ) p (k) (τ n ))(τ τ n )+p (k) (τ n ). (8) morphed time axis value at an anchor anchor location ID of the anchor t m3 (τ) =(p m (τ n+1 ) p m (τ n ))(τ τ n )+p m (τ n ), (11) p m (τ n )= K ( p (k) (τ n ) p (k) (τ n 1 ) ) w (k) Tx (τ n) k=1 value at morphed location + p m (τ n 1 ), (12) 193

Matlab implementation of function inversion yi = interp1(x,y,xi, linear, extrap ); xi = interp1(y,x,yi, linear, extrap ); 194

Temporally variable multi-aspect N-way morphing voices attribute 195

Movie

GUI for generalized morphing preparation Matlab

Matlab November, 2013, APSIPA, Taiwan

Morphing by scripting Matlab function for temporally variable multi-aspects arbitrary many voices morphing morphedobject = tvariablenwaymorphingraw(objectbundle,contributionstructure,dispon); synthstructure = generatemorphedsound(morphedobject); morphedobject = morphedtimeanchors: [49x1 double] timemorphedframe: [522x8 double] morphedtargetf0: 115.1853 morphedf0: [1x522 double] f0listonmorphedtime: [522x8 double] frequencymappingatanchor: [1x1 struct] frameonmorphing: [522x1 double] morphedvuv: [1x522 double] contributionstructure: [1x1 struct] morphedspectrogram: [2049x522 double] morphedaperiodicity: [2x522 double] elapsedtime: 1.1410 cutofflistfix: [5x1 double] samplingfrequency: 48000 procedurename: 'tvariablenwaymorphing' tmpobj: [1x1 struct] 202

Topic Application STRAIGHT Background 206

Topic Application STRAIGHT Background 207

Thank you! Roy D. Patterson Masanori Morise Hideki Banno Toshio Irino Ryuichi Nisimura Verena G. Skuk Stefan Schweinberger Parham Zolfaghari Ken-Ichi Sakakibara Ikuyo Masuda-Katsuse Alain de Cheveigne Josh McDermott Osamu Fujimura Toru Takahashi Tomoki Toda and many others. 209

References 210

Reference: STRAIGHT Kawahara, H., Morise, M., Toda, T., Banno, H., Nisimura, R., & Irino, T. (2014). Excitation source analysis for high-quality speech manipulation systems based on an interference-free representation of group delay with minimum phase response compensation. In Fifteenth Annual Conference of the International Speech Communication Association. Kawahara, H., Morise, M., & Sakakibara, K. I. (2013d). Temporally fine F0 extractor applied for frequency modulation power spectral analysis of singing voices. Proc. MAVEBA, 125-128. Kawahara, H., Morise, M., Banno, H., & Skuk, V. G. (2013c). Temporally variable multi-aspect N-way morphing based on interference-free speech representations. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific (pp. 1-10). IEEE. Kawahara, H., Morise, M., Toda, T., Nisimura, R., & Irino, T. (2013b). Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds. In INTERSPEECH (pp. 34-38). Kawahara, H., Morise, M., Nisimura, R., & Irino, T. (2013a). Higher order waveform symmetry measure and its application to periodicity detectors for speech and singing with fine temporal resolution. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 6797-6801). IEEE. Kawahara, H., Morise, M., Nisimura, R., & Irino, T. (2012b). Deviation measure of waveform symmetry and its application to high-speed and temporally-fine F0 extraction for vocal sound texture manipulation. In Interspeech. Kawahara, H., & Morise, M. (2012a). Analysis and synthesis of strong vocal expressions: extension and application of audio texture features to singing voice. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on (pp. 5389-5392). IEEE. Kawahara, H., & Morise, M. (2011b). Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana, 36(5), 713-727. Kawahara, H., Irino, T., & Morise, M. (2011a). An interference-free representation of instantaneous frequency of periodic signals and its application to F0 extraction. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5420-5423). IEEE. 211

Reference: STRAIGHT Kawahara, H., Morise, M., Takahashi, T., Banno, H., Nisimura, R. & Irino, T. (2010b). Kurtosis-based acoustic event detection and its application to speech analysis, modification and synthesis systems, Spring Annual Meeting of the Acoustical Society of Japan, 315-316. [in Japanese] Kawahara, H., Morise, M., Takahashi, T., Banno, H., Nisimura, R., & Irino, T. (2010a). Simplification and extension of non-periodic excitation source representations for high-quality speech manipulation systems. In Interspeech 2010, 38-41. Fujimura, O., Honda, K., Kawahara, H., Konparu, Y., Morise, M., & Williams, J. C. (2009). Noh voice quality. Logopedics Phoniatrics Vocology, 34(4), 157-170. Kawahara, H., Takahashi, T., Morise, M., & Banno, H. (2009b). Development of exploratory research tools based on TANDEM-STRAIGHT. In Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference (pp. 111-120). Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference, International Organizing Committee. Kawahara, H., Nisimura, R., Irino, T., Morise, M., Takahashi, T., & Banno, H. (2009a). Temporally variable multiaspect auditory morphing enabling extrapolation without objective and perceptual breakdown. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on (pp. 3905-3908). IEEE. Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T., & Banno, H. (2008, March). TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 3933-3936). IEEE. Banno, H., Hata, H., Morise, M., Takahashi, T., Irino, T., & Kawahara, H. (2007). Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation. Acoustical science and technology, 28(3), 140-146. Kawahara, H. (2006). STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology, 27(6), 349-353. 212

Reference: STRAIGHT Kawahara, H., de Cheveigné, A., Banno, H., Takahashi, T., & Irino, T. (2005, September). Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In Interspeech (pp. 537-540). Matsui, H., & Kawahara, H. (2003b). Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system. In INTERSPEECH. Kawahara, H., & Matsui, H. (2003a). Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on (Vol. 1, pp. I-256). IEEE. Kawahara, H., Estill, J., & Fujimura, O. (2001, September). Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. In MAVEBA (pp. 59-64). Kawahara, H., Atake, Y., & Zolfaghari, P. (2000). Accurate vocal event detection method based on a fixed-point analysis of mapping from time to weighted average group delay. In INTERSPEECH (pp. 664-667). Kawahara, H., Katayose, H., de Cheveigné, A., & Patterson, R. D. (1999b). Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity. In EuroSpeech (Vol. 99, No. 6, pp. 2781-2784). Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999a). Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech communication, 27(3), 187-207. Kawahara, H. (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on (Vol. 2, pp. 1303-1306). IEEE. 213

Reference: using STRAIGHT Assmann, P. F., & T. M. Nearey (2008). Identification of frequency-shifted vowels. The Journal of the Acoustical Society of America, 124(5), 3203-3212. Athanasios, T., Zañartu, M., Little, M.A., Fox, C., Ramig, L.O., & Clifford, G.D. (2014). Robust fundamental frequency estimation in sustained vowels: Detailed algorithmic comparisons and information fusion with adaptive Kalman filtering, The Journal of the Acoustical Society of America, 135(5), 2885-2901. Bruckert, L., Bestelmeyer, P., Latinus, M., Rouger, J., Charest, I., Rousselet, G. A.,... & Belin, P. (2010). Vocal attractiveness increases by averaging. Current Biology, 20(2), 116-120. d' Alessandro, C., Rilliard, A., & Le Beux, S. (2011). Chironomic stylization of intonationa). The Journal of the Acoustical Society of America, 129(3), 1594-1604. Humes, L. E., Kewley-Port, D., Fogerty, D., & Kinney, D. (2010). Measures of hearing threshold and temporal processing across the adult lifespan. Hearing research, 264(1), 30-40. Ives, D. T., Smith, D. R., & Patterson, R. D. (2005). Discrimination of speaker size from syllable phrasesa). The Journal of the Acoustical Society of America, 118(6), 3816-3822. Kawahara, H., Kitamura, T., Takemoto, H., Nisimura, R., & Irino, T. (2014). Vocal tract length estimation based on vowels using a database consisting of 385 speakers and a database with MRI-based vocal tract shape information. In Fifteenth Annual Conference of the International Speech Communication Association. Kawahara, H., Mizobuchi, S., Morise, M., Nisimura, R., & Irino, T. (2014). Realtime conversion of growl-type voice qualities based on modulation and approximate time-varying filtering driven by a non-linear oscillator: Formulation. IPSJ SIG Technidal report, 2014-MUS-102(14), 1-6. Liu, C., & Kewley-Port, D. (2004). Vowel formant discrimination for high-fidelity speech. The Journal of the Acoustical Society of America, 116(2), 1224-1233. Nguyen, P. C., Takao, O., & Akagi, M. (2003). Modified restricted temporal decomposition and its application to low rate speech coding. IEICE TRANSACTIONS on Information and Systems, 86(3), 397-405. 214

Reference: using STRAIGHT Saitou, T., Goto, M., Unoki, M., & Akagi, M. (2007, October). Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices. In Applications of Signal Processing to Audio and Acoustics, 2007 IEEE Workshop on (pp. 215-218). IEEE. chweinberger, S. R., Walther, C., Zäske, R., & Kovács, G. (2011). Neural correlates of adaptation to voice identity. British Journal of psychology, 102(4), 748-764. Schweinberger, S. R., Casper, C., Hauthal, N., Kaufmann, J. M., Kawahara, H., Kloth, N.,... & Zäske, R. (2008). Auditory adaptation in voice perception. Current Biology, 18(9), 684-688. Skuk, V. G., & Schweinberger, S. R. (2014). Influences of Fundamental Frequency, Formant Frequencies, Aperiodicity, and Spectrum Level on the Perception of Voice Gender. Journal of Speech, Language, and Hearing Research, 57(1), 285-296. Skuk, V. G., & Schweinberger, S. R. (2013). Adaptation aftereffects in vocal emotion perception elicited by expressive faces and voices. PloS one, 8(11), e81691. Smith, D. R., Patterson, R. D., Turner, R., Kawahara, H., & Irino, T. (2005). The processing and perception of size information in speech sounds. The Journal of the Acoustical Society of America, 117(1), 305. Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMMbased speech synthesis. IEICE TRANSACTIONS on Information and Systems, 90(5), 816-824. Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on (Vol. 2, pp. 841-844). IEEE. Tsanas, A., Zañartu, M., Little, M.A., Fox, C., Ramig, L.O., & Clifford, G.D. (2014). Robust fundamental frequency estimation in sustained vowels: detailed algorithmic comparisons and information fusion with adaptive Kalman filtering, The Journal of the Acoustical Society of America, XXX(X), XXX. von Kriegstein, K., Smith, D. R., Patterson, R. D., Kiebel, S. J., & Griffiths, T. D. (2010). How the human brain recognizes speech in the context of changing speakers. The Journal of Neuroscience, 30(2), 629-638. 215

Reference: using STRAIGHT von Kriegstein, K., Smith, D. R., Patterson, R. D., Ives, D. T., & Griffiths, T. D. (2007). Neural representation of auditory size in the human voice and in sounds from other resonant sources. Current Biology, 17(13), 1123-1128. von Kriegstein, K., Warren, J. D., Ives, D. T., Patterson, R. D., & Griffiths, T. D. (2006). Processing the acoustic effect of size in speech sounds. Neuroimage, 32(1), 368-375. Yonezawa, T., Suzuki, N., Abe, S., Mase, K., & Kogure, K. (2007). Perceptual continuity and naturalness of expressive strength in singing voices based on speech morphing. EURASIP Journal on Audio, Speech, and Music Processing, 2007(3), 2. Yu, K., & Young, S. (2011). Continuous F0 modeling for HMM based statistical parametric speech synthesis. Audio, Speech, and Language Processing, IEEE Transactions on, 19(5), 1071-1079. Zäske, R., Schweinberger, S. R., & Kawahara, H. (2010). Voice aftereffects of adaptation to speaker identity. Hearing research, 268(1), 38-45. Zäske, R., Schweinberger, S. R., Kaufmann, J. M., & Kawahara, H. (2009). In the ear of the beholder: neural correlates of adaptation to voice gender. European Journal of Neuroscience, 30(3), 527-534. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. W., & Tokuda, K. (2007a). The HMM-based speech synthesis system (HTS) version 2.0. In SSW (pp. 294-299). Zen, H., Toda, T., Nakamura, M., & Tokuda, K. (2007b). Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE transactions on information and systems, 90(1), 325-333. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039-1064. 216