STRAIGHT_Tutorial_Tianjin2016.key

Similar documents
27 5) STRAIGHT ) STRAIGHT 8) 3 STRAIGHT ),6),2) 7) 7),9) 5) STRAIGHT 5),7) 2.. spline ) ms ) STRAIGHT (db) ERB N(Effective Rectangul

YANGsaf [] 3. πn, (n Z) Z [16 18] 3.1 Flanagan [19] A.1 TANDEM-STRAIGHT [1] 1/ [0] A. TANDEM-STRAIGHT [] 3. [3,6] F0 [14] F0 [10] [10] 3.3 [] Vol.017-

7) 8) 9),10) 11) 18) 11),16) 18) 19) 20) Vocaloid 6) Vocaloid 1 VocaListener1 2 VocaListener1 3 VocaListener VocaListener1 VocaListener1 Voca


IPSJ SIG Technical Report Vol.2014-MUS-104 No /8/27 F0 1,a) 1,b) 1,c) 2,d) (F0) F0 F0 Graphical User Interface (GUI) F0 1. [1] CD MIDI [2] [3,

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).


IPSJ SIG Technical Report Vol.2017-MUS-115 No /6/17 1,a) 1 1 WORLD F0 Vocaloid F0 ipad 1. Vocaloid [1] UTAU *1 Vocaloid Vocaloid F0 VocaListene

Isogai, T., Building a dynamic correlation network for fat-tailed financial asset returns, Applied Network Science (7):-24, 206,

IPSJ SIG Technical Report Vol.2014-MUS-103 No /5/25 GUI 1,a) 1,b) 1,c) 1,d).., GUI.,FIR, 3 3.,.GUI. GUI,. A GUI for manipulating growl-like tas

H(ω) = ( G H (ω)g(ω) ) 1 G H (ω) (6) 2 H 11 (ω) H 1N (ω) H(ω)= (2) H M1 (ω) H MN (ω) [ X(ω)= X 1 (ω) X 2 (ω) X N (ω) ] T (3)

2013 M

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

10_08.dvi

28 Horizontal angle correction using straight line detection in an equirectangular image


200708_LesHouches_02.ppt

1. HNS [1] HNS HNS HNS [2] HNS [3] [4] [5] HNS 16ch SNR [6] 1 16ch 1 3 SNR [4] [5] 2. 2 HNS API HNS CS27-HNS [1] (SOA) [7] API Web 2

IPSJ SIG Technical Report Vol.2012-MUS-94 No.3 Vol.2012-SLP-90 No /2/ DTM 200 GUIN-Resonator: A system synthesizing voice with the styl

IPSJ SIG Technical Report Vol.2012-MUS-95 No /6/2 1,a) 2,b) 1,c) 1,d) TANDEM-STRAIGHT 70 Hz 20 db Manipulation of temporal fine structures on ex

2007-Kanai-paper.dvi

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

IPSJ SIG Technical Report Vol.2015-MUS-107 No /5/23 HARK-Binaural Raspberry Pi 2 1,a) ( ) HARK 2 HARK-Binaural A/D Raspberry Pi 2 1.

pp d 2 * Hz Hz 3 10 db Wind-induced noise, Noise reduction, Microphone array, Beamforming 1

<4D F736F F D B B83578B6594BB2D834A836F815B82D082C88C60202E646F63>

Vol. 48 No. 3 Mar Evaluation of Music-noise Assimilation Playback for Portable Audio Players Akifumi Inoue, Shohei Bise, Satoshi Ichimura and

14 2 5

TADM-STRAIGHT [7], [8] 3 (1) (2) (3) [9] 0.9% [10] [11] 2. [12] [13] glottal formant [14], [15] 3 [16] [11] (dcgcfb) [10] X 284 ( ) P

(MIRU2008) HOG Histograms of Oriented Gradients (HOG)

LMC6022 Low Power CMOS Dual Operational Amplifier (jp)

SICE東北支部研究集会資料(2012年)

23_02.dvi

第 55 回自動制御連合講演会 2012 年 11 月 17 日,18 日京都大学 1K403 ( ) Interpolation for the Gas Source Detection using the Parameter Estimation in a Sensor Network S. T

013858,繊維学会誌ファイバー1月/報文-02-古金谷

IPSJ SIG Technical Report Vol.2015-MUS-106 No.25 Vol.2015-EC-35 No /3/3 1,a) 1,b) 1,c) 1,d),,, Improving voice attractiveness by speech paramet

1 1 tf-idf tf-idf i

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

磁性物理学 - 遷移金属化合物磁性のスピンゆらぎ理論

a) Extraction of Similarities and Differences in Human Behavior Using Singular Value Decomposition Kenichi MISHIMA, Sayaka KANATA, Hiroaki NAKANISHI a

IPSJ SIG Technical Report 1, Instrument Separation in Reverberant Environments Using Crystal Microphone Arrays Nobutaka ITO, 1, 2 Yu KITANO, 1

B HNS 7)8) HNS ( ( ) 7)8) (SOA) HNS HNS 4) HNS ( ) ( ) 1 TV power, channel, volume power true( ON) false( OFF) boolean channel volume int

0A_SeibutsuJyoho-RF.ppt

Fig. 2 Signal plane divided into cell of DWT Fig. 1 Schematic diagram for the monitoring system

IPSJ-SLP

untitled

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

浜松医科大学紀要

鉄鋼協会プレゼン

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

IPSJ SIG Technical Report Vol.2013-GN-87 No /3/ Research of a surround-sound field adjustmen system based on loudspeakers arrangement Ak

, (GPS: Global Positioning Systemg),.,, (LBS: Local Based Services).. GPS,.,. RFID LAN,.,.,.,,,.,..,.,.,,, i

IPSJ SIG Technical Report Vol.2009-BIO-17 No /5/26 DNA 1 1 DNA DNA DNA DNA Correcting read errors on DNA sequences determined by Pyrosequencing

Vol.54 No (July 2013) [9] [10] [11] [12], [13] 1 Fig. 1 Flowchart of the proposed system. c 2013 Information


,,.,.,,.,.,.,.,,.,..,,,, i

UWB a) Accuracy of Relative Distance Measurement with Ultra Wideband System Yuichiro SHIMIZU a) and Yukitoshi SANADA (Ultra Wideband; UWB) UWB GHz DLL

kubostat2015e p.2 how to specify Poisson regression model, a GLM GLM how to specify model, a GLM GLM logistic probability distribution Poisson distrib

Hz

JFE.dvi

"CAS を利用した Single Sign On 環境の構築"


75 unit: mm Fig. Structure of model three-phase stacked transformer cores (a) Alternate-lap joint (b) Step-lap joint 3 4)

2 ( ) i

XFEL/SPring-8

kubostat2017c p (c) Poisson regression, a generalized linear model (GLM) : :

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

IPSJ SIG Technical Report Vol.2012-MUS-96 No /8/10 MIDI Modeling Performance Indeterminacies for Polyphonic Midi Score Following and

Study on Application of the cos a Method to Neutron Stress Measurement Toshihiko SASAKI*3 and Yukio HIROSE Department of Materials Science and Enginee

Introduction Purpose This training course describes the configuration and session features of the High-performance Embedded Workshop (HEW), a key tool

スライド 1

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

Bull. of Nippon Sport Sci. Univ. 47 (1) Devising musical expression in teaching methods for elementary music An attempt at shared teaching

main.dvi

(check matrices and minimum distances) H : a check matrix of C the minimum distance d = (the minimum # of column vectors of H which are linearly depen

GSP_SITA2017_web.key

Computational Semantics 1 category specificity Warrington (1975); Warrington & Shallice (1979, 1984) 2 basic level superiority 3 super-ordinate catego

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

impulse_response.dvi

メタルバンドソー

<95DB8C9288E397C389C88A E696E6462>

ADC121S Bit, ksps, Diff Input, Micro Pwr Sampling ADC (jp)


Fig. 3 Coordinate system and notation Fig. 1 The hydrodynamic force and wave measured system Fig. 2 Apparatus of model testing

通信容量制約を考慮したフィードバック制御 - 電子情報通信学会 情報理論研究会(IT) 若手研究者のための講演会

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

kiyo5_1-masuzawa.indd

数値計算:フーリエ変換

OPA134/2134/4134('98.03)

Abstract This paper concerns with a method of dynamic image cognition. Our image cognition method has two distinguished features. One is that the imag

HP cafe HP of A A B of C C Map on N th Floor coupon A cafe coupon B Poster A Poster A Poster B Poster B Case 1 Show HP of each company on a user scree

2.R R R R Pan-Tompkins(PT) [8] R 2 SQRS[9] PT Q R WQRS[10] Quad Level Vector(QLV)[11] QRS R Continuous Wavelet Transform(CWT)[12] Mexican hat 4

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.

Tornado Series selection SW TiCN HSS Co FAX VL PM

udc-2.dvi

AUTOMATIC MEASUREMENTS OF STREAM FLOW USING FLUVIAL ACOUSTIC TOMOGRAPHY SYSTEM Kiyosi KAWANISI, Arata, KANEKO Noriaki GOHDA and Shinya

kubostat7f p GLM! logistic regression as usual? N? GLM GLM doesn t work! GLM!! probabilit distribution binomial distribution : : β + β x i link functi

11) 13) 11),12) 13) Y c Z c Image plane Y m iy O m Z m Marker coordinate system T, d X m f O c X c Camera coordinate system 1 Coordinates and problem

Transcription:

Lecture 1 Introduction to speech signal processing in STRAIGHT vocoder Hideki Kawahara Emeritus Professor: Wakayama University, Japan Tianjin University, China, 9 December, 2016

Collaborators Roy D. Patterson Masanori Morise Hideki Banno Toshio Irino Ryuichi Nisimura Verena G. Skuk Stefan Schweinberger Parham Zolfaghari Ken-Ichi Sakakibara Ikuyo Masuda-Katsuse Alain de Cheveigne Josh McDermott Osamu Fujimura Toru Takahashi Tomoki Toda and many others. 2

Matlab

Link to STRAIGHT resources TANDEM- STRAIGHT and morphing Username: Tianjin-Lecture Password: STRAIGHT (Valid on 9 December, 2016) Link 4

Link to STRAIGHT resources legacy- STRAIGHT Username: Tianjin-Lecture Password: STRAIGHT (Valid on 9 December, 2016) Link 5

Lecture 1 Introduction to speech signal processing in STRAIGHT vocoder Hideki Kawahara Emeritus Professor: Wakayama University, Japan Tianjin University, China, 9 December, 2016

Topic Application STRAIGHT Background 7

Summary Application STRAIGHT Background Interference-free representations play important roles Periodic excitation is an efficient and robust strategy for sampling and transmitting relevant information for communications using voice STRAIGHT is a collection of functions and applications Extended morphing provides a unique research strategy useful for para- and non-linguistic aspects of speech 8

Topic Application STRAIGHT Background 9

Interference Vocal tract SHAPE information is mixed with interfering structure caused by repetitive structure in voiced sounds Linear predictive analysis still suffers from estimation bias caused by repetitive structure in voiced sounds Interference-free representations 10

Visualization: spectrogram S(ω,t) = w(τ t)s(τ )e jωτ dτ 2 wide-band narrow-band 11

Movie 12

Visualization: spectrogram S(ω,t) = w(τ t)s(τ )e jωτ dτ 2 wide-band narrow-band 13

Matlab

Matlab

Matlab

http://www.wakayama-u.ac.jp/~kawahara/matlabrealtimespeechtools/

Interference Vocal tract SHAPE information is mixed with interfering structure caused by repetitive structure in voiced sounds Linear predictive analysis still suffers from estimation bias caused by repetitive structure in voiced sounds Interference-free representations 19

In case of speech repetition Articulator Voicing organ filter transfer function mixing nightmare of signal processing voice source fundamental frequency source wave

Topic Application STRAIGHT Background 21

STRAIGHT is a VOCODER analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 22

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 23

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 24

Interference-free representation of power spectrum Complementary set of pitch synchronized time windows and spline-based spectral smoothing and inverse filtering legacy-straight [kawahara et.al. 1999] - original idea is by Kawahara (1997) F0-adaptive set of time windows separated a half pitch period and F0-adaptive smoothing followed by digital filter compensation based on consistent sampling TANDEM-STRAIGHT [kawahara et.al. 2008]???? 25

Interference: power spectrum DVD 26

Interference-free representation of power spectrum Complementary set of pitch synchronized time windows and spline-based spectral smoothing and inverse filtering legacy-straight [kawahara et.al. 1999] F0-adaptive set of time windows separated a half pitch period and F0-adaptive smoothing followed by digital filter compensation based on consistent sampling TANDEM-STRAIGHT [kawahara et.al. 2008] - original time window idea is by Morise et.al. (2007)???? 27

TANDEM-STRAIGHT: periodic pulse power spectrum Movie 28

29

30

31

TANDEM: principle signal model power spectrum arbitrary real numbers windowing function TANDEM spectrum temporally varying term fundamental period 32

log-power spectrum TANDEM-STRAIGHT: synthetic vowel /a/ Movie 33

Selection of window How to select windowing function averaged spectrum temporal variation normalized duration 34

temporal variation single window normalized duration (re. T0)

* Nuttall windows Nuttall, A. H. (1981). Some windows with very good sidelobe behavior. Acoustics, Speech and Signal Processing, IEEE Transactions on, 29(1), 84-91. 0 20 40 Hann Blackman Nuttall gain (db) 60 80 100 120 10 0 10 1 frequency (Hz) 36

Nuttall #12 in Table II Note: nuttallwin in Matlab is different.

temporal variation TANDEM window normalized duration (re.t0)

temporal variation TANDEM window normalized duration (re.t0)

Interference-free representation of power spectrum Complementary set of pitch synchronized time windows and spline-based spectral smoothing and inverse filtering legacy-straight [kawahara et.al. 1999] F0-adaptive set of time windows separated a half pitch period and F0-adaptive smoothing followed by digital filter compensation based on consistent sampling TANDEM-STRAIGHT [kawahara et.al. 2008, 2011]???? 40

Consistent sampling simple filtering Consistent sampling: recovery only at sampled points [Unser 2000] Sampling theory: whole waveform recovery 41

Consistent sampling recovered spectrum smoothed spectrum smoothing function frequency domain representation 42

Consistent sampling 1 correlation 1 filter coefficient [Kawahara & Morise 2011b] 43

Implementation cepstrum representation truncated and of TANDEM spectrum adjusted coefficients lifter form of the rectangular frequency smoother [Kawahara & Morise 2011b] 44

Test example STRAIGHT spectrum TANDEM spectrum [Kawahara & Morise 2011b] 45

TANDEM-STRAIGHT: natural speech Movie 46

STRAIGHT is a VOCODER analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 47

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 48

Other cause of interference Phase spectrogram and instantaneous frequency DVD 49

F0 extractors using instantaneous frequency Fundamental component selection using a constant-q filter bank and AM-FM magnitude legacy STRAIGHT [kawahara et.al.1999a] Fixed point of frequency to instantaneous frequency mapping [kawahara et.al.1999b] Refinement of initial estimates of F0s using instantaneous frequency Multi-source F0 extractor with intensive manual optimization of parameters [kawahara et.al. 2005] XSX: excitation source extractor based on interference-free representation of power spectra [kawahara et.al. 2008][Fujimura et.al. 2009] YANGsaf [Kawahara et.al., 2016] 50

Interference-free representation of instantaneous frequency Interferences in instantaneous frequency of periodic signals Interference-free representation of instantaneous frequency (animation) Derivation of Interference-free representation of instantaneous frequency 51

Movie 52

waveform and time windows time and frequency resolution Movie phase spectrogram viewer of target representation 53

Movie 54

Movie 55

Interference-free representation of instantaneous frequency Interferences in instantaneous frequency of periodic signals Interference-free representation of instantaneous frequency (animation) Derivation of Interference-free representation of instantaneous frequency 56

Movie 57

Movie 58

Interference-free representation of instantaneous frequency Interferences in instantaneous frequency of periodic signals Interference-free representation of instantaneous frequency (animation) Derivation of Interference-free representation of instantaneous frequency 59

Instantaneous frequency: problem Definition: Time derivative of phase where singularity 60

Flanagan s equation Derivation-1 No need of inverse function 61

Flanagan s equation Derivation-2 Simplification and notation = 62

Averaged instantaneous frequency Power weighted average Derivation-3 Note! Denominator is TANDEM spectrum 63

Averaged instantaneous frequency Power weighted average Derivation-4 Numerator: sum of each numerator 64

Numerators Derivation-5 Substitution and simplification Squared terms vanish 65

TANDEM trick Derivation-6 Independent on time This term should be eliminated = TANDEM trick 66

F0 extractors using instantaneous frequency Fundamental component selection using a constant-q filter bank and AM-FM magnitude legacy STRAIGHT [kawahara et.al.1999a] Fixed point of frequency to instantaneous frequency mapping [kawahara et.al.1999b] Refinement of initial estimates of F0s using instantaneous frequency NDF: Multi-source F0 extractor with intensive manual optimization of parameters [kawahara et.al. 2005] XSX: excitation source extractor based on interference-free representation of power spectra [kawahara et.al. 2008][Fujimura et.al. 2009] YANGsaf [Kawahara et.al., 2016] 67

Periodicity detection by spectral division Movie 68

TANDEM STRAIGHT F0 adaptive processing sp. division shaping

F0 adaptive processing Contradiction: no F0 information multiple hypothesis and integration

Multiple hypothesis and integration detector-1 signal detector-2 detector-3 detector-4 integration F0 salience blackman 2.5 npo:3 std:0.0012663 1.4 1.2 1 response 0.8 0.6 detector-n 0.4 0.2 71 0 40 30 20 10 0 10 20 30 40 normalized lag in semitone (re. T0)

Spectral division TANDEM spectrum: <- envelope and periodic structure F0-adaptive smoothed spectrum <- envelope TANDEM spectrum spectrum only with periodic component smoothed spectrum 72

Selecting base-band sp. division shaping 73

Integration of individual detectors shaping individual response shaped response integrated response 74

Integration of individual detectors shaped response integrated response 75

Integration of individual detectors Movie 76

Alternating amplitude Movie 77

Displacement of pulse timing Movie 78

Analysis of Noh voice Fujimura, O., Honda, K., Kawahara, H., Konparu, Y., Morise, M., & Williams, J. C. (2009). Noh voice quality. Logopedics Phoniatrics Vocology, 34(4), 157-170. 79

F0 extractors using instantaneous frequency Fundamental component selection using a constant-q filter bank and AM-FM magnitude legacy STRAIGHT [kawahara et.al.1999a] Fixed point of frequency to instantaneous frequency mapping [kawahara et.al.1999b] Refinement of initial estimates of F0s using instantaneous frequency NDF: Multi-source F0 extractor with intensive manual optimization of parameters [kawahara et.al. 2005] XSX: excitation source extractor based on interference-free representation of power spectra [kawahara et.al. 2008][Fujimura et.al. 2009] YANGsaf:[Kawahara et.al., 2016] SSW9 80

Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis Hideki Kawahara, Yannis Agiomyrgiannakis, Heiga Zen Wakayama University, Japan ISCA SSW9, Sunnyvale, CA, USA, 13-15 September, 2016

STRAIGHT: legacy to TANDEM spectrum instantaneous frequency group delay 82

STRAIGHT is a VOCODER analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 83

Non-periodic component Distance between lower and upper power spectrum envelope and calibration based on simulation [Kawahara et.al. 2001] Residuals of pitch scale linear prediction of frequency sub-bands and sigmoidal spectral modeling [Kawahara et.al. 2010a] Interference-free group delay representation for estimating deviation from periodicity [Kawahara et.al. 2014] Event extraction based on running kurtosis [Kawahara et.al. 2010b] the least successful component 84

EGG is not almighty 1 0.8 0.6 0.4 0.2 EGG speech 0 0.2 0.4 0.25 0.3 0.35 0.4 0.45 time (s) 85

glottal closure instance GCI no GCI gender:f talkerid:14 sentenceid:28 reldscrpncy:5.8 % differentiated EGG signal speech signal 2.5 2 1.5 1 0.5 0 mismatch 0.5 1 EGG signal 1.5 2 close open 2.5 0.72 0.74 0.76 0.78 0.8 0.82 time (s)

amount of mismatch relative mismatch 10 0 10 1 10 2 10 3 10 4 Mismatch is not rare 10% 1% 840 utterances were tested (30 sentences 28 speakers) only 77 (in 840) utterances do not have mismatch 0 100 200 300 400 500 600 700 800 87 record count sorted utterance ID

F0 modulation in Noh voice 6*F0 1200 frequency (Hz) 1000 800 600 400 200 0 1.5 1.52 1.54 1.56 1.58 1.6 1.62 1.64 1.66 1.68 1.7 88 x 10 4 time (ms)

F0 modulation in Noh voice coupling 1:3 chaos? 1:2 1:3 1:2 6*F0 1200 frequency (Hz) 1000 800 600 400 200 0 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 x 89 10 4 time (ms)

F0 modulation power spectrum 40 45 vibrato modulation relative rms level (db, semitone) 50 55 60 65 70 75 80 10 0 10 1 10 2 modulation frequency (Hz) 90

Power spectrum of periodic impulse train using previous windows zeroes noise component develops MAVEBA'2001

Envelope calculation cepstrum original and smoothed spectrum upper and lower envelope lifter MAVEBA'2001

Multi resolution analysis: example estimated excitation speech waveform MAVEBA'2001

Non-periodic component Distance between lower and upper power spectrum envelope and calibration based on simulation [Kawahara et.al. 2001] Residuals of pitch scale linear prediction of frequency sub-bands and sigmoidal spectral modeling [Kawahara et.al. 2010a] Interference-free group delay representation for estimating deviation from periodicity [Kawahara et.al. 2014] Event extraction based on running kurtosis [Kawahara et.al. 2010b] the least successful component YANGsaf may be the answer 94

Broad-band colored noise turbulence: random boundary frequency slope converted pulse to noise ratio square root of pulse to noise ratio

Parameter estimation logit conversion weighted least square solution weight update

Fitting example

Non-periodic component Distance between lower and upper power spectrum envelope and calibration based on simulation [Kawahara et.al. 2001] Residuals of pitch scale linear prediction of frequency sub-bands and sigmoidal spectral modeling [Kawahara et.al. 2010a] Interference-free group delay representation for estimating deviation from periodicity [Kawahara et.al. 2014] Event extraction based on running kurtosis [Kawahara et.al. 2010b] the least successful component 99

Event detection Examples original natural speech synthesis without events synthesis with event detection

Event detection Examples original natural speech synthesis without events synthesis with event detection

Event detection Examples original natural speech synthesis without events synthesis with event detection

Outline Introduction: TANDEM-STRAIGHT Non-periodic component in speech sounds Wide-band noise Acoustic events Discussion

Outline Introduction: TANDEM-STRAIGHT Non-periodic component in speech sounds Wide-band noise Acoustic events Conclusion

Acoustic event detection strongly distorted distribution 4th moment 2nd moment kurtosis (non-negative) r =2, 4 implementation as filtering

Acoustic event detection local peak of running kurtosis closest centroid of 4th power of the windowed wave peak picking (initial estimate) practical solution event location adjustment theoretical solution filtered signal

Event detection example

Peak kurtosis distribution for 112 utterances highly non-gaussian

Non-periodic component Distance between lower and upper power spectrum envelope and calibration based on simulation [Kawahara et.al. 2001] Residuals of pitch scale linear prediction of frequency sub-bands and sigmoidal spectral modeling [Kawahara et.al. 2010a] Interference-free group delay representation for estimating deviation from periodicity [Kawahara et.al. 2014] Event extraction based on running kurtosis [Kawahara et.al. 2010b] the least successful component 110

Other cause of interference Phase spectrogram and group delay DVD 111

Movie

113 Movie

Movie

group delay Flanagan-like equation di [log(x)] τ g = dω [ ] 1 dx = I X dω = R[X]I [ dx dω ] I[X]R [ dx dω ] X 2, 115

Flanagan-like equation X(ω,t)= X d (ω,t)= dx(ω,t) dω = = j w(τ)x(τ t)e jωτ dτ w(τ)x(τ t) de jωτ dω dτ τw(τ)x(τ t)e jωτ dτ, τ g (ω,t)= R[X(ω,t)]I[X d(ω,t)] I[X(ω,t)]R[X d (ω,t)] X(ω,t) 2, 116

Interference model x(t) =δ ( t T 0 2 ) + αδ ( t + T 0 2 ) time window covers temporally repeating events with amplitude modification 117

( ) ( ) Interference in power spectrum P (ω,t)= ( w t T 0 ) 2 2 ( +2αw t T 0 2 + α ) ( w t + T 0 ( w t + T 0 2 2 ) ) 2 cos (2π ff0 ) periodic variation on the frequency axis 118

( ) ( ) ( Interference in power spectrum P (ω,t)= ( w t T 0 ) 2 2 ( +2αw t T 0 2 + α ) ( w t + T 0 ( w t + T 0 2 2 ) ) 2 cos (2π ff0 ) TANDEM trick cancels the interference P F (ω,t)= P ( ω ω 0 4,t ) + P ( ω + ω 0 4,t ) 2 119

Power weighted average τ ga (ω,t)= τ g1(ω,t) S 1 (ω,t) 2 + τ g2 (ω,t) S 2 (ω,t) 2 S 1 (ω,t) 2 + S 2 (ω,t) 2 P F (ω,t) Using T0/2 separation makes this sum interference-free 120

Numerators R[S(ω,t)]I[S d (ω,t)] = w d (t 1 )w(t 1 ) cos 2 ( ωt 1 ) α 2 w d (t 2 )w(t 2 ) cos 2 ( ωt 2 ) αw d (t 2 )w(t 1 ) cos( ωt 1 ) cos( ωt 2 ) αw d (t 1 )w(t 2 ) cos( ωt 1 ) cos( ωt 2 ) I[S(ω,t)]R[S d (ω,t)] = w d (t 1 )w(t 1 ) sin 2 ( ωt 1 ) + α 2 w d (t 2 )w(t 2 ) sin 2 ( ωt 2 ) + αw d (t 2 )w(t 1 ) sin( ωt 1 ) sin( ωt 2 ) + αw d (t 1 )w(t 2 ) sin( ωt 1 ) sin( ωt 2 ), 121

d d 1 1 1 + α 2 w d (t 2 )w(t 2 ) sin 2 ( ωt 2 ) Numerators: simplification + αw d (t 2 )w(t 1 ) sin( ωt 1 ) sin( ωt 2 ) + αw d (t 1 )w(t 2 ) sin( ωt 1 ) sin( ωt 2 ), sin 2 θ + cos 2 θ =1 sin cos A cos 2 θ + B cos + sin 2 θ =1 cos A sin B = cos(a A cos B B) + sin A sin B cos(a B) R[S(ω,t)]I[S d (ω,t)] I[S(ω,t)]R[S d (ω,t)] = (27) w d (t 1 )w(t 1 ) α 2 w d (t 2 )w(t 2 ) α (w d (t 1 )w(t 2 )+w d (t 2 )w(t 1 )) cos (2π ff0 ), (28) periodic component 122 P F (ω,t)

Interference-free group τ df (ω,t)= 1 P F (ω,t) delay [R[S(ω 1,t)]I[S d (ω 1,t)] I[S(ω 1,t)]R[S d (ω 1,t)] + R[S(ω 2,t)]I[S d (ω 2,t)] I[S(ω 2,t)]R[S d (ω 2,t)]] R I I R ω 1 = ω ω 0 4, ω 2 = ω + ω 0 where 4 Interference on the frequency axis is removed but 123

Interference-free group delay on the frequency axis 124

Interference-free group delay both in the time and frequency 125

waveform and time windows modulation window phase spectrogram power spectrum group delay Movie

Movie

Movie

Movie

Synthesis Minimum phase impulse response Pitch synchronized overlap and add of periodic and aperiodic components Mixed mode excitation and approximate time varying filter 130

STRAIGHT is a VOCODER analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 131

Synthesis: OLA analysis physical attributes synthesis spectral envelope analysis spectral envelope filter-1 + output signal input signal-1 F0 analysis F0 modification periodic pulse generator filter-2 signal parameter non-periodicity analysis nonperiodicity noise generator data process 132

Synthesis: approximate TVF analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 133

x[n] = FFT based convolution k x[k]δ 0,n k ( y[n] = M k=0 x[k]h[n k] cyclic version of signal with length N N 1 ( ) 2πjkn x[n] = X[k] exp N k=0 DFT coefficient N 1 ỹ[n] = h[n k] x[k] Y [k] =H[k]X[k] k=0 cyclic convolution 134

original signal x[n] = FFT based convolution k= s (k) [n] = k= x (k) 1 [n]w[n kl] constraint on w[n kl] =1 subdivision window k= subdivided signal implementation w 1 [n] =1,n=0, 1,...,K 1 ( ) w 2nπ 2 [n] =0.5 0.5 cos 2L K =2L +1 FFT length limit ( ) 135

Time varying filter in STRAIGHT x[n] = M k=0 x[k]h[n k; k] minimum phase impulse response h min (t) = 1 2π H min (ω)e jωt dt h min (t) =R [ 1 2π ] H min (ω)e jωt dt R[ln(H min (ω))] + ji[ln(h min (ω))] = c(q) = 1 2π ln(p (ω))e jωq dω c min (q)e jωq dq c min (q) = c(q) (q>0) c(0)/2 (q = 0) 0 (q <0) 136

Numerical examples 137

STRAIGHT spectrogram: section 138

Minimum phase responses 139

Time invariant and variant responses: spectral views rectangular subdivision raised cosine subdivision time invariant filter time varying filter 140

STRAIGHT is a VOCODER enabling flexible manipulation analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 141

STRAIGHT: summary STRAIGHT decomposes input speech into Interference-free spectrum Fundamental frequency: F0 Aperiodic component Virtually perfect removal of interferences Flexible manipulation of speech without introducing quality degradations STRAIGHT component procedures provide building blocks for various applications such as TTS systems STRAIGHT serves as a test-bed for new component algorithms 142

Topic Application STRAIGHT Background 143

Lecture 2 Hands on tutorial of generalized speech morphing based on STRAIGHT Hideki Kawahara Emeritus Professor: Wakayama University, Japan Tianjin University, China, 9 December, 2016

Topic Application STRAIGHT Background 145

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 146

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 147

STRAIGHT is a VOCODER enabling flexible manipulation analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal input signal-1 F0 analysis F0 modification periodic pulse generator shaper and mixer signal parameter non-periodicity analysis nonperiodicity non-periodic component generator data process 148

STRAIGHT GUIs Matlab APSIPA DL talk 149

Snapshot: F0 extraction Matlab customizable F0 extractor 150

Snapshot: modification Matlab duration size F0 amplitude 151

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 152

Manipulation by function calls Analysis functions Source analysis Fundamental frequency Aperiodic component Filter analysis Synthesis function Default OLA synthesis Optional: approximate time varying filter Optional: sinusoidal synthesis 153

Manipulation by function calls Analysis functions Source analysis Fundamental frequency Aperiodic component Filter analysis Synthesis function Default OLA synthesis Optional: approximate time varying filter Optional: sinusoidal synthesis 154

Fundamental frequency r = exf0candidateststraightgb(x,fs,paramsin) samplingfrequency: 22050 f0: [159x1 double] periodicitylevel: [159x1 double].. post processing rc = autof0tracking(r,x); rc.vuv = refinevoicingdecision(x,rc); 155

Aperiodicity parameter q = aperiodicityratiosigmoid(x,rc,sidemargin,exponent,displayon) samplingfrequency: 22050 f0: [159x1 double] vuv: [159x1 double] sigmoidparameter: [2x159 double]..... 156

Manipulation by function calls Analysis functions Source analysis Fundamental frequency Aperiodic component Filter analysis Synthesis function Default OLA synthesis Optional: approximate time varying filter Optional: sinusoidal synthesis 157

Filter analysis exspectrumtstraightgb(x,fs,sourceobj,paramsin) ElapsedTimeForSpectrum: 0.1414 temporalpositions: [1x159 double] spectrogramstraight: [1025x159 double] samplingfrequency: 22050 TANDEMSTRAIGHTconditions: [1x1 struct] spectrogramtandem: [1025x159 double] dateofspectrumestimation: 'DD-MM-2014 01:00:58' 158

Manipulation by function calls Analysis functions Source analysis Fundamental frequency Aperiodic component Filter analysis Synthesis function Default OLA synthesis Optional: approximate time varying filter Optional: sinusoidal synthesis 159

Default OLA synthesis exgeneralstraightsynthesisr2(sourcestructure,filterstructure) synthesisout: [17106x1 double] samplingfrequency: 22050 elapsedtime: 0.1040 generalized framework generalstraightsynthesisframeworkr2(feedinghandle, responsehandle, deterministichandle, randomhandle, shifterhandle, datasubstrate,optionalparameters) 160

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 161

Modification by function calls Fundamental frequency manipulation (example) making fundamental frequency 1.2 times higher rm = r; rm.f0 = r.f0*1.2; s = exgeneralstraightsynthesisr2(rm,f); making fundamental frequency 50 Hz higher rm = r; rm.f0 = r.f0+50; s = exgeneralstraightsynthesisr2(rm,f); 162

Modification by function calls Speaking rate manipulation (example) making total duration 2 times longer rm = r; rm.temporalpositions = r.temporalpositions*2; s = exgeneralstraightsynthesisr2(rm,f); 163

Modification by function calls Vocal tract length manipulation (example) making vocal tract length 1.2 times longer fftl = (size(f.spectrogramstraight,1)-1)*2; fxoriginal = (0:fftl/2)/fftl*f.samplingFrequency; fxtarget = fxoriginal*1.2; fxtarget = min(f.samplingfrequency/2, fxtarget); fm = f; fm.f.spectrogramstraight = interp1(fxoriginal,f.spectrogramstraight,fxtarget); s = exgeneralstraightsynthesisr2(r,fm); 164

Modification by function calls Vocal tract length manipulation (example) making vocal tract length 0.8 times of the original fftl = (size(f.spectrogramstraight,1)-1)*2; fxoriginal = (0:fftl/2)/fftl*f.samplingFrequency; fxtarget = fxoriginal*0.8; fm = f; fm.f.spectrogramstraight = interp1(fxoriginal,f.spectrogramstraight,fxtarget); s = exgeneralstraightsynthesisr2(r,fm); nonlinear frequency axis modification is possible by designing fxtarget 165

http://ml.cs.yamanashi.ac.jp/straight/english/index.html 166

http://ml.cs.yamanashi.ac.jp/straight/english/index.html 167

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 168

Matlab

170

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 171

voices Temporally variable multi-aspect N-way morphing attribute 172

Temporally variable multi-aspect N-way morphing analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal F0 analysis F0 periodic pulse generator shaper and mixer input signal-1 non-periodicity analysis nonperiodicity morphing non-periodic component generator time axis alignment time axis mapping time axis alignment input signal-k frequency axis alignment frequency axis mapping frequency axis alignment signal analysis parameter physical attributes data input signal-n analysis a set of indexed weights of physical attributes process 173

STRAIGHT analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal F0 analysis F0 periodic pulse generator shaper and mixer input signal-1 non-periodicity analysis nonperiodicity morphing non-periodic component generator time axis alignment time axis mapping time axis alignment input signal-k frequency axis alignment frequency axis mapping frequency axis alignment signal analysis parameter physical attributes data input signal-n analysis a set of indexed weights of physical attributes process 174

Temporally variable multi-aspect N-way morphing analysis physical attributes synthesis spectral envelope analysis spectral envelope filter output signal F0 analysis F0 periodic pulse generator shaper and mixer input signal-1 non-periodicity analysis nonperiodicity morphing non-periodic component generator time axis alignment time axis mapping time axis alignment input signal-k frequency axis alignment frequency axis mapping frequency axis alignment signal analysis parameter physical attributes data input signal-n analysis a set of indexed weights of physical attributes process 175

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 176

Generalized morphing enabling extrapolation location, speed... no constraint F0, power... positivity time axis, frequency axis... monotonicity w.sum(function) exponent(w.sum(log(function))) integration(exponent(w.sum(log(function )))) derivative of function 177

What is the problem? interpolation 178

What is the problem? interpolation Break down extrapolation Non-monotonic mapping 179

Speech parameter constraints ( 1 ) time increases monotonically ( 2 ) frequency increases monotonically ( 3 ) time-frequency spectral representation is positive ( 4 ) fundamental frequency is positive abstract time Θ ( ) { ) ) Θ (k) (ν, τ) = f (k) 0 ( t (k) (τ), a (k) (t (k) (τ) P (f (k) (k) (ν),t (k) (τ),f (k) (ν),t (k) (τ), (1) morphing entity 180 abstract frequency ), }

Speech parameter constraints ( 1 ) time increases monotonically ( 2 ) frequency increases monotonically ( 3 ) time-frequency spectral representation is positive ( 4 ) fundamental frequency is positive abstract time Θ ( ) { ) ) Θ (k) (ν, τ) = f (k) 0 ( t (k) (τ), a (k) (t (k) (τ) P (f (k) (k) (ν),t (k) (τ),f (k) (ν),t (k) (τ), (1) morphing entity 181 abstract frequency ), }

Speech parameter constraints ( 1 ) time increases monotonically ( 2 ) frequency increases monotonically ( 3 ) time-frequency spectral representation is positive ( 4 ) fundamental frequency is positive abstract time Θ ( ) { ) ) Θ (k) (ν, τ) = f (k) 0 ( t (k) (τ), a (k) (t (k) (τ) P (f (k) (k) (ν),t (k) (τ),f (k) (ν),t (k) (τ), (1) morphing entity 182 abstract frequency ), }

Speech parameter constraints ( 1 ) time increases monotonically ( 2 ) frequency increases monotonically ( 3 ) time-frequency spectral representation is positive ( 4 ) fundamental frequency is positive abstract time Θ ( ) { ) ) Θ (k) (ν, τ) = f (k) 0 ( t (k) (τ), a (k) (t (k) (τ) P (f (k) (k) (ν),t (k) (τ),f (k) (ν),t (k) (τ), (1) morphing entity 183 abstract frequency ), }

Speech parameter constraints ( 1 ) time increases monotonically ( 2 ) frequency increases monotonically ( 3 ) time-frequency spectral representation is positive ( 4 ) fundamental frequency is positive abstract time Θ ( ) { ) ) Θ (k) (ν, τ) = f (k) 0 ( t (k) (τ), a (k) (t (k) (τ) P (f (k) (k) (ν),t (k) (τ),f (k) (ν),t (k) (τ), (1) morphing entity 184 abstract frequency ), }

Generalized morphing enabling extrapolation location, speed... no constraint F0, power... positivity time axis, frequency axis... monotonicity w.sum(function) exponent(w.sum(log(function))) integration(exponent(w.sum(log(function )))) derivative of function 185

No constraint case morphed parameter: function number of cases weight N g m1 (t m3 (τ)) = w (k) (t (k) (τ))g (k) (t (k) (τ)), (2) k=1 speech parameter index of case N w (k) (t (k) (τ)) = 1. k=1 not always necessary 186

Generalized morphing enabling extrapolation location, speed... no constraint F0, power... positivity time axis, frequency axis... monotonicity w.sum(function) exponent(w.sum(log(function))) integration(exponent(w.sum(log(function )))) derivative of function 187

positivity constraint ( N g m2 (t m3 (τ)) = exp w (k) (t (k) (τ)) log ( g (k) (t (k) (τ)) )) k=1 ( k=1 ( N ( = g (k) (t (k) (τ)) ) w (k) (t (k) (τ)), (4) g m2 (t m3 (τ)) > 0 188

Generalized morphing enabling extrapolation location, speed... no constraint F0, power... positivity time axis, frequency axis... monotonicity w.sum(function) exponent(w.sum(log(function))) integration(exponent(w.sum(log(function )))) derivative of function 189

monotonicity constraint morphed attribute: function number of cases weight ( ( τ N ( ) ) dg g m3 (τ) = exp w (k) (k) (ξ) (ξ) log dξ 0 dξ k=1 index of case τ N ( ( ) dg (k) w (ξ) (k) (ξ) = dξ, (5) dξ 0 k=1 speech attribute abstract parameter dg m3 (τ) > 0 dτ 190

Generalized morphing ( ( ) ) morphing entity ( examplar ( Θ m (ν, τ)=t Θ (1) (ν, τ), Θ (2) (ν, τ),...,θ (K) (ν, τ); W ), (6) W ={w F0 (τ), w A (τ), w P (τ), w Fx (τ), w Tx } (τ)}, (7) w X (τ) =[w (1) X (τ),w(2) X (τ),...,w(k) X (τ)]t } X {F 0, A, P, F x,t x } F0 aperiodicity time-frequency rep. frequency c. time c. 191

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 192

( Implementation: ) piece-wise linear function ( ) time axis of an example ID of the example ( ) t (k) (τ) =(p (k) (τ n+1 ) p (k) (τ n ))(τ τ n )+p (k) (τ n ). (8) morphed time axis value at an anchor anchor location ID of the anchor t m3 (τ) =(p m (τ n+1 ) p m (τ n ))(τ τ n )+p m (τ n ), (11) p m (τ n )= K ( p (k) (τ n ) p (k) (τ n 1 ) ) w (k) Tx (τ n) k=1 value at morphed location + p m (τ n 1 ), (12) 193

Matlab implementation of function inversion yi = interp1(x,y,xi, linear, extrap ); xi = interp1(y,x,yi, linear, extrap ); 194

Temporally variable multi-aspect N-way morphing voices attribute 195

Movie

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 197

GUI for generalized morphing preparation Matlab

Matlab November, 2013, APSIPA, Taiwan

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 200

Morphing by scripting Matlab function for temporally variable multi-aspects arbitrary many voices morphing morphedobject = tvariablenwaymorphingraw(objectbundle,contributionstructure,dispon); synthstructure = generatemorphedsound(morphedobject); objectbundless = STRAIGHTobject: {1x8 cell} contributionstructure = timeaxis: [47x8 double] fundamentalfrequency: [47x8 double] frequencyaxis: [47x8 double] aperiodicity: [47x8 double] spectrum: [47x8 double] 201

Morphing by scripting Matlab function for temporally variable multi-aspects arbitrary many voices morphing morphedobject = tvariablenwaymorphingraw(objectbundle,contributionstructure,dispon); synthstructure = generatemorphedsound(morphedobject); morphedobject = morphedtimeanchors: [49x1 double] timemorphedframe: [522x8 double] morphedtargetf0: 115.1853 morphedf0: [1x522 double] f0listonmorphedtime: [522x8 double] frequencymappingatanchor: [1x1 struct] frameonmorphing: [522x1 double] morphedvuv: [1x522 double] contributionstructure: [1x1 struct] morphedspectrogram: [2049x522 double] morphedaperiodicity: [2x522 double] elapsedtime: 1.1410 cutofflistfix: [5x1 double] samplingfrequency: 48000 procedurename: 'tvariablenwaymorphing' tmpobj: [1x1 struct] 202

Morphing by scripting Matlab function for temporally variable multi-aspects arbitrary many voices morphing morphedobject = tvariablenwaymorphingraw(objectbundle,contributionstructure,dispon); synthstructure = generatemorphedsound(morphedobject); contributionstructure = timeaxis: [47x8 double] fundamentalfrequency: [47x8 double] frequencyaxis: [47x8 double] aperiodicity: [47x8 double] spectrum: [47x8 double] flexible manipulation can be implemented by assigning relevant weights for contributionstructure 203

Application Speech modification using STRAIGHT Modification using GUIs Modification by function calls Extended morphing for two voices Temporally variable multi-aspect arbitrary many voices morphing Formulation Implementation Morphing using GUIs Morphing by scripting Morphing as a research tool 204

Topic Application STRAIGHT Background 206

Topic Application STRAIGHT Background 207

Summary Application STRAIGHT Background Interference-free representations play important roles Periodic excitation is an efficient and robust strategy for sampling and transmitting relevant information for communications using voice STRAIGHT is a collection of functions and applications Extended morphing provides a unique research strategy useful for para- and non-linguistic aspects of speech 208

Thank you! Roy D. Patterson Masanori Morise Hideki Banno Toshio Irino Ryuichi Nisimura Verena G. Skuk Stefan Schweinberger Parham Zolfaghari Ken-Ichi Sakakibara Ikuyo Masuda-Katsuse Alain de Cheveigne Josh McDermott Osamu Fujimura Toru Takahashi Tomoki Toda and many others. 209

References 210

Reference: STRAIGHT Kawahara, H., Morise, M., Toda, T., Banno, H., Nisimura, R., & Irino, T. (2014). Excitation source analysis for high-quality speech manipulation systems based on an interference-free representation of group delay with minimum phase response compensation. In Fifteenth Annual Conference of the International Speech Communication Association. Kawahara, H., Morise, M., & Sakakibara, K. I. (2013d). Temporally fine F0 extractor applied for frequency modulation power spectral analysis of singing voices. Proc. MAVEBA, 125-128. Kawahara, H., Morise, M., Banno, H., & Skuk, V. G. (2013c). Temporally variable multi-aspect N-way morphing based on interference-free speech representations. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific (pp. 1-10). IEEE. Kawahara, H., Morise, M., Toda, T., Nisimura, R., & Irino, T. (2013b). Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds. In INTERSPEECH (pp. 34-38). Kawahara, H., Morise, M., Nisimura, R., & Irino, T. (2013a). Higher order waveform symmetry measure and its application to periodicity detectors for speech and singing with fine temporal resolution. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 6797-6801). IEEE. Kawahara, H., Morise, M., Nisimura, R., & Irino, T. (2012b). Deviation measure of waveform symmetry and its application to high-speed and temporally-fine F0 extraction for vocal sound texture manipulation. In Interspeech. Kawahara, H., & Morise, M. (2012a). Analysis and synthesis of strong vocal expressions: extension and application of audio texture features to singing voice. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on (pp. 5389-5392). IEEE. Kawahara, H., & Morise, M. (2011b). Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana, 36(5), 713-727. Kawahara, H., Irino, T., & Morise, M. (2011a). An interference-free representation of instantaneous frequency of periodic signals and its application to F0 extraction. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5420-5423). IEEE. 211

Reference: STRAIGHT Kawahara, H., Morise, M., Takahashi, T., Banno, H., Nisimura, R. & Irino, T. (2010b). Kurtosis-based acoustic event detection and its application to speech analysis, modification and synthesis systems, Spring Annual Meeting of the Acoustical Society of Japan, 315-316. [in Japanese] Kawahara, H., Morise, M., Takahashi, T., Banno, H., Nisimura, R., & Irino, T. (2010a). Simplification and extension of non-periodic excitation source representations for high-quality speech manipulation systems. In Interspeech 2010, 38-41. Fujimura, O., Honda, K., Kawahara, H., Konparu, Y., Morise, M., & Williams, J. C. (2009). Noh voice quality. Logopedics Phoniatrics Vocology, 34(4), 157-170. Kawahara, H., Takahashi, T., Morise, M., & Banno, H. (2009b). Development of exploratory research tools based on TANDEM-STRAIGHT. In Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference (pp. 111-120). Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference, International Organizing Committee. Kawahara, H., Nisimura, R., Irino, T., Morise, M., Takahashi, T., & Banno, H. (2009a). Temporally variable multiaspect auditory morphing enabling extrapolation without objective and perceptual breakdown. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on (pp. 3905-3908). IEEE. Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T., & Banno, H. (2008, March). TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 3933-3936). IEEE. Banno, H., Hata, H., Morise, M., Takahashi, T., Irino, T., & Kawahara, H. (2007). Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation. Acoustical science and technology, 28(3), 140-146. Kawahara, H. (2006). STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology, 27(6), 349-353. 212

Reference: STRAIGHT Kawahara, H., de Cheveigné, A., Banno, H., Takahashi, T., & Irino, T. (2005, September). Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In Interspeech (pp. 537-540). Matsui, H., & Kawahara, H. (2003b). Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system. In INTERSPEECH. Kawahara, H., & Matsui, H. (2003a). Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on (Vol. 1, pp. I-256). IEEE. Kawahara, H., Estill, J., & Fujimura, O. (2001, September). Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. In MAVEBA (pp. 59-64). Kawahara, H., Atake, Y., & Zolfaghari, P. (2000). Accurate vocal event detection method based on a fixed-point analysis of mapping from time to weighted average group delay. In INTERSPEECH (pp. 664-667). Kawahara, H., Katayose, H., de Cheveigné, A., & Patterson, R. D. (1999b). Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity. In EuroSpeech (Vol. 99, No. 6, pp. 2781-2784). Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999a). Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech communication, 27(3), 187-207. Kawahara, H. (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on (Vol. 2, pp. 1303-1306). IEEE. 213

Reference: using STRAIGHT Assmann, P. F., & T. M. Nearey (2008). Identification of frequency-shifted vowels. The Journal of the Acoustical Society of America, 124(5), 3203-3212. Athanasios, T., Zañartu, M., Little, M.A., Fox, C., Ramig, L.O., & Clifford, G.D. (2014). Robust fundamental frequency estimation in sustained vowels: Detailed algorithmic comparisons and information fusion with adaptive Kalman filtering, The Journal of the Acoustical Society of America, 135(5), 2885-2901. Bruckert, L., Bestelmeyer, P., Latinus, M., Rouger, J., Charest, I., Rousselet, G. A.,... & Belin, P. (2010). Vocal attractiveness increases by averaging. Current Biology, 20(2), 116-120. d' Alessandro, C., Rilliard, A., & Le Beux, S. (2011). Chironomic stylization of intonationa). The Journal of the Acoustical Society of America, 129(3), 1594-1604. Humes, L. E., Kewley-Port, D., Fogerty, D., & Kinney, D. (2010). Measures of hearing threshold and temporal processing across the adult lifespan. Hearing research, 264(1), 30-40. Ives, D. T., Smith, D. R., & Patterson, R. D. (2005). Discrimination of speaker size from syllable phrasesa). The Journal of the Acoustical Society of America, 118(6), 3816-3822. Kawahara, H., Kitamura, T., Takemoto, H., Nisimura, R., & Irino, T. (2014). Vocal tract length estimation based on vowels using a database consisting of 385 speakers and a database with MRI-based vocal tract shape information. In Fifteenth Annual Conference of the International Speech Communication Association. Kawahara, H., Mizobuchi, S., Morise, M., Nisimura, R., & Irino, T. (2014). Realtime conversion of growl-type voice qualities based on modulation and approximate time-varying filtering driven by a non-linear oscillator: Formulation. IPSJ SIG Technidal report, 2014-MUS-102(14), 1-6. Liu, C., & Kewley-Port, D. (2004). Vowel formant discrimination for high-fidelity speech. The Journal of the Acoustical Society of America, 116(2), 1224-1233. Nguyen, P. C., Takao, O., & Akagi, M. (2003). Modified restricted temporal decomposition and its application to low rate speech coding. IEICE TRANSACTIONS on Information and Systems, 86(3), 397-405. 214

Reference: using STRAIGHT Saitou, T., Goto, M., Unoki, M., & Akagi, M. (2007, October). Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices. In Applications of Signal Processing to Audio and Acoustics, 2007 IEEE Workshop on (pp. 215-218). IEEE. chweinberger, S. R., Walther, C., Zäske, R., & Kovács, G. (2011). Neural correlates of adaptation to voice identity. British Journal of psychology, 102(4), 748-764. Schweinberger, S. R., Casper, C., Hauthal, N., Kaufmann, J. M., Kawahara, H., Kloth, N.,... & Zäske, R. (2008). Auditory adaptation in voice perception. Current Biology, 18(9), 684-688. Skuk, V. G., & Schweinberger, S. R. (2014). Influences of Fundamental Frequency, Formant Frequencies, Aperiodicity, and Spectrum Level on the Perception of Voice Gender. Journal of Speech, Language, and Hearing Research, 57(1), 285-296. Skuk, V. G., & Schweinberger, S. R. (2013). Adaptation aftereffects in vocal emotion perception elicited by expressive faces and voices. PloS one, 8(11), e81691. Smith, D. R., Patterson, R. D., Turner, R., Kawahara, H., & Irino, T. (2005). The processing and perception of size information in speech sounds. The Journal of the Acoustical Society of America, 117(1), 305. Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMMbased speech synthesis. IEICE TRANSACTIONS on Information and Systems, 90(5), 816-824. Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on (Vol. 2, pp. 841-844). IEEE. Tsanas, A., Zañartu, M., Little, M.A., Fox, C., Ramig, L.O., & Clifford, G.D. (2014). Robust fundamental frequency estimation in sustained vowels: detailed algorithmic comparisons and information fusion with adaptive Kalman filtering, The Journal of the Acoustical Society of America, XXX(X), XXX. von Kriegstein, K., Smith, D. R., Patterson, R. D., Kiebel, S. J., & Griffiths, T. D. (2010). How the human brain recognizes speech in the context of changing speakers. The Journal of Neuroscience, 30(2), 629-638. 215

Reference: using STRAIGHT von Kriegstein, K., Smith, D. R., Patterson, R. D., Ives, D. T., & Griffiths, T. D. (2007). Neural representation of auditory size in the human voice and in sounds from other resonant sources. Current Biology, 17(13), 1123-1128. von Kriegstein, K., Warren, J. D., Ives, D. T., Patterson, R. D., & Griffiths, T. D. (2006). Processing the acoustic effect of size in speech sounds. Neuroimage, 32(1), 368-375. Yonezawa, T., Suzuki, N., Abe, S., Mase, K., & Kogure, K. (2007). Perceptual continuity and naturalness of expressive strength in singing voices based on speech morphing. EURASIP Journal on Audio, Speech, and Music Processing, 2007(3), 2. Yu, K., & Young, S. (2011). Continuous F0 modeling for HMM based statistical parametric speech synthesis. Audio, Speech, and Language Processing, IEEE Transactions on, 19(5), 1071-1079. Zäske, R., Schweinberger, S. R., & Kawahara, H. (2010). Voice aftereffects of adaptation to speaker identity. Hearing research, 268(1), 38-45. Zäske, R., Schweinberger, S. R., Kaufmann, J. M., & Kawahara, H. (2009). In the ear of the beholder: neural correlates of adaptation to voice gender. European Journal of Neuroscience, 30(3), 527-534. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. W., & Tokuda, K. (2007a). The HMM-based speech synthesis system (HTS) version 2.0. In SSW (pp. 294-299). Zen, H., Toda, T., Nakamura, M., & Tokuda, K. (2007b). Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE transactions on information and systems, 90(1), 325-333. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039-1064. 216