OSS

Size: px

Start display at page:

Download "OSS"

こうしょすわ
4 years ago
Views:

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 次は新金岡新金岡です名詞助詞固有名詞固有名詞助動詞ツギワシンカナオカシンカナオカデス * * * ツギワシンカナオカシンカナオカデス * * * DNN T frames 8

9 9

10 この部分を見てみる 10

11 11

12 12

13 13

14 Synthesis filter 14

15 15

16 16

17 Speech frames Spectral features unvoiced unvoiced 200 Hz F0 value T frames 17

18 次は新金岡新金岡です名詞助詞固有名詞固有名詞助動詞ツギワシンカナオカシンカナオカデス * * * ツギワシンカナオカシンカナオカデス * * * DNN T frames 18

19 各フレームでの処理を見ると Heiga Zen, Andrew Senior, Mike Schuster, Statistical Parametric Speech Synthesis Using Deep Neural Networks, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),

20 全体での処理 (FeedForward 型の例 ) unvoiced unvoiced 200 Hz 205 Hz 210 Hz 220 Hz T frames 位置 :1 つ目 2 つ目 3 つ目 4 つ目 5 つ目 6 つ目 7 つ目 20

21 FF Highway block FF FF FF X + FF FF X -1 X Xin Wang, Shinji Takaki, Junichi Yamagishi, "Investigating very deep highway networks for parametric speech synthesis", Speech Communication

22 22

23 female voice male voice 23 Xin Wang, Shinji Takaki, Junichi Yamagishi, "A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora", 9th ISCA Workshop on Speech Synthesis

24 24

25 kHz16,000 - AR - AR : LPC 25

26 - - 26

27 Xin Wang, Shinji Takaki, Junichi Yamagishi, "AN AUTO REGRESSIVE RECURRENT MIXTURE DENSITY NETWORK FOR PARAMETRIC SPEECH SYNTHESIS", icassp Xin Wang, Shinji Takaki, Junichi Yamagishi, "An RNN-based Quantized F0 Model with Multi-tier Feedback Links fortext-to-speech Synthesis", Interspeech

28 1-D CNNs Softmax Quantized waveform + Block 1 Block 2 Block 40 1-D CNN 1-D CNN 1-D CNN 1-D CNN + 1-D CNN + 1-D CNN + * * * Tanh Sigmoid Tanh Sigmoid Tanh Sigmoid Dilated 1-D CNN Dilated 1-D CNN Dilated 1-D CNN Feedforward Up sampling Time resolution: 16kHz One-hot quantized waveform (time shifted) Conditional Parameters Feedforward Bi-LSTM Time resolution: 1/(5ms) = 20Hz (Frame level) van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray, WaveNet: A Generative Model for Raw Audio, Arxiv

29 + 1-D CNN Softmax Waveform 1-D CNN 1-D CNN + 1-D CNN 1-D CNN + 1-D CNN 1-D CNN + Tanh * + Sigmoid Tanh * + Sigmoid Tanh * + Sigmoid Diluted 1-D CNN Diluted 1-D CNN Diluted 1-D CNN 1-D CNN Linear Waveform (time shifted) F0 Bi-directional LSTM Spectral features Neural Waveform Generator (16kHz) Hierarchical-softmax Linear Autoregressive GMM Linear Uni-directional LSTM Bi-directional LSTM Autoregressive Acoustic Models (200Hz) Bi-directional LSTM Tanh-feedforward Tanh-feedforward Linguistic features Bi-directional LSTM Tanh-feedforward Tanh-feedforward 29

Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi "A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis",

30 Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi "A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis", ICASSP SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo Phase recovery minimum phase Wavenet Wavenet PML WORLD Waveform g e n e r a t o r s GAN F0 MGC GAN DAR SAR RNN A c o u s t i c models Linguistic features Reference :16kHz :48kHz Wavenet 30

31 TTS Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa "Do prosodic manual annotations matter for Japanese speech synthesis systems with WaveNet vocoder? Submitted to Interspeech

32 The cat in the hat 32

33 Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems

34 Spectrogram J. Shen, M. Schuster, N. Jaitly, R. Skerry-Ryan, R. A. Saurous, R. J. Weiss, R. Pang, Y. Agiomyrgiannakis, Y. Wu, Y. Zhang, Y. Wang, Z. Chen, and Z. Yang, Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions, ICASSP

35 - - Deep Voice3 from Baidu (Tacotron2 + dot-product attention + speaker embedding) 35 Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller, Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning ICLR 2018

36 36

37 37

38 The cat in the hat 38

39 sil a i sil (khz) 39

40 /a/ 0 /i/ 0.2 /u/ 0 /e/ 0.3 /o/ 0.5 /a/ 0 /i/ 0.1 /u/ 0 /e/ 0.4 /o/ 0.5 /a/ 0 /i/ 0 /u/ 0 /e/ 0.5 /o/ 0.3 /a/ 0.3 /i/ 0 /u/ 0 /e/ 0.5 /o/ 0.2 /a/ 0.45 /i/ 0 /u/ 0 /e/ 0.35 /o/ 0.2 /a/ 0.55 /i/ 0 /u/ 0 /e/ 0.3 /o/ 0.2 Acoustic sequence 40

41 /a/ 0 /i/ 0.2 /u/ 0 /e/ 0.3 /o/ 0.5 / / 0 /a/ 0 /i/ 0.1 /u/ 0 /e/ 0.4 /o/ 0 / / 0.5 /a/ 0 /i/ 0 /u/ 0 /e/ 0.5 /o/ 0.3 / / 0 /a/ 0.3 /i/ 0 /u/ 0 /e/ 0 /o/ 0.2 / / 0.5 /a/ 0.45 /i/ 0 /u/ 0 /e/ 0.35 /o/ 0.2 / / 0 /a/ 0 /i/ 0 /u/ 0 /e/ 0.3 /o/ 0.2 / / 0.55 Acoustic sequence 41

42 42

43 Bi-directional RNN Convolution Spectrogram 43

Test set Deep speech 2 Human WSJ eval 92 3.60 5.03 WSJ eval 93 4.98 8.08 LibriSpeech test-clean 5.33 5.83 LibriSpeech test-other 13.25 12.

44 Test set Deep speech 2 Human WSJ eval WSJ eval LibriSpeech test-clean LibriSpeech test-other Amodei, Dario, et al. "Deep speech 2: End-to-end speech recognition in english and mandarin." arxiv preprint arxiv: (2015). 44

45 /but/ 0.2 /cat/ 0.5 /hat/ 0.2 /and/ 0.1 / / 0 word pieces Previously predicted word /a/ 0 /an/ 0.2 /the/ 0.5 /its/ 0.3 / / 0 Language model Acoustic model Kanishka Rao, Haşim Sak, Rohit Prabhavalkar, Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer ASRU

46 DNN break thorough Switchboard WER 10 Human

47 47

Haiku Generation Based on Motif Images Using Deep Learning Koki Yoneda 1 Soichiro Yokoyama 2 Tomohisa Yamashita 2 Hidenori Kawamura Scho

Haiku Generation Based on Motif Images Using Deep Learning 1 2 2 2 Koki Yoneda 1 Soichiro Yokoyama 2 Tomohisa Yamashita 2 Hidenori Kawamura 2 1 1 School of Engineering Hokkaido University 2 2 Graduate