JP2020166299A5 - Speech synthesis methods, speech synthesis systems and programs - Google Patents

Speech synthesis methods, speech synthesis systems and programs Download PDF

Info

Publication number
JP2020166299A5
JP2020166299A5 JP2020114265A JP2020114265A JP2020166299A5 JP 2020166299 A5 JP2020166299 A5 JP 2020166299A5 JP 2020114265 A JP2020114265 A JP 2020114265A JP 2020114265 A JP2020114265 A JP 2020114265A JP 2020166299 A5 JP2020166299 A5 JP 2020166299A5
Authority
JP
Japan
Prior art keywords
acoustic signal
tuning component
trained model
frequency spectrum
unvoiced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2020114265A
Other languages
Japanese (ja)
Other versions
JP2020166299A (en
JP6977818B2 (en
Filing date
Publication date
Priority claimed from JP2017229041A external-priority patent/JP6733644B2/en
Application filed filed Critical
Priority to JP2020114265A priority Critical patent/JP6977818B2/en
Priority claimed from JP2020114265A external-priority patent/JP6977818B2/en
Publication of JP2020166299A publication Critical patent/JP2020166299A/en
Publication of JP2020166299A5 publication Critical patent/JP2020166299A5/en
Application granted granted Critical
Publication of JP6977818B2 publication Critical patent/JP6977818B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Description

第1実施形態では、調波スペクトルHの基本周波数Y1が第2学習済モデルM2に入力されるから、非調波信号A2のサンプルは基本周波数Y1に連動して変化する。例えば、制御データX(および有声性判定結果Y2)が共通する場合でも、基本周波数Y1が相違すれば、相異なる音響特性の非調波信号A2が生成されよい。また、有声性判定結果Y2が第2学習済モデルM2に入力されるから、非調波信号A2のサンプルは有声性判定結果Y2に連動して変化する。例えば、制御データX(および基本周波数Y1)が共通する場合でも、有声性判定結果Y2が相違すれば、相異なる音響特性の非調波信号A2が生成されてよい。なお、第2学習済モデルM2は、基本周波数Y1および有声性判定結果Y2の何れか一方のみを入力とし、他方を入力しないモデルでもよい。
In the first embodiment, since the fundamental frequency Y1 of the tuning spectrum H is input to the second trained model M2, the sample of the non-tuning signal A2 changes in conjunction with the fundamental frequency Y1. For example, even when the control data X (and voiced determination result Y2) is common, if differences fundamental frequency Y1, inharmonic signal A2 having different acoustic characteristics may be generated. Further, since the voicedness determination result Y2 is input to the second trained model M2, the sample of the non-tuning signal A2 changes in conjunction with the voicedness determination result Y2. For example, even when the control data X (and the fundamental frequency Y1) are common, if the voicedness determination results Y2 are different, non-tuning signals A2 having different acoustic characteristics may be generated. The second trained model M2 may be a model in which only one of the fundamental frequency Y1 and the voicedness determination result Y2 is input and the other is not input.

(2)前述の各形態では、第1学習済モデルM1の出力と第2学習済モデルM2の出力との間の同期を主要な目的として、第1学習済モデルM1の処理結果に応じた補助データYを第2学習済モデルM2に入力したが、例えば、両者間の同期のためのデータを制御データX含めることで、第2学習済モデルM2が補助データYを利用する構成を省略してもよい。また、調波成分の基本周波数Y1および有声性判定結果Y2の一方のみを制御データXとともに第2学習済モデルM2に入力してもよい。 (2) In each of the above-described forms, the main purpose is to synchronize the output of the first trained model M1 and the output of the second trained model M2, and the auxiliary according to the processing result of the first trained model M1. The data Y was input to the second trained model M2. For example, by including the data for synchronization between the two in the control data X , the configuration in which the second trained model M2 uses the auxiliary data Y is omitted. You may. Further, only one of the fundamental frequency Y1 of the tuning component and the voicedness determination result Y2 may be input to the second trained model M2 together with the control data X.

Claims (10)

合成音声の制御データに応じた調波成分の周波数スペクトルの時系列を第1学習済モデルにより生成し、
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成し、
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成し、
前記音響信号の生成においては、前記合成音声が無声音となる無声区間において前記音響信号を生成し、前記合成音声が有声音となる有声区間において前記音響信号の生成を停止する、
コンピュータにより実現される音声合成方法。
The time series of the frequency spectrum of the tuning component corresponding to the control data of the synthesized voice is generated by the first trained model.
An acoustic signal in the time domain representing the waveform of the non-tuning component corresponding to the control data is generated by the second trained model.
An audio signal including the tuning component and the non-tuning component is generated from the time series of the frequency spectrum and the acoustic signal .
In the generation of the acoustic signal, the acoustic signal is generated in the unvoiced section where the synthetic voice becomes unvoiced, and the generation of the acoustic signal is stopped in the voiced section where the synthesized voice becomes voiced .
A speech synthesis method realized by a computer.
前記周波数スペクトルの時系列の生成は、前記無声区間および前記有声区間の双方において実行される Generation of the time series of the frequency spectrum is performed in both the unvoiced and voiced sections.
請求項1の音声合成方法。 The voice synthesis method of claim 1.
合成音声の制御データに応じた調波成分の周波数スペクトルの時系列を第1学習済モデルにより生成し、
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成し、
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成し、
前記合成音声が無声音となる無声区間における前記音響信号のサンプルのビット数は、前記合成音声が有声音となる有声区間における前記音響信号のサンプルのビット数を上回る
コンピュータにより実現される音声合成方法。
The time series of the frequency spectrum of the tuning component corresponding to the control data of the synthesized voice is generated by the first trained model.
An acoustic signal in the time domain representing the waveform of the non-tuning component corresponding to the control data is generated by the second trained model.
An audio signal including the tuning component and the non-tuning component is generated from the time series of the frequency spectrum and the acoustic signal .
The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced .
A speech synthesis method realized by a computer.
前記第1学習済モデルは、前記調波成分の周波数スペクトルを第1単位期間毎に出力するニューラルネットワークであり、 The first trained model is a neural network that outputs the frequency spectrum of the wave tuning component for each first unit period.
前記第2学習済モデルは、前記非調波成分の時間領域におけるサンプルを、前記第1単位期間よりも短い第2単位期間毎に出力するニューラルネットワークである The second trained model is a neural network that outputs a sample in the time domain of the non-tuning component for each second unit period shorter than the first unit period.
請求項1から請求項3の何れかの音声合成方法。 The voice synthesis method according to any one of claims 1 to 3.
合成音声の制御データに応じた調波成分の周波数スペクトルの時系列を生成する第1学習済モデルと、 The first trained model that generates the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech, and
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を生成する第2学習済モデルと、 A second trained model that generates an acoustic signal in the time domain that represents the waveform of the inharmonic component according to the control data, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する合成処理部とを具備し、 It is provided with a synthesis processing unit that generates an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
前記第2学習済モデルは、前記合成音声が無声音となる無声区間において前記音響信号を生成し、前記合成音声が有声音となる有声区間において前記音響信号の生成を停止する、 The second trained model generates the acoustic signal in the unvoiced section where the synthetic voice becomes unvoiced, and stops the generation of the acoustic signal in the voiced section where the synthesized voice becomes voiced.
音声合成システム。 Speech synthesis system.
前記第1学習済モデルは、前記無声区間および前記有声区間の双方において前記周波数スペクトルの時系列を生成する The first trained model generates a time series of the frequency spectrum in both the unvoiced section and the voiced section.
請求項5の音声合成システム。 The voice synthesis system of claim 5.
合成音声の制御データに応じた調波成分の周波数スペクトルの時系列を生成する第1学習済モデルと、 The first trained model that generates the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech, and
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を生成する第2学習済モデルと、 A second trained model that generates an acoustic signal in the time domain that represents the waveform of the non-tuning component according to the control data, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する合成処理部とを具備し、 It is provided with a synthesis processing unit that generates an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
前記合成音声が無声音となる無声区間における前記音響信号のサンプルのビット数は、前記合成音声が有声音となる有声区間における前記音響信号のサンプルのビット数を上回る、 The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced.
音声合成システム。 Speech synthesis system.
前記第1学習済モデルは、前記調波成分の周波数スペクトルを第1単位期間毎に出力するニューラルネットワークであり、 The first trained model is a neural network that outputs the frequency spectrum of the wave tuning component for each first unit period.
前記第2学習済モデルは、前記非調波成分の時間領域におけるサンプルを、前記第1単位期間よりも短い第2単位期間毎に出力するニューラルネットワークである The second trained model is a neural network that outputs a sample in the time domain of the non-tuning component for each second unit period shorter than the first unit period.
請求項5から請求項7の何れかの音声合成システム。 The speech synthesis system according to any one of claims 5 to 7.
合成音声の制御データに応じた調波成分の周波数スペクトルの時系列を第1学習済モデルにより生成する処理と、 The process of generating the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech by the first trained model, and
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成する処理と、 The process of generating an acoustic signal in the time domain representing the waveform of the non-tuning component according to the control data by the second trained model, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する処理と A process of generating an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
をコンピュータに実行させるプログラムであって、 Is a program that causes a computer to execute
前記音響信号を生成する処理においては、前記合成音声が無声音となる無声区間において前記音響信号を生成し、前記合成音声が有声音となる有声区間において前記音響信号の生成を停止する、 In the process of generating the acoustic signal, the acoustic signal is generated in the unvoiced section where the synthetic voice becomes unvoiced, and the generation of the acoustic signal is stopped in the voiced section where the synthesized voice becomes voiced.
プログラム。 program.
合成音声の制御データに応じた調波成分の周波数スペクトルの時系列を第1学習済モデルにより生成する処理と、 The process of generating the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech by the first trained model, and
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成する処理と、 The process of generating an acoustic signal in the time domain representing the waveform of the non-tuning component according to the control data by the second trained model, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する処理と A process of generating an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
をコンピュータに実行させるプログラムであって、 Is a program that causes a computer to execute
前記合成音声が無声音となる無声区間における前記音響信号のサンプルのビット数は、前記合成音声が有声音となる有声区間における前記音響信号のサンプルのビット数を上回る、 The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced.
プログラム。 program.
JP2020114265A 2017-11-29 2020-07-01 Speech synthesis methods, speech synthesis systems and programs Active JP6977818B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2020114265A JP6977818B2 (en) 2017-11-29 2020-07-01 Speech synthesis methods, speech synthesis systems and programs

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017229041A JP6733644B2 (en) 2017-11-29 2017-11-29 Speech synthesis method, speech synthesis system and program
JP2020114265A JP6977818B2 (en) 2017-11-29 2020-07-01 Speech synthesis methods, speech synthesis systems and programs

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
JP2017229041A Division JP6733644B2 (en) 2017-11-29 2017-11-29 Speech synthesis method, speech synthesis system and program

Publications (3)

Publication Number Publication Date
JP2020166299A JP2020166299A (en) 2020-10-08
JP2020166299A5 true JP2020166299A5 (en) 2021-01-07
JP6977818B2 JP6977818B2 (en) 2021-12-08

Family

ID=72666035

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2020114265A Active JP6977818B2 (en) 2017-11-29 2020-07-01 Speech synthesis methods, speech synthesis systems and programs

Country Status (1)

Country Link
JP (1) JP6977818B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110720122B (en) * 2017-06-28 2023-06-27 雅马哈株式会社 Sound generating device and method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995030193A1 (en) * 1994-04-28 1995-11-09 Motorola Inc. A method and apparatus for converting text into audible signals using a neural network
JP4067762B2 (en) * 2000-12-28 2008-03-26 ヤマハ株式会社 Singing synthesis device
JP2002268660A (en) * 2001-03-13 2002-09-20 Japan Science & Technology Corp Method and device for text voice synthesis
JP5102939B2 (en) * 2005-04-08 2012-12-19 ヤマハ株式会社 Speech synthesis apparatus and speech synthesis program
US20120316881A1 (en) * 2010-03-25 2012-12-13 Nec Corporation Speech synthesizer, speech synthesis method, and speech synthesis program

Similar Documents

Publication Publication Date Title
Uhlich et al. Deep neural network based instrument extraction from music
Fitzgerald Upmixing from mono-a source separation approach
CN102149038B (en) A hearing aid with means for decorrelating input and output signals
JP2019101093A5 (en) Speech synthesis method, speech synthesis system and program
JP2012252240A5 (en)
JP2008233672A (en) Masking sound generation apparatus, masking sound generation method, program, and recording medium
Bronson et al. Phase constrained complex NMF: Separating overlapping partials in mixtures of harmonic musical sources
Jeong et al. Vocal separation from monaural music using temporal/spectral continuity and sparsity constraints
US20230343348A1 (en) Machine-Learned Differentiable Digital Signal Processing
JP2020076843A5 (en) Information processing method, information processing system, and program
Marafioti et al. Audio inpainting of music by means of neural networks
JP2020166299A5 (en) Speech synthesis methods, speech synthesis systems and programs
JP2020076844A5 (en) Sound processing method, sound processing system and program
JP3511360B2 (en) Music sound signal separation method, its apparatus and program recording medium
Oh et al. Spectrogram-channels u-net: a source separation model viewing each channel as the spectrogram of each source
Lee et al. Two-stage refinement of magnitude and complex spectra for real-time speech enhancement
JP2007033804A (en) Sound source separation device, sound source separation program, and sound source separation method
CN115910009A (en) Electronic device, method, and computer program
JP2009500669A (en) Parametric multi-channel decoding
JP6831767B2 (en) Speech recognition methods, devices and programs
Masuyama et al. Modal decomposition of musical instrument sound via alternating direction method of multipliers
Im et al. Neural vocoder feature estimation for dry singing voice separation
JP4168700B2 (en) Speech synthesis apparatus, method and program
Wiggins et al. A differentiable acoustic guitar model for string-specific polyphonic synthesis
Sinith et al. Raga recognition through tonic identification using flute acoustics