JP2020166299A5 - Speech synthesis methods, speech synthesis systems and programs - Google Patents
Speech synthesis methods, speech synthesis systems and programs Download PDFInfo
- Publication number
- JP2020166299A5 JP2020166299A5 JP2020114265A JP2020114265A JP2020166299A5 JP 2020166299 A5 JP2020166299 A5 JP 2020166299A5 JP 2020114265 A JP2020114265 A JP 2020114265A JP 2020114265 A JP2020114265 A JP 2020114265A JP 2020166299 A5 JP2020166299 A5 JP 2020166299A5
- Authority
- JP
- Japan
- Prior art keywords
- acoustic signal
- tuning component
- trained model
- frequency spectrum
- unvoiced
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015572 biosynthetic process Effects 0.000 title claims 7
- 238000003786 synthesis reaction Methods 0.000 title claims 7
- 230000002194 synthesizing Effects 0.000 title claims 7
- 238000001308 synthesis method Methods 0.000 title claims 5
- 238000001228 spectrum Methods 0.000 claims description 17
- 238000000034 method Methods 0.000 claims 7
- 230000005236 sound signal Effects 0.000 claims 6
- 230000000875 corresponding Effects 0.000 claims 4
- 230000001537 neural Effects 0.000 claims 4
Description
第1実施形態では、調波スペクトルHの基本周波数Y1が第2学習済モデルM2に入力されるから、非調波信号A2のサンプルは基本周波数Y1に連動して変化する。例えば、制御データX(および有声性判定結果Y2)が共通する場合でも、基本周波数Y1が相違すれば、相異なる音響特性の非調波信号A2が生成されてよい。また、有声性判定結果Y2が第2学習済モデルM2に入力されるから、非調波信号A2のサンプルは有声性判定結果Y2に連動して変化する。例えば、制御データX(および基本周波数Y1)が共通する場合でも、有声性判定結果Y2が相違すれば、相異なる音響特性の非調波信号A2が生成されてよい。なお、第2学習済モデルM2は、基本周波数Y1および有声性判定結果Y2の何れか一方のみを入力とし、他方を入力しないモデルでもよい。
In the first embodiment, since the fundamental frequency Y1 of the tuning spectrum H is input to the second trained model M2, the sample of the non-tuning signal A2 changes in conjunction with the fundamental frequency Y1. For example, even when the control data X (and voiced determination result Y2) is common, if differences fundamental frequency Y1, inharmonic signal A2 having different acoustic characteristics may be generated. Further, since the voicedness determination result Y2 is input to the second trained model M2, the sample of the non-tuning signal A2 changes in conjunction with the voicedness determination result Y2. For example, even when the control data X (and the fundamental frequency Y1) are common, if the voicedness determination results Y2 are different, non-tuning signals A2 having different acoustic characteristics may be generated. The second trained model M2 may be a model in which only one of the fundamental frequency Y1 and the voicedness determination result Y2 is input and the other is not input.
(2)前述の各形態では、第1学習済モデルM1の出力と第2学習済モデルM2の出力との間の同期を主要な目的として、第1学習済モデルM1の処理結果に応じた補助データYを第2学習済モデルM2に入力したが、例えば、両者間の同期のためのデータを制御データXに含めることで、第2学習済モデルM2が補助データYを利用する構成を省略してもよい。また、調波成分の基本周波数Y1および有声性判定結果Y2の一方のみを制御データXとともに第2学習済モデルM2に入力してもよい。 (2) In each of the above-described forms, the main purpose is to synchronize the output of the first trained model M1 and the output of the second trained model M2, and the auxiliary according to the processing result of the first trained model M1. The data Y was input to the second trained model M2. For example, by including the data for synchronization between the two in the control data X , the configuration in which the second trained model M2 uses the auxiliary data Y is omitted. You may. Further, only one of the fundamental frequency Y1 of the tuning component and the voicedness determination result Y2 may be input to the second trained model M2 together with the control data X.
Claims (10)
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成し、
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成し、
前記音響信号の生成においては、前記合成音声が無声音となる無声区間において前記音響信号を生成し、前記合成音声が有声音となる有声区間において前記音響信号の生成を停止する、
コンピュータにより実現される音声合成方法。 The time series of the frequency spectrum of the tuning component corresponding to the control data of the synthesized voice is generated by the first trained model.
An acoustic signal in the time domain representing the waveform of the non-tuning component corresponding to the control data is generated by the second trained model.
An audio signal including the tuning component and the non-tuning component is generated from the time series of the frequency spectrum and the acoustic signal .
In the generation of the acoustic signal, the acoustic signal is generated in the unvoiced section where the synthetic voice becomes unvoiced, and the generation of the acoustic signal is stopped in the voiced section where the synthesized voice becomes voiced .
A speech synthesis method realized by a computer.
請求項1の音声合成方法。 The voice synthesis method of claim 1.
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成し、
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成し、
前記合成音声が無声音となる無声区間における前記音響信号のサンプルのビット数は、前記合成音声が有声音となる有声区間における前記音響信号のサンプルのビット数を上回る、
コンピュータにより実現される音声合成方法。 The time series of the frequency spectrum of the tuning component corresponding to the control data of the synthesized voice is generated by the first trained model.
An acoustic signal in the time domain representing the waveform of the non-tuning component corresponding to the control data is generated by the second trained model.
An audio signal including the tuning component and the non-tuning component is generated from the time series of the frequency spectrum and the acoustic signal .
The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced .
A speech synthesis method realized by a computer.
前記第2学習済モデルは、前記非調波成分の時間領域におけるサンプルを、前記第1単位期間よりも短い第2単位期間毎に出力するニューラルネットワークである The second trained model is a neural network that outputs a sample in the time domain of the non-tuning component for each second unit period shorter than the first unit period.
請求項1から請求項3の何れかの音声合成方法。 The voice synthesis method according to any one of claims 1 to 3.
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を生成する第2学習済モデルと、 A second trained model that generates an acoustic signal in the time domain that represents the waveform of the inharmonic component according to the control data, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する合成処理部とを具備し、 It is provided with a synthesis processing unit that generates an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
前記第2学習済モデルは、前記合成音声が無声音となる無声区間において前記音響信号を生成し、前記合成音声が有声音となる有声区間において前記音響信号の生成を停止する、 The second trained model generates the acoustic signal in the unvoiced section where the synthetic voice becomes unvoiced, and stops the generation of the acoustic signal in the voiced section where the synthesized voice becomes voiced.
音声合成システム。 Speech synthesis system.
請求項5の音声合成システム。 The voice synthesis system of claim 5.
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を生成する第2学習済モデルと、 A second trained model that generates an acoustic signal in the time domain that represents the waveform of the non-tuning component according to the control data, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する合成処理部とを具備し、 It is provided with a synthesis processing unit that generates an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
前記合成音声が無声音となる無声区間における前記音響信号のサンプルのビット数は、前記合成音声が有声音となる有声区間における前記音響信号のサンプルのビット数を上回る、 The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced.
音声合成システム。 Speech synthesis system.
前記第2学習済モデルは、前記非調波成分の時間領域におけるサンプルを、前記第1単位期間よりも短い第2単位期間毎に出力するニューラルネットワークである The second trained model is a neural network that outputs a sample in the time domain of the non-tuning component for each second unit period shorter than the first unit period.
請求項5から請求項7の何れかの音声合成システム。 The speech synthesis system according to any one of claims 5 to 7.
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成する処理と、 The process of generating an acoustic signal in the time domain representing the waveform of the non-tuning component according to the control data by the second trained model, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する処理と A process of generating an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
をコンピュータに実行させるプログラムであって、 Is a program that causes a computer to execute
前記音響信号を生成する処理においては、前記合成音声が無声音となる無声区間において前記音響信号を生成し、前記合成音声が有声音となる有声区間において前記音響信号の生成を停止する、 In the process of generating the acoustic signal, the acoustic signal is generated in the unvoiced section where the synthetic voice becomes unvoiced, and the generation of the acoustic signal is stopped in the voiced section where the synthesized voice becomes voiced.
プログラム。 program.
前記制御データに応じた非調波成分の波形を表す時間領域の音響信号を第2学習済モデルにより生成する処理と、 The process of generating an acoustic signal in the time domain representing the waveform of the non-tuning component according to the control data by the second trained model, and
前記調波成分と前記非調波成分とを含む音声信号を前記周波数スペクトルの時系列と前記音響信号とから生成する処理と A process of generating an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
をコンピュータに実行させるプログラムであって、 Is a program that causes a computer to execute
前記合成音声が無声音となる無声区間における前記音響信号のサンプルのビット数は、前記合成音声が有声音となる有声区間における前記音響信号のサンプルのビット数を上回る、 The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced.
プログラム。 program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020114265A JP6977818B2 (en) | 2017-11-29 | 2020-07-01 | Speech synthesis methods, speech synthesis systems and programs |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017229041A JP6733644B2 (en) | 2017-11-29 | 2017-11-29 | Speech synthesis method, speech synthesis system and program |
JP2020114265A JP6977818B2 (en) | 2017-11-29 | 2020-07-01 | Speech synthesis methods, speech synthesis systems and programs |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2017229041A Division JP6733644B2 (en) | 2017-11-29 | 2017-11-29 | Speech synthesis method, speech synthesis system and program |
Publications (3)
Publication Number | Publication Date |
---|---|
JP2020166299A JP2020166299A (en) | 2020-10-08 |
JP2020166299A5 true JP2020166299A5 (en) | 2021-01-07 |
JP6977818B2 JP6977818B2 (en) | 2021-12-08 |
Family
ID=72666035
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2020114265A Active JP6977818B2 (en) | 2017-11-29 | 2020-07-01 | Speech synthesis methods, speech synthesis systems and programs |
Country Status (1)
Country | Link |
---|---|
JP (1) | JP6977818B2 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110720122B (en) * | 2017-06-28 | 2023-06-27 | 雅马哈株式会社 | Sound generating device and method |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1995030193A1 (en) * | 1994-04-28 | 1995-11-09 | Motorola Inc. | A method and apparatus for converting text into audible signals using a neural network |
JP4067762B2 (en) * | 2000-12-28 | 2008-03-26 | ヤマハ株式会社 | Singing synthesis device |
JP2002268660A (en) * | 2001-03-13 | 2002-09-20 | Japan Science & Technology Corp | Method and device for text voice synthesis |
JP5102939B2 (en) * | 2005-04-08 | 2012-12-19 | ヤマハ株式会社 | Speech synthesis apparatus and speech synthesis program |
US20120316881A1 (en) * | 2010-03-25 | 2012-12-13 | Nec Corporation | Speech synthesizer, speech synthesis method, and speech synthesis program |
-
2020
- 2020-07-01 JP JP2020114265A patent/JP6977818B2/en active Active
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Uhlich et al. | Deep neural network based instrument extraction from music | |
Fitzgerald | Upmixing from mono-a source separation approach | |
CN102149038B (en) | A hearing aid with means for decorrelating input and output signals | |
JP2019101093A5 (en) | Speech synthesis method, speech synthesis system and program | |
JP2012252240A5 (en) | ||
JP2008233672A (en) | Masking sound generation apparatus, masking sound generation method, program, and recording medium | |
Bronson et al. | Phase constrained complex NMF: Separating overlapping partials in mixtures of harmonic musical sources | |
Jeong et al. | Vocal separation from monaural music using temporal/spectral continuity and sparsity constraints | |
US20230343348A1 (en) | Machine-Learned Differentiable Digital Signal Processing | |
JP2020076843A5 (en) | Information processing method, information processing system, and program | |
Marafioti et al. | Audio inpainting of music by means of neural networks | |
JP2020166299A5 (en) | Speech synthesis methods, speech synthesis systems and programs | |
JP2020076844A5 (en) | Sound processing method, sound processing system and program | |
JP3511360B2 (en) | Music sound signal separation method, its apparatus and program recording medium | |
Oh et al. | Spectrogram-channels u-net: a source separation model viewing each channel as the spectrogram of each source | |
Lee et al. | Two-stage refinement of magnitude and complex spectra for real-time speech enhancement | |
JP2007033804A (en) | Sound source separation device, sound source separation program, and sound source separation method | |
CN115910009A (en) | Electronic device, method, and computer program | |
JP2009500669A (en) | Parametric multi-channel decoding | |
JP6831767B2 (en) | Speech recognition methods, devices and programs | |
Masuyama et al. | Modal decomposition of musical instrument sound via alternating direction method of multipliers | |
Im et al. | Neural vocoder feature estimation for dry singing voice separation | |
JP4168700B2 (en) | Speech synthesis apparatus, method and program | |
Wiggins et al. | A differentiable acoustic guitar model for string-specific polyphonic synthesis | |
Sinith et al. | Raga recognition through tonic identification using flute acoustics |