JP2020166299A5

JP2020166299A5 - Speech synthesis methods, speech synthesis systems and programs

Info

Publication number: JP2020166299A5
Application number: JP2020114265A
Authority: JP
Filing date: 2020-07-01
Publication date: 2021-01-07
Anticipated expiration: 2037-11-29

Description

第１実施形態では、調波スペクトルＨの基本周波数Ｙ1が第２学習済モデルＭ2に入力されるから、非調波信号Ａ2のサンプルは基本周波数Ｙ1に連動して変化する。例えば、制御データＸ（および有声性判定結果Ｙ2）が共通する場合でも、基本周波数Ｙ1が相違すれば、相異なる音響特性の非調波信号Ａ2が生成されてよい。また、有声性判定結果Ｙ2が第２学習済モデルＭ2に入力されるから、非調波信号Ａ2のサンプルは有声性判定結果Ｙ2に連動して変化する。例えば、制御データＸ（および基本周波数Ｙ1）が共通する場合でも、有声性判定結果Ｙ2が相違すれば、相異なる音響特性の非調波信号Ａ2が生成されてよい。なお、第２学習済モデルＭ2は、基本周波数Ｙ1および有声性判定結果Ｙ2の何れか一方のみを入力とし、他方を入力しないモデルでもよい。
In the first embodiment, since the fundamental frequency Y1 of the tuning spectrum H is input to the second trained model M2, the sample of the non-tuning signal A2 changes in conjunction with the fundamental frequency Y1. For example, even when the control data X (and voiced determination result Y2) is common, if differences fundamental frequency Y1, inharmonic signal A2 having different acoustic characteristics may be generated. Further, since the voicedness determination result Y2 is input to the second trained model M2, the sample of the non-tuning signal A2 changes in conjunction with the voicedness determination result Y2. For example, even when the control data X (and the fundamental frequency Y1) are common, if the voicedness determination results Y2 are different, non-tuning signals A2 having different acoustic characteristics may be generated. The second trained model M2 may be a model in which only one of the fundamental frequency Y1 and the voicedness determination result Y2 is input and the other is not input.

（２）前述の各形態では、第１学習済モデルＭ1の出力と第２学習済モデルＭ2の出力との間の同期を主要な目的として、第１学習済モデルＭ1の処理結果に応じた補助データＹを第２学習済モデルＭ2に入力したが、例えば、両者間の同期のためのデータを制御データＸに含めることで、第２学習済モデルＭ2が補助データＹを利用する構成を省略してもよい。また、調波成分の基本周波数Ｙ1および有声性判定結果Ｙ2の一方のみを制御データＸとともに第２学習済モデルＭ2に入力してもよい。 (2) In each of the above-described forms, the main purpose is to synchronize the output of the first trained model M1 and the output of the second trained model M2, and the auxiliary according to the processing result of the first trained model M1. The data Y was input to the second trained model M2. For example, by including the data for synchronization between the two in the control data X , the configuration in which the second trained model M2 uses the auxiliary data Y is omitted. You may. Further, only one of the fundamental frequency Y1 of the tuning component and the voicedness determination result Y2 may be input to the second trained model M2 together with the control data X.

Claims

The time series of the frequency spectrum of the tuning component corresponding to the control data of the synthesized voice is generated by the first trained model.
An acoustic signal in the time domain representing the waveform of the non-tuning component corresponding to the control data is generated by the second trained model.
An audio signal including the tuning component and the non-tuning component is generated from the time series of the frequency spectrum and the acoustic signal .
In the generation of the acoustic signal, the acoustic signal is generated in the unvoiced section where the synthetic voice becomes unvoiced, and the generation of the acoustic signal is stopped in the voiced section where the synthesized voice becomes voiced .
A speech synthesis method realized by a computer.

Generation of the time series of the frequency spectrum is performed in both the unvoiced and voiced sections.
The voice synthesis method of claim 1.

The time series of the frequency spectrum of the tuning component corresponding to the control data of the synthesized voice is generated by the first trained model.
An acoustic signal in the time domain representing the waveform of the non-tuning component corresponding to the control data is generated by the second trained model.
An audio signal including the tuning component and the non-tuning component is generated from the time series of the frequency spectrum and the acoustic signal .
The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced .
A speech synthesis method realized by a computer.

The first trained model is a neural network that outputs the frequency spectrum of the wave tuning component for each first unit period.
The second trained model is a neural network that outputs a sample in the time domain of the non-tuning component for each second unit period shorter than the first unit period.
The voice synthesis method according to any one of claims 1 to 3.

The first trained model that generates the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech, and
A second trained model that generates an acoustic signal in the time domain that represents the waveform of the inharmonic component according to the control data, and
It is provided with a synthesis processing unit that generates an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
The second trained model generates the acoustic signal in the unvoiced section where the synthetic voice becomes unvoiced, and stops the generation of the acoustic signal in the voiced section where the synthesized voice becomes voiced.
Speech synthesis system.

The first trained model generates a time series of the frequency spectrum in both the unvoiced section and the voiced section.
The voice synthesis system of claim 5.

The first trained model that generates the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech, and
A second trained model that generates an acoustic signal in the time domain that represents the waveform of the non-tuning component according to the control data, and
It is provided with a synthesis processing unit that generates an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced.
Speech synthesis system.

The first trained model is a neural network that outputs the frequency spectrum of the wave tuning component for each first unit period.
The second trained model is a neural network that outputs a sample in the time domain of the non-tuning component for each second unit period shorter than the first unit period.
The speech synthesis system according to any one of claims 5 to 7.

The process of generating the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech by the first trained model, and
The process of generating an acoustic signal in the time domain representing the waveform of the non-tuning component according to the control data by the second trained model, and
A process of generating an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
Is a program that causes a computer to execute
In the process of generating the acoustic signal, the acoustic signal is generated in the unvoiced section where the synthetic voice becomes unvoiced, and the generation of the acoustic signal is stopped in the voiced section where the synthesized voice becomes voiced.
program.

The process of generating the time series of the frequency spectrum of the tuning component according to the control data of the synthesized speech by the first trained model, and
The process of generating an acoustic signal in the time domain representing the waveform of the non-tuning component according to the control data by the second trained model, and
A process of generating an audio signal including the tuning component and the non-tuning component from the time series of the frequency spectrum and the acoustic signal.
Is a program that causes a computer to execute
The number of bits of the acoustic signal sample in the unvoiced section where the synthetic voice is unvoiced exceeds the number of bits of the acoustic signal sample in the voiced section where the synthetic voice is voiced.
program.