WO2021199446A1

WO2021199446A1 - Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program

Info

Publication number: WO2021199446A1
Application number: PCT/JP2020/015389
Authority: WO
Inventors: 田中　宏; 弘和亀岡; 卓弘金子; 伸克北条
Original assignee: 日本電信電話株式会社
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2021-10-07
Also published as: JP7368779B2; JPWO2021199446A1

Abstract

Provided is a sound signal conversion model learning device equipped with a learning unit which obtains, by a machine learning method, a trained model for converting an input signal that is an inputted sound signal into a sound signal having a higher natural signal degree than the input signal, the natural signal degree indicating the degree of similarity to a natural signal that is a sound actually emitted by an animal. The machine leaning method is a method wherein a first generation unit which performs forward conversion processing, which is a conversion for increasing the natural signal degree, on the inputted sound signal to output a forward conversion signal having a higher natural signal degree than the sound signal, a first identification unit which identifies whether an inputted signal is a forward conversion signal or a natural signal, a second generation unit which performs reverse conversion processing, which is a conversion for decreasing the natural signal degree, on an inputted sound signal to output a reverse conversion signal having a lower natural signal degree than the sound signal, and a second identification unit which identifies whether an inputted signal is a preliminary synthesized signal that is a preliminarily prepared signal and a synthesized signal, or a reverse conversion signal, learn on the basis of identification results of the first identification unit and the second identification unit.

Description

Voice signal conversion model learning device, voice signal conversion device, voice signal conversion model learning method and program

The present invention relates to a voice signal conversion model learning device, a voice signal conversion device, a voice signal conversion model learning method, and a program.

Input information such as speech generation by parametric vocoder method (see Non-Patent Document 1) and statistical voice quality conversion (see Non-Patent Document 2) as having the potential to expand human communication ability and physical function. Techniques for generating desired speech from are being researched. For example, the parametric bocoder method voice generation technology provides assistance for persons with physical disabilities (see Non-Patent Documents 3 and 4) and language education support (Non-Patent Documents) because of the ease of system construction and its high versatility. 5 and 6) and its application to amusement (see Non-Patent Document 7) have been widely studied.

However, there is a problem that the voice generated by using the above-mentioned conventional technology has a large difference from the voice sound actually emitted by humans. One of the causes that caused the difference from the voice sound produced by humans was that the generated features were excessively smoothed. GAN (Generative Adversarial Networks), which is one of the machine learning methods, has been proposed as a method for suppressing such excessive smoothing. As a method using GAN, for example, SEWAN (Speech Enhancement Generative Adversarial Network) has been proposed (see Non-Patent Document 8). However, in the method using SEWAN proposed so far, there is a problem that learning cannot be established when a target sound waveform having the same amplitude spectrum but a different phase spectrum exists in the training data, and there is less limitation in learning. It was difficult to use the data to reduce the difference from the actual human voice. In addition, such a problem has been a similar problem not only in humans but also in voice generation of voices emitted by animals.

In view of the above circumstances, an object of the present invention is to provide a technique for generating a voice closer to the voice emitted by an animal.

One aspect of the present invention is learning to convert an input signal, which is an input audio signal, into an audio signal having a degree of natural signal higher than that of the input signal, which is similar to a natural signal which is actually emitted by an animal. A learning unit that obtains a completed model by a machine learning method is provided, and the machine learning method uses a forward conversion process that is a conversion that increases the degree of natural signal to an input voice signal to obtain the completed model from the voice signal. The first generator that outputs a forward conversion signal, which is a signal with a high degree of natural signal, and the first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, are input. A second generator that outputs an inverse conversion signal having a lower natural signal degree than the audio signal by executing an inverse conversion process that lowers the natural signal degree of the audio signal, and an input signal are prepared in advance. The second identification unit that identifies which of the pre-synthesized signal and the inverse conversion signal, which are the combined signals and the combined signals, is the identification result of the first identification unit and the second identification unit. It is a voice signal conversion model learning device which is a method of learning based on.

According to the present invention, it is possible to generate a sound closer to the sound emitted by an animal.

An explanatory diagram illustrating an outline of the audio signal generation system 100 of the embodiment. The explanatory view explaining the outline of the voice signal conversion model learning apparatus 1 in embodiment. The flowchart which shows an example of the flow of the forward conversion signal identification processing in embodiment. The first flowchart which shows an example of the flow of the forward conversion learning process in an embodiment. A second flowchart showing an example of the flow of the forward conversion learning process in the embodiment. The flowchart which shows an example of the flow of the forward conversion signal identification learning process in embodiment. The flowchart which shows an example of the flow of the inverse transformation signal identification processing in embodiment. The flowchart which shows an example of the flow of the inverse transformation learning process in an embodiment. The flowchart which shows an example of the flow of the inverse transformation signal identification learning processing in embodiment. FIG. 5 is a flowchart showing an example of a flow of processing executed by the voice signal conversion model learning device 1 in the embodiment. The figure which shows an example of the hardware composition of the voice signal conversion model learning apparatus 1 in embodiment. The figure which shows an example of the functional structure of the control part 10 in embodiment. The figure which shows an example of the hardware composition of the voice signal conversion apparatus 2 in embodiment. The figure which shows an example of the functional structure of the control part 20 in embodiment. FIG. 5 is a flowchart showing an example of a flow of processing executed by the voice signal conversion device 2 in the embodiment. The first figure which shows an example of the experimental result of the 1st experiment. The second figure which shows an example of the experimental result of the 1st experiment. The figure which shows an example of the experimental result of the 2nd experiment. The figure which shows an example of the experimental result of the 3rd experiment.

(Embodiment)
The outline of the audio signal generation system 100 of the embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is an explanatory diagram illustrating an outline of the audio signal generation system 100 of the embodiment. The audio signal generation system 100 is a synthesized audio signal (hereinafter referred to as “synthetic signal”) and has a low degree of similarity to the natural signal (hereinafter referred to as “natural signal degree”) (hereinafter referred to as “unnatural”). Improves the degree of natural signal of "composite signal"). A natural signal is a voice actually emitted by a human being. That is, the voice signal generation system 100 converts the input unnaturally synthesized signal into a naturally synthesized signal which is a composite signal having a higher degree of natural signal than the input unnaturally synthesized signal. Converting an unnaturally synthesized signal into a naturally synthesized signal is equivalent to generating a naturally synthesized signal based on the unnaturally synthesized signal.

The voice signal generation system 100 includes a voice signal conversion model learning device 1 and a voice signal conversion device 2. The voice signal conversion model learning device 1 obtains a trained model (hereinafter referred to as “voice signal conversion model”) that generates a naturally synthesized signal based on an unnaturally synthesized signal by machine learning. For the sake of simplicity of the explanation below, performing machine learning is called learning. Note that performing machine learning means appropriately adjusting the values of parameters in the machine learning model. In the following description, learning to be A means that the value of the parameter in the machine learning model is adjusted to satisfy A. A represents a predetermined condition.

The voice signal conversion model learning device 1 receives a natural signal and a synthesized signal as inputs, and uses a voice waveform classifier and a voice feature amount classifier in the learning of the classifier by cyclic hostile learning (CycleGAN: CycleGenerativeAdversarial Networks). Learn the audio signal conversion model. The voice waveform classifier is a discriminator that discriminates whether or not the voice signal is a natural signal based on the waveform of the voice signal used for learning (hereinafter referred to as “voice waveform”). The voice feature amount classifier is a classifier that acquires information satisfying a predetermined condition from a voice signal used for learning as a voice feature amount and discriminates whether or not the voice signal is a natural signal based on the acquired voice feature amount. Hereinafter, the CycleGAN using the voice waveform classifier and the voice feature amount classifier will be referred to as a convolutional CycleGAN. The voice feature amount is, for example, a phase spectrum of a voice signal. As will be described later, the natural signal and the combined signal input to the voice signal conversion model learning device 1 may be stored in advance in the storage unit included in the voice signal conversion model learning device 1.

The convolution CycleGAN is a neural network that learns the voice waveform of the voice signal used for learning and the voice feature amount with different classifiers. Generally, a neural network that learns the features of data used for learning with a different classifier for each feature is called a convolutional neural network. Therefore, the convolutional CycleGAN is both a neural network obtained by modifying the foldable neural network and a neural network obtained by modifying the convolutional neural network.

FIG. 2 is an explanatory diagram illustrating an outline of the voice signal conversion model learning device 1 according to the embodiment. The voice signal conversion model learning device 1 includes a first generation unit 110, a first identification unit 120, a first input determination unit 130, a second generation unit 150, a second identification unit 160, and a second input determination unit 170. The first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 are functional units for learning. In the voice signal conversion model learning device 1, the first generation unit 110, the first identification unit 120, the first input determination unit 130, the second generation unit 150, the second identification unit 160, and the second input determination unit 170 cooperate. And execute CycleGAN.

The first generation unit 110 executes forward conversion processing on the input audio signal. The forward conversion process is a process for improving the degree of natural signal of the input audio signal. The first generation unit 110 outputs the audio signal after the forward conversion process as a forward conversion signal. The first generation unit 110 learns in detail based on the identification result of the first identification unit 120, which will be described later. The first generation unit 110 learns by learning so as to further improve the degree of natural signal by the forward conversion process.

A specific example of learning in which the degree of natural signal is further improved by the forward conversion process is to reduce the value of the loss function, which is a function showing a larger value as the probability that the identification result of the first identification unit 120 is incorrect is lower. This is a process for appropriately adjusting the value of the parameter.

The first identification unit 120 identifies whether the input audio signal is a natural signal or a forward conversion signal. The first identification unit 120 learns based on the identification result. The first identification unit 120 includes a voice waveform identification unit 121, a voice feature amount identification unit 122, an integrated identification unit 123, and a first determination unit 140.

The audio signal input to the first identification unit 120 is input to the audio waveform identification unit 121. The audio signal input to the audio waveform identification unit 121 is an audio signal determined by the first input determination unit 130, which will be described in detail later, and is a natural signal or a forward conversion signal. The voice waveform identification unit 121 identifies whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal based on the voice waveform of the input voice signal. The voice waveform identification unit 121 is an example of a voice waveform classifier.

The voice signal input to the first identification unit 120 is input to the voice feature amount identification unit 122. That is, the voice signal input to the voice feature amount identification unit 122 is the same as the voice signal input to the voice waveform identification unit 121. The voice feature amount identification unit 122 acquires the voice feature amount based on the input voice signal. The voice feature amount identification unit 122 identifies whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal based on the acquired voice feature amount. The voice feature amount discriminating unit 122 is an example of a voice feature amount discriminator.

Based on the identification result of the voice waveform identification unit 121 and the identification result of the voice feature amount identification unit 122, the integrated identification unit 123 determines whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal. Identify if. The identification result of the integrated identification unit 123 is the identification result of the first identification unit 120. The identification result of the integrated identification unit 123 is output to the first determination unit 140.

The first determination unit 140 determines whether or not the identification result of the integrated identification unit 123 is correct based on the determination result of the first input determination unit 130.

The voice waveform identification unit 121, the voice feature amount identification unit 122, and the integrated identification unit 123 learn based on the determination result of the first determination unit 140. The voice waveform identification unit 121, the voice feature amount identification unit 122, and the integrated identification unit 123 learn by learning so as to further improve the accuracy of identification. In a specific example of learning that further improves the accuracy of discrimination, the value of the parameter is set so as to increase the value of the loss function, which is a function that shows a larger value as the probability that the discrimination result of the integrated discriminator 123 is incorrect is lower. This is a process for adjusting appropriately.

The first input determination unit 130 determines whether the audio signal input to the first identification unit 120 is a forward conversion signal or a natural signal.

When the first input determination unit 130 determines a natural signal as an audio signal to be input to the first identification unit 120, one natural signal belonging to the natural signal group shown in the central column of FIG. 2 is the first identification unit 120. Is entered in. The natural signal group is a set of natural signals prepared in advance for learning. The composite signal group shown in the central column of FIG. 2 is a set of synthetic signals prepared in advance for learning. Hereinafter, the composite signal belonging to the composite signal group is referred to as a pre-synthesized signal.

When the first input determination unit 130 determines the composite signal as the composite signal to be input to the first identification unit 120, the forward conversion signal is input to the first identification unit 120.

The second generation unit 150 executes an inverse transformation process on the input audio signal. When a forward conversion signal is input as an audio signal, the reverse conversion process is executed on the acquired forward conversion signal. When a natural signal is input as an audio signal, the reverse conversion process is performed on the acquired natural signal. The inverse transformation process is a process of reducing the degree of natural signal of the input audio signal. The second generation unit 150 outputs the audio signal after the inverse transformation processing as the inverse transformation signal. The second generation unit 150 learns based on the identification result of the second identification unit 160, which will be described in detail later. The second generation unit 150 learns so as to further reduce the degree of natural signal by the inverse transformation process by learning.

As a specific example of learning in which the degree of natural signal is further lowered by the inverse transformation processing, the value of the loss function, which is a function showing a larger value as the probability that the identification result of the second identification unit 160 is incorrect is lower, is reduced. This is a process for appropriately adjusting the value of the parameter.

The second identification unit 160 identifies whether the input audio signal is an inverse conversion signal or a precombined signal. The second identification unit 160 learns based on the identification result of the second identification unit 160. The second identification unit 160 includes a voice waveform identification unit 161, a voice feature amount identification unit 162, an integrated identification unit 163, and a second determination unit 180.

The audio waveform identification unit 161 determines whether the audio signal input to the second identification unit 160 is an inverse conversion signal or a precombined signal based on the audio waveform of the audio signal input to the second identification unit 160. Identify. The voice waveform identification unit 161 is an example of a voice waveform classifier.

In the voice feature amount identification unit 162, the voice signal input to the second identification unit 160 is either an inverse conversion signal or a precombined signal based on the voice feature amount of the voice signal input to the second identification unit 160. To identify. The voice feature amount identification unit 162 is an example of a voice feature amount classifier.

In the integrated identification unit 163, based on the identification result of the voice waveform identification unit 161 and the identification result of the voice feature amount identification unit 162, the voice signal input to the second identification unit 160 is an inverse conversion signal and a precombined signal. Identify which one. The identification result of the integrated identification unit 163 is the identification result of the second identification unit 160. The identification result of the integrated identification unit 163 is output to the second determination unit 180.

The second determination unit 180 determines whether or not the identification result of the second identification unit 160 is correct based on the determination result of the second input determination unit 170.

The voice waveform identification unit 161, the voice feature amount identification unit 162, and the integrated identification unit 163 learn based on the determination result of the second determination unit 180. The voice waveform identification unit 161, the voice feature amount identification unit 162, and the integrated identification unit 163 learn by learning so as to further improve the accuracy of identification. In a specific example of learning that further improves the accuracy of discrimination, the value of the parameter is set so as to increase the value of the loss function, which is a function that shows a larger value as the probability that the discrimination result of the integrated discrimination unit 163 is incorrect is lower. This is a process for adjusting appropriately.

The second input determination unit 170 determines whether the audio signal input to the second generation unit 150 is a forward conversion signal or a natural signal. Further, the second input determination unit 170 also determines whether the audio signal input to the second identification unit 160 is an inverse conversion signal or a precombined signal.

The first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 operate in cooperation with each other to learn to reduce the objective function L represented by the following equation. .. That is, the objective function L is a loss function when the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 learn.

H1 represents self-identical loss. More specifically, H1 is represented by the following equation (18).

The sum of H2 to H9 represents a hostile loss. Note that D _xwave represents a discriminator that identifies what kind of signal the audio signal x is based on the waveform of the audio signal x. Note that D _ywave represents a discriminator that identifies what kind of signal the voice signal y is based on the waveform of the voice signal y. Note that D _xmsp represents a discriminator that identifies what kind of signal the audio signal x is based on the audio features of the audio signal x. Note that D _ymsp represents a discriminator that identifies what kind of signal the voice signal y is based on the voice feature amount of the voice signal y. Hereinafter, for the sake of simplicity, the classifier is represented by the symbol D. Note that Dmsp (A) is a function that outputs the probability of whether or not A is the target audio feature. Further, log (1-Dmsp (A)) is a function that outputs the probability that A is not the target voice feature amount.

F (A) means a process of convolving a fast Fourier transform matrix windowed by a Hanning window into A and then convolving a mel filter with respect to the absolute value of A after convolution.

The λ _cyc _{in the L cyc} term in the objective function represents the weight. L _cyc is a hyperparameter in learning. G _{x → y} is a map and represents a map that converts the voice signal x into the voice signal y. The audio signal y is an audio signal having a higher degree of natural signal than the audio signal x. D _y represents an identification function that distinguishes whether the input audio signal y is a natural signal or a composite signal. G _{y → x} is a map and represents a map that converts the voice signal y into the voice signal x. D _x represents an identification function that distinguishes whether the input audio signal x is a natural signal or a composite signal.

_Ladv represents an objective function in hostile learning. That _{is, Ladv} represents a hostile loss. Hostile loss is the value represented by the loss function in hostile learning. _Lid represents an identity map. The identity map exists in the objective function L so as not to change the objective function L when the input to the _{map G x → y is the audio signal y instead of the audio signal x.} The value of the identity mapping L _id represents the identity mapping loss.

L1 represents a loss function in hostile learning executed in collaboration with the first generation unit 110 and the first identification unit 120. L2 represents a loss function in hostile learning executed in collaboration with the second generation unit 150 and the second identification unit 160. L3 is a function representing the circulation consistent loss in CycleGAN. _{That is, in L3, the mapping G x → y} and the mapping G _{y → x} are arranged in the Cycle GAN executed by the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 in cooperation with each other. It is a function indicating whether or not there is a one-to-one correspondence.

As described above, the objective function L is a function represented by a function representing hostile loss, a function representing consistent loss, and a function representing identity mapping loss.

Here, an example of the flow of each of the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process will be described. The forward conversion signal identification process is a process in which the first identification unit 120 discriminates whether the input audio signal is a natural signal or a forward conversion signal. The forward conversion learning process is a process that the first generation unit 110 learns. The forward conversion signal identification learning process is a process that the first identification unit 120 learns. The inverse conversion signal identification process is a process in which the second identification unit 160 discriminates whether the input audio signal is an inverse conversion signal or a pre-synthesized signal. The inverse transformation learning process is a process that the second generation unit 150 learns. The inverse transformation signal identification learning process is a process that the second identification unit 160 learns.

FIG. 3 is a flowchart showing an example of the flow of the forward conversion signal identification process in the embodiment. The audio waveform identification unit 121 acquires the audio signal input to the first identification unit 120, and the audio signal input to the first identification unit 120 based on the acquired audio waveform is either a natural signal or a forward conversion signal. Identify the presence (step S101). Next, the voice feature amount identification unit 122 acquires the voice feature amount of the voice signal input to the first identification unit 120, and the voice signal input to the first identification unit 120 is a natural signal based on the acquired voice feature amount. And the forward conversion signal are identified (step S102). Next, the integrated identification unit 123 receives the voice signal input to the first identification unit 120 according to a predetermined rule determined in advance based on the identification result of the voice waveform identification unit 121 and the identification result of the voice feature amount identification unit 122. It identifies whether it is a natural signal or a forward conversion signal (step S103). The identification result of the integrated identification unit 123 in step S103 is output to the first determination unit 140.

FIG. 4 is a first flowchart showing an example of the flow of the forward conversion learning process in the embodiment. The first input determination unit 130 determines the audio signal input to the first identification unit 120 as a forward conversion signal (step S201). Next, the first generation unit 110 acquires one composite signal from the composite signal group and executes a forward conversion process on the acquired composite signal to generate a forward conversion signal (step S202). Next, the first generation unit 110 outputs the generated forward conversion signal to the first identification unit 120 (step S203). Next, the first identification unit 120 executes forward conversion signal identification processing on the acquired voice signal (step S204). That is, the processes of steps S101 to S103 are executed. Next, the first determination unit 140 determines whether or not the identification result of the first identification unit 120 is correct by comparing with the determination result of the first input determination unit 130 (step S205). Next, the first generation unit 110 learns to further improve the natural signal degree by the forward conversion process based on the determination result of the first determination unit 140 (step S206). Specifically, the first generation unit 110 learns to make the objective function L smaller.

FIG. 5 is a second flowchart showing an example of the flow of the forward conversion learning process in the embodiment. Hereinafter, the same processing as that shown in FIG. 3 or 4 will be designated by the same reference numerals as those in FIG. 3 or 4, and the description thereof will be omitted.

The second generation unit 150 outputs an inverse conversion signal (step S301). Next, the first generation unit 110 acquires the reverse conversion signal output by the second generation unit 150, and generates a forward conversion signal by executing a forward conversion process on the acquired reverse conversion signal (step S302). .. Next, the processes of steps S203 to S206 are executed.

FIG. 6 is a flowchart showing an example of the flow of the forward conversion signal identification learning process in the embodiment. Hereinafter, the same processing as that shown in FIGS. 3 to 5 will be designated by the same reference numerals as those in FIGS. 3 to 5, and the description thereof will be omitted.

The first input determination unit 130 determines whether the audio signal input to the first identification unit 120 is a natural signal or a forward conversion signal (step S401). Next, the processes of steps S204 and S205 are executed. Next, the first identification unit 120 learns to further improve the accuracy of identification (step S402). Specifically, the first identification unit 120 learns to make the objective function L larger. More specifically, the voice waveform identification unit 121 and the voice feature amount identification unit 122 learn so as to make the objective function L larger.

FIG. 7 is a flowchart showing an example of the flow of the inverse transformation signal identification processing in the embodiment. The voice waveform identification unit 161 acquires the voice waveform of the voice signal input to the second identification unit 160, and the voice signal input to the second identification unit 160 based on the acquired voice waveform is an inverse conversion signal and a precombined signal. (Step S501). Next, the voice feature amount identification unit 162 acquires the voice signal input to the second identification unit 160, and the voice signal input to the second identification unit 160 is precombined with the inverse conversion signal based on the acquired voice feature amount. Identifying which of the signals is (step S502). Next, the integrated identification unit 163 receives the voice signal input to the second identification unit 160 according to a predetermined rule determined in advance based on the identification result of the voice waveform identification unit 161 and the identification result of the voice feature amount identification unit 162. It identifies whether it is an inverse conversion signal or a precombined signal (step S503). The identification result of the integrated identification unit 163 in step S503 is output to the second determination unit 180.

FIG. 8 is a flowchart showing an example of the flow of the inverse transformation learning process in the embodiment. The second input determination unit 170 determines the audio signal input to the second identification unit 160 as an inverse conversion signal (step S601). Next, the second generation unit 150 acquires the forward conversion signal and executes the reverse conversion process on the acquired forward conversion signal to generate the reverse conversion signal (step S602). Next, the second generation unit 150 outputs the generated inverse conversion signal to the second identification unit 160 (step S603). Next, the second identification unit 160 executes an inverse transformation signal identification process on the acquired voice signal (step S604). That is, the processes of steps S401 to S403 are executed. Next, the second determination unit 180 determines whether or not the identification result of the second identification unit 160 is correct by comparing with the determination result of the second input determination unit 170 (step S605). Next, the second generation unit 150 learns to further improve the natural signal degree by the inverse transformation process based on the determination result of the second determination unit 180 (step S606). Specifically, the second generation unit 150 learns to make the objective function L smaller. When the second generation unit 150 acquires the natural signal and generates the inverse conversion signal, the processes of steps S602 to S606 are similarly performed.

FIG. 9 is a flowchart showing an example of the flow of the inverse transformation signal identification learning process in the embodiment. Hereinafter, the same processing as that shown in FIG. 7 or 8 will be designated by the same reference numerals as those in FIG. 7 or 8, and the description thereof will be omitted.

The second input determination unit 170 determines whether the audio signal input to the second identification unit 160 is a natural signal or an inverse conversion signal (step S701). Next, the processes of steps S604 and S605 are executed. Next, the second identification unit 160 learns to further improve the accuracy of identification (step S702). Specifically, the second identification unit 160 learns to make the objective function L larger. More specifically, the voice waveform identification unit 161 and the voice feature amount identification unit 162 learn so as to make the objective function L larger.

FIG. 10 is a flowchart showing an example of the flow of processing executed by the voice signal conversion model learning device 1 in the embodiment. In FIG. 10, an example of the subsequent processing flow will be described by taking the case where step S201 is performed as an example. Further, in FIG. 10, an example of the processing flow will be described by taking the case where the processing of step S601 is performed as an example. Hereinafter, the same processing as that shown in FIGS. 3 to 9 will be described by assigning the same reference numerals as those shown in FIGS. 3 to 9 and omitting description thereof.

Starting from step S201, the processes are executed in the order of step S202, step S203, step S204, step S205, step S206, step S402, step S601, step S602, step S604, step S605, step S606, and step S702. After step S702, it is determined whether or not the end condition is satisfied (step S801). The end condition is, for example, a condition that the number of times of learning exceeds a predetermined number of times. Whether or not the end condition is satisfied is determined by, for example, the management unit 102 described later.

When the end condition is satisfied (step S801: YES), the process ends. On the other hand, if the end condition is not satisfied (step S8011: NO), the process of step S301 is executed. Next, the process of step S302 is executed. After step S302, the process returns to step S203.

Note that the processing in step S206 and the processing in step S402 may be executed in the reverse order. The order in which the processes in step S606 and the processes in step S702 are executed may be reversed.

If, instead of the process of step S201, the process of determining the audio signal input to the first identification unit 120 by the first input determination unit 130 as a natural signal is executed, the processes of steps S602 to S302 are not executed. .. In such a case, the process ends after the process of FIG. 6 is executed.

When the process of determining the audio signal input to the second identification unit 160 by the second input determination unit 170 as a natural signal is executed instead of the process of step S601, the processes of steps S602 to S604 and step S606 are executed. Processing and is not executed.

As described above, the voice signal conversion model learning device 1 executes the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process. A voice signal conversion model with a higher degree of natural signal is obtained with each learning.

FIG. 11 is a diagram showing an example of the hardware configuration of the voice signal conversion model learning device 1 according to the embodiment.
The voice signal conversion model learning device 1 includes a control unit 10 including a processor 91 such as a CPU (Central Processing Unit) connected by a bus and a memory 92, and executes a program. The voice signal conversion model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14 by executing a program. More specifically, the processor 91 reads out the program stored in the storage unit 13, and stores the read program in the memory 92. When the processor 91 executes a program stored in the memory 92, the voice signal conversion model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14. do.

The control unit 10 controls the operation of various functional units included in the voice signal conversion model learning device 1. The control unit 10 executes, for example, forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal identification learning processing.

The input unit 11 includes an input device such as a mouse, a keyboard, and a touch panel. The input unit 11 may be configured as an interface for connecting these input devices to its own device. The input unit 11 receives input of various information to its own device. The input unit 11 receives, for example, an input instructing the start of learning. The input unit 11 receives, for example, an input of a composite signal to be added to the composite signal group. The input unit 11 receives, for example, an input of a natural signal to be added to the natural signal group.

The interface unit 12 includes a communication interface for connecting the own device to an external device. The interface unit 12 communicates with an external device via wire or wireless. The external device may be a storage device such as a USB (Universal Serial Bus) memory, for example. When the external device outputs, for example, a composite signal, the interface unit 12 acquires the composite signal output by the external device by communicating with the external device. When the external device outputs, for example, a natural signal, the interface unit 12 acquires the natural signal output by the external device by communicating with the external device.

The interface unit 12 includes a communication interface for connecting the own device to the voice signal conversion device 2. The interface unit 12 communicates with the voice signal conversion device 2 via wire or wireless. The interface unit 12 outputs a voice signal conversion model to the voice signal conversion device 2 by communicating with the voice signal conversion device 2.

The storage unit 13 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 13 stores various information related to the voice signal conversion model learning device 1. The storage unit 13 stores, for example, a group of natural signals in advance. The storage unit 13 stores, for example, a synthetic signal group in advance. The storage unit 13 stores, for example, a composite signal and a natural signal input via the input unit 11 or the interface unit 12. The storage unit 13 stores, for example, the identification result of the first identification unit 120.

The storage unit 13 stores, for example, the identification result of the second identification unit 160. The storage unit 13 stores, for example, the determination result of the first determination unit 140. The storage unit 13 stores, for example, the determination result of the second determination unit 180. The storage unit 13 stores, for example, the determination result of the first input determination unit 130. The storage unit 13 stores, for example, the determination result of the second input determination unit 170. The storage unit 13 stores, for example, an audio signal conversion model.

The output unit 14 outputs various information. The output unit 14 includes display devices such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 14 may be configured as an interface for connecting these display devices to its own device. The output unit 14 outputs, for example, the information input to the input unit 11.

FIG. 12 is a diagram showing an example of the functional configuration of the control unit 10 in the embodiment. The control unit 10 includes a managed unit 101 and a management unit 102. The managed unit 101 includes a first generation unit 110, a first identification unit 120, a first input determination unit 130, a first determination unit 140, a second generation unit 150, a second identification unit 160, a second input determination unit 170, and the like. A second determination unit 180 is provided. The managed unit 101 includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, and reverse conversion learning processing using each voice signal included in the natural signal group and the synthetic signal group. An audio signal conversion model is obtained by executing the inverse conversion signal identification learning process. Specifically, the audio signal conversion model is a trained model that represents the forward conversion process by the first generation unit 110.

The management unit 102 controls the operation of the managed unit 101. The management unit 102 executes, for example, the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process by the managed unit 101. Control the timing.

The management unit 102 controls, for example, the operations of the input unit 11, the interface unit 12, the storage unit 13, and the output unit 14. The management unit 102 reads various information from, for example, the storage unit 13 and outputs it to the managed unit 101. The management unit 102 acquires, for example, the information input to the input unit 11 and outputs it to the managed unit 101. The management unit 102 acquires, for example, the information input to the input unit 11 and records it in the storage unit 13. The information input to the management unit 102, for example, the interface unit 12, is acquired and output to the managed unit 101. The information input to the management unit 102, for example, the interface unit 12, is acquired and recorded in the storage unit 13. The management unit 102 causes the output unit 14 to output the information input to the input unit 11, for example.

The management unit 102 records, for example, the identification result of the first identification unit 120 in the storage unit 13. The management unit 102 records, for example, the identification result of the second identification unit 160 in the storage unit 13. The storage unit 13 records, for example, the determination result of the first determination unit 140 in the storage unit 13. The storage unit 13 records, for example, the determination result of the second determination unit 180 in the storage unit 13. The storage unit 13 records, for example, the determination result of the first input determination unit 130 in the storage unit 13. The storage unit 13 records, for example, the determination result of the second input determination unit 170 in the storage unit 13.

FIG. 13 is a diagram showing an example of the hardware configuration of the audio signal conversion device 2 according to the embodiment.
The voice signal conversion device 2 includes a control unit 20 including a processor 93 such as a CPU connected by a bus and a memory 94, and executes a program. The voice signal conversion device 2 functions as a device including a control unit 20, an input unit 21, an interface unit 22, a storage unit 23, and an output unit 24 by executing a program. More specifically, the processor 93 reads the program stored in the storage unit 23, and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the voice signal conversion device 2 functions as a device including the control unit 20, the input unit 21, the interface unit 22, the storage unit 23, and the output unit 24.

The control unit 20 controls the operation of various functional units included in the voice signal conversion device 2. The control unit 20 converts the unnaturally synthesized signal into a naturally synthesized signal by using, for example, the voice signal conversion model obtained by the voice signal conversion model learning device 1.

The input unit 21 includes an input device such as a mouse, a keyboard, and a touch panel. The input unit 21 may be configured as an interface for connecting these input devices to its own device. The input unit 21 receives input of various information to its own device. The input unit 21 receives, for example, an input instructing the start of a process of converting an unnaturally synthesized signal into a naturally synthesized signal. The input unit 21 receives, for example, the input of the unnaturally synthesized signal to be converted.

The interface unit 22 includes a communication interface for connecting the own device to an external device. The interface unit 22 communicates with an external device via wire or wireless. The external device is, for example, an output destination of a naturally synthesized signal. In such a case, the interface unit 22 outputs a naturally synthesized signal to the external device by communicating with the external device. The external device for outputting the naturally synthesized signal is, for example, an audio output device such as a speaker.

The external device may be, for example, a storage device such as a USB memory that stores the voice signal conversion model. When the external device stores, for example, the voice signal conversion model and outputs the voice signal conversion model, the interface unit 22 acquires the voice signal conversion model by communicating with the external device.

The external device is, for example, an output source of an unnaturally synthesized signal. In such a case, the interface unit 22 acquires an unnaturally synthesized signal from the external device by communicating with the external device.

The interface unit 22 includes a communication interface for connecting the own device to the voice signal conversion model learning device 1. The interface unit 22 communicates with the voice signal conversion model learning device 1 via wire or wireless. The interface unit 22 acquires a voice signal conversion model from the voice signal conversion model learning device 1 by communicating with the voice signal conversion model learning device 1.

The storage unit 23 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 23 stores various information related to the voice signal conversion device 2. The storage unit 13 stores the voice signal conversion model acquired via, for example, the interface unit 22.

The output unit 24 outputs various information. The output unit 24 includes display devices such as a CRT display, a liquid crystal display, and an organic EL display. The output unit 24 may be configured as an interface for connecting these display devices to its own device. The output unit 24 outputs, for example, the information input to the input unit 21.

FIG. 14 is a diagram showing an example of the functional configuration of the control unit 20 in the embodiment. The control unit 20 includes a conversion target acquisition unit 201, a conversion unit 202, and an audio signal output control unit 203.

The conversion target acquisition unit 201 acquires the unnatural composite signal to be converted. The conversion target acquisition unit 201 acquires, for example, the unnatural composite signal input to the input unit 21. The conversion target acquisition unit 201 acquires, for example, the unnaturally synthesized signal input to the interface unit 22.

The conversion unit 202 converts the conversion target acquired by the conversion target acquisition unit 201 into a naturally synthesized signal using the voice signal conversion model. The naturally synthesized signal is output to the audio signal output control unit 203.

The voice signal output control unit 203 controls the operation of the interface unit 22. The audio signal output control unit 203 causes the interface unit 22 to output a naturally synthesized signal by controlling the operation of the interface unit 22.

FIG. 15 is a flowchart showing an example of the flow of processing executed by the voice signal conversion device 2 in the embodiment. The control unit 20 acquires the unnaturally synthesized signal input to the interface unit 22 (step S901). Next, the control unit 20 converts the unnaturally synthesized signal into a naturally synthesized signal using the audio signal conversion model stored in the storage unit 23 (step S902). Next, the control unit 20 controls the operation of the interface unit 22 to output the naturally synthesized signal to the output destination (step S903). The output destination is, for example, an external device such as a speaker.

(Experimental result)
The experimental results of a comparative experiment (hereinafter referred to as "first experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method are shown in FIGS. 16 and 17. show.

The first experiment was conducted using 437 sentences included in the Japanese voice data set of a female narrator. Of the 437 sentences in the Japanese voice dataset, 407 sentences (about 1 hour) were used to obtain the voice signal conversion model. Of the 437 sentences in the Japanese voice dataset, 30 sentences (4 minutes) were used to obtain a 5-step MOS (Mean Opinion Score) rating for the naturalness of sound quality. The audio sampling rate was 22.05 kHz. There were 10 subjects. Each subject evaluated 30 and 20 sentences randomly selected for each learning method.

FIG. 16 is a first diagram showing an example of the experimental results of the first experiment. FIG. 17 is a second diagram showing an example of the experimental results of the first experiment. The horizontal axis of FIGS. 16 and 17 shows a method for obtaining an audio signal conversion model. The vertical axis of FIGS. 16 and 17 shows a 5-step MOS evaluation regarding the naturalness of sound quality. The horizontal axis of the dotted line in FIGS. 16 and 17 represents the evaluation result of the natural voice.

"SPSS" indicates a method of DNN (Deep Neural Network) text-to-speech synthesis (SPSS: Statistical Parametric Speech Synthesis). "GANv" indicates a correction method on the voice feature amount. “V1” indicates a method of using a downsampling module for a convolutional neural network.

“V2” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a first simple identification unit in place of the first identification unit 120 and a second simple identification unit in place of the second identification unit 160. .. The first simple identification unit includes a voice waveform identification unit 121, and does not include a voice feature amount identification unit 122 and an integrated identification unit 123, and whether the voice signal input by the waveform of the input voice signal is a natural signal or a forward conversion signal. It is a classifier that identifies. The second simple identification unit includes a voice waveform identification unit 161 and does not include a voice feature amount identification unit 162 and an integrated identification unit 163. It is a classifier that identifies the signal.

“V2msp” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a third simple identification unit in place of the first identification unit 120 and a fourth simple identification unit in place of the second identification unit 160. .. The third simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123. The voice feature amount identification unit 122 included in the third simple identification unit uses a mel spectrogram of the voice signal to be identified as the feature amount used for identification. The fourth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163. The voice feature amount identification unit 162 included in the fourth simple identification unit uses a mel spectrogram of the voice signal to be identified as the feature amount used for identification.

“V2ph” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a fifth simple identification unit in place of the first identification unit 120 and a sixth simple identification unit in place of the second identification unit 160. .. The fifth simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123. The voice feature amount identification unit 122 included in the fifth simple identification unit uses the phase spectrum of the voice signal to be identified as the feature amount used for identification. The sixth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163. The voice feature amount identification unit 162 included in the sixth simple identification unit uses the phase spectrum of the voice signal to be identified as the feature amount used for identification.

“V2mfcc” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a seventh simple identification unit in place of the first identification unit 120 and an eighth simple identification unit in place of the second identification unit 160. .. The seventh simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123. The voice feature amount identification unit 122 included in the seventh simple identification unit uses the mel frequency cepstrum coefficient of the voice signal to be identified as the feature amount used for identification. The eighth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163. The voice feature amount identification unit 162 included in the eighth simple identification unit uses the mel frequency cepstrum coefficient of the voice signal to be identified as the feature amount used for identification.

16 and 17 show that V1 has significantly improved sound quality as compared with SPSS. The improvement of sound quality means that the degree of natural signal is increased. 16 and 17 show that V1 has improved sound quality over GANv. 16 and 17 show that V2 has improved sound quality over SPSS. 16 and 17 show that V2 does not improve sound quality over V1. This is because V2 produces more noisy sound than V1. 16 and 17 show that V2msp and V2mfcc have higher MOS ratings than V1, V2, V2ph, SPSS and GANv.

Further, FIGS. 16 and 17 show that the p-value of the two-sided Mann-Whitney test is 0.05 or more for V2msp and V2mfcc. This indicates that the audio signal converted by V2msp and V2mfcc has no statistical difference from the natural signal. 16 and 17 show that V2ph is a noisy voice and has a lower MOS rating than V2. From the results of FIGS. 16 and 17, it is suggested that it is effective to use the voice waveform classifiers (that is, the voice waveform discriminating units 121 and 161) and the voice feature

amount discriminating units

122 and 162. In FIGS. 16 and 17, “V2msp”, “V2ph” and “V2mfcc” are examples of processing for converting audio using the audio signal conversion model obtained by the audio signal generation system 100.

FIG. 18 shows the experimental results of a comparison experiment (hereinafter referred to as "second experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method.

The second experiment was performed using 13100 sentences included in the English voice data set LJSpeech (see Reference 1). Forty of the 13100 sentences in the English speech dataset were used to obtain a five-step MOS rating for the naturalness of sound quality. The audio sampling rate was 22.05 kHz. There were 14 subjects. Each subject evaluated 15 sentences for each learning method. In the second experiment, spectral distortion was also calculated.

FIG. 18 is a diagram showing an example of the experimental results of the second experiment. FIG. 18 shows the minimum significant difference (LSD: Least squared distance) and the evaluation result of MOS for each learning method. WORLD is the method described in reference 2. Griffin-Lim is the method described in reference 3. OpenWaveNet is the method described in Reference 4. WaveGlow is the method described in reference 5.

Reference 1: “The LJ Speech Dataset” [online] [Searched on March 30, 2nd year of Reiwa], Internet <URL: https://keithito.com/LJ-Speech-Dataset/>
Reference 2: M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol.99, no. 7, pp.1877-1884, 2016.
Reference 3: D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Audio, Speech and Language Processing (TASLP), vol.32, no.2, pp. 236-243 , 1984.
Reference 4: Ryuichi Yamamoto et al. “WaveNet vocoder” [online] [Searched on March 30, 2nd year of Reiwa], Internet <URL: https://doi.org/10.5281/zenodo.1472609>
Reference 5: R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based generative network for speech synthesis,” 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3617- 3621, 2019.

FIG. 18 shows that Griffin-Lim has the lowest LSD. FIG. 18 shows that the LSD of WORLD is greatly distorted because it is a parametric vocoder. On the other hand, FIG. 18 shows that there is no difference in MOS evaluation between Griffin-Lim and WORLD.

FIG. 18 shows that when WaveGlow and openWaveNet are compared, openWaveNet has a larger LSD. On the other hand, FIG. 18 shows that WaveGlow has a higher MOS evaluation when comparing WaveGlow and openWaveNet. These results indicate that LSDs around 4 are unlikely to affect MOS evaluation. FIG. 18 shows that V2msp has the highest LSD and MOS rating.

Note that in FIG. 18, "Recorded" indicates the recorded voice itself. “Recorded” is the target voice itself as the converted voice. Therefore, there is no LSD value corresponding to "Recorded".

The audio signal generation system 100 converts the input waveform into a waveform having a high degree of natural signal. Therefore, the voice signal generation system 100 converts the input voice into the voice whose band is restored even when the voice (deteriorated voice) whose band is reduced from that of the original voice is input, for example. Can be done. This means that the voice signal generation system 100 has expanded the band.

FIG. 19 shows the experimental results of a comparison experiment (hereinafter referred to as "third experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method.

In the third experiment, 50 random sentences (3000 sentences in total) for every 60 speakers randomly selected from 109 speakers included in the English voice data set VCTK (see Reference 6) are used to obtain a voice signal conversion model. Was used for. In the third experiment, two men and two men were randomly selected from the remaining speakers, and then two utterance sentences (8 sentences in total) were randomly selected for each selected speaker, and the MUSHRA test on sound quality was conducted. It was conducted.

Reference 6: Ryuichi Yamamoto et al. “WaveNet vocoder” [online] [Search on March 30, 2nd year of Reiwa], Internet <URL: https://doi.org/10.5281/zenodo.1472609>

FIG. 19 is a diagram showing an example of the experimental results of the third experiment. The vertical axis of FIG. 19 shows the test results of the MUSHRA test. The horizontal axis of FIG. 19 shows the method to be evaluated. “48” on the horizontal axis of FIG. 19 indicates a natural voice sampled at 48 kHz. “16to48” on the horizontal axis of FIG. 19 indicates a voice whose band has been expanded by the voice signal generation system 100.

“8to48” on the horizontal axis in FIG. 19 indicates the voice whose band has been expanded by the voice signal generation system 100. “8to16to48” on the horizontal axis of FIG. 19 indicates a voice whose band has been expanded by the voice signal generation system 100. The differences between "16to48", "8to48" and "8to16to48" are as follows.

In "16to48", "16" is input to the audio signal converter 2 as deteriorated audio, the audio signal conversion model obtained by the audio signal generation system 100 is applied to "16", and the band of "16" is extended to 48 kHz. The audio of the result is shown. “16” indicates a voice sampled at 48 kHz and downsampled to 16 kHz.

In "8to16to48", "8" is input to the audio signal converter 2 as deteriorated audio, the audio signal conversion model obtained by the audio signal generation system 100 is applied to "8", and the band of "8" is extended to 48 kHz. The audio of the result is shown. The voice of the result that the band of "16" was extended to 48kHz is shown. “8” indicates a voice sampled at 48 kHz and downsampled to 16 kHz.

In "8to16to48", "8" is input to the voice signal converter 2 as deteriorated voice and converted to "16", and then "16" is input to the voice signal converter 2 as deteriorated voice and converted to "48". Shows the voice. “48” indicates the sound sampled at 48 kHz.

“16” on the horizontal axis in FIG. 19 indicates natural voice downsampled to 16 kHz. “4” on the horizontal axis of FIG. 19 indicates a natural sound downsampled to 4 kHz.

FIG. 19 shows that "16to48" has a small difference from the original sound. FIG. 19 shows that “8to48” is significantly deteriorated from the original sound. The reason for the deterioration is that since information is aggregated at 16 kHz or less in voice, the amount of information is greatly reduced by downsampling to 8 kHz, and learning does not go well. FIG. 19 shows that “8to16to48” has higher sound quality than “8to48”.

The voice signal generation system 100 of the embodiment configured as described above uses not only one of the voice waveforms or voice features of the voice signal but also both, and forward conversion signal identification processing, forward conversion learning processing, and forward conversion signal identification learning. A voice signal conversion model is obtained by executing processing, inverse conversion signal identification processing, inverse conversion learning processing, and inverse conversion signal identification learning processing. Therefore, the audio signal generation system 100 configured in this way can generate an audio signal having a higher degree of natural signal than when obtaining an audio signal conversion model using only the audio waveform. That is, the voice signal generation system 100 configured in this way can generate voice that is closer to the voice emitted by humans. A method for obtaining a speech signal conversion model using only a speech waveform is, for example, SEWAN (Speech Enhancement Generative Adversarial Network).

Further, the voice signal generation system 100 of the embodiment configured as described above includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal. A voice signal conversion model is obtained by executing the discrimination learning process. Therefore, the voice signal generation system 100 can generate a voice closer to the voice emitted by a human being than when the voice signal conversion model is acquired only by the convolutional neural network using the voice waveform and the voice feature amount.

The voice signal generation system 100 of the embodiment configured as described above includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal identification learning. Use processing. Therefore, even when the alignment of the voice signal used for learning is low, it is possible to generate a voice close to the voice emitted by a human being. Therefore, the speech signal generation system 100 has an effect that the application scene is not limited as compared with SEWAN (Speech Enhancement Generative Adversarial Networks) (see Reference 7), which is effective only when the alignment is high. The high alignment means that the difference between the learning audio signal and the ideal audio signal of the user who wants to output by the audio signal generation system 100 is small. The learning audio signal with high alignment is, for example, an audio signal in which noise is superimposed on a voice recorded in an ideal environment on a computer, the noise is simulated in a noise environment, and then noise is removed. The speech signal for learning with low alignment is a synthetic speech generated in text-to-speech synthesis or speech conversion. Since the length of such an audio signal also differs for each signal, the alignment is low in this respect as well.

Reference 7: S. Pascual et al., “SEGAN: Speech enhancement generative advertising network,” 2017 Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.3642-3646, 2017.

(Modification example)
The method by which the voice signal generation system 100 generates a voice signal conversion model does not necessarily have to be a convolutional cycle GAN. The method for generating the voice signal conversion model by the voice signal generation system 100 (hereinafter referred to as “model generation method”) may be any method as long as it satisfies the following learning method conditions.

The learning method conditions include the first condition. The first condition is that the model generation method executes a forward conversion process, which is a conversion that increases the natural signal degree of the input audio signal, to generate a forward conversion signal that is a signal having a higher natural signal degree than the audio signal. The condition is that the method uses the first generator to output.

The learning method condition includes the second condition. The second condition is that the model generation method is a method using a first classifier that discriminates whether the input signal is a forward conversion signal or a natural signal.

The learning method condition includes the third condition. The third condition is that the model generation method outputs an inverse transformation signal having a lower natural signal degree than the forward conversion signal by executing an inverse transformation process which is a conversion that lowers the natural signal degree with respect to the input signal. The condition is that it is a method using the generator of.

The learning method conditions include the fourth condition. The fourth condition is that the model generation method uses a second classifier that discriminates whether the input signal is a pre-synthesized signal which is a prepared signal and is a synthesized signal or an inverse conversion signal. The condition is that it is the method to be used. The composite signal read by the second identification unit 160 from the composite signal group is an example of the pre-synthesized signal.

The learning method condition includes the fifth condition. The fifth condition is that the model generation method is that the first generator, the first classifier, the second generator, and the second classifier are the discrimination results of the first classifier and the second classifier. It is a condition that learning is performed based on the identification result of.

The learning method condition may further include the following weak classifier conditions. The weak classifier condition includes a condition that at least one of the first classifier and the second classifier learns using the voice waveform classifier and the voice feature amount classifier. Therefore, the model generation method is, for example, a method using a third generator different from the first generator and the second generator, and a third classifier different from the first classifier and the second classifier. You may.

This is an example of the first generator of the first generator 110. The first discriminator 120 is an example of a first discriminator. The second generator 150 is an example of the second generator. The second discriminator 160 is an example of a second discriminator.

If the method of generating the voice signal conversion model satisfies at least the first to fifth conditions, the voice signal generation system 100 uses the voice signal for learning even when the alignment of the voice signal is low. It is possible to generate a sound close to.

Note that the voice waveform identification unit 121 and the voice waveform identification unit 161 may identify the voice signal based on the frequency spectrum converted based on the perceptual scale of pitch. The perceptual measure of pitch is, for example, the Mel scale. The frequency spectrum converted based on the perceptual measure of pitch is, for example, a spectrum represented by the mel frequency cepstrum coefficient. The frequency spectrum may be, for example, a phase spectrum. The frequency spectrum may be an amplitude spectrum. The frequency spectrum converted based on the perceptual measure of pitch may be, for example, a mel spectrogram. In this way, since human perceptual information can also be used for voice generation based on the perceptual scale of pitch, it is possible to generate a voice closer to the voice emitted by a human than the voice signal generation system 100.

Note that the audio signal conversion model learning device 1 does not necessarily have to learn a learning model that converts an input audio signal into an audio signal that is close to the audio emitted by a human being. The voice signal conversion model learning device 1 may learn a learning model that converts an input voice signal into a voice signal of a voice close to the voice of an animal other than a human such as a dog or a cat. In such a case, the voice signal conversion device 2 converts the input voice into a voice signal close to the voice of an animal other than a human. As described above, the animals in this embodiment include humans.

It is desirable that the unnatural signal and the naturally synthesized signal are audio signals of the same type of animal, but they do not necessarily have to be the same.

The managed unit 101 is an example of the learning unit. The unnatural signal is an example of an input signal.

The voice signal conversion model learning device 1 may be implemented by using a plurality of information processing devices connected so as to be able to communicate via a network. In this case, each functional unit included in the voice signal conversion model learning device 1 may be distributed and mounted in a plurality of information processing devices. For example, the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 may be mounted on different information processing devices.

The voice signal conversion device 2 may be implemented by using a plurality of information processing devices connected so as to be able to communicate via a network. In this case, each functional unit included in the voice signal conversion device 2 may be distributed and mounted in a plurality of information processing devices.

Even if all or part of each function of the audio signal generation system 100 is realized by using hardware such as ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array), etc. good. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a flexible disk, a magneto-optical disk, a portable medium such as a ROM or a CD-ROM, or a storage device such as a hard disk built in a computer system. The program may be transmitted over a telecommunication line.

Although the embodiments of the present invention have been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs and the like within a range that does not deviate from the gist of the present invention.

100 ... Voice signal generation system, 1 ... Voice signal conversion model learning device, 2 ... Voice signal conversion device, 10 ... Control unit, 11 ... Input unit, 12 ... Interface unit, 13 ... Storage unit, 14 ... Output unit, 101 ... Managed unit, 102 ... Management unit, 110 ... 1st generation unit, 120 ... 1st identification unit, 121 ... Voice waveform identification unit, 122 ... Voice feature amount identification unit, 123 ... Integrated identification unit, 130 ... 1st input determination Unit, 140 ... 1st judgment unit, 150 ... 2nd generation unit, 160 ... 2nd identification unit, 161 ... Voice waveform identification unit, 162 ... Voice feature amount identification unit, 163 ... Integrated identification unit, 170 ... 2nd input determination Unit, 180 ... 2nd judgment unit, 20 ... control unit, 21 ... input unit, 22 ... interface unit, 23 ... storage unit, 24 ... output unit, 201 ... conversion target acquisition unit, 202 ... conversion unit, 203 ... audio signal Output control unit, 91 ... processor, 92 ... memory, 93 ... processor, 94 ... memory

Claims

Machine learning of a trained model that converts an input signal, which is an input voice signal, into a voice signal whose natural signal degree, which indicates the degree of similarity to the natural signal, which is the voice actually emitted by an animal, is higher than the input signal. Learning department to get by the method,
With
The machine learning method outputs a forward conversion signal which is a signal having a higher natural signal degree than the voice signal by executing a forward conversion process which is a conversion which increases the natural signal degree with respect to the input voice signal. The first generation unit, the first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, and the inverse conversion that lowers the degree of natural signal with respect to the input audio signal. A second generation unit that outputs an inverse conversion signal having a lower natural signal degree than the voice signal by executing processing, and a precombined signal in which the input signal is a signal prepared in advance and synthesized. The second identification unit that distinguishes between the first identification unit and the inverse conversion signal is a method of learning based on the identification results of the first identification unit and the second identification unit.
Voice signal conversion model learning device.
The machine learning method is a method of Cycle Generative Adversarial Networks (Cycle GAN).
The audio signal conversion model learning device according to claim 1.
At least one of the first identification unit and the second identification unit is a voice waveform classifier that discriminates whether or not the voice signal is a natural signal based on the waveform of the voice signal used for learning, and a predetermined voice signal from the voice signal. Learning is performed using a voice feature amount classifier that acquires a voice feature amount that satisfies the conditions and discriminates whether or not the voice signal is a natural signal based on the acquired voice feature amount.
The audio signal conversion model learning device according to claim 1 or 2.
The voice waveform classifier is a frequency spectrum of the voice signal converted based on a perceptual measure of pitch.
The audio signal conversion model learning device according to claim 3.
A trained model that converts an input signal, which is an input voice signal, into a voice signal having a degree of natural signal higher than that of the input signal, which shows a degree of similarity to the natural signal that is actually emitted by an animal, is machine-learned. The machine learning method includes a learning unit obtained by the method, and the machine learning method has a higher natural signal degree than the voice signal by executing a forward conversion process which is a conversion for increasing the natural signal degree with respect to the input voice signal. A first generator that outputs a forward conversion signal, which is a signal, a first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, and a natural unit for the input audio signal. A second generator that outputs an inverse conversion signal having a lower natural signal degree than the voice signal by executing an inverse conversion process that lowers the signal degree, and a signal in which the input signal is prepared in advance. A method in which a second identification unit that discriminates between a pre-synthesized signal and an inverse conversion signal, which are synthesized signals, learns based on the identification results of the first identification unit and the second identification unit. A conversion unit that converts an input audio signal using the trained model obtained by a certain audio signal conversion model learning device.
A voice signal converter comprising.
Machine learning of a trained model that converts an input signal, which is an input voice signal, into a voice signal whose natural signal degree, which indicates the degree of similarity to the natural signal, which is the voice actually emitted by an animal, is higher than the input signal. Learning steps you get in the way,
Have,
The machine learning method outputs a forward conversion signal which is a signal having a higher natural signal degree than the voice signal by executing a forward conversion process which is a conversion which increases the natural signal degree with respect to the input voice signal. The first generation unit, the first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, and the inverse conversion that lowers the degree of natural signal with respect to the input audio signal. A second generation unit that outputs an inverse conversion signal having a lower natural signal degree than the voice signal by executing processing, and a precombined signal in which the input signal is a signal prepared in advance and synthesized. The second identification unit that distinguishes between the first identification unit and the inverse conversion signal is a method of learning based on the identification results of the first identification unit and the second identification unit.
Voice signal conversion model learning method.
A program for operating a computer as the audio signal conversion model learning device according to any one of claims 1 to 4.
A program for operating a computer as the audio signal converter according to claim 5.