WO2021199446A1 - Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program - Google Patents

Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program Download PDF

Info

Publication number
WO2021199446A1
WO2021199446A1 PCT/JP2020/015389 JP2020015389W WO2021199446A1 WO 2021199446 A1 WO2021199446 A1 WO 2021199446A1 JP 2020015389 W JP2020015389 W JP 2020015389W WO 2021199446 A1 WO2021199446 A1 WO 2021199446A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
voice
unit
conversion
input
Prior art date
Application number
PCT/JP2020/015389
Other languages
French (fr)
Japanese (ja)
Inventor
田中 宏
弘和 亀岡
卓弘 金子
伸克 北条
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/015389 priority Critical patent/WO2021199446A1/en
Priority to JP2022511494A priority patent/JP7368779B2/en
Publication of WO2021199446A1 publication Critical patent/WO2021199446A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used

Definitions

  • the present invention relates to a voice signal conversion model learning device, a voice signal conversion device, a voice signal conversion model learning method, and a program.
  • Input information such as speech generation by parametric vocoder method (see Non-Patent Document 1) and statistical voice quality conversion (see Non-Patent Document 2) as having the potential to expand human communication ability and physical function.
  • Techniques for generating desired speech from are being researched.
  • the parametric bocoder method voice generation technology provides assistance for persons with physical disabilities (see Non-Patent Documents 3 and 4) and language education support (Non-Patent Documents) because of the ease of system construction and its high versatility. 5 and 6) and its application to amusement (see Non-Patent Document 7) have been widely studied.
  • GAN Geneative Adversarial Networks
  • SEWAN Seech Enhancement Generative Adversarial Network
  • an object of the present invention is to provide a technique for generating a voice closer to the voice emitted by an animal.
  • One aspect of the present invention is learning to convert an input signal, which is an input audio signal, into an audio signal having a degree of natural signal higher than that of the input signal, which is similar to a natural signal which is actually emitted by an animal.
  • a learning unit that obtains a completed model by a machine learning method is provided, and the machine learning method uses a forward conversion process that is a conversion that increases the degree of natural signal to an input voice signal to obtain the completed model from the voice signal.
  • the first generator that outputs a forward conversion signal, which is a signal with a high degree of natural signal, and the first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, are input.
  • a second generator that outputs an inverse conversion signal having a lower natural signal degree than the audio signal by executing an inverse conversion process that lowers the natural signal degree of the audio signal, and an input signal are prepared in advance.
  • the second identification unit that identifies which of the pre-synthesized signal and the inverse conversion signal, which are the combined signals and the combined signals, is the identification result of the first identification unit and the second identification unit. It is a voice signal conversion model learning device which is a method of learning based on.
  • the flowchart which shows an example of the flow of the forward conversion signal identification processing in embodiment.
  • the first flowchart which shows an example of the flow of the forward conversion learning process in an embodiment.
  • a second flowchart showing an example of the flow of the forward conversion learning process in the embodiment.
  • the flowchart which shows an example of the flow of the forward conversion signal identification learning process in embodiment.
  • the flowchart which shows an example of the flow of the inverse transformation learning process in an embodiment.
  • FIG. 5 is a flowchart showing an example of a flow of processing executed by the voice signal conversion model learning device 1 in the embodiment.
  • FIG. 5 is a flowchart showing an example of a flow of processing executed by the voice signal conversion device 2 in the embodiment.
  • the second figure which shows an example of the experimental result of the 1st experiment.
  • FIG. 1 is an explanatory diagram illustrating an outline of the audio signal generation system 100 of the embodiment.
  • the audio signal generation system 100 is a synthesized audio signal (hereinafter referred to as “synthetic signal”) and has a low degree of similarity to the natural signal (hereinafter referred to as “natural signal degree”) (hereinafter referred to as “unnatural”). Improves the degree of natural signal of "composite signal”).
  • a natural signal is a voice actually emitted by a human being.
  • the voice signal generation system 100 converts the input unnaturally synthesized signal into a naturally synthesized signal which is a composite signal having a higher degree of natural signal than the input unnaturally synthesized signal. Converting an unnaturally synthesized signal into a naturally synthesized signal is equivalent to generating a naturally synthesized signal based on the unnaturally synthesized signal.
  • the voice signal generation system 100 includes a voice signal conversion model learning device 1 and a voice signal conversion device 2.
  • the voice signal conversion model learning device 1 obtains a trained model (hereinafter referred to as “voice signal conversion model”) that generates a naturally synthesized signal based on an unnaturally synthesized signal by machine learning.
  • voice signal conversion model a trained model
  • performing machine learning is called learning.
  • performing machine learning means appropriately adjusting the values of parameters in the machine learning model.
  • learning to be A means that the value of the parameter in the machine learning model is adjusted to satisfy A.
  • A represents a predetermined condition.
  • the voice signal conversion model learning device 1 receives a natural signal and a synthesized signal as inputs, and uses a voice waveform classifier and a voice feature amount classifier in the learning of the classifier by cyclic hostile learning (CycleGAN: CycleGenerativeAdversarial Networks). Learn the audio signal conversion model.
  • the voice waveform classifier is a discriminator that discriminates whether or not the voice signal is a natural signal based on the waveform of the voice signal used for learning (hereinafter referred to as “voice waveform”).
  • the voice feature amount classifier is a classifier that acquires information satisfying a predetermined condition from a voice signal used for learning as a voice feature amount and discriminates whether or not the voice signal is a natural signal based on the acquired voice feature amount.
  • the CycleGAN using the voice waveform classifier and the voice feature amount classifier will be referred to as a convolutional CycleGAN.
  • the voice feature amount is, for example, a phase spectrum of a voice signal.
  • the natural signal and the combined signal input to the voice signal conversion model learning device 1 may be stored in advance in the storage unit included in the voice signal conversion model learning device 1.
  • the convolution CycleGAN is a neural network that learns the voice waveform of the voice signal used for learning and the voice feature amount with different classifiers.
  • a neural network that learns the features of data used for learning with a different classifier for each feature is called a convolutional neural network. Therefore, the convolutional CycleGAN is both a neural network obtained by modifying the foldable neural network and a neural network obtained by modifying the convolutional neural network.
  • FIG. 2 is an explanatory diagram illustrating an outline of the voice signal conversion model learning device 1 according to the embodiment.
  • the voice signal conversion model learning device 1 includes a first generation unit 110, a first identification unit 120, a first input determination unit 130, a second generation unit 150, a second identification unit 160, and a second input determination unit 170.
  • the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 are functional units for learning.
  • the first generation unit 110, the first identification unit 120, the first input determination unit 130, the second generation unit 150, the second identification unit 160, and the second input determination unit 170 cooperate. And execute CycleGAN.
  • the first generation unit 110 executes forward conversion processing on the input audio signal.
  • the forward conversion process is a process for improving the degree of natural signal of the input audio signal.
  • the first generation unit 110 outputs the audio signal after the forward conversion process as a forward conversion signal.
  • the first generation unit 110 learns in detail based on the identification result of the first identification unit 120, which will be described later.
  • the first generation unit 110 learns by learning so as to further improve the degree of natural signal by the forward conversion process.
  • a specific example of learning in which the degree of natural signal is further improved by the forward conversion process is to reduce the value of the loss function, which is a function showing a larger value as the probability that the identification result of the first identification unit 120 is incorrect is lower. This is a process for appropriately adjusting the value of the parameter.
  • the first identification unit 120 identifies whether the input audio signal is a natural signal or a forward conversion signal.
  • the first identification unit 120 learns based on the identification result.
  • the first identification unit 120 includes a voice waveform identification unit 121, a voice feature amount identification unit 122, an integrated identification unit 123, and a first determination unit 140.
  • the audio signal input to the first identification unit 120 is input to the audio waveform identification unit 121.
  • the audio signal input to the audio waveform identification unit 121 is an audio signal determined by the first input determination unit 130, which will be described in detail later, and is a natural signal or a forward conversion signal.
  • the voice waveform identification unit 121 identifies whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal based on the voice waveform of the input voice signal.
  • the voice waveform identification unit 121 is an example of a voice waveform classifier.
  • the voice signal input to the first identification unit 120 is input to the voice feature amount identification unit 122. That is, the voice signal input to the voice feature amount identification unit 122 is the same as the voice signal input to the voice waveform identification unit 121.
  • the voice feature amount identification unit 122 acquires the voice feature amount based on the input voice signal.
  • the voice feature amount identification unit 122 identifies whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal based on the acquired voice feature amount.
  • the voice feature amount discriminating unit 122 is an example of a voice feature amount discriminator.
  • the integrated identification unit 123 determines whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal. Identify if.
  • the identification result of the integrated identification unit 123 is the identification result of the first identification unit 120.
  • the identification result of the integrated identification unit 123 is output to the first determination unit 140.
  • the first determination unit 140 determines whether or not the identification result of the integrated identification unit 123 is correct based on the determination result of the first input determination unit 130.
  • the voice waveform identification unit 121, the voice feature amount identification unit 122, and the integrated identification unit 123 learn based on the determination result of the first determination unit 140.
  • the voice waveform identification unit 121, the voice feature amount identification unit 122, and the integrated identification unit 123 learn by learning so as to further improve the accuracy of identification.
  • the value of the parameter is set so as to increase the value of the loss function, which is a function that shows a larger value as the probability that the discrimination result of the integrated discriminator 123 is incorrect is lower. This is a process for adjusting appropriately.
  • the first input determination unit 130 determines whether the audio signal input to the first identification unit 120 is a forward conversion signal or a natural signal.
  • the first input determination unit 130 determines a natural signal as an audio signal to be input to the first identification unit 120
  • one natural signal belonging to the natural signal group shown in the central column of FIG. 2 is the first identification unit 120.
  • the natural signal group is a set of natural signals prepared in advance for learning.
  • the composite signal group shown in the central column of FIG. 2 is a set of synthetic signals prepared in advance for learning.
  • the composite signal belonging to the composite signal group is referred to as a pre-synthesized signal.
  • the forward conversion signal is input to the first identification unit 120.
  • the second generation unit 150 executes an inverse transformation process on the input audio signal.
  • the reverse conversion process is executed on the acquired forward conversion signal.
  • a natural signal is input as an audio signal, the reverse conversion process is performed on the acquired natural signal.
  • the inverse transformation process is a process of reducing the degree of natural signal of the input audio signal.
  • the second generation unit 150 outputs the audio signal after the inverse transformation processing as the inverse transformation signal.
  • the second generation unit 150 learns based on the identification result of the second identification unit 160, which will be described in detail later.
  • the second generation unit 150 learns so as to further reduce the degree of natural signal by the inverse transformation process by learning.
  • the value of the loss function which is a function showing a larger value as the probability that the identification result of the second identification unit 160 is incorrect is lower, is reduced. This is a process for appropriately adjusting the value of the parameter.
  • the second identification unit 160 identifies whether the input audio signal is an inverse conversion signal or a precombined signal. The second identification unit 160 learns based on the identification result of the second identification unit 160.
  • the second identification unit 160 includes a voice waveform identification unit 161, a voice feature amount identification unit 162, an integrated identification unit 163, and a second determination unit 180.
  • the audio waveform identification unit 161 determines whether the audio signal input to the second identification unit 160 is an inverse conversion signal or a precombined signal based on the audio waveform of the audio signal input to the second identification unit 160. Identify.
  • the voice waveform identification unit 161 is an example of a voice waveform classifier.
  • the voice signal input to the second identification unit 160 is either an inverse conversion signal or a precombined signal based on the voice feature amount of the voice signal input to the second identification unit 160.
  • the voice feature amount identification unit 162 is an example of a voice feature amount classifier.
  • the voice signal input to the second identification unit 160 is an inverse conversion signal and a precombined signal. Identify which one.
  • the identification result of the integrated identification unit 163 is the identification result of the second identification unit 160.
  • the identification result of the integrated identification unit 163 is output to the second determination unit 180.
  • the second determination unit 180 determines whether or not the identification result of the second identification unit 160 is correct based on the determination result of the second input determination unit 170.
  • the voice waveform identification unit 161, the voice feature amount identification unit 162, and the integrated identification unit 163 learn based on the determination result of the second determination unit 180.
  • the voice waveform identification unit 161, the voice feature amount identification unit 162, and the integrated identification unit 163 learn by learning so as to further improve the accuracy of identification.
  • the value of the parameter is set so as to increase the value of the loss function, which is a function that shows a larger value as the probability that the discrimination result of the integrated discrimination unit 163 is incorrect is lower. This is a process for adjusting appropriately.
  • the second input determination unit 170 determines whether the audio signal input to the second generation unit 150 is a forward conversion signal or a natural signal. Further, the second input determination unit 170 also determines whether the audio signal input to the second identification unit 160 is an inverse conversion signal or a precombined signal.
  • the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 operate in cooperation with each other to learn to reduce the objective function L represented by the following equation. .. That is, the objective function L is a loss function when the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 learn.
  • H1 represents self-identical loss. More specifically, H1 is represented by the following equation (18).
  • D xwave represents a discriminator that identifies what kind of signal the audio signal x is based on the waveform of the audio signal x.
  • D ywave represents a discriminator that identifies what kind of signal the voice signal y is based on the waveform of the voice signal y.
  • D xmsp represents a discriminator that identifies what kind of signal the audio signal x is based on the audio features of the audio signal x.
  • D ymsp represents a discriminator that identifies what kind of signal the voice signal y is based on the voice feature amount of the voice signal y.
  • the classifier is represented by the symbol D.
  • Dmsp (A) is a function that outputs the probability of whether or not A is the target audio feature.
  • log (1-Dmsp (A)) is a function that outputs the probability that A is not the target voice feature amount.
  • F (A) means a process of convolving a fast Fourier transform matrix windowed by a Hanning window into A and then convolving a mel filter with respect to the absolute value of A after convolution.
  • the ⁇ cyc in the L cyc term in the objective function represents the weight.
  • L cyc is a hyperparameter in learning.
  • G x ⁇ y is a map and represents a map that converts the voice signal x into the voice signal y.
  • the audio signal y is an audio signal having a higher degree of natural signal than the audio signal x.
  • D y represents an identification function that distinguishes whether the input audio signal y is a natural signal or a composite signal.
  • G y ⁇ x is a map and represents a map that converts the voice signal y into the voice signal x.
  • D x represents an identification function that distinguishes whether the input audio signal x is a natural signal or a composite signal.
  • Ladv represents an objective function in hostile learning. That is, Ladv represents a hostile loss.
  • Hostile loss is the value represented by the loss function in hostile learning.
  • Lid represents an identity map. The identity map exists in the objective function L so as not to change the objective function L when the input to the map G x ⁇ y is the audio signal y instead of the audio signal x.
  • the value of the identity mapping L id represents the identity mapping loss.
  • L1 represents a loss function in hostile learning executed in collaboration with the first generation unit 110 and the first identification unit 120.
  • L2 represents a loss function in hostile learning executed in collaboration with the second generation unit 150 and the second identification unit 160.
  • L3 is a function representing the circulation consistent loss in CycleGAN. That is, in L3, the mapping G x ⁇ y and the mapping G y ⁇ x are arranged in the Cycle GAN executed by the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 in cooperation with each other. It is a function indicating whether or not there is a one-to-one correspondence.
  • the objective function L is a function represented by a function representing hostile loss, a function representing consistent loss, and a function representing identity mapping loss.
  • the forward conversion signal identification process is a process in which the first identification unit 120 discriminates whether the input audio signal is a natural signal or a forward conversion signal.
  • the forward conversion learning process is a process that the first generation unit 110 learns.
  • the forward conversion signal identification learning process is a process that the first identification unit 120 learns.
  • the inverse conversion signal identification process is a process in which the second identification unit 160 discriminates whether the input audio signal is an inverse conversion signal or a pre-synthesized signal.
  • the inverse transformation learning process is a process that the second generation unit 150 learns.
  • the inverse transformation signal identification learning process is a process that the second identification unit 160 learns.
  • FIG. 3 is a flowchart showing an example of the flow of the forward conversion signal identification process in the embodiment.
  • the audio waveform identification unit 121 acquires the audio signal input to the first identification unit 120, and the audio signal input to the first identification unit 120 based on the acquired audio waveform is either a natural signal or a forward conversion signal. Identify the presence (step S101).
  • the voice feature amount identification unit 122 acquires the voice feature amount of the voice signal input to the first identification unit 120, and the voice signal input to the first identification unit 120 is a natural signal based on the acquired voice feature amount.
  • the forward conversion signal are identified (step S102).
  • the integrated identification unit 123 receives the voice signal input to the first identification unit 120 according to a predetermined rule determined in advance based on the identification result of the voice waveform identification unit 121 and the identification result of the voice feature amount identification unit 122. It identifies whether it is a natural signal or a forward conversion signal (step S103). The identification result of the integrated identification unit 123 in step S103 is output to the first determination unit 140.
  • FIG. 4 is a first flowchart showing an example of the flow of the forward conversion learning process in the embodiment.
  • the first input determination unit 130 determines the audio signal input to the first identification unit 120 as a forward conversion signal (step S201).
  • the first generation unit 110 acquires one composite signal from the composite signal group and executes a forward conversion process on the acquired composite signal to generate a forward conversion signal (step S202).
  • the first generation unit 110 outputs the generated forward conversion signal to the first identification unit 120 (step S203).
  • the first identification unit 120 executes forward conversion signal identification processing on the acquired voice signal (step S204). That is, the processes of steps S101 to S103 are executed.
  • the first determination unit 140 determines whether or not the identification result of the first identification unit 120 is correct by comparing with the determination result of the first input determination unit 130 (step S205).
  • the first generation unit 110 learns to further improve the natural signal degree by the forward conversion process based on the determination result of the first determination unit 140 (step S206). Specifically, the first generation unit 110 learns to make the objective function L smaller.
  • FIG. 5 is a second flowchart showing an example of the flow of the forward conversion learning process in the embodiment.
  • the same processing as that shown in FIG. 3 or 4 will be designated by the same reference numerals as those in FIG. 3 or 4, and the description thereof will be omitted.
  • the second generation unit 150 outputs an inverse conversion signal (step S301).
  • the first generation unit 110 acquires the reverse conversion signal output by the second generation unit 150, and generates a forward conversion signal by executing a forward conversion process on the acquired reverse conversion signal (step S302). ..
  • the processes of steps S203 to S206 are executed.
  • FIG. 6 is a flowchart showing an example of the flow of the forward conversion signal identification learning process in the embodiment.
  • the same processing as that shown in FIGS. 3 to 5 will be designated by the same reference numerals as those in FIGS. 3 to 5, and the description thereof will be omitted.
  • the first input determination unit 130 determines whether the audio signal input to the first identification unit 120 is a natural signal or a forward conversion signal (step S401). Next, the processes of steps S204 and S205 are executed. Next, the first identification unit 120 learns to further improve the accuracy of identification (step S402). Specifically, the first identification unit 120 learns to make the objective function L larger. More specifically, the voice waveform identification unit 121 and the voice feature amount identification unit 122 learn so as to make the objective function L larger.
  • FIG. 7 is a flowchart showing an example of the flow of the inverse transformation signal identification processing in the embodiment.
  • the voice waveform identification unit 161 acquires the voice waveform of the voice signal input to the second identification unit 160, and the voice signal input to the second identification unit 160 based on the acquired voice waveform is an inverse conversion signal and a precombined signal.
  • Step S501 the voice feature amount identification unit 162 acquires the voice signal input to the second identification unit 160, and the voice signal input to the second identification unit 160 is precombined with the inverse conversion signal based on the acquired voice feature amount. Identifying which of the signals is (step S502).
  • the integrated identification unit 163 receives the voice signal input to the second identification unit 160 according to a predetermined rule determined in advance based on the identification result of the voice waveform identification unit 161 and the identification result of the voice feature amount identification unit 162. It identifies whether it is an inverse conversion signal or a precombined signal (step S503).
  • the identification result of the integrated identification unit 163 in step S503 is output to the second determination unit 180.
  • FIG. 8 is a flowchart showing an example of the flow of the inverse transformation learning process in the embodiment.
  • the second input determination unit 170 determines the audio signal input to the second identification unit 160 as an inverse conversion signal (step S601).
  • the second generation unit 150 acquires the forward conversion signal and executes the reverse conversion process on the acquired forward conversion signal to generate the reverse conversion signal (step S602).
  • the second generation unit 150 outputs the generated inverse conversion signal to the second identification unit 160 (step S603).
  • the second identification unit 160 executes an inverse transformation signal identification process on the acquired voice signal (step S604). That is, the processes of steps S401 to S403 are executed.
  • the second determination unit 180 determines whether or not the identification result of the second identification unit 160 is correct by comparing with the determination result of the second input determination unit 170 (step S605).
  • the second generation unit 150 learns to further improve the natural signal degree by the inverse transformation process based on the determination result of the second determination unit 180 (step S606). Specifically, the second generation unit 150 learns to make the objective function L smaller.
  • the processes of steps S602 to S606 are similarly performed.
  • FIG. 9 is a flowchart showing an example of the flow of the inverse transformation signal identification learning process in the embodiment.
  • the same processing as that shown in FIG. 7 or 8 will be designated by the same reference numerals as those in FIG. 7 or 8, and the description thereof will be omitted.
  • the second input determination unit 170 determines whether the audio signal input to the second identification unit 160 is a natural signal or an inverse conversion signal (step S701). Next, the processes of steps S604 and S605 are executed. Next, the second identification unit 160 learns to further improve the accuracy of identification (step S702). Specifically, the second identification unit 160 learns to make the objective function L larger. More specifically, the voice waveform identification unit 161 and the voice feature amount identification unit 162 learn so as to make the objective function L larger.
  • FIG. 10 is a flowchart showing an example of the flow of processing executed by the voice signal conversion model learning device 1 in the embodiment.
  • an example of the subsequent processing flow will be described by taking the case where step S201 is performed as an example.
  • an example of the processing flow will be described by taking the case where the processing of step S601 is performed as an example.
  • the same processing as that shown in FIGS. 3 to 9 will be described by assigning the same reference numerals as those shown in FIGS. 3 to 9 and omitting description thereof.
  • step S801 it is determined whether or not the end condition is satisfied (step S801).
  • the end condition is, for example, a condition that the number of times of learning exceeds a predetermined number of times. Whether or not the end condition is satisfied is determined by, for example, the management unit 102 described later.
  • step S801 When the end condition is satisfied (step S801: YES), the process ends. On the other hand, if the end condition is not satisfied (step S8011: NO), the process of step S301 is executed. Next, the process of step S302 is executed. After step S302, the process returns to step S203.
  • processing in step S206 and the processing in step S402 may be executed in the reverse order.
  • the order in which the processes in step S606 and the processes in step S702 are executed may be reversed.
  • step S201 If, instead of the process of step S201, the process of determining the audio signal input to the first identification unit 120 by the first input determination unit 130 as a natural signal is executed, the processes of steps S602 to S302 are not executed. .. In such a case, the process ends after the process of FIG. 6 is executed.
  • step S601 When the process of determining the audio signal input to the second identification unit 160 by the second input determination unit 170 as a natural signal is executed instead of the process of step S601, the processes of steps S602 to S604 and step S606 are executed. Processing and is not executed.
  • the voice signal conversion model learning device 1 executes the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process.
  • a voice signal conversion model with a higher degree of natural signal is obtained with each learning.
  • FIG. 11 is a diagram showing an example of the hardware configuration of the voice signal conversion model learning device 1 according to the embodiment.
  • the voice signal conversion model learning device 1 includes a control unit 10 including a processor 91 such as a CPU (Central Processing Unit) connected by a bus and a memory 92, and executes a program.
  • the voice signal conversion model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14 by executing a program. More specifically, the processor 91 reads out the program stored in the storage unit 13, and stores the read program in the memory 92.
  • the voice signal conversion model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14. do.
  • the control unit 10 controls the operation of various functional units included in the voice signal conversion model learning device 1.
  • the control unit 10 executes, for example, forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal identification learning processing.
  • the input unit 11 includes an input device such as a mouse, a keyboard, and a touch panel.
  • the input unit 11 may be configured as an interface for connecting these input devices to its own device.
  • the input unit 11 receives input of various information to its own device.
  • the input unit 11 receives, for example, an input instructing the start of learning.
  • the input unit 11 receives, for example, an input of a composite signal to be added to the composite signal group.
  • the input unit 11 receives, for example, an input of a natural signal to be added to the natural signal group.
  • the interface unit 12 includes a communication interface for connecting the own device to an external device.
  • the interface unit 12 communicates with an external device via wire or wireless.
  • the external device may be a storage device such as a USB (Universal Serial Bus) memory, for example.
  • the interface unit 12 acquires the composite signal output by the external device by communicating with the external device.
  • the interface unit 12 acquires the natural signal output by the external device by communicating with the external device.
  • the interface unit 12 includes a communication interface for connecting the own device to the voice signal conversion device 2.
  • the interface unit 12 communicates with the voice signal conversion device 2 via wire or wireless.
  • the interface unit 12 outputs a voice signal conversion model to the voice signal conversion device 2 by communicating with the voice signal conversion device 2.
  • the storage unit 13 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device.
  • the storage unit 13 stores various information related to the voice signal conversion model learning device 1.
  • the storage unit 13 stores, for example, a group of natural signals in advance.
  • the storage unit 13 stores, for example, a synthetic signal group in advance.
  • the storage unit 13 stores, for example, a composite signal and a natural signal input via the input unit 11 or the interface unit 12.
  • the storage unit 13 stores, for example, the identification result of the first identification unit 120.
  • the storage unit 13 stores, for example, the identification result of the second identification unit 160.
  • the storage unit 13 stores, for example, the determination result of the first determination unit 140.
  • the storage unit 13 stores, for example, the determination result of the second determination unit 180.
  • the storage unit 13 stores, for example, the determination result of the first input determination unit 130.
  • the storage unit 13 stores, for example, the determination result of the second input determination unit 170.
  • the storage unit 13 stores, for example, an audio signal conversion model.
  • the output unit 14 outputs various information.
  • the output unit 14 includes display devices such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display.
  • the output unit 14 may be configured as an interface for connecting these display devices to its own device.
  • the output unit 14 outputs, for example, the information input to the input unit 11.
  • FIG. 12 is a diagram showing an example of the functional configuration of the control unit 10 in the embodiment.
  • the control unit 10 includes a managed unit 101 and a management unit 102.
  • the managed unit 101 includes a first generation unit 110, a first identification unit 120, a first input determination unit 130, a first determination unit 140, a second generation unit 150, a second identification unit 160, a second input determination unit 170, and the like.
  • a second determination unit 180 is provided.
  • the managed unit 101 includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, and reverse conversion learning processing using each voice signal included in the natural signal group and the synthetic signal group.
  • An audio signal conversion model is obtained by executing the inverse conversion signal identification learning process. Specifically, the audio signal conversion model is a trained model that represents the forward conversion process by the first generation unit 110.
  • the management unit 102 controls the operation of the managed unit 101.
  • the management unit 102 executes, for example, the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process by the managed unit 101. Control the timing.
  • the management unit 102 controls, for example, the operations of the input unit 11, the interface unit 12, the storage unit 13, and the output unit 14.
  • the management unit 102 reads various information from, for example, the storage unit 13 and outputs it to the managed unit 101.
  • the management unit 102 acquires, for example, the information input to the input unit 11 and outputs it to the managed unit 101.
  • the management unit 102 acquires, for example, the information input to the input unit 11 and records it in the storage unit 13.
  • the information input to the management unit 102, for example, the interface unit 12 is acquired and output to the managed unit 101.
  • the information input to the management unit 102, for example, the interface unit 12, is acquired and recorded in the storage unit 13.
  • the management unit 102 causes the output unit 14 to output the information input to the input unit 11, for example.
  • the management unit 102 records, for example, the identification result of the first identification unit 120 in the storage unit 13.
  • the management unit 102 records, for example, the identification result of the second identification unit 160 in the storage unit 13.
  • the storage unit 13 records, for example, the determination result of the first determination unit 140 in the storage unit 13.
  • the storage unit 13 records, for example, the determination result of the second determination unit 180 in the storage unit 13.
  • the storage unit 13 records, for example, the determination result of the first input determination unit 130 in the storage unit 13.
  • the storage unit 13 records, for example, the determination result of the second input determination unit 170 in the storage unit 13.
  • FIG. 13 is a diagram showing an example of the hardware configuration of the audio signal conversion device 2 according to the embodiment.
  • the voice signal conversion device 2 includes a control unit 20 including a processor 93 such as a CPU connected by a bus and a memory 94, and executes a program.
  • the voice signal conversion device 2 functions as a device including a control unit 20, an input unit 21, an interface unit 22, a storage unit 23, and an output unit 24 by executing a program. More specifically, the processor 93 reads the program stored in the storage unit 23, and stores the read program in the memory 94.
  • the voice signal conversion device 2 functions as a device including the control unit 20, the input unit 21, the interface unit 22, the storage unit 23, and the output unit 24.
  • the control unit 20 controls the operation of various functional units included in the voice signal conversion device 2.
  • the control unit 20 converts the unnaturally synthesized signal into a naturally synthesized signal by using, for example, the voice signal conversion model obtained by the voice signal conversion model learning device 1.
  • the input unit 21 includes an input device such as a mouse, a keyboard, and a touch panel.
  • the input unit 21 may be configured as an interface for connecting these input devices to its own device.
  • the input unit 21 receives input of various information to its own device.
  • the input unit 21 receives, for example, an input instructing the start of a process of converting an unnaturally synthesized signal into a naturally synthesized signal.
  • the input unit 21 receives, for example, the input of the unnaturally synthesized signal to be converted.
  • the interface unit 22 includes a communication interface for connecting the own device to an external device.
  • the interface unit 22 communicates with an external device via wire or wireless.
  • the external device is, for example, an output destination of a naturally synthesized signal.
  • the interface unit 22 outputs a naturally synthesized signal to the external device by communicating with the external device.
  • the external device for outputting the naturally synthesized signal is, for example, an audio output device such as a speaker.
  • the external device may be, for example, a storage device such as a USB memory that stores the voice signal conversion model.
  • the interface unit 22 acquires the voice signal conversion model by communicating with the external device.
  • the external device is, for example, an output source of an unnaturally synthesized signal.
  • the interface unit 22 acquires an unnaturally synthesized signal from the external device by communicating with the external device.
  • the interface unit 22 includes a communication interface for connecting the own device to the voice signal conversion model learning device 1.
  • the interface unit 22 communicates with the voice signal conversion model learning device 1 via wire or wireless.
  • the interface unit 22 acquires a voice signal conversion model from the voice signal conversion model learning device 1 by communicating with the voice signal conversion model learning device 1.
  • the storage unit 23 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device.
  • the storage unit 23 stores various information related to the voice signal conversion device 2.
  • the storage unit 13 stores the voice signal conversion model acquired via, for example, the interface unit 22.
  • the output unit 24 outputs various information.
  • the output unit 24 includes display devices such as a CRT display, a liquid crystal display, and an organic EL display.
  • the output unit 24 may be configured as an interface for connecting these display devices to its own device.
  • the output unit 24 outputs, for example, the information input to the input unit 21.
  • FIG. 14 is a diagram showing an example of the functional configuration of the control unit 20 in the embodiment.
  • the control unit 20 includes a conversion target acquisition unit 201, a conversion unit 202, and an audio signal output control unit 203.
  • the conversion target acquisition unit 201 acquires the unnatural composite signal to be converted.
  • the conversion target acquisition unit 201 acquires, for example, the unnatural composite signal input to the input unit 21.
  • the conversion target acquisition unit 201 acquires, for example, the unnaturally synthesized signal input to the interface unit 22.
  • the conversion unit 202 converts the conversion target acquired by the conversion target acquisition unit 201 into a naturally synthesized signal using the voice signal conversion model.
  • the naturally synthesized signal is output to the audio signal output control unit 203.
  • the voice signal output control unit 203 controls the operation of the interface unit 22.
  • the audio signal output control unit 203 causes the interface unit 22 to output a naturally synthesized signal by controlling the operation of the interface unit 22.
  • FIG. 15 is a flowchart showing an example of the flow of processing executed by the voice signal conversion device 2 in the embodiment.
  • the control unit 20 acquires the unnaturally synthesized signal input to the interface unit 22 (step S901).
  • the control unit 20 converts the unnaturally synthesized signal into a naturally synthesized signal using the audio signal conversion model stored in the storage unit 23 (step S902).
  • the control unit 20 controls the operation of the interface unit 22 to output the naturally synthesized signal to the output destination (step S903).
  • the output destination is, for example, an external device such as a speaker.
  • the first experiment was conducted using 437 sentences included in the Japanese voice data set of a female narrator. Of the 437 sentences in the Japanese voice dataset, 407 sentences (about 1 hour) were used to obtain the voice signal conversion model. Of the 437 sentences in the Japanese voice dataset, 30 sentences (4 minutes) were used to obtain a 5-step MOS (Mean Opinion Score) rating for the naturalness of sound quality. The audio sampling rate was 22.05 kHz. There were 10 subjects. Each subject evaluated 30 and 20 sentences randomly selected for each learning method.
  • FIG. 16 is a first diagram showing an example of the experimental results of the first experiment.
  • FIG. 17 is a second diagram showing an example of the experimental results of the first experiment.
  • the horizontal axis of FIGS. 16 and 17 shows a method for obtaining an audio signal conversion model.
  • the vertical axis of FIGS. 16 and 17 shows a 5-step MOS evaluation regarding the naturalness of sound quality.
  • the horizontal axis of the dotted line in FIGS. 16 and 17 represents the evaluation result of the natural voice.
  • SPSS indicates a method of DNN (Deep Neural Network) text-to-speech synthesis (SPSS: Statistical Parametric Speech Synthesis).
  • GANv indicates a correction method on the voice feature amount.
  • V1 indicates a method of using a downsampling module for a convolutional neural network.
  • V2 indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a first simple identification unit in place of the first identification unit 120 and a second simple identification unit in place of the second identification unit 160.
  • the first simple identification unit includes a voice waveform identification unit 121, and does not include a voice feature amount identification unit 122 and an integrated identification unit 123, and whether the voice signal input by the waveform of the input voice signal is a natural signal or a forward conversion signal. It is a classifier that identifies.
  • the second simple identification unit includes a voice waveform identification unit 161 and does not include a voice feature amount identification unit 162 and an integrated identification unit 163. It is a classifier that identifies the signal.
  • V2msp indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a third simple identification unit in place of the first identification unit 120 and a fourth simple identification unit in place of the second identification unit 160.
  • the third simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123.
  • the voice feature amount identification unit 122 included in the third simple identification unit uses a mel spectrogram of the voice signal to be identified as the feature amount used for identification.
  • the fourth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163.
  • the voice feature amount identification unit 162 included in the fourth simple identification unit uses a mel spectrogram of the voice signal to be identified as the feature amount used for identification.
  • V2ph indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a fifth simple identification unit in place of the first identification unit 120 and a sixth simple identification unit in place of the second identification unit 160.
  • the fifth simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123.
  • the voice feature amount identification unit 122 included in the fifth simple identification unit uses the phase spectrum of the voice signal to be identified as the feature amount used for identification.
  • the sixth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163.
  • the voice feature amount identification unit 162 included in the sixth simple identification unit uses the phase spectrum of the voice signal to be identified as the feature amount used for identification.
  • V2mfcc indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a seventh simple identification unit in place of the first identification unit 120 and an eighth simple identification unit in place of the second identification unit 160.
  • the seventh simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123.
  • the voice feature amount identification unit 122 included in the seventh simple identification unit uses the mel frequency cepstrum coefficient of the voice signal to be identified as the feature amount used for identification.
  • the eighth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163.
  • the voice feature amount identification unit 162 included in the eighth simple identification unit uses the mel frequency cepstrum coefficient of the voice signal to be identified as the feature amount used for identification.
  • 16 and 17 show that V1 has significantly improved sound quality as compared with SPSS.
  • the improvement of sound quality means that the degree of natural signal is increased.
  • 16 and 17 show that V1 has improved sound quality over GANv.
  • 16 and 17 show that V2 has improved sound quality over SPSS.
  • 16 and 17 show that V2 does not improve sound quality over V1. This is because V2 produces more noisy sound than V1.
  • 16 and 17 show that V2msp and V2mfcc have higher MOS ratings than V1, V2, V2ph, SPSS and GANv.
  • FIGS. 16 and 17 show that the p-value of the two-sided Mann-Whitney test is 0.05 or more for V2msp and V2mfcc. This indicates that the audio signal converted by V2msp and V2mfcc has no statistical difference from the natural signal.
  • 16 and 17 show that V2ph is a noisy voice and has a lower MOS rating than V2. From the results of FIGS. 16 and 17, it is suggested that it is effective to use the voice waveform classifiers (that is, the voice waveform discriminating units 121 and 161) and the voice feature amount discriminating units 122 and 162.
  • “V2msp”, “V2ph” and “V2mfcc” are examples of processing for converting audio using the audio signal conversion model obtained by the audio signal generation system 100.
  • FIG. 18 shows the experimental results of a comparison experiment (hereinafter referred to as "second experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method.
  • the second experiment was performed using 13100 sentences included in the English voice data set LJSpeech (see Reference 1). Forty of the 13100 sentences in the English speech dataset were used to obtain a five-step MOS rating for the naturalness of sound quality.
  • the audio sampling rate was 22.05 kHz.
  • spectral distortion was also calculated.
  • FIG. 18 is a diagram showing an example of the experimental results of the second experiment.
  • FIG. 18 shows the minimum significant difference (LSD: Least squared distance) and the evaluation result of MOS for each learning method.
  • WORLD is the method described in reference 2.
  • Griffin-Lim is the method described in reference 3.
  • OpenWaveNet is the method described in Reference 4.
  • WaveGlow is the method described in reference 5.
  • Reference 1 “The LJ Speech Dataset” [online] [Searched on March 30, 2nd year of Reiwa], Internet ⁇ URL: https://keithito.com/LJ-Speech-Dataset/>
  • Reference 2 M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol.99, no. 7, pp.1877-1884, 2016.
  • Reference 3 D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Audio, Speech and Language Processing (TASLP), vol.32, no.2, pp. 236-243 , 1984.
  • FIG. 18 shows that Griffin-Lim has the lowest LSD.
  • FIG. 18 shows that the LSD of WORLD is greatly distorted because it is a parametric vocoder.
  • FIG. 18 shows that there is no difference in MOS evaluation between Griffin-Lim and WORLD.
  • FIG. 18 shows that when WaveGlow and openWaveNet are compared, openWaveNet has a larger LSD. On the other hand, FIG. 18 shows that WaveGlow has a higher MOS evaluation when comparing WaveGlow and openWaveNet. These results indicate that LSDs around 4 are unlikely to affect MOS evaluation. FIG. 18 shows that V2msp has the highest LSD and MOS rating.
  • the audio signal generation system 100 converts the input waveform into a waveform having a high degree of natural signal. Therefore, the voice signal generation system 100 converts the input voice into the voice whose band is restored even when the voice (deteriorated voice) whose band is reduced from that of the original voice is input, for example. Can be done. This means that the voice signal generation system 100 has expanded the band.
  • FIG. 19 shows the experimental results of a comparison experiment (hereinafter referred to as "third experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method.
  • FIG. 19 is a diagram showing an example of the experimental results of the third experiment.
  • the vertical axis of FIG. 19 shows the test results of the MUSHRA test.
  • the horizontal axis of FIG. 19 shows the method to be evaluated.
  • “48” on the horizontal axis of FIG. 19 indicates a natural voice sampled at 48 kHz.
  • “16to48” on the horizontal axis of FIG. 19 indicates a voice whose band has been expanded by the voice signal generation system 100.
  • “8to48” on the horizontal axis in FIG. 19 indicates the voice whose band has been expanded by the voice signal generation system 100.
  • “8to16to48” on the horizontal axis of FIG. 19 indicates a voice whose band has been expanded by the voice signal generation system 100.
  • the differences between "16to48”, “8to48” and “8to16to48” are as follows.
  • “16” is input to the audio signal converter 2 as deteriorated audio, the audio signal conversion model obtained by the audio signal generation system 100 is applied to “16", and the band of "16” is extended to 48 kHz. The audio of the result is shown. “16” indicates a voice sampled at 48 kHz and downsampled to 16 kHz.
  • “16” on the horizontal axis in FIG. 19 indicates natural voice downsampled to 16 kHz. “4” on the horizontal axis of FIG. 19 indicates a natural sound downsampled to 4 kHz.
  • FIG. 19 shows that "16to48” has a small difference from the original sound.
  • FIG. 19 shows that “8to48” is significantly deteriorated from the original sound. The reason for the deterioration is that since information is aggregated at 16 kHz or less in voice, the amount of information is greatly reduced by downsampling to 8 kHz, and learning does not go well.
  • FIG. 19 shows that “8to16to48” has higher sound quality than “8to48”.
  • the voice signal generation system 100 of the embodiment configured as described above uses not only one of the voice waveforms or voice features of the voice signal but also both, and forward conversion signal identification processing, forward conversion learning processing, and forward conversion signal identification learning.
  • a voice signal conversion model is obtained by executing processing, inverse conversion signal identification processing, inverse conversion learning processing, and inverse conversion signal identification learning processing. Therefore, the audio signal generation system 100 configured in this way can generate an audio signal having a higher degree of natural signal than when obtaining an audio signal conversion model using only the audio waveform. That is, the voice signal generation system 100 configured in this way can generate voice that is closer to the voice emitted by humans.
  • a method for obtaining a speech signal conversion model using only a speech waveform is, for example, SEWAN (Speech Enhancement Generative Adversarial Network).
  • the voice signal generation system 100 of the embodiment configured as described above includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal.
  • a voice signal conversion model is obtained by executing the discrimination learning process. Therefore, the voice signal generation system 100 can generate a voice closer to the voice emitted by a human being than when the voice signal conversion model is acquired only by the convolutional neural network using the voice waveform and the voice feature amount.
  • the voice signal generation system 100 of the embodiment configured as described above includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal identification learning. Use processing. Therefore, even when the alignment of the voice signal used for learning is low, it is possible to generate a voice close to the voice emitted by a human being. Therefore, the speech signal generation system 100 has an effect that the application scene is not limited as compared with SEWAN (Speech Enhancement Generative Adversarial Networks) (see Reference 7), which is effective only when the alignment is high.
  • SEWAN Seech Enhancement Generative Adversarial Networks
  • the learning audio signal with high alignment is, for example, an audio signal in which noise is superimposed on a voice recorded in an ideal environment on a computer, the noise is simulated in a noise environment, and then noise is removed.
  • the speech signal for learning with low alignment is a synthetic speech generated in text-to-speech synthesis or speech conversion. Since the length of such an audio signal also differs for each signal, the alignment is low in this respect as well.
  • the method by which the voice signal generation system 100 generates a voice signal conversion model does not necessarily have to be a convolutional cycle GAN.
  • the method for generating the voice signal conversion model by the voice signal generation system 100 (hereinafter referred to as “model generation method”) may be any method as long as it satisfies the following learning method conditions.
  • the learning method conditions include the first condition.
  • the first condition is that the model generation method executes a forward conversion process, which is a conversion that increases the natural signal degree of the input audio signal, to generate a forward conversion signal that is a signal having a higher natural signal degree than the audio signal.
  • the condition is that the method uses the first generator to output.
  • the learning method condition includes the second condition.
  • the second condition is that the model generation method is a method using a first classifier that discriminates whether the input signal is a forward conversion signal or a natural signal.
  • the learning method condition includes the third condition.
  • the third condition is that the model generation method outputs an inverse transformation signal having a lower natural signal degree than the forward conversion signal by executing an inverse transformation process which is a conversion that lowers the natural signal degree with respect to the input signal.
  • the condition is that it is a method using the generator of.
  • the learning method conditions include the fourth condition.
  • the fourth condition is that the model generation method uses a second classifier that discriminates whether the input signal is a pre-synthesized signal which is a prepared signal and is a synthesized signal or an inverse conversion signal.
  • the condition is that it is the method to be used.
  • the composite signal read by the second identification unit 160 from the composite signal group is an example of the pre-synthesized signal.
  • the learning method condition includes the fifth condition.
  • the fifth condition is that the model generation method is that the first generator, the first classifier, the second generator, and the second classifier are the discrimination results of the first classifier and the second classifier. It is a condition that learning is performed based on the identification result of.
  • the learning method condition may further include the following weak classifier conditions.
  • the weak classifier condition includes a condition that at least one of the first classifier and the second classifier learns using the voice waveform classifier and the voice feature amount classifier. Therefore, the model generation method is, for example, a method using a third generator different from the first generator and the second generator, and a third classifier different from the first classifier and the second classifier. You may.
  • the first discriminator 120 is an example of a first discriminator.
  • the second generator 150 is an example of the second generator.
  • the second discriminator 160 is an example of a second discriminator.
  • the voice signal generation system 100 uses the voice signal for learning even when the alignment of the voice signal is low. It is possible to generate a sound close to.
  • the voice waveform identification unit 121 and the voice waveform identification unit 161 may identify the voice signal based on the frequency spectrum converted based on the perceptual scale of pitch.
  • the perceptual measure of pitch is, for example, the Mel scale.
  • the frequency spectrum converted based on the perceptual measure of pitch is, for example, a spectrum represented by the mel frequency cepstrum coefficient.
  • the frequency spectrum may be, for example, a phase spectrum.
  • the frequency spectrum may be an amplitude spectrum.
  • the frequency spectrum converted based on the perceptual measure of pitch may be, for example, a mel spectrogram.
  • the audio signal conversion model learning device 1 does not necessarily have to learn a learning model that converts an input audio signal into an audio signal that is close to the audio emitted by a human being.
  • the voice signal conversion model learning device 1 may learn a learning model that converts an input voice signal into a voice signal of a voice close to the voice of an animal other than a human such as a dog or a cat.
  • the voice signal conversion device 2 converts the input voice into a voice signal close to the voice of an animal other than a human.
  • the animals in this embodiment include humans.
  • the unnatural signal and the naturally synthesized signal are audio signals of the same type of animal, but they do not necessarily have to be the same.
  • the managed unit 101 is an example of the learning unit.
  • the unnatural signal is an example of an input signal.
  • the voice signal conversion model learning device 1 may be implemented by using a plurality of information processing devices connected so as to be able to communicate via a network.
  • each functional unit included in the voice signal conversion model learning device 1 may be distributed and mounted in a plurality of information processing devices.
  • the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 may be mounted on different information processing devices.
  • the voice signal conversion device 2 may be implemented by using a plurality of information processing devices connected so as to be able to communicate via a network.
  • each functional unit included in the voice signal conversion device 2 may be distributed and mounted in a plurality of information processing devices.
  • the program may be recorded on a computer-readable recording medium.
  • the computer-readable recording medium is, for example, a flexible disk, a magneto-optical disk, a portable medium such as a ROM or a CD-ROM, or a storage device such as a hard disk built in a computer system.
  • the program may be transmitted over a telecommunication line.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

Provided is a sound signal conversion model learning device equipped with a learning unit which obtains, by a machine learning method, a trained model for converting an input signal that is an inputted sound signal into a sound signal having a higher natural signal degree than the input signal, the natural signal degree indicating the degree of similarity to a natural signal that is a sound actually emitted by an animal. The machine leaning method is a method wherein a first generation unit which performs forward conversion processing, which is a conversion for increasing the natural signal degree, on the inputted sound signal to output a forward conversion signal having a higher natural signal degree than the sound signal, a first identification unit which identifies whether an inputted signal is a forward conversion signal or a natural signal, a second generation unit which performs reverse conversion processing, which is a conversion for decreasing the natural signal degree, on an inputted sound signal to output a reverse conversion signal having a lower natural signal degree than the sound signal, and a second identification unit which identifies whether an inputted signal is a preliminary synthesized signal that is a preliminarily prepared signal and a synthesized signal, or a reverse conversion signal, learn on the basis of identification results of the first identification unit and the second identification unit.

Description

音声信号変換モデル学習装置、音声信号変換装置、音声信号変換モデル学習方法及びプログラムVoice signal conversion model learning device, voice signal conversion device, voice signal conversion model learning method and program
 本発明は、音声信号変換モデル学習装置、音声信号変換装置、音声信号変換モデル学習方法及びプログラムに関する。 The present invention relates to a voice signal conversion model learning device, a voice signal conversion device, a voice signal conversion model learning method, and a program.
 人間のコミュニケーション能力や身体機能を拡張する可能性を秘めているとして、パラメトリックなボコーダ方式による音声生成(非特許文献1参照)や統計的声質変換(非特許文献2参照)等の入力された情報から所望の音声を生成する技術が研究されている。例えば、パラメトリックなボコーダ方式による音声生成の技術は、システム構築の容易さ及びその汎用性の高さから,身体障碍者の補助(非特許文献3及び4参照)や、言語教育支援(非特許文献5及び6参照)や、アミューズメント(非特許文献7参照)への応用が広く研究されている。 Input information such as speech generation by parametric vocoder method (see Non-Patent Document 1) and statistical voice quality conversion (see Non-Patent Document 2) as having the potential to expand human communication ability and physical function. Techniques for generating desired speech from are being researched. For example, the parametric bocoder method voice generation technology provides assistance for persons with physical disabilities (see Non-Patent Documents 3 and 4) and language education support (Non-Patent Documents) because of the ease of system construction and its high versatility. 5 and 6) and its application to amusement (see Non-Patent Document 7) have been widely studied.
 しかしながら、上述の従来技術等を用いて生成された音声は、実際に人間が発する声音との違いが大きいという問題があった。人間が発する声音との違いを引き起こす原因の1つは生成された特徴量を過剰に平滑化してしまうことであった。このような過剰な平滑化を抑制する方法として、機械学習の方法の1つであるGAN(Generative Adversarial Networks)が提案されている。GANを用いた方法としては、例えばSEGAN(Speech Enhancement Generative Adversarial Network)が提案されている(非特許文献8参照)。しかしながら、これまで提案されているSEGANを用いた手法では、学習データの中に振幅スペクトルが同一だが位相スペクトルが異なる目標音声波形が存在した場合は学習が成立しないという課題もあり、制限の少ない学習データを使い実際に人間が発する声音との違いを小さくすることは難しかった。また、このような問題は人間に限らず動物が発する音声の音声生成においても同様の課題であった。 However, there is a problem that the voice generated by using the above-mentioned conventional technology has a large difference from the voice sound actually emitted by humans. One of the causes that caused the difference from the voice sound produced by humans was that the generated features were excessively smoothed. GAN (Generative Adversarial Networks), which is one of the machine learning methods, has been proposed as a method for suppressing such excessive smoothing. As a method using GAN, for example, SEWAN (Speech Enhancement Generative Adversarial Network) has been proposed (see Non-Patent Document 8). However, in the method using SEWAN proposed so far, there is a problem that learning cannot be established when a target sound waveform having the same amplitude spectrum but a different phase spectrum exists in the training data, and there is less limitation in learning. It was difficult to use the data to reduce the difference from the actual human voice. In addition, such a problem has been a similar problem not only in humans but also in voice generation of voices emitted by animals.
 上記事情に鑑み、本発明は、より動物が発する音声に近い音声を生成する技術を提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide a technique for generating a voice closer to the voice emitted by an animal.
 本発明の一態様は、入力された音声信号である入力信号を実際に動物が発する音声である自然信号との類似の度合を示す自然信号度が前記入力信号よりも高い音声信号に変換する学習済みモデルを機械学習の方法で得る学習部、を備え、前記機械学習の方法は、入力された音声信号に対して自然信号度を高める変換である順変換処理を実行することで前記音声信号よりも自然信号度の高い信号である順変換信号を出力する第1生成部と、入力された信号が順変換信号と自然信号とのいずれであるかを識別する第1識別部と、入力された音声信号に対して自然信号度を低める変換である逆変換処理を実行することで前記音声信号よりも自然信号度の低い逆変換信号を出力する第2生成部と、入力された信号が予め用意された信号であって合成された信号である事前合成信号と逆変換信号とのいずれであるかを識別する第2識別部とが、前記第1識別部及び前記第2識別部の識別結果に基づいて学習する方法である、音声信号変換モデル学習装置である。 One aspect of the present invention is learning to convert an input signal, which is an input audio signal, into an audio signal having a degree of natural signal higher than that of the input signal, which is similar to a natural signal which is actually emitted by an animal. A learning unit that obtains a completed model by a machine learning method is provided, and the machine learning method uses a forward conversion process that is a conversion that increases the degree of natural signal to an input voice signal to obtain the completed model from the voice signal. The first generator that outputs a forward conversion signal, which is a signal with a high degree of natural signal, and the first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, are input. A second generator that outputs an inverse conversion signal having a lower natural signal degree than the audio signal by executing an inverse conversion process that lowers the natural signal degree of the audio signal, and an input signal are prepared in advance. The second identification unit that identifies which of the pre-synthesized signal and the inverse conversion signal, which are the combined signals and the combined signals, is the identification result of the first identification unit and the second identification unit. It is a voice signal conversion model learning device which is a method of learning based on.
 本発明により、より動物が発する音声に近い音声を生成することが可能となる。 According to the present invention, it is possible to generate a sound closer to the sound emitted by an animal.
実施形態の音声信号生成システム100の概要を説明する説明図。An explanatory diagram illustrating an outline of the audio signal generation system 100 of the embodiment. 実施形態における音声信号変換モデル学習装置1の概要を説明する説明図。The explanatory view explaining the outline of the voice signal conversion model learning apparatus 1 in embodiment. 実施形態における順変換信号識別処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the forward conversion signal identification processing in embodiment. 実施形態における順変換学習処理の流れの一例を示す第1のフローチャート。The first flowchart which shows an example of the flow of the forward conversion learning process in an embodiment. 実施形態における順変換学習処理の流れの一例を示す第2のフローチャート。A second flowchart showing an example of the flow of the forward conversion learning process in the embodiment. 実施形態における順変換信号識別学習処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the forward conversion signal identification learning process in embodiment. 実施形態における逆変換信号識別処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the inverse transformation signal identification processing in embodiment. 実施形態における逆変換学習処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the inverse transformation learning process in an embodiment. 実施形態における逆変換信号識別学習処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the inverse transformation signal identification learning processing in embodiment. 実施形態における音声信号変換モデル学習装置1が実行する処理の流れの一例を示すフローチャート。FIG. 5 is a flowchart showing an example of a flow of processing executed by the voice signal conversion model learning device 1 in the embodiment. 実施形態における音声信号変換モデル学習装置1のハードウェア構成の一例を示す図。The figure which shows an example of the hardware composition of the voice signal conversion model learning apparatus 1 in embodiment. 実施形態における制御部10の機能構成の一例を示す図。The figure which shows an example of the functional structure of the control part 10 in embodiment. 実施形態における音声信号変換装置2のハードウェア構成の一例を示す図。The figure which shows an example of the hardware composition of the voice signal conversion apparatus 2 in embodiment. 実施形態における制御部20の機能構成の一例を示す図。The figure which shows an example of the functional structure of the control part 20 in embodiment. 実施形態における音声信号変換装置2が実行する処理の流れの一例を示すフローチャート。FIG. 5 is a flowchart showing an example of a flow of processing executed by the voice signal conversion device 2 in the embodiment. 第1実験の実験結果の一例を示す第1の図。The first figure which shows an example of the experimental result of the 1st experiment. 第1実験の実験結果の一例を示す第2の図。The second figure which shows an example of the experimental result of the 1st experiment. 第2実験の実験結果の一例を示す図。The figure which shows an example of the experimental result of the 2nd experiment. 第3実験の実験結果の一例を示す図。The figure which shows an example of the experimental result of the 3rd experiment.
(実施形態)
 図1及び図2を用いて、実施形態の音声信号生成システム100の概要を説明する。図1は、実施形態の音声信号生成システム100の概要を説明する説明図である。音声信号生成システム100は、合成された音声信号(以下「合成信号」という。)であって自然信号との類似の度合(以下「自然信号度」という。)が低い音声信号(以下「不自然合成信号」という。)の自然信号度を向上させる。自然信号は、実際に人間が発する音声である。すなわち音声信号生成システム100は、入力された不自然合成信号を入力された不自然合成信号よりも自然信号度が高い合成信号である自然合成信号に変換する。なお、不自然合成信号を自然合成信号に変換するとは、不自然合成信号に基づき自然合成信号を生成することと同値である。
(Embodiment)
The outline of the audio signal generation system 100 of the embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is an explanatory diagram illustrating an outline of the audio signal generation system 100 of the embodiment. The audio signal generation system 100 is a synthesized audio signal (hereinafter referred to as “synthetic signal”) and has a low degree of similarity to the natural signal (hereinafter referred to as “natural signal degree”) (hereinafter referred to as “unnatural”). Improves the degree of natural signal of "composite signal"). A natural signal is a voice actually emitted by a human being. That is, the voice signal generation system 100 converts the input unnaturally synthesized signal into a naturally synthesized signal which is a composite signal having a higher degree of natural signal than the input unnaturally synthesized signal. Converting an unnaturally synthesized signal into a naturally synthesized signal is equivalent to generating a naturally synthesized signal based on the unnaturally synthesized signal.
 音声信号生成システム100は、音声信号変換モデル学習装置1及び音声信号変換装置2を備える。音声信号変換モデル学習装置1は、不自然合成信号に基づき自然合成信号を生成する学習済みのモデル(以下「音声信号変換モデル」という。)を機械学習によって得る。以下説明の簡単のため、機械学習を行うことを学習という。なお、機械学習を行うとは、機械学習モデルにおけるパラメータの値を好適に調整することを意味する。以下の説明において、Aであるように学習するとは、機械学習モデルにおけるパラメータの値がAを満たすように調整されることを意味する。Aは予め定められた条件を表す。 The voice signal generation system 100 includes a voice signal conversion model learning device 1 and a voice signal conversion device 2. The voice signal conversion model learning device 1 obtains a trained model (hereinafter referred to as “voice signal conversion model”) that generates a naturally synthesized signal based on an unnaturally synthesized signal by machine learning. For the sake of simplicity of the explanation below, performing machine learning is called learning. Note that performing machine learning means appropriately adjusting the values of parameters in the machine learning model. In the following description, learning to be A means that the value of the parameter in the machine learning model is adjusted to satisfy A. A represents a predetermined condition.
 音声信号変換モデル学習装置1は、自然信号と合成信号とを入力とし、音声波形識別器及び音声特徴量識別器を識別器の学習において用いる循環型敵対的学習(CycleGAN:Cycle Generative Adversarial Networks)によって音声信号変換モデルを学習する。音声波形識別器は、学習に用いる音声信号の波形(以下「音声波形」という。)に基づいて音声信号が自然信号か否かを識別する識別器である。音声特徴量識別器は学習に用いる音声信号から所定の条件を満たす情報を音声特徴量として取得し、取得した音声特徴量に基づいて音声信号が自然信号か否かを識別する識別器である。以下、音声波形識別器と音声特徴量識別器とを用いるCycleGANを畳み込みCycleGANという。なお、音声特徴量は、例えば、音声信号の位相スペクトルである。なお、後述するように、音声信号変換モデル学習装置1に入力される自然信号と合成信号とは、予め音声信号変換モデル学習装置1が備える記憶部に記憶済みであってもよい。 The voice signal conversion model learning device 1 receives a natural signal and a synthesized signal as inputs, and uses a voice waveform classifier and a voice feature amount classifier in the learning of the classifier by cyclic hostile learning (CycleGAN: CycleGenerativeAdversarial Networks). Learn the audio signal conversion model. The voice waveform classifier is a discriminator that discriminates whether or not the voice signal is a natural signal based on the waveform of the voice signal used for learning (hereinafter referred to as “voice waveform”). The voice feature amount classifier is a classifier that acquires information satisfying a predetermined condition from a voice signal used for learning as a voice feature amount and discriminates whether or not the voice signal is a natural signal based on the acquired voice feature amount. Hereinafter, the CycleGAN using the voice waveform classifier and the voice feature amount classifier will be referred to as a convolutional CycleGAN. The voice feature amount is, for example, a phase spectrum of a voice signal. As will be described later, the natural signal and the combined signal input to the voice signal conversion model learning device 1 may be stored in advance in the storage unit included in the voice signal conversion model learning device 1.
 畳み込みCycleGANは、学習に用いる音声信号の音声波形と音声特徴量とをそれぞれ異なる識別器で学習するニューラルネットワークである。一般に、学習に用いるデータの特徴量を特徴量ごとに異なる識別器で学習するニューラルネットワークを畳み込みニューラルネットワークという。そのため畳み込みCycleGANは、CycleGANを変形させたニューラルネットワークでもあり畳み込みニューラルネットワークを変形させたニューラルネットワークでもある。 The convolution CycleGAN is a neural network that learns the voice waveform of the voice signal used for learning and the voice feature amount with different classifiers. Generally, a neural network that learns the features of data used for learning with a different classifier for each feature is called a convolutional neural network. Therefore, the convolutional CycleGAN is both a neural network obtained by modifying the foldable neural network and a neural network obtained by modifying the convolutional neural network.
 図2は、実施形態における音声信号変換モデル学習装置1の概要を説明する説明図である。音声信号変換モデル学習装置1は、第1生成部110、第1識別部120、第1入力決定部130、第2生成部150、第2識別部160及び第2入力決定部170を備える。第1生成部110、第1識別部120、第2生成部150及び第2識別部160は、学習する機能部である。音声信号変換モデル学習装置1においては、第1生成部110、第1識別部120、第1入力決定部130、第2生成部150、第2識別部160及び第2入力決定部170が協働してCycleGANを実行する。 FIG. 2 is an explanatory diagram illustrating an outline of the voice signal conversion model learning device 1 according to the embodiment. The voice signal conversion model learning device 1 includes a first generation unit 110, a first identification unit 120, a first input determination unit 130, a second generation unit 150, a second identification unit 160, and a second input determination unit 170. The first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 are functional units for learning. In the voice signal conversion model learning device 1, the first generation unit 110, the first identification unit 120, the first input determination unit 130, the second generation unit 150, the second identification unit 160, and the second input determination unit 170 cooperate. And execute CycleGAN.
 第1生成部110は、入力された音声信号に対して順変換処理を実行する。順変換処理は、入力された音声信号の自然信号度を向上させる処理である。第1生成部110は順変換処理後の音声信号を順変換信号として出力する。第1生成部110は、詳細を後述する第1識別部120の識別結果に基づいて学習する。第1生成部110は、学習により、順変換処理によって自然信号度をより向上させるように学習する。 The first generation unit 110 executes forward conversion processing on the input audio signal. The forward conversion process is a process for improving the degree of natural signal of the input audio signal. The first generation unit 110 outputs the audio signal after the forward conversion process as a forward conversion signal. The first generation unit 110 learns in detail based on the identification result of the first identification unit 120, which will be described later. The first generation unit 110 learns by learning so as to further improve the degree of natural signal by the forward conversion process.
 順変換処理によって自然信号度をより向上させるような学習の具体例は、第1識別部120の識別結果が誤りである確率が低いほど大きな値を示す関数である損失関数の値を小さくするようにパラメータの値を好適に調整する処理である。 A specific example of learning in which the degree of natural signal is further improved by the forward conversion process is to reduce the value of the loss function, which is a function showing a larger value as the probability that the identification result of the first identification unit 120 is incorrect is lower. This is a process for appropriately adjusting the value of the parameter.
 第1識別部120は、入力された音声信号が自然信号と順変換信号とのいずれであるかを識別する。第1識別部120は、この識別結果に基づいて学習する。第1識別部120は、音声波形識別部121、音声特徴量識別部122、統合識別部123及び第1判定部140を備える。 The first identification unit 120 identifies whether the input audio signal is a natural signal or a forward conversion signal. The first identification unit 120 learns based on the identification result. The first identification unit 120 includes a voice waveform identification unit 121, a voice feature amount identification unit 122, an integrated identification unit 123, and a first determination unit 140.
 音声波形識別部121には第1識別部120に入力された音声信号が入力される。音声波形識別部121に入力される音声信号は、詳細を後述する第1入力決定部130によって決定された音声信号であって、自然信号又は順変換信号である。音声波形識別部121は、入力された音声信号の音声波形に基づいて第1識別部120に入力された音声信号が自然信号と順変換信号とのいずれであるかを識別する。音声波形識別部121は、音声波形識別器の一例である。 The audio signal input to the first identification unit 120 is input to the audio waveform identification unit 121. The audio signal input to the audio waveform identification unit 121 is an audio signal determined by the first input determination unit 130, which will be described in detail later, and is a natural signal or a forward conversion signal. The voice waveform identification unit 121 identifies whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal based on the voice waveform of the input voice signal. The voice waveform identification unit 121 is an example of a voice waveform classifier.
 音声特徴量識別部122には第1識別部120に入力された音声信号が入力される。すなわち、音声特徴量識別部122に入力される音声信号は音声波形識別部121に入力される音声信号と同一である。音声特徴量識別部122は、入力された音声信号に基づき音声特徴量を取得する。音声特徴量識別部122は、取得した音声特徴量に基づいて第1識別部120に入力された音声信号が自然信号と順変換信号とのいずれであるかを識別する。音声特徴量識別部122は、音声特徴量識別器の一例である。 The voice signal input to the first identification unit 120 is input to the voice feature amount identification unit 122. That is, the voice signal input to the voice feature amount identification unit 122 is the same as the voice signal input to the voice waveform identification unit 121. The voice feature amount identification unit 122 acquires the voice feature amount based on the input voice signal. The voice feature amount identification unit 122 identifies whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal based on the acquired voice feature amount. The voice feature amount discriminating unit 122 is an example of a voice feature amount discriminator.
 統合識別部123は、音声波形識別部121の識別結果と音声特徴量識別部122の識別結果とに基づいて、第1識別部120に入力された音声信号が自然信号と順変換信号とのいずれであるかを識別する。統合識別部123の識別結果が、第1識別部120による識別結果である。統合識別部123の識別結果が、第1判定部140に出力される。 Based on the identification result of the voice waveform identification unit 121 and the identification result of the voice feature amount identification unit 122, the integrated identification unit 123 determines whether the voice signal input to the first identification unit 120 is a natural signal or a forward conversion signal. Identify if. The identification result of the integrated identification unit 123 is the identification result of the first identification unit 120. The identification result of the integrated identification unit 123 is output to the first determination unit 140.
 第1判定部140は、第1入力決定部130の決定結果に基づき統合識別部123の識別結果が正しか否かを判定する。 The first determination unit 140 determines whether or not the identification result of the integrated identification unit 123 is correct based on the determination result of the first input determination unit 130.
 音声波形識別部121、音声特徴量識別部122及び統合識別部123は、第1判定部140の判定結果に基づき学習する。音声波形識別部121、音声特徴量識別部122及び統合識別部123は、学習により、識別の精度をより向上させるように学習する。識別の精度をより向上させるような学習の具体例は、統合識別部123の識別結果が誤りである確率が低いほど大きな値を示す関数である損失関数の値を大きくするようにパラメータの値を好適に調整する処理である。 The voice waveform identification unit 121, the voice feature amount identification unit 122, and the integrated identification unit 123 learn based on the determination result of the first determination unit 140. The voice waveform identification unit 121, the voice feature amount identification unit 122, and the integrated identification unit 123 learn by learning so as to further improve the accuracy of identification. In a specific example of learning that further improves the accuracy of discrimination, the value of the parameter is set so as to increase the value of the loss function, which is a function that shows a larger value as the probability that the discrimination result of the integrated discriminator 123 is incorrect is lower. This is a process for adjusting appropriately.
 第1入力決定部130は、第1識別部120に入力する音声信号を順変換信号と自然信号とのいずれにするかを決定する。 The first input determination unit 130 determines whether the audio signal input to the first identification unit 120 is a forward conversion signal or a natural signal.
 第1入力決定部130が第1識別部120に入力する音声信号として自然信号を決定した場合には、図2の中央列に記載の自然信号群に属する1つの自然信号が第1識別部120に入力される。前記自然信号群は学習のために予め用意された自然信号の集合である。なお、図2の中央列に記載の合成信号群は学習のために予め用意された合成信号の集合である。以下、合成信号群に属する合成信号を事前合成信号という。 When the first input determination unit 130 determines a natural signal as an audio signal to be input to the first identification unit 120, one natural signal belonging to the natural signal group shown in the central column of FIG. 2 is the first identification unit 120. Is entered in. The natural signal group is a set of natural signals prepared in advance for learning. The composite signal group shown in the central column of FIG. 2 is a set of synthetic signals prepared in advance for learning. Hereinafter, the composite signal belonging to the composite signal group is referred to as a pre-synthesized signal.
 第1入力決定部130が第1識別部120に入力する合成信号として合成信号を決定した場合には、順変換信号が第1識別部120に入力される。 When the first input determination unit 130 determines the composite signal as the composite signal to be input to the first identification unit 120, the forward conversion signal is input to the first identification unit 120.
 第2生成部150は、入力された音声信号に対して逆変換処理を実行する。音声信号として順変換信号が入力された場合、取得した順変換信号に対して逆変換処理を実行する。音声信号として自然信号が入力された場合、取得した自然信号に対して逆変換処理を実施する。逆変換処理は入力された音声信号の自然信号度を低下させる処理である。第2生成部150は、逆変換処理後の音声信号を逆変換信号として出力する。第2生成部150は、詳細を後述する第2識別部160の識別結果に基づいて学習する。第2生成部150は、学習により、逆変換処理によって自然信号度をより低下させるように学習する。 The second generation unit 150 executes an inverse transformation process on the input audio signal. When a forward conversion signal is input as an audio signal, the reverse conversion process is executed on the acquired forward conversion signal. When a natural signal is input as an audio signal, the reverse conversion process is performed on the acquired natural signal. The inverse transformation process is a process of reducing the degree of natural signal of the input audio signal. The second generation unit 150 outputs the audio signal after the inverse transformation processing as the inverse transformation signal. The second generation unit 150 learns based on the identification result of the second identification unit 160, which will be described in detail later. The second generation unit 150 learns so as to further reduce the degree of natural signal by the inverse transformation process by learning.
 逆変換処理によって自然信号度をより低下させるような学習の具体例は、第2識別部160の識別結果が誤りである確率が低いほど大きな値を示す関数である損失関数の値を小さくするようにパラメータの値を好適に調整する処理である。 As a specific example of learning in which the degree of natural signal is further lowered by the inverse transformation processing, the value of the loss function, which is a function showing a larger value as the probability that the identification result of the second identification unit 160 is incorrect is lower, is reduced. This is a process for appropriately adjusting the value of the parameter.
 第2識別部160は、入力された音声信号が逆変換信号と事前合成信号とのいずれであるかを識別する。第2識別部160は、第2識別部160の識別結果に基づいて学習する。第2識別部160は、音声波形識別部161、音声特徴量識別部162、統合識別部163及び第2判定部180を備える。 The second identification unit 160 identifies whether the input audio signal is an inverse conversion signal or a precombined signal. The second identification unit 160 learns based on the identification result of the second identification unit 160. The second identification unit 160 includes a voice waveform identification unit 161, a voice feature amount identification unit 162, an integrated identification unit 163, and a second determination unit 180.
 音声波形識別部161は、第2識別部160に入力された音声信号の音声波形に基づいて第2識別部160に入力された音声信号が逆変換信号と事前合成信号とのいずれであるかを識別する。音声波形識別部161は、音声波形識別器の一例である。 The audio waveform identification unit 161 determines whether the audio signal input to the second identification unit 160 is an inverse conversion signal or a precombined signal based on the audio waveform of the audio signal input to the second identification unit 160. Identify. The voice waveform identification unit 161 is an example of a voice waveform classifier.
 音声特徴量識別部162は、第2識別部160に入力された音声信号の音声特徴量に基づいて第2識別部160に入力された音声信号が逆変換信号と事前合成信号とのいずれであるかを識別する。音声特徴量識別部162は、音声特徴量識別器の一例である。 In the voice feature amount identification unit 162, the voice signal input to the second identification unit 160 is either an inverse conversion signal or a precombined signal based on the voice feature amount of the voice signal input to the second identification unit 160. To identify. The voice feature amount identification unit 162 is an example of a voice feature amount classifier.
 統合識別部163は、音声波形識別部161の識別結果と音声特徴量識別部162の識別結果とに基づいて、第2識別部160に入力された音声信号が逆変換信号と事前合成信号とのいずれであるかを識別する。統合識別部163の識別結果が、第2識別部160による識別結果である。統合識別部163の識別結果が、第2判定部180に出力される。 In the integrated identification unit 163, based on the identification result of the voice waveform identification unit 161 and the identification result of the voice feature amount identification unit 162, the voice signal input to the second identification unit 160 is an inverse conversion signal and a precombined signal. Identify which one. The identification result of the integrated identification unit 163 is the identification result of the second identification unit 160. The identification result of the integrated identification unit 163 is output to the second determination unit 180.
 第2判定部180は、第2入力決定部170の決定結果に基づき第2識別部160の識別結果が正しか否かを判定する。 The second determination unit 180 determines whether or not the identification result of the second identification unit 160 is correct based on the determination result of the second input determination unit 170.
 音声波形識別部161、音声特徴量識別部162及び統合識別部163は、第2判定部180の判定結果に基づき学習する。音声波形識別部161、音声特徴量識別部162及び統合識別部163は、学習により、識別の精度をより向上させるように学習する。識別の精度をより向上させるような学習の具体例は、統合識別部163の識別結果が誤りである確率が低いほど大きな値を示す関数である損失関数の値を大きくするようにパラメータの値を好適に調整する処理である。 The voice waveform identification unit 161, the voice feature amount identification unit 162, and the integrated identification unit 163 learn based on the determination result of the second determination unit 180. The voice waveform identification unit 161, the voice feature amount identification unit 162, and the integrated identification unit 163 learn by learning so as to further improve the accuracy of identification. In a specific example of learning that further improves the accuracy of discrimination, the value of the parameter is set so as to increase the value of the loss function, which is a function that shows a larger value as the probability that the discrimination result of the integrated discrimination unit 163 is incorrect is lower. This is a process for adjusting appropriately.
 第2入力決定部170は、第2生成部150に入力する音声信号を順変換信号と自然信号とのいずれにするかを決定する。また、第2入力決定部170は、第2識別部160に入力する音声信号を逆変換信号と事前合成信号とのいずれにするかも決定する。
The second input determination unit 170 determines whether the audio signal input to the second generation unit 150 is a forward conversion signal or a natural signal. Further, the second input determination unit 170 also determines whether the audio signal input to the second identification unit 160 is an inverse conversion signal or a precombined signal.
 第1生成部110、第1識別部120、第2生成部150及び第2識別部160は協働して動作することで、以下の式で表される目的関数Lを小さくするように学習する。すなわち目的関数Lは、第1生成部110、第1識別部120、第2生成部150及び第2識別部160が学習する際の損失関数である。 The first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 operate in cooperation with each other to learn to reduce the objective function L represented by the following equation. .. That is, the objective function L is a loss function when the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 learn.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000016
Figure JPOXMLDOC01-appb-M000016
Figure JPOXMLDOC01-appb-M000017
Figure JPOXMLDOC01-appb-M000017
 H1は自己同一損失を表す。より具体的には、H1は以下の式(18)で表される。 H1 represents self-identical loss. More specifically, H1 is represented by the following equation (18).
Figure JPOXMLDOC01-appb-M000018
Figure JPOXMLDOC01-appb-M000018
 H2からH9の和は敵対的損失を表す。なお、Dxwaveは、音声信号xの波形に基づいて音声信号xがどのような信号であるかを識別する識別器を表す。なお、Dywaveは、音声信号yの波形に基づいて音声信号yがどのような信号であるかを識別する識別器を表す。なお、Dxmspは、音声信号xの音声特徴量に基づいて音声信号xがどのような信号であるかを識別する識別器を表す。なお、Dymspは、音声信号yの音声特徴量に基づいて音声信号yがどのような信号であるかを識別する識別器を表す。以下、説明の簡単のため、識別器を記号Dで表す。なお、Dmsp(A)はAが目的となっている音声特徴量であるかどうかの確率を出力する関数である。また、log(1-Dmsp(A))は、Aが目的の音声特徴量ではない確率を出力する関数である。 The sum of H2 to H9 represents a hostile loss. Note that D xwave represents a discriminator that identifies what kind of signal the audio signal x is based on the waveform of the audio signal x. Note that D ywave represents a discriminator that identifies what kind of signal the voice signal y is based on the waveform of the voice signal y. Note that D xmsp represents a discriminator that identifies what kind of signal the audio signal x is based on the audio features of the audio signal x. Note that D ymsp represents a discriminator that identifies what kind of signal the voice signal y is based on the voice feature amount of the voice signal y. Hereinafter, for the sake of simplicity, the classifier is represented by the symbol D. Note that Dmsp (A) is a function that outputs the probability of whether or not A is the target audio feature. Further, log (1-Dmsp (A)) is a function that outputs the probability that A is not the target voice feature amount.
 F(A)は、ハニング窓で窓掛けされた高速フーリエ変換行列をAに畳み込んだ後に畳み込み後のAの絶対値に対してメルフィルターを畳み込む処理を意味する。 F (A) means a process of convolving a fast Fourier transform matrix windowed by a Hanning window into A and then convolving a mel filter with respect to the absolute value of A after convolution.
 目的関数におけるLcycの項のλcycは重みを表す。Lcycは学習においてハイパーパラメータである。Gx→yは写像であって音声信号xを音声信号yに変換する写像を表す。音声信号yは音声信号xよりも自然信号度が高い音声信号である。Dは、入力された音声信号yについて自然信号か合成信号かを見分ける識別関数を表す。Gy→xは写像であって音声信号yを音声信号xに変換する写像を表す。Dは、入力された音声信号xについて自然信号か合成信号かを見分ける識別関数を表す。 The λ cyc in the L cyc term in the objective function represents the weight. L cyc is a hyperparameter in learning. G x → y is a map and represents a map that converts the voice signal x into the voice signal y. The audio signal y is an audio signal having a higher degree of natural signal than the audio signal x. D y represents an identification function that distinguishes whether the input audio signal y is a natural signal or a composite signal. G y → x is a map and represents a map that converts the voice signal y into the voice signal x. D x represents an identification function that distinguishes whether the input audio signal x is a natural signal or a composite signal.
 Ladvは、敵対的学習における目的関数を表す。すなわちLadvは、敵対的損失を表す。敵対的損失とは敵対的学習における損失関数が表す値である。Lidは、恒等写像を表す。恒等写像は、写像Gx→yへの入力が音声信号xではなくて音声信号yであった場合に、目的関数Lを変化させないために目的関数Lに存在する。恒等写像Lidの値は恒等写像損失を表す。 Ladv represents an objective function in hostile learning. That is, Ladv represents a hostile loss. Hostile loss is the value represented by the loss function in hostile learning. Lid represents an identity map. The identity map exists in the objective function L so as not to change the objective function L when the input to the map G x → y is the audio signal y instead of the audio signal x. The value of the identity mapping L id represents the identity mapping loss.
 L1は第1生成部110及び第1識別部120が協働して実行する敵対的学習における損失関数を表す。L2は第2生成部150及び第2識別部160が協働して実行する敵対的学習における損失関数を表す。L3は、CycleGANにおける循環無矛盾損失を表す関数である。すなわち、L3は、第1生成部110、第1識別部120、第2生成部150及び第2識別部160が協働して実行するCycleGANにおいて写像Gx→yと写像Gy→xとが1対1対応であるか否かを表す関数である。 L1 represents a loss function in hostile learning executed in collaboration with the first generation unit 110 and the first identification unit 120. L2 represents a loss function in hostile learning executed in collaboration with the second generation unit 150 and the second identification unit 160. L3 is a function representing the circulation consistent loss in CycleGAN. That is, in L3, the mapping G x → y and the mapping G y → x are arranged in the Cycle GAN executed by the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 in cooperation with each other. It is a function indicating whether or not there is a one-to-one correspondence.
 このように、目的関数Lは、敵対的損失を表す関数と、無矛盾損失を表す関数と、恒等写像損失を表す関数とによって表される関数である。 As described above, the objective function L is a function represented by a function representing hostile loss, a function representing consistent loss, and a function representing identity mapping loss.
 ここで、順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理それぞれの処理の流れの一例を説明する。順変換信号識別処理は、第1識別部120が入力された音声信号が自然信号か順変換信号かを識別する処理である。順変換学習処理は第1生成部110が学習する処理である。順変換信号識別学習処理は第1識別部120が学習する処理である。逆変換信号識別処理は、第2識別部160が入力された音声信号が逆変換信号か事前合成信号かを識別する処理である。逆変換学習処理は第2生成部150が学習する処理である。逆変換信号識別学習処理は、第2識別部160が学習する処理である。 Here, an example of the flow of each of the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process will be described. The forward conversion signal identification process is a process in which the first identification unit 120 discriminates whether the input audio signal is a natural signal or a forward conversion signal. The forward conversion learning process is a process that the first generation unit 110 learns. The forward conversion signal identification learning process is a process that the first identification unit 120 learns. The inverse conversion signal identification process is a process in which the second identification unit 160 discriminates whether the input audio signal is an inverse conversion signal or a pre-synthesized signal. The inverse transformation learning process is a process that the second generation unit 150 learns. The inverse transformation signal identification learning process is a process that the second identification unit 160 learns.
 図3は、実施形態における順変換信号識別処理の流れの一例を示すフローチャートである。音声波形識別部121が第1識別部120に入力された音声信号を取得し、取得した音声波形に基づいて第1識別部120に入力された音声信号が自然信号と順変換信号とのいずれであるかを識別する(ステップS101)。次に音声特徴量識別部122が第1識別部120に入力された音声信号の音声特徴量を取得し、取得した音声特徴量に基づいて第1識別部120に入力された音声信号が自然信号と順変換信号とのいずれであるかを識別する(ステップS102)。次に統合識別部123が音声波形識別部121の識別結果と音声特徴量識別部122の識別結果とに基づき予め定められた所定の規則にしたがい、第1識別部120に入力された音声信号が自然信号と順変換信号とのいずれであるかを識別する(ステップS103)。ステップS103における統合識別部123の識別結果が、第1判定部140に出力される。 FIG. 3 is a flowchart showing an example of the flow of the forward conversion signal identification process in the embodiment. The audio waveform identification unit 121 acquires the audio signal input to the first identification unit 120, and the audio signal input to the first identification unit 120 based on the acquired audio waveform is either a natural signal or a forward conversion signal. Identify the presence (step S101). Next, the voice feature amount identification unit 122 acquires the voice feature amount of the voice signal input to the first identification unit 120, and the voice signal input to the first identification unit 120 is a natural signal based on the acquired voice feature amount. And the forward conversion signal are identified (step S102). Next, the integrated identification unit 123 receives the voice signal input to the first identification unit 120 according to a predetermined rule determined in advance based on the identification result of the voice waveform identification unit 121 and the identification result of the voice feature amount identification unit 122. It identifies whether it is a natural signal or a forward conversion signal (step S103). The identification result of the integrated identification unit 123 in step S103 is output to the first determination unit 140.
 図4は、実施形態における順変換学習処理の流れの一例を示す第1のフローチャートである。第1入力決定部130が第1識別部120に入力する音声信号を順変換信号に決定する(ステップS201)。次に第1生成部110が、合成信号群から合成信号を1つ取得し、取得した合成信号に対して順変換処理を実行することで順変換信号を生成する(ステップS202)。次に第1生成部110は生成した順変換信号を第1識別部120に出力する(ステップS203)。次に第1識別部120は取得した音声信号について、順変換信号識別処理を実行する(ステップS204)。すなわち、ステップS101~ステップS103の処理が実行される。次に第1判定部140が、第1入力決定部130の決定結果と比較して第1識別部120の識別結果が正しか否かを判定する(ステップS205)。次に第1生成部110が第1判定部140の判定結果に基づき順変換処理によって自然信号度をより向上させるように学習する(ステップS206)。具体的には、第1生成部110は目的関数Lをより小さくするように学習する。 FIG. 4 is a first flowchart showing an example of the flow of the forward conversion learning process in the embodiment. The first input determination unit 130 determines the audio signal input to the first identification unit 120 as a forward conversion signal (step S201). Next, the first generation unit 110 acquires one composite signal from the composite signal group and executes a forward conversion process on the acquired composite signal to generate a forward conversion signal (step S202). Next, the first generation unit 110 outputs the generated forward conversion signal to the first identification unit 120 (step S203). Next, the first identification unit 120 executes forward conversion signal identification processing on the acquired voice signal (step S204). That is, the processes of steps S101 to S103 are executed. Next, the first determination unit 140 determines whether or not the identification result of the first identification unit 120 is correct by comparing with the determination result of the first input determination unit 130 (step S205). Next, the first generation unit 110 learns to further improve the natural signal degree by the forward conversion process based on the determination result of the first determination unit 140 (step S206). Specifically, the first generation unit 110 learns to make the objective function L smaller.
 図5は、実施形態における順変換学習処理の流れの一例を示す第2のフローチャートである。以下、図3又は図4に記載の処理と同様の処理については図3又は図4と同じ符号を付すことで説明を省略する。 FIG. 5 is a second flowchart showing an example of the flow of the forward conversion learning process in the embodiment. Hereinafter, the same processing as that shown in FIG. 3 or 4 will be designated by the same reference numerals as those in FIG. 3 or 4, and the description thereof will be omitted.
 第2生成部150が逆変換信号を出力する(ステップS301)。次に第1生成部110が、第2生成部150が出力した逆変換信号を取得し、取得した逆変換信号に対して順変換処理を実行することで順変換信号を生成する(ステップS302)。次にステップS203~ステップS206の処理が実行される。 The second generation unit 150 outputs an inverse conversion signal (step S301). Next, the first generation unit 110 acquires the reverse conversion signal output by the second generation unit 150, and generates a forward conversion signal by executing a forward conversion process on the acquired reverse conversion signal (step S302). .. Next, the processes of steps S203 to S206 are executed.
 図6は、実施形態における順変換信号識別学習処理の流れの一例を示すフローチャートである。以下、図3~図5に記載の処理と同様の処理については図3~図5と同じ符号を付すことで説明を省略する。 FIG. 6 is a flowchart showing an example of the flow of the forward conversion signal identification learning process in the embodiment. Hereinafter, the same processing as that shown in FIGS. 3 to 5 will be designated by the same reference numerals as those in FIGS. 3 to 5, and the description thereof will be omitted.
 第1入力決定部130が第1識別部120に入力する音声信号を、自然信号と順変換信号とのいずれにするかを決定する(ステップS401)。次に、ステップS204及びステップS205の処理が実行される。次に第1識別部120が、識別の精度をより向上させるように学習する(ステップS402)。具体的には、第1識別部120が目的関数Lをより大きくするように学習する。より具体的には、目的関数Lをより大きくするように音声波形識別部121及び音声特徴量識別部122が学習する。 The first input determination unit 130 determines whether the audio signal input to the first identification unit 120 is a natural signal or a forward conversion signal (step S401). Next, the processes of steps S204 and S205 are executed. Next, the first identification unit 120 learns to further improve the accuracy of identification (step S402). Specifically, the first identification unit 120 learns to make the objective function L larger. More specifically, the voice waveform identification unit 121 and the voice feature amount identification unit 122 learn so as to make the objective function L larger.
 図7は、実施形態における逆変換信号識別処理の流れの一例を示すフローチャートである。音声波形識別部161が第2識別部160に入力された音声信号の音声波形を取得し、取得した音声波形に基づいて第2識別部160に入力された音声信号が逆変換信号と事前合成信号とのいずれであるかを識別する(ステップS501)。次に音声特徴量識別部162が第2識別部160に入力された音声信号を取得し、取得した音声特徴量に基づいて第2識別部160に入力された音声信号が逆変換信号と事前合成信号とのいずれであるかを識別する(ステップS502)。次に統合識別部163が音声波形識別部161の識別結果と音声特徴量識別部162の識別結果とに基づき予め定められた所定の規則にしたがい、第2識別部160に入力された音声信号が逆変換信号と事前合成信号とのいずれであるかを識別する(ステップS503)。ステップS503における統合識別部163の識別結果が、第2判定部180に出力される。 FIG. 7 is a flowchart showing an example of the flow of the inverse transformation signal identification processing in the embodiment. The voice waveform identification unit 161 acquires the voice waveform of the voice signal input to the second identification unit 160, and the voice signal input to the second identification unit 160 based on the acquired voice waveform is an inverse conversion signal and a precombined signal. (Step S501). Next, the voice feature amount identification unit 162 acquires the voice signal input to the second identification unit 160, and the voice signal input to the second identification unit 160 is precombined with the inverse conversion signal based on the acquired voice feature amount. Identifying which of the signals is (step S502). Next, the integrated identification unit 163 receives the voice signal input to the second identification unit 160 according to a predetermined rule determined in advance based on the identification result of the voice waveform identification unit 161 and the identification result of the voice feature amount identification unit 162. It identifies whether it is an inverse conversion signal or a precombined signal (step S503). The identification result of the integrated identification unit 163 in step S503 is output to the second determination unit 180.
 図8は、実施形態における逆変換学習処理の流れの一例を示すフローチャートである。第2入力決定部170が第2識別部160に入力する音声信号を逆変換信号に決定する(ステップS601)。次に第2生成部150が、順変換信号を取得し、取得した順変換信号に対して逆変換処理を実行することで逆変換信号を生成する(ステップS602)。次に第2生成部150は生成した逆変換信号を第2識別部160に出力する(ステップS603)。次に第2識別部160は取得した音声信号について、逆変換信号識別処理を実行する(ステップS604)。すなわち、ステップS401~ステップS403の処理が実行される。次に第2判定部180が、第2入力決定部170の決定結果と比較して第2識別部160の識別結果が正しか否かを判定する(ステップS605)。次に第2生成部150が第2判定部180の判定結果に基づき逆変換処理によって自然信号度をより向上させるように学習する(ステップS606)。具体的には、第2生成部150は目的関数Lをより小さくするように学習する。なお、第2生成部150が、自然信号を取得し、逆変換信号を生成した場合についても、同様にステップS602からステップS606の処理を行う。 FIG. 8 is a flowchart showing an example of the flow of the inverse transformation learning process in the embodiment. The second input determination unit 170 determines the audio signal input to the second identification unit 160 as an inverse conversion signal (step S601). Next, the second generation unit 150 acquires the forward conversion signal and executes the reverse conversion process on the acquired forward conversion signal to generate the reverse conversion signal (step S602). Next, the second generation unit 150 outputs the generated inverse conversion signal to the second identification unit 160 (step S603). Next, the second identification unit 160 executes an inverse transformation signal identification process on the acquired voice signal (step S604). That is, the processes of steps S401 to S403 are executed. Next, the second determination unit 180 determines whether or not the identification result of the second identification unit 160 is correct by comparing with the determination result of the second input determination unit 170 (step S605). Next, the second generation unit 150 learns to further improve the natural signal degree by the inverse transformation process based on the determination result of the second determination unit 180 (step S606). Specifically, the second generation unit 150 learns to make the objective function L smaller. When the second generation unit 150 acquires the natural signal and generates the inverse conversion signal, the processes of steps S602 to S606 are similarly performed.
 図9は、実施形態における逆変換信号識別学習処理の流れの一例を示すフローチャートである。以下、図7又は図8に記載の処理と同様の処理については図7又は図8と同じ符号を付すことで説明を省略する。 FIG. 9 is a flowchart showing an example of the flow of the inverse transformation signal identification learning process in the embodiment. Hereinafter, the same processing as that shown in FIG. 7 or 8 will be designated by the same reference numerals as those in FIG. 7 or 8, and the description thereof will be omitted.
 第2入力決定部170が第2識別部160に入力する音声信号を、自然信号と逆変換信号とのいずれにするかを決定する(ステップS701)。次に、ステップS604及びステップS605の処理が実行される。次に第2識別部160が、識別の精度をより向上させるように学習する(ステップS702)。具体的には、第2識別部160が目的関数Lをより大きくするように学習する。より具体的には、目的関数Lをより大きくするように音声波形識別部161及び音声特徴量識別部162が学習する。 The second input determination unit 170 determines whether the audio signal input to the second identification unit 160 is a natural signal or an inverse conversion signal (step S701). Next, the processes of steps S604 and S605 are executed. Next, the second identification unit 160 learns to further improve the accuracy of identification (step S702). Specifically, the second identification unit 160 learns to make the objective function L larger. More specifically, the voice waveform identification unit 161 and the voice feature amount identification unit 162 learn so as to make the objective function L larger.
 図10は、実施形態における音声信号変換モデル学習装置1が実行する処理の流れの一例を示すフローチャートである。図10では、ステップS201が行われた場合を例にその後の処理の流れの一例を説明する。また、図10では、ステップS601の処理が行われる場合を例に処理の流れの一例を説明する。以下、図3~図9に記載の処理と同様の処理については、図3~図9に記載の符号と同じ符号を付することで説明を省略する。 FIG. 10 is a flowchart showing an example of the flow of processing executed by the voice signal conversion model learning device 1 in the embodiment. In FIG. 10, an example of the subsequent processing flow will be described by taking the case where step S201 is performed as an example. Further, in FIG. 10, an example of the processing flow will be described by taking the case where the processing of step S601 is performed as an example. Hereinafter, the same processing as that shown in FIGS. 3 to 9 will be described by assigning the same reference numerals as those shown in FIGS. 3 to 9 and omitting description thereof.
 ステップS201から始まって、ステップS202、ステップS203、ステップS204、ステップS205、ステップS206、ステップS402、ステップS601、ステップS602、ステップS604、ステップS605、ステップS606、ステップS702の順番に処理が実行される。ステップS702の次に、終了条件が満たされたか否かが判定される(ステップS801)。終了条件は、例えば学習の回数が所定の回数を超えたという条件である。終了条件が満たされたか否かは、例えば後述する管理部102によって判定される。 Starting from step S201, the processes are executed in the order of step S202, step S203, step S204, step S205, step S206, step S402, step S601, step S602, step S604, step S605, step S606, and step S702. After step S702, it is determined whether or not the end condition is satisfied (step S801). The end condition is, for example, a condition that the number of times of learning exceeds a predetermined number of times. Whether or not the end condition is satisfied is determined by, for example, the management unit 102 described later.
 終了条件が満たされる場合(ステップS801:YES)、処理が終了する。一方、終了条件が満たされない場合(ステップS801:NO)、ステップS301の処理が実行される。次にステップS302の処理が実行される。ステップS302の次にステップS203の処理に戻る。 When the end condition is satisfied (step S801: YES), the process ends. On the other hand, if the end condition is not satisfied (step S8011: NO), the process of step S301 is executed. Next, the process of step S302 is executed. After step S302, the process returns to step S203.
 なお、ステップS206の処理とステップS402の処理とは実行される順番が逆でもよい。なお、ステップS606の処理とステップS702の処理とは実行される順番が逆でもよい。 Note that the processing in step S206 and the processing in step S402 may be executed in the reverse order. The order in which the processes in step S606 and the processes in step S702 are executed may be reversed.
 なお、ステップS201の処理に代えて、第1入力決定部130が第1識別部120に入力する音声信号を自然信号に決定する処理が実行された場合、ステップS602からステップS302の処理は実行されない。このような場合、図6の処理が実行された後、処理が終了する。 If, instead of the process of step S201, the process of determining the audio signal input to the first identification unit 120 by the first input determination unit 130 as a natural signal is executed, the processes of steps S602 to S302 are not executed. .. In such a case, the process ends after the process of FIG. 6 is executed.
 なお、ステップS601の処理に代えて、第2入力決定部170が第2識別部160に入力する音声信号を自然信号に決定する処理が実行された場合、ステップS602からステップS604の処理とステップS606の処理とが実行されない。 When the process of determining the audio signal input to the second identification unit 160 by the second input determination unit 170 as a natural signal is executed instead of the process of step S601, the processes of steps S602 to S604 and step S606 are executed. Processing and is not executed.
 このように音声信号変換モデル学習装置1は、順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理の実行により学習のたびに自然信号度がより高い音声信号変換モデルを得る。 As described above, the voice signal conversion model learning device 1 executes the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process. A voice signal conversion model with a higher degree of natural signal is obtained with each learning.
 図11は、実施形態における音声信号変換モデル学習装置1のハードウェア構成の一例を示す図である。
 音声信号変換モデル学習装置1は、バスで接続されたCPU(Central Processing Unit)等のプロセッサ91とメモリ92とを備える制御部10を備え、プログラムを実行する。音声信号変換モデル学習装置1は、プログラムの実行によって制御部10、入力部11、インタフェース部12、記憶部13及び出力部14を備える装置として機能する。より具体的には、プロセッサ91が記憶部13に記憶されているプログラムを読み出し、読み出したプログラムをメモリ92に記憶させる。プロセッサ91が、メモリ92に記憶させたプログラムを実行することによって、音声信号変換モデル学習装置1は、制御部10、入力部11、インタフェース部12、記憶部13及び出力部14を備える装置として機能する。
FIG. 11 is a diagram showing an example of the hardware configuration of the voice signal conversion model learning device 1 according to the embodiment.
The voice signal conversion model learning device 1 includes a control unit 10 including a processor 91 such as a CPU (Central Processing Unit) connected by a bus and a memory 92, and executes a program. The voice signal conversion model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14 by executing a program. More specifically, the processor 91 reads out the program stored in the storage unit 13, and stores the read program in the memory 92. When the processor 91 executes a program stored in the memory 92, the voice signal conversion model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14. do.
 制御部10は、音声信号変換モデル学習装置1が備える各種機能部の動作を制御する。制御部10は、例えば順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理を実行する。 The control unit 10 controls the operation of various functional units included in the voice signal conversion model learning device 1. The control unit 10 executes, for example, forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal identification learning processing.
 入力部11は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部11は、これらの入力装置を自装置に接続するインタフェースとして構成されてもよい。入力部11は、自装置に対する各種情報の入力を受け付ける。入力部11は、例えば学習の開始を指示する入力を受け付ける。入力部11は、例えば合成信号群に追加する合成信号の入力を受け付ける。入力部11は、例えば自然信号群に追加する自然信号の入力を受け付ける。 The input unit 11 includes an input device such as a mouse, a keyboard, and a touch panel. The input unit 11 may be configured as an interface for connecting these input devices to its own device. The input unit 11 receives input of various information to its own device. The input unit 11 receives, for example, an input instructing the start of learning. The input unit 11 receives, for example, an input of a composite signal to be added to the composite signal group. The input unit 11 receives, for example, an input of a natural signal to be added to the natural signal group.
 インタフェース部12は、自装置を外部装置に接続するための通信インタフェースを含んで構成される。インタフェース部12は、有線又は無線を介して外部装置と通信する。外部装置は、例えばUSB(Universal Serial Bus)メモリ等の記憶装置であってもよい。外部装置が例えば合成信号を出力する場合、インタフェース部12は外部装置との通信によって外部装置が出力する合成信号を取得する。外部装置が例えば自然信号を出力する場合、インタフェース部12は外部装置との通信によって外部装置が出力する自然信号を取得する。 The interface unit 12 includes a communication interface for connecting the own device to an external device. The interface unit 12 communicates with an external device via wire or wireless. The external device may be a storage device such as a USB (Universal Serial Bus) memory, for example. When the external device outputs, for example, a composite signal, the interface unit 12 acquires the composite signal output by the external device by communicating with the external device. When the external device outputs, for example, a natural signal, the interface unit 12 acquires the natural signal output by the external device by communicating with the external device.
 インタフェース部12は、自装置を音声信号変換装置2に接続するための通信インタフェースを含んで構成される。インタフェース部12は、有線又は無線を介して音声信号変換装置2と通信する。インタフェース部12は、音声信号変換装置2との通信により、音声信号変換装置2に音声信号変換モデルを出力する。 The interface unit 12 includes a communication interface for connecting the own device to the voice signal conversion device 2. The interface unit 12 communicates with the voice signal conversion device 2 via wire or wireless. The interface unit 12 outputs a voice signal conversion model to the voice signal conversion device 2 by communicating with the voice signal conversion device 2.
 記憶部13は、磁気ハードディスク装置や半導体記憶装置などの非一時的コンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部13は音声信号変換モデル学習装置1に関する各種情報を記憶する。記憶部13は、例えば予め自然信号群を記憶する。記憶部13は、例えば予め合成信号群を記憶する。記憶部13は、例えば入力部11又はインタフェース部12を介して入力された合成信号及び自然信号を記憶する。記憶部13は、例えば第1識別部120の識別結果を記憶する。 The storage unit 13 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 13 stores various information related to the voice signal conversion model learning device 1. The storage unit 13 stores, for example, a group of natural signals in advance. The storage unit 13 stores, for example, a synthetic signal group in advance. The storage unit 13 stores, for example, a composite signal and a natural signal input via the input unit 11 or the interface unit 12. The storage unit 13 stores, for example, the identification result of the first identification unit 120.
 記憶部13は、例えば第2識別部160の識別結果を記憶する。記憶部13は、例えば第1判定部140の判定結果を記憶する。記憶部13は、例えば第2判定部180の判定結果を記憶する。記憶部13は、例えば第1入力決定部130の決定結果を記憶する。記憶部13は、例えば第2入力決定部170の決定結果を記憶する。記憶部13は、例えば音声信号変換モデルを記憶する。 The storage unit 13 stores, for example, the identification result of the second identification unit 160. The storage unit 13 stores, for example, the determination result of the first determination unit 140. The storage unit 13 stores, for example, the determination result of the second determination unit 180. The storage unit 13 stores, for example, the determination result of the first input determination unit 130. The storage unit 13 stores, for example, the determination result of the second input determination unit 170. The storage unit 13 stores, for example, an audio signal conversion model.
 出力部14は、各種情報を出力する。出力部14は、例えばCRT(Cathode Ray Tube)ディスプレイや液晶ディスプレイ、有機EL(Electro-Luminescence)ディスプレイ等の表示装置を含んで構成される。出力部14は、これらの表示装置を自装置に接続するインタフェースとして構成されてもよい。出力部14は、例えば入力部11に入力された情報を出力する。 The output unit 14 outputs various information. The output unit 14 includes display devices such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 14 may be configured as an interface for connecting these display devices to its own device. The output unit 14 outputs, for example, the information input to the input unit 11.
 図12は、実施形態における制御部10の機能構成の一例を示す図である。制御部10は、被管理部101及び管理部102を備える。被管理部101は、第1生成部110、第1識別部120、第1入力決定部130、第1判定部140、第2生成部150、第2識別部160、第2入力決定部170及び第2判定部180を備える。被管理部101は、自然信号群及び合成信号群が含む各音声信号を用いた順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理の実行により音声信号変換モデルを得る。音声信号変換モデルは具体的には、第1生成部110による順変換処理を表す学習済みのモデルである。 FIG. 12 is a diagram showing an example of the functional configuration of the control unit 10 in the embodiment. The control unit 10 includes a managed unit 101 and a management unit 102. The managed unit 101 includes a first generation unit 110, a first identification unit 120, a first input determination unit 130, a first determination unit 140, a second generation unit 150, a second identification unit 160, a second input determination unit 170, and the like. A second determination unit 180 is provided. The managed unit 101 includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, and reverse conversion learning processing using each voice signal included in the natural signal group and the synthetic signal group. An audio signal conversion model is obtained by executing the inverse conversion signal identification learning process. Specifically, the audio signal conversion model is a trained model that represents the forward conversion process by the first generation unit 110.
 管理部102は、被管理部101の動作を制御する。管理部102は、例えば被管理部101による順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理それぞれの実行のタイミングを制御する。 The management unit 102 controls the operation of the managed unit 101. The management unit 102 executes, for example, the forward conversion signal identification process, the forward conversion learning process, the forward conversion signal identification learning process, the reverse conversion signal identification process, the reverse conversion learning process, and the reverse conversion signal identification learning process by the managed unit 101. Control the timing.
 管理部102は、例えば入力部11、インタフェース部12、記憶部13及び出力部14の動作を制御する。管理部102は、例えば記憶部13から各種情報を読み出し被管理部101に出力する。管理部102は、例えば入力部11に入力された情報を取得し被管理部101に出力する。管理部102は、例えば入力部11に入力された情報を取得し記憶部13に記録する。管理部102、例えばインタフェース部12に入力された情報を取得し被管理部101に出力する。管理部102、例えばインタフェース部12に入力された情報を取得し記憶部13に記録する。管理部102は、例えば入力部11に入力された情報を出力部14に出力させる。 The management unit 102 controls, for example, the operations of the input unit 11, the interface unit 12, the storage unit 13, and the output unit 14. The management unit 102 reads various information from, for example, the storage unit 13 and outputs it to the managed unit 101. The management unit 102 acquires, for example, the information input to the input unit 11 and outputs it to the managed unit 101. The management unit 102 acquires, for example, the information input to the input unit 11 and records it in the storage unit 13. The information input to the management unit 102, for example, the interface unit 12, is acquired and output to the managed unit 101. The information input to the management unit 102, for example, the interface unit 12, is acquired and recorded in the storage unit 13. The management unit 102 causes the output unit 14 to output the information input to the input unit 11, for example.
 管理部102は、例えば第1識別部120の識別結果を記憶部13に記録する。管理部102は、例えば第2識別部160の識別結果を記憶部13に記録する。記憶部13は、例えば第1判定部140の判定結果を記憶部13に記録する。記憶部13は、例えば第2判定部180の判定結果を記憶部13に記録する。記憶部13は、例えば第1入力決定部130の決定結果を記憶部13に記録する。記憶部13は、例えば第2入力決定部170の決定結果を記憶部13に記録する。 The management unit 102 records, for example, the identification result of the first identification unit 120 in the storage unit 13. The management unit 102 records, for example, the identification result of the second identification unit 160 in the storage unit 13. The storage unit 13 records, for example, the determination result of the first determination unit 140 in the storage unit 13. The storage unit 13 records, for example, the determination result of the second determination unit 180 in the storage unit 13. The storage unit 13 records, for example, the determination result of the first input determination unit 130 in the storage unit 13. The storage unit 13 records, for example, the determination result of the second input determination unit 170 in the storage unit 13.
 図13は、実施形態における音声信号変換装置2のハードウェア構成の一例を示す図である。
 音声信号変換装置2は、バスで接続されたCPU等のプロセッサ93とメモリ94とを備える制御部20を備え、プログラムを実行する。音声信号変換装置2は、プログラムの実行によって制御部20、入力部21、インタフェース部22、記憶部23及び出力部24を備える装置として機能する。より具体的には、プロセッサ93が記憶部23に記憶されているプログラムを読み出し、読み出したプログラムをメモリ94に記憶させる。プロセッサ93が、メモリ94に記憶させたプログラムを実行することによって、音声信号変換装置2は、制御部20、入力部21、インタフェース部22、記憶部23及び出力部24を備える装置として機能する。
FIG. 13 is a diagram showing an example of the hardware configuration of the audio signal conversion device 2 according to the embodiment.
The voice signal conversion device 2 includes a control unit 20 including a processor 93 such as a CPU connected by a bus and a memory 94, and executes a program. The voice signal conversion device 2 functions as a device including a control unit 20, an input unit 21, an interface unit 22, a storage unit 23, and an output unit 24 by executing a program. More specifically, the processor 93 reads the program stored in the storage unit 23, and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the voice signal conversion device 2 functions as a device including the control unit 20, the input unit 21, the interface unit 22, the storage unit 23, and the output unit 24.
 制御部20は、音声信号変換装置2が備える各種機能部の動作を制御する。制御部20は、例えば音声信号変換モデル学習装置1が得た音声信号変換モデルを用いて、不自然合成信号を自然合成信号に変換する。 The control unit 20 controls the operation of various functional units included in the voice signal conversion device 2. The control unit 20 converts the unnaturally synthesized signal into a naturally synthesized signal by using, for example, the voice signal conversion model obtained by the voice signal conversion model learning device 1.
 入力部21は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部21は、これらの入力装置を自装置に接続するインタフェースとして構成されてもよい。入力部21は、自装置に対する各種情報の入力を受け付ける。入力部21は、例えば不自然合成信号を自然合成信号に変換する処理の開始を指示する入力を受け付ける。入力部21は、例えば変換対象の不自然合成信号の入力を受け付ける。 The input unit 21 includes an input device such as a mouse, a keyboard, and a touch panel. The input unit 21 may be configured as an interface for connecting these input devices to its own device. The input unit 21 receives input of various information to its own device. The input unit 21 receives, for example, an input instructing the start of a process of converting an unnaturally synthesized signal into a naturally synthesized signal. The input unit 21 receives, for example, the input of the unnaturally synthesized signal to be converted.
 インタフェース部22は、自装置を外部装置に接続するための通信インタフェースを含んで構成される。インタフェース部22は、有線又は無線を介して外部装置と通信する。外部装置は、例えば自然合成信号の出力先である。このような場合、インタフェース部22は、外部装置との通信によって外部装置に自然合成信号を出力する。自然合成信号の出力際の外部装置は、例えばスピーカー等の音声出力装置である。 The interface unit 22 includes a communication interface for connecting the own device to an external device. The interface unit 22 communicates with an external device via wire or wireless. The external device is, for example, an output destination of a naturally synthesized signal. In such a case, the interface unit 22 outputs a naturally synthesized signal to the external device by communicating with the external device. The external device for outputting the naturally synthesized signal is, for example, an audio output device such as a speaker.
 外部装置は、例えば音声信号変換モデルを記憶したUSBメモリ等の記憶装置であってもよい。外部装置が例えば音声信号変換モデルを記憶しており音声信号変換モデルを出力する場合、インタフェース部22は外部装置との通信によって音声信号変換モデルを取得する。 The external device may be, for example, a storage device such as a USB memory that stores the voice signal conversion model. When the external device stores, for example, the voice signal conversion model and outputs the voice signal conversion model, the interface unit 22 acquires the voice signal conversion model by communicating with the external device.
 外部装置は、例えば不自然合成信号の出力元である。このような場合、インタフェース部22は、外部装置との通信によって外部装置から不自然合成信号を取得する。 The external device is, for example, an output source of an unnaturally synthesized signal. In such a case, the interface unit 22 acquires an unnaturally synthesized signal from the external device by communicating with the external device.
 インタフェース部22は、自装置を音声信号変換モデル学習装置1に接続するための通信インタフェースを含んで構成される。インタフェース部22は、有線又は無線を介して音声信号変換モデル学習装置1と通信する。インタフェース部22は、音声信号変換モデル学習装置1との通信により、音声信号変換モデル学習装置1から音声信号変換モデルを取得する。 The interface unit 22 includes a communication interface for connecting the own device to the voice signal conversion model learning device 1. The interface unit 22 communicates with the voice signal conversion model learning device 1 via wire or wireless. The interface unit 22 acquires a voice signal conversion model from the voice signal conversion model learning device 1 by communicating with the voice signal conversion model learning device 1.
 記憶部23は、磁気ハードディスク装置や半導体記憶装置などの非一時的コンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部23は音声信号変換装置2に関する各種情報を記憶する。記憶部13は、例えばインタフェース部22を介して取得した音声信号変換モデルを記憶する。 The storage unit 23 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 23 stores various information related to the voice signal conversion device 2. The storage unit 13 stores the voice signal conversion model acquired via, for example, the interface unit 22.
 出力部24は、各種情報を出力する。出力部24は、例えばCRTディスプレイや液晶ディスプレイ、有機ELディスプレイ等の表示装置を含んで構成される。出力部24は、これらの表示装置を自装置に接続するインタフェースとして構成されてもよい。出力部24は、例えば入力部21に入力された情報を出力する。 The output unit 24 outputs various information. The output unit 24 includes display devices such as a CRT display, a liquid crystal display, and an organic EL display. The output unit 24 may be configured as an interface for connecting these display devices to its own device. The output unit 24 outputs, for example, the information input to the input unit 21.
 図14は、実施形態における制御部20の機能構成の一例を示す図である。制御部20は、変換対象取得部201、変換部202及び音声信号出力制御部203を備える。 FIG. 14 is a diagram showing an example of the functional configuration of the control unit 20 in the embodiment. The control unit 20 includes a conversion target acquisition unit 201, a conversion unit 202, and an audio signal output control unit 203.
 変換対象取得部201は、変換対象となる不自然合成信号を取得する。変換対象取得部201は、例えば入力部21に入力された不自然合成信号を取得する。変換対象取得部201は、例えばインタフェース部22に入力された不自然合成信号を取得する。 The conversion target acquisition unit 201 acquires the unnatural composite signal to be converted. The conversion target acquisition unit 201 acquires, for example, the unnatural composite signal input to the input unit 21. The conversion target acquisition unit 201 acquires, for example, the unnaturally synthesized signal input to the interface unit 22.
 変換部202は、変換対象取得部201が取得した変換対象を、音声信号変換モデルを用いて自然合成信号に変換する。自然合成信号は音声信号出力制御部203に出力される。 The conversion unit 202 converts the conversion target acquired by the conversion target acquisition unit 201 into a naturally synthesized signal using the voice signal conversion model. The naturally synthesized signal is output to the audio signal output control unit 203.
 音声信号出力制御部203は、インタフェース部22の動作を制御する。音声信号出力制御部203は、インタフェース部22の動作を制御することでインタフェース部22に自然合成信号を出力させる。 The voice signal output control unit 203 controls the operation of the interface unit 22. The audio signal output control unit 203 causes the interface unit 22 to output a naturally synthesized signal by controlling the operation of the interface unit 22.
 図15は、実施形態における音声信号変換装置2が実行する処理の流れの一例を示すフローチャートである。制御部20が、インタフェース部22に入力された不自然合成信号を取得する(ステップS901)。次に制御部20が、記憶部23に記憶された音声信号変換モデルを用いて不自然合成信号を自然合成信号に変換する(ステップS902)。次に制御部20がインタフェース部22の動作を制御して自然合成信号を出力先に出力させる(ステップS903)。出力先は、例えばスピーカー等の外部装置である。 FIG. 15 is a flowchart showing an example of the flow of processing executed by the voice signal conversion device 2 in the embodiment. The control unit 20 acquires the unnaturally synthesized signal input to the interface unit 22 (step S901). Next, the control unit 20 converts the unnaturally synthesized signal into a naturally synthesized signal using the audio signal conversion model stored in the storage unit 23 (step S902). Next, the control unit 20 controls the operation of the interface unit 22 to output the naturally synthesized signal to the output destination (step S903). The output destination is, for example, an external device such as a speaker.
(実験結果)
 音声信号生成システム100が得た音声信号変換モデルと、他の学習方法で得られた音声信号変換モデルとの比較実験(以下「第1実験」という。)の実験結果を図16及び図17によって示す。
(Experimental result)
The experimental results of a comparative experiment (hereinafter referred to as "first experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method are shown in FIGS. 16 and 17. show.
 第1実験は、女性ナレーターの日本語音声のデータセットに含まれる437文を用いて行われた。日本語音声のデータセットの437文のうち407文(約1時間)が音声信号変換モデルを得るために使用された。日本語音声のデータセットの437文のうち30文(4分)が、音質の自然さに関する5段階MOS(Mean Opinion Score)評価を得るために使用された。音声のサンプリングレートは22.05kHzであった。被験者は10名であった。各被験者は学習方法ごとにランダムに選択された30文及び20文について評価した。 The first experiment was conducted using 437 sentences included in the Japanese voice data set of a female narrator. Of the 437 sentences in the Japanese voice dataset, 407 sentences (about 1 hour) were used to obtain the voice signal conversion model. Of the 437 sentences in the Japanese voice dataset, 30 sentences (4 minutes) were used to obtain a 5-step MOS (Mean Opinion Score) rating for the naturalness of sound quality. The audio sampling rate was 22.05 kHz. There were 10 subjects. Each subject evaluated 30 and 20 sentences randomly selected for each learning method.
 図16は第1実験の実験結果の一例を示す第1の図である。図17は第1実験の実験結果の一例を示す第2の図である。図16及び図17の横軸は、音声信号変換モデルを得る手法を示す。図16及び図17の縦軸は、音質の自然さに関する5段階MOS評価を示す。図16及び図17における点線の横軸は、自然音声の評価結果を表す。 FIG. 16 is a first diagram showing an example of the experimental results of the first experiment. FIG. 17 is a second diagram showing an example of the experimental results of the first experiment. The horizontal axis of FIGS. 16 and 17 shows a method for obtaining an audio signal conversion model. The vertical axis of FIGS. 16 and 17 shows a 5-step MOS evaluation regarding the naturalness of sound quality. The horizontal axis of the dotted line in FIGS. 16 and 17 represents the evaluation result of the natural voice.
 “SPSS”はDNN(Deep Neural Network)テキスト音声合成(SPSS:Statistical Parametric Speech Synthesis)の方法を示す。“GANv”は音声特徴量上での補正手法を示す。“V1”は、畳み込みニューラルネットワークに対してダウンサンプリングモジュールを用いる手法を示す。 "SPSS" indicates a method of DNN (Deep Neural Network) text-to-speech synthesis (SPSS: Statistical Parametric Speech Synthesis). "GANv" indicates a correction method on the voice feature amount. “V1” indicates a method of using a downsampling module for a convolutional neural network.
 “V2”は、第1識別部120に代えて第1簡易識別部を備え第2識別部160に代えて第2簡易識別部を備える音声信号生成システム100によって音声信号変換モデルを得る手法を示す。第1簡易識別部は、音声波形識別部121を備え、音声特徴量識別部122及び統合識別部123を備えず入力された音声信号の波形によって入力された音声信号が自然信号か順変換信号かを識別する識別器である。第2簡易識別部は、音声波形識別部161を備え、音声特徴量識別部162及び統合識別部163を備えず入力された音声信号の波形によって入力された音声信号が逆変換信号か事前合成信号かを識別する識別器である。 “V2” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a first simple identification unit in place of the first identification unit 120 and a second simple identification unit in place of the second identification unit 160. .. The first simple identification unit includes a voice waveform identification unit 121, and does not include a voice feature amount identification unit 122 and an integrated identification unit 123, and whether the voice signal input by the waveform of the input voice signal is a natural signal or a forward conversion signal. It is a classifier that identifies. The second simple identification unit includes a voice waveform identification unit 161 and does not include a voice feature amount identification unit 162 and an integrated identification unit 163. It is a classifier that identifies the signal.
 “V2msp”は、第1識別部120に代えて第3簡易識別部を備え第2識別部160に代えて第4簡易識別部を備える音声信号生成システム100によって音声信号変換モデルを得る手法を示す。第3簡易識別部は、音声波形識別部121、音声特徴量識別部122及び統合識別部123を備える。第3簡易識別部が備える音声特徴量識別部122は、識別に用いる特徴量として識別対象の音声信号のメルスペクトログラムを用いる。第4簡易識別部は、音声波形識別部161、音声特徴量識別部162及び統合識別部163を備える。第4簡易識別部が備える音声特徴量識別部162は、識別に用いる特徴量として識別対象の音声信号のメルスペクトログラムを用いる。 “V2msp” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a third simple identification unit in place of the first identification unit 120 and a fourth simple identification unit in place of the second identification unit 160. .. The third simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123. The voice feature amount identification unit 122 included in the third simple identification unit uses a mel spectrogram of the voice signal to be identified as the feature amount used for identification. The fourth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163. The voice feature amount identification unit 162 included in the fourth simple identification unit uses a mel spectrogram of the voice signal to be identified as the feature amount used for identification.
 “V2ph”は、第1識別部120に代えて第5簡易識別部を備え第2識別部160に代えて第6簡易識別部を備える音声信号生成システム100によって音声信号変換モデルを得る手法を示す。第5簡易識別部は、音声波形識別部121、音声特徴量識別部122及び統合識別部123を備える。第5簡易識別部が備える音声特徴量識別部122は、識別に用いる特徴量として識別対象の音声信号の位相スペクトルを用いる。第6簡易識別部は、音声波形識別部161、音声特徴量識別部162及び統合識別部163を備える。第6簡易識別部が備える音声特徴量識別部162は、識別に用いる特徴量として識別対象の音声信号の位相スペクトルを用いる。 “V2ph” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a fifth simple identification unit in place of the first identification unit 120 and a sixth simple identification unit in place of the second identification unit 160. .. The fifth simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123. The voice feature amount identification unit 122 included in the fifth simple identification unit uses the phase spectrum of the voice signal to be identified as the feature amount used for identification. The sixth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163. The voice feature amount identification unit 162 included in the sixth simple identification unit uses the phase spectrum of the voice signal to be identified as the feature amount used for identification.
 “V2mfcc”は、第1識別部120に代えて第7簡易識別部を備え第2識別部160に代えて第8簡易識別部を備える音声信号生成システム100によって音声信号変換モデルを得る手法を示す。第7簡易識別部は、音声波形識別部121、音声特徴量識別部122及び統合識別部123を備える。第7簡易識別部が備える音声特徴量識別部122は、識別に用いる特徴量として識別対象の音声信号のメル周波数ケプストラム係数を用いる。第8簡易識別部は、音声波形識別部161、音声特徴量識別部162及び統合識別部163を備える。第8簡易識別部が備える音声特徴量識別部162は、識別に用いる特徴量として識別対象の音声信号のメル周波数ケプストラム係数を用いる。 “V2mfcc” indicates a method of obtaining a voice signal conversion model by a voice signal generation system 100 having a seventh simple identification unit in place of the first identification unit 120 and an eighth simple identification unit in place of the second identification unit 160. .. The seventh simple identification unit includes a voice waveform identification unit 121, a voice feature amount identification unit 122, and an integrated identification unit 123. The voice feature amount identification unit 122 included in the seventh simple identification unit uses the mel frequency cepstrum coefficient of the voice signal to be identified as the feature amount used for identification. The eighth simple identification unit includes a voice waveform identification unit 161, a voice feature amount identification unit 162, and an integrated identification unit 163. The voice feature amount identification unit 162 included in the eighth simple identification unit uses the mel frequency cepstrum coefficient of the voice signal to be identified as the feature amount used for identification.
 図16及び図17は、V1がSPSSと比較して大幅に音質改善していることを示す。なお、音質の改善とは、自然信号度が高くなることを意味する。図16及び図17は、V1がGANvよりも音質を改善していることを示す。図16及び図17は、V2がSPSSよりも音質を改善していることを示す。図16及び図17は、V2がV1よりも音質を改善していないことを示す。これは、V2の方がV1よりも雑音の多い音声を生成するためである。図16及び図17は、V2msp及びV2mfccがV1、V2、V2ph、SPSS及びGANvよりも高いMOS評価であることを示す。 16 and 17 show that V1 has significantly improved sound quality as compared with SPSS. The improvement of sound quality means that the degree of natural signal is increased. 16 and 17 show that V1 has improved sound quality over GANv. 16 and 17 show that V2 has improved sound quality over SPSS. 16 and 17 show that V2 does not improve sound quality over V1. This is because V2 produces more noisy sound than V1. 16 and 17 show that V2msp and V2mfcc have higher MOS ratings than V1, V2, V2ph, SPSS and GANv.
 また、図16及び図17は、V2msp及びV2mfccについて、両側Mann-Whitney検定のp値が0.05以上であることを示す。このことは、V2msp及びV2mfccによって変換された音声信号は自然信号との統計的な差異が無いことを示す。図16及び図17は、V2phは雑音を含む音声でありV2よりもMOS評価が低いことを示す。図16及び図17の結果より、音声波形の識別器(すなわち音声波形識別部121及び161)と音声特徴量識別部122及び162とを用いることが有効であることを示唆している。なお、図16及び図17において、“V2msp”、“V2ph”及び“V2mfcc”は音声信号生成システム100が得た音声信号変換モデルを用いて音声を変換する処理の一例である。 Further, FIGS. 16 and 17 show that the p-value of the two-sided Mann-Whitney test is 0.05 or more for V2msp and V2mfcc. This indicates that the audio signal converted by V2msp and V2mfcc has no statistical difference from the natural signal. 16 and 17 show that V2ph is a noisy voice and has a lower MOS rating than V2. From the results of FIGS. 16 and 17, it is suggested that it is effective to use the voice waveform classifiers (that is, the voice waveform discriminating units 121 and 161) and the voice feature amount discriminating units 122 and 162. In FIGS. 16 and 17, “V2msp”, “V2ph” and “V2mfcc” are examples of processing for converting audio using the audio signal conversion model obtained by the audio signal generation system 100.
 音声信号生成システム100が得た音声信号変換モデルと、他の学習方法で得られた音声信号変換モデルとの比較実験(以下「第2実験」という。)の実験結果を図18によって示す。 FIG. 18 shows the experimental results of a comparison experiment (hereinafter referred to as "second experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method.
 第2実験は、英語音声のデータセットLJSpeech(参考文献1参照)に含まれる13100文を用いて行われた。英語音声のデータセットの13100文のうち40文が、音質の自然さに関する5段階MOS評価を得るために使用された。音声のサンプリングレートは22.05kHzであった。被験者は14名であった。各被験者は学習方法ごとに15文について評価した。第2実験では、スペクトル歪みも算出された。 The second experiment was performed using 13100 sentences included in the English voice data set LJSpeech (see Reference 1). Forty of the 13100 sentences in the English speech dataset were used to obtain a five-step MOS rating for the naturalness of sound quality. The audio sampling rate was 22.05 kHz. There were 14 subjects. Each subject evaluated 15 sentences for each learning method. In the second experiment, spectral distortion was also calculated.
 図18は第2実験の実験結果の一例を示す図である。図18は、学習方法ごとに、最小有意差(LSD:Least squared distance)とMOSの評価結果とを示す。WORLDは参考文献2に記載の方法である。Griffin-Limは参考文献3に記載の方法である。Open WaveNetは参考文献4に記載の方法である。WaveGlowは参考文献5に記載の方法である。 FIG. 18 is a diagram showing an example of the experimental results of the second experiment. FIG. 18 shows the minimum significant difference (LSD: Least squared distance) and the evaluation result of MOS for each learning method. WORLD is the method described in reference 2. Griffin-Lim is the method described in reference 3. OpenWaveNet is the method described in Reference 4. WaveGlow is the method described in reference 5.
 参考文献1:“The LJ Speech Dataset” [online][令和2年3月30日検索]、インターネット〈URL:https://keithito.com/LJ-Speech-Dataset/>
 参考文献2:M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol.99, no.7, pp.1877-1884, 2016.
 参考文献3:D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Audio, Speech and Language Processing (TASLP), vol.32, no.2,pp. 236-243, 1984.
 参考文献4:Ryuichi Yamamoto et al. “WaveNet vocoder”[online][令和2年3月30日検索]、インターネット〈URL:https://doi.org/10.5281/zenodo.1472609>
 参考文献5:R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based generative network for speech synthesis,” 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3617-3621, 2019.
Reference 1: “The LJ Speech Dataset” [online] [Searched on March 30, 2nd year of Reiwa], Internet <URL: https://keithito.com/LJ-Speech-Dataset/>
Reference 2: M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol.99, no. 7, pp.1877-1884, 2016.
Reference 3: D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Audio, Speech and Language Processing (TASLP), vol.32, no.2, pp. 236-243 , 1984.
Reference 4: Ryuichi Yamamoto et al. “WaveNet vocoder” [online] [Searched on March 30, 2nd year of Reiwa], Internet <URL: https://doi.org/10.5281/zenodo.1472609>
Reference 5: R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based generative network for speech synthesis,” 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3617- 3621, 2019.
 図18は、Griffin-LimのLSDが最も低いことを示す。図18は、WORLDのLSDは、パラメトリックボコーダであるため大きく歪んでいることを示す。一方で、図18は、Griffin-LimとWORLDとの間にMOS評価の差異は無いことを示す。 FIG. 18 shows that Griffin-Lim has the lowest LSD. FIG. 18 shows that the LSD of WORLD is greatly distorted because it is a parametric vocoder. On the other hand, FIG. 18 shows that there is no difference in MOS evaluation between Griffin-Lim and WORLD.
 図18は、WaveGlowとopen WaveNetとを比較した場合に、open WaveNetの方がLSDが大きいことを示す。一方、図18は、WaveGlowとopen WaveNetとを比較した場合に、WaveGlowの方がMOS評価が高いことを示す。これらの結果は、4前後のLSDは、MOS評価に影響を与えない可能性が高いことを示す。図18は、V2mspが最も高いLSDとMOS評価とを得ていることを示す。 FIG. 18 shows that when WaveGlow and openWaveNet are compared, openWaveNet has a larger LSD. On the other hand, FIG. 18 shows that WaveGlow has a higher MOS evaluation when comparing WaveGlow and openWaveNet. These results indicate that LSDs around 4 are unlikely to affect MOS evaluation. FIG. 18 shows that V2msp has the highest LSD and MOS rating.
 なお、図18において“Recorded”は収録音声そのもの示す。“Recorded”は、変換後の音声として目標とする音声そのものである。そのため、”Recorded”に対応するLSDの値は無い。 Note that in FIG. 18, "Recorded" indicates the recorded voice itself. “Recorded” is the target voice itself as the converted voice. Therefore, there is no LSD value corresponding to "Recorded".
 音声信号生成システム100は、入力された波形を自然信号度の高い波形へ変換する。そのため、音声信号生成システム100は、例えば元の音声よりも帯域が削減された音声(劣化音声)が入力された場合であっても、入力された音声を帯域が復元された音声に変換することができる。このことは、音声信号生成システム100が、帯域を拡大したことを意味する。 The audio signal generation system 100 converts the input waveform into a waveform having a high degree of natural signal. Therefore, the voice signal generation system 100 converts the input voice into the voice whose band is restored even when the voice (deteriorated voice) whose band is reduced from that of the original voice is input, for example. Can be done. This means that the voice signal generation system 100 has expanded the band.
 音声信号生成システム100が得た音声信号変換モデルと、他の学習方法で得られた音声信号変換モデルとの比較実験(以下「第3実験」という。)の実験結果を図19によって示す。 FIG. 19 shows the experimental results of a comparison experiment (hereinafter referred to as "third experiment") between the audio signal conversion model obtained by the audio signal generation system 100 and the audio signal conversion model obtained by another learning method.
 第3実験では、英語音声のデータセットVCTK(参考文献6参照)に含まれる109話者中ランダムに選ばれた60話者ごとのランダム50文(合計3000文)が音声信号変換モデルを得るために使用された。第3実験では、残りの話者からランダムに男女2名ずつが選ばれた後、選ばれた話者ごとに発話文をランダムに2文(合計8文)が選ばれ、音質に関するMUSHRAテストが行われた。 In the third experiment, 50 random sentences (3000 sentences in total) for every 60 speakers randomly selected from 109 speakers included in the English voice data set VCTK (see Reference 6) are used to obtain a voice signal conversion model. Was used for. In the third experiment, two men and two men were randomly selected from the remaining speakers, and then two utterance sentences (8 sentences in total) were randomly selected for each selected speaker, and the MUSHRA test on sound quality was conducted. It was conducted.
 参考文献6:Ryuichi Yamamoto et al. “WaveNet vocoder”[online][令和2年3月30日検索]、インターネット〈URL:https://doi.org/10.5281/zenodo.1472609> Reference 6: Ryuichi Yamamoto et al. “WaveNet vocoder” [online] [Search on March 30, 2nd year of Reiwa], Internet <URL: https://doi.org/10.5281/zenodo.1472609>
 図19は第3実験の実験結果の一例を示す図である。図19の縦軸はMUSHRAテストのテスト結果を示す。図19の横軸は評価対象の方法を示す。図19の横軸の“48”は48kHzでサンプリングされた自然音声を示す。図19の横軸の“16to48”は音声信号生成システム100により帯域が拡張された音声を示す。 FIG. 19 is a diagram showing an example of the experimental results of the third experiment. The vertical axis of FIG. 19 shows the test results of the MUSHRA test. The horizontal axis of FIG. 19 shows the method to be evaluated. “48” on the horizontal axis of FIG. 19 indicates a natural voice sampled at 48 kHz. “16to48” on the horizontal axis of FIG. 19 indicates a voice whose band has been expanded by the voice signal generation system 100.
 図19の横軸の“8to48”は音声信号生成システム100により帯域が拡張された音声を示す。図19の横軸の“8to16to48”は音声信号生成システム100により帯域が拡張された音声を示す。“16to48”と“8to48”と“8to16to48”との違いは以下の通りである。 “8to48” on the horizontal axis in FIG. 19 indicates the voice whose band has been expanded by the voice signal generation system 100. “8to16to48” on the horizontal axis of FIG. 19 indicates a voice whose band has been expanded by the voice signal generation system 100. The differences between "16to48", "8to48" and "8to16to48" are as follows.
 “16to48”は、“16”が劣化音声として音声信号変換装置2に入力され、“16”に音声信号生成システム100が得た音声信号変換モデルが適用され、“16”の帯域が48kHzまで拡張された結果の音声を示す。“16”は、48kHzでサンプリングされた音声を16kHzまでダウンサンプリングした音声を示す。 In "16to48", "16" is input to the audio signal converter 2 as deteriorated audio, the audio signal conversion model obtained by the audio signal generation system 100 is applied to "16", and the band of "16" is extended to 48 kHz. The audio of the result is shown. “16” indicates a voice sampled at 48 kHz and downsampled to 16 kHz.
 “8to16to48”は、“8”が劣化音声として音声信号変換装置2に入力され、“8”に音声信号生成システム100が得た音声信号変換モデルが適用され、“8”の帯域が48kHzまで拡張された結果の音声を示す。“16”の帯域が48kHzまで拡張された結果の音声を示す。“8”は、48kHzでサンプリングされた音声を16kHzまでダウンサンプリングした音声を示す。 In "8to16to48", "8" is input to the audio signal converter 2 as deteriorated audio, the audio signal conversion model obtained by the audio signal generation system 100 is applied to "8", and the band of "8" is extended to 48 kHz. The audio of the result is shown. The voice of the result that the band of "16" was extended to 48kHz is shown. “8” indicates a voice sampled at 48 kHz and downsampled to 16 kHz.
 “8to16to48”は、“8”が劣化音声として音声信号変換装置2に入力され“16”に変換され、次に“16”が劣化音声として音声信号変換装置2に入力され“48”に変換された音声を示す。“48”は、48kHzでサンプリングされた音声を示す。 In "8to16to48", "8" is input to the voice signal converter 2 as deteriorated voice and converted to "16", and then "16" is input to the voice signal converter 2 as deteriorated voice and converted to "48". Shows the voice. “48” indicates the sound sampled at 48 kHz.
 図19の横軸の“16”は16kHzにダウンサンプリングされた自然音声を示す。図19の横軸の“4”は4kHzにダウンサンプリングされた自然音声を示す。 “16” on the horizontal axis in FIG. 19 indicates natural voice downsampled to 16 kHz. “4” on the horizontal axis of FIG. 19 indicates a natural sound downsampled to 4 kHz.
 図19は、“16to48”は原音との違いが小さいことを示す。図19は、“8to48”は原音より大きく劣化していることを示す。劣化の理由は、音声では16kHz以下に情報が集約されるため8kHzにダウンサンプリングすることで大きく情報量が落ち、学習が上手くいかなかったためである。図19は、“8to16to48”が“8to48”よりも高音質であることを示す。 FIG. 19 shows that "16to48" has a small difference from the original sound. FIG. 19 shows that “8to48” is significantly deteriorated from the original sound. The reason for the deterioration is that since information is aggregated at 16 kHz or less in voice, the amount of information is greatly reduced by downsampling to 8 kHz, and learning does not go well. FIG. 19 shows that “8to16to48” has higher sound quality than “8to48”.
 このように構成された実施形態の音声信号生成システム100は、音声信号の音声波形又は音声特徴量の一方だけでなく両方を用い、順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理の実行により音声信号変換モデルを得る。そのため、このように構成された音声信号生成システム100は、音声波形だけを用いて音声信号変換モデルを得る場合よりも自然信号度の高い音声信号を生成することができる。すなわち、このように構成された音声信号生成システム100は、より人間が発する音声に近い音声を生成することができる。なお、音声波形だけを用いて音声信号変換モデルを得る方法は、例えばSEGAN(Speech Enhancement Generative Adversarial Network)である。 The voice signal generation system 100 of the embodiment configured as described above uses not only one of the voice waveforms or voice features of the voice signal but also both, and forward conversion signal identification processing, forward conversion learning processing, and forward conversion signal identification learning. A voice signal conversion model is obtained by executing processing, inverse conversion signal identification processing, inverse conversion learning processing, and inverse conversion signal identification learning processing. Therefore, the audio signal generation system 100 configured in this way can generate an audio signal having a higher degree of natural signal than when obtaining an audio signal conversion model using only the audio waveform. That is, the voice signal generation system 100 configured in this way can generate voice that is closer to the voice emitted by humans. A method for obtaining a speech signal conversion model using only a speech waveform is, for example, SEWAN (Speech Enhancement Generative Adversarial Network).
 また、このように構成された実施形態の音声信号生成システム100は、順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理の実行により音声信号変換モデルを得る。そのため、音声信号生成システム100は、音声波形及び音声特徴量を用いて畳み込みニューラルネットワークのみにより音声信号変換モデルを取得する場合よりも、より人間が発する音声に近い音声を生成することができる。 Further, the voice signal generation system 100 of the embodiment configured as described above includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal. A voice signal conversion model is obtained by executing the discrimination learning process. Therefore, the voice signal generation system 100 can generate a voice closer to the voice emitted by a human being than when the voice signal conversion model is acquired only by the convolutional neural network using the voice waveform and the voice feature amount.
 このように構成された実施形態の音声信号生成システム100は、順変換信号識別処理、順変換学習処理、順変換信号識別学習処理、逆変換信号識別処理、逆変換学習処理及び逆変換信号識別学習処理を用いる。そのため、学習用に用いる音声信号のアライメントが低い場合であっても人間が発する音声に近い音声を生成することができる。そのため、音声信号生成システム100は、アライメントが高い場合にのみ有効なSEGAN(Speech Enhancement Generative Adversarial Networks)(参考文献7参照)よりも、適用場面が限定されないという効果を奏する。なお、アライメントが高いとは、学習用の音声信号と音声信号生成システム100によって出力したいユーザの理想とする音声信号との違いが小さいことを意味する。アライメントが高い学習用の音声信号は、例えば理想環境で収録された音声に対して計算機上で雑音を重畳し雑音環境下音声を模擬したのちに雑音除去が行われた音声信号である。アライメントが低い学習用の音声信号は、テキスト音声合成や音声変換において生成された合成音声である。このような音声信号は音声信号の長さも信号ごとに異なるため、この点でもアライメントが低い。 The voice signal generation system 100 of the embodiment configured as described above includes forward conversion signal identification processing, forward conversion learning processing, forward conversion signal identification learning processing, reverse conversion signal identification processing, reverse conversion learning processing, and reverse conversion signal identification learning. Use processing. Therefore, even when the alignment of the voice signal used for learning is low, it is possible to generate a voice close to the voice emitted by a human being. Therefore, the speech signal generation system 100 has an effect that the application scene is not limited as compared with SEWAN (Speech Enhancement Generative Adversarial Networks) (see Reference 7), which is effective only when the alignment is high. The high alignment means that the difference between the learning audio signal and the ideal audio signal of the user who wants to output by the audio signal generation system 100 is small. The learning audio signal with high alignment is, for example, an audio signal in which noise is superimposed on a voice recorded in an ideal environment on a computer, the noise is simulated in a noise environment, and then noise is removed. The speech signal for learning with low alignment is a synthetic speech generated in text-to-speech synthesis or speech conversion. Since the length of such an audio signal also differs for each signal, the alignment is low in this respect as well.
 参考文献7:S. Pascual et al., “SEGAN: Speech enhancement generative adversarial network,” 2017 Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.3642-3646, 2017. Reference 7: S. Pascual et al., “SEGAN: Speech enhancement generative advertising network,” 2017 Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.3642-3646, 2017.
(変形例)
 なお、音声信号生成システム100が音声信号変換モデルを生成する方法は、必ずしも畳み込みCycleGANである必要は無い。音声信号生成システム100が音声信号変換モデルを生成する方法(以下「モデル生成方法」という。)は、以下の学習方法条件を満たす方法であればどのようなものであってもよい。
(Modification example)
The method by which the voice signal generation system 100 generates a voice signal conversion model does not necessarily have to be a convolutional cycle GAN. The method for generating the voice signal conversion model by the voice signal generation system 100 (hereinafter referred to as “model generation method”) may be any method as long as it satisfies the following learning method conditions.
 学習方法条件は第1条件を含む。第1条件は、モデル生成方法は入力された音声信号に対して自然信号度を高める変換である順変換処理を実行することで前記音声信号よりも自然信号度の高い信号である順変換信号を出力する第1の生成器を用いる方法である、という条件である。 The learning method conditions include the first condition. The first condition is that the model generation method executes a forward conversion process, which is a conversion that increases the natural signal degree of the input audio signal, to generate a forward conversion signal that is a signal having a higher natural signal degree than the audio signal. The condition is that the method uses the first generator to output.
 学習方法条件は第2条件を含む。第2条件は、モデル生成方法は入力された信号が順変換信号と自然信号とのいずれであるかを識別する第1の識別器を用いる方法である、という条件である。 The learning method condition includes the second condition. The second condition is that the model generation method is a method using a first classifier that discriminates whether the input signal is a forward conversion signal or a natural signal.
 学習方法条件は第3条件を含む。第3条件は、モデル生成方法は入力された信号に対して自然信号度を低める変換である逆変換処理を実行することで順変換信号よりも自然信号度の低い逆変換信号を出力する第2の生成器を用いる方法である、という条件である。 The learning method condition includes the third condition. The third condition is that the model generation method outputs an inverse transformation signal having a lower natural signal degree than the forward conversion signal by executing an inverse transformation process which is a conversion that lowers the natural signal degree with respect to the input signal. The condition is that it is a method using the generator of.
 学習方法条件は第4条件を含む。第4条件は、モデル生成方法は入力された信号が予め用意された信号であって合成された信号である事前合成信号と逆変換信号とのいずれであるかを識別する第2の識別器を用いる方法である、という条件である。なお、第2識別部160が合成信号群から読み出す合成信号は、事前合成信号の一例である。 The learning method conditions include the fourth condition. The fourth condition is that the model generation method uses a second classifier that discriminates whether the input signal is a pre-synthesized signal which is a prepared signal and is a synthesized signal or an inverse conversion signal. The condition is that it is the method to be used. The composite signal read by the second identification unit 160 from the composite signal group is an example of the pre-synthesized signal.
 学習方法条件は第5条件を含む。第5条件は、モデル生成方法は第1生成器と、第1の識別器と、第2生成器と、第2の識別器とが第1の識別器の識別結果と第2の識別器との識別結果に基づいて学習する、という条件である。 The learning method condition includes the fifth condition. The fifth condition is that the model generation method is that the first generator, the first classifier, the second generator, and the second classifier are the discrimination results of the first classifier and the second classifier. It is a condition that learning is performed based on the identification result of.
 学習方法条件はさらに、以下の弱識別器条件を含んでもよい。弱識別器条件は、第1の識別器と第2の識別器との少なくとも1つは音声波形識別器と音声特徴量識別器とを用いて学習する、という条件を含む。そのため、モデル生成方法は、例えば第1生成器及び第2生成器と異なる第3生成器と、第1の識別器と第2の識別器と異なる第3の識別器と、を用いる方法であってもよい。 The learning method condition may further include the following weak classifier conditions. The weak classifier condition includes a condition that at least one of the first classifier and the second classifier learns using the voice waveform classifier and the voice feature amount classifier. Therefore, the model generation method is, for example, a method using a third generator different from the first generator and the second generator, and a third classifier different from the first classifier and the second classifier. You may.
 第1生成部110の第1生成器の一例である。第1識別部120は第1の識別器の一例である。第2生成部150は第2生成器の一例である。第2識別部160は第2の識別器の一例である。 This is an example of the first generator of the first generator 110. The first discriminator 120 is an example of a first discriminator. The second generator 150 is an example of the second generator. The second discriminator 160 is an example of a second discriminator.
 なお、音声信号変換モデルを生成する方法が少なくとも第1条件~第5条件を満たしていれば、音声信号生成システム100は学習用に用いる音声信号のアライメントが低い場合であっても人間が発する音声に近い音声を生成することができる。 If the method of generating the voice signal conversion model satisfies at least the first to fifth conditions, the voice signal generation system 100 uses the voice signal for learning even when the alignment of the voice signal is low. It is possible to generate a sound close to.
 なお、音声波形識別部121及び音声波形識別部161は、音高の知覚的尺度に基づいて変換された周波数スペクトルに基づいて音声信号を識別してもよい。音高の知覚的尺度は例えばメル尺度である。音高の知覚的尺度に基づいて変換された周波数スペクトルは、例えばメル周波数ケプストラム係数で表されるスペクトルである。周波数スペクトルは、例えば位相スペクトルであってもよい。周波数スペクトルは、振幅スペクトルであってもよい。音高の知覚的尺度に基づいて変換された周波数スペクトルは、例えばメルスペクトログラムであってもよい。このように、音高の知覚的尺度に基づくことで人の知覚の情報も音声の生成に用いることができるため、音声信号生成システム100より人間が発する音声に近い音声を生成することができる。 Note that the voice waveform identification unit 121 and the voice waveform identification unit 161 may identify the voice signal based on the frequency spectrum converted based on the perceptual scale of pitch. The perceptual measure of pitch is, for example, the Mel scale. The frequency spectrum converted based on the perceptual measure of pitch is, for example, a spectrum represented by the mel frequency cepstrum coefficient. The frequency spectrum may be, for example, a phase spectrum. The frequency spectrum may be an amplitude spectrum. The frequency spectrum converted based on the perceptual measure of pitch may be, for example, a mel spectrogram. In this way, since human perceptual information can also be used for voice generation based on the perceptual scale of pitch, it is possible to generate a voice closer to the voice emitted by a human than the voice signal generation system 100.
 なお、音声信号変換モデル学習装置1は必ずしも入力された音声信号を人間の発する音声に近い音声の音声信号に変換する学習モデルを学習する必要は無い。音声信号変換モデル学習装置1は入力された音声信号を犬や猫等の人間以外の動物の音声に近い音声の音声信号に変換する学習モデルを学習してもよい。このような場合、音声信号変換装置2は入力された音声を人間以外の動物の音声に近い音声信号に変換する。上述の通り、本実施形態における動物は人間を含む。 Note that the audio signal conversion model learning device 1 does not necessarily have to learn a learning model that converts an input audio signal into an audio signal that is close to the audio emitted by a human being. The voice signal conversion model learning device 1 may learn a learning model that converts an input voice signal into a voice signal of a voice close to the voice of an animal other than a human such as a dog or a cat. In such a case, the voice signal conversion device 2 converts the input voice into a voice signal close to the voice of an animal other than a human. As described above, the animals in this embodiment include humans.
 なお、不自然信号と自然合成信号とは、同じ種別の動物の音声信号であることが望ましいが必ずしも同じでなくてもよい。 It is desirable that the unnatural signal and the naturally synthesized signal are audio signals of the same type of animal, but they do not necessarily have to be the same.
 なお、被管理部101は学習部の一例である。なお、不自然信号は入力信号の一例である。 The managed unit 101 is an example of the learning unit. The unnatural signal is an example of an input signal.
 音声信号変換モデル学習装置1は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、音声信号変換モデル学習装置1が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。例えば、第1生成部110と、第1識別部120と、第2生成部150と、第2識別部160とはそれぞれ異なる情報処理装置に実装されてもよい。 The voice signal conversion model learning device 1 may be implemented by using a plurality of information processing devices connected so as to be able to communicate via a network. In this case, each functional unit included in the voice signal conversion model learning device 1 may be distributed and mounted in a plurality of information processing devices. For example, the first generation unit 110, the first identification unit 120, the second generation unit 150, and the second identification unit 160 may be mounted on different information processing devices.
 音声信号変換装置2は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、音声信号変換装置2が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 The voice signal conversion device 2 may be implemented by using a plurality of information processing devices connected so as to be able to communicate via a network. In this case, each functional unit included in the voice signal conversion device 2 may be distributed and mounted in a plurality of information processing devices.
 なお、音声信号生成システム100の各機能の全て又は一部は、ASIC(Application Specific Integrated Circuit)やPLD(Programmable Logic Device)やFPGA(Field Programmable Gate Array)等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。 Even if all or part of each function of the audio signal generation system 100 is realized by using hardware such as ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array), etc. good. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a flexible disk, a magneto-optical disk, a portable medium such as a ROM or a CD-ROM, or a storage device such as a hard disk built in a computer system. The program may be transmitted over a telecommunication line.
 以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs and the like within a range that does not deviate from the gist of the present invention.
 100…音声信号生成システム、 1…音声信号変換モデル学習装置、 2…音声信号変換装置、 10…制御部、 11…入力部、 12…インタフェース部、 13…記憶部、 14…出力部、 101…被管理部、 102…管理部、 110…第1生成部、 120…第1識別部、 121…音声波形識別部、 122…音声特徴量識別部、 123…統合識別部、 130…第1入力決定部、 140…第1判定部、 150…第2生成部、 160…第2識別部、 161…音声波形識別部、 162…音声特徴量識別部、 163…統合識別部、 170…第2入力決定部、 180…第2判定部、 20…制御部、 21…入力部、 22…インタフェース部、 23…記憶部、 24…出力部、 201…変換対象取得部、 202…変換部、 203…音声信号出力制御部、 91…プロセッサ、 92…メモリ、 93…プロセッサ、 94…メモリ 100 ... Voice signal generation system, 1 ... Voice signal conversion model learning device, 2 ... Voice signal conversion device, 10 ... Control unit, 11 ... Input unit, 12 ... Interface unit, 13 ... Storage unit, 14 ... Output unit, 101 ... Managed unit, 102 ... Management unit, 110 ... 1st generation unit, 120 ... 1st identification unit, 121 ... Voice waveform identification unit, 122 ... Voice feature amount identification unit, 123 ... Integrated identification unit, 130 ... 1st input determination Unit, 140 ... 1st judgment unit, 150 ... 2nd generation unit, 160 ... 2nd identification unit, 161 ... Voice waveform identification unit, 162 ... Voice feature amount identification unit, 163 ... Integrated identification unit, 170 ... 2nd input determination Unit, 180 ... 2nd judgment unit, 20 ... control unit, 21 ... input unit, 22 ... interface unit, 23 ... storage unit, 24 ... output unit, 201 ... conversion target acquisition unit, 202 ... conversion unit, 203 ... audio signal Output control unit, 91 ... processor, 92 ... memory, 93 ... processor, 94 ... memory

Claims (8)

  1.  入力された音声信号である入力信号を、実際に動物が発する音声である自然信号との類似の度合を示す自然信号度が前記入力信号よりも高い音声信号に変換する学習済みモデルを機械学習の方法で得る学習部、
     を備え、
     前記機械学習の方法は、入力された音声信号に対して自然信号度を高める変換である順変換処理を実行することで前記音声信号よりも自然信号度の高い信号である順変換信号を出力する第1生成部と、入力された信号が順変換信号と自然信号とのいずれであるかを識別する第1識別部と、入力された音声信号に対して自然信号度を低める変換である逆変換処理を実行することで前記音声信号よりも自然信号度の低い逆変換信号を出力する第2生成部と、入力された信号が予め用意された信号であって合成された信号である事前合成信号と逆変換信号とのいずれであるかを識別する第2識別部とが、前記第1識別部及び前記第2識別部の識別結果に基づいて学習する方法である、
     音声信号変換モデル学習装置。
    Machine learning of a trained model that converts an input signal, which is an input voice signal, into a voice signal whose natural signal degree, which indicates the degree of similarity to the natural signal, which is the voice actually emitted by an animal, is higher than the input signal. Learning department to get by the method,
    With
    The machine learning method outputs a forward conversion signal which is a signal having a higher natural signal degree than the voice signal by executing a forward conversion process which is a conversion which increases the natural signal degree with respect to the input voice signal. The first generation unit, the first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, and the inverse conversion that lowers the degree of natural signal with respect to the input audio signal. A second generation unit that outputs an inverse conversion signal having a lower natural signal degree than the voice signal by executing processing, and a precombined signal in which the input signal is a signal prepared in advance and synthesized. The second identification unit that distinguishes between the first identification unit and the inverse conversion signal is a method of learning based on the identification results of the first identification unit and the second identification unit.
    Voice signal conversion model learning device.
  2.  前記機械学習の方法は、循環型敵対的学習(CycleGAN:Cycle Generative Adversarial Networks)の方法である、
     請求項1に記載の音声信号変換モデル学習装置。
    The machine learning method is a method of Cycle Generative Adversarial Networks (Cycle GAN).
    The audio signal conversion model learning device according to claim 1.
  3.  前記第1識別部及び前記第2識別部の少なくとも1つは学習に用いる音声信号の波形に基づいて前記音声信号が自然信号か否かを識別する音声波形識別器と、前記音声信号から所定の条件を満たす情報である音声特徴量を取得し、取得した音声特徴量に基づいて前記音声信号が自然信号か否かを識別する音声特徴量識別器と、を用いて学習する、
     請求項1又は2に記載の音声信号変換モデル学習装置。
    At least one of the first identification unit and the second identification unit is a voice waveform classifier that discriminates whether or not the voice signal is a natural signal based on the waveform of the voice signal used for learning, and a predetermined voice signal from the voice signal. Learning is performed using a voice feature amount classifier that acquires a voice feature amount that satisfies the conditions and discriminates whether or not the voice signal is a natural signal based on the acquired voice feature amount.
    The audio signal conversion model learning device according to claim 1 or 2.
  4.  前記音声波形識別器は、音高の知覚的尺度に基づいて変換された前記音声信号の周波数スペクトルである、
     請求項3に記載の音声信号変換モデル学習装置。
    The voice waveform classifier is a frequency spectrum of the voice signal converted based on a perceptual measure of pitch.
    The audio signal conversion model learning device according to claim 3.
  5.  入力された音声信号である入力信号を、実際に動物が発する音声である自然信号との類似の度合を示す自然信号度が前記入力信号よりも高い音声信号に変換する学習済みモデルを機械学習の方法で得る学習部、を備え、前記機械学習の方法は、入力された音声信号に対して自然信号度を高める変換である順変換処理を実行することで前記音声信号よりも自然信号度の高い信号である順変換信号を出力する第1生成部と、入力された信号が順変換信号と自然信号とのいずれであるかを識別する第1識別部と、入力された音声信号に対して自然信号度を低める変換である逆変換処理を実行することで前記音声信号よりも自然信号度の低い逆変換信号を出力する第2生成部と、入力された信号が予め用意された信号であって合成された信号である事前合成信号と逆変換信号とのいずれであるかを識別する第2識別部とが、前記第1識別部及び前記第2識別部の識別結果に基づいて学習する方法である音声信号変換モデル学習装置が得た前記学習済みモデルを用いて、入力された音声信号を変換する変換部、
     を備える音声信号変換装置。
    A trained model that converts an input signal, which is an input voice signal, into a voice signal having a degree of natural signal higher than that of the input signal, which shows a degree of similarity to the natural signal that is actually emitted by an animal, is machine-learned. The machine learning method includes a learning unit obtained by the method, and the machine learning method has a higher natural signal degree than the voice signal by executing a forward conversion process which is a conversion for increasing the natural signal degree with respect to the input voice signal. A first generator that outputs a forward conversion signal, which is a signal, a first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, and a natural unit for the input audio signal. A second generator that outputs an inverse conversion signal having a lower natural signal degree than the voice signal by executing an inverse conversion process that lowers the signal degree, and a signal in which the input signal is prepared in advance. A method in which a second identification unit that discriminates between a pre-synthesized signal and an inverse conversion signal, which are synthesized signals, learns based on the identification results of the first identification unit and the second identification unit. A conversion unit that converts an input audio signal using the trained model obtained by a certain audio signal conversion model learning device.
    A voice signal converter comprising.
  6.  入力された音声信号である入力信号を、実際に動物が発する音声である自然信号との類似の度合を示す自然信号度が前記入力信号よりも高い音声信号に変換する学習済みモデルを機械学習の方法で得る学習ステップ、
     を有し、
     前記機械学習の方法は、入力された音声信号に対して自然信号度を高める変換である順変換処理を実行することで前記音声信号よりも自然信号度の高い信号である順変換信号を出力する第1生成部と、入力された信号が順変換信号と自然信号とのいずれであるかを識別する第1識別部と、入力された音声信号に対して自然信号度を低める変換である逆変換処理を実行することで前記音声信号よりも自然信号度の低い逆変換信号を出力する第2生成部と、入力された信号が予め用意された信号であって合成された信号である事前合成信号と逆変換信号とのいずれであるかを識別する第2識別部とが、前記第1識別部及び前記第2識別部の識別結果に基づいて学習する方法である、
     音声信号変換モデル学習方法。
    Machine learning of a trained model that converts an input signal, which is an input voice signal, into a voice signal whose natural signal degree, which indicates the degree of similarity to the natural signal, which is the voice actually emitted by an animal, is higher than the input signal. Learning steps you get in the way,
    Have,
    The machine learning method outputs a forward conversion signal which is a signal having a higher natural signal degree than the voice signal by executing a forward conversion process which is a conversion which increases the natural signal degree with respect to the input voice signal. The first generation unit, the first identification unit that identifies whether the input signal is a forward conversion signal or a natural signal, and the inverse conversion that lowers the degree of natural signal with respect to the input audio signal. A second generation unit that outputs an inverse conversion signal having a lower natural signal degree than the voice signal by executing processing, and a precombined signal in which the input signal is a signal prepared in advance and synthesized. The second identification unit that distinguishes between the first identification unit and the inverse conversion signal is a method of learning based on the identification results of the first identification unit and the second identification unit.
    Voice signal conversion model learning method.
  7.  請求項1から4のいずれか一項に記載の音声信号変換モデル学習装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as the audio signal conversion model learning device according to any one of claims 1 to 4.
  8.  請求項5に記載の音声信号変換装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as the audio signal converter according to claim 5.
PCT/JP2020/015389 2020-04-03 2020-04-03 Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program WO2021199446A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/015389 WO2021199446A1 (en) 2020-04-03 2020-04-03 Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program
JP2022511494A JP7368779B2 (en) 2020-04-03 2020-04-03 Audio signal conversion model learning device, audio signal conversion device, audio signal conversion model learning method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/015389 WO2021199446A1 (en) 2020-04-03 2020-04-03 Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program

Publications (1)

Publication Number Publication Date
WO2021199446A1 true WO2021199446A1 (en) 2021-10-07

Family

ID=77927771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/015389 WO2021199446A1 (en) 2020-04-03 2020-04-03 Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program

Country Status (2)

Country Link
JP (1) JP7368779B2 (en)
WO (1) WO2021199446A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019101391A (en) * 2017-12-07 2019-06-24 日本電信電話株式会社 Series data converter, learning apparatus, and program
JP2019144404A (en) * 2018-02-20 2019-08-29 日本電信電話株式会社 Voice conversion learning device, voice conversion device, method and program
CN110246488A (en) * 2019-06-14 2019-09-17 苏州思必驰信息科技有限公司 Half optimizes the phonetics transfer method and device of CycleGAN model
JP2020027193A (en) * 2018-08-13 2020-02-20 日本電信電話株式会社 Voice conversion learning device, voice conversion device, method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019101391A (en) * 2017-12-07 2019-06-24 日本電信電話株式会社 Series data converter, learning apparatus, and program
JP2019144404A (en) * 2018-02-20 2019-08-29 日本電信電話株式会社 Voice conversion learning device, voice conversion device, method and program
JP2020027193A (en) * 2018-08-13 2020-02-20 日本電信電話株式会社 Voice conversion learning device, voice conversion device, method, and program
CN110246488A (en) * 2019-06-14 2019-09-17 苏州思必驰信息科技有限公司 Half optimizes the phonetics transfer method and device of CycleGAN model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FANG FUMING, YAMAGISHI JUNICHI, ECHIZEN ISAO: "High-quality nonparallel voice conversion using CycleGAN", IPSJ SIG TECHNICAL REPORT, vol. 2017, no. 9 (2017-SLP-119), 21 December 2017 (2017-12-21), pages 1 - 6, XP055937956 *

Also Published As

Publication number Publication date
JP7368779B2 (en) 2023-10-25
JPWO2021199446A1 (en) 2021-10-07

Similar Documents

Publication Publication Date Title
US10679643B2 (en) Automatic audio captioning
US11605368B2 (en) Speech recognition using unspoken text and speech synthesis
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
JP7018659B2 (en) Voice conversion device, voice conversion method and program
US11856369B1 (en) Methods and systems implementing phonologically-trained computer-assisted hearing aids
Su et al. Bandwidth extension is all you need
JP7257593B2 (en) Training Speech Synthesis to Generate Distinguishable Speech Sounds
US11823655B2 (en) Synthetic speech processing
WO2022017040A1 (en) Speech synthesis method and system
JP2015040903A (en) Voice processor, voice processing method and program
WO2019116889A1 (en) Signal processing device and method, learning device and method, and program
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
EP4205109A1 (en) Synthesized data augmentation using voice conversion and speech recognition models
US11676572B2 (en) Instantaneous learning in text-to-speech during dialog
Huang et al. Refined wavenet vocoder for variational autoencoder based voice conversion
Li et al. Speech Audio Super-Resolution for Speech Recognition.
US20230013370A1 (en) Generating audio waveforms using encoder and decoder neural networks
JP7192882B2 (en) Speech rhythm conversion device, model learning device, methods therefor, and program
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
JP7393585B2 (en) WaveNet self-training for text-to-speech
JP7423056B2 (en) Reasoners and how to learn them
WO2021199446A1 (en) Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program
Mottini et al. Voicy: Zero-shot non-parallel voice conversion in noisy reverberant environments
Zheng et al. Incorporating ultrasound tongue images for audio-visual speech enhancement through knowledge distillation
Yun et al. Voice conversion of synthesized speeches using deep neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20928762

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022511494

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20928762

Country of ref document: EP

Kind code of ref document: A1