CN111081266B - Training generation countermeasure network, and voice enhancement method and system - Google Patents

Training generation countermeasure network, and voice enhancement method and system Download PDF

Info

Publication number
CN111081266B
CN111081266B CN201911312488.0A CN201911312488A CN111081266B CN 111081266 B CN111081266 B CN 111081266B CN 201911312488 A CN201911312488 A CN 201911312488A CN 111081266 B CN111081266 B CN 111081266B
Authority
CN
China
Prior art keywords
speech
noise
voice
generator
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911312488.0A
Other languages
Chinese (zh)
Other versions
CN111081266A (en
Inventor
刘刚
龚科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DMAI Guangzhou Co Ltd
Original Assignee
DMAI Guangzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DMAI Guangzhou Co Ltd filed Critical DMAI Guangzhou Co Ltd
Priority to CN201911312488.0A priority Critical patent/CN111081266B/en
Publication of CN111081266A publication Critical patent/CN111081266A/en
Application granted granted Critical
Publication of CN111081266B publication Critical patent/CN111081266B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a method and a system for generating a confrontation network by training and enhancing voice, wherein the training method comprises the following steps: respectively inputting a voice pair consisting of a denoised voice sub-voice and a noise-carrying voice sub-voice and a voice pair consisting of a noise-carrying voice sub-voice and a noise-carrying voice sub-voice corresponding to a pure voice into a local discriminator, respectively inputting the voice pair consisting of the noise-carrying voice and the noise-carrying voice corresponding to the pure voice and the voice pair consisting of the denoised voice of a generator into a global discriminator, respectively training the discriminator and the generator, and leading the generator to generate the denoised voice from the whole and the local by a training countermeasure network comprising the global discriminator and the local discriminator, so as to eliminate the noise which is difficult to eliminate in a specific area in a targeted manner; a generator based on a feature pyramid network to better extract speech features and to perceive noise, thereby eliminating noise more effectively.

Description

Training generation countermeasure network, and voice enhancement method and system
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method and a system for training and generating an confrontation network and enhancing voice.
Background
Speech Enhancement (SE) refers to removing noise z from noisy Speech y, thereby separating clean Speech x, i.e., x ═ y-z. Removing noise from a mixed speech signal is one of the most challenging tasks in speech signal processing, and conventional speech enhancement algorithms include spectral subtraction, subspace methods, and wiener filtering methods. In recent years, a speech enhancement method based on deep learning makes a great breakthrough in the field of speech enhancement, and particularly, the application of a generated countermeasure Network (GAN) in the field of speech enhancement effectively improves the quality of denoised speech.
The generation of countermeasure networks has since been proposed to achieve good results in many challenging generation tasks such as image translation, super-resolution, etc., which contain two important components: a Generator (Generator, G) and a Discriminator (Discriminator, D). The generator maps the samples Z from the prior distribution Z to the samples X of the distribution X of interest, D as a binary classifier that decides true samples as true and false samples generated by the generator. The generator and the discriminator are trained in a countercheck mode, samples generated by the generator are subjected to real distribution as much as possible to confuse the discriminator so that the discriminator judges the samples to be true, the discriminator separates the real samples from the generated samples as much as possible, the samples generated by the generator are very close to the real samples in the continuous game process until Nash equilibrium is achieved, and the discriminator cannot judge whether the generated samples are real or generated.
The existing voice enhancement technology based on the generation countermeasure network basically directly uses the frame in the image generation task for voice enhancement in the voice noise reduction field, and does not combine the characteristics of voice to carry out more effective noise reduction. Noisy speech is a very non-stationary signal, i.e. some places may be very noisy and some places may be very noisy or substantially noiseless. The generator does not have a reasonable structure to sense the characteristic, and although the discriminator can guide the generator to generate the voice which is as clean as possible, the generator only has a relatively coarse angle in the whole, the generated voice is likely to have noise locally, and the voice enhancement effect is poor.
Disclosure of Invention
Therefore, the training generation confrontation network, the voice enhancement method and the training generation confrontation network system overcome the defect that the voice enhancement effect of the voice enhancement technology based on the confrontation network in the prior art is poor.
In a first aspect, an embodiment of the present invention provides a method for training a generated confrontation network, including the following steps: obtaining a speech y with noise and a corresponding pure speech x to form a training set; inputting the noisy speech y into a generator to generate denoised speech
Figure GDA0003669428310000021
Speech de-noised at clean speech x, noisy speech y and generator
Figure GDA0003669428310000022
Respectively intercepting k sub-voices with the same size at the same position; speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure GDA0003669428310000023
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the two signals to local discriminator to sum up the noisy speechSpeech pair (x, y) corresponding to clean speech component and speech pair of denoised speech and noisy speech component of generator
Figure GDA0003669428310000024
And inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition.
In one embodiment, the process of training the arbiter comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure GDA0003669428310000031
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true i ,y i ) (ii) a The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator
Figure GDA0003669428310000032
Inputting the speech into a global discriminator to obtain the relative probability p (x, y) that the pure speech corresponding to the noisy speech is true; the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.
In one embodiment, the discriminant loss is calculated by the following equation:
D_loss=localD_Loss+globalD_Loss,
localD_Loss=-log(min(p(x i ,y i ))),
globalD_Loss=-log(p(x,y));
wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) i ,y i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.
In one implementationIn an example, a process for training a generator includes: speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure GDA0003669428310000033
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generator
Figure GDA0003669428310000034
Relative probability of being true
Figure GDA0003669428310000035
The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator
Figure GDA0003669428310000036
Inputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true
Figure GDA0003669428310000037
Fix the arbiter parameters and calculate the generation penalty to update the generator parameters.
In one embodiment, the generation loss is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
Figure GDA0003669428310000041
Figure GDA0003669428310000042
Figure GDA0003669428310000043
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
In a second aspect, an embodiment of the present invention provides a speech enhancement method, including: acquiring a voice with noise to be enhanced; inputting a voice to be enhanced with noise into a generator in a generated confrontation network obtained by the method for training the generated confrontation network according to the first aspect of the embodiment of the present invention, so as to enhance the voice and generate a pure voice.
In one embodiment, the generator uses the feature pyramid network as a backbone network to obtain the speech features of different scales.
In a third aspect, an embodiment of the present invention provides a training generation countermeasure network system, including: the training set acquisition module is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set; a generator de-noising module for inputting the noise voice signal into the generator to generate de-noised voice
Figure GDA0003669428310000044
A sub-voice obtaining module for removing noise from the pure voice x, the voice y with noise and the generator
Figure GDA0003669428310000045
Respectively intercepting k sub-voices with the same size at the same position; a training module for generating a speech pair composed of the denoised sub-speech and the sub-speech with noise
Figure GDA0003669428310000046
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoised
Figure GDA0003669428310000047
And inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition.
In a fourth aspect, an embodiment of the present invention provides a speech enhancement system, including: the voice to be enhanced acquisition module is used for acquiring the voice with noise to be enhanced; the speech enhancement module is configured to input a speech to be enhanced with noise into a generator in a generated confrontation network obtained by the method for training the generated confrontation network according to the first aspect of the embodiment of the present invention, so as to enhance the speech and generate a pure speech.
In a fifth aspect, an embodiment of the present invention provides a computer device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the at least one processor to perform the method of the first or second aspect of the embodiments of the present invention.
In a sixth aspect, embodiments of the present invention provide a computer-readable storage medium, which stores computer instructions for causing the at least one processor to execute the method according to the first or second aspect of the embodiments of the present invention.
The technical scheme of the invention has the following advantages:
1. the method and the system for training the generation confrontation network, provided by the embodiment of the invention, combine the noise-carrying instability and the noise existence locality, and enable the generator to denoise the voice sub-voice and the voice pair consisting of the noise-carrying sub-voice
Figure GDA0003669428310000051
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech and the noise-containing speech into a local discriminator to generate a speech pair (x, y) consisting of the noise-containing speech and the corresponding clean speech, and generating a denoised speech and noise-containing speechFormed speech pair
Figure GDA0003669428310000052
Inputting a global discriminator, respectively training the discriminator and the generator, and leading the generator to generate denoised voice from the whole body and the local part by the trained confrontation network through a dynamic granularity discriminator comprising the global discriminator and the local discriminator, thereby eliminating the noise which is difficult to eliminate in a specific area in a targeted way.
2. The voice enhancement method and the voice enhancement system provided by the embodiment of the invention combine the unstable characteristic of the voice with noise and better extract the voice characteristics and sense the noise through the generator based on the characteristic pyramid network, thereby effectively eliminating the noise.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a specific example of a method for training a confrontation network according to an embodiment of the present invention;
FIG. 2 is a flowchart of a specific example of a speech enhancement method according to an embodiment of the present invention;
fig. 3 is a flowchart of a convolution operation performed by using a feature pyramid network as a backbone network according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a plurality of convolution operations provided by an embodiment of the present invention;
FIG. 5 is a block diagram of a system for providing a training and generation countermeasure network according to an embodiment of the present invention;
FIG. 6 is a block diagram of a speech enhancement system according to an embodiment of the present invention;
fig. 7 is a block diagram of a specific example of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Example 1
The method for training to generate the confrontation network provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:
step S1: the noisy speech and its corresponding clean speech are obtained to form a training set.
In the embodiment of the present invention, a clean speech x may be obtained first, and noise may be added to the clean speech randomly to obtain a speech y with noise, which is only by way of example and not limited thereto.
Step S2, the voice with noise is input to a generator to generate the voice after denoising.
The embodiment of the invention inputs the voice y with noise into a preset generator to obtain the voice y after denoising
Figure GDA0003669428310000071
The generator takes a feature pyramid network as a backbone network to generate denoised voice by acquiring voice features generating different scales
Figure GDA0003669428310000072
Step S3, noise-removed speech in the generator and clean speech
Figure GDA0003669428310000073
Respectively intercepting at the same position and intercepting the same size of sonAnd (4) voice.
In this embodiment, the speech is de-noised by the clean speech x, the noisy speech y and the generator respectively
Figure GDA0003669428310000074
The same interception operation is carried out at the same position, and the obtained sub-voices respectively corresponding to the same position are x 1 ,x 2 ,…,x k ,y 1 ,y 2 ,…,y k And
Figure GDA0003669428310000075
the number of k is reasonably set according to actual needs, and 10 is taken as k in the embodiment of the invention.
Step S4, the speech pair composed of the sub-speech of the speech and the sub-speech with noise is formed after the generator is denoised
Figure GDA0003669428310000081
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoised
Figure GDA0003669428310000082
And inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition.
In this embodiment, the preset training end condition may be that the loss value obtained by training is smaller than a preset threshold and/or the number of training steps is smaller than a preset number of times, which is not limited herein.
The process of training the discriminator in the embodiment of the invention comprises the following steps:
1) speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure GDA0003669428310000083
And with noisy speechSpeech pair (x) of sub-speech and sub-speech corresponding to clean speech i ,y i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true i ,y i );
2) The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator
Figure GDA0003669428310000084
Inputting the speech with noise into a global discriminator to obtain the relative probability p (x, y) that the pure speech x corresponding to the speech with noise is true;
3) the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.
The discriminant loss in the embodiment of the present invention is calculated by the following formula:
D_loss=localD_Loss+globalD_Loss,
localD_Loss=-log(min(p(x i ,y i ))),
globalD_Loss=-log(p(x,y));
wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) i ,y i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.
In the embodiment of the present invention, the process of training the generator includes:
1) speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure GDA0003669428310000091
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generator
Figure GDA0003669428310000092
Relative probability of being true;
2) the speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator
Figure GDA0003669428310000093
Inputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true
Figure GDA0003669428310000094
3) Fix the arbiter parameters and calculate the generation penalty to update the generator parameters.
The generation loss given in the embodiment of the present invention is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
Figure GDA0003669428310000095
Figure GDA0003669428310000096
Figure GDA0003669428310000097
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
In the embodiment of the invention, it is desirable to remove noise as much as possible for each granularity of speech generated by the generator, that is, it is desirable for the discriminator to discriminate it as true, that is, it is desirable for the value of the discriminator output to be as close to 1 as possible. For L 1 Distance Loss L _ Loss, which is L of clean and enhanced speech per granularity 1 Distance. It should be noted that the distance may be a cosine distance, a euclidean distance, or other similar distances, which is not limited herein.
The training generation confrontation network method provided by the embodiment of the invention combines the noise-carrying instability and the noise existence locality, and utilizes the dynamic granularity discriminator comprising the global discriminator and the local discriminator to guide the generator to generate the denoised voice from the whole body and the local part, so that the noise which is difficult to eliminate in a specific area is eliminated in a targeted manner, and the voice enhancement effect is better.
Example 2
An embodiment of the present invention provides a speech enhancement method, as shown in fig. 2, including the following steps:
step S21, obtaining the voice with noise to be enhanced;
and step S22, the generator in the generation countermeasure network obtained by inputting the voice with noise to be enhanced enhances the voice to generate pure voice. The generator employed in this embodiment is a generator included in the generation of the countermeasure network based on the training generation countermeasure network method in embodiment 1.
In the embodiment of the present invention, a generator for generating a countermeasure network is implemented by using a feature pyramid network as a backbone network, and as shown in fig. 3, before the feature pyramid network, a convolution with a step length of 1 is first used, the feature pyramid network includes 5 sense blocks in a bottom-up path, and each Block respectively generates features of information with different semantic degrees. In a top-down path, 1 × 1 convolution enables the features of each scale to have the same channel number, the four uppermost features are up-sampled to 1/4 length of the original input, then are spliced together and are subjected to convolution once to obtain a Dense feature containing information with different semantic degrees, and then are up-sampled twice and convoluted once by combining the output of the first Dense Block to obtain pure voice. In this embodiment, four different scale speech features are taken as an example, but not limited thereto.
As shown in fig. 4, the sense Block in the present embodiment includes a plurality of convolution operations each of which takes as input the output of the preceding layer and inputs its output to each of the following layers, assuming that the output of the l-th layer is X l Layer l-1, l-2, the output of layer 0 (i.e., the input of the current Block) is X l-1 ,X l-2 ,…,X 0 And then:
X l =H l ([X l-1 ,X l-2 ,…,X 0 ])
wherein H l Represents the convolution operation of the l-th layer [. ]]Indicating a splicing operation. To produce features of different scales, an embodiment of the invention embeds a convolution of step size 2 into the end of each Dense Block. After each convolution, a Parametric Linear modified Unit (PReLU) was used as the activation function.
The voice enhancement method provided by the embodiment of the invention combines the unstable characteristic of the voice with noise, and better extracts the voice characteristic and senses the noise through the generator based on the characteristic pyramid network, thereby more effectively eliminating the noise; by combining the noisy unevenness and the noise existence locality, the dynamic granularity discriminator including the global discriminator and the local discriminator obtained in embodiment 1 guides the generator to generate denoised voice from the whole and the local, and eliminates the noise which is difficult to eliminate in a specific area more in a targeted manner.
Example 3
An embodiment of the present invention provides a training generation countermeasure network system, as shown in fig. 5, including:
the training set acquisition module 1 is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set; this module executes the method described in step S1 in embodiment 1, and is not described herein again.
A generator de-noising module 2 for inputting the noise voice signal into the generator to generate de-noised voice
Figure GDA0003669428310000111
This module executes the method described in step S2 in embodiment 1, and is not described herein again.
A sub-voice obtaining module 3 for obtaining the voice after the noise removal of the pure voice x, the voice with noise y and the generator
Figure GDA0003669428310000121
Are respectively in the same wayThe position of the voice capturing unit is used for capturing k sub-voices with the same size; this module executes the method described in step S3 in embodiment 1, and is not described herein again.
A training module 4 for generating a speech pair consisting of the denoised sub-speech of the speech and the sub-speech of the noisy speech
Figure GDA0003669428310000122
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoised
Figure GDA0003669428310000123
And inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition. This module executes the method described in step S4 in embodiment 1, and is not described herein again.
The training generation confrontation network system provided by the embodiment of the invention combines the noise-carrying instability and the noise existence locality, and utilizes the dynamic granularity discriminator comprising the global discriminator and the local discriminator to guide the generator to generate the denoised voice from the whole body and the local part, so that the noise which is difficult to eliminate in a specific area is eliminated in a targeted manner, and the voice enhancement effect is better.
Example 4
An embodiment of the present invention provides a speech enhancement system, as shown in fig. 6, including:
the voice to be enhanced acquisition module 5 is used for acquiring a voice to be enhanced with noise; this module executes the method described in step S21 in embodiment 2, and is not described herein again.
A speech enhancement module 6, configured to input the speech to be enhanced with noise into the generator in the generated confrontation network obtained by the method for training the generated confrontation network in embodiment 1 to enhance the speech, so as to generate a pure speech. This module executes the method described in step S22 in embodiment 2, and is not described herein again.
The voice enhancement system provided by the embodiment of the invention combines the unstable characteristic of the voice with noise, and better extracts voice characteristics and senses noise through the generator based on the characteristic pyramid network, thereby more effectively eliminating the noise; by combining the noisy unevenness and the noise existence locality, the dynamic granularity discriminator including the global discriminator and the local discriminator obtained in embodiment 1 guides the generator to generate denoised voice from the whole and the local, and eliminates the noise which is difficult to eliminate in a specific area more in a targeted manner.
Example 5
An embodiment of the present invention provides a computer device, as shown in fig. 7, including: at least one processor 401, such as a CPU (Central Processing Unit), at least one communication interface 403, memory 404, and at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The communication interface 403 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may also include a standard wired interface and a standard wireless interface. The Memory 404 may be a RAM (random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 404 may optionally be at least one memory device located remotely from the processor 401. Wherein the processor 401 may perform the method in embodiment 1 or embodiment 2. A set of program codes is stored in the memory 404 and the processor 401 calls the program codes stored in the memory 404 for executing the method in embodiment 1 or in embodiment 2. The communication bus 402 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 7, but it is not intended that there be only one bus or one type of bus.
The memory 404 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 404 may also comprise a combination of memories of the kind described above.
The processor 401 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The processor 401 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
Optionally, the memory 404 is also used to store program instructions. The processor 401 may call program instructions to implement the method in embodiment 1 or embodiment 2 as the present application.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer-executable instruction is stored on the computer-readable storage medium, and the computer-executable instruction may execute the method in embodiment 1 or embodiment 2. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid-State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims (9)

1. A method for training to generate an confrontation network is characterized by comprising the following steps:
obtaining a speech y with noise and a corresponding pure speech x to form a training set;
inputting the noisy speech y into a generator to generate denoised speech
Figure FDA0003669428300000019
Speech de-noised at clean speech x, noisy speech y and generator
Figure FDA0003669428300000011
Respectively intercepting k sub-voices with the same size at the same position;
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure FDA0003669428300000012
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoised
Figure FDA0003669428300000013
Inputting the global discriminator, respectively training the discriminator and the generator, and obtaining training according to the preset training end conditionA good generation countermeasure network;
Figure FDA0003669428300000014
i-th sub-speech, y, representing de-noised speech i Representing the ith sub-speech, x, of noisy speech i An ith sub-speech representing a clean speech;
wherein the process of training the generator comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure FDA0003669428300000015
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generator
Figure FDA0003669428300000016
Relative probability of being true
Figure FDA0003669428300000017
The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator
Figure FDA0003669428300000018
Inputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true
Figure FDA00036694283000000110
Fixing the discriminator parameters and calculating the generation loss to update the parameters of the generator;
the generation loss is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
Figure FDA0003669428300000024
Figure FDA0003669428300000025
Figure FDA0003669428300000021
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
2. The method for training a generative confrontation network as claimed in claim 1, wherein the process of training the discriminator comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure FDA0003669428300000022
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true i ,y i );
The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator
Figure FDA0003669428300000023
Inputting the speech into a global discriminator to obtain the relative probability p (x, y) that the pure speech corresponding to the noisy speech is true;
the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.
3. The method of training a generative confrontation network as claimed in claim 2 wherein the discriminant loss is calculated by the formula:
D_loss=localD_Loss+globalD_Loss,
localD_Loss=-log(min(p(x i ,y i ))),
globalD_Loss=-log(p(x,y));
wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) i ,y i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.
4. A method of speech enhancement, comprising:
acquiring a voice with noise to be enhanced;
inputting the voice with noise to be enhanced into the generator in the generation countermeasure network obtained by the method for training the generation countermeasure network according to any one of claims 1 to 3 to enhance the voice and generate the pure voice.
5. The speech enhancement method of claim 4 wherein the generator uses the feature pyramid network as a backbone network to obtain speech features of different scales.
6. A training generative confrontation network system, comprising:
the training set acquisition module is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set;
a generator de-noising module for inputting the noise voice signal into the generator to generate de-noised voice
Figure FDA0003669428300000031
A sub-voice obtaining module for removing noise from the pure voice x, the voice y with noise and the generator
Figure FDA0003669428300000032
Respectively intercepting k sub-voices with the same size at the same position;
a training module for generating a speech pair composed of the denoised sub-speech and the sub-speech with noise
Figure FDA0003669428300000033
And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoised
Figure FDA0003669428300000034
Inputting a global discriminator, respectively training the discriminator and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition;
Figure FDA0003669428300000041
i-th sub-speech, y, representing de-noised speech i Representing the ith sub-speech, x, of noisy speech i An ith sub-speech representing a clean speech;
wherein the process of training the generator comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech
Figure FDA0003669428300000042
And a speech pair (x) consisting of a sub-speech of the noisy speech and a sub-speech of the corresponding clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generator
Figure FDA0003669428300000043
Relative probability of being true
Figure FDA0003669428300000044
The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator
Figure FDA0003669428300000045
Inputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true
Figure FDA0003669428300000047
Fixing the discriminator parameters and calculating the generation loss to update the parameters of the generator;
the generation loss is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
Figure FDA0003669428300000048
Figure FDA0003669428300000049
Figure FDA0003669428300000046
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
7. A speech enhancement system, comprising:
the voice to be enhanced acquisition module is used for acquiring the voice with noise to be enhanced;
a voice enhancement module, for inputting the voice with noise to be enhanced into the generator in the generation countermeasure network obtained by the method of training the generation countermeasure network according to any one of claims 1 to 3, to enhance the voice and generate the pure voice.
8. A computer device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of claims 1-3 or 4-5.
9. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of claims 1-3 or 4-5.
CN201911312488.0A 2019-12-18 2019-12-18 Training generation countermeasure network, and voice enhancement method and system Active CN111081266B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911312488.0A CN111081266B (en) 2019-12-18 2019-12-18 Training generation countermeasure network, and voice enhancement method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911312488.0A CN111081266B (en) 2019-12-18 2019-12-18 Training generation countermeasure network, and voice enhancement method and system

Publications (2)

Publication Number Publication Date
CN111081266A CN111081266A (en) 2020-04-28
CN111081266B true CN111081266B (en) 2022-08-09

Family

ID=70315828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911312488.0A Active CN111081266B (en) 2019-12-18 2019-12-18 Training generation countermeasure network, and voice enhancement method and system

Country Status (1)

Country Link
CN (1) CN111081266B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833893A (en) * 2020-06-16 2020-10-27 杭州云嘉云计算有限公司 Speech enhancement method based on artificial intelligence
CN112164008B (en) * 2020-09-29 2024-02-23 中国科学院深圳先进技术研究院 Training method of image data enhancement network, training device, medium and equipment thereof
CN112444810B (en) * 2020-10-27 2022-07-01 电子科技大学 Radar air multi-target super-resolution method
CN113096673B (en) * 2021-03-30 2022-09-30 山东省计算中心(国家超级计算济南中心) Voice processing method and system based on generation countermeasure network
CN113555028B (en) * 2021-07-19 2024-08-02 首约科技(北京)有限公司 Processing method for noise reduction of Internet of vehicles voice

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10971142B2 (en) * 2017-10-27 2021-04-06 Baidu Usa Llc Systems and methods for robust speech recognition using generative adversarial networks
CN108133702A (en) * 2017-12-20 2018-06-08 重庆邮电大学 A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias
CN109147810B (en) * 2018-09-30 2019-11-26 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network
CN109256144B (en) * 2018-11-20 2022-09-06 中国科学技术大学 Speech enhancement method based on ensemble learning and noise perception training
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network
CN110047502A (en) * 2019-04-18 2019-07-23 广州九四智能科技有限公司 The recognition methods of hierarchical voice de-noising and system under noise circumstance
CN110428849B (en) * 2019-07-30 2021-10-08 珠海亿智电子科技有限公司 Voice enhancement method based on generation countermeasure network
CN110390950B (en) * 2019-08-17 2021-04-09 浙江树人学院(浙江树人大学) End-to-end voice enhancement method based on generation countermeasure network

Also Published As

Publication number Publication date
CN111081266A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111081266B (en) Training generation countermeasure network, and voice enhancement method and system
DE112017003563B4 (en) METHOD AND SYSTEM OF AUTOMATIC LANGUAGE RECOGNITION USING POSTERIORI TRUST POINT NUMBERS
JP5897107B2 (en) Detection of speech syllable / vowel / phoneme boundaries using auditory attention cues
CN108765334A (en) A kind of image de-noising method, device and electronic equipment
CN110245621B (en) Face recognition device, image processing method, feature extraction model, and storage medium
US20230274479A1 (en) Learning apparatus and method for creating image and apparatus and method for image creation
CN112597918B (en) Text detection method and device, electronic equipment and storage medium
CN111640123B (en) Method, device, equipment and medium for generating background-free image
CN112308866A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113221925A (en) Target detection method and device based on multi-scale image
CN114266894A (en) Image segmentation method and device, electronic equipment and storage medium
CN111860077A (en) Face detection method, face detection device, computer-readable storage medium and equipment
CN111444788B (en) Behavior recognition method, apparatus and computer storage medium
CN111353514A (en) Model training method, image recognition method, device and terminal equipment
CN117593633A (en) Ocean scene-oriented image recognition method, system, equipment and storage medium
CN117252890A (en) Carotid plaque segmentation method, device, equipment and medium
CN113256662B (en) Pathological section image segmentation method and device, computer equipment and storage medium
CN116563556B (en) Model training method
CN115565186B (en) Training method and device for character recognition model, electronic equipment and storage medium
Liu et al. Interference reduction in reverberant speech separation with visual voice activity detection
WO2020241074A1 (en) Information processing method and program
KR102206792B1 (en) Method for image denoising using parallel feature pyramid network, recording medium and device for performing the method
CN116543246A (en) Training method of image denoising model, image denoising method, device and equipment
US20200372280A1 (en) Apparatus and method for image processing for machine learning
CN110580336B (en) Lip language word segmentation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant