CN111081266B - Training generation countermeasure network, and voice enhancement method and system - Google Patents
Training generation countermeasure network, and voice enhancement method and system Download PDFInfo
- Publication number
- CN111081266B CN111081266B CN201911312488.0A CN201911312488A CN111081266B CN 111081266 B CN111081266 B CN 111081266B CN 201911312488 A CN201911312488 A CN 201911312488A CN 111081266 B CN111081266 B CN 111081266B
- Authority
- CN
- China
- Prior art keywords
- speech
- noise
- voice
- generator
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000015654 memory Effects 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 abstract description 4
- 230000002708 enhancing effect Effects 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a method and a system for generating a confrontation network by training and enhancing voice, wherein the training method comprises the following steps: respectively inputting a voice pair consisting of a denoised voice sub-voice and a noise-carrying voice sub-voice and a voice pair consisting of a noise-carrying voice sub-voice and a noise-carrying voice sub-voice corresponding to a pure voice into a local discriminator, respectively inputting the voice pair consisting of the noise-carrying voice and the noise-carrying voice corresponding to the pure voice and the voice pair consisting of the denoised voice of a generator into a global discriminator, respectively training the discriminator and the generator, and leading the generator to generate the denoised voice from the whole and the local by a training countermeasure network comprising the global discriminator and the local discriminator, so as to eliminate the noise which is difficult to eliminate in a specific area in a targeted manner; a generator based on a feature pyramid network to better extract speech features and to perceive noise, thereby eliminating noise more effectively.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method and a system for training and generating an confrontation network and enhancing voice.
Background
Speech Enhancement (SE) refers to removing noise z from noisy Speech y, thereby separating clean Speech x, i.e., x ═ y-z. Removing noise from a mixed speech signal is one of the most challenging tasks in speech signal processing, and conventional speech enhancement algorithms include spectral subtraction, subspace methods, and wiener filtering methods. In recent years, a speech enhancement method based on deep learning makes a great breakthrough in the field of speech enhancement, and particularly, the application of a generated countermeasure Network (GAN) in the field of speech enhancement effectively improves the quality of denoised speech.
The generation of countermeasure networks has since been proposed to achieve good results in many challenging generation tasks such as image translation, super-resolution, etc., which contain two important components: a Generator (Generator, G) and a Discriminator (Discriminator, D). The generator maps the samples Z from the prior distribution Z to the samples X of the distribution X of interest, D as a binary classifier that decides true samples as true and false samples generated by the generator. The generator and the discriminator are trained in a countercheck mode, samples generated by the generator are subjected to real distribution as much as possible to confuse the discriminator so that the discriminator judges the samples to be true, the discriminator separates the real samples from the generated samples as much as possible, the samples generated by the generator are very close to the real samples in the continuous game process until Nash equilibrium is achieved, and the discriminator cannot judge whether the generated samples are real or generated.
The existing voice enhancement technology based on the generation countermeasure network basically directly uses the frame in the image generation task for voice enhancement in the voice noise reduction field, and does not combine the characteristics of voice to carry out more effective noise reduction. Noisy speech is a very non-stationary signal, i.e. some places may be very noisy and some places may be very noisy or substantially noiseless. The generator does not have a reasonable structure to sense the characteristic, and although the discriminator can guide the generator to generate the voice which is as clean as possible, the generator only has a relatively coarse angle in the whole, the generated voice is likely to have noise locally, and the voice enhancement effect is poor.
Disclosure of Invention
Therefore, the training generation confrontation network, the voice enhancement method and the training generation confrontation network system overcome the defect that the voice enhancement effect of the voice enhancement technology based on the confrontation network in the prior art is poor.
In a first aspect, an embodiment of the present invention provides a method for training a generated confrontation network, including the following steps: obtaining a speech y with noise and a corresponding pure speech x to form a training set; inputting the noisy speech y into a generator to generate denoised speechSpeech de-noised at clean speech x, noisy speech y and generatorRespectively intercepting k sub-voices with the same size at the same position; speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the two signals to local discriminator to sum up the noisy speechSpeech pair (x, y) corresponding to clean speech component and speech pair of denoised speech and noisy speech component of generatorAnd inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition.
In one embodiment, the process of training the arbiter comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true i ,y i ) (ii) a The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generatorInputting the speech into a global discriminator to obtain the relative probability p (x, y) that the pure speech corresponding to the noisy speech is true; the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.
In one embodiment, the discriminant loss is calculated by the following equation:
D_loss=localD_Loss+globalD_Loss,
localD_Loss=-log(min(p(x i ,y i ))),
globalD_Loss=-log(p(x,y));
wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) i ,y i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.
In one implementationIn an example, a process for training a generator includes: speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generatorRelative probability of being trueThe speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generatorInputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is trueFix the arbiter parameters and calculate the generation penalty to update the generator parameters.
In one embodiment, the generation loss is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
In a second aspect, an embodiment of the present invention provides a speech enhancement method, including: acquiring a voice with noise to be enhanced; inputting a voice to be enhanced with noise into a generator in a generated confrontation network obtained by the method for training the generated confrontation network according to the first aspect of the embodiment of the present invention, so as to enhance the voice and generate a pure voice.
In one embodiment, the generator uses the feature pyramid network as a backbone network to obtain the speech features of different scales.
In a third aspect, an embodiment of the present invention provides a training generation countermeasure network system, including: the training set acquisition module is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set; a generator de-noising module for inputting the noise voice signal into the generator to generate de-noised voiceA sub-voice obtaining module for removing noise from the pure voice x, the voice y with noise and the generatorRespectively intercepting k sub-voices with the same size at the same position; a training module for generating a speech pair composed of the denoised sub-speech and the sub-speech with noiseAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoisedAnd inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition.
In a fourth aspect, an embodiment of the present invention provides a speech enhancement system, including: the voice to be enhanced acquisition module is used for acquiring the voice with noise to be enhanced; the speech enhancement module is configured to input a speech to be enhanced with noise into a generator in a generated confrontation network obtained by the method for training the generated confrontation network according to the first aspect of the embodiment of the present invention, so as to enhance the speech and generate a pure speech.
In a fifth aspect, an embodiment of the present invention provides a computer device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the at least one processor to perform the method of the first or second aspect of the embodiments of the present invention.
In a sixth aspect, embodiments of the present invention provide a computer-readable storage medium, which stores computer instructions for causing the at least one processor to execute the method according to the first or second aspect of the embodiments of the present invention.
The technical scheme of the invention has the following advantages:
1. the method and the system for training the generation confrontation network, provided by the embodiment of the invention, combine the noise-carrying instability and the noise existence locality, and enable the generator to denoise the voice sub-voice and the voice pair consisting of the noise-carrying sub-voiceAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech and the noise-containing speech into a local discriminator to generate a speech pair (x, y) consisting of the noise-containing speech and the corresponding clean speech, and generating a denoised speech and noise-containing speechFormed speech pairInputting a global discriminator, respectively training the discriminator and the generator, and leading the generator to generate denoised voice from the whole body and the local part by the trained confrontation network through a dynamic granularity discriminator comprising the global discriminator and the local discriminator, thereby eliminating the noise which is difficult to eliminate in a specific area in a targeted way.
2. The voice enhancement method and the voice enhancement system provided by the embodiment of the invention combine the unstable characteristic of the voice with noise and better extract the voice characteristics and sense the noise through the generator based on the characteristic pyramid network, thereby effectively eliminating the noise.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a specific example of a method for training a confrontation network according to an embodiment of the present invention;
FIG. 2 is a flowchart of a specific example of a speech enhancement method according to an embodiment of the present invention;
fig. 3 is a flowchart of a convolution operation performed by using a feature pyramid network as a backbone network according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a plurality of convolution operations provided by an embodiment of the present invention;
FIG. 5 is a block diagram of a system for providing a training and generation countermeasure network according to an embodiment of the present invention;
FIG. 6 is a block diagram of a speech enhancement system according to an embodiment of the present invention;
fig. 7 is a block diagram of a specific example of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Example 1
The method for training to generate the confrontation network provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:
step S1: the noisy speech and its corresponding clean speech are obtained to form a training set.
In the embodiment of the present invention, a clean speech x may be obtained first, and noise may be added to the clean speech randomly to obtain a speech y with noise, which is only by way of example and not limited thereto.
Step S2, the voice with noise is input to a generator to generate the voice after denoising.
The embodiment of the invention inputs the voice y with noise into a preset generator to obtain the voice y after denoisingThe generator takes a feature pyramid network as a backbone network to generate denoised voice by acquiring voice features generating different scales
Step S3, noise-removed speech in the generator and clean speechRespectively intercepting at the same position and intercepting the same size of sonAnd (4) voice.
In this embodiment, the speech is de-noised by the clean speech x, the noisy speech y and the generator respectivelyThe same interception operation is carried out at the same position, and the obtained sub-voices respectively corresponding to the same position are x 1 ,x 2 ,…,x k ,y 1 ,y 2 ,…,y k Andthe number of k is reasonably set according to actual needs, and 10 is taken as k in the embodiment of the invention.
Step S4, the speech pair composed of the sub-speech of the speech and the sub-speech with noise is formed after the generator is denoisedAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoisedAnd inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition.
In this embodiment, the preset training end condition may be that the loss value obtained by training is smaller than a preset threshold and/or the number of training steps is smaller than a preset number of times, which is not limited herein.
The process of training the discriminator in the embodiment of the invention comprises the following steps:
1) speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd with noisy speechSpeech pair (x) of sub-speech and sub-speech corresponding to clean speech i ,y i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true i ,y i );
2) The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generatorInputting the speech with noise into a global discriminator to obtain the relative probability p (x, y) that the pure speech x corresponding to the speech with noise is true;
3) the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.
The discriminant loss in the embodiment of the present invention is calculated by the following formula:
D_loss=localD_Loss+globalD_Loss,
localD_Loss=-log(min(p(x i ,y i ))),
globalD_Loss=-log(p(x,y));
wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) i ,y i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.
In the embodiment of the present invention, the process of training the generator includes:
1) speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generatorRelative probability of being true;
2) the speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generatorInputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true
3) Fix the arbiter parameters and calculate the generation penalty to update the generator parameters.
The generation loss given in the embodiment of the present invention is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
In the embodiment of the invention, it is desirable to remove noise as much as possible for each granularity of speech generated by the generator, that is, it is desirable for the discriminator to discriminate it as true, that is, it is desirable for the value of the discriminator output to be as close to 1 as possible. For L 1 Distance Loss L _ Loss, which is L of clean and enhanced speech per granularity 1 Distance. It should be noted that the distance may be a cosine distance, a euclidean distance, or other similar distances, which is not limited herein.
The training generation confrontation network method provided by the embodiment of the invention combines the noise-carrying instability and the noise existence locality, and utilizes the dynamic granularity discriminator comprising the global discriminator and the local discriminator to guide the generator to generate the denoised voice from the whole body and the local part, so that the noise which is difficult to eliminate in a specific area is eliminated in a targeted manner, and the voice enhancement effect is better.
Example 2
An embodiment of the present invention provides a speech enhancement method, as shown in fig. 2, including the following steps:
step S21, obtaining the voice with noise to be enhanced;
and step S22, the generator in the generation countermeasure network obtained by inputting the voice with noise to be enhanced enhances the voice to generate pure voice. The generator employed in this embodiment is a generator included in the generation of the countermeasure network based on the training generation countermeasure network method in embodiment 1.
In the embodiment of the present invention, a generator for generating a countermeasure network is implemented by using a feature pyramid network as a backbone network, and as shown in fig. 3, before the feature pyramid network, a convolution with a step length of 1 is first used, the feature pyramid network includes 5 sense blocks in a bottom-up path, and each Block respectively generates features of information with different semantic degrees. In a top-down path, 1 × 1 convolution enables the features of each scale to have the same channel number, the four uppermost features are up-sampled to 1/4 length of the original input, then are spliced together and are subjected to convolution once to obtain a Dense feature containing information with different semantic degrees, and then are up-sampled twice and convoluted once by combining the output of the first Dense Block to obtain pure voice. In this embodiment, four different scale speech features are taken as an example, but not limited thereto.
As shown in fig. 4, the sense Block in the present embodiment includes a plurality of convolution operations each of which takes as input the output of the preceding layer and inputs its output to each of the following layers, assuming that the output of the l-th layer is X l Layer l-1, l-2, the output of layer 0 (i.e., the input of the current Block) is X l-1 ,X l-2 ,…,X 0 And then:
X l =H l ([X l-1 ,X l-2 ,…,X 0 ])
wherein H l Represents the convolution operation of the l-th layer [. ]]Indicating a splicing operation. To produce features of different scales, an embodiment of the invention embeds a convolution of step size 2 into the end of each Dense Block. After each convolution, a Parametric Linear modified Unit (PReLU) was used as the activation function.
The voice enhancement method provided by the embodiment of the invention combines the unstable characteristic of the voice with noise, and better extracts the voice characteristic and senses the noise through the generator based on the characteristic pyramid network, thereby more effectively eliminating the noise; by combining the noisy unevenness and the noise existence locality, the dynamic granularity discriminator including the global discriminator and the local discriminator obtained in embodiment 1 guides the generator to generate denoised voice from the whole and the local, and eliminates the noise which is difficult to eliminate in a specific area more in a targeted manner.
Example 3
An embodiment of the present invention provides a training generation countermeasure network system, as shown in fig. 5, including:
the training set acquisition module 1 is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set; this module executes the method described in step S1 in embodiment 1, and is not described herein again.
A generator de-noising module 2 for inputting the noise voice signal into the generator to generate de-noised voiceThis module executes the method described in step S2 in embodiment 1, and is not described herein again.
A sub-voice obtaining module 3 for obtaining the voice after the noise removal of the pure voice x, the voice with noise y and the generatorAre respectively in the same wayThe position of the voice capturing unit is used for capturing k sub-voices with the same size; this module executes the method described in step S3 in embodiment 1, and is not described herein again.
A training module 4 for generating a speech pair consisting of the denoised sub-speech of the speech and the sub-speech of the noisy speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoisedAnd inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition. This module executes the method described in step S4 in embodiment 1, and is not described herein again.
The training generation confrontation network system provided by the embodiment of the invention combines the noise-carrying instability and the noise existence locality, and utilizes the dynamic granularity discriminator comprising the global discriminator and the local discriminator to guide the generator to generate the denoised voice from the whole body and the local part, so that the noise which is difficult to eliminate in a specific area is eliminated in a targeted manner, and the voice enhancement effect is better.
Example 4
An embodiment of the present invention provides a speech enhancement system, as shown in fig. 6, including:
the voice to be enhanced acquisition module 5 is used for acquiring a voice to be enhanced with noise; this module executes the method described in step S21 in embodiment 2, and is not described herein again.
A speech enhancement module 6, configured to input the speech to be enhanced with noise into the generator in the generated confrontation network obtained by the method for training the generated confrontation network in embodiment 1 to enhance the speech, so as to generate a pure speech. This module executes the method described in step S22 in embodiment 2, and is not described herein again.
The voice enhancement system provided by the embodiment of the invention combines the unstable characteristic of the voice with noise, and better extracts voice characteristics and senses noise through the generator based on the characteristic pyramid network, thereby more effectively eliminating the noise; by combining the noisy unevenness and the noise existence locality, the dynamic granularity discriminator including the global discriminator and the local discriminator obtained in embodiment 1 guides the generator to generate denoised voice from the whole and the local, and eliminates the noise which is difficult to eliminate in a specific area more in a targeted manner.
Example 5
An embodiment of the present invention provides a computer device, as shown in fig. 7, including: at least one processor 401, such as a CPU (Central Processing Unit), at least one communication interface 403, memory 404, and at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The communication interface 403 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may also include a standard wired interface and a standard wireless interface. The Memory 404 may be a RAM (random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 404 may optionally be at least one memory device located remotely from the processor 401. Wherein the processor 401 may perform the method in embodiment 1 or embodiment 2. A set of program codes is stored in the memory 404 and the processor 401 calls the program codes stored in the memory 404 for executing the method in embodiment 1 or in embodiment 2. The communication bus 402 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 7, but it is not intended that there be only one bus or one type of bus.
The memory 404 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 404 may also comprise a combination of memories of the kind described above.
The processor 401 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The processor 401 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
Optionally, the memory 404 is also used to store program instructions. The processor 401 may call program instructions to implement the method in embodiment 1 or embodiment 2 as the present application.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer-executable instruction is stored on the computer-readable storage medium, and the computer-executable instruction may execute the method in embodiment 1 or embodiment 2. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid-State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.
Claims (9)
1. A method for training to generate an confrontation network is characterized by comprising the following steps:
obtaining a speech y with noise and a corresponding pure speech x to form a training set;
Speech de-noised at clean speech x, noisy speech y and generatorRespectively intercepting k sub-voices with the same size at the same position;
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoisedInputting the global discriminator, respectively training the discriminator and the generator, and obtaining training according to the preset training end conditionA good generation countermeasure network;i-th sub-speech, y, representing de-noised speech i Representing the ith sub-speech, x, of noisy speech i An ith sub-speech representing a clean speech;
wherein the process of training the generator comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generatorRelative probability of being true
The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generatorInputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true
Fixing the discriminator parameters and calculating the generation loss to update the parameters of the generator;
the generation loss is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
2. The method for training a generative confrontation network as claimed in claim 1, wherein the process of training the discriminator comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true i ,y i );
The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generatorInputting the speech into a global discriminator to obtain the relative probability p (x, y) that the pure speech corresponding to the noisy speech is true;
the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.
3. The method of training a generative confrontation network as claimed in claim 2 wherein the discriminant loss is calculated by the formula:
D_loss=localD_Loss+globalD_Loss,
localD_Loss=-log(min(p(x i ,y i ))),
globalD_Loss=-log(p(x,y));
wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) i ,y i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.
4. A method of speech enhancement, comprising:
acquiring a voice with noise to be enhanced;
inputting the voice with noise to be enhanced into the generator in the generation countermeasure network obtained by the method for training the generation countermeasure network according to any one of claims 1 to 3 to enhance the voice and generate the pure voice.
5. The speech enhancement method of claim 4 wherein the generator uses the feature pyramid network as a backbone network to obtain speech features of different scales.
6. A training generative confrontation network system, comprising:
the training set acquisition module is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set;
a generator de-noising module for inputting the noise voice signal into the generator to generate de-noised voice
A sub-voice obtaining module for removing noise from the pure voice x, the voice y with noise and the generatorRespectively intercepting k sub-voices with the same size at the same position;
a training module for generating a speech pair composed of the denoised sub-speech and the sub-speech with noiseAnd a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech i ,y i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoisedInputting a global discriminator, respectively training the discriminator and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition;i-th sub-speech, y, representing de-noised speech i Representing the ith sub-speech, x, of noisy speech i An ith sub-speech representing a clean speech;
wherein the process of training the generator comprises:
speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speechAnd a speech pair (x) consisting of a sub-speech of the noisy speech and a sub-speech of the corresponding clean speech i ,y i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generatorRelative probability of being true
The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generatorInputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true
Fixing the discriminator parameters and calculating the generation loss to update the parameters of the generator;
the generation loss is calculated by the following formula:
G_loss=localDG_Loss+globalDG_Loss+L_Loss,
wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator 1 The distance is lost.
7. A speech enhancement system, comprising:
the voice to be enhanced acquisition module is used for acquiring the voice with noise to be enhanced;
a voice enhancement module, for inputting the voice with noise to be enhanced into the generator in the generation countermeasure network obtained by the method of training the generation countermeasure network according to any one of claims 1 to 3, to enhance the voice and generate the pure voice.
8. A computer device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of claims 1-3 or 4-5.
9. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of claims 1-3 or 4-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911312488.0A CN111081266B (en) | 2019-12-18 | 2019-12-18 | Training generation countermeasure network, and voice enhancement method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911312488.0A CN111081266B (en) | 2019-12-18 | 2019-12-18 | Training generation countermeasure network, and voice enhancement method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111081266A CN111081266A (en) | 2020-04-28 |
CN111081266B true CN111081266B (en) | 2022-08-09 |
Family
ID=70315828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911312488.0A Active CN111081266B (en) | 2019-12-18 | 2019-12-18 | Training generation countermeasure network, and voice enhancement method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111081266B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111833893A (en) * | 2020-06-16 | 2020-10-27 | 杭州云嘉云计算有限公司 | Speech enhancement method based on artificial intelligence |
CN112164008B (en) * | 2020-09-29 | 2024-02-23 | 中国科学院深圳先进技术研究院 | Training method of image data enhancement network, training device, medium and equipment thereof |
CN112444810B (en) * | 2020-10-27 | 2022-07-01 | 电子科技大学 | Radar air multi-target super-resolution method |
CN113096673B (en) * | 2021-03-30 | 2022-09-30 | 山东省计算中心(国家超级计算济南中心) | Voice processing method and system based on generation countermeasure network |
CN113555028B (en) * | 2021-07-19 | 2024-08-02 | 首约科技(北京)有限公司 | Processing method for noise reduction of Internet of vehicles voice |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10971142B2 (en) * | 2017-10-27 | 2021-04-06 | Baidu Usa Llc | Systems and methods for robust speech recognition using generative adversarial networks |
CN108133702A (en) * | 2017-12-20 | 2018-06-08 | 重庆邮电大学 | A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias |
CN109147810B (en) * | 2018-09-30 | 2019-11-26 | 百度在线网络技术(北京)有限公司 | Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network |
CN109256144B (en) * | 2018-11-20 | 2022-09-06 | 中国科学技术大学 | Speech enhancement method based on ensemble learning and noise perception training |
CN110085218A (en) * | 2019-03-26 | 2019-08-02 | 天津大学 | A kind of audio scene recognition method based on feature pyramid network |
CN110047502A (en) * | 2019-04-18 | 2019-07-23 | 广州九四智能科技有限公司 | The recognition methods of hierarchical voice de-noising and system under noise circumstance |
CN110428849B (en) * | 2019-07-30 | 2021-10-08 | 珠海亿智电子科技有限公司 | Voice enhancement method based on generation countermeasure network |
CN110390950B (en) * | 2019-08-17 | 2021-04-09 | 浙江树人学院(浙江树人大学) | End-to-end voice enhancement method based on generation countermeasure network |
-
2019
- 2019-12-18 CN CN201911312488.0A patent/CN111081266B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111081266A (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111081266B (en) | Training generation countermeasure network, and voice enhancement method and system | |
DE112017003563B4 (en) | METHOD AND SYSTEM OF AUTOMATIC LANGUAGE RECOGNITION USING POSTERIORI TRUST POINT NUMBERS | |
JP5897107B2 (en) | Detection of speech syllable / vowel / phoneme boundaries using auditory attention cues | |
CN108765334A (en) | A kind of image de-noising method, device and electronic equipment | |
CN110245621B (en) | Face recognition device, image processing method, feature extraction model, and storage medium | |
US20230274479A1 (en) | Learning apparatus and method for creating image and apparatus and method for image creation | |
CN112597918B (en) | Text detection method and device, electronic equipment and storage medium | |
CN111640123B (en) | Method, device, equipment and medium for generating background-free image | |
CN112308866A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN113221925A (en) | Target detection method and device based on multi-scale image | |
CN114266894A (en) | Image segmentation method and device, electronic equipment and storage medium | |
CN111860077A (en) | Face detection method, face detection device, computer-readable storage medium and equipment | |
CN111444788B (en) | Behavior recognition method, apparatus and computer storage medium | |
CN111353514A (en) | Model training method, image recognition method, device and terminal equipment | |
CN117593633A (en) | Ocean scene-oriented image recognition method, system, equipment and storage medium | |
CN117252890A (en) | Carotid plaque segmentation method, device, equipment and medium | |
CN113256662B (en) | Pathological section image segmentation method and device, computer equipment and storage medium | |
CN116563556B (en) | Model training method | |
CN115565186B (en) | Training method and device for character recognition model, electronic equipment and storage medium | |
Liu et al. | Interference reduction in reverberant speech separation with visual voice activity detection | |
WO2020241074A1 (en) | Information processing method and program | |
KR102206792B1 (en) | Method for image denoising using parallel feature pyramid network, recording medium and device for performing the method | |
CN116543246A (en) | Training method of image denoising model, image denoising method, device and equipment | |
US20200372280A1 (en) | Apparatus and method for image processing for machine learning | |
CN110580336B (en) | Lip language word segmentation method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |