CN111081266B

CN111081266B - Training generation countermeasure network, and voice enhancement method and system

Info

Publication number: CN111081266B
Application number: CN201911312488.0A
Authority: CN
Inventors: 刘刚; 龚科
Original assignee: DMAI Guangzhou Co Ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2022-08-09
Anticipated expiration: 2039-12-18
Also published as: CN111081266A

Abstract

The invention discloses a method and a system for generating a confrontation network by training and enhancing voice, wherein the training method comprises the following steps: respectively inputting a voice pair consisting of a denoised voice sub-voice and a noise-carrying voice sub-voice and a voice pair consisting of a noise-carrying voice sub-voice and a noise-carrying voice sub-voice corresponding to a pure voice into a local discriminator, respectively inputting the voice pair consisting of the noise-carrying voice and the noise-carrying voice corresponding to the pure voice and the voice pair consisting of the denoised voice of a generator into a global discriminator, respectively training the discriminator and the generator, and leading the generator to generate the denoised voice from the whole and the local by a training countermeasure network comprising the global discriminator and the local discriminator, so as to eliminate the noise which is difficult to eliminate in a specific area in a targeted manner; a generator based on a feature pyramid network to better extract speech features and to perceive noise, thereby eliminating noise more effectively.

Description

Training generation countermeasure network, and voice enhancement method and system

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method and a system for training and generating an confrontation network and enhancing voice.

Background

Speech Enhancement (SE) refers to removing noise z from noisy Speech y, thereby separating clean Speech x, i.e., x ═ y-z. Removing noise from a mixed speech signal is one of the most challenging tasks in speech signal processing, and conventional speech enhancement algorithms include spectral subtraction, subspace methods, and wiener filtering methods. In recent years, a speech enhancement method based on deep learning makes a great breakthrough in the field of speech enhancement, and particularly, the application of a generated countermeasure Network (GAN) in the field of speech enhancement effectively improves the quality of denoised speech.

The generation of countermeasure networks has since been proposed to achieve good results in many challenging generation tasks such as image translation, super-resolution, etc., which contain two important components: a Generator (Generator, G) and a Discriminator (Discriminator, D). The generator maps the samples Z from the prior distribution Z to the samples X of the distribution X of interest, D as a binary classifier that decides true samples as true and false samples generated by the generator. The generator and the discriminator are trained in a countercheck mode, samples generated by the generator are subjected to real distribution as much as possible to confuse the discriminator so that the discriminator judges the samples to be true, the discriminator separates the real samples from the generated samples as much as possible, the samples generated by the generator are very close to the real samples in the continuous game process until Nash equilibrium is achieved, and the discriminator cannot judge whether the generated samples are real or generated.

The existing voice enhancement technology based on the generation countermeasure network basically directly uses the frame in the image generation task for voice enhancement in the voice noise reduction field, and does not combine the characteristics of voice to carry out more effective noise reduction. Noisy speech is a very non-stationary signal, i.e. some places may be very noisy and some places may be very noisy or substantially noiseless. The generator does not have a reasonable structure to sense the characteristic, and although the discriminator can guide the generator to generate the voice which is as clean as possible, the generator only has a relatively coarse angle in the whole, the generated voice is likely to have noise locally, and the voice enhancement effect is poor.

Disclosure of Invention

Therefore, the training generation confrontation network, the voice enhancement method and the training generation confrontation network system overcome the defect that the voice enhancement effect of the voice enhancement technology based on the confrontation network in the prior art is poor.

In a first aspect, an embodiment of the present invention provides a method for training a generated confrontation network, including the following steps: obtaining a speech y with noise and a corresponding pure speech x to form a training set; inputting the noisy speech y into a generator to generate denoised speech

Speech de-noised at clean speech x, noisy speech y and generator

Respectively intercepting k sub-voices with the same size at the same position; speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ,y _i ) Respectively inputting the two signals to local discriminator to sum up the noisy speechSpeech pair (x, y) corresponding to clean speech component and speech pair of denoised speech and noisy speech component of generator

And inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition.

In one embodiment, the process of training the arbiter comprises:

speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ,y _i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true _i ,y _i ) (ii) a The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator

Inputting the speech into a global discriminator to obtain the relative probability p (x, y) that the pure speech corresponding to the noisy speech is true; the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.

In one embodiment, the discriminant loss is calculated by the following equation:

D_loss＝localD_Loss+globalD_Loss，

localD_Loss＝-log(min(p(x _i ,y _i )))，

globalD_Loss＝-log(p(x,y))；

wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) _i ,y _i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.

In one implementationIn an example, a process for training a generator includes: speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ,y _i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generator

Relative probability of being true

The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator

Inputting the voice into a global discriminator to obtain the relative probability that the voice of the generator after de-noising is true

Fix the arbiter parameters and calculate the generation penalty to update the generator parameters.

In one embodiment, the generation loss is calculated by the following formula:

G_loss＝localDG_Loss+globalDG_Loss+L_Loss，

wherein, globalDG _ Loss is the global countermeasure Loss of the discriminator, localDG _ Loss is the local countermeasure Loss of the discriminator, and L _ Loss is the L for generating the enhanced speech and the pure speech by the generator ₁ The distance is lost.

In a second aspect, an embodiment of the present invention provides a speech enhancement method, including: acquiring a voice with noise to be enhanced; inputting a voice to be enhanced with noise into a generator in a generated confrontation network obtained by the method for training the generated confrontation network according to the first aspect of the embodiment of the present invention, so as to enhance the voice and generate a pure voice.

In one embodiment, the generator uses the feature pyramid network as a backbone network to obtain the speech features of different scales.

In a third aspect, an embodiment of the present invention provides a training generation countermeasure network system, including: the training set acquisition module is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set; a generator de-noising module for inputting the noise voice signal into the generator to generate de-noised voice

A sub-voice obtaining module for removing noise from the pure voice x, the voice y with noise and the generator

Respectively intercepting k sub-voices with the same size at the same position; a training module for generating a speech pair composed of the denoised sub-speech and the sub-speech with noise

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ,y _i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoised

In a fourth aspect, an embodiment of the present invention provides a speech enhancement system, including: the voice to be enhanced acquisition module is used for acquiring the voice with noise to be enhanced; the speech enhancement module is configured to input a speech to be enhanced with noise into a generator in a generated confrontation network obtained by the method for training the generated confrontation network according to the first aspect of the embodiment of the present invention, so as to enhance the speech and generate a pure speech.

In a fifth aspect, an embodiment of the present invention provides a computer device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the at least one processor to perform the method of the first or second aspect of the embodiments of the present invention.

In a sixth aspect, embodiments of the present invention provide a computer-readable storage medium, which stores computer instructions for causing the at least one processor to execute the method according to the first or second aspect of the embodiments of the present invention.

The technical scheme of the invention has the following advantages:

1. the method and the system for training the generation confrontation network, provided by the embodiment of the invention, combine the noise-carrying instability and the noise existence locality, and enable the generator to denoise the voice sub-voice and the voice pair consisting of the noise-carrying sub-voice

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ,y _i ) Respectively inputting the speech and the noise-containing speech into a local discriminator to generate a speech pair (x, y) consisting of the noise-containing speech and the corresponding clean speech, and generating a denoised speech and noise-containing speechFormed speech pair

Inputting a global discriminator, respectively training the discriminator and the generator, and leading the generator to generate denoised voice from the whole body and the local part by the trained confrontation network through a dynamic granularity discriminator comprising the global discriminator and the local discriminator, thereby eliminating the noise which is difficult to eliminate in a specific area in a targeted way.

2. The voice enhancement method and the voice enhancement system provided by the embodiment of the invention combine the unstable characteristic of the voice with noise and better extract the voice characteristics and sense the noise through the generator based on the characteristic pyramid network, thereby effectively eliminating the noise.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a specific example of a method for training a confrontation network according to an embodiment of the present invention;

FIG. 2 is a flowchart of a specific example of a speech enhancement method according to an embodiment of the present invention;

fig. 3 is a flowchart of a convolution operation performed by using a feature pyramid network as a backbone network according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a plurality of convolution operations provided by an embodiment of the present invention;

FIG. 5 is a block diagram of a system for providing a training and generation countermeasure network according to an embodiment of the present invention;

FIG. 6 is a block diagram of a speech enhancement system according to an embodiment of the present invention;

fig. 7 is a block diagram of a specific example of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1

The method for training to generate the confrontation network provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:

step S1: the noisy speech and its corresponding clean speech are obtained to form a training set.

In the embodiment of the present invention, a clean speech x may be obtained first, and noise may be added to the clean speech randomly to obtain a speech y with noise, which is only by way of example and not limited thereto.

Step S2, the voice with noise is input to a generator to generate the voice after denoising.

The embodiment of the invention inputs the voice y with noise into a preset generator to obtain the voice y after denoising

The generator takes a feature pyramid network as a backbone network to generate denoised voice by acquiring voice features generating different scales

Step S3, noise-removed speech in the generator and clean speech

Respectively intercepting at the same position and intercepting the same size of sonAnd (4) voice.

In this embodiment, the speech is de-noised by the clean speech x, the noisy speech y and the generator respectively

The same interception operation is carried out at the same position, and the obtained sub-voices respectively corresponding to the same position are x ₁ ,x ₂ ,…,x _k ，y ₁ ,y ₂ ,…,y _k And

the number of k is reasonably set according to actual needs, and 10 is taken as k in the embodiment of the invention.

Step S4, the speech pair composed of the sub-speech of the speech and the sub-speech with noise is formed after the generator is denoised

In this embodiment, the preset training end condition may be that the loss value obtained by training is smaller than a preset threshold and/or the number of training steps is smaller than a preset number of times, which is not limited herein.

The process of training the discriminator in the embodiment of the invention comprises the following steps:

1) speech pair composed of sub-speech of de-noised speech of generator and sub-speech of noise-carrying speech

And with noisy speechSpeech pair (x) of sub-speech and sub-speech corresponding to clean speech _i ,y _i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true _i ,y _i )；

2) The speech pair (x, y) composed of the noise-carrying speech and the corresponding pure speech and the speech pair composed of the speech with noise and the noise-carrying speech after the noise removal of the generator

Inputting the speech with noise into a global discriminator to obtain the relative probability p (x, y) that the pure speech x corresponding to the speech with noise is true;

3) the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.

The discriminant loss in the embodiment of the present invention is calculated by the following formula:

D_loss＝localD_Loss+globalD_Loss，

localD_Loss＝-log(min(p(x _i ,y _i )))，

globalD_Loss＝-log(p(x,y))；

In the embodiment of the present invention, the process of training the generator includes:

Relative probability of being true;

3) Fix the arbiter parameters and calculate the generation penalty to update the generator parameters.

The generation loss given in the embodiment of the present invention is calculated by the following formula:

G_loss＝localDG_Loss+globalDG_Loss+L_Loss，

In the embodiment of the invention, it is desirable to remove noise as much as possible for each granularity of speech generated by the generator, that is, it is desirable for the discriminator to discriminate it as true, that is, it is desirable for the value of the discriminator output to be as close to 1 as possible. For L ₁ Distance Loss L _ Loss, which is L of clean and enhanced speech per granularity ₁ Distance. It should be noted that the distance may be a cosine distance, a euclidean distance, or other similar distances, which is not limited herein.

The training generation confrontation network method provided by the embodiment of the invention combines the noise-carrying instability and the noise existence locality, and utilizes the dynamic granularity discriminator comprising the global discriminator and the local discriminator to guide the generator to generate the denoised voice from the whole body and the local part, so that the noise which is difficult to eliminate in a specific area is eliminated in a targeted manner, and the voice enhancement effect is better.

Example 2

An embodiment of the present invention provides a speech enhancement method, as shown in fig. 2, including the following steps:

step S21, obtaining the voice with noise to be enhanced;

and step S22, the generator in the generation countermeasure network obtained by inputting the voice with noise to be enhanced enhances the voice to generate pure voice. The generator employed in this embodiment is a generator included in the generation of the countermeasure network based on the training generation countermeasure network method in embodiment 1.

In the embodiment of the present invention, a generator for generating a countermeasure network is implemented by using a feature pyramid network as a backbone network, and as shown in fig. 3, before the feature pyramid network, a convolution with a step length of 1 is first used, the feature pyramid network includes 5 sense blocks in a bottom-up path, and each Block respectively generates features of information with different semantic degrees. In a top-down path, 1 × 1 convolution enables the features of each scale to have the same channel number, the four uppermost features are up-sampled to 1/4 length of the original input, then are spliced together and are subjected to convolution once to obtain a Dense feature containing information with different semantic degrees, and then are up-sampled twice and convoluted once by combining the output of the first Dense Block to obtain pure voice. In this embodiment, four different scale speech features are taken as an example, but not limited thereto.

As shown in fig. 4, the sense Block in the present embodiment includes a plurality of convolution operations each of which takes as input the output of the preceding layer and inputs its output to each of the following layers, assuming that the output of the l-th layer is X _l Layer l-1, l-2, the output of layer 0 (i.e., the input of the current Block) is X _l-1 ,X _l-2 ,…,X ₀ And then:

X _l ＝H _l ([X _l-1 ,X _l-2 ,…,X ₀ ])

wherein H _l Represents the convolution operation of the l-th layer [. ]]Indicating a splicing operation. To produce features of different scales, an embodiment of the invention embeds a convolution of step size 2 into the end of each Dense Block. After each convolution, a Parametric Linear modified Unit (PReLU) was used as the activation function.

The voice enhancement method provided by the embodiment of the invention combines the unstable characteristic of the voice with noise, and better extracts the voice characteristic and senses the noise through the generator based on the characteristic pyramid network, thereby more effectively eliminating the noise; by combining the noisy unevenness and the noise existence locality, the dynamic granularity discriminator including the global discriminator and the local discriminator obtained in embodiment 1 guides the generator to generate denoised voice from the whole and the local, and eliminates the noise which is difficult to eliminate in a specific area more in a targeted manner.

Example 3

An embodiment of the present invention provides a training generation countermeasure network system, as shown in fig. 5, including:

the training set acquisition module 1 is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set; this module executes the method described in step S1 in embodiment 1, and is not described herein again.

A generator de-noising module 2 for inputting the noise voice signal into the generator to generate de-noised voice

This module executes the method described in step S2 in embodiment 1, and is not described herein again.

A sub-voice obtaining module 3 for obtaining the voice after the noise removal of the pure voice x, the voice with noise y and the generator

Are respectively in the same wayThe position of the voice capturing unit is used for capturing k sub-voices with the same size; this module executes the method described in step S3 in embodiment 1, and is not described herein again.

A training module 4 for generating a speech pair consisting of the denoised sub-speech of the speech and the sub-speech of the noisy speech

And inputting the global arbiter, respectively training the arbiter and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition. This module executes the method described in step S4 in embodiment 1, and is not described herein again.

The training generation confrontation network system provided by the embodiment of the invention combines the noise-carrying instability and the noise existence locality, and utilizes the dynamic granularity discriminator comprising the global discriminator and the local discriminator to guide the generator to generate the denoised voice from the whole body and the local part, so that the noise which is difficult to eliminate in a specific area is eliminated in a targeted manner, and the voice enhancement effect is better.

Example 4

An embodiment of the present invention provides a speech enhancement system, as shown in fig. 6, including:

the voice to be enhanced acquisition module 5 is used for acquiring a voice to be enhanced with noise; this module executes the method described in step S21 in embodiment 2, and is not described herein again.

A speech enhancement module 6, configured to input the speech to be enhanced with noise into the generator in the generated confrontation network obtained by the method for training the generated confrontation network in embodiment 1 to enhance the speech, so as to generate a pure speech. This module executes the method described in step S22 in embodiment 2, and is not described herein again.

The voice enhancement system provided by the embodiment of the invention combines the unstable characteristic of the voice with noise, and better extracts voice characteristics and senses noise through the generator based on the characteristic pyramid network, thereby more effectively eliminating the noise; by combining the noisy unevenness and the noise existence locality, the dynamic granularity discriminator including the global discriminator and the local discriminator obtained in embodiment 1 guides the generator to generate denoised voice from the whole and the local, and eliminates the noise which is difficult to eliminate in a specific area more in a targeted manner.

Example 5

An embodiment of the present invention provides a computer device, as shown in fig. 7, including: at least one processor 401, such as a CPU (Central Processing Unit), at least one communication interface 403, memory 404, and at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The communication interface 403 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may also include a standard wired interface and a standard wireless interface. The Memory 404 may be a RAM (random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 404 may optionally be at least one memory device located remotely from the processor 401. Wherein the processor 401 may perform the method in embodiment 1 or embodiment 2. A set of program codes is stored in the memory 404 and the processor 401 calls the program codes stored in the memory 404 for executing the method in embodiment 1 or in embodiment 2. The communication bus 402 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 7, but it is not intended that there be only one bus or one type of bus.

The memory 404 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 404 may also comprise a combination of memories of the kind described above.

The processor 401 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 401 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Optionally, the memory 404 is also used to store program instructions. The processor 401 may call program instructions to implement the method in embodiment 1 or embodiment 2 as the present application.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer-executable instruction is stored on the computer-readable storage medium, and the computer-executable instruction may execute the method in embodiment 1 or embodiment 2. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid-State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A method for training to generate an confrontation network is characterized by comprising the following steps:

obtaining a speech y with noise and a corresponding pure speech x to form a training set;

inputting the noisy speech y into a generator to generate denoised speech

Speech de-noised at clean speech x, noisy speech y and generator

Respectively intercepting k sub-voices with the same size at the same position;

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ，y _i ) Respectively inputting the speech with noise and the speech pair (x, y) consisting of corresponding pure speech, and the speech pair consisting of the speech with noise and the speech with noise after the generator is denoised

Inputting the global discriminator, respectively training the discriminator and the generator, and obtaining training according to the preset training end conditionA good generation countermeasure network;

i-th sub-speech, y, representing de-noised speech _i Representing the ith sub-speech, x, of noisy speech _i An ith sub-speech representing a clean speech;

wherein the process of training the generator comprises:

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ，y _i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generator

Relative probability of being true

Fixing the discriminator parameters and calculating the generation loss to update the parameters of the generator;

the generation loss is calculated by the following formula:

G_loss＝localDG_Loss+globalDG_Loss+L_Loss，

2. The method for training a generative confrontation network as claimed in claim 1, wherein the process of training the discriminator comprises:

And a speech pair (x) consisting of a sub-speech with noisy speech and a sub-speech corresponding to clean speech _i ，y _i ) Respectively inputting the partial speech signals into a local discriminator to obtain the relative probability p (x) that the sub-speech of the noise speech corresponding to the pure speech is true _i ，y _i )；

Inputting the speech into a global discriminator to obtain the relative probability p (x, y) that the pure speech corresponding to the noisy speech is true;

the parameters of the generator are fixed and the discriminant penalty is calculated to update the parameters of the discriminant.

3. The method of training a generative confrontation network as claimed in claim 2 wherein the discriminant loss is calculated by the formula:

D_loss＝localD_Loss+globalD_Loss，

localD_Loss＝-log(min(p(x _i ，y _i )))，

globalD_Loss＝-log(p(x，y))；

wherein globalD _ Loss is global discrimination Loss, and localD _ Loss is local discrimination Loss, min (p (x) _i ，y _i ) The minimum of the relative probabilities that sub-speech corresponding to clean speech for k noisy speech is true.

4. A method of speech enhancement, comprising:

acquiring a voice with noise to be enhanced;

inputting the voice with noise to be enhanced into the generator in the generation countermeasure network obtained by the method for training the generation countermeasure network according to any one of claims 1 to 3 to enhance the voice and generate the pure voice.

5. The speech enhancement method of claim 4 wherein the generator uses the feature pyramid network as a backbone network to obtain speech features of different scales.

6. A training generative confrontation network system, comprising:

the training set acquisition module is used for acquiring a speech y with noise and a corresponding pure speech x to form a training set;

a generator de-noising module for inputting the noise voice signal into the generator to generate de-noised voice

Respectively intercepting k sub-voices with the same size at the same position;

a training module for generating a speech pair composed of the denoised sub-speech and the sub-speech with noise

Inputting a global discriminator, respectively training the discriminator and the generator, and obtaining a well-trained generated countermeasure network according to a preset training end condition;

wherein the process of training the generator comprises:

And a speech pair (x) consisting of a sub-speech of the noisy speech and a sub-speech of the corresponding clean speech _i ，y _i ) Respectively input into the local discriminators to obtain sub-voices of the voice de-noised by the generator

Relative probability of being true

the generation loss is calculated by the following formula:

G_loss＝localDG_Loss+globalDG_Loss+L_Loss，

7. A speech enhancement system, comprising:

the voice to be enhanced acquisition module is used for acquiring the voice with noise to be enhanced;

a voice enhancement module, for inputting the voice with noise to be enhanced into the generator in the generation countermeasure network obtained by the method of training the generation countermeasure network according to any one of claims 1 to 3, to enhance the voice and generate the pure voice.

8. A computer device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of claims 1-3 or 4-5.

9. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of claims 1-3 or 4-5.