CN111785288B

CN111785288B - Voice enhancement method, device, equipment and storage medium

Info

Publication number: CN111785288B
Application number: CN202010615254.XA
Authority: CN
Inventors: 邓承韵; 宋辉; 沙永涛; 张毅
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2022-03-15
Anticipated expiration: 2040-06-30
Also published as: CN111785288A

Abstract

Embodiments of the present disclosure provide a voice enhancement method, apparatus, device, and storage medium. The method comprises the following steps: acquiring a target voice; determining the scene type of a scene where the target voice is located; selecting a voice enhancement model corresponding to the scene type from preset voice enhancement models; and enhancing the target voice through a voice enhancement model corresponding to the scene type. The method of the embodiment of the disclosure provides flexibility of voice enhancement, so that the use scenes of the voice enhancement are wider, and simultaneously, the voice enhancement effect under each scene is ensured.

Description

Voice enhancement method, device, equipment and storage medium

Technical Field

Embodiments of the present disclosure relate to the field of speech processing, and in particular, to a method, an apparatus, a device, and a storage medium for speech enhancement.

Background

The speech enhancement is to perform noise reduction processing on a speech signal to improve the quality of the speech signal.

In general, there is a method of improving a speech enhancement effect by collecting speech from different directions with a plurality of microphones in terms of hardware, and there is a method of improving an effect of a speech signal with a deep learning technique in terms of software.

However, the above approaches are more focused on improving the noise removal degree of the speech, and the actual scene of speech enhancement is not fully considered.

Disclosure of Invention

Embodiments of the present disclosure provide a method, an apparatus, a device, and a storage medium for speech enhancement, so as to solve the problem that the actual scene of speech enhancement is not fully considered in the existing speech enhancement method, which results in a poor speech enhancement effect.

In a first aspect, an embodiment of the present disclosure provides a speech enhancement method, including:

acquiring a target voice;

determining the scene type of the scene where the target voice is located;

selecting a voice enhancement model corresponding to the scene type from preset voice enhancement models;

and enhancing the target voice through a voice enhancement model corresponding to the scene type.

In a second aspect, an embodiment of the present disclosure provides a speech enhancement apparatus, including:

the acquisition module is used for acquiring target voice;

the determining module is used for determining the scene type of the scene where the target voice is located;

the selection module is used for selecting a voice enhancement model corresponding to the scene type from preset voice enhancement models;

and the enhancement module is used for enhancing the target voice through a voice enhancement model corresponding to the scene type.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method according to the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method as described in the first aspect above.

In a fifth aspect, embodiments of the present disclosure provide a program product comprising instructions, the program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.

The voice enhancement method, the voice enhancement device, the voice enhancement equipment and the storage medium provided by the embodiment of the disclosure determine the scene type of the scene where the target voice is located, determine the voice enhancement model corresponding to the scene type, and perform voice enhancement on the target voice through the voice enhancement model corresponding to the scene type. Therefore, the target voice is subjected to voice enhancement in a targeted manner according to the scene where the target voice is located, so that the requirements of different scenes on the voice enhancement are met, the flexibility of the voice enhancement is improved, the applicable scenes of the voice enhancement are wider, and the voice enhancement effect under different scenes is ensured.

Various possible embodiments of the present disclosure and technical advantages thereof will be described in detail below.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic diagram of a network architecture according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a speech enhancement method according to an embodiment of the disclosure;

FIG. 3 is a flowchart illustrating a speech enhancement model training process according to an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating a training process of a speech enhancement model according to an embodiment of the present disclosure;

fig. 5 is a diagram illustrating a structure of a generative countermeasure network according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a training apparatus for a speech enhancement model according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present disclosure;

fig. 9 is a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, terms related to embodiments of the present disclosure are explained:

generate a generic countermeasure network (GAN): a deep learning model, comprising a generator and a discriminator, wherein the generator is also called a generation model (Generative model), and the discriminator is also called a discrimination model (discrimination model);

convolutional Neural Networks (CNN), Convolutional Neural Networks: a feedforward neural network including convolution calculations and having a depth structure;

convolutional Recurrent Neural Network (CRNN): the convolutional Neural Network is obtained by combining a Recurrent Neural Network (RNN) and a convolutional Neural Network.

Magnitude spectrum and phase spectrum of speech signal: by performing fourier transform on the voice signal, a magnitude spectrum and a phase spectrum of the voice signal can be obtained, wherein the magnitude spectrum is used for representing the change of the magnitude of the voice signal along with the change of the signal frequency, and the phase spectrum is used for representing the change of the phase of the voice signal along with the change of the signal frequency.

Time-frequency mask (T-F mask, time-frequency mask): the method is also called time-frequency masking, and the time-frequency mask is acted on the amplitude spectrum of the voice signal to mask part of signals in the voice signal, so that the proper time-frequency mask can be generated to act on the voice signal to reduce the noise of the voice signal.

Speech enhancement may be used for different speech systems and speech devices, for example speech enhancement may be used for speech recognition systems, as well as for hearing aids. Different speech systems and speech devices require different effects on speech enhancement. In general, the requirements of speech recognition systems for speech enhancement are: the integrity of the voice is kept as much as possible while voice noise is removed, and distortion is avoided; the requirements of hearing aids for speech enhancement are: the voice subjective quality of the voice is kept as much as possible while removing the voice noise, and the voice subjective quality comprises the definition and the intelligibility of the voice.

Generally, when speech enhancement is performed, emphasis is generally placed on improving the degree of speech denoising, and the requirement of an application scene of speech enhancement on the speech enhancement effect is not fully considered.

According to the voice enhancement method, the voice enhancement device, the voice enhancement equipment and the storage medium, the scene type of the scene where the target voice is located is determined under the condition that the target voice is obtained, the target voice is enhanced through the voice enhancement model corresponding to the scene type, accordingly, the voice enhancement is performed on the target voice in a targeted mode according to the scene where the target voice is located, the requirements of different scenes on the voice enhancement are fully considered, the flexibility of the voice enhancement is provided, the applicable scenes of the voice enhancement are wider, and meanwhile the voice enhancement effect under different scenes is ensured.

The method for enhancing speech provided by the embodiment of the present disclosure may be applied to the network architecture diagram shown in fig. 1. As shown in fig. 1, the network architecture includes: the terminal device 101 and the server 102, and network communication is established between the terminal device 101 and the server 102. The server 102 may receive the target speech sent by the terminal device 101, and enhance the target speech through a pre-trained speech enhancement model. Of course, the trained speech enhancement model may also be set in the terminal device 101, and the terminal device 101 enhances the target speech through the speech enhancement model.

The terminal device 101 may be a computer, a tablet computer, a smart phone, or other terminal devices, and the server 102 may be an individual server or a server group.

The following describes technical solutions of the embodiments of the present disclosure and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a speech enhancement method according to an embodiment of the disclosure. As shown in fig. 2, the method includes:

s201, obtaining target voice.

Wherein the number of the target voices is one or more.

Specifically, a plurality of voices can be collected and stored in advance, and a voice library including the plurality of voices is generated. When a voice enhancement request of a user is received, voice which is not subjected to voice enhancement can be obtained in a voice library, and the voice which is not subjected to enhancement is determined as target voice. Alternatively, a target voice transmitted by the user may be received to perform voice processing on the target voice specified by the user.

Specifically, the target voice may also be collected in real time, for example, in a case where a voice enhancement request of the user is received, the microphone is turned on, and the target voice is collected in real time.

In one possible embodiment, the target voice may be a character voice, for example, a voice of a character talking, a voice of a character singing. The target voice can also be animal voice, and can also be voice sent by objects, such as light music formed by instrumental sounds, so as to perform voice enhancement on different types of target voices.

S202, determining the scene type of the scene where the target voice is located.

Specifically, a plurality of scene types may be set in advance. When the target voice is obtained or after the target voice is obtained, the scene information of the target voice can be obtained, the scene information of the target voice is analyzed, and the scene type of the scene where the target voice is located is determined. In the process of analyzing the scene information of the target voice, preset keywords can be extracted from the scene information, and the scene type corresponding to the keywords in the scene information of the target voice is determined according to the preset corresponding relation between the keywords and the scene type.

As an example, keywords "make a call", "open a meeting", "sing" and the like may be preset, where a scene type corresponding to the keyword "make a call" is a call scene, a scene type corresponding to the keyword "open a meeting" is an open meeting scene, and a scene type corresponding to the keyword "sing" is a music scene.

Specifically, when the scene information of the target voice is acquired, the scene information of the target voice input by the user may be acquired, for example, when the target voice is a recording during a call, the user may input the scene information "call", and when the target voice is a recording during a meeting, the user may input the scene information "company meeting". And sending a plurality of preset scene types to the user, acquiring the scene type selected by the user and determining the scene type selected by the user as the scene type of the scene where the target voice is located.

In a possible implementation manner, the scene type of the scene where the target voice is located may also be determined by performing preliminary recognition on the content of the target voice, so as to improve the accuracy of the scene type of the scene where the target voice is located, for example, when a word related to making a call appears in the target voice, the scene type of the scene where the target voice is located may be considered as a call scene.

In a possible implementation, the scene type may also be determined according to the application program that collects the target voice, for example, if the application program that collects the target voice is a taxi-taking application program, the scene type of the scene in which the target voice is located may be considered as a taxi-taking scene or a taxi scene, and if the application program that collects the target voice is a dictionary application program, the scene may be considered as a dictionary query scene. Therefore, the scene type of the scene where the target voice is located is associated with the actual service of the target voice, so that the voice enhancement can better meet the requirements of different application scenes.

In one possible embodiment, the preset plurality of scene types include a human ear listening scene and a machine recognition scene to ensure the reasonableness of the scene type setting. In the process of carrying out voice enhancement on the target voice in the human ear listening scene, the method focuses more on improving the voice subjective quality of the target voice while removing the noise of the target voice. In the process of carrying out voice enhancement on the target voice in the machine recognition scene, the method focuses more on removing the noise of the target voice, and simultaneously ensures the integrity of the voice and avoids the distortion of the voice. The human ear listening scene refers to a scene in which target voice after voice enhancement is used for playing, and the communication scene, the meeting scene and the taxi taking scene can be regarded as human ear listening scenes; the machine recognition scene refers to a scene in which target voice after voice enhancement is used for converting into corresponding text content, and the dictionary query scene can be regarded as the machine recognition scene.

S203, selecting a voice enhancement model corresponding to the scene type from the preset voice enhancement models.

Specifically, the speech enhancement models respectively corresponding to the scene types are preset, and the speech enhancement models can be pre-trained deep learning models. After the scene type of the scene where the target voice is located is obtained, a voice enhancement model corresponding to the scene type of the scene where the target voice is located can be selected from the voice enhancement models.

Specifically, in the training process of the speech enhancement models corresponding to the scene types, corresponding performance indexes can be set for the speech enhancement models corresponding to the scene types, the subjective quality of speech can be used as the performance indexes of the speech enhancement models in the human ear listening scenes, and in the machine recognition scenes, the integrity of speech can be used as the performance indexes of the speech enhancement models to improve the performance indexes, and the speech enhancement models are trained, so that the speech enhancement models corresponding to the scene types are obtained.

And S204, enhancing the target voice through the voice enhancement model corresponding to the scene type.

Specifically, the target voice is input into a voice enhancement model corresponding to the scene type of the scene where the target voice is located, so as to obtain the enhanced target voice.

In one possible embodiment, the speech enhancement model is a generator of a pre-trained generative confrontation network, and the generator is a convolutional recurrent neural network. Generally, when the speech enhancement model is only a convolution cyclic neural network, noise can be effectively removed, but the problems of excessive denoising and speech component loss caused by easy overfitting exist; when the voice enhancement model is only a generative countermeasure network, the risk of overfitting can be reduced, but the denoising capability is general. Therefore, the convolution cyclic neural network is used as a generator of the generative confrontation network, the denoising capability of the convolution cyclic neural network and the capability of the generative confrontation network for reducing overfitting are combined, and the generative confrontation network is trained under the guidance of performance indexes to obtain the voice enhancement models of various scene types and improve the voice enhancement effect of the voice enhancement models.

In one possible embodiment, in the case where the speech enhancement model is a generator in a generative confrontation network and the generator is a convolutional recurrent neural network, in speech enhancement of the target speech by the speech enhancement model: the amplitude spectrum and the phase spectrum of the target voice can be extracted firstly, and the amplitude spectrum of the target voice is input into a voice enhancement model to obtain a time-frequency mask for enhancing the target voice; secondly, enhancing the magnitude spectrum of the target voice through the time-frequency mask to obtain an enhanced magnitude spectrum of the target voice; and finally, obtaining the enhanced target voice according to the enhanced amplitude spectrum and the phase spectrum of the target voice. Therefore, the voice enhancement effect of the target voice is effectively improved through the convolution cyclic neural network.

In one possible implementation, the discriminator of the generative countermeasure network is a convolutional neural network, so as to enhance the speech enhancement effect of the speech enhancement model through training of the convolutional neural network auxiliary generator.

According to the voice enhancement method provided by the embodiment of the disclosure, the voice enhancement is performed on the target voice in a targeted manner according to the scene where the target voice is located, so as to meet the requirements of different scenes on the voice enhancement, improve the flexibility of the voice enhancement, enable the applicable scenes of the voice enhancement to be wider, and simultaneously ensure the voice enhancement effect under different scenes.

Fig. 3 is a schematic flowchart of a training process of a speech enhancement model according to an embodiment of the present disclosure, where the speech enhancement model is a generator in a generative confrontation network, and the generator of the generative confrontation network is a convolutional recurrent neural network. As shown in fig. 3, the method includes:

s301, obtaining training voice.

The training speech includes a noisy speech and an original speech with the noisy speech, and the original speech is a clean speech. When training voice is collected, a plurality of original voices can be collected, different noises are added into the original voices, and noisy voices corresponding to the original voices are obtained.

And S302, training the generative confrontation network for multiple times according to the training voice and the preset performance index.

Specifically, the noisy speech in the training speech is used as the input of a generator in the generating type countermeasure network, and the generator is used for enhancing the noisy speech to obtain the enhanced noisy speech. And performing performance evaluation on the enhanced noisy speech according to the performance index to obtain a performance score of the enhanced noisy speech, for example: and when the performance index is subjective voice quality, performing performance evaluation on the voice with noise on the subjective voice quality. And predicting the performance score of the enhanced noisy speech through a discriminator in the generative countermeasure network to obtain the predicted performance score of the enhanced noisy speech.

Specifically, the prediction performance score of the enhanced noisy speech and the performance score of the enhanced noisy speech can reflect the prediction error of the discriminator, so that model parameter adjustment can be performed on the discriminator according to the prediction performance score of the enhanced noisy speech and the performance score of the enhanced noisy speech, and a training process of the discriminator is completed. According to the enhanced predicted performance score of the noisy speech and the preset target performance score, model parameters of the generator can be adjusted, and one training process of the generator is completed.

Specifically, a training process for the discriminator and a training process for the generator are completed, and then a training process for the generative countermeasure network is completed. The training process is carried out on the generative confrontation network for a plurality of times so as to improve the voice enhancement effect of the generator in the generative confrontation network. After the training of the generative confrontation network is finished, a generator in the trained generative confrontation network is a voice enhancement model for voice enhancement.

According to the training process of the voice enhancement model provided by the embodiment of the disclosure, the generative confrontation network is trained for multiple times according to the training data and the performance indexes, so that the voice enhancement effect of the voice enhancement model is improved, and the corresponding performance indexes can be designed for different scene types, so that the voice enhancement model under different scene types can be obtained through training.

Fig. 4 is a flowchart illustrating a one-time training process of a speech enhancement model according to an embodiment of the present disclosure, where the speech enhancement model is a generator in a generative confrontation network, and the generator of the generative confrontation network is a convolutional recurrent neural network. As shown in fig. 4, the method includes:

s401, obtaining training voice.

And S402, enhancing the noisy voice through the generator.

Specifically, the magnitude spectrum and the phase spectrum of the voice with noise can be extracted by performing fourier transform on the voice with noise, and the magnitude spectrum of the voice with noise is used as the input of the generator to obtain the output of the generator. The output of the generator is a time-frequency mask, and the amplitude spectrum of the voice with noise is enhanced through the time-frequency mask to obtain an enhanced amplitude spectrum. And carrying out inverse Fourier transform on the phase spectrum of the voice with noise and the enhanced amplitude spectrum of the voice with noise to obtain the enhanced voice with noise.

As an example, the time-frequency Mask may be an Ideal Binary Mask (IBM), an Ideal Ratio Mask (IRM), or a Phase Sensitive Mask (PSM).

The enhanced magnitude spectrum can be obtained by multiplying the time-frequency mask and the magnitude spectrum of the noisy speech.

In a feasible implementation manner, the generator sequentially includes an encoding structure, a loop structure and a decoding structure, a network layer in the encoding structure is a convolutional layer, the loop structure includes one or more Bi-directional Long Short-Term Memory (Bi-directional Long Short-Term Memory) cyclic neural networks (BilsTM), the network layer in the decoding structure is an anti-convolutional layer, and the decoding structure is a reverse process of the encoding structure. The coding structure is used for extracting the self correlation of the voice signals from different dimensions and different scales, the cycle structure is used for extracting the long-term and short-term time sequence information of the voice signals, and the decoding structure is combined with the information extracted by the coding structure and the cycle structure to generate a time-frequency mask. The generator with the structure can effectively extract the information of the voice signal, reduce the risk of overfitting and improve the voice enhancement effect.

Further, Skip Connections (Skip Connections) are established between the coding structure and the decoding structure to improve the speech enhancement effect of the generator. The jump connection is used in the deep neural network, is used for solving the problems of gradient explosion and gradient disappearance in the training process, and improves the training effect of the deep neural network.

Wherein, jump connection is established between the coding structure and the decoding structure, which means that jump connection is established between the convolution layer in the coding structure and the corresponding deconvolution layer in the decoding structure, and the output of the convolution layer is input to the deconvolution layer through the jump connection. According to the fact that the decoding structure is the reverse process of the encoding structure, jumping connection is established between the first convolutional layer in the encoding structure and the last-but-one deconvolution in the decoding structure, jumping connection is established between the second convolutional layer in the encoding structure and the last-but-one deconvolution in the decoding structure, and the like.

In one possible embodiment, in the encoding structure and the decoding structure, Batch Normalization (BN) processing is performed on the output of each convolutional layer and the output of each deconvolution layer, so as to improve the data processing effect of the generator, and further improve the voice enhancement effect.

Further, in the generator, the activation function of the hidden layer may adopt an Exponential Linear Units (ELU), and the activation function of the output layer may adopt a nonlinear activation function, so as to improve the data processing effect of the generator, and further improve the voice enhancement effect. The input layer of the generator is the first convolutional layer in the coding structure, and the output layer of the generator is the last deconvolution layer in the decoding structure.

And S403, performing performance evaluation on the enhanced voice with noise through a preset performance evaluation function corresponding to the performance index to obtain a performance score of the enhanced voice with noise.

And the preset performance evaluation function corresponding to the performance index is used for scoring the performance of the enhanced noisy speech in the aspect of the performance index. Therefore, the performance evaluation function may be specifically designed according to the performance index, and is not limited herein.

In one possible embodiment, the performance indicator is a speech recognition indicator or a subjective auditory sensation indicator. And if the performance index is a voice recognition index, the trained voice enhancement model is a voice enhancement model corresponding to the machine recognition scene, and is suitable for enhancing the target voice in the machine recognition scene. And if the performance index is a subjective auditory index, the trained voice enhancement model is a voice enhancement model corresponding to the human ear auditory scene, and is suitable for enhancing the target voice in the human ear auditory scene. Therefore, the speech enhancement model suitable for different scenes can be obtained through training of different performance indexes.

In one possible implementation, the speech recognition index includes a Word Error Rate (WER), so that the speech enhancement model is trained by using the Word Error Rate as a performance index, and the recognition accuracy of the target speech after speech enhancement is improved in a machine recognition scene.

In one possible embodiment, the subjective auditory sensation indicator includes one or more of: subjective speech quality assessment (PESQ) and Short-Time Objective Intelligibility (STOI), so that a speech enhancement model is trained by taking the subjective speech quality assessment or the Short-Time Objective Intelligibility as a performance index, and the definition and Intelligibility of target speech after speech enhancement are improved in a human ear listening scene.

S404, performing performance score prediction on the enhanced noisy speech through the discriminator to obtain a prediction performance score of the enhanced noisy speech.

Specifically, a magnitude spectrum of the enhanced noisy speech may be extracted, and the magnitude spectrum is also a magnitude spectrum enhanced by the time-frequency mask in S402. The magnitude spectrum may be input to a discriminator, and the discriminator may perform performance score prediction on the enhanced noisy speech to obtain output data of the discriminator, i.e., the predicted performance score of the enhanced noisy speech. Wherein, the discriminator is a deep neural network, the output of the discriminator is a value in a preset performance score interval, and the performance score interval is a continuous interval, such as the continuous interval [0,1 ].

In one possible embodiment, the discriminators are convolutional neural networks, which sequentially include convolutional layers, flattened layers (flatteten layers), and fully-connected layers to improve the accuracy of the predicted performance scores by the convolutional neural networks.

In one possible implementation, in each convolutional layer in the discriminator, a batch normalization process and an activation function, which may be a leakage corrected Linear Unit (leakage ReLU), are applied to the output of the convolutional layer to improve the accuracy of the predicted performance score of the discriminator.

In a possible embodiment, in the process of predicting the performance score of the enhanced noisy speech by the discriminator, the amplitude spectrum of the enhanced noisy speech is extracted, the amplitude spectrum of the original speech of the noisy speech is extracted, the amplitude spectrum of the enhanced noisy speech and the amplitude spectrum of the original speech are merged to obtain merged data, the merged data is used as input data of the discriminator, the input data of the discriminator is input into the discriminator to obtain the predicted performance score of the enhanced noisy speech, and therefore the accuracy of predicting the performance score of the enhanced noisy speech by the discriminator is improved by merging the amplitude spectrum of the enhanced noisy speech and the amplitude spectrum of the original speech.

Wherein the combination of the enhanced magnitude spectrum of the noisy speech and the magnitude spectrum of the original speech is a dimension combination. For example, if the dimension of the magnitude spectrum of the enhanced noisy speech is F × T × 1, and the dimension of the magnitude spectrum of the original speech is F × T × 1, then the dimension of the merged data is F × T × 2, where for each speech, F represents the total number of frequencies in the frequency series of the speech and T represents the total number of frames in the time series of the speech.

S405, performing performance score prediction on the original voice through the discriminator to obtain a prediction performance score of the original voice.

Specifically, the magnitude spectrum of the original speech may be extracted, the magnitude spectrum of the original speech is input to a discriminator, and the original speech is subjected to performance score prediction by the discriminator to obtain output data of the discriminator, where the output data of the discriminator is the predicted performance score of the original speech.

In a feasible implementation mode, in the process of predicting the performance score of the original voice through the discriminator, the amplitude spectrum of the original voice is extracted, the amplitude spectrums of two identical original voices are combined to obtain combined data, the combined data is used as input data of the discriminator, the input data of the discriminator is input into the discriminator to obtain the predicted performance score of the original voice, and therefore the accuracy of predicting the performance score of the original voice through the discriminator is improved by combining the amplitude spectrums of the two identical original voices.

S406, training the discriminator according to the performance score and the prediction performance score of the enhanced noisy speech and the prediction performance score of the original speech.

Specifically, the loss value of the discriminator for predicting the performance score can be calculated according to the difference between the performance score of the enhanced noisy speech and the predicted performance score of the enhanced noisy speech, and according to the difference between the predicted performance score of the original speech and the preset performance score of the original speech, and the discriminator is subjected to model parameter adjustment according to the loss value and a preset model optimization algorithm to complete one-time training of the discriminator.

In one possible implementation, the model optimization algorithm may employ an adaptive moment estimation (Adam) optimization algorithm to improve the model training effect.

In one possible embodiment, the penalty function used to calculate the penalty value for the arbiter to predict the performance score may be expressed as:

L_D＝E_{(x,s)～(X,S)}[(D_l(s,s)-1)²+(D_l(G(x),s)-Q'(iSTFT(G(x)),iSTFT(s)))²]；

wherein L is_DA penalty value for the arbiter for performance score prediction is calculated for the arbiter. E_{(x,s)～(X,S)}() Indicating the expected value. X represents the noisy speech in all training speech in the current training number, S represents the original speech in all training speech in the current training number, for example, 5 training speech may be selected in the training database in each training, X represents 5 noisy speech, S represents 5 original speech, X is one of noisy speech, and S is the original speech corresponding to X. D_l(s, s) represents the prediction performance score of the original speech, where two s represent the merged data of the magnitude spectra of two identical original speeches input to the discriminator when performing the performance score prediction of the original speech, and 1 is the preset performance score of the original speech. G (x) represents the noisy speech that is speech enhanced by the generator, i.e. the enhanced noisy speech. D_l(G (x), s) represents the prediction performance score of the enhanced noisy speech, and the input to the discriminator is the combined data of the magnitude spectrum of the enhanced noisy speech and the magnitude spectrum of the original speech. Q' (iSTFT (G (x)), iSTFT (s)) represents the enhanced bandThe performance score of the noisy speech, Q' (. cndot.) is a performance evaluation function corresponding to the performance index, iSTFT (G (x)) represents the short-time Fourier transform of the enhanced noisy speech, and iSTFT(s) represents the short-time Fourier transform of the original speech. Therefore, the calculation accuracy of the loss value of the performance score prediction of the discriminator is improved by the above formula

S407, training the generator according to the enhanced predicted performance score of the noisy speech, the preset target performance score, the noisy speech and the original speech.

Wherein the preset target performance score is an ideal performance score of the enhanced noisy speech.

Specifically, when the generator is used to enhance the noisy speech, the output of the generator is a time-frequency mask, and the enhanced noisy speech is obtained by multiplying the time-frequency mask by the magnitude spectrum of the noisy speech, so that the target time-frequency mask can be obtained when the original speech corresponding to the noisy speech is known. The target time-frequency mask is multiplied by the magnitude spectrum of the noisy speech to obtain the magnitude spectrum of the original speech, and thus, the target time-frequency mask is one of the learning targets of the generator.

Specifically, a first loss value may be determined according to a difference between a predicted performance score and a target performance score of the enhanced noisy speech, and a second loss value may be determined according to a difference between a target time-frequency mask and a time-frequency mask. And adjusting the model parameters of the generator by combining the first loss value, the second loss value and a preset model optimization algorithm, and finishing one-time training of the generator so as to improve the training effect of the generator.

In one possible embodiment, the model optimization algorithm may employ an adaptive moment estimation optimization algorithm to improve the model training effect.

In one possible embodiment, the calculation formula of the first loss value can be expressed as:

L_G1＝E_x～X[(D_l(G(x),s)-1)²]；

wherein L is_G1Is a first loss value, E_x～X() Represents the expected value, D_l(G (x), s) represents the predicted performance score of the enhanced noisy speech.

In one possible embodiment, the calculation formula of the second loss value can be expressed as:

wherein L is_G2In order to be the second loss value,

time-frequency mask generated for the generator, y_t,fAnd F is one of the frequencies in F.

In one possible embodiment, the first loss value and the second loss value may be combined by means of weighted summation to obtain the loss value of the generator, so as to improve the accuracy of the loss value.

As an example, the formula for combining the first loss value and the second loss value may be expressed as:

L_G＝L_G1+λ·L_G2(ii) a Wherein λ is a preset weight parameter, L_GIs the loss value of the generator.

In the training process of the speech enhancement model provided by the embodiment of the disclosure, the generator and the discriminator in the generative confrontation network are trained according to the training data and the performance index, so that the speech enhancement effect of the speech enhancement model is improved, and the corresponding performance index can be designed for different scene types, so as to train and obtain the speech enhancement model under different scene types.

As an example, fig. 5 is a diagram illustrating a structure of a generative countermeasure network according to an embodiment of the present disclosure, in fig. 5, a generator of the generative countermeasure network is a CRNN network, that is, a convolutional recurrent neural network, and a discriminator of the generative countermeasure network is a CNN network, that is, a convolutional neural network.

As shown in fig. 5, the generator includes 5 convolutional layers, 2 bidirectional long-and-short memory-cycle networks (BiLSTM), and 5 deconvolution layers. Wherein, 5 convolution layer networks form a coding structure, 2 BilSTM layers form a circulating structure, and 5 deconvolution layers form a decoding structure. Each convolution layer in the coding structure is connected with the corresponding deconvolution layer in the decoding structure through jumping connection, so that output data of the convolution layer is input into the corresponding deconvolution layer, and gradient explosion or gradient disappearance in the training process is avoided.

As shown in fig. 5, the discriminator includes 5 convolutional layers, 1 flattening layer, and 1 fully-connected layer, where the flattening layer is used to perform dimensionality reduction on the output data of the convolutional layers, so that the output data can be input to the fully-connected layer, and then the corresponding predicted performance score is output through the fully-connected layer. And if the value range of the performance score is 0-1, the value range of the output data of the discriminator is 0-1.

As shown in fig. 5, in the training process of the generative countermeasure network, the amplitude spectrum of the noisy speech is input into the generator to obtain a corresponding time-frequency mask. And multiplying the time-frequency mask by the amplitude spectrum of the noise voice to obtain an enhanced amplitude spectrum. And carrying out inverse Fourier transform on the enhanced amplitude spectrum and the phase spectrum of the voice with noise to obtain the enhanced voice with noise. And performing performance evaluation on the noisy speech through a performance evaluation function to obtain the performance score of the enhanced noisy speech.

As shown in fig. 5, after obtaining the enhanced noisy speech, the magnitude spectrum of the enhanced noisy speech (i.e., the magnitude spectrum enhanced by the time-frequency mask generated by the generator) and the magnitude spectrum of the original speech may be input to the discriminator. The predicted performance score of the enhanced noisy speech and the predicted performance score of the original speech are obtained. The discriminator may be trained based on the predicted performance score of the enhanced noisy speech and the predicted performance score of the original speech, the predicted performance score of the enhanced noisy speech and the predicted performance score of the original speech. The generator may be trained based on the predicted performance score, the target performance score, the time-frequency mask, and the target time-frequency mask of the enhanced noisy speech. The arrow between the original speech and the performance evaluation function indicates that the performance function may require the original speech as one of the input data when evaluating the enhanced noisy speech. The arrow between the performance evaluation function and the first convolutional layer in the discriminator indicates that the performance score of the enhanced noisy speech is used for the training of the discriminator.

Fig. 6 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

an obtaining module 601, configured to obtain a target voice;

a determining module 602, configured to determine a scene type of a scene where the target voice is located;

a selecting module 603, configured to select, from preset speech enhancement models, a speech enhancement model corresponding to a scene type;

and an enhancement module 604, configured to enhance the target speech through a speech enhancement model corresponding to the scene type.

In one possible embodiment, the scene type is a human ear listening scene or a machine recognition scene.

In one possible embodiment, the speech enhancement model is a generator of a pre-trained generative confrontation network, and the generator is a convolutional recurrent neural network.

In one possible implementation, the obtaining module 601 is further configured to: acquiring training voice, wherein the training voice comprises voice with noise and original voice of the voice with noise;

the apparatus also includes a training module to: and training the generating type countermeasure network according to the training voice and the preset performance index.

In one possible embodiment, the discriminators of the generative countermeasure network are convolutional neural networks; the training module is specifically configured to: enhancing the voice with noise through a generator; performing performance evaluation on the enhanced voice with noise through a preset performance evaluation function corresponding to the performance index to obtain a performance score of the enhanced voice with noise; performing performance score prediction on the enhanced voice with noise through a discriminator to obtain a prediction performance score of the enhanced voice with noise; predicting the performance score of the original voice through a discriminator to obtain the predicted performance score of the original voice; training a discriminator according to the performance score and the prediction performance score of the enhanced noisy speech and the prediction performance score of the original speech; training the generator according to the enhanced predicted performance score of the noisy speech, the preset target performance score, the noisy speech and the original speech.

In one possible embodiment, the training module is specifically configured to: extracting a magnitude spectrum and a phase spectrum of the voice with noise; inputting the amplitude spectrum into a generator to obtain a time-frequency mask output by the generator; enhancing the amplitude spectrum through a time-frequency mask; and carrying out inverse Fourier transform on the phase spectrum and the enhanced amplitude spectrum to obtain the enhanced noisy speech.

In one possible embodiment, the training module is specifically configured to: determining a target time-frequency mask according to the voice with noise and the original voice; determining a first loss value according to the prediction performance score and the target performance score of the enhanced noisy speech, and determining a second loss value according to a target time-frequency mask code and a time-frequency mask code; the generator is trained based on the first loss value and the second loss value.

In one possible embodiment, the training module is specifically configured to: extracting the amplitude spectrum of the original voice and extracting the enhanced amplitude spectrum of the voice with noise; combining the amplitude spectrum of the original voice and the enhanced amplitude spectrum of the voice with noise to obtain input data of a discriminator; and inputting the input data of the discriminator into the discriminator to obtain the prediction performance score of the enhanced noisy speech.

In one possible embodiment, the performance indicator includes a speech recognition indicator or a subjective auditory sensation indicator.

In one possible implementation, the speech recognition index includes a word error rate.

In one possible embodiment, the subjective auditory sensation indicator includes one or more of: subjective speech quality assessment, short-term objective intelligibility.

Fig. 7 is a schematic structural diagram of a training apparatus for a speech enhancement model according to an embodiment of the present disclosure, where the speech enhancement model is a generator of a generative confrontation network, the generator is a convolutional recurrent neural network, and the generator is a convolutional neural network. As shown in fig. 7, the apparatus includes:

an obtaining module 701, configured to obtain a training speech, where the training speech includes a speech with noise and an original speech with noise;

the training module 702 is configured to train the generative confrontation network for multiple times according to the training speech and a preset performance index.

In one possible embodiment, the discriminators of the generative countermeasure network are convolutional neural networks; the training module 702 is specifically configured to: enhancing the voice with noise through a generator; performing performance evaluation on the enhanced voice with noise through a preset performance evaluation function corresponding to the performance index to obtain a performance score of the enhanced voice with noise; performing performance score prediction on the enhanced voice with noise through a discriminator to obtain a prediction performance score of the enhanced voice with noise; predicting the performance score of the original voice through a discriminator to obtain the predicted performance score of the original voice; training a discriminator according to the performance score and the prediction performance score of the enhanced noisy speech and the prediction performance score of the original speech; training the generator according to the enhanced predicted performance score of the noisy speech, the preset target performance score, the noisy speech and the original speech.

In one possible implementation, the training module 702 is specifically configured to: extracting a magnitude spectrum and a phase spectrum of the voice with noise; inputting the amplitude spectrum into a generator to obtain a time-frequency mask output by the generator; enhancing the amplitude spectrum through a time-frequency mask; and carrying out inverse Fourier transform on the phase spectrum and the enhanced amplitude spectrum to obtain the enhanced noisy speech.

In one possible implementation, the training module 702 is specifically configured to: determining a target time-frequency mask according to the voice with noise and the original voice; determining a first loss value according to the prediction performance score and the target performance score of the enhanced noisy speech, and determining a second loss value according to a target time-frequency mask code and a time-frequency mask code; the generator is trained based on the first loss value and the second loss value.

In one possible implementation, the training module 702 is specifically configured to: extracting the amplitude spectrum of the original voice and extracting the enhanced amplitude spectrum of the voice with noise; combining the amplitude spectrum of the original voice and the enhanced amplitude spectrum of the voice with noise to obtain input data of a discriminator; and inputting the input data of the discriminator into the discriminator to obtain the prediction performance score of the enhanced noisy speech.

The speech enhancement apparatus provided in fig. 6 and the training apparatus of the speech enhancement model provided in fig. 7 can perform the above corresponding method embodiments, and the implementation principle and the technical effect are similar, and are not described herein again.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present disclosure. As shown in fig. 8, the server may include: a processor 801 and a memory 802. The memory 802 is used for storing computer execution instructions, and the processor 801 executes the computer program to realize the method according to any one of the above embodiments.

The processor 801 may be a general-purpose processor, including a central processing unit CPU, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The memory 802 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

An embodiment of the present disclosure also provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of any of the embodiments described above.

An embodiment of the present disclosure also provides a program product comprising a computer program, the computer program being stored in a storage medium, the computer program being readable from the storage medium by at least one processor, the at least one processor being capable of implementing the method of any of the above embodiments when executing the computer program.

Fig. 9 is a block diagram of a speech enhancement apparatus 900 according to an embodiment of the disclosure. For example, the apparatus 900 may be provided as a server or a computer. Referring to fig. 9, the apparatus 900 includes a processing component 901 that further includes one or more processors and memory resources, represented by memory 902, for storing instructions, e.g., applications, that are executable by the processing component 901. The application programs stored in memory 902 may include one or more modules that each correspond to a set of instructions. Further, the processing component 901 is configured to execute instructions to perform the method of any of the embodiments described above.

The device 900 may also include a power component 903 configured to perform power management of the device 900, a wired or wireless network interface 904 configured to connect the device 900 to a network, and an input/output (I/O) interface 905. The apparatus 900 may operate based on an operating system stored in the memory 902, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

The training device of the speech enhancement model may be a computer or a server, and therefore, the block diagram of the speech enhancement device 900 may refer to the block diagram of the speech enhancement device 900 shown in fig. 9.

In the embodiments of the present disclosure, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein, A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship; in the formula, the character "/" indicates that the preceding and following related objects are in a relationship of "division". "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It is to be understood that references to "first" and "second" in the embodiments of the present disclosure are merely for convenience of description and do not limit the scope of the embodiments of the present disclosure.

It is to be understood that the various numerical designations referred to in the embodiments of the disclosure are merely for convenience of description and are not intended to limit the scope of the embodiments of the disclosure.

It should be understood that, in the embodiment of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. The embodiments of the disclosure are intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech enhancement, the method comprising:

acquiring a target voice;

determining the scene type of the scene where the target voice is located;

enhancing the target voice through a voice enhancement model corresponding to the scene type;

the voice enhancement model is a generator of a pre-trained generative confrontation network, the generator is a convolution cyclic neural network, the generative confrontation network is obtained by multiple times of training based on training voice and preset performance indexes, and the training voice comprises original voice with noise and noise;

the one-time training process of the generator comprises the following steps:

enhancing the voice with noise through the generator to obtain a time-frequency mask output by the generator and the enhanced voice with noise;

predicting the performance score of the enhanced voice with noise through the discriminator of the generative countermeasure network to obtain the predicted performance score of the enhanced voice with noise;

and adjusting model parameters of the generator based on a difference value between the enhanced predicted performance score of the voice with noise and a preset target performance score and a difference value between the enhanced time-frequency mask code of the voice with noise and a target time-frequency mask code, wherein the target time-frequency mask code is obtained based on the original voice.

2. The method of claim 1, wherein the discriminator is a convolutional neural network; the one-time training process of the discriminator comprises the following steps:

performing performance score prediction on the original voice through the discriminator to obtain a predicted performance score of the original voice;

and training the discriminator according to the performance score and the prediction performance score of the enhanced noisy speech and the prediction performance score of the original speech.

3. The method of claim 1 or 2, wherein said enhancing, by said generator, said noisy speech comprises:

extracting a magnitude spectrum and a phase spectrum of the voice with the noise;

inputting the amplitude spectrum into the generator to obtain a time-frequency mask output by the generator;

enhancing the amplitude spectrum through the time-frequency mask;

and carrying out inverse Fourier transform on the phase spectrum and the enhanced amplitude spectrum to obtain the enhanced noisy speech.

4. The method of claim 1 or 2, wherein the model parameter adjustment for the generator based on a difference between the enhanced predicted performance score of the noisy speech and a preset target performance score and based on a difference between the time-frequency mask and a target time-frequency mask comprises:

determining the target time-frequency mask according to the voice with noise and the original voice;

determining a first loss value according to the enhanced predicted performance score and the target performance score of the noisy speech, and determining a second loss value according to the target time-frequency mask and the time-frequency mask;

training the generator according to the first loss value and the second loss value.

5. The method according to claim 1 or 2, wherein said performing, by said discriminator, a performance score prediction on said enhanced noisy speech to obtain a predicted performance score of said enhanced noisy speech comprises:

extracting the amplitude spectrum of the original voice and extracting the enhanced amplitude spectrum of the voice with noise;

combining the amplitude spectrum of the original voice and the enhanced amplitude spectrum of the voice with noise to obtain input data of the discriminator;

and inputting the input data of the discriminator into the discriminator to obtain the enhanced prediction performance score of the noisy speech.

6. The method according to claim 1 or 2, wherein the performance indicator is a speech recognition indicator or a subjective auditory sensation indicator.

7. The method of claim 6, wherein the speech recognition index comprises a word error rate.

8. The method of claim 6, wherein the subjective auditory sensation indicator comprises one or more of: subjective speech quality assessment, short-term objective intelligibility.

9. A speech enhancement apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring target voice;

the enhancement module is used for enhancing the target voice through a voice enhancement model corresponding to the scene type;

the apparatus further comprises a training module, wherein in a training process of the generator, the training module is configured to:

and adjusting model parameters of the generator based on a difference value between the enhanced predicted performance score of the noisy speech and a preset target performance score and a difference value between the time-frequency mask and a target time-frequency mask, wherein the target time-frequency mask is obtained based on the original speech.

10. The apparatus of claim 9, wherein the discriminator is a convolutional neural network; in a training process of the discriminator, the training module is configured to:

11. The apparatus according to claim 9 or 10, wherein the training module is specifically configured to:

enhancing the amplitude spectrum through the time-frequency mask;

12. The apparatus according to claim 9 or 10, wherein the training module is specifically configured to:

13. The apparatus according to claim 9 or 10, wherein the training module is specifically configured to:

14. The apparatus of claim 9 or 10, wherein the performance indicator is a speech recognition indicator or a subjective auditory sensation indicator.

15. The apparatus of claim 14, wherein the speech recognition metric comprises a word error rate.

16. The apparatus of claim 14, wherein the subjective auditory sensation indicator comprises one or more of: subjective speech quality assessment, short-term objective intelligibility.

17. An electronic device, comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1-8.

18. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program; the computer program, when executed, implementing the method of any one of claims 1-8.