CN114974280A

CN114974280A - Training method of audio noise reduction model, and audio noise reduction method and device

Info

Publication number: CN114974280A
Application number: CN202210518375.1A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-08-30

Abstract

The disclosure provides a training method of an audio noise reduction model, and a method and a device for audio noise reduction, and relates to the technical field of computers, in particular to the technical fields of artificial intelligence, voice technology and deep learning. The training method of the audio noise reduction model comprises the following steps: inputting the frequency spectrum characteristics of the noisy audio frequency into a noise token model to obtain noise characteristics; inputting the frequency spectrum characteristics and the noise characteristics into a noise reduction model to obtain noise reduction audio; inputting the noise reduction audio into a generation type countermeasure network to obtain an audio prediction value and an audio true value; adjusting parameters of a noise token model, a noise reduction model and a generative confrontation network according to the noise reduction audio frequency, the noise-free audio frequency, the audio predictive value, the audio true value and the audio preset value by using a loss function; and under the condition that the noise token model, the noise reduction model and the generative countermeasure network are adjusted to be converged, obtaining the trained audio noise reduction model. According to the scheme disclosed by the invention, the noise interference in the audio frequency can be reduced, the audio frequency quality is enhanced, and the audio frequency intelligibility is improved.

Description

Training method of audio noise reduction model, and audio noise reduction method and device

Technical Field

The present disclosure relates to the field of computer technology, and more particularly to the field of artificial intelligence, speech technology, and deep learning technology.

Background

In audio-video real-time communication, various noises such as loud noise, keyboard tapping sound, noisy sound, etc. are inevitable. In order to suppress these noises, audio noise reduction is required. Audio noise reduction (or audio enhancement) refers to a technique for extracting a useful audio signal (or a clean audio signal) from a noisy audio signal as much as possible to suppress or reduce noise interference when the audio signal is interfered with or even submerged by various background noises. This noise is usually present not only at the far-end sound emitting party but also at the near-end sound receiving party.

Disclosure of Invention

The disclosure provides a training method of an audio noise reduction model, and an audio noise reduction method and device.

According to an aspect of the present disclosure, there is provided a training method of an audio noise reduction model, including:

inputting the frequency spectrum characteristics of the noisy audio frequency into a noise token model to obtain noise characteristics;

inputting the frequency spectrum characteristics and the noise characteristics into a noise reduction model to obtain noise reduction audio;

inputting the noise reduction audio into a generation type countermeasure network to obtain an audio prediction value and an audio true value;

respectively adjusting parameters of a noise token model, a noise reduction model and a generation countermeasure network according to noise-free audios corresponding to the noise-containing audios, an audio predicted value, an audio true value and an audio preset value by using a loss function; and

and under the condition that the noise token model, the noise reduction model and the generative countermeasure network are adjusted to be converged, obtaining the trained audio noise reduction model.

According to another aspect of the present disclosure, there is provided a method of audio noise reduction, comprising:

processing a target noisy audio frequency of an audio transmitting end by utilizing a pre-trained audio frequency noise reduction model; and

sending the noise reduction enhanced audio obtained after the pre-trained audio noise reduction model is processed to an audio receiving end; the pre-trained audio noise reduction model is obtained by adopting the training method of the audio noise reduction model of any embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided an audio noise reduction model training apparatus, including:

the characteristic module is used for inputting the frequency spectrum characteristic of the noisy audio frequency into the noise token model to obtain the noise characteristic;

the noise reduction module is used for inputting the frequency spectrum characteristics and the noise characteristics into a noise reduction model to obtain noise reduction audio;

the computing module is used for inputting the noise reduction audio into the generation type countermeasure network to obtain an audio prediction value and an audio true value;

the adjusting module is used for respectively adjusting parameters of the noise token model, the noise reduction model and the generating countermeasure network according to the noise-free audio, the audio predicted value, the audio true value and the audio preset value corresponding to the noise reduction audio and the noise-containing audio by utilizing the loss function; and

and the training module is used for obtaining the trained audio noise reduction model under the condition that the noise token model, the noise reduction model and the generative confrontation network are all converged.

According to another aspect of the present disclosure, there is provided an apparatus for audio noise reduction, including:

the processing module is used for processing the target noisy audio at the audio transmitting end by utilizing the pre-trained audio noise reduction model; and

the transmitting module is used for transmitting the noise reduction enhanced audio obtained after the pre-trained audio noise reduction model is processed to an audio receiving end; the pre-trained audio noise reduction model is obtained by adopting the training method of the audio noise reduction model of any embodiment of the disclosure, or by adopting the training device of the audio noise reduction model of any embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method of training an audio noise reduction model and/or a method of audio noise reduction in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method of training an audio noise reduction model and/or a method of audio noise reduction in any of the embodiments of the present disclosure.

According to the scheme disclosed by the invention, the noise interference in the audio frequency can be reduced, the audio frequency quality is enhanced, and the audio frequency intelligibility is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a method of training an audio noise reduction model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an application of a training method of an audio noise reduction model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an application of a training method of an audio noise reduction model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an audio noise reduction model according to an embodiment of the disclosure;

FIG. 5 is a schematic structural diagram of a noise token model according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a noise reduction model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a structure of a generator according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a structure of an arbiter according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a structure of a generative countermeasure network according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a generative countermeasure network according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a generative countermeasure network according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a method of audio noise reduction according to an embodiment of the present disclosure;

FIG. 13 is a schematic diagram of an application of a method of audio noise reduction according to an embodiment of the present disclosure;

FIG. 14 is a schematic diagram of a training apparatus for an audio noise reduction model according to an embodiment of the present disclosure;

FIG. 15 is a schematic diagram of an apparatus for audio noise reduction according to an embodiment of the present disclosure;

FIG. 16 is a block diagram of an electronic device for implementing a method of audio noise reduction and/or a method of training an audio noise reduction model according to embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

An embodiment of the present disclosure provides a training method of an audio noise reduction model, as shown in fig. 1, which is a flowchart of a data processing method of this embodiment, and the method may include the following steps:

step S101: and inputting the frequency spectrum characteristics of the noisy audio into the noise token model to obtain the noise characteristics.

Step S102: and inputting the frequency spectrum characteristic and the noise characteristic into a noise reduction model to obtain noise reduction audio.

Step S103: and inputting the noise reduction audio into a generation type countermeasure network to obtain an audio predicted value and an audio true value.

Step S104: and respectively adjusting parameters of the noise token model, the noise reduction model and the generative confrontation network according to the noise-free audio, the audio predicted value, the audio true value and the audio preset value corresponding to the noise-reduction audio and the noise-containing audio by using the loss function. And

step S105: and under the condition that the noise token model, the noise reduction model and the generative countermeasure network are adjusted to be converged, obtaining the trained audio noise reduction model.

It should be noted that noisy audio can be understood as speech or audio that contains background noise. The background noise may include a noisy sound in the environment, a loud sound, a key tap sound, a vehicle whistling sound, and the like. The voice may include voice sent by the user through the smart terminal (e.g., cell phone call, network call, video call). The audio may include music, video played by the terminal or interactive voice generated by the intelligent terminal. Noisy audio is understood to be audio of one of the frames in a continuous audio. The length and frame shift of the audio per frame can be selected and adjusted as desired.

A spectral feature may be understood as any feature that relates to the spectrum of an audio signal or a speech signal. The spectral features of the noisy audio can be extracted from the audio by any method known in the art, such as fourier transform, short-time fourier transform, or a spectral feature extraction model.

The network structure of the noise token model can be selected and adjusted according to needs, and is not specifically limited herein, as long as the noise characteristics of the noise in the noisy audio can be extracted from the noisy audio, and the noise characteristics can be expressed in a vector form. The noise feature may be any feature that characterizes noise, and the dimension of the noise feature may be selected and adjusted as desired.

The noise reduction model specifically adopted may be any model capable of realizing an audio noise reduction function in the prior art, and is not specifically limited herein. The network structure of the noise reduction model may be selected and adjusted according to the need, and is not specifically limited herein. The noise reduction audio output by the noise reduction model may be the noise reduction audio itself, or may be a spectral feature of the noise reduction audio.

The generator and the arbiter in the generative countermeasure network may be composed of a neural network or may be a function, and are not particularly limited in this embodiment.

The audio prediction value may be a prediction value calculated by the discriminator according to an output result of the generator. The audio truth value can be calculated according to the noise reduction audio by a preset function in the generative confrontation network. The preset function may be independent of the presence of the generator and the arbiter. The audio preset value may be a preset value. The noiseless audio corresponding to the noisy audio can be understood as the voice actually spoken by a speaker at the audio sending end or the audio and video audio actually played by the terminal at the audio sending end.

The parameters of the noise token model, the noise reduction model and the generative countermeasure network are adjusted respectively, and it can be understood that the parameters of the noise token model, the noise reduction model and the generative countermeasure network are all adjusted in the process of one-time parameter adjustment according to the loss function. It is also understood that only one or more of the model parameters are adjusted during a parameter adjustment, not all of the model parameters, according to the loss function.

The loss function can be selected and adjusted as desired. The loss function may be one or more. For example, the noise token model, the noise reduction model and the generative countermeasure network respectively have a loss function corresponding to the noise token model, the noise reduction model and the generative countermeasure network, and the noise token model, the noise reduction model and the generative countermeasure network also have a common loss function. Parameters of a mediation noise token model, a noise reduction model and a generative countermeasure network are synthesized by loss values of the loss functions.

According to the scheme disclosed by the invention, the trained audio noise reduction model can reduce noise interference in audio, enhance audio quality and improve audio intelligibility. According to the scheme, the noise token model, the noise reduction model and the generative confrontation network are jointly utilized to construct the audio noise reduction model, so that the trained model can process different types of noise, the upper limit of audio noise is not limited, and non-stationary noise can be processed. The model training method disclosed by the invention utilizes the loss function to respectively adjust the parameters of the noise token model, the noise reduction model and the generation type countermeasure network, realizes the overall consideration analysis on the transmission link of the audio signal, optimizes the global modeling and improves the experience of a listener at an audio receiving end. The audio noise reduction model obtained by training the scheme can be applied to audio and video conferences, entertainment interactive live broadcasting and audio signal transmission of online education products. According to the method and the device, the generative antagonistic neural network is designed in the trained audio noise reduction model, so that the characteristics of audio and noise can be automatically learned and found by using the generative antagonistic neural network, and the effect of model training is enhanced. Meanwhile, a generative confrontation network is added in the model training, so that the defects of computational complexity and inseparability of the traditional criteria (such as a speech quality perception evaluation criterion, a virtual speech quality target audience criterion, a short-time objective intelligibility criterion and a coherence speech intelligibility index criterion) can be effectively avoided.

Fig. 2 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present disclosure, where the distributed cluster system is an example of a cluster system, and exemplarily describes that training of an audio noise reduction model can be performed by using the distributed cluster system. As shown in fig. 2, the distributed cluster system includes a plurality of nodes (e.g., server cluster 201, server 202, server cluster 203, server 204, and server 205, where the server 205 may also be connected to electronic devices, such as a cell phone 2051 and a desktop 2052), a plurality of nodes, and a plurality of nodes and connected electronic devices may jointly execute a training task of one or more audio noise reduction models. Optionally, a plurality of nodes in the distributed cluster system may adopt a training mode of an audio noise reduction model with parallel data, and then the plurality of nodes may execute a training task of the audio noise reduction model based on the same training mode to better train the model; if the plurality of nodes in the distributed cluster system adopt a model training mode with parallel models, the plurality of nodes can execute model training tasks based on different training modes to better train the models. Optionally, after each round of model training is completed, data exchange (e.g., data synchronization) may be performed between multiple nodes.

Fig. 3 is an application scenario of an audio noise reduction model obtained by training the training method of the audio noise reduction model according to an embodiment of the present disclosure. The audio sender is the real environment where the speaker is located. The audio receiving end is the real environment where the listener is located. The audio sender includes the voice audio (Hello) spoken by the speaker and the far-end noise (far-end, i.e. audio sender) in the real environment where the speaker is located. The audio sink includes the enhanced audio received by the listener and the near-end noise in the real environment where the listener is located (the near-end is the audio sink). The enhanced audio is obtained by processing the noisy audio (the voice audio spoken by the speaker + the far-end noise) sent by the audio sending end by the audio noise reduction model obtained by training the audio noise reduction model by the audio noise reduction model training method disclosed by the disclosure.

In an embodiment, the training method of the audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, and may further include the steps of: before inputting the spectral features of the noisy audio into the noise token model to obtain the noise features, the method further includes:

and constructing a noise-containing audio according to the background noise audio and the noise-free audio of the audio transmitting end.

Using Short Time Fourier Transform (STFT), the spectral features of the noisy audio are obtained.

It should be noted that the noiseless frequency can be understood as the voice actually spoken by the speaker or the voice in the audio/video played by the terminal. The background noise audio may include a noisy sound in the environment, a loud sound, a key tap sound, a vehicle whistling sound, and the like. The constructed noisy audio can be understood as continuous audio and can also be understood as audio of one frame in the continuous audio. The length and frame shift of the audio per frame can be selected and adjusted as desired.

According to the scheme disclosed by the invention, more accurate frequency spectrum characteristics of the noisy audio can be obtained by utilizing short-time Fourier transform. The noisy audio is constructed through the background noise audio and the noiseless audio, so that the noisy audio is more consistent with the real environment of an audio sending end where a speaker is located in the actual using environment, and the effect of the audio noise reduction model obtained through training is good when the noisy audio is used as a training sample.

In one example, noisy audio is first framed using a Hanning window (Hanning) with W milliseconds per frame (e.g., W-64 ms) and shifted by H milliseconds (e.g., H-8 ms), and each frame is subjected to a short-time fourier transform to obtain spectral features.

In one embodiment, the method for training an audio noise reduction model provided by the embodiments of the present disclosure includes steps S101 to S105, where step S102: inputting the spectral feature of the noisy audio into the noise token model to obtain the noise feature, which may include:

and inputting the spectral characteristics of the noisy audio into a first network of the noise token model, and extracting first characteristics.

And calculating the first characteristic and a high-dimensional noise matrix of the noise token model to obtain the noise characteristic. Wherein the high-dimensional noise matrix is constructed by high-dimensional features of different types of noise.

It should be noted that the first feature, the noise feature, and the high-dimensional features of different types of noise in the embodiments of the present disclosure may be represented in the form of a vector. That is, the first feature may be understood as a first vector, the noise feature may be understood as a noise vector (noise embedding), and the high-dimensional features of different types of noise may be understood as high-dimensional vectors of different types of noise.

The first feature is operated on a high-dimensional noise matrix of the noise token model, which may be understood as multiplying or adding the first feature to the high-dimensional noise matrix.

The high-dimensional noise matrix is constructed by high-dimensional features of different types of noise, so that the high-dimensional features of the noise in the noisy audio can be covered, and if the high-dimensional features are covered, the noise features of the noisy audio can be accurately extracted through the high-dimensional noise matrix. If not, the high-dimensional noise matrix can also accurately obtain the noise characteristics representing the noisy audio, because the high-order characteristics of different types of noise in the high-dimensional noise matrix can be combined to represent the noise characteristics of the noisy audio, and the noise characteristics close to unknown noise can be represented by the combination of the known high-dimensional characteristics of the noise.

According to the scheme disclosed by the invention, because the noise token model is designed in the trained audio noise reduction model, the environmental noise is effectively well modeled. Meanwhile, a high-dimensional noise matrix constructed by high-dimensional characteristics of different types of noise is introduced, so that the high-dimensional noise matrix can be used as a vector template to acquire the noise characteristics of known or unknown noise. By calculating the first characteristic and the high-dimensional noise matrix of the noise token model, the frequency spectrum characteristic of the noisy audio can be mapped to a corresponding subspace, so that the noise can be more accurately modeled, and more accurate noise characteristic can be obtained.

In one embodiment, the method for training an audio noise reduction model provided by the embodiments of the present disclosure includes steps S101 to S105, as shown in fig. 4, where a first network is formed by N ₁ A two-dimensional convolutional layer (Conv2D), a Long Short-Term Memory network (LSTM), and a Multi-head attention layer (Multi-head-attention) connected in sequence, wherein N is ₁ Is a positive integer. In the example shown in FIG. 4, N ₁ ＝6。

The multi-headed attention tier multiplies the high-dimensional noise matrix with the results of the long short-term memory network output, resulting in a set of coefficients (e.g., 0.20, 0.1, 0.7 in fig. 4) that can represent the probability that it is a high-dimensional feature of a certain noise type in the high-dimensional noise matrix. Then, the set of coefficients is weighted with a high-dimensional feature of a certain noise type in the high-dimensional noise matrix (e.g., the high-dimensional feature of class a noise, the high-dimensional feature of class B noise, and the high-dimensional feature of class C noise in fig. 4), and finally, a noise feature is obtained.

According to the scheme disclosed by the disclosure, the noise token model comprises a two-dimensional convolutional layer, a long-short-term memory network, a multi-head attention layer and a high-dimensional noise matrix, so that the constructed noise token model can more accurately extract the noise characteristics of the required noisy audio.

In one specific example, the calculation formula for the multi-head attention layer is:

comprises 3 variables Q, K, V, wherein Q is the output vector of the long-short term memory network layer, and K and V are high-dimensional noise matrix. A set of correlation coefficients is found by multiplying Q and K, and then weighted by V, thereby characterizing the noise characteristics of the noisy audio.

In the embodiment of the disclosure, the input of the noise token model is the spectral feature of each frame of audio, the output is the noise feature of the frame, and then the noise feature is input to the noise reduction model, so as to assist in training the perception capability of the audio noise reduction model on the environmental noise, and further perform targeted noise reduction on the audio.

In one embodiment, the method for training an audio noise reduction model provided by the embodiments of the present disclosure includes steps S101 to S105, where the process of constructing the high-dimensional noise matrix includes:

and respectively inputting different types of noises into a preset noise identification network.

And extracting high-dimensional characteristics of different types of noise from an underlying network layer of the preset noise identification network.

And constructing a high-dimensional noise matrix according to the high-dimensional characteristics of different types of noise.

It should be noted that the preset noise identification network may adopt any network structure in the prior art, as long as the noise can be identified.

The high-dimensional features of the different types of noise may be represented in a vector manner, i.e. the high-dimensional features of the different types of noise may be understood as high-dimensional vectors of the different types of noise. That is, the high-dimensional noise matrix may be constructed based on high-dimensional vectors of different types of noise.

According to the scheme disclosed by the invention, because the characteristics are extracted from the underlying network layer of the preset noise identification network, the obtained high-dimensional characteristics of different types of noise can more accurately represent the characteristics of the noise, so that the effect of the constructed high-dimensional noise matrix is better in application.

In one example, the predetermined noise recognition network may be formed of a 5-layer TDNN (time delay neural network), and/or an LSTM. And carrying out classification and identification training by inputting different types of noise so as to obtain a noise identification network. Then, inputting each type of noise into the noise identification network, extracting a vector from the last layer or the second layer network structure as a high-dimensional vector for representing the type of noise, and obtaining a vector template corresponding to each type of noise. The noise utilized by the trained noise recognition network includes various types of noise such as open source noise data (e.g., Aurora2, HuCorpus, etc.), airports, restaurants, streets, stations, in-cars, exhibitors, room reverb, etc.

In one example, 16 classes of noise may be identified and 256-dimensional vectors extracted from the penultimate or second tier network structure of the noise identification network, and then a high-dimensional noise matrix is constructed using the 256-dimensional vectors of the 16 classes of noise.

In one embodiment, the method for training an audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, where step S102: inputting the spectrum feature and the noise feature into a noise reduction model to obtain a noise reduction audio, which may include:

and splicing the spectrum characteristics and the noise characteristics and inputting the spliced spectrum characteristics and the noise characteristics into an Encoder (Encoder) of a noise reduction model. Wherein the spectral features are obtained from noisy audio by Short Time Fourier Transform (STFT).

And performing audio reconstruction on the output result of the encoder by using a Decoder (Decoder) of the noise reduction model.

And performing Inverse short-time Fourier transform (Inverse STFT) on the reconstructed audio to obtain the noise-reduced audio.

According to the scheme disclosed by the invention, the audio noise reduction processing can be effectively realized by utilizing the modes of an encoder, a decoder and Fourier transform.

In one embodiment, the method for training an audio noise reduction model provided by the embodiments of the present disclosure includes steps S101 to S105, as shown in fig. 5, where the noise reduction model includes an encoder, a long-short term memory network, and a decoder.

The encoder consists of N ₂ A two-dimensional convolution layer (Conv2D), a layer normalization layer (layer norm), and a Linear rectification unit (ReLU) connected in sequence, wherein N is ₂ Is a positive integer.

Long and short term memory network is N ₃ Wherein, N is ₃ Is a positive integer.

The Decoder comprises a real part Decoder (real Decoder) and an imaginary part Decoder (imaginary Decoder) arranged in parallel, wherein the real part Decoder and the imaginary part Decoder are both N ₄ A two-dimensional convolution layer of N ₄ Is a positive integer.

The linear rectifying unit of the coder is connected with the input end of the first long-short term memory network, the output end of the last long-short term memory network is respectively connected with the real part decoder and the imaginary part decoder, and the coder is also directly connected with the real part decoder and the imaginary part decoder (skip connection).

According to the scheme disclosed by the invention, the designed noise reduction model comprises an encoder, a long-term and short-term memory network and a decoder, and the encoder is directly connected with a real part decoder and an imaginary part decoder, so that the noise reduction processing of the noise reduction model can be better realized, the output noise reduction audio can be better utilized by a generation type countermeasure network, and the training effect of the whole audio noise reduction model is ensured.

In one example, the Encoder includes 5 layers of Conv2D, layerNorm, and ReLU. real Decoder includes 5 layers of Conv2D, and imaginary Decoder includes 5 layers of Conv 2D. The noise reduction model includes an Encoder, a 2-layer LSTM, an imaginary Decoder, and a real Decoder.

The direct connection of the encoder to the real decoder can be understood as the output of the 1 st convolutional layer (Conv2D) of the encoder is input to the 5 th convolutional layer (Conv2D) of the real decoder, the 2 nd convolutional layer is input to the 4 th convolutional layer of the real decoder, and so on. The direct connection between the encoder and the imaginary decoder is the same, and is not described herein again.

The decoder reconstructs a clean speech signal based on a Complex Ideal free Ratio Mask (CIRM) network, that is, predicts a real part (real) and an imaginary part (imaginary), and then performs inverse short-time fourier transform (inverse STFT) to obtain a noise-reduced audio signal. The audio noise reduction effect obtained by the method is better.

In one embodiment, the method for training an audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, where step S103: the method for generating the denoised audio input into the countermeasure network to obtain the audio predicted value and the audio true value comprises the following steps:

step S1031: the noise reduction audio and the background noise audio of the audio receiving end are input into a generator (generator) of a generating countermeasure network to obtain enhanced audio.

Step S1032: the enhanced audio and the noise reduction audio are input into a discriminator (discriminator) of a generative countermeasure network to obtain an audio prediction value.

Step S1033: and obtaining an audio true value based on the enhanced audio and the noise reduction audio by using a preset function.

It should be noted that the noise reduction audio of the generator of the input-generating countermeasure network may be understood as the spectral feature of the noise reduction audio, and the background noise audio of the audio receiving end of the generator of the input-generating countermeasure network may be understood as the spectral feature of the background noise audio.

According to the scheme disclosed by the disclosure, noise exists not only at an audio transmitting end (speaker side) but also at an audio receiving end (listener side), most of noise reduction technologies only consider the situation at one end, and do not consider the analysis and optimization jointly, so that the quality and intelligibility of audio are improved. According to the scheme, the background noise audios of the audio sending end and the audio receiving end are combined, and the whole-course unified modeling is carried out on the noise-containing audios, so that on one hand, the background noise interference is reduced, the audio (voice) quality is improved, and on the other hand, the voice intelligibility is improved. According to the scheme disclosed by the invention, the background noise audio of the audio receiving end is input into the generator, so that the generator can sense the environmental noise in advance, and the generated audio is suitable for the surrounding environment and better fused with the surrounding environment.

In one example, the preset function may employ a complementary cumulative distribution function (Q function).

In one embodiment, the method for training an audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, where step S1032 of step S103: inputting the enhanced audio and the noise-reduced audio into a discriminator of a generative confrontation network to obtain an audio prediction value, which may include:

and inputting the enhanced audio and the noise reduction audio into a first discriminator of a generator impedance network to obtain an audio quality prediction value.

And inputting the enhanced audio and the noise reduction audio into a second discriminator of the generation type countermeasure network to obtain an audio intelligibility prediction value. And

obtaining an audio truth value based on the enhanced audio and the noise reduction audio by using a preset function, wherein the method comprises the following steps:

and obtaining an audio quality true value based on the enhanced audio and the noise reduction audio by utilizing a first preset function.

And obtaining an audio intelligibility true value based on the enhanced audio and the noise reduction audio by utilizing a second preset function.

The first discriminator is configured to discriminate intelligibility (intelligibility) of the enhanced audio based on the enhanced audio output by the generator. The second discriminator is for discriminating quality of the enhanced audio from the enhanced audio output from the generator. The first preset function and the second preset function may adopt complementary cumulative distribution functions.

The disclosed embodiment includes a Speech Quality perception evaluation (PESQ, score interval of-0.5 to 4.5), a Virtual Speech Quality target Listener (visual Speech Quality Objective Listener, visuol, score interval of 1 to 5), a short-time Objective intelligibility (STOI, score interval of 0 to 1), a coherent Speech intelligibility index (Coherence and eech interactive intelligibility index, CSII, score interval of 0 to + ∞), and the like. Since these criteria are complex and non-differentiable, it is not easy to directly optimize the audio. Therefore, the scheme of the disclosure designs a generative confrontation network in an audio noise reduction model, generates a desired audio through a generator, then distinguishes whether the expected audio meets the expectation or how far the expected audio is away from the expectation through a first discriminator and a second discriminator, and then optimizes the generative confrontation network to achieve a preset target, so that the generative confrontation network has the capability of distinguishing and generating. According to the scheme, the traditional ability of measuring the quality and the intelligibility of the voice signal is given to the first discrimination model and the second discrimination model, so that the audio noise reduction model can distinguish the quality and the intelligibility of the audio, and the experience of a listener is effectively improved.

In one example, as shown in FIG. 6, the general model structure of the audio noise reduction model that needs to be trained is shown. The audio noise reduction model to be trained comprises a noise token model, a noise reduction model and a generative confrontation network. The audio sending end constructs the audio containing noise based on the noise-free audio (the voice of a speaker or the audio and video played by the terminal) and the background noise. The method comprises the steps that the spectral characteristics of noisy audio are input into a noise token model and a noise reduction model, the noise token model inputs the noise characteristics obtained based on the noisy audio into the noise reduction model, the noise reduction model inputs the noise reduction audio obtained based on the noise characteristics and the spectral characteristics of the noisy audio into a generator of a generative confrontation network, the generator inputs the obtained enhanced audio into a first discriminator and a second discriminator of the generative confrontation network, and meanwhile the noise reduction audio output by the noise reduction model is input into the first discriminator and the second discriminator. And constructing a fourth loss function and calculating the fourth loss value by utilizing the loss value obtained by the first loss function based on the noiseless audio and the noise-reduced audio, the loss value obtained by the second loss function based on the first discriminator and the first preset function, and the loss value obtained by the third loss function based on the second discriminator and the second preset function, and respectively adjusting the parameters of the noise token model, the noise-reduced model and the generative confrontation network (the generator, the first discriminator and the second discriminator) based on the first loss value, the second loss value, the third loss value and the fourth loss value.

In one embodiment, the method for training an audio noise reduction model provided by the embodiments of the present disclosure includes steps S101 to S105, as shown in fig. 7, where the generator is represented by N ₅ A second network, a full connected layer (FC), an Exponential Activation layer (Exponential Activation), and an Energy Normalization layer (Energy Normalization), wherein N is connected in sequence ₅ The second network is composed of a one-dimensional convolution layer (Conv1D), a normalization layer (layer Norm) and a linear rectification unit (ReLU) which are connected in sequence. The input to the second network is the spectrum of the Short Time Fourier Transformed (STFT) noise reduced audio.

According to the scheme of the disclosure, the generator comprises the convolution layer, the normalization layer, the linear rectification unit, the full connection layer, the exponential activation layer and the energy normalization layer, so that the enhanced audio quality of the output of the constructed generator is higher.

In one example, the generator structure includes { { Conv1D + layer Norm + ReLU }. 6+ FC + Exponential Activation + Energy Normalization }, where the input is a frequency spectrum of the spliced noise reduction audio and a frequency spectrum of the background noise audio at the audio receiving end, and the frequency spectrums are convolved and fully connected, and then the frequency spectrum of the audio is adjusted by an Exponential Activation function of an Exponential Activation layer, and the adjusted frequency spectrum is subjected to Energy regularization and inverse short-time fourier transform (inversion STFT) to obtain the enhanced audio.

Exponential activation function, as follows:

α＝exp(4*tanh(u))

where u is the output of the FC layer. Alpha is an amplification factor, and is preferably between 0.02 and 55. The coefficient is then multiplied by the noise reduced spectrum, if >1 corresponds to an enhancement of the spectral energy, and vice versa a reduction.

In one embodiment, the training method of the audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, wherein the generative confrontation network includes a first discriminator and a second discriminator, which are arranged in parallel, the first discriminator is used for calculating the audio quality prediction value, and the second discriminator is used for calculating the audio intelligibility prediction value.

As shown in fig. 8, the structure of the first discriminator or the second discriminator is shown. The first and second discriminators are both N ₆ A third network, a global average pooling layer (gapoling), a first fully-connected layer, a Leaky Linear rectification unit (lreuu), and a second fully-connected layer, wherein N is a connection in sequence ₆ Is a positive integer. The third network is formed by connecting a two-dimensional convolution layer (Conv2D), a normalization layer (layerNorm) and a linear rectification unit (ReLU) in sequence

According to the scheme of the disclosure, the structure of the discriminators (the first discriminator and the second discriminator) comprises the two-dimensional convolution layer, the normalization layer, the linear rectification unit, the global average pooling layer, the first full-connection layer, the leakage linear rectification unit and the second full-connection layer, so that the discriminators can realize more accurate audio prediction.

In one example, the first and second discriminators are identical in structure, and are each composed of { Concat + { Conv2D + layerNorm + ReLU }. 5+ gapoling + FC + lreuu + FC + sigmod }, with input being the noise reduction audio and the enhanced audio output by the generator, and output being the predicted score, corresponding to PESQ, ViSQOL and STOI, CSII criteria, all 2-dimensional. The audio truth value is an accurate calculated value (a true value calculated according to the traditional criterion) obtained by directly applying the criteria and utilizing a preset function, so that the prediction of the discriminator is close to the target truth value through training, and the discriminator can evaluate the quality and the intelligibility of the audio.

Concat and sigmod are functions for operating on the inputs and outputs of the discriminators (first and second discriminators).

In one example, as shown in fig. 9, generating a competing network includes: the generator inputs the obtained enhanced audio to the first discriminator and the second discriminator of the generative countermeasure network, and simultaneously inputs the noise reduction audio output by the noise reduction model to the first discriminator and the second discriminator.

In one embodiment, the method for training an audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, where step S104: utilizing the loss function to respectively adjust parameters of the noise token model, the noise reduction model and the generative confrontation network according to the noise-free audio corresponding to the noise-reduced audio and the noise-containing audio, the audio prediction value, the audio true value and the audio preset value, which may include:

and calculating a first loss value according to the noise-reduced audio and the noise-free audio corresponding to the noise-containing audio by using the first loss function.

And calculating a second loss value according to the audio quality predicted value and the audio quality true value and/or the audio quality predicted value and the audio quality preset value by utilizing a second loss function.

And calculating a third loss value according to the audio intelligibility prediction value and the audio intelligibility true value and/or the audio intelligibility prediction value and the audio intelligibility preset value by utilizing a third loss function.

A fourth loss value is calculated from the first loss value, the second loss value, and the third loss value using a fourth loss function.

And adjusting parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the first loss value, the second loss value, the third loss value and the fourth loss value.

The first loss function, the second loss function, the third loss function, and the fourth loss function may be selected and adjusted as needed, and are not particularly limited herein.

According to the scheme of the disclosure, the parameters of the noise token model, the noise reduction model and the generative countermeasure network can be independently adjusted through the first loss function, the second loss function and the third loss function, and the parameters of the noise token model, the noise reduction model and the generative countermeasure network can be globally adjusted based on the first loss function, the second loss function and the third loss function through the fourth loss function, so that the overall parameters of the audio noise reduction model can be optimized, and the model training can be completed more quickly.

In one example, the first loss function may employ a scale-invariant source-to-noise ratio (SI-SNR) as a loss function, and the training goal of the first loss function is to compute a maximum scale-invariant SNR. The calculation formula is as follows:

wherein the content of the first and second substances,

representing noise-reduced audio, s representing noise-free audio, | s | | luminance ² ＝<s，s>Representing the signal power, where zero-mean warping is performed on the reconstructed signal to ensure scale invariance.

In one example, the second and third loss functions may be implemented as a loss function (MSE) with a Mean Square Error.

In one example, the loss calculation of the whole model training is composed of three parts, one part is a noise reduction model for an audio transmitting end, and the purity of an audio signal is expected to be improved from the aspect of signal to noise ratio, so that the influence of noise is reduced. The other two parts are directed to a generating countermeasure network at an audio receiving end, aiming at endowing a first discriminator and a second discriminator with the traditional capability of measuring the quality and the intelligibility of a voice signal, and enabling a generator to generate expected audio towards a set target, wherein the aim is achieved by means of deep learning, and further the introduced loss is as follows:

L＝L _int +α*L _qua +β*L _sisnr

where α and β represent hyperparameters, and are empirical values (e.g., α is 0.6 and β is 0.05). L denotes a fourth loss function, L _int Is a first loss function, L _qua Is a second loss function, L _sisnr Is a third loss function.

In another embodiment, the method for training an audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, where step S104: utilizing the loss function to respectively adjust parameters of the noise token model, the noise reduction model and the generative confrontation network according to the noise-free audio corresponding to the noise-reduced audio and the noise-containing audio, the audio prediction value, the audio true value and the audio preset value, which may include:

and calculating a fifth loss value according to the noise-free audio corresponding to the noise-reduced audio and the noise-containing audio by using a fifth loss function.

And calculating a sixth loss value according to the audio predicted value, the audio true value and the audio preset value by using a sixth loss function.

A seventh loss value is calculated from the fifth loss value and the sixth loss value using a seventh loss function.

And adjusting parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the fifth loss value, the sixth loss value and the seventh loss value.

According to the scheme of the disclosure, parameters of the noise token model, the noise reduction model and the generative countermeasure network can be independently adjusted through the fifth loss function and the sixth loss function, and parameters of the noise token model, the noise reduction model and the generative countermeasure network can be globally adjusted based on the fifth loss function and the sixth loss function through the seventh loss function, so that the overall parameters of the audio noise reduction model can be optimized, and model training can be completed more quickly.

and adjusting parameters of a noise token model, a noise reduction model and a discriminator of a generating countermeasure network for the ith time respectively by utilizing a loss function according to the noise-free audio, the audio predicted value, the audio true value and the audio preset value corresponding to the noise-reducing audio and the noise-containing audio, wherein i is a positive integer.

And obtaining the noise reduction audio, the audio prediction value and the audio true value of the (i + 1) th time according to the spectral characteristics of the noise-containing audio based on the noise token model, the noise reduction model and the generative confrontation network after the parameters are adjusted.

And respectively adjusting parameters of the noise token model, the noise reduction model and the generator of the generative countermeasure network after parameter adjustment for the (i + 1) th time by utilizing a loss function according to the noise-free audio corresponding to the noise-containing audio, the audio preset value, the noise reduction audio corresponding to the (i + 1) th time, the audio predicted value and the audio true value.

According to the scheme of the disclosure, the generator and the arbiter are performed alternately in the training process, that is, in the iteration process, the loss is calculated according to the sixth loss function, then according to the random gradient descent criterion, the arbiter is updated reversely (as shown in fig. 10), the generator is not updated temporarily, both the far-end noise reduction model and the noise token model are updated, then the parameters of the arbiter are fixed, in the next iteration, after the prediction score of the arbiter is calculated, the loss is calculated between the prediction score and the corresponding audio preset value, then the arbiter is fixed, the generator is updated reversely (as shown in fig. 11), the noise reduction model and the noise token model are updated, and the process is continuously cycled. The penalty in the discriminator is calculated from the difference between the audio predictor and the audio true value calculated by the conventional criterion, while the generator is calculated from the difference between the audio predictor and the audio preset value (e.g., the audio preset value takes 1, the maximum value of sigmod).

In one embodiment, the method for training an audio noise reduction model provided by the embodiment of the present disclosure includes steps S101 to S105, where step S105: under the condition that the adjustment is carried out until the noise token model, the noise reduction model and the generative countermeasure network are all converged, obtaining a trained audio noise reduction model, which may include:

and under the condition that the noise token model, the noise reduction model and the generative confrontation network are adjusted to be converged, obtaining the trained audio noise reduction model based on the generators of the noise token model, the noise reduction model and the generative confrontation network.

An embodiment of the present disclosure provides a method for audio noise reduction, as shown in fig. 12, which is a flowchart of the method for audio noise reduction of the present embodiment, and the method may include the following steps:

step S1201: and processing the target noisy audio at the audio transmitting end by utilizing the pre-trained audio denoising model. And

step S1202: and sending the noise reduction enhanced audio obtained after the pre-trained audio noise reduction model is processed to an audio receiving end. The pre-trained audio noise reduction model is obtained by adopting the training method of the audio noise reduction model of any embodiment of the disclosure.

According to the scheme disclosed by the invention, the audio noise reduction model is obtained by using the training method of the audio noise reduction model in any embodiment of the invention, so that the noise interference of an audio transmitting end and an audio receiving end can be effectively eliminated, the audio quality is enhanced, the audio intelligibility is improved, and the listener experience of the audio receiving end is ensured.

Fig. 13 is an application scenario of the audio denoising method according to the embodiment of the present disclosure. The audio sender is the real environment where the speaker is located. The audio receiving end is the real environment where the listener is located. The audio sender includes the voice audio (Hello) spoken by the speaker and the far-end noise (far-end, i.e. audio sender) in the real environment where the speaker is located. The audio sink includes the enhanced audio received by the listener and the near-end noise in the real environment where the listener is located (the near-end is the audio sink). The enhanced audio is obtained by processing the noisy audio (the voice audio spoken by the speaker + the far-end noise) sent by the audio sending end by the audio noise reduction model obtained by training the audio noise reduction model by the audio noise reduction model training method disclosed by the disclosure.

In one embodiment, the method for training an audio noise reduction model provided by the embodiment of the present disclosure includes steps S1201 and S1202, where step S1201: the pre-trained audio noise reduction model is utilized to process the target noisy audio at the audio sending end, which may include:

and inputting the frequency spectrum characteristic of the target noisy audio frequency at the audio frequency sending end into a noise token model of a pre-trained audio frequency noise reduction model to obtain the noise characteristic of the target noisy audio frequency.

And inputting the noise characteristics of the target noisy audio and the frequency spectrum characteristics of the target noisy audio into a noise reduction model of a pre-trained audio noise reduction model to obtain the target noise reduction audio.

And inputting the target noise reduction audio into a generator of a generation type countermeasure network of a pre-trained audio noise reduction model to obtain noise reduction enhanced audio.

An embodiment of the present disclosure provides a training apparatus for an audio noise reduction model, as shown in fig. 14, which is a block diagram of a structure of the training apparatus for an audio noise reduction model of this embodiment, and the apparatus may include:

the feature module 140 is configured to input the spectral feature of the noisy audio into the noise token model to obtain a noise feature.

And the noise reduction module 141 is configured to input the spectrum characteristic and the noise characteristic into a noise reduction model to obtain a noise reduction audio.

And the calculating module 142 is configured to input the noise-reduced audio into the generative confrontation network to obtain an audio predicted value and an audio true value.

And the adjusting module 143 is configured to adjust parameters of the noise token model, the noise reduction model, and the generative countermeasure network respectively according to the noise-free audio, the audio prediction value, the audio true value, and the audio preset value corresponding to the noise-reduced audio and the noise-containing audio by using the loss function. And

and the training module 144 is configured to obtain a trained audio noise reduction model when the adjustment is made until the noise token model, the noise reduction model, and the generative countermeasure network are all converged.

In one embodiment, the training apparatus for an audio noise reduction model further comprises:

and the construction module is used for constructing the audio containing noise according to the background noise audio and the noiseless audio of the audio sending end.

And the acquisition module is used for acquiring the frequency spectrum characteristics of the noisy audio by utilizing short-time Fourier transform.

In one embodiment, the feature module 140 includes:

and the extraction submodule is used for inputting the spectral characteristics of the noisy audio into a first network of the noise token model and extracting first characteristics.

And the operation submodule is used for operating the first characteristic and the high-dimensional noise matrix of the noise token model to obtain the noise characteristic. Wherein the high-dimensional noise matrix is constructed by high-dimensional features of different types of noise.

In one embodiment, the noise reduction module 141 includes:

and the first input submodule is used for inputting the spectral characteristics and the noise characteristics into an encoder of the noise reduction model. Wherein, the frequency spectrum characteristic is obtained by short-time Fourier transform of the noisy audio.

And the reconstruction submodule is used for carrying out audio reconstruction on the output result of the encoder by utilizing the decoder of the noise reduction model.

And the calculation submodule is used for carrying out inverse short-time Fourier transform on the reconstructed audio to obtain the noise reduction audio.

In one embodiment, the calculation module 142 includes:

and the second input submodule is used for inputting the noise reduction audio and the background noise audio of the audio receiving end into a generator of the generation type countermeasure network to obtain the enhanced audio.

And the third input submodule is used for inputting the enhanced audio and the noise-reduction audio into the discriminator of the generative countermeasure network to obtain an audio prediction value.

And the truth-value submodule is used for obtaining an audio truth value based on the enhanced audio and the noise reduction audio by utilizing a preset function.

In one embodiment, the third input sub-module is configured to input the enhanced audio and the noise-reduced audio into the first discriminator of the generator impedance network to obtain the audio quality prediction value. And inputting the enhanced audio and the noise reduction audio into a second discriminator of the generation type countermeasure network to obtain an audio intelligibility prediction value. And

and the true value submodule is used for obtaining a true value of the audio quality based on the enhanced audio and the noise reduction audio by utilizing a first preset function. And obtaining an audio intelligibility true value based on the enhanced audio and the noise reduction audio by utilizing a second preset function.

In one embodiment, the adjustment module 143 includes:

and the first calculation submodule is used for calculating a first loss value according to the noise-reduced audio and the noise-containing audio by using the first loss function.

And the second calculating submodule is used for calculating a second loss value according to the audio quality predicted value and the audio quality true value and/or the audio quality predicted value and the audio quality preset value by utilizing a second loss function.

And the third calculation sub-module is used for calculating a third loss value according to the audio intelligibility prediction value and the audio intelligibility true value and/or the audio intelligibility prediction value and the audio intelligibility preset value by utilizing a third loss function.

And the fourth calculating submodule is used for calculating a fourth loss value according to the first loss value, the second loss value and the third loss value by utilizing a fourth loss function.

And the first adjusting submodule is used for adjusting parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the first loss value, the second loss value, the third loss value and the fourth loss value.

In one embodiment, the adjustment module 143 includes:

and the fifth calculation submodule is used for calculating a fifth loss value according to the noise-free audio corresponding to the noise-reduced audio and the noise-containing audio by utilizing a fifth loss function.

And the sixth calculating submodule is used for calculating a sixth loss value according to the audio predicted value, the audio true value and the audio preset value by utilizing a sixth loss function.

And the seventh calculating submodule is used for calculating a seventh loss value according to the fifth loss value and the sixth loss value by utilizing a seventh loss function.

And the second adjusting submodule is used for respectively adjusting the parameters of the noise token model, the noise reduction model and the generative confrontation network according to the fifth loss value, the sixth loss value and the seventh loss value.

In one embodiment, the adjustment module 143 includes:

and the third adjusting submodule is used for respectively adjusting parameters of the noise token model, the noise reduction model and the discriminant of the generative countermeasure network for the ith time by utilizing the loss function according to the noise-free audio, the audio predicted value, the audio true value and the audio preset value corresponding to the noise-reduction audio and the noise-containing audio, wherein i is a positive integer.

And the obtaining submodule is used for obtaining the noise reduction audio, the audio prediction value and the audio true value of the (i + 1) th time according to the spectral characteristics of the noise-containing audio based on the noise token model, the noise reduction model and the generative countermeasure network after the parameters are adjusted.

And the fourth adjusting submodule is used for respectively adjusting the parameters of the noise token model, the noise reduction model and the generator of the generating type countermeasure network after parameter adjustment for the (i + 1) th time by utilizing the loss function according to the noise-free audio corresponding to the noise-containing audio, the audio preset value, the (i + 1) th noise reduction audio, the audio predicted value and the audio true value.

In one embodiment, training module 144 includes:

and the training submodule is used for obtaining a trained audio noise reduction model on the basis of the generators of the noise token model, the noise reduction model and the generative confrontation network under the condition that the noise token model, the noise reduction model and the generative confrontation network are adjusted to be converged.

In one embodiment, the method for training an audio noise reduction model provided by the embodiments of the present disclosure includes steps S101 to S105, where a first network is formed by N ₁ A two-dimensional convolution layer, a long-short term memory network and a multi-head attention layer connected in sequence, wherein N is ₁ Is a positive integer.

In one embodiment, the process of constructing the high-dimensional noise matrix includes:

And extracting high-dimensional characteristics of different types of noise from the underlying network layer of the preset noise identification network.

In one embodiment, the noise reduction model includes an encoder, a long-short term memory network, and a decoder.

The encoder consists of N ₂ A two-dimensional convolution layer, a normalization layer and a linear rectification unit connected in sequence, wherein N is ₂ Is a positive integer.

The decoder comprises a real part decoder and an imaginary part decoder which are arranged in parallel, wherein both the real part decoder and the imaginary part decoder are N ₄ A two-dimensional convolution layer of N ₄ Is a positive integer.

The linear rectifying unit of the encoder is connected with the input end of the first long-short term memory network, the output end of the last long-short term memory network is respectively connected with the real part decoder and the imaginary part decoder, and the first two-dimensional convolution layer of the encoder is also connected with the real part decoder and the imaginary part decoder.

In one embodiment, the generator is composed of N ₅ A second network, a full connection layer, an index activation layer and an energy normalization layer, wherein N is ₅ The second network is formed by connecting a one-dimensional convolution layer, a normalization layer and a linear rectifying unit in sequence, wherein the one-dimensional convolution layer is a positive integer.

In one embodiment, the generative confrontation network comprises a first discriminator and a second discriminator arranged in parallel, the first discriminator being used to calculate the audio quality prediction value and the second discriminator being used to calculate the audio intelligibility prediction value.

Both the first and second discriminators are N ₆ The third network, the global average pooling layer, the first full-connection layer, the leakage linear rectifying unit and the second full-connection layer are connected in sequence, wherein N is ₆ The third network is formed by connecting a two-dimensional convolution layer, a normalization layer and a linear rectification unit in sequence, wherein the number of the convolution layers is positive integer.

An embodiment of the present disclosure provides an audio noise reduction apparatus, as shown in fig. 15, which is a block diagram of the audio noise reduction apparatus of this embodiment, and the apparatus may include:

and the processing module 150 is configured to process the target noisy audio at the audio sending end by using the pre-trained audio denoising model. And

and a sending module 151, configured to send the noise reduction enhanced audio obtained after the pre-trained audio noise reduction model is processed to an audio receiving end. The pre-trained audio noise reduction model is obtained by adopting the training method of the audio noise reduction model of any embodiment of the disclosure, or by adopting the training device of the audio noise reduction model of any embodiment of the disclosure.

In one embodiment, the processing module 150 includes:

and the frequency spectrum characteristic input submodule is used for inputting the frequency spectrum characteristic of the target noisy audio frequency at the audio transmitting end into a noise token model of the pre-trained audio noise reduction model to obtain the noise characteristic of the target noisy audio frequency.

And the target noise reduction submodule is used for inputting the noise characteristics of the target noise-containing audio and the frequency spectrum characteristics of the target noise-containing audio into a noise reduction model of the pre-trained audio noise reduction model to obtain the target noise reduction audio.

And the noise reduction enhancement submodule is used for inputting the target noise reduction audio into a generator of a generating countermeasure network of the pre-trained audio noise reduction model to obtain the noise reduction enhancement audio.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 16 shows a schematic block diagram of an example electronic device 1600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 16, the apparatus 1600 includes a computing unit 1601, which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1602 or a computer program loaded from a storage unit 1608 into a Random Access Memory (RAM) 1603. In the RAM 1603, various programs and data necessary for the operation of the device 1600 can also be stored. The computing unit 1601, ROM 1602 and RAM 1603 are connected to each other via a bus 1604. An input/output (I/O) interface 1605 is also connected to the bus 1604.

Various components in device 1600 connect to I/O interface 1605, including: an input unit 1606 such as a keyboard, a mouse, and the like; an output unit 1607 such as various types of displays, speakers, and the like; a storage unit 1608, such as a magnetic disk, optical disk, or the like; and a communication unit 1609 such as a network card, a modem, a wireless communication transceiver, etc. A communication unit 1609 allows device 1600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

Computing unit 1601 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of computing unit 1601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1601 performs the above-described respective methods and processes, such as a training method of an audio noise reduction model and/or a method of audio noise reduction. For example, in some embodiments, the method of training the audio noise reduction model and/or the method of audio noise reduction may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1608. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 1600 via ROM 1602 and/or communications unit 1609. When the computer program is loaded into RAM 1603 and executed by the computing unit 1601, one or more steps of the method for training an audio noise reduction model and/or the method for audio noise reduction described above may be performed. Alternatively, in other embodiments, the computing unit 1601 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method of an audio noise reduction model and/or a method of audio noise reduction.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training an audio noise reduction model, comprising:

inputting the frequency spectrum characteristic and the noise characteristic into a noise reduction model to obtain noise reduction audio;

respectively adjusting parameters of the noise token model, the noise reduction model and the generative confrontation network according to the noise reduction audio frequency, the noise-free audio frequency corresponding to the noise-containing audio frequency, the audio predicted value, the audio true value and the audio preset value by using a loss function; and

2. The method of claim 1, before said inputting spectral features of noisy audio into a noise token model to obtain noise features, further comprising:

constructing a noisy audio frequency according to the background noise audio frequency and the noiseless frequency of the audio frequency sending end;

and acquiring the frequency spectrum characteristics of the noisy audio by using short-time Fourier transform.

3. The method of claim 1, wherein the inputting spectral features of noisy audio into a noise token model to obtain noise features comprises:

inputting the spectral characteristics of the noisy audio into a first network of a noise token model, and extracting first characteristics;

calculating the first characteristic and a high-dimensional noise matrix of the noise token model to obtain a noise characteristic; wherein the high-dimensional noise matrix is constructed by high-dimensional features of different types of noise.

4. The method of claim 1, wherein said inputting the spectral feature and the noise feature into a noise reduction model resulting in noise reduced audio comprises:

an encoder for inputting the spectral features and the noise features into a noise reduction model; wherein the spectral features are obtained from the noisy audio by short-time Fourier transform;

performing audio reconstruction on an output result of the encoder by using a decoder of the noise reduction model;

and carrying out inverse short-time Fourier transform on the reconstructed audio to obtain the noise reduction audio.

5. The method of claim 1, wherein said generating the noise reduced audio input into a countering network resulting in audio prediction values and audio truth values comprises:

inputting the noise reduction audio and the background noise audio of the audio receiving end into a generator of a generating type countermeasure network to obtain enhanced audio;

inputting the enhanced audio and the noise reduction audio into a discriminator of the generative countermeasure network to obtain an audio prediction value;

and obtaining an audio true value based on the enhanced audio and the noise reduction audio by using a preset function.

6. The method of claim 5, wherein said inputting said enhanced audio and said de-noised audio into a discriminator of said generative confrontation network to obtain an audio prediction value comprises:

inputting the enhanced audio and the noise reduction audio into a first discriminator of the generative countermeasure network to obtain an audio quality prediction value;

inputting the enhanced audio and the noise reduction audio into a second discriminator of the generative countermeasure network to obtain an audio intelligibility prediction value; and

obtaining an audio truth value based on the enhanced audio and the noise-reduced audio by using a preset function, including:

obtaining an audio quality true value based on the enhanced audio and the noise reduction audio by using a first preset function;

and obtaining an audio intelligibility true value based on the enhanced audio and the noise reduction audio by using a second preset function.

7. The method of claim 6, wherein said adjusting parameters of said noise token model, said noise reduction model and said generative countermeasure network according to said noise-reduced audio, said noise-free audio corresponding to said noise-containing audio, said predicted audio value, said true audio value and a preset audio value by using a loss function comprises:

calculating a first loss value according to the noise-free audio corresponding to the noise-containing audio and the noise-reduced audio by using a first loss function;

calculating a second loss value according to the audio quality predicted value and the audio quality true value and/or the audio quality predicted value and an audio quality preset value by using a second loss function;

calculating a third loss value according to the audio intelligibility prediction value and the audio intelligibility true value and/or the audio intelligibility prediction value and the audio intelligibility preset value by using a third loss function;

calculating a fourth loss value according to the first loss value, the second loss value and the third loss value by using a fourth loss function;

and adjusting parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the first loss value, the second loss value, the third loss value and the fourth loss value respectively.

8. The method of claim 1, wherein the adjusting parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the noise reduction audio, the noise-free audio corresponding to the noise-containing audio, the predicted audio value, the true audio value and the preset audio value by using the loss function comprises:

calculating a fifth loss value according to the noise-reduced audio and the noise-containing audio corresponding to the noise-reduced audio by using a fifth loss function;

calculating a sixth loss value according to the audio predicted value, the audio true value and the audio preset value by using a sixth loss function;

calculating a seventh loss value from the fifth loss value and the sixth loss value using a seventh loss function;

9. The method of claim 1, wherein the adjusting parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the noise reduction audio, the noise-free audio corresponding to the noise-containing audio, the predicted audio value, the true audio value and the preset audio value by using the loss function comprises:

adjusting parameters of a discriminator of the noise token model, the noise reduction model and the generative countermeasure network for the ith time respectively according to the noise reduction audio, the noise-free audio corresponding to the noise-containing audio, the audio predicted value, the audio true value and the audio preset value by using a loss function, wherein i is a positive integer;

obtaining a noise reduction audio frequency, an audio frequency predicted value and an audio frequency true value of the (i + 1) th time according to the spectral characteristics of the noise-containing audio frequency based on the noise token model, the noise reduction model and the generative confrontation network after parameter adjustment;

and utilizing the loss function to respectively adjust the parameters of the noise token model, the noise reduction model and the generator of the generating type confrontation network after the parameters are adjusted for the (i + 1) th time according to the noiseless audio corresponding to the noised audio, the audio preset value, the (i + 1) th noise reduction audio, the audio predicted value and the audio true value.

10. The method of claim 5, wherein the deriving a trained audio noise reduction model with adjustments to convergence of the noise token model, the noise reduction model, and the generative confrontation network comprises:

and under the condition that the noise token model, the noise reduction model and the generative confrontation network are adjusted to be converged, obtaining a trained audio noise reduction model based on the noise token model, the noise reduction model and the generator of the generative confrontation network.

11. The method of claim 3, wherein the first network consists of N ₁ A two-dimensional convolution layer, a long-short term memory network and a multi-head attention layer connected in sequence, wherein N is ₁ Is a positive integer.

12. The method of claim 3 or 11, wherein the construction of the high-dimensional noise matrix comprises:

respectively inputting the different types of noises into a preset noise identification network;

extracting high-dimensional characteristics of the different types of noise from an underlying network layer of the preset noise identification network;

and constructing the high-dimensional noise matrix according to the high-dimensional characteristics of the different types of noise.

13. The method of claim 4, wherein the noise reduction model comprises the encoder, long-short term memory network, and the decoder;

the encoder is composed of N ₂ A two-dimensional convolution layer, a normalization layer and a linear rectification unit connected in sequence, wherein N is ₂ Is a positive integer;

the long-short term memory network is N ₃ Wherein, N is ₃ Is a positive integer;

the decoder comprises a real part decoder and an imaginary part decoder which are arranged in parallel, and both the real part decoder and the imaginary part decoder are N ₄ A two-dimensional convolution layer of N ₄ Is a positive integer;

the linear rectifying unit of the encoder is connected with the input end of the first long-short term memory network, the output end of the last long-short term memory network is respectively connected with the real part decoder and the imaginary part decoder, and the encoder is also connected with the real part decoder and the imaginary part decoder.

14. The method of claim 5, wherein the generator is comprised of N ₅ A second network, a full connection layer, an index activation layer and an energy normalization layer, wherein N is ₅ Is a positive integer, saidThe two networks are formed by connecting a one-dimensional convolution layer, a normalization layer and a linear rectification unit in sequence.

15. The method of claim 5 or 6, wherein the generative confrontation network comprises a first discriminator and a second discriminator arranged in parallel, the first discriminator being used to calculate an audio quality prediction value and the second discriminator being used to calculate an audio intelligibility prediction value;

the first and second discriminators are both N ₆ The third network, the global average pooling layer, the first full-connection layer, the leakage linear rectifying unit and the second full-connection layer are connected in sequence, wherein N is ₆ The third network is formed by sequentially connecting a two-dimensional convolution layer, a normalization layer and a linear rectifying unit.

16. A method of audio noise reduction, comprising:

sending the noise reduction enhanced audio obtained after the pre-trained audio noise reduction model is processed to an audio receiving end; wherein the pre-trained audio noise reduction model is obtained using the method of any one of claims 1 to 15.

17. The method of claim 16, wherein the processing the target noisy audio at the audio transmitter side using the pre-trained audio noise reduction model comprises:

inputting the frequency spectrum characteristic of a target noisy audio frequency of an audio frequency sending end into a noise token model of a pre-trained audio frequency noise reduction model to obtain the noise characteristic of the target noisy audio frequency;

inputting the noise characteristics of the target noisy audio and the frequency spectrum characteristics of the target noisy audio into a noise reduction model of the pre-trained audio noise reduction model to obtain a target noise reduction audio;

and inputting the target noise reduction audio into a generator of a generative confrontation network of the pre-trained audio noise reduction model to obtain noise reduction enhanced audio.

18. An apparatus for training an audio noise reduction model, comprising:

an adjusting module, configured to adjust parameters of the noise token model, the noise reduction model, and the generative countermeasure network respectively according to the noise-reduced audio, the noise-free audio corresponding to the noise-containing audio, the predicted audio value, the true audio value, and a preset audio value by using a loss function; and

and the training module is used for obtaining the trained audio noise reduction model under the condition that the noise token model, the noise reduction model and the generative confrontation network are adjusted to be converged.

19. The apparatus of claim 18, further comprising:

the construction module is used for constructing a noisy audio according to the background noise audio and the noiseless audio of the audio transmitting end;

and the acquisition module is used for acquiring the frequency spectrum characteristics of the noisy audio by using short-time Fourier transform.

20. The apparatus of claim 18, wherein the feature module comprises:

the extraction submodule is used for inputting the spectral characteristics of the noisy audio into a first network of the noise token model and extracting first characteristics;

the operation submodule is used for operating the first characteristic and a high-dimensional noise matrix of the noise token model to obtain a noise characteristic; wherein the high-dimensional noise matrix is constructed by high-dimensional features of different types of noise.

21. The apparatus of claim 18, wherein the noise reduction module comprises:

a first input submodule for inputting the spectral features and the noise features into an encoder of a noise reduction model; wherein the spectral features are obtained from the noisy audio by short-time Fourier transform;

the reconstruction submodule is used for carrying out audio reconstruction on the output result of the encoder by utilizing the decoder of the noise reduction model;

22. The apparatus of claim 18, wherein the computing module comprises:

the second input submodule is used for inputting the noise reduction audio and the background noise of the audio receiving end into a generator of a generating type countermeasure network to obtain enhanced audio;

a third input submodule, configured to input the enhanced audio and the noise-reduced audio to the discriminator of the generative countermeasure network, so as to obtain an audio prediction value;

23. The apparatus of claim 22, wherein the third input sub-module is configured to input the enhanced audio and the noise-reduced audio into the first discriminator of the generative confrontation network to obtain an audio quality prediction value; inputting the enhanced audio and the noise reduction audio into a second discriminator of the generative countermeasure network to obtain an audio intelligibility prediction value; and

the truth-value submodule is used for obtaining an audio quality truth value based on the enhanced audio and the noise reduction audio by utilizing a first preset function; and obtaining an audio intelligibility true value based on the enhanced audio and the noise reduction audio by using a second preset function.

24. The apparatus of claim 23, wherein the adjustment module comprises:

the first calculation submodule is used for calculating a first loss value according to the noise-reduced audio and the noise-free audio corresponding to the noise-containing audio by utilizing a first loss function;

the second calculation submodule is used for calculating a second loss value according to the audio quality predicted value and the audio quality true value and/or the audio quality predicted value and the audio quality preset value by using a second loss function;

a third calculation submodule, configured to calculate, by using a third loss function, a third loss value according to the audio intelligibility prediction value and the audio intelligibility true value, and/or the audio intelligibility prediction value and the audio intelligibility preset value;

a fourth calculation submodule, configured to calculate a fourth loss value according to the first loss value, the second loss value, and the third loss value by using a fourth loss function;

and the first adjusting submodule is used for respectively adjusting the parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the first loss value, the second loss value, the third loss value and the fourth loss value.

25. The apparatus of claim 18, wherein the adjustment module comprises:

a fifth calculating submodule, configured to calculate a fifth loss value according to the noise-free audio corresponding to the noise-containing audio and the noise-reduced audio by using a fifth loss function;

the sixth calculating submodule is used for calculating a sixth loss value according to the audio predicted value, the audio true value and the audio preset value by using a sixth loss function;

a seventh calculation submodule, configured to calculate a seventh loss value according to the fifth loss value and the sixth loss value by using a seventh loss function;

and the second adjusting submodule is used for respectively adjusting the parameters of the noise token model, the noise reduction model and the generative countermeasure network according to the fifth loss value, the sixth loss value and the seventh loss value.

26. The apparatus of claim 18, wherein the adjustment module comprises:

a third adjusting sub-module, configured to perform, by using a loss function, i-th adjustment on parameters of the discriminators of the noise token model, the noise reduction model, and the generator countermeasure network according to the noise reduction audio, the noise-free audio corresponding to the noise-containing audio, the predicted audio value, the true audio value, and the preset audio value, where i is a positive integer;

the acquisition submodule is used for acquiring a noise reduction audio frequency, an audio prediction value and an audio true value of the (i + 1) th time according to the spectral characteristics of the noise-containing audio frequency on the basis of the noise token model, the noise reduction model and the generation type countermeasure network after the parameters are adjusted;

and the fourth adjusting submodule is used for respectively adjusting the parameters of the noise token model, the noise reduction model and the generator of the generative confrontation network after the parameters are adjusted for the (i + 1) th time by utilizing the loss function according to the noise-free audio corresponding to the noise-containing audio, the audio preset value, the (i + 1) th noise reduction audio, the audio predicted value and the audio true value.

27. The apparatus of claim 22, the training module comprising:

and the training submodule is used for obtaining a trained audio noise reduction model on the basis of the noise token model, the noise reduction model and the generator of the generative confrontation network under the condition that the noise token model, the noise reduction model and the generative confrontation network are adjusted to be converged.

28. An apparatus for audio noise reduction, comprising:

the sending module is used for sending the noise reduction enhanced audio obtained after the pre-trained audio noise reduction model is processed to an audio receiving end; wherein the pre-trained audio noise reduction model is obtained using the method of any one of claims 1 to 15, or the apparatus of any one of claims 18 to 27.

29. The apparatus of claim 28, wherein the processing module comprises:

the frequency spectrum characteristic input submodule is used for inputting the frequency spectrum characteristic of the target noisy audio frequency of the audio frequency sending end into a noise token model of a pre-trained audio frequency noise reduction model to obtain the noise characteristic of the target noisy audio frequency;

the target noise reduction submodule is used for inputting the noise characteristics of the target noisy audio and the frequency spectrum characteristics of the target noisy audio into a noise reduction model of the pre-trained audio noise reduction model to obtain a target noise reduction audio;

and the noise reduction enhancement submodule is used for inputting the target noise reduction audio into a generator of the generating countermeasure network of the pre-trained audio noise reduction model to obtain the noise reduction enhancement audio.

30. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 17.

31. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 17.

32. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 17.