CN114694674A

CN114694674A - Speech noise reduction method, device and equipment based on artificial intelligence and storage medium

Info

Publication number: CN114694674A
Application number: CN202210231278.4A
Authority: CN
Inventors: 李�杰; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-07-01

Abstract

The application relates to and discloses a voice noise reduction method, a voice noise reduction device, voice noise reduction equipment and a storage medium based on artificial intelligence, wherein the method comprises the following steps: acquiring a spectrogram to be denoised corresponding to the voice to be denoised; inputting the spectrogram to be denoised into a preset denoising model for denoising to obtain a denoised spectrogram, wherein the denoising model sequentially comprises: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-modules; and reconstructing a voice signal of the denoised spectrogram to obtain a target clean voice. The method realizes the decoupling of time domain and frequency domain, is beneficial to the streaming voice noise reduction, is suitable for application scenes with limited computing resources and/or higher real-time requirements, and is beneficial to improving the noise reduction effect.

Description

Speech noise reduction method, device and equipment based on artificial intelligence and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech noise reduction based on artificial intelligence.

Background

The voice usually contains noise, and when the voice containing the noise is applied to an actual scene, the accuracy of the voice application is reduced, and the user experience is influenced. The existing noise reduction model obtained based on convolutional neural network training is adopted to reduce noise of voice, and although a good noise reduction effect is obtained, the model has high requirements on computing resources and needs long computing time, so that the model cannot be suitable for application scenes with limited computing resources and/or high real-time requirements.

Disclosure of Invention

The application mainly aims to provide a voice noise reduction method, a voice noise reduction device, voice noise reduction equipment and a storage medium based on artificial intelligence, and aims to solve the technical problem that a noise reduction model obtained based on convolutional neural network training cannot be suitable for application scenes with limited computing resources and/or high real-time requirements.

In order to achieve the above object, the present application provides a speech noise reduction method based on artificial intelligence, the method comprising:

acquiring a spectrogram to be denoised corresponding to the voice to be denoised;

inputting the spectrogram to be denoised into a preset denoising model for denoising to obtain a denoised spectrogram, wherein the denoising model sequentially comprises: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-blocks;

and reconstructing a voice signal of the denoised spectrogram to obtain a target clean voice.

Further, the step of inputting the spectrogram to be denoised into a preset denoising model for denoising to obtain a denoised spectrogram includes:

inputting the spectrogram to be denoised into the coding module for feature extraction to obtain a plurality of single-layer audio coding features and target audio coding features;

inputting the target audio coding features into the frequency domain noise reduction module for frequency domain noise reduction to obtain frequency domain noise-reduced audio features;

residual error connection is carried out on the target audio coding characteristics and the frequency domain noise-reduced audio characteristics to obtain audio characteristics to be processed;

inputting the audio features to be processed into the time domain noise reduction module to perform feature grouping, grouping time domain noise reduction and feature splicing respectively to obtain time domain noise reduced audio features;

residual error connection is carried out on the frequency domain noise-reduced audio frequency characteristic and the time domain noise-reduced audio frequency characteristic to obtain an audio frequency characteristic to be decoded;

inputting each single-layer audio coding feature and the audio feature to be decoded into the decoding module for decoding to obtain a spectrogram to be analyzed;

and inputting the spectrogram to be analyzed into the mask gain and reduction module for masking to obtain the noise-reduced spectrogram.

Further, the step of inputting the target audio coding feature into the frequency domain noise reduction module for frequency domain noise reduction to obtain a frequency domain noise-reduced audio feature includes:

adopting a dimension reduction submodule of the frequency domain noise reduction module to perform dimension reduction processing on the target audio coding feature to obtain a coding feature after dimension reduction;

performing frequency domain noise reduction on the dimension-reduced coding features by using a multi-head self-attention submodule of the frequency domain noise reduction module to obtain dimension-to-be-increased coding features, wherein the multi-head self-attention submodule is a module for realizing a multi-head self-attention mechanism, and a Query Value, a Key Value and a Value of the multi-head self-attention mechanism of the frequency domain noise reduction submodule are data determined according to preset dependent frequency band width and adjacent sub-band information;

and performing dimension increasing processing on the to-be-dimension-increased coding features by adopting a dimension increasing submodule of the frequency domain noise reduction module to obtain the frequency domain noise-reduced audio features.

Further, the step of inputting the audio features to be processed into the time domain noise reduction module to perform feature grouping, grouping time domain noise reduction and feature splicing respectively to obtain time domain noise-reduced audio features includes:

dividing the audio features to be processed by adopting a feature grouping layer of the time domain noise reduction module to obtain a plurality of single group audio features, wherein the number of the single group audio features is the same as that of the time domain noise reduction submodules;

inputting the ith single group of audio features into the ith time domain noise reduction submodule for time domain noise reduction to obtain the ith audio feature to be combined, wherein i is an integer greater than 0;

and performing characteristic splicing on each audio characteristic to be combined by adopting the characteristic combination layer of the time domain noise reduction module to obtain the time domain noise-reduced audio characteristic.

Further, the step of inputting the spectrogram to be denoised into the coding module for feature extraction to obtain a plurality of single-layer audio coding features and target audio coding features includes:

carrying out Pointwise convolution on the input vector of the kth coding layer by adopting the kth coding layer of the coding module to obtain a first audio characteristic;

acquiring a preset Depthwise convolution time dimension;

if the time dimension of the Depthwise convolution is equal to 1, performing conventional convolution on the first audio feature by adopting a kth coding layer to obtain a kth single-layer audio coding feature;

if the Depthwise convolution time dimension is equal to 2, performing causal convolution on the first audio feature by adopting a kth coding layer to obtain a kth single-layer audio coding feature;

taking the nth single-layer audio coding feature as the target audio coding feature;

wherein k is an integer greater than 0, k is less than or equal to n, n is greater than 0, n is the number of the coding layers; and when k is equal to 1, taking the spectrogram to be subjected to noise reduction as an input vector of the kth coding layer, and when k is greater than 1, taking the (k-1) th single-layer audio coding feature as an input vector of the kth coding layer.

Further, the step of inputting each single-layer audio coding feature and the audio feature to be decoded into the decoding module for decoding to obtain a spectrogram to be analyzed includes:

performing dimension reduction processing on the (n + 1) -m single-layer audio coding features to obtain coding features to be processed, wherein m is an integer greater than 0 and is less than or equal to n;

adding the output vector of the m-1 th decoding layer and the element value of the same position of the coding feature to be processed to obtain the m-th feature to be processed;

performing deconvolution processing on the mth feature to be processed to obtain an mth single-layer decoding feature;

taking the nth single-layer decoding feature as the spectrogram to be analyzed;

and when m is equal to 1, the audio features to be decoded are taken as the output vector of the m-1 th decoding layer, and when m is larger than 1, the m-1 th single-layer decoding features are taken as the output vector of the m-1 th decoding layer.

Further, before the step of inputting the spectrogram to be denoised into a preset denoising model for denoising, and obtaining a denoised spectrogram, the method further includes:

obtaining a plurality of training samples and a model to be trained;

training the model to be trained according to each training sample and a preset target function until a preset model training end condition is reached, and taking the model to be trained reaching the model training end condition as the noise reduction model;

wherein the objective function S is expressed as: s = SISNR + MSE loss + perceptual loss + regularization term, SISNR is the signal-to-noise ratio loss of speech, MSE loss is the loss calculated according to the mean-square error of the real part of the spectrogram, the mean-square error of the imaginary part of the spectrogram and the mean-square error of the spectrogram magnitude spectrum, and perceptual loss is the perceptual loss of speech.

This application has still provided a device of making an uproar falls in pronunciation based on artificial intelligence, the device includes:

the data acquisition module is used for acquiring a spectrogram to be denoised corresponding to the voice to be denoised;

and the noise reduction processing module is used for inputting the spectrogram to be subjected to noise reduction into a preset noise reduction model to perform noise reduction processing so as to obtain a noise-reduced spectrogram, wherein the noise reduction model sequentially comprises: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-modules;

and the voice signal reconstruction module is used for reconstructing the voice signal of the denoised spectrogram to obtain target clean voice.

The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The application relates to a voice noise reduction method, a voice noise reduction device, voice noise reduction equipment and a storage medium based on artificial intelligence, wherein the method is used for inputting a spectrogram to be subjected to noise reduction into a preset noise reduction model to perform noise reduction treatment, and obtaining a noise-reduced spectrogram, wherein the noise reduction model sequentially comprises the following steps: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-blocks; and reconstructing a voice signal of the denoised spectrogram to obtain a target clean voice. The method has the advantages that the noise reduction model is adopted to sequentially perform feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask gain and reduction, effective noise reduction is realized by sequentially performing frequency domain noise reduction and time domain noise reduction, and the noise reduction effect is improved; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the decoupling of a time domain and a frequency domain is realized, and the streaming voice noise reduction is facilitated; the time domain noise reduction module adopts at least two time domain noise reduction sub-modules, so that grouped time domain noise reduction is realized, the operation amount and the network parameter quantity are reduced, and the method is favorable for being suitable for application scenes with limited computing resources and/or higher real-time requirements; and the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and is favorable for improving the noise reduction effect.

Drawings

FIG. 1 is a flowchart illustrating an artificial intelligence based speech noise reduction method according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating a structure of an artificial intelligence based speech noise reduction apparatus according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a speech noise reduction method based on artificial intelligence, where the method includes:

s1: acquiring a spectrogram to be denoised corresponding to the voice to be denoised;

s2: inputting the spectrogram to be denoised into a preset denoising model for denoising to obtain a denoised spectrogram, wherein the denoising model sequentially comprises: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-modules;

s3: and reconstructing a voice signal of the denoised spectrogram to obtain a target clean voice.

The embodiment realizes that the noise reduction model is adopted to sequentially carry out feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask gain and reduction, and effective noise reduction is realized by sequentially carrying out frequency domain noise reduction and time domain noise reduction, so that the noise reduction effect is improved; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the decoupling of a time domain and a frequency domain is realized, and the streaming voice noise reduction is facilitated; the time domain noise reduction module adopts at least two time domain noise reduction sub-modules, so that grouped time domain noise reduction is realized, the operation amount and the network parameter quantity are reduced, and the method is favorable for being suitable for application scenes with limited computing resources and/or higher real-time requirements; and the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and is favorable for improving the noise reduction effect.

For S1, a spectrogram to be noise reduced corresponding to the voice to be noise reduced input by the user may be obtained, a spectrogram to be noise reduced corresponding to the voice to be noise reduced may also be obtained from the database, and a spectrogram to be noise reduced corresponding to the voice to be noise reduced may also be obtained from a third-party application.

The voice to be denoised, that is, one or more pieces of voice to be denoised.

The spectrogram to be denoised is a spectrogram of a voice to be denoised, wherein the spectrogram is a graph generated according to a Fourier spectrum.

The spectrogram to be denoised comprises 2 channels, the 2 channels being a real part channel and an imaginary part channel, respectively. The real channel is the real part of the fourier spectral signature. The imaginary channel is the imaginary part of the fourier spectral feature.

Performing short-time Fourier transform on the voice to be denoised, and taking a spectrogram obtained by the short-time Fourier transform as a spectrogram to be processed; and performing direct-current component removal processing on the spectrogram to be processed, and taking the spectrogram to be processed without the direct-current component as the spectrogram to be subjected to noise reduction. Because the direct current component has little influence on the reconstruction of the frequency spectrum, the frequency spectrum image obtained by the short-time Fourier transform is used as the frequency spectrum image to be subjected to noise reduction after the direct current component is removed, the noise reduction effect is not influenced, and the operation amount is reduced.

And S2, inputting the spectrogram to be denoised into a preset denoising model, sequentially performing feature extraction, frequency domain denoising, time domain denoising, decoding and mask gain and reduction, and taking data output by the mask gain and reduction as the spectrogram after denoising.

The coding module is used for coding to realize the extraction of the audio features. The encoding module comprises a plurality of encoding layers, the encoding layers are in linear connection, and each encoding layer outputs a single-layer audio encoding characteristic.

Optionally, to implement streaming processing, only convolution in the frequency dimension is used, that is, the convolution kernel size in the time dimension is 1 to reduce the amount of operation, or a causal convolution mode is adopted in the time domain to implement streaming processing.

Optionally, the number of coding layers of the coding module is at least 3.

The frequency domain noise reduction module is used for reducing noise in the dimension of the frequency domain, so that the information of the frequency domain is fully utilized. And the output of the last coding layer of the coding module is used as the input of the frequency domain noise reduction module. The frequency domain noise reduction module is a module for realizing a Multi-head self-attention mechanism (Multi-head self-attention) by using adjacent subband information.

The time domain noise reduction module is used for reducing noise in the dimension of the time domain, so that the information of the time domain is fully utilized. And taking the output of the frequency domain noise reduction module and the output of the last coding layer of the coding module as the input of the time domain noise reduction module. The time domain noise reduction module comprises: at least two time domain noise reduction sub-modules, thereby realizing grouping time domain noise reduction. The time domain noise reduction sub-module is a module obtained based on a long-short term memory artificial neural network (LSTM) and/or a GRU (gated cyclic unit).

And the decoding module is used for decoding to obtain the spectrogram after frequency domain noise reduction and time domain noise reduction. The decoding module comprises a plurality of decoding layers, and the decoding layers are linearly connected. The input to each decoding layer is data derived from the output of the previous decoding layer and the output of the encoding layer.

Optionally, the number of decoding layers of the decoding module is the same as the number of encoding layers of the encoding module.

Optionally, the number of decoding layers of the decoding module is at least 3.

And the mask gain and reduction module is used for enhancing the data corresponding to the wanted voice and suppressing the data corresponding to the unwanted voice in the spectrogram.

Optionally, the masking gain and reduction module performs masking by using 0 and 1. For example, in the spectrogram, the masking gain and reduction module performs masking with 1 to gain data corresponding to desired voice and performs masking with 0 to reduce data corresponding to undesired voice.

Optionally, the masking gain and reduction module performs masking by using a value from 0 to 1.

For S3, performing short-time Fourier inverse transformation on the denoised spectrogram to obtain time domain data to be processed; and performing voice signal reconstruction on the time domain data to be processed by adopting an Overlapadd method, and taking the reconstructed clean voice as the target clean voice corresponding to the voice to be denoised.

Overlapadd, also written as Overlap-add, Overlap-add.

The method for performing speech signal reconstruction on the time domain data to be processed by using the Overlapadd method is not described herein again.

In an embodiment, the step of inputting the spectrogram to be denoised into a preset denoising model for denoising to obtain a denoised spectrogram includes:

s21: inputting the spectrogram to be denoised into the coding module for feature extraction to obtain a plurality of single-layer audio coding features and target audio coding features;

s22: inputting the target audio coding features into the frequency domain noise reduction module for frequency domain noise reduction to obtain frequency domain noise-reduced audio features;

s23: residual error connection is carried out on the target audio coding features and the frequency domain noise-reduced audio features to obtain audio features to be processed;

s24: inputting the audio features to be processed into the time domain noise reduction module to perform feature grouping, grouping time domain noise reduction and feature splicing respectively to obtain time domain noise reduced audio features;

s25: residual error connection is carried out on the frequency domain noise-reduced audio frequency characteristic and the time domain noise-reduced audio frequency characteristic to obtain an audio frequency characteristic to be decoded;

s26: inputting each single-layer audio coding feature and the audio feature to be decoded into the decoding module for decoding to obtain a spectrogram to be analyzed;

s27: and inputting the spectrogram to be analyzed into the mask gain and reduction module for masking to obtain the noise-reduced spectrogram.

For step S21, the spectrogram to be denoised is input into the encoding module for feature extraction, the audio feature data extracted by each encoding layer in the encoding module is used as a single-layer audio encoding feature, and the single-layer audio encoding feature extracted by the last encoding layer in the encoding module is used as a target audio encoding feature.

For S22, the target audio coding feature is input into the frequency domain noise reduction module, noise reduction is performed in the frequency domain dimension by the multi-head attention mechanism of the frequency domain noise reduction module, and the audio feature subjected to noise reduction in the frequency domain dimension is used as the audio feature subjected to noise reduction in the frequency domain.

And S23, residual connection (residual connection) is carried out on the target audio coding feature and the audio feature after the frequency domain noise reduction, and the audio feature obtained by residual connection is used as the audio feature to be processed.

The implementation method for residual error connection between the target audio coding feature and the frequency-domain noise-reduced audio feature is not described herein again.

And S24, inputting the audio features to be processed into the time domain noise reduction module, respectively performing feature grouping through the time domain noise reduction module, performing grouping time domain noise reduction on data obtained by feature grouping in a time domain dimension, performing feature splicing on the noise-reduced features, and taking the data obtained by feature splicing as the audio features after time domain noise reduction.

And S25, residual error connection is carried out on the frequency domain noise-reduced audio features and the time domain noise-reduced audio features, and audio feature vectors obtained by residual error connection are used as audio features to be decoded.

The method for implementing residual error connection between the frequency-domain noise-reduced audio features and the time-domain noise-reduced audio features is not described herein.

For S26, determining an mth feature to be decoded according to the (n + 1) -mth single-layer audio coding features and the (m-1) -th output vector of the decoding layer, wherein m is an integer greater than 0, m is less than or equal to n, n is the number of coding layers in the coding module, and the number of coding layers in the coding module is the same as the number of decoding layers in the decoding module; and when m is equal to 1, taking the audio features to be decoded as the output vector of the (m-1) th decoding layer, and when m is greater than 1, taking the (m-1) th single-layer decoding features as the output vector of the (m-1) th decoding layer.

And taking the data output by the last decoding layer of the decoding module as a spectrogram to be analyzed.

For S27, inputting the spectrogram to be analyzed into the masking gain and reduction module; the Mask gain and reduction module adopts a CRMmask (CRM Mask) mode as a noise reduction filtering function to realize gain on data corresponding to wanted voice and reduction on data corresponding to unwanted voice in a spectrogram; and taking the masked spectrogram to be analyzed as the noise-reduced spectrogram.

Optionally, masking is performed by using the following formula: enhance _ real + i _ enhance _ image = (mask _ real + i _ mask _ image) (' noise _ real + i _ noise _ image), where enhance _ real is the real part of the enhanced speech, enhance _ image is the imaginary part after enhancement, mask _ real is the mask enhancement coefficient of the real part, mask _ image is the mask enhancement coefficient of the imaginary part, noise _ real is the real part of the noise, noise _ image is the imaginary part of the noise, and i is the imaginary unit.

The mask enhancement coefficient is a value from 0 to 1, and may be 0 or 1.

In an embodiment, the step of inputting the target audio coding feature into the frequency domain noise reduction module for frequency domain noise reduction to obtain a frequency domain noise-reduced audio feature includes:

s221: adopting a dimension reduction submodule of the frequency domain noise reduction module to perform dimension reduction processing on the target audio coding features to obtain dimension-reduced coding features;

s222: performing frequency domain noise reduction on the dimension-reduced coding features by using a multi-head self-attention submodule of the frequency domain noise reduction module to obtain dimension-to-be-increased coding features, wherein the multi-head self-attention submodule is a module for realizing a multi-head self-attention mechanism, and a Query Value, a Key Value and a Value of the multi-head self-attention mechanism of the frequency domain noise reduction submodule are data determined according to preset dependent frequency band width and adjacent sub-band information;

s223: and performing dimension increasing processing on the to-be-dimension-increased coding features by adopting a dimension increasing submodule of the frequency domain noise reduction module to obtain the frequency domain noise-reduced audio features.

The embodiment sequentially performs dimension reduction, frequency domain noise reduction and dimension increasing, and the performance of the model is not reduced while the controllability of the calculated amount is ensured; the Value of Query, the Value of Key and the Value of the multi-head self-attention mechanism of the frequency domain noise reduction submodule are data determined according to preset dependent frequency band width and adjacent sub-band information, and the frequency domain noise reduction effect is favorably improved.

And S221, performing dimension reduction processing on the target audio coding feature by using a dimension reduction sub-module of the frequency domain noise reduction module, and taking data obtained through the dimension reduction processing as the coding feature after dimension reduction.

And the dimension reduction submodule adopts a full connection layer.

And S222, performing frequency domain noise reduction on the dimension-reduced coding features by adopting the multi-head self-attention sub-module of the frequency domain noise reduction module, and taking data obtained by the frequency domain noise reduction as the dimension-to-be-raised coding features.

Query, Key, and Value are elements of the multi-headed self-attention mechanism. The basic principle is as follows: and giving a Query, calculating the correlation between the Query and the Key, and then finding the most appropriate Value according to the correlation between the Query and the Key.

Optionally, the dependent frequency band width adopts a preset context dependent width. The Value of Query, the Value of Key, and the Value of the multi-head attention mechanism of the frequency domain noise reduction submodule are data determined according to a preset dependent frequency band width and adjacent subband information, that is, a current 1 frame, an upper frame of a preset context dependent width, and a lower frame of a preset context dependent width are determined as the Value of Query, the current 1 frame is used as the Value of Key, a content vector is generated according to the current 1 frame, the upper frame of the preset context dependent width, and the lower frame of the preset context dependent width, and the generated content vector is used as the Value of Value.

It is to be understood that the current 1 frame, the previous frame of the preset context dependent width, and the following frame of the preset context dependent width are consecutive frames.

For example, if the dependent frequency band width is 4, determining a Value of Query for 9 frames (i.e., consecutive frames) totaling the current 1 frame, the previous 4 frames, and the next 4 frames, taking the current 1 frame as a Value of Key, generating a content vector according to the 9 frames totaling the current 1 frame, the previous 4 frames, and the next 4 frames, and taking the generated content vector as a Value of Value.

And S223, performing dimension-increasing processing on the dimension-to-be-increased coding feature by adopting a dimension-increasing submodule of the frequency domain noise reduction module, and taking data obtained through the dimension-increasing processing as the frequency domain noise-reduced audio feature.

And the dimension-increasing submodule adopts a full connection layer.

In an embodiment, the step of inputting the audio feature to be processed into the time domain noise reduction module to perform feature grouping, grouping time domain noise reduction, and feature splicing, respectively, to obtain the time domain noise-reduced audio feature includes:

s241: dividing the audio features to be processed by adopting a feature grouping layer of the time domain noise reduction module to obtain a plurality of single group audio features, wherein the number of the single group audio features is the same as that of the time domain noise reduction submodules;

s242: inputting the ith single-group audio feature into the ith time domain noise reduction submodule for time domain noise reduction to obtain an ith audio feature to be combined, wherein i is an integer greater than 0;

s243: and performing feature splicing on each audio feature to be combined by adopting the feature combination layer of the time domain noise reduction module to obtain the time domain noise-reduced audio feature.

The embodiment respectively carries out feature grouping, grouping time domain noise reduction and feature splicing, ensures the width of the time domain noise reduction submodule, reduces the calculated amount and the network parameter amount, and is beneficial to being suitable for application scenes with limited calculation resources and/or higher real-time requirements.

And for S241, dividing the audio features to be processed by adopting the feature grouping layer of the time domain noise reduction module, and taking each group of divided features as a single group of audio features.

Wherein the number of the single set of audio features is the same as the number of the time domain noise reduction sub-modules, thereby preparing input data for each of the time domain noise reduction sub-modules.

For step S242, the ith single group of audio features is input into the ith time domain noise reduction submodule for time domain noise reduction, so as to obtain the ith audio feature to be combined, thereby implementing time domain noise reduction of each single group of audio features by using one time domain noise reduction submodule.

That is, the number of audio features to be combined is the same as the number of time-domain noise reduction sub-modules.

And S243, performing feature splicing on each audio feature to be combined by using the feature combination layer of the time domain noise reduction module, and taking data obtained by feature splicing as the time domain noise-reduced audio feature.

In an embodiment, the step of inputting the spectrogram to be noise-reduced into the encoding module for feature extraction to obtain a plurality of single-layer audio encoding features and target audio encoding features includes:

s211: carrying out Pointwise convolution on the input vector of the kth coding layer by adopting the kth coding layer of the coding module to obtain a first audio characteristic;

s212: acquiring a preset Depthwise convolution time dimension;

s213: if the time dimension of the Depthwise convolution is equal to 1, performing conventional convolution on the first audio characteristic by adopting a kth coding layer to obtain a kth single-layer audio coding characteristic;

s214: if the Depthwise convolution time dimension is equal to 2, performing causal convolution on the first audio feature by adopting a kth coding layer to obtain a kth single-layer audio coding feature;

s215: taking the nth single-layer audio coding feature as the target audio coding feature;

wherein k is an integer greater than 0, k is less than or equal to n, n is greater than 0, n is the number of the coding layers; and when k is equal to 1, taking the spectrogram to be subjected to noise reduction as an input vector of the kth coding layer, and when k is larger than 1, taking the (k-1) th single-layer audio coding feature as an input vector of the kth coding layer.

In the conventional depth-separable convolution of the embodiment, the sequence of firstly performing Depthwise convolution and then performing Pointwise convolution is adopted for feature extraction, the sequence of firstly performing Depthwise convolution and then performing Pointwise convolution cannot be well used for information synthesis, in order to solve the problem, the sequence of performing Pointwise convolution and then performing Depthwise convolution is adopted for information synthesis, the Pointwise convolution is firstly adopted for information synthesis in a real part channel, then the Depthwise convolution is adopted for information synthesis in an imaginary part channel, and audio features obtained by information synthesis are adopted for noise reduction, so that the noise reduction effect is favorably improved; compared with the conventional convolution, the parameter quantity of the Pointwise convolution and the Depthwise convolution is smaller, so that the voice noise reduction efficiency is improved, and the application is further favorable for being applied to an application scene with limited computing resources and/or higher real-time requirement; and when the time dimension of the Depthwise convolution is equal to 1, the conventional convolution is adopted, and when the time dimension of the Depthwise convolution is equal to 2, the causal convolution is adopted, so that the streaming processing is realized, and the method and the device are further favorable for being used in application scenes with limited computing resources and/or high real-time requirements.

The conventional convolution, with a convolution kernel size of 3 x 3, inputs 64 channels and outputs 128 channels, has parameters of 64 x 128 x 3 in this case.

Pointwise convolution, the size of convolution kernel is 1 × 1 × M, M is the number of channels of the previous layer. Therefore, the convolution operation in this case performs weighted combination of maps (Feature maps) in the previous step in the depth direction to generate a new Feature map. There are several convolution kernels with several output Feature maps.

The Depthwise convolution is that one convolution kernel is responsible for one channel, and one channel is only convoluted by one convolution kernel. A three-channel color picture with the size of 64 multiplied by 64 pixels is firstly subjected to a first convolution operation, and the difference is that the convolution is completely carried out in a two-dimensional plane, and the number of filters is the same as that of channels in the previous layer. So that 3 Feature maps are generated after the operation of one three-channel image.

In the above example, since the parameter of the conventional convolution is 64 × 128 × 3, and the parameter of the Depthwise convolution and the Pointwise convolution is 64 × 128 + 128 × 3, the amount of the parameter is significantly reduced, and the amount of the operation is reduced.

And S211, adopting the kth coding layer of the coding module, performing Pointwise convolution on the input vector of the kth coding layer, and taking data obtained by the convolution as first audio features.

For S212, a preset Depthwise convolution time dimension may be obtained from the database, a preset Depthwise convolution time dimension input by the user may also be obtained, and the preset Depthwise convolution time dimension may also be written into the program implementing the present application.

For S213, if the Depthwise convolution time dimension is equal to 1, it means that the time domain convolution kernel is 1, and therefore, the kth coding layer is adopted, the first audio feature is subjected to the conventional convolution, and data obtained by the conventional convolution is used as the kth single-layer audio coding feature.

For S214, if the Depthwise convolution time dimension is equal to 2, it means that the time domain convolution kernel is 2, so that in the streaming inference, when calculating the current frame, information of the previous frame needs to be recorded, therefore, the kth encoding layer is adopted to perform causal convolution on the first audio feature, and data obtained by the causal convolution is used as the kth single-layer audio encoding feature.

Causal convolution, i.e., cause volumes.

For S215, the nth single-layer audio coding feature is taken as the target audio coding feature, that is, the audio feature extracted by the last coding layer in the coding module is taken as the target audio coding feature.

When k is equal to 1, taking the spectrogram to be subjected to noise reduction as an input vector of a kth coding layer, namely the input vector of the 1 st coding layer is an input vector of the coding module; and when k is larger than 1, taking the k-1 single-layer audio feature vector as the input vector of the kth coding layer, namely, the input vectors of the 1 st and later coding layers are the output vectors of the last coding layer.

In an embodiment, the step of inputting each single-layer audio coding feature and the audio feature to be decoded into the decoding module for decoding to obtain a spectrogram to be analyzed includes:

s261: performing dimensionality reduction on the (n + 1) -m single-layer audio coding features to obtain coding features to be processed, wherein m is an integer larger than 0, and m is smaller than or equal to n;

s262: adding the output vector of the m-1 th decoding layer and the element value of the same position of the coding feature to be processed to obtain the m-th feature to be processed;

s263: performing deconvolution processing on the mth feature to be processed to obtain an mth single-layer decoding feature;

s264: taking the nth single-layer decoding feature as the spectrogram to be analyzed;

In the embodiment, through dimension reduction processing, addition of element values at the same positions as those of the output vectors of the m-1 th decoding layer and deconvolution processing, the number of channels of a relative fusion (Concat) method becomes 2 times, the number of network parameters and the calculated amount are reduced, and the method is further favorable for being applied to application scenes with limited calculation resources and/or high real-time requirements; this embodiment has a better effect than the direct Skip method.

And S261, performing Pointwise convolution on the (n + 1-m) th single-layer audio coding feature to realize dimension reduction processing, and taking data obtained through the dimension reduction processing as a coding feature to be processed.

And when the (n + 1) -m single-layer audio coding features are subjected to Pointwise convolution, adopting 1-1 convolution.

And for S262, adding the output vector of the m-1 th decoding layer and the coding feature to be processed at the same position, and taking the data obtained by the addition as the m-th feature to be processed.

For example, the values of the elements in the row b and column c of the output vector of the m-1 th decoding layer are added to the values of the elements in the row b and column c of the coding feature to be processed, and the added data is used as the values of the elements in the row b and column c of the feature to be processed.

And for S263, performing deconvolution processing on the mth feature to be processed, and taking data obtained through deconvolution processing as the mth single-layer decoding feature.

For S265, the nth single-layer decoding feature is used as the spectrogram to be analyzed, so that data output by the last decoding layer of the decoding module is used as the spectrogram to be analyzed.

In an embodiment, before the step of inputting the spectrogram to be denoised into a preset denoising model for denoising, and obtaining a denoised spectrogram, the method further includes:

s0211: obtaining a plurality of training samples and a model to be trained;

s0212: training the model to be trained according to each training sample and a preset target function until a preset model training end condition is reached, and taking the model to be trained reaching the model training end condition as the noise reduction model;

wherein the objective function S is expressed as: s = SISNR + MSE loss + perceptual loss + regularization term, SISNR is the loss of signal-to-noise ratio of speech, MSE loss is the loss calculated from the mean-square error of the real part of the spectrogram, the mean-square error of the imaginary part of the spectrogram, and the mean-square error of the spectrogram magnitude spectrum, and perceptual loss is the perceptual loss of speech.

In order to solve the problem, the embodiment adopts the loss of the signal-to-noise ratio, the loss and the perception loss calculated according to the mean square error of the real part of a spectrogram, the mean square error of the imaginary part of the spectrogram and the mean square error of the magnitude spectrum of the spectrogram, and model training fully considers the information and perception of the real part of the spectrogram, the imaginary part of the spectrogram and the magnitude spectrum, thereby improving the noise reduction capability of the model.

For S0211, a plurality of training samples input by the user may be obtained, a plurality of training samples may be obtained from a database, or a plurality of training samples may be obtained from a third-party application.

Each training sample of the plurality of training samples comprises: the system comprises a spectrum sample graph, a spectrogram calibration result, a clean voice calibration result and a sensing data calibration result, wherein the spectrum sample graph is obtained after short-time Fourier transform is carried out on a voice sample, the clean voice calibration result is accurate clean voice corresponding to the spectrum sample graph, the spectrogram calibration result is accurate spectrogram corresponding to the accurate clean voice corresponding to the spectrum sample graph, and the sensing data calibration result is accurate sensing data corresponding to the accurate clean voice corresponding to the spectrum sample graph.

The voice sample is voice obtained by mixing clean voice and noise voice.

For S0212, the training sample from one of a plurality of training samples is taken as a target training sample; inputting the frequency spectrum sample graph of the target training sample into the model to be trained for noise reduction processing to obtain a frequency spectrum graph prediction result; reconstructing a voice signal according to the spectrogram prediction result to obtain a clean voice prediction result; inputting the spectrogram prediction result, the clean voice calibration result of the target training sample, the spectrogram calibration result and the perception data calibration result into the target function to calculate a loss value; updating the network parameters of the model to be trained by adopting the loss values obtained by calculation, and using the updated model to be trained for calculating the spectrogram prediction result next time; repeating the step of using one training sample in the plurality of training samples as a target training sample until the model training end condition is reached; and taking the model to be trained which reaches the model training end condition as the noise reduction model.

The model training end conditions include: and the loss value of the model to be trained reaches a first convergence condition or the iteration number of the model to be trained reaches a second convergence condition.

The first convergence condition refers to that the loss value of the model to be trained is not reduced any more when the model to be trained is calculated twice in a neighboring mode.

The second convergence condition means that the training index is no longer elevated. For example, the training indicator is the loss of signal to noise ratio.

After the speech is subjected to a short-time fourier transform, the real and imaginary components are obtained. The real part of the spectrogram refers to a component of the real part. The imaginary part of the spectrum refers to the component of the imaginary part.

Short-time fourier transform, a general tool for speech signal processing, defines a very useful class of time and frequency distributions that specify the complex amplitude of an arbitrary signal over time and frequency. The spectrogram magnitude spectrum is a complex magnitude obtained by short-time Fourier transform.

The regularization term is self-defined, is regularization of an L2 norm, and is regularization of regularization terms of weight values in a function corresponding to signal-to-noise loss and a function corresponding to MSE loss. By adding the regular term into the objective function, the model tends to select a model with smaller parameters when the gradient is reduced, so that the elasticity of the model is reduced, and overfitting can be relieved to a certain extent.

The L2 norm, is the euclidean norm.

SISNR, the English name scale-innovative source-to-noise ratio, is a scale-invariant signal-to-noise ratio, meaning a signal-to-noise ratio that is not affected by signal variations. The penalty function for SISNR is not described herein.

Perceptual loss (Perceptual loss) includes: LMS (Log Mel spectra) and PMSQE (Perceptial Metal for Speech Quality evaluation).

In another embodiment of the present application, the objective function S is expressed as: s = SISNR + MSE loss + regularization term.

Referring to fig. 2, the present application further provides an artificial intelligence-based speech noise reduction apparatus, the apparatus comprising:

the data acquisition module 100 is configured to acquire a spectrogram to be denoised corresponding to a voice to be denoised;

the denoising module 200 is configured to input the spectrogram to be denoised into a preset denoising model for denoising, so as to obtain a denoised spectrogram, where the denoising model sequentially includes: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-blocks;

and a speech signal reconstruction module 300, configured to perform speech signal reconstruction on the noise-reduced spectrogram to obtain a target clean speech.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer equipment is used for storing data such as a voice noise reduction method based on artificial intelligence. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an artificial intelligence based speech noise reduction method. The voice noise reduction method based on artificial intelligence comprises the following steps: acquiring a spectrogram to be denoised corresponding to the voice to be denoised; inputting the spectrogram to be denoised into a preset denoising model for denoising to obtain a denoised spectrogram, wherein the denoising model sequentially comprises: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-modules; and reconstructing a voice signal of the denoised spectrogram to obtain a target clean voice.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements an artificial intelligence based speech noise reduction method, including the steps of: acquiring a spectrogram to be denoised corresponding to the voice to be denoised; inputting the spectrogram to be denoised into a preset denoising model to perform denoising processing to obtain a denoised spectrogram, wherein the denoising model sequentially comprises: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-modules; and reconstructing a voice signal of the denoised spectrogram to obtain a target clean voice.

The voice noise reduction method based on artificial intelligence realizes that the noise reduction model is adopted to sequentially carry out feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask gain and reduction, realizes effective noise reduction by sequentially carrying out frequency domain noise reduction and time domain noise reduction, and improves the noise reduction effect; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the decoupling of a time domain and a frequency domain is realized, and the streaming voice noise reduction is facilitated; the time domain noise reduction module adopts at least two time domain noise reduction sub-modules, so that grouped time domain noise reduction is realized, the operation amount and the network parameter quantity are reduced, and the method is favorable for being suitable for application scenes with limited computing resources and/or higher real-time requirements; and the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and is favorable for improving the noise reduction effect.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), dual data rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, apparatus, article or method that comprises the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. An artificial intelligence based speech noise reduction method, the method comprising:

inputting the spectrogram to be denoised into a preset denoising model for denoising to obtain a denoised spectrogram, wherein the denoising model sequentially comprises: the system comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask gain and reduction module, wherein the frequency domain noise reduction module is a module for realizing a multi-head self-attention mechanism by using adjacent sub-band information, and the time domain noise reduction module comprises: at least two time domain noise reduction sub-modules;

2. The artificial intelligence-based speech noise reduction method according to claim 1, wherein the step of inputting the spectrogram to be noise-reduced into a preset noise reduction model for noise reduction to obtain a noise-reduced spectrogram comprises:

residual error connection is carried out on the target audio coding features and the frequency domain noise-reduced audio features to obtain audio features to be processed;

3. The artificial intelligence based speech noise reduction method of claim 2, wherein the step of inputting the target audio coding feature into the frequency domain noise reduction module for frequency domain noise reduction to obtain a frequency domain noise-reduced audio feature comprises:

4. The artificial intelligence based speech noise reduction method according to claim 2, wherein the step of inputting the audio features to be processed into the time domain noise reduction module for feature grouping, grouping time domain noise reduction and feature splicing respectively to obtain time domain noise reduced audio features comprises:

5. The artificial intelligence based speech noise reduction method according to claim 2, wherein the step of inputting the spectrogram to be noise reduced into the coding module for feature extraction to obtain a plurality of single-layer audio coding features and target audio coding features comprises:

acquiring a preset Depthwise convolution time dimension;

if the time dimension of the Depthwise convolution is equal to 1, performing conventional convolution on the first audio characteristic by adopting a kth coding layer to obtain a kth single-layer audio coding characteristic;

6. The artificial intelligence based speech noise reduction method according to claim 4, wherein the step of inputting each of the single-layer audio coding features and the audio features to be decoded into the decoding module for decoding to obtain the spectrogram to be analyzed comprises:

taking the nth single-layer decoding feature as the spectrogram to be analyzed;

7. The artificial intelligence-based speech noise reduction method according to claim 1, wherein before the step of inputting the spectrogram to be noise-reduced into a preset noise reduction model for noise reduction processing to obtain a noise-reduced spectrogram, the method further comprises:

obtaining a plurality of training samples and a model to be trained;

8. An artificial intelligence based speech noise reduction apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.