CN114067826A

CN114067826A - Voice noise reduction method, device, equipment and storage medium

Info

Publication number: CN114067826A
Application number: CN202210054737.6A
Authority: CN
Inventors: 李�杰; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-02-18
Anticipated expiration: 2042-01-18
Also published as: CN114067826B

Abstract

The application relates to a voice noise reduction method, a voice noise reduction device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a spectrogram to be denoised corresponding to the voice to be denoised; and inputting the spectrogram to be denoised into a preset voice denoising model to perform denoising treatment to obtain a denoised spectrogram, wherein the voice denoising model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module; and reconstructing a voice signal of the denoised spectrogram to obtain target voice. The voice noise reduction model is adopted to sequentially perform feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask enhancement and suppression, effective noise reduction is realized by sequentially performing frequency domain noise reduction and time domain noise reduction, the noise reduction effect is improved, the accuracy of the obtained noise-reduced voice is improved, and the user experience is improved; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the time domain and the frequency domain are decoupled, and the streaming voice noise reduction is facilitated.

Description

Voice noise reduction method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech noise reduction technologies, and in particular, to a speech noise reduction method, apparatus, device, and storage medium.

Background

The voice usually contains noise, and when the voice containing the noise is applied to an actual scene, the accuracy of the voice application is reduced, and the user experience is influenced. The existing filter is adopted to reduce the noise of the language, and the filter has limited noise suppression effect.

Disclosure of Invention

The present application mainly aims to provide a method, an apparatus, a device and a storage medium for speech noise reduction, and aims to solve the technical problems that in the prior art, a filter is used for noise reduction of speech, and the suppression effect of the filter on noise is limited.

In order to achieve the above object, the present application provides a method for reducing noise in speech, the method comprising:

acquiring a spectrogram to be denoised corresponding to the voice to be denoised;

and inputting the spectrogram to be denoised into a preset voice denoising model to perform denoising treatment to obtain a denoised spectrogram, wherein the voice denoising model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module;

and reconstructing a voice signal of the denoised spectrogram to obtain target voice.

Further, the step of obtaining a spectrogram to be denoised corresponding to the voice to be denoised includes:

acquiring the voice to be denoised;

carrying out short-time Fourier transform on the voice to be denoised to obtain a spectrogram to be processed;

and performing direct current component removal processing on the spectrogram to be processed to obtain the spectrogram to be denoised.

Further, the step of inputting the spectrogram to be denoised into a preset speech denoising model for denoising to obtain a denoised spectrogram includes:

inputting the spectrogram to be denoised into the coding module for feature extraction to obtain a plurality of single-layer audio feature vectors and a target audio feature vector;

inputting the target audio characteristic vector into the frequency domain noise reduction module for frequency domain noise reduction to obtain a frequency domain noise-reduced audio characteristic vector, wherein the frequency domain noise reduction module is a module obtained based on a multi-head self-attention mechanism;

residual error connection is carried out on the target audio characteristic vector and the frequency domain noise-reduced audio characteristic vector to obtain an audio characteristic vector to be processed;

inputting the audio characteristic vector to be processed into the time domain noise reduction module for time domain noise reduction to obtain a time domain noise-reduced audio characteristic vector, wherein the time domain noise reduction module is a module obtained based on a long-short term memory artificial neural network;

residual error connection is carried out on the target audio characteristic vector and the time-domain noise-reduced audio characteristic vector to obtain an audio characteristic vector to be decoded;

inputting each single-layer audio characteristic vector and the audio characteristic vector to be decoded into the decoding module for decoding to obtain a spectrogram to be analyzed;

and inputting the spectrogram to be analyzed into the mask enhancing and suppressing module for masking to obtain the noise-reduced spectrogram.

Further, the encoding module includes n encoding layers, and the step of inputting the spectrogram to be denoised into the encoding module for feature extraction to obtain a plurality of single-layer audio feature vectors and a target audio feature vector includes:

inputting the input vector of the kth coding layer into the kth coding layer of the coding module to sequentially carry out Pointwise convolution and Depthwise convolution to obtain the kth single-layer audio characteristic vector;

taking the nth single-layer audio feature vector as the target audio feature vector;

and when k is greater than 1, the k-1 single-layer audio feature vector is taken as the input vector of the kth coding layer.

Further, the step of inputting the input vector of the kth coding layer into the kth coding layer of the coding module to sequentially perform Pointwise convolution and Depthwise convolution to obtain the kth single-layer audio feature vector includes:

performing Pointwise convolution on the input vector of the kth coding layer by adopting the kth coding layer to obtain a first audio feature vector;

acquiring a preset Depthwise convolution time dimension;

if the time dimension of the Depthwise convolution is equal to 1, performing conventional convolution on the first audio feature vector by adopting a kth coding layer to obtain a kth single-layer audio feature vector;

and if the time dimension of the Depthwise convolution is equal to 2, performing causal convolution on the first audio feature vector by adopting the kth coding layer to obtain the kth single-layer audio feature vector.

Further, the decoding module includes n decoding layers, and the step of inputting each single-layer audio feature vector and the audio feature vector to be decoded into the decoding module for decoding to obtain a spectrogram to be analyzed includes:

sequentially carrying out vector connection, Pointwise convolution and deconvolution on the output vector of the m-1 th decoding layer and the n +1-m single-layer audio characteristic vectors to obtain an m-th vector to be decoded, wherein m is an integer larger than 0, and m is smaller than or equal to n;

inputting the mth vector to be decoded into the mth decoding layer for decoding to obtain mth single-layer decoding data;

taking the nth single-layer decoding data as the spectrogram to be analyzed;

and when m is equal to 1, the audio feature vector to be decoded is taken as the output vector of the m-1 th decoding layer, and when m is larger than 1, the m-1 th single-layer decoding data is taken as the output vector of the m-1 th decoding layer.

Further, before the step of inputting the spectrogram to be denoised into a preset speech denoising model for denoising, and obtaining a denoised spectrogram, the method further includes:

obtaining a plurality of training samples;

training an initial model according to the training samples and a preset target function until a preset model training end condition is reached, and taking the initial model reaching the model training end condition as the voice noise reduction model;

wherein the objective function S is expressed as: s = SISNR + MSE loss + regularization term, SISNR is the signal-to-noise ratio loss, and MSE loss is the loss calculated from the mean square error of the real part of the spectrogram, the mean square error of the imaginary part of the spectrogram, and the mean square error of the spectrogram magnitude spectrum.

The application also provides a speech noise reduction device, the device includes:

the data acquisition module is used for acquiring a spectrogram to be denoised corresponding to the voice to be denoised;

and the noise reduction processing module is used for inputting the spectrogram to be subjected to noise reduction into a preset voice noise reduction model to perform noise reduction processing so as to obtain a noise-reduced spectrogram, wherein the voice noise reduction model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module;

and the voice signal reconstruction module is used for reconstructing the voice signal of the denoised spectrogram to obtain target voice.

The present application further proposes an electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The method comprises the steps of obtaining a spectrogram to be denoised corresponding to the voice to be denoised; and inputting the spectrogram to be denoised into a preset voice denoising model to perform denoising treatment to obtain a denoised spectrogram, wherein the voice denoising model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module; and reconstructing a voice signal of the denoised spectrogram to obtain target voice. The voice noise reduction model is adopted to sequentially perform feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask enhancement and suppression, effective noise reduction is realized by sequentially performing frequency domain noise reduction and time domain noise reduction, the noise reduction effect is improved, the accuracy of the obtained noise-reduced voice is improved, and the user experience is improved; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the time domain and the frequency domain are decoupled, and the streaming voice noise reduction is facilitated.

Drawings

FIG. 1 is a flowchart illustrating a voice denoising method according to an embodiment of the present application;

fig. 2 is a schematic block diagram of a structure of a speech noise reduction apparatus according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of an electronic device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a speech noise reduction method, where the method includes:

s1: acquiring a spectrogram to be denoised corresponding to the voice to be denoised;

s2: and inputting the spectrogram to be denoised into a preset voice denoising model to perform denoising treatment to obtain a denoised spectrogram, wherein the voice denoising model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module;

s3: and reconstructing a voice signal of the denoised spectrogram to obtain target voice.

The embodiment realizes that the voice noise reduction model is adopted to sequentially carry out feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask enhancement and suppression, and effective noise reduction is realized by sequentially carrying out frequency domain noise reduction and time domain noise reduction, so that the noise reduction effect is improved, the accuracy of the obtained noise-reduced voice is improved, and the user experience is improved; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the time domain and the frequency domain are decoupled, and the streaming voice noise reduction is facilitated.

For S1, a spectrogram to be noise reduced corresponding to the voice to be noise reduced input by the user may be obtained, a spectrogram to be noise reduced corresponding to the voice to be noise reduced may also be obtained from the database, and a spectrogram to be noise reduced corresponding to the voice to be noise reduced may also be obtained from a third-party application.

The voice to be denoised, that is, one or more pieces of voice to be denoised.

The spectrogram to be denoised is a spectrogram of a voice to be denoised, wherein the spectrogram is a graph generated according to a Fourier spectrum.

The spectrogram to be denoised comprises 2 channels, the 2 channels being a real part channel and an imaginary part channel, respectively. The real channel is the real part of the fourier spectral signature. The real channel is the imaginary part of the fourier spectral feature.

And S2, inputting the spectrogram to be subjected to noise reduction into a preset voice noise reduction model, sequentially performing feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask enhancement and suppression, and taking data output by the mask enhancement and suppression as the spectrogram subjected to noise reduction.

The coding module is used for coding to realize the extraction of the audio features. The encoding module comprises one or more encoding layers, the encoding layers are in linear connection, and each encoding layer outputs a single-layer audio characteristic vector.

Optionally, to implement streaming processing, only convolution in the frequency dimension is used, that is, the convolution kernel size in the time dimension is 1 to reduce the amount of operation, or a causal convolution mode is adopted in the time domain to implement streaming processing.

The frequency domain noise reduction module is used for reducing noise in the dimension of the frequency domain, so that the information of the frequency domain is fully utilized. And the output of the last coding layer of the coding module is used as the input of the frequency domain noise reduction module. The frequency domain noise reduction module is a module obtained based on a Multi-head self-attentive mechanism (Multi-head self-attention).

It is to be understood that the frequency domain noise reduction module may also be a module based on a self-attention mechanism (self-attention).

The time domain noise reduction module is used for reducing noise in the dimension of the time domain, so that the information of the time domain is fully utilized. And taking the output of the frequency domain noise reduction module and the output of the last coding layer of the coding module as the input of the time domain noise reduction module. The time domain noise reduction module is obtained based on a long-short term memory artificial neural network (LSTM).

And the decoding module is used for decoding to obtain the spectrogram after frequency domain noise reduction and time domain noise reduction. The decoding module comprises one or more decoding layers, and the decoding layers are linearly connected. The input of each decoding layer is data resulting from a jump-connection of the output of the last decoding layer and the output of the coding layer.

The mask enhancing and suppressing module is used for enhancing data corresponding to the wanted voice and suppressing data corresponding to the unwanted voice in the spectrogram.

Optionally, the mask enhancing and suppressing module performs masking by using 0 and 1. For example, in the spectrogram, the mask enhancement and suppression module performs masking with 1 to enhance data corresponding to desired voice and performs masking with 0 to suppress data corresponding to undesired voice.

Optionally, the mask enhancing and suppressing module performs masking by using a value from 0 to 1.

For S3, performing short-time Fourier inverse transformation on the denoised spectrogram to obtain time domain data to be processed; and performing voice signal reconstruction on the time domain data to be processed by adopting an Overlapadd method, and taking the reconstructed voice as the target voice corresponding to the voice to be denoised.

Overlapadd, also written as Overlap-add, Overlap-add.

The method for performing speech signal reconstruction on the time domain data to be processed by using the Overlapadd method is not described herein again.

The target speech is the clean speech obtained after denoising.

In an embodiment, the step of obtaining a spectrogram to be noise reduced corresponding to a speech to be noise reduced includes:

s11: acquiring the voice to be denoised;

s12: carrying out short-time Fourier transform on the voice to be denoised to obtain a spectrogram to be processed;

s13: and performing direct current component removal processing on the spectrogram to be processed to obtain the spectrogram to be denoised.

In the embodiment, the spectrogram obtained by short-time Fourier transform is used as the spectrogram to be denoised after the direct current component is removed, so that the operation amount of denoising is reduced, the denoising efficiency is improved, and the application is favorably used for a voice denoising scene with higher real-time requirement and a voice denoising scene with limited computing resources; because the direct current component has little influence on the reconstruction of the frequency spectrum, the frequency spectrum image obtained by the short-time Fourier transform is used as the frequency spectrum image to be subjected to noise reduction after the direct current component is removed, the noise reduction effect is not influenced, and the operation amount is reduced.

For S11, the speech to be noise-reduced input by the user may be acquired, the speech to be noise-reduced may be acquired from a database, or the speech to be noise-reduced may be acquired from a third-party application.

And S12, performing short-time Fourier transform on the voice to be denoised, and taking a spectrogram obtained by the short-time Fourier transform as a spectrogram to be processed.

For step S13, performing dc component removal processing on the spectrogram to be processed, and taking the spectrogram to be processed from which the dc component is removed as the spectrogram to be noise-reduced.

Optionally, when the framing time window of the speech to be noise-reduced is 25ms and the sampling rate is 16000, the frequency point of the short-time fourier transform is set to 512 points, so that 257 complex numbers are obtained after the short-time fourier transform, first 201 complex numbers are extracted from the 257 complex numbers, and then the complex numbers corresponding to the dc component are removed from the extracted 201 complex numbers to obtain 200 complex numbers, so that the spectrogram corresponding to the 200 complex numbers is used as the spectrogram to be noise-reduced. The method is the mainstream method in the prior art and has the problem of redundant calculation.

Optionally, when the framing time window of the speech to be noise-reduced is 25ms and the sampling rate is 16000, the frequency point of the short-time fourier transform is set to 400 points, so that 201 complex numbers are obtained after the short-time fourier transform, and 200 complex numbers are obtained by removing the complex numbers corresponding to the direct-current component from the 201 complex numbers, so that the spectrogram corresponding to the 200 complex numbers is used as the spectrogram to be noise-reduced. The method and the device have the advantages that the sampling points of the non-integer power of 2 are processed by adopting the actual number, the reconstruction is completed, meanwhile, the operation amount and the network parameter amount of the model are reduced, the noise reduction efficiency is further improved, and the method and the device are further favorable for being used in a voice noise reduction scene with high real-time requirement.

For example, when each frame corresponds to 512 points, 512 is a sampling point of 2 raised to the power of an integer, and the frequency point of the short-time fourier transform is directly set to 512 points, so that the processing by adopting the actual number is realized.

For example, when the frame time window of the speech to be denoised is 25ms and the sampling rate is 16000, each frame corresponds to 400 points, 400 is a non-integer power sampling point of 2, the frequency point of the short-time fourier transform is directly set as 400 points, and the processing by adopting the actual number is realized.

In an embodiment, the step of inputting the spectrogram to be denoised into a preset speech denoising model for denoising to obtain a denoised spectrogram includes:

s21: inputting the spectrogram to be denoised into the coding module for feature extraction to obtain a plurality of single-layer audio feature vectors and a target audio feature vector;

s22: inputting the target audio characteristic vector into the frequency domain noise reduction module for frequency domain noise reduction to obtain a frequency domain noise-reduced audio characteristic vector, wherein the frequency domain noise reduction module is a module obtained based on a multi-head self-attention mechanism;

s23: residual error connection is carried out on the target audio characteristic vector and the frequency domain noise-reduced audio characteristic vector to obtain an audio characteristic vector to be processed;

s24: inputting the audio characteristic vector to be processed into the time domain noise reduction module for time domain noise reduction to obtain a time domain noise-reduced audio characteristic vector, wherein the time domain noise reduction module is a module obtained based on a long-short term memory artificial neural network;

s25: residual error connection is carried out on the target audio characteristic vector and the time-domain noise-reduced audio characteristic vector to obtain an audio characteristic vector to be decoded;

s26: inputting each single-layer audio characteristic vector and the audio characteristic vector to be decoded into the decoding module for decoding to obtain a spectrogram to be analyzed;

s27: and inputting the spectrogram to be analyzed into the mask enhancing and suppressing module for masking to obtain the noise-reduced spectrogram.

In the embodiment, the voice noise reduction model is adopted to sequentially perform feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask enhancement and suppression, effective noise reduction is realized by sequentially performing frequency domain noise reduction and time domain noise reduction, the noise reduction effect is improved, the accuracy of the obtained noise-reduced voice is improved, and the user experience is improved; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the decoupling of a time domain and a frequency domain is realized, and the streaming voice noise reduction is facilitated; the frequency domain noise reduction module is a module obtained based on a multi-head self-attention mechanism, and makes full use of frequency domain information; the time domain noise reduction module is obtained based on a long-short term memory artificial neural network, and makes full use of time domain information; the noise reduction in the frequency domain and the noise reduction in the time domain are respectively realized through a multi-head self-attention mechanism and a long-short term memory artificial neural network, so that the time sequence modeling capability of the multi-head self-attention mechanism and the long-short term memory artificial neural network is effectively utilized; the multi-head self-attention mechanism of the frequency domain is only carried out on a single frame, and the calculation of the time domain depends on the output of the previous frame, so that the multi-head self-attention mechanism is very suitable for streaming processing.

For step S21, the spectrogram to be denoised is input into the encoding module for feature extraction, the audio feature data extracted by each encoding layer in the encoding module is used as a single-layer audio feature vector, and the audio feature data extracted by the last encoding layer in the encoding module is used as a target audio feature vector.

For S22, the target audio feature vector is input into the frequency domain noise reduction module, noise reduction is performed in the frequency domain dimension by the multi-head attention mechanism of the frequency domain noise reduction module, and the audio feature vector subjected to noise reduction in the frequency domain dimension is used as the audio feature vector subjected to noise reduction in the frequency domain.

And S23, residual connection (residual connection) is carried out on the target audio characteristic vector and the frequency domain noise-reduced audio characteristic vector, and the audio characteristic vector obtained by residual connection is used as the audio characteristic vector to be processed.

The implementation method for residual connection between the target audio feature vector and the frequency-domain noise-reduced audio feature vector is not described herein.

And S24, inputting the audio feature vector to be processed into the time domain noise reduction module, reducing noise in the time domain dimension through the time domain noise reduction module, and taking the audio feature vector subjected to noise reduction in the time domain dimension as the audio feature vector subjected to time domain noise reduction.

And S25, residual error connection is carried out on the target audio characteristic vector and the time-domain noise-reduced audio characteristic vector, and the audio characteristic vector obtained by residual error connection is used as the audio characteristic vector to be decoded.

The method for implementing residual connection between the target audio feature vector and the time-domain noise-reduced audio feature vector is not described herein again.

For S26, sequentially carrying out vector connection on the output vector of the m-1 th decoding layer and the n +1-m single-layer audio feature vectors to obtain an m-th vector to be decoded, wherein m is an integer larger than 0, m is smaller than or equal to n, n is the number of coding layers in a coding module, and the number of the coding layers in the coding module is the same as that of the decoding layers in the decoding module; and when m is equal to 1, the audio feature vector to be decoded is taken as the output vector of the m-1 th decoding layer, and when m is larger than 1, the m-1 th single-layer decoding data is taken as the output vector of the m-1 th decoding layer.

And taking the data output by the last decoding layer of the decoding module as a spectrogram to be analyzed.

For S27, inputting the spectrogram to be analyzed into the mask enhancing and suppressing module; the Mask enhancing and suppressing module adopts a CRM Mask (CRM Mask) mode as a noise reduction filtering function to enhance data corresponding to the wanted voice and suppress data corresponding to the unwanted voice in a spectrogram; and taking the masked spectrogram to be analyzed as the noise-reduced spectrogram.

Optionally, masking is performed by using the following formula: enhance _ real + i _ enhance _ image = (mask _ real + i _ mask _ image) (' noise _ real + i _ noise _ image), where enhance _ real is the real part of the enhanced speech, enhance _ image is the imaginary part after enhancement, mask _ real is the mask enhancement coefficient of the real part, mask _ image is the mask enhancement coefficient of the imaginary part, noise _ real is the real part of the noise, noise _ image is the imaginary part of the noise, and i is the imaginary unit.

The mask enhancement coefficient is a value from 0 to 1, and may be 0 or 1.

In one embodiment, the encoding module includes n encoding layers, and the step of inputting the spectrogram to be noise-reduced into the encoding module for feature extraction to obtain a plurality of single-layer audio feature vectors and a target audio feature vector includes:

s211: inputting the input vector of the kth coding layer into the kth coding layer of the coding module to sequentially carry out Pointwise convolution and Depthwise convolution to obtain the kth single-layer audio characteristic vector;

s212: taking the nth single-layer audio feature vector as the target audio feature vector;

In order to solve the problem, the embodiment adopts the sequence of performing Pointwise convolution and then performing Depthwise convolution to perform information synthesis, firstly adopts the sequence of performing the Pointwise convolution and then performing the Pointwise convolution to perform the information synthesis, firstly adopts the Pointwise convolution to perform the information synthesis on a real part channel, then adopts the sequence of performing the Depthwise convolution and then performing the Depthwise convolution to perform the information synthesis on an imaginary part channel, adopts an audio characteristic vector obtained by the information synthesis to perform noise reduction, and is favorable for improving the noise reduction effect; compared with the conventional convolution, the parameter quantity of the Pointwise convolution and the Depthwise convolution is smaller, so that the voice noise reduction efficiency is improved, the method and the device are favorably used for voice noise reduction scenes with high real-time requirements and voice noise reduction scenes with limited computing resources.

The conventional convolution, with a convolution kernel size of 3 x 3, inputs 64 channels and outputs 128 channels, has parameters of 64 x 128 x 3 in this case.

Pointwise convolution, the size of convolution kernel is 1 × 1 × M, M is the number of channels of the previous layer. Therefore, the convolution operation in this case performs weighted combination of maps (Feature maps) in the previous step in the depth direction to generate a new Feature map. There are several convolution kernels with several output Feature maps.

The Depthwise convolution is that one convolution kernel is responsible for one channel, and one channel is only convoluted by one convolution kernel. A three-channel color picture with 64 x 64 pixels is firstly subjected to a first convolution operation, wherein the difference is that the convolution is completely performed in a two-dimensional plane, and the number of filters is the same as that of the channels of the previous layer. So that 3 Feature maps are generated after the operation of one three-channel image.

In the above example, since the parameter of the conventional convolution is 64 × 128 × 3, and the parameter of the Depthwise convolution and the Pointwise convolution is 64 × 128 + 128 × 3, the amount of the parameter is significantly reduced, and the amount of the operation is reduced.

And for S211, inputting the input vector of the kth coding layer into the kth coding layer of the coding module, firstly performing Pointwise convolution, performing Depthwise convolution on data obtained by the Pointwise convolution, and taking the data obtained by the Depthwise convolution as the kth single-layer audio feature vector.

For S212, the nth single-layer audio feature vector is used as the target audio feature vector, that is, the audio feature data extracted by the last encoding layer in the encoding module is used as the target audio feature vector.

When k is equal to 1, taking the spectrogram to be subjected to noise reduction as an input vector of a kth coding layer, namely the input vector of the 1 st coding layer is an input vector of the coding module; and when k is larger than 1, taking the k-1 single-layer audio feature vector as the input vector of the kth coding layer, namely, the input vectors of the 1 st and later coding layers are the output vectors of the last coding layer.

In an embodiment, the step of inputting the input vector of the kth coding layer into the kth coding layer of the coding module to sequentially perform Pointwise convolution and Depthwise convolution to obtain the kth single-layer audio feature vector includes:

s2111: performing Pointwise convolution on the input vector of the kth coding layer by adopting the kth coding layer to obtain a first audio feature vector;

s2112: acquiring a preset Depthwise convolution time dimension;

s2113: if the time dimension of the Depthwise convolution is equal to 1, performing conventional convolution on the first audio feature vector by adopting a kth coding layer to obtain a kth single-layer audio feature vector;

s2114: and if the time dimension of the Depthwise convolution is equal to 2, performing causal convolution on the first audio feature vector by adopting the kth coding layer to obtain the kth single-layer audio feature vector.

The embodiment realizes that the conventional convolution is adopted when the time dimension of the Depthwise convolution is equal to 1, and the causal convolution is adopted when the time dimension of the Depthwise convolution is equal to 2, thereby realizing the streaming processing.

And for S2111, performing Pointwise convolution on the input vector of the kth coding layer by adopting the kth coding layer, and taking the data obtained by the convolution as a first audio feature vector.

For S2112, a preset Depthwise convolution time dimension may be acquired from the database, a preset Depthwise convolution time dimension input by the user may be acquired, and the preset Depthwise convolution time dimension may be written into the program implementing the present application.

For S2113, if the Depthwise convolution time dimension is equal to 1, it means that the time domain convolution kernel is 1, so that the kth encoding layer is adopted, the first audio feature vector is subjected to conventional convolution, and data obtained by the conventional convolution is used as the kth single-layer audio feature vector.

For S2114, if the Depthwise convolution time dimension is equal to 2, it means that the time domain convolution kernel is 2, so that in the streaming inference, when a current frame is calculated, information of a previous frame needs to be recorded, and therefore, the coding layer of the kth is adopted to perform causal convolution on the first audio feature vector, and data obtained by the causal convolution is used as the single-layer audio feature vector of the kth.

Causal convolution, i.e., cause volumes.

In an embodiment, the decoding module includes n decoding layers, and the step of inputting each single-layer audio feature vector and the audio feature vector to be decoded into the decoding module for decoding to obtain the spectrogram to be analyzed includes:

s251: sequentially carrying out vector connection, Pointwise convolution and deconvolution on the output vector of the m-1 th decoding layer and the n +1-m single-layer audio characteristic vectors to obtain an m-th vector to be decoded, wherein m is an integer larger than 0, and m is smaller than or equal to n;

s252: inputting the mth vector to be decoded into the mth decoding layer for decoding to obtain mth single-layer decoding data;

s253: taking the nth single-layer decoding data as the spectrogram to be analyzed;

This embodiment realizes decoding the output vector and the n +1-m on layer with m-1 the individual layer audio frequency characteristic vector carries out vector connection, Pointwise convolution and deconvolution in proper order after, as the mth decoding the input on layer, through the dimensionality reduction of Pointwise convolution, reduced the parameter quantity of model, reduced the operand to improve the efficiency of making an uproar falls in the pronunciation, be favorable to being used for this application in the scene of making an uproar falls in the higher pronunciation of real-time requirement, also be favorable to being used for this application in the scene of making an uproar falls in the limited pronunciation of computational resource.

For S251, splicing the m-1 th output vector of the decoding layer and the n +1-m th single-layer audio characteristic vector in the channel dimension, performing Pointwise convolution on the spliced data, performing deconvolution on the Pointwise convolved data, and taking the deconvolved data as the input vector of the m-th decoding layer.

For S252, the mth vector to be decoded is input to the mth decoding layer for decoding, and the decoded data is used as mth single-layer decoded data.

For S253, the nth single-layer decoded data is used as the spectrogram to be analyzed, that is, the output of the last layer of the decoding module is used as the spectrogram to be analyzed.

And when m is equal to 1, taking the audio feature vector to be decoded as an output vector of the m-1 th decoding layer, namely taking the output vector of the time domain noise reduction module as one of data sources of the input vector of the first decoding layer of the decoding module.

And when m is larger than 1, taking the m-1 th single-layer decoding data as the output vector of the m-1 th decoding layer, namely taking the output vector of the last decoding layer as one of the data sources of the input vector of the current decoding layer.

In an embodiment, before the step of inputting the spectrogram to be denoised into a preset speech denoising model for denoising, and obtaining a denoised spectrogram, the method further includes:

s021: obtaining a plurality of training samples;

s022: training an initial model according to the training samples and a preset target function until a preset model training end condition is reached, and taking the initial model reaching the model training end condition as the voice noise reduction model;

In order to solve the problem, the embodiment adopts the loss of the signal-to-noise ratio, and the model training fully considers the information of the real part of the spectrogram, the imaginary part of the spectrogram and the magnitude spectrum of the spectrogram according to the loss of the mean-square error of the real part of the spectrogram, the mean-square error of the imaginary part of the spectrogram and the mean-square error of the magnitude spectrum of the spectrogram, thereby improving the noise reduction capability of the model.

For S021, a plurality of training samples input by the user may be obtained, a plurality of training samples may be obtained from a database, or a plurality of training samples may be obtained from a third-party application.

Each training sample of the plurality of training samples comprises: the method comprises the following steps of obtaining a spectrum sample graph, a spectrogram calibration result and a clean voice calibration result, wherein the spectrum sample graph is obtained after short-time Fourier transform is carried out on a voice sample, the clean voice calibration result is accurate clean voice corresponding to the spectrum sample graph, and the spectrogram calibration result is an accurate spectrogram corresponding to the clean voice corresponding to the spectrum sample graph.

The voice sample is voice obtained by mixing clean voice and noise voice.

For S022, the training sample is taken as a target training sample from one of a plurality of training samples; inputting the frequency spectrum sample graph of the target training sample into the initial model for denoising to obtain a frequency spectrum graph prediction result; reconstructing a voice signal according to the spectrogram prediction result to obtain a clean voice prediction result; calibrating a result according to the spectrogram of the target training sample; inputting the spectrogram prediction result, the clean voice calibration result of the target training sample and the spectrogram calibration result into the target function to calculate a loss value; updating the network parameters of the initial model by using the loss values obtained by calculation, and using the updated initial model to calculate the spectrogram prediction result next time; repeating the step of using one training sample in the plurality of training samples as a target training sample until the model training end condition is reached; and taking the initial model reaching the model training end condition as the voice noise reduction model.

The model training end conditions include: the loss value of the initial model reaches a first convergence condition or the iteration number of the initial model reaches a second convergence condition.

The first convergence condition refers to that the loss value of the initial model is not reduced any more when the initial model is calculated twice in a neighboring mode.

The second convergence condition means that the training index is no longer elevated. For example, the training indicator is the loss of signal to noise ratio.

After the speech is subjected to a short-time fourier transform, the real and imaginary components are obtained. The real part of the spectrogram refers to a component of the real part. The imaginary part of the spectrum refers to the component of the imaginary part.

Short-time fourier transform, a general tool for speech signal processing, defines a very useful class of time and frequency distributions that specify the complex amplitude of an arbitrary signal over time and frequency. The spectrogram magnitude spectrum is a complex magnitude obtained by short-time Fourier transform.

The regularization term is self-defined, is regularization of an L2 norm, and is regularization of regularization terms of weight values in a function corresponding to signal-to-noise loss and a function corresponding to MSE loss. By adding the regular term into the objective function, the model tends to select a model with smaller parameters when the gradient is reduced, so that the elasticity of the model is reduced, and overfitting can be relieved to a certain extent.

The L2 norm, is the euclidean norm.

SISNR, the English name scale-innovative source-to-noise ratio, is a scale-invariant signal-to-noise ratio, meaning a signal-to-noise ratio that is not affected by signal variations. The penalty function for SISNR is not described herein.

Referring to fig. 2, the present application also proposes a speech noise reduction apparatus, the apparatus comprising:

the data acquisition module 100 is configured to acquire a spectrogram to be denoised corresponding to a voice to be denoised;

the noise reduction processing module 200 is configured to perform noise reduction processing on the spectrogram to be noise reduced input into a preset speech noise reduction model to obtain a noise-reduced spectrogram, where the speech noise reduction model sequentially includes: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module;

and a speech signal reconstruction module 300, configured to perform speech signal reconstruction on the noise-reduced spectrogram to obtain a target speech.

An embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the following method steps when executing the computer program: a method of speech noise reduction comprising: acquiring a spectrogram to be denoised corresponding to the voice to be denoised; and inputting the spectrogram to be denoised into a preset voice denoising model to perform denoising treatment to obtain a denoised spectrogram, wherein the voice denoising model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module; and reconstructing a voice signal of the denoised spectrogram to obtain target voice.

The electronic device includes: computer devices and mobile electronic devices. Mobile electronic devices include, but are not limited to: cell-phone, panel computer and wearing equipment.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as a voice noise reduction method and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of speech noise reduction. The voice noise reduction method comprises the following steps: acquiring a spectrogram to be denoised corresponding to the voice to be denoised; and inputting the spectrogram to be denoised into a preset voice denoising model to perform denoising treatment to obtain a denoised spectrogram, wherein the voice denoising model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module; and reconstructing a voice signal of the denoised spectrogram to obtain target voice.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for reducing noise in speech, including the steps of: acquiring a spectrogram to be denoised corresponding to the voice to be denoised; and inputting the spectrogram to be denoised into a preset voice denoising model to perform denoising treatment to obtain a denoised spectrogram, wherein the voice denoising model sequentially comprises: the device comprises an encoding module, a frequency domain noise reduction module, a time domain noise reduction module, a decoding module and a mask enhancing and suppressing module; and reconstructing a voice signal of the denoised spectrogram to obtain target voice.

The voice noise reduction method realizes the sequential feature extraction, frequency domain noise reduction, time domain noise reduction, decoding and mask enhancement and suppression by adopting the voice noise reduction model, realizes effective noise reduction by sequentially performing the frequency domain noise reduction and the time domain noise reduction, improves the noise reduction effect, improves the accuracy of the obtained noise-reduced voice, and improves the user experience; the frequency domain noise reduction and the time domain noise reduction are processed separately, so that the time domain and the frequency domain are decoupled, and the streaming voice noise reduction is facilitated.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for speech noise reduction, the method comprising:

2. The method of claim 1, wherein the step of obtaining the spectrogram of the noise-reduced voice corresponding to the noise-reduced voice comprises:

acquiring the voice to be denoised;

3. The method of claim 1, wherein the step of inputting the spectrogram to be denoised into a preset speech denoising model for denoising to obtain a denoised spectrogram comprises:

4. The method of claim 3, wherein the coding module includes n coding layers, and the step of inputting the spectrogram to be noise-reduced into the coding module for feature extraction to obtain a plurality of single-layer audio feature vectors and target audio feature vectors includes:

5. The method of claim 4, wherein the step of inputting the input vector of the kth coding layer into the kth coding layer of the coding module to sequentially perform Pointwise convolution and Depthwise convolution to obtain the kth single-layer audio feature vector comprises:

acquiring a preset Depthwise convolution time dimension;

6. The method of claim 4, wherein the decoding module includes n decoding layers, and the step of inputting each of the single-layer audio feature vectors and the audio feature vectors to be decoded into the decoding module for decoding to obtain the spectrogram to be analyzed includes:

taking the nth single-layer decoding data as the spectrogram to be analyzed;

7. The method of claim 1, wherein before the step of inputting the spectrogram to be denoised into a preset speech denoising model for denoising, and obtaining a denoised spectrogram, the method further comprises:

obtaining a plurality of training samples;

8. An apparatus for speech noise reduction, the apparatus comprising:

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.