CN117174105A

CN117174105A - Speech noise reduction and dereverberation method based on improved deep convolutional network

Info

Publication number: CN117174105A
Application number: CN202311452944.8A
Authority: CN
Inventors: 韦伟才; 邓海蛟; 马健莹; 潘晖
Original assignee: Shenzhen Longxinwei Semiconductor Technology Co ltd
Current assignee: Shenzhen Longxinwei Semiconductor Technology Co ltd
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2023-12-05

Abstract

The invention provides a voice noise reduction and dereverberation method based on an improved deep convolutional network, which comprises the following steps: extracting characteristic data of original voice, and encoding the characteristic data; identifying the coded characteristic data through an improved time convolution network, and outputting an identification result; performing feature decoding on the identification result, and recombining a voice signal according to the decoded data; the invention can effectively combine the voice noise reduction and the dereverberation by using the improved time convolution network, thereby improving the processing speed and the voice noise reduction and dereverberation effect.

Description

Speech noise reduction and dereverberation method based on improved deep convolutional network

Technical Field

The invention belongs to the technical field of voice enhancement, and particularly relates to a voice noise reduction and dereverberation method based on an improved deep convolutional network.

Background

Noise reduction and dereverberation techniques are one of the very important research directions in the field of speech signal processing. In speech recognition, speaker recognition, and audio processing, efficient noise reduction and dereverberation methods are required to improve the signal-to-noise ratio and intelligibility of the signal. Currently, common methods such as spectral subtraction (Spectral Subtraction), wavelet transform noise reduction, double-threshold energy extraction (Double-Threshold Energy Extraction), reverberation cancellation based on blind source separation, and the like have gradually become mainstream methods, and are widely used in actual production.

Although existing noise reduction and dereverberation techniques have been able to achieve certain results, many technical challenges and difficulties remain. For example, the time-varying, nonlinear characteristics and diversity of the speech signal itself can have an impact on the accuracy and robustness of the noise reduction and dereverberation algorithms, and the processing speed is not ideal. Therefore, how to optimize the complexity and accuracy of the algorithm and improve the stability and reliability thereof is a problem to be solved in the art.

Disclosure of Invention

In order to solve at least one technical problem set forth above, the present invention provides a method for voice noise reduction and dereverberation based on an improved deep convolutional network, which can enhance the effects of voice noise reduction and dereverberation, and reduce the complexity of an algorithm to increase the speed of voice processing.

In a first aspect, the present invention provides a method for noise reduction and reverberation removal of speech based on an improved deep convolutional network, the method comprising:

extracting characteristic data of original voice, and encoding the characteristic data;

identifying the coded characteristic data through an improved time convolution network, and outputting an identification result;

and performing feature decoding on the identification result, and recombining a voice signal according to the decoded data.

In one possible implementation manner, the extracting feature data of the original voice includes:

filtering and reverberation processing are carried out on the original voice by adopting an FIR digital filter, and reverberation voice is generated;

mixing the reverberation voices with different signal to noise ratios to generate voice with noise;

performing short-time Fourier transform on the noisy speech to generate frequency domain data of the noisy speech;

and combining the real value and the imaginary value of the frequency domain data to generate the characteristic data of the original voice.

In one possible implementation, before performing the short-time fourier transform on the noisy speech, the method further includes:

pre-emphasis is carried out on the voice with noise, so that the signal-to-noise ratio of the voice with noise in a high-frequency part is improved;

and framing and windowing the pre-emphasized noisy speech.

In one possible implementation, the encoding the feature data includes:

inputting the characteristic data to an encoder, the encoder comprising a first sub-module and a second sub-module;

the characteristic data sequentially pass through the convolution layer, the normalization layer and the PReLu activation layer of the first sub-module to be processed, and the output data of the PReLu activation layer of the first sub-module sequentially passes through the convolution layer, the normalization layer and the PReLu activation layer of the second sub-module to be processed, so that encoded data are generated.

In one possible implementation, the convolution kernel of the convolution layer of the first sub-module has a size of (1, 3), a step size of (1, 1), and a number of 32; the convolution kernel size of the convolution layer of the second sub-module is (2, 5), the step size is (1, 2), and the number is 64.

In one possible implementation manner, performing feature decoding on the identification result includes:

and inputting the identification result to a decoder, wherein the number of the sub-modules of the decoder and the encoder is the same as that of the network structure.

In one possible implementation, before the identifying the encoded feature data by the modified time convolution network, training the time convolution network further includes:

obtaining training samples, wherein the training samples are data obtained by encoding characteristic data of a plurality of voices;

performing shape conversion on the training sample to generate a multidimensional tensor;

inputting the multidimensional tensor into a time convolution network for training, wherein the time convolution network comprises two residual blocks, and each residual block comprises two sub-residual modules;

each sub residual error module comprises a causal hole convolution layer, a gating expansion convolution layer, a normalization layer, an activation layer and a Dropout layer which are connected in sequence; the Dropout layer of one of the sub residual modules is connected with the causal hole convolution layer of the other sub residual module.

In one possible implementation, the multi-dimensional tensor is input to a time convolution network for training, and further includes:

constructing a loss function according to the short-time objective speech intelligibility index;

and updating the gradient of the time convolution network by back propagation of the loss function in the time convolution network until the loss function meets the preset condition, and generating an improved time convolution network.

In one possible implementation, the reorganizing the speech signal according to the decoded data includes:

calculating a mask signal according to the decoded data to obtain a signal gain;

performing inverse Fourier transform on the signal gain to obtain a time domain framing signal;

and windowing and recombining the framing signals, and splicing the recombined frame signals to obtain complete voice signals.

In a second aspect, the present invention also provides a speech noise reduction and dereverberation system based on an improved deep convolutional network, the system comprising:

the characteristic coding unit is used for extracting characteristic data of the original voice and coding the characteristic data;

the characteristic recognition unit is used for recognizing the coded characteristic data through the improved time convolution network and outputting a recognition result;

And the voice reorganization unit is used for carrying out feature decoding on the identification result and reorganizing voice signals according to the decoded data.

Compared with the prior art, the invention has the beneficial effects that:

1) According to the invention, the convolutional neural network is used for extracting the features from the data after the short-time Fourier transform, so that the convolutional neural network can be more fully utilized for extracting more and higher-level abstract features from the data after the conversion. Compared with the traditional feature extraction method, the convolutional neural network has stronger data expression capability. The extracted characteristic data can greatly improve the learning efficiency and generalization capability of the model.

2) The improved time convolution network is used in the invention, the time series data can be processed in parallel without a circulating structure, so that high-efficiency training and inference can be realized by means of hardware accelerators such as a GPU (graphics processing unit) and the like, and the training time is greatly reduced. Compared to conventional Recurrent Neural Networks (RNNs), time Convolutional Networks (TCNs) do not suffer from gradient vanishing/explosion and are difficult to capture long-term dependencies, which are solved by using a set of stackable 1D convolutional layers, each of which convolves the entire sequence, thus effectively expanding the receptive field, enabling the TCN to easily process long sequence data and extract relevant information therefrom. In addition, TCNs are easier to implement and debug than conventional Convolutional Neural Networks (CNNs). Since the structure of TCNs is not recursive, parallelization and optimization are generally easier than RNNs.

3) In the invention, gating expansion convolution is introduced into the improved time convolution network, and the history information which needs to be reserved or forgotten at the current moment can be adaptively selected by using the gating expansion convolution. Meanwhile, the use of the residual structure can further improve the training efficiency and accuracy of the model.

4) The invention uses the structure of decoding in the depth convolution network, which has stronger characteristic extraction capability, because the encoder can extract key characteristics from the original data; automatically reducing the dimension of the input data and removing redundant information through an encoder to complete automatic noise reduction; because a separate architecture is adopted between the encoder and the decoder, in some cases, the data can be processed by using the encoder only, so that cross-domain feature migration is realized, and the method has strong migration performance; since the training strategy of the self-encoder is to minimize reconstruction errors, it is robust to noise and interference signals in the data.

5) According to the invention, the short-time objective speech intelligibility index (STOI) is used as a loss function, so that the model has stronger generalization capability, and the signal-to-noise ratio and high-degree signal recovery can be considered. On the other hand, when noise reduction and dereverberation are performed simultaneously, the model is ensured not to influence both the noise reduction and the dereverberation, namely, the effect of the dereverberation is influenced by noise reduction.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for voice noise reduction and dereverberation based on an improved deep convolutional network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an improved deep convolutional network according to one embodiment of the present invention;

FIG. 3 is a flowchart of a method for voice noise reduction and dereverberation based on an improved deep convolutional network according to another embodiment of the present invention;

FIG. 4 is a schematic diagram of a speech noise reduction and dereverberation system based on an improved deep convolutional network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a speech noise reduction and dereverberation system based on an improved deep convolutional network according to another embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, if a directional indication (such as up, down, left, right, front, and rear … …) is involved in the embodiment of the present invention, the directional indication is merely used to explain the relative positional relationship, movement condition, etc. between the components in a specific posture, and if the specific posture is changed, the directional indication is correspondingly changed.

In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, if "and/or" and/or "are used throughout, the meaning includes three parallel schemes, for example," a and/or B "including a scheme, or B scheme, or a scheme where a and B are satisfied simultaneously. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

Current common noise reduction and dereverberation methods include spectral subtraction, wavelet transform noise reduction, double threshold energy extraction, reverberation cancellation based on blind source separation, etc., but these methods are not ideal in terms of processing speed and accuracy. Therefore, the invention provides a voice noise reduction and dereverberation method based on an improved deep convolution network, which can effectively combine voice noise reduction and dereverberation by adopting the improved time convolution network, and improves the voice noise reduction and dereverberation effect while accelerating the processing speed.

Referring to fig. 1, one embodiment of the present invention provides a method for noise reduction and reverberation removal of speech based on an improved deep convolutional network, including:

s10, extracting characteristic data of the original voice, and encoding the characteristic data.

S20, identifying the coded characteristic data through an improved time convolution network, and outputting an identification result.

S30, performing feature decoding on the recognition result, and recombining the voice signal according to the decoded data.

In this embodiment, first, the original voice needs to be acquired, and the corresponding feature data needs to be extracted. Typically, feature data is obtained by signal processing and feature extraction, and common features include time domain features, frequency domain features, formant features, acoustic parameters, phoneme coding, and the like.

To facilitate the model recognition process, feature data is typically encoded after features are extracted. The feature encoding is optionally performed using a convolutional neural network, including encoding using an encoder. Input data may be progressively abstracted and represented in compression by stacking multiple layers in an encoder. The encoder learns the main characteristics of the data in the process and encodes the data into vector representation, so that noise and redundant information can be effectively reduced, and the performance and efficiency of the deep learning model are improved.

Further, the embodiment adopts the method that the coded characteristic data is identified through the improved time convolution network. The data encoded by the encoder may be first shape transformed as input to an improved temporal convolution network, using causal hole convolution (Causal Dilated Convolution) and gated dilation convolution as the main components in the construction of the model. Each residual block consists of causal hole convolution, gated dilation convolution, active layer, normalization, and Dropout layer, constructed using a residual structure. And finally, performing feature decoding on the recognition result, and recombining the voice signal according to the decoded data.

According to the embodiment, the improved time convolution network is used for effectively combining voice noise reduction and dereverberation, so that the processing speed is increased, and the voice noise reduction and dereverberation effect is improved.

In one embodiment, extracting feature data of an original speech includes:

mixing the reverberant voices with different signal to noise ratios to generate voice with noise;

performing short-time Fourier transform on the voice with noise to generate frequency domain data of the voice with noise;

In order to further improve the noise reduction effect and the generalization capability of the model, when preprocessing voice data, digital filter processing is firstly performed on original clear voice. Meanwhile, the reverberated pure voice is used as original data of model training. On one hand, the processing method can effectively filter invalid voice data and avoid interference to subsequent model training; on the other hand, by performing reverberation processing on the data. The training data becomes reverberant data, wherein the original clean speech is used as label data. The processed data can play the roles of noise reduction and reverberation removal after model training, and can also improve the generalization performance of the model.

Optionally, an FIR digital filter is used in the digital signal processing for filtering. The FIR filter adopts a linear weighting mode to filter an input signal and changes the frequency response of the input signal so as to realize signal processing operations such as notch, passband gain and the like to a certain extent. Specifically, it takes discrete time series data as input and performs convolution operation through a set of pre-designed filter coefficients to obtain an output sequence. Since the speech data of different sampling rates have different coefficients, the input signal can be filtered as desired. The method can filter out partial noise and interference signals in the input signal and improve the quality of the signal. Meanwhile, the frequency response curve of the output signal can be adjusted to be more in line with the target characteristic, so that the signal form is changed.

Further, mixed with different signal to noise ratios to obtain noise voice with fixed length, randomly extracting and carrying out short-time Fourier transform on the noise voice, extracting real values and virtual values, and combining the real values and the virtual values to serve as input data of an encoder.

According to the embodiment, the tone quality of original voice can be improved through filtering and reverberation processing, and a more real environment background can be provided by combining the reverberation voices with different signal to noise ratios, so that the voice with noise is closer to an actual application scene. The time domain signal can be converted into the frequency domain representation through short-time Fourier transform, and the energy distribution information of the audio frequency on different frequencies is provided, so that the characteristic data of the original voice can be effectively extracted.

and framing and windowing the pre-emphasized noisy speech.

First, pre-emphasis techniques are used to improve the signal-to-noise ratio of the signal in the high frequency portion. The preprocessed data is then framed and windowed and converted to complex values using a short-time fourier transform. For subsequent feature extraction, the real and imaginary parts thereof are extracted, and they are shape-converted and spliced together. Finally, the data is normalized using Layer Normalization (LN) for better subsequent feature extraction and learning.

In one possible implementation, the encoding the feature data includes:

Optionally, the convolution kernel size of the convolution layer of the first sub-module is (1, 3), the step size is (1, 1), and the number is 32; the convolution kernel size of the convolution layer of the second sub-module is (2, 5), the step size is (1, 2), and the number is 64.

In this embodiment, when constructing the encoder, a first sub-module is first constructed, the first layer is a convolution layer, the convolution kernel is (1, 3), the step size is (1, 1), and the number of convolution kernels is 32; after convolution, a batch normalization layer is connected, so that acceleration training is realized, the generalization capability of the model is improved, and a certain regularization effect is realized; then the PReLu activation layer is connected, which has stronger generalization capability, better sparsity and parameter sharing capability compared with ReLu. Then, the output of the first sub-module is used as the input of the second sub-module, and the second sub-module adopts the same structure, the number of convolution kernels of the convolution layers is 64, the step size is (1, 2), and the kernel size is (2, 5) for encoding and refining more information.

The encoder consists of two sub-convolution modules in which input data is progressively abstracted and represented in compression by stacking multiple layers. The coding module learns the main characteristics of the data in the process and codes the main characteristics into vector representation, so that noise and redundant information can be effectively reduced, and the performance and efficiency of the deep learning model are improved.

In this embodiment, by using the convolutional neural network to extract features from the data after the short-time fourier transform, the convolutional neural network can be used more fully to extract more and higher-level abstract features from the data after the convolutional neural network is transformed at this time. Compared with the traditional feature extraction method, the convolutional neural network has stronger data expression capability. The extracted characteristic data can greatly improve the learning efficiency and generalization capability of the model.

In one possible implementation, before identifying the encoded feature data by the modified time convolution network, the method further includes training the time convolution network, including:

each sub-residual module comprises a causal cavity convolution layer, a gating expansion convolution layer, a normalization layer, an activation layer and a Dropout layer which are connected in sequence; the Dropout layer of one sub residual module is connected with the causal hole convolution layer of the other sub residual module.

In this embodiment, a training sample is first obtained, where the training sample is data obtained by encoding feature data of a plurality of voices.

Specifically, feature data of several voices may be acquired and encoded. When the characteristics are extracted, the training data can be mixed through different signal to noise ratios to obtain noise-carrying voice with fixed length, short-time Fourier transform is carried out on the noise-carrying voice with fixed length through random extraction, real values and virtual values are extracted and combined to be used as input data of an encoder, when the encoder encodes the training data, a training sample can be obtained, then the training sample is further subjected to shape conversion, and then a multidimensional tensor can be generated, and then the multidimensional tensor is input into a time convolution network for training.

and updating the gradient of the time convolution network by back propagation of the loss function in the time convolution network until the loss function meets the preset condition, and generating the improved time convolution network.

To aid understanding, in one specific embodiment, the training time convolutional network includes the steps of:

1) And connecting the coded characteristic data with a time convolution network module after shape conversion, wherein the time convolution network module is used as an input of a causal hole convolution layer, and the causal hole convolution is formed by combining causal hole convolution with causal hole convolution. Common convolution, when performing a 2-pixel shift, can repeat the calculation in the region where the two radii meet, and does not actually enlarge the receptive field. However, in the hole convolution, the middle convolution kernel is skipped every several pixels, so that more space appearance information can be covered at the same time, and the receptive field of the network is enlarged. The causal convolution is a limitation of adding a time axis, namely, the output of the convolution only depends on the past input, but not the future input, and the causal convolution in the general causal hole convolution is used as a first layer and then connected as the hole convolution; the gating expansion convolution mainly comprises two expansion convolutions and a sigmoid activation function group layer, wherein a sigmoid function is applied to scale an output value to be (0, 1), the output of the causal hole convolution is used as the input of the gating expansion convolution, the input is respectively connected with the two expansion convolutions, and the output result obtained by multiplying the output of one expansion convolution connection activation layer with the other expansion convolution is the output of the gating expansion convolution.

The time convolution network comprises two residual blocks, wherein each residual block comprises two sub-residual modules;

each sub-residual module comprises a causal cavity convolution layer, a gating expansion convolution layer, a normalization layer, an activation layer and a Dropout layer which are connected in sequence; wherein the Dropout layer of one sub-residual module is connected to the causal hole convolution layer of the other sub-residual module, as shown in fig. 2. It should be noted that for the entire residual block, if the number of input channels is different from the number of output channels of the causal expansion convolution (the number of filters of the second causal expansion convolution), there is an alternative 1x1 convolution at the end-to-end connection. 2) The loss function of the whole network in the back propagation process uses short-time objective speech intelligibility index (STIO) as the loss function according to the input data characteristics and the function of the deep convolution network, and compared with the common loss functions such as signal to noise ratio (SNR), mean square error loss (MSE) and the like, the loss function is brought in by the forward calculation result, namely the estimated value and the actual value, to carry out loss calculation.

3) Gradient updating of the depth convolution network, the actual value and the estimated value obtained by forward calculation are carried into the gradient updating by the designed loss function to obtain an error value, and the back propagation is mainly the gradient calculation process. The essence of the back propagation algorithm is an optimization method, and the error on a training sample is calculated, and then the error is propagated back along the network, so that gradient information of each parameter in the network is finally obtained, the parameters are updated, and the error is gradually reduced until a certain precision requirement is met. Specifically, let the first layer input be Output is +.>The layer 1 input is +.>Output is +.>. Let go of>The error of the layer is->The following steps are:

wherein,indicate connection +.>And->Weight matrix of layer,/>Representing element-by-element multiplication>Representing the derivative of the activation function. For->Error of layer->The calculation formula is as follows:

first, theThe gradient information of a certain convolution kernel on a layer is:

wherein L represents a loss function,indicate->Sample number of layers, +.>Respectively indicate the convolution kernel at +.>Layer height, width, number of channels, +.>Respectively represent the position offset of the convolution kernel on the input feature map, n=1 represents +.>A first sample on the layer.

Obviously, the gradient of the convolution kernel during back propagation is equal to the gradient of the different regions of the input feature map (i.e.The sum of the products of the output errors, which also reflects the direction of the convolution kernel adjustment: errors should be reduced by having the weight of the convolution kernel change at a particular location.

4) And after repeated iterative training, the deep convolution network continuously updates the weight coefficient to obtain a more ideal result or obtains a final model file after all iterative processes are completed.

In one embodiment, the back propagation of the improved time convolution network comprises:

a) Back propagation of convolutional layers:

In the back propagation of the convolution layers, it is necessary to calculate an error term for each convolution kernel and update the weight parameter of each convolution kernel. First, for each element in the convolution layer output tensorAccording to the error relation between the error term and the loss function, calculating the corresponding error term +.>。

Then, according to the error termCalculate gradient values of all convolution kernels +.>Wherein->Representing the weight value of the ith convolution kernel at position (m, n). Specifically, the gradient value of the convolution kernel may be calculated by:

1) Defining a new error tensorLet its shape be the same as the output tensor y of the convolutional layer and initialize all elements to 0.

2) For error tensorsEach element of->First find the corresponding position of the convolution kernel in the element input tensor +.>I.e. the element coordinates of the output tensor are mapped back into the original input tensor, and thenAccumulate +.>。

For each convolution kernelCalculate its gradient average over all samples:

where N is the number of samples,representing the element of the b-th sample in the input tensor at position (i+m-1, j+n-1).

b) Batch Normalization (BN) back propagation:

the BN layer mainly includes two operations: normalization of mean, variance and linear transformation of scale and bias. In back propagation we need to calculate the gradients of the parameters and inputs in the BN layer.

During the back propagation of BN layer, the input x and output z are needed to be calculated, and the scaling factor is calculatedAnd bias factor->Is a gradient of (a). First, calculate the scaling factor +.>And bias factor->Can be obtained:

wherein,representing the gradient of z to the loss function. Then the gradient of y is calculated, and the following can be obtained:

the gradient of the mean and variance over x is then calculated. According to the chain law, it is possible to obtain:

where m is the number of samples per batch. Finally, the input x, gradient can be usedScale factor->And bias factor->To update parameters in the BN layer. Specifically, the update may be performed using the following formula: />

Wherein,is learning rate (I/O)>Is used for calculating the sliding average +.>And->Exponential decay rate, +.>And->The sample mean and sample variance of the current lot, respectively.

In one embodiment, the loss function in the improved deep convolutional network uses primarily short-time objective speech intelligibility index (STOI), which is an objective quality assessment indicator that measures the intelligibility or clarity of speech signals. The loss function in the model is therefore, according to its characteristics, as follows:

where N is the number of frames of the speech signal, s is the actual value,as a result of the inverse fourier transform after the model gain calculation, For the phase angle of vector x +.>Reflecting the similarity of the two voices in the time domain, for a perfect reconstructed voice the phase should be very similar to the phase of the original voice, therefore +.>Will approach 1. Otherwise, it will approach 0. After averaging, the STOI Score is typically reported as a percentage, indicating the intelligibility of the distorted signal.

In this embodiment, the improved time convolution network is used, so that the time series data can be processed in parallel without a loop structure, and therefore, efficient training and inference can be realized by means of hardware accelerators such as GPUs, and training time is greatly reduced. Compared to conventional Recurrent Neural Networks (RNNs), time Convolutional Networks (TCNs) do not suffer from gradient vanishing/explosion and are difficult to capture long-term dependencies, which are solved by using a set of stackable 1D convolutional layers, each of which convolves the entire sequence, thus effectively expanding the receptive field, enabling the TCN to easily process long sequence data and extract relevant information therefrom. In addition, TCNs are easier to implement and debug than conventional Convolutional Neural Networks (CNNs). Since the structure of TCNs is not recursive, parallelization and optimization are generally easier than RNNs.

On the other hand, in the embodiment, gating expansion convolution is introduced into the improved time convolution network, and the history information which needs to be reserved or forgotten at the current moment can be adaptively selected by using the gating expansion convolution. Meanwhile, the use of the residual structure can further improve the training efficiency and accuracy of the model.

By using short-time objective speech intelligibility index (STOI) as a loss function, the model can have stronger generalization capability and can combine signal-to-noise ratio with high-degree signal recovery. On the other hand, when noise reduction and dereverberation are performed simultaneously, the model is ensured not to influence both the noise reduction and the dereverberation, namely, the effect of the dereverberation is influenced by noise reduction.

In one possible implementation, feature decoding the recognition result includes:

and inputting the identification result to a decoder, wherein the number of sub-modules of the decoder and the encoder is the same as that of the network structure.

In this embodiment, when the decoder is constructed, the output of the time convolution network and the output of each sub-convolution module in the encoder are connected to the input of the decoder and each corresponding sub-module, and as final input data, each sub-module of the decoder is composed of deconvolution, batch normalization, and a pralu activation layer, and the number of sub-modules is equal to that of the encoder.

performing inverse Fourier transform on the signal gain to obtain a framing signal in a time domain;

windowing and recombining the divided frame signals, and splicing the recombined frame signals to obtain a complete voice signal.

And respectively performing point multiplication operation on a result obtained by the decoding structure and the improved time convolution network and a result obtained by the initial short-time Fourier transform to obtain a final gain result, and performing inverse short-time Fourier transform, windowing and signal reconstruction to obtain a final result so as to finish voice noise reduction and dereverberation.

Referring to fig. 3, in one embodiment, there is further provided a method for voice noise reduction and dereverberation based on an improved deep convolutional network, comprising the following five steps:

s101, in order to further improve the noise reduction effect and the generalization capability of the model, when preprocessing voice data, we firstly perform digital filter processing on the original clear voice. Meanwhile, the reverberated pure voice is taken as the original data of model training. On one hand, the processing method can effectively filter invalid voice data and avoid interference to subsequent model training; on the other hand, by performing reverberation processing on the data. The training data becomes reverberant data, wherein the original clean speech is used as label data. The processed data can play the roles of noise reduction and reverberation removal after model training, and can also improve the generalization performance of the model.

S102, mixing training data through different signal to noise ratios to obtain noise-carrying voice with fixed length, randomly extracting the noise-carrying voice with fixed length, carrying out short-time Fourier transform on the noise-carrying voice, extracting real values and virtual values, combining the real values and the virtual values to serve as input data of a coding module, wherein the coding module consists of two sub-convolution modules, and gradually abstracting the input data and compressing and representing the input data by stacking a plurality of layers in the coding module. The coding module learns the main characteristics of the data in the process and codes the main characteristics into vector representation, so that noise and redundant information can be effectively reduced, and the performance and efficiency of the deep learning model are improved.

S103, constructing a time convolution network, namely performing shape conversion on the data coded by the coding module to serve as input of the time convolution network, and constructing a model by using causal hole convolution (Causal Dilated Convolution) and gating expansion convolution as main components. Each residual block consists of causal hole convolution, gated dilation convolution, active layer, normalization, and Dropout layer, constructed using a residual structure.

S104, constructing a decoding module, connecting the output of the time convolution network and the output of each sub-convolution module in the encoding module with the input of the decoding module and each corresponding sub-module as final input data, wherein each sub-module of the decoding module consists of deconvolution, batch normalization and PReLu activation layers, and the number of the sub-modules is equal to that of the encoding module.

S105, performing point multiplication operation on a result obtained by the decoding structure and the improved time convolution network and a result obtained by the initial short-time Fourier transform to obtain a final gain result, and performing inverse short-time Fourier transform, windowing and signal reconstruction to obtain a final result to finish voice noise reduction and dereverberation.

In the embodiment, firstly, the reverberation data is obtained by calculating the input data through the open source data in a data preprocessing mode and digital filtering is carried out; pre-emphasis, framing and windowing and short-time Fourier transformation are carried out; processing the transformed data by using a deep learning algorithm, performing feature coding by using a convolutional network, then constructing an improved Time Convolutional Network (TCN) model, obtaining mask data by taking the output of the model as a feature decoding network, finally performing gain calculation on the obtained mask data and an original signal, obtaining an enhanced voice signal by using an inverse short time Fourier transform, windowing and reconstruction of a calculated result, and effectively combining voice noise reduction and dereverberation by using a deep learning mode, thereby having good effects.

Based on the same inventive concept as the above method, in another embodiment of the present invention, a speech noise reduction and dereverberation system based on an improved deep convolutional network is also disclosed. Referring to fig. 4, a speech noise reduction and dereverberation system based on an improved deep convolutional network according to the present embodiment includes:

A feature encoding unit 100, configured to extract feature data of an original voice, and encode the feature data;

the feature recognition unit 200 is configured to recognize the encoded feature data through the improved time convolution network, and output a recognition result;

and the voice reorganizing unit 300 is configured to perform feature decoding on the recognition result, and reorganize the voice signal according to the decoded data.

In the system disclosed in this embodiment, specific implementation of each module may also correspond to corresponding descriptions of the method embodiments shown in the foregoing embodiments, which are not repeated herein for simplicity.

Referring to fig. 5, in another embodiment, there is further provided a speech noise reduction and dereverberation system based on an improved deep convolutional network, which is applied to the speech noise reduction and dereverberation method based on an improved deep convolutional network described in any one of the above embodiments, and includes:

the voice preprocessing module 10 is used for processing the original voice signal. The module generates reverberation data by filtering the clean human voice signal using an FIR filter and convolving it with the room impulse response data.

The output of the voice preprocessing module 10 is electrically connected to the input of the model input data processing module 20, and the model input data processing module 20 uses a variety of processing methods including pre-emphasis, framing windowing, and short-time fourier transform.

The output end of the model input data processing module 20 is electrically connected with the input end of the feature encoding module 30, and the feature encoding module 30 performs feature compression and abstraction through the design of the convolution layer, the normalization layer and the activation layer of the encoding network.

The output end of the feature encoding module 30 is electrically connected with the input end of the improved time convolution network module 40, and the improved time convolution neural network module 40 is used for learning the long-term dependency relationship in the input data from the features of the encoding module, so that the structural information in the time sequence can be better captured, and the historical information which needs to be reserved or forgotten at the current moment can be adaptively selected.

The output end of the improved time convolution network module 40 is electrically connected to the input end of the decoding module 50, and the decoding module 50 is used for reconstructing and recovering the input data to obtain a longer and accurate reproduction of the input data.

The output end of the decoding module 50 is electrically connected with the input end of the mask calculation module 60, the mask calculation module 60 can obtain a mask through forward calculation of a network, and meanwhile, point multiplication operation is performed by utilizing real and imaginary values obtained through original Fourier transformation, so that a corresponding gain result is obtained.

The output end of the mask calculation module 60 is electrically connected to the input end of the speech signal reconstruction module 70, and the speech signal reconstruction module 70 is configured to convert the result of the mask calculation module into a final speech signal after noise reduction and reverberation removal. The module reversely converts the mask result back to the time domain through the inverse short-time Fourier transform, and then the data are recombined into a complete voice signal through the overlap splicing technology. Finally, the output voice signal is subjected to various audio processing steps such as noise reduction, reverberation removal and the like, and the quality is effectively improved. The mask calculation refers to a process of obtaining a gain result by performing a mask operation on input data after forward calculation of the deep learning model. Specifically, the procedure performs corresponding calculation and multiplication of mask data obtained by the forward calculation with the original input signal, thereby obtaining gain coefficients of corresponding positions. The main purpose of this process is to extract the effective information in the input signal, cancel unwanted noise and interference signals, and make the model process the input signal more accurately.

Alternatively, the feature encoding module 30 encodes and extracts features primarily through the convolutional layer, the normalization layer, and the activation layer, providing a more valuable representation of the features for subsequent tasks. The feature encoding module 30 performs feature extraction on the input data through convolution operation of the convolution layer, and can automatically learn and extract different features, and enhance the expressive power of the network by increasing the number of convolution kernels or using convolution kernels of different sizes. The normalization layer is used for accelerating training of the neural network, and can accelerate the convergence speed of the network by normalizing the output of the neurons, so that the accuracy and generalization performance of the model are improved. The activation layer is used as an important component of nonlinear transformation in the neural network, and can perform nonlinear transformation on the convolved output, so that the expression capacity and generalization performance of the network are improved.

Optionally, the feature encoding module 30 is composed of two sub-convolution modules, where the first sub-convolution module includes an input layer, a convolution, a normalization and an activation layer, where an output end of the input layer is electrically connected to a convolution input end in the sub-convolution extraction module, the convolution layer is used to perform feature extraction on input data, an output end of the convolution layer is electrically connected to an input end of the normalization layer, the normalization layer uses a batch normalization method, an output end of the normalization layer is electrically connected to an input end of the pralu activation layer, and the pralu introduces a learnable parameter on the basis of retaining non-negative and linear growth characteristics of ReLu. The parameter is used for controlling the input of an output value smaller than zero, so that the expression capacity and fitting capacity of the model are improved, the output end of the PReLu activation layer is electrically connected with the input end of the convolution layer of the next sub-convolution module, the convolution layer is used for extracting more abstract characteristic data, and the structure of the second sub-convolution module is identical to that of the first convolution module.

Alternatively, the modified time convolution network module 40 is composed of two residual blocks, two sub-residual blocks in each block, each sub-residual block being sequentially connected by a causal hole convolution, a one-dimensional convolution, a gated dilation convolution, a normalization, an activation layer, and a Dropout layer, wherein the Dropout layer of one sub-residual block is connected to the causal hole convolution of the other sub-residual block, as shown in fig. 2. The module is used for learning long-term dependency relation of data and adaptively selecting historical information which needs to be reserved or forgotten at the current moment. And training efficiency and accuracy of the model can be further improved by utilizing the residual error structure.

Optionally, the decoding module 50 is composed of deconvolution, normalization and activation layers, and likewise two sub deconvolution modules correspond to and are identical in structure to the feature encoding module 30, and its input is composed of the output of each sub-module of the feature encoding module 30 together with the output of the modified time convolution network module 40 for reconstruction restoration of data.

Optionally, the speech signal reconstruction module 70 is configured to perform subsequent processing on the gain-processed data, so as to obtain a final speech signal after noise reduction and dereverberation. In a specific implementation, the module performs inverse short-time fourier transform to convert the frequency domain data into time domain data, and then uses a windowing technique to reconstruct the signal after the frame. Finally, by splicing all the recombined frames, the final noise-reduced and dereverberated speech signal is output.

In one embodiment, the present invention also provides a computer-readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to perform a method as any one of the possible implementations described above.

In one embodiment, the present invention also provides an electronic device, including: the electronic device comprises a processor, a transmitting means, an input means, an output means and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as any one of the possible implementations described above.

Referring to fig. 6, fig. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the invention.

The electronic device 2 comprises a processor 21, a memory 22, input means 23, output means 24. The processor 21, memory 22, input device 23, and output device 24 are coupled by connectors including various interfaces, transmission lines or buses, etc., as are not limited by the present embodiments. It should be appreciated that in various embodiments of the invention, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

The processor 21 may be one or more graphics processors (graphics processing unit, GPUs), which may be single-core GPUs or multi-core GPUs in the case where the processor 21 is a GPU. Alternatively, the processor 21 may be a processor group formed by a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the invention is not limited.

Memory 22 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present invention. Optionally, the memory includes, but is not limited to, a random access memory (random access memory, RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or a portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The output device 23 and the input device 24 may be separate devices or may be an integral device.

It will be appreciated that in embodiments of the present invention, the memory 22 may not only be used to store relevant instructions, but embodiments of the present invention are not limited to the specific data stored in the memory.

It will be appreciated that fig. 6 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all video parsing devices capable of implementing the embodiments of the present invention are within the scope of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments of the present invention are provided with emphasis, and that the same or similar parts may not be described in detail in different embodiments for convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in description of other embodiments.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (digital versatile disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A method for noise reduction and dereverberation of speech based on an improved deep convolutional network, the method comprising:

2. The method for noise reduction and dereverberation of speech based on the improved deep convolutional network as set forth in claim 1, wherein said extracting feature data of the original speech comprises:

3. The improved deep convolutional network-based speech noise reduction and dereverberation method of claim 2, further comprising, prior to short-time fourier transforming the noisy speech:

and framing and windowing the pre-emphasized noisy speech.

4. The improved deep convolutional network-based speech noise reduction and dereverberation method of claim 2, wherein encoding the feature data comprises:

5. The method for noise reduction and dereverberation of speech based on an improved deep convolutional network according to claim 4, wherein the convolution kernel size of the convolutional layer of the first sub-module is (1, 3), the step size is (1, 1), and the number is 32; the convolution kernel size of the convolution layer of the second sub-module is (2, 5), the step size is (1, 2), and the number is 64.

6. The method for speech noise reduction and dereverberation based on the improved deep convolutional network of claim 4, wherein feature decoding the recognition result comprises:

7. The improved deep convolutional network-based speech noise reduction and dereverberation method of claim 1, further comprising training the temporal convolutional network prior to the identifying of the encoded feature data by the improved temporal convolutional network, comprising:

8. The improved deep convolutional network-based speech noise reduction and dereverberation method of claim 7, wherein inputting the multi-dimensional tensor into a temporal convolutional network for training, further comprises:

9. The method for speech noise reduction and dereverberation based on an improved deep convolutional network of claim 1, wherein recombining the speech signal from the decoded data comprises:

10. A speech noise reduction and dereverberation system based on an improved deep convolutional network, the system comprising: