CN117594056A

CN117594056A - RNN voice noise reduction and dereverberation method and system based on SIFT

Info

Publication number: CN117594056A
Application number: CN202410075344.2A
Authority: CN
Inventors: 韦伟才; 邓海蛟; 马健莹; 潘晖
Original assignee: Shenzhen Longxinwei Semiconductor Technology Co ltd
Current assignee: Shenzhen Longxinwei Semiconductor Technology Co ltd
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-02-23

Abstract

The invention discloses a RNN voice noise reduction and dereverberation method and system based on SIFT, wherein the method comprises the following steps: extracting SIFT features of original voice; and inputting the SIFT features into a preset RNN model, and carrying out signal reconstruction according to output data of the RNN model to generate target voice. According to the invention, the SIFT features are combined with the RNN network model, so that the voice processing speed can be improved, the voice noise reduction and dereverberation processing effects are enhanced, meanwhile, the real-time performance is ensured, the requirement on the equipment memory is low, and the operation cost is saved.

Description

RNN voice noise reduction and dereverberation method and system based on SIFT

Technical Field

The invention relates to the technical field of voice enhancement, in particular to a RNN voice noise reduction and dereverberation method and system based on SIFT.

Background

Noise reduction and dereverberation techniques are one of the very important research directions in the field of speech signal processing. In speech recognition, speaker recognition, and audio processing, efficient noise reduction and dereverberation methods are required to improve the signal-to-noise ratio and intelligibility of the signal. Currently, common methods such as spectral subtraction (Spectral Subtraction), wavelet transform noise reduction, double-threshold energy extraction (Double-Threshold Energy Extraction), reverberation cancellation based on blind source separation, and the like have gradually become mainstream methods, and are widely used in actual production.

Although existing noise reduction and dereverberation techniques have been able to achieve certain results, many technical challenges and difficulties remain. For example, the time-varying, nonlinear characteristics and diversity of the speech signal itself can have an impact on the accuracy and robustness of the noise reduction and dereverberation algorithms, and the processing speed is not ideal. Therefore, how to optimize the complexity and accuracy of the algorithm and improve the stability and reliability thereof is a problem to be solved in the art.

Disclosure of Invention

In order to solve at least one technical problem set forth above, the present invention provides a method and a system for noise reduction and dereverberation of RNN speech based on SIFT, which can enhance the noise reduction and dereverberation effects of speech, reduce the complexity of algorithm, and increase the speed of speech processing.

In a first aspect, the present invention provides a SIFT-based RNN speech noise reduction and dereverberation method, the method comprising:

extracting SIFT features of original voice;

and inputting the SIFT features into a preset RNN model, and carrying out signal reconstruction according to output data of the RNN model to generate target voice.

In one possible implementation manner, before the extracting the SIFT feature of the original voice, the method further includes:

And performing FIR digital filtering on the original voice, and performing reverberation convolution operation.

In one possible implementation manner, before the extracting the SIFT feature of the original voice, the method further includes performing a spectral transformation on the original voice, including: pre-emphasis is carried out on the voice signal of the original voice, so that the signal-to-noise ratio of the voice signal in a high-frequency part is improved;

carrying out framing windowing on the pre-emphasized voice signal, and carrying out short-time Fourier transform to generate a conversion signal of the voice signal from a time domain to a frequency domain;

and rotating and mapping the converted signals to generate a frequency spectrum image.

In one possible implementation manner, the extracting SIFT features of the original speech includes:

detecting scale space extremum of the frequency spectrum image, and identifying potential interested areas with scale and direction invariance;

extracting local extreme points under different scales in the potential interest region through a Gaussian differential pyramid, and taking the local extreme points as key points of a SIFT algorithm;

performing direction assignment on the key points, calculating gradient directions of the key points, and distributing the key points to corresponding gradient direction histograms;

and constructing a 4 multiplied by 4 window by taking the key points as the center, calculating the gradient amplitude and the gradient direction of each pixel point in the window, and determining the 128-dimensional SIFT feature vector of the key points.

In one possible implementation manner, the preset RNN model includes a frequency-time modulation spectrum sensing region extraction module;

the frequency-time modulation spectrum sensing region extraction module comprises a bidirectional LSTM network and a unidirectional LSTM network;

the bidirectional LSTM network is used for receiving input data of an input layer, performing first characteristic learning on the input data and transmitting a learning result to the LSTM unit;

and the unidirectional LSTM network performs secondary feature learning on the learning result, and extracts a frequency-time interested region.

In one possible implementation, the bidirectional LSTM network includes a bidirectional LSTM layer, a first fully connected layer, and a layer normalization layer;

the bidirectional LSTM layer comprises a forward LSTM layer and a reverse LSTM layer, which are respectively used for performing feature learning from different directions of an input data sequence of the input layer;

the first full connection layer is used for mapping nonlinear characteristics output by the bidirectional LSTM layer to a new characteristic space;

and the layer normalization layer is used for normalizing the output data of the first full-connection layer.

In one possible implementation, the unidirectional LSTM network includes a first LSTM layer, a second fully-connected layer, and a ReLu activation layer;

The first LSTM layer is used for carrying out feature learning on the output data of the layer normalization layer;

the second full connection layer is used for mapping nonlinear characteristics output by the first LSTM layer to a new characteristic space;

the ReLu activation layer is configured to introduce a nonlinear feature into the output data of the second fully-connected layer.

In one possible implementation manner, the preset RNN model further includes a narrowband filtering network module;

the narrow-band filtering network module comprises a second LSTM layer, a third full-connection layer and an output layer;

the second LSTM layer is configured to extract mask features from output data of the ReLu activation layer;

the third full connection layer is used for mapping nonlinear characteristics output by the second LSTM layer to a new characteristic space;

the output layer is connected with the output end of the third full-connection layer and is used for outputting mask data.

In one possible implementation manner, the performing signal reconstruction according to the output data of the RNN model, generating the target voice includes:

calculating a signal gain based on output data of the RNN model;

and performing inverse Fourier transform, windowing and signal reconstruction on the signal gain to generate target voice.

In a second aspect, the present invention also provides a SIFT-based RNN speech noise reduction and dereverberation system, the system comprising:

the feature extraction unit is used for extracting SIFT features of the original voice;

and the signal reconstruction unit is used for inputting the SIFT features into a preset RNN model, and performing signal reconstruction according to the output data of the RNN model to generate target voice.

Compared with the prior art, the invention has the beneficial effects that:

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly describe the embodiments of the present invention or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present invention or the background art.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

FIG. 1 is a schematic flow chart of a RNN voice noise reduction and dereverberation method based on SIFT according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of spectrum conversion on an original voice signal according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of SIFT feature extraction on an original speech signal according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an RNN model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a SIFT-based RNN voice noise reduction and reverberation removal system according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another embodiment of a SIFT-based RNN speech noise reduction and dereverberation system;

fig. 7 is a schematic hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

The current common noise reduction and dereverberation methods include spectral subtraction (Spectral Subtraction), wavelet transform noise reduction, double-threshold energy extraction (Double-Threshold Energy Extraction), reverberation cancellation based on blind source separation, etc., but these methods still face many technical challenges and difficulties. For example, the time-varying, nonlinear characteristics and diversity of the speech signal itself can have an impact on the accuracy and robustness of the noise reduction and dereverberation algorithm, and the processing efficiency is not ideal. Therefore, the invention can improve the speed of voice processing and the recognition precision after voice noise reduction and dereverberation by extracting the SIFT feature of the voice and combining the SIFT feature with the RNN neural network model.

Referring to fig. 1, fig. 1 is a flowchart of a method for reducing noise and dereverberation of RNN speech based on SIFT according to an embodiment (a) of the present invention.

An SIFT-based RNN speech denoising and dereverberation method comprising:

s10, extracting SIFT features of original voice;

s20, inputting SIFT features into a preset RNN model, and performing signal reconstruction according to output data of the RNN model to generate target voice.

SIFT (Scale-Invariant Feature Transform) is a feature extraction algorithm widely used in computer vision. The SIFT algorithm can automatically detect local features in different scales and directions in the image, the features have invariance to the rotation, scaling and other transformations of the image, and the traditional image feature extraction method can effectively extract feature information in the image on the premise of ensuring a certain speed.

In general, the noise reduction and the reverberation removal are implemented by using different technical methods, and how to implement the noise reduction and the reverberation removal simultaneously by one mode and method is also a research direction in the field of speech processing, if the noise reduction and the reverberation removal can only be implemented separately by a traditional mode, but if a deep learning algorithm is used, it is not difficult to implement the noise reduction and the reverberation removal simultaneously by virtue of strong feature extraction, analysis, optimization and the like. Particularly, for sequence data such as voice, a good effect can be obtained by constructing a model by using a Recurrent Neural Network (RNN).

In particular, the use of RNN models generally has the following advantages:

processing the sequence data: the RNN model has a memory function and can process sequence data such as text, time series, etc. It can build context by memorizing previous information and utilizing this information in processing subsequent data to better capture patterns and dependencies in the sequence.

Variable length input: the RNN model may process variable length input sequences. It can receive inputs of different lengths at each time step and generate outputs of corresponding lengths, making it very flexible in processing sequence data of different lengths.

Parameter sharing: the RNN model processes sequence data by sharing the same parameters over different time steps, which results in a model with fewer parameters. Parameter sharing can reduce the complexity of the model and improve the training efficiency and generalization capability of the model.

Context understanding: due to its memory function, the RNN model can better understand and utilize context information. The model may use previous information to infer the current context and predict and generate based on the context. This makes the RNN model excellent in tasks such as natural language processing and speech recognition.

Hierarchical representation capabilities: the RNN model may enhance its representational capacity by stacking multiple loop layers. The multi-layered RNN model can learn more complex features and relationships and better capture abstract representations in the input sequence. This enables the RNN model to handle more complex tasks and data.

In this embodiment, a preset RNN model needs to be acquired first, the model needs to take a large amount of speech data as samples, SIFT features of speech as model input, and mask signals as model output, so as to train the RNN model meeting training conditions. Then, when the voice noise reduction and dereverberation processing is carried out, only the SIFT feature of the voice to be recognized, namely the original voice, is required to be obtained and input into a trained RNN model to obtain a recognized mask signal, and finally, the signal reconstruction is carried out according to the mask signal, so that the noise reduction and dereverberation process of the original voice is completed.

Therefore, in the embodiment, through the design of combining the SIFT feature and the RNN network model, on one hand, the artificial deepening learning mode is adopted to improve the voice processing speed and lower the requirement on the equipment memory when the algorithm is transplanted and deployed; on the other hand, the noise reduction and reverberation removal effects of the voice can be optimized through the design of training data and a deep learning algorithm, and the recognition accuracy of the voice is improved.

In one embodiment, before extracting the SIFT feature of the original speech in step S10, FIR digital filtering is further included on the original speech, and a reverberation convolution operation is performed.

In general, if the original speech is not preprocessed, there is much signal interference in extracting SIFT features, which reduces the accuracy of subsequent speech processing. Thus, in this embodiment, the original speech needs to be preprocessed.

Specifically, the method for realizing voice pre-conversion comprises the following steps: the clean speech is initially processed by a digital filter FIR based on a linear time-invariant model, i.e. assuming that the signal is stationary in both time and frequency domains, the filtering function can be achieved by convolution operations. Specifically, the FIR digital filter takes a set of discrete time series data as input, and performs convolution operation through a set of pre-designed filter coefficients to obtain an output sequence, thereby realizing the filtering effect on an input signal. Noise and interference signals in the input signals can be filtered, so that the signal quality and the signal precision are improved, the frequency response curve of the output signals can be adjusted, the frequency response curve meets the required target characteristics better, and the signal morphology can be changed.

In the embodiment, the pretreatment of the voice adopts simple digital filtering to perform preliminary calculation, and reverberation convolution is performed on pure voice, so that the voice signal after pretreatment can effectively filter some special noise or useless information, and interference on subsequent model learning is avoided; on the other hand, the data after the reverberation convolution is used as input to play a role in removing the reverberation after the model is trained, and the effects of further improving the noise reduction effect, improving the generalization capability of the model and optimizing the subsequent frequency spectrum conversion can be achieved.

In one embodiment, before extracting SIFT features of the original speech in step S10, the method further includes performing a spectral transformation on the original speech, including:

pre-emphasis is carried out on the voice signal of the original voice, so that the signal-to-noise ratio of the voice signal in a high-frequency part is improved;

the converted signal is rotated and mapped to generate a spectral image.

If the SIFT feature is directly extracted without performing spectrum transformation on the original voice, the extracted feature is often in a condition of extraction disorder caused by noise, shielding and picture content disorder, and the final voice processing result is affected. In order to improve the extraction quality of SIFT features, in this embodiment, spectrum transformation is required to be performed on the original speech, and then SIFT feature extraction is performed.

Referring to fig. 2, fig. 2 provides a flow of spectral transformation of the original speech: firstly, pre-emphasis is carried out on a voice signal of original voice, and the signal to noise ratio of the signal in a high-frequency part is improved; then framing and windowing, changing data from one dimension to two dimensions, ensuring smooth transition, performing short-time Fourier transform, and then performing time domain to frequency domain conversion; and finally, rotating and mapping, namely, mapping the data to a range from 0 to 255 in a quantization mode after rotating the data by 90 degrees anticlockwise, and representing different color values.

According to the embodiment, SIFT feature extraction is performed after frequency spectrum conversion, so that features can be well positioned in a spatial domain or a frequency domain by the SIFT feature extraction method; secondly, the probability of extraction disorder caused by noise, shielding and picture content disorder can be reduced, a large number of features can be extracted, and the features are highly independent.

In a preferred embodiment, the voice may be preprocessed before performing step S10 to extract SIFT features of the original voice, and then subjected to spectral conversion. At this time, the SIFT-based RNN voice noise reduction and dereverberation method includes:

1) Performing FIR digital filtering on the original voice, and performing reverberation convolution operation;

2) Pre-emphasis is carried out on the voice signal after the reverberation convolution operation, so that the signal-to-noise ratio of the voice signal in a high-frequency part is improved;

3) Carrying out framing windowing on the pre-emphasized voice signal, and carrying out short-time Fourier transform to generate a conversion signal of the voice signal from a time domain to a frequency domain;

4) Rotating and mapping the converted signals to generate a frequency spectrum image;

5) Extracting SIFT features of original voice from the spectrum image;

6) And inputting the SIFT features into a preset RNN model, and carrying out signal reconstruction according to output data of the RNN model to generate target voice.

In the embodiment, the method and the device firstly perform FIR digital filtering and reverberation convolution operation on the original voice, can filter some special noise or useless information, avoid the interference on the subsequent model learning, and improve the noise reduction effect and the generalization capability of the model; and then, carrying out frequency spectrum conversion on the voice signal, and better positioning SIFT features, so that the extracted features are highly independent and do not interfere with each other, and finally, the recognition accuracy after voice processing is improved.

In a specific embodiment, extracting SIFT features of the original speech includes:

Extracting local extremum points under different scales in a potential region of interest through a Gaussian differential pyramid, and taking the local extremum points as key points of a SIFT algorithm;

Referring to fig. 3, fig. 3 provides a flowchart for extracting SIFT features of original speech. As can be seen from fig. 3, the extraction process includes: firstly, scale space extremum detection is carried out, and a Gaussian differential pyramid is used for identifying potential interested areas with scale and direction invariance. Then, positioning key points, determining positions and scales at each candidate position by fitting a fine model, and extracting local extreme points under different scales in the image through Gaussian difference, wherein the extreme points are used as key points of a SIFT algorithm; then carrying out direction assignment, calculating the gradient direction of each key point, and distributing the gradient direction to a gradient direction histogram where the key point is located; finally, a key point descriptor, wherein local gradients are measured on a selected scale in the field around each key point, the gradients are transformed into a representation, a 4×4 window is constructed by taking the key point as the center, the gradient amplitude and direction of each pixel point in the window are calculated, and the gradient amplitude and direction are distributed into a statistical histogram, so that the 128-dimensional SIFT feature vector of the key point is finally obtained.

According to the embodiment, local features in different scales and directions in the image can be automatically detected by extracting SIFT features, the features are unchanged for rotation, scaling and other transformations of the image, and the traditional image feature extraction method can effectively extract feature information in the image on the premise of ensuring a certain speed, so that noise reduction and reverberation removal effects of voice are improved.

In one embodiment, the preset RNN model includes a frequency-time modulation spectrum sensing region extraction module;

and the unidirectional LSTM network performs secondary feature learning on the learning result, and extracts the frequency-time interested region.

Further, the bidirectional LSTM network includes a bidirectional LSTM layer, a first fully connected layer, and a layer normalization layer;

The layer normalization layer is used for normalizing the output data of the first full-connection layer.

The unidirectional LSTM network comprises a first LSTM layer, a second full connection layer and a ReLu activation layer;

the second full connection layer is used for mapping the nonlinear characteristics output by the first LSTM layer to a new characteristic space;

the ReLu active layer is used to introduce non-linear features in the output data of the second fully connected layer.

Speech signals are processed as a kind of sequence signal on deep learning, usually using RNN, where LSTM is a more efficient processing method, and LSTM (Long Short-Term Memory) is a recurrent neural network structure mainly used for processing and predicting time-series data. LSTM can better address long-term dependence issues than common recurrent neural networks. In LSTM, each cell contains three gate controllers: input gate, output gate and forget gate. These gates help control the flow of information between the current input variable, the output of the last time step, and the previous state. The advantage of using LSTM is that it can efficiently handle long-term dependencies of time-series data. The traditional cyclic neural network is easy to cause problems such as gradient disappearance or gradient explosion, and the LSTM avoids the problems in a door controller mode, so that the model can learn the correlation in time sequence data better, and the accuracy and the stability of the model are improved. Therefore, LSTM has been widely used in the fields of speech enhancement, machine translation, video analysis, and the like.

In this embodiment, a recurrent neural network model is constructed, the extracted feature data is used as the input of the model, and a bidirectional LSTM layer is used as the main structure in the construction of the model. By bi-directional LSTM, the context information is better captured when processing sequence data, forward LSTM processes data from the beginning of the sequence, and reverse LSTM processes data from the end of the sequence. The outputs of the two LSTM's are connected together at each time step and then sent to the next level network for processing. Since the bi-directional LSTM can consider both past and future information, it performs well in many sequential tasks.

Referring to fig. 4, a structural block diagram of an RNN model of a SIFT-based RNN speech noise reduction and dereverberation method.

Specifically, the frequency-time modulation spectrum sensing region extraction module consists of a bidirectional LSTM and a unidirectional LSTM, the output end of the input layer is electrically connected with the input end of a bidirectional LSTM layer of the bidirectional LSTM network, and the bidirectional LSTM layer is used for improving the utilization rate of input data and the generalization capability of a model; the output end of the bidirectional LSTM layer is electrically connected with the input end of the first full-connection layer, and the first full-connection layer is used for mapping output nonlinearity to a new feature space; the output end of the first full-connection layer is electrically connected with the input end of the layer normalization layer, and the layer normalization is used in the RNN for improving the convergence speed and learning capacity of the network; the output end of the layer normalization layer is electrically connected with the input end of the LSTM layer, and the output results of the layer normalization layer are generally required to be added by the adding layer before being electrically connected with the input end of the LSTM layer of the unidirectional LSTM network.

Further, in the unidirectional LSTM network, the first LSTM layer is configured to further learn the correlation and characteristics of the features, the output end of the first LSTM layer is electrically connected to the input end of the second full-connection layer, and the second full-connection layer is used for feature mapping as the first full-connection layer, and the output end of the second full-connection layer is electrically connected to the input end of the ReLu activation layer, so that the ReLu activation layer has a fast calculation speed, can effectively avoid gradient disappearance, and can play a role of sparsity.

According to the embodiment, the RNN model structure with the LSTM network is constructed, so that the sequence data can be effectively utilized, a longer dependency relationship is realized, the model structure constructed by combining the bidirectional LSTM and the unidirectional LSTM with the residual structure can improve accuracy when learning the data, is more stable for noise or incomplete input data, can simultaneously consider information before and after the current moment, and has better effect on tasks needing global information; the training process can be faster converged, and the generalization capability of the model is improved.

In one possible implementation, the preset RNN model further includes a narrowband filter network module;

the second LSTM layer is used for extracting mask characteristics from output data of the ReLu activation layer;

the output layer is connected with the output end of the third full connection layer and is used for outputting mask data.

In this embodiment, the preset RNN model includes a narrowband filter network module in addition to the frequency-time modulation spectrum sensing region extraction module. In particular, the narrowband filter network module is composed of a second LSTM layer, a third full connection layer, and an output layer, and is configured to extract mask data from the receptive field. The output end of the ReLu activation layer is electrically connected with the input end of a second LSTM layer in the narrow-band filter network module, and the second LSTM layer is used for extracting mask characteristics from output data of the ReLu activation layer; the third full connection layer is used for mapping nonlinear characteristics output by the second LSTM layer to a new characteristic space; the output layer is connected with the output end of the third full connection layer and is used for outputting mask data.

Further, the output data of the narrow-band filtering network module is used for performing gain calculation, namely, the model obtains mask data of an input signal after forward calculation and calculates the mask data and the input data, the forward calculation output end of the RNN network is electrically connected with the input end of reverse calculation in the training process, the reverse calculation obtains a loss result of a true value through a loss function in the whole network training process, the weight value in the whole network is updated through a random gradient descent method, and a final result is output after model iterative training is completed.

In a specific embodiment, training the recurrent neural network to obtain the preset RNN model includes:

1) The input characteristic data is normalized and then connected with a bidirectional LSTM layer, and the bidirectional LSTM consists of two LSTM layers: a forward LSTM and a reverse LSTM. The forward LSTM processes data from the beginning of the sequence, while the reverse LSTM processes data from the end of the sequence; the bi-directional LSTM is then connected to the fully-connected layer, and then connected to the ReLu active layer to form a module for use as a frequency-time modulation spectrum sensing region. And then a narrow-band filter network consisting of unidirectional LSTM is connected for noise and reverberation cancellation.

2) The cyclic neural network loss function in the back propagation process uses the Signal Distortion Rate (SDR) as loss better than other losses such as common logarithmic mean square error loss according to the characteristics of the input data and the effect of the LSTM network, the result obtained by forward calculation is used as the input of the loss function to obtain a loss value, and the back calculation process is to minimize the loss;

3) Gradient updating of the cyclic neural network, and calculating partial derivatives of the loss function on the output vector of each time step of the output layer; according to a chain rule, solving partial derivatives of the error of the current time step t+1 on the hidden state and the unit state of the previous time step t, and adding the partial derivatives into the partial derivatives of the current time step t; for each time step t, after the results of step 2) are calculated, the partial derivatives of the weights W of the output gate, input gate, forget gate and candidate cell states of the current time step t are calculated respectively. Next, it is necessary to accumulate the gradient of each time step and sum the gradients of all time steps by back-propagation along the time axis. The weights W and bias terms b of the model are updated using a random gradient descent (SGD) method.

4) And outputting a final model file after repeated iterative training or meeting training conditions. The model input data processing module comprises: the original data are divided into pure voice and noise voice, human voice and noise with the same length and corresponding signal to noise ratio are obtained through cutting, splicing and setting of different signal to noise ratios, a certain amount of voice data are generated through setting of fixed time length values, and the values with fixed time length are randomly taken for preprocessing when the voice data are input into an input processing module.

In a preferred embodiment, the establishment of RNN network models and back propagation:

calculating an output error: firstly, calculating an output error delta (t) of the current moment t, wherein the error can be calculated by the difference between a target value (t) and an actual output value (y (t)), namely delta (t) =y (t) -t;

calculating a weight update gradient of the LSTM output layer: and calculating the update gradient dU (t) =delta (t) h (t-1) of the weight matrix U of the LSTM output layer according to the error delta (t) and the hiding state h (t-1) of the last time step. At the same time, the update gradient db (t) =delta (t) of the bias term b also needs to be calculated.

Back-propagating the error to the previous time step: the error delta (t) is first transferred to the previous time step t-1, where the error comprises the error transferred from the previous time step (delta (t) _h (t)) and the error of the current time step (delta (t) _y (t)). Where delta (t) _h (t) is calculated by the output gate (o (t)), i.e. delta (t) _h (t) =delta (t+1) _h (t+1) w_oht f '(z (t)), where w_oh is a weight matrix of the output gate (o (t)), f' (z (t)) representing the derivative of the current input state, i.e. (z (t)).

Calculating a weight update gradient inside the LSTM: based on the error delta (t) _h (t) and the hidden state h (t-1) of the previous time step, the input state x (t), the update gradients dV (t), dW (t) and dc (t) of the weight matrix V, W and the bias term c inside the LSTM are calculated. Wherein,

dV(t) = delta(t)_h(t) * i(t) * f'(z(t)) * x(t)

dW(t) = delta(t)_h(t) * i(t) * f'(z(t)) * h(t-1)

dc(t) = delta(t)_h(t) * i(t) * f'(z(t))。

back-propagating the error to the previous time step: the above steps are repeated, back propagating the error to an earlier time step. And finally, updating parameters according to the weight updating gradient of each time step.

Preferably, the loss function in the RNN network model mainly uses the Signal Distortion Rate (SDR):

wherein:

in the formula, M (t, f) is a mask value obtained by actual forward calculation, Y (t, f) is an actual input value,for masking real part value, ++>Is the imaginary value, s is the true value, +.>And obtaining final noise-reduced voice after windowing and signal reconstruction for the inverse Fourier transform result after model gain calculation.

Through the training process, the RNN model with the LSTM network structure can be trained, the RNN model is used for noise reduction and reverberation removal processing of voice, mask data of output voice can be quickly reconstructed by combining SIFT features, the effect and recognition precision of voice processing are improved, and the generalization capability of the model is high, and the operation speed is high.

In one possible implementation, performing signal reconstruction according to output data of the RNN model to generate a target voice includes:

calculating a signal gain based on output data of the RNN model;

According to the embodiment, the final gain value is obtained by performing dot product operation on the result obtained by the cyclic neural network and the result obtained by the initial fast Fourier transform, and the final result is obtained by performing inverse fast Fourier transform, windowing and signal reconstruction, so that the voice noise reduction and reverberation removal are completed, the recognition precision and generalization capability of the model are improved, and the operation speed is improved.

Referring to fig. 5, an embodiment of the present invention further provides a SIFT-based RNN speech noise reduction and dereverberation system, including:

a feature extraction unit 100 for extracting SIFT features of an original voice;

the signal reconstruction unit 200 is configured to input the SIFT feature to a preset RNN model, perform signal reconstruction according to output data of the RNN model, and generate a target voice.

In some embodiments, the functions or units included in the system provided by the present embodiment may be used to perform the method described in the foregoing method embodiments, and specific implementation thereof may refer to the description of the foregoing method embodiments, which is not further described herein for brevity.

Referring to fig. 6, in one embodiment, another SIFT-based RNN speech noise reduction and dereverberation system is also provided, comprising:

the signal input module 10 is electrically connected with the input end of the voice preprocessing module, and is used for acquiring an original voice signal and transmitting the original voice signal to the voice preprocessing module 20;

the voice preprocessing module 20 is configured to digitally filter the pure voice signal and then perform reverberation convolution preprocessing, where an output end of the voice preprocessing module 20 is electrically connected with an input end of the spectrum conversion module 30;

the frequency spectrum conversion module 30 is configured to perform pre-emphasis and framing windowing on the voice, and then perform short-time fourier transform, rotation and mapping processing, where an output end of the frequency spectrum conversion module 30 is electrically connected to an input end of the SIFT feature extraction module 40;

the SIFT feature extraction module 40 is configured to be composed of scale space extremum detection, key point positioning, azimuth orientation and key point descriptors, and an output end of the SIFT feature extraction module 40 is electrically connected with an input end of the recurrent neural network feature processing module 50;

the cyclic neural network processing module 50 is configured to extract a frequency-time modulation spectrum sensing region from the SIFT feature, and an output end of the cyclic neural network processing module 50 is electrically connected with an input end of the narrowband filter network module 60;

The narrowband filter network module 60 is configured to obtain noise and reverberation cancellation from the region of interest, output a final mask value, and an output end of the narrowband filter network module 60 is electrically connected to an input end of the signal gain module 70;

the signal gain module 70 is configured to perform point multiplication calculation on the mask obtained by the network forward calculation and the real part and the imaginary part of the complex value of the original fourier transform to obtain a gain result, where an output end of the signal gain module 70 is electrically connected with an input end of the speech signal reconstruction module 80;

the voice signal reconstruction module 80 is configured to perform inverse fast fourier transform on the output result of the signal gain module, then perform windowing, splicing and recombination to form a voice signal, and output the final voice after noise reduction and reverberation removal.

In summary, the system provided in this embodiment can at least achieve the following effects:

1) The model structure constructed by combining the bidirectional LSTM and the unidirectional LSTM with the residual structure not only can improve accuracy when learning the data, but also can be more robust to noise or incomplete input data, and can simultaneously consider information before and after the current moment, so that the effect is better for tasks needing global information; the training process can be faster converged, and the generalization capability of the model is improved.

2) Through SIFT feature extraction after frequency spectrum conversion, the SIFT feature extraction method can well locate features in a spatial domain or a frequency domain, can reduce the probability of extraction disorder caused by noise, shielding and picture content disorder, and can extract a large number of features at the same time, and the features are highly independent.

3) The SIFT feature and the RNN network model are combined, and the artificial deep learning mode has the advantages that the artificial deep learning mode has great advantages in memory and speech processing speed when the algorithm is transplanted and deployed, the real-time performance is guaranteed, meanwhile, the memory requirement is low, and on the other hand, the method can show good effects in noise reduction and dereverberation through the design of training data and a deep learning algorithm.

Referring to fig. 7, fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the invention.

The electronic device 2 comprises a processor 21, a memory 22, input means 23, output means 24. The processor 21, memory 22, input device 23, and output device 24 are coupled by connectors including various interfaces, transmission lines or buses, etc., as are not limited by the present embodiments. It should be appreciated that in various embodiments of the invention, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

The processor 21 may be one or more graphics processors (graphics processing unit, GPUs), which may be single-core GPUs or multi-core GPUs in the case where the processor 21 is a GPU. Alternatively, the processor 21 may be a processor group formed by a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. Alternatively, the processor 21 may be another type of processor, and the embodiment of the present invention is not limited.

Memory 22 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present invention. Optionally, the memory 22 includes, but is not limited to, a random access memory (random access memory, RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or a portable read-only memory (compact disc read-only memory, CD-ROM), and the memory 22 is used to store relevant instructions and data, and the embodiment of the present invention is not limited to the data specifically stored in the memory 22.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It will be appreciated that fig. 7 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input devices, output devices, processors, memories, etc., and all video parsing devices capable of implementing the embodiments of the present invention are within the scope of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments of the present invention are provided with emphasis, and that the same or similar parts may not be described in detail in different embodiments for convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in description of other embodiments.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (digital versatile disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A SIFT-based RNN speech noise reduction and dereverberation method, the method comprising:

extracting SIFT features of original voice;

2. The SIFT-based RNN speech noise reduction and dereverberation method of claim 1, further comprising, prior to the extracting SIFT features of the original speech:

3. The SIFT-based RNN speech noise reduction and dereverberation method of claim 1, further comprising, prior to the extracting SIFT features of the original speech, performing a spectral transformation on the original speech, comprising:

4. A SIFT-based RNN speech noise reduction and dereverberation method according to claim 3, characterized in that the extracting SIFT features of the original speech comprises:

5. The SIFT-based RNN speech noise reduction and dereverberation method according to claim 1, wherein the preset RNN model comprises a frequency-time modulation spectrum receptive region extraction module;

6. The SIFT-based RNN voice noise reduction and dereverberation method of claim 5, wherein the bi-directional LSTM network comprises a bi-directional LSTM layer, a first fully connected layer, and a layer normalization layer;

7. The SIFT-based RNN voice noise reduction and dereverberation method of claim 6, wherein the unidirectional LSTM network comprises a first LSTM layer, a second fully connected layer, and a ReLu activation layer;

8. The SIFT-based RNN speech noise reduction and dereverberation method of claim 7, wherein the predetermined RNN model further comprises a narrowband filter network module;

9. The SIFT-based RNN speech noise reduction and dereverberation method according to claim 1, wherein the performing signal reconstruction according to output data of an RNN model to generate a target speech comprises:

calculating a signal gain based on output data of the RNN model;

10. A SIFT-based RNN speech noise reduction and dereverberation system, the system comprising: