CN117854536A - RNN noise reduction method and system based on multidimensional voice feature combination - Google Patents

RNN noise reduction method and system based on multidimensional voice feature combination Download PDF

Info

Publication number
CN117854536A
CN117854536A CN202410268153.8A CN202410268153A CN117854536A CN 117854536 A CN117854536 A CN 117854536A CN 202410268153 A CN202410268153 A CN 202410268153A CN 117854536 A CN117854536 A CN 117854536A
Authority
CN
China
Prior art keywords
voice
voice data
data
gain
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410268153.8A
Other languages
Chinese (zh)
Other versions
CN117854536B (en
Inventor
韦伟才
邓海蛟
马健莹
潘晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Longxinwei Semiconductor Technology Co ltd
Original Assignee
Shenzhen Longxinwei Semiconductor Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Longxinwei Semiconductor Technology Co ltd filed Critical Shenzhen Longxinwei Semiconductor Technology Co ltd
Priority to CN202410268153.8A priority Critical patent/CN117854536B/en
Priority claimed from CN202410268153.8A external-priority patent/CN117854536B/en
Publication of CN117854536A publication Critical patent/CN117854536A/en
Application granted granted Critical
Publication of CN117854536B publication Critical patent/CN117854536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The application relates to a RNN noise reduction method and system based on multidimensional voice feature combination, and relates to the technical field of voice noise reduction, comprising the steps of receiving original voice data, preprocessing the original voice data through a digital filter, and outputting first voice data; performing fast Fourier transform on the first voice data, and acquiring multi-dimensional voice characteristics according to the modes of mel cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, frequency band energy, logarithmic transformation and the like; the multidimensional voice features are combined and then input into a preset cyclic neural network model, and gain values aiming at different frequency bands are extracted; performing interpolation expansion on the gain value, and performing gain calculation on the expanded gain value and the first voice data to obtain a gain result; and carrying out inverse fast Fourier transform and signal reconstruction on the gain result to obtain noise-reduced voice data.

Description

RNN noise reduction method and system based on multidimensional voice feature combination
Technical Field
The application relates to the technical field of voice noise reduction, in particular to an RNN noise reduction method and system based on multidimensional voice feature combination.
Background
The voice noise reduction technology refers to that for a section of audio containing clutter noise, a digital signal processing algorithm is utilized to remove noise in the audio, so that the sound becomes clearer and more natural. It has wide application in communication, speech recognition and other fields. The background of speech noise reduction technology can be traced back to the 70 s of the 20 th century. At that time, due to the imperfection of the telephone network and the low transmission quality, the call quality is very poor and the noise interference is serious. In order to improve the call quality, research into voice noise reduction technology is being conducted. The earliest voice noise reduction method is based on the principle of spectral subtraction and performs noise reduction treatment on a frequency domain, but the method can cause distortion of an audio signal and has unsatisfactory effect.
As society continues to develop new and varied noise fills around us, new methods of voice noise reduction for digital signal processing technology continue to emerge. More typically, the noise reduction is performed by wavelet transform. The wavelet transform can effectively separate noise from speech, thereby achieving a better noise reduction effect. In addition, there are voice noise reduction technologies based on neural networks and deep learning, which use a large amount of data for training, so that the noise reduction requirements under different scenes can be better adapted, and how to effectively and real-timely guarantee the information definition in the communication process is also an important research finding for digital signal processing.
With respect to the related art, since the characteristics of the signal are generally difficult to be seen by the transformation of the voice signal in the time domain, the signal is generally observed by performing a fast fourier transformation on the voice signal to transform the voice signal into an energy distribution in frequency, and different energy distributions represent different voice characteristics, and different calculation methods after the fourier transformation also represent different voice characteristics. Usually, mel-frequency cepstrum coefficient (MFCC) or fourier transform (FFT) is used alone as an input part of a noise reduction method for voice features, and noise reduction is performed based on them, but the effect of noise reduction by a single voice feature is often not ideal because of the complexity, instability, volatility, and other characteristics of the voice with noise.
Disclosure of Invention
In order to solve the problem that the effect of noise reduction processing by singly using a certain voice characteristic is often not ideal because of the characteristics of complexity, instability, volatility and the like of the voice with noise, the application provides an RNN noise reduction method and system based on multidimensional voice characteristic combination.
In a first aspect, the present application provides a method for RNN noise reduction based on multidimensional speech feature combination, which adopts the following technical scheme: comprising the following steps:
receiving original voice data, preprocessing the original voice data through a digital filter, and outputting first voice data;
and performing fast Fourier transform on the first voice data, and acquiring multi-dimensional voice characteristics according to Mel cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, frequency band energy and logarithmic transformation modes.
The multidimensional voice features are combined and then input into a preset cyclic neural network model, and gain values aiming at different frequency bands are extracted;
performing interpolation expansion on the gain value, and performing gain calculation on the expanded gain value and the first voice data to obtain a gain result;
and carrying out inverse fast Fourier transform and signal reconstruction on the gain result to obtain noise-reduced voice data.
Optionally, before the preprocessing the original voice data by the digital filter and outputting the first voice data, the method further includes:
amplifying and arranging the original voice data in a data enhancement mode, wherein the amplifying and arranging the original voice data comprises the steps of carrying out data enhancement on pure voice data and noise data in the original voice data, and arranging the original voice data after the data enhancement into a preset fixed length.
Optionally, the performing fast fourier transform on the first voice data to obtain multi-dimensional voice features according to mel cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, and frequency band energy and logarithmic transformation modes includes:
pre-emphasis and windowing are carried out on the first voice data, short-time fast Fourier transform is carried out, so that time domain information of the first voice data is converted into frequency domain information, and partial multidimensional voice features are obtained;
obtaining a mel-frequency cepstrum coefficient according to Fourier transform, the mel-frequency cepstrum coefficient and the like;
and performing discrete transformation on the first voice data to acquire the multidimensional voice feature of the other part.
Optionally, before the multi-dimensional speech features are combined and then input into a preset cyclic neural network model and gain values for different frequency bands are extracted, the method further includes:
splicing and combining a plurality of voice features to serve as training data;
the number of neurons in the last cyclic neural network layer of the cyclic neural network is corresponding to the number of voice features before combination, and the number of neurons is used as an output layer;
after the training data is subjected to forward computation of the cyclic neural network, obtaining an iteration result of each time;
according to the input voice characteristics as training data, mask gains corresponding to different voice characteristics can be obtained, so that noise suppression is carried out each time;
reversely calculating a loss function of the cyclic neural network, and obtaining a loss value after each iteration according to the loss function and the iteration result;
calculating a weight matrix gradient in each circulating layer according to the loss value and the output value of the circulating layer in each state;
performing reverse derivative according to the weight matrix gradient, and calculating a weight update value;
and carrying out weight updating during each iteration through a random gradient descent method and the weight updating value, and obtaining the optimal cyclic neural network parameters after repeated iterative training.
Optionally, the inverse calculating the loss function of the recurrent neural network includes:
the loss function is a regression type function, namely a logarithmic mean square error is used;
the loss function formula is as follows:
wherein (1)>Is an ideal ratio mask, ++>Is the actual ratio mask calculated by the recurrent neural network,/->1/2 is used to adjust the noise suppression level, and N is the number of frequency bands.
Optionally, before the back calculating the loss function of the recurrent neural network, the method further includes:
the cyclic neural network layer comprises a two-way long-period and short-period memory network layer, and the long-period and short-period memory network layer comprises: a hidden status layer and a cell status layer;
the hidden state layer (h_t) and the cell state layer (c_t) are formulated as follows:
wherein, [ h_ (t-1), x_t]The hidden state h_ (t-1) of the previous time step t_1 and the input vector x_t of the current time step t are spliced; the "; w_i, W_f, W_o and W_c are entitlement matrices; b_i, b_f, b_o and b_c are bias matrices; i_t, f_t, o_t and g_t are input gate, forget gate, output gate and cell state, respectively.
In a second aspect, the application discloses an RNN noise reduction device based on multidimensional speech feature combination, which adopts the following technical scheme and includes:
the voice data processing module is used for receiving original voice data, preprocessing the original voice data through a digital filter and outputting first voice data;
the voice feature extraction module is used for carrying out fast Fourier transform on the first voice data and acquiring multidimensional voice features according to the modes of mel cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, frequency band energy, logarithmic transformation and the like;
the cyclic network processing module is used for inputting the multidimensional voice characteristics into a preset cyclic neural network model after being combined, and extracting gain values aiming at different frequency bands;
the voice data gain module is used for carrying out interpolation expansion on the gain value, carrying out gain calculation on the expanded gain value and the first voice data, and obtaining a gain result;
and the voice signal reconstruction module is used for carrying out inverse fast Fourier transform and signal reconstruction on the gain result to obtain noise reduction voice data.
Optionally, the voice data processing module includes:
the voice preprocessing module is used for receiving original voice data, preprocessing the original voice data through a digital filter and obtaining first voice data;
the voice framing module is used for framing long voice so as to enable the digital filter to process the original voice data subjected to framing;
and the feature extraction module is used for extracting different preprocessed voice features.
In a third aspect, the present application further provides a control apparatus, the apparatus comprising:
comprising a memory and a processor, the memory having stored thereon a computer program capable of being loaded by the processor and performing an RNN noise reduction method based on a combination of multi-dimensional speech features as described above.
In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program capable of being loaded by a processor and performing an RNN noise reduction method based on a combination of multi-dimensional speech features as described above.
In summary, through the constructed cyclic network model structure, the method and the device can extract effective gains of various input dimensional features, meanwhile, because of the characteristics of the cyclic network model, context associated information about noise and pure voice in various voice feature information can be extracted, associated feature memories are formed, effective voice frequency band gains are generated rapidly, the combined advantages of various voice features on different dimensional expressions of voice signals are brought into play, continuous and real-time enhancement processing of the voice signals is met, and real-time effective suppression of noise in voice is facilitated; by combining the characteristics of multiple dimensions, the multi-dimensional distinction of pure voice and noise can be expressed from multiple angles, the overall advantages of the pure voice and the noise can be further developed after the pure voice and the noise are combined, namely, the characteristics of various dimensions can be obtained quickly without complex operation, and the real-time processing requirement is met; through the designed circulation network model loss function and output result, the input multidimensional voice features can be effectively learned, corresponding gain values are generated according to different frequency bands, and then the diversity of input training data and the advantages of various features can be effectively utilized, the training effect is improved, the generalization capability of the model is improved, and the extraction efficiency in the process of extracting the multidimensional voice features can be improved.
Drawings
FIG. 1 is a flow chart of a RNN denoising method based on multidimensional speech feature combination.
FIG. 2 is a block diagram of a cyclic neural network model in an RNN noise reduction method based on multi-dimensional speech feature combinations.
FIG. 3 is a block diagram of an RNN noise reduction device based on multi-dimensional speech feature combinations.
Fig. 4 is a block diagram of the structure of the speech processing module.
Reference numerals illustrate: 310. a voice data processing module; 311. a voice preprocessing module; 312. a voice framing module; 313. a feature extraction module; 320. a voice feature extraction module; 330. a cyclic network processing module; 340. a voice data gain module; 350. and the voice signal reconstruction module.
Detailed Description
Specific embodiments of the present invention will be described in more detail below with reference to the drawings. Various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the description of the present invention, unless otherwise indicated, the use of the terms "first," "second," etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, but is merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.
The embodiment of the application discloses an RNN noise reduction method based on multidimensional voice feature combination. The execution main body is a control system, the effective gain extraction can be carried out on the input various dimensional characteristics through the constructed circulation network model structure, meanwhile, the context associated information about noise and pure voice in various voice characteristic information can be extracted and related characteristic memories can be formed because of the characteristics of the circulation network model, and effective voice frequency band gain is generated rapidly, so that the combined advantages of the various voice characteristics on different dimensional expressions of voice signals are exerted, continuous and real-time enhancement processing of the voice signals is met, and real-time effective suppression of noise in voice is facilitated.
It should be understood that speech features refer to the important properties and characteristics in a speech signal that distinguish between different voices, which are commonly used for speech enhancement, speech recognition, speaker recognition, etc. Mel-frequency coefficients (MFCCs) are a widely used speech feature that transform a speech signal in the frequency domain to a mel scale, and perform logarithmic and discrete cosine transforms to obtain a set of coefficients that reflect the speech feature; the Pitch frequency (Pitch) is the vibration frequency of the vocal cords, which is the reciprocal of the on-off period of the vocal tract through which the air flow passes when a person emits a voiced sound. The channel energy normalization (PCEN) is to introduce a normalization mechanism of each channel on the basis of FFT or Fbank characteristics so as to inhibit the influence of the amplitude change of the input signal on the identification result. Fourier Transform (FFT), which is one of the most commonly used acoustic features, is an operation of mapping an audio signal from the time domain to the frequency domain, and is equivalent to transforming the signal from a time-based time-sequential space to a frequency-based space.
Referring to fig. 1, the embodiment of the present application at least includes steps S10 to S50.
S10, receiving original voice data, preprocessing the original voice data through a digital filter, and outputting first voice data.
S20, performing fast Fourier transform on the first voice data, and acquiring multi-dimensional voice characteristics according to the modes of Mel cepstrum filtering, discrete transform, autocorrelation function, channel energy normalization, frequency band energy, logarithmic transform and the like;
s30, after the multidimensional voice features are combined, inputting the combined multidimensional voice features into a preset cyclic neural network model, and extracting gain values aiming at different frequency bands;
s40, carrying out interpolation expansion on the gain value, and carrying out gain calculation on the expanded gain value and the first voice data to obtain a gain result;
s50, performing inverse fast Fourier transform and signal reconstruction on the gain result to obtain noise reduction voice data.
The multidimensional language features refer to various voice features extracted by various methods, and the language features are extracted mainly by Mei Puer coefficients and discrete transformation methods in the application.
Specifically, the original voice data is amplified and arranged in a data enhancement mode, which comprises the steps of carrying out data enhancement on pure voice data and noise data in the original voice data, arranging the original voice data into a fixed length, and regenerating the original voice data with noise as original input of the whole processing flow.
In order to further improve the noise reduction effect, the voice is subjected to noise reduction pretreatment, namely, the voice signal subjected to pretreatment is subjected to primary or coarse treatment by adopting digital filtering, on one hand, special noise such as voice with extremely high or low frequency can be effectively filtered, and the interference of the voice signal to the subsequent noise reduction treatment is effectively avoided; on the other hand, the voice signal obtained after pretreatment is purer, which is beneficial to the extraction of a plurality of subsequent voice features.
For an original input voice signal, performing fast Fourier transform to convert a time domain signal into a frequency domain signal, and obtaining voice features with different dimensions through various calculation modes, wherein the voice features are respectively aimed at fundamental frequency, frequency features, amplitude, phase and the like of the voice signal. Finally, the extracted various features are combined. The combined voice characteristics are input into the established cyclic neural network, the unique advantages of the cyclic neural network on sequence signal processing are utilized, repeated iterative training is carried out, the weight updating of the model is completed, and the gain values for different frequency bands are calculated through the model.
And then interpolation expansion is carried out on the gain value of the cyclic neural network to the same size as the original voice frame data, then gain calculation is carried out on the gain value and the original voice frame data to obtain a calculation result, and finally a final result is obtained in a mode of inverse fast Fourier change and signal reconstruction, so that voice noise reduction is completed, and noise reduction voice data is obtained.
Further, the preprocessing mode of the voice signal comprises: and carrying out primary voice noise reduction processing on the original voice data after the cleaning processing through a digital filter to obtain first voice data, wherein the digital filter adopted here is a finite impulse Filter (FIR), and filtering out signals which are not of interest through the digital filter, so as to leave wanted signals, namely removing frequency signals which are too high or abnormal.
Further, the multi-dimensional speech feature extraction includes: and performing short-time fast Fourier transform according to the pre-emphasized and windowed first voice data after preprocessing, and converting the time domain information into frequency domain information. And calculating the energy of a frequency band according to fast Fourier transform results represented by different voice features, carrying out logarithmic transformation and DCT to obtain a Barker Frequency Cepstrum Coefficient (BFCC), carrying out logarithmic operation and discrete cosine transformation on the Fourier transformed value to obtain a Mel cepstrum coefficient, obtaining a fundamental tone frequency through an autocorrelation function method (ACF), introducing a normalization mechanism of each channel to obtain channel energy normalization (PCEN), extracting the fundamental tone frequency and the fundamental tone frequency, carrying out normalization processing on the fundamental tone frequency and the fundamental tone frequency, and carrying out splicing, namely carrying out discrete transformation on first voice data to obtain the multidimensional voice features of the other part.
Regarding the voice characteristics, calculating a mel cepstrum coefficient to obtain a characteristic vector with the length of 22, and then carrying out first-order and second-order difference on the MFCC to obtain 11 characteristics; taking the first eight pitch correlations and the pitch period as features according to the autocorrelation function; calculating to obtain PCEN value length of 1 of one frame of data; the frequency band is divided into 24 according to Opus, the 24 frequency band energies are subjected to logarithmic transformation and then to DCT transformation, namely 24 BFCCs (Bark-frequency cepstral coefficients), and the total characteristic length is 68.
In some embodiments, referring to fig. 2, considering the problem of recurrent neural network training setup, the relevant processing steps are as follows: splicing and combining a plurality of voice features to serve as training data; the number of neurons in the last cyclic neural network layer of the cyclic neural network is corresponding to the number of voice features before combination, and the number of neurons is used as an output layer; after the training data is subjected to forward computation of the cyclic neural network, an iteration result of each time is obtained; according to the input voice characteristics as training data, mask gains corresponding to different voice characteristics can be obtained, so that noise suppression is carried out each time; reversely calculating a loss function of the cyclic neural network, and obtaining a loss value after each iteration according to the loss function and the iteration result; calculating a weight matrix gradient in each circulating layer according to the loss value and the output value of the circulating layer in each state; performing reverse derivative according to the weight matrix gradient, and calculating a weight update value; and carrying out weight updating during each iteration through a random gradient descent method and a weight updating value, and obtaining the optimal cyclic neural network parameters after repeated iterative training.
Specifically, after the multidimensional voice features are combined, the multidimensional voice features are input as training data, the number of the features before the combination is corresponding to the number of neurons in a last layer of the cyclic neural network is taken as an output layer, the result of each iteration is obtained after the training data is subjected to forward calculation of the cyclic neural network, mask gains corresponding to different features can be obtained according to the input features serving as the training data, and therefore noise suppression of each time is carried out; the method comprises the steps that a loss function of a cyclic neural network in a reverse calculation process is calculated, according to input data characteristics and the action of the cyclic neural network, the designed loss function is of a regression type, namely logarithmic mean square error, a corresponding loss value is obtained after each iteration, and then the state of the cyclic neural network is updated according to minimized loss; gradient updating of the cyclic neural network, gradient calculation of the weight matrix is carried out according to the loss function, the gradient of the weight matrix in each cyclic layer is calculated according to the loss value and the output value of the cyclic layer in each state, inverse derivative is carried out according to the gradient, the weight updating value is calculated, and weight updating during each iteration is carried out through a random gradient descent method; and obtaining the optimal parameters of the cyclic neural network after repeated iterative training.
Wherein the voice data enhancement comprises: the method comprises the steps of performing pitch, sonic velocity, time transfer, time-frequency domain masking, time warping and other method transformation on pure voice and noise voice in original voice, then cleaning all the voice, and then cutting and splicing the voice into a fixed size to obtain amplified voice data.
Because the voice itself belongs to time domain data, namely sequence data, although the voice itself can be converted into a frequency domain, obvious context relations and time relations exist between the data of the voice itself, so the voice is exactly in line with the characteristics of the cyclic neural network itself, and the voice specifically comprises the following steps: and constructing a voice noise reduction model.
Regarding the establishment of a cyclic neural network model, according to the input sample and time sequence characteristics, long-term and short-term memory of the sample and associated context relation are required to be recorded, so that more features can be learned when the requirements are met by using a two-way long-term and short-term memory network layer, then the weight after each iteration is obtained by updating the initialized weight and gradient in the training process of the network, and in the forward calculation process of the cyclic neural network, the state of each cyclic neural network cell at the previous moment is involved in the prediction of the next state and the update of the state through a gate, and the final weight value is obtained after the iteration is completed;
the loss function in the cyclic network uses a logarithmic mean square error, as shown below:
wherein (1)>Is an ideal ratio mask, ++>Is the actual ratio mask calculated by the recurrent neural network,/->1/2 is used to adjust the noise suppression level, and N is the number of frequency bands. The cyclic neural network layer comprises a two-way long-short-term memory network LSTM layer, which is beneficial to extracting more information, and the hidden layer state consists of two parts: the cell status layer (c_t) and the hidden status layer (h_t) can be expressed as:
wherein, [ h_ (t-1), x_t]The hidden state h_ (t-1) of the previous time step t_1 and the input vector x_t of the current time step t are spliced; the "; w_i, w_f, w_o and w_c are rights matrices, respectively; b_i, b_f, b_o and b_c are bias matrices; i_t, f_t, o_t and g_t are input gate, forget gate, output gate and cell state, respectively, and the activation function used in the cell is tanh. The final purpose of the network is to minimize the error value, so that the loss L after training the network is as small as possible.
The input value X is a multi-dimensional matrix input formed by multi-dimensional speech feature extraction and combination, and the final output value is divided into 24 gain values according to the frequency band. The gain of each band is defined asWherein->The band energy of the clean voice is the band energy of the noisy voice.
In the training process, error loss is calculated through a loss function, and iterative updating is carried out through a gradient descent method and a two-way long-short-term memory network layer, namely a cell state layer and a hidden state layer.
The implementation principle of the RNN noise reduction method based on the multidimensional voice feature combination in the embodiment of the application is as follows: according to the method, through the constructed circulation network model structure, effective gain extraction can be carried out on various input dimensional characteristics, meanwhile, due to the characteristics of the circulation network model, context associated information about noise and pure voice in various voice characteristic information can be extracted, relevant characteristic memories are formed, effective voice frequency band gain is generated rapidly, the combined advantages of various voice characteristics on different dimensional expressions of voice signals are brought into play, continuous and real-time enhancement processing of the voice signals is met, and real-time effective suppression of noise in voice is facilitated; by combining the characteristics of multiple dimensions, the multi-dimensional distinction of pure voice and noise can be expressed from multiple angles, the overall advantages of the pure voice and the noise can be further developed after the pure voice and the noise are combined, namely, the characteristics of various dimensions can be obtained quickly without complex operation, and the real-time processing requirement is met; through the designed circulation network model loss function and output result, the input multidimensional voice features can be effectively learned, corresponding gain values are generated according to different frequency bands, and then the diversity of input training data and the advantages of various features can be effectively utilized, the training effect is improved, the generalization capability of the model is improved, and the extraction efficiency in the process of extracting the multidimensional voice features can be improved.
FIG. 1 is a flow chart of an RNN denoising method based on multidimensional speech feature combination in one embodiment. It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows; the steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders; and at least some of the steps in fig. 1 may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur in sequence, but may be performed alternately or alternately with at least some of the other steps or sub-steps of other steps.
Based on the same technical concept, referring to fig. 3, the embodiment of the application further provides an RNN noise reduction device based on multidimensional speech feature combination, and the device adopts the following technical scheme that:
the voice data processing module 310 is configured to receive original voice data, pre-process the original voice data through a digital filter, and output first voice data;
the voice feature extraction module 320 is configured to perform a fast fourier transform on the first voice data, and obtain multi-dimensional voice features according to mel cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, band energy, logarithmic transformation, and other manners;
the cyclic network processing module 330 is configured to combine the multidimensional speech features and input the combined multidimensional speech features into a preset cyclic neural network model to extract gain values for different frequency bands;
the voice data gain module 340 performs interpolation expansion on the gain value, and performs gain calculation on the expanded gain value and the first voice data to obtain a gain result;
the voice signal reconstruction module 350 is configured to perform inverse fast fourier transform and signal reconstruction on the gain result to obtain noise-reduced voice data.
Referring to fig. 4, in some embodiments, the voice data processing module 310 specifically includes:
the voice preprocessing module 311 is configured to receive original voice data, preprocess the original voice data through a digital filter, and obtain first voice data;
a voice framing module 312, configured to frame-process the long voice so that the digital filter processes the frame-divided original voice data;
the feature extraction module 313 is configured to extract different preprocessed voice features.
Specifically, the voice preprocessing module 311 is electrically connected to the output end of the voice framing module 312 during preprocessing, the voice framing module 312 is configured to perform framing processing on long voice, where the frame length is 512, perform preliminary processing on an input noise signal, the output end of the voice framing module 312 is electrically connected to the input end of the feature extraction module 313, the feature extraction module 313 is configured to extract different preprocessed voice features, the bark frequency cepstrum coefficient, the mel cepstrum coefficient, the pitch feature and the channel energy normalization feature are respectively extracted and combined according to input data, the output end of the feature extraction module 313 is electrically connected to the input end of the cyclic network processing module, the cyclic network processing module 330 is configured to extract an audio band gain, the output end of the cyclic network processing module 330 is electrically connected to the input end of the voice data gain module 340, the voice data gain module 340 extends the gain value to 512 and calculates the corresponding noise-carrying voice value in an interpolation manner, the output end of the voice data gain module 340 is electrically connected to the input end of the voice signal reconstruction module 350, and the voice signal reconstruction module 350 is configured to reconstruct the audio signal into a voice signal by performing inverse fourier transform and the final fourier transform to obtain the final noise reduction signal by 50%.
The embodiment of the application also discloses a control device.
Specifically, the control device includes a memory and a processor, and the memory stores a computer program capable of being loaded by the processor and executing the RNN noise reduction method based on the multi-dimensional speech feature combination.
The embodiment of the application also discloses a computer readable storage medium.
Specifically, the computer-readable storage medium stores a computer program capable of being loaded by a processor and executing the RNN noise reduction method based on the combination of multi-dimensional speech features as described above, the computer-readable storage medium including, for example: a U-disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RandomAccessMemory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
The foregoing are all preferred embodiments of the present application, and are not intended to limit the scope of the present application in any way, therefore: all equivalent changes in structure, shape and principle of this application should be covered in the protection scope of this application.

Claims (10)

1. An RNN noise reduction method based on multidimensional speech feature combination, comprising:
receiving original voice data, preprocessing the original voice data through a digital filter, and outputting first voice data;
performing fast Fourier transform on the first voice data, and acquiring multi-dimensional voice characteristics according to Mel cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, frequency band energy and logarithmic transformation modes;
the multidimensional voice features are combined and then input into a preset cyclic neural network model, and gain values aiming at different frequency bands are extracted;
performing interpolation expansion on the gain value, and performing gain calculation on the expanded gain value and the first voice data to obtain a gain result;
and carrying out inverse fast Fourier transform and signal reconstruction on the gain result to obtain noise-reduced voice data.
2. The RNN noise reduction method based on the combination of the multi-dimensional speech features according to claim 1, further comprising, before the preprocessing of the original speech data by the digital filter, outputting the first speech data:
amplifying and arranging the original voice data in a data enhancement mode, wherein the amplifying and arranging the original voice data comprises the steps of carrying out data enhancement on pure voice data and noise data in the original voice data, and arranging the original voice data after the data enhancement into a preset fixed length.
3. The RNN denoising method based on combination of multi-dimensional speech features according to claim 2, wherein the performing fast fourier transform on the first speech data, obtaining multi-dimensional speech features according to mel-cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, and frequency band energy and logarithmic transformation, comprises:
pre-emphasis and windowing are carried out on the first voice data, short-time fast Fourier transform is carried out, so that time domain information of the first voice data is converted into frequency domain information, and partial multidimensional voice features are obtained;
obtaining a mel-frequency cepstrum coefficient according to Fourier transform, the mel-frequency cepstrum coefficient and the like;
and performing discrete transformation on the first voice data to acquire the multidimensional voice feature of the other part.
4. The RNN noise reduction method based on the combination of multi-dimensional speech features according to claim 1, wherein before the step of inputting the combined multi-dimensional speech features into a preset cyclic neural network model and extracting gain values for different frequency bands, the method further comprises:
splicing and combining a plurality of voice features to serve as training data;
the number of neurons in the last cyclic neural network layer of the cyclic neural network is corresponding to the number of voice features before combination, and the number of neurons is used as an output layer;
after the training data is subjected to forward computation of the cyclic neural network, obtaining an iteration result of each time;
according to the input voice characteristics as training data, mask gains corresponding to different voice characteristics can be obtained, so that noise suppression is carried out each time;
reversely calculating a loss function of the cyclic neural network, and obtaining a loss value after each iteration according to the loss function and the iteration result;
calculating a weight matrix gradient in each circulating layer according to the loss value and the output value of the circulating layer in each state;
performing reverse derivative according to the weight matrix gradient, and calculating a weight update value;
and carrying out weight updating during each iteration through a random gradient descent method and the weight updating value, and obtaining the optimal cyclic neural network parameters after repeated iterative training.
5. The RNN denoising method based on multi-dimensional speech feature combination of claim 4, wherein the inverse computing the loss function of the recurrent neural network comprises:
the loss function is a regression type function, namely a logarithmic mean square error is used; the loss function formula is as follows:
wherein (1)>Is an ideal ratio mask, ++>Is the actual ratio mask calculated by the recurrent neural network,/->1/2 is used to adjust the noise suppression level, and N is the number of frequency bands.
6. The RNN denoising method based on multi-dimensional speech feature combination of claim 5, further comprising, prior to the back computing the loss function of the recurrent neural network:
the cyclic neural network layer comprises a two-way long-period and short-period memory network layer, and the long-period and short-period memory network layer comprises: a hidden status layer and a cell status layer;
the hidden state layer (h_t) and the cell state layer (c_t) are formulated as follows:
wherein, [ h_ (t-1), x_t]Representing the spelling of the hidden state h_ (t-1) of the previous time step t_1 and the current time step t input vector x_tConnecting; the "; w_i, W_f, W_o and W_c are entitlement matrices; b_i, b_f, b_o and b_c are bias matrices; i_t, f_t, o_t and g_t are input gate, forget gate, output gate and cell state, respectively.
7. An RNN noise reduction apparatus based on a combination of multi-dimensional speech features, the apparatus comprising:
the voice data processing module is used for receiving original voice data, preprocessing the original voice data through a digital filter and outputting first voice data;
the voice feature extraction module is used for carrying out fast Fourier transform on the first voice data and acquiring multidimensional voice features according to the modes of mel cepstrum filtering, discrete transformation, autocorrelation function, channel energy normalization, frequency band energy, logarithmic transformation and the like;
the cyclic network processing module is used for inputting the multidimensional voice characteristics into a preset cyclic neural network model after being combined, and extracting gain values aiming at different frequency bands;
the voice data gain module is used for carrying out interpolation expansion on the gain value, carrying out gain calculation on the expanded gain value and the first voice data, and obtaining a gain result;
and the voice signal reconstruction module is used for carrying out inverse fast Fourier transform and signal reconstruction on the gain result to obtain noise reduction voice data.
8. The RNN noise reduction device based on the combination of multi-dimensional speech features of claim 7, wherein the speech data processing module comprises:
the voice preprocessing module is used for receiving original voice data, preprocessing the original voice data through a digital filter and obtaining first voice data;
the voice framing module is used for framing long voice so as to enable the digital filter to process the original voice data subjected to framing;
and the feature extraction module is used for extracting different preprocessed voice features.
9. A control apparatus, characterized in that the apparatus comprises:
comprising a memory and a processor, said memory having stored thereon a computer program capable of being loaded by said processor and performing the method according to any of claims 1 to 6.
10. A computer readable storage medium, characterized in that a computer program is stored which can be loaded by a processor and which performs the method according to any of claims 1 to 6.
CN202410268153.8A 2024-03-09 RNN noise reduction method and system based on multidimensional voice feature combination Active CN117854536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410268153.8A CN117854536B (en) 2024-03-09 RNN noise reduction method and system based on multidimensional voice feature combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410268153.8A CN117854536B (en) 2024-03-09 RNN noise reduction method and system based on multidimensional voice feature combination

Publications (2)

Publication Number Publication Date
CN117854536A true CN117854536A (en) 2024-04-09
CN117854536B CN117854536B (en) 2024-06-07

Family

ID=

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065067A (en) * 2018-08-16 2018-12-21 福建星网智慧科技股份有限公司 A kind of conference terminal voice de-noising method based on neural network model
US20190385608A1 (en) * 2019-08-12 2019-12-19 Lg Electronics Inc. Intelligent voice recognizing method, apparatus, and intelligent computing device
CN111429932A (en) * 2020-06-10 2020-07-17 浙江远传信息技术股份有限公司 Voice noise reduction method, device, equipment and medium
US20210074266A1 (en) * 2019-09-06 2021-03-11 Evoco Labs Co., Ltd. Deep neural network based audio processing method, device and storage medium
CN113077806A (en) * 2021-03-23 2021-07-06 杭州朗和科技有限公司 Audio processing method and device, model training method and device, medium and equipment
CN114360572A (en) * 2022-01-20 2022-04-15 百果园技术(新加坡)有限公司 Voice denoising method and device, electronic equipment and storage medium
CN115223583A (en) * 2022-07-26 2022-10-21 宸芯科技有限公司 Voice enhancement method, device, equipment and medium
CN115762540A (en) * 2022-10-27 2023-03-07 深圳市龙芯威半导体科技有限公司 Multidimensional RNN voice noise reduction method, device, equipment and medium
US20230267947A1 (en) * 2020-07-31 2023-08-24 Dolby Laboratories Licensing Corporation Noise reduction using machine learning
CN116863944A (en) * 2023-07-10 2023-10-10 杭州电子科技大学 Voiceprint recognition method and system based on unsteady state audio enhancement and multi-scale attention
US20230386492A1 (en) * 2022-05-24 2023-11-30 Agora Lab, Inc. System and method for suppressing noise from audio signal
CN117174105A (en) * 2023-11-03 2023-12-05 深圳市龙芯威半导体科技有限公司 Speech noise reduction and dereverberation method based on improved deep convolutional network

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065067A (en) * 2018-08-16 2018-12-21 福建星网智慧科技股份有限公司 A kind of conference terminal voice de-noising method based on neural network model
US20190385608A1 (en) * 2019-08-12 2019-12-19 Lg Electronics Inc. Intelligent voice recognizing method, apparatus, and intelligent computing device
US20210074266A1 (en) * 2019-09-06 2021-03-11 Evoco Labs Co., Ltd. Deep neural network based audio processing method, device and storage medium
CN111429932A (en) * 2020-06-10 2020-07-17 浙江远传信息技术股份有限公司 Voice noise reduction method, device, equipment and medium
US20230267947A1 (en) * 2020-07-31 2023-08-24 Dolby Laboratories Licensing Corporation Noise reduction using machine learning
CN113077806A (en) * 2021-03-23 2021-07-06 杭州朗和科技有限公司 Audio processing method and device, model training method and device, medium and equipment
CN114360572A (en) * 2022-01-20 2022-04-15 百果园技术(新加坡)有限公司 Voice denoising method and device, electronic equipment and storage medium
US20230386492A1 (en) * 2022-05-24 2023-11-30 Agora Lab, Inc. System and method for suppressing noise from audio signal
CN115223583A (en) * 2022-07-26 2022-10-21 宸芯科技有限公司 Voice enhancement method, device, equipment and medium
CN115762540A (en) * 2022-10-27 2023-03-07 深圳市龙芯威半导体科技有限公司 Multidimensional RNN voice noise reduction method, device, equipment and medium
CN116863944A (en) * 2023-07-10 2023-10-10 杭州电子科技大学 Voiceprint recognition method and system based on unsteady state audio enhancement and multi-scale attention
CN117174105A (en) * 2023-11-03 2023-12-05 深圳市龙芯威半导体科技有限公司 Speech noise reduction and dereverberation method based on improved deep convolutional network

Similar Documents

Publication Publication Date Title
CN111223493B (en) Voice signal noise reduction processing method, microphone and electronic equipment
Zhao et al. Two-stage deep learning for noisy-reverberant speech enhancement
Kim et al. SE-Conformer: Time-Domain Speech Enhancement Using Conformer.
Hermansky et al. RASTA processing of speech
CN108172231B (en) Dereverberation method and system based on Kalman filtering
JP5230103B2 (en) Method and system for generating training data for an automatic speech recognizer
CN110767244B (en) Speech enhancement method
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
Yuliani et al. Speech enhancement using deep learning methods: A review
WO2023001128A1 (en) Audio data processing method, apparatus and device
CN108922514B (en) Robust feature extraction method based on low-frequency log spectrum
CN101460996A (en) Gain control system, gain control method, and gain control program
Delcroix et al. Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds
Geng et al. End-to-end speech enhancement based on discrete cosine transform
CN116013344A (en) Speech enhancement method under multiple noise environments
Ueda et al. Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization
KR20220022286A (en) Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder
Nguyen et al. Feature adaptation using linear spectro-temporal transform for robust speech recognition
Tu et al. DNN training based on classic gain function for single-channel speech enhancement and recognition
Ali et al. Speech enhancement using dilated wave-u-net: an experimental analysis
Pardede et al. Generalized filter-bank features for robust speech recognition against reverberation
CN117854536B (en) RNN noise reduction method and system based on multidimensional voice feature combination
CN115472168B (en) Short-time voice voiceprint recognition method, system and equipment for coupling BGCC and PWPE features
CN117854536A (en) RNN noise reduction method and system based on multidimensional voice feature combination
Upadhyay et al. Robust recognition of English speech in noisy environments using frequency warped signal processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant