CN113782011B

CN113782011B - Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene

Info

Publication number: CN113782011B
Application number: CN202110985541.4A
Authority: CN
Inventors: 姜彦吉; 张胜; 宋湘钰; 范佳亮; 彭博
Original assignee: Suzhou Automotive Research Institute of Tsinghua University
Current assignee: Suzhou Automotive Research Institute of Tsinghua University
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2024-04-09
Anticipated expiration: 2041-08-26
Also published as: CN113782011A

Abstract

The invention discloses a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene, wherein the training method comprises the following steps of: after the pure voice signal and the pure noise signal are respectively framed, the pure voice frequency band and the pure noise frequency band are obtained by framing according to the frequency, the pure voice signal and the pure noise signal are mixed, then framing is carried out on the pure voice signal and the pure noise signal, the noisy voice frequency band is obtained by framing according to the corresponding frequency, and the corresponding logarithmic spectrum, the gain expected value and the characteristic value of the pure noise frequency band are obtained according to the energy of each pure noise, each pure voice and each noisy voice frequency band; and inputting the characteristic value into a frequency band gain model to output a gain value corresponding to each noisy speech frequency band, and training a neural network by using the gain expected value and the log spectrum as labels to realize parameter optimization of the frequency band gain model. The training method and the voice noise reduction method provided by the invention can reduce noise of voice with noise and ensure the robustness of a voice recognition system.

Description

Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene

Technical Field

The invention relates to the technical field of voice noise reduction, in particular to a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene.

Background

Along with the improvement of the intelligent degree of the automobile, the vehicle-mounted voice system becomes standard configuration in an automobile cabin, and has two requirements, (1) clear voice call quality is ensured during driving; and (2) ensuring stable voice recognition system performance during driving. Due to the influences of engine noise, wind noise, road noise, air-conditioning noise and the like in the driving process, the voice signal is interfered by a complex environment, the performance of a voice system is seriously influenced, and the use experience of a user is influenced. Noise in an in-vehicle scene has become a problem that must be overcome.

The common method and characteristics of voice noise reduction can be summarized as follows:

(1) The traditional algorithm based on signal processing, such as spectral subtraction, wiener filtering and the like, assumes that the voice is subjected to certain distribution, the noise is stable or slowly changed, the power spectrum of the noise or an ideal wiener filter is estimated, the algorithm is simple, the instantaneity is good, better separation performance can be obtained under the condition of meeting, but in the actual scene environment, the assumption condition is difficult to meet, and the noise reduction performance can be compromised.

(2) Based on the method of decomposition calculation, such as non-negative matrix decomposition, the frequency spectrum of the sound signal is assumed to have a low-rank structure, so that a base with a smaller number can be used for representation, and the basic spectrum mode in the sound signal can be mined.

(3) The rule-based algorithm, such as modeling the speech enhancement problem in the noise scene according to some rules or mechanisms found in the study of auditory scene analysis, is supported by rules, and has strong interpretability, but because auditory studies generally use simpler stimulus as input, the obtained rules are not necessarily applicable to complex auditory environments, the model aims at reproducing the results in experimental formulas, and are difficult to apply to practical problems, and in addition, most of auditory models are seriously dependent on grouping clues, especially the accuracy of pitch extraction, and are difficult to ensure in complex auditory environments, so that the noise reduction effect of the speech is not ideal.

(4) Noise reduction algorithm based on deep learning model, which uses high computational power of computer to model voice by utilizing high nonlinearity of deep neural network, can obtain better noise reduction performance under the drive of huge data volume, but the model has high requirement on computing resource and poor real-time performance.

In addition, because the optimization targets of voice noise reduction and voice recognition are different, voice data processed by a plurality of noise reduction algorithms can be damaged, so that the accuracy of a voice recognition system is reduced, and the design of the noise reduction algorithms needs to be compatible with the model design of the voice recognition algorithms.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene, and the technical scheme is as follows:

in one aspect, the invention provides a training method of a frequency band gain model, wherein the frequency band gain model is based on a neural network model, adopts an SRU architecture, and can perform noise reduction gain on signals of a plurality of frequency bands according to characteristic values of noisy voice signals;

the training method comprises the following steps:

s1, after a pure voice signal and a pure noise signal are respectively framed, carrying out frame-by-frame banding according to frequency to obtain n pure voice frequency bands and n pure noise frequency bands, and calculating the energy of each pure voice frequency band and each pure noise frequency band; mixing the pure voice signal and the pure noise signal to obtain a voice signal with noise, carrying out frame-dividing treatment on the voice signal with noise, carrying out band-dividing on the voice signal with noise frame by frame according to corresponding frequencies to obtain n voice frequency bands with noise, and calculating the energy of each voice frequency band with noise;

according to the energy of each pure noise frequency band, n corresponding logarithmic spectrums of the pure noise frequency bands are obtained;

according to the ratio of the energy of the pure voice frequency band to the energy of the noisy voice frequency band of the corresponding frequency band, n gain expected values are obtained;

obtaining corresponding logarithmic power spectrum according to energy of each noisy speech frequency band, and obtaining n MFCC coefficients through inverse discrete cosine transform to serve as n corresponding characteristic values of the noisy speech frequency band;

s2, inputting the n eigenvalues into the frequency band gain model to output gain values corresponding to each noisy speech frequency band, and performing neural network training by using the gain expected values and the log spectrum as labels to realize parameter optimization of the frequency band gain model.

Further, the band gain model includes a first SRU layer, a second SRU layer, a third SRU layer, a fourth SRU layer, a fifth SRU layer, a first fully connected layer, and a second fully connected layer,

the characteristic value is input to the first SRU layer, processed by using a tanh activation function and then output;

the characteristic value is input to the first full-connection layer, processed by using a tanh activation function and then output to the second SRU layer, and processed by using a Relu activation function and then output to the third SRU layer;

in the third SRU layer, processing the output of the first fully-connected layer and the output of the second SRU layer by using a Relu activation function and outputting;

in the fourth SRU layer, processing the output of the first SRU layer and the output of the third SRU layer by using a Relu activation function and outputting;

in the fifth SRU layer, processing the output of the first SRU layer, the output of the third SRU layer and the output of the fourth SRU layer by using a Relu activation function and outputting;

and in the second full connection layer, processing the output of the fifth SRU layer by using a sigmoid activation function and outputting the processed output so as to obtain the gain value of the voice frequency band with noise.

Further, the SRU units in the second SRU layer can perform parallel computation, and can update the hidden state through the forget gate.

Further, the clear sound judgment and the processing are carried out on each frame of the voice signal with noise so as to obtain a pitch period value of the voice signal with noise, and the pitch period value is used as a new characteristic value to be input into the frequency band gain model for training.

Further, according to the pitch period value, a pitch signal band corresponding to the noisy speech signal is obtained, energy of the pitch signal band is calculated, discrete cosine transform is carried out by combining the energy of the noisy speech frequency band together, so that relevant parameters are obtained, and the relevant parameters are input into the frequency band gain model as new characteristic values to be trained together.

Further, the first derivative and/or the second derivative of the characteristic value is processed, and the obtained result is input into the frequency band gain model as a new characteristic value to be trained together.

Further, the pure voice signal, the pure noise signal and the voice signal with noise are all banded by a Mel filter.

On the other hand, the invention also provides a voice noise reduction method suitable for the vehicle-mounted scene, which comprises the following steps:

p1, carrying out frame division processing on noisy speech, carrying out frame-by-frame banding according to frequency to obtain m noisy speech frequency bands, extracting m corresponding characteristic values of the m noisy speech frequency bands, and inputting the m corresponding characteristic values into the frequency band gain model to obtain gain values corresponding to the noisy speech frequency bands;

p2, adopting a comb filter to carry out pitch filtering on the voice frequency band with noise;

p3, calculating the energy of the noisy speech frequency band after filtering to obtain the energy ratio of the noisy speech frequency band before and after filtering;

and P4, multiplying the signal after filtering the noisy speech frequency band by the energy ratio, and multiplying the signal with the noisy speech frequency band by a gain value corresponding to the noisy speech frequency band to obtain noise-reduced speech data.

Further, the setting formula of the comb filter is as follows:

x′[i]＝x[i]+a×P

wherein x [ i ] and x' [ i ] respectively represent signals before and after filtering, a is a filter coefficient, and P is frequency domain data generated by voice signals containing fundamental frequency parts.

Further, the filter coefficients are calculated using the following formula:

wherein cope is the correlation value of the energy of the noisy speech signal and the pitch energy, g _b Is the gain value.

The technical scheme provided by the invention has the following beneficial effects:

(1) The noise reduction effect under the environment of low signal-to-noise ratio and unstable noise is improved;

(2) Noise is reduced on the voice with noise, and meanwhile, the robustness of a voice recognition system is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a band gain model framework provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a band gain model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an SRU unit in a band gain model according to an embodiment of the present invention;

fig. 4 is a flowchart of a method for voice noise reduction for a vehicle scene according to an embodiment of the present invention.

Detailed Description

For better understanding of the present invention, the objects, technical solutions and advantages thereof will be more clearly understood by those skilled in the art, and the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. It should be noted that the implementation manner not shown or described in the drawings is a manner known to those of ordinary skill in the art. Additionally, although examples of parameters including particular values may be provided herein, it should be appreciated that the parameters need not be exactly equal to the corresponding values, but may be approximated to the corresponding values within acceptable error margins or design constraints. It will be apparent that the described embodiments are merely some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, in the description and claims, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or device.

In one embodiment of the invention, a training method of a frequency band gain model is provided, wherein the frequency band gain model is based on a neural network model, and adopts an SRU architecture, and the frequency band gain model can carry out noise reduction gain on signals of a plurality of frequency bands according to characteristic values of noisy voice signals.

The training method comprises the following steps:

Wherein the band gain model comprises a first SRU layer, a second SRU layer, a third SRU layer, a fourth SRU layer, a fifth SRU layer, a first fully connected layer and a second fully connected layer,

the characteristic value is input to the first full-connection layer, processed by using a tanh activation function and then output to the second SRU layer, processed by using a Relu activation function and then output to the third SRU layer, and SRU units in the second SRU layer can perform parallel calculation and update a hidden state through a forgetting door;

To enhance the training effect and to speed up the training time, at least the following three ways can be used to enrich the sources of the eigenvalues of the noisy speech:

mode one: and performing unvoiced sound judgment and processing on each frame of the voice signal with noise to obtain a pitch period value of the voice signal with noise, and inputting the pitch period value as a new characteristic value to the frequency band gain model for training.

Mode two: on the basis of the first mode, a fundamental tone signal band corresponding to the noise-carrying voice signal is obtained according to the fundamental tone period value, energy of the fundamental tone signal band is calculated, discrete cosine transformation is carried out by combining the energy of the noise-carrying voice band, so that relevant parameters are obtained, and the relevant parameters are input into the band gain model as new added characteristic values to be trained.

Mode three: and performing first derivative and/or second derivative processing on the characteristic value, and inputting the obtained result as a newly added characteristic value to the frequency band gain model for training.

It should be noted that the above three modes can be used simultaneously, so that the characteristic value of the noisy speech is greatly increased, and the characteristic value is input into the frequency band gain model together for model training, so that the effect is better.

In one embodiment of the invention, the following training is performed for the band gain model.

And step 1, collecting pure noise audio data and pure voice audio data in the driving process by using recording equipment.

The pure noise audio data in the driving process refer to pure noise audio data correspondingly collected under a certain vehicle speed, a vehicle window and an air conditioner state; the vehicle speed data is the vehicle speed corresponding to the background noise, the numerical value is an integer, and if 80 represents 80km/h; the air-conditioning data represent gear information of an air-conditioning windshield under the background noise, wherein the gear information is in a closed state and a half-open state respectively; the window data represents window opening and closing state information under the background noise, and the window opening and closing state information is respectively in the states of [ closed, half-open and full-open ]. The data comprise the noise under various driving conditions in the interval of 30-120 vehicle speed, different opening states of four vehicle windows and an air conditioner.

Step 2 performs an up-sampling or down-sampling operation on the collected audio data, unifying the sampling rates of the pure voice audio data and the pure noise audio data, and the sampling rates are 8k-48k to serve as a pure noise signal and a pure voice signal, respectively.

And 3, framing the pure voice signal and the pure noise signal, wherein the frame length is 15-30 ms, and the frame is 5-10 ms.

And 4, windowing the audio data obtained in the step 3, and flattening a top window, wherein the window function expression is as follows:

and 5, carrying out Fourier transform on the audio data obtained in the step 4, wherein the transformation formula is as follows:

wherein X is _n (e ^jω ) Is a fourier transform for the time domain signal x (n), the subscript n denotes a time index, { ω (n) } is a real window sequence.

Step 6, carrying out band division on the audio data obtained in the step 5 according to frequency frame by frame so as to meet the auditory characteristics of human ears, wherein the band division adopts a Mel filter, and 65 filters are used for dividing the audio data into 66 frequency bands, wherein the relation between the Mel frequency and the actual frequency is as follows:

wherein F is _mel Is the perceived frequency in Mel and f is the actual frequency in Hz.

And 7, carrying out energy calculation on the signals banded in the step 6, wherein an energy calculation formula is as follows:

E(k)＝|X(k)| ²

the energy of the band is:

and step 8, pure noise energy is obtained from the pure noise signal in the calculation mode of step 7, 66 log spectrums are obtained by using the pure noise energy, and the calculation formula is as follows:

Ln[i]＝log ₁₀ (10 ^-2 +En[i])

wherein Ln [ i ] is a logarithmic spectrum and En [ i ] is pure noise energy.

And 9, mixing the pure voice signal and the pure noise signal, and downsampling the obtained mixed audio data to save the calculated amount so as to obtain a voice signal with noise, and carrying out corresponding framing treatment.

Step 10, filtering the noisy speech by using the Mel filter in step 6 to obtain corresponding noisy speech frequency bands, and superposing the energy in each noisy speech frequency band.

Step 11, taking the logarithm of the output of each filter to obtain the logarithm power spectrum of the corresponding frequency band, performing inverse discrete cosine transform to obtain 66 MFCC coefficients, and taking the obtained 66 MFCC coefficients as characteristic values to be respectively recorded as x ₁ ～x ₆₆ 。

Where x' (k) is the kth filter output power spectrum and L is the number of MFCC coefficients.

Step 12 is directed to x ₁ ～x ₆₆ The first 18 coefficients of (a) are respectively subjected to first-order and second-order derivative processing to increase 36 eigenvalues, respectively denoted as x ₆₇ ～x ₁₀₂ 。

First order derivative:

second order derivation:

step 13, calculating the basis of the voice signal with noiseThe period of the sound, denoted as x, is taken as the characteristic value ₁₀₃ 。

The calculation steps comprise:

(1) Filtering a frame of noisy speech signal { x (n) } by using a 900Hz low-pass filter, and removing the first 20 output values from the signal to obtain data which is recorded as { x' (n) };

(2) The maximum amplitude of the front 100-120 samples and the rear 100-120 samples of { x' (n) } are obtained respectively, and the smaller one is multiplied by a factor of 0.68 to be used as a threshold level C _L ；

(3) Performing center clipping on { x (n) } to obtain { y (n) } and three-level quantization to obtain { y' (n) };

(4) The cross-correlation value R (k) of { y (n) }, { y' (n) } is calculated as follows:

the value range of k is 20-150, which corresponds to the pitch frequency range of 60-500 Hz, and R (0) corresponds to short-time energy;

(5) After the cross-correlation value is obtained, the maximum value R among R (20) to R (150) can be obtained _max If R is _max <0.25R (0), the frame is considered as unvoiced, the pitch period value P is 0, otherwise, the pitch period P is R (k) is the maximum value R _max The value of time position k, i.e. p=argmax _20≤k≤150 R(k)。

Step 14, obtaining the fundamental tone signal band in the noisy speech according to step 13, and calculating the energy data Ex of each noisy speech band and the energy data Ep of the fundamental tone signal band by using the energy calculation formula of step 7. Wherein w is _b (k) Is the amplitude of the frequency band at frequency point k.

Ex＝∑ _k w _b (k)|X(k)| ²

Ep＝∑ _k w _b (k)|P(k)| ²

Using the obtained related energy to perform discrete cosine transform to calculate 12 values x ₁₀₄ ～x ₁₁₅ As a characteristic value. The discrete cosine transform formula is as follows:

where F (i) is an original signal, F (u) is a coefficient after discrete cosine transform, N is the number of points of the original signal, and c (u) is a compensation coefficient, so that the discrete cosine transform matrix can be an orthogonal matrix.

Step 15, calculating 66 gain expected values g [ i ] by using the ratio of the converted pure voice energy to the voice energy with noise, wherein the gain calculation formula is as follows:

wherein E is _y (b) Is pure speech energy, E _x (b) Is noisy speech energy.

Let g [ i ] =1 if g [ i ] > 1; if the endpoint detection value of the pure voice is zero, the mute mark obtained by extracting the characteristics of the pure voice is zero or g [ i ] = 0, let g [ i ] = -1.

Step 16, inputting 115 eigenvalues to x ₁ ～x ₁₁₅ Into the band gain model, the band gain model has 7 layers, 66 outputs and 450 neurons, and as shown in fig. 1 and 2, the model internal data flow is as follows:

step 16-1 firstly enters a full-connection layer, tanh activation function processing is used, weight constraint is set, loss function weight is 0.3-0.5, constraint is carried out on a main weight matrix by 0.45-0.5, constraint is carried out on a bias vector by 0.45-0.5, a regular term applied to the weight is 10-6-10-7, a regular term applied to the bias vector is 10-6-10-7, and 64 values are output in total.

Step 16-2 inputting the 64 values obtained in step 16-1 into the SRU layer, depending on the input x _t Parallel computation is performed, where W represents a weight matrix.

ft＝σ(W _f *x _t +b _f )

r _t ＝σ(W _r *x _t +b _r )

Step 16-3 updating the hidden state c through the forget gate using the value calculated in step 16-2 _t Finally obtain output h _t Where g represents the activation function.

h _t ＝g(c _t )

Step 16-4 referring to fig. 3, the SRU units in steps 16-2 and 16-3 output 36 values using the Relu activation function process.

Step 16-5 places the outputs of steps 16-1 and 16-4 into one SRU layer for processing using the Relu activation function for a total of 42 outputs.

Step 16-6 puts the 115 feature values input at the beginning into one SRU layer, and uses the tanh activation function to process, and total 86 outputs.

Step 16-7 places the output values of both layers of step 16-5 and step 16-6 into a new SRU layer, and outputs 48 values in total using the Relu activation function process.

Step 16-8 places the outputs of the three layers in step 16-5, step 16-6 and step 16-7 into a new SRU layer, processed using the Relu activation function, and outputs 108 values in total.

Step 16-9 takes the output of step 16-8 as an input to a fully connected layer, and uses a sigmoid activation function to process the output to output 66 gain values in total.

The basic construction of the band gain model is completed through the steps.

Step 17, inputting the extracted 115 eigenvalues into the frequency band gain model, and training 66 gain expected values and 66 log spectrums as labels. The entire model has a set of 66-dimensional outputs that act on different frequencies to accomplish the noise suppression task.

Specifically, the data are divided into 30-40 parts, namely 30-40 samples are selected for one training, the training is performed 100-120 times, 10% -20% of the data in the training set are used as verification sets, and gain data are obtained after the training.

The method comprises the steps of setting an optimizer and a loss function used for training, wherein the optimizer uses adam for gradient control and uses a cross entropy loss function.

Where x represents samples, y represents actual labels, a represents predicted output, and n represents the total number of samples.

It should be noted that the steps of this embodiment are not strictly sequential, and may be flexibly exchanged or deleted according to the actual implementation, and the method still falls within the scope of protection of the embodiment without substantial modification.

In one embodiment of the present invention, a method for voice noise reduction applicable to a vehicle-mounted scene is provided, including the following steps:

The setting formula of the comb filter is as follows:

x′[i]＝x[i]+a×P

wherein x [ i ] and x' [ i ] respectively represent signals before and after filtering, P is frequency domain data generated by voice signals containing fundamental frequency parts, a is a filter coefficient, and the filter coefficient is calculated by adopting the following formula:

The vehicle-mounted voice recognition system has poor effect in driving working conditions with low signal-to-noise ratio, wherein one important reason is that a plurality of unstable noises can be generated in a real vehicle-mounted environment to influence the voice recognition effect. The traditional voice noise reduction technology can not well finish voice noise reduction under a real scene, and the method provided by the invention can carry out real-time voice noise reduction aiming at different vehicle speeds and different scenes under the conditions of windowing and air conditioning, thereby effectively solving the problem of inaccurate voice recognition under various driving environments.

Referring to fig. 4, the method specifically aims at processing vehicle-mounted noise reduction, and includes the following steps:

step a, voice audio data in various driving scenes are collected by using recording equipment, the voice audio data are noisy voices under different driving conditions, the noisy voices are subjected to frame-by-frame processing, the noisy voices are banded according to frequency by frame to obtain noisy voice frequency bands, characteristic values are extracted from the noisy voice frequency bands and are input into a trained frequency band gain model to obtain corresponding gain values, and the process of extracting the characteristic values is included in the embodiment and is not repeated.

Step b, designing a low-pass filter with the cut-off frequency of 800Hz, enabling the voice with noise to pass through the filter, and removing high-frequency noise; .

And c, finishing pitch filtering according to the gain value obtained in the step a, wherein the pitch filtering adopts a comb filter, and the formula of the comb filter is as follows:

x[i]＝x[i]+a*P

where a is a filter coefficient and P is frequency domain data generated from a speech signal including a baseband portion. The energy to the original speech is re-quantized using a filter.

The filter coefficients are calculated using the following formula:

wherein cope is a correlation value between the energy of the noisy speech band signal and the pitch energy, g _b For gain value, when cope is greater than or equal to g _b When a=1; if g _b =1, then a=0, when there is no noise; if escape=0, a=0, and there is no pitch at this time.

The calculation formula of cope is as follows:

wherein,

Expe′＝∑ _k w _b (k)X(k)P(k)

Ex＝∑ _k w _b (k)|X(k)| ²

Ep＝∑ _k w _b (k)|P(k)| ²

thereby obtaining the expansion formula of cope:

where exp is normalized, X (k) is a signal, and P (k) is frequency domain data generated from a voice signal including a fundamental frequency portion. Wherein w is _b (k) For the amplitude of the frequency band at frequency point k, ex is the energy data of the band and Ep is the energy data of the corresponding pitch band.

Step d calculates the energy newE of the noisy speech band filtered in step c.

And e, calculating the ratio of the pre-filtering noisy speech frequency band energy to the corresponding post-filtering noisy speech frequency band energy, wherein the calculation formula is as follows:

step f multiplies the filtered signal X [ i ] by the ratio obtained in step e to obtain a signal X' [ i ], so that the energy of each frequency band is the same as the energy of the original signal.

X′[i]＝X[i]×norm

And g, multiplying the signal X 'i obtained after the processing in the step f by a gain value corresponding to each frequency band to obtain noise-reduced voice data X' [ i ].

X″[i]＝X′[i]×g[i]

And h, performing inverse fast Fourier transform operation on each frame of data, and converting the frequency domain signal into a time domain.

Inverse fast fourier transform:

and i, synthesizing the processed data of each frame, and outputting the noise-reduced audio stream.

In the traditional voice noise reduction method based on signal processing, many parameters need to be estimated manually or fine-tuned, so that the obtained parameters are not accurate enough, the noise estimation is not accurate enough, and the noise reduction effect is not ideal under the environment of low signal-to-noise ratio and unstable noise. According to the invention, the traditional signal processing technology is combined with the deep learning method, the noise reduction parameters are trained and learned in a data driving mode, the real-time advantage of the traditional signal processing algorithm is reserved, the noise reduction performance of the algorithm is improved, and meanwhile, when the algorithm is designed, the hearing characteristics of human ears and the perception characteristics of a voice recognition model are considered, so that the noise of voice with noise is reduced, and meanwhile, the robustness of a voice recognition system is ensured.

The invention provides a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene. According to the auditory characteristic law of human ears and the dependence on voice characteristics in a voice recognition model, carrying out frequency band division filtering operation on received voice with noise according to a mel cepstrum, and carrying out gain control on each frame of data of signals, thereby realizing voice noise reduction.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The training method of the frequency band gain model is characterized in that the frequency band gain model is based on a neural network model, an SRU architecture is adopted, and the frequency band gain model can carry out noise reduction gain on signals of a plurality of frequency bands according to characteristic values of noise-carrying voice signals;

the band gain model includes a first SRU layer, a second SRU layer, a third SRU layer, a fourth SRU layer, a fifth SRU layer, a first fully connected layer, and a second fully connected layer,

the characteristic value is input to the first SRU layer, processed by using a tanh activation function and then output; the characteristic value is input to the first full-connection layer, processed by using a tanh activation function and then output to the second SRU layer, and processed by using a Relu activation function and then output to the third SRU layer; in the third SRU layer, processing the output of the first fully-connected layer and the output of the second SRU layer by using a Relu activation function and outputting; in the fourth SRU layer, processing the output of the first SRU layer and the output of the third SRU layer by using a Relu activation function and outputting; in the fifth SRU layer, processing the output of the first SRU layer, the output of the third SRU layer and the output of the fourth SRU layer by using a Relu activation function and outputting; in the second full connection layer, the output of the fifth SRU layer is processed by using a sigmoid activation function and then is output, so that the gain value of the voice frequency band with noise is obtained;

the training method comprises the following steps:

2. The method of claim 1, wherein the SRU units in the second SRU layer are capable of performing parallel computation and updating hidden states through a forgetting gate.

3. The method according to claim 1, wherein each frame of the noisy speech signal is subjected to unvoiced sound judgment and processing to obtain a pitch period value thereof, and the pitch period value is inputted as a new feature value to the band gain model for training.

4. A method of training a band gain model according to claim 3, wherein a pitch signal band corresponding to the noisy speech signal is obtained from the pitch period value, the energy of the pitch signal band is calculated, and the discrete cosine transform is performed in combination with the energy of the noisy speech band to obtain a correlation parameter, which is input as a new feature value to the band gain model for training.

5. The method according to claim 1, wherein the feature values are subjected to first derivative and/or second derivative processing, and the obtained result is inputted as a new feature value to the band gain model for training.

6. The method of claim 1, wherein the pure speech signal, the pure noise signal, and the noisy speech signal are each banded using a mel filter.

7. The voice noise reduction method suitable for the vehicle-mounted scene is characterized by comprising the following steps of:

p1, carrying out frame division processing on noisy speech, carrying out frame-by-frame banding according to frequency to obtain m noisy speech frequency bands, extracting m corresponding characteristic values of the m noisy speech frequency bands, and inputting a frequency band gain model according to any one of claims 1 to 6 to obtain gain values corresponding to the noisy speech frequency bands;

8. The method for voice noise reduction applicable to an on-vehicle scene according to claim 7, wherein the comb filter is set as follows:

x′[i]＝x[i]+a×P

9. The method of claim 8, wherein the filter coefficients are calculated using the following formula: