CN113782011B - Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene - Google Patents

Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene Download PDF

Info

Publication number
CN113782011B
CN113782011B CN202110985541.4A CN202110985541A CN113782011B CN 113782011 B CN113782011 B CN 113782011B CN 202110985541 A CN202110985541 A CN 202110985541A CN 113782011 B CN113782011 B CN 113782011B
Authority
CN
China
Prior art keywords
frequency band
noise
voice
layer
sru
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110985541.4A
Other languages
Chinese (zh)
Other versions
CN113782011A (en
Inventor
姜彦吉
张胜
宋湘钰
范佳亮
彭博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Automotive Research Institute of Tsinghua University
Original Assignee
Suzhou Automotive Research Institute of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Automotive Research Institute of Tsinghua University filed Critical Suzhou Automotive Research Institute of Tsinghua University
Priority to CN202110985541.4A priority Critical patent/CN113782011B/en
Publication of CN113782011A publication Critical patent/CN113782011A/en
Application granted granted Critical
Publication of CN113782011B publication Critical patent/CN113782011B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene, wherein the training method comprises the following steps of: after the pure voice signal and the pure noise signal are respectively framed, the pure voice frequency band and the pure noise frequency band are obtained by framing according to the frequency, the pure voice signal and the pure noise signal are mixed, then framing is carried out on the pure voice signal and the pure noise signal, the noisy voice frequency band is obtained by framing according to the corresponding frequency, and the corresponding logarithmic spectrum, the gain expected value and the characteristic value of the pure noise frequency band are obtained according to the energy of each pure noise, each pure voice and each noisy voice frequency band; and inputting the characteristic value into a frequency band gain model to output a gain value corresponding to each noisy speech frequency band, and training a neural network by using the gain expected value and the log spectrum as labels to realize parameter optimization of the frequency band gain model. The training method and the voice noise reduction method provided by the invention can reduce noise of voice with noise and ensure the robustness of a voice recognition system.

Description

Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene
Technical Field
The invention relates to the technical field of voice noise reduction, in particular to a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene.
Background
Along with the improvement of the intelligent degree of the automobile, the vehicle-mounted voice system becomes standard configuration in an automobile cabin, and has two requirements, (1) clear voice call quality is ensured during driving; and (2) ensuring stable voice recognition system performance during driving. Due to the influences of engine noise, wind noise, road noise, air-conditioning noise and the like in the driving process, the voice signal is interfered by a complex environment, the performance of a voice system is seriously influenced, and the use experience of a user is influenced. Noise in an in-vehicle scene has become a problem that must be overcome.
The common method and characteristics of voice noise reduction can be summarized as follows:
(1) The traditional algorithm based on signal processing, such as spectral subtraction, wiener filtering and the like, assumes that the voice is subjected to certain distribution, the noise is stable or slowly changed, the power spectrum of the noise or an ideal wiener filter is estimated, the algorithm is simple, the instantaneity is good, better separation performance can be obtained under the condition of meeting, but in the actual scene environment, the assumption condition is difficult to meet, and the noise reduction performance can be compromised.
(2) Based on the method of decomposition calculation, such as non-negative matrix decomposition, the frequency spectrum of the sound signal is assumed to have a low-rank structure, so that a base with a smaller number can be used for representation, and the basic spectrum mode in the sound signal can be mined.
(3) The rule-based algorithm, such as modeling the speech enhancement problem in the noise scene according to some rules or mechanisms found in the study of auditory scene analysis, is supported by rules, and has strong interpretability, but because auditory studies generally use simpler stimulus as input, the obtained rules are not necessarily applicable to complex auditory environments, the model aims at reproducing the results in experimental formulas, and are difficult to apply to practical problems, and in addition, most of auditory models are seriously dependent on grouping clues, especially the accuracy of pitch extraction, and are difficult to ensure in complex auditory environments, so that the noise reduction effect of the speech is not ideal.
(4) Noise reduction algorithm based on deep learning model, which uses high computational power of computer to model voice by utilizing high nonlinearity of deep neural network, can obtain better noise reduction performance under the drive of huge data volume, but the model has high requirement on computing resource and poor real-time performance.
In addition, because the optimization targets of voice noise reduction and voice recognition are different, voice data processed by a plurality of noise reduction algorithms can be damaged, so that the accuracy of a voice recognition system is reduced, and the design of the noise reduction algorithms needs to be compatible with the model design of the voice recognition algorithms.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene, and the technical scheme is as follows:
in one aspect, the invention provides a training method of a frequency band gain model, wherein the frequency band gain model is based on a neural network model, adopts an SRU architecture, and can perform noise reduction gain on signals of a plurality of frequency bands according to characteristic values of noisy voice signals;
the training method comprises the following steps:
s1, after a pure voice signal and a pure noise signal are respectively framed, carrying out frame-by-frame banding according to frequency to obtain n pure voice frequency bands and n pure noise frequency bands, and calculating the energy of each pure voice frequency band and each pure noise frequency band; mixing the pure voice signal and the pure noise signal to obtain a voice signal with noise, carrying out frame-dividing treatment on the voice signal with noise, carrying out band-dividing on the voice signal with noise frame by frame according to corresponding frequencies to obtain n voice frequency bands with noise, and calculating the energy of each voice frequency band with noise;
according to the energy of each pure noise frequency band, n corresponding logarithmic spectrums of the pure noise frequency bands are obtained;
according to the ratio of the energy of the pure voice frequency band to the energy of the noisy voice frequency band of the corresponding frequency band, n gain expected values are obtained;
obtaining corresponding logarithmic power spectrum according to energy of each noisy speech frequency band, and obtaining n MFCC coefficients through inverse discrete cosine transform to serve as n corresponding characteristic values of the noisy speech frequency band;
s2, inputting the n eigenvalues into the frequency band gain model to output gain values corresponding to each noisy speech frequency band, and performing neural network training by using the gain expected values and the log spectrum as labels to realize parameter optimization of the frequency band gain model.
Further, the band gain model includes a first SRU layer, a second SRU layer, a third SRU layer, a fourth SRU layer, a fifth SRU layer, a first fully connected layer, and a second fully connected layer,
the characteristic value is input to the first SRU layer, processed by using a tanh activation function and then output;
the characteristic value is input to the first full-connection layer, processed by using a tanh activation function and then output to the second SRU layer, and processed by using a Relu activation function and then output to the third SRU layer;
in the third SRU layer, processing the output of the first fully-connected layer and the output of the second SRU layer by using a Relu activation function and outputting;
in the fourth SRU layer, processing the output of the first SRU layer and the output of the third SRU layer by using a Relu activation function and outputting;
in the fifth SRU layer, processing the output of the first SRU layer, the output of the third SRU layer and the output of the fourth SRU layer by using a Relu activation function and outputting;
and in the second full connection layer, processing the output of the fifth SRU layer by using a sigmoid activation function and outputting the processed output so as to obtain the gain value of the voice frequency band with noise.
Further, the SRU units in the second SRU layer can perform parallel computation, and can update the hidden state through the forget gate.
Further, the clear sound judgment and the processing are carried out on each frame of the voice signal with noise so as to obtain a pitch period value of the voice signal with noise, and the pitch period value is used as a new characteristic value to be input into the frequency band gain model for training.
Further, according to the pitch period value, a pitch signal band corresponding to the noisy speech signal is obtained, energy of the pitch signal band is calculated, discrete cosine transform is carried out by combining the energy of the noisy speech frequency band together, so that relevant parameters are obtained, and the relevant parameters are input into the frequency band gain model as new characteristic values to be trained together.
Further, the first derivative and/or the second derivative of the characteristic value is processed, and the obtained result is input into the frequency band gain model as a new characteristic value to be trained together.
Further, the pure voice signal, the pure noise signal and the voice signal with noise are all banded by a Mel filter.
On the other hand, the invention also provides a voice noise reduction method suitable for the vehicle-mounted scene, which comprises the following steps:
p1, carrying out frame division processing on noisy speech, carrying out frame-by-frame banding according to frequency to obtain m noisy speech frequency bands, extracting m corresponding characteristic values of the m noisy speech frequency bands, and inputting the m corresponding characteristic values into the frequency band gain model to obtain gain values corresponding to the noisy speech frequency bands;
p2, adopting a comb filter to carry out pitch filtering on the voice frequency band with noise;
p3, calculating the energy of the noisy speech frequency band after filtering to obtain the energy ratio of the noisy speech frequency band before and after filtering;
and P4, multiplying the signal after filtering the noisy speech frequency band by the energy ratio, and multiplying the signal with the noisy speech frequency band by a gain value corresponding to the noisy speech frequency band to obtain noise-reduced speech data.
Further, the setting formula of the comb filter is as follows:
x′[i]=x[i]+a×P
wherein x [ i ] and x' [ i ] respectively represent signals before and after filtering, a is a filter coefficient, and P is frequency domain data generated by voice signals containing fundamental frequency parts.
Further, the filter coefficients are calculated using the following formula:
wherein cope is the correlation value of the energy of the noisy speech signal and the pitch energy, g b Is the gain value.
The technical scheme provided by the invention has the following beneficial effects:
(1) The noise reduction effect under the environment of low signal-to-noise ratio and unstable noise is improved;
(2) Noise is reduced on the voice with noise, and meanwhile, the robustness of a voice recognition system is guaranteed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a band gain model framework provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a band gain model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an SRU unit in a band gain model according to an embodiment of the present invention;
fig. 4 is a flowchart of a method for voice noise reduction for a vehicle scene according to an embodiment of the present invention.
Detailed Description
For better understanding of the present invention, the objects, technical solutions and advantages thereof will be more clearly understood by those skilled in the art, and the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. It should be noted that the implementation manner not shown or described in the drawings is a manner known to those of ordinary skill in the art. Additionally, although examples of parameters including particular values may be provided herein, it should be appreciated that the parameters need not be exactly equal to the corresponding values, but may be approximated to the corresponding values within acceptable error margins or design constraints. It will be apparent that the described embodiments are merely some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, in the description and claims, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or device.
In one embodiment of the invention, a training method of a frequency band gain model is provided, wherein the frequency band gain model is based on a neural network model, and adopts an SRU architecture, and the frequency band gain model can carry out noise reduction gain on signals of a plurality of frequency bands according to characteristic values of noisy voice signals.
The training method comprises the following steps:
s1, after a pure voice signal and a pure noise signal are respectively framed, carrying out frame-by-frame banding according to frequency to obtain n pure voice frequency bands and n pure noise frequency bands, and calculating the energy of each pure voice frequency band and each pure noise frequency band; mixing the pure voice signal and the pure noise signal to obtain a voice signal with noise, carrying out frame-dividing treatment on the voice signal with noise, carrying out band-dividing on the voice signal with noise frame by frame according to corresponding frequencies to obtain n voice frequency bands with noise, and calculating the energy of each voice frequency band with noise;
according to the energy of each pure noise frequency band, n corresponding logarithmic spectrums of the pure noise frequency bands are obtained;
according to the ratio of the energy of the pure voice frequency band to the energy of the noisy voice frequency band of the corresponding frequency band, n gain expected values are obtained;
obtaining corresponding logarithmic power spectrum according to energy of each noisy speech frequency band, and obtaining n MFCC coefficients through inverse discrete cosine transform to serve as n corresponding characteristic values of the noisy speech frequency band;
s2, inputting the n eigenvalues into the frequency band gain model to output gain values corresponding to each noisy speech frequency band, and performing neural network training by using the gain expected values and the log spectrum as labels to realize parameter optimization of the frequency band gain model.
Wherein the band gain model comprises a first SRU layer, a second SRU layer, a third SRU layer, a fourth SRU layer, a fifth SRU layer, a first fully connected layer and a second fully connected layer,
the characteristic value is input to the first SRU layer, processed by using a tanh activation function and then output;
the characteristic value is input to the first full-connection layer, processed by using a tanh activation function and then output to the second SRU layer, processed by using a Relu activation function and then output to the third SRU layer, and SRU units in the second SRU layer can perform parallel calculation and update a hidden state through a forgetting door;
in the third SRU layer, processing the output of the first fully-connected layer and the output of the second SRU layer by using a Relu activation function and outputting;
in the fourth SRU layer, processing the output of the first SRU layer and the output of the third SRU layer by using a Relu activation function and outputting;
in the fifth SRU layer, processing the output of the first SRU layer, the output of the third SRU layer and the output of the fourth SRU layer by using a Relu activation function and outputting;
and in the second full connection layer, processing the output of the fifth SRU layer by using a sigmoid activation function and outputting the processed output so as to obtain the gain value of the voice frequency band with noise.
To enhance the training effect and to speed up the training time, at least the following three ways can be used to enrich the sources of the eigenvalues of the noisy speech:
mode one: and performing unvoiced sound judgment and processing on each frame of the voice signal with noise to obtain a pitch period value of the voice signal with noise, and inputting the pitch period value as a new characteristic value to the frequency band gain model for training.
Mode two: on the basis of the first mode, a fundamental tone signal band corresponding to the noise-carrying voice signal is obtained according to the fundamental tone period value, energy of the fundamental tone signal band is calculated, discrete cosine transformation is carried out by combining the energy of the noise-carrying voice band, so that relevant parameters are obtained, and the relevant parameters are input into the band gain model as new added characteristic values to be trained.
Mode three: and performing first derivative and/or second derivative processing on the characteristic value, and inputting the obtained result as a newly added characteristic value to the frequency band gain model for training.
It should be noted that the above three modes can be used simultaneously, so that the characteristic value of the noisy speech is greatly increased, and the characteristic value is input into the frequency band gain model together for model training, so that the effect is better.
In one embodiment of the invention, the following training is performed for the band gain model.
And step 1, collecting pure noise audio data and pure voice audio data in the driving process by using recording equipment.
The pure noise audio data in the driving process refer to pure noise audio data correspondingly collected under a certain vehicle speed, a vehicle window and an air conditioner state; the vehicle speed data is the vehicle speed corresponding to the background noise, the numerical value is an integer, and if 80 represents 80km/h; the air-conditioning data represent gear information of an air-conditioning windshield under the background noise, wherein the gear information is in a closed state and a half-open state respectively; the window data represents window opening and closing state information under the background noise, and the window opening and closing state information is respectively in the states of [ closed, half-open and full-open ]. The data comprise the noise under various driving conditions in the interval of 30-120 vehicle speed, different opening states of four vehicle windows and an air conditioner.
Step 2 performs an up-sampling or down-sampling operation on the collected audio data, unifying the sampling rates of the pure voice audio data and the pure noise audio data, and the sampling rates are 8k-48k to serve as a pure noise signal and a pure voice signal, respectively.
And 3, framing the pure voice signal and the pure noise signal, wherein the frame length is 15-30 ms, and the frame is 5-10 ms.
And 4, windowing the audio data obtained in the step 3, and flattening a top window, wherein the window function expression is as follows:
and 5, carrying out Fourier transform on the audio data obtained in the step 4, wherein the transformation formula is as follows:
wherein X is n (e ) Is a fourier transform for the time domain signal x (n), the subscript n denotes a time index, { ω (n) } is a real window sequence.
Step 6, carrying out band division on the audio data obtained in the step 5 according to frequency frame by frame so as to meet the auditory characteristics of human ears, wherein the band division adopts a Mel filter, and 65 filters are used for dividing the audio data into 66 frequency bands, wherein the relation between the Mel frequency and the actual frequency is as follows:
wherein F is mel Is the perceived frequency in Mel and f is the actual frequency in Hz.
And 7, carrying out energy calculation on the signals banded in the step 6, wherein an energy calculation formula is as follows:
E(k)=|X(k)| 2
the energy of the band is:
and step 8, pure noise energy is obtained from the pure noise signal in the calculation mode of step 7, 66 log spectrums are obtained by using the pure noise energy, and the calculation formula is as follows:
Ln[i]=log 10 (10 -2 +En[i])
wherein Ln [ i ] is a logarithmic spectrum and En [ i ] is pure noise energy.
And 9, mixing the pure voice signal and the pure noise signal, and downsampling the obtained mixed audio data to save the calculated amount so as to obtain a voice signal with noise, and carrying out corresponding framing treatment.
Step 10, filtering the noisy speech by using the Mel filter in step 6 to obtain corresponding noisy speech frequency bands, and superposing the energy in each noisy speech frequency band.
Step 11, taking the logarithm of the output of each filter to obtain the logarithm power spectrum of the corresponding frequency band, performing inverse discrete cosine transform to obtain 66 MFCC coefficients, and taking the obtained 66 MFCC coefficients as characteristic values to be respectively recorded as x 1 ~x 66
Where x' (k) is the kth filter output power spectrum and L is the number of MFCC coefficients.
Step 12 is directed to x 1 ~x 66 The first 18 coefficients of (a) are respectively subjected to first-order and second-order derivative processing to increase 36 eigenvalues, respectively denoted as x 67 ~x 102
First order derivative:
second order derivation:
step 13, calculating the basis of the voice signal with noiseThe period of the sound, denoted as x, is taken as the characteristic value 103
The calculation steps comprise:
(1) Filtering a frame of noisy speech signal { x (n) } by using a 900Hz low-pass filter, and removing the first 20 output values from the signal to obtain data which is recorded as { x' (n) };
(2) The maximum amplitude of the front 100-120 samples and the rear 100-120 samples of { x' (n) } are obtained respectively, and the smaller one is multiplied by a factor of 0.68 to be used as a threshold level C L
(3) Performing center clipping on { x (n) } to obtain { y (n) } and three-level quantization to obtain { y' (n) };
(4) The cross-correlation value R (k) of { y (n) }, { y' (n) } is calculated as follows:
the value range of k is 20-150, which corresponds to the pitch frequency range of 60-500 Hz, and R (0) corresponds to short-time energy;
(5) After the cross-correlation value is obtained, the maximum value R among R (20) to R (150) can be obtained max If R is max <0.25R (0), the frame is considered as unvoiced, the pitch period value P is 0, otherwise, the pitch period P is R (k) is the maximum value R max The value of time position k, i.e. p=argmax 20≤k≤150 R(k)。
Step 14, obtaining the fundamental tone signal band in the noisy speech according to step 13, and calculating the energy data Ex of each noisy speech band and the energy data Ep of the fundamental tone signal band by using the energy calculation formula of step 7. Wherein w is b (k) Is the amplitude of the frequency band at frequency point k.
Ex=∑ k w b (k)|X(k)| 2
Ep=∑ k w b (k)|P(k)| 2
Using the obtained related energy to perform discrete cosine transform to calculate 12 values x 104 ~x 115 As a characteristic value. The discrete cosine transform formula is as follows:
where F (i) is an original signal, F (u) is a coefficient after discrete cosine transform, N is the number of points of the original signal, and c (u) is a compensation coefficient, so that the discrete cosine transform matrix can be an orthogonal matrix.
Step 15, calculating 66 gain expected values g [ i ] by using the ratio of the converted pure voice energy to the voice energy with noise, wherein the gain calculation formula is as follows:
wherein E is y (b) Is pure speech energy, E x (b) Is noisy speech energy.
Let g [ i ] =1 if g [ i ] > 1; if the endpoint detection value of the pure voice is zero, the mute mark obtained by extracting the characteristics of the pure voice is zero or g [ i ] = 0, let g [ i ] = -1.
Step 16, inputting 115 eigenvalues to x 1 ~x 115 Into the band gain model, the band gain model has 7 layers, 66 outputs and 450 neurons, and as shown in fig. 1 and 2, the model internal data flow is as follows:
step 16-1 firstly enters a full-connection layer, tanh activation function processing is used, weight constraint is set, loss function weight is 0.3-0.5, constraint is carried out on a main weight matrix by 0.45-0.5, constraint is carried out on a bias vector by 0.45-0.5, a regular term applied to the weight is 10-6-10-7, a regular term applied to the bias vector is 10-6-10-7, and 64 values are output in total.
Step 16-2 inputting the 64 values obtained in step 16-1 into the SRU layer, depending on the input x t Parallel computation is performed, where W represents a weight matrix.
ft=σ(W f *x t +b f )
r t =σ(W r *x t +b r )
Step 16-3 updating the hidden state c through the forget gate using the value calculated in step 16-2 t Finally obtain output h t Where g represents the activation function.
h t =g(c t )
Step 16-4 referring to fig. 3, the SRU units in steps 16-2 and 16-3 output 36 values using the Relu activation function process.
Step 16-5 places the outputs of steps 16-1 and 16-4 into one SRU layer for processing using the Relu activation function for a total of 42 outputs.
Step 16-6 puts the 115 feature values input at the beginning into one SRU layer, and uses the tanh activation function to process, and total 86 outputs.
Step 16-7 places the output values of both layers of step 16-5 and step 16-6 into a new SRU layer, and outputs 48 values in total using the Relu activation function process.
Step 16-8 places the outputs of the three layers in step 16-5, step 16-6 and step 16-7 into a new SRU layer, processed using the Relu activation function, and outputs 108 values in total.
Step 16-9 takes the output of step 16-8 as an input to a fully connected layer, and uses a sigmoid activation function to process the output to output 66 gain values in total.
The basic construction of the band gain model is completed through the steps.
Step 17, inputting the extracted 115 eigenvalues into the frequency band gain model, and training 66 gain expected values and 66 log spectrums as labels. The entire model has a set of 66-dimensional outputs that act on different frequencies to accomplish the noise suppression task.
Specifically, the data are divided into 30-40 parts, namely 30-40 samples are selected for one training, the training is performed 100-120 times, 10% -20% of the data in the training set are used as verification sets, and gain data are obtained after the training.
The method comprises the steps of setting an optimizer and a loss function used for training, wherein the optimizer uses adam for gradient control and uses a cross entropy loss function.
Where x represents samples, y represents actual labels, a represents predicted output, and n represents the total number of samples.
It should be noted that the steps of this embodiment are not strictly sequential, and may be flexibly exchanged or deleted according to the actual implementation, and the method still falls within the scope of protection of the embodiment without substantial modification.
In one embodiment of the present invention, a method for voice noise reduction applicable to a vehicle-mounted scene is provided, including the following steps:
p1, carrying out frame division processing on noisy speech, carrying out frame-by-frame banding according to frequency to obtain m noisy speech frequency bands, extracting m corresponding characteristic values of the m noisy speech frequency bands, and inputting the m corresponding characteristic values into the frequency band gain model to obtain gain values corresponding to the noisy speech frequency bands;
p2, adopting a comb filter to carry out pitch filtering on the voice frequency band with noise;
p3, calculating the energy of the noisy speech frequency band after filtering to obtain the energy ratio of the noisy speech frequency band before and after filtering;
and P4, multiplying the signal after filtering the noisy speech frequency band by the energy ratio, and multiplying the signal with the noisy speech frequency band by a gain value corresponding to the noisy speech frequency band to obtain noise-reduced speech data.
The setting formula of the comb filter is as follows:
x′[i]=x[i]+a×P
wherein x [ i ] and x' [ i ] respectively represent signals before and after filtering, P is frequency domain data generated by voice signals containing fundamental frequency parts, a is a filter coefficient, and the filter coefficient is calculated by adopting the following formula:
wherein cope is the correlation value of the energy of the noisy speech signal and the pitch energy, g b Is the gain value.
The vehicle-mounted voice recognition system has poor effect in driving working conditions with low signal-to-noise ratio, wherein one important reason is that a plurality of unstable noises can be generated in a real vehicle-mounted environment to influence the voice recognition effect. The traditional voice noise reduction technology can not well finish voice noise reduction under a real scene, and the method provided by the invention can carry out real-time voice noise reduction aiming at different vehicle speeds and different scenes under the conditions of windowing and air conditioning, thereby effectively solving the problem of inaccurate voice recognition under various driving environments.
Referring to fig. 4, the method specifically aims at processing vehicle-mounted noise reduction, and includes the following steps:
step a, voice audio data in various driving scenes are collected by using recording equipment, the voice audio data are noisy voices under different driving conditions, the noisy voices are subjected to frame-by-frame processing, the noisy voices are banded according to frequency by frame to obtain noisy voice frequency bands, characteristic values are extracted from the noisy voice frequency bands and are input into a trained frequency band gain model to obtain corresponding gain values, and the process of extracting the characteristic values is included in the embodiment and is not repeated.
Step b, designing a low-pass filter with the cut-off frequency of 800Hz, enabling the voice with noise to pass through the filter, and removing high-frequency noise; .
And c, finishing pitch filtering according to the gain value obtained in the step a, wherein the pitch filtering adopts a comb filter, and the formula of the comb filter is as follows:
x[i]=x[i]+a*P
where a is a filter coefficient and P is frequency domain data generated from a speech signal including a baseband portion. The energy to the original speech is re-quantized using a filter.
The filter coefficients are calculated using the following formula:
wherein cope is a correlation value between the energy of the noisy speech band signal and the pitch energy, g b For gain value, when cope is greater than or equal to g b When a=1; if g b =1, then a=0, when there is no noise; if escape=0, a=0, and there is no pitch at this time.
The calculation formula of cope is as follows:
wherein,
Expe′=∑ k w b (k)X(k)P(k)
Ex=∑ k w b (k)|X(k)| 2
Ep=∑ k w b (k)|P(k)| 2
thereby obtaining the expansion formula of cope:
where exp is normalized, X (k) is a signal, and P (k) is frequency domain data generated from a voice signal including a fundamental frequency portion. Wherein w is b (k) For the amplitude of the frequency band at frequency point k, ex is the energy data of the band and Ep is the energy data of the corresponding pitch band.
Step d calculates the energy newE of the noisy speech band filtered in step c.
And e, calculating the ratio of the pre-filtering noisy speech frequency band energy to the corresponding post-filtering noisy speech frequency band energy, wherein the calculation formula is as follows:
step f multiplies the filtered signal X [ i ] by the ratio obtained in step e to obtain a signal X' [ i ], so that the energy of each frequency band is the same as the energy of the original signal.
X′[i]=X[i]×norm
And g, multiplying the signal X 'i obtained after the processing in the step f by a gain value corresponding to each frequency band to obtain noise-reduced voice data X' [ i ].
X″[i]=X′[i]×g[i]
And h, performing inverse fast Fourier transform operation on each frame of data, and converting the frequency domain signal into a time domain.
Inverse fast fourier transform:
and i, synthesizing the processed data of each frame, and outputting the noise-reduced audio stream.
In the traditional voice noise reduction method based on signal processing, many parameters need to be estimated manually or fine-tuned, so that the obtained parameters are not accurate enough, the noise estimation is not accurate enough, and the noise reduction effect is not ideal under the environment of low signal-to-noise ratio and unstable noise. According to the invention, the traditional signal processing technology is combined with the deep learning method, the noise reduction parameters are trained and learned in a data driving mode, the real-time advantage of the traditional signal processing algorithm is reserved, the noise reduction performance of the algorithm is improved, and meanwhile, when the algorithm is designed, the hearing characteristics of human ears and the perception characteristics of a voice recognition model are considered, so that the noise of voice with noise is reduced, and meanwhile, the robustness of a voice recognition system is ensured.
The invention provides a training method of a frequency band gain model and a voice noise reduction method for a vehicle-mounted scene. According to the auditory characteristic law of human ears and the dependence on voice characteristics in a voice recognition model, carrying out frequency band division filtering operation on received voice with noise according to a mel cepstrum, and carrying out gain control on each frame of data of signals, thereby realizing voice noise reduction.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (9)

1. The training method of the frequency band gain model is characterized in that the frequency band gain model is based on a neural network model, an SRU architecture is adopted, and the frequency band gain model can carry out noise reduction gain on signals of a plurality of frequency bands according to characteristic values of noise-carrying voice signals;
the band gain model includes a first SRU layer, a second SRU layer, a third SRU layer, a fourth SRU layer, a fifth SRU layer, a first fully connected layer, and a second fully connected layer,
the characteristic value is input to the first SRU layer, processed by using a tanh activation function and then output; the characteristic value is input to the first full-connection layer, processed by using a tanh activation function and then output to the second SRU layer, and processed by using a Relu activation function and then output to the third SRU layer; in the third SRU layer, processing the output of the first fully-connected layer and the output of the second SRU layer by using a Relu activation function and outputting; in the fourth SRU layer, processing the output of the first SRU layer and the output of the third SRU layer by using a Relu activation function and outputting; in the fifth SRU layer, processing the output of the first SRU layer, the output of the third SRU layer and the output of the fourth SRU layer by using a Relu activation function and outputting; in the second full connection layer, the output of the fifth SRU layer is processed by using a sigmoid activation function and then is output, so that the gain value of the voice frequency band with noise is obtained;
the training method comprises the following steps:
s1, after a pure voice signal and a pure noise signal are respectively framed, carrying out frame-by-frame banding according to frequency to obtain n pure voice frequency bands and n pure noise frequency bands, and calculating the energy of each pure voice frequency band and each pure noise frequency band; mixing the pure voice signal and the pure noise signal to obtain a voice signal with noise, carrying out frame-dividing treatment on the voice signal with noise, carrying out band-dividing on the voice signal with noise frame by frame according to corresponding frequencies to obtain n voice frequency bands with noise, and calculating the energy of each voice frequency band with noise;
according to the energy of each pure noise frequency band, n corresponding logarithmic spectrums of the pure noise frequency bands are obtained;
according to the ratio of the energy of the pure voice frequency band to the energy of the noisy voice frequency band of the corresponding frequency band, n gain expected values are obtained;
obtaining corresponding logarithmic power spectrum according to energy of each noisy speech frequency band, and obtaining n MFCC coefficients through inverse discrete cosine transform to serve as n corresponding characteristic values of the noisy speech frequency band;
s2, inputting the n eigenvalues into the frequency band gain model to output gain values corresponding to each noisy speech frequency band, and performing neural network training by using the gain expected values and the log spectrum as labels to realize parameter optimization of the frequency band gain model.
2. The method of claim 1, wherein the SRU units in the second SRU layer are capable of performing parallel computation and updating hidden states through a forgetting gate.
3. The method according to claim 1, wherein each frame of the noisy speech signal is subjected to unvoiced sound judgment and processing to obtain a pitch period value thereof, and the pitch period value is inputted as a new feature value to the band gain model for training.
4. A method of training a band gain model according to claim 3, wherein a pitch signal band corresponding to the noisy speech signal is obtained from the pitch period value, the energy of the pitch signal band is calculated, and the discrete cosine transform is performed in combination with the energy of the noisy speech band to obtain a correlation parameter, which is input as a new feature value to the band gain model for training.
5. The method according to claim 1, wherein the feature values are subjected to first derivative and/or second derivative processing, and the obtained result is inputted as a new feature value to the band gain model for training.
6. The method of claim 1, wherein the pure speech signal, the pure noise signal, and the noisy speech signal are each banded using a mel filter.
7. The voice noise reduction method suitable for the vehicle-mounted scene is characterized by comprising the following steps of:
p1, carrying out frame division processing on noisy speech, carrying out frame-by-frame banding according to frequency to obtain m noisy speech frequency bands, extracting m corresponding characteristic values of the m noisy speech frequency bands, and inputting a frequency band gain model according to any one of claims 1 to 6 to obtain gain values corresponding to the noisy speech frequency bands;
p2, adopting a comb filter to carry out pitch filtering on the voice frequency band with noise;
p3, calculating the energy of the noisy speech frequency band after filtering to obtain the energy ratio of the noisy speech frequency band before and after filtering;
and P4, multiplying the signal after filtering the noisy speech frequency band by the energy ratio, and multiplying the signal with the noisy speech frequency band by a gain value corresponding to the noisy speech frequency band to obtain noise-reduced speech data.
8. The method for voice noise reduction applicable to an on-vehicle scene according to claim 7, wherein the comb filter is set as follows:
x′[i]=x[i]+a×P
wherein x [ i ] and x' [ i ] respectively represent signals before and after filtering, a is a filter coefficient, and P is frequency domain data generated by voice signals containing fundamental frequency parts.
9. The method of claim 8, wherein the filter coefficients are calculated using the following formula:
wherein cope is the correlation value of the energy of the noisy speech signal and the pitch energy, g b Is the gain value.
CN202110985541.4A 2021-08-26 2021-08-26 Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene Active CN113782011B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110985541.4A CN113782011B (en) 2021-08-26 2021-08-26 Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110985541.4A CN113782011B (en) 2021-08-26 2021-08-26 Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene

Publications (2)

Publication Number Publication Date
CN113782011A CN113782011A (en) 2021-12-10
CN113782011B true CN113782011B (en) 2024-04-09

Family

ID=78839274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110985541.4A Active CN113782011B (en) 2021-08-26 2021-08-26 Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene

Country Status (1)

Country Link
CN (1) CN113782011B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114387772B (en) * 2021-12-15 2022-11-25 深圳市东峰盛科技有限公司 Security protection control is with camera that has alarm structure
CN117198308B (en) * 2023-09-11 2024-03-19 辽宁工程技术大学 Style migration method for in-vehicle feedback sound effect

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999012155A1 (en) * 1997-09-30 1999-03-11 Qualcomm Incorporated Channel gain modification system and method for noise reduction in voice communication
JP2005348173A (en) * 2004-06-03 2005-12-15 Nippon Telegr & Teleph Corp <Ntt> Noise reduction method, device for executing the same method, program and its recording medium
CN103646648A (en) * 2013-11-19 2014-03-19 清华大学 Noise power estimation method
CN108877782A (en) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109065067A (en) * 2018-08-16 2018-12-21 福建星网智慧科技股份有限公司 A kind of conference terminal voice de-noising method based on neural network model
CN109767782A (en) * 2018-12-28 2019-05-17 中国科学院声学研究所 A kind of sound enhancement method improving DNN model generalization performance
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
CN110335620A (en) * 2019-07-08 2019-10-15 广州欢聊网络科技有限公司 A kind of noise suppressing method, device and mobile terminal
CN110610715A (en) * 2019-07-29 2019-12-24 西安工程大学 Noise reduction method based on CNN-DNN hybrid neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060028337A1 (en) * 2004-08-09 2006-02-09 Li Qi P Voice-operated remote control for TV and electronic systems
US8005668B2 (en) * 2004-09-22 2011-08-23 General Motors Llc Adaptive confidence thresholds in telematics system speech recognition
ES2928295T3 (en) * 2020-02-14 2022-11-16 System One Noc & Dev Solutions S A Method for improving telephone voice signals based on convolutional neural networks

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999012155A1 (en) * 1997-09-30 1999-03-11 Qualcomm Incorporated Channel gain modification system and method for noise reduction in voice communication
JP2005348173A (en) * 2004-06-03 2005-12-15 Nippon Telegr & Teleph Corp <Ntt> Noise reduction method, device for executing the same method, program and its recording medium
CN103646648A (en) * 2013-11-19 2014-03-19 清华大学 Noise power estimation method
CN108877782A (en) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109065067A (en) * 2018-08-16 2018-12-21 福建星网智慧科技股份有限公司 A kind of conference terminal voice de-noising method based on neural network model
CN109767782A (en) * 2018-12-28 2019-05-17 中国科学院声学研究所 A kind of sound enhancement method improving DNN model generalization performance
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
CN110335620A (en) * 2019-07-08 2019-10-15 广州欢聊网络科技有限公司 A kind of noise suppressing method, device and mobile terminal
CN110610715A (en) * 2019-07-29 2019-12-24 西安工程大学 Noise reduction method based on CNN-DNN hybrid neural network

Also Published As

Publication number Publication date
CN113782011A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
WO2020177371A1 (en) Environment adaptive neural network noise reduction method and system for digital hearing aids, and storage medium
CN113782011B (en) Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene
JP5230103B2 (en) Method and system for generating training data for an automatic speech recognizer
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN108831499A (en) Utilize the sound enhancement method of voice existing probability
CN103531204B (en) Sound enhancement method
CN103238183B (en) Noise suppression device
CN106340292A (en) Voice enhancement method based on continuous noise estimation
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
CN111583954A (en) Speaker independent single-channel voice separation method
CN111785285A (en) Voiceprint recognition method for home multi-feature parameter fusion
CN102157156A (en) Single-channel voice enhancement method and system
Wolfe et al. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement
JP4757775B2 (en) Noise suppressor
Hammam et al. Blind signal separation with noise reduction for efficient speaker identification
CN103971697B (en) Sound enhancement method based on non-local mean filtering
CN114566179A (en) Time delay controllable voice noise reduction method
Lu et al. Controlling tradeoff between approximation accuracy and complexity of a smooth function in a reproducing kernel Hilbert space for noise reduction
CN115312073A (en) Low-complexity residual echo suppression method combining signal processing and deep neural network
CN111491245B (en) Digital hearing aid sound field identification algorithm based on cyclic neural network and implementation method
CN114189781A (en) Noise reduction method and system for double-microphone neural network noise reduction earphone
CN103270772A (en) Signal processing device, signal processing method, and signal processing program
CN113066483A (en) Sparse continuous constraint-based method for generating confrontation network voice enhancement
Roy Single channel speech enhancement using Kalman filter
Lu et al. Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant