CN111653272A

CN111653272A - Vehicle-mounted voice enhancement algorithm based on deep belief network

Info

Publication number: CN111653272A
Application number: CN202010484415.6A
Authority: CN
Inventors: 周伟; 钱龙; 施建阳; 张英鹏; 李鹏华; 董莉娜; 计超; 易军; 郑福建; 汪彦; 郭鑫
Original assignee: Chongqing University of Science and Technology
Current assignee: Chongqing University of Science and Technology
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2020-09-11

Abstract

The invention provides a vehicle-mounted voice enhancement algorithm based on a deep belief network, which comprises the following steps: step 1: dividing the vehicle-mounted voice signal into a training sample signal and a testing sample signal; step 2: optimizing the learning rate, the initial weight and the number of hidden nodes of the DBN by adopting a QPSO algorithm; and step 3: replacing the sigmoid function with a hyperbolic tangent (tanh) activation function to optimize the deep belief network model; and 4, step 4: carrying out greedy unsupervised learning layer by layer on the optimized deep belief network to obtain an abstract voice characteristic vector of the input vehicle-mounted voice signal; and 5: and inputting the abstract voice signal into a minimum mean square error algorithm to obtain a voice enhancement signal. The invention combines the deep belief network and the traditional least mean square error algorithm to carry out voice enhancement, not only utilizes the strong learning capability and the feature extraction capability of the deep belief network, but also combines the high efficiency of the traditional voice enhancement algorithm.

Description

Vehicle-mounted voice enhancement algorithm based on deep belief network

Technical Field

The invention relates to a voice enhancement technology, in particular to a vehicle-mounted voice enhancement algorithm based on a deep belief network.

Background

In recent years, with the rapid development of economy, the consumption level of people is increasingly improved, and an automobile becomes a main walking tool for people to go out. Statistics show that the number of motor vehicles in the whole country in the last half of 2019 is 3.4 hundred million, 1242 thousands of newly registered automobiles and 1408 thousands of newly-licensed drivers. With the ever increasing market competition, automotive manufacturers are constantly upgrading in-vehicle electronic devices, such as in-vehicle multimedia systems, in-vehicle navigation systems, in-vehicle hands-free systems, in-vehicle control systems, etc., in order to meet the diverse demands of users during driving.

Under the ideal condition, a driver sends out command voice for controlling the vehicle-mounted electronic equipment, and the vehicle-mounted voice recognition system calls the corresponding vehicle-mounted electronic equipment according to the recognized content. However, there are various background noises in a real vehicle-mounted environment: such as engine noise, tire noise, wind noise, vehicle air conditioning noise, and also human noise from the passenger compartment. At present, the speech recognition accuracy rate in a quiet environment reaches about 98%, but in a real environment, particularly in a complex vehicle-mounted noise environment, the speech recognition accuracy rate will be sharply reduced.

Disclosure of Invention

The invention aims to: the vehicle-mounted voice enhancement algorithm with high voice recognition accuracy can be used in a real environment, particularly a complex vehicle-mounted noise environment.

The invention provides a vehicle-mounted voice enhancement algorithm based on a deep belief network, which comprises the following steps:

step 1: dividing the vehicle-mounted voice signal into a training sample signal and a testing sample signal;

step 2: optimizing the learning rate, the initial weight and the number of hidden nodes of the DBN by adopting a QPSO algorithm;

QPSO algorithm represents quantum particle swarm algorithm, and DBN represents deep belief network.

And step 3: replacing a sigmoid function with a tanh activation function, and optimizing the deep belief network model;

and 4, step 4: carrying out greedy unsupervised learning layer by layer on the optimized deep belief network to obtain an abstract voice characteristic vector of the input vehicle-mounted voice signal;

and 5: and inputting the abstract voice signal into a minimum mean square error algorithm to obtain a voice enhancement signal.

Further, the step 2 includes training the DBN by using a restricted boltzmann machine; the limited Boltzmann machine represents the current state of the system through energy, and the energy expression is as follows:

where n represents the number of nodes of the visual layer v, m represents the number of nodes of the hidden layer h, a represents the bias of the visual layer, b tableIndicating the bias of the hidden layer, W_ijRepresenting the weight value from the visible layer node to the hidden layer node, wherein theta is { W, a, b } representing the set of all parameters of the system;

the probability distribution of the whole system is as follows:

the logarithmic derivative of the edge probability distribution is taken using the following equation:

for the training samples, distribution P (h | v, θ) is represented by "data", and distribution P (v, h | θ) is represented by "model". Wherein, use<·>_pRepresents a mathematical expectation with respect to the distribution P;

the edge probability distribution logarithmic derivative formula is expressed as:

the RBM parameter is obtained by adopting the following formula:

wherein k represents the number of sampling times, which is the learning rate;

RBM denotes a restricted Boltzmann machine;

optimizing the learning rate, the initial weight and the number of hidden nodes by adopting the following formulas:

P＝μP_i+(1-μ)P_j

in the formula: z is the size of the population; mu and u are the interval [0,1]Random numbers which meet the uniform distribution; z_bAverage point of optimal position for all particle individuals; p is a radical of_iFor the individual optimal position of the particle i, p_jFor the individual global optimum position of particle i, X (t) for the position of particle i in the t-th iteration α_pTo compress the expansion factor.

Further, in step 3, the sigmoid function expression adopts the following formula:

the tanh activation function expression adopts the following formula:

the invention has the beneficial effects that: the deep belief network is combined with the traditional minimum mean square error algorithm to carry out voice enhancement, so that the strong learning capability and the feature extraction capability of the deep belief network are utilized, and the high efficiency of the traditional voice enhancement algorithm is combined. And (3) performing feature learning on the vehicle-mounted voice signal through a deep belief network, and inputting the feature learning into an LMS algorithm to obtain the optimal voice enhancement effect. Speech enhancement, as a pre-processing scheme, is an effective and necessary way to suppress noise and interference, providing convenience for subsequent speech recognition.

Drawings

FIG. 1 is a system block diagram of the present invention.

FIG. 2 is a flow chart of the QPSO algorithm.

Fig. 3 is a schematic diagram of an LMS algorithm adaptive noise canceller.

The method is concretely implemented as follows:

the following provides a more detailed description of the embodiments and the operation of the present invention with reference to the accompanying drawings.

As shown in figure 1, the invention provides a vehicle-mounted voice enhancement algorithm based on a deep belief network.

The method comprises the following specific steps:

step 1: and preprocessing the acquired vehicle-mounted voice signals, namely removing the mean value and normalizing, and dividing the samples into training samples and testing samples.

Step 2: firstly, a deep belief network model is constructed, wherein the DBN model consists of a plurality of RBMs and has the function of automatically extracting high-level features. The RBM represents the current state of the system through energy, and the energy expression is as follows:

wherein n represents the number of nodes of the visual layer v, m represents the number of nodes of the hidden layer h, a and b represent the bias of the visual layer and the hidden layer respectively, and W_ijThe weight value from the visible layer inode to the hidden layer j node is represented. The set of all parameters of the system is denoted by θ ═ { W, a, b }.

The probability distribution of the whole system is as follows:

the denominator part is called as a normalization factor, so that the value of the system probability is in the range of [0,1 ]. From the above equation, the edge probability distribution is obtained, which is estimated by using the maximum likelihood method, and the logarithmic derivative of which has:

for the training samples, distribution P (h | v, θ) is represented by "data", and distribution P (v, h | θ) is represented by "model". Wherein, use<·>_pRepresenting about a distributionMathematical expectation of P. Because θ ═ W, a, b, the derivative defined in conjunction with the energy, the above equation can be expressed as:

and (4) approximating the log likelihood probability by Gibbs sampling, and obtaining the gradient of the RBM parameter for updating. The calculation expression is as follows:

where k represents the number of samples, which is the learning rate.

Optimizing the learning rate, the initial weight and the number of hidden layers by utilizing a QPSO algorithm, wherein the calculation expression is as follows:

P＝μP_i+(1-μ)P_j

in the formula: z is the size of the population; mu and u are the interval [0,1]Random numbers which meet the uniform distribution; z_bAverage point of optimal position for all particle individuals; p is a radical of_iAnd p_jFor particle i the individual optimum position and the global optimum position, X (t) for the position of particle i in the t iteration α_pTo compress the expansion factor.

The QPSO algorithm is a flow chart as shown in fig. 2, and the process of optimizing the DBN by using the QPSO algorithm can be described as follows:

1) initializing a QPSO algorithm, wherein the QPSO algorithm comprises the positions and optimization ranges of particles, compression expansion factors, iteration times and the like, and the DBN learning rate, the initial weight and the number of hidden layers to be optimized are mapped to the positions of the particles;

2) calculating the fitness of each particle in the population to obtain the individual optimal position of each particle and the global optimal position of the population;

3) calculating to obtain the optimal average point of the individual positions of all the particles in the population, and then updating the positions of the particles;

4) and repeating 1) -3) until an iteration stop condition is met, wherein the output optimization result is the parameter of the DBN.

And step 3: and (3) setting basic parameters of the DBN according to the dimension of the voice signal, the size of the sample set and the result optimized by the QPSO algorithm in the step (2), wherein the basic parameters comprise parameters such as the number of hidden layer units, the number of model layers, training times, batch size, momentum learning rate, punishment rate, initial bias, initial weight and the like. And the expression of the traditional sigmoid function is as follows:

replaced with a tanh activation function, the expression is as follows:

the problem that the gradient of the traditional DBN activation function is easy to disappear is solved, and the convergence and the stability of the network are effectively improved.

And 4, step 4: the method comprises the steps of obtaining high-level expression of input vehicle-mounted voice signals by a layer-by-layer feature extraction method, optimizing weights among network connections, firstly adopting an unsupervised training mode, training RBMs layer by layer, keeping features of the voice signals as much as possible, then carrying out fine adjustment on a back propagation network, and learning ideal high-level abstract voice feature signals. The whole process is as follows:

1) and initializing parameters. And setting parameters including a weight matrix, a visible layer bias vector, a hidden layer bias vector, sampling times, iteration times, a learning rate and the like according to the step 3.

2) And (6) updating the parameters. And performing Gibbs sampling for multiple times by using a contrast divergence algorithm, and updating the parameters by using a parameter updating formula.

3) And (5) training layer by layer. And training each RBM layer by layer until all RBMs are trained.

4) And (6) fine adjustment. And adjusting the weight and the bias of each layer of network by using an error back propagation mechanism of the network model.

And 5: and inputting the high-level abstract voice signal into a self-adaptive filtering algorithm to perform voice enhancement. Based on the schematic diagram of fig. 3, where s (n) denotes the original speech signal; x (n) represents a speech signal containing noise; y (n) represents the output signal of the filter; v (n) represents a noise signal; v. of₀(n) and v₁(n) represents two uncorrelated noise signals; e (n) represents an error signal, expressed as follows:

e(n)＝x(n)-y(n)＝s(n)+v₀(n)-y(n)

squaring both sides and then taking the mathematical expectation, one can obtain

E[e²(n)]＝E[s²(n)]+[(v₀(n)-y(n))²]+2E[s(n)(v₀(n)-y(n))]

The adaptive process of the LMS algorithm is to automatically adjust the tap weight of the filter

So that the error E [ E ]²(n)]To a minimum. Because s (n) and v₀(n) are independent of each other, so that E [ E ] is²(n)]To a minimum, only E [ (v) is required₀(n)-y(n))²]The requirement can be met at minimum.

The iterative formula of the least mean square error LMS algorithm based on the steepest descent method is as follows:

so that it is possible to obtain:

where μ denotes a step factor.

The order of the adaptive filtering is M, the filter coefficient is F, the input signal sequence is X, and the output is:

e(n)＝d(n)-y(n)

it is possible to obtain:

let F be [ w ]₀w₁…w_M-1]^T，X_j＝[x_1jx_2j...x_nj]Then the output of the filter can be written in matrix form:

the cost function is defined as:

when the cost function in the above equation is minimized, it is considered that optimal filtering is achieved, and such adaptive filtering becomes least mean square adaptive filtering.

For least mean square adaptive filtering, the filter coefficients that minimize the mean square error need to be determined, and gradient descent methods are generally used to solve such problems. The iterative formula of the filter coefficient vector is:

wherein the content of the first and second substances,

is the gradient of the cost function.

Because of the transient gradient-2X_je_jFor unbiased estimation of true gradient values, in practical applications, the instantaneous gradient can be used instead of the true gradient, that is:

F_j+1＝F_j+μe_jX_j

and through gradual iteration, the optimal filter coefficient can be obtained, and the self-adaptive filtering of the input vehicle-mounted voice signal is realized to perform voice enhancement.

The LMS-based speech noise reduction algorithm comprises the following specific steps:

1) first, wavelet transform is performed on a speech signal x (n) containing noise, so as to obtain the following formula:

x(n)＝s(n)+v₀(n)

2) then, bionic wavelet transform is carried out on the voice signal x (n) containing the noise to obtain wavelet coefficients containing the noise, which is as follows

Wherein the content of the first and second substances,

representing the coefficients produced after a speech signal containing noise has been transformed using a biomimetic wavelet.

3) And (4) self-adaptive filtering processing. The input signal of the adaptive canceller is N ═ N₀n₁…n_M-1]^TThe output signal of the filter is:

according to the method, after noise filtering is carried out through a vehicle-mounted voice enhancement algorithm based on a deep belief network, an enhanced vehicle-mounted voice signal is output, and finally, the calculated signal-to-noise ratio and the calculated PESQ value are improved compared with those of a traditional voice enhancement algorithm. The algorithm can effectively eliminate the noise of the original voice signal, furthest reserve the information of the original voice signal and effectively enhance the processed voice signal.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A vehicle-mounted voice enhancement algorithm based on a deep belief network is characterized by comprising the following steps:

2. The vehicle-mounted speech enhancement algorithm based on the deep belief network according to claim 1, characterized in that: the step 2 comprises training the DBN by adopting a restricted Boltzmann machine;

the limited Boltzmann machine represents the current state of the system through energy, and the energy expression is as follows:

where n represents the number of nodes of the visual layer v, m represents the number of nodes of the hidden layer h, a represents the bias of the visual layer, b represents the bias of the hidden layer, W_ijRepresenting the weight value from the visible layer node to the hidden layer node, wherein theta is { W, a, b } representing the set of all parameters of the system;

the probability distribution of the whole system is as follows:

for the training samples, "data" is used to represent the distribution P (h | v, θ), and "model" is used to represent the distribution P (v, h | θ), where<·>_pRepresents a mathematical expectation with respect to the distribution P;

the RBM parameter is obtained by adopting the following formula:

where k represents the number of samples, learning rate

P＝μP_i+(1-μ)P_j

3. The vehicle-mounted speech enhancement algorithm based on the deep belief network according to claim 1, characterized in that: in the step 3, the sigmoid function expression adopts the following formula:

the tanh activation function expression adopts the following formula: