CN112259113A

CN112259113A - Preprocessing system for improving accuracy rate of speech recognition in vehicle and control method thereof

Info

Publication number: CN112259113A
Application number: CN202011060176.8A
Authority: CN
Inventors: 彭博; 姜彦吉; 琚林锋; 范佳亮; 郑四发
Original assignee: Suzhou Automotive Research Institute of Tsinghua University
Current assignee: Suzhou Automotive Research Institute of Tsinghua University
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-22

Abstract

The invention discloses a preprocessing system for improving the accuracy of speech recognition in a vehicle and a control method thereof, wherein the control method comprises the following steps: the method comprises the steps that a vibration sensor collects original vibration signals at different positions outside a current car, and an in-car microphone collects sound signals inside the current car; establishing a transmission channel model through a convolutional neural network noise reduction model and a cyclic neural network model to form a mapping relation between an original vibration signal and an in-vehicle noise signal, and outputting a modeled cancellation signal; and removing the in-vehicle noise signal in the sound signal according to the modeled cancellation signal to obtain a residual signal after noise reduction, wherein the residual signal is used as an input signal of the voice recognition system. The preprocessing system provided by the invention can be used for carrying out noise reduction and enhancement on voice signals at the front end of the voice recognition system, improving the accuracy of voice recognition, and positioning the position of a passenger who sends voice in a vehicle through the arrangement of the microphone, so that the control requirements of different passengers can be met conveniently.

Description

Preprocessing system for improving accuracy rate of speech recognition in vehicle and control method thereof

Technical Field

The invention relates to the field of voice noise reduction, in particular to a preprocessing system for improving the accuracy rate of voice recognition in a vehicle and a control method thereof.

Background

The voice interaction is the most convenient and efficient means for controlling the vehicle-mounted system when a person drives a vehicle. The voice recognition and control in the environment inside the vehicle become a hot technical direction for the interaction of people and vehicles in the future. When a vehicle runs, large background noises in the vehicle comprise road noises, engine noises and wind noises, and the non-stable time-varying noises can seriously influence the performance of a voice recognition system and reduce the accuracy of system recognition, so that a corresponding preprocessing system is required to perform noise reduction processing on noisy voice signals. Meanwhile, the existing vehicle-mounted voice recognition system cannot perform positioning distinguishing on voice signals in the vehicle, cannot realize the targeted voice control of passengers in different riding positions, and also reduces the driving experience of the passengers.

The existing noise reduction processing system of vehicle-mounted voice can be divided into single-channel voice enhancement and multi-channel voice enhancement. The single channel enhancement is generally based on a voice activity detection or statistical model method, and the accuracy of the in-vehicle complex environment voice recognition is improved by improving the robustness of a voice recognition algorithm. However, the noise reduction and the distortion of the speech signal are caused, so that the method is usually used for processing stationary noise and has poor control effect on non-stationary noise. Multichannel speech enhancement is through arranging a plurality of microphones in the car, and it carries out speech signal and falls the noise to constitute microphone array and receive sound signal to fall the noise to obtain better recognition effect, can handle non-stationary noise to a certain extent, still exist and fall the unsatisfactory, the poor scheduling problem of noise tracking performance of noise reduction performance.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a pretreatment system for improving the accuracy of speech recognition in a vehicle and a control method thereof, and the technical scheme is as follows:

on one hand, the invention provides a control method of a preprocessing system for improving the accuracy of speech recognition in a vehicle, which comprises the following steps:

s1, collecting original vibration signals at different positions outside the vehicle, and collecting current sound signals inside the vehicle, wherein the sound signals inside the vehicle comprise noise signals inside the vehicle and voice signals inside the vehicle;

s2, performing real-time feature learning on the original vibration signal through a convolutional neural network noise reduction model to obtain a corresponding feature vector and outputting the feature vector to a cyclic neural network model;

s3, establishing a transfer channel model through the recurrent neural network model to form a mapping relation between the original vibration signal and the in-vehicle noise signal, and outputting a modeled cancellation signal;

and S4, removing the in-car noise signal in the in-car sound signal according to the counteracting signal to obtain a voice signal after noise reduction, and using the voice signal as an input signal of a voice recognition system.

Further, in step S2, the convolutional neural network noise reduction model has a six-layer structure, which includes, in order from input to output, a first convolutional layer, a second convolutional layer, a maximum pooling layer, a third convolutional layer, an average pooling layer, and a Dropout layer, and a feature vector output is obtained after inputting the signal data matrix.

Further, in step S3, outputting the modeled cancellation signal includes the following steps:

s31, obtaining a counteracting signal through the recurrent neural network model;

s32, comparing the counteracting signal with the in-vehicle noise signal to obtain a residual error signal;

s33, when the residual error signal exceeds the set value, executing S34; when the residual error signal does not exceed a set value, directly outputting the offset signal;

and S34, updating model parameters in the recurrent neural network model according to the residual error signal, and executing S31-S33.

Further, in step S1, there are a plurality of in-vehicle microphones simultaneously picking up sound signals of different positions in the current vehicle.

Further, before the step S2, acoustic features of the in-vehicle speech signals at different positions in the vehicle are extracted by the convolutional neural network localization model, and the in-vehicle space is divided into a plurality of regions according to the extracted acoustic features.

Further, the acoustic characteristics include intensity and duration.

Furthermore, the convolutional neural network positioning model is of a five-layer structure and sequentially comprises a first convolutional layer, a maximum pooling layer, a second convolutional layer, an average pooling layer and a Softmax layer from input to output, and a vector output is obtained after a signal data matrix is input.

Further, the convolutional neural network positioning model takes the energy and time difference of the sound signals collected by each in-vehicle microphone as learning features, and obtains the probability of the in-vehicle speaker in each in-vehicle area through the time domain signals in the sound signals.

In another aspect, the present invention provides a preprocessing system for improving accuracy of speech recognition in a vehicle, comprising

The system comprises an external vibration sensor, a digital signal transmission line and a signal processing circuit, wherein the external vibration sensor is distributed at different positions outside the vehicle and is connected with the digital signal transmission line;

the system comprises in-vehicle microphones, a digital signal transmission line and a digital signal processing circuit, wherein the in-vehicle microphones are distributed at different positions in the vehicle and are connected to the digital signal transmission line, and the in-vehicle microphones are used for acquiring sound signals in the vehicle;

the voice signal preprocessing module is connected to the digital signal transmission line, the vibration sensor outside the vehicle and the microphone inside the vehicle output the collected signals to the voice signal preprocessing module through the digital signal transmission line, the voice signal preprocessing module carries out noise reduction processing on the sound signals collected by the microphone inside the vehicle through a convolutional neural network noise reduction model and a cyclic neural network model, and the voice signal preprocessing module can judge the position of a speaker according to the sound signals collected by the microphone inside the vehicle through a convolutional neural network positioning model.

Further, the vehicle exterior vibration sensor includes a first vehicle exterior vibration sensor, a second vehicle exterior vibration sensor, and a third vehicle exterior vibration sensor; the first off-board vibration sensor is arranged at an engine compartment of a vehicle for collecting engine noise; the second outside-vehicle vibration sensor is arranged near a vehicle tire and used for collecting road noise generated when the vehicle runs; the third vehicle exterior vibration sensor is arranged at the vehicle exterior rear view mirror and used for collecting wind noise formed at the rear view mirror when the vehicle runs.

The technical scheme provided by the invention has the following beneficial effects:

a. the self-adaptive control of the acquired noise signals is realized, and the noise is reduced before the voice recognition is carried out, so that the recognition rate of a recognition system is improved;

b. realizing the personalized voice recognition control and service of passengers at different riding positions in the vehicle.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a preprocessing system for improving accuracy of speech recognition in a vehicle according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the distribution of components of the preprocessing system for improving the accuracy of speech recognition in a vehicle according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a transmission path of a speech noise reduction signal of the preprocessing system according to the embodiment of the present invention for improving the accuracy of speech recognition in a vehicle;

FIG. 4 is a schematic diagram of an online learning and noise reduction process of the preprocessing system for improving the accuracy of speech recognition in the vehicle according to the embodiment of the present invention;

FIG. 5 is a schematic input/output diagram of a neural network noise reduction model of a preprocessing system for improving accuracy of speech recognition in a vehicle according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a speech positioning process of the preprocessing system for improving accuracy of in-vehicle speech recognition according to the embodiment of the present invention;

FIG. 7 is a schematic diagram of a transmission path of a speech positioning signal of the preprocessing system for improving accuracy of speech recognition in a vehicle according to an embodiment of the present invention;

FIG. 8 is a schematic flow chart of a neural network positioning model of the preprocessing system for improving the accuracy of speech recognition in a vehicle according to the embodiment of the present invention.

The system comprises a first vehicle exterior vibration sensor 11, a second vehicle exterior vibration sensor 12, a third vehicle exterior vibration sensor 13, a vehicle interior microphone 2, a voice signal preprocessing module 3, a digital signal transmission line 4, a voice recognition system 5 and a passenger position 6.

Detailed Description

In order to make the technical solutions of the present invention better understood and more clearly understood by those skilled in the art, the technical solutions of the embodiments of the present invention will be described below in detail and completely with reference to the accompanying drawings. It should be noted that the implementations not shown or described in the drawings are in a form known to those of ordinary skill in the art. Additionally, while exemplifications of parameters including particular values may be provided herein, it is to be understood that the parameters need not be exactly equal to the respective values, but may be approximated to the respective values within acceptable error margins or design constraints. It is to be understood that the described embodiments are merely exemplary of a portion of the invention and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. In addition, the terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In an embodiment of the present invention, a control method of a preprocessing system for improving accuracy of in-vehicle speech recognition is provided, and with reference to fig. 1 to 4, the method includes the following steps:

In step S3, outputting the modeled cancellation signal includes the following steps:

In step S3, updating model parameters in the recurrent neural network model by using a minimum mean square error algorithm; in step S2, the convolutional neural network noise reduction model has a six-layer structure, which includes, in order from input to output, a first convolutional layer, a second convolutional layer, a maximum pooling layer, a third convolutional layer, an average pooling layer, and a Dropout layer, and a feature vector output is obtained after inputting the signal data matrix.

In an embodiment of the present invention, the control method of the speech signal preprocessing system includes not only a speech noise reduction method but also a speech localization method, and on the basis of the above embodiment, a plurality of in-vehicle microphones simultaneously collect a current in-vehicle sound signal, and before performing speech localization, feature learning needs to be performed through a neural network model. Specifically, acoustic features of in-vehicle voice signals of different positions in the vehicle are extracted through a convolutional neural network positioning model, the in-vehicle space is divided into a plurality of regions according to the extracted acoustic features, wherein the acoustic features comprise intensity and duration, for example, the in-vehicle space is divided into 4 regions according to the distribution of the positions of passengers, so that the signal feature range corresponding to each region received by an in-vehicle microphone is different, and generally one seat corresponds to one region. The convolutional neural network positioning model takes the energy and time difference of sound signals collected by each in-vehicle microphone as learning characteristics, and obtains the probability of the in-vehicle speaker in each in-vehicle seat through time domain signals in the sound signals. The convolutional neural network positioning model is of a five-layer structure and sequentially comprises a first convolutional layer, a maximum pooling layer, a second convolutional layer, an average pooling layer and a Softmax layer from input to output, a vector output is obtained after a signal data matrix is input, and the vector comprises the probability of the occurrence of a speaker in each seat in a vehicle.

In an embodiment of the present invention, a preprocessing system for improving accuracy of speech recognition in a vehicle based on the above control method is provided, and referring to fig. 1 to fig. 3, the system includes a vibration sensor outside the vehicle, preferably a digital microphone, an in-vehicle microphone 2 and a speech signal preprocessing module 3, wherein the vibration sensor outside the vehicle is distributed at different positions outside the vehicle and is connected to a digital signal transmission line 4, and the vibration sensor outside the vehicle is used for collecting original vibration signals outside the vehicle, preferably, the vibration sensor outside the vehicle includes a first vibration sensor outside the vehicle 11, a second vibration sensor outside the vehicle 12 and a third vibration sensor outside the vehicle 13; the first vehicle external vibration sensor 11 is arranged on an engine of a vehicle and can acquire an engine vibration signal; the second off-vehicle vibration sensor 12 is arranged near four tire frames of the vehicle, and one vibration sensor is arranged near each tire frame of the vehicle and can collect road noise generated when the vehicle runs; the third vehicle exterior vibration sensor 13 is disposed at the vehicle exterior rear view mirror, and can collect wind noise generated at the rear view mirror when the vehicle is running.

The in-vehicle microphones 2 are distributed at different positions in the vehicle and are connected to the digital signal transmission line 4, the in-vehicle microphones 2 are used for collecting sound signals in the vehicle, the in-vehicle microphones are preferably digital dual-microphone pairs in the vehicle, the number of the in-vehicle microphones 2 is preferably 4, and one in-vehicle microphone 2 is mounted near each in-vehicle passenger position 6, for example, at the positions of handles above four vehicle doors; the sound signals collected by the in-vehicle microphone 2 comprise in-vehicle voice signals of in-vehicle speakers and in-vehicle noise signals formed by transmitting original vibration outside the vehicle into the vehicle, and the in-vehicle microphones 2 at different positions can simultaneously collect the voice of the same speaker.

The voice signal preprocessing module 3 is connected to the digital signal transmission line 4, the digital signal transmission line 4 is preferably in an A2B connection mode, the vibration sensor outside the vehicle and the in-vehicle microphone 2 output the collected signals to the voice signal preprocessing module 3 through the digital signal transmission line 4, a chip used by the voice signal preprocessing module 3 can be preferably an NXP special neural processing engine i.mx 8M Plus, the voice signal preprocessing module 3 performs noise reduction processing on the sound signals collected by the in-vehicle microphone 2 through a convolutional neural network noise reduction model and a cyclic neural network model, the voice signal preprocessing module 3 can remove noise input into a voice recognition system according to a feedforward signal and the algorithm model, and the left clean voice signals are used for recognition; the voice signal pre-processing module 3 can also judge the position of the speaker through a convolutional neural network positioning model according to the sound signal collected by the in-vehicle microphone 2.

Specifically, the voice noise reduction part of the voice signal preprocessing system adopts a feedforward system structure, feedforward signals of the feedforward system can be collected by three vehicle exterior vibration sensors, original vibration signals collected by a first vehicle exterior vibration sensor, a second vehicle exterior vibration sensor and a third vehicle exterior vibration sensor are respectively recorded as x (n), y (n) and z (n), 4 vehicle interior microphones can collect voice signals at different passenger positions in a vehicle on one hand, and voice signals sent by 4 passenger positions in the vehicle can be respectively recorded as b₁(n)、b₂(n)、b₃(n)、b₄(n), on the other hand, an in-vehicle noise signal can be collected, wherein the in-vehicle noise is mainly transmitted into the vehicle by three noise sources outside the vehicle through a transmission channel (including air or a vehicle body) to form, the in-vehicle noise signal is correspondingly composed of three parts, corresponding to x (n), y (n), z (n) collected outside the vehicle, and the three parts of the in-vehicle noise signal are respectively marked as x ' (n), y ' (n), and z ' (n), so that a sound signal collected by an in-vehicle microphone can be regarded as voice plus additive noise and is marked as s (n), the sound signal is the superposition of a pure voice signal and the in-vehicle noise signal and can be expressed by the following formula:

s(n)＝b_j(n)+x′(n)+y′(n)+z′(n)

wherein j is 1,2,3, 4.

Referring to fig. 3, noise source signals x (n), y (n), z (n) collected by the exterior vibration sensor are transmitted to the voice signal preprocessing module through a digital signal transmission line, and a mapping relation between interior noise signals x ' (n), y ' (n), z ' (n) and exterior original vibration signals x (n), y (n), z (n) is established by using a voice noise reduction part of the voice signal preprocessing module, so that the time delay influence of a transmission channel is eliminated, the interior noise signals are controlled, the system has noise tracking performance, the noise reduction effect of the preprocessing system is improved, and the damage to voice signals is reduced. The residual clean voice signal after noise reduction and enhancement is recorded as b' (n) to be input into a voice recognition system, so that the voice recognition accuracy can be improved.

The control algorithm of the noise reduction part of the voice signal preprocessing module adopts a mixed model of a one-dimensional convolutional neural network model and a cyclic neural network model, so that the characteristics of a signal transmission channel can be extracted in real time, the signal transmission delay of a processing system is effectively solved, the noise reduction processing of a voice signal is realized, and the one-dimensional convolutional neural network model is the convolutional neural network noise reduction model. The original vibration signals X (n), Y (n), Z (n) collected by the vibration sensor outside the vehicle are respectively (X (1), X (2) · X (n)), (Y (1), Y (2) · Y (n)), (Z (1), Z (2) · Z (n)), the sequences after each corresponding modeling processing are respectively (X (1), X (2) · X (n)), (Y (1), Y (2) · Y (n)), (Z (1), Z (2) · Z (n)), the noise signals X '(n), Y' (n), Z '(n) collected by the microphone inside the vehicle, and the corresponding signal sequences are respectively (X' (1), X '(2) · X' (n)), (Y '(1), Y' (2) · Y '(n)), (Z' (1), z '(2).. z' (n)). The voice signal pre-processing module adopts a minimum mean square error algorithm to adjust a control parameter W of the recurrent neural network model, so that the sum of the squares of the errors of the output signal of the model and the noise signal collected by the microphone in the vehicle is minimum, the noise reduction effect is achieved, and the sum of the squares of the errors is recorded as epsilon (n). The squared error of the noise-reduced residual signal corresponding to the three noise sources after noise reduction can be calculated by the following formula:

ε_x(n)＝[e_x(n)]²＝[X(n)-x′(n)]²

ε_y(n)＝[e_y(n)]²＝[Y(n)-y′(n)]²

ε_z(n)＝[e_z(n)]²＝[Z(n)-z′(n)]²

the updating formula of the model parameter W of the recurrent neural network model in the speech signal preprocessing module is as follows:

in the formula, mu_x,μ_y,μ_zIs the model convergence factor.

Referring to fig. 4, the noise reduction scene of the speech signal preprocessing module can be regarded as an infinite sequence of training tasks, and the real-time online learning and noise reduction process of the noise signal of the hybrid model is as follows:

(1) inputting data: after preprocessing the acquired noise signal, storing the vibration signal values from three noise sources in discrete time intervals, which will produce a 3 × n matrix as the multi-channel input vector of the one-dimensional convolutional neural network.

(2)1D CNN layer: the first layer of one-dimensional convolutional neural network selects 80 convolutional kernels to extract input signal characteristics, an output vector is a weight matrix of the convolutional kernels, and the 1D CNN layer expressed below is a one-dimensional convolutional layer.

(3)1D CNN layer: the output result of the first convolution layer is transmitted to the second convolution layer, and then 50 convolution kernels are selected for further feature extraction.

(4) Maximum pooling layer: to reduce the complexity of the output and prevent data overfitting, the largest pooling layer is chosen after the convolutional layer.

(5)1D CNN layer: the output of the pooling layer is connected with another convolution layer, and 20 convolution kernels are selected to abstract high-dimensional features.

(6) Average pooling layer: to avoid overfitting, the average of the two weights in the neural network is pooled.

(7) Dropout layer: the neurons in the network are randomly assigned a weight of 0, and because of the short-time stationary nature of the noise signal, the selected ratio is preferably 0.4, and the weights for 40% of the neurons in this layer will be set to zero.

(8) RNN structure: inputting the feature vector of the previous layer of output noise signal into a Recurrent Neural Network (RNN) structure, and minimizing the mean square error value between the sequence of the network output vector and the sequence of the noise signal collected by the in-vehicle microphone, as shown in FIG. 5, wherein U, W, V is a model parameter, the model parameter W has the same meaning as the above W and can be adjusted accordingly, the RNN model adopts a parameter sharing mode, s is a hidden state, and o is a model output vector.

(9) Training a model: and updating network model parameters by adopting a real-time online cyclic learning algorithm aiming at the cyclic neural network to achieve the aim of training the model.

In an embodiment of the present invention, the speech signal preprocessing module further has a speech positioning function, for example, the receiving intensity and the time information of the speech signal generated at a passenger position in the vehicle by the 4 in-vehicle microphones with different positions are different, and based on this, a multi-channel convolutional neural network positioning model for speech signal position determination, hereinafter referred to as convolutional neural network positioning model, is established to extract speech signal features at different positions, so as to implement the speech positioning function of the preprocessing system. Referring to fig. 6 and 7, the 4 different in-vehicle microphones are denoted as p (n), q (n), g (n), and h (n) for the voice signals generated by a passenger, and their sequences are (p (1), p (2.. p (n)), (q (1), q (2.. q (n)), (g (1), g (2.. g (n)), (h (1), h (2.. h (n)), respectively.

Referring to fig. 8, the algorithm steps of the convolutional neural network localization model are as follows:

(1) inputting data: after the voice signals collected by the 4 in-vehicle microphones are preprocessed, a 4 x n matrix is generated according to the sequence of the voice signals and is used as a multi-channel input vector of the one-dimensional convolution neural network.

(2)1D CNN layer: the first layer defines a one-dimensional convolution kernel, which can learn one feature in the first layer of the neural network, and for speech signal localization, preferably 10 convolution kernels are defined, thus extracting 10 position features at the first layer of the network.

(3) Maximum pooling layer: to reduce the complexity of the output and prevent data overfitting, a max pooling layer is used after the CNN layer.

(4)1D CNN layer: and then learning the higher-level characteristics through a convolution layer, wherein 6 convolution kernels are defined by the layer.

(5) Average pooling layer: an average pooling layer is employed to further avoid overfitting, taking the average of two weights in the neural network, with each one-dimensional feature convolution kernel leaving only one weight in the neural network of this layer.

(6) An output layer: the fully-connected layer activated by Softmax is adopted, 4 neurons of the output layer correspond to four riding positions in the vehicle, the output value represents the probability of the four positions, the output value can enter the voice recognition system, and a basis is provided for the voice recognition system to perform personalized service, for example, the voice recognition system can open the maximum voice recognition control authority of a driver, a common passenger can only realize some basic voice control, and the output value is filtered by the pre-processing system to output a maximum probability value to the voice recognition system.

It should be noted that, the number and distribution positions of all the in-vehicle microphones and the out-vehicle vibration sensors include, but are not limited to, the above, and other numbers and distribution positions should also be included, for example, the out-vehicle vibration sensors may be additionally arranged at the tail of the vehicle, the formula for calculating the relevant parameters in the above embodiment is also only a preferred embodiment, and also includes a formula for realizing similar functions, and the number of convolution kernels selected in each layer in each neural network model, the number of input data, and the setting of the relevant parameters in the above embodiment are all based on the test preferred values, and other values for realizing similar functions should also be included.

The invention designs a voice signal preprocessing system in a vehicle-mounted environment, which comprises the implementation of carrying a system hardware structure on a vehicle body and the implementation of a corresponding algorithm. When the automobile runs, the system utilizes the vibration sensor outside the automobile to collect non-stable noise source vibration signals in real time in a running state, extracts noise characteristics, establishes a mapping relation between the noise signals received by the voice recognition system inside the automobile and the mapping relation under the condition of system delay, simultaneously adopts a multi-position double-channel microphone structure inside the automobile, and realizes position matching of voice signals on four riding positions inside the automobile through an algorithm. The voice signal preprocessing system needs to reduce noise and enhance the voice signal at the front end of the voice recognition system, so that the accuracy of voice recognition in the vehicle environment is further improved, the position of a passenger sending a voice control signal in the vehicle is located, the voice recognition system is convenient to meet personalized recognition requirements of different passengers, good human-vehicle voice interaction is realized, and the passenger obtains better driving experience.

The noise reduction part in the voice signal pretreatment system provided by the invention adopts a feedforward self-adaptive control structure, vibration sensors near noise sources are used for collecting vibration signals (road noise, engine noise and wind noise) of three noise sources during driving in real time, a one-dimensional convolutional neural network control model is used for carrying out feature learning on the real-time noise signals, a cyclic neural network model is used for realizing signal tracking and control, the accuracy of a voice recognition system is favorably improved, the positioning part is used for collecting voice signals by using microphones of four channels in a vehicle, the acoustic features of the voice signals at different positions in the vehicle are extracted by using the one-dimensional convolutional neural network model, and then the positions of passengers sending the voice signals are calculated and judged.

The voice signal pretreatment system provided by the invention realizes the control of noise in the vehicle, enhances the noise reduction of the voice signal with noise, provides a clean voice signal for the voice recognition system, and positions the position of a passenger sending the voice signal so as to realize the personalized recognition and control of the voice recognition system. The voice signal pre-processing system provided by the invention overcomes the voice signal distortion caused by noise reduction of the traditional voice enhancement system, the in-vehicle noise signal characteristic is extracted in real time in a self-adaptive manner by an algorithm, the noise signal characteristic does not need to be defined manually, noise is filtered without damage to the voice signal, the recognition accuracy of the voice recognition system is improved, a feedforward structure of the system collects a vibration signal in a driving state in real time, a noise reduction model of the pre-processing system is trained, the signal delay problem of the noise reduction system is effectively solved, effective tracking of in-vehicle noise is realized, the system is suitable for complex in-vehicle noise environments in different driving states at different speeds, and the noise reduction performance of the pre-processing system; the in-vehicle microphone also realizes the judgment of the position of the voice signal, provides the source position information of the voice instruction for the voice recognition system and expands the functions of the preprocessing system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A control method of a preprocessing system for improving the accuracy of speech recognition in a vehicle is characterized by comprising the following steps:

2. The method for controlling a preprocessing system for improving accuracy of speech recognition in a vehicle according to claim 1, wherein in step S2, the convolutional neural network noise reduction model has a six-layer structure, and comprises, in order from input to output, a first convolutional layer, a second convolutional layer, a maximum pooling layer, a third convolutional layer, an average pooling layer, and a Dropout layer, and a feature vector output is obtained after inputting the signal data matrix.

3. The control method of the preprocessing system for improving the accuracy of speech recognition in the vehicle according to claim 1, wherein the step of outputting the modeled cancellation signal in S3 includes the following steps:

4. The control method for the preprocessing system according to claim 1, wherein in step S1, a plurality of in-vehicle microphones are present for simultaneously capturing sound signals of different positions in the current vehicle.

5. The control method for the preprocessing system for improving the accuracy of the in-vehicle speech recognition according to claim 4, wherein before the step of S2, the acoustic features of the in-vehicle speech signals at different positions in the vehicle are extracted by the convolutional neural network positioning model, and the in-vehicle space is divided into a plurality of regions according to the extracted acoustic features.

6. The control method of the preprocessing system for improving the accuracy of in-vehicle speech recognition according to claim 5, wherein the acoustic features include intensity and duration.

7. The control method of the preprocessing system for improving the accuracy of the speech recognition in the vehicle according to claim 5, wherein the convolutional neural network positioning model has a five-layer structure, and comprises a first convolutional layer, a maximum pooling layer, a second convolutional layer, an average pooling layer and a Softmax layer from input to output in sequence, and a vector output is obtained after a signal data matrix is input.

8. The control method for the preprocessing system for improving the accuracy of the in-vehicle speech recognition according to claim 7, wherein the convolutional neural network localization model uses the energy and time difference of the sound signal collected by each in-vehicle microphone as the learning feature, and the convolutional neural network localization model obtains the probability of the in-vehicle speaker appearing in each in-vehicle area through the time domain signal in the sound signal.

9. A pre-processing system for improving accuracy of speech recognition in a vehicle is characterized by comprising

10. The pre-processing system for improving accuracy of in-vehicle speech recognition according to claim 9, wherein the out-of-vehicle vibration sensor comprises a first out-of-vehicle vibration sensor, a second out-of-vehicle vibration sensor, and a third out-of-vehicle vibration sensor; the first off-board vibration sensor is arranged at an engine compartment of a vehicle for collecting engine noise; the second outside-vehicle vibration sensor is arranged near a vehicle tire and used for collecting road noise generated when the vehicle runs; the third vehicle exterior vibration sensor is arranged at the vehicle exterior rear view mirror and used for collecting wind noise formed at the rear view mirror when the vehicle runs.