WO2021208455A1

WO2021208455A1 - Neural network speech recognition method and system oriented to home spoken environment

Info

Publication number: WO2021208455A1
Application number: PCT/CN2020/133554
Authority: WO
Inventors: 张晖; 程铭; 赵海涛; 孙雁飞; 倪艺洋; 朱洪波
Original assignee: 南京邮电大学
Priority date: 2020-04-15
Filing date: 2020-12-03
Publication date: 2021-10-21
Also published as: JP2022540968A; CN111477220A; JP7166683B2; CN111477220B

Abstract

A neural network speech recognition method and system oriented to a home spoken environment, comprising: model construction: adding a long-short-term memory network into a deep neural network, to construct a combined neural network DNN-LSTM model; pre-processing an acquired speech data set to obtain a feature vector set, and using the feature vector set as an input of the DNN-LSTM model to perform iterative training, so as to train same to an optimal acoustic model; causing an input speech signal of an unknown language to be subjected to the trained DNN-LSTM model to respectively obtain a Chinese output probability vector set and an English output probability vector set; performing language matching according to the Chinese output probability vector set and the English output probability vector set, and outputting a determination result. The present invention is able to quickly and accurately recognize content of a speaker in a home scene, and can be widely applied to an actual home scene.

Description

Neural network speech recognition method and system facing home spoken environment

Technical field

The invention belongs to the technical field of intelligent recognition, and specifically relates to a neural network speech recognition method and system oriented to the home spoken environment.

Background technique

The key object of speech recognition research is speech. The speech signal is converted into information that can be recognized by the computer, so as to recognize the speaker's voice commands and text content. There are basically three methods of speech recognition: based on linguistics and acoustics, model matching and neural networks. Although the first method appeared earlier, it has not yet reached a more practical stage due to the complexity of its model; the second method is the Hidden Markov Model, which can be used to label the probability model of the problem. It also shows that the model randomly generates observation sequences, which greatly improves the speech recognition technology. The third method uses shallow neural network learning and training to easily cause gradient instability, and manual extraction of sample features is time-consuming and laborious, and the recognition effect is not very good. In traditional speech recognition systems, the GMM-HMM acoustic modeling method is the most widely used in practice, but when dealing with some complex speech signal problems in the home environment, the application scenarios of the traditional model appear to be relatively single.

Summary of the invention

Objective of the invention: In order to overcome the shortcomings of the prior art, the present invention provides a neural network speech recognition method oriented to the spoken environment of the home, which can solve the problems of low speech recognition rate and poor recognition efficiency. The present invention also provides a home-oriented speech recognition method. Neural network speech recognition system for spoken language environment.

Technical solution: On the one hand, the neural network speech recognition method oriented to the spoken language environment at home according to the present invention includes:

Model construction: add long and short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;

Model training:

Chinese speech data training: preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and use the Chinese feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal Chinese acoustic model;

English speech data training: preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal English acoustic model;

Model test:

Passing an input speech signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model respectively to obtain a Chinese output probability vector set and an English output probability vector set;

Perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.

Further, it includes:

The combined neural network DNN-LSTM model includes an input layer, a long and short-term memory network, a second hidden layer, a third hidden layer, a fourth hidden layer, and an output layer, and the long and short-term memory network serves as the first hidden layer.

Further, it includes:

The number of nodes in the first hidden layer is 512, the activation function selects the sigmoid function and the tanh function, the number of nodes in the second hidden layer, the third hidden layer and the fourth hidden layer are all 1024, and the activation function selects the sigmoid function .

Further, it includes:

The Chinese feature vector set is used as the input of the DNN-LSTM model for iterative training, and the training steps include:

(1) Initialize the weight matrix W and the bias vector b in the model structure to a random value;

(2) Start from the first iteration to the maximum number of iterations; in each iteration, it is traversed from the first voice data training sample to the last training sample;

(3) In the training process of each training sample, input the corresponding feature vector to the input layer; traverse from the first hidden layer to the output layer, and use the forward propagation algorithm to indicate the corresponding layer being traversed, and then according to The loss function represents the output layer; after the forward propagation algorithm is completed, it starts to traverse from the fourth hidden layer to the first hidden layer, and uses the back propagation algorithm to represent the corresponding first hidden layer;

(4) After the backpropagation algorithm is completed, the sequence from the first hidden layer to the output layer is traversed, and the weight matrix and bias vector W ⁿ , b ^{n of the} corresponding layer are updated. The training is over; if the sample has not been traversed, continue to traverse the sample; if the sample has been traversed, the next iteration will be performed;

(5) When all the change values of W and b do not exceed the iteration threshold, stop the iteration loop;

(6) Save the optimal weight matrix W and bias vector b of each layer.

Further, it includes:

Performing language matching according to the Chinese output probability vector set and the English output probability vector set includes:

Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P′, respectively, and mark them as H and H′; where P={p ₁ , p ₂ ,...,p _q }, P′={p′ ₁ , p′ ₂ ,...,p′ _t }, q is the total number of output categories of the Chinese acoustic model, and t is the total number of output categories of the English acoustic model;

If the probability vector set output from the Chinese acoustic model has p _{i that is} significantly larger than other probability values, the probability values in the probability vector set output from the English acoustic model are not much different, where 1≤i≤q;

And if the information entropy H corresponding to the input to the Chinese acoustic model is smaller than the information entropy H′ corresponding to the input to the English acoustic model, the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output result;

If p′ _{j is} significantly larger than the other probability values in the probability vector set output from the English acoustic model, the probability values in the probability vector set output from the Chinese acoustic model are not much different.

And if the information entropy H′ corresponding to the input to the English acoustic model is smaller than the information entropy H corresponding to the input to the Chinese acoustic model, the corresponding input speech signal voice0 of the unknown language is English, and the output probability of the English acoustic model is used as the final output result.

On the other hand, the present invention also provides a system implemented by the above-mentioned neural network speech recognition method for the spoken language environment at home, which includes:

Model building module, used to add long and short-term memory network to deep neural network to build combined neural network DNN-LSTM model;

The model training module includes a Chinese model training unit and an English model training unit. The Chinese model training unit is used to preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and combine the Chinese feature vector set Perform iterative training as the input of the DNN-LSTM model to train to the optimal Chinese acoustic model;

The English model training unit is used to preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the best English acoustics Model;

The model testing module includes a voice input unit and a voice type judgment unit. The voice input unit is used to pass an input voice signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model, respectively, to obtain The Chinese output probability vector set and the English output probability vector set; the voice type judgment unit is configured to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.

Beneficial effects: Compared with the prior art, the present invention has significant advantages: the present invention combines the characteristics of LSTM and uses the memory unit to record long historical information and DNN can effectively extract the characteristics of high-level information in the data. Put forward the idea of adding LSTM to the first layer of the hidden layer of DNN, build a combined neural network combining DNN and LSTM for acoustic modeling, and train and test the Chinese data set and English data set to obtain the Chinese acoustic model and English Acoustic model, and compare the output results of the input speech signal in the Chinese acoustic model and the English acoustic model by quoting the concept of entropy, and use the result with a smaller entropy value as the output result of the acoustic model, so as to achieve the purpose of simple language recognition and improve In order to improve the overall speech recognition rate, the content of the speaker in the home scene can be quickly and accurately identified, which can be widely used in the actual home scene.

Description of the drawings

Fig. 1 is a block diagram of the overall structure of the combined neural network speech recognition algorithm for the spoken language environment at home according to the present invention;

Figure 2 is a structural diagram of the DNN-LSTM model of the present invention;

Figure 3 shows the overall structure of LSTM.

Detailed ways

In order to describe in more detail the combined neural network speech recognition algorithm oriented to the spoken language environment at home proposed by the present invention, an example is described below with reference to the accompanying drawings.

Figure 1 is a block diagram of the overall structure of the combined neural network speech recognition algorithm for the home spoken environment. First, combine the characteristics of DNN and LSTM to construct the DNN-LSTM model; then, use the DNN-LSTM model to train the Chinese data set and the English data set , Save the Chinese acoustic model and English acoustic model; finally, output the result through language matching, so as to achieve the purpose of language recognition and speech recognition.

DNN is a deep neural network, Deep Neural Networks, LSTM is a long and short-term memory network, and Long Short-Term Memory, as shown in Figure 3, is the three-gate logic calculation structure diagram inside LSTM. The core element of LSTM is the cell state, which represents the cell state over time. Information transfer. It runs through the entire chain in a straight line, with only a few small linear interactions, and information can easily flow without significant changes. In the transmission process, the information in the cell state is added or deleted through the input at the current time, the state of the hidden layer at the previous time, the cell state at the previous time, and the gate structure. In speech recognition, the memory unit of the LSTM model is mainly used to store and process speech features. It implements three calculations, namely the forget gate, input gate and output gate. The three gates are used to protect and control the neuron state at the current moment. c _t , as follows:

(1) Input gate: The function of this gate is to determine how much information in the _{input x t} _{is kept in c t} . The realization formula is:

i _t =σ(W _i ·[h _t-1 ,x _t ]+b _i ) (3)

Wherein, _t is the time t, I gate input through the input gate and the input gate corresponding state

save. W _i , W _c represent the weight matrix, b _i , b _c represent the bias terms, x _t-1 , x _t , x _t+1 represent the input at the previous time, the current time and the next time respectively; h _{t- 1} , h _t , h _t+1 represent the neuron state at the previous moment, the current moment and the next moment, respectively, and σ represents the sigmoid function.

(2) Forgetting gate: The function of this gate is to determine how many components _{of c t-1} _{in the input at time t remain in c t} . The realization formula is:

f _t =σ(W _f ·[h _t-1 ,x _t ]+b _f ) (5)

Among them, W _f represents the weight matrix, and b _f represents the bias term.

(3) Output gate: The function of this gate is to use _{how much the control unit c t} outputs to the current output value h _{t of the} LSTM. First, the state after the input gate and the forget gate, that is, the realization formula of _{c t is:}

_{Among them, the first half is the component remaining in c t} after the information passes through the forgetting gate, and the second half is the component remaining in c _t after the information passes through the input gate. Then, in order to determine how many components of _{c t} _{remain in h t} , the output realization formula is:

o _t =σ(W _o [h _t-1 ,x _t ]+b _o ) (7)

Among them, o _t is the state of the output layer at time t, W _o represents the weight matrix, and b _o represents the bias term. Finally, after the output gate, the final output of the hidden layer is:

h _t = o _t *tanh(c _t ) (8)

Specifically: The combined neural network speech recognition method for the oral environment at home includes:

First, build the model: add a long and short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;

Figure 2 shows the DNN-LSTM model. The structure is as follows: layer 0 is the input layer, layer 1 to layer 4 are hidden layers, and layer 5 is the output layer, and its activation function is the softmax function. In the hidden layer, the first layer is the LSTM network structure, the number of nodes is 512, and the activation function selects the sigmoid function and the tanh function. In order to prevent excessive learning of the data in the network, a Dropout strategy is added to the neural unit; the latter 3 layers They are all DNN network structures, the number of nodes in each layer is 1024, and the activation function selects the sigmoid function; that is, the combined neural network DNN-LSTM model includes the input layer, the long and short-term memory network, the second hidden layer, the third hidden layer, and the fourth A hidden layer and an output layer, and the long and short-term memory network serves as the first hidden layer.

The model has 6 layers, the input vector of each layer of neuron is z ⁿ , and the output vector is y ⁿ , then:

z ⁿ = W ⁿ z ^n-1 + b ⁿ , n = 1, 2, 3, 4, 5 (1)

In the formula, W ⁿ is the weight matrix from the n-1th layer to the nth layer, and b ⁿ is the bias of the nth layer. According to the input vector, the output can be:

y ⁿ = f _n (z ⁿ ) (2)

In the formula, f _n is the activation function of the nth layer.

Second, perform model training:

Chinese speech data training: preprocess the collected Chinese speech data set to obtain the Chinese feature vector set vector0, and use the Chinese feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal Chinese acoustic model;

Among them, the pre-processing operations include: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector0 is used as the DNN-LSTM model input for iterative training to train to the optimal acoustic model China_model.

The training steps are as follows:

(1) Initialize the weight matrix W and the bias vector b in the network structure to a random value.

(2) Start from the first iteration to the maximum number of iterations; in this embodiment, the maximum number of iterations is set to 50. In each iteration, it is traversed from the first training sample to the last training sample, where i is used to Indicates the training sample being traversed;

(3) In the training process of each sample, the input vector is used as the input of the first layer of the DNN, denoted by a ¹ ; then the traversal starts from the first layer of the hidden layer to the output layer, and n is used to denote which one is being traversed. Layer, each layer is calculated by forward propagation algorithm a ^i,n =f(z ^i,n )=f(W ⁿ a ^i,n-1 +b ⁿ ), which means the i-th traversed by the layer being traversed The input layer corresponding to the sample.

Calculate the output layer δ ^{i, L} , L according to the loss function as the output layer; after the forward propagation algorithm is completed, it starts to traverse from the last layer of the hidden layer to the first layer of the hidden layer, and then performs the back propagation algorithm to calculate δ ^{i, n} = (W ⁿ⁺¹ ) ^T δ ^{i, n+1} ⊙ f′(z ^{i, n} ); that is, the output layer corresponding to the i-th training sample of the layer being traversed, and T is the transposition f ′ Means derivation, ⊙ means XOR operation.

(4) After the backpropagation algorithm is completed, traverse from the first layer of the hidden layer to the output layer, and update the W ⁿ and b ⁿ of the nth layer that is being traversed, then:

In this way, the training for a certain sample in one iteration process is over; if the sample has not been traversed, continue to traverse the sample; if the sample has been traversed, the next iteration will be performed, where m is the total number of training samples, α is the iteration step size;

(6) Save the optimal weight matrix W and bias vector b of each layer.

English speech data training: preprocess the collected English speech data set to obtain the English feature vector set vector1, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal English acoustic model ; Among them, the pre-processing operations include: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector1 as the DNN-LSTM model input for iterative training, training to the optimal acoustic model English_model, specific traversal steps and Chinese voice The steps of data training are the same, so I won't repeat them here.

Finally, carry out the model test, the specific steps are as follows:

The information entropy formula is:

If the probability vector set output from the Chinese acoustic model has p _{i that is} significantly greater than other probability values, the probability values in the probability vector set output from the English acoustic model are not significantly different, where 1≤i≤q;

In the embodiment of the present invention, whether the probability value is significantly different is related to the output classification of the softmax output layer. The more output classifications, the smaller the corresponding range. The range is set to β, that is, the difference between the probability values is greater than or equal to β, that is, the phase difference Obviously, if the range of the difference between each probability value is less than β, the difference is not obvious. During the experiment, when the output is classified into 5 categories, the range β is approximately 0.2. The more output categories, the smaller the range.

And if the information entropy H corresponding to the input to the Chinese acoustic model is smaller than the information entropy H′ corresponding to the input to the English acoustic model, the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output The result; namely: according to the nature of information entropy, the greater the entropy, the greater the amount of information in the system, and the higher the uncertainty. When p ₁ = p ₂ =,..., = p _q , the maximum value is used, and the Chinese output The information entropy of the probability is much smaller than the information entropy of the English output probability, that is, the matching degree of the Chinese speech signal in the Chinese acoustic model is higher, so the output probability of the Chinese acoustic model is taken as the final output result.

If p′ _{j is} significantly larger than other probability values in the probability vector set output from the English acoustic model, the probability values in the probability vector set output from the Chinese acoustic model are not much different, where 1≤j≤t;

On the other hand, the present invention also provides a neural network speech recognition system oriented to the spoken language environment at home. The system includes:

As for the system/device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities Or there is any such actual relationship or sequence between operations.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete application embodiment, or an embodiment combining applications and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Although the preferred embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present invention.

Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies, the present invention is also intended to include these modifications and variations.

Claims

A neural network speech recognition method oriented to the spoken language environment at home, characterized in that the method includes:

Model construction: add long and short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;

Model training:

Chinese speech data training: preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and use the Chinese feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal Chinese acoustic model;

English speech data training: preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal English acoustic model;

Model test:

Passing an input speech signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model respectively to obtain a Chinese output probability vector set and an English output probability vector set;

Perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
The neural network speech recognition method for home speaking environment according to claim 1, wherein the combined neural network DNN-LSTM model includes an input layer, a long and short-term memory network, a second hidden layer, a third hidden layer, The fourth hidden layer and the output layer, the long and short-term memory network serves as the first hidden layer.
The neural network speech recognition method oriented to the oral environment at home according to claim 2, wherein the number of nodes in the first hidden layer is 512, and the activation function selects the sigmoid function and the tanh function, and the second hidden layer, The number of nodes in the third hidden layer and the fourth hidden layer are both 1024, and the sigmoid function is selected as the activation function.
The neural network speech recognition method for the spoken language environment at home according to claim 3, wherein the Chinese feature vector set is used as the input of the DNN-LSTM model for iterative training, and the training step comprises:

(1) Initialize the weight matrix W and the bias vector b in the model structure to a random value;

(2) Start from the first iteration to the maximum number of iterations; in each iteration, it is traversed from the first voice data training sample to the last training sample;

(3) In the training process of each training sample, input the corresponding feature vector to the input layer; traverse from the first hidden layer to the output layer, and use the forward propagation algorithm to indicate the corresponding layer being traversed, and then according to The loss function represents the output layer; after the forward propagation algorithm is completed, it starts to traverse from the fourth hidden layer to the first hidden layer, and uses the back propagation algorithm to represent the corresponding first hidden layer;

(4) After the backpropagation algorithm is completed, the order from the first hidden layer to the output layer is started, and the weight matrix and bias vectors W n and b n of the corresponding layer are updated, where n is the layer that is being traversed, n = 1, 2, 3, 4, 5, so far the training for a sample in one iteration process is over; if the sample has not been traversed at this time, continue to traverse the sample; if the sample has been traversed, proceed to the next Iteration

(5) When all the change values of W and b do not exceed the iteration threshold, stop the iteration loop;

(6) Save the optimal weight matrix W and bias vector b of each layer.
The neural network speech recognition method oriented to the spoken language environment at home according to claim 1, wherein performing language matching according to the Chinese output probability vector set and the English output probability vector set includes:

Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P'respectively, which are respectively marked as H and H'; where P={p 1 ,p 2 ,...,p q }, P'={p′ 1 ,p' 2 ,...,p′ t }, q is the total number of output categories of the Chinese acoustic model, and t is the total number of output categories of the English acoustic model;

If the probability vector set output from the Chinese acoustic model has p i that is significantly larger than other probability values, the probability values in the probability vector set output from the English acoustic model are not much different, where 1≤i≤q;

And if the information entropy H corresponding to the input to the Chinese acoustic model is smaller than the information entropy H'corresponding to the input to the English acoustic model, the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output result;

P, if present, from the English acoustic model probability vector output focus' j probability value is significantly larger than the other, from the probability values differ by less Chinese acoustic model output probabilities of the respective set of vectors, wherein, 1≤j≤t;

And if the information entropy H'corresponding to the input to the English acoustic model is smaller than the information entropy H corresponding to the input to the Chinese acoustic model, the corresponding input voice signal voice0 of the unknown language is English, and the output probability of the English acoustic model is used as the final output result.
A system implemented by a neural network speech recognition method for a spoken language environment at home according to any one of claims 1-5, wherein the system comprises:

The model building module is used to add a long and short-term memory network to the deep neural network to construct a combined neural network DNN-LSTM model;

The model training module includes a Chinese model training unit and an English model training unit. The Chinese model training unit is used to preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and combine the Chinese feature vector set As the input of the DNN-LSTM model, perform iterative training to train to the optimal Chinese acoustic model;

The English model training unit is used to preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the best English acoustics Model;

The model testing module includes a voice input unit and a voice type judgment unit. The voice input unit is used to pass an input voice signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model, respectively, to obtain The Chinese output probability vector set and the English output probability vector set; the voice type judging unit is configured to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.