CN110858480B

CN110858480B - Speech recognition method based on N-element grammar neural network language model

Info

Publication number: CN110858480B
Application number: CN201810928881.1A
Authority: CN
Inventors: 张鹏远; 张一珂; 潘接林; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2018-08-15
Filing date: 2018-08-15
Publication date: 2022-05-17
Anticipated expiration: 2038-08-15
Also published as: CN110858480A

Abstract

The invention discloses a speech recognition method based on an N-element grammar neural network language model, which comprises the following steps: step 1) establishing and training an N-order N-element grammar neural network language model; step 2), for each test voice u, selecting K candidate results with highest scores by using a recognizer; recalculating language model scores of K candidate results based on the trained N-order N-element grammar neural network language model; and then, recalculating the scores of the K candidate results, and selecting the candidate result with the highest score as the final recognition result of the test voice u. The performance and the calculation efficiency of the speech recognition method are superior to those of the speech recognition method based on the RNN language model.

Description

Speech recognition method based on N-element grammar neural network language model

Technical Field

The invention relates to the field of speech recognition and the field of natural language processing, in particular to a method based on an N-gram neural network language model.

Background

A Language Model (LM) is a mathematical model that describes the probability distribution of word sequences, which plays an important role in applications related to natural Language processing. With the development of Deep learning technology, Deep Neural Network (DNN) based language model modeling technology has shown great potential in a series of tasks such as speech recognition, machine translation, text generation, and the like.

Relevant studies have shown that the performance of neural network language models depends heavily on the specific model structure. The currently mainstream neural network structures include standard DNN, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). The DNN model is generally used for simple classification tasks. The CNN model can model high-dimensional data and capture the relevance among all dimensions of the data, and is mainly used for tasks such as image processing and the like. While the RNN model can efficiently compress historical information using recursive concatenation, and is therefore suitable for modeling serialized data. Because natural sentences naturally have strong time sequence relationship and are typical serialized data, the RNN model is widely applied to natural language related tasks.

At present, the language model based on the RNN obtains obviously better performance than a statistical language model and a DNN/CNN language model on a series of tasks such as voice recognition, machine translation and the like. However, in the speech recognition task, the RNN model also has the following problems: 1) after being expanded along time, the recursive structure is similar to a deep DNN model, and the deep network has the phenomena of gradient disappearance and gradient explosion in the training process, so that the RNN model is difficult to train, and the performance of the RNN model is limited; 2) compared with forward network structures such as DNN and CNN, the parallelization degree of the RNN structure is low, and the RNN structure cannot be parallelized on a time axis, so that the time complexity of RNN model calculation is high, and the RNN model calculation method is difficult to be applied to voice recognition systems such as voice input methods and intelligent sound boxes with high real-time rate requirements.

Disclosure of Invention

The invention aims to overcome the technical defects, and provides the voice recognition method based on the N-gram neural network language model on the premise of simplifying the structure of the neural network language model, reducing the computational complexity of the model and improving the parallelizable degree of the model without losing the performance of the voice recognition system.

In order to achieve the above object, the present invention provides a speech recognition method based on an N-gram neural network language model, the method comprising:

step 1) establishing and training an N-order N-element grammar neural network language model;

step 2), for each test voice u, selecting K candidate results with highest scores by using a recognizer; recalculating language model scores of K candidate results based on the trained N-order N-element grammar neural network language model; and then, recalculating the scores of the K candidate results, and selecting the candidate result with the highest score as the final recognition result of the test voice u.

As an improvement of the above method, the step 1) specifically includes:

step 1-1) a sentence l ═ w containing M words in the training set S is given₁,…,w_MInputting the N-order N-element grammar neural network language model as the word w to be predicted_iN words w before 1 ≦ i ≦ M_i-n,…,w_i-1In a matrix of one-hot codes

The low-dimensional expression vector obtained after the table look-up operation

Wherein V represents the size of the vocabulary table, is the row number of the matrix C, and h is the column number of the matrix C;

step 1-2) obtaining words w through affine transformation calculation_i-kFinal feature representation

Sigma denotes a non-linear function, a matrix

Step 1-3) calculating a history information vector

Step 1-4) for historical information vector h_iBy affine transformation, to obtain implicit expression vectors of historical information

Matrix array

Step 1-5) H_iCarrying out probability regularization through affine transformation and softmax function to obtain a word w to be predicted_iProbability distribution of

Representing softmax functions, matrices

Step 1-6) calculating y_i＝(y_i1,…,y_iV) And w_i＝(w_i1,…,w_iV) As a function of the loss

y_idIs y_iComponent of (a), w_idIs w_iD is more than or equal to 1 and less than or equal to V;

step 1-7) for the N-order N-element grammar neural network language model

Calculating an auxiliary loss function L according to the steps 1-4) to 1-6)_i-k：

Step 1-8) obtaining n loss functions L through steps 1-6) to 1-7)_i-n+1,……,L_i(ii) a Final word w to be predicted_iIs optimized to the objective function

Comprises the following steps:

wherein alpha is more than or equal to 0 and less than or equal to 1 and is weight;

step 1-9) updating model parameters based on the training set by using random gradient descent according to the following formula:

wherein, λ is the learning rate, θ is the model parameter, including: matrix C, F_k,k＝1,2,…n,H,W，

And (4) for updating the model parameters, when the model parameters theta are converged, finishing the training of the N-order N-element grammar neural network language model.

As an improvement of the above method, the step 2) specifically includes:

step 2-1) for each test voice u, obtaining a plurality of recognition candidate results by using a recognizer, wherein the recognizer can give the p-th candidate result u of the test voice u_pScore S (u) of_p)：

S(u_p)＝a(u_p)+μl(u_p)

Wherein, a (u)_p) Is a candidate result u_pIs scored on the acoustic model, l (u)_p) Is a candidate result u_pMu is a language model score coefficient; selecting K candidate results u with highest scores_p,1≤p≤K；

Step 2-2) recalculating candidate result u by using the trained N-element grammar neural network language model_pAnd P is more than or equal to 1 and less than or equal to K_p)；

Step 2-3) recalculating candidate result u_pAnd a score of p is not less than 1 and not more than K

Step 2-4) selecting scores

The highest candidate is used as the final recognition result of the test speech u.

As an improvement of the above method, the step 2-2) is specifically:

for a candidate result containing M words

Inputting the trained N-order N-gram neural network language model to obtain the word of each word

Of the loss function, i.e. the probability y_pmCandidate result u_pAnd P is more than or equal to 1 and less than or equal to K_p) Comprises the following steps:

the invention has the advantages that:

1. the N-element grammar neural network language model only adopts a forward network structure, so that the problems of gradient extinction and gradient explosion can be effectively avoided;

2. compared with an RNN language model, the N-element grammar neural network language model reduces the calculation complexity of the model and improves the parallel efficiency of the model;

3. the performance and the calculation efficiency of the speech recognition method are superior to those of the speech recognition method based on the RNN language model.

Drawings

FIG. 1 is a schematic structural diagram of an N-gram neural network language model according to the present invention;

FIG. 2 is a flowchart of a speech recognition method based on an N-gram neural network language model according to the present invention.

Detailed Description

The method of the present invention is described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1 and fig. 2, the present invention provides a speech recognition method based on an N-gram neural network language model, including:

for an N-order N-gram neural network language model,

1) given a sentence l ═ w comprising M words in the training set S₁,…,w_MIn predicting word w_i(i ═ 1,2, …, M), the input to the model is the word w to be predicted_iThe preceding n words w_i-n,……,w_i-1One-hot encoding is carried out on the matrix

The obtained low-dimensional representation vector after the table look-up operation

Where V represents the size of the vocabulary, typically h < V.

2) Obtaining words w by affine transformation calculation_i-kFinal characterization

Where σ represents a non-linear function, a matrix

3) Computing historical information vectors

4) To h_iBy affine transformation, to obtain implicit expression vectors of historical information

Where σ represents a non-linear function, a matrix

5)H_iCarrying out probability regularization through affine transformation and softmax function to obtain word w_i+1Probability distribution of

Wherein

Representing softmax functions, matrices

6) Then calculate y_iAnd w_iCross entropy of (Cross entropy) as a function of the loss at time i

7) For the n-th order model, use

Instead of h in step 3)_iThen calculating the auxiliary loss function according to the steps 4) to 6)

8) Finally, n loss functions L are obtained through the steps 6) to 7)_i-n+1,……,L_i. The invention employs L_iAs the main optimization target at time i, all other loss functions L_i-n+1,……,L_i-1As a secondary optimization objective. The final optimization objective function form is as follows:

wherein alpha (alpha is more than or equal to 0 and less than or equal to 1) is the weight of the main optimization target and the auxiliary optimization target.

9) The model is trained in a Stochastic Gradient Descent (SGD) mode. Namely, the model parameters are updated according to the following formula:

wherein θ is a model parameter, i.e. the matrix C, F described in the above step_k(k ═ 1,2, … n), H, W. λ is the learning rate, which is used to control the step length of parameter update.

10) After the model training is completed, in the testing stage, for a sentence containing M words, s ═ w₁,…,w_MCalculating w of each word in turn according to the steps 1) to 6) above_mProbability y of (1, 2, … M)_m(at time m, there is no need to calculate the input y of the auxiliary objective function in step 7)_m-n+1,…,y_m-1). The probability of sentence s can then be obtained by

11) And (4) re-scoring the recognition result: in the process of identifying and decoding once, for each test voice u, a plurality of identification candidate results are obtained by using an identifier, and the identifier gives the kth candidate result u of the test voice u_kScore S (u) of_k)：

S(u_k)＝a(u_k)+μl(u_k)

Wherein, a (u)_k) Is a candidate result u_kScore of the acoustic model of (1), l (k)_k) Is a candidate result u_kMu is a language model score coefficient; selecting K candidate results u with highest scores _k，1≤k≤K。

In the process of re-grading the recognition result, using an N-element grammar neural network language model to recalculate u according to the step 10)_kIs scored as a language model of (u)_k) And recalculate the candidate u_kIs scored by

Then, the user can use the device to perform the operation,picking scores

Example (c):

this example illustrates the implementation of the present invention on the Switchboard dataset and the performance comparison with the RNN language model.

The example selects a model order of n 20 and a vocabulary size of V25,000. When predicting word w_iWhen the input of the model is w_i-20,……,w_i-1First, each word w_i-k(k ═ 1,2, … 20) is expressed as a 25,000 dimensional one-hot vector. If the hidden vector dimension h is 300, the dimension of the matrix C is 25,000 × 300, and the matrix F is selected_kThe dimensions of H are all 300 × 300, and the dimensions of matrix W are 300 × 25,000. And selecting a sigmoid function as a nonlinear function sigma.

Obtaining w through the table look-up operation of the matrix C_i-kLow dimensional word vector

By affine transformation f_i-k＝σ(F_ke_i-k) Further obtain w_i-kCorresponding implicit feature representation

For any time i, accumulating the implicit characteristics f of the words before the time i_i-20,f_i-19,…,f_i-1Obtaining the historical information vector of the i moment

By affine transformation H_i＝σ(Hh_i) Further obtaining a history information vector h_iCorresponding implicit feature representation

By affine transformation

Get the word w_iProbability distribution of

Wherein the content of the first and second substances,

representing the softmax function. Then calculating the final optimized objective function according to the formula in the step 8) of the invention

And updating the model parameters according to the formula in step 9) of the invention. The example takes a weight α of 0.5 and a learning rate λ of 1.0, and the training process contains a total of 50 iterations (Epoch).

After model training is complete, for a given sentence, s ═ w₁,…,w_MThe sentence probability p(s) is calculated according to the formula in step 10). Then, the top 100 candidate results of each test voice are re-scored according to the formula in the step 11) of the invention, and the re-scoring score is selected

On the Switchboard data set, the performance pair ratio of the N-gram Neural Network Language Model (NNLM) and the RNN language model of the invention is shown in Table 1. The parameter quantity represents the calculation complexity of the model, and the more the parameter quantity of the general model is, the higher the calculation complexity is; the real-time rate is based on the operation time of the RNN model, and the index indicates the calculation efficiency of the model, and when the parameter amount of each model is equivalent, the index can be regarded as the parallelization efficiency. The lower the real-time rate, the higher the computational efficiency and parallelization degree of the model. The result shows that compared with the current mainstream speech recognition method based on the RNN language model, the method can reduce the computational complexity of the network language model and improve the parallelization efficiency of the model under the condition of not losing the recognition precision.

Table 1: performance comparison of different neural network language models on the Switchboard test set

	Identifying error rates	Amount of ginseng	Real time rate
				RNN	19.08％	10.5M	1
NNLM	18.97％	10M	0.73

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of speech recognition based on an N-gram neural network language model, the method comprising:

step 2), for each test voice u, selecting K candidate results with highest scores by using a recognizer; recalculating language model scores of K candidate results based on the trained N-order N-element grammar neural network language model; then, the scores of the K candidate results are recalculated, and the candidate result with the highest score is selected as the final recognition result of the test voice u;

the step 2) specifically comprises the following steps:

step 2-1) for each test voice u, obtaining a plurality of recognition candidate results by using a recognizer, wherein the recognizer can give the p-th candidate result u of the test voice u_pScore of S (u)_p)：

S(u_p)＝a(u_p)+μl(u_p)

Wherein, a (u)_p) Is a candidate result u_pIs scored on the acoustic model, l (u)_p) Is a candidate result u_pMu is a language model score coefficient; selecting K candidate results u with highest scores_p，1≤p≤K；

Step 2-4) selecting scores

The highest candidate result is used as the final recognition result of the test voice u;

the step 2-2) is specifically as follows:

for a candidate result containing M words

2. the speech recognition method based on the N-gram neural network language model according to claim 1, wherein the step 1) specifically comprises:

step 1-1) a sentence l ═ w containing M words in the training set S is given₁，...，w_MInputting the N-order N-element grammar neural network language model as the word w to be predicted_iN words w before 1 ≦ i ≦ M_i-n，...，w_i-1In a matrix of one-hot codes

Sigma denotes a non-linear function, a matrix

Step 1-3) calculating a history information vector

Matrix array

Representing softmax functions, matrices

Step 1-6) calculating y_i＝(y_i1，...，y_iV) And w_i＝(w_i1，...，w_iV) As a function of the loss

y_idIs y_iComponent (b) of，w_idIs w_iD is more than or equal to 1 and less than or equal to V;

steps 1-7) to

Calculating n-1 auxiliary loss functions L according to the steps 1-4) to 1-6)_i-k：

Step 1-8) calculating the word w to be predicted_iIs optimized to the objective function

Comprises the following steps:

wherein, λ is the learning rate, θ is the model parameter, including: matrix C, F_k，k＝1，2，...n，H，W，