CN110211574B

CN110211574B - Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Info

Publication number: CN110211574B
Application number: CN201910477034.2A
Authority: CN
Inventors: 韩纪庆; 唐海桃; 郑铁然; 郑贵滨
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2022-03-11
Anticipated expiration: 2039-06-03
Also published as: CN110211574A

Abstract

A speech recognition model establishing method based on bottleneck characteristics and a multi-scale multi-head attention mechanism belongs to the field of modeling methods. The traditional attention model has the problems of poor recognition performance, single attention scale and the like. A speech recognition model building method based on bottleneck characteristic and multi-scale multi-head attention mechanism, adopt the deep confidence network to extract the bottleneck characteristic as the front end, can increase the robustness of the model, the rear end adopts the multi-scale multi-head attention model composed of convolution kernel of different scales, model the speech primitive of the grade such as phoneme, syllable, word, etc., calculate the hidden layer state sequence of the recurrent neural network and output sequence one by one; the decoding network corresponding to the attention network of each head is used for calculating the elements of the output sequences at the positions, and finally all the output sequences are collected and integrated into a new output sequence. The invention can improve the recognition effect of the voice recognition system.

Description

Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Technical Field

The invention relates to a training model in the technical field of speech recognition, in particular to an improved method for improving the recognition performance of a model by extracting bottleneck characteristics to increase the robustness of the model and establishing a multi-scale multi-head model to model speech elements at the levels of phonemes, syllables, words and the like.

Background

Speech signals are one of the most common and used signals in human society and are an important way for people to express, communicate and propagate information. In the current information explosion era, mass voice data is constantly generated in the internet and telephone channels, and the need for Automatic Speech Recognition (ASR) is becoming more urgent in order to more efficiently recognize, classify and retrieve large-scale voice signals. Compared with a traditional Hidden Markov Model (HMM) based speech recognition system, the end-to-end speech recognition system integrates an acoustic Model, a pronunciation dictionary and a speech Model into a set of neural network, so that the Model is more concise and the parameters are more conveniently optimized directly; the end-to-end model has stronger universality, thereby not only reducing the dependence on professional linguistic knowledge, but also reducing the construction difficulty of a voice recognition system; the training data of the end-to-end model does not need alignment information, and the marked voice can be directly used for training. The attention model breaks away from the assumption of independence of model output, and the alignment relation of phonemes and features is quantized through attention coefficients, so that the model has certain explanatory property. The research of the patent on end-to-end voice recognition is developed from an attention model, and focuses on the problems of long training period, poor recognition performance and the like existing in the current voice recognition system based on the attention model. Although the problem that the training efficiency is low due to the fact that a large number of cyclic neural network units are applied to the attention model internally, the parameter scale is large, the internal structure is complex, and therefore the training efficiency is low can be partially solved by increasing hardware computing resources, research from the algorithm level is still a difficult problem. Attention is paid to the fact that the prior knowledge of the phonetics is completely abandoned by the model, so that the prior knowledge of the phonetics lacks effective initialization parameters, the convergence speed of the parameters is slow, the important reason for increasing the training time is also, and therefore how to integrate the model with the phonetic features is very meaningful work. The reason why the attention model is different from the conventional model in performance is that: insufficient training caused by limited corpus resources, inaccurate alignment relation between generated features and phonemes (graphemes), and poor robustness of a model to noise. In view of the above problems, the model needs to be studied deeply to find a reasonable solution.

Disclosure of Invention

The invention aims to solve the problems of poor recognition performance and single attention scale of the conventional attention model, and provides a speech recognition model establishing method based on bottleneck characteristics and a multi-scale multi-head attention mechanism.

A method for establishing a speech recognition model based on bottleneck characteristics and a multi-scale multi-head attention mechanism comprises the following steps:

step one, utilizing input sample FBank voice characteristic vector X ═ X₁,x₂,...,x_T) Carrying out unsupervised training on the RBM network in the DBN to obtain a first three-layer connection weight matrix W in the initialized coding network₁、W₂、W₃The three layers are connected with a weight matrix and a random initialization weight output layer W₄Forming a DBN-based bottleneck feature extraction network at the front end of the coding network; the RBM network represents a Restricted Boltzmann Machine, which is called a Restricted Boltzmann Machine in English and called RBM for short; DBN represents a Deep Belief Network, which is called Deep Belief Network in English and is called DBN for short; FBank denotes a filter bank; sample FBank speech feature vector X ═ X₁,x₂,...,x_T) In, x₁Representing a characteristic sequence, x, of the first time instant fbank₂A characteristic sequence, x, representing the second time instant fbank_TThe characteristic sequences of the T-th moment fbank are represented, and the speech characteristic vectors are finally formed by the characteristic sequences; fbank is characterized by the result of not performing discrete cosine transform on Mel-Frequency Cepstral coeffients (MFCC);

step two, extracting sample FBank voice characteristic vector X ═ X₁,x₂,...,x_T) Inputting the bottleneck characteristic into the four-layer bottleneck characteristic extraction network in the step one, and obtaining bottleneck characteristic extraction network output v by utilizing forward calculation; then, according to the output v of the bottleneck feature extraction network and the actual output of the sample training data X, adopting a BP algorithm to perform supervised training on the bottleneck feature extraction network so as to calculate the weight of each layer of the bottleneck feature extraction network with four layers

And bias

Gradient of (2), then weighting

And bias

Carrying out iterative updating; the supervised training in the step two can be calculated more quickly and accurately;

step three, removing the last output layer of the four-layer bottleneck characteristic extraction network in the step two, only reserving the first three layers as the front end of the coding network again, and outputting the bottleneck characteristic sequence (v) output by the front end₁,v₂,...,v_n) As input to the RNN, here the RNN is the back end of the coding network, which consists of two sets of GRU-based multi-layer RNNs, a forward network and a backward network, respectively;

the input FBank voice feature vector X is processed by a front-end bottleneck feature extraction network of a coding network to obtain a bottleneck feature sequence (v)₁,v₂,...,v_n) The output bottleneck characteristic sequence is finally output as a high-level characteristic sequence (h) through a multilayer RNN (neural network) with the rear end of the encoder based on GRU (GRU)₁,h₂,...,h_T)；

The coding network consists of a bottleneck characteristic extraction network in the step 1 and a recurrent neural network in the step 3, and the bottleneck characteristic extraction network is called as the front end of the coding network; GRU is called as a Gated Recurrent Unit in English, and is called as GRU for short; RNN represents a Recurrent Neural Network, which is called a Current Neural Network for short;

fourthly, a multi-scale and multi-head attention mechanism is adopted to encode the high-level characteristic sequence (h) output by the rear end of the network₁,h₂,...,h_T) Separately injecting convolution kernels F of different scales_iIn the method, the phonetic elements of the phoneme, syllable, word and other levels are modeled to obtain respective target vectors ct_oTo facilitate back-end mode classification. Then corresponding ct of each model_oAs input to the decoding RNN network for each head in a multi-scale multi-head attention network, RNN hidden layer state sequences(s) are computed one by one₁,s₂,...,s_o) And outputting the phoneme sequence (y)₁,y₂,...,y_o)；

Step five, calculating an output sequence (y) by using a decoding RNN corresponding to the attention network of each head in the multi-scale multi-head attention network₁,y₂,...,y_o) Splicing to obtain a new output sequence (y)₁,y₂,...,y_o)_i(ii) a The new output sequences are integrated into the final output sequence (Y) using DNN₁,Y₂,...,Y_o). Here, the output y of the decoding network for each header corresponds to_oIs to generate a target vector ct from each head attention network_oAnd an element y at position o-1_o-1And (3) the product is obtained.

The invention has the beneficial effects that:

the invention provides a speech recognition method based on bottleneck characteristics and a multi-scale multi-head attention mechanism. The method adopts DBN to extract bottleneck characteristics to serve as the front end of a coding network, can increase the robustness of the voice recognition system, and adopts a multi-scale multi-head attention model consisting of convolution kernels with different scales for a decoding network, so that different levels of voice elements such as phonemes, syllables and words can be modeled respectively, and the recognition effect of the voice recognition system is improved.

The voice recognition based on the multi-scale and multi-head attention model designed by the method is superior to the traditional model in the aspect of phoneme sequence recognition, and has stronger time sequence modeling capability and distinguishing capability. The designed speech recognition training model improves the recognition accuracy and reduces the system false recognition rate and the recognition rejection rate.

Drawings

FIG. 1 is a multi-scale multi-head attention model incorporating a bottleneck feature extraction network;

FIG. 2 is a DBN-based bottleneck feature extraction network;

Detailed Description

The first embodiment is as follows:

the method for establishing the speech recognition model based on the bottleneck characteristic and the multi-scale multi-head attention mechanism comprises the following steps:

And bias

Gradient of (2), then weighting

And bias

Carrying out iterative updating; the supervised training in the second step can be calculated more quicklyMore accurate;

Step five, calculating an output sequence (y) by using a decoding RNN corresponding to the attention network of each head in the multi-scale multi-head attention network₁,y₂,...,y_o) Splicing to obtain a new output sequence (y)₁,y₂,...,y_o)_i(ii) a Will be provided withThe new output sequences are integrated into the final output sequence (Y) using DNN₁,Y₂,...,Y_o). Here, the output y of the decoding network for each header corresponds to_oIs to generate a target vector ct from each head attention network_oAnd an element y at position o-1_o-1And (3) the product is obtained.

The second embodiment is as follows:

different from the first embodiment, in the first step of the method for establishing a speech recognition model based on the bottleneck feature and the multi-scale multi-head attention mechanism in the present embodiment, the input speech feature vector X ═ X (X ═ X)₁,x₂,...,x_T) Adopting 40-dimensional FBank characteristics and energy, splicing corresponding first-order and second-order differences, and totaling 123-dimensional parameters; and for the extracted features, firstly, normalizing in a training set range to ensure that each component is subjected to standard normal distribution, and then, normalizing the features of the test set and the development set by utilizing the normalization parameters of the training set.

The third concrete implementation mode:

different from the first or second embodiment, in the method for establishing a speech recognition model based on the bottleneck characteristic and the multi-scale multi-head attention mechanism of the present embodiment, in the first and second steps, an unsupervised training process is performed on an RBM network in a DBN, wherein the training of the RBN network includes using an unsupervised pre-training (pre-training) and a supervised training method of a back gradient propagation algorithm; the input of the RBM network is FBank voice characteristics, the output layer of the RBM network is a softmax layer, and each output layer unit is correspondingly bound with the posterior probability of the three-tone state; three hidden layers are arranged between the input layer and the output layer, the second hidden layer is a bottleneck layer, and the number of state units of the second hidden layer is less than that of other hidden layers.

The fourth concrete implementation mode:

different from the first or second specific embodiment, in the speech recognition model establishing method based on the bottleneck feature and the multi-scale multi-head attention mechanism of the present embodiment, in the first step, the process of unsupervised training of the RBM network by using the input speech feature FBank is as follows:

step one, adding a layer of classification labels for outputting phoneme states on the basis of a DBN with a 4-layer structure to obtain bottleneck characteristics with a 5-layer structure, and taking the voice characteristics as the bottleneck characteristics with the 5-layer structure to extract the input of an RBM network; wherein, the 4 DBNs are formed by continuously stacking 4 groups of single-layer RBMs, the first layer is a Gaussian-Bernoulli RBM, and the other three layers are Bernoulli-Bernoulli RBMs;

step two, in the RBM, because no neuron is connected between the visible layer and the hidden layer, the visible layer and the hidden layer are independent of each other. The a posteriori probabilities p (v | h) and p (h | v) can therefore both be conveniently calculated. Wherein the posterior probability p (h | v) of Bernoulli-Bernoulli RBM is calculated as follows:

wherein E (v, h) is the energy value of a single-layer RBM, v is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layer, W is the energy value of the single-layer RBM, W is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layer_i,*vRepresents the ith row of the weight matrix W. The derivation of equation (1) illustrates that in the case of visible layer determination, the components of the hidden layer state are independent of each other;

step three, each component h of neuron state of hidden layer is hidden by Bernoulli-Bernoulli RBM_iThe values of (A) are only 0 and 1, and the following are:

wherein σ (·) represents a Sigmoid function;

the vector of hidden layer states can be represented by the following formula:

p(h＝1|v)＝σ(Wv+b) (3)

from the symmetry relationship, under known hidden layer states, the visible layer states of a single-layer bernoulli-bernoulli RBM are:

p(v＝1|h)＝σ(W^Th+a) (4)

step four, for the gaussian-bernoulli RBM with gaussian distribution in the visible layer, the posterior probability p (h ═ 1| v) is calculated as in equation (3), and p (v | h) is calculated as follows:

p(v|h)＝N(v；W^Th+a,I) (5)

wherein I represents a unit array;

step one and step five, the stacking process of the coded DBN network is as follows:

firstly, training a Gaussian-Bernoulli RBM (radial basis function) because the characteristics of the input voice of a first layer are real numbers and are matched with Gaussian distribution; then, taking the activation probability of the hidden layer unit as the input of the next visible layer of the Bernoulli-Bernoulli RBM, taking the hidden state of the second layer of RBM as the input of the third layer of RBM, and then analogizing all the layers in sequence;

under a given training data set V, the parameter training of the single-layer RBM is realized by minimizing the negative log-likelihood by adopting a gradient descent method:

during training, the derivative of the negative log-likelihood is generally of the form:

wherein, theta is determined as the distribution of the visual layer vector v of the RBM,<ξ>_dataand<ξ>_modelrespectively representing the expectation value of a random variable xi estimated from sample data and a model; for the weights of the connection matrix in a single-layer RBM, there are:

taking Bernoulli-Bernoulli RBM as an example, it is desirable to<v_ih_j>_dataFor visible layer states v in training data_iAnd hidden layer state v_iA frequency of 1 at the same time, is desired<v_ih_j>_modelAn expectation is calculated for the RBM-generated model distribution.

Step six, training the DBN:

firstly, initializing a first layer RBM by using a Contrast Divergence (CD) algorithm according to input sample data; wherein, in the CD algorithm, a model is estimated by using a Sample obtained by Gibbs sampling (Gibbs Sample), and the Gibbs sampling process in the CD algorithm specifically comprises the following steps:

1) initializing a visible layer through a data sample by the Gibbs sampler to obtain a visible layer state v;

2) calculating to obtain posterior probability p (h | v) as a hidden layer state h by adopting a formula (3);

secondly, respectively sampling and reconstructing a visible layer state v by adopting a formula (4) or a formula (5) according to different RBM types; then, fixing parameters of the first layer RBM, calculating the hidden layer state of the RBM according to the sample data, and taking the hidden layer state as the visible layer state of the second layer RBM; at the moment, acquiring a RBM of a training second layer by adopting a CD algorithm;

and then, the process is repeated, and finally, the weight parameters connecting DBN layers are obtained one by one;

and finally, after training is finished, only keeping the weight matrix connecting layers as an initialization parameter of the DBN transfer matrix.

The fifth concrete implementation mode:

different from the first or second specific embodiment, in the second step, the BP algorithm is adopted to perform supervised training on the RBM network, that is, after unsupervised training is performed on the network by using sample input data by means of the RBM technology, the first 3 layers of connection weight matrices in the coding network are initialized, and the trained weight matrices can partially reflect and 'memorize' effective information in sample data. In order to further enhance the extraction and distinguishing capability of the network to the voice features and enable the bottleneck features obtained by the subsequent attention model to have stronger robustness and distinguishing capability, the voice sample is labeled to carry out supervised training on the voice sample on the basis of unsupervised training, and the specific process is as follows:

directly migrating the weight of the DBN to a Deep Neural Network (DNN); then, fine tuning is carried out on the basis of the known DNN partial model parameters; first 3 connection matrix weights W of the network₁,W₂,W₃As known, these three layers and one random initialization weight output layer W₄Forming a DBN-based bottleneck feature extraction network at the front end of a coding network, adding an activation function between layers, and calculating the network forward direction:

let f (-) denote the activation function, the 1 st hidden layer state h₁And the l-th hidden layer state h_lRespectively as follows:

h₁＝f(W¹x+b¹) (9)

h_l＝f(W^lh_l-1+b^l) (10)

by the 4 th hidden layer h₄And (3) obtaining a final output vector v of the network through a softmax function:

v＝softmax(W⁴h₃+b⁴) (11)

when the network is trained by the method, a fixed number of samples are randomly extracted from a training sample set every time, the characteristics of the extracted samples are input into the network, and network output is obtained through forward calculation;

secondly, calculating the gradient of each layer of parameters by adopting a BP algorithm according to the output of the network and the sample training data, and then updating the model parameters:

DBN network training criteria typically employ a minimum Cross Entropy (CE) function:

network parameters are then fine-tuned using a back propagation algorithm (BP), which updates the model parameters generally as follows:

wherein the content of the first and second substances,

and

after t parameter updates, the weight matrix and the offset vector of the l layer in the network are respectively, mu is the learning rate, and delta W_t ^lAnd

the gradient of the weight matrix and the bias vector, respectively, is calculated from the training samples as the first derivative:

wherein M is the number of samples used in each training;

step two, step two and step two, carry out step one and step two step circularly, and regard output of the network and error of the model output as the sign whether training is finished or not of the presuming threshold:

firstly, in the training process, the error between the network output and the standard output is usually calculated by using a development set (development set) for cross validation;

then, in order to obtain a bottleneck feature extraction network on the basis, all the network state unit, the transfer matrix and the offset vector of the last output layer in the trained network are deleted, the rest is used as the front end of the coding network, and the state of the feature extraction bottleneck layer is used as the output of the front end of the coding network; in order to extract the bottleneck characteristic, the original characteristic parameters of the voice are used as network input, and the output vector of the bottleneck layer is calculated layer by layer through the network, wherein the vector is the bottleneck characteristic parameter; because the RBM is adopted to initialize the model parameters in the network training process and the alignment information provided by the GMM-HMM model in supervised training exists, the extracted bottleneck features have better distinguishability and robustness for different phoneme states compared with the traditional features. Thus, if the bottleneck feature is applied to the attention model, it is equivalent to migrating the knowledge in the network to the attention model.

The sixth specific implementation mode:

different from the fifth embodiment, the speech recognition model establishing method based on the bottleneck characteristic and the multi-scale multi-head attention mechanism of the embodiment,

the multi-scale multi-head attention model consists of an encoding network, a decoding network and an attention network;

the coding network and the decoding network both comprise recurrent neural network units;

the front end of the coding network is a DBN feature extraction network, and the rear end of the coding network is a deep-layer recurrent neural network, so that the context associated information of a voice feature sequence is learned and mined, high-level information is extracted from original features, and the distinguishing degree and the representation capability of the features are enhanced;

the main body of the attention network is a multilayer perceptron, the network input is the output of the coding network and the hidden layer unit state of the decoding network, and the output of the network is the relevance scores of the coding network and the decoding network; obtaining an attention coefficient after the index normalization processing of the relevancy score;

the decoding network is formed by connecting a single-layer cyclic neural network and a maxout network, attention coefficients are obtained according to the attention network, then the output of all the coding networks is weighted and summed to obtain target vectors, finally the target vectors are used as network input, and the posterior probability of each phoneme on each position of an output sequence is calculated.

The seventh embodiment:

different from the sixth specific embodiment, in the method for establishing the speech recognition model based on the bottleneck feature and the multi-scale multi-head attention mechanism of the sixth embodiment, in the third step, the high-level feature sequence (h)₁,h₂,...,h_T) The method is obtained by constructing the front end and the rear end of a coding network of a multi-scale multi-head attention model, and comprises the following specific processes:

the back end of the coding network consists of two groups of GRU unit-based multilayer recurrent neural networks, namely a forward network and a backward network. The structure of the bidirectional network can simultaneously transmit past and future information, and the information quantity of high-level features is improved.

Step three, firstly, the well-trained DBN bottleneck characteristic (v) of the front end of the coding network is obtained₁,v₂,...,v_n) As an input to a recurrent neural network; then, the stacking layer number of the recurrent neural network units is reduced, only the time sequence information of the 1-layer network processing characteristic sequence is reserved, and a high-layer characteristic sequence (h) is output₁,h₂,...,h_T)；

Step three, at the time t of the rear end of the coding network, the hidden layer states of the forward network and the backward network of the layer c are respectively

And

since the implicit layer unit information transfer directions of the forward network and the backward network are opposite, the corresponding transfer functions are respectively expressed by the following equations (17) and (18):

step three, step two, the forward network and the backward network are parallel through respective transfer functionsAnd (4) carrying out feature transfer between layers, and carrying out down-sampling on a time domain in the transfer process so as to achieve the aim of reducing the calculated amount. In the forward network, since the GRU is used as the basic unit, the hidden layer state calculation is performed by inputting the characteristic sequence (x) as shown in formulas (17) and (18)₁,x₂,...,x_T) Layer 1 hidden layer state can be obtained

Similarly, the c-1 layer hidden layer state can be calculated

The computation process hidden layer state is shown as equation (19):

similarly, calculating the state of a c-th layer hidden layer of the backward network;

thirdly, splicing the layer c hidden states of the forward network and the backward network to obtain the high-level characteristic h of the coding network at the moment t_t：

The specific implementation mode is eight:

different from the seventh specific embodiment, in the speech recognition model establishment method based on the bottleneck feature and the multi-scale multi-head attention mechanism of the present embodiment, in the fourth step, an attention network construction process of the multi-scale multi-head attention model is as follows:

note that the force mechanism is a technique for extracting valid features from a sequence of features in a sequence-to-sequence model. It assigns attention to the partially relevant regions of the feature sequence, thereby ignoring the extraneous features. Computing element y of sequence-to-sequence model output sequence using a multi-scale multi-headed attention mechanism_oThe characteristic sequence (h)₁,h₂,...,h_T) Using convolution kernels F of different scales_iOne target vector ct mapped to attention mechanism for each head_o(ii) a Target vector ct_oThe compressed important information in the sequence is contained, and the important characteristic information of the sequence is converted into a vector with a fixed length, so that the rear-end mode classification is facilitated; then ct is added_oAs the decoding network input after the attention network of each head, the hidden layer state sequence(s) of the recurrent neural network is calculated one by one₁,s₂,...,s_o) And output sequence (y)₁,y₂,...,y_o) (ii) a Wherein the content of the first and second substances,

a single head attention model of the multi-scale multi-head attention model calculates a target vector ct corresponding to an output sequence position O ∈ {1, 2., O } by adopting an attention mechanism_oThe process is as follows:

first, the hidden layer state s of the output sequence position o-1 is calculated_o-1The degree of correlation with the characteristic at time t is as shown in equation (21):

e_o,t＝a(s_o-1,h_t) (21)

wherein, a (-) represents the attention network, and is a multi-layer perceptron with only one hidden layer, which can be shown as:

e_o,t＝ω^T tanh(W[s_o-1,h_t,f_o,t]+b) (22)

f_o＝F_i*α_o-1 (23)

where W and b are the input layer to hidden layer weight matrix and the offset vector, respectively, ω is the hidden layer to output layer weight matrix, F_iFor convolution kernels, attention mechanism for multi-scale multi-headed, F_iThe convolution kernels of (1) are different in size, and the convolution part of each head adopts convolution filters with different sizes; due to the fact that the model is multi-headed and the sizes of the convolution filters are different, the model can not only use the representation information of different subspaces of different sequence positions to perform sequence data processing, but also the convolution filters with different sizes can model the information of one level of phoneme, syllable and word.

The multi-scale multi-head model is initialized randomly by adopting different convolution filter parameters, so that the multi-head model forms an integrated model, convolution filters with different sizes represent different models, and attention scores at the last moment can be learned and trained under the supervision of various models to obtain different voice element models; here, the models with smaller convolution filter size mainly represent models at the level of one phoneme, the models with normal size mainly represent models at the level of one syllable, and the models with larger size represent models at the level of one word. Therefore, the voice primitives with different levels can be modeled through the multi-scale multi-head attention model, and the aim of improving the voice recognition effect is fulfilled.

Then, the relevance of the features at all times is exponentially normalized, and the normalized value is called attention coefficient as follows:

and finally, taking the attention coefficient as weight, and carrying out weighted summation on the features at all moments to obtain a target vector ct of an output sequence position o under the attention mechanism_o:

The process of calculating the target vector using the attention mechanism can be abbreviated as equation (26):

ct_o＝attention([h₁,h₂,...,h_T],s_o-1) (26)。

the specific implementation mode is eight:

different from the seventh specific embodiment, in the speech recognition model building method based on the bottleneck characteristic and the multi-scale multi-head attention mechanism of the present embodiment, in the fifth step, the decoding network corresponding to the attention network of each head is used to calculate the output sequence (y)₁,y₂,...,y_o) Element y at position o_oThe process of (2) is as follows:

for multi-scale multi-head notesFor a single head of the attention model, the decoding network behind the attention network of each head is computing the output sequence (y)₁,y₂,...,y_o) Element y of position o_oThe input to the network is a multiscale single attention network generated target vector ct_oAnd element y at position o-1_o-1The calculation process is as follows:

firstly, the decoding network calculates and obtains a target vector ct according to the attention network_oAnd will ct_oAnd y_o-1Computing a hidden layer state s of a recurrent neural network as an input to the recurrent neural network based on the GRU units_oAs shown in formula (27):

s_o＝GRU(s_o-1,[y_o-1,ct_o]) (27)

then, given the hidden layer state s of the decoding network_o∈R^dUnder the condition of being used as input, the posterior probability of the phoneme i is obtained through maxout network calculation

Each hidden layer unit of the maxout network has a plurality of candidate units, the network selects the unit with the maximum value as output, and the calculation process is as follows:

wherein d is the input hidden layer state s_oDimension of (c), corresponding to the number of hidden layer units, W_i,j∈R^d×m×kAnd b_i,j∈R^m ^×kRespectively a maxout network parameter matrix and a bias vector, wherein k is the number of candidate units of each hidden layer unit of the maxout network, m is the number of output layer units, and the number of corresponding output phoneme types in an acoustic model;

finally, as shown in equation (30), the output is obtained from the output layer of the maxout networkOutput vector y_o，y_oThe ith component represents the posterior probability of the occurrence of the phoneme i at the output sequence position o;

finally, all output sequences (y)₁,y₂,...,y_o)_iIntegration into a new output sequence (Y) using DNN₁,Y₂,...,Y_o)；

Y_i＝DNN((y₁,y₂,...,y_o)_i) (31)。

The voice recognition based on the multi-scale multi-head attention model is superior to the traditional model in the aspect of phoneme sequence recognition, and has stronger time sequence modeling capability and distinguishing capability. In order to increase the robustness of the model in a noise environment, the bottleneck characteristic of the DBN-based network is increased on the basis of the model, and the reason is that: firstly, because the bottleneck characteristic extraction network based on the DBN is not a recursive structure, the gradients of multiple frames can be calculated in parallel on a GPU during DBN training, so that the training efficiency is effectively improved, and the problems of large scale and time consumption in training caused by the multi-layer structure of the recurrent neural network can be effectively avoided; secondly, by reducing the number of layers of the circulating neural network, the distance of the gradient transmitted in the network in the reverse direction can be reduced, and the training cost is further reduced. Meanwhile, the characteristics with stronger distinguishability are adopted, so that the model training is easier to converge, and the training efficiency is further improved. Finally, the bottleneck characteristic extracted based on the DBN has strong robustness to noise, and the defect of weak anti-noise capability of the multi-scale multi-head attention model can be overcome by using the bottleneck characteristic. The multi-scale model is initialized randomly by adopting different convolution filter parameters, so that the multi-head model forms an integrated model, and is superior to a single-head model, and the convolution filters with different sizes represent different models, so that the attention scores at the last moment can be learned and trained under the supervision of various models to obtain different voice element models. Here, the models with smaller convolution filter sizes mainly represent models at the phoneme level, the models with normal sizes mainly represent models at the syllable level, and the larger convolution filter sizes represent models established at the word level. Therefore, the voice primitives with different levels can be modeled through the multi-scale multi-head attention model, and the aim of improving the voice recognition effect is fulfilled.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The method for establishing the speech recognition model based on the bottleneck characteristic and the multi-scale multi-head attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

step one, utilizing input sample FBank voice characteristic vector X ═ X₁,x₂,...,x_T) Carrying out unsupervised training on the RBM network in the DBN to obtain a first three-layer connection weight matrix W in the initialized coding network₁、W₂、W₃The three layers are connected with a weight matrix and a random initialization weight output layer W₄Forming a DBN-based bottleneck feature extraction network at the front end of the coding network; the RBM network represents a constrained Boltzmann machine; DBN represents a deep belief network; FBank denotes a filter bank;

step two, extracting sample FBank voice characteristic vector X ═ X₁,x₂,...,x_T) Inputting the bottleneck characteristic into the four-layer bottleneck characteristic extraction network in the step one, and obtaining bottleneck characteristic extraction network output v by utilizing forward calculation; then, according to the output v of the bottleneck feature extraction network and the actual output of the sample training data X, adopting a BP algorithm to perform supervised training on the bottleneck feature extraction network so as to calculate the weight value Delta W of each layer of the bottleneck feature extraction network with four layers_t ^lAnd bias

Gradient of (2), then weighting

And bias

Carrying out iterative updating;

Fourthly, a multi-scale and multi-head attention mechanism is adopted to encode the high-level characteristic sequence (h) output by the rear end of the network₁,h₂,...,h_T) Separately injecting convolution kernels F of different scales_iModeling the phonetic elements of phoneme, syllable and word level to obtain respective target vector ct_oThen corresponding ct to each model_oAs input to the decoding RNN network for each head in a multi-scale multi-head attention network, RNN hidden layer state sequences(s) are computed one by one₁,s₂,...,s_o) And outputting the phoneme sequence (y)₁,y₂,...,y_o)；

In the fourth step, the attention network construction process of the multi-scale multi-head attention model is as follows:

attention mechanism and meter using multi-scale and multi-headComputing element y of sequence to sequence model output sequence_oThe characteristic sequence (h)₁,h₂,...,h_T) Using convolution kernels F of different scales_iOne target vector ct mapped to attention mechanism for each head_o(ii) a Then ct is added_oAs the decoding network input after the attention network of each head, the hidden layer state sequence(s) of the recurrent neural network is calculated one by one₁,s₂,...,s_o) And output sequence (y)₁,y₂,...,y_o) (ii) a Wherein the content of the first and second substances,

e_o,t＝a(s_o-1,h_t) (21)

wherein, a (-) represents the attention network, and is a multi-layer perceptron with only one hidden layer, and is represented as:

e_o,t＝ω^Ttanh(W[s_o-1,h_t,f_o,t]+b) (22)

f_o＝F_i*α_o-1 (23)

where W and b are the input layer to hidden layer weight matrix and the offset vector, respectively, ω is the hidden layer to output layer weight matrix, F_iFor convolution kernels, attention mechanism for multi-scale multi-headed, F_iThe convolution kernels of (1) are different in size, and the convolution part of each head adopts convolution filters with different sizes;

the multi-scale multi-head model is initialized randomly by adopting different convolution filter parameters, so that the multi-head model forms an integrated model, and convolution filters with different sizes represent different models, so that the attention score at the last moment can be learned and trained under the supervision of various models to obtain different voice element models; the model with the smaller convolution filter size mainly represents the model with the first class of phoneme, the model with the normal size mainly represents the model with the first class of syllable, and the model with the larger convolution filter size represents the model with the first class of word;

ct_o＝attention([h₁，h₂，...h_T，s_o-1) (26)

step five, calculating an output sequence (y) by using a decoding RNN corresponding to the attention network of each head in the multi-scale multi-head attention network₁,y₂,...,y_o) Splicing to obtain a new output sequence (y)₁,y₂,...,y_o)_i(ii) a The new output sequences are integrated into the final output sequence (Y) using DNN₁,Y₂,...,Y_o)；

In the fifth step, the decoding network corresponding to the attention network of each head is used for calculating the output sequence (y)₁,y₂,...,y_o) Element y at position o_oThe process comprises the following steps:

for a single head of a multi-scale multi-head attention model, the decoding network behind the attention network of each head is computing the output sequence (y)₁,y₂,...,y_o) Element y of position o_oThe input to the network is a multi-scale single attention network generation target directionQuantity ct_oAnd element y at position o-1_o-1The calculation process is as follows:

s_o＝GRU(s_o-1,[y_o-1,ct_o]) (27)

wherein d is the input hidden layer state s_oDimension of (c), corresponding to the number of hidden layer units, W_i,j∈R^d×m×kAnd b_i,j∈R^m×kRespectively a maxout network parameter matrix and a bias vector, wherein k is the number of candidate units of each hidden layer unit of the maxout network, m is the number of output layer units, and the number of corresponding output phoneme types in an acoustic model;

finally, as shown in equation (30), the output vector y is obtained from the output layer of the maxout network_o，y_oThe ith component represents the posterior probability of the occurrence of the phoneme i at the output sequence position o;

Y_i＝DNN((y₁,y₂,...,y_o)_i) (31)。

2. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 1, wherein: in the first step, the input FBank voice feature vector X ═ X (X)₁,x₂,...,x_T) Adopting 40-dimensional FBank characteristics and energy, splicing corresponding first-order and second-order differences, and totaling 123-dimensional parameters; and for the extracted features, firstly, normalizing in a training set range to ensure that each component is subjected to standard normal distribution, and then, normalizing the features of the test set and the development set by utilizing the normalization parameters of the training set.

3. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 2, wherein: in the first step and the second step, an unsupervised training process is carried out on an RBM network in the DBN, the input of the RBM network is FBank voice characteristics, an output layer of the RBM network is a softmax layer, and each output layer unit is correspondingly bound with the posterior probability of the three-tone state; three hidden layers are arranged between the input layer and the output layer, the second hidden layer is a bottleneck layer, and the number of state units of the second hidden layer is less than that of other hidden layers.

4. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 3, wherein: in the first step, the input FBank voice feature vector X is used as (X)₁,x₂,...,x_T) The process of unsupervised training of the RBM network in the DBN is：

step two, in the RBM, the posterior probability p (h | v) of the Bernoulli-Bernoulli RBM is calculated as follows:

wherein E (v, h) is the energy value of a single-layer RBM, v is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layer, W is the energy value of the single-layer RBM, W is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layer_i,*vRepresenting the ith row of the weight matrix W, and all components of the hidden layer state are independent;

wherein σ (·) represents a Sigmoid function;

the vector of hidden layer states can be represented by the following formula:

p(h＝1|v)＝σ(Wv+b) (3)

p(v＝1|h)＝σ(W^Th+a) (4)

p(v|h)＝N(v；W^Th+a,I) (5)

wherein I represents a unit array;

first, a Gaussian-Bernoulli RBM is trained; then, taking the activation probability of the hidden layer unit as the input of the next visible layer of the Bernoulli-Bernoulli RBM, taking the hidden state of the second layer of RBM as the input of the third layer of RBM, and then analogizing all the layers in sequence;

step six, training the DBN:

firstly, initializing a first layer RBM by using a contrast divergence algorithm according to input sample data; wherein, the model is estimated by using a sample obtained by Gibbs sampling in the CD algorithm, and the Gibbs sampling process in the CD algorithm specifically comprises the following steps:

5. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 4, wherein: in the second step, the BP algorithm is adopted to perform supervised training on the bottleneck characteristic extraction network according to the output v of the bottleneck characteristic extraction network and the actual output of the sample training data X, namely, the voice sample is labeled to perform supervised training on the voice sample on the basis of unsupervised training, and the specific process is as follows:

step two, directly migrating the weight of the DBN to a deep neural network; then, fine adjustment is carried out on the basis of known DNN partial model parameters; first three connection matrix weights W of the network₁,W₂,W₃As known, these three layers and one random initialization weight output layer W₄Forming a DBN-based bottleneck feature extraction network at the front end of a coding network, adding an activation function between layers, and calculating the network forward direction:

h₁＝f(W¹x+b¹) (9)

h_l＝f(W^lh_l-1+b^l) (10)

by the 4 th hidden layer h₄Obtaining network end through softmax functionOutput vector v of (d):

v＝softmax(W⁴h₃+b⁴) (11)

the DBN network training criterion adopts a minimum cross entropy function:

then, fine tuning is carried out on the network parameters by using a back propagation algorithm, and the model parameters are updated according to the following formula:

wherein the content of the first and second substances,

and

gradients of weight matrix and bias vector, respectively, the gradients being derived from training samplesCalculating the first derivative to obtain:

wherein M is the number of samples used in each training;

firstly, in the training process, a development set is adopted to calculate the error between the network output and the standard output for cross validation;

then, on the basis, deleting all the network state units, transfer matrixes and offset vectors of the last output layer in the trained network, taking the rest as the front end of the coding network, and taking the state of the feature extraction bottleneck layer as the output of the front end of the coding network; and taking the original voice characteristic parameters as network input, calculating a bottleneck layer output vector layer by layer through the network, wherein the vector is the bottleneck characteristic parameters, and extracting the bottleneck characteristic.

6. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 5, wherein: the multi-scale multi-head attention model consists of an encoding network, a decoding network and an attention network;

the front end of the coding network is a DBN characteristic extraction network, and the rear end of the coding network is a deep circulating neural network;

7. The method of claim 6, wherein the method comprises: in the third step, the high-level characteristic sequence (h)₁,h₂,...,h_T) The method is obtained by constructing the front end and the rear end of a coding network of a multi-scale multi-head attention model, and the specific construction process comprises the following steps:

And

step three, in the forward network, GRU is used as a basic unit, hidden layer state calculation is shown as formulas (17) and (18), and a characteristic sequence (x) is input₁,x₂,...,x_T) Layer 1 hidden layer state can be obtained

Similarly, the c-1 layer hidden layer state can be calculated

The computation process hidden layer state is shown as equation (19):