CN110211574B - Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism - Google Patents

Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism Download PDF

Info

Publication number
CN110211574B
CN110211574B CN201910477034.2A CN201910477034A CN110211574B CN 110211574 B CN110211574 B CN 110211574B CN 201910477034 A CN201910477034 A CN 201910477034A CN 110211574 B CN110211574 B CN 110211574B
Authority
CN
China
Prior art keywords
network
layer
output
bottleneck
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910477034.2A
Other languages
Chinese (zh)
Other versions
CN110211574A (en
Inventor
韩纪庆
唐海桃
郑铁然
郑贵滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201910477034.2A priority Critical patent/CN110211574B/en
Publication of CN110211574A publication Critical patent/CN110211574A/en
Application granted granted Critical
Publication of CN110211574B publication Critical patent/CN110211574B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

A speech recognition model establishing method based on bottleneck characteristics and a multi-scale multi-head attention mechanism belongs to the field of modeling methods. The traditional attention model has the problems of poor recognition performance, single attention scale and the like. A speech recognition model building method based on bottleneck characteristic and multi-scale multi-head attention mechanism, adopt the deep confidence network to extract the bottleneck characteristic as the front end, can increase the robustness of the model, the rear end adopts the multi-scale multi-head attention model composed of convolution kernel of different scales, model the speech primitive of the grade such as phoneme, syllable, word, etc., calculate the hidden layer state sequence of the recurrent neural network and output sequence one by one; the decoding network corresponding to the attention network of each head is used for calculating the elements of the output sequences at the positions, and finally all the output sequences are collected and integrated into a new output sequence. The invention can improve the recognition effect of the voice recognition system.

Description

Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism
Technical Field
The invention relates to a training model in the technical field of speech recognition, in particular to an improved method for improving the recognition performance of a model by extracting bottleneck characteristics to increase the robustness of the model and establishing a multi-scale multi-head model to model speech elements at the levels of phonemes, syllables, words and the like.
Background
Speech signals are one of the most common and used signals in human society and are an important way for people to express, communicate and propagate information. In the current information explosion era, mass voice data is constantly generated in the internet and telephone channels, and the need for Automatic Speech Recognition (ASR) is becoming more urgent in order to more efficiently recognize, classify and retrieve large-scale voice signals. Compared with a traditional Hidden Markov Model (HMM) based speech recognition system, the end-to-end speech recognition system integrates an acoustic Model, a pronunciation dictionary and a speech Model into a set of neural network, so that the Model is more concise and the parameters are more conveniently optimized directly; the end-to-end model has stronger universality, thereby not only reducing the dependence on professional linguistic knowledge, but also reducing the construction difficulty of a voice recognition system; the training data of the end-to-end model does not need alignment information, and the marked voice can be directly used for training. The attention model breaks away from the assumption of independence of model output, and the alignment relation of phonemes and features is quantized through attention coefficients, so that the model has certain explanatory property. The research of the patent on end-to-end voice recognition is developed from an attention model, and focuses on the problems of long training period, poor recognition performance and the like existing in the current voice recognition system based on the attention model. Although the problem that the training efficiency is low due to the fact that a large number of cyclic neural network units are applied to the attention model internally, the parameter scale is large, the internal structure is complex, and therefore the training efficiency is low can be partially solved by increasing hardware computing resources, research from the algorithm level is still a difficult problem. Attention is paid to the fact that the prior knowledge of the phonetics is completely abandoned by the model, so that the prior knowledge of the phonetics lacks effective initialization parameters, the convergence speed of the parameters is slow, the important reason for increasing the training time is also, and therefore how to integrate the model with the phonetic features is very meaningful work. The reason why the attention model is different from the conventional model in performance is that: insufficient training caused by limited corpus resources, inaccurate alignment relation between generated features and phonemes (graphemes), and poor robustness of a model to noise. In view of the above problems, the model needs to be studied deeply to find a reasonable solution.
Disclosure of Invention
The invention aims to solve the problems of poor recognition performance and single attention scale of the conventional attention model, and provides a speech recognition model establishing method based on bottleneck characteristics and a multi-scale multi-head attention mechanism.
A method for establishing a speech recognition model based on bottleneck characteristics and a multi-scale multi-head attention mechanism comprises the following steps:
step one, utilizing input sample FBank voice characteristic vector X ═ X1,x2,...,xT) Carrying out unsupervised training on the RBM network in the DBN to obtain a first three-layer connection weight matrix W in the initialized coding network1、W2、W3The three layers are connected with a weight matrix and a random initialization weight output layer W4Forming a DBN-based bottleneck feature extraction network at the front end of the coding network; the RBM network represents a Restricted Boltzmann Machine, which is called a Restricted Boltzmann Machine in English and called RBM for short; DBN represents a Deep Belief Network, which is called Deep Belief Network in English and is called DBN for short; FBank denotes a filter bank; sample FBank speech feature vector X ═ X1,x2,...,xT) In, x1Representing a characteristic sequence, x, of the first time instant fbank2A characteristic sequence, x, representing the second time instant fbankTThe characteristic sequences of the T-th moment fbank are represented, and the speech characteristic vectors are finally formed by the characteristic sequences; fbank is characterized by the result of not performing discrete cosine transform on Mel-Frequency Cepstral coeffients (MFCC);
step two, extracting sample FBank voice characteristic vector X ═ X1,x2,...,xT) Inputting the bottleneck characteristic into the four-layer bottleneck characteristic extraction network in the step one, and obtaining bottleneck characteristic extraction network output v by utilizing forward calculation; then, according to the output v of the bottleneck feature extraction network and the actual output of the sample training data X, adopting a BP algorithm to perform supervised training on the bottleneck feature extraction network so as to calculate the weight of each layer of the bottleneck feature extraction network with four layers
Figure BDA0002082613110000021
And bias
Figure BDA0002082613110000022
Gradient of (2), then weighting
Figure BDA0002082613110000023
And bias
Figure BDA0002082613110000024
Carrying out iterative updating; the supervised training in the step two can be calculated more quickly and accurately;
step three, removing the last output layer of the four-layer bottleneck characteristic extraction network in the step two, only reserving the first three layers as the front end of the coding network again, and outputting the bottleneck characteristic sequence (v) output by the front end1,v2,...,vn) As input to the RNN, here the RNN is the back end of the coding network, which consists of two sets of GRU-based multi-layer RNNs, a forward network and a backward network, respectively;
the input FBank voice feature vector X is processed by a front-end bottleneck feature extraction network of a coding network to obtain a bottleneck feature sequence (v)1,v2,...,vn) The output bottleneck characteristic sequence is finally output as a high-level characteristic sequence (h) through a multilayer RNN (neural network) with the rear end of the encoder based on GRU (GRU)1,h2,...,hT);
The coding network consists of a bottleneck characteristic extraction network in the step 1 and a recurrent neural network in the step 3, and the bottleneck characteristic extraction network is called as the front end of the coding network; GRU is called as a Gated Recurrent Unit in English, and is called as GRU for short; RNN represents a Recurrent Neural Network, which is called a Current Neural Network for short;
fourthly, a multi-scale and multi-head attention mechanism is adopted to encode the high-level characteristic sequence (h) output by the rear end of the network1,h2,...,hT) Separately injecting convolution kernels F of different scalesiIn the method, the phonetic elements of the phoneme, syllable, word and other levels are modeled to obtain respective target vectors ctoTo facilitate back-end mode classification. Then corresponding ct of each modeloAs input to the decoding RNN network for each head in a multi-scale multi-head attention network, RNN hidden layer state sequences(s) are computed one by one1,s2,...,so) And outputting the phoneme sequence (y)1,y2,...,yo);
Step five, calculating an output sequence (y) by using a decoding RNN corresponding to the attention network of each head in the multi-scale multi-head attention network1,y2,...,yo) Splicing to obtain a new output sequence (y)1,y2,...,yo)i(ii) a The new output sequences are integrated into the final output sequence (Y) using DNN1,Y2,...,Yo). Here, the output y of the decoding network for each header corresponds tooIs to generate a target vector ct from each head attention networkoAnd an element y at position o-1o-1And (3) the product is obtained.
The invention has the beneficial effects that:
the invention provides a speech recognition method based on bottleneck characteristics and a multi-scale multi-head attention mechanism. The method adopts DBN to extract bottleneck characteristics to serve as the front end of a coding network, can increase the robustness of the voice recognition system, and adopts a multi-scale multi-head attention model consisting of convolution kernels with different scales for a decoding network, so that different levels of voice elements such as phonemes, syllables and words can be modeled respectively, and the recognition effect of the voice recognition system is improved.
The voice recognition based on the multi-scale and multi-head attention model designed by the method is superior to the traditional model in the aspect of phoneme sequence recognition, and has stronger time sequence modeling capability and distinguishing capability. The designed speech recognition training model improves the recognition accuracy and reduces the system false recognition rate and the recognition rejection rate.
Drawings
FIG. 1 is a multi-scale multi-head attention model incorporating a bottleneck feature extraction network;
FIG. 2 is a DBN-based bottleneck feature extraction network;
Detailed Description
The first embodiment is as follows:
the method for establishing the speech recognition model based on the bottleneck characteristic and the multi-scale multi-head attention mechanism comprises the following steps:
step one, utilizing input sample FBank voice characteristic vector X ═ X1,x2,...,xT) Carrying out unsupervised training on the RBM network in the DBN to obtain a first three-layer connection weight matrix W in the initialized coding network1、W2、W3The three layers are connected with a weight matrix and a random initialization weight output layer W4Forming a DBN-based bottleneck feature extraction network at the front end of the coding network; the RBM network represents a Restricted Boltzmann Machine, which is called a Restricted Boltzmann Machine in English and called RBM for short; DBN represents a Deep Belief Network, which is called Deep Belief Network in English and is called DBN for short; FBank denotes a filter bank; sample FBank speech feature vector X ═ X1,x2,...,xT) In, x1Representing a characteristic sequence, x, of the first time instant fbank2A characteristic sequence, x, representing the second time instant fbankTThe characteristic sequences of the T-th moment fbank are represented, and the speech characteristic vectors are finally formed by the characteristic sequences; fbank is characterized by the result of not performing discrete cosine transform on Mel-Frequency Cepstral coeffients (MFCC);
step two, extracting sample FBank voice characteristic vector X ═ X1,x2,...,xT) Inputting the bottleneck characteristic into the four-layer bottleneck characteristic extraction network in the step one, and obtaining bottleneck characteristic extraction network output v by utilizing forward calculation; then, according to the output v of the bottleneck feature extraction network and the actual output of the sample training data X, adopting a BP algorithm to perform supervised training on the bottleneck feature extraction network so as to calculate the weight of each layer of the bottleneck feature extraction network with four layers
Figure BDA0002082613110000043
And bias
Figure BDA0002082613110000041
Gradient of (2), then weighting
Figure BDA0002082613110000042
And bias
Figure BDA0002082613110000044
Carrying out iterative updating; the supervised training in the second step can be calculated more quicklyMore accurate;
step three, removing the last output layer of the four-layer bottleneck characteristic extraction network in the step two, only reserving the first three layers as the front end of the coding network again, and outputting the bottleneck characteristic sequence (v) output by the front end1,v2,...,vn) As input to the RNN, here the RNN is the back end of the coding network, which consists of two sets of GRU-based multi-layer RNNs, a forward network and a backward network, respectively;
the input FBank voice feature vector X is processed by a front-end bottleneck feature extraction network of a coding network to obtain a bottleneck feature sequence (v)1,v2,...,vn) The output bottleneck characteristic sequence is finally output as a high-level characteristic sequence (h) through a multilayer RNN (neural network) with the rear end of the encoder based on GRU (GRU)1,h2,...,hT);
The coding network consists of a bottleneck characteristic extraction network in the step 1 and a recurrent neural network in the step 3, and the bottleneck characteristic extraction network is called as the front end of the coding network; GRU is called as a Gated Recurrent Unit in English, and is called as GRU for short; RNN represents a Recurrent Neural Network, which is called a Current Neural Network for short;
fourthly, a multi-scale and multi-head attention mechanism is adopted to encode the high-level characteristic sequence (h) output by the rear end of the network1,h2,...,hT) Separately injecting convolution kernels F of different scalesiIn the method, the phonetic elements of the phoneme, syllable, word and other levels are modeled to obtain respective target vectors ctoTo facilitate back-end mode classification. Then corresponding ct of each modeloAs input to the decoding RNN network for each head in a multi-scale multi-head attention network, RNN hidden layer state sequences(s) are computed one by one1,s2,...,so) And outputting the phoneme sequence (y)1,y2,...,yo);
Step five, calculating an output sequence (y) by using a decoding RNN corresponding to the attention network of each head in the multi-scale multi-head attention network1,y2,...,yo) Splicing to obtain a new output sequence (y)1,y2,...,yo)i(ii) a Will be provided withThe new output sequences are integrated into the final output sequence (Y) using DNN1,Y2,...,Yo). Here, the output y of the decoding network for each header corresponds tooIs to generate a target vector ct from each head attention networkoAnd an element y at position o-1o-1And (3) the product is obtained.
The second embodiment is as follows:
different from the first embodiment, in the first step of the method for establishing a speech recognition model based on the bottleneck feature and the multi-scale multi-head attention mechanism in the present embodiment, the input speech feature vector X ═ X (X ═ X)1,x2,...,xT) Adopting 40-dimensional FBank characteristics and energy, splicing corresponding first-order and second-order differences, and totaling 123-dimensional parameters; and for the extracted features, firstly, normalizing in a training set range to ensure that each component is subjected to standard normal distribution, and then, normalizing the features of the test set and the development set by utilizing the normalization parameters of the training set.
The third concrete implementation mode:
different from the first or second embodiment, in the method for establishing a speech recognition model based on the bottleneck characteristic and the multi-scale multi-head attention mechanism of the present embodiment, in the first and second steps, an unsupervised training process is performed on an RBM network in a DBN, wherein the training of the RBN network includes using an unsupervised pre-training (pre-training) and a supervised training method of a back gradient propagation algorithm; the input of the RBM network is FBank voice characteristics, the output layer of the RBM network is a softmax layer, and each output layer unit is correspondingly bound with the posterior probability of the three-tone state; three hidden layers are arranged between the input layer and the output layer, the second hidden layer is a bottleneck layer, and the number of state units of the second hidden layer is less than that of other hidden layers.
The fourth concrete implementation mode:
different from the first or second specific embodiment, in the speech recognition model establishing method based on the bottleneck feature and the multi-scale multi-head attention mechanism of the present embodiment, in the first step, the process of unsupervised training of the RBM network by using the input speech feature FBank is as follows:
step one, adding a layer of classification labels for outputting phoneme states on the basis of a DBN with a 4-layer structure to obtain bottleneck characteristics with a 5-layer structure, and taking the voice characteristics as the bottleneck characteristics with the 5-layer structure to extract the input of an RBM network; wherein, the 4 DBNs are formed by continuously stacking 4 groups of single-layer RBMs, the first layer is a Gaussian-Bernoulli RBM, and the other three layers are Bernoulli-Bernoulli RBMs;
step two, in the RBM, because no neuron is connected between the visible layer and the hidden layer, the visible layer and the hidden layer are independent of each other. The a posteriori probabilities p (v | h) and p (h | v) can therefore both be conveniently calculated. Wherein the posterior probability p (h | v) of Bernoulli-Bernoulli RBM is calculated as follows:
Figure BDA0002082613110000061
wherein E (v, h) is the energy value of a single-layer RBM, v is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layer, W is the energy value of the single-layer RBM, W is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layeri,*vRepresents the ith row of the weight matrix W. The derivation of equation (1) illustrates that in the case of visible layer determination, the components of the hidden layer state are independent of each other;
step three, each component h of neuron state of hidden layer is hidden by Bernoulli-Bernoulli RBMiThe values of (A) are only 0 and 1, and the following are:
Figure BDA0002082613110000062
wherein σ (·) represents a Sigmoid function;
the vector of hidden layer states can be represented by the following formula:
p(h=1|v)=σ(Wv+b) (3)
from the symmetry relationship, under known hidden layer states, the visible layer states of a single-layer bernoulli-bernoulli RBM are:
p(v=1|h)=σ(WTh+a) (4)
step four, for the gaussian-bernoulli RBM with gaussian distribution in the visible layer, the posterior probability p (h ═ 1| v) is calculated as in equation (3), and p (v | h) is calculated as follows:
p(v|h)=N(v;WTh+a,I) (5)
wherein I represents a unit array;
step one and step five, the stacking process of the coded DBN network is as follows:
firstly, training a Gaussian-Bernoulli RBM (radial basis function) because the characteristics of the input voice of a first layer are real numbers and are matched with Gaussian distribution; then, taking the activation probability of the hidden layer unit as the input of the next visible layer of the Bernoulli-Bernoulli RBM, taking the hidden state of the second layer of RBM as the input of the third layer of RBM, and then analogizing all the layers in sequence;
under a given training data set V, the parameter training of the single-layer RBM is realized by minimizing the negative log-likelihood by adopting a gradient descent method:
Figure BDA0002082613110000071
during training, the derivative of the negative log-likelihood is generally of the form:
Figure BDA0002082613110000072
wherein, theta is determined as the distribution of the visual layer vector v of the RBM,<ξ>dataand<ξ>modelrespectively representing the expectation value of a random variable xi estimated from sample data and a model; for the weights of the connection matrix in a single-layer RBM, there are:
Figure BDA0002082613110000073
taking Bernoulli-Bernoulli RBM as an example, it is desirable to<vihj>dataFor visible layer states v in training dataiAnd hidden layer state viA frequency of 1 at the same time, is desired<vihj>modelAn expectation is calculated for the RBM-generated model distribution.
Step six, training the DBN:
firstly, initializing a first layer RBM by using a Contrast Divergence (CD) algorithm according to input sample data; wherein, in the CD algorithm, a model is estimated by using a Sample obtained by Gibbs sampling (Gibbs Sample), and the Gibbs sampling process in the CD algorithm specifically comprises the following steps:
1) initializing a visible layer through a data sample by the Gibbs sampler to obtain a visible layer state v;
2) calculating to obtain posterior probability p (h | v) as a hidden layer state h by adopting a formula (3);
secondly, respectively sampling and reconstructing a visible layer state v by adopting a formula (4) or a formula (5) according to different RBM types; then, fixing parameters of the first layer RBM, calculating the hidden layer state of the RBM according to the sample data, and taking the hidden layer state as the visible layer state of the second layer RBM; at the moment, acquiring a RBM of a training second layer by adopting a CD algorithm;
and then, the process is repeated, and finally, the weight parameters connecting DBN layers are obtained one by one;
and finally, after training is finished, only keeping the weight matrix connecting layers as an initialization parameter of the DBN transfer matrix.
The fifth concrete implementation mode:
different from the first or second specific embodiment, in the second step, the BP algorithm is adopted to perform supervised training on the RBM network, that is, after unsupervised training is performed on the network by using sample input data by means of the RBM technology, the first 3 layers of connection weight matrices in the coding network are initialized, and the trained weight matrices can partially reflect and 'memorize' effective information in sample data. In order to further enhance the extraction and distinguishing capability of the network to the voice features and enable the bottleneck features obtained by the subsequent attention model to have stronger robustness and distinguishing capability, the voice sample is labeled to carry out supervised training on the voice sample on the basis of unsupervised training, and the specific process is as follows:
directly migrating the weight of the DBN to a Deep Neural Network (DNN); then, fine tuning is carried out on the basis of the known DNN partial model parameters; first 3 connection matrix weights W of the network1,W2,W3As known, these three layers and one random initialization weight output layer W4Forming a DBN-based bottleneck feature extraction network at the front end of a coding network, adding an activation function between layers, and calculating the network forward direction:
let f (-) denote the activation function, the 1 st hidden layer state h1And the l-th hidden layer state hlRespectively as follows:
h1=f(W1x+b1) (9)
hl=f(Wlhl-1+bl) (10)
by the 4 th hidden layer h4And (3) obtaining a final output vector v of the network through a softmax function:
v=softmax(W4h3+b4) (11)
when the network is trained by the method, a fixed number of samples are randomly extracted from a training sample set every time, the characteristics of the extracted samples are input into the network, and network output is obtained through forward calculation;
secondly, calculating the gradient of each layer of parameters by adopting a BP algorithm according to the output of the network and the sample training data, and then updating the model parameters:
DBN network training criteria typically employ a minimum Cross Entropy (CE) function:
Figure BDA0002082613110000081
network parameters are then fine-tuned using a back propagation algorithm (BP), which updates the model parameters generally as follows:
Figure BDA0002082613110000082
Figure BDA0002082613110000083
wherein the content of the first and second substances,
Figure BDA0002082613110000091
and
Figure BDA0002082613110000092
after t parameter updates, the weight matrix and the offset vector of the l layer in the network are respectively, mu is the learning rate, and delta Wt lAnd
Figure BDA0002082613110000093
the gradient of the weight matrix and the bias vector, respectively, is calculated from the training samples as the first derivative:
Figure BDA0002082613110000094
Figure BDA0002082613110000095
wherein M is the number of samples used in each training;
step two, step two and step two, carry out step one and step two step circularly, and regard output of the network and error of the model output as the sign whether training is finished or not of the presuming threshold:
firstly, in the training process, the error between the network output and the standard output is usually calculated by using a development set (development set) for cross validation;
then, in order to obtain a bottleneck feature extraction network on the basis, all the network state unit, the transfer matrix and the offset vector of the last output layer in the trained network are deleted, the rest is used as the front end of the coding network, and the state of the feature extraction bottleneck layer is used as the output of the front end of the coding network; in order to extract the bottleneck characteristic, the original characteristic parameters of the voice are used as network input, and the output vector of the bottleneck layer is calculated layer by layer through the network, wherein the vector is the bottleneck characteristic parameter; because the RBM is adopted to initialize the model parameters in the network training process and the alignment information provided by the GMM-HMM model in supervised training exists, the extracted bottleneck features have better distinguishability and robustness for different phoneme states compared with the traditional features. Thus, if the bottleneck feature is applied to the attention model, it is equivalent to migrating the knowledge in the network to the attention model.
The sixth specific implementation mode:
different from the fifth embodiment, the speech recognition model establishing method based on the bottleneck characteristic and the multi-scale multi-head attention mechanism of the embodiment,
the multi-scale multi-head attention model consists of an encoding network, a decoding network and an attention network;
the coding network and the decoding network both comprise recurrent neural network units;
the front end of the coding network is a DBN feature extraction network, and the rear end of the coding network is a deep-layer recurrent neural network, so that the context associated information of a voice feature sequence is learned and mined, high-level information is extracted from original features, and the distinguishing degree and the representation capability of the features are enhanced;
the main body of the attention network is a multilayer perceptron, the network input is the output of the coding network and the hidden layer unit state of the decoding network, and the output of the network is the relevance scores of the coding network and the decoding network; obtaining an attention coefficient after the index normalization processing of the relevancy score;
the decoding network is formed by connecting a single-layer cyclic neural network and a maxout network, attention coefficients are obtained according to the attention network, then the output of all the coding networks is weighted and summed to obtain target vectors, finally the target vectors are used as network input, and the posterior probability of each phoneme on each position of an output sequence is calculated.
The seventh embodiment:
different from the sixth specific embodiment, in the method for establishing the speech recognition model based on the bottleneck feature and the multi-scale multi-head attention mechanism of the sixth embodiment, in the third step, the high-level feature sequence (h)1,h2,...,hT) The method is obtained by constructing the front end and the rear end of a coding network of a multi-scale multi-head attention model, and comprises the following specific processes:
the back end of the coding network consists of two groups of GRU unit-based multilayer recurrent neural networks, namely a forward network and a backward network. The structure of the bidirectional network can simultaneously transmit past and future information, and the information quantity of high-level features is improved.
Step three, firstly, the well-trained DBN bottleneck characteristic (v) of the front end of the coding network is obtained1,v2,...,vn) As an input to a recurrent neural network; then, the stacking layer number of the recurrent neural network units is reduced, only the time sequence information of the 1-layer network processing characteristic sequence is reserved, and a high-layer characteristic sequence (h) is output1,h2,...,hT);
Step three, at the time t of the rear end of the coding network, the hidden layer states of the forward network and the backward network of the layer c are respectively
Figure BDA0002082613110000101
And
Figure BDA0002082613110000102
since the implicit layer unit information transfer directions of the forward network and the backward network are opposite, the corresponding transfer functions are respectively expressed by the following equations (17) and (18):
Figure BDA0002082613110000103
Figure BDA0002082613110000104
step three, step two, the forward network and the backward network are parallel through respective transfer functionsAnd (4) carrying out feature transfer between layers, and carrying out down-sampling on a time domain in the transfer process so as to achieve the aim of reducing the calculated amount. In the forward network, since the GRU is used as the basic unit, the hidden layer state calculation is performed by inputting the characteristic sequence (x) as shown in formulas (17) and (18)1,x2,...,xT) Layer 1 hidden layer state can be obtained
Figure BDA0002082613110000105
Similarly, the c-1 layer hidden layer state can be calculated
Figure BDA0002082613110000106
The computation process hidden layer state is shown as equation (19):
Figure BDA0002082613110000107
similarly, calculating the state of a c-th layer hidden layer of the backward network;
thirdly, splicing the layer c hidden states of the forward network and the backward network to obtain the high-level characteristic h of the coding network at the moment tt
Figure BDA0002082613110000111
The specific implementation mode is eight:
different from the seventh specific embodiment, in the speech recognition model establishment method based on the bottleneck feature and the multi-scale multi-head attention mechanism of the present embodiment, in the fourth step, an attention network construction process of the multi-scale multi-head attention model is as follows:
note that the force mechanism is a technique for extracting valid features from a sequence of features in a sequence-to-sequence model. It assigns attention to the partially relevant regions of the feature sequence, thereby ignoring the extraneous features. Computing element y of sequence-to-sequence model output sequence using a multi-scale multi-headed attention mechanismoThe characteristic sequence (h)1,h2,...,hT) Using convolution kernels F of different scalesiOne target vector ct mapped to attention mechanism for each heado(ii) a Target vector ctoThe compressed important information in the sequence is contained, and the important characteristic information of the sequence is converted into a vector with a fixed length, so that the rear-end mode classification is facilitated; then ct is addedoAs the decoding network input after the attention network of each head, the hidden layer state sequence(s) of the recurrent neural network is calculated one by one1,s2,...,so) And output sequence (y)1,y2,...,yo) (ii) a Wherein the content of the first and second substances,
a single head attention model of the multi-scale multi-head attention model calculates a target vector ct corresponding to an output sequence position O ∈ {1, 2., O } by adopting an attention mechanismoThe process is as follows:
first, the hidden layer state s of the output sequence position o-1 is calculatedo-1The degree of correlation with the characteristic at time t is as shown in equation (21):
eo,t=a(so-1,ht) (21)
wherein, a (-) represents the attention network, and is a multi-layer perceptron with only one hidden layer, which can be shown as:
eo,t=ωT tanh(W[so-1,ht,fo,t]+b) (22)
fo=Fio-1 (23)
where W and b are the input layer to hidden layer weight matrix and the offset vector, respectively, ω is the hidden layer to output layer weight matrix, FiFor convolution kernels, attention mechanism for multi-scale multi-headed, FiThe convolution kernels of (1) are different in size, and the convolution part of each head adopts convolution filters with different sizes; due to the fact that the model is multi-headed and the sizes of the convolution filters are different, the model can not only use the representation information of different subspaces of different sequence positions to perform sequence data processing, but also the convolution filters with different sizes can model the information of one level of phoneme, syllable and word.
The multi-scale multi-head model is initialized randomly by adopting different convolution filter parameters, so that the multi-head model forms an integrated model, convolution filters with different sizes represent different models, and attention scores at the last moment can be learned and trained under the supervision of various models to obtain different voice element models; here, the models with smaller convolution filter size mainly represent models at the level of one phoneme, the models with normal size mainly represent models at the level of one syllable, and the models with larger size represent models at the level of one word. Therefore, the voice primitives with different levels can be modeled through the multi-scale multi-head attention model, and the aim of improving the voice recognition effect is fulfilled.
Then, the relevance of the features at all times is exponentially normalized, and the normalized value is called attention coefficient as follows:
Figure BDA0002082613110000121
and finally, taking the attention coefficient as weight, and carrying out weighted summation on the features at all moments to obtain a target vector ct of an output sequence position o under the attention mechanismo:
Figure BDA0002082613110000122
The process of calculating the target vector using the attention mechanism can be abbreviated as equation (26):
cto=attention([h1,h2,...,hT],so-1) (26)。
the specific implementation mode is eight:
different from the seventh specific embodiment, in the speech recognition model building method based on the bottleneck characteristic and the multi-scale multi-head attention mechanism of the present embodiment, in the fifth step, the decoding network corresponding to the attention network of each head is used to calculate the output sequence (y)1,y2,...,yo) Element y at position ooThe process of (2) is as follows:
for multi-scale multi-head notesFor a single head of the attention model, the decoding network behind the attention network of each head is computing the output sequence (y)1,y2,...,yo) Element y of position ooThe input to the network is a multiscale single attention network generated target vector ctoAnd element y at position o-1o-1The calculation process is as follows:
firstly, the decoding network calculates and obtains a target vector ct according to the attention networkoAnd will ctoAnd yo-1Computing a hidden layer state s of a recurrent neural network as an input to the recurrent neural network based on the GRU unitsoAs shown in formula (27):
so=GRU(so-1,[yo-1,cto]) (27)
then, given the hidden layer state s of the decoding networko∈RdUnder the condition of being used as input, the posterior probability of the phoneme i is obtained through maxout network calculation
Figure BDA0002082613110000123
Each hidden layer unit of the maxout network has a plurality of candidate units, the network selects the unit with the maximum value as output, and the calculation process is as follows:
Figure BDA0002082613110000131
Figure BDA0002082613110000132
wherein d is the input hidden layer state soDimension of (c), corresponding to the number of hidden layer units, Wi,j∈Rd×m×kAnd bi,j∈Rm ×kRespectively a maxout network parameter matrix and a bias vector, wherein k is the number of candidate units of each hidden layer unit of the maxout network, m is the number of output layer units, and the number of corresponding output phoneme types in an acoustic model;
finally, as shown in equation (30), the output is obtained from the output layer of the maxout networkOutput vector yo,yoThe ith component represents the posterior probability of the occurrence of the phoneme i at the output sequence position o;
Figure BDA0002082613110000133
finally, all output sequences (y)1,y2,...,yo)iIntegration into a new output sequence (Y) using DNN1,Y2,...,Yo);
Yi=DNN((y1,y2,...,yo)i) (31)。
The voice recognition based on the multi-scale multi-head attention model is superior to the traditional model in the aspect of phoneme sequence recognition, and has stronger time sequence modeling capability and distinguishing capability. In order to increase the robustness of the model in a noise environment, the bottleneck characteristic of the DBN-based network is increased on the basis of the model, and the reason is that: firstly, because the bottleneck characteristic extraction network based on the DBN is not a recursive structure, the gradients of multiple frames can be calculated in parallel on a GPU during DBN training, so that the training efficiency is effectively improved, and the problems of large scale and time consumption in training caused by the multi-layer structure of the recurrent neural network can be effectively avoided; secondly, by reducing the number of layers of the circulating neural network, the distance of the gradient transmitted in the network in the reverse direction can be reduced, and the training cost is further reduced. Meanwhile, the characteristics with stronger distinguishability are adopted, so that the model training is easier to converge, and the training efficiency is further improved. Finally, the bottleneck characteristic extracted based on the DBN has strong robustness to noise, and the defect of weak anti-noise capability of the multi-scale multi-head attention model can be overcome by using the bottleneck characteristic. The multi-scale model is initialized randomly by adopting different convolution filter parameters, so that the multi-head model forms an integrated model, and is superior to a single-head model, and the convolution filters with different sizes represent different models, so that the attention scores at the last moment can be learned and trained under the supervision of various models to obtain different voice element models. Here, the models with smaller convolution filter sizes mainly represent models at the phoneme level, the models with normal sizes mainly represent models at the syllable level, and the larger convolution filter sizes represent models established at the word level. Therefore, the voice primitives with different levels can be modeled through the multi-scale multi-head attention model, and the aim of improving the voice recognition effect is fulfilled.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. The method for establishing the speech recognition model based on the bottleneck characteristic and the multi-scale multi-head attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:
step one, utilizing input sample FBank voice characteristic vector X ═ X1,x2,...,xT) Carrying out unsupervised training on the RBM network in the DBN to obtain a first three-layer connection weight matrix W in the initialized coding network1、W2、W3The three layers are connected with a weight matrix and a random initialization weight output layer W4Forming a DBN-based bottleneck feature extraction network at the front end of the coding network; the RBM network represents a constrained Boltzmann machine; DBN represents a deep belief network; FBank denotes a filter bank;
step two, extracting sample FBank voice characteristic vector X ═ X1,x2,...,xT) Inputting the bottleneck characteristic into the four-layer bottleneck characteristic extraction network in the step one, and obtaining bottleneck characteristic extraction network output v by utilizing forward calculation; then, according to the output v of the bottleneck feature extraction network and the actual output of the sample training data X, adopting a BP algorithm to perform supervised training on the bottleneck feature extraction network so as to calculate the weight value Delta W of each layer of the bottleneck feature extraction network with four layerst lAnd bias
Figure FDA0003297912520000011
Gradient of (2), then weighting
Figure FDA0003297912520000012
And bias
Figure FDA0003297912520000013
Carrying out iterative updating;
step three, removing the last output layer of the four-layer bottleneck characteristic extraction network in the step two, only reserving the first three layers as the front end of the coding network again, and outputting the bottleneck characteristic sequence (v) output by the front end1,v2,...,vn) As input to the RNN, here the RNN is the back end of the coding network, which consists of two sets of GRU-based multi-layer RNNs, a forward network and a backward network, respectively;
the input FBank voice feature vector X is processed by a front-end bottleneck feature extraction network of a coding network to obtain a bottleneck feature sequence (v)1,v2,...,vn) The output bottleneck characteristic sequence is finally output as a high-level characteristic sequence (h) through a multilayer RNN (neural network) with the rear end of the encoder based on GRU (GRU)1,h2,...,hT);
Fourthly, a multi-scale and multi-head attention mechanism is adopted to encode the high-level characteristic sequence (h) output by the rear end of the network1,h2,...,hT) Separately injecting convolution kernels F of different scalesiModeling the phonetic elements of phoneme, syllable and word level to obtain respective target vector ctoThen corresponding ct to each modeloAs input to the decoding RNN network for each head in a multi-scale multi-head attention network, RNN hidden layer state sequences(s) are computed one by one1,s2,...,so) And outputting the phoneme sequence (y)1,y2,...,yo);
In the fourth step, the attention network construction process of the multi-scale multi-head attention model is as follows:
attention mechanism and meter using multi-scale and multi-headComputing element y of sequence to sequence model output sequenceoThe characteristic sequence (h)1,h2,...,hT) Using convolution kernels F of different scalesiOne target vector ct mapped to attention mechanism for each heado(ii) a Then ct is addedoAs the decoding network input after the attention network of each head, the hidden layer state sequence(s) of the recurrent neural network is calculated one by one1,s2,...,so) And output sequence (y)1,y2,...,yo) (ii) a Wherein the content of the first and second substances,
a single head attention model of the multi-scale multi-head attention model calculates a target vector ct corresponding to an output sequence position O ∈ {1, 2., O } by adopting an attention mechanismoThe process is as follows:
first, the hidden layer state s of the output sequence position o-1 is calculatedo-1The degree of correlation with the characteristic at time t is as shown in equation (21):
eo,t=a(so-1,ht) (21)
wherein, a (-) represents the attention network, and is a multi-layer perceptron with only one hidden layer, and is represented as:
eo,t=ωTtanh(W[so-1,ht,fo,t]+b) (22)
fo=Fio-1 (23)
where W and b are the input layer to hidden layer weight matrix and the offset vector, respectively, ω is the hidden layer to output layer weight matrix, FiFor convolution kernels, attention mechanism for multi-scale multi-headed, FiThe convolution kernels of (1) are different in size, and the convolution part of each head adopts convolution filters with different sizes;
the multi-scale multi-head model is initialized randomly by adopting different convolution filter parameters, so that the multi-head model forms an integrated model, and convolution filters with different sizes represent different models, so that the attention score at the last moment can be learned and trained under the supervision of various models to obtain different voice element models; the model with the smaller convolution filter size mainly represents the model with the first class of phoneme, the model with the normal size mainly represents the model with the first class of syllable, and the model with the larger convolution filter size represents the model with the first class of word;
then, the relevance of the features at all times is exponentially normalized, and the normalized value is called attention coefficient as follows:
Figure FDA0003297912520000021
and finally, taking the attention coefficient as weight, and carrying out weighted summation on the features at all moments to obtain a target vector ct of an output sequence position o under the attention mechanismo:
Figure FDA0003297912520000031
The process of calculating the target vector using the attention mechanism can be abbreviated as equation (26):
cto=attention([h1,h2,...hT,so-1) (26)
step five, calculating an output sequence (y) by using a decoding RNN corresponding to the attention network of each head in the multi-scale multi-head attention network1,y2,...,yo) Splicing to obtain a new output sequence (y)1,y2,...,yo)i(ii) a The new output sequences are integrated into the final output sequence (Y) using DNN1,Y2,...,Yo);
In the fifth step, the decoding network corresponding to the attention network of each head is used for calculating the output sequence (y)1,y2,...,yo) Element y at position ooThe process comprises the following steps:
for a single head of a multi-scale multi-head attention model, the decoding network behind the attention network of each head is computing the output sequence (y)1,y2,...,yo) Element y of position ooThe input to the network is a multi-scale single attention network generation target directionQuantity ctoAnd element y at position o-1o-1The calculation process is as follows:
firstly, the decoding network calculates and obtains a target vector ct according to the attention networkoAnd will ctoAnd yo-1Computing a hidden layer state s of a recurrent neural network as an input to the recurrent neural network based on the GRU unitsoAs shown in formula (27):
so=GRU(so-1,[yo-1,cto]) (27)
then, given the hidden layer state s of the decoding networko∈RdUnder the condition of being used as input, the posterior probability of the phoneme i is obtained through maxout network calculation
Figure FDA0003297912520000032
Each hidden layer unit of the maxout network has a plurality of candidate units, the network selects the unit with the maximum value as output, and the calculation process is as follows:
Figure FDA0003297912520000033
Figure FDA0003297912520000034
wherein d is the input hidden layer state soDimension of (c), corresponding to the number of hidden layer units, Wi,j∈Rd×m×kAnd bi,j∈Rm×kRespectively a maxout network parameter matrix and a bias vector, wherein k is the number of candidate units of each hidden layer unit of the maxout network, m is the number of output layer units, and the number of corresponding output phoneme types in an acoustic model;
finally, as shown in equation (30), the output vector y is obtained from the output layer of the maxout networko,yoThe ith component represents the posterior probability of the occurrence of the phoneme i at the output sequence position o;
Figure FDA0003297912520000041
finally, all output sequences (y)1,y2,...,yo)iIntegration into a new output sequence (Y) using DNN1,Y2,...,Yo);
Yi=DNN((y1,y2,...,yo)i) (31)。
2. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 1, wherein: in the first step, the input FBank voice feature vector X ═ X (X)1,x2,...,xT) Adopting 40-dimensional FBank characteristics and energy, splicing corresponding first-order and second-order differences, and totaling 123-dimensional parameters; and for the extracted features, firstly, normalizing in a training set range to ensure that each component is subjected to standard normal distribution, and then, normalizing the features of the test set and the development set by utilizing the normalization parameters of the training set.
3. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 2, wherein: in the first step and the second step, an unsupervised training process is carried out on an RBM network in the DBN, the input of the RBM network is FBank voice characteristics, an output layer of the RBM network is a softmax layer, and each output layer unit is correspondingly bound with the posterior probability of the three-tone state; three hidden layers are arranged between the input layer and the output layer, the second hidden layer is a bottleneck layer, and the number of state units of the second hidden layer is less than that of other hidden layers.
4. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 3, wherein: in the first step, the input FBank voice feature vector X is used as (X)1,x2,...,xT) The process of unsupervised training of the RBM network in the DBN is:
Step one, adding a layer of classification labels for outputting phoneme states on the basis of a DBN with a 4-layer structure to obtain bottleneck characteristics with a 5-layer structure, and taking the voice characteristics as the bottleneck characteristics with the 5-layer structure to extract the input of an RBM network; wherein, the 4 DBNs are formed by continuously stacking 4 groups of single-layer RBMs, the first layer is a Gaussian-Bernoulli RBM, and the other three layers are Bernoulli-Bernoulli RBMs;
step two, in the RBM, the posterior probability p (h | v) of the Bernoulli-Bernoulli RBM is calculated as follows:
Figure FDA0003297912520000051
wherein E (v, h) is the energy value of a single-layer RBM, v is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layer, W is the energy value of the single-layer RBM, W is the visual layer vector, a is the bias vector of the visual layer, h is the hidden layer vector, b is the hidden layer bias vector, W is the weight matrix connecting the visual layer and the hidden layeri,*vRepresenting the ith row of the weight matrix W, and all components of the hidden layer state are independent;
step three, each component h of neuron state of hidden layer is hidden by Bernoulli-Bernoulli RBMiThe values of (A) are only 0 and 1, and the following are:
Figure FDA0003297912520000052
wherein σ (·) represents a Sigmoid function;
the vector of hidden layer states can be represented by the following formula:
p(h=1|v)=σ(Wv+b) (3)
from the symmetry relationship, under known hidden layer states, the visible layer states of a single-layer bernoulli-bernoulli RBM are:
p(v=1|h)=σ(WTh+a) (4)
step four, for the gaussian-bernoulli RBM with gaussian distribution in the visible layer, the posterior probability p (h ═ 1| v) is calculated as in equation (3), and p (v | h) is calculated as follows:
p(v|h)=N(v;WTh+a,I) (5)
wherein I represents a unit array;
step one and step five, the stacking process of the coded DBN network is as follows:
first, a Gaussian-Bernoulli RBM is trained; then, taking the activation probability of the hidden layer unit as the input of the next visible layer of the Bernoulli-Bernoulli RBM, taking the hidden state of the second layer of RBM as the input of the third layer of RBM, and then analogizing all the layers in sequence;
under a given training data set V, the parameter training of the single-layer RBM is realized by minimizing the negative log-likelihood by adopting a gradient descent method:
Figure FDA0003297912520000061
during training, the derivative of the negative log-likelihood is generally of the form:
Figure FDA0003297912520000062
wherein, theta is determined as the distribution of the visual layer vector v of the RBM,<ξ>dataand<ξ>modelrespectively representing the expectation value of a random variable xi estimated from sample data and a model; for the weights of the connection matrix in a single-layer RBM, there are:
Figure FDA0003297912520000063
step six, training the DBN:
firstly, initializing a first layer RBM by using a contrast divergence algorithm according to input sample data; wherein, the model is estimated by using a sample obtained by Gibbs sampling in the CD algorithm, and the Gibbs sampling process in the CD algorithm specifically comprises the following steps:
1) initializing a visible layer through a data sample by the Gibbs sampler to obtain a visible layer state v;
2) calculating to obtain posterior probability p (h | v) as a hidden layer state h by adopting a formula (3);
secondly, respectively sampling and reconstructing a visible layer state v by adopting a formula (4) or a formula (5) according to different RBM types; then, fixing parameters of the first layer RBM, calculating the hidden layer state of the RBM according to the sample data, and taking the hidden layer state as the visible layer state of the second layer RBM; at the moment, acquiring a RBM of a training second layer by adopting a CD algorithm;
and then, the process is repeated, and finally, the weight parameters connecting DBN layers are obtained one by one;
and finally, after training is finished, only keeping the weight matrix connecting layers as an initialization parameter of the DBN transfer matrix.
5. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 4, wherein: in the second step, the BP algorithm is adopted to perform supervised training on the bottleneck characteristic extraction network according to the output v of the bottleneck characteristic extraction network and the actual output of the sample training data X, namely, the voice sample is labeled to perform supervised training on the voice sample on the basis of unsupervised training, and the specific process is as follows:
step two, directly migrating the weight of the DBN to a deep neural network; then, fine adjustment is carried out on the basis of known DNN partial model parameters; first three connection matrix weights W of the network1,W2,W3As known, these three layers and one random initialization weight output layer W4Forming a DBN-based bottleneck feature extraction network at the front end of a coding network, adding an activation function between layers, and calculating the network forward direction:
let f (-) denote the activation function, the 1 st hidden layer state h1And the l-th hidden layer state hlRespectively as follows:
h1=f(W1x+b1) (9)
hl=f(Wlhl-1+bl) (10)
by the 4 th hidden layer h4Obtaining network end through softmax functionOutput vector v of (d):
v=softmax(W4h3+b4) (11)
when the network is trained by the method, a fixed number of samples are randomly extracted from a training sample set every time, the characteristics of the extracted samples are input into the network, and network output is obtained through forward calculation;
secondly, calculating the gradient of each layer of parameters by adopting a BP algorithm according to the output of the network and the sample training data, and then updating the model parameters:
the DBN network training criterion adopts a minimum cross entropy function:
Figure FDA0003297912520000071
then, fine tuning is carried out on the network parameters by using a back propagation algorithm, and the model parameters are updated according to the following formula:
Figure FDA0003297912520000072
Figure FDA0003297912520000073
wherein the content of the first and second substances,
Figure FDA0003297912520000074
and
Figure FDA0003297912520000075
after t parameter updates, the weight matrix and the offset vector of the l layer in the network are respectively, mu is the learning rate, and delta Wt lAnd
Figure FDA0003297912520000076
gradients of weight matrix and bias vector, respectively, the gradients being derived from training samplesCalculating the first derivative to obtain:
Figure FDA0003297912520000077
Figure FDA0003297912520000078
wherein M is the number of samples used in each training;
step two, step two and step two, carry out step one and step two step circularly, and regard output of the network and error of the model output as the sign whether training is finished or not of the presuming threshold:
firstly, in the training process, a development set is adopted to calculate the error between the network output and the standard output for cross validation;
then, on the basis, deleting all the network state units, transfer matrixes and offset vectors of the last output layer in the trained network, taking the rest as the front end of the coding network, and taking the state of the feature extraction bottleneck layer as the output of the front end of the coding network; and taking the original voice characteristic parameters as network input, calculating a bottleneck layer output vector layer by layer through the network, wherein the vector is the bottleneck characteristic parameters, and extracting the bottleneck characteristic.
6. The method for building a speech recognition model based on bottleneck features and a multi-scale multi-head attention mechanism according to claim 5, wherein: the multi-scale multi-head attention model consists of an encoding network, a decoding network and an attention network;
the coding network and the decoding network both comprise recurrent neural network units;
the front end of the coding network is a DBN characteristic extraction network, and the rear end of the coding network is a deep circulating neural network;
the main body of the attention network is a multilayer perceptron, the network input is the output of the coding network and the hidden layer unit state of the decoding network, and the output of the network is the relevance scores of the coding network and the decoding network; obtaining an attention coefficient after the index normalization processing of the relevancy score;
the decoding network is formed by connecting a single-layer cyclic neural network and a maxout network, attention coefficients are obtained according to the attention network, then the output of all the coding networks is weighted and summed to obtain target vectors, finally the target vectors are used as network input, and the posterior probability of each phoneme on each position of an output sequence is calculated.
7. The method of claim 6, wherein the method comprises: in the third step, the high-level characteristic sequence (h)1,h2,...,hT) The method is obtained by constructing the front end and the rear end of a coding network of a multi-scale multi-head attention model, and the specific construction process comprises the following steps:
step three, firstly, the well-trained DBN bottleneck characteristic (v) of the front end of the coding network is obtained1,v2,...,vn) As an input to a recurrent neural network; then, the stacking layer number of the recurrent neural network units is reduced, only the time sequence information of the 1-layer network processing characteristic sequence is reserved, and a high-layer characteristic sequence (h) is output1,h2,...,hT);
Step three, at the time t of the rear end of the coding network, the hidden layer states of the forward network and the backward network of the layer c are respectively
Figure FDA0003297912520000081
And
Figure FDA0003297912520000082
since the implicit layer unit information transfer directions of the forward network and the backward network are opposite, the corresponding transfer functions are respectively expressed by the following equations (17) and (18):
Figure FDA0003297912520000083
Figure FDA0003297912520000084
step three, in the forward network, GRU is used as a basic unit, hidden layer state calculation is shown as formulas (17) and (18), and a characteristic sequence (x) is input1,x2,...,xT) Layer 1 hidden layer state can be obtained
Figure FDA0003297912520000091
Similarly, the c-1 layer hidden layer state can be calculated
Figure FDA0003297912520000092
The computation process hidden layer state is shown as equation (19):
Figure FDA0003297912520000093
similarly, calculating the state of a c-th layer hidden layer of the backward network;
thirdly, splicing the layer c hidden states of the forward network and the backward network to obtain the high-level characteristic h of the coding network at the moment tt
Figure FDA0003297912520000094
CN201910477034.2A 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism Active CN110211574B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910477034.2A CN110211574B (en) 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910477034.2A CN110211574B (en) 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Publications (2)

Publication Number Publication Date
CN110211574A CN110211574A (en) 2019-09-06
CN110211574B true CN110211574B (en) 2022-03-11

Family

ID=67790458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910477034.2A Active CN110211574B (en) 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Country Status (1)

Country Link
CN (1) CN110211574B (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110824915B (en) * 2019-09-30 2022-06-07 华南师范大学 GA-DBN network-based intelligent monitoring method and system for wastewater treatment
CN110765966B (en) * 2019-10-30 2022-03-25 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN110880328B (en) * 2019-11-20 2022-11-15 Oppo广东移动通信有限公司 Arrival reminding method, device, terminal and storage medium
CN111159419B (en) * 2019-12-09 2021-05-25 浙江师范大学 Knowledge tracking data processing method, system and storage medium based on graph convolution
CN111312228A (en) * 2019-12-09 2020-06-19 中国南方电网有限责任公司 End-to-end-based voice navigation method applied to electric power enterprise customer service
CN111243578A (en) * 2020-01-10 2020-06-05 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111243609B (en) * 2020-01-10 2023-07-14 平安科技(深圳)有限公司 Method and device for intelligent detection of effective voice and computer readable storage medium
CN111276131B (en) * 2020-01-22 2021-01-12 厦门大学 Multi-class acoustic feature integration method and system based on deep neural network
CN111309965B (en) * 2020-03-20 2024-02-13 腾讯科技(深圳)有限公司 Audio matching method, device, computer equipment and storage medium
CN113450781B (en) * 2020-03-25 2022-08-09 阿里巴巴集团控股有限公司 Speech processing method, speech encoder, speech decoder and speech recognition system
CN111145760B (en) * 2020-04-02 2020-06-30 支付宝(杭州)信息技术有限公司 Method and neural network model for speaker recognition
CN111489738B (en) * 2020-04-13 2021-02-23 深圳市友杰智新科技有限公司 Feature extraction method and voice command identification method based on multi-head attention mechanism
CN111477221B (en) * 2020-05-28 2022-12-30 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111524519B (en) * 2020-05-28 2022-09-30 中国科学技术大学 End-to-end multi-channel speech recognition method using advanced feature fusion
CN111680591B (en) * 2020-05-28 2023-01-13 天津大学 Pronunciation inversion method based on feature fusion and attention mechanism
CN112530421B (en) * 2020-11-03 2023-04-07 科大讯飞股份有限公司 Voice recognition method, electronic equipment and storage device
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112992125B (en) * 2021-04-20 2021-08-03 北京沃丰时代数据科技有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112992126B (en) * 2021-04-22 2022-02-25 北京远鉴信息技术有限公司 Voice authenticity verification method and device, electronic equipment and readable storage medium
CN113192489A (en) * 2021-05-16 2021-07-30 金陵科技学院 Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model
CN113705322B (en) * 2021-06-11 2024-05-24 北京易达图灵科技有限公司 Handwritten Chinese character recognition method and device based on threshold graph neural network
CN113392919B (en) * 2021-06-24 2023-04-28 长沙理工大学 Deep belief network DBN detection method of attention mechanism
CN113537472B (en) * 2021-07-26 2024-04-09 北京计算机技术及应用研究所 Construction method of bidirectional recurrent neural network with low calculation and storage consumption
CN113593526A (en) * 2021-07-27 2021-11-02 哈尔滨理工大学 Speech emotion recognition method based on deep learning
CN113782014B (en) * 2021-09-26 2024-03-26 联想(北京)有限公司 Speech recognition method and device
CN113936680B (en) * 2021-10-08 2023-08-08 电子科技大学 Single-channel voice enhancement method based on multi-scale information perception convolutional neural network
CN114333900B (en) * 2021-11-30 2023-09-05 南京硅基智能科技有限公司 Method for extracting BNF (BNF) characteristics end to end, network model, training method and training system
CN114999459A (en) * 2022-05-05 2022-09-02 苏州大学 Voice recognition method and system based on multi-scale recursive quantitative analysis

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120328161A1 (en) * 2011-06-22 2012-12-27 Palenychka Roman Method and multi-scale attention system for spatiotemporal change determination and object detection
US20140114655A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
CN104756182A (en) * 2012-11-29 2015-07-01 索尼电脑娱乐公司 Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection
CN105023573A (en) * 2011-04-01 2015-11-04 索尼电脑娱乐公司 Speech syllable/vowel/phone boundary detection using auditory attention cues
CN108510983A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 The system and method for automatic unit selection and goal decomposition for sequence labelling
CN109271838A (en) * 2018-07-19 2019-01-25 重庆邮电大学 A kind of three parameter attributes fusion gesture identification method based on fmcw radar

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106571135B (en) * 2016-10-27 2020-06-09 苏州大学 Ear voice feature extraction method and system
CN108417202B (en) * 2018-01-19 2020-09-01 苏州思必驰信息科技有限公司 Voice recognition method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105023573A (en) * 2011-04-01 2015-11-04 索尼电脑娱乐公司 Speech syllable/vowel/phone boundary detection using auditory attention cues
US20120328161A1 (en) * 2011-06-22 2012-12-27 Palenychka Roman Method and multi-scale attention system for spatiotemporal change determination and object detection
US20140114655A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
CN104756182A (en) * 2012-11-29 2015-07-01 索尼电脑娱乐公司 Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection
CN108510983A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 The system and method for automatic unit selection and goal decomposition for sequence labelling
CN109271838A (en) * 2018-07-19 2019-01-25 重庆邮电大学 A kind of three parameter attributes fusion gesture identification method based on fmcw radar

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MULTI-SCALE ALIGNMENT AND CONTEXTUAL HISTORY FOR ATTENTION MECHANISM IN SEQUENCE-TO-SEQUENCE MODEL;Andros Tjandra et al.;《SLT 2018》;20181231;全文 *
基于注意力机制的端到端语音识别技术研究;龙星延;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20190115;第2-4章 *

Also Published As

Publication number Publication date
CN110211574A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
CN110211574B (en) Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism
Cai et al. A novel learnable dictionary encoding layer for end-to-end language identification
CN110399850B (en) Continuous sign language recognition method based on deep neural network
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
WO2018227780A1 (en) Speech recognition method and device, computer device and storage medium
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN111414481B (en) Chinese semantic matching method based on pinyin and BERT embedding
CN106952644A (en) A kind of complex audio segmentation clustering method based on bottleneck characteristic
CN112784730B (en) Multi-modal emotion recognition method based on time domain convolutional network
CN111798840B (en) Voice keyword recognition method and device
CN110634476B (en) Method and system for rapidly building robust acoustic model
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN110853630B (en) Lightweight speech recognition method facing edge calculation
CN116011457A (en) Emotion intelligent recognition method based on data enhancement and cross-modal feature fusion
Huang et al. Recurrent poisson process unit for speech recognition
Zhang et al. Deep recurrent convolutional neural network: Improving performance for speech recognition
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN113488060A (en) Voiceprint recognition method and system based on variation information bottleneck
CN116010874A (en) Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion
Wei et al. EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting
Jiang et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit.
CN114863938A (en) Bird language identification method and system based on attention residual error and feature fusion
Han et al. Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification
CN113889099A (en) Voice recognition method and system
CN112560440A (en) Deep learning-based syntax dependence method for aspect-level emotion analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant