CN112509564A

CN112509564A - End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Info

Publication number: CN112509564A
Application number: CN202011101902.6A
Authority: CN
Inventors: 庞伟; 王亮; 陆生礼; 狄敏; 姚志强
Original assignee: Southeast University Wuxi Institute Of Integrated Circuit Technology; Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Current assignee: Southeast University Wuxi Institute Of Integrated Circuit Technology; Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-03-16
Anticipated expiration: 2040-10-15
Also published as: CN112509564B

Abstract

The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of a decoder, the attention relation between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally, different weights are assigned to the two training criteria in a weighting mode. The method can accelerate the convergence speed of the model, obtain more accurate alignment attribute, obtain internal relation between inputs and improve the accuracy and robustness of the voice recognition system.

Description

End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Technical Field

The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, relates to a voice recognition technology, and belongs to the technical field of calculation, calculation and counting.

Background

In recent years, with the improvement of computing power, data accumulation and algorithm progress, the deep learning technology is gradually replacing the traditional machine learning probability-based research, and the fields of computer vision, natural language processing, computer hearing and the like are becoming the most important research focus in the artificial intelligence field at present, wherein the development of the speech recognition technology is beneficial to the red profit brought by the rapid development of the deep learning technology. Since 2011 Deep Neural Network (DNN) replaces Gaussian Mixture Model (GMM) to Model the observation probability of speech, speech recognition has succeeded in large-vocabulary continuous speech recognition, and the recognition effect has made the biggest breakthrough in the last 10 years. To date, many techniques have been developed on the way to search for speech recognition, so that the accuracy of speech recognition exceeds the human level. But behind this result are a number of complex techniques and deployment on the server side, using a large amount of storage space and computational resources, while also consuming a large amount of energy.

The existing voice recognition method mainly adopts a Long Short-Term Memory network (LSTM) to model voice, but the method has the defects that parallel training cannot be carried out, the next step of input can be carried out after the current input processing is finished, the training time is too Long, and the problems of gradient disappearance or dispersion and the like are easily caused; and when the input sequence is long, only 100 orders of magnitude information can be remembered, and 1000 orders of magnitude information or sequence cannot be remembered. The biggest drawback is that the hardware requirements are particularly high, and bandwidth binding computations need to be stored, which is a nightmare for hardware designers and ultimately limits the applicability of the solution.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

an end-to-end speech recognition method based on connection time sequence classification and self-attention mechanism adopts a coder-decoder structure as a main body frame, the coder-decoder structure comprises a coder sub-network and a decoder sub-network, the method shares the same coder network, and a corresponding decoding sequence is directly output by using a speech recognition decoder after the coder network is trained; meanwhile, the high-level abstract features extracted by the encoder are used as a part of the input of the decoder and jointly trained with the decoder. The two sub-networks respectively use different training criteria, then different weights are given to the output of the loss function, and model parameters are updated to different degrees through the joint optimization of the two sub-networks.

The encoder-decoder architecture is implemented using a transformer architecture that enables deep modeling of input speech and input labels by stacking multiple layers of transformers and mining the linkage between speech and text.

The speech recognition decoder decodes using a beam search, the decoded result directly corresponds to the correct word or kanji sequence, and the decoded result is decoded by the CTC. The acoustic model is modeled by using large-granularity units such as English words or Chinese characters, and the English words or Chinese characters generally select common Chinese characters appearing in a training set and a test set as output classification results.

The method specifically comprises the following steps:

step 1, data preparation and feature extraction: collecting voice data to obtain a voice data set; firstly, feature extraction is carried out on voice in a voice data set by using a Mel filter bank algorithm; then, the voice is converted from a time domain to a frequency domain by using discrete Fourier transform, so that the characteristics of the voice can be better observed; the finally extracted features are used as the input of the network;

step 2, training an acoustic model: regarding the encoder sub-network as an acoustic model, inputting the voice features extracted from the training set into the encoder sub-network, acquiring the internal relation between each frame of voice signals through a self-attention mechanism, performing feature classification after full-connection mapping, and finally performing training by using a CTC training criterion;

the encoder sub-network comprises a position encoding layer, a multi-head attention layer, a feedforward network layer, a residual connection layer and a layer normalization layer, and finally, the CTC loss function is used for aligning an input speech frame with the output of the encoder; before inputting the characteristics into the encoder sub-network, adding position information to each frame of characteristics, inputting the characteristics into the encoder sub-network after adding the position information, and firstly carrying out multi-head attention calculation so as to obtain multi-level characteristics of a voice signal; then using the full connection layer to carry out feature mapping, wherein one part is used for the input of a decoder, and the other part is used for the output of an encoder; the output of the encoder is connected with a layer of full connection as a classification result of a final modeling unit, and the number of neurons of the full connection layer is equal to the number of the modeled Chinese characters plus one, namely a space unit;

step 3, training a language model: treating the decoder subnetwork as a language model; the language model is trained by using a cross entropy training criterion; the first self-attention layer of the decoder subnetwork input needs to be masked, i.e. the input at the later moment cannot be seen when processing the current moment; the second self-attention layer of the decoder subnetwork needs to receive the output of the encoder, so as to realize the attention relationship between the encoder and the decoder; the terminal of the decoder subnetwork uses softmax output and trains a language model with a cross entropy loss function; the overall loss value of the language model is weighted and summed by the respective loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights;

and 4, decoding: in the training process, the model parameter with the minimum loss value is saved and used as a trained model; decoding a beam searching mode, wherein the beam searching mode finds a path with the highest probability in the first n paths of the paths passing through the current node each time; when the last node is reached, taking the sequence with the maximum path probability as the final recognition result; during testing, the voice to be tested is subjected to feature extraction, then input into the trained model for processing, and finally sent to a decoder to obtain a final recognition result.

Preferably: the encoder network uses CTC training criterion, the decoder network uses cross entropy training criterion, the outputs of the two are weighted, the updating of different emphasis of model parameters is realized by adjusting weighting coefficients, and the loss function is that loss is lambda loss_ctc+(1-λ)*loss_saWherein loss represents the total loss value of the model, loss_ctcLoss value, loss, representing CTC training criterion_saRepresents the loss value of the cross entropy training criterion, and lambda is the weighting adjustment factor.

Preferably: in the step 1, the voice frame with the frame length of 25ms and the frame shift of 10ms is taken as a unit for extraction.

Preferably: windowing is performed on the extracted voice in the step 1, so that the continuity of data of adjacent voice frames is guaranteed, and spectrum leakage is prevented.

Preferably: the features extracted in step 1 are 40-dimensional mel features.

Preferably: and (3) performing data enhancement on the voice data collected in the step (1), wherein the data enhancement comprises deletion operation on a voice time domain and a voice frequency domain.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements an end-to-end speech recognition method based on connection timing classification and a self-attention mechanism.

A terminal device, comprising: a memory, a processor and a computer program stored on the memory and running on the processor, the processor when executing the program implementing an end-to-end speech recognition method based on connection timing classification and a self-attention mechanism.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism.

(2) The invention uses two training modes during training, but only uses one decoding mode during decoding, thereby improving the accuracy of speech recognition without increasing the burden of decoding.

Drawings

Fig. 1 is a schematic structural view of the present invention.

FIG. 2 is a flow chart of speech recognition according to the present invention.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

An end-to-end speech recognition method based on connection time sequence Classification and Self-Attention Mechanism, the method uses a connection time sequence Classification (CTC) and Self-Attention Mechanism (SA) mixed Mechanism to directly model English words or Chinese characters, no pre-processing or post-processing is needed, and the output result directly corresponds to a correct English sequence or Chinese character sequence. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of a decoder, the attention relation between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally different weights are assigned to the two training criteria in a weighting mode to represent the tendency degree of updating the network parameters during reverse propagation. When the weight is 0 or 1, it can be considered as two different network models at all. According to the invention, the autonomous change of the modeling capacity of the model is realized by controlling the weight, and meanwhile, the convergence speed of the model is accelerated and the training time of the model is reduced by utilizing the pre-alignment characteristic of CTC. The model structure is shown in fig. 1 and includes an acoustic model, a language model, and a decoder. The identification method comprises the following four steps:

step 1, data preparation and feature extraction: and (2) using a network public voice data set, wherein the used public voice data set is an Aishell-1 voice data set and is divided into a training set, a verification set and a test set, then using a feature extraction algorithm to extract representative features of each sentence of voice, and specifically using a Mel filter bank algorithm to extract the features of the voice. Because the voice has short-time stationarity, the voice frame with the frame length of 25ms and the frame shift of 10ms is taken as a unit for extraction. Windowing is performed on the extracted voice, so that the continuity of data of adjacent voice frames is guaranteed, and spectrum leakage is prevented. The speech is then transformed from the time domain to the frequency domain using a discrete fourier transform so that the characteristics of the speech can be better observed. Since the perception of the human ear to the sound follows the variation of the mel frequency, that is, the result of the perception of the human ear to the sound is two times different, and then the difference is also two times represented on the mel frequency, the feature extraction is carried out on each frame of speech signal by using the mel frequency filter bank. The finally extracted features are 40-dimensional Mel features, and the Mel features are used as the input of the network.

And 2, training an acoustic model. As shown in fig. 1, the encoder subnetwork is regarded as an acoustic model, the extracted speech features are input into the encoder subnetwork, the intrinsic relation between each frame of speech signal is obtained through a self-attention mechanism, then feature classification is performed after full-connection mapping, and finally training is performed by using a CTC training criterion. The output of the acoustic model uses the CTC training criterion, and the training method does not need to align the input voice with the output sequence, thereby saving a large amount of manpower and material resources and improving the development efficiency.

The input voice features are features which have undergone data enhancement, and the data enhancement includes operations such as deletion in the voice time domain and the voice frequency domain. Before entering a network, the feature of position coding needs to be superimposed on the basis of feature dimension to indicate the relative position of the frame of speech, then the frame of speech is processed after entering the network, and finally the output dimension of the encoder is 6308 to represent the number of classification of Chinese characters output finally.

The encoder structure specifically comprises a position coding layer, a multi-head attention layer, a feedforward network layer, two residual error connections and a layer normalization layer, and finally, the CTC loss function is used for aligning an input speech frame with the output of the encoder, so that the overall training speed is accelerated. It is necessary to add location information to each frame feature before inputting the feature into the network because the attention mechanism associates features of the entire speech without remembering location information when acquiring the relationships between input features. After the position information is added, the characteristics can be input into the network, multi-head attention calculation is firstly carried out, and multi-convolution kernels similar to convolution are carried out, so that the aim of obtaining multi-level characteristics of the voice signal is fulfilled. To avoid the problem of gradient vanishing when updating parameters in the back-propagation, residual concatenation is added. The function of layer normalization is to stabilize data within a proper range when the data flows in the network of different layers. The fully-connected layer is then used for feature mapping, one for the input to the decoder and one for the output of the encoder. The output of the encoder is connected with a layer of full connection as a classification result of a final modeling unit, and the number of neurons of the full connection layer is equal to the number of the modeled Chinese characters plus one, namely a space unit.

And 3, training a language model. As shown in fig. 1, the decoder subnetwork is considered a language model. The language model is trained using cross-entropy training criteria. The decoder subnetwork includes a masking multi-head attention layer, a feed-forward network layer, three residual connections, and a layer normalization layer. The decoder sub-network is similar in structure to the encoder sub-network, and the process flow is also similar. The input of the decoder is a text sequence corresponding to the current training voice, the text sequence enters a decoder sub-network through embedded coding, the text features can also acquire the internal relation between every two texts through a self-attention mechanism, and finally feature fusion classification is carried out through full connection and softmax activation, and training is carried out according to a cross entropy training criterion.

The input is the superposition of embedded coding and position coding of a text sequence, and the special point of the step is that when the current moment is decoded, the input at the later moment needs to be covered to prevent data leakage and inaccurate identification result.

The decoder sub-network and encoder sub-network structure differs from each other by two points: firstly, a first self-attention layer input by a decoder subnetwork needs to be covered, namely, input at the later moment cannot be seen when the current moment is processed, so that data leakage is prevented, and the model effect is prevented from being influenced; second, the decoder subnetwork second needs to receive the encoder's output from the attention layer to achieve the attention relationship between the encoder and decoder. The end of the network uses the softmax output and trains the network model with a cross entropy loss function. The loss value of the model totality is calculated by weighted summation of the loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model effect is obtained by adjusting the weights.

And step four, decoding. And obtaining the model parameter with the minimum loss through the joint optimization of the two parameters, testing the data of the test set by using a decoding mode of beam search, and outputting a recognition result. Therefore, the model parameters with the minimum loss value are saved in the training process. The decoding uses a beam search mode, which finds the path with the highest probability in the first n paths of the paths passed by the current node each time, so as to ensure that the correct paths with low probability are not lost. And when the last node is reached, taking the sequence with the maximum path probability as the final recognition result. During testing, the voice is subjected to feature extraction, then input into a network for processing, and finally sent to a decoder to obtain a final recognition result.

In summary, the present invention provides an end-to-end speech recognition method based on connection timing classification and self-attention mechanism, which utilizes independent assumption of CTC conditions to align speech and text quickly, thereby speeding up convergence, and utilizes SA to capture global association, thereby solving the problem of long distance dependence. Under the condition of not increasing the complexity of the model, the recognition effect better than that of a single model is obtained by adjusting the weight factor and combining the advantages of the weight factor and the weight factor.

The invention combines the mature CTC technology and the transform model based on the SA technology, thereby not only accelerating the training speed of the model, but also capturing the relation among all frames in the voice and improving the stability and the recognition rate of the voice recognition system.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An end-to-end speech recognition method based on connection timing classification and a self-attention mechanism is characterized by comprising the following steps:

2. The method of claim 1 for end-to-end speech recognition based on connection timing classification and a self-attention mechanism, wherein: the loss function is λ ═ λ -_ctc+(1-λ)*loss_saWherein loss represents the total loss value of the model, loss_ctcLoss value, loss, representing CTC training criterion_saRepresents the loss value of the cross entropy training criterion, and lambda is the weighting adjustment factor.

3. The method of claim 2, wherein the method comprises: in the step 1, the voice frame with the frame length of 25ms and the frame shift of 10ms is taken as a unit for extraction.

4. The method of claim 3, wherein the method comprises: windowing is performed on the extracted voice in the step 1, so that the continuity of data of adjacent voice frames is guaranteed, and spectrum leakage is prevented.

5. The method of claim 4, wherein the method comprises: the features extracted in step 1 are 40-dimensional mel features.

6. The method of claim 5, wherein the method comprises: and (3) performing data enhancement on the voice data collected in the step (1), wherein the data enhancement comprises deletion operation on a voice time domain and a voice frequency domain.

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech recognition method of claim 1.

8. A terminal device, comprising: memory, processor and computer program stored on the memory and running on the processor, characterized in that the processor implements the speech recognition method according to claim 1 when executing the program.