CN112509564B

CN112509564B - End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Info

Publication number: CN112509564B
Application number: CN202011101902.6A
Authority: CN
Inventors: 庞伟; 王亮; 陆生礼; 狄敏; 姚志强
Original assignee: Southeast University-Wuxi Institute Of Integrated Circuit Technology; Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Current assignee: Southeast University-Wuxi Institute Of Integrated Circuit Technology; Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2024-04-02
Anticipated expiration: 2040-10-15
Also published as: CN112509564A

Abstract

The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, which uses a mixed mechanism of the connection time sequence classification CTC and the self-attention mechanism SA to directly model English words or Chinese characters without preprocessing or post-processing, and the output result directly corresponds to a correct English sequence or Chinese character sequence. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of the decoder, the attention relationship between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally, the two training criteria are given different weights in a weighted mode. The invention not only can accelerate the convergence rate of the model and obtain more accurate alignment attribute, but also can acquire internal relation between inputs, thereby improving the accuracy and the robustness of the voice recognition system.

Description

End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Technical Field

The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, relates to a voice recognition technology, and belongs to the technical field of calculation, calculation and counting.

Background

In recent years, with the improvement of computing power, accumulation of data and improvement of algorithms, deep learning technology is gradually replacing traditional machine learning based researches, and at present, fields of computer vision, natural language processing, computer hearing and the like are becoming the most important research hot spot in the field of artificial intelligence, wherein the development of voice recognition technology benefits from the rapid development of the deep learning technology. Since 2011, the deep neural network (Deep Neural Network, DNN) replaces the Gaussian mixture model (Gaussian Mixture Model, GMM) to model the observation probability of the voice, the voice recognition starts to succeed in large-vocabulary continuous voice recognition, and the recognition effect is the biggest breakthrough in the last 10 years. Hitherto, on the road of exploring speech recognition, many techniques have been developed so that the accuracy of speech recognition exceeds the human level. The result is that a plurality of complex technologies are adopted, and the server side deployment is performed, so that a large amount of storage space and calculation resources are used, and a large amount of energy is consumed.

The existing voice recognition method mainly adopts a Long Short-Term Memory (LSTM) to model voice, but the method has the defects that parallel training cannot be performed, next input can be performed after the current input processing is finished, the training time is too Long, gradient disappearance or dispersion is easy to occur, and the like; and when the input sequence is long, only 100 orders of magnitude information can be remembered, but 1000 orders of magnitude information or sequences cannot be remembered. The biggest disadvantage is that the hardware requirements are particularly high, requiring memory bandwidth binding calculations, which is a nightmare for the hardware designer, ultimately limiting the applicability of the solution.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, the invention provides the end-to-end voice recognition method based on the connection time sequence classification and the self-attention mechanism, which uses CTC to align voices and SA to analyze the relation between voices, combines the advantages of the CTC and the SA, accelerates the training speed, simplifies the model and improves the robustness.

The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:

an end-to-end speech recognition method based on connection time sequence classification and self-attention mechanism adopts the structure of an encoder-decoder as a main body framework, wherein the structure of the encoder-decoder comprises an encoder sub-network and a decoder sub-network; at the same time, the high-level abstract features extracted by the encoder are used as a part of the input of the decoder to be trained jointly with the decoder. The two sub-networks respectively use different training criteria, then give different weights to the output of the loss function, and update the model parameters to different degrees through the joint optimization of the two sub-networks.

The encoder-decoder architecture is implemented using a converter transformer architecture that models the depth of the input speech and the input labels by stacking multiple layers of transformers and mines the links between speech and text.

The decoder for speech recognition decodes by means of a bundle search, the decoding result directly corresponds to the correct word or chinese character sequence, and the decoding result is decoded by CTC. The acoustic model is modeled by using large granularity units such as English words or Chinese characters, and the English words or Chinese characters generally select training sets, the Chinese characters appearing in test sets and common Chinese characters as output classification results.

The method specifically comprises the following steps:

step 1, data preparation and feature extraction: collecting voice data to obtain a voice data set; firstly, extracting characteristics of voices in a voice data set by using a Mel filter bank algorithm; then, the discrete Fourier transform is used for converting the voice from the time domain to the frequency domain, so that the characteristics of the voice can be observed better; the finally extracted features are used as the input of the network;

step 2, training an acoustic model: the encoder sub-network is regarded as an acoustic model, the voice features extracted by the training set are input into the encoder sub-network, the inherent relation between each frame of voice signals is obtained through a self-attention mechanism, then the feature classification is carried out after full-connection mapping, and finally the training is carried out by using a CTC training criterion;

the encoder sub-network comprises a position coding layer, a multi-head attention layer, a feedforward network layer, a residual connection layer and a layer normalization layer, and finally, an input voice frame is aligned with the output of the encoder by using a CTC loss function; before inputting the features into the encoder sub-network, adding position information to each frame of features, and inputting the features into the encoder sub-network after adding the position information, firstly performing multi-head attention calculation to obtain multi-level features of the voice signals; then using the full connection layer to perform feature mapping, one part is used for the input of the decoder and the other part is used for the output of the encoder; the encoder outputs and then connects a layer of full connection as the classification result of the final modeling unit, the neuron number of the full connection layer is equal to the number of the Chinese characters to be modeled and then added one, namely a space unit;

step 3, training a language model: treating the decoder subnetwork as a language model; the language model is trained by using cross entropy training criteria; the first self-attention layer of decoder sub-network inputs needs to be obscured, i.e. inputs at later moments cannot be seen when processing the current moment; the second self-attention layer of the decoder sub-network needs to receive the output of the encoder to realize the attention relationship between the encoder and the decoder; the end of the decoder subnetwork uses softmax output and trains the language model with a cross entropy loss function; the loss value of the language model is calculated by weighting and summing the loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights;

step 4, decoding: in the training process, saving model parameters with minimum loss values as a trained model; decoding uses a beam searching mode, and the beam searching mode finds a path with the highest probability in the first n paths of paths passing by reaching the current node each time; when the last node is reached, taking the sequence with the maximum path probability as the last recognition result; during testing, the voice to be tested is subjected to feature extraction, then is input into a trained model for processing, and finally is sent to a decoder to obtain a final recognition result.

Preferably: the encoder network uses CTC training criteria, the decoder network uses cross entropy training criteria, and the decoder network uses the CTC training criteriaThe output of the model parameter and the model parameter are weighted, the updating of different emphasis of the model parameter is realized by adjusting the weighting coefficient, and the loss function is loss=lambda _ctc +(1-λ)*loss _sa Where loss represents the total loss value of the model, loss _ctc Loss value, loss, representing CTC training criteria _sa Loss values representing cross entropy training criteria, λ is the weighting adjustment factor.

Preferably: in the step 1, the extraction is performed by taking a voice frame with a frame length of 25ms and a frame length of 10ms as a unit.

Preferably: in step 1, windowing is performed on the extracted voice, so that the data continuity of adjacent voice frames is ensured, and frequency spectrum leakage is prevented.

Preferably: the features extracted in step 1 are 40-dimensional mel features.

Preferably: and (3) carrying out data enhancement on the voice data acquired in the step (1), wherein the data enhancement comprises deletion operations on the voice time domain and the voice frequency domain.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements an end-to-end speech recognition method based on connection timing classification and self-attention mechanisms.

A terminal device, comprising: the system comprises a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor realizes an end-to-end voice recognition method based on connection time sequence classification and self-attention mechanism when executing the program.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention provides an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, which can not only accelerate the convergence speed of a model and obtain more accurate alignment attribute, but also acquire internal connection between inputs by creatively combining a CTC (computer to computer) and an SA (computer to computer) mechanism, thereby improving the accuracy and the robustness of a voice recognition system.

(2) The invention uses two training modes during training, but only uses one decoding mode during decoding, thereby improving the accuracy of voice recognition and simultaneously not increasing the burden of decoding.

Drawings

Fig. 1 is a schematic structural view of the present invention.

Fig. 2 is a flow chart of speech recognition of the present invention.

Detailed Description

The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various equivalent modifications to the invention will fall within the scope of the appended claims to the skilled person after reading the invention.

An end-to-end speech recognition method based on connection time sequence classification and Self-attention mechanism uses a mixed mechanism of connection time sequence classification (Connectionist Temporal Classification, CTC) and Self-attention mechanism (Self-Attention Mechanism, SA) to directly model English words or Chinese characters without preprocessing or post-processing, and output results directly correspond to correct English sequences or Chinese character sequences. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of the decoder, the attention relationship between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally, the two training criteria are given different weights in a weighted mode, so that the tendency degree of updating network parameters in back propagation is represented. When the weight is 0 or 1, it can be completely regarded as two different network models. The invention realizes the autonomous change of the modeling capability of the model by controlling the weight, and simultaneously accelerates the convergence rate of the model and reduces the training time of the model by utilizing the prealignment characteristic of the CTC. The model structure is shown in fig. 1 and includes an acoustic model, a language model, and a decoder. The identification method comprises the following four steps:

step 1, data preparation and feature extraction: the method comprises the steps of using a network public voice data set, wherein the public voice data set is an Aishell-1 voice data set, dividing the Aishell-1 voice data set into a training set, a verification set and a test set, extracting representative characteristics from each sentence of voice by using a characteristic extraction algorithm, and specifically extracting the characteristics from the voice by using a Mel filter bank algorithm. Since the voice has short-time stationarity, the extraction is performed in units of voice frames with a frame length of 25ms and a frame length of 10 ms. Windowing is carried out on the extracted voice, so that the continuity of data of adjacent voice frames is ensured, and frequency spectrum leakage is prevented. The speech is then converted from the time domain to the frequency domain using a discrete fourier transform, so that the characteristics of the speech can be better observed. Since the perception of sound by the human ear follows the change in mel frequency, i.e., the result of the human ear's perception of sound is twice different, then the human ear's perception of sound is twice different in mel frequency, and thus each frame of speech signal is feature extracted using a mel frequency filter bank. The final extracted feature is a 40-dimensional mel feature, which is taken as an input to the network.

And 2, training an acoustic model. As shown in fig. 1, the encoder sub-network is regarded as an acoustic model, the extracted speech features are input to the encoder sub-network, the inherent relation between each frame of speech signals is obtained through a self-attention mechanism, then feature classification is performed after full connection mapping, and finally training is performed by using a CTC training criterion. The output of the acoustic model uses a CTC training criterion, and the training method does not need to align the input voice with the output sequence, so that a large amount of manpower and material resources are saved, and the development efficiency is improved.

The input voice features are features after data enhancement, and the data enhancement comprises operations such as deletion in the voice time domain and the frequency domain. Before entering the network, the features of position coding are required to be overlapped on the basis of feature dimensions to represent the relative position of the frame voice, then the frame voice enters the network and is processed, finally the output dimension of the encoder is 6308 and represents the final output Chinese character classification number, and since the CTC is used for decoding, a space is added on the basis of the Chinese character classification number, which means that when decoding, if the network cannot recognize the frame voice or the frame voice is a silence section, the network outputs a space for identification.

The structure of the encoder specifically comprises a position coding layer, a multi-head attention layer, a feedforward network layer, two residual error connections and a layer normalization layer, and finally, the input voice frame is aligned with the output of the encoder by using a CTC loss function, so that the overall training speed is accelerated. The location information needs to be added to each frame of features before the features are input to the network because the self-attention mechanism correlates features of the entire speech when acquiring the relationship between the input features, and does not remember the location information. After the position information is added, the characteristics can be input into a network, firstly, multi-head attention calculation is carried out, and the multi-convolution kernel similar to convolution is carried out, so that the multi-level characteristics of the voice signals are obtained. In order to avoid the problem of gradient extinction when back-propagating the update parameters, a residual connection is added. Layer normalization has the effect of stabilizing data within a suitable range as it flows through networks of different layers. The fully connected layer is then used for feature mapping, partly for the input of the decoder and partly for the output of the encoder. And after the output of the encoder, a layer of full connection is connected to serve as a classification result of the final modeling unit, and the number of neurons of the full connection layer is equal to the number of the Chinese characters to be modeled, and then one more is added, namely a space unit.

And 3, training a language model. As shown in fig. 1, the decoder subnetwork is considered as a language model. The language model is trained using cross entropy training criteria. The decoder subnetwork includes a mask multi-head attention layer, a feed forward network layer, three residual connections, and a layer normalization layer. The decoder sub-network is similar to the encoder sub-network in structure and the process flow is also similar. The input of the decoder is a text sequence corresponding to the current training voice, the text sequence enters a sub-network of the decoder through embedded coding, text features can acquire the internal relation between each text through a self-attention mechanism, and finally feature fusion classification is carried out through full connection by using softmax activation, so that the training is carried out through cross entropy training criteria.

The input is that the text sequence is overlapped with the position code through the embedded code, and the special point of the step is that when the current moment is decoded, the input at the later moment is required to be covered so as to prevent the phenomenon of inaccurate identification result caused by data leakage.

The decoder subnetwork differs from the encoder subnetwork structure in two points: firstly, a first self-attention layer input by a decoder sub-network needs to be masked, namely, the input at the later moment can not be seen when the current moment is processed, so that data leakage is prevented, and the model effect is influenced; secondly, the second self-attention layer of the decoder sub-network needs to receive the output of the encoder, so as to realize the attention relationship between the encoder and the decoder. The end of the network uses the softmax output and trains the network model with a cross entropy loss function. The loss value of the model overall is calculated by weighting and summing the loss values of the encoder and the decoder respectively, different weights represent different parameter updating degrees, and the best model effect is obtained by adjusting the weights.

And step four, decoding. And obtaining model parameters with minimum loss through joint optimization of the model parameters and the model parameters, testing data of a test set by using a decoding mode of beam searching, and outputting an identification result. Therefore, during training, the model parameters with the minimum loss value are saved. The decoding uses a beam search approach that finds the path with the highest probability of the first n paths that reach the current node, thus ensuring that the correct paths with the lower probability are not dropped. When the last node is reached, the sequence with the highest path probability is taken as the last recognition result. During testing, the voice is subjected to feature extraction, then is input into a network for processing, and finally is sent to a decoder to obtain a final recognition result.

In summary, the invention provides an end-to-end voice recognition method based on connection time sequence classification and self-attention mechanism, which uses the independent assumption of CTC conditions to quickly align voice and text, thereby accelerating convergence speed, and simultaneously using SA to capture the property of global connection, thereby solving the problem of long-distance dependence. Under the condition of not increasing the complexity of the model, better recognition effect than that of a single model is obtained by adjusting the weight factors and combining the advantages of the model and the weight factors.

The invention combines the mature CTC technology and the transducer model based on SA technology, thereby not only accelerating the training speed of the model, but also capturing the relation between all frames in the voice and improving the stability and recognition rate of the voice recognition system.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. An end-to-end speech recognition method based on connection timing classification and self-attention mechanism, comprising the steps of:

2. The end-to-end speech recognition method based on connection timing classification and self-attention mechanism of claim 1, wherein: the loss function is loss=λ×loss _ctc +(1-λ)*loss _sa Where loss represents model totalLoss value of loss of (1) _ctc Loss value, loss, representing CTC training criteria _sa Loss values representing cross entropy training criteria, λ is the weighting adjustment factor.

3. The end-to-end speech recognition method based on connection timing classification and self-attention mechanism of claim 2, wherein: in the step 1, the extraction is performed by taking a voice frame with a frame length of 25ms and a frame length of 10ms as a unit.

4. The end-to-end speech recognition method based on connection timing classification and self-attention mechanism of claim 3, wherein: in step 1, windowing is performed on the extracted voice, so that the data continuity of adjacent voice frames is ensured, and frequency spectrum leakage is prevented.

5. The method for end-to-end speech recognition based on connection timing classification and self-attention mechanism of claim 4, wherein: the features extracted in step 1 are 40-dimensional mel features.

6. The method for end-to-end speech recognition based on connection timing classification and self-attention mechanism of claim 5, wherein: and (3) carrying out data enhancement on the voice data acquired in the step (1), wherein the data enhancement comprises deletion operations on the voice time domain and the voice frequency domain.

7. A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the speech recognition method of claim 1.

8. A terminal device, comprising: a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the speech recognition method according to claim 1 when executing the program.