CN114783418A

CN114783418A - End-to-end voice recognition method and system based on sparse self-attention mechanism

Info

Publication number: CN114783418A
Application number: CN202210694730.0A
Authority: CN
Inventors: 魏建国; 杨家豪; 路文焕; 裴连军; 付金栋; 朱咏梅; 刘焕志; 倪景宽; 赵莹莹; 李政
Original assignee: Tianjin Energy Investment Group Co ltd; Tianjin University
Current assignee: Tianjin Energy Investment Group Co ltd; Tianjin University
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-07-22
Anticipated expiration: 2042-06-20
Also published as: CN114783418B

Abstract

The invention discloses an end-to-end voice recognition method and system based on a sparse self-attention mechanism, which comprises the following steps: acquiring an audio data set, and performing preprocessing, data enhancement and feature extraction on the data set to obtain an acoustic feature sequence; the acoustic feature sequence is subjected to down-sampling and input into a sparse self-attention voice recognition model, and the input acoustic feature sequence is recognized in the sparse self-attention voice recognition model by utilizing a hyper-parameter with the optimal training effect; and decoding the recognition result of the sparse self-attention voice recognition model to obtain a corresponding text sequence. Without increasing the complexity of the model, the number of dot product operations is reduced and the input sequence is made more focused on frames of certain critical time steps. The invention obtains better identification accuracy and improves the storage and time efficiency.

Description

End-to-end voice recognition method and system based on sparse self-attention mechanism

Technical Field

The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition method and system based on a sparse self-attention mechanism.

Background

Speech Recognition (ASR) is a sequence-to-sequence task where an input audio signal sequence is converted into corresponding textual content. With the rapid development of artificial intelligence, as one of the important ways of human-computer interaction, speech recognition technology is widely embedded into various intelligent devices and chat robots. The traditional speech recognition technology is divided into three modules, namely an acoustic model, a pronunciation dictionary and a language model, and the traditional model has the limitations of inconsistent assumptions, accumulated errors and the like. In the last decade, with deep learning widely applied to the fields of computer vision or natural language processing, a method of replacing each sub-module in the speech recognition technology by using a neural network also appears.

In order to solve the disadvantages of the conventional model, an End-to-End (End-to-End) language identification technology is developed. The end-to-end speech recognition model is mainly divided into two types: one is a Recurrent Neural Network (Recurrent Neural Network) model based on connection timing Classification (connection Temporal Classification), but this cyclic Neural Network-based model can only process data sequentially and cannot effectively capture timing information. The other is an Encoder-Decoder (Encoder-Decoder) model based on a Self-Attention Mechanism (Self-Attention Mechanism), and the Encoder-Decoder model is a non-recursive network and can process all input speech signal frames in parallel, so that the problem of the speech recognition model based on the recurrent neural network is effectively solved.

The encoder-decoder model based on the self-attention mechanism has high training speed and high recognition performance, and can process all time steps in an input sequence in parallel, but the model still has several obvious disadvantages: first, since attention is a weighted average calculation, the distribution of weights tends to be sparse, such distribution makes the results often focus only on "most important information" and ignore other "secondary information", which is usually not appropriate. Secondly, since the operation process of the self-attention mechanism adopts dot product operation, the time complexity is also the square of the length of the input sequence. Different from a machine translation task, after a section of voice signal is preprocessed through a signal, the length of an input audio sequence of the voice signal can often reach the length of a thousand frames, and therefore huge calculation amount consumption becomes a performance bottleneck in the voice recognition task.

Recently, there are also many approaches to improve upon the drawbacks of the self-attention mechanism in the encoder-decoder model, such as windowing attention so that only local attention information inside the window is of interest at each point in time; and adding a Gaussian distribution function during weight calculation, and adding different weights to frames with different time step long distances when calculating the attention weights of the current frame and other frames. These methods can optimize the model, but also have the problems of increasing the complexity of the model, artificially limiting the attention process, and the like.

The traditional speech recognition method based on the self-attention mechanism is limited by weighted average calculation and cannot effectively focus on some meaningful local information. When the input audio signal is relatively long, the square calculation complexity thereof also degrades the recognition effect and efficiency. Much of the research has focused on adding more artificial assumptions to the attention mechanism or shortening the length of the input audio signal by down-sampling. However, these methods either make the model more complicated and reduce the efficiency while improving the performance, or degrade the accuracy of model identification while improving the efficiency.

Disclosure of Invention

Therefore, an object of the present invention is to provide an end-to-end speech recognition method based on a sparse self-attention mechanism, which adopts a sparse attention mechanism of a deterministic method, reduces the number of dot product operations without increasing the complexity of the model, and makes the input sequence focus more on frames of certain key time steps. The invention obtains better identification accuracy and improves the storage and time efficiency.

In order to achieve the above object, an end-to-end speech recognition method based on a sparse self-attention mechanism of the present invention includes the following steps:

s1, acquiring an audio data set, and preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;

s2, performing down-sampling on the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by utilizing a hyper-parameter with an optimal training effect in the speech recognition model based on the sparse self-attention mechanism;

and S3, decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.

Further preferably, in the sparse self-attention mechanism-based speech recognition model, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.

Further preferably, in S2, the construction process of the speech recognition model based on sparse self-attention mechanism includes the following steps:

s201, carrying out convolution and down-sampling on the input acoustic features to obtain a Key matrix,

s202, calculating vectors with the maximum values of the input acoustic feature sequences in all dimensions of the whole world, recording the vectors as the whole vectors, and calculating an attention weight matrix SQK matrix by using the obtained whole vectors and an original Query matrix;

s203, sampling the Query matrix, and obtaining the result after sampling

Matrix sum

A matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.

Further preferably, in S203, the sampling the original Query matrix includes the following steps:

the lambda subscripts with the maximum weight in the attention weight matrix SQK are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded as

And (4) matrix.

Further preferably, in S203, the Value matrix after being replaced by the mean vector is represented by the following formula:

;

wherein V _ mean represents the Value matrix after replacement,

the E-th eigenvalue in the ith time-step vector representing the Value matrix, t is the total number of time-step vectors.

Further preferably, in S2, the training process of the sparse self-attention mechanism-based speech recognition model includes:

inputting the acoustic characteristic sequence into an encoder for sampling;

acquiring a text data set, and extracting corresponding text features from all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;

and the sampled acoustic feature sequence is used as the input of a speech recognition model based on a sparse self-attention mechanism, the decoding result of a decoder is used as the output, the speech recognition model based on the sparse self-attention mechanism is trained, the hyper-parameters are adjusted in the training process to obtain different test results, and the combination of the hyper-parameters with the optimal test results is selected as the output.

The invention also provides an end-to-end voice recognition system based on the sparse self-attention mechanism, which comprises a data acquisition module, an encoder, a voice recognition model based on the sparse self-attention mechanism and a decoder;

the data acquisition module is used for acquiring an audio data set;

the encoder is used for preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;

the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the hyper-parameter with the optimal training effect;

the decoder is used for decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.

Further preferably, the sparse self-attention mechanism-based speech recognition model includes a self-attention moment matrix generation module, and the self-attention adopts multi-head dot product attention, and the multi-head dot product attention performs linear transformation on the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.

Further preferably, the sparse self-attention mechanism-based speech recognition model comprises a self-attention weight calculation module, and the self-attention weight calculation module is used for calculating the self-attention weightThe recalculation module is used for performing convolution downsampling on the input acoustic features to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampled

Matrix sum

A matrix for recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.

Compared with the prior art, the end-to-end voice recognition method and the system based on the sparse self-attention mechanism at least have the following advantages:

1. by adopting a sparse attention mechanism of a deterministic method, under the condition of not increasing the complexity of a model, the times of dot product operation are reduced, an input sequence is enabled to pay more attention to frames of certain key time steps, better identification accuracy is obtained, and meanwhile, storage and time efficiency are improved.

Drawings

FIG. 1 is a schematic flow chart of an end-to-end speech recognition method based on a sparse self-attention mechanism according to the present invention.

FIG. 2 is a flow chart of the sparse self-attention mechanism of the present invention;

FIG. 3 is a diagram of an input test audio spectrogram according to the present invention;

FIG. 4 is a test audio output attention weight hotspot graph of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the detailed description.

As shown in fig. 1, an embodiment of an aspect of the present invention provides an end-to-end speech recognition method based on a sparse self-attention mechanism, including the following steps:

s2, down-sampling the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by using a hyper-parameter with an optimal training effect;

and S3, decoding the recognition result of the voice recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.

It should be noted that, in the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention performs linear transformation on the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.

In S2, the construction process of the sparse attention-based speech recognition model includes the following steps:

s203, sampling the Query matrix, and obtaining the result after sampling

Matrix sum

The method for sampling the original Query matrix comprises the following steps:

And (4) a matrix.

The Value matrix after the replacement by the mean vector is expressed by the following formula:

;

wherein V mean represents the Value matrix after replacement,

In the S2, the training process of the sparse self-attention mechanism based speech recognition model includes:

inputting the acoustic feature sequence into an encoder for sampling;

acquiring a text data set, and extracting corresponding text features of all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;

In the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention carries out linear transformation on input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.

The invention also provides an end-to-end voice recognition system based on the sparse self-attention mechanism, which comprises a data acquisition module, a coder, a voice recognition model based on the sparse self-attention mechanism and a decoder;

the data acquisition module is used for acquiring an audio data set;

the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the super-parameter with the optimal training effect;

The speech recognition model based on the sparse self-attention mechanism comprises a self-attention moment array generation module, wherein self-attention adopts multi-head dot product attention, the multi-head dot product attention carries out linear transformation on input acoustic features, and a Query matrix, a Key matrix and a Value matrix with the same dimensionality are generated.

The speech recognition model based on the sparse self-attention mechanism comprises a self-attention weight calculation module, wherein the self-attention weight calculation module is used for performing convolution and downsampling on input acoustic features to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampled

Matrix sum

A matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; speech as a final sparse-based-attention mechanismA model is identified.

In one embodiment of the invention, in the step one, a Chinese audio data set is prepared, and the audio data is subjected to preprocessing, data enhancement and feature extraction, so that the training target better conforms to the language characteristics through the pre-training text token, and the recognition efficiency of the rare words is improved.

The process in the first step of the invention comprises the following steps:

and step 11, data parameters. The sampling rate was 16 kHz.

And step 12, pre-emphasis, framing and windowing are carried out on the sampled original signals. The frame length and the frame shift are set to 25ms and 10ms, respectively.

And step 13, using noise data in the RIRS _ NOISES to perform data enhancement, noise addition and reverberation processing on the preprocessed audio data.

And 14, extracting acoustic features, and performing convolution down-sampling on the features. And carrying out short-time Fourier transform (STFT) on each frame of signal to obtain a short-time amplitude spectrum. The acoustic features of 80 dimensions are obtained by a mel filter bank (Fbank). Down-sampling uses two 3x3 convolutional layers, and the step size of depth and width are both set to 2, the feature dimension of the output is 256, and the acoustic feature after down-sampling by the convolutional layers can be represented as:

where n represents the length of the audio signal,

representing the audio signal characteristics of the mth frame.

Step 15, pre-training the subword Tokenizer containing all texts in the dataset. Using the Tokenizer to obtain corresponding text features, the extracted text features may be expressed as:

wherein m represents a textLength, y of the token extracted from the feature_mRepresenting the mth token feature.

Fig. 2 is a flow chart of the sparse self-attention mechanism proposed by the present invention.

And step two, building a voice recognition model based on a sparse self-attention mechanism. The voice recognition model is built based on a Speech-transform, and a sparse attention mechanism is provided for replacing a multi-head self-attention module of a core module of the model aiming at the defects of an original model. The network comprises m encoder modules and n decoder modules, wherein m and n are set to be 12, the extracted acoustic features are input into an encoder, and the text features are input into a decoder.

Step 21, a conventional multi-head dot product attention method.

First, the input acoustic features are linearly transformed.

The formula for the linear transformation is:

the Q, K and V sub-tables represent Query, Key and Value, and the dimensions of the Q, K and V sub-tables are the same as the dimensions of the input voice features.

，

，

The weight matrix required for Q, K, V is obtained by linear transformation.

The traditional calculation formula of multi-head dot product attention can be described as follows:

where d represents the dimension of the data, i.e., the hidden dimension of the acoustic feature after convolutional layer down-sampling. V is a Value matrix.

For each attention module, the attention value in each head needs to be connected, and the process can be described as follows:

wherein h is the number of the multiple heads,

a weight matrix required for performing a linear transformation after multi-tap joining.

And step 22, constructing a sparse self-attention mechanism-based encoder.

In order to extract the correct key frame in the input acoustic feature sequence, a Query querier is implemented. Firstly, inputting characteristics by an encoder, obtaining a Key matrix through linear transformation, carrying out down-sampling operation on the Key, extracting vectors with larger values in all dimensions of an input acoustic characteristic sequence, and recording the vectors as the vectors

The formula for selecting Key can be described as:

wherein,

representing the sequence of all time step vectors on the dimension D in the Key matrix from big to small, wherein the eigenvalue corresponding to the ith vector uses the obtained global vector

And calculating attention weight with the original Query matrix, wherein the formula for calculating the attention weight at this time can be described as follows:

after the weight matrix SQK is obtained, the Query matrix is sampled. Specifically, the lambda subscripts with the maximum weight in the SQK matrix are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded as

。

Where λ is the corresponding hyperparameter.

Obtained after sampling by the above steps

Matrix sum

After the matrix, the attention weight is recalculated, and the formula for calculating the attention weight at this time can be described as:

where V is the Value matrix.

For the other Query vectors in the Query matrix, which are Dropout, global mean Value vector is used for replacement when calculating the final self-attention layer output, which is denoted as V _ mean.

;

Wherein V _ mean represents the Value matrix after replacement,

i-th one representing Value matrixThe E-th eigenvalue in the time step vector.

The resulting sparse self-attention module output formula can be described as:

and step three, training the voice recognition model, storing the voice recognition model with the optimal verification set result, and performing decoding test on the test set by using the model.

Step 31, training phase. We constructed models on the basis of the Speech-Transformer. We used Adam and SGD as two-stage optimizers, with the data dimensions of the attention module and hidden layer set to 256 and 2048, respectively, for a total of 50 cycles after the training parameters were set.

Step 32, a test phase. Different hyper-parameter training models are set, and finally the combination of the hyper-parameters with the optimal test set result is selected as output. As shown in fig. 3, a test audio spectrogram is input during testing; as shown in Table 1, we compare the recognition performance of several different speech recognition models on the data set, and the end-to-end speech recognition method based on sparse attention mechanism provided by the invention achieves the optimal result. FIG. 4 is a hot-spot diagram of the audio output attention weight for testing in accordance with the present invention.

TABLE 1 comparison of time complexity and speech recognition word error rates for different models

Table 2 shows that we compare the proposed method to improve the efficiency of storage and time for different length input audio feature sequences.

TABLE 2 self-attention layer decoding time contrast

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. An end-to-end speech recognition method based on a sparse self-attention mechanism is characterized by comprising the following steps:

s1, acquiring an audio data set, and carrying out pretreatment, data enhancement and feature extraction on the data set to obtain an acoustic feature sequence;

2. The end-to-end speech recognition method based on the sparse self-attention mechanism as claimed in claim 1, wherein in the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.

3. The sparse self-attention mechanism-based end-to-end speech recognition method according to claim 2, wherein in S2, the sparse self-attention mechanism-based speech recognition model building process comprises the following steps:

s202, calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK by using the obtained global vector and an original Query matrix;

s203, sampling the Query matrix, and obtaining the result after sampling

Matrix sum

A matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism based speech recognition model.

4. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 3, wherein in S203, sampling the original Query matrix comprises the following steps:

And (4) a matrix.

5. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 3, wherein in S203, the Value matrix after the replacement with the mean vector is represented by the following formula:

;

wherein V _ mean represents the Value matrix after replacement,

the E-th eigenvalue in the ith time step vector representing the Value matrix, and t is the total number of time step vectors.

6. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 1, wherein in the S2, the training process of the sparse self-attention mechanism-based speech recognition model comprises:

inputting the acoustic characteristic sequence into an encoder for sampling;

7. An end-to-end speech recognition system based on a sparse self-attention mechanism is characterized by comprising a data acquisition module, an encoder, a speech recognition model based on the sparse self-attention mechanism and a decoder;

the data acquisition module is used for acquiring an audio data set;

the decoder is used for decoding the recognition result of the voice recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.

8. The sparse self-attention mechanism-based end-to-end speech recognition system of claim 7, wherein a self-attention moment matrix generation module is included in the sparse self-attention mechanism-based speech recognition model, self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.

9. The sparse self-attention mechanism-based end-to-end speech recognition system of claim 8, wherein the sparse self-attention mechanism-based speech recognition model comprises a self-attention weight calculation module configured to perform convolution downsampling on an input acoustic feature to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampled

Matrix sum