CN114783418A - End-to-end voice recognition method and system based on sparse self-attention mechanism - Google Patents

End-to-end voice recognition method and system based on sparse self-attention mechanism Download PDF

Info

Publication number
CN114783418A
CN114783418A CN202210694730.0A CN202210694730A CN114783418A CN 114783418 A CN114783418 A CN 114783418A CN 202210694730 A CN202210694730 A CN 202210694730A CN 114783418 A CN114783418 A CN 114783418A
Authority
CN
China
Prior art keywords
attention
matrix
sparse self
speech recognition
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210694730.0A
Other languages
Chinese (zh)
Other versions
CN114783418B (en
Inventor
魏建国
杨家豪
路文焕
裴连军
付金栋
朱咏梅
刘焕志
倪景宽
赵莹莹
李政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Energy Investment Group Co ltd
Tianjin University
Original Assignee
Tianjin Energy Investment Group Co ltd
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Energy Investment Group Co ltd, Tianjin University filed Critical Tianjin Energy Investment Group Co ltd
Priority to CN202210694730.0A priority Critical patent/CN114783418B/en
Publication of CN114783418A publication Critical patent/CN114783418A/en
Application granted granted Critical
Publication of CN114783418B publication Critical patent/CN114783418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an end-to-end voice recognition method and system based on a sparse self-attention mechanism, which comprises the following steps: acquiring an audio data set, and performing preprocessing, data enhancement and feature extraction on the data set to obtain an acoustic feature sequence; the acoustic feature sequence is subjected to down-sampling and input into a sparse self-attention voice recognition model, and the input acoustic feature sequence is recognized in the sparse self-attention voice recognition model by utilizing a hyper-parameter with the optimal training effect; and decoding the recognition result of the sparse self-attention voice recognition model to obtain a corresponding text sequence. Without increasing the complexity of the model, the number of dot product operations is reduced and the input sequence is made more focused on frames of certain critical time steps. The invention obtains better identification accuracy and improves the storage and time efficiency.

Description

End-to-end voice recognition method and system based on sparse self-attention mechanism
Technical Field
The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition method and system based on a sparse self-attention mechanism.
Background
Speech Recognition (ASR) is a sequence-to-sequence task where an input audio signal sequence is converted into corresponding textual content. With the rapid development of artificial intelligence, as one of the important ways of human-computer interaction, speech recognition technology is widely embedded into various intelligent devices and chat robots. The traditional speech recognition technology is divided into three modules, namely an acoustic model, a pronunciation dictionary and a language model, and the traditional model has the limitations of inconsistent assumptions, accumulated errors and the like. In the last decade, with deep learning widely applied to the fields of computer vision or natural language processing, a method of replacing each sub-module in the speech recognition technology by using a neural network also appears.
In order to solve the disadvantages of the conventional model, an End-to-End (End-to-End) language identification technology is developed. The end-to-end speech recognition model is mainly divided into two types: one is a Recurrent Neural Network (Recurrent Neural Network) model based on connection timing Classification (connection Temporal Classification), but this cyclic Neural Network-based model can only process data sequentially and cannot effectively capture timing information. The other is an Encoder-Decoder (Encoder-Decoder) model based on a Self-Attention Mechanism (Self-Attention Mechanism), and the Encoder-Decoder model is a non-recursive network and can process all input speech signal frames in parallel, so that the problem of the speech recognition model based on the recurrent neural network is effectively solved.
The encoder-decoder model based on the self-attention mechanism has high training speed and high recognition performance, and can process all time steps in an input sequence in parallel, but the model still has several obvious disadvantages: first, since attention is a weighted average calculation, the distribution of weights tends to be sparse, such distribution makes the results often focus only on "most important information" and ignore other "secondary information", which is usually not appropriate. Secondly, since the operation process of the self-attention mechanism adopts dot product operation, the time complexity is also the square of the length of the input sequence. Different from a machine translation task, after a section of voice signal is preprocessed through a signal, the length of an input audio sequence of the voice signal can often reach the length of a thousand frames, and therefore huge calculation amount consumption becomes a performance bottleneck in the voice recognition task.
Recently, there are also many approaches to improve upon the drawbacks of the self-attention mechanism in the encoder-decoder model, such as windowing attention so that only local attention information inside the window is of interest at each point in time; and adding a Gaussian distribution function during weight calculation, and adding different weights to frames with different time step long distances when calculating the attention weights of the current frame and other frames. These methods can optimize the model, but also have the problems of increasing the complexity of the model, artificially limiting the attention process, and the like.
The traditional speech recognition method based on the self-attention mechanism is limited by weighted average calculation and cannot effectively focus on some meaningful local information. When the input audio signal is relatively long, the square calculation complexity thereof also degrades the recognition effect and efficiency. Much of the research has focused on adding more artificial assumptions to the attention mechanism or shortening the length of the input audio signal by down-sampling. However, these methods either make the model more complicated and reduce the efficiency while improving the performance, or degrade the accuracy of model identification while improving the efficiency.
Disclosure of Invention
Therefore, an object of the present invention is to provide an end-to-end speech recognition method based on a sparse self-attention mechanism, which adopts a sparse attention mechanism of a deterministic method, reduces the number of dot product operations without increasing the complexity of the model, and makes the input sequence focus more on frames of certain key time steps. The invention obtains better identification accuracy and improves the storage and time efficiency.
In order to achieve the above object, an end-to-end speech recognition method based on a sparse self-attention mechanism of the present invention includes the following steps:
s1, acquiring an audio data set, and preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
s2, performing down-sampling on the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by utilizing a hyper-parameter with an optimal training effect in the speech recognition model based on the sparse self-attention mechanism;
and S3, decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
Further preferably, in the sparse self-attention mechanism-based speech recognition model, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
Further preferably, in S2, the construction process of the speech recognition model based on sparse self-attention mechanism includes the following steps:
s201, carrying out convolution and down-sampling on the input acoustic features to obtain a Key matrix,
s202, calculating vectors with the maximum values of the input acoustic feature sequences in all dimensions of the whole world, recording the vectors as the whole vectors, and calculating an attention weight matrix SQK matrix by using the obtained whole vectors and an original Query matrix;
s203, sampling the Query matrix, and obtaining the result after sampling
Figure 108452DEST_PATH_IMAGE001
Matrix sum
Figure 527408DEST_PATH_IMAGE002
A matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
Further preferably, in S203, the sampling the original Query matrix includes the following steps:
the lambda subscripts with the maximum weight in the attention weight matrix SQK are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded as
Figure 904162DEST_PATH_IMAGE001
And (4) matrix.
Further preferably, in S203, the Value matrix after being replaced by the mean vector is represented by the following formula:
Figure 508319DEST_PATH_IMAGE003
;
wherein V _ mean represents the Value matrix after replacement,
Figure 945117DEST_PATH_IMAGE004
the E-th eigenvalue in the ith time-step vector representing the Value matrix, t is the total number of time-step vectors.
Further preferably, in S2, the training process of the sparse self-attention mechanism-based speech recognition model includes:
inputting the acoustic characteristic sequence into an encoder for sampling;
acquiring a text data set, and extracting corresponding text features from all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;
and the sampled acoustic feature sequence is used as the input of a speech recognition model based on a sparse self-attention mechanism, the decoding result of a decoder is used as the output, the speech recognition model based on the sparse self-attention mechanism is trained, the hyper-parameters are adjusted in the training process to obtain different test results, and the combination of the hyper-parameters with the optimal test results is selected as the output.
The invention also provides an end-to-end voice recognition system based on the sparse self-attention mechanism, which comprises a data acquisition module, an encoder, a voice recognition model based on the sparse self-attention mechanism and a decoder;
the data acquisition module is used for acquiring an audio data set;
the encoder is used for preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the hyper-parameter with the optimal training effect;
the decoder is used for decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
Further preferably, the sparse self-attention mechanism-based speech recognition model includes a self-attention moment matrix generation module, and the self-attention adopts multi-head dot product attention, and the multi-head dot product attention performs linear transformation on the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
Further preferably, the sparse self-attention mechanism-based speech recognition model comprises a self-attention weight calculation module, and the self-attention weight calculation module is used for calculating the self-attention weightThe recalculation module is used for performing convolution downsampling on the input acoustic features to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampled
Figure 182194DEST_PATH_IMAGE001
Matrix sum
Figure 628219DEST_PATH_IMAGE002
A matrix for recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
Compared with the prior art, the end-to-end voice recognition method and the system based on the sparse self-attention mechanism at least have the following advantages:
1. by adopting a sparse attention mechanism of a deterministic method, under the condition of not increasing the complexity of a model, the times of dot product operation are reduced, an input sequence is enabled to pay more attention to frames of certain key time steps, better identification accuracy is obtained, and meanwhile, storage and time efficiency are improved.
Drawings
FIG. 1 is a schematic flow chart of an end-to-end speech recognition method based on a sparse self-attention mechanism according to the present invention.
FIG. 2 is a flow chart of the sparse self-attention mechanism of the present invention;
FIG. 3 is a diagram of an input test audio spectrogram according to the present invention;
FIG. 4 is a test audio output attention weight hotspot graph of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the detailed description.
As shown in fig. 1, an embodiment of an aspect of the present invention provides an end-to-end speech recognition method based on a sparse self-attention mechanism, including the following steps:
s1, acquiring an audio data set, and preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
s2, down-sampling the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by using a hyper-parameter with an optimal training effect;
and S3, decoding the recognition result of the voice recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
It should be noted that, in the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention performs linear transformation on the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
In S2, the construction process of the sparse attention-based speech recognition model includes the following steps:
s201, carrying out convolution and down-sampling on the input acoustic features to obtain a Key matrix,
s202, calculating vectors with the maximum values of the input acoustic feature sequences in all dimensions of the whole world, recording the vectors as the whole vectors, and calculating an attention weight matrix SQK matrix by using the obtained whole vectors and an original Query matrix;
s203, sampling the Query matrix, and obtaining the result after sampling
Figure 306456DEST_PATH_IMAGE001
Matrix sum
Figure 38789DEST_PATH_IMAGE002
A matrix for recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
The method for sampling the original Query matrix comprises the following steps:
the lambda subscripts with the maximum weight in the attention weight matrix SQK are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded as
Figure 91058DEST_PATH_IMAGE001
And (4) a matrix.
The Value matrix after the replacement by the mean vector is expressed by the following formula:
Figure 684982DEST_PATH_IMAGE005
;
wherein V mean represents the Value matrix after replacement,
Figure 732572DEST_PATH_IMAGE004
the E-th eigenvalue in the ith time-step vector representing the Value matrix, t is the total number of time-step vectors.
In the S2, the training process of the sparse self-attention mechanism based speech recognition model includes:
inputting the acoustic feature sequence into an encoder for sampling;
acquiring a text data set, and extracting corresponding text features of all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;
and the sampled acoustic feature sequence is used as the input of a speech recognition model based on a sparse self-attention mechanism, the decoding result of a decoder is used as the output, the speech recognition model based on the sparse self-attention mechanism is trained, the hyper-parameters are adjusted in the training process to obtain different test results, and the combination of the hyper-parameters with the optimal test results is selected as the output.
In the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention carries out linear transformation on input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
The invention also provides an end-to-end voice recognition system based on the sparse self-attention mechanism, which comprises a data acquisition module, a coder, a voice recognition model based on the sparse self-attention mechanism and a decoder;
the data acquisition module is used for acquiring an audio data set;
the encoder is used for preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the super-parameter with the optimal training effect;
the decoder is used for decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
The speech recognition model based on the sparse self-attention mechanism comprises a self-attention moment array generation module, wherein self-attention adopts multi-head dot product attention, the multi-head dot product attention carries out linear transformation on input acoustic features, and a Query matrix, a Key matrix and a Value matrix with the same dimensionality are generated.
The speech recognition model based on the sparse self-attention mechanism comprises a self-attention weight calculation module, wherein the self-attention weight calculation module is used for performing convolution and downsampling on input acoustic features to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampled
Figure 151370DEST_PATH_IMAGE001
Matrix sum
Figure 18832DEST_PATH_IMAGE002
A matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; speech as a final sparse-based-attention mechanismA model is identified.
In one embodiment of the invention, in the step one, a Chinese audio data set is prepared, and the audio data is subjected to preprocessing, data enhancement and feature extraction, so that the training target better conforms to the language characteristics through the pre-training text token, and the recognition efficiency of the rare words is improved.
The process in the first step of the invention comprises the following steps:
and step 11, data parameters. The sampling rate was 16 kHz.
And step 12, pre-emphasis, framing and windowing are carried out on the sampled original signals. The frame length and the frame shift are set to 25ms and 10ms, respectively.
And step 13, using noise data in the RIRS _ NOISES to perform data enhancement, noise addition and reverberation processing on the preprocessed audio data.
And 14, extracting acoustic features, and performing convolution down-sampling on the features. And carrying out short-time Fourier transform (STFT) on each frame of signal to obtain a short-time amplitude spectrum. The acoustic features of 80 dimensions are obtained by a mel filter bank (Fbank). Down-sampling uses two 3x3 convolutional layers, and the step size of depth and width are both set to 2, the feature dimension of the output is 256, and the acoustic feature after down-sampling by the convolutional layers can be represented as:
Figure 478763DEST_PATH_IMAGE006
where n represents the length of the audio signal,
Figure 787385DEST_PATH_IMAGE007
representing the audio signal characteristics of the mth frame.
Step 15, pre-training the subword Tokenizer containing all texts in the dataset. Using the Tokenizer to obtain corresponding text features, the extracted text features may be expressed as:
Figure 612253DEST_PATH_IMAGE008
wherein m represents a textLength, y of the token extracted from the featuremRepresenting the mth token feature.
Fig. 2 is a flow chart of the sparse self-attention mechanism proposed by the present invention.
And step two, building a voice recognition model based on a sparse self-attention mechanism. The voice recognition model is built based on a Speech-transform, and a sparse attention mechanism is provided for replacing a multi-head self-attention module of a core module of the model aiming at the defects of an original model. The network comprises m encoder modules and n decoder modules, wherein m and n are set to be 12, the extracted acoustic features are input into an encoder, and the text features are input into a decoder.
Step 21, a conventional multi-head dot product attention method.
First, the input acoustic features are linearly transformed.
The formula for the linear transformation is:
Figure 639114DEST_PATH_IMAGE009
the Q, K and V sub-tables represent Query, Key and Value, and the dimensions of the Q, K and V sub-tables are the same as the dimensions of the input voice features.
Figure 89687DEST_PATH_IMAGE010
Figure 597023DEST_PATH_IMAGE011
Figure 717426DEST_PATH_IMAGE012
The weight matrix required for Q, K, V is obtained by linear transformation.
The traditional calculation formula of multi-head dot product attention can be described as follows:
Figure 90638DEST_PATH_IMAGE013
where d represents the dimension of the data, i.e., the hidden dimension of the acoustic feature after convolutional layer down-sampling. V is a Value matrix.
For each attention module, the attention value in each head needs to be connected, and the process can be described as follows:
Figure 220268DEST_PATH_IMAGE014
wherein h is the number of the multiple heads,
Figure 172656DEST_PATH_IMAGE015
a weight matrix required for performing a linear transformation after multi-tap joining.
And step 22, constructing a sparse self-attention mechanism-based encoder.
In order to extract the correct key frame in the input acoustic feature sequence, a Query querier is implemented. Firstly, inputting characteristics by an encoder, obtaining a Key matrix through linear transformation, carrying out down-sampling operation on the Key, extracting vectors with larger values in all dimensions of an input acoustic characteristic sequence, and recording the vectors as the vectors
Figure 417955DEST_PATH_IMAGE002
The formula for selecting Key can be described as:
Figure 294776DEST_PATH_IMAGE016
wherein,
Figure 87151DEST_PATH_IMAGE017
representing the sequence of all time step vectors on the dimension D in the Key matrix from big to small, wherein the eigenvalue corresponding to the ith vector uses the obtained global vector
Figure 490451DEST_PATH_IMAGE002
And calculating attention weight with the original Query matrix, wherein the formula for calculating the attention weight at this time can be described as follows:
Figure 727223DEST_PATH_IMAGE018
after the weight matrix SQK is obtained, the Query matrix is sampled. Specifically, the lambda subscripts with the maximum weight in the SQK matrix are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded as
Figure 622497DEST_PATH_IMAGE001
Figure 359509DEST_PATH_IMAGE019
Where λ is the corresponding hyperparameter.
Obtained after sampling by the above steps
Figure 430365DEST_PATH_IMAGE001
Matrix sum
Figure 348292DEST_PATH_IMAGE002
After the matrix, the attention weight is recalculated, and the formula for calculating the attention weight at this time can be described as:
Figure 448972DEST_PATH_IMAGE020
where V is the Value matrix.
For the other Query vectors in the Query matrix, which are Dropout, global mean Value vector is used for replacement when calculating the final self-attention layer output, which is denoted as V _ mean.
Figure 458516DEST_PATH_IMAGE021
;
Wherein V _ mean represents the Value matrix after replacement,
Figure 383878DEST_PATH_IMAGE004
i-th one representing Value matrixThe E-th eigenvalue in the time step vector.
The resulting sparse self-attention module output formula can be described as:
Figure 60322DEST_PATH_IMAGE022
and step three, training the voice recognition model, storing the voice recognition model with the optimal verification set result, and performing decoding test on the test set by using the model.
Step 31, training phase. We constructed models on the basis of the Speech-Transformer. We used Adam and SGD as two-stage optimizers, with the data dimensions of the attention module and hidden layer set to 256 and 2048, respectively, for a total of 50 cycles after the training parameters were set.
Step 32, a test phase. Different hyper-parameter training models are set, and finally the combination of the hyper-parameters with the optimal test set result is selected as output. As shown in fig. 3, a test audio spectrogram is input during testing; as shown in Table 1, we compare the recognition performance of several different speech recognition models on the data set, and the end-to-end speech recognition method based on sparse attention mechanism provided by the invention achieves the optimal result. FIG. 4 is a hot-spot diagram of the audio output attention weight for testing in accordance with the present invention.
TABLE 1 comparison of time complexity and speech recognition word error rates for different models
Figure 133452DEST_PATH_IMAGE023
Table 2 shows that we compare the proposed method to improve the efficiency of storage and time for different length input audio feature sequences.
TABLE 2 self-attention layer decoding time contrast
Figure 290895DEST_PATH_IMAGE024
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (9)

1. An end-to-end speech recognition method based on a sparse self-attention mechanism is characterized by comprising the following steps:
s1, acquiring an audio data set, and carrying out pretreatment, data enhancement and feature extraction on the data set to obtain an acoustic feature sequence;
s2, down-sampling the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by using a hyper-parameter with an optimal training effect;
and S3, decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
2. The end-to-end speech recognition method based on the sparse self-attention mechanism as claimed in claim 1, wherein in the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
3. The sparse self-attention mechanism-based end-to-end speech recognition method according to claim 2, wherein in S2, the sparse self-attention mechanism-based speech recognition model building process comprises the following steps:
s201, carrying out convolution and down-sampling on the input acoustic features to obtain a Key matrix,
s202, calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK by using the obtained global vector and an original Query matrix;
s203, sampling the Query matrix, and obtaining the result after sampling
Figure 148673DEST_PATH_IMAGE001
Matrix sum
Figure 548561DEST_PATH_IMAGE002
A matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism based speech recognition model.
4. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 3, wherein in S203, sampling the original Query matrix comprises the following steps:
the lambda subscripts with the maximum weight in the attention weight matrix SQK are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded as
Figure 834049DEST_PATH_IMAGE001
And (4) a matrix.
5. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 3, wherein in S203, the Value matrix after the replacement with the mean vector is represented by the following formula:
Figure 725388DEST_PATH_IMAGE003
;
wherein V _ mean represents the Value matrix after replacement,
Figure 779932DEST_PATH_IMAGE004
the E-th eigenvalue in the ith time step vector representing the Value matrix, and t is the total number of time step vectors.
6. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 1, wherein in the S2, the training process of the sparse self-attention mechanism-based speech recognition model comprises:
inputting the acoustic characteristic sequence into an encoder for sampling;
acquiring a text data set, and extracting corresponding text features of all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;
and the sampled acoustic feature sequence is used as the input of a speech recognition model based on a sparse self-attention mechanism, the decoding result of a decoder is used as the output, the speech recognition model based on the sparse self-attention mechanism is trained, the hyper-parameters are adjusted in the training process to obtain different test results, and the combination of the hyper-parameters with the optimal test results is selected as the output.
7. An end-to-end speech recognition system based on a sparse self-attention mechanism is characterized by comprising a data acquisition module, an encoder, a speech recognition model based on the sparse self-attention mechanism and a decoder;
the data acquisition module is used for acquiring an audio data set;
the encoder is used for preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the hyper-parameter with the optimal training effect;
the decoder is used for decoding the recognition result of the voice recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
8. The sparse self-attention mechanism-based end-to-end speech recognition system of claim 7, wherein a self-attention moment matrix generation module is included in the sparse self-attention mechanism-based speech recognition model, self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
9. The sparse self-attention mechanism-based end-to-end speech recognition system of claim 8, wherein the sparse self-attention mechanism-based speech recognition model comprises a self-attention weight calculation module configured to perform convolution downsampling on an input acoustic feature to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampled
Figure 616301DEST_PATH_IMAGE001
Matrix sum
Figure 389085DEST_PATH_IMAGE002
A matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
CN202210694730.0A 2022-06-20 2022-06-20 End-to-end voice recognition method and system based on sparse self-attention mechanism Active CN114783418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210694730.0A CN114783418B (en) 2022-06-20 2022-06-20 End-to-end voice recognition method and system based on sparse self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210694730.0A CN114783418B (en) 2022-06-20 2022-06-20 End-to-end voice recognition method and system based on sparse self-attention mechanism

Publications (2)

Publication Number Publication Date
CN114783418A true CN114783418A (en) 2022-07-22
CN114783418B CN114783418B (en) 2022-08-23

Family

ID=82420363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210694730.0A Active CN114783418B (en) 2022-06-20 2022-06-20 End-to-end voice recognition method and system based on sparse self-attention mechanism

Country Status (1)

Country Link
CN (1) CN114783418B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081752A (en) * 2022-08-11 2022-09-20 浙江君同智能科技有限责任公司 Black and gray production crowdsourcing flow prediction device and method
CN115796407A (en) * 2023-02-13 2023-03-14 中建科技集团有限公司 Production line fault prediction method and related equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models
CN113140220A (en) * 2021-04-12 2021-07-20 西北工业大学 Lightweight end-to-end speech recognition method based on convolution self-attention transformation network
CN113380232A (en) * 2021-06-15 2021-09-10 哈尔滨工业大学 End-to-end speech recognition method based on constraint structured sparse attention mechanism and storage medium
CN113642646A (en) * 2021-08-13 2021-11-12 重庆邮电大学 Image threat article classification and positioning method based on multiple attention and semantics
WO2022121150A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Speech recognition method and apparatus based on self-attention mechanism and memory network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models
WO2022121150A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Speech recognition method and apparatus based on self-attention mechanism and memory network
CN113140220A (en) * 2021-04-12 2021-07-20 西北工业大学 Lightweight end-to-end speech recognition method based on convolution self-attention transformation network
CN113380232A (en) * 2021-06-15 2021-09-10 哈尔滨工业大学 End-to-end speech recognition method based on constraint structured sparse attention mechanism and storage medium
CN113642646A (en) * 2021-08-13 2021-11-12 重庆邮电大学 Image threat article classification and positioning method based on multiple attention and semantics

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TIMO LOHRENZ ET AL.: "《RELAXED ATTENTION:A SIMPLE METHOD TO BOOST PERFORMANCE OF END-TO-END AUTOMATIC SPEECH RECOGNITION》", 《ARXIV:2017.01275V2》 *
ZHIFU GAO ET AL.: "《SAN-M:Memory Equipped Self-Attention for End-to-End Speech Recognition》", 《ARXIV:2006.01713V1》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081752A (en) * 2022-08-11 2022-09-20 浙江君同智能科技有限责任公司 Black and gray production crowdsourcing flow prediction device and method
CN115796407A (en) * 2023-02-13 2023-03-14 中建科技集团有限公司 Production line fault prediction method and related equipment

Also Published As

Publication number Publication date
CN114783418B (en) 2022-08-23

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
Gaikwad et al. A review on speech recognition technique
Zhu et al. Phone-to-audio alignment without text: A semi-supervised approach
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110189749A (en) Voice keyword automatic identifying method
CN111798840A (en) Voice keyword recognition method and device
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
CN114495969A (en) Voice recognition method integrating voice enhancement
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
CN114999460A (en) Lightweight Chinese speech recognition method combined with Transformer
Rudresh et al. Performance analysis of speech digit recognition using cepstrum and vector quantization
CN113611285B (en) Language identification method based on stacked bidirectional time sequence pooling
CN118136022A (en) Intelligent voice recognition system and method
CN117558278A (en) Self-adaptive voice recognition method and system
CN116631383A (en) Voice recognition method based on self-supervision pre-training and interactive fusion network
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN114333762B (en) Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium
CN114626424B (en) Data enhancement-based silent speech recognition method and device
CN113628639A (en) Voice emotion recognition method based on multi-head attention mechanism
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant