CN114783418A - End-to-end voice recognition method and system based on sparse self-attention mechanism - Google Patents
End-to-end voice recognition method and system based on sparse self-attention mechanism Download PDFInfo
- Publication number
- CN114783418A CN114783418A CN202210694730.0A CN202210694730A CN114783418A CN 114783418 A CN114783418 A CN 114783418A CN 202210694730 A CN202210694730 A CN 202210694730A CN 114783418 A CN114783418 A CN 114783418A
- Authority
- CN
- China
- Prior art keywords
- attention
- matrix
- sparse self
- speech recognition
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000005070 sampling Methods 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 21
- 230000000694 effects Effects 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000000605 extraction Methods 0.000 claims abstract description 3
- 239000011159 matrix material Substances 0.000 claims description 110
- 239000013598 vector Substances 0.000 claims description 52
- 230000008569 process Effects 0.000 claims description 15
- 238000012360 testing method Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000002708 enhancing effect Effects 0.000 claims description 5
- 238000003860 storage Methods 0.000 abstract description 4
- 230000009466 transformation Effects 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 108010076504 Protein Sorting Signals Proteins 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009901 attention process Effects 0.000 description 1
- 125000004122 cyclic group Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an end-to-end voice recognition method and system based on a sparse self-attention mechanism, which comprises the following steps: acquiring an audio data set, and performing preprocessing, data enhancement and feature extraction on the data set to obtain an acoustic feature sequence; the acoustic feature sequence is subjected to down-sampling and input into a sparse self-attention voice recognition model, and the input acoustic feature sequence is recognized in the sparse self-attention voice recognition model by utilizing a hyper-parameter with the optimal training effect; and decoding the recognition result of the sparse self-attention voice recognition model to obtain a corresponding text sequence. Without increasing the complexity of the model, the number of dot product operations is reduced and the input sequence is made more focused on frames of certain critical time steps. The invention obtains better identification accuracy and improves the storage and time efficiency.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition method and system based on a sparse self-attention mechanism.
Background
Speech Recognition (ASR) is a sequence-to-sequence task where an input audio signal sequence is converted into corresponding textual content. With the rapid development of artificial intelligence, as one of the important ways of human-computer interaction, speech recognition technology is widely embedded into various intelligent devices and chat robots. The traditional speech recognition technology is divided into three modules, namely an acoustic model, a pronunciation dictionary and a language model, and the traditional model has the limitations of inconsistent assumptions, accumulated errors and the like. In the last decade, with deep learning widely applied to the fields of computer vision or natural language processing, a method of replacing each sub-module in the speech recognition technology by using a neural network also appears.
In order to solve the disadvantages of the conventional model, an End-to-End (End-to-End) language identification technology is developed. The end-to-end speech recognition model is mainly divided into two types: one is a Recurrent Neural Network (Recurrent Neural Network) model based on connection timing Classification (connection Temporal Classification), but this cyclic Neural Network-based model can only process data sequentially and cannot effectively capture timing information. The other is an Encoder-Decoder (Encoder-Decoder) model based on a Self-Attention Mechanism (Self-Attention Mechanism), and the Encoder-Decoder model is a non-recursive network and can process all input speech signal frames in parallel, so that the problem of the speech recognition model based on the recurrent neural network is effectively solved.
The encoder-decoder model based on the self-attention mechanism has high training speed and high recognition performance, and can process all time steps in an input sequence in parallel, but the model still has several obvious disadvantages: first, since attention is a weighted average calculation, the distribution of weights tends to be sparse, such distribution makes the results often focus only on "most important information" and ignore other "secondary information", which is usually not appropriate. Secondly, since the operation process of the self-attention mechanism adopts dot product operation, the time complexity is also the square of the length of the input sequence. Different from a machine translation task, after a section of voice signal is preprocessed through a signal, the length of an input audio sequence of the voice signal can often reach the length of a thousand frames, and therefore huge calculation amount consumption becomes a performance bottleneck in the voice recognition task.
Recently, there are also many approaches to improve upon the drawbacks of the self-attention mechanism in the encoder-decoder model, such as windowing attention so that only local attention information inside the window is of interest at each point in time; and adding a Gaussian distribution function during weight calculation, and adding different weights to frames with different time step long distances when calculating the attention weights of the current frame and other frames. These methods can optimize the model, but also have the problems of increasing the complexity of the model, artificially limiting the attention process, and the like.
The traditional speech recognition method based on the self-attention mechanism is limited by weighted average calculation and cannot effectively focus on some meaningful local information. When the input audio signal is relatively long, the square calculation complexity thereof also degrades the recognition effect and efficiency. Much of the research has focused on adding more artificial assumptions to the attention mechanism or shortening the length of the input audio signal by down-sampling. However, these methods either make the model more complicated and reduce the efficiency while improving the performance, or degrade the accuracy of model identification while improving the efficiency.
Disclosure of Invention
Therefore, an object of the present invention is to provide an end-to-end speech recognition method based on a sparse self-attention mechanism, which adopts a sparse attention mechanism of a deterministic method, reduces the number of dot product operations without increasing the complexity of the model, and makes the input sequence focus more on frames of certain key time steps. The invention obtains better identification accuracy and improves the storage and time efficiency.
In order to achieve the above object, an end-to-end speech recognition method based on a sparse self-attention mechanism of the present invention includes the following steps:
s1, acquiring an audio data set, and preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
s2, performing down-sampling on the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by utilizing a hyper-parameter with an optimal training effect in the speech recognition model based on the sparse self-attention mechanism;
and S3, decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
Further preferably, in the sparse self-attention mechanism-based speech recognition model, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
Further preferably, in S2, the construction process of the speech recognition model based on sparse self-attention mechanism includes the following steps:
s201, carrying out convolution and down-sampling on the input acoustic features to obtain a Key matrix,
s202, calculating vectors with the maximum values of the input acoustic feature sequences in all dimensions of the whole world, recording the vectors as the whole vectors, and calculating an attention weight matrix SQK matrix by using the obtained whole vectors and an original Query matrix;
s203, sampling the Query matrix, and obtaining the result after samplingMatrix sumA matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
Further preferably, in S203, the sampling the original Query matrix includes the following steps:
the lambda subscripts with the maximum weight in the attention weight matrix SQK are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded asAnd (4) matrix.
Further preferably, in S203, the Value matrix after being replaced by the mean vector is represented by the following formula:
wherein V _ mean represents the Value matrix after replacement,the E-th eigenvalue in the ith time-step vector representing the Value matrix, t is the total number of time-step vectors.
Further preferably, in S2, the training process of the sparse self-attention mechanism-based speech recognition model includes:
inputting the acoustic characteristic sequence into an encoder for sampling;
acquiring a text data set, and extracting corresponding text features from all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;
and the sampled acoustic feature sequence is used as the input of a speech recognition model based on a sparse self-attention mechanism, the decoding result of a decoder is used as the output, the speech recognition model based on the sparse self-attention mechanism is trained, the hyper-parameters are adjusted in the training process to obtain different test results, and the combination of the hyper-parameters with the optimal test results is selected as the output.
The invention also provides an end-to-end voice recognition system based on the sparse self-attention mechanism, which comprises a data acquisition module, an encoder, a voice recognition model based on the sparse self-attention mechanism and a decoder;
the data acquisition module is used for acquiring an audio data set;
the encoder is used for preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the hyper-parameter with the optimal training effect;
the decoder is used for decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
Further preferably, the sparse self-attention mechanism-based speech recognition model includes a self-attention moment matrix generation module, and the self-attention adopts multi-head dot product attention, and the multi-head dot product attention performs linear transformation on the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
Further preferably, the sparse self-attention mechanism-based speech recognition model comprises a self-attention weight calculation module, and the self-attention weight calculation module is used for calculating the self-attention weightThe recalculation module is used for performing convolution downsampling on the input acoustic features to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampledMatrix sumA matrix for recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
Compared with the prior art, the end-to-end voice recognition method and the system based on the sparse self-attention mechanism at least have the following advantages:
1. by adopting a sparse attention mechanism of a deterministic method, under the condition of not increasing the complexity of a model, the times of dot product operation are reduced, an input sequence is enabled to pay more attention to frames of certain key time steps, better identification accuracy is obtained, and meanwhile, storage and time efficiency are improved.
Drawings
FIG. 1 is a schematic flow chart of an end-to-end speech recognition method based on a sparse self-attention mechanism according to the present invention.
FIG. 2 is a flow chart of the sparse self-attention mechanism of the present invention;
FIG. 3 is a diagram of an input test audio spectrogram according to the present invention;
FIG. 4 is a test audio output attention weight hotspot graph of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the detailed description.
As shown in fig. 1, an embodiment of an aspect of the present invention provides an end-to-end speech recognition method based on a sparse self-attention mechanism, including the following steps:
s1, acquiring an audio data set, and preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
s2, down-sampling the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by using a hyper-parameter with an optimal training effect;
and S3, decoding the recognition result of the voice recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
It should be noted that, in the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention performs linear transformation on the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
In S2, the construction process of the sparse attention-based speech recognition model includes the following steps:
s201, carrying out convolution and down-sampling on the input acoustic features to obtain a Key matrix,
s202, calculating vectors with the maximum values of the input acoustic feature sequences in all dimensions of the whole world, recording the vectors as the whole vectors, and calculating an attention weight matrix SQK matrix by using the obtained whole vectors and an original Query matrix;
s203, sampling the Query matrix, and obtaining the result after samplingMatrix sumA matrix for recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
The method for sampling the original Query matrix comprises the following steps:
the lambda subscripts with the maximum weight in the attention weight matrix SQK are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded asAnd (4) a matrix.
The Value matrix after the replacement by the mean vector is expressed by the following formula:
wherein V mean represents the Value matrix after replacement,the E-th eigenvalue in the ith time-step vector representing the Value matrix, t is the total number of time-step vectors.
In the S2, the training process of the sparse self-attention mechanism based speech recognition model includes:
inputting the acoustic feature sequence into an encoder for sampling;
acquiring a text data set, and extracting corresponding text features of all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;
and the sampled acoustic feature sequence is used as the input of a speech recognition model based on a sparse self-attention mechanism, the decoding result of a decoder is used as the output, the speech recognition model based on the sparse self-attention mechanism is trained, the hyper-parameters are adjusted in the training process to obtain different test results, and the combination of the hyper-parameters with the optimal test results is selected as the output.
In the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention carries out linear transformation on input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
The invention also provides an end-to-end voice recognition system based on the sparse self-attention mechanism, which comprises a data acquisition module, a coder, a voice recognition model based on the sparse self-attention mechanism and a decoder;
the data acquisition module is used for acquiring an audio data set;
the encoder is used for preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the super-parameter with the optimal training effect;
the decoder is used for decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
The speech recognition model based on the sparse self-attention mechanism comprises a self-attention moment array generation module, wherein self-attention adopts multi-head dot product attention, the multi-head dot product attention carries out linear transformation on input acoustic features, and a Query matrix, a Key matrix and a Value matrix with the same dimensionality are generated.
The speech recognition model based on the sparse self-attention mechanism comprises a self-attention weight calculation module, wherein the self-attention weight calculation module is used for performing convolution and downsampling on input acoustic features to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampledMatrix sumA matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; speech as a final sparse-based-attention mechanismA model is identified.
In one embodiment of the invention, in the step one, a Chinese audio data set is prepared, and the audio data is subjected to preprocessing, data enhancement and feature extraction, so that the training target better conforms to the language characteristics through the pre-training text token, and the recognition efficiency of the rare words is improved.
The process in the first step of the invention comprises the following steps:
and step 11, data parameters. The sampling rate was 16 kHz.
And step 12, pre-emphasis, framing and windowing are carried out on the sampled original signals. The frame length and the frame shift are set to 25ms and 10ms, respectively.
And step 13, using noise data in the RIRS _ NOISES to perform data enhancement, noise addition and reverberation processing on the preprocessed audio data.
And 14, extracting acoustic features, and performing convolution down-sampling on the features. And carrying out short-time Fourier transform (STFT) on each frame of signal to obtain a short-time amplitude spectrum. The acoustic features of 80 dimensions are obtained by a mel filter bank (Fbank). Down-sampling uses two 3x3 convolutional layers, and the step size of depth and width are both set to 2, the feature dimension of the output is 256, and the acoustic feature after down-sampling by the convolutional layers can be represented as:
where n represents the length of the audio signal,representing the audio signal characteristics of the mth frame.
wherein m represents a textLength, y of the token extracted from the featuremRepresenting the mth token feature.
Fig. 2 is a flow chart of the sparse self-attention mechanism proposed by the present invention.
And step two, building a voice recognition model based on a sparse self-attention mechanism. The voice recognition model is built based on a Speech-transform, and a sparse attention mechanism is provided for replacing a multi-head self-attention module of a core module of the model aiming at the defects of an original model. The network comprises m encoder modules and n decoder modules, wherein m and n are set to be 12, the extracted acoustic features are input into an encoder, and the text features are input into a decoder.
First, the input acoustic features are linearly transformed.
The formula for the linear transformation is:
the Q, K and V sub-tables represent Query, Key and Value, and the dimensions of the Q, K and V sub-tables are the same as the dimensions of the input voice features.,,The weight matrix required for Q, K, V is obtained by linear transformation.
The traditional calculation formula of multi-head dot product attention can be described as follows:
where d represents the dimension of the data, i.e., the hidden dimension of the acoustic feature after convolutional layer down-sampling. V is a Value matrix.
For each attention module, the attention value in each head needs to be connected, and the process can be described as follows:
wherein h is the number of the multiple heads,a weight matrix required for performing a linear transformation after multi-tap joining.
And step 22, constructing a sparse self-attention mechanism-based encoder.
In order to extract the correct key frame in the input acoustic feature sequence, a Query querier is implemented. Firstly, inputting characteristics by an encoder, obtaining a Key matrix through linear transformation, carrying out down-sampling operation on the Key, extracting vectors with larger values in all dimensions of an input acoustic characteristic sequence, and recording the vectors as the vectorsThe formula for selecting Key can be described as:
wherein,representing the sequence of all time step vectors on the dimension D in the Key matrix from big to small, wherein the eigenvalue corresponding to the ith vector uses the obtained global vectorAnd calculating attention weight with the original Query matrix, wherein the formula for calculating the attention weight at this time can be described as follows:
after the weight matrix SQK is obtained, the Query matrix is sampled. Specifically, the lambda subscripts with the maximum weight in the SQK matrix are taken out, the subscripts are used for taking out the corresponding Query vectors in the Query matrix for down-sampling, and the sampled Query matrix is recorded as。
Where λ is the corresponding hyperparameter.
Obtained after sampling by the above stepsMatrix sumAfter the matrix, the attention weight is recalculated, and the formula for calculating the attention weight at this time can be described as:
where V is the Value matrix.
For the other Query vectors in the Query matrix, which are Dropout, global mean Value vector is used for replacement when calculating the final self-attention layer output, which is denoted as V _ mean.
Wherein V _ mean represents the Value matrix after replacement,i-th one representing Value matrixThe E-th eigenvalue in the time step vector.
The resulting sparse self-attention module output formula can be described as:
and step three, training the voice recognition model, storing the voice recognition model with the optimal verification set result, and performing decoding test on the test set by using the model.
Step 31, training phase. We constructed models on the basis of the Speech-Transformer. We used Adam and SGD as two-stage optimizers, with the data dimensions of the attention module and hidden layer set to 256 and 2048, respectively, for a total of 50 cycles after the training parameters were set.
Step 32, a test phase. Different hyper-parameter training models are set, and finally the combination of the hyper-parameters with the optimal test set result is selected as output. As shown in fig. 3, a test audio spectrogram is input during testing; as shown in Table 1, we compare the recognition performance of several different speech recognition models on the data set, and the end-to-end speech recognition method based on sparse attention mechanism provided by the invention achieves the optimal result. FIG. 4 is a hot-spot diagram of the audio output attention weight for testing in accordance with the present invention.
TABLE 1 comparison of time complexity and speech recognition word error rates for different models
Table 2 shows that we compare the proposed method to improve the efficiency of storage and time for different length input audio feature sequences.
TABLE 2 self-attention layer decoding time contrast
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.
Claims (9)
1. An end-to-end speech recognition method based on a sparse self-attention mechanism is characterized by comprising the following steps:
s1, acquiring an audio data set, and carrying out pretreatment, data enhancement and feature extraction on the data set to obtain an acoustic feature sequence;
s2, down-sampling the acoustic feature sequence, inputting the down-sampled acoustic feature sequence into a speech recognition model based on a sparse self-attention mechanism, and recognizing the input acoustic feature sequence by using a hyper-parameter with an optimal training effect;
and S3, decoding the recognition result of the speech recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
2. The end-to-end speech recognition method based on the sparse self-attention mechanism as claimed in claim 1, wherein in the speech recognition model based on the sparse self-attention mechanism, the self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms the input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
3. The sparse self-attention mechanism-based end-to-end speech recognition method according to claim 2, wherein in S2, the sparse self-attention mechanism-based speech recognition model building process comprises the following steps:
s201, carrying out convolution and down-sampling on the input acoustic features to obtain a Key matrix,
s202, calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK by using the obtained global vector and an original Query matrix;
s203, sampling the Query matrix, and obtaining the result after samplingMatrix sumA matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism based speech recognition model.
4. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 3, wherein in S203, sampling the original Query matrix comprises the following steps:
5. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 3, wherein in S203, the Value matrix after the replacement with the mean vector is represented by the following formula:
6. The sparse self-attention mechanism-based end-to-end speech recognition method of claim 1, wherein in the S2, the training process of the sparse self-attention mechanism-based speech recognition model comprises:
inputting the acoustic characteristic sequence into an encoder for sampling;
acquiring a text data set, and extracting corresponding text features of all texts in the text data set by using a Tokenizer; inputting the text features into a decoder;
and the sampled acoustic feature sequence is used as the input of a speech recognition model based on a sparse self-attention mechanism, the decoding result of a decoder is used as the output, the speech recognition model based on the sparse self-attention mechanism is trained, the hyper-parameters are adjusted in the training process to obtain different test results, and the combination of the hyper-parameters with the optimal test results is selected as the output.
7. An end-to-end speech recognition system based on a sparse self-attention mechanism is characterized by comprising a data acquisition module, an encoder, a speech recognition model based on the sparse self-attention mechanism and a decoder;
the data acquisition module is used for acquiring an audio data set;
the encoder is used for preprocessing, enhancing data and extracting features of the data set to obtain an acoustic feature sequence;
the voice recognition model based on the sparse self-attention mechanism performs down-sampling on the acoustic feature sequence, and recognizes the input acoustic feature sequence by utilizing the hyper-parameter with the optimal training effect;
the decoder is used for decoding the recognition result of the voice recognition model based on the sparse self-attention mechanism to obtain a corresponding text sequence.
8. The sparse self-attention mechanism-based end-to-end speech recognition system of claim 7, wherein a self-attention moment matrix generation module is included in the sparse self-attention mechanism-based speech recognition model, self-attention adopts multi-head dot product attention, and the multi-head dot product attention linearly transforms input acoustic features to generate a Query matrix, a Key matrix and a Value matrix with the same dimension.
9. The sparse self-attention mechanism-based end-to-end speech recognition system of claim 8, wherein the sparse self-attention mechanism-based speech recognition model comprises a self-attention weight calculation module configured to perform convolution downsampling on an input acoustic feature to obtain a Key matrix; calculating a vector with the maximum value of the input acoustic feature sequence in each global dimension, recording the vector as a global vector, and calculating an attention weight matrix SQK matrix by using the obtained global vector and an original Query matrix; sampling the Query matrix, the sampledMatrix sumA matrix, recalculating the attention weight; correlating the newly obtained attention weight with the Value matrix replaced by the mean vector to obtain a sparse self-attention relationship; as a final sparse self-attention mechanism-based speech recognition model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210694730.0A CN114783418B (en) | 2022-06-20 | 2022-06-20 | End-to-end voice recognition method and system based on sparse self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210694730.0A CN114783418B (en) | 2022-06-20 | 2022-06-20 | End-to-end voice recognition method and system based on sparse self-attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114783418A true CN114783418A (en) | 2022-07-22 |
CN114783418B CN114783418B (en) | 2022-08-23 |
Family
ID=82420363
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210694730.0A Active CN114783418B (en) | 2022-06-20 | 2022-06-20 | End-to-end voice recognition method and system based on sparse self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114783418B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115081752A (en) * | 2022-08-11 | 2022-09-20 | 浙江君同智能科技有限责任公司 | Black and gray production crowdsourcing flow prediction device and method |
CN115796407A (en) * | 2023-02-13 | 2023-03-14 | 中建科技集团有限公司 | Production line fault prediction method and related equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200043483A1 (en) * | 2018-08-01 | 2020-02-06 | Google Llc | Minimum word error rate training for attention-based sequence-to-sequence models |
CN113140220A (en) * | 2021-04-12 | 2021-07-20 | 西北工业大学 | Lightweight end-to-end speech recognition method based on convolution self-attention transformation network |
CN113380232A (en) * | 2021-06-15 | 2021-09-10 | 哈尔滨工业大学 | End-to-end speech recognition method based on constraint structured sparse attention mechanism and storage medium |
CN113642646A (en) * | 2021-08-13 | 2021-11-12 | 重庆邮电大学 | Image threat article classification and positioning method based on multiple attention and semantics |
WO2022121150A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Speech recognition method and apparatus based on self-attention mechanism and memory network |
-
2022
- 2022-06-20 CN CN202210694730.0A patent/CN114783418B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200043483A1 (en) * | 2018-08-01 | 2020-02-06 | Google Llc | Minimum word error rate training for attention-based sequence-to-sequence models |
WO2022121150A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Speech recognition method and apparatus based on self-attention mechanism and memory network |
CN113140220A (en) * | 2021-04-12 | 2021-07-20 | 西北工业大学 | Lightweight end-to-end speech recognition method based on convolution self-attention transformation network |
CN113380232A (en) * | 2021-06-15 | 2021-09-10 | 哈尔滨工业大学 | End-to-end speech recognition method based on constraint structured sparse attention mechanism and storage medium |
CN113642646A (en) * | 2021-08-13 | 2021-11-12 | 重庆邮电大学 | Image threat article classification and positioning method based on multiple attention and semantics |
Non-Patent Citations (2)
Title |
---|
TIMO LOHRENZ ET AL.: "《RELAXED ATTENTION:A SIMPLE METHOD TO BOOST PERFORMANCE OF END-TO-END AUTOMATIC SPEECH RECOGNITION》", 《ARXIV:2017.01275V2》 * |
ZHIFU GAO ET AL.: "《SAN-M:Memory Equipped Self-Attention for End-to-End Speech Recognition》", 《ARXIV:2006.01713V1》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115081752A (en) * | 2022-08-11 | 2022-09-20 | 浙江君同智能科技有限责任公司 | Black and gray production crowdsourcing flow prediction device and method |
CN115796407A (en) * | 2023-02-13 | 2023-03-14 | 中建科技集团有限公司 | Production line fault prediction method and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114783418B (en) | 2022-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
Gaikwad et al. | A review on speech recognition technique | |
Zhu et al. | Phone-to-audio alignment without text: A semi-supervised approach | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
CN110189749A (en) | Voice keyword automatic identifying method | |
CN111798840A (en) | Voice keyword recognition method and device | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
CN115019776A (en) | Voice recognition model, training method thereof, voice recognition method and device | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN112735404A (en) | Ironic detection method, system, terminal device and storage medium | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
CN115394287A (en) | Mixed language voice recognition method, device, system and storage medium | |
CN114999460A (en) | Lightweight Chinese speech recognition method combined with Transformer | |
Rudresh et al. | Performance analysis of speech digit recognition using cepstrum and vector quantization | |
CN113611285B (en) | Language identification method based on stacked bidirectional time sequence pooling | |
CN118136022A (en) | Intelligent voice recognition system and method | |
CN117558278A (en) | Self-adaptive voice recognition method and system | |
CN116631383A (en) | Voice recognition method based on self-supervision pre-training and interactive fusion network | |
CN115376547B (en) | Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium | |
CN114333762B (en) | Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium | |
CN114626424B (en) | Data enhancement-based silent speech recognition method and device | |
CN113628639A (en) | Voice emotion recognition method based on multi-head attention mechanism | |
Tailor et al. | Deep learning approach for spoken digit recognition in Gujarati language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |