CN115240645A

CN115240645A - Stream type voice recognition method based on attention re-scoring

Info

Publication number: CN115240645A
Application number: CN202210864507.6A
Authority: CN
Inventors: 杜军朝; 刘惠; 张志鹏; 于英涛; 潘江涛
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-10-25

Abstract

The invention provides a stream type voice recognition method based on attention re-scoring, which comprises the following steps: acquiring a training data set; constructing a streaming voice recognition model based on attention re-scoring; carrying out iterative training on the flow type voice recognition model; and acquiring a real-time voice recognition result. The stream type voice recognition model constructed by the invention comprises an attention redressing stage, the result of the stream type voice recognition is optimized by using complete voice coding in the process of training the model and obtaining a real-time voice recognition result, the complete voice information can be fully used in the process of the stream type recognition, and the recognition accuracy is effectively improved while the stream type low delay is kept.

Description

Stream type voice recognition method based on attention re-scoring

Technical Field

The invention belongs to the technical field of voice recognition, relates to a voice recognition method, and particularly relates to a stream type voice recognition method based on attention re-scoring.

Background

The task of voice recognition is to convert voice into a machine-processable input form such as characters or instructions, and is one of important ways of man-machine interaction; and with the constant popularization of smart phones and smart sound boxes, the importance of the smart phones and the smart sound boxes is increasing. The voice recognition is the first step of voice interaction and the most important step, and the rapid and accurate voice recognition can effectively improve the use experience of various voice interactions.

Speech recognition can be classified into two types, streaming speech recognition and non-streaming speech recognition, depending on whether the speech recognition is real-time or not. The non-streaming voice recognition needs to perform recognition after obtaining a complete voice fragment, and an accurate result can be given by using complete voice information during the non-streaming voice recognition. Streaming speech recognition requires that the user be presented with a real-time recognition result while obtaining the real-time speech of the user. The low latency of streaming speech recognition can greatly improve the user experience, but in streaming recognition, the model can only use limited speech information, and the recognition accuracy is often lower than that of non-streaming recognition. Low latency for streaming identification is difficult to compromise with high accuracy for non-streaming identification. On the other hand, the streaming speech recognition under different scenes often has different requirements on delay and accuracy, and the model often needs to be retrained according to different scene requirements.

The patent application with publication number CN113327603A entitled "speech recognition method, apparatus, electronic device and computer readable storage medium" discloses a speech recognition method, which first obtains a speech feature sequence of a speech signal to be recognized, and divides the speech feature sequence into a plurality of speech blocks; inputting the voice characteristic sequence into a self-attention-based encoder to obtain voice codes, wherein each self-attention layer in the encoder can only notice one voice block before and/or after the current voice block, namely, only limited voice information can be used for calculation; the speech coding is input into a connection time sequence classification CTC module, the CTC module processes the speech coding block by block and predicts the character output, the speech coding is cut off when the predicted character output changes, a speech coding section corresponding to each character is obtained, and a decoder based on attention decodes each speech coding section to obtain a corresponding final recognition result. The method finds out the voice coding section corresponding to each character through CTC, and then decodes each voice coding section by using an attention-based decoder, so that the accuracy is improved to a certain extent while the low delay of the stream type recognition is kept, but the method still only uses limited voice information and does not utilize the information of complete voice, thereby leading the recognition accuracy to be still lower.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a stream type voice recognition method based on attention re-scoring, which is used for solving the technical problem that the prior art is difficult to consider both the high accuracy of non-stream type recognition and the low delay of stream type recognition.

In order to realize the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) Acquiring a training data set:

(1a) Obtaining N pieces of noiseless voice data S = { S } from different speakers ₁ ,...,s _n ,...,s _N And text content L = { L) corresponding to S ₁ ,...,l _n ,...,l _N And M pieces of natural noise data F = { F) from different scenes ₁ ,...,f _m ,...,f _M }; wherein N is more than or equal to 200000,s _n Representing the nth piece of noiseless speech data, l _n Denotes s _n M is more than or equal to 2000, f _m Representing the mth piece of natural noise data;

(1b) Every strip of S noiseless speech data S _n Mixing with any piece of natural noise data in F, and combining N pieces of training data into a training data set T = { T = { (T) } ₁ ,...,t _n ,...,t _N In which t _n Denotes s _n Corresponding training data;

(2) Constructing a streaming voice recognition model based on attention re-scoring:

(2a) Constructing a structure of a streaming voice recognition model based on attention re-scoring:

constructing a streaming voice recognition model comprising a voice feature extraction module, a streaming voice recognition module and an attention re-grading module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which the activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;

(2b) Defining a loss function L for an attention-re-scoring based streaming speech recognition model _Joint ：

L _Joint ＝λL _CTC +(1-λ)L _Attention

Wherein L is _CTC Representing the CTC loss function, L _Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is a radical of an alcohol _Attention Using KL divergence loss with tag smoothing;

(3) Carrying out iterative training on the flow type voice recognition model V:

(3a) The initial iteration number is I, the maximum iteration number is I, I is more than or equal to 200, and the current streaming voice recognition model is V _i And let i =1,V = V _i ；

(3b) And (3) carrying out forward propagation by taking the training data set T as the input of the streaming voice recognition model V according to batches:

(3b1) The voice feature extraction module extracts each training data t _n The mel filter bank FBank characteristic;

(3b2) The streaming speech recognition module converts each training data t _n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t _n The FBank features of all speech blocks add relative position coding and add t _n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech coder calculates the speech coding of each speech block by block, the visual field of multi-head self-attention of each relation structure in the speech coder needs to be limited in the calculation process, the attention to the current speech block and a fixed number of speech blocks before the current speech block can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t _n The speech coding of (2); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the speech codes of all the speech blocks _n The streaming identification result and the corresponding streaming score;

(3b3) Attention decoder pass t in attention re-scoring module _n Calculating decoding information corresponding to each stream type recognition result according to the complete voice coding and the R stream type recognition results; the attention prediction layer calculates the attention score of each stream type recognition result according to the decoding information, then carries out weighted summation on the stream type score and the attention score of each stream type recognition result to obtain the final score of each stream type result, and then takes the stream type recognition result with the highest final score as t _n Final recognition result l of _n ；

(3c) Using a joint loss function L _Joint Through l _n And l _n Calculating the loss value of the streaming voice recognition model in the ith iteration, adopting an Adam optimization method, and updating the weight parameters of the streaming voice recognition module and the attention re-scoring module in the streaming voice recognition model through the minimized loss value to obtain a real-time voice information extraction model V of the target speaker after the ith iteration _i ；

(3d) Judging whether I is more than or equal to I, if so, obtaining a trained streaming voice recognition model V ^* Otherwise, let i = i +1, and perform step (3 b);

(4) Acquiring a real-time voice recognition result:

(4a) Cutting the real-time collected voice stream into a plurality of voice blocks with equal length, and using the cut voice blocks as a trained stream type voice recognition model V ^* Is forward propagated to obtain in real timeTaking a plurality of streaming recognition results and corresponding streaming scores, and using the streaming recognition result with the highest streaming score as a real-time voice recognition result;

(4b) After the speech stream is over, V ^* The flow type recognition result is re-scored to give a final recognition result.

Compared with the prior art, the invention has the following advantages:

the stream type voice recognition model constructed by the invention comprises an attention re-grading module, and the attention re-grading module optimizes the result of the stream type voice recognition by using complete voice coding in the process of training the model and acquiring the voice recognition result, so that the problem of low recognition accuracy caused by the fact that the stream type voice recognition only can use limited voice information in the prior art is solved, the complete voice information can be fully utilized, and the recognition accuracy is effectively improved while the stream type low delay is kept.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram of a flow-type speech recognition model based on attention re-scoring constructed by the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments.

Referring to fig. 1, the present invention includes the steps of:

step 1) acquiring a training data set:

(1a) Acquiring N pieces of noiseless voice data S = { S) from different speakers ₁ ,...,s _n ,...,s _N And text content L = { L ] corresponding to S ₁ ,...,l _n ,...,l _N Acquiring M pieces of natural noise data F = { F) from different scenes ₁ ,...,f _m ,...,f _M }; wherein N is more than or equal to 200000,s _n Representing the nth noiseless speech data,/ _n Denotes s _n M is greater than or equal to 2000, f _m Representing the mth piece of natural noise data;

in this embodiment, the noiseless speech data is from 400 different speakers, about 350 noiseless speech data per speaker, about 140000 noiseless speeches in total, and about 178 hours in total; the natural noise data set selects a TUT2016 data set and a TUT2017 data set, audio scene data of about 27.25 hours are summed, the audio time length of each noise scene is 109 minutes, each noise data is about 30 seconds, and about 3250 pieces of natural noise data are totally collected;

in the embodiment, all voice data or noise data are wav files, and the sampling frequency is 16kHz;

(1b) Each of the noiseless voice data S of S _n Mixing with any piece of natural noise data in F, and combining N pieces of training data into a training data set T = { T = { (T) } ₁ ,...,t _n ,...,t _N Where t is _n Denotes s _n Corresponding training data, wherein each piece of noise-free speech data S of S _n Mixing with any piece of natural noise data in the F, and the method comprises the following steps:

(1b1) Initializing training data t _n Allowed minimum signal-to-noise ratio min _snr Maximum signal-to-noise ratio max _snr Hopefully, the proportion of the noiseless voice data reserved in the training data set T is p, and p is more than or equal to 0 and less than or equal to 1;

(1b2) Each piece of noiseless speech data S of S is calculated _n Average power of

Any piece of natural noise data F in the sum F _k Average power of

And according to

And

calculating s _n Corresponding noise figure gamma _n ：

Wherein k is [1, M ]]，SNR _n Denotes t _n Signal to noise ratio of (d), min _snr ≤SNR _n ≤max _snr ，α _n Denotes a threshold value, 0. Ltoreq. Alpha _n ≤1；

(1b3) According to the noise coefficient gamma _n Calculating s _n Corresponding training data t _n ：

t _n ＝s _n +γ _n ·f _k 。

The invention determines the noise factor gamma from the signal power of the noise-free speech and natural noise data and the selected signal-to-noise ratio _n ，γ _n =0, representing training data t _n The training data set T only contains noiseless voice data, namely the training data set T contains part of noiseless voice data and also contains voice data which contains noise and is obtained by mixing part of noiseless voice data and natural noise data, so that the problem that the training is interfered by voice features damaged by too strong noise is avoided. Meanwhile, the signal-to-noise ratio which dynamically changes in a certain range greatly enriches the richness of the training data set, and the robustness of the model facing different noise scenes is improved. When a training data set is manufactured, the proportion p of data added with noise needs to be controlled, if the proportion is too large, the performance of the model for recognizing clear voice is reduced, and if the proportion is too small, the robustness of the model for noise interference is reduced, wherein p =0.6 is set in the embodiment; this embodiment sets the minimum signal-to-noise ratio min _snr Maximum signal to noise ratio max of 1 _snr Is set to 10;

step 2) constructing a streaming voice recognition model based on attention re-scoring, wherein the structure of the model is shown in FIG. 2:

constructing a streaming voice recognition model comprising a streaming voice recognition module and an attention re-grading module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which an activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;

in this embodiment, the fully-connected layer with position coding maps the 320-dimensional speech feature input into the 256-dimensional speech feature using a fully-connected neural network composed of 256 neurons with 320-dimensional internal dimensions; and adding relative position codes to 256 voice features;

in this embodiment, 12 causal convolution-based former structures with 256 dimensions of self-attention dimension are used in the speech encoder, and each former structure includes four modules: front and back two feedforward modules (FFN), the middle is a multi-head self-attention Module (MHSA) and a convolution module (Conv) with the number of 4 attention heads, residual connection is added in front and back of each module, and each structure is finally normalized (LayerNorm);

in this embodiment, the attention decoder consists of 6 transform decoders stacked; each Transformer decoder comprises a multi-head self-attention module with 4 attention heads, a multi-head attention module with 4 attention heads and a feedforward neural network module, residual connection is added in front of and behind each module, and each structure is finally normalized;

(2b) Defining a loss function L for a streaming speech recognition module and an attention re-scoring module _Joint ：

L _Joint ＝λL _CTC +(1-λ)L _Attention

Wherein L is _CTC Representing the CTC loss function, L _Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is _Attention Calculating a predicted character sequence l by using KL divergence with label smoothness _n For real character sequence l _n KL divergence of (1) _n Using a one-hot vector representation with tag smoothing, a tag smoothing techniqueOverfitting of the model can be effectively prevented.

Step 3), carrying out iterative training on the flow type voice recognition model V:

In this embodiment, I is 200;

(3b) And (3) performing forward propagation by taking the training data set T as the input of the streaming voice recognition model V according to batches:

in this embodiment, the speech feature extraction module performs pre-emphasis, framing, windowing, short-time fourier transform, power spectrum taking, mel filter bank taking, and logarithm taking on each piece of training data, and then obtains mel filter bank features corresponding to the training data; in the embodiment, a 25ms frame length and a 10ms frame shift are adopted during framing, the dimension of the extracted Mel filter bank characteristic is 80 dimensions, and 320 ViMel filter bank characteristics are obtained through 4-frame stacking and 3-frame downsampling;

(3b2) The streaming voice recognition module converts each training data t _n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t _n The FBank characteristics of all speech blocks add relative position coding and t _n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech coder calculates the speech coding of each speech block by block, the visual field of multi-head self-attention of each relation structure in the speech coder needs to be limited in the calculation process, the attention to the current speech block and a fixed number of speech blocks before the current speech block can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t _n The speech coding of (2); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the speech codes of all the speech blocks _n The streaming type recognition result and the corresponding streaming type score;

the full-connection layer with the position code maps the input 320-dimensional Mel filter bank characteristics to 256 dimensions while adding the relative position code at the in-place input, thereby reducing the input dimension and further reducing the parameter quantity of the whole model.

The former structure in the speech encoder replaces the original depth separable convolution with a causal convolution. Causal convolution cuts off the future speech information on the right side, only pays attention to the information before the current time step, and does not depend on the information after the current time step. The causal convolution based former structure is more suitable for streaming voice data.

On the other hand, to adapt to streaming recognition, the multi-headed self-attention of the former in the speech encoder can only notice the information before the current time step; in the invention, attention of the blocks is used, namely, each former can only calculate the attention of the current voice block to the current voice block and the voice block which is prior to the current voice block in time sequence.

During stream type recognition, the larger voice block enables the model to receive more voice information, the recognition accuracy rate is higher, but the model depends on more future information, the recognition delay is increased, and the real-time performance is reduced; smaller speech blocks provide less speech information, and recognition accuracy is reduced, but real-time performance is improved. The existing training method uses a voice block with a fixed length, and the accuracy rate is greatly reduced by changing the length of the voice block. The invention dynamically changes the length of the voice block during training, so that the model can adapt to the voice blocks with different lengths. When in use, different delays and accuracy rates can be obtained by adjusting the length of the voice block according to actual requirements. In this embodiment, the length chunksize of each speech block is set to be a random integer between 1 and 30, that is, each speech block contains the mel filter bank characteristic of a chunksize frame, the duration of the corresponding speech block is 30 milliseconds to 900 milliseconds, and the average latency is 15 milliseconds to 450 milliseconds.

The CTC prediction layer adopts a prefix bundle search algorithm, and an initial identification result is a blank label; searching block by block from the first voice block, when processing each voice block, finding the path with the highest probability from the existing result, combining and continuing characters and removing blank labels, finally fusing a plurality of paths with consistent result, keeping the beam width result with the highest probability, and taking the result with the highest probability as the current streaming output result; the process is repeated until all the voice blocks are processed, and the beamwidth stream type identification result and the corresponding probability which is the fraction of the stream type identification are obtained. In this embodiment, the beam width is set to 10, and a larger beam width will result in more accurate results, but will also result in more delay.

(3b3) Attention decoder pass t in attention re-scoring module _n Calculating decoding information corresponding to each stream type recognition result according to the complete voice coding and the R stream type recognition results; the attention prediction layer calculates the attention score of each stream type recognition result according to the decoding information, then carries out weighted summation on the stream type score and the attention score of each stream type recognition result to obtain the final score of each stream type result, and then takes the stream type recognition result with the highest final score as t _n Final recognition result l _n ；

The attention decoder adopts a Teacher-Forcing mode during decoding, takes a plurality of streaming type recognition results as labels, decodes the voice codes of complete voice, can perform repeated scoring on a plurality of decoded voice codes by one-time execution, avoids the autoregressive decoding process, and has better real-time performance.

(3c) Using a joint loss function L _Joint Through l _n And l _n Calculating the loss value of the streaming voice recognition model in the ith iteration, adopting an Adam optimization method, updating the weight parameters of the streaming voice recognition module and the attention scoring module in the streaming voice recognition model through the minimized loss value, and obtaining a real-time voice information extraction model V of the target speaker after the ith iteration _i ；

step 4), acquiring a real-time voice recognition result:

(4a) Cutting the real-time collected voice stream into multiple voice blocks with equal length, and cutting out the voice blocksCutting a plurality of voice blocks as a trained streaming voice recognition model V ^* The input of the voice recognition system is transmitted forward to obtain a plurality of streaming recognition results and corresponding streaming scores in real time, and one streaming recognition result with the highest streaming score is used as a real-time voice recognition result;

(4b) After the speech stream is over, V ^* The flow type recognition result is re-scored, and a final recognition result is given.

The length of the voice block can be determined according to different requirements of the computing power, the accuracy rate and the delay of the actual equipment, the variation range corresponds to the range of the length of the voice block in training, and the embodiment is set to be 30 milliseconds to 900 milliseconds;

in actual recognition, the output state of each former structure in the speech encoder can be cached, and when the next speech block is calculated, the cached output state is directly used as historical information to be spliced in the hidden state of the corresponding former structure, so that repeated calculation of the historical information is avoided. The block-wise attention used can compute multiframe features within a speech block at once, further reducing latency.

Claims

1. A stream-type voice recognition method based on attention re-scoring is characterized by comprising the following steps:

(1) Acquiring a training data set:

(1a) Acquiring N pieces of noiseless voice data S = { S) from different speakers ₁ ,...,s _n ,...,s _N And text content L = { L) corresponding to S ₁ ,...,l _n ,...,l _N And M pieces of natural noise data F = { F) from different scenes ₁ ,...,f _m ,...,f _M }; wherein N is more than or equal to 200000,s _n Representing the nth piece of noiseless speech data, l _n Denotes s _n M is greater than or equal to 2000, f _m Representing the mth piece of natural noise data;

(1b) Every strip of S noiseless speech data S _n Mixing with any piece of natural noise data in F to obtain training data, and combining N pieces of training data to form a training data set T ={t ₁ ,...,t _n ,...,t _N Where t is _n Denotes s _n Corresponding training data;

constructing a streaming voice recognition model comprising a voice feature extraction module, a streaming voice recognition module and an attention scoring module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which an activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;

(2b) Defining a loss function L for an attention-re-scoring-based streaming speech recognition model _Joint ：

L _Joint ＝λL _CTC +(1-λ)L _Attention

Wherein L is _CTC Representing the CTC loss function, L _Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is _Attention Using KL divergence loss with label smoothing;

(3) Carrying out iterative training on the flow type voice recognition model V:

(3a) The initialization iteration number is I, the maximum iteration number is I, I is more than or equal to 200, and the current streaming voice recognition model is V _i And let i =1,V = V _i ；

(3b1) The voice characteristic extraction module extracts each training data t _n The mel filter bank FBank characteristic;

(3b2) Streaming speech recognition module will eachTraining data t _n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t _n The FBank features of all speech blocks add relative position coding and add t _n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech encoder calculates the speech coding of each speech block by block, the multi-head self-attention visual field of each relation structure in the speech encoder needs to be limited in the calculation process, the attention to the current speech block and the speech blocks in front of the current speech block in a fixed number can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t _n The speech coding of (3); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the voice codes of all the voice blocks _n The streaming type recognition result and the corresponding streaming type score;

(4) Acquiring a real-time voice recognition result:

(4a) Cutting the real-time collected voice stream into a plurality of voice blocks with equal length, and using the cut voice blocks as a trained stream type voice recognition model V ^* The input of the voice recognition system is transmitted forward to obtain a plurality of streaming recognition results and corresponding streaming scores in real time, and one streaming recognition result with the highest streaming score is used as a real-time voice recognition result;

2. The attention-re-scoring-based streaming speech recognition method according to claim 1, wherein each of the noiseless speech data S of S in the step (1 b) is _n Mixing with any piece of natural noise data in the F, and the implementation steps are as follows:

(1b1) Initializing training data t _n The minimum allowable signal-to-noise ratio is min _snr Maximum signal-to-noise ratio max _snr Hopefully, the proportion of the noiseless voice data reserved in the training data set T is p, and p is more than or equal to 0 and less than or equal to 1;

Any piece of natural noise data F in the sum F _k Average power of

And in accordance with

And

calculating s _n Corresponding noise coefficient gamma _n ：

Wherein k is equal to [1,M ]]，SNR _n Represents t _n Signal to noise ratio of (d), min _snr ≤SNR _n ≤max _snr ，α _n Denotes a threshold value, 0. Ltoreq. Alpha _n ≤1；

t _n ＝s _n +γ _n f _k 。

3. The streaming speech recognition method based on attention scoring according to claim 1, wherein the structure of the streaming speech recognition model based on attention scoring in step (2 a) is 12, and the number of sequentially connected transform structures included in the speech encoder is 6.

4. The streaming speech recognition method based on attention re-scoring according to claim 1, wherein the step (3 b 2) is implemented by using t _n The FBank characteristic with position code is divided into a plurality of voice blocks uniformly, the voice block length of each batch of input training data is randomly determined within a certain range, and the same batch of training data uses the voice blocks with the same length when being divided.

5. The streaming speech recognition method based on attention re-scoring as claimed in claim 1, wherein the attention decoder in the attention re-scoring module in step (3 b 3) passes t _n The method comprises the steps of calculating decoding information corresponding to each stream type recognition result by using the voice coding and a plurality of stream type recognition results, wherein a Teacher-Forcing mode is adopted for decoding when an attention decoder decodes, and the stream type recognition results are directly used as labels to carry out t-direction recognition _n The complete speech encoding of (a) is decoded.