Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a stream type voice recognition method based on attention re-scoring, which is used for solving the technical problem that the prior art is difficult to consider both the high accuracy of non-stream type recognition and the low delay of stream type recognition.
In order to realize the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) Acquiring a training data set:
(1a) Obtaining N pieces of noiseless voice data S = { S } from different speakers 1 ,...,s n ,...,s N And text content L = { L) corresponding to S 1 ,...,l n ,...,l N And M pieces of natural noise data F = { F) from different scenes 1 ,...,f m ,...,f M }; wherein N is more than or equal to 200000,s n Representing the nth piece of noiseless speech data, l n Denotes s n M is more than or equal to 2000, f m Representing the mth piece of natural noise data;
(1b) Every strip of S noiseless speech data S n Mixing with any piece of natural noise data in F, and combining N pieces of training data into a training data set T = { T = { (T) } 1 ,...,t n ,...,t N In which t n Denotes s n Corresponding training data;
(2) Constructing a streaming voice recognition model based on attention re-scoring:
(2a) Constructing a structure of a streaming voice recognition model based on attention re-scoring:
constructing a streaming voice recognition model comprising a voice feature extraction module, a streaming voice recognition module and an attention re-grading module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which the activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;
(2b) Defining a loss function L for an attention-re-scoring based streaming speech recognition model Joint :
L Joint =λL CTC +(1-λ)L Attention
Wherein L is CTC Representing the CTC loss function, L Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is a radical of an alcohol Attention Using KL divergence loss with tag smoothing;
(3) Carrying out iterative training on the flow type voice recognition model V:
(3a) The initial iteration number is I, the maximum iteration number is I, I is more than or equal to 200, and the current streaming voice recognition model is V i And let i =1,V = V i ;
(3b) And (3) carrying out forward propagation by taking the training data set T as the input of the streaming voice recognition model V according to batches:
(3b1) The voice feature extraction module extracts each training data t n The mel filter bank FBank characteristic;
(3b2) The streaming speech recognition module converts each training data t n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t n The FBank features of all speech blocks add relative position coding and add t n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech coder calculates the speech coding of each speech block by block, the visual field of multi-head self-attention of each relation structure in the speech coder needs to be limited in the calculation process, the attention to the current speech block and a fixed number of speech blocks before the current speech block can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t n The speech coding of (2); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the speech codes of all the speech blocks n The streaming identification result and the corresponding streaming score;
(3b3) Attention decoder pass t in attention re-scoring module n Calculating decoding information corresponding to each stream type recognition result according to the complete voice coding and the R stream type recognition results; the attention prediction layer calculates the attention score of each stream type recognition result according to the decoding information, then carries out weighted summation on the stream type score and the attention score of each stream type recognition result to obtain the final score of each stream type result, and then takes the stream type recognition result with the highest final score as t n Final recognition result l of n ;
(3c) Using a joint loss function L Joint Through l n And l n Calculating the loss value of the streaming voice recognition model in the ith iteration, adopting an Adam optimization method, and updating the weight parameters of the streaming voice recognition module and the attention re-scoring module in the streaming voice recognition model through the minimized loss value to obtain a real-time voice information extraction model V of the target speaker after the ith iteration i ;
(3d) Judging whether I is more than or equal to I, if so, obtaining a trained streaming voice recognition model V * Otherwise, let i = i +1, and perform step (3 b);
(4) Acquiring a real-time voice recognition result:
(4a) Cutting the real-time collected voice stream into a plurality of voice blocks with equal length, and using the cut voice blocks as a trained stream type voice recognition model V * Is forward propagated to obtain in real timeTaking a plurality of streaming recognition results and corresponding streaming scores, and using the streaming recognition result with the highest streaming score as a real-time voice recognition result;
(4b) After the speech stream is over, V * The flow type recognition result is re-scored to give a final recognition result.
Compared with the prior art, the invention has the following advantages:
the stream type voice recognition model constructed by the invention comprises an attention re-grading module, and the attention re-grading module optimizes the result of the stream type voice recognition by using complete voice coding in the process of training the model and acquiring the voice recognition result, so that the problem of low recognition accuracy caused by the fact that the stream type voice recognition only can use limited voice information in the prior art is solved, the complete voice information can be fully utilized, and the recognition accuracy is effectively improved while the stream type low delay is kept.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Referring to fig. 1, the present invention includes the steps of:
step 1) acquiring a training data set:
(1a) Acquiring N pieces of noiseless voice data S = { S) from different speakers 1 ,...,s n ,...,s N And text content L = { L ] corresponding to S 1 ,...,l n ,...,l N Acquiring M pieces of natural noise data F = { F) from different scenes 1 ,...,f m ,...,f M }; wherein N is more than or equal to 200000,s n Representing the nth noiseless speech data,/ n Denotes s n M is greater than or equal to 2000, f m Representing the mth piece of natural noise data;
in this embodiment, the noiseless speech data is from 400 different speakers, about 350 noiseless speech data per speaker, about 140000 noiseless speeches in total, and about 178 hours in total; the natural noise data set selects a TUT2016 data set and a TUT2017 data set, audio scene data of about 27.25 hours are summed, the audio time length of each noise scene is 109 minutes, each noise data is about 30 seconds, and about 3250 pieces of natural noise data are totally collected;
in the embodiment, all voice data or noise data are wav files, and the sampling frequency is 16kHz;
(1b) Each of the noiseless voice data S of S n Mixing with any piece of natural noise data in F, and combining N pieces of training data into a training data set T = { T = { (T) } 1 ,...,t n ,...,t N Where t is n Denotes s n Corresponding training data, wherein each piece of noise-free speech data S of S n Mixing with any piece of natural noise data in the F, and the method comprises the following steps:
(1b1) Initializing training data t n Allowed minimum signal-to-noise ratio min snr Maximum signal-to-noise ratio max snr Hopefully, the proportion of the noiseless voice data reserved in the training data set T is p, and p is more than or equal to 0 and less than or equal to 1;
(1b2) Each piece of noiseless speech data S of S is calculated
n Average power of
Any piece of natural noise data F in the sum F
k Average power of
And according to
And
calculating s
n Corresponding noise figure gamma
n :
Wherein k is [1, M ]],SNR n Denotes t n Signal to noise ratio of (d), min snr ≤SNR n ≤max snr ,α n Denotes a threshold value, 0. Ltoreq. Alpha n ≤1;
(1b3) According to the noise coefficient gamma n Calculating s n Corresponding training data t n :
t n =s n +γ n ·f k 。
The invention determines the noise factor gamma from the signal power of the noise-free speech and natural noise data and the selected signal-to-noise ratio n ,γ n =0, representing training data t n The training data set T only contains noiseless voice data, namely the training data set T contains part of noiseless voice data and also contains voice data which contains noise and is obtained by mixing part of noiseless voice data and natural noise data, so that the problem that the training is interfered by voice features damaged by too strong noise is avoided. Meanwhile, the signal-to-noise ratio which dynamically changes in a certain range greatly enriches the richness of the training data set, and the robustness of the model facing different noise scenes is improved. When a training data set is manufactured, the proportion p of data added with noise needs to be controlled, if the proportion is too large, the performance of the model for recognizing clear voice is reduced, and if the proportion is too small, the robustness of the model for noise interference is reduced, wherein p =0.6 is set in the embodiment; this embodiment sets the minimum signal-to-noise ratio min snr Maximum signal to noise ratio max of 1 snr Is set to 10;
step 2) constructing a streaming voice recognition model based on attention re-scoring, wherein the structure of the model is shown in FIG. 2:
(2a) Constructing a structure of a streaming voice recognition model based on attention re-scoring:
constructing a streaming voice recognition model comprising a streaming voice recognition module and an attention re-grading module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which an activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;
in this embodiment, the fully-connected layer with position coding maps the 320-dimensional speech feature input into the 256-dimensional speech feature using a fully-connected neural network composed of 256 neurons with 320-dimensional internal dimensions; and adding relative position codes to 256 voice features;
in this embodiment, 12 causal convolution-based former structures with 256 dimensions of self-attention dimension are used in the speech encoder, and each former structure includes four modules: front and back two feedforward modules (FFN), the middle is a multi-head self-attention Module (MHSA) and a convolution module (Conv) with the number of 4 attention heads, residual connection is added in front and back of each module, and each structure is finally normalized (LayerNorm);
in this embodiment, the attention decoder consists of 6 transform decoders stacked; each Transformer decoder comprises a multi-head self-attention module with 4 attention heads, a multi-head attention module with 4 attention heads and a feedforward neural network module, residual connection is added in front of and behind each module, and each structure is finally normalized;
(2b) Defining a loss function L for a streaming speech recognition module and an attention re-scoring module Joint :
L Joint =λL CTC +(1-λ)L Attention
Wherein L is CTC Representing the CTC loss function, L Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is Attention Calculating a predicted character sequence l by using KL divergence with label smoothness n For real character sequence l n KL divergence of (1) n Using a one-hot vector representation with tag smoothing, a tag smoothing techniqueOverfitting of the model can be effectively prevented.
Step 3), carrying out iterative training on the flow type voice recognition model V:
(3a) The initial iteration number is I, the maximum iteration number is I, I is more than or equal to 200, and the current streaming voice recognition model is V i And let i =1,V = V i ;
In this embodiment, I is 200;
(3b) And (3) performing forward propagation by taking the training data set T as the input of the streaming voice recognition model V according to batches:
(3b1) The voice feature extraction module extracts each training data t n The mel filter bank FBank characteristic;
in this embodiment, the speech feature extraction module performs pre-emphasis, framing, windowing, short-time fourier transform, power spectrum taking, mel filter bank taking, and logarithm taking on each piece of training data, and then obtains mel filter bank features corresponding to the training data; in the embodiment, a 25ms frame length and a 10ms frame shift are adopted during framing, the dimension of the extracted Mel filter bank characteristic is 80 dimensions, and 320 ViMel filter bank characteristics are obtained through 4-frame stacking and 3-frame downsampling;
(3b2) The streaming voice recognition module converts each training data t n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t n The FBank characteristics of all speech blocks add relative position coding and t n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech coder calculates the speech coding of each speech block by block, the visual field of multi-head self-attention of each relation structure in the speech coder needs to be limited in the calculation process, the attention to the current speech block and a fixed number of speech blocks before the current speech block can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t n The speech coding of (2); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the speech codes of all the speech blocks n The streaming type recognition result and the corresponding streaming type score;
the full-connection layer with the position code maps the input 320-dimensional Mel filter bank characteristics to 256 dimensions while adding the relative position code at the in-place input, thereby reducing the input dimension and further reducing the parameter quantity of the whole model.
The former structure in the speech encoder replaces the original depth separable convolution with a causal convolution. Causal convolution cuts off the future speech information on the right side, only pays attention to the information before the current time step, and does not depend on the information after the current time step. The causal convolution based former structure is more suitable for streaming voice data.
On the other hand, to adapt to streaming recognition, the multi-headed self-attention of the former in the speech encoder can only notice the information before the current time step; in the invention, attention of the blocks is used, namely, each former can only calculate the attention of the current voice block to the current voice block and the voice block which is prior to the current voice block in time sequence.
During stream type recognition, the larger voice block enables the model to receive more voice information, the recognition accuracy rate is higher, but the model depends on more future information, the recognition delay is increased, and the real-time performance is reduced; smaller speech blocks provide less speech information, and recognition accuracy is reduced, but real-time performance is improved. The existing training method uses a voice block with a fixed length, and the accuracy rate is greatly reduced by changing the length of the voice block. The invention dynamically changes the length of the voice block during training, so that the model can adapt to the voice blocks with different lengths. When in use, different delays and accuracy rates can be obtained by adjusting the length of the voice block according to actual requirements. In this embodiment, the length chunksize of each speech block is set to be a random integer between 1 and 30, that is, each speech block contains the mel filter bank characteristic of a chunksize frame, the duration of the corresponding speech block is 30 milliseconds to 900 milliseconds, and the average latency is 15 milliseconds to 450 milliseconds.
The CTC prediction layer adopts a prefix bundle search algorithm, and an initial identification result is a blank label; searching block by block from the first voice block, when processing each voice block, finding the path with the highest probability from the existing result, combining and continuing characters and removing blank labels, finally fusing a plurality of paths with consistent result, keeping the beam width result with the highest probability, and taking the result with the highest probability as the current streaming output result; the process is repeated until all the voice blocks are processed, and the beamwidth stream type identification result and the corresponding probability which is the fraction of the stream type identification are obtained. In this embodiment, the beam width is set to 10, and a larger beam width will result in more accurate results, but will also result in more delay.
(3b3) Attention decoder pass t in attention re-scoring module n Calculating decoding information corresponding to each stream type recognition result according to the complete voice coding and the R stream type recognition results; the attention prediction layer calculates the attention score of each stream type recognition result according to the decoding information, then carries out weighted summation on the stream type score and the attention score of each stream type recognition result to obtain the final score of each stream type result, and then takes the stream type recognition result with the highest final score as t n Final recognition result l n ;
The attention decoder adopts a Teacher-Forcing mode during decoding, takes a plurality of streaming type recognition results as labels, decodes the voice codes of complete voice, can perform repeated scoring on a plurality of decoded voice codes by one-time execution, avoids the autoregressive decoding process, and has better real-time performance.
(3c) Using a joint loss function L Joint Through l n And l n Calculating the loss value of the streaming voice recognition model in the ith iteration, adopting an Adam optimization method, updating the weight parameters of the streaming voice recognition module and the attention scoring module in the streaming voice recognition model through the minimized loss value, and obtaining a real-time voice information extraction model V of the target speaker after the ith iteration i ;
(3d) Judging whether I is more than or equal to I, if so, obtaining a trained streaming voice recognition model V * Otherwise, let i = i +1, and perform step (3 b);
step 4), acquiring a real-time voice recognition result:
(4a) Cutting the real-time collected voice stream into multiple voice blocks with equal length, and cutting out the voice blocksCutting a plurality of voice blocks as a trained streaming voice recognition model V * The input of the voice recognition system is transmitted forward to obtain a plurality of streaming recognition results and corresponding streaming scores in real time, and one streaming recognition result with the highest streaming score is used as a real-time voice recognition result;
(4b) After the speech stream is over, V * The flow type recognition result is re-scored, and a final recognition result is given.
The length of the voice block can be determined according to different requirements of the computing power, the accuracy rate and the delay of the actual equipment, the variation range corresponds to the range of the length of the voice block in training, and the embodiment is set to be 30 milliseconds to 900 milliseconds;
in actual recognition, the output state of each former structure in the speech encoder can be cached, and when the next speech block is calculated, the cached output state is directly used as historical information to be spliced in the hidden state of the corresponding former structure, so that repeated calculation of the historical information is avoided. The block-wise attention used can compute multiframe features within a speech block at once, further reducing latency.