CN115240645A - Stream type voice recognition method based on attention re-scoring - Google Patents

Stream type voice recognition method based on attention re-scoring Download PDF

Info

Publication number
CN115240645A
CN115240645A CN202210864507.6A CN202210864507A CN115240645A CN 115240645 A CN115240645 A CN 115240645A CN 202210864507 A CN202210864507 A CN 202210864507A CN 115240645 A CN115240645 A CN 115240645A
Authority
CN
China
Prior art keywords
attention
voice
speech
streaming
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210864507.6A
Other languages
Chinese (zh)
Inventor
杜军朝
刘惠
张志鹏
于英涛
潘江涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210864507.6A priority Critical patent/CN115240645A/en
Publication of CN115240645A publication Critical patent/CN115240645A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a stream type voice recognition method based on attention re-scoring, which comprises the following steps: acquiring a training data set; constructing a streaming voice recognition model based on attention re-scoring; carrying out iterative training on the flow type voice recognition model; and acquiring a real-time voice recognition result. The stream type voice recognition model constructed by the invention comprises an attention redressing stage, the result of the stream type voice recognition is optimized by using complete voice coding in the process of training the model and obtaining a real-time voice recognition result, the complete voice information can be fully used in the process of the stream type recognition, and the recognition accuracy is effectively improved while the stream type low delay is kept.

Description

Stream type voice recognition method based on attention re-scoring
Technical Field
The invention belongs to the technical field of voice recognition, relates to a voice recognition method, and particularly relates to a stream type voice recognition method based on attention re-scoring.
Background
The task of voice recognition is to convert voice into a machine-processable input form such as characters or instructions, and is one of important ways of man-machine interaction; and with the constant popularization of smart phones and smart sound boxes, the importance of the smart phones and the smart sound boxes is increasing. The voice recognition is the first step of voice interaction and the most important step, and the rapid and accurate voice recognition can effectively improve the use experience of various voice interactions.
Speech recognition can be classified into two types, streaming speech recognition and non-streaming speech recognition, depending on whether the speech recognition is real-time or not. The non-streaming voice recognition needs to perform recognition after obtaining a complete voice fragment, and an accurate result can be given by using complete voice information during the non-streaming voice recognition. Streaming speech recognition requires that the user be presented with a real-time recognition result while obtaining the real-time speech of the user. The low latency of streaming speech recognition can greatly improve the user experience, but in streaming recognition, the model can only use limited speech information, and the recognition accuracy is often lower than that of non-streaming recognition. Low latency for streaming identification is difficult to compromise with high accuracy for non-streaming identification. On the other hand, the streaming speech recognition under different scenes often has different requirements on delay and accuracy, and the model often needs to be retrained according to different scene requirements.
The patent application with publication number CN113327603A entitled "speech recognition method, apparatus, electronic device and computer readable storage medium" discloses a speech recognition method, which first obtains a speech feature sequence of a speech signal to be recognized, and divides the speech feature sequence into a plurality of speech blocks; inputting the voice characteristic sequence into a self-attention-based encoder to obtain voice codes, wherein each self-attention layer in the encoder can only notice one voice block before and/or after the current voice block, namely, only limited voice information can be used for calculation; the speech coding is input into a connection time sequence classification CTC module, the CTC module processes the speech coding block by block and predicts the character output, the speech coding is cut off when the predicted character output changes, a speech coding section corresponding to each character is obtained, and a decoder based on attention decodes each speech coding section to obtain a corresponding final recognition result. The method finds out the voice coding section corresponding to each character through CTC, and then decodes each voice coding section by using an attention-based decoder, so that the accuracy is improved to a certain extent while the low delay of the stream type recognition is kept, but the method still only uses limited voice information and does not utilize the information of complete voice, thereby leading the recognition accuracy to be still lower.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a stream type voice recognition method based on attention re-scoring, which is used for solving the technical problem that the prior art is difficult to consider both the high accuracy of non-stream type recognition and the low delay of stream type recognition.
In order to realize the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) Acquiring a training data set:
(1a) Obtaining N pieces of noiseless voice data S = { S } from different speakers 1 ,...,s n ,...,s N And text content L = { L) corresponding to S 1 ,...,l n ,...,l N And M pieces of natural noise data F = { F) from different scenes 1 ,...,f m ,...,f M }; wherein N is more than or equal to 200000,s n Representing the nth piece of noiseless speech data, l n Denotes s n M is more than or equal to 2000, f m Representing the mth piece of natural noise data;
(1b) Every strip of S noiseless speech data S n Mixing with any piece of natural noise data in F, and combining N pieces of training data into a training data set T = { T = { (T) } 1 ,...,t n ,...,t N In which t n Denotes s n Corresponding training data;
(2) Constructing a streaming voice recognition model based on attention re-scoring:
(2a) Constructing a structure of a streaming voice recognition model based on attention re-scoring:
constructing a streaming voice recognition model comprising a voice feature extraction module, a streaming voice recognition module and an attention re-grading module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which the activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;
(2b) Defining a loss function L for an attention-re-scoring based streaming speech recognition model Joint
L Joint =λL CTC +(1-λ)L Attention
Wherein L is CTC Representing the CTC loss function, L Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is a radical of an alcohol Attention Using KL divergence loss with tag smoothing;
(3) Carrying out iterative training on the flow type voice recognition model V:
(3a) The initial iteration number is I, the maximum iteration number is I, I is more than or equal to 200, and the current streaming voice recognition model is V i And let i =1,V = V i
(3b) And (3) carrying out forward propagation by taking the training data set T as the input of the streaming voice recognition model V according to batches:
(3b1) The voice feature extraction module extracts each training data t n The mel filter bank FBank characteristic;
(3b2) The streaming speech recognition module converts each training data t n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t n The FBank features of all speech blocks add relative position coding and add t n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech coder calculates the speech coding of each speech block by block, the visual field of multi-head self-attention of each relation structure in the speech coder needs to be limited in the calculation process, the attention to the current speech block and a fixed number of speech blocks before the current speech block can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t n The speech coding of (2); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the speech codes of all the speech blocks n The streaming identification result and the corresponding streaming score;
(3b3) Attention decoder pass t in attention re-scoring module n Calculating decoding information corresponding to each stream type recognition result according to the complete voice coding and the R stream type recognition results; the attention prediction layer calculates the attention score of each stream type recognition result according to the decoding information, then carries out weighted summation on the stream type score and the attention score of each stream type recognition result to obtain the final score of each stream type result, and then takes the stream type recognition result with the highest final score as t n Final recognition result l of n
(3c) Using a joint loss function L Joint Through l n And l n Calculating the loss value of the streaming voice recognition model in the ith iteration, adopting an Adam optimization method, and updating the weight parameters of the streaming voice recognition module and the attention re-scoring module in the streaming voice recognition model through the minimized loss value to obtain a real-time voice information extraction model V of the target speaker after the ith iteration i
(3d) Judging whether I is more than or equal to I, if so, obtaining a trained streaming voice recognition model V * Otherwise, let i = i +1, and perform step (3 b);
(4) Acquiring a real-time voice recognition result:
(4a) Cutting the real-time collected voice stream into a plurality of voice blocks with equal length, and using the cut voice blocks as a trained stream type voice recognition model V * Is forward propagated to obtain in real timeTaking a plurality of streaming recognition results and corresponding streaming scores, and using the streaming recognition result with the highest streaming score as a real-time voice recognition result;
(4b) After the speech stream is over, V * The flow type recognition result is re-scored to give a final recognition result.
Compared with the prior art, the invention has the following advantages:
the stream type voice recognition model constructed by the invention comprises an attention re-grading module, and the attention re-grading module optimizes the result of the stream type voice recognition by using complete voice coding in the process of training the model and acquiring the voice recognition result, so that the problem of low recognition accuracy caused by the fact that the stream type voice recognition only can use limited voice information in the prior art is solved, the complete voice information can be fully utilized, and the recognition accuracy is effectively improved while the stream type low delay is kept.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a diagram of a flow-type speech recognition model based on attention re-scoring constructed by the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Referring to fig. 1, the present invention includes the steps of:
step 1) acquiring a training data set:
(1a) Acquiring N pieces of noiseless voice data S = { S) from different speakers 1 ,...,s n ,...,s N And text content L = { L ] corresponding to S 1 ,...,l n ,...,l N Acquiring M pieces of natural noise data F = { F) from different scenes 1 ,...,f m ,...,f M }; wherein N is more than or equal to 200000,s n Representing the nth noiseless speech data,/ n Denotes s n M is greater than or equal to 2000, f m Representing the mth piece of natural noise data;
in this embodiment, the noiseless speech data is from 400 different speakers, about 350 noiseless speech data per speaker, about 140000 noiseless speeches in total, and about 178 hours in total; the natural noise data set selects a TUT2016 data set and a TUT2017 data set, audio scene data of about 27.25 hours are summed, the audio time length of each noise scene is 109 minutes, each noise data is about 30 seconds, and about 3250 pieces of natural noise data are totally collected;
in the embodiment, all voice data or noise data are wav files, and the sampling frequency is 16kHz;
(1b) Each of the noiseless voice data S of S n Mixing with any piece of natural noise data in F, and combining N pieces of training data into a training data set T = { T = { (T) } 1 ,...,t n ,...,t N Where t is n Denotes s n Corresponding training data, wherein each piece of noise-free speech data S of S n Mixing with any piece of natural noise data in the F, and the method comprises the following steps:
(1b1) Initializing training data t n Allowed minimum signal-to-noise ratio min snr Maximum signal-to-noise ratio max snr Hopefully, the proportion of the noiseless voice data reserved in the training data set T is p, and p is more than or equal to 0 and less than or equal to 1;
(1b2) Each piece of noiseless speech data S of S is calculated n Average power of
Figure BDA0003758024820000051
Any piece of natural noise data F in the sum F k Average power of
Figure BDA0003758024820000052
And according to
Figure BDA0003758024820000053
And
Figure BDA0003758024820000054
calculating s n Corresponding noise figure gamma n
Figure BDA0003758024820000055
Wherein k is [1, M ]],SNR n Denotes t n Signal to noise ratio of (d), min snr ≤SNR n ≤max snr ,α n Denotes a threshold value, 0. Ltoreq. Alpha n ≤1;
(1b3) According to the noise coefficient gamma n Calculating s n Corresponding training data t n
t n =s nn ·f k
The invention determines the noise factor gamma from the signal power of the noise-free speech and natural noise data and the selected signal-to-noise ratio n ,γ n =0, representing training data t n The training data set T only contains noiseless voice data, namely the training data set T contains part of noiseless voice data and also contains voice data which contains noise and is obtained by mixing part of noiseless voice data and natural noise data, so that the problem that the training is interfered by voice features damaged by too strong noise is avoided. Meanwhile, the signal-to-noise ratio which dynamically changes in a certain range greatly enriches the richness of the training data set, and the robustness of the model facing different noise scenes is improved. When a training data set is manufactured, the proportion p of data added with noise needs to be controlled, if the proportion is too large, the performance of the model for recognizing clear voice is reduced, and if the proportion is too small, the robustness of the model for noise interference is reduced, wherein p =0.6 is set in the embodiment; this embodiment sets the minimum signal-to-noise ratio min snr Maximum signal to noise ratio max of 1 snr Is set to 10;
step 2) constructing a streaming voice recognition model based on attention re-scoring, wherein the structure of the model is shown in FIG. 2:
(2a) Constructing a structure of a streaming voice recognition model based on attention re-scoring:
constructing a streaming voice recognition model comprising a streaming voice recognition module and an attention re-grading module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which an activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;
in this embodiment, the fully-connected layer with position coding maps the 320-dimensional speech feature input into the 256-dimensional speech feature using a fully-connected neural network composed of 256 neurons with 320-dimensional internal dimensions; and adding relative position codes to 256 voice features;
in this embodiment, 12 causal convolution-based former structures with 256 dimensions of self-attention dimension are used in the speech encoder, and each former structure includes four modules: front and back two feedforward modules (FFN), the middle is a multi-head self-attention Module (MHSA) and a convolution module (Conv) with the number of 4 attention heads, residual connection is added in front and back of each module, and each structure is finally normalized (LayerNorm);
in this embodiment, the attention decoder consists of 6 transform decoders stacked; each Transformer decoder comprises a multi-head self-attention module with 4 attention heads, a multi-head attention module with 4 attention heads and a feedforward neural network module, residual connection is added in front of and behind each module, and each structure is finally normalized;
(2b) Defining a loss function L for a streaming speech recognition module and an attention re-scoring module Joint
L Joint =λL CTC +(1-λ)L Attention
Wherein L is CTC Representing the CTC loss function, L Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is Attention Calculating a predicted character sequence l by using KL divergence with label smoothness n For real character sequence l n KL divergence of (1) n Using a one-hot vector representation with tag smoothing, a tag smoothing techniqueOverfitting of the model can be effectively prevented.
Step 3), carrying out iterative training on the flow type voice recognition model V:
(3a) The initial iteration number is I, the maximum iteration number is I, I is more than or equal to 200, and the current streaming voice recognition model is V i And let i =1,V = V i
In this embodiment, I is 200;
(3b) And (3) performing forward propagation by taking the training data set T as the input of the streaming voice recognition model V according to batches:
(3b1) The voice feature extraction module extracts each training data t n The mel filter bank FBank characteristic;
in this embodiment, the speech feature extraction module performs pre-emphasis, framing, windowing, short-time fourier transform, power spectrum taking, mel filter bank taking, and logarithm taking on each piece of training data, and then obtains mel filter bank features corresponding to the training data; in the embodiment, a 25ms frame length and a 10ms frame shift are adopted during framing, the dimension of the extracted Mel filter bank characteristic is 80 dimensions, and 320 ViMel filter bank characteristics are obtained through 4-frame stacking and 3-frame downsampling;
(3b2) The streaming voice recognition module converts each training data t n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t n The FBank characteristics of all speech blocks add relative position coding and t n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech coder calculates the speech coding of each speech block by block, the visual field of multi-head self-attention of each relation structure in the speech coder needs to be limited in the calculation process, the attention to the current speech block and a fixed number of speech blocks before the current speech block can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t n The speech coding of (2); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the speech codes of all the speech blocks n The streaming type recognition result and the corresponding streaming type score;
the full-connection layer with the position code maps the input 320-dimensional Mel filter bank characteristics to 256 dimensions while adding the relative position code at the in-place input, thereby reducing the input dimension and further reducing the parameter quantity of the whole model.
The former structure in the speech encoder replaces the original depth separable convolution with a causal convolution. Causal convolution cuts off the future speech information on the right side, only pays attention to the information before the current time step, and does not depend on the information after the current time step. The causal convolution based former structure is more suitable for streaming voice data.
On the other hand, to adapt to streaming recognition, the multi-headed self-attention of the former in the speech encoder can only notice the information before the current time step; in the invention, attention of the blocks is used, namely, each former can only calculate the attention of the current voice block to the current voice block and the voice block which is prior to the current voice block in time sequence.
During stream type recognition, the larger voice block enables the model to receive more voice information, the recognition accuracy rate is higher, but the model depends on more future information, the recognition delay is increased, and the real-time performance is reduced; smaller speech blocks provide less speech information, and recognition accuracy is reduced, but real-time performance is improved. The existing training method uses a voice block with a fixed length, and the accuracy rate is greatly reduced by changing the length of the voice block. The invention dynamically changes the length of the voice block during training, so that the model can adapt to the voice blocks with different lengths. When in use, different delays and accuracy rates can be obtained by adjusting the length of the voice block according to actual requirements. In this embodiment, the length chunksize of each speech block is set to be a random integer between 1 and 30, that is, each speech block contains the mel filter bank characteristic of a chunksize frame, the duration of the corresponding speech block is 30 milliseconds to 900 milliseconds, and the average latency is 15 milliseconds to 450 milliseconds.
The CTC prediction layer adopts a prefix bundle search algorithm, and an initial identification result is a blank label; searching block by block from the first voice block, when processing each voice block, finding the path with the highest probability from the existing result, combining and continuing characters and removing blank labels, finally fusing a plurality of paths with consistent result, keeping the beam width result with the highest probability, and taking the result with the highest probability as the current streaming output result; the process is repeated until all the voice blocks are processed, and the beamwidth stream type identification result and the corresponding probability which is the fraction of the stream type identification are obtained. In this embodiment, the beam width is set to 10, and a larger beam width will result in more accurate results, but will also result in more delay.
(3b3) Attention decoder pass t in attention re-scoring module n Calculating decoding information corresponding to each stream type recognition result according to the complete voice coding and the R stream type recognition results; the attention prediction layer calculates the attention score of each stream type recognition result according to the decoding information, then carries out weighted summation on the stream type score and the attention score of each stream type recognition result to obtain the final score of each stream type result, and then takes the stream type recognition result with the highest final score as t n Final recognition result l n
The attention decoder adopts a Teacher-Forcing mode during decoding, takes a plurality of streaming type recognition results as labels, decodes the voice codes of complete voice, can perform repeated scoring on a plurality of decoded voice codes by one-time execution, avoids the autoregressive decoding process, and has better real-time performance.
(3c) Using a joint loss function L Joint Through l n And l n Calculating the loss value of the streaming voice recognition model in the ith iteration, adopting an Adam optimization method, updating the weight parameters of the streaming voice recognition module and the attention scoring module in the streaming voice recognition model through the minimized loss value, and obtaining a real-time voice information extraction model V of the target speaker after the ith iteration i
(3d) Judging whether I is more than or equal to I, if so, obtaining a trained streaming voice recognition model V * Otherwise, let i = i +1, and perform step (3 b);
step 4), acquiring a real-time voice recognition result:
(4a) Cutting the real-time collected voice stream into multiple voice blocks with equal length, and cutting out the voice blocksCutting a plurality of voice blocks as a trained streaming voice recognition model V * The input of the voice recognition system is transmitted forward to obtain a plurality of streaming recognition results and corresponding streaming scores in real time, and one streaming recognition result with the highest streaming score is used as a real-time voice recognition result;
(4b) After the speech stream is over, V * The flow type recognition result is re-scored, and a final recognition result is given.
The length of the voice block can be determined according to different requirements of the computing power, the accuracy rate and the delay of the actual equipment, the variation range corresponds to the range of the length of the voice block in training, and the embodiment is set to be 30 milliseconds to 900 milliseconds;
in actual recognition, the output state of each former structure in the speech encoder can be cached, and when the next speech block is calculated, the cached output state is directly used as historical information to be spliced in the hidden state of the corresponding former structure, so that repeated calculation of the historical information is avoided. The block-wise attention used can compute multiframe features within a speech block at once, further reducing latency.

Claims (5)

1. A stream-type voice recognition method based on attention re-scoring is characterized by comprising the following steps:
(1) Acquiring a training data set:
(1a) Acquiring N pieces of noiseless voice data S = { S) from different speakers 1 ,...,s n ,...,s N And text content L = { L) corresponding to S 1 ,...,l n ,...,l N And M pieces of natural noise data F = { F) from different scenes 1 ,...,f m ,...,f M }; wherein N is more than or equal to 200000,s n Representing the nth piece of noiseless speech data, l n Denotes s n M is greater than or equal to 2000, f m Representing the mth piece of natural noise data;
(1b) Every strip of S noiseless speech data S n Mixing with any piece of natural noise data in F to obtain training data, and combining N pieces of training data to form a training data set T ={t 1 ,...,t n ,...,t N Where t is n Denotes s n Corresponding training data;
(2) Constructing a streaming voice recognition model based on attention re-scoring:
(2a) Constructing a structure of a streaming voice recognition model based on attention re-scoring:
constructing a streaming voice recognition model comprising a voice feature extraction module, a streaming voice recognition module and an attention scoring module which are connected in sequence; the streaming voice recognition module comprises a fully-connected layer with position codes, a voice coder and a connection time sequence classification CTC prediction layer which are connected in sequence, wherein the voice coder comprises a plurality of former structures which are connected in sequence, and the CTC prediction layer adopts a fully-connected layer of which an activation function is a Softmax function; the attention re-grading module comprises an attention decoder and an attention prediction layer which are connected in sequence, wherein the attention decoder comprises a plurality of transform structures which are connected in sequence, and the attention prediction layer adopts a full connection layer of which an activation function is a Softmax function;
(2b) Defining a loss function L for an attention-re-scoring-based streaming speech recognition model Joint
L Joint =λL CTC +(1-λ)L Attention
Wherein L is CTC Representing the CTC loss function, L Attention Representing an attention loss function, wherein lambda is a weight factor, and 0 < lambda < 1; l is Attention Using KL divergence loss with label smoothing;
(3) Carrying out iterative training on the flow type voice recognition model V:
(3a) The initialization iteration number is I, the maximum iteration number is I, I is more than or equal to 200, and the current streaming voice recognition model is V i And let i =1,V = V i
(3b) And (3) performing forward propagation by taking the training data set T as the input of the streaming voice recognition model V according to batches:
(3b1) The voice characteristic extraction module extracts each training data t n The mel filter bank FBank characteristic;
(3b2) Streaming speech recognition module will eachTraining data t n The characteristics of the Mel filter bank FBank are uniformly divided into a plurality of voice blocks, and the full connection layer with position coding is t n The FBank features of all speech blocks add relative position coding and add t n The FBank characteristic with position coding is evenly divided into a plurality of voice blocks; the speech encoder calculates the speech coding of each speech block by block, the multi-head self-attention visual field of each relation structure in the speech encoder needs to be limited in the calculation process, the attention to the current speech block and the speech blocks in front of the current speech block in a fixed number can only be calculated when each speech block is processed, and the speech coding of all the speech blocks forms t n The speech coding of (3); the CTC prediction layer adopts a CTC prefix beam search algorithm, the beam width is set to be R, R is more than or equal to 10 and less than or equal to 50, and t is calculated according to the voice codes of all the voice blocks n The streaming type recognition result and the corresponding streaming type score;
(3b3) Attention decoder pass t in attention re-scoring module n Calculating decoding information corresponding to each stream type recognition result according to the complete voice coding and the R stream type recognition results; the attention prediction layer calculates the attention score of each stream type recognition result according to the decoding information, then carries out weighted summation on the stream type score and the attention score of each stream type recognition result to obtain the final score of each stream type result, and then takes the stream type recognition result with the highest final score as t n Final recognition result l of n
(3c) Using a joint loss function L Joint Through l n And l n Calculating the loss value of the streaming voice recognition model in the ith iteration, adopting an Adam optimization method, updating the weight parameters of the streaming voice recognition module and the attention scoring module in the streaming voice recognition model through the minimized loss value, and obtaining a real-time voice information extraction model V of the target speaker after the ith iteration i
(3d) Judging whether I is more than or equal to I, if so, obtaining a trained streaming voice recognition model V * Otherwise, let i = i +1, and perform step (3 b);
(4) Acquiring a real-time voice recognition result:
(4a) Cutting the real-time collected voice stream into a plurality of voice blocks with equal length, and using the cut voice blocks as a trained stream type voice recognition model V * The input of the voice recognition system is transmitted forward to obtain a plurality of streaming recognition results and corresponding streaming scores in real time, and one streaming recognition result with the highest streaming score is used as a real-time voice recognition result;
(4b) After the speech stream is over, V * The flow type recognition result is re-scored to give a final recognition result.
2. The attention-re-scoring-based streaming speech recognition method according to claim 1, wherein each of the noiseless speech data S of S in the step (1 b) is n Mixing with any piece of natural noise data in the F, and the implementation steps are as follows:
(1b1) Initializing training data t n The minimum allowable signal-to-noise ratio is min snr Maximum signal-to-noise ratio max snr Hopefully, the proportion of the noiseless voice data reserved in the training data set T is p, and p is more than or equal to 0 and less than or equal to 1;
(1b2) Each piece of noiseless speech data S of S is calculated n Average power of
Figure FDA0003758024810000031
Any piece of natural noise data F in the sum F k Average power of
Figure FDA0003758024810000032
And in accordance with
Figure FDA0003758024810000033
And
Figure FDA0003758024810000034
calculating s n Corresponding noise coefficient gamma n
Figure FDA0003758024810000035
Wherein k is equal to [1,M ]],SNR n Represents t n Signal to noise ratio of (d), min snr ≤SNR n ≤max snr ,α n Denotes a threshold value, 0. Ltoreq. Alpha n ≤1;
(1b3) According to the noise coefficient gamma n Calculating s n Corresponding training data t n
t n =s nn f k
3. The streaming speech recognition method based on attention scoring according to claim 1, wherein the structure of the streaming speech recognition model based on attention scoring in step (2 a) is 12, and the number of sequentially connected transform structures included in the speech encoder is 6.
4. The streaming speech recognition method based on attention re-scoring according to claim 1, wherein the step (3 b 2) is implemented by using t n The FBank characteristic with position code is divided into a plurality of voice blocks uniformly, the voice block length of each batch of input training data is randomly determined within a certain range, and the same batch of training data uses the voice blocks with the same length when being divided.
5. The streaming speech recognition method based on attention re-scoring as claimed in claim 1, wherein the attention decoder in the attention re-scoring module in step (3 b 3) passes t n The method comprises the steps of calculating decoding information corresponding to each stream type recognition result by using the voice coding and a plurality of stream type recognition results, wherein a Teacher-Forcing mode is adopted for decoding when an attention decoder decodes, and the stream type recognition results are directly used as labels to carry out t-direction recognition n The complete speech encoding of (a) is decoded.
CN202210864507.6A 2022-07-21 2022-07-21 Stream type voice recognition method based on attention re-scoring Pending CN115240645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210864507.6A CN115240645A (en) 2022-07-21 2022-07-21 Stream type voice recognition method based on attention re-scoring

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210864507.6A CN115240645A (en) 2022-07-21 2022-07-21 Stream type voice recognition method based on attention re-scoring

Publications (1)

Publication Number Publication Date
CN115240645A true CN115240645A (en) 2022-10-25

Family

ID=83675654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210864507.6A Pending CN115240645A (en) 2022-07-21 2022-07-21 Stream type voice recognition method based on attention re-scoring

Country Status (1)

Country Link
CN (1) CN115240645A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117558265A (en) * 2024-01-12 2024-02-13 联通(广东)产业互联网有限公司 Dialect stream type voice recognition method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117558265A (en) * 2024-01-12 2024-02-13 联通(广东)产业互联网有限公司 Dialect stream type voice recognition method and device, electronic equipment and storage medium
CN117558265B (en) * 2024-01-12 2024-04-19 联通(广东)产业互联网有限公司 Dialect stream type voice recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111429889B (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
EP4018437B1 (en) Optimizing a keyword spotting system
EP1391879A2 (en) Speech coding apparatus, linear prediction coefficient analyzing apparatus and noise reducing apparatus
CN110827801A (en) Automatic voice recognition method and system based on artificial intelligence
CN111916058A (en) Voice recognition method and system based on incremental word graph re-scoring
CN111508498A (en) Conversational speech recognition method, system, electronic device and storage medium
CN111862934B (en) Method for improving speech synthesis model and speech synthesis method and device
CN110767210A (en) Method and device for generating personalized voice
CN111951796B (en) Speech recognition method and device, electronic equipment and storage medium
CN113724718B (en) Target audio output method, device and system
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN1125437C (en) Speech recognition method
CN116092501B (en) Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system
CN114495969A (en) Voice recognition method integrating voice enhancement
CN111489754A (en) Telephone traffic data analysis method based on intelligent voice technology
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN113793591A (en) Speech synthesis method and related device, electronic equipment and storage medium
CN115240645A (en) Stream type voice recognition method based on attention re-scoring
US4989249A (en) Method of feature determination and extraction and recognition of voice and apparatus therefore
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
CN112216270B (en) Speech phoneme recognition method and system, electronic equipment and storage medium
WO2022228704A1 (en) Decoder
CN116312502A (en) End-to-end stream type voice recognition method and device based on sequential sampling blocking mechanism
JPH01204099A (en) Speech recognition device
CN114067793A (en) Audio processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Du Junchao

Inventor after: Liu Hui

Inventor after: Zhang Zhipeng

Inventor after: Wei Yuheng

Inventor after: Yu Yingtao

Inventor after: Pan Jiangtao

Inventor before: Du Junchao

Inventor before: Liu Hui

Inventor before: Zhang Zhipeng

Inventor before: Yu Yingtao

Inventor before: Pan Jiangtao