CN113257248B - Streaming and non-streaming mixed voice recognition system and streaming voice recognition method - Google Patents

Streaming and non-streaming mixed voice recognition system and streaming voice recognition method Download PDF

Info

Publication number
CN113257248B
CN113257248B CN202110675286.3A CN202110675286A CN113257248B CN 113257248 B CN113257248 B CN 113257248B CN 202110675286 A CN202110675286 A CN 202110675286A CN 113257248 B CN113257248 B CN 113257248B
Authority
CN
China
Prior art keywords
streaming
stream
sequences
candidate
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110675286.3A
Other languages
Chinese (zh)
Other versions
CN113257248A (en
Inventor
陶建华
田正坤
易江燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110675286.3A priority Critical patent/CN113257248B/en
Publication of CN113257248A publication Critical patent/CN113257248A/en
Application granted granted Critical
Publication of CN113257248B publication Critical patent/CN113257248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention provides a mixed speech recognition system of flow type and non-flow type, comprising: a stream encoder, a concatenated sequential classification decoder, and an attention mechanism decoder; the stream type encoder is constructed by adopting a Transformer based on a local self-attention mechanism; the joint time sequence classification decoder comprises a linear mapping layer which is responsible for mapping the coding state to a pre-designed word list space, so that the dimension represented by the coding state mapping is the same as that of the word list space, and then the predicted mark is calculated through Softmax and used for stream decoding; the attention mechanism decoder is constructed by adopting a Transformer decoder and consists of a front-end convolution layer and a plurality of repeated unidirectional Transformer coding layers, the last layer is a linear mapping layer, the dimension represented by the coding state mapping is the same as the dimension of a word list space, and the probability of final output is calculated.

Description

Streaming and non-streaming mixed voice recognition system and streaming voice recognition method
Technical Field
The present application relates to the field of speech recognition, and more particularly to a hybrid speech recognition system for streaming and non-streaming.
Background
At present, the speech recognition technology has been widely applied, the speech recognition can be divided into a streaming speech recognition system and a non-streaming speech recognition system according to different application scenarios, and in order to reduce delay and real-time rate, the acoustic context that the streaming speech recognition system relies on is greatly reduced, which also affects the recognition effect of the model to a certain extent. The non-streaming speech recognition system is applied to occasions without requirements on real-time rate, can use all acoustic sequences for prediction, and generally has better recognition effect compared with a streaming recognition system. However, in order to adapt to different task requirements, models are generally trained for streaming and non-streaming tasks respectively, and there is no effective scheme for implementing one model to be applied to two tasks. The invention provides a voice recognition system, which integrates streaming and non-streaming models into the same model, realizes one model and two decoding modes and is suitable for two types of tasks.
There are many current schemes for both streaming and non-streaming speech recognition, but there are not many schemes that unify the two recognition models into one framework. The method mainly comprises two ideas:
the first idea is Google's idea, which realizes adaptation of the same encoder to both streaming (local context) and non-streaming (global context) by variable context training of the encoder part. In the model training process, the streaming model and the non-streaming model are trained simultaneously, and when the streaming model is trained, the acoustic context is shielded and only the acoustic context is relied on. When training is not streaming, the entire acoustic context is modeled without masking. In order to eliminate the performance gap between the flow model and the non-flow model, the model also uses the idea of knowledge extraction, and the non-flow model is used for improving the performance of the flow model. The decoder can use one decoder to realize two decoding modes, and only different encoders are required to be selected according to different task requirements.
The second idea is the hybrid model proposed by alisbaba, which contains two encoders (streaming and non-streaming) and two decoders. The system adopts encoders of different categories to encode input voice, selects a streaming encoder for a streaming task, then uses a streaming decoder to perform preliminary decoding, and uses a non-streaming decoder to perform re-scoring on the decoding result. Non-streaming decoding is the time of decoding that relies only on a non-streaming encoder and decoder. Such structural models are relatively complex.
Embodiments disclosed in application publication No. CN111402891A provide a voice recognition method, apparatus, device and storage medium. The method comprises the steps of obtaining a voice feature sequence of a current voice signal to be recognized; inputting the voice characteristic sequence into a Deep-FSMN model obtained by pre-training to obtain an output sequence representing the probability of each phoneme; inputting the output series into a pre-trained CTC model to obtain a corresponding phoneme sequence; and inputting the phoneme sequence into a language model, and converting the phoneme sequence into a final character sequence as a recognition result. In this way, the model performance can be improved, and the time delay of voice recognition can be reduced; the operation amount is reduced, and the voice recognition effect is improved.
Application publication No. CN111968629A claims a chinese speech recognition method combining Transformer and CNN-DFSMN-CTC, comprising the steps of: s1, preprocessing the voice signal, and extracting 80-dimensional log melFbank characteristics; s2, convolving the extracted 80-dimensional Fbank characteristics by a CNN convolution network; s3, inputting the characteristics into the DFSMN network structure; s4, using CTC loss as a loss function of the acoustic model, adopting a Beam search algorithm for prediction, and using an Adam optimizer for optimization; s5, introducing a strong language model Transformer iterative training until an optimal model structure is reached; and S6, combining the Transformer and the acoustic model CNN-DFSMN-CTC for adaptation, and verifying on multiple data sets to finally obtain the optimal recognition result. The invention has higher identification accuracy and higher decoding speed, and the character error rate reaches 11.8 percent after the verification on a plurality of data sets, wherein the character error rate of 7.8 percent is best reached on the Aidatang data set.
The main problems of the prior art include two aspects:
(1) model redundancy, large training task amount, and respective training; at present, aiming at streaming recognition and non-streaming recognition tasks, models are usually trained respectively, the models have the problems of repetition of partial functions and system redundancy, and the models are trained respectively, so that the task amount of model training is increased.
(2) The model structure is complex; similar to the idea of the Alibara model, the model comprises two encoders and two decoders, different encoders are combined and applied to different tasks, the model structure is complex, and training is difficult.
Disclosure of Invention
In view of this, the present invention provides a mixed speech recognition system of streaming type and non-streaming type, and specifically, the present invention is implemented by the following technical solutions:
in a first aspect, the present invention provides a hybrid speech recognition system for streaming and non-streaming, comprising: a stream encoder, a concatenated sequential classification decoder, and an attention mechanism decoder; the stream coder is used for stream modeling, dependence on all acoustic contexts needs to be eliminated, a Transformer based on a local self-attention mechanism is adopted for construction, and a coding state is output; the joint time sequence classification decoder comprises a linear mapping layer which is responsible for mapping the coding state to a pre-designed word list space to obtain a coding state mapping representation, the dimension of the coding state mapping representation is the same as that of the word list space, and then the predicted mark is calculated through Softmax and is mainly used for stream decoding; the attention mechanism decoder is constructed by adopting a Transformer decoder and consists of a front-end convolution layer and a plurality of layers of repeated unidirectional Transformer coding layers, the last layer is a linear mapping layer, the dimensionality represented by the coding state mapping is the same as the dimensionality of the vocabulary space, and the final output probability is calculated;
in the training process of the model, the connected time sequence classification decoder calculates a CTC loss function, and the attention mechanism decoder calculates a cross entropy loss function; carrying out weighted summation on the CTC loss function and the cross entropy loss function to be used as a model loss function of the identification system;
in the flow type reasoning process, a connection time sequence classification decoder is used as a main part, an attention mechanism decoder is used as an auxiliary part, the connection time sequence classification decoder adopts a BeamSearch search algorithm to generate N flow type candidate acoustic sequences and CTC flow type acoustic scores of the N flow type candidate acoustic sequences from a coding state, the attention mechanism decoder is used for re-scoring the N flow type candidate acoustic sequences, the N flow type candidate acoustic sequences are rearranged according to the score of each flow type candidate acoustic sequence, and the flow type candidate acoustic sequence with the highest score is used as a final flow type identification result;
in the non-streaming reasoning process, an attention mechanism decoder is used as a main part, a connection time sequence classification decoder is used as an auxiliary part, the attention mechanism decoder adopts a BeamSearch search algorithm to generate M non-stream sequences with the highest score in the decoding process as M non-stream candidate acoustic sequences, the connection time sequence classification decoder is used for re-scoring the M non-stream candidate acoustic sequences, the M non-stream candidate acoustic sequences are rearranged according to the score of each non-stream candidate acoustic sequence, and the non-stream candidate acoustic sequence with the highest score is used as a final non-stream identification result.
Preferably, the model loss function is of the specific form:
model loss function = λ × CTC loss function + (1- λ) × cross entropy loss function;
wherein,
λ: setting parameters, wherein lambda is more than or equal to 0.1 and less than or equal to 0.3.
Preferably, the specific method for the attention mechanism decoder to re-score the N streaming candidate acoustic sequences is as follows:
amplifying a sentence starting mark at the front end of each streaming candidate acoustic sequence to serve as an input streaming candidate sequence;
the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, calculates to obtain a streaming attention score, and uses the streaming attention score as a score for re-scoring the N streaming candidate acoustic sequences.
Preferably, the scoring of the re-scores further comprises: the streaming attention score and the CTC streaming acoustic score are weighted and summed as a score for re-scoring the N streaming candidate acoustic sequences.
Preferably, N is a setting parameter, and N is more than or equal to 10 and less than or equal to 100.
Preferably, the specific process of using the M non-stream sequences with the highest scores generated by the attention mechanism decoder in the decoding process by using the BeamSearch algorithm as the M non-stream candidate acoustic sequences is as follows:
predicting from the beginning of the mark, inputting a complete coding state and a mark obtained by the previous prediction in each step, and then calculating the score of the predicted mark; this process is repeated until the end marker is predicted to stop; and then taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark.
Preferably, the concrete method for re-scoring the M non-stream candidate acoustic sequences by the joint timing classification decoder is as follows: and weighting and summing the CTC non-stream acoustic scores and the non-stream attention scores to obtain scores for the M non-stream candidate acoustic sequences to be re-scored by the joint time sequence classification decoder.
Preferably, M is a setting parameter, 10 ≦ M ≦ 100.
The invention also provides a streaming voice recognition method, which comprises the following steps:
(1) calculating an acoustic feature stream to be input into a streaming coder every time an input audio stream reaches a fixed length;
(2) the stream type acoustic characteristic stream is converted into a stream type coding state after passing through a stream type coder and is input into a joint time sequence classification decoder;
(3) the connection time sequence classification decoder adopts a BeamSearch search algorithm to predict the flow type coding state;
(4) repeating the above (1) - (3), if a sentence is over, ending the streaming coding state, and finally generating N streaming candidate acoustic sequences and CTC streaming acoustic scores of the N streaming candidate acoustic sequences;
(5) an attention mechanism decoder performs re-scoring on the N streaming candidate acoustic sequences, and a sentence starting mark is amplified at the front end of each streaming candidate sequence to serve as an input streaming candidate sequence; the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, calculates to obtain a streaming attention score, and uses the streaming attention score as a score for re-scoring the N streaming candidate acoustic sequences;
or performing weighted summation on the streaming attention score and the CTC streaming acoustic score to obtain a score for re-scoring the N streaming candidate acoustic sequences;
(6) and rearranging the N streaming candidate acoustic sequences according to the score of each streaming candidate acoustic sequence, and taking the streaming candidate acoustic sequence with the highest score as a final streaming identification result. The performance of streaming speech recognition is improved by increasing the number N of streaming candidate acoustic sequences, a typical value of N is 10, the parameter setting range is as follows: n is more than or equal to 10 and less than or equal to 100.
The invention also provides a non-streaming voice recognition method, which comprises the following steps:
(1) after the audio input is finished, extracting characteristics of the whole audio, and inputting the characteristics into a stream coder for coding;
(2) the attention mechanism decoder depends on all the output and the initial mark of the stream encoder as input, the prediction is carried out from the initial mark, each step needs to input the complete coding state and the mark obtained by the previous step of prediction, and then the score of the prediction mark is calculated;
(3) repeating step (2) until the end marker is predicted to stop; then, taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark;
(4) using a connection time sequence classification decoder to re-score all M non-stream candidate acoustic sequences, and using a dynamic programming algorithm to calculate the probability of predicting and obtaining a target non-stream candidate acoustic sequence under the condition of inputting complete voice input as a CTC non-stream acoustic score;
(5) weighting and summing CTC non-stream acoustic scores and the non-stream attention scores to obtain scores for re-scoring the M non-stream candidate acoustic sequences by a joint time sequence classification decoder, and reordering the scores;
(6) and finally outputting the branch with the highest score for re-scoring the M non-flow candidate acoustic sequences as a recognition result.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
the method provided by the embodiment of the application,
(1) the model structure is simple, the invention only comprises a stream coder and two decoders (CTC decoder and Attention), and the model structure is relatively simple.
(2) The model decoding process is simple and complementary. In the streaming and non-streaming decoding of the model, certain performance improvement can be obtained only by exchanging the arrangement sequence of different decoders.
(3) The model training is simple, compared with the training process of other streaming models and non-streaming models, the system comprises two types of decoders, joint training can be carried out, and the convergence speed and the training speed of different modules are improved mutually.
Drawings
FIG. 1 is a block diagram of a hybrid speech recognition system for streaming and non-streaming according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a transform self-attention mechanism of a streaming encoder according to an embodiment of the present invention;
FIG. 3 is a flow chart of a streaming speech recognition method according to an embodiment of the present invention;
fig. 4 is a flowchart of a non-streaming speech recognition method according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The embodiment of the present application as shown in figure 1 provides a hybrid speech recognition system for streaming and non-streaming,
the method comprises the following steps: a stream encoder, a concatenated sequential classification decoder, and an attention mechanism decoder; the stream encoder is constructed by adopting a Transformer based on a local self-attention mechanism for stream modeling, and the dependence on all acoustic contexts needs to be eliminated; the joint time sequence classification decoder comprises a linear mapping layer which is responsible for mapping the coding state to a pre-designed word list space, so that the dimension represented by the coding state mapping is the same as that of the word list space, and then the predicted mark is calculated through Softmax and is mainly used for stream decoding; the attention mechanism decoder is constructed by adopting a Transformer decoder and consists of a front-end convolution layer and a plurality of repeated unidirectional Transformer coding layers, the last layer is a linear mapping layer, the dimensionality represented by the coding state mapping is the same as the dimensionality of the vocabulary space, and the final output probability is calculated;
model in the training process, the joint time sequence classification decoder calculates CTC loss function,L CTC (ii) a The attention mechanism decoder computes a cross entropy loss function, L CE (ii) a Performing weighted summation on the CTC loss function and the cross entropy loss function to serve as a model loss function of the identification systemL final
The specific form of the model loss function is as follows:
model loss function = λ × CTC loss function + (1- λ) × cross entropy loss function;
wherein,
λ: setting parameters to be 0.1;
Figure 212271DEST_PATH_IMAGE001
in the flow type reasoning process, the joint time sequence classification decoder is used as a main part, the attention mechanism decoder is used as an auxiliary part, the joint time sequence classification decoder adopts a BeamSearch search algorithm to generate N flow type candidate acoustic sequences and CTC flow type acoustic scores of the N flow type candidate acoustic sequences from a coding state, the attention mechanism decoder is used for re-scoring the N flow type candidate acoustic sequences, the N flow type candidate acoustic sequences are rearranged according to the score of each flow type candidate acoustic sequence, and the flow type candidate acoustic sequence with the highest score is used as a final flow type recognition result;
the specific method for the attention mechanism decoder to re-score the N streaming candidate acoustic sequences is as follows:
amplifying a sentence starting mark at the front end of each streaming candidate sequence to serve as an input streaming candidate sequence;
the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, and calculates to obtain a streaming attention score which is used as a score for re-scoring the N streaming candidate acoustic sequences;
the scoring of the re-scores further comprises: weighted summing the streaming attention score and the CTC streaming acoustic score as a score to re-score the N streaming candidate acoustic sequences; a typical value for N is 10, which may be an extended value of 50 or 100;
in the non-streaming reasoning process, the attention mechanism decoder is used as a main guide, the connection time sequence classification decoder is used as an auxiliary guide, the attention mechanism decoder adopts a BeamSearch search algorithm to take M non-stream sequences with highest scores generated in the decoding process as M non-stream candidate acoustic sequences, the connection time sequence classification decoder is used for re-scoring the M non-stream candidate acoustic sequences, the M non-stream candidate acoustic sequences are rearranged according to the scores of the non-stream candidate acoustic sequences, and the non-stream candidate acoustic sequence with the highest score is used as a final non-stream identification result;
the specific process that the attention mechanism decoder adopts the BeamSearch search algorithm to generate the M non-stream sequences with the highest score in the decoding process as the M non-stream candidate acoustic sequences is as follows:
predicting from the beginning of the mark, inputting a complete coding state and a mark obtained by the previous prediction in each step, and then calculating the score of the predicted mark; this process is repeated until the end marker is predicted to stop; then, taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark;
the specific method for the joint timing sequence classification decoder to re-score the M non-stream candidate acoustic sequences is as follows: weighting and summing CTC non-stream acoustic scores and the non-stream attention scores to obtain scores for the M non-stream candidate acoustic sequences to be re-scored by the joint time sequence classification decoder;
typical values for M are 10, which may be extended by 50 or 100;
as shown in fig. 2, the streaming encoder is constructed by using a streaming Transformer model structure; the system consists of a front-end module and a multi-layer repeated unidirectional Transformer coding layer. The front-end module comprises two layers of convolution, the convolution kernel of the front-end module is 3X3, the step length is set to be 2, ReLU activation functions are used between the layers of convolution as connection, and finally the model dimension is output through a linear mapping layer. The unidirectional Transformer coding layer consists of a multi-head unidirectional self-attention mechanism and a feedforward network, and each multi-head unidirectional self-attention mechanism and feedforward network uses residual connection and post-layer normalization to help the model to converge. The relative position coding is introduced in the unidirectional self-attention mechanism. The encoder calculates as follows:
a convolution front-end module:
Figure 10462DEST_PATH_IMAGE002
Figure 271679DEST_PATH_IMAGE003
where ReLU represents the activation function, Conv2D represents the 2D convolutional layer, x represents the input speech feature, O conv1 Representing the output of the first layer of convolution, O front Representing the output of the convolution front-end module.
The multi-head one-way self-attention mechanism:
Figure 471717DEST_PATH_IMAGE004
wherein,W i Q W i W W i V andW O representing a parameter matrix that can be learned, Q, K, V three representing a query matrix, a key matrix and a value matrix, respectively, Q = K = V for the self-attention mechanism, if this is the first layer of the self-attention mechanism, Q = K = V = O front If not the first layer, Q = K = V equals the output of the preceding layer of the feedforward neural network. A denotes a relative position encoding matrix that can be learned. An implementation of the relative position coding in TransformerXL may also be used instead here.d k Representing the dimension of the last dimension of the key-value matrix. Each attention headhead i All are independent attention mechanisms, and H attention outputs are spliced together and are linearly mapped to obtain a final output O SLF
Calculation of the feedforward neural network:
Figure 28862DEST_PATH_IMAGE005
wherein Linear represents Linear mapping and GLU represents activation function;
the attention mechanism decoder is constructed by adopting a transform-based decoder and comprises a front-end word embedded representation module, sine and cosine position coding, a plurality of layers of repeated transform decoding layers and a final linear mapping component, wherein each layer of transform decoding layer is composed of a shielded self-attention mechanism, a coding and decoding attention mechanism and a feedforward network. Each of the multi-headed masked self-attention mechanism and codec attention mechanism and feed-forward network uses residual concatenation and post-layer normalization to help the model converge its computation process as follows:
Figure 235721DEST_PATH_IMAGE006
whereinP e Which represents a sine-cosine position code,O e representing a word-embedded representation with position coding.
A shading self-attention mechanism:
Figure 536515DEST_PATH_IMAGE007
whereinW i Q W i W W i V AndW O representing a parameter matrix that can be learned, Q, K, V three representing a query matrix, a key matrix and a value matrix, respectively, Q = K = V for the self-attention mechanism, if this is the first layer of the self-attention mechanism, Q = K = V = O e Otherwise, it equals the output of the previous layer of feedforward neural network. Masking information for future frames of each vector needs to be masked out during the computation of the self-attention mechanism to force the model to learn the timing dependencies between languages.
The attention mechanism of encoding and decoding is as follows:
Figure 591059DEST_PATH_IMAGE008
whereinW i Q W i W W i V AndW O representing a parameter matrix that can be learned, three of Q, K, V representing a query matrix, a key matrix and a value matrix, respectively, Q = K = V = O for the self-attention mechanism SLF
Calculating a feedforward network:
Figure 552061DEST_PATH_IMAGE009
where Linear represents a Linear mapping and GLU represents an activation function.
As shown in fig. 3, a streaming speech recognition method includes:
(1) calculating an acoustic feature stream to be input into a streaming coder every time an input audio stream reaches a fixed length;
(2) the stream type acoustic characteristic stream is converted into a stream type coding state after passing through a stream type coder and is input into a joint time sequence classification decoder;
(3) the connection time sequence classification decoder adopts a BeamSearch search algorithm to predict the flow type coding state;
(4) repeating the above (1) - (3), if a sentence is over, ending the streaming coding state, and finally generating N streaming candidate acoustic sequences and CTC streaming acoustic scores of the N streaming candidate acoustic sequences;
(5) an attention mechanism decoder performs re-scoring on the N streaming candidate acoustic sequences, and a sentence starting mark is amplified at the front end of each streaming candidate sequence to serve as an input streaming candidate sequence; the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, and calculates to obtain a streaming attention score which is used as a score for re-scoring the N streaming candidate acoustic sequences;
or weighted summing the streaming attention score and the CTC streaming acoustic score as a score for re-scoring the N streaming candidate acoustic sequences;
(6) and rearranging the N streaming candidate acoustic sequences according to the score of each streaming candidate acoustic sequence, and taking the streaming candidate acoustic sequence with the highest score as a final streaming identification result. The performance of streaming speech recognition is improved by increasing the number N of streaming candidate acoustic sequences, a typical value of N is 10, the parameter setting range is as follows: n is more than or equal to 10 and less than or equal to 100.
(7) The step (4) and the step (5) are performed segment by segment, and when a fixed number of words are identified, a reordering is performed on the current streaming candidate acoustic sequence, and the streaming candidate acoustic sequence can be pruned according to the ordering result, so as to improve the subsequent decoding efficiency.
As shown in fig. 4, a non-streaming speech recognition method includes:
(1) after the audio input is finished, extracting characteristics of the whole audio, and inputting the characteristics into a stream coder for coding;
(2) the attention mechanism decoder depends on all the output and the initial mark of the stream encoder as input, the prediction is carried out from the initial mark, each step needs to input the complete coding state and the mark obtained by the previous step of prediction, and then the score of the prediction mark is calculated;
(3) repeating step (2) until the end marker is predicted to stop; then, taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark;
(4) using a connection time sequence classification decoder to re-score all M non-stream candidate acoustic sequences, and using a dynamic programming algorithm to calculate the probability of predicting and obtaining a target non-stream candidate acoustic sequence under the condition of inputting complete voice input as a CTC non-stream acoustic score;
(5) weighting and summing CTC non-stream acoustic scores and the non-stream attention scores to obtain scores for the M non-stream candidate acoustic sequences to be re-scored by the joint time sequence classification decoder, and reordering the scores;
(6) and finally outputting the branch with the highest score for re-scoring the M non-flow candidate acoustic sequences as a recognition result.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A hybrid speech recognition system for streaming and non-streaming, comprising: a stream encoder, a concatenated sequential classification decoder, and an attention mechanism decoder; the stream type encoder is constructed by adopting a Transformer based on a local self-attention mechanism, and outputs an encoding state; the joint time sequence classification decoder comprises a linear mapping layer which is responsible for mapping the coding state to a pre-designed word list space to obtain a coding state mapping representation, the dimension of the coding state mapping representation is the same as that of the word list space, and then the predicted mark is calculated through Softmax and used for stream decoding; the attention mechanism decoder is constructed by adopting a Transformer decoder and consists of a front-end convolution layer and a plurality of repeated unidirectional Transformer coding layers, the last layer is a linear mapping layer, the dimensionality represented by the coding state mapping is the same as the dimensionality of the vocabulary space, and the final output probability is calculated; training a flow type and non-flow type mixed voice recognition system, wherein in the training process, the connection time sequence classification decoder calculates a CTC loss function, and the attention mechanism decoder calculates a cross entropy loss function; performing weighted summation on the CTC loss function and the cross entropy loss function to serve as a model loss function of the identification system;
in the flow type reasoning process, the joint time sequence classification decoder is used as a main part, the attention mechanism decoder is used as an auxiliary part, the joint time sequence classification decoder adopts a BeamSearch search algorithm to generate N flow type candidate acoustic sequences and CTC flow type acoustic scores of the N flow type candidate acoustic sequences from a coding state, the attention mechanism decoder performs scoring again on the N flow type candidate acoustic sequences, the N flow type candidate acoustic sequences are rearranged according to the score of each flow type candidate acoustic sequence, and the flow type candidate acoustic sequence with the highest score is used as a final flow type recognition result;
in the non-streaming reasoning process, the attention mechanism decoder is used as a main guide, the connection time sequence classification decoder is used as an auxiliary guide, the attention mechanism decoder adopts M non-stream sequences with the highest scores generated in the decoding process by a BeamSearch search algorithm as M non-stream candidate acoustic sequences, the connection time sequence classification decoder performs re-scoring on the M non-stream candidate acoustic sequences, the M non-stream candidate acoustic sequences are rearranged according to the scores of the non-stream candidate acoustic sequences, and the non-stream candidate acoustic sequence with the highest score is used as a final non-stream identification result.
2. A mixed speech recognition system for streaming and non-streaming according to claim 1, wherein the model loss function is of the specific form:
model loss function = λ × CTC loss function + (1- λ) × cross entropy loss function;
wherein,
λ: setting parameters, wherein lambda is more than or equal to 0.1 and less than or equal to 0.3.
3. The mixed speech recognition system of claim 1, wherein the attention mechanism decoder re-scores the N streaming candidate acoustic sequences by:
amplifying a sentence starting mark at the front end of each streaming candidate acoustic sequence to serve as an input streaming candidate sequence;
the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, and calculates a streaming attention score which is used for reprinting the scores of the N streaming candidate acoustic sequences.
4. The mixed speech recognition system of claim 3, wherein the reprint score further comprises: weighted summing the streaming attention score and the CTC streaming acoustic score as a reprint score for the N streaming candidate acoustic sequences.
5. The system of claim 4, wherein N is a setting parameter, 10 ≦ N ≦ 100.
6. The mixed speech recognition system of claim 1, wherein the attention mechanism decoder adopts the BeamSearch algorithm to generate M non-stream sequences with the highest scores in the decoding process as M non-stream candidate acoustic sequences by:
predicting from a start mark, wherein each step needs to input a complete coding state and a mark obtained by the previous step of prediction, and then calculating the score of the prediction mark; this process is repeated until the end marker is predicted to stop; and then, taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark.
7. The system of claim 6, wherein the specific method for re-scoring the M non-stream candidate acoustic sequences by the concatenated sequential classification decoder is: and weighting and summing the CTC non-stream acoustic scores and the non-stream attention scores to obtain the scores of the M non-stream candidate acoustic sequences rewrited by the joint time sequence classification decoder.
8. The system of claim 7, wherein M is a setting parameter, 10 ≦ M ≦ 100.
9. A method for hybrid speech recognition between streaming and non-streaming, comprising:
the method comprises a streaming and non-streaming mixed voice recognition method, and specifically comprises the following steps:
the stream type voice recognition method comprises the following steps:
(1) calculating an acoustic feature stream to be input into a streaming coder every time an input audio stream reaches a fixed length;
(2) the stream type acoustic characteristic stream is converted into a stream type coding state after passing through a stream type coder and is input into a joint time sequence classification decoder;
(3) the connection time sequence classification decoder adopts a BeamSearch search algorithm to predict the flow type coding state;
(4) repeating the above (1) - (3), if a sentence is over, ending the streaming coding state, and finally generating N streaming candidate acoustic sequences and CTC streaming acoustic scores of the N streaming candidate acoustic sequences;
(5) the attention mechanism decoder performs repeated scoring on the N streaming candidate acoustic sequences, and a sentence starting mark is amplified at the front end of each streaming candidate sequence to serve as an input streaming candidate sequence; the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, and calculates a streaming attention score which is used for reprinting the scores of the N streaming candidate acoustic sequences;
or performing a weighted summation of the streaming attention score and the CTC streaming acoustic score as a reprinting score for the N streaming candidate acoustic sequences;
(6) rearranging the N streaming candidate acoustic sequences according to the score of each streaming candidate acoustic sequence, taking the streaming candidate acoustic sequence with the highest score as a final streaming recognition result, and improving the performance of streaming voice recognition by increasing the number N of the streaming candidate acoustic sequences, wherein the typical value of N is 10, and the parameter setting range is as follows: n is more than or equal to 10 and less than or equal to 100,
the non-streaming mixed voice recognition method comprises the following steps: (1) after the audio input is finished, extracting characteristics of the whole audio, and inputting the characteristics into a stream coder for coding;
(2) the power mechanism decoder relies on all the output and the initial mark of the stream encoder as input, the prediction is carried out from the initial mark, each step needs to input the complete coding state and the mark obtained by the previous step of prediction, and then the fraction of the prediction mark is calculated;
(3) repeating step (2) until the end marker is predicted to stop; then, taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark;
(4) using a connection time sequence classification decoder to reprint scores of all M non-stream candidate acoustic sequences, and using a dynamic programming algorithm to calculate the probability of predicting a target non-stream candidate acoustic sequence under the condition of inputting complete voice input to be used as a CTC non-stream acoustic score;
(5) weighting and summing CTC non-stream acoustic scores and the non-stream attention scores to obtain a joint time sequence classification decoder which performs scoring again on the M non-stream candidate acoustic sequences and reorders the scores;
(6) and finally outputting the M non-flow candidate acoustic sequences to perform scoring with the highest scoring again as a recognition result.
CN202110675286.3A 2021-06-18 2021-06-18 Streaming and non-streaming mixed voice recognition system and streaming voice recognition method Active CN113257248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110675286.3A CN113257248B (en) 2021-06-18 2021-06-18 Streaming and non-streaming mixed voice recognition system and streaming voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110675286.3A CN113257248B (en) 2021-06-18 2021-06-18 Streaming and non-streaming mixed voice recognition system and streaming voice recognition method

Publications (2)

Publication Number Publication Date
CN113257248A CN113257248A (en) 2021-08-13
CN113257248B true CN113257248B (en) 2021-10-15

Family

ID=77188576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110675286.3A Active CN113257248B (en) 2021-06-18 2021-06-18 Streaming and non-streaming mixed voice recognition system and streaming voice recognition method

Country Status (1)

Country Link
CN (1) CN113257248B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968629A (en) * 2020-07-08 2020-11-20 重庆邮电大学 Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
CN113674734B (en) * 2021-08-24 2023-08-01 中国铁道科学研究院集团有限公司电子计算技术研究所 Information query method and system based on voice recognition, equipment and storage medium
CN113539273B (en) * 2021-09-16 2021-12-10 腾讯科技(深圳)有限公司 Voice recognition method and device, computer equipment and storage medium
CN113705541B (en) * 2021-10-21 2022-04-01 中国科学院自动化研究所 Expression recognition method and system based on transform marker selection and combination

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473529B (en) * 2019-09-09 2021-11-05 北京中科智极科技有限公司 Stream type voice transcription system based on self-attention mechanism
US11302309B2 (en) * 2019-09-13 2022-04-12 International Business Machines Corporation Aligning spike timing of models for maching learning
CN111179918B (en) * 2020-02-20 2022-10-14 中国科学院声学研究所 Joint meaning time classification and truncation type attention combined online voice recognition technology
CN111402891B (en) * 2020-03-23 2023-08-11 抖音视界有限公司 Speech recognition method, device, equipment and storage medium
CN111968629A (en) * 2020-07-08 2020-11-20 重庆邮电大学 Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
CN112037798B (en) * 2020-09-18 2022-03-01 中科极限元(杭州)智能科技股份有限公司 Voice recognition method and system based on trigger type non-autoregressive model
CN112509564B (en) * 2020-10-15 2024-04-02 江苏南大电子信息技术股份有限公司 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN112530437B (en) * 2020-11-18 2023-10-20 北京百度网讯科技有限公司 Semantic recognition method, device, equipment and storage medium
CN112802467B (en) * 2020-12-21 2024-05-31 出门问问(武汉)信息科技有限公司 Speech recognition method and device

Also Published As

Publication number Publication date
CN113257248A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN112037798B (en) Voice recognition method and system based on trigger type non-autoregressive model
CN113257248B (en) Streaming and non-streaming mixed voice recognition system and streaming voice recognition method
Arık et al. Deep voice: Real-time neural text-to-speech
CN111429889B (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN110164476B (en) BLSTM voice emotion recognition method based on multi-output feature fusion
CN111477221A (en) Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111931736B (en) Lip language identification method and system using non-autoregressive model and integrated discharge technology
US9600764B1 (en) Markov-based sequence tagging using neural networks
CN113488028B (en) Speech transcription recognition training decoding method and system based on fast jump decoding
Li et al. End-to-end speech recognition with adaptive computation steps
Hori et al. Real-time one-pass decoding with recurrent neural network language model for speech recognition
CN114373451A (en) End-to-end Chinese speech recognition method
CN113488029A (en) Non-autoregressive speech recognition training decoding method and system based on parameter sharing
US11715461B2 (en) Transformer-based automatic speech recognition system incorporating time-reduction layer
Macoskey et al. Bifocal neural asr: Exploiting keyword spotting for inference optimization
Yang et al. A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition
CN114203170A (en) Streaming voice recognition system and method based on non-autoregressive model
Xie et al. Fast DNN Acoustic Model Speaker Adaptation by Learning Hidden Unit Contribution Features.
Chang et al. Context-aware end-to-end ASR using self-attentive embedding and tensor fusion
Li et al. Gated recurrent unit based acoustic modeling with future context
CN116863920B (en) Voice recognition method, device, equipment and medium based on double-flow self-supervision network
CN113628630B (en) Information conversion method and device based on coding and decoding network and electronic equipment
CN116312539A (en) Chinese dialogue round correction method and system based on large model
JP2531227B2 (en) Voice recognition device
Deng et al. History utterance embedding transformer lm for speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant