CN113257248B

CN113257248B - Streaming and non-streaming hybrid speech recognition system and streaming speech recognition method

Info

Publication number: CN113257248B
Application number: CN202110675286.3A
Authority: CN
Inventors: 陶建华; 田正坤; 易江燕
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-10-15
Anticipated expiration: 2041-06-18
Also published as: CN113257248A

Abstract

The invention provides a streaming and non-streaming mixed speech recognition system, comprising: a streaming encoder, a connected time sequence classification decoder and an attention mechanism decoder; the streaming encoder adopts a Transformer based on a local self-attention mechanism to perform Construction; the linking temporal classification decoder contains a linear mapping layer, which is responsible for mapping the encoding state to the pre-designed vocabulary space, so that the dimension of the encoding state mapping representation is the same as the dimension of the vocabulary space, and then calculates the predicted token through Softmax , used for streaming decoding; the attention mechanism decoder is constructed with Transformer decoder, which is composed of front-end convolutional layer and multi-layer repeated one-way Transformer encoding layer, and the last layer is a linear mapping layer, which makes the encoding state map represented by The dimensions are the same as those of the vocabulary space, and the probability of the final output is calculated.

Description

Streaming and non-streaming hybrid speech recognition system and streaming speech recognition method

技术领域technical field

本申请涉及语音识别领域，尤其涉及一种流式和非流式混合语音识别系统。The present application relates to the field of speech recognition, and in particular, to a mixed speech recognition system of streaming and non-streaming.

背景技术Background technique

目前语音识别技术已经获得了广泛的应用，语音识别根据不同的应用场景可以划分为流式语音识别系统和非流式语音识别系统，流式语音识别系统为了降低延迟和实时率，其所依赖的声学上下文大大降低，其在一定程度上也影响了模型的识别效果。非流式的语音识别系统，其应用于对于实时率没有要求的场合，其可以使用全部的声学序列进行预测，一般情况下非流式系统相较流式识别系统具有更好的识别效果。然而为了适应不同的任务需求，一般要针对流式和非流式任务来分别训练模型，而并没有一种效果好的方案可以实现一个模型应用于两种任务。本发明提出一种语音识别系统，其将流式和非流式模型整合到同一个模型中，实现了一种模型，两种解码模式，适用于两种类型的任务。At present, speech recognition technology has been widely used. Speech recognition can be divided into streaming speech recognition systems and non-streaming speech recognition systems according to different application scenarios. In order to reduce delay and real-time rate, streaming speech recognition systems rely on The acoustic context is greatly reduced, which also affects the recognition effect of the model to a certain extent. The non-streaming speech recognition system, which is applied to the occasions where the real-time rate is not required, can use all the acoustic sequences for prediction. Generally, the non-streaming system has a better recognition effect than the streaming recognition system. However, in order to adapt to different task requirements, models are generally trained separately for streaming and non-streaming tasks, and there is no effective solution that can implement a model for two tasks. The present invention proposes a speech recognition system, which integrates streaming and non-streaming models into the same model, realizes one model, two decoding modes, and is suitable for two types of tasks.

目前针对流式语音识别和非流式语音识别的方案有很多种，但是将两种识别模型统一到一个框架中的方案并不多。其中主要包含两种思路：At present, there are many solutions for streaming speech recognition and non-streaming speech recognition, but there are not many solutions for unifying the two recognition models into one framework. There are mainly two ideas:

第一种思路是Google的思路，通过编码器部分的变上下文训练来实现同一个编码器对于流式（局部上下文）和非流式（全局上下文）的适应。其在模型训练过程中，同时训练流式和非流式，当训练流式模型时候，会遮蔽掉声学下文，仅依赖声学上文。而训练非流式的时候，并不采用遮蔽操作，而对全部的声学上下文进行建模。为了消除流式模型和非流式模型之间的性能差距，模型还使用了知识萃取的思路，使用非流式模型来提升流式模型的表现。解码器可以使用一个解码器来实现两种解码模式，只需要针对不同的任务需求选择不同的编码器即可。The first idea is Google's idea, which realizes the adaptation of the same encoder for streaming (local context) and non-streaming (global context) through the variable context training of the encoder part. During the model training process, both streaming and non-streaming are trained. When the streaming model is trained, the acoustic context is masked and only the acoustic context is relied upon. When training non-streaming, the masking operation is not used, but the full acoustic context is modeled. In order to eliminate the performance gap between the streaming model and the non-streaming model, the model also uses the idea of knowledge extraction, and uses the non-streaming model to improve the performance of the streaming model. The decoder can use one decoder to implement two decoding modes, and only need to select different encoders for different task requirements.

第二种思路是阿里巴巴提出的混合模型，其模型包含两个编码器（流式和非流式）和两个解码器。系统对对输入的语音采用不同的类别的编码器进行编码，针对流式任务则选流式编码器，然后使用流式解码器进行初步解码，使用非流式解码器对其解码结果进行重打分。进行非流式解码就是解码的时候仅依赖于非流式编码器和解码器。这种结构模型相对复杂。The second idea is the hybrid model proposed by Alibaba, whose model consists of two encoders (streaming and non-streaming) and two decoders. The system uses different types of encoders to encode the input speech, selects the streaming encoder for streaming tasks, then uses the streaming decoder for preliminary decoding, and uses the non-streaming decoder to re-score the decoding results. . To perform non-streaming decoding is to rely only on the non-streaming encoder and decoder when decoding. This structural model is relatively complex.

申请公布号CN111402891A公开的实施例提供了语音识别方法、装置、设备和存储介质。所述方法包括获取当前待识别语音信号的语音特征序列；将所述语音特征序列输入预先训练得到的Deep-FSMN模型，得到表示各个音素的概率的输出序列；将所述输出系列输入预先训练的CTC模型，得到对应的音素序列；将所述音素序列输入语言模型，转换成最终的文字序列作为识别结果。以此方式，可以提升模型性能，减少语音识别的时延；减少了运算量，提高了语音识别效果。The embodiments disclosed in Application Publication No. CN111402891A provide a speech recognition method, apparatus, device and storage medium. The method includes acquiring the speech feature sequence of the current speech signal to be recognized; inputting the speech feature sequence into a pre-trained Deep-FSMN model to obtain an output sequence representing the probability of each phoneme; inputting the output sequence into a pre-trained model. The CTC model obtains the corresponding phoneme sequence; the phoneme sequence is input into the language model, and converted into the final text sequence as the recognition result. In this way, the performance of the model can be improved, the delay of speech recognition can be reduced, the amount of computation can be reduced, and the effect of speech recognition can be improved.

申请公布号CN111968629A请求保护一种结合Transformer和CNN-DFSMN-CTC的中文语音识别方法，该方法包括步骤：S1，将语音信号进行预处理，提取80维的log melFbank特征；S2，将提取到的80维Fbank特征用CNN卷积网络进行卷积；S3，将特征输入到DFSMN网络结构中；S4，将CTC loss作为声学模型的损失函数，采用Beam search算法进行预测，使用Adam优化器进行优化；S5，引入强语言模型Transformer迭代训练直至达到最优模型结构；S6，将Transformer和声学模型CNN-DFSMN-CTC相结合进行适配，在多数据集上进行验证，最终得到最优识别结果。本发明识别准确率更高，解码速度更快，在多个数据集上验证后字符错误率达到了11.8％，其中在Aidatatang数据集上最好达到了7.8％的字符错误率。Application publication number CN111968629A claims to protect a Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC. The method includes steps: S1, preprocessing the speech signal, and extracting 80-dimensional log melFbank features; S2, extracting the extracted The 80-dimensional Fbank features are convolved with a CNN convolution network; S3, the features are input into the DFSMN network structure; S4, the CTC loss is used as the loss function of the acoustic model, the Beam search algorithm is used for prediction, and the Adam optimizer is used for optimization; S5, the strong language model Transformer is introduced for iterative training until the optimal model structure is reached; S6, the Transformer is combined with the acoustic model CNN-DFSMN-CTC for adaptation, and verified on multiple data sets, and finally the optimal recognition result is obtained. The invention has higher recognition accuracy and faster decoding speed, and the character error rate reaches 11.8% after verification on multiple data sets, among which the character error rate of 7.8% is best achieved on the Aidatatang data set.

现有技术主要问题包含两方面：The main problems of the prior art include two aspects:

（1）模型冗余，训练任务量大，分别训练；目前针对流式识别和非流式识别任务，通常要分别训练模型，其模型存在部分功能的重复，以及系统冗余的问题，并且分别训练模型，增加了模型训练的任务量。(1) Model redundancy, large amount of training tasks, and separate training; currently, for streaming recognition and non-streaming recognition tasks, models are usually trained separately, and the models have some duplication of functions and system redundancy problems, and they are separately trained. Training the model increases the task volume of model training.

（2）模型结构复杂；类似于阿里巴巴模型的思路，模型包含两个编码器和两个解码器，不同的编码器的组合应用于不同的任务，模型结构较为复杂，也难以训练。(2) The model structure is complex; similar to the Alibaba model, the model contains two encoders and two decoders. The combination of different encoders is applied to different tasks, and the model structure is more complex and difficult to train.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明提供一种流式和非流式混合语音识别系统，具体地，本发明是通过如下技术方案实现的：In view of this, the present invention provides a mixed speech recognition system of streaming and non-streaming. Specifically, the present invention is realized by the following technical solutions:

第一方面，本发明提供一种流式和非流式混合语音识别系统，包括：流式编码器、联结时序分类解码器和注意力机制解码器；流式编码器是为了流式建模，需要摆脱对于全部声学上下文的依赖，采用基于局部自注意力机制的Transformer来进行构建，输出编码状态；联结时序分类解码器包含一个线性映射层，负责将编码状态映射到预先设计好的词表空间，得到编码状态映射表示，使编码状态映射表示的维度与词表空间的维度相同，然后通过Softmax计算预测到的标记，主要用于流式解码；注意力机制解码器采用Transformer解码器来构建，由前端卷积层和多层重复的单向Transformer编码层组成，最后一层为线性映射层，使编码状态映射表示的维度与所述词表空间的维度相同，并计算最终输出的概率；In a first aspect, the present invention provides a streaming and non-streaming hybrid speech recognition system, including: a streaming encoder, a connected timing classification decoder and an attention mechanism decoder; the streaming encoder is for streaming modeling, It is necessary to get rid of the dependence on all acoustic contexts, and use the Transformer based on the local self-attention mechanism to construct and output the encoding state; the connection time series classification decoder contains a linear mapping layer, which is responsible for mapping the encoding state to the pre-designed vocabulary space. , get the encoding state map representation, make the dimension of the encoding state map representation the same as the dimension of the vocabulary space, and then calculate the predicted tag through Softmax, which is mainly used for streaming decoding; the attention mechanism decoder is constructed by using the Transformer decoder, It consists of a front-end convolution layer and a multi-layer repeated one-way Transformer encoding layer, and the last layer is a linear mapping layer, so that the dimension represented by the encoding state map is the same as the dimension of the vocabulary space, and the probability of the final output is calculated;

模型在训练过程中，所述联结时序分类解码器计算CTC损失函数，注意力机制解码器计算交叉熵损失函数；将CTC损失函数和所述交叉熵损失函数进行加权求和作为识别系统的模型损失函数；During the training process of the model, the connection sequence classification decoder calculates the CTC loss function, and the attention mechanism decoder calculates the cross entropy loss function; the weighted summation of the CTC loss function and the cross entropy loss function is used as the model loss of the recognition system. function;

在流式推理过程中，以联结时序分类解码器作为主导，注意力机制解码器为辅助，联结时序分类解码器采用BeamSearch搜索算法将编码状态生成N条流式候选声学序列和所述N条流式候选声学序列的CTC流式声学分数，由注意力机制解码器对N条流式候选声学序列进行重打分，根据每条流式候选声学序列的分数对N条流式候选声学序列重新排列，采用分数最高的流式候选声学序列作为最终的流式识别结果；In the streaming reasoning process, the connected time series classification decoder is used as the leading role, and the attention mechanism decoder is assisted. The connected time series classification decoder uses the BeamSearch search algorithm to generate N stream candidate acoustic sequences and the N stream The CTC streaming acoustic scores of the candidate acoustic sequences are calculated by the attention mechanism, and the N streaming candidate acoustic sequences are re-scored by the attention mechanism decoder, and the N streaming candidate acoustic sequences are rearranged according to the scores of each streaming candidate acoustic sequence. Use the stream candidate acoustic sequence with the highest score as the final stream recognition result;

在非流式推理过程中，以注意力机制解码器作为主导，联结时序分类解码器作为辅助，注意力机制解码器采用BeamSearch搜索算法在对解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列，由联结时序分类解码器对所述M条非流候选声学序列进行重打分，根据每条非流候选声学序列的分数对M条非流候选声学序列重新排列，采用分数最高的非流候选声学序列作为最终的非流识别结果。In the non-streaming reasoning process, the attention mechanism decoder is used as the leading, and the time-series classification decoder is used as the auxiliary. The attention mechanism decoder uses the M non-streaming sequences with the highest scores generated by the BeamSearch search algorithm in the decoding process as M pieces of non-flow candidate acoustic sequences, the M pieces of non-flow candidate acoustic sequences are re-scored by the connection timing classification decoder, and the M pieces of non-flow candidate acoustic sequences are rearranged according to the score of each piece of non-flow candidate acoustic sequence, using The non-flow candidate acoustic sequence with the highest score is used as the final non-flow recognition result.

优选地，模型损失函数的具体形式为：Preferably, the specific form of the model loss function is:

模型损失函数=λ*CTC损失函数+(1-λ)*交叉熵损失函数；Model loss function=λ*CTC loss function+(1-λ)*cross entropy loss function;

其中，in,

λ：设置参数，0.1≤λ≤0.3。λ: Setting parameter, 0.1≤λ≤0.3.

优选地，注意力机制解码器对N条流式候选声学序列进行重打分的具体方法为：Preferably, the specific method for the attention mechanism decoder to re-score the N stream candidate acoustic sequences is:

在每条流式候选声学序列前端扩增一个句子开始标记作为输入流式候选序列；Amplify a sentence start marker at the front of each streaming candidate acoustic sequence as the input streaming candidate sequence;

注意力机制解码器采用N条输入流式候选序列和所述N条流式候选序列对应的编码状态作为输入，预测包含结束标记不包含开始标记的流式目标候选序列，对流式目标候选序列每个位置上的概率进行求和，计算得到流式注意力分数，流式注意力分数作为对所述N条流式候选声学序列进行重打分的分数。The attention mechanism decoder uses the N input stream candidate sequences and the coding states corresponding to the N stream candidate sequences as input, and predicts the stream target candidate sequence that contains the end tag and does not contain the start tag. The probabilities at the positions are summed to obtain a streaming attention score, which is used as a score for re-scoring the N streaming candidate acoustic sequences.

优选地，重打分的分数还包括：将流式注意力分数和CTC流式声学分数进行加权求和作为对N条流式候选声学序列进行重打分的分数。Preferably, the re-scoring score further includes: taking the weighted summation of the streaming attention score and the CTC streaming acoustic score as a score for re-scoring the N streaming candidate acoustic sequences.

优选地，N为设置参数，10≤N≤100。Preferably, N is a setting parameter, and 10≤N≤100.

优选地，注意力机制解码器采用BeamSearch搜索算法在对解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列的具体过程为：Preferably, the specific process that the attention mechanism decoder uses the BeamSearch search algorithm to generate the M non-streaming sequences with the highest scores in the decoding process as the M non-streaming candidate acoustic sequences is as follows:

从开始标记开始进行预测，每一步都需要输入完整的编码状态和上一步预测得到的标记，然后计算预测标记的得分；重复这一过程，直到预测到结束标记停止；然后将所述注意力机制解码器解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列和M条非流候选声学序列的非流注意力分数，剔除掉开始标记和结束标记。Start prediction from the start tag, each step needs to input the complete encoding state and the tag predicted in the previous step, and then calculate the score of the predicted tag; repeat this process until the end tag is predicted and stop; then the attention mechanism The M non-streaming sequences with the highest scores generated in the decoding process of the decoder are used as the non-streaming attention scores of the M non-streaming candidate acoustic sequences and the M non-streaming candidate acoustic sequences, and the start markers and end markers are removed.

优选地，由所述联结时序分类解码器对所述M条非流候选声学序列进行重打分的具体方法为：将CTC非流声学分数与所述非流注意力分数进行加权求和得到所述联结时序分类解码器对所述M条非流候选声学序列进行重打分的分数。Preferably, a specific method for re-scoring the M non-streaming candidate acoustic sequences by the concatenated time series classification decoder is: weighted summation of the CTC non-streaming acoustic scores and the non-streaming attention scores to obtain the The concatenated temporal classification decoder re-scores the M non-streaming candidate acoustic sequences.

优选地，M为设置参数，10≤M≤100。Preferably, M is a setting parameter, 10≤M≤100.

本发明还提供一种流式语音识别方法，包括：The present invention also provides a stream speech recognition method, comprising:

(1) 输入的音频流每达到一个固定长度就被计算为声学特征流输入到流式编码器中；(1) Every time the input audio stream reaches a fixed length, it is calculated as an acoustic feature stream and input into the streaming encoder;

(2) 流式的声学特征流经过流式编码器之后转变为流式编码状态输入到联结时序分类解码器中；(2) After passing through the streaming encoder, the streamed acoustic feature stream is converted into a streaming encoding state and input to the concatenated time series classification decoder;

(3) 联结时序分类解码器采用BeamSearch搜索算法，对流式编码状态进行预测；(3) The BeamSearch search algorithm is used in conjunction with the time series classification decoder to predict the streaming encoding state;

(4) 重复上述(1)-(3)，如果遇到句子结束，则流式编码状态结束，最终生成N条流式候选声学序列和所述N条流式候选声学序列的CTC流式声学分数；(4) Repeat the above (1)-(3), if the end of the sentence is encountered, the streaming encoding state ends, and finally N pieces of streaming candidate acoustic sequences and the CTC streaming acoustics of the N pieces of streaming candidate acoustic sequences are generated Fraction;

(5) 注意力机制解码器对所述N条流式候选声学序列进行重打分，在每条流式候选序列前端扩增一个句子开始标记作为输入流式候选序列；注意力机制解码器采用N条输入流式候选序列和所述N条流式候选序列对应的编码状态作为输入，预测包含结束标记不包含开始标记的流式目标候选序列，对流式目标候选序列每个位置上的概率进行求和，计算得到流式注意力分数，流式注意力分数作为对所述N条流式候选声学序列进行重打分的分数；(5) The attention mechanism decoder re-scores the N stream candidate acoustic sequences, and a sentence start marker is amplified at the front of each stream candidate sequence as the input stream candidate sequence; the attention mechanism decoder uses N The input stream candidate sequences and the encoding states corresponding to the N stream candidate sequences are used as input, and the stream target candidate sequences that contain the end marker but not the start marker are predicted, and the probability of each position of the stream target candidate sequence is calculated. and, the streaming attention score is calculated, and the streaming attention score is used as the score for re-scoring the N streaming candidate acoustic sequences;

或者将流式注意力分数和CTC流式声学分数进行加权求和作为对N条流式候选声学序列进行重打分的分数；Or use the weighted summation of the streaming attention score and the CTC streaming acoustic score as the score for re-scoring the N streaming candidate acoustic sequences;

(6) 据每条流式候选声学序列的分数对N条流式候选声学序列重新排列，采用分数最高的流式候选声学序列作为最终的流式识别结果。通过提高流式候选声学序列的数量N来提升流式语音识别的表现，N的典型值是10，参数设置范围：10≤N≤100。(6) Rearrange the N stream candidate acoustic sequences according to the score of each stream candidate acoustic sequence, and use the stream candidate acoustic sequence with the highest score as the final stream recognition result. The performance of streaming speech recognition is improved by increasing the number N of streaming candidate acoustic sequences. The typical value of N is 10, and the parameter setting range is 10≤N≤100.

本发明还提供一种非流式语音识别方法，包括：The present invention also provides a non-streaming speech recognition method, comprising:

(1) 音频输入结束后，对整段音频提取特征，并输入流式编码器中进行编码；(1) After the audio input ends, extract features from the entire audio segment and input it into the streaming encoder for encoding;

(2) 注意力机制解码器依赖流式编码器的全部输出和起始标记作为输入，从开始标记开始进行预测，每一步都需要输入完整的编码状态和上一步预测得到的标记，然后计算预测标记的得分；(2) The attention mechanism decoder relies on the entire output of the streaming encoder and the start tag as input, and starts from the start tag to make predictions. Each step needs to input the complete encoding state and the tag predicted in the previous step, and then calculate the prediction marked score;

(3) 重复步骤(2)，直到预测到结束标记停止；然后将注意力机制解码器解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列和M条非流候选声学序列的非流注意力分数，剔除掉开始标记和结束标记；(3) Repeat step (2) until the end marker is predicted to stop; then the M non-streaming sequences with the highest scores generated during the decoding process of the attention mechanism decoder are used as M non-streaming candidate acoustic sequences and M non-streaming candidates The non-flow attention score of the acoustic sequence, with start and end markers removed;

(4) 使用联结时序分类解码器对全部M条非流候选声学序列进行重打分，使用动态规划算法，计算在输入完整语音输入条件下预测得到目标非流候选声学序列的概率作为CTC非流声学分数；(4) Re-score all M non-streaming candidate acoustic sequences using the connected time sequence classification decoder, and use the dynamic programming algorithm to calculate the probability of predicting the target non-streaming candidate acoustic sequence under the condition of inputting the complete speech input as the CTC non-streaming acoustic sequence Fraction;

(5) 将CTC非流声学分数与所述非流注意力分数进行加权求和得到联结时序分类解码器对所述M条非流候选声学序列进行重打分的分数，并重新排序；(5) The weighted summation of the CTC non-streaming acoustic score and the non-streaming attention score is obtained to obtain a score for re-scoring the M non-streaming candidate acoustic sequences by the connected time sequence classification decoder, and reordering;

(6) 最终输出所述M条非流候选声学序列进行重打分的分数最高的分支作为识别结果。(6) Finally output the branch with the highest score of the M non-flow candidate acoustic sequences for re-score as the identification result.

本申请实施例提供的上述技术方案与现有技术相比具有如下优点：Compared with the prior art, the above-mentioned technical solutions provided in the embodiments of the present application have the following advantages:

本申请实施例提供的该方法，The method provided in the embodiments of the present application,

（1）模型结构简单，本发明仅包含一个流式编码器和两个解码器（CTC解码器和Attention）解码器，其模型结构相对简单。(1) The model structure is simple. The present invention only includes one stream encoder and two decoders (CTC decoder and Attention) decoder, and its model structure is relatively simple.

（2）模型解码流程简单，相辅相成。模型的流式和非流式解码中，仅需要对调不同解码器的排列顺序，就能获得一定的性能提升。(2) The model decoding process is simple and complementary. In the streaming and non-streaming decoding of the model, it is only necessary to reverse the arrangement order of different decoders to obtain a certain performance improvement.

（3）模型训练简单，相对于其他流式和非流式模型的训练过程，本系统本身就包含两种类型的解码器，可以进行联合训练，相互提升了不同模块的收敛速度和训练速度。(3) The model training is simple. Compared with the training process of other streaming and non-streaming models, the system itself contains two types of decoders, which can be trained jointly, which mutually improves the convergence speed and training speed of different modules.

附图说明Description of drawings

图1为本发明实施例提供的一种流式和非流式混合语音识别系统结构框图；1 is a structural block diagram of a streaming and non-streaming hybrid speech recognition system provided by an embodiment of the present invention;

图2为本发明实施例提供的流式编码器Transformer自注意力机制示意图；2 is a schematic diagram of a self-attention mechanism of a streaming encoder Transformer provided by an embodiment of the present invention;

图3为本发明实施例提供的流式语音识别方法流程图；3 is a flowchart of a method for stream speech recognition provided by an embodiment of the present invention;

图4为本发明实施例提供的非流式语音识别方法流程图。FIG. 4 is a flowchart of a method for non-streaming speech recognition provided by an embodiment of the present invention.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with some aspects of the invention as recited in the appended claims.

如图1所示本申请实施例提供的流式和非流式混合语音识别系统，As shown in FIG. 1, the streaming and non-streaming hybrid speech recognition system provided by the embodiment of the present application,

包括：流式编码器、联结时序分类解码器和注意力机制解码器；所述流式编码器是为了流式建模，需要摆脱对于全部声学上下文的依赖，采用基于局部自注意力机制的Transformer来进行构建；所述联结时序分类解码器包含一个线性映射层，负责将编码状态映射到预先设计好的词表空间，使编码状态映射表示的维度与所述词表空间的维度相同，然后通过Softmax计算预测到的标记，主要用于流式解码；所述注意力机制解码器采用Transformer解码器来构建，由前端卷积层和多层重复的单向Transformer编码层组成，最后一层为线性映射层，使编码状态映射表示的维度与所述词表空间的维度相同，并计算最终输出的概率；Including: a streaming encoder, a connected temporal classification decoder, and an attention mechanism decoder; the streaming encoder is for streaming modeling and needs to get rid of the dependence on all acoustic contexts, and adopts a Transformer based on a local self-attention mechanism. to construct; the connection timing classification decoder includes a linear mapping layer, which is responsible for mapping the encoding state to the pre-designed vocabulary space, so that the dimension of the encoding state mapping representation is the same as the dimension of the vocabulary space, and then through the Softmax calculates the predicted mark, which is mainly used for streaming decoding; the attention mechanism decoder is constructed with a Transformer decoder, which consists of a front-end convolutional layer and a multi-layer repeated one-way Transformer encoding layer, and the last layer is linear The mapping layer makes the dimension of the encoding state mapping representation the same as the dimension of the vocabulary space, and calculates the probability of the final output;

模型在训练过程中，所述联结时序分类解码器解码器计算CTC损失函数，L _CTC；所述注意力机制解码器计算交叉熵损失函数, L _CE；将所述CTC损失函数和所述交叉熵损失函数进行加权求和作为识别系统的模型损失函数L _final；During the training process of the model, the coupled time series classification decoder decoder calculates the CTC loss function, L _CTC ; the attention mechanism decoder calculates the cross entropy loss function, L _CE ; the CTC loss function and the cross entropy are calculated. The weighted summation of the loss function is used as the model loss function L _final of the recognition system;

所述模型损失函数的具体形式为：The specific form of the model loss function is:

其中，in,

λ：设置参数，为0.1；λ: set parameter, it is 0.1;

在流式推理过程中，以所述联结时序分类解码器作为主导，所述注意力机制解码器为辅助，所述联结时序分类解码器采用BeamSearch搜索算法将编码状态生成N条流式候选声学序列和所述N条流式候选声学序列的CTC流式声学分数，由所述注意力机制解码器对所述N条流式候选声学序列进行重打分，根据每条流式候选声学序列的分数对N条流式候选声学序列重新排列，采用分数最高的流式候选声学序列作为最终的流式识别结果；In the streaming reasoning process, the connected time series classification decoder is used as the lead, the attention mechanism decoder is used as the auxiliary, and the connected time series classification decoder uses the BeamSearch search algorithm to generate N pieces of streaming candidate acoustic sequences from the encoding state. and the CTC streaming acoustic scores of the N streaming candidate acoustic sequences, the attention mechanism decoder re-scores the N streaming candidate acoustic sequences, according to the scores of each streaming candidate acoustic sequence The N stream candidate acoustic sequences are rearranged, and the stream candidate acoustic sequence with the highest score is used as the final stream recognition result;

所述注意力机制解码器对所述N条流式候选声学序列进行重打分的具体方法为：The specific method for the attention mechanism decoder to re-score the N stream candidate acoustic sequences is as follows:

在每条流式候选序列前端扩增一个句子开始标记作为输入流式候选序列；Amplify a sentence start marker at the front of each streaming candidate sequence as the input streaming candidate sequence;

所述注意力机制解码器采用N条输入流式候选序列和所述N条流式候选序列对应的编码状态作为输入，预测包含结束标记不包含开始标记的流式目标候选序列，对流式目标候选序列每个位置上的概率进行求和，计算得到流式注意力分数，所述流式注意力分数作为对所述N条流式候选声学序列进行重打分的分数；The attention mechanism decoder uses the N input stream candidate sequences and the coding states corresponding to the N stream candidate sequences as input, predicts the stream target candidate sequence that contains the end tag and does not contain the start tag, and predicts the stream target candidate sequence. The probabilities at each position of the sequence are summed, and the streaming attention score is calculated, and the streaming attention score is used as the score for re-scoring the N streaming candidate acoustic sequences;

所述重打分的分数还包括：将所述流式注意力分数和所述CTC流式声学分数进行加权求和作为对所述N条流式候选声学序列进行重打分的分数；N的典型值是10，可以扩充值50或者100；The re-scoring score further includes: a weighted summation of the streaming attention score and the CTC streaming acoustic score as a score for re-scoring the N streaming candidate acoustic sequences; a typical value of N is 10, the value can be expanded to 50 or 100;

在非流式推理过程中，以所述注意力机制解码器作为主导，所述联结时序分类解码器作为辅助，所述注意力机制解码器采用BeamSearch搜索算法在对解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列，由所述联结时序分类解码器对所述M条非流候选声学序列进行重打分，根据每条非流候选声学序列的分数对M条非流候选声学序列重新排列，采用分数最高的非流候选声学序列作为最终的非流识别结果；In the non-streaming reasoning process, the attention mechanism decoder is used as the leading role, and the connection time series classification decoder is used as an auxiliary. The attention mechanism decoder uses the BeamSearch search algorithm to generate the highest scoring The M non-streaming candidate acoustic sequences are regarded as M non-streaming candidate acoustic sequences, and the M non-streaming candidate acoustic sequences are re-scored by the connection timing classification decoder, and the M non-streaming candidate acoustic sequences are classified according to the score of each non-streaming candidate acoustic sequence. The flow candidate acoustic sequences are rearranged, and the non-flow candidate acoustic sequence with the highest score is used as the final non-flow identification result;

所述注意力机制解码器采用BeamSearch搜索算法在对解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列的具体过程为：The specific process that the attention mechanism decoder uses the BeamSearch search algorithm to generate the M non-streaming sequences with the highest scores in the decoding process as the M non-streaming candidate acoustic sequences is as follows:

从开始标记开始进行预测，每一步都需要输入完整的编码状态和上一步预测得到的标记，然后计算预测标记的得分；重复这一过程，直到预测到结束标记停止；然后将所述注意力机制解码器解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列和M条非流候选声学序列的非流注意力分数，剔除掉开始标记和结束标记；Start prediction from the start tag, each step needs to input the complete encoding state and the tag predicted in the previous step, and then calculate the score of the predicted tag; repeat this process until the end tag is predicted and stop; then the attention mechanism The M non-streaming sequences with the highest scores generated in the decoding process of the decoder are used as the non-streaming attention scores of the M non-streaming candidate acoustic sequences and the M non-streaming candidate acoustic sequences, and the start marker and the end marker are removed;

所述由所述联结时序分类解码器对所述M条非流候选声学序列进行重打分的具体方法为：将CTC非流声学分数与所述非流注意力分数进行加权求和得到所述联结时序分类解码器对所述M条非流候选声学序列进行重打分的分数；The specific method for re-scoring the M non-streaming candidate acoustic sequences by the connection timing classification decoder is: weighted summation of the CTC non-streaming acoustic score and the non-streaming attention score to obtain the connection A score for re-scoring the M non-streaming candidate acoustic sequences by the time-series classification decoder;

M的典型值是10，可以扩充值50或者100；The typical value of M is 10, and the value can be extended to 50 or 100;

如图2所示，流式编码器采用流式Transformer模型结构来构建；其由前端模块和多层重复的单向Transformer编码层组成。前端模块包含两层卷积，其卷积核为3X3,步长设置为2，卷积之间使用ReLU激活函数作为连接，最终经过线性映射层输出为模型维度。单向Transformer编码层由多头单向自注意力机制和前馈网络组成，每个多头单向自注意力机制和前馈网络都使用了残差连接和后置层归一化来帮助模型进行收敛。单向自注意机制中引入了相对位置编码。编码器的计算过程如下：As shown in Figure 2, the streaming encoder is constructed using the streaming Transformer model structure; it consists of a front-end module and multiple layers of repeated unidirectional Transformer encoding layers. The front-end module contains two layers of convolution, the convolution kernel is 3X3, the stride is set to 2, and the ReLU activation function is used as the connection between the convolutions, and finally the output is the model dimension through the linear mapping layer. The one-way Transformer encoding layer consists of a multi-head one-way self-attention mechanism and a feed-forward network. Each multi-head one-way self-attention mechanism and feed-forward network uses residual connections and post-layer normalization to help the model converge. . The relative position encoding is introduced in the one-way self-attention mechanism. The calculation process of the encoder is as follows:

卷积前端模块：Convolution front-end module:

其中ReLU表示激活函数，Conv2D表示2D卷积层，x表示输入语音特征， O_conv1表示第一层卷积的输出，O_front表示卷积前端模块的输出。where ReLU represents the activation function, Conv2D represents the 2D convolutional layer, x represents the input speech feature, O _conv1 represents the output of the first layer convolution, and O _front represents the output of the convolution front-end module.

多头单向自注意力机制：Multi-head one-way self-attention mechanism:

其中，W _i ^Q，W _i ^W，W _i ^V和W ^O表示可以学习的参数矩阵，Q,K,V三个分别表示查询矩阵，键矩阵和值矩阵，对于自注意力机制而言，Q=K=V，如果这是第一层自注意力机制，则Q=K=V=O_front，如果不是第一层，这Q=K=V等于前一层前馈神经网络的输出。A表示可以学习的相对位置编码矩阵。这里也可以使用TransformerXL中的相对位置编码的实现方式来代替。d _k表示键值矩阵最后一维的维度。每个注意力头head _i都是一个独立的注意力机制，将H个注意力输出拼接起来并通过线性映射得到最终的输出O_SLF Among them, Wi ^Q , ^{Wi W} , _Wi V _and ^WO represent the parameter ^matrix that can be learned, and Q, K , and _V represent the query matrix, key matrix and value matrix respectively. For the self-attention mechanism, Q =K=V, if this is the first layer of self-attention mechanism, then Q=K=V=O _front , if it is not the first layer, this Q=K=V is equal to the output of the previous layer of feedforward neural network. A represents the relative position encoding matrix that can be learned. The implementation of relative position encoding in TransformerXL can also be used instead. d _k represents the dimension of the last dimension of the key-value matrix. Each attention head head _i is an independent attention mechanism, which concatenates H attention outputs and obtains the final output O _SLF through linear mapping

前馈神经网络的计算：Calculation of feedforward neural network:

其中Linear表示线性映射，GLU表示激活函数；Where Linear represents linear mapping, GLU represents activation function;

注意力机制解码器采用基于Transformer的解码器来构建，其包含前端词嵌入表示模块，正弦余弦位置编码，多层重复的Transformer解码层和最终的线性映射组成，每层Transformer解码层均有遮蔽的自注意力机制和编码解码注意力机制以及前馈网络构成。每个多头遮蔽的自注意力机制和编码解码注意力机制以及前馈网络都使用了残差连接和后置层归一化来帮助模型进行收敛其计算过程如下：The attention mechanism decoder is constructed with a Transformer-based decoder, which includes a front-end word embedding representation module, a sine and cosine position encoding, multiple layers of repeated Transformer decoding layers and a final linear mapping. Self-attention mechanism and encoder-decoder attention mechanism and feed-forward network composition. Each multi-head masked self-attention mechanism and encoder-decoder attention mechanism and feed-forward network use residual connections and post-layer normalization to help the model converge. The calculation process is as follows:

其中P _e表示正余弦位置编码，O _e表示带有位置编码的词嵌入表示。where P _e represents the sine-cosine positional encoding, and O _e represents the word embedding representation with positional encoding.

遮蔽自注意力机制：Masked Self-Attention Mechanism:

其中W _i ^Q，W _i ^W，W _i ^V和W ^O表示可以学习的参数矩阵，Q,K,V三个分别表示查询矩阵，键矩阵和值矩阵，对于自注意力机制而言，Q=K=V，如果这是第一层自注意力机制，则Q=K=V=O_e，否则，其等于前一层前馈神经网络的输出。遮蔽自注意力机制计算过程中，需要遮蔽掉每个向量未来的帧的信息，以迫使模型能够学习到语言之间的时序依赖关系。Among them, ^Wi ^Q , Wi W _, Wi _V and ^WO represent the parameter ^matrix that can be learned, and Q, K , and _V represent the query matrix, the key matrix and the value matrix respectively. For the self-attention mechanism, Q= K=V, if this is the first layer of self-attention mechanism, then Q=K=V=O _e , otherwise, it is equal to the output of the previous layer of feed-forward neural network. In the calculation process of the masked self-attention mechanism, the information of the future frames of each vector needs to be masked to force the model to learn the temporal dependencies between languages.

编码解码注意力机制：Encoding and decoding attention mechanism:

其中W _i ^Q，W _i ^W，W _i ^V和W ^O表示可以学习的参数矩阵，Q,K,V三个分别表示查询矩阵，键矩阵和值矩阵，对于自注意力机制而言，Q=K=V= O_SLF；Among them, ^Wi ^Q , Wi W _, Wi _V and ^WO represent the parameter ^matrix that can be learned, and Q, K , and _V represent the query matrix, the key matrix and the value matrix respectively. For the self-attention mechanism, Q= K=V= _Oslf ;

前馈网络计算：Feedforward network calculation:

其中Linear表示线性映射，GLU表示激活函数。where Linear represents a linear map and GLU represents an activation function.

如图3所示，一种流式语音识别方法，包括：As shown in Figure 3, a streaming speech recognition method includes:

(5)注意力机制解码器对所述N条流式候选声学序列进行重打分，在每条流式候选序列前端扩增一个句子开始标记作为输入流式候选序列；所述注意力机制解码器采用N条输入流式候选序列和所述N条流式候选序列对应的编码状态作为输入，预测包含结束标记不包含开始标记的流式目标候选序列，对流式目标候选序列每个位置上的概率进行求和，计算得到流式注意力分数，所述流式注意力分数作为对所述N条流式候选声学序列进行重打分的分数；(5) The attention mechanism decoder re-scores the N stream candidate acoustic sequences, and amplifies a sentence start marker at the front of each stream candidate sequence as the input stream candidate sequence; the attention mechanism decoder Using the N input stream candidate sequences and the coding states corresponding to the N stream candidate sequences as input, predicting the stream target candidate sequence that contains the end marker but does not contain the start marker, the probability of each position of the stream target candidate sequence is calculated. performing a summation to obtain a streaming attention score, which is used as a score for re-scoring the N streaming candidate acoustic sequences;

或者将所述流式注意力分数和所述CTC流式声学分数进行加权求和作为对所述N条流式候选声学序列进行重打分的分数；Or weighted summation of the streaming attention score and the CTC streaming acoustic score is used as a score for re-scoring the N streaming candidate acoustic sequences;

(7) 所述步骤(4)和步骤(5)为逐段进行，每识别出固定几个字，就对目前的流式候选声学序列进行一个重新排序，可以依赖排序结果对流式候选声学序列进行剪枝操作，以提高后续解码效率。(7) Steps (4) and (5) are performed segment by segment. Each time a fixed number of words are identified, the current streaming candidate acoustic sequence is reordered, and the streaming candidate acoustic sequence can be reordered depending on the sorting result. A pruning operation is performed to improve the subsequent decoding efficiency.

如图4所示，一种非流式语音识别方法，包括：As shown in Figure 4, a non-streaming speech recognition method includes:

(3) 重复步骤(2)，直到预测到结束标记停止；然后将所述注意力机制解码器解码过程中生成的得分最高的M条非流序列作为M条非流候选声学序列和M条非流候选声学序列的非流注意力分数，剔除掉开始标记和结束标记；(3) Step (2) is repeated until the end mark is predicted to stop; then the M non-streaming sequences with the highest scores generated in the decoding process of the attention mechanism decoder are used as M non-streaming candidate acoustic sequences and M non-streaming candidate acoustic sequences. The non-flow attention score of the flow candidate acoustic sequence, excluding the start and end tags;

(5) 将CTC非流声学分数与所述非流注意力分数进行加权求和得到所述联结时序分类解码器对所述M条非流候选声学序列进行重打分的分数，并重新排序；(5) weighted summation of the CTC non-streaming acoustic score and the non-streaming attention score to obtain the score for re-scoring the M non-streaming candidate acoustic sequences by the connected timing classification decoder, and reordering;

应当理解，尽管在本发明可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本发明范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present invention. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

虽然本说明书包含许多具体实施细节，但是这些不应被解释为限制任何发明的范围或所要求保护的范围，而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面，在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外，虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护，但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除，并且所要求保护的组合可以指向子组合或子组合的变型。While this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or what may be claimed, but rather are used primarily to describe features of specific embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function as described above in certain combinations and even be originally claimed as such, one or more features from a claimed combination may in some cases be removed from the combination and the claimed A protected combination may point to a subcombination or a variation of a subcombination.

类似地，虽然在附图中以特定顺序描绘了操作，但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行，以实现期望的结果。在某些情况下，多任务和并行处理可能是有利的。此外，上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离，并且应当理解，所描述的程序组件和系统通常可以一起集成在单个软件产品中，或者封装成多个软件产品。Similarly, although operations are depicted in the figures in a particular order, this should not be construed as requiring the operations to be performed in the particular order shown or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various system modules and components in the above-described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software product , or packaged into multiple software products.

由此，主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下，权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外，附图中描绘的处理并非必需所示的特定顺序或顺次顺序，以实现期望的结果。在某些实现中，多任务和并行处理可能是有利的。Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims

1. A hybrid speech recognition system for streaming and non-streaming, comprising: a stream encoder, a concatenated sequential classification decoder, and an attention mechanism decoder; the stream type encoder is constructed by adopting a Transformer based on a local self-attention mechanism, and outputs an encoding state; the joint time sequence classification decoder comprises a linear mapping layer which is responsible for mapping the coding state to a pre-designed word list space to obtain a coding state mapping representation, the dimension of the coding state mapping representation is the same as that of the word list space, and then the predicted mark is calculated through Softmax and used for stream decoding; the attention mechanism decoder is constructed by adopting a Transformer decoder and consists of a front-end convolution layer and a plurality of repeated unidirectional Transformer coding layers, the last layer is a linear mapping layer, the dimensionality represented by the coding state mapping is the same as the dimensionality of the vocabulary space, and the final output probability is calculated; training a flow type and non-flow type mixed voice recognition system, wherein in the training process, the connection time sequence classification decoder calculates a CTC loss function, and the attention mechanism decoder calculates a cross entropy loss function; performing weighted summation on the CTC loss function and the cross entropy loss function to serve as a model loss function of the identification system;

in the flow type reasoning process, the joint time sequence classification decoder is used as a main part, the attention mechanism decoder is used as an auxiliary part, the joint time sequence classification decoder adopts a BeamSearch search algorithm to generate N flow type candidate acoustic sequences and CTC flow type acoustic scores of the N flow type candidate acoustic sequences from a coding state, the attention mechanism decoder performs scoring again on the N flow type candidate acoustic sequences, the N flow type candidate acoustic sequences are rearranged according to the score of each flow type candidate acoustic sequence, and the flow type candidate acoustic sequence with the highest score is used as a final flow type recognition result;

in the non-streaming reasoning process, the attention mechanism decoder is used as a main guide, the connection time sequence classification decoder is used as an auxiliary guide, the attention mechanism decoder adopts M non-stream sequences with the highest scores generated in the decoding process by a BeamSearch search algorithm as M non-stream candidate acoustic sequences, the connection time sequence classification decoder performs re-scoring on the M non-stream candidate acoustic sequences, the M non-stream candidate acoustic sequences are rearranged according to the scores of the non-stream candidate acoustic sequences, and the non-stream candidate acoustic sequence with the highest score is used as a final non-stream identification result.

2. A mixed speech recognition system for streaming and non-streaming according to claim 1, wherein the model loss function is of the specific form:

model loss function = λ × CTC loss function + (1- λ) × cross entropy loss function;

wherein,

λ: setting parameters, wherein lambda is more than or equal to 0.1 and less than or equal to 0.3.

3. The mixed speech recognition system of claim 1, wherein the attention mechanism decoder re-scores the N streaming candidate acoustic sequences by:

amplifying a sentence starting mark at the front end of each streaming candidate acoustic sequence to serve as an input streaming candidate sequence;

the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, and calculates a streaming attention score which is used for reprinting the scores of the N streaming candidate acoustic sequences.

4. The mixed speech recognition system of claim 3, wherein the reprint score further comprises: weighted summing the streaming attention score and the CTC streaming acoustic score as a reprint score for the N streaming candidate acoustic sequences.

5. The system of claim 4, wherein N is a setting parameter, 10 ≦ N ≦ 100.

6. The mixed speech recognition system of claim 1, wherein the attention mechanism decoder adopts the BeamSearch algorithm to generate M non-stream sequences with the highest scores in the decoding process as M non-stream candidate acoustic sequences by:

predicting from a start mark, wherein each step needs to input a complete coding state and a mark obtained by the previous step of prediction, and then calculating the score of the prediction mark; this process is repeated until the end marker is predicted to stop; and then, taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark.

7. The system of claim 6, wherein the specific method for re-scoring the M non-stream candidate acoustic sequences by the concatenated sequential classification decoder is: and weighting and summing the CTC non-stream acoustic scores and the non-stream attention scores to obtain the scores of the M non-stream candidate acoustic sequences rewrited by the joint time sequence classification decoder.

8. The system of claim 7, wherein M is a setting parameter, 10 ≦ M ≦ 100.

9. A method for hybrid speech recognition between streaming and non-streaming, comprising:

the method comprises a streaming and non-streaming mixed voice recognition method, and specifically comprises the following steps:

the stream type voice recognition method comprises the following steps:

(1) calculating an acoustic feature stream to be input into a streaming coder every time an input audio stream reaches a fixed length;

(2) the stream type acoustic characteristic stream is converted into a stream type coding state after passing through a stream type coder and is input into a joint time sequence classification decoder;

(3) the connection time sequence classification decoder adopts a BeamSearch search algorithm to predict the flow type coding state;

(4) repeating the above (1) - (3), if a sentence is over, ending the streaming coding state, and finally generating N streaming candidate acoustic sequences and CTC streaming acoustic scores of the N streaming candidate acoustic sequences;

(5) the attention mechanism decoder performs repeated scoring on the N streaming candidate acoustic sequences, and a sentence starting mark is amplified at the front end of each streaming candidate sequence to serve as an input streaming candidate sequence; the attention mechanism decoder adopts N input streaming candidate sequences and coding states corresponding to the N streaming candidate sequences as input, predicts a streaming target candidate sequence which contains an end mark and does not contain a start mark, sums the probability of each position of the streaming target candidate sequence, and calculates a streaming attention score which is used for reprinting the scores of the N streaming candidate acoustic sequences;

or performing a weighted summation of the streaming attention score and the CTC streaming acoustic score as a reprinting score for the N streaming candidate acoustic sequences;

(6) rearranging the N streaming candidate acoustic sequences according to the score of each streaming candidate acoustic sequence, taking the streaming candidate acoustic sequence with the highest score as a final streaming recognition result, and improving the performance of streaming voice recognition by increasing the number N of the streaming candidate acoustic sequences, wherein the typical value of N is 10, and the parameter setting range is as follows: n is more than or equal to 10 and less than or equal to 100,

the non-streaming mixed voice recognition method comprises the following steps: (1) after the audio input is finished, extracting characteristics of the whole audio, and inputting the characteristics into a stream coder for coding;

(2) the power mechanism decoder relies on all the output and the initial mark of the stream encoder as input, the prediction is carried out from the initial mark, each step needs to input the complete coding state and the mark obtained by the previous step of prediction, and then the fraction of the prediction mark is calculated;

(3) repeating step (2) until the end marker is predicted to stop; then, taking the M non-stream sequences with the highest scores generated in the decoding process of the attention mechanism decoder as the non-stream attention scores of the M non-stream candidate acoustic sequences and the M non-stream candidate acoustic sequences, and removing the start mark and the end mark;

(4) using a connection time sequence classification decoder to reprint scores of all M non-stream candidate acoustic sequences, and using a dynamic programming algorithm to calculate the probability of predicting a target non-stream candidate acoustic sequence under the condition of inputting complete voice input to be used as a CTC non-stream acoustic score;

(5) weighting and summing CTC non-stream acoustic scores and the non-stream attention scores to obtain a joint time sequence classification decoder which performs scoring again on the M non-stream candidate acoustic sequences and reorders the scores;

(6) and finally outputting the M non-flow candidate acoustic sequences to perform scoring with the highest scoring again as a recognition result.