CN114155834A

CN114155834A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN114155834A
Application number: CN202111432141.7A
Authority: CN
Inventors: 刘丹; 韩凯; 魏思
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-08

Abstract

The application provides a voice recognition method, a voice recognition device, equipment and a storage medium, wherein the method comprises the following steps: acquiring coding characteristics obtained by coding acoustic characteristics of a voice to be recognized by a coder; the encoder is obtained according to a first recognition result of a voice sample and text label training of the voice sample, and the first recognition result of the voice sample is determined according to an encoding characteristic obtained by encoding the acoustic characteristic of the voice sample by the encoder and an attention coefficient of the recognition result of the voice sample to each frame of encoding characteristic output by the encoder; and determining a voice recognition result of the voice to be recognized according to the coding characteristics of the voice to be recognized. By adopting the technical scheme, the voice recognition accuracy can be improved.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technology, and more particularly, to a speech recognition method, apparatus, device, and storage medium.

Background

The Speech Recognition (ASR) technology is a technology for a machine to convert a Speech signal into a corresponding text or command through a Recognition and understanding process, that is, a technology for a machine to understand a human Speech.

Currently, end-to-end speech recognition is the mainstream scheme, and among them, the recognition effect of the end-to-end speech recognition scheme based on attention mechanism is the best. However, monotonicity is difficult to guarantee by the attention mechanism of the conventional end-to-end model based on the attention mechanism, and particularly, the attention of the model is unconstrained and unordered, so that the model identification accuracy is difficult to improve, and especially when the streaming identification requirement is met, the identification effect is often poor.

Disclosure of Invention

Based on the above technical current situation, the present application provides a speech recognition method, apparatus, device and storage medium, which can constrain the attention of speech recognition, thereby improving the accuracy of speech recognition.

A speech recognition method comprising:

acquiring coding characteristics obtained by coding acoustic characteristics of a voice to be recognized by a coder;

the encoder is obtained according to a first recognition result of a voice sample and text label training of the voice sample, and the first recognition result of the voice sample is determined according to an encoding characteristic obtained by encoding the acoustic characteristic of the voice sample by the encoder and an attention coefficient of the recognition result of the voice sample to each frame of encoding characteristic output by the encoder;

and determining a voice recognition result of the voice to be recognized according to the coding characteristics of the voice to be recognized.

Optionally, determining a speech recognition result of the speech to be recognized according to the coding feature of the speech to be recognized, including:

decoding the coding features of the voice to be recognized to obtain the decoding features of the voice to be recognized;

determining attention coefficients of the recognition result of the voice to be recognized to the coding features of each frame of the voice to be recognized according to the coding features and the decoding features of the voice to be recognized;

and determining a voice recognition result of the voice to be recognized according to the coding feature and the decoding feature of the voice to be recognized and the attention coefficient of the recognition result of the voice to be recognized to the coding feature of each frame of the voice to be recognized.

Optionally, the encoding step of the encoder encoding the acoustic feature of the speech to be recognized to obtain an encoding feature includes:

coding each frame of acoustic features of the speech to be recognized based on an attention mechanism respectively, so as to obtain the coding features of the speech to be recognized;

the method for coding any frame of acoustic features based on the self-attention mechanism comprises the following steps: and coding the frame acoustic features according to the acoustic feature sequence with the set length containing the frame acoustic features to obtain coding features corresponding to the frame acoustic features.

Optionally, the obtaining the decoding feature of the speech to be recognized by decoding the coding feature of the speech to be recognized includes:

decoding the coding features of each frame of the voice to be recognized and the decoding features corresponding to the recognized result of the voice to be recognized to obtain the decoding features of the voice to be recognized;

and the recognized result of the voice to be recognized is the recognized result of the voice to be recognized, which is obtained before the current moment.

Optionally, determining a speech recognition result for the speech to be recognized according to the coding feature and the decoding feature of the speech to be recognized and the attention coefficient of the recognition result of the speech to be recognized to the coding feature of each frame of the speech to be recognized, including:

determining a first decoding result of the speech to be recognized according to the coding features of each frame of the speech to be recognized and the attention coefficient of the recognition result of the speech to be recognized to the coding features of each frame of the speech to be recognized;

determining a second decoding result of the voice to be recognized according to the decoding characteristics of the voice to be recognized;

and determining a voice recognition result of the voice to be recognized according to the first decoding result and the second decoding result.

Optionally, determining the first decoding result of the speech to be recognized according to the coding features of each frame of the speech to be recognized and the attention coefficient of the decoding result of the speech to be recognized to the coding features of each frame of the speech to be recognized, where the determining the first decoding result of the speech to be recognized includes:

respectively decoding each frame coding feature of the speech to be recognized to obtain a decoding result corresponding to each frame coding feature;

and weighting the decoding result corresponding to each frame of coding feature by taking the attention coefficient of the recognition result of the speech to be recognized to each frame of coding feature of the speech to be recognized as a weight to obtain a first decoding result of the speech to be recognized.

Optionally, the first decoding result and the second decoding result respectively include a plurality of decoding paths;

determining a voice recognition result of the voice to be recognized according to the first decoding result and the second decoding result, including:

and performing weighted summation on the scores of the decoding paths of the first decoding result and the second decoding result, and determining a voice recognition result of the voice to be recognized from the decoding paths after weighted summation of the scores.

Optionally, the training process of the encoder includes:

acquiring coding characteristics obtained by coding the acoustic characteristics of the voice sample by the coder, and acquiring decoding characteristics obtained by decoding the coding characteristics of the voice sample;

determining attention coefficients of the recognition results of the voice samples to the coding features of all frames of the voice samples according to the coding features and the decoding features of the voice samples;

determining a first recognition result of the voice sample according to the coding features of the voice sample and the attention coefficient;

and optimizing the parameters of the encoder by comparing the first recognition result with the label of the voice sample.

Optionally, obtaining coding features obtained by processing acoustic features of a speech to be recognized by a coder, and determining a speech recognition result of the speech to be recognized according to the coding features of the speech to be recognized, includes:

inputting the acoustic features of the voice to be recognized into a pre-trained voice recognition model, enabling the voice recognition model to obtain coding features obtained by processing the acoustic features of the voice to be recognized by a coder of the voice recognition model, and determining a voice recognition result of the voice to be recognized according to the coding features of the voice to be recognized.

Optionally, the speech recognition model includes an encoder and a decoder, and the training process for the speech recognition model includes:

inputting the acoustic characteristics of a voice sample into an encoder to obtain the encoding characteristics output by the encoder, and decoding the encoding characteristics by using a decoder to obtain decoding characteristics;

determining attention coefficients of the decoding results of the voice samples to the coding features of the frames of the voice samples according to the coding features and the decoding features;

determining a first recognition result of the voice sample according to the coding features of the voice sample and the attention coefficient, and determining a first loss function by comparing the first recognition result with a label of the voice sample;

determining a second recognition result of the voice sample according to the decoding characteristics of the voice sample, and determining a second loss function by comparing the second recognition result with the label of the voice sample;

optimizing parameters of the encoder and the decoder using the first loss function and the second loss function.

Optionally, optimizing parameters of the encoder and the decoder by using the first loss function and the second loss function includes:

carrying out weighted summation on the first loss function and the second loss function to obtain a combined loss function;

optimizing parameters of the encoder by using the joint loss function;

and the number of the first and second groups,

optimizing parameters of the decoder using the second loss function, or using the second loss function and the joint loss function.

A speech recognition apparatus comprising:

the encoding unit is used for acquiring encoding characteristics obtained by encoding the acoustic characteristics of the speech to be recognized by the encoder;

and the decoding unit is used for determining a voice recognition result of the voice to be recognized according to the coding characteristics of the voice to be recognized.

A speech recognition device comprising:

a memory and a processor;

wherein the memory is connected with the processor and used for storing programs;

the processor is used for realizing the voice recognition method by operating the program in the memory.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method described above.

According to the speech recognition method, the attention coefficient of the speech recognition result to each frame coding feature output by the encoder is combined in the training process of the encoder. The training mode realizes the regularization of the encoder, is favorable for promoting the monotonicity of the attention mechanism of the encoder, and the encoding characteristic frame concerned by the attention of the voice recognition is more accurate in the encoding characteristic of the voice to be recognized output by the encoder based on the training mode, so that the encoding characteristic output by the encoder is more favorable for the voice recognition, namely the accuracy of the voice recognition result can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a speech recognition model provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a regularization module in a speech recognition model provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an encoder of a speech recognition model provided in an embodiment of the present application;

FIG. 5 is a flow chart of another speech recognition method provided by the embodiments of the present application;

FIG. 6 is a schematic structural diagram of a decoder of a speech recognition model provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for the application scene of voice recognition, and the accuracy and the recognition efficiency of end-to-end voice recognition can be improved by adopting the technical scheme of the embodiment of the application.

Speech recognition is widely used in the fields of home appliances, communications, automotive electronics, medical care, home services, consumer electronics, and the like.

Currently, end-to-end speech recognition is the most common speech recognition solution. The end-to-end model of the main flow mainly includes three types: end-to-end ASR based on CTC (connectionist Temporal Classification), encoder-decoder model based on attention (attention model), end-to-end ASR based on RNN-T (Current Neural Network-transmitter). Although all three end-to-end models show excellent performance in the speech recognition field, each has disadvantages. CTC often requires a language model to be added to the decoding process, assuming that the output between different frames is conditional independent, in order to improve efficiency. RNN-T solves the problem of condition independence in CTC, and a language model is integrated into a system, but the model training difficulty is high, and some unreasonable decoding paths exist.

Relatively, the speech recognition effect of the Encoder-Decoder model based on the Attention is the best, but the dependence on the length of the input sentence is strong, and monotonicity is difficult to guarantee by the Attention. Specifically, the Attention of the decoding end of the existing entitlement-based Encoder model to the encoding end is unconstrained and unordered, which makes the model identification accuracy and efficiency difficult to improve.

Based on the technical current situation, the embodiment of the application provides a new speech recognition scheme aiming at an end-to-end model based on an attention mechanism, and the end-to-end speech recognition realized based on the scheme can strengthen the restriction of the decoding end on the attention of the encoding end, so that the attention of the decoding end is focused on correct encoding characteristics, thereby playing a role in regularization on the recognition process and further improving the recognition effect. Meanwhile, the attention regularization processing can ensure the attention monotonicity of the decoding end to the encoding end, and has a positive effect on improving the speech recognition accuracy, particularly the accuracy of stream type recognition.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a voice recognition method, which is suitable for an end-to-end voice recognition model based on an attention mechanism.

Similar to the conventional Attention-based Encoder-Decoder model, the Attention-based end-to-end speech recognition model is also mainly composed of an Encoder and a Decoder. The encoder performs convolution and downsampling on an original high frame rate acoustic feature sequence according to the characteristics of voice acoustic features (such as MFCC (Mel frequency cepstrum coefficient) or Filterbank) to obtain a low frame rate feature sequence, and extracts hidden layer features, namely encoding features, which are friendly to voice recognition by means of self-attention mechanism focusing on context.

The decoder passes the coding characteristics h of the encoder output_eModeling the conditional probability p (y) using the attention and autoregressive mechanisms of_t|h_e,y_＜t). The decoder is optimized using the cross entropy loss L.

Based on the end-to-end speech recognition model, an embodiment of the present application provides a speech recognition method, which is shown in fig. 1 and includes:

s101, obtaining coding characteristics obtained by coding the acoustic characteristics of the speech to be recognized by the coder.

The encoder is obtained by training a first recognition result of a voice sample and a text label of the voice sample, and the first recognition result of the voice sample is determined according to an encoding characteristic obtained by encoding an acoustic characteristic of the voice sample by the encoder and an attention coefficient of the recognition result of the voice sample to each frame of encoding characteristic output by the encoder.

Specifically, the acoustic feature of the speech to be recognized may be any type of acoustic feature obtained by extracting the acoustic feature of the speech to be recognized, for example, the acoustic feature may be of a type of MFCC, Filterbank, or the like.

Inputting the acoustic features of the speech to be recognized into the encoder of the end-to-end speech recognition model, performing convolution down-sampling on the original high frame rate acoustic feature sequence by the encoder to obtain a low frame rate feature sequence, then, paying attention to the context by a self-attention mechanism, and extracting the hidden layer features friendly to speech recognition to obtain the coding features h_e. Characterizing the code by h_eInput to a decoder, which decodes it to obtain a decoding characteristic h_dBy decoding the feature h_dAnd linear projection and softmax classification are carried out, and a decoding result at the current moment can be obtained.

In a conventional end-to-end speech recognition model training scheme, the speech sample recognition result output by a decoder is compared with the text label of the speech sample, the cross entropy loss L is calculated, and the parameters of the encoder and the decoder are optimized by using the cross entropy loss L.

In addition to the above training modes, the embodiments of the present application also perform special training on the encoder in the end-to-end speech recognition model.

Referring to fig. 2, in a conventional end-to-end speech recognition model formed by an encoder and a decoder, an embodiment of the present application additionally adds a regularization module, which is used for determining a coding feature h output by a hidden layer of the encoder_eDecoding characteristic h of decoder hidden layer output_dPerforming MLP (multi-level linear programming) attention to obtain an attention coefficient, wherein the attention coefficient is used for representing a decoding characteristic h of the hidden layer output of a decoder_dCoding characteristics h for encoder hidden layer output_eDue to the decoding characteristics h of the decoder hidden layer output_dThe attention coefficient can be directly used for determining the decoding result, so that the attention coefficient can be used for representing the attention coefficient of the decoding result of the voice to the coding characteristics of each frame of the voice.

The specific structure of the regularization module is shown in fig. 3, which encodes each frame of speech with a feature h_eAnd the decoding characteristic h output by the decoder at the ith moment_dRespectively, a linear projection is carried out and then added, a linear projection is carried out after the tanh activation function,finally, calculating an attention coefficient through softmax:

α_i，t＝softmax(Wtanh(uh_e+vh_d))

wherein alpha is_i,tAnd W, u and v are linear projection layer parameters which represent attention coefficients of the decoding result of the speech output at the ith moment to the t frame coding characteristics of the speech.

The specific processing contents of the regularizing module and the specific working contents of the encoder and the decoder can be described with reference to the corresponding contents in the following embodiments.

The regularization module is used for model training, and is particularly applied to training of a model encoder. In the training process, the acoustic characteristics of the voice sample are input into a model encoder, then the encoding characteristics of the voice sample output by the encoder are obtained, the decoding characteristics of the voice sample output by the decoder after the encoding characteristics output by the encoder are decoded are obtained, and MLP attention operation is carried out on the encoding characteristics and the decoding characteristics of the voice sample by utilizing a regularization module. Through the regularization module operation, the attention coefficient of the recognition result of the speech sample to each frame coding feature of the speech sample can be determined.

In the speech recognition process, the decoding end outputs the decoding result of the partial speech frame of the speech sample at each moment. Therefore, the above-mentioned attention coefficient for determining the recognition result of the speech sample to the coding feature of each frame of the speech sample is, specifically, the attention coefficient for determining the decoding result of the current time to the coding feature of each frame of the speech sample. The attention coefficient may be used to reflect the magnitude of the influence of each frame encoding characteristic of the speech sample on the decoding result of the speech sample at the current time. According to the above calculation, the attention coefficient of the decoding result of the speech sample at each time point to the encoding feature of each frame of the speech sample can be determined.

Then, a first recognition result of the speech sample is determined according to the coding features of the speech sample and the calculated attention coefficient. Specifically, after linear projection and softmax are respectively performed on each frame of coding features of a speech sample, a V-dimensional probability distribution (V is the size of a dictionary) is obtained. Then, the attention coefficient obtained through the calculation is used for weighting the probability distribution corresponding to each frame of coding feature, so that the probability distribution of each word in the dictionary is obtained, and the first recognition result is obtained.

Finally, comparing the first recognition result with the text label of the voice sample and calculating the cross entropy loss L_EThen use this cross entropy loss L_EThe parameters of the encoder are optimized.

At the same time, the decoding characteristics h of the speech samples output by the decoder_dCarrying out linear projection and softmax classification to obtain a decoding result, comparing the decoding result with a text label of a voice sample and calculating cross entropy loss L_D. This cross entropy loss L_DIt can also be used for parameter optimization for the encoder and for parameter optimization for the decoder.

For the details of the above training process of the encoder, see also the following on the above training process of the attention-based end-to-end speech recognition model.

It should be noted that the encoder is trained in the above manner, so that the loss function includes attention factors of the speech recognition result for each frame of the speech coding features output by the encoder. The parameter optimization is carried out on the encoder based on the loss function, so that the encoding characteristics obtained by the encoder are more accurate when the encoder encodes the voice acoustic characteristics concerned by attention during decoding, and the identification result obtained based on the encoding characteristics is more accurate.

That is, the above training mode combining the attention coefficient of the recognition result to the coding feature plays a role in regularization to the coding process of the encoder, so that the encoder can optimize the extraction of the coding feature concerned by the speech recognition result, and the coding feature concerned by the speech recognition result output by the encoder is more accurate, thereby being beneficial to improving the accuracy of the speech recognition result.

S102, determining a voice recognition result of the voice to be recognized according to the coding characteristics of the voice to be recognized.

For example, after the acoustic feature of the speech to be recognized is input into the encoder, the encoding feature obtained by encoding the acoustic feature of the speech to be recognized output by the encoder is obtained, then the encoding feature is decoded by the decoder to obtain the decoding feature, and after the linear projection and softmax classification are performed on the decoding feature, the decoding result at the current moment can be obtained. And sequentially splicing the decoding results at all the moments to obtain a complete voice recognition result of the voice to be recognized.

As can be seen from the above description, in the speech recognition method provided in the embodiment of the present application, the training process for the encoder combines the attention coefficient of the speech recognition result to the encoding characteristic of each frame output by the encoder. The training mode realizes the regularization of the encoder, is favorable for promoting the monotonicity of the attention mechanism of the encoder, and the encoding characteristic frame concerned by the attention of the voice recognition is more accurate in the encoding characteristic of the voice to be recognized output by the encoder based on the training mode, so that the encoding characteristic output by the encoder is more favorable for the voice recognition, namely the accuracy of the voice recognition result can be improved.

As an alternative embodiment, the above-mentioned encoder adopts a network structure based on the VGG-Transformer, and as shown in fig. 4 in particular, the specific functions and names of various parts in the network structure can be referred to in the introduction of the network structure of the VGG and the Transformer in the prior art. Based on the network structure, an encoder performs convolution downsampling on acoustic features of a voice to be recognized by a VGG network, then focuses on a context by using a self-attention mechanism of a transform network structure, and extracts hidden layer features which are friendly to voice recognition, namely coding features h_e。

Among them, the transform is a model mechanism based on Attention, before the transform comes out, in order to obtain the dependency relationship of long distance, it usually uses RNN to perform serialization coding, because each output of RNN model depends on the current input and the previous hidden state, it can not be calculated in parallel, the model efficiency is low. The Multi-head self-attention (Multi-head self-attention) mechanism proposed in the transform allows models to be computed in parallel.

Since self-attribute has no sequence information, the input sequence changes the sequence, and the obtained result is unchanged. In order to make the coding result fit to the input sequence, Position Encoding (PE) can be added at self-orientation, each unit of the input sequence is provided with encoding only related to position, thus, when the units of the sequence change order, PE will change accordingly, so that self-orientation has position information, and the coding result is exactly matched with the input sequence. PE in the original Transformer is absolute position information based on sin and cos, when the sample length in the test set is far longer than the common length in the training set, the obtained position code is not seen by the network, and therefore the network cannot obtain a robust result. Relative position encoding (relative position encoding) solves this problem well. Therefore, the transform network in the embodiment of the present application employs a multi-head self-attention mechanism of relative position coding, i.e., a multi-head relative attention module in fig. 4.

It will be appreciated that the encoder described above is capable of implementing an encoding function based on the self-attention mechanism. In order to meet the requirements of a streaming recognition scene, in the embodiment of the present application, when the encoder performs self-attention, a segment is intercepted, and the self-attention mechanism only focuses on a preceding and following small segment of feature sequence, not on the entire speech sequence, so that the encoder can be used for streaming recognition.

Specifically, when the encoder encodes the acoustic feature of the speech to be recognized, specifically, the encoder performs encoding processing based on the attention mechanism on each frame of acoustic feature of the speech to be recognized, so as to obtain the encoding feature of the speech to be recognized.

When the coding processing based on the attention mechanism is carried out on any frame of acoustic features, the frame of acoustic features are coded according to an acoustic feature sequence with set length containing the frame of acoustic features, and coding features corresponding to the frame of acoustic features are obtained. That is, the attention is limited to a small acoustic feature sequence containing the frame acoustic features, and the frame acoustic features are encoded to obtain corresponding encoding features. The short acoustic feature sequence containing the frame acoustic features may be a feature sequence composed of the frame acoustic features and a short acoustic feature sequence before or after the frame acoustic features.

Because the coding attention is not in the whole voice sequence, the coding features for recognition can be obtained based on the acoustic feature sequence with limited length, and then the corresponding recognition result can be obtained, namely, the streaming recognition effect can be realized.

The above network structure of the VGG-Transformer, the processing procedure for acquiring the coding features of the acoustic features based on the network structure in combination with the self-attention mechanism, and the specific content of the relative position code may refer to the descriptions about the network structure and the relative position code of the VGG-Transformer in the conventional art, and the detailed description of the embodiments of the present application is not repeated.

In addition, the above-mentioned encoder may also adopt other network structures, for example, a VGG-former network structure, a convolution + former network structure, or a bidirectional LSTM network structure, and the embodiment of the present application is not limited strictly as long as the encoder function described in the above-mentioned embodiment of the present application can be realized and the encoder training scheme described in the above-mentioned embodiment of the present application can be applied.

As a preferred embodiment, in addition to the training of the above attention coefficient of each frame coding feature output by the encoder in combination with the recognition result of the speech sample performed by the regularization module on the encoder, in an actual speech recognition application, the speech recognition method provided in the embodiment of the present application also determines the speech recognition result of the speech to be recognized by the regularization module.

Referring to fig. 5, in the speech recognition method provided in the embodiment of the present application, the acoustic feature of the speech to be recognized is input into the encoder trained in the above manner, and after step S501 is executed and the coding feature obtained by coding the acoustic feature of the speech to be recognized by the encoder is obtained, the speech recognition result of the speech to be recognized is obtained by executing the following steps S502 to S504:

s502, decoding the coding characteristics of the voice to be recognized to obtain the decoding characteristics of the voice to be recognized.

Illustratively, the coding features obtained by coding the acoustic features of the speech to be recognized by the coder are input into the decoder, so that the decoder decodes the coding features, and then the feature vector output by the last hidden layer of the decoder is obtained, namely, the decoding result of the coding features, namely, the decoding features of the speech to be recognized.

The decoding feature obtained by decoding the encoding feature is, specifically, a decoding feature obtained by decoding the encoding feature and used for determining a decoding result at the current time. Specifically, the "decoding result at the current time" may be a complete recognition result of the whole speech to be recognized, or may be a recognition result of a partial speech frame of the speech to be recognized.

When the speech to be recognized is short, or in an offline speech recognition scenario, the complete recognition result of the speech to be recognized may be output at one time, and at this time, the decoding feature obtained by decoding the coding feature of the speech to be recognized is specifically a decoding feature used for determining the complete recognition result of the speech to be recognized.

When the speech to be recognized is long or in a streaming speech recognition scenario, the recognition result of the speech to be recognized is output in a streaming manner, and at this time, the decoding feature obtained by decoding the coding feature of the speech to be recognized is specifically the decoding feature used for determining the recognition result of the current speech frame to be recognized of the speech to be recognized.

The embodiment of the application takes a streaming output voice recognition result as an example, and introduces an implementation process of recognizing the voice to be recognized and determining a decoding result at the current moment. The decoding results at other times can also be obtained as described in the embodiments of the present application. Therefore, the decoding feature obtained by decoding the coding feature is specifically a decoding feature used for determining a decoding result at the current time, and the decoding result at the current time is a recognition result of a partial speech frame to be currently recognized of the speech to be recognized.

When the speech recognition result is output in a streaming manner, the decoding results output by the decoding end are output in sequence, and not all the recognition results of the speech to be recognized are output at one time.

For example, assuming that the speech to be recognized shares T frames (0 th frame to T-1 th frame), the coding features of the T frames are correspondingly available, and when the decoder decodes the coding features of the T frames, the decoding features for determining the recognition result of the current partial speech frame to be recognized of the speech to be recognized are output at a certain time, instead of the decoding features for determining the complete recognition result of the speech to be recognized being output at the time.

For example, at the 1 st moment, the decoder decodes the coding feature of the speech to be recognized to obtain a decoding feature, and based on the decoding feature, the recognition result of the 0 th frame of the speech to be recognized can be determined as the decoding result output by the decoding end at the 1 st moment; at the 2 nd moment, the decoder decodes the coding feature of the speech to be recognized to obtain a decoding feature, and based on the decoding feature, the recognition result of the 1 st frame of the speech to be recognized can be determined and used as the decoding result output by the decoding end at the 2 nd moment; by analogy, at the ith moment, the decoder decodes the coding feature of the speech to be recognized to obtain a decoding feature, and based on the decoding feature, the recognition result of the (i-1) th frame of the speech to be recognized can be determined and used as the decoding result output by the decoding end at the ith moment. And finally, sequentially splicing the decoding results output by the decoding end at each moment to obtain a complete recognition result of the speech to be recognized.

In the above example, the decoding characteristics output by the decoder at a certain time may also be the decoding characteristics used for determining the recognition results of a plurality of speech frames to be recognized of the speech to be recognized. In this way, the decoding characteristics output by the decoder each time are actually the decoding characteristics used for determining the recognition result of one or more speech frames to be recognized of the speech to be recognized at the current moment.

As an exemplary implementation manner, the decoder described above in this embodiment of the present application adopts a decoder with a transform network structure, and the structure of the decoder can be seen in fig. 6. The decoder includes two parts, one is self-entry at the decoder input, here Multi-head Relative entry with Relative position coding, and the same encoder. The other part is the encoder's position to the decoder, which does not require a relative position, so a standard multi-head position structure is used.

Based on the above-mentioned decoder network structure, when the decoding characteristics of the speech to be recognized are obtained by the decoder, specifically, the coding characteristics of each frame of the speech to be recognized output by the encoder are input into the Multi-head authentication module of the decoder, and meanwhile, the decoding characteristics corresponding to the recognized result of the speech to be recognized are input into the Multi-head Relative authentication module. The recognized result of the speech to be recognized refers to a recognition result of the speech to be recognized, which is obtained before the current time. In a speech recognition scene of the streaming output recognition result, the recognition result of the partial speech frame of the speech to be recognized is output at each moment according to the sequence from front to back, and correspondingly, the decoder outputs the decoding characteristic used for determining the recognition result of the partial speech frame of the speech to be recognized at each moment. Therefore, the recognized result of the speech to be recognized corresponds to the decoding feature, i.e., the decoding feature output by the decoder before the current time. Therefore, the decoding characteristics (outputs) output by the decoder before the current moment are input into the Multi-head Relative attribute module, so that the decoder decodes the coding characteristics of each frame output by the encoder and the decoding characteristics output by the decoder before the current moment together to obtain the decoding characteristics of the speech to be recognized at the current moment.

It can be understood that when the decoder determines the decoding characteristics at the current moment, the decoder refers to the coding characteristics of the speech to be recognized and the decoding characteristics output before the current moment, so that the decoding characteristics output by the decoder at the current moment simultaneously contain all information of the acoustic characteristics of the speech to be recognized and the characteristic information of the preorder recognized result, and therefore the decoding characteristic information at the current moment is richer, and more accurate decoding results can be recognized.

S503, according to the coding features and the decoding features of the voice to be recognized, determining attention coefficients of the recognition results of the voice to be recognized to the coding features of the frames of the voice to be recognized.

Specifically, as can be seen from the above description, the decoding characteristics output by the decoder are directly used to determine the decoding result at the current time. Therefore, the attention coefficient of each frame of the coding feature of the speech to be recognized of the decoding feature output by the decoder at the current moment can represent the attention coefficient of the coding feature of each frame of the speech to be recognized of the decoding result of the speech to be recognized at the current moment.

In the embodiment of the present application, the regularization module of the end-to-end speech recognition model shown in fig. 2 is used to regularize the encoding feature h output by the encoder_eAnd decoding characteristics h of the decoder output_dPerforming MLP attention operation to obtain an attention coefficient which can represent decoding characteristic h output by a decoder_dCoding characteristics h for encoder output_eI.e. the attention coefficient of the coding feature of each frame of the speech to be recognized of the decoding result of the speech to be recognized at the current moment.

The specific structure of the regularization module is shown in fig. 3, and the feature h of each frame of the speech to be recognized is coded_eAnd the decoding characteristic h output by the decoder at the ith moment_dRespectively carrying out linear projection and addition, carrying out tanh activation function and linear projection, and finally calculating an attention coefficient through softmax:

α_i，t＝softmax(Wtanh(uh_e+vh_d))

wherein alpha is_i,tAnd W, u and v are linear projection layer parameters, and represent attention coefficients of the coding features of the t frame of the speech to be recognized, which are output by the decoding result (namely, the recognition result) of the speech to be recognized at the ith moment.

Through the calculation, the attention coefficient of the coding feature of each frame of the speech to be recognized of the recognition result of the current moment of the speech to be recognized can be determined. The attention coefficient may be used to reflect the influence of each frame coding feature of the speech to be recognized on the recognition result of the speech to be recognized at the current time.

S504, determining a voice recognition result of the voice to be recognized according to the coding feature and the decoding feature of the voice to be recognized and the attention coefficient of the recognition result of the voice to be recognized to the coding feature of each frame of the voice to be recognized.

Specifically, as a simple implementation, the decoding characteristics output at the current time (i-th time) of the decoder are determined

And (3) performing linear projection and softmax classification to obtain the conditional probability distribution at the ith moment as follows:

wherein x represents the coding characteristics of the acoustic characteristics of the speech to be recognized; w_jAnd W_j'The parameters of the linear projection layer are actually the same in value, and the parameters under the dictionary dimensionality and the parameters at the current moment are distinguished only by different symbols; v represents a dictionary dimension; y is_＜iIndicating the decoding result before time i.

The dimension of the conditional probability distribution at the ith time is a dictionary dimension, and the dimension can represent the probability that the recognition result at the current time corresponds to each word in the dictionary, namely the recognition result at the current time of the speech to be recognized.

Alternatively, the feature h is encoded for each frame output by the encoder_eAnd respectively carrying out linear projection and softmax to obtain V-dimensional probability distribution (V is the size of a dictionary), namely obtaining a decoding result corresponding to each frame coding feature. For example, after the t-th frame coding feature is operated, the probability distribution is obtained as follows:

wherein, W_eRepresenting linear projection layer parameters;

and

the method is used for representing the coding characteristics of the speech to be recognized, the numerical values of the two are actually the same, and the parameters under the dictionary dimension and the current parameters are distinguished only by different symbols.

And then, taking the attention coefficient of each frame of coding feature of the speech to be recognized of the recognition result of the speech to be recognized as a weight, and weighting the decoding result corresponding to each frame of coding feature to obtain the recognition result of the speech to be recognized.

Referring to the above description, assume that the decoding result corresponding to the t-th frame encoding characteristic is p (y)_t|h_e) The attention coefficient of the recognition result of the ith moment of the speech to be recognized to the t frame coding feature is alpha_i,tThen, the decoding result corresponding to the coding feature of each frame can be weighted according to the following formula to obtain the recognition result p of the speech to be recognized at the ith moment_i：

The recognition result p_iThe dimension of (2) is a dictionary dimension, namely a V dimension.

The two modes are both to acquire the recognition result of the speech to be recognized through a single way, the implementation mode is simple, but the accuracy of the recognition result is not high enough.

In order to further improve the accuracy of the recognition result, the embodiment of the application provides that the recognition results of the two approaches are combined to determine the recognition result of the speech to be recognized. The method can be realized by executing the following steps A1-A3:

a1, determining a first decoding result of the speech to be recognized according to the frame coding features of the speech to be recognized and the attention coefficient of the recognition result of the speech to be recognized to the frame coding features of the speech to be recognized.

Specifically, the feature h is encoded in each frame output by the encoder_eAnd respectively carrying out linear projection and softmax to obtain a decoding result corresponding to the coding features of each frame. Then, taking the attention coefficient of each frame of coding feature of the speech to be recognized as a weight according to the recognition result of the speech to be recognized, weighting the decoding result corresponding to each frame of coding feature to obtain the recognition result of the speech to be recognized, and naming the recognition result as the first decoding result of the speech to be recognized for convenience of distinguishing. The specific process of the above treatment can be described with reference to the above embodiments.

A2, determining a second decoding result of the speech to be recognized according to the decoding characteristics of the speech to be recognized.

In particular, the decoding characteristics of the decoder output

And performing linear projection and softmax classification to obtain conditional probability distribution, namely the recognition result of the speech to be recognized at the ith moment, and naming the conditional probability distribution as a second decoding result of the speech to be recognized for convenience of distinguishing. The specific processing procedure can be seen from the description of the corresponding contents of the above embodiments.

A3, determining a voice recognition result of the voice to be recognized according to the first decoding result and the second decoding result.

In general, neither the first decoding result nor the second decoding result has only one decoding path, but includes a plurality of decoding paths. For example, the first decoding result has 5 decoding paths, the second decoding result also has 5 decoding paths, and the decoding paths in the first decoding result and the decoding paths in the second decoding result have a corresponding relationship. In each decoding path, a certain recognition character is the probability distribution of each word in the dictionary, and the score of the recognition character can be represented.

On the basis, the embodiment of the application performs weighted summation on the scores of the corresponding decoding paths in the first decoding result and the second decoding result, and then selects one or more decoding paths with the highest score from the scores of the decoding paths after weighted summation to serve as the finally determined decoding result of the speech to be recognized.

The weight occupied by the score of the decoding path in the first decoding result is between 0.1 and 0.3, and correspondingly, the weight occupied by the score of the decoding path in the second decoding result is between 0.7 and 0.9, namely, the decoding path occupied by the decoder output is larger.

Through the introduction, the voice recognition method provided by the embodiment of the application not only regularizes the model training, but also ensures the monotonicity of the model attention mechanism. Meanwhile, in the application of voice recognition, the recognition result of the encoding end and the recognition result of the decoding end are fused, the regularization effect is also realized on the recognition process, and the voice recognition efficiency and accuracy can be further improved.

In the foregoing embodiments, it has been described that the speech recognition method provided in the embodiments of the present application is applied to an end-to-end speech recognition model based on an attention mechanism. The speech recognition model includes an encoder and a decoder, and may further include a regularization module, as described in the above embodiments.

Therefore, in actual execution, the speech recognition method provided in the embodiment of the present application may specifically be to input the acoustic feature of the speech to be recognized into a pre-trained speech recognition model, so that the speech recognition model obtains the coding feature obtained by processing the acoustic feature of the speech to be recognized by its encoder, and determine the speech recognition result of the speech to be recognized according to the coding feature of the speech to be recognized.

In a preferred embodiment, the speech recognition model is obtained by training as shown in the following steps B1-B5:

and B1, inputting the acoustic characteristics of the voice sample into an encoder to obtain the encoding characteristics output by the encoder, and decoding the encoding characteristics by a decoder to obtain the decoding characteristics.

And B2, according to the coding features and the decoding features, determining attention coefficients of the decoding results of the voice samples to the coding features of the frames of the voice samples.

B3, determining a first recognition result of the voice sample according to the coding feature of the voice sample and the attention coefficient, and determining a first loss function by comparing the first recognition result with the label of the voice sample.

B4, according to the decoding characteristics of the voice sample, determining a second recognition result of the voice sample, and comparing the second recognition result with the label of the voice sample to determine a second loss function.

Specifically, the above-mentioned encoding feature and decoding feature obtaining process, attention coefficient calculating process, and first recognition result, second recognition result, first loss function and second loss function calculating process may all be performed with reference to the above description of the embodiments, for example, with reference to the corresponding contents of the training process for the encoder described in the above embodiments, and will not be described in detail here.

Specifically, after obtaining the first loss function and the second loss function, the embodiment of the present application trains the speech recognition model by performing the following step B5:

b5, optimizing the parameters of the encoder and the decoder by using the first loss function and the second loss function.

Specifically, assume that the first loss function is L_EThe second loss function is L_DApplying a first loss function L_EAnd a second loss function L_DAnd (3) carrying out weighted summation to obtain a joint loss function L:

L＝λL_E+(1-λ)L_D

wherein, the value of lambda is about 0.3.

Then, the encoder of the speech recognition model is trained using the joint loss function L. As will be understood from the above description, the joint loss function L includes attention coefficient information of the recognition result of the speech sample to the coding features of each frame of the speech sample. The joint loss function is utilized to optimize the parameters of the encoder, a regularization effect can be achieved on the encoder, the monotonicity of an attention mechanism of the encoder is facilitated, the encoding accuracy of an encoding characteristic frame concerned by an identification result is improved, and a more accurate identification result can be obtained according to the encoding characteristic output by the encoder.

At the same time, a second loss function L is used_DOptimizing the parameters of the decoder, or using a second loss function L_DAnd the joint loss function L described above, the parameters of the decoder are optimized.

In particular, a second loss function L is used_DThe decoder parameters are optimized, namely the conventional model training scheme is obtained.

In the embodiment of the present application, the second loss function and the joint loss function may be further combined for performing parameter optimization on the decoder, and the two loss functions may be applied sequentially. For example, the decoder may be first parametrically optimized with the second loss function and then parametrically optimized with the joint loss function.

Because the joint loss function contains attention coefficient information of the recognition result on the coding characteristics of each frame, the joint loss function is utilized to carry out parameter optimization on the decoder, which is beneficial to promoting the decoder to focus the decoding attention on the correct coding characteristics to a certain extent, and therefore, a more accurate decoding result can be obtained.

On the other hand, the joint loss function includes the first loss function, and the application of the first loss function may also affect the optimization direction of the second loss function on the decoder parameters, thereby forming mutual interference, and failing to obtain a better parameter optimization effect.

For the above situation, in the actual training process, according to the actual effect of the speech recognition result, the second loss function is selected to perform parameter optimization on the decoder, or the second loss function and the joint loss function are used to perform parameter optimization on the decoder.

In correspondence with the above-mentioned speech recognition method, the embodiment of the present application further provides a speech recognition apparatus, as shown in fig. 7, the apparatus includes:

the encoding unit 100 is configured to acquire an encoding feature obtained by encoding an acoustic feature of a speech to be recognized by an encoder;

the decoding unit 110 is configured to determine a speech recognition result of the speech to be recognized according to the coding feature of the speech to be recognized.

As an optional implementation manner, determining a speech recognition result of the speech to be recognized according to the coding feature of the speech to be recognized includes:

As an optional implementation manner, the encoding device encodes an acoustic feature of a speech to be recognized to obtain an encoded feature, including:

As an optional implementation manner, obtaining the decoding feature of the speech to be recognized by decoding the coding feature of the speech to be recognized includes:

As an optional implementation manner, determining a speech recognition result for the speech to be recognized according to the coding feature and the decoding feature of the speech to be recognized and the attention coefficient of the recognition result of the speech to be recognized to the coding feature of each frame of the speech to be recognized includes:

As an optional implementation manner, determining a first decoding result of the speech to be recognized according to the coding features of each frame of the speech to be recognized and the attention coefficient of the decoding result of the speech to be recognized to the coding features of each frame of the speech to be recognized includes:

As an optional implementation manner, the first decoding result and the second decoding result respectively include a plurality of decoding paths;

As an optional implementation, the training process of the encoder includes:

As an optional implementation manner, acquiring coding features obtained by processing acoustic features of a speech to be recognized by an encoder, and determining a speech recognition result of the speech to be recognized according to the coding features of the speech to be recognized includes:

As an optional implementation, the speech recognition model includes an encoder and a decoder, and the training process for the speech recognition model includes:

As an optional implementation, optimizing the parameters of the encoder and the decoder by using the first loss function and the second loss function includes:

optimizing parameters of the encoder by using the joint loss function;

and the number of the first and second groups,

Specifically, the detailed working contents of each part of the voice recognition apparatus are referred to the corresponding contents in each embodiment of the voice recognition method, and are not repeated here.

Another embodiment of the present application further provides a speech recognition apparatus, as shown in fig. 8, the apparatus including:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the speech recognition method disclosed in any of the above embodiments by running the program stored in the memory 200.

Specifically, the voice recognition device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the programs stored in the memory 200 and invokes other devices, which can be used to implement the steps of the speech recognition method provided by the above-described embodiments of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the speech recognition method provided in the foregoing embodiment of the present application.

Specifically, the specific working contents of each part of the voice recognition device and the specific processing contents of the computer program on the storage medium when being executed by the processor can refer to the contents of each embodiment of the voice recognition method, and are not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition method, comprising:

2. The method of claim 1, wherein determining the speech recognition result of the speech to be recognized according to the coding features of the speech to be recognized comprises:

3. The method of claim 1 or 2, wherein the encoder encodes the acoustic features of the speech to be recognized to obtain the encoded features, and comprises:

4. The method according to claim 2, wherein obtaining the decoded features of the speech to be recognized by decoding the encoded features of the speech to be recognized comprises:

5. The method of claim 2, wherein determining the speech recognition result for the speech to be recognized according to the coding feature and the decoding feature of the speech to be recognized and the attention coefficient of the recognition result for each frame of coding feature of the speech to be recognized comprises:

6. The method according to claim 5, wherein determining the first decoding result of the speech to be recognized according to the coding features of each frame of the speech to be recognized and the attention coefficient of the decoding result of the speech to be recognized to the coding features of each frame of the speech to be recognized comprises:

7. The method of claim 5, wherein the first decoding result and the second decoding result respectively comprise a plurality of decoding paths;

8. The method of claim 1, wherein the training process of the encoder comprises:

9. The method of claim 1, wherein obtaining coding features obtained by processing acoustic features of a speech to be recognized by an encoder, and determining a speech recognition result of the speech to be recognized according to the coding features of the speech to be recognized comprises:

10. The method of claim 9, wherein the speech recognition model comprises an encoder and a decoder, and wherein the training process for the speech recognition model comprises:

11. The method of claim 10, wherein optimizing parameters of the encoder and the decoder using the first loss function and the second loss function comprises:

optimizing parameters of the encoder by using the joint loss function;

and the number of the first and second groups,

12. A speech recognition apparatus, comprising:

13. A speech recognition device, comprising:

a memory and a processor;

the processor is configured to implement the speech recognition method according to any one of claims 1 to 11 by executing the program in the memory.

14. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements a speech recognition method according to any one of claims 1 to 11.