CN113362813A

CN113362813A - Voice recognition method and device and electronic equipment

Info

Publication number: CN113362813A
Application number: CN202110745581.1A
Authority: CN
Inventors: 王智超; 杨文文; 周盼; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-07

Abstract

The embodiment of the application provides a voice recognition method, a voice recognition device and electronic equipment, wherein the method comprises the following steps: acquiring voice data to be recognized; inputting acoustic characteristics of voice data to be recognized into a voice recognition model for processing; the speech recognition model includes an encoder having an output layer trained based on a connected-timing classification CTC criterion; extracting hidden layer characteristics of the acoustic characteristics through an encoder, and decoding the hidden layer characteristics of the acoustic characteristics through an output layer of the encoder according to a non-autoregressive decoding mode; and determining a voice recognition result according to the output result of the output layer of the encoder. According to the embodiment of the application, the hidden layer characteristics of the acoustic characteristics are decoded according to a non-autoregressive decoding mode through the output layer trained on the basis of the CTC (China traffic control) criterion, and the decoding speed of the voice recognition model can be greatly improved.

Description

Voice recognition method and device and electronic equipment

Technical Field

The present application relates to the field of manual technology, and in particular, to a speech recognition method, apparatus, and electronic device.

Background

The speech recognition means converting received speech information into text information, and a conventional speech recognition system includes an acoustic model, a language model and a dictionary model. The end-to-end voice recognition system provides a mode for fusing the three models into a neural network model for common modeling, thereby simplifying the construction process of the voice recognition system and also improving the performance of the voice recognition system.

The end-to-end identification mainly comprises: an end-to-end recognition technology based on a Connection Timing Classification (CTC) criterion, and an Attention-based codec (AED) end-to-end speech recognition technology. In recent years, research has been carried out to fuse end-to-end voice recognition technologies based on CTC and AED to obtain an AED-CTC/Attention end-to-end voice recognition model, and breakthrough progress is made in many public voice recognition tasks.

However, in the practical application process of the AED-CTC/Attention end-to-end speech recognition model, the decoder branches are used to decode in the decoding stage by means of autoregressive, the decoder needs to predict the next text according to the acoustic features and the previously predicted text, and the decoder needs to calculate once every time the AED-CTC/Attention end-to speech recognition model outputs one word, which results in low decoding efficiency.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, so that the efficiency of voice recognition is improved.

Correspondingly, the embodiment of the application also provides a voice recognition device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present application discloses a speech recognition method, which specifically includes:

acquiring voice data to be recognized;

inputting the acoustic characteristics of the voice data to be recognized into a voice recognition model for processing; the speech recognition model includes an encoder having an output layer trained based on Connected Temporal Classification (CTC) criteria; extracting hidden layer characteristics of the acoustic characteristics through the encoder, and decoding the hidden layer characteristics of the acoustic characteristics through an output layer of the encoder according to a non-autoregressive decoding mode;

and determining a voice recognition result according to the output result of the output layer of the encoder.

Optionally, the determining a speech recognition result according to an output result of an output layer of the encoder includes:

decoding a word sequence output by an output layer of the encoder through a weighted finite state converter (WFST) network encoded with a word-level language model to obtain a candidate recognition result;

and determining a voice recognition result according to the candidate recognition result.

Optionally, the decoding, by the WFST network encoded with the word-level language model, the word sequence output by the output layer of the encoder to obtain a candidate recognition result includes:

inputting the word sequence into the WFST network for decoding; the WFST network is constructed based on a language model, a dictionary model and an output mapping model; the language model is used for judging whether the word sequence accords with grammar and the occurrence probability of the word sequence; the dictionary model is used for mapping the word sequences into word sequences; the output mapping model is used for transmitting the word sequence output by the output layer to a single word;

and obtaining a plurality of candidate sentences output by the WFST network and scores corresponding to the candidate sentences.

Optionally, the candidate recognition result includes a plurality of candidate sentences and scores corresponding to the candidate sentences; the determining a speech recognition result according to the candidate recognition result includes:

determining corresponding normalized probability values according to the scores corresponding to the candidate sentences respectively;

and taking the candidate sentence with the maximum normalized probability value as a voice recognition result.

Optionally, the candidate recognition result includes a plurality of candidate sentences and scores corresponding to the plurality of candidate sentences; the determining a speech recognition result according to the candidate recognition result includes:

inputting the candidate recognition result into a decoder of the speech recognition model; the decoder having an output layer trained based on attention criteria rescoring the candidate sentences through the output layer of the decoder;

determining a corresponding normalized probability value according to a new score corresponding to the candidate sentence output by an output layer of the decoder;

Optionally, the speech recognition model is trained by:

acquiring sample voice data and a text label corresponding to the sample voice data;

taking the acoustic features of the sample voice data as the input of an encoder of the voice recognition model, extracting the hidden layer features of the acoustic features through the encoder, predicting an output result based on the hidden layer features of the acoustic features through an output layer of the encoder, and calculating an error according to a CTC (central processing unit) criterion according to the output result of the encoder and the text label;

taking the hidden layer features and the text labels output by the encoder as the input of a decoder of the speech recognition model, predicting by the decoder based on the hidden layer features and the text labels, predicting an output result by an output layer of the decoder, and calculating an error according to an attention criterion according to the output result of the decoder and the text labels;

performing model joint parameter training based on the errors calculated according to the CTC criterion and the errors calculated according to the attention criterion.

Optionally, said model joint parameter training based on said error calculated according to CTC criteria and said error calculated according to attention criteria comprises:

determining a target error according to the product of the error calculated according to the CTC criterion and the preset first weight value and the product of the error calculated according to the attention criterion and the preset second weight value;

and adjusting parameters of the voice recognition model according to the target error.

The embodiment of the present application further discloses a speech recognition apparatus, including:

the voice data acquisition module is used for acquiring voice data to be recognized;

the model processing module is used for inputting the acoustic characteristics of the voice data to be recognized into a voice recognition model for processing; the speech recognition model includes an encoder having an output layer trained based on Connected Temporal Classification (CTC) criteria; extracting hidden layer characteristics of the acoustic characteristics through the encoder, and decoding the hidden layer characteristics of the acoustic characteristics through an output layer of the encoder according to a non-autoregressive decoding mode;

and the recognition result determining module is used for determining a voice recognition result according to the output result of the output layer of the encoder.

Optionally, the identification result determining module includes:

the candidate recognition result determining submodule is used for decoding a word sequence output by an output layer of the encoder through a Weighted Finite State Transducer (WFST) network with a word-level language model;

and the voice recognition result determining submodule is used for determining a voice recognition result according to the candidate recognition result.

Optionally, the candidate recognition result determining sub-module includes:

a network decoding unit, configured to input the word sequence into the WFST network for decoding; the WFST network is constructed based on a language model, a dictionary model and an output mapping model; the language model is used for judging whether the word sequence accords with grammar and the occurrence probability of the word sequence; the dictionary model is used for mapping the word sequences into word sequences; the output mapping model is used for transmitting the word sequence output by the output layer to a single word;

and the candidate sentence determining unit is used for obtaining a plurality of candidate sentences output by the WFST network and scores corresponding to the candidate sentences.

Optionally, the candidate recognition result includes a plurality of candidate sentences and scores corresponding to the candidate sentences; the voice recognition result determination submodule includes:

the first normalization unit is used for determining corresponding normalization probability values according to the scores corresponding to the candidate sentences respectively;

and a first recognition result determination unit for taking the candidate sentence with the maximum normalized probability value as the voice recognition result.

Optionally, the candidate recognition result includes a plurality of candidate sentences and scores corresponding to the plurality of candidate sentences; the voice recognition result determination submodule includes:

a re-scoring unit for inputting the candidate recognition result into a decoder of the speech recognition model; the decoder having an output layer trained based on attention criteria rescoring the candidate sentences through the output layer of the decoder;

the second normalization unit is used for determining a corresponding normalization probability value according to a new score corresponding to the candidate statement output by the output layer of the decoder;

and a second recognition result determination unit for taking the candidate sentence with the maximum normalized probability value as the voice recognition result.

Optionally, the speech recognition model is trained by:

the sample data acquisition module is used for acquiring sample voice data and text labels corresponding to the sample voice data;

the encoder processing module is used for taking the acoustic features of the sample voice data as the input of an encoder of the voice recognition model, extracting the hidden layer features of the acoustic features through the encoder, predicting an output result based on the hidden layer features of the acoustic features through an output layer of the encoder, and calculating errors according to a CTC (computer-to-computer) criterion according to the output result of the encoder and the text labels;

a decoder processing module, configured to use the hidden layer feature and the text label output by the encoder as input of a decoder of the speech recognition model, predict, by the decoder, based on the hidden layer feature and the text label, predict an output result by an output layer of the decoder, and calculate an error according to an attention criterion according to the output result of the decoder and the text label;

a model training module for performing model joint parameter training based on the errors calculated according to the CTC criterion and the errors calculated according to the attention criterion.

Optionally, the model training module comprises:

a target error determination submodule, configured to determine a target error according to a product of the error calculated according to the CTC criterion and the preset first weight value and a product of the error calculated according to the attention criterion and the preset second weight value;

and the model parameter adjusting submodule is used for adjusting the parameters of the voice recognition model according to the target error.

The embodiment of the application also discloses an electronic device, which comprises: a processor, a memory and a computer program stored on the memory and being executable on the processor, the computer program, when executed by the processor, implementing the steps of the speech recognition method as described above.

The embodiment of the application also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the voice recognition method are realized.

The embodiment of the application has the following advantages:

in the embodiment of the application, the server can acquire voice data; inputting acoustic characteristics of voice data to be recognized into a voice recognition model for processing; the speech recognition model includes an encoder; hidden layer features of the acoustic features can be extracted through an encoder, and the hidden layer features of the acoustic features can be decoded in a non-autoregressive decoding mode through an output layer trained on the basis of a CTC (coefficient-to-coefficient) criterion; and finally, determining a voice recognition result according to the output result of the output layer. Compared with the mode of adopting autoregressive decoding, the method and the device have the advantages that the hidden layer characteristics of the acoustic characteristics are decoded according to a non-autoregressive decoding mode through the output layer trained on the basis of the CTC (coefficient C) criterion, and the decoding speed of the speech recognition model can be greatly improved.

Drawings

FIG. 1 is a flow chart illustrating steps of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of steps of another speech recognition method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a speech recognition model in an embodiment of the present application;

FIG. 4 is a flow chart of a method for training a speech recognition model in an embodiment of the present application;

fig. 5 is a block diagram of a speech recognition apparatus according to an embodiment of the present application;

FIG. 6 is a block diagram of a speech recognition device in an alternative embodiment of the present application;

FIG. 7 illustrates a block diagram of an electronic device for speech recognition, according to an example embodiment;

fig. 8 is a schematic structural diagram of an electronic device for speech recognition according to another exemplary embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

For the AED-CTC/Attention end-to-end speech recognition model, the model has two main modules: encoder (Encoder), Decoder (Decoder). The encoder can learn acoustic features, and the decoder can decode the language information and the acoustic information in a combined mode; the encoder has an output layer trained based on CTC criteria that can automatically learn word boundary alignment. The automatic alignment capability of the CTC output layer can enable the text to have stronger monotonous alignment relation with acoustic characteristics, and the decoder can avoid the problems of long sentence truncation and the like; and the joint modeling capability of the decoder can also enable the CTC output layer to have richer text context capability and stronger identification capability. The encoder and decoder may employ neural network architecture such as RNN, LSTM, BLSTM, Transformer, etc.

In the embodiment of the application, the voice recognition method adopting the end-to-end voice recognition model can be applied to various service scenes. For example, in a private scenario, a mobile terminal of a user may be equipped with a voice input method, the voice input method may obtain voice data of the user, and the voice input method may call a voice recognition server to perform non-real-time voice recognition. The voice recognition server can adopt an end-to-end voice recognition model to recognize voice data to obtain a voice recognition result, and the voice recognition result is returned to the voice input method.

For another example, in a vehicle-mounted scene, a voice assistant of a vehicle-mounted terminal may obtain voice data of a user, the voice assistant may call a voice recognition server to perform real-time voice recognition, the voice recognition server may recognize the voice data by using an end-to-end voice recognition model to obtain a voice recognition result, and further perform semantic recognition according to the voice recognition result to obtain a semantic recognition result, so that the voice assistant may respond to the semantic recognition result to complete interaction.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a speech recognition method of the present application is shown, which may specifically include the following steps:

step 101, obtaining voice data to be recognized.

The method of the embodiment of the application can be applied to a server, such as a voice recognition server. The server may provide speech recognition services to clients (e.g., voice assistant, instant messaging APP, etc.).

The server is deployed with a speech recognition model, which may be, for example, an AED-CTC/Attention end-to-end speech recognition model. The speech recognition model may perform a non-real-time speech recognition task for the speech data to be recognized or a real-time speech recognition task for the speech data to be recognized according to the invocation request.

The client can send the voice data to be recognized to the server, and the server extracts the acoustic features of the voice data after receiving the voice data to be recognized. The voice data may include a plurality of frames of sound signals, and the acoustic features may be extracted separately for each frame of sound signals. Common acoustic feature expressions include Fbank (also known as FilterBank) feature vectors, MFCC (Mel-frequency cepstral coefficients) feature vectors.

Step 102, inputting the acoustic characteristics of the voice data to be recognized into a voice recognition model for processing; the speech recognition model includes an encoder having an output layer trained based on CTC criteria; extracting the hidden layer characteristics of the acoustic characteristics through the encoder, and decoding the hidden layer characteristics of the acoustic characteristics through an output layer of the encoder according to a non-autoregressive decoding mode.

The speech recognition model may include an encoder, and a decoder. At the end of the encoder there may be an output layer trained based on CTC criteria.

An output layer trained based on CTC criteria may predict aligned output text based on acoustic features, helping the attention mechanism end-to-speech recognition model to remain monotonic during model training, thereby keeping implicit alignment between the output text of the model and the input acoustic features.

And the output layer trained on the basis of the CTC criterion is decoded and output in a non-autoregressive decoding mode, each decoding result has no dependency relationship, and each decoding result of the whole output sequence is synchronously predicted in parallel. The autoregressive decoding mode is opposite to the non-autoregressive decoding mode, and the autoregressive decoding mode uses the generated decoding result as known information to predict the next decoding result each time, and finally combines the decoding results generated at each time step into a complete sequence to be output. Compared with an autoregressive decoding mode, decoding according to a non-autoregressive decoding mode can greatly improve the decoding speed of the speech recognition model. Compared with the mode of adopting autoregressive decoding, the decoding speed of the voice recognition model can be greatly improved and the voice recognition speed is improved by decoding the output layer trained on the basis of the CTC criterion according to a non-autoregressive decoding mode.

And 103, determining a voice recognition result according to the output result of the output layer of the encoder.

The output result of the output layer trained based on the CTC criterion may be a word sequence, where each word in the word sequence is obtained by decoding the acoustic features using a non-autoregressive decoding method. The speech recognition result can be further determined from the word sequence.

Referring to fig. 2, a flowchart illustrating steps of an alternative embodiment of the speech recognition method of the present application is shown, which may specifically include the following steps:

step 201, voice data to be recognized is obtained.

Step 202, inputting the acoustic characteristics of the voice data to be recognized into a voice recognition model for processing; the speech recognition model comprises an encoder; extracting hidden layer characteristics of the acoustic characteristics through the encoder, and decoding the hidden layer characteristics of the acoustic characteristics through an output layer of the encoder according to a non-autoregressive decoding mode; wherein an output layer of the encoder is trained using CTC criteria.

And step 203, decoding the word sequence output by the output layer of the encoder through a WFST network encoded with a word level language model to obtain a candidate recognition result.

Weighted Finite State transducer networks (WFST) are directed graphs that can be used to express any language for language information coding.

In the embodiment of the application, because the output layer trained based on the CTC criterion is decoded in a non-autoregressive decoding mode, only one word can be output in each decoding, so that only word-level language models can be fused in the decoding process, the modeling capability of the word-level language models is far weaker than that of the word-level language models, and especially in the hot word optimization of proper nouns, the recognition performance of the language models on the speech recognition models is not greatly improved, and short boards exist in the recognition of the speech recognition models in the hot word field.

In this regard, word-level language model information may be encoded into the WFST network, and further word sequences output by the output layer trained based on CTC criteria may be decoded by the WFST network. Therefore, the decoding process of the voice recognition can be fused with word-level language model information, on one hand, the problem that the hot word optimization of the voice recognition model is difficult is solved, and on the other hand, the performance of the voice recognition model is improved.

In an alternative embodiment of the present application, the step 203 may comprise the following sub-steps S11-S12:

substep S11, inputting the word sequence into the WFST network for decoding; the WFST network is constructed based on a language model, a dictionary model and an output mapping model; the language model is used for judging whether the word sequence accords with grammar and the occurrence probability of the word sequence; the dictionary model is used for mapping the word sequences into word sequences; the output mapping model is used for transmitting the word sequence output by the output layer to a single word.

And a substep S12 of obtaining a plurality of candidate sentences output by the WFST network and scores corresponding to the candidate sentences.

In the present application, the WFST network may be constructed by a language model, a dictionary model, an output mapping model. Illustratively, a WFST network may be represented by the following formula:

wherein

Representing a composite calculation. And G is used for judging whether an input word sequence conforms to the grammar of the language and the probability of the occurrence of the word sequence by the language model. L is a dictionary model for mapping word sequences to word sequences. T is an output mapping model for mapping word sequences output by the output layer to words, and the rule is to map a plurality of outputs of the model that are the same and have no blank symbol interval in the middle to a word, map a plurality of words to a plurality of outputs that are the same and have blank symbol interval in the middle, and map independent outputs to words, as follows: t (I me)

Is provided with

A small

Is small and small

Dream of dream

) I have a small dream, wherein

Representing blank symbols.

And T, L and G are subjected to composite calculation to obtain a final WFST network, wherein the input of the final WFST network is a word sequence output by an output layer trained on the basis of CTC (Chinese character to character) criteria, and the output of the final WFST network is a plurality of candidate sentences and corresponding scores thereof.

And step 204, determining a voice recognition result according to the candidate recognition result.

The WFST network may determine one or more candidate recognition results from the word sequence. The candidate recognition result may include candidate sentences and scores corresponding to the candidate sentences.

In an alternative embodiment of the present application, the step 204 may include the following sub-steps S21-S22:

and a substep S21, determining corresponding normalized probability values according to the scores corresponding to the candidate sentences, respectively.

Illustratively, the scores corresponding to the candidate sentences may be normalized by a softmax () normalization function to obtain normalized probability values corresponding to the candidate sentences.

And a substep S22, taking the candidate sentence with the maximum normalized probability value as the speech recognition result.

The greater the normalized probability value of a candidate statement, the greater the likelihood that the statement will appear. The candidate sentence with the highest probability value is used as the final speech recognition result.

In another embodiment of the present application, the step 204 may include the following sub-steps S31-S32:

a substep S31 of inputting the candidate recognition result into a decoder of the speech recognition model; the decoder has an output layer trained based on attention criteria to re-score the candidate sentences through the output layer of the decoder.

The decoder may have an output layer trained based on attention criteria, by which the score of the candidate sentence may be modified to obtain a new score.

And a substep S32 of determining a corresponding normalized probability value according to the new score corresponding to the candidate sentence output by the output layer of the decoder.

Illustratively, the new score corresponding to each candidate sentence may be normalized by a softmax () normalization function, so as to obtain a normalized probability value corresponding to each candidate sentence.

And a substep S33, taking the candidate sentence with the maximum normalized probability value as the speech recognition result.

And taking the candidate sentence with the maximum probability value as a final voice recognition result.

In the embodiment of the application, the server can acquire the voice data to be recognized; inputting acoustic characteristics of voice data to be recognized into a voice recognition model for processing; the speech recognition model includes an encoder; extracting hidden layer characteristics of the acoustic characteristics through an encoder, and decoding the hidden layer characteristics of the acoustic characteristics through an output layer of the encoder, which is trained based on a CTC (continuous traffic control) criterion, according to a non-autoregressive decoding mode; decoding the word sequence output by the output layer of the encoder through a WFST network encoded with a word level language model to obtain a candidate recognition result; and determining a voice recognition result according to the candidate recognition result. Compared with the mode of adopting autoregressive decoding, the method and the device have the advantages that the hidden layer characteristics of the acoustic characteristics are decoded according to a non-autoregressive decoding mode through the output layer trained on the basis of the CTC (coefficient C) criterion, and the decoding speed of the speech recognition model can be greatly improved.

Furthermore, the word sequence is decoded through a WFST network encoded with a word-level language model to obtain a candidate recognition result, so that the problem of difficulty in optimizing hot words of the voice recognition model can be solved, and the performance of the voice recognition model is improved.

Referring to fig. 3, a schematic diagram of a speech recognition model in an embodiment of the present application is shown. The speech recognition model comprises an encoder and a decoder, wherein the input of the encoder is acoustic features which are responsible for modeling the acoustic features. The input of the decoder is the text label of the voice and the output of the encoder, which is responsible for modeling the language information on one hand and predicting the next output of the model by combining the acoustic characteristics and the language information on the other hand. Both the encoder and decoder enhance their modeling capabilities through an internal attention mechanism.

The encoder has an output layer trained based on CTC criteria. The decoder has an output layer trained based on attention criteria. The voice recognition model may be optimized using the CTC criterion and the attention criterion as joint training criteria. After the output layer of the encoder, a WFST network may be connected, which is constructed based on a language model, a dictionary model, and an output mapping model. In the inference process of the voice recognition model, the acoustic features of the voice data to be recognized can be input into an encoder, the acoustic features are decoded by the encoder to output a word sequence, the word sequence is further decoded by a WFST network to generate a plurality of candidate sentences and corresponding scores, and a voice recognition result is determined according to the candidate sentences and the corresponding scores.

Referring to fig. 4, a flowchart of a training method of a speech recognition model in an embodiment of the present application is shown, where the training method of the speech recognition model includes:

step 401, obtaining sample voice data and a text label corresponding to the sample voice data.

The text labels of the sample voice data can be generated after the sample voice data is manually listened, and the text labels can be used for assisting in training the model.

Step 402, taking the acoustic features of the sample voice data as the input of the encoder of the voice recognition model, extracting the hidden layer features of the acoustic features through the encoder, predicting the output result based on the hidden layer features of the acoustic features through the output layer of the encoder, and calculating the error according to the CTC criterion according to the output result of the encoder and the text label.

Specifically, the CTC criterion loss function may be calculated as an error based on the output of the encoder and the text label.

Step 403, using the hidden layer feature and the text label output by the encoder as the input of a decoder of the speech recognition model, predicting by the decoder based on the hidden layer feature and the text label, predicting an output result by the output layer of the decoder, and calculating an error according to an attention criterion according to the output result of the decoder and the text label.

The decoder may perform encoding according to the text label and prediction according to the hidden layer feature and the text label output by the encoder, and predicting an output result of an output layer of the output result decoder by the output layer of the decoder may include: candidate sentences and probabilities corresponding to the candidate sentences.

Specifically, the attention criterion loss function may be calculated as an error based on the output of the decoder and the text label. For example, the attention criterion loss function may use a Cross-Entropy (CE) loss function.

Step 404, performing model joint parameter training based on the errors calculated according to the CTC criterion and the errors calculated according to the attention criterion.

When the CTC criterion loss function and the attention criterion loss function meet preset conditions, it can be determined that the model training is completed. For example, when the error between the CTC criterion loss function and the attention criterion loss function is less than a preset error threshold, it may be determined that the model training is complete.

In an alternative embodiment, the step 404 may comprise: determining a target error according to the product of the error calculated according to the CTC criterion and the preset first weight value and the product of the error calculated according to the attention criterion and the preset second weight value; and adjusting parameters of the voice recognition model according to the target error.

Specifically, the target loss function may be determined according to a product of the CTC criterion loss function and a preset first weight value, and a product of the attention criterion loss function and a preset second weight value. Parameters of the speech recognition model are adjusted according to the objective function.

Illustratively, the target loss function may be represented by the following formula:

L_{CTC/Attention}＝λL_ctc+(1-λ)L_Attentionand λ is a hyper-parameter for setting the weights of the CTC criterion and the attention criterion. The value of lambda can be between 0 and 1 and can be obtained by carrying out multiple times of experimental calculation. L is_ctcLoss function for CTC criterion, L_AttentionIs a loss of attention criterion function. For example, if λ is 0.4, the first weight value of the CTC criterion loss function is 0.4 and the second weight value of the attention criterion loss function is 0.6. After the target loss function is obtained through calculation, parameters of the voice recognition model can be adjusted through the target loss function. By combining CTC criteria with attention criteriaTraining can keep the voice recognition model monotonous in the model training process, so that the output text of the voice recognition model keeps implicit alignment with the input acoustic features, the voice recognition training is better, and the final recognition accuracy is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

Referring to fig. 5, a block diagram of a speech recognition apparatus according to an embodiment of the present application is shown, which may specifically include the following modules:

a voice data obtaining module 501, configured to obtain voice data to be recognized;

a model processing module 502, configured to input an acoustic feature of the to-be-recognized speech data into a speech recognition model for processing; the speech recognition model includes an encoder having an output layer trained based on Connected Temporal Classification (CTC) criteria; extracting hidden layer characteristics of the acoustic characteristics through the encoder, and decoding the hidden layer characteristics of the acoustic characteristics through an output layer of the encoder according to a non-autoregressive decoding mode;

a recognition result determining module 503, configured to determine a speech recognition result according to an output result of the output layer of the encoder.

Referring to fig. 6, a block diagram of an alternative embodiment of a speech recognition device of the present application is shown.

In an optional embodiment of the present application, the recognition result determining module 503 may include:

a candidate recognition result determining submodule 5031, configured to decode, through a weighted finite state transducer WFST network encoded with a word-level language model, a word sequence output by an output layer of the encoder to obtain a candidate recognition result;

a voice recognition result determining sub-module 5032, configured to determine a voice recognition result according to the candidate recognition result.

In an alternative embodiment of the present application, the candidate recognition result determining sub-module 5031 may include:

a network decoding unit 50311, configured to input the word sequence into the WFST network for decoding; the WFST network is constructed based on a language model, a dictionary model and an output mapping model; the language model is used for judging whether the word sequence accords with grammar and the occurrence probability of the word sequence; the dictionary model is used for mapping the word sequences into word sequences; the output mapping model is used for transmitting the word sequence output by the output layer to a single word;

a candidate sentence determination unit 50312, configured to obtain a plurality of candidate sentences output by the WFST network and scores corresponding to the plurality of candidate sentences.

In an optional embodiment of the present application, the candidate recognition result includes a plurality of candidate sentences and scores corresponding to the candidate sentences; the voice recognition result determining sub-module 5032 may include:

a first normalizing unit 50321, configured to determine corresponding normalized probability values according to the scores corresponding to the candidate sentences, respectively;

a first recognition result determining unit 50322 configured to use the candidate sentence with the largest normalized probability value as the speech recognition result.

In an optional embodiment of the present application, the candidate recognition result includes a plurality of candidate sentences and scores corresponding to the plurality of candidate sentences; the voice recognition result determining sub-module 5032 may include:

a re-scoring unit 50323 for inputting the candidate recognition result into a decoder of the speech recognition model; the decoder having an output layer trained based on attention criteria rescoring the candidate sentences through the output layer of the decoder;

a second normalizing unit 50324, configured to determine a corresponding normalized probability value according to a new score corresponding to the candidate sentence output by the output layer of the decoder;

a second recognition result determining unit 50325, configured to use the candidate sentence with the largest normalized probability value as the speech recognition result.

In an alternative embodiment of the present application, the speech recognition model is trained by:

a sample data obtaining module 504, configured to obtain sample voice data and a text label corresponding to the sample voice data;

an encoder processing module 505, configured to use the acoustic features of the sample speech data as an input of an encoder of the speech recognition model, extract, by the encoder, hidden layer features of the acoustic features, predict, by an output layer of the encoder, an output result based on the hidden layer features of the acoustic features, and calculate an error according to a CTC criterion according to the output result of the encoder and the text label;

a decoder processing module 506, configured to use the hidden layer feature and the text label output by the encoder as input of a decoder of the speech recognition model, perform prediction by the decoder based on the hidden layer feature and the text label, predict an output result by an output layer of the decoder, and calculate an error according to an attention criterion according to the output result of the decoder and the text label;

a model training module 507 for performing model joint parameter training based on the error calculated according to the CTC criterion and the error calculated according to the attention criterion.

In an alternative embodiment of the present application, the model training module 507 may include:

a target error determination submodule 5071, configured to determine a target error according to a product of the error calculated according to the CTC criterion and the preset first weight value and a product of the error calculated according to the attention criterion and the preset second weight value;

and the model parameter adjusting submodule 5072 is used for adjusting parameters of the voice recognition model according to the target error.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 7 is a block diagram illustrating an architecture of an electronic device 700 for speech recognition, according to an example embodiment. For example, the electronic device 700 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, a smart wearable device, and the like.

Referring to fig. 7, electronic device 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 708, an audio component 710, an input/output (I/O) interface 712, a sensor component 714, and a communication component 716.

The processing component 702 generally controls overall operation of the electronic device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 702 may include one or more processors 720 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components. For example, the processing component 702 can include a multimedia module to facilitate interaction between the multimedia component 708 and the processing component 702.

The memory 704 is configured to store various types of data to support operations at the electronic device 700. Examples of such data include instructions for any application or method operating on the electronic device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 706 provides power to the various components of the electronic device 700. The power components 706 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 700.

The multimedia component 708 includes a screen that provides an output interface between the electronic device 700 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 708 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 700 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 710 is configured to output and/or input audio signals. For example, the audio component 710 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 700 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 704 or transmitted via the communication component 716. In some embodiments, audio component 710 also includes a speaker for outputting audio signals.

The I/O interface 712 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 714 includes one or more sensors for providing various aspects of status assessment for the electronic device 700. For example, the sensor assembly 714 may detect an open/closed state of the electronic device 700, the relative positioning of components, such as a display and keypad of the electronic device 700, the sensor assembly 714 may also detect a change in the position of the electronic device 700 or a component of the electronic device 700, the presence or absence of user contact with the electronic device 700, orientation or acceleration/deceleration of the electronic device 700, and a change in the temperature of the electronic device 700. The sensor assembly 714 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 714 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 714 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 716 is configured to facilitate wired or wireless communication between the electronic device 700 and other devices. The electronic device 700 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 714 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 714 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the electronic device 700 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of speech recognition, the method comprising:

acquiring voice data to be recognized;

Optionally, the speech recognition model is trained by:

Fig. 8 is a schematic structural diagram of an electronic device 800 for speech recognition according to another exemplary embodiment of the present application. The electronic device 800 may be a server, which may vary widely due to configuration or performance, and may include one or more Central Processing Units (CPUs) 822 (e.g., one or more processors) and memory 832, one or more storage media 830 (e.g., one or more mass storage devices) storing applications 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 822 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the server.

The server may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, one or more keyboards 856, and/or one or more operating systems 841, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In an exemplary embodiment, the server is configured to execute one or more programs by the one or more central processors 822 including instructions for:

acquiring voice data to be recognized;

Optionally, further comprising instructions for training the speech recognition model to:

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable speech recognition terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable speech recognition terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable speech recognition terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable speech recognition terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing has introduced in detail a speech recognition method, a speech recognition apparatus and an electronic device provided by the present application, and specific examples are applied herein to illustrate the principles and embodiments of the present application, and the description of the foregoing embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A speech recognition method, comprising:

acquiring voice data to be recognized;

2. The method of claim 1, wherein determining the speech recognition result according to the output result of the output layer of the encoder comprises:

3. The method of claim 2, wherein decoding the word sequence output by the output layer of the encoder through the WFST network encoded with the word-level language model to obtain candidate recognition results comprises:

4. The method of claim 2, wherein the candidate recognition result comprises a plurality of candidate sentences and scores corresponding to the candidate sentences; the determining a speech recognition result according to the candidate recognition result includes:

5. The method of claim 2, wherein the candidate recognition result comprises a plurality of candidate sentences and scores corresponding to the plurality of candidate sentences; the determining a speech recognition result according to the candidate recognition result includes:

6. The method of claim 1, wherein the speech recognition model is trained by:

7. The method of claim 6, wherein said model joint parameter training based on said errors calculated according to CTC criteria and said errors calculated according to attention criteria comprises:

8. A speech recognition apparatus, comprising:

9. The apparatus of claim 8, wherein the recognition result determining module comprises:

10. The apparatus of claim 9, wherein the candidate recognition result determining sub-module comprises:

11. The apparatus of claim 9, wherein the candidate recognition result comprises a plurality of candidate sentences and scores corresponding to the candidate sentences; the voice recognition result determination submodule includes:

12. The apparatus of claim 9, wherein the candidate recognition result comprises a plurality of candidate sentences and scores corresponding to the plurality of candidate sentences; the voice recognition result determination submodule includes:

13. The apparatus of claim 8, wherein the speech recognition model is trained by:

14. An electronic device, comprising: processor, memory and a computer program stored on the memory and being executable on the processor, the computer program, when being executed by the processor, realizing the steps of the speech recognition method according to any of the claims 1-7.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 7.