CN111402861B

CN111402861B - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN111402861B
Application number: CN202010217558.0A
Authority: CN
Inventors: 陈明佳
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2022-11-15
Anticipated expiration: 2040-03-25
Also published as: CN111402861A

Abstract

The embodiment of the invention discloses a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium. The method comprises the following steps: acquiring a voice to be recognized, and converting the voice into acoustic features; inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; the language of the phoneme sequence output by each language acoustic model is different; converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence; according to the acoustic characteristics, carrying out language classification on the voice, and determining language confidence coefficients of the voice belonging to various languages; and determining the classification recognition score of the voice for each language according to the recognition confidence coefficient and the language confidence coefficient, and taking the character sequence corresponding to the highest value of the classification recognition score as the recognition result of the voice. When the method is used for identifying the audio frequency segments of different languages, a user does not need to switch the identification systems of different languages, and meanwhile, the effects of high accuracy rate, low time delay and good user experience of voice identification can be realized.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium.

Background

With the development of globalization, users often receive audio of different languages, and when a text form of a language corresponding to the audio needs to be acquired, the audio needs to be subjected to speech recognition. For example, speech recognition is performed on multi-lingual mixed audio; or after voice recognition is carried out on the audio frequency of one section of the main language, voice recognition is carried out on the audio frequency of one section of the auxiliary language.

When performing speech recognition on an audio frequency mixed with multiple languages, multiple languages are usually modeled and combined into one technical framework, so as to perform speech recognition on the audio frequency mixed with multiple languages. The multi-language modeling is combined into a technical framework, the learning capability of the model is limited, the identification capability of the model for a single language is reduced, and the characteristic quantities learned by the model are different due to different language data quantities of the multi-language, so that the speech identification capability of the language with large data quantity is obviously stronger than that of the language with small data quantity. Although the problem of voice recognition of part of auxiliary languages mixed in the main language can be solved, when the user adopts the complete main language for a period of time and adopts the complete auxiliary language for another period of time, the recognition accuracy rate is poor, and thus the recognition cannot be carried out.

To solve the above problem, the speech recognition method in the prior art generally adopts two schemes: one is to completely build two sets of speech recognition systems with different languages, and in practical application, a user needs to switch languages manually or in other manual modes, which is very inconvenient and poor in user experience; the other method is that before the speech recognition system performs speech recognition, language classification is performed through a language classification model, speech recognition of corresponding languages is performed according to the classified languages, but the speech recognition accuracy is worse due to language classification errors, and the calculation amount is large and the time delay is high due to the addition of classification modules.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, which can improve the voice recognition accuracy and reduce time delay without switching languages by a user.

In a first aspect, an embodiment of the present invention provides a speech recognition method, where the method includes:

acquiring a voice to be recognized, and converting the voice into acoustic features;

inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; the language of the phoneme sequence output by each language acoustic model is different;

converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence;

according to the acoustic features, language classification is carried out on the voice, and language confidence coefficients of the voice belonging to various languages are determined;

and determining the classification recognition score of the voice for each language according to the recognition confidence coefficient and the language confidence coefficient, and taking the character sequence corresponding to the highest value of the classification recognition score as the recognition result of the voice.

In a second aspect, an embodiment of the present invention further provides a speech recognition apparatus, where the apparatus includes:

the acoustic feature conversion module is used for acquiring a voice to be recognized and converting the voice into acoustic features;

the phoneme sequence output module is used for inputting the acoustic features into at least two language acoustic models and outputting corresponding phoneme sequences; the language of the phoneme sequence output by each language acoustic model is different;

the character sequence conversion module is used for converting the phoneme sequences of all languages into corresponding character sequences and determining the recognition confidence coefficients of the character sequences;

a language classification module, configured to classify the language of the speech according to the acoustic feature, and determine language confidence that the speech belongs to each language;

and the recognition result acquisition module is used for determining the classification recognition score of the voice for each language according to the recognition confidence coefficient and the language confidence coefficient, and taking the character sequence corresponding to the highest value of the classification recognition score as the recognition result of the voice.

In a third aspect, an embodiment of the present invention further provides a speech recognition device, where the speech recognition device includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a speech recognition method according to any embodiment of the invention.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a speech recognition method according to any embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, the voice to be recognized is obtained and converted into acoustic characteristics; inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; the language of the phoneme sequence output by each language acoustic model is different; converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence; according to the acoustic characteristics, language classification is carried out on the voice, and language confidence coefficients of the voice belonging to various languages are determined; according to the recognition confidence coefficient and the language confidence coefficient, the classification recognition score of the voice for each language is determined, and the character sequence corresponding to the highest value of the classification recognition score is used as the recognition result of the voice, so that the problem that the user needs to switch languages when the voice recognizes the voices of two different languages is solved, and the effects of high voice recognition accuracy rate and low time delay are achieved without user switching.

Drawings

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a speech recognition method according to a second embodiment of the present invention;

fig. 3 is a flowchart of a speech recognition method according to a third embodiment of the present invention;

FIG. 4 is a block diagram of a speech recognition system according to an embodiment of the present invention;

FIG. 5 is a block diagram of a speech recognition system according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to a fourth embodiment of the present invention;

fig. 7 is a schematic structural diagram of a speech recognition device according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, where the present embodiment is applicable to a case of recognizing speech of different languages, the method may be executed by a speech recognition apparatus, the apparatus may be implemented by software and/or hardware, and the apparatus may be integrated in a processor, as shown in fig. 1, and the method specifically includes:

step 110, obtaining the voice to be recognized, and converting the voice into acoustic features.

The speech to be recognized may be a complete speech of a certain language, for example, a speech of a chinese language, or a speech of a foreign language (which may be english, japanese, french, russian, etc.); or a speech of mandarin, or a dialect (including southern Fujian, northeast, shanxi, guangdong, etc.). The method for acquiring the speech to be recognized may be to acquire the speech through a microphone or other devices, and the present invention is not particularly limited. The acquired voice to be recognized can be converted into acoustic features through the voice signal preprocessing module, wherein the acoustic features can be the frequency spectrum, feature vectors or sentence vectors of the voice and the like.

In an implementation manner of the embodiment of the present invention, optionally, converting the speech into the acoustic feature includes: and inputting the voice into a voice processing deep learning model to obtain the acoustic characteristics of the voice.

The speech processing Deep learning model that can be used includes a Long Short-Term Memory network (LSTM), a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or a Bidirectional Encoder (BERT).

In an implementation manner of the embodiment of the present invention, optionally, the speech processing deep learning model includes: BERT model.

In the embodiment of the invention, the BERT model is preferably adopted to convert the voice into the acoustic features, in the voice recognition, the data volume of some languages marked manually is less, the marking cost is higher, and the model is weaker in the flourishing ability and low in accuracy rate of untrained data by adopting a general deep learning model such as LSTM, CNN or DNN for modeling. The BERT model has obvious advantages of feature extraction, pre-training can be carried out on a large amount of label-free data, then task-type voice data fine-tuning training is used, namely, only one-time large-scale pre-training is needed, then a good feature extraction effect can be achieved through quick fine tuning, a large amount of unlabeled data can be used for training, the generalization capability of the model is increased, and the accuracy of the model can meet the requirement of practical application. In the technical scheme of the embodiment of the invention, voices of different languages can share the BERT model to perform acoustic feature conversion, so that the calculation amount and time delay of voice recognition can be greatly reduced, and meanwhile, because the BERT model is independent of the language acoustic models of various languages, if the BERT model is used in a transform neural network with better effect in the field of voice recognition, the BERT model is convenient to switch, and the upgrading, updating and maintenance of a voice recognition framework are facilitated.

Step 120, inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; wherein, the language of the phoneme sequence output by each language acoustic model is different.

In the Multi-Task Learning (Multi-Task Learning) framework, multiple tasks can be implemented in one speech recognition model, and independent language acoustic models can be respectively established for each language. The language acoustic model may be obtained by training a certain language, and the acoustic features may be converted into a phoneme sequence of the specific language, for example, a chinese language acoustic model is obtained by training the acoustic features corresponding to a chinese language, and the acoustic features are input into the chinese language acoustic model, so that a chinese phoneme sequence may be obtained. And simultaneously inputting the acoustic features into language acoustic models corresponding to various languages for processing to obtain corresponding phoneme sequences, namely processing the acoustic features in a parallel mode. For example, for the acoustic features of the same speech, the Chinese phoneme sequence and the Japanese phoneme sequence can be obtained by inputting the acoustic features of the same speech into the Chinese language acoustic model and the Japanese language acoustic model at the same time. Where a phoneme is the smallest unit of speech, each piece of speech may be converted into a sequence of phonemes, e.g., "hello" for speech, and "nihao" for chinese for sequence of phonemes.

Step 130, converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence.

The phoneme sequences of the languages can be converted in a parallel manner to obtain the character sequences of the corresponding languages. For example, a chinese phoneme sequence may be converted into a text sequence through a chinese language model, for example, a phoneme sequence of "nihao" may be converted into a text sequence of "hello". The output result through the model may be multiple, each result has a corresponding confidence, and the result with the highest confidence is output. For example, for the speech "hello", the phoneme sequence output by the language acoustic model may include "nihao", "lihao", "leihao", etc., with a confidence of "nihao" of 0.9, a confidence of "lihao" of 0.08, and a confidence of "leihao" of 0.02. Thus, the phoneme sequence "nihao" is taken as an output result of the language acoustic model. Similarly, there is a corresponding recognition confidence for the text sequence, where the confidence is understood to be the probability of the occurrence of the result, and may exist during the model training, and thus may be generated when the model outputs the result.

In an implementation manner of the embodiment of the present invention, optionally, converting the phoneme sequence of each language into a corresponding text sequence, and determining the recognition confidence of the text sequence includes: and respectively inputting the phoneme sequences of each language into the language models of the corresponding languages to obtain the character sequences corresponding to the phoneme sequences of each language, and determining the recognition confidence of the character sequences.

The language model may be obtained by training a certain language, and the phoneme sequence may be converted into a language-specific character sequence, for example, a chinese language model is obtained by training a phoneme sequence corresponding to a chinese language, and the phoneme sequence is input into the chinese language model, so as to obtain a chinese character sequence and a recognition confidence of the corresponding output character sequence. For example, for the phoneme sequence "nihao", the language model in chinese is converted into the text sequence "hello", and the recognition confidence is 0.8; converting into a character sequence 'Lihao', wherein the recognition confidence coefficient is 0.1; and converting the Chinese language model into a character sequence 'hello', wherein the recognition confidence coefficient is 0.1, and finally, the output result of the Chinese language model can be the character sequence corresponding to the highest value of the recognition confidence coefficient, the output result can be the character sequence 'hello', and the recognition confidence coefficient is 0.8.

And 140, carrying out language classification on the voice according to the acoustic characteristics, and determining language confidence coefficients of the voice belonging to various languages.

The acoustic classification model can be trained aiming at the acoustic features of the voices of various languages, and the acoustic features of a section of voice can be classified in languages to determine the language confidence coefficient of each language of the voice. For example, the acoustic classification model is trained on acoustic features of a speech in chinese, japanese, and english, and for a section of speech, when the acoustic features are input into the acoustic classification model, a language confidence that the speech belongs to chinese, a language confidence that the speech belongs to japanese, a language confidence that the speech belongs to english, and a language confidence that the speech does not belong to chinese, japanese, and english can be obtained. The language and language confidence corresponding to the highest value of the language confidence may be used as the output result of the acoustic classification model, for example, the speech belongs to chinese, and the language confidence is 0.71.

And 150, determining a classification recognition score of the voice for each language according to the recognition confidence coefficient and the language confidence coefficient, and taking the character sequence corresponding to the highest value of the classification recognition score as a voice recognition result.

The information fusion decision module may determine a classification recognition score of the speech for each language according to the recognition confidence of the text sequence and the language confidence that the speech belongs to a certain language, where the classification recognition score may be a product of the recognition confidence and the language confidence, a sum of the recognition confidence and the language confidence, or other determination manners, and the present invention is not limited in particular. The highest classification recognition score represents that the voice recognition result has the highest probability of being the character sequence corresponding to the highest classification recognition score, so that the character sequence corresponding to the highest classification recognition score is used as the voice recognition result.

For example, for a speech segment X, assume that the content of X is CCFD in language a. After being processed by a voice recognition module (comprising an acoustic classification model, a language acoustic model and a language model) in a multitask learning architecture, the recognition result of the language A can be obtained as CCFD, and the recognition confidence coefficient is 0.7; the recognition result of the B language is HJKL, and the recognition confidence coefficient is 0.45. And (3) obtaining three classification results by the acoustic classification model, wherein the language confidence that the voice belongs to the language A is 0.75, the language confidence that the voice belongs to the language B is 0.24, and the language confidence that the voice does not belong to the language A or the language confidence that the voice does not belong to the language B is 0.01. Finally, the classification recognition score of the speech X for the speech type a output CCFD is 0.7 × 0.75=0.525; the classification recognition score of HJKL output for speech X in B language is 0.45 × 0.24=0.108. Therefore, the recognition result of the X speech is CCFD in the a language.

In a specific implementation manner of the embodiment of the present invention, semantic models of different languages may be added after the language model, and semantic domain classification may be performed on a text sequence of a corresponding language, and a domain confidence may be determined; according to the recognition confidence, the language confidence and the field confidence, determining the classification recognition score of the voice for each language, and taking the character sequence corresponding to the highest value of the classification recognition score as the recognition result of the voice, so that the recognition result of the voice can be more accurate.

In another specific implementation manner of the embodiment of the present invention, a text language classification model may be added after the language model, and text language classification may be performed on the text sequence output by the language model of each language, so as to determine the text language confidence that the text sequence belongs to each language; determining language classification scores of voices belonging to various languages according to the language confidence degrees and the character language confidence degrees, and determining the language corresponding to the highest value of the language classification scores as a target language; performing field classification on all character sequences through a semantic model corresponding to a target language, and determining the field confidence of the character sequences corresponding to each field; according to the recognition confidence, the language classification score and the field confidence, the classification recognition score of the voice for each language is determined, and the character sequence corresponding to the highest value of the classification recognition score is used as a voice recognition result, so that the voice language recognition result is more accurate, and the voice recognition result is more accurate.

According to the technical scheme of the embodiment of the invention, the voice to be recognized is obtained and converted into acoustic characteristics; inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; the language of the phoneme sequence output by each language acoustic model is different; converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence coefficient of the character sequence; according to the acoustic characteristics, language classification is carried out on the voice, and language confidence coefficients of the voice belonging to various languages are determined; according to the recognition confidence coefficient and the language confidence coefficient, the classification recognition score of the voice for each language is determined, and the character sequence corresponding to the highest value of the classification recognition score is used as the recognition result of the voice, so that the problem that the voice needs to be switched manually or in other manual modes when two sections of voices of different languages are recognized by the voice is solved, the effects of high voice recognition accuracy, low time delay and small number of modules can be achieved without switching by the user.

Example two

Fig. 2 is a flowchart of a speech recognition method provided in a second embodiment of the present invention, which is a further refinement of the above technical solution, and the technical solution in this embodiment may be combined with various alternatives in one or more of the above embodiments.

As shown in fig. 2, the method includes:

step 210, obtaining the voice to be recognized, and converting the voice into acoustic features.

In an implementation manner of the embodiment of the present invention, optionally, the deep learning model for speech processing includes: BERT model.

Step 220, inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; wherein, the language of the phoneme sequence output by each language acoustic model is different.

Step 230, converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence.

In an implementation manner of the embodiment of the present invention, optionally, converting the phoneme sequence of each language into a corresponding text sequence, and determining the recognition confidence of the text sequence, includes: and respectively inputting the phoneme sequences of each language into the language models of the corresponding languages to obtain the character sequences corresponding to the phoneme sequences of each language, and determining the recognition confidence coefficients of the character sequences.

And step 240, according to the acoustic characteristics, performing language classification on the voice, and determining language confidence coefficients of the voice belonging to various languages.

And step 250, determining the field confidence of the character sequence in each field aiming at the character sequence of each language.

After the language model, the domain classification may be performed on the text sequences of various languages, and the domain confidence of the text sequences in various domains is determined, where the domain classification may refer to that the speech belongs to the domains of music, stories, games, movies, or reading. For example, the text sequence in language a may be subjected to domain classification, and the domain confidence that the text sequence obtained by the language model in language a belongs to the fields of music, stories, games, movies, or reading is determined, and if the domain confidence that the text sequence in language a in the music field is the highest is 0.68, the domain confidence that the text sequence in language a in the music field is 0.68 is determined.

In an implementation manner of the embodiment of the present invention, optionally, determining the domain confidence of the text sequence in each domain includes: and performing field classification on the character sequence through a semantic model corresponding to the language of the character sequence to obtain the field confidence of the character sequence in each field.

The semantic models of different languages can be added after the language model, semantic domain classification can be performed on the character sequences of the corresponding languages, and the domain confidence can be determined. The semantic model can be trained for a character sequence of a specific language, and the field classification can be performed for the character sequence of the specific language of a segment of voice to determine the field confidence of the character sequence of the specific language in each field. For example, the semantic model may be trained on a word sequence of the language a, where the trained field includes music, movie, and reading, the word sequence of the language a is input, the field confidence of the word sequence of the language a in the music field, the field confidence in the movie field, the field confidence in the reading field, and the field confidence of the word sequence not in the music, movie, and reading fields may be obtained, and the field corresponding to the highest value of the field confidence may be selected as the field of the word sequence of the language a. By using the voice classification module, a more accurate result can be selected under the condition that the language confidence of the acoustic classification model is very close, the accuracy of field classification in the semantic classification module can be better embodied in task-type voice conversation, and errors caused by the acoustic classification model can be corrected according to information of a text level.

And step 260, determining the classification recognition score of the voice aiming at each language according to the recognition confidence coefficient, the language confidence coefficient and the domain confidence coefficient.

The information fusion decision module may determine a classification recognition score of the speech for each language according to the recognition confidence of the text sequence, the language confidence that the speech belongs to a certain language, and the domain confidence of the text sequence in a certain domain, where the classification recognition score may be a product of the recognition confidence and the language confidence, a sum of a domain confidence maximum value, a sum of the recognition confidence, the language confidence, and the domain confidence maximum value, or other determination manners, which is not limited in the present invention.

In an implementation manner of the embodiment of the present invention, optionally, determining a classification recognition score of the speech for each language according to the recognition confidence, the language confidence and the domain confidence includes: determining a product of the recognition confidence coefficient and the language confidence coefficient, and determining an arithmetic sum of the product and the highest value of the domain confidence coefficient; the arithmetic sum is used as a classification recognition score for each language of speech.

In the embodiment of the present invention, a preferred method for determining the classification recognition scores of the languages is as follows: the classification recognition score of the specific language = recognition confidence of the character sequence of the specific language × language confidence that the speech of the specific language belongs to a certain language + highest domain confidence corresponding to the character sequence in each domain. In task-type voice conversation, the semantic classification module is added to classify the fields, so that the accuracy of voice recognition is further improved.

For example, for a speech segment X, assume that the content of X is CCFD in language a. After being processed by a voice recognition module (comprising an acoustic classification model, a language acoustic model and a language model) in a multitask learning architecture, the recognition result of the language A can be obtained as CCFD, and the recognition confidence coefficient is 0.7; the recognition result of the B language is HJKL, and the recognition confidence coefficient is 0.45. And (3) obtaining three classification results by the acoustic classification model, wherein the language confidence that the voice belongs to the language A is 0.75, the language confidence that the voice belongs to the language B is 0.24, and the language confidence that the voice does not belong to the language A or the language confidence that the voice does not belong to the language B is 0.01. In the semantic classification model, the best domain classification result of CCFD in the semantic model of the language A is that CCFD belongs to the music domain, and the domain confidence coefficient is 0.71; the best domain classification result of HJKL in the semantic model of the B language is that HJKL belongs to the reading domain, and the domain confidence coefficient is 0.39. Finally, the classification recognition score of the speech X for the A language output CCFD is 0.7 × 0.75+0.71=1.235; the classification recognition score of HJKL output for B language as speech X is 0.45 × 0.24+0.39=0.498. Therefore, the recognition result of the X speech is CCFD in the a language.

And 270, taking the character sequence corresponding to the highest value of the classification recognition score as a voice recognition result.

According to the technical scheme of the embodiment of the invention, the voice to be recognized is obtained and converted into acoustic characteristics; inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; the language of the phoneme sequence output by each language acoustic model is different; converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence; according to the acoustic characteristics, language classification is carried out on the voice, and language confidence coefficients of the voice belonging to various languages are determined; determining the domain confidence of the character sequence in each domain aiming at the character sequence of each language; determining a classification recognition score of the voice aiming at each language according to the recognition confidence coefficient, the language confidence coefficient and the domain confidence coefficient; the character sequence corresponding to the highest value of the classification recognition score is used as a recognition result of the voice, the problem that a user needs to manually switch or perform other manual modes when two sections of voices of different languages are recognized by the voice is solved, the effects of high voice recognition accuracy, low time delay and small module number can be achieved while the user does not need to switch, particularly in task-type voice conversation, errors caused by an acoustic classification model can be corrected according to information of a text level through field classification, and the accuracy of the voice recognition is further improved.

EXAMPLE III

Fig. 3 is a flowchart of a speech recognition method provided in a third embodiment of the present invention, which is a further refinement of the above technical solution, and the technical solution in this embodiment may be combined with various alternatives in one or more of the above embodiments.

As shown in fig. 3, the method includes:

step 310, obtaining the voice to be recognized, and converting the voice into acoustic features.

Step 320, inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; wherein, the language of the phoneme sequence output by each language acoustic model is different.

Step 330, converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence.

And 340, classifying the languages of the voice according to the acoustic characteristics, and determining language confidence coefficients of the voices belonging to the languages.

And step 350, classifying the language of the characters of the character sequence aiming at the character sequence of each language, and determining the confidence coefficient of the character language of each language of the character sequence.

The language model may be followed by a text language classification model, which may classify the text languages of the text sequences output by the language model of each language, and determine the text language confidence that the text sequences belong to each language. The language classification model may be trained on word sequences of multiple languages, and may determine, according to a multilingual word sequence, a confidence level that all word sequences belong to the same language, for example, a word sequence of language a and a word sequence of language B belong to language a together, or a confidence level that all word sequences belong to language B together.

And step 360, determining language classification scores of the voices belonging to various languages according to the language confidence degrees and the character language confidence degrees, and determining the language corresponding to the highest value of the language classification scores as the target language.

For example, for the speech X, the language confidence that X belongs to the language a is 0.6, the language confidence that X belongs to the language B is 0.3, and the language confidence that a neither belongs to the language a nor the language B is 0.1; the confidence of the character sequence of X in A language and the confidence of the character sequence of B language both belonging to the character language of A language is 0.5, the confidence of the character sequence of X in A language and the confidence of the character sequence of B language both belonging to the character language of B language is 0.4, and the confidence of the character sequence of X in A language and the confidence of the character sequence of B language both belonging to the character languages of A and B languages is 0.1. Then, the language classification score of the language A could be 0.6+0.5=1.1, and the language classification score of the language B could be 0.3+0.4=0.7, so the language A is the target language.

And 370, performing domain classification on all the character sequences through the semantic models corresponding to the target languages, and determining the domain confidence coefficients of the character sequences corresponding to the domains.

For example, when it is determined that the language a is the target language, the text sequence of the language a and the text sequence of the language B may be input to the semantic model of the language a, the text sequence of the language a and the text sequence of the language B are subjected to domain classification, and a domain confidence that the text sequence of the language a belongs to each domain and a domain confidence that the text sequence of the language B belongs to each domain are determined. The method can be applied to the speech recognition when a plurality of languages are mixed in the speech X.

And 380, determining the classification recognition score of the voice aiming at each language according to the recognition confidence, the language classification score and the domain confidence.

The information fusion decision module can be used for determining the speech classification recognition score aiming at each language according to the recognition confidence coefficient and the language classification score of the character sequence and the domain confidence coefficient of the character sequence in a certain field.

For example, for a piece of speech X, assume that the content of X is CCFD in language a. After the processing of a voice recognition module (including an acoustic classification model, a language acoustic model and a language model) in a multitask learning framework, the recognition result of the language A is CCFD, and the recognition confidence coefficient is 0.7; the recognition result of the B language is HJKL, and the recognition confidence coefficient is 0.45. And (3) obtaining three classification results by the acoustic classification model, wherein the language confidence that the voice belongs to the language A is 0.75, the language confidence that the voice belongs to the language B is 0.24, and the language confidence that the voice does not belong to the language A or the language confidence that the voice does not belong to the language B is 0.01. The confidence coefficient of the character language of which CCFD and HJKL belong to the A language together is 0.5, the confidence coefficient of the character language of which CCFD and HJKL belong to the B language together is 0.4, and the confidence coefficient of the character language of which CCFD and HJKL belong to the B language together is 0.1. In the semantic classification model, the best domain classification result of CCFD in the semantic model of the target language is that CCFD belongs to the music domain, and the domain confidence coefficient is 0.71; the best domain classification result of HJKL in the semantic model of the target language is that HJKL belongs to the reading domain, and the domain confidence is 0.39. Finally, the classification recognition score of the speech X for the A language output CCFD is 0.7X (0.75 + 0.5)/2 +0.71=1.1475; the speech X is the class B output HJKL with a classification recognition score of 0.45X (0.24 + 0.4)/2 +0.39=0.534. Therefore, the recognition result of the X speech is CCFD in the a language.

And 390, taking the character sequence corresponding to the highest value of the classification recognition score as a voice recognition result.

According to the technical scheme of the embodiment of the invention, the voice to be recognized is obtained and converted into acoustic characteristics; inputting the acoustic features into at least two language acoustic models, and outputting corresponding phoneme sequences; the language of the phoneme sequence output by each language acoustic model is different; converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence; according to the acoustic characteristics, language classification is carried out on the voice, and language confidence coefficients of the voice belonging to various languages are determined; for the character sequence of each language, carrying out character language classification on the character sequence, and determining the character language confidence coefficient of each language of the character sequence; determining language classification scores of voices belonging to various languages according to the language confidence degrees and the character language confidence degrees, and determining the language corresponding to the highest value of the language classification scores as a target language; performing field classification on all character sequences through a semantic model corresponding to a target language, and determining the field confidence of the character sequences corresponding to each field; determining the classification recognition score of the voice aiming at each language according to the recognition confidence, the language classification score and the domain confidence; the character sequence corresponding to the highest value of the classification recognition score is used as a recognition result of the voice, the problem that the user needs to manually or manually switch two sections of voices of different languages when recognizing the voices of the two sections of different languages is solved, the effects of high voice recognition accuracy, low time delay and small module number can be achieved while the user does not need to switch the voices, particularly in task-type voice conversation, errors caused by an acoustic classification model can be corrected according to information of a text level through field classification, the accuracy of the voice recognition is further improved, and the method and the device can be suitable for voice recognition of mixed languages.

Fig. 4 is a block diagram of a speech recognition system according to an embodiment of the present invention, and as shown in fig. 4, a using process according to an embodiment of the present invention may be: and converting voice data of the voice into acoustic features through a voice signal preprocessing module, and taking the acoustic features as the input of a multi-task learning voice recognition module. The acoustic features are converted into phoneme sequences through a multitask learning speech recognition module, the phoneme sequences are converted into character sequences through a language model and a speech recognition decoder, and information used for fusion decision making is output and can comprise recognition confidence degrees and language confidence degrees of various languages. And performing field classification on the character sequence in the multi-task learning voice recognition module through a semantic classification module, and determining the confidence coefficient of the field. And performing fusion decision by using the recognition confidence coefficient and the language confidence coefficient of the multitask learning voice recognition module and the field confidence coefficient of the semantic classification module through a fusion decision module, and selecting a final voice recognition result.

Fig. 5 is a block diagram of a speech recognition system according to an embodiment of the present invention, and as shown in fig. 5, a using process according to an embodiment of the present invention may specifically be: the multi-task learning speech recognition module includes an acoustic classification model, a language acoustic model, and a language model. The input voice obtains deep acoustic features through a BERT model, and the acoustic features are input into a language acoustic model of the language A, a language acoustic model of the language B and an acoustic classification model. The language acoustic model obtains a phoneme sequence corresponding to the voice according to the acoustic characteristics, and the acoustic classification model judges language confidence coefficients of the voice belonging to various languages according to the acoustic characteristics. The language models of different languages can be used together with the decoder to convert the phoneme sequence of the corresponding language into a character sequence of the corresponding language and determine the recognition confidence of the character sequence. The semantic classification models of different languages can perform domain classification on the character sequences of corresponding languages and determine the domain confidence of each domain. In the information fusion decision module, decision can be made by using the recognition confidence, the language confidence and the domain confidence to determine the finally output character sequence.

Example four

Fig. 6 is a schematic structural diagram of a speech recognition apparatus according to a fourth embodiment of the present invention. With reference to fig. 6, the apparatus comprises: an acoustic feature conversion module 410, a phoneme sequence output module 420, a text sequence conversion module 430, a language classification module 440 and a recognition result acquisition module 450.

The acoustic feature conversion module 410 is configured to acquire a voice to be recognized, and convert the voice into an acoustic feature;

a phoneme sequence output module 420, configured to input the acoustic features into at least two language acoustic models, and output a corresponding phoneme sequence; the language of the phoneme sequence output by each language acoustic model is different;

a text sequence conversion module 430, configured to convert the phoneme sequences of the respective languages into corresponding text sequences, and determine recognition confidence of the text sequences;

a language classification module 440, configured to classify the language of the speech according to the acoustic features, and determine language confidence that the speech belongs to each language;

the recognition result obtaining module 450 is configured to determine a classification recognition score of the speech for each language according to the recognition confidence and the language confidence, and use a text sequence corresponding to the highest value of the classification recognition score as a recognition result of the speech.

Optionally, the recognition result obtaining module 450 includes: a first determination unit for the confidence of the domain and a first determination unit for the classification recognition score;

a first determining unit of the domain confidence degree, which is used for determining the domain confidence degree of the character sequence in each domain according to the character sequence of each language;

and the first determination unit of the classification recognition score is used for determining the classification recognition score of the voice aiming at each language according to the recognition confidence coefficient, the language confidence coefficient and the domain confidence coefficient.

Optionally, the first determining unit for domain confidence includes: a domain confidence determining subunit;

and the domain confidence determining subunit is used for performing domain classification on the character sequences through the semantic models corresponding to the languages of the character sequences to obtain the domain confidence of the character sequences in each domain.

Optionally, the recognition result obtaining module 450 includes: a character language confidence determining unit, a target language determining unit, a second domain confidence determining unit and a second classification recognition score determining unit;

a text language confidence determining unit, configured to classify the text sequence according to the text sequence of each language, and determine a text language confidence that the text sequence belongs to each language;

the target language determining unit is used for determining language classification scores of voices belonging to various languages according to the language confidence degrees and the character language confidence degrees, and determining the language corresponding to the highest value of the language classification scores as the target language;

the second determining unit of the domain confidence is used for carrying out domain classification on all the character sequences through the semantic model corresponding to the target language and determining the domain confidence corresponding to each domain of the character sequences;

and the second determination unit of the classification recognition score is used for determining the classification recognition score of the voice aiming at each language according to the recognition confidence, the language classification score and the domain confidence.

Optionally, the first determining unit for the classification recognition score includes: calculating sub-unit and classifying identification score determining sub-unit

The determining subunit is used for determining a product of the recognition confidence coefficient and the language confidence coefficient and determining an arithmetic sum of the product and the highest value of the domain confidence coefficient;

and a classification recognition score determining subunit for determining the arithmetic sum as a speech classification recognition score for each language.

Optionally, the acoustic feature conversion module 410 includes: an acoustic feature conversion unit;

and the acoustic feature conversion unit is used for inputting the voice into the voice processing deep learning model to obtain the acoustic features of the voice.

Optionally, the text sequence conversion module 430 includes: a character sequence conversion unit;

and the character sequence conversion unit is used for respectively inputting the phoneme sequences of all languages into the language models of the corresponding languages to obtain the character sequences corresponding to the phoneme sequences of all languages and determine the recognition confidence coefficients of the character sequences.

Optionally, the deep learning model for speech processing includes: BERT model.

The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 7 is a schematic structural diagram of a speech recognition apparatus according to a fifth embodiment of the present invention, and as shown in fig. 7, the speech recognition apparatus includes:

one or more processors 510, one processor 510 being illustrated in FIG. 7;

a memory 520;

the apparatus may further include: an input device 530 and an output device 550.

The processor 510, the memory 520, the input device 530 and the output device 550 in the apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 7.

The memory 520 may be used as a non-transitory computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a speech recognition method in the embodiment of the present invention (for example, the acoustic feature conversion module 410, the phoneme sequence output module 420, the word sequence conversion module 430, the language classification module 440, and the recognition result obtaining module 450 shown in fig. 3). The processor 510 executes software programs, instructions and modules stored in the memory 520 to execute various functional applications and data processing of the computer device, namely, to implement a speech recognition method of the above method embodiment, that is:

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to a terminal device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output means 550 may comprise a display device such as a display screen.

Example six

A sixth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a speech recognition method according to an embodiment of the present invention:

and determining the classification recognition score of the voice for each language according to the recognition confidence and the language confidence, and taking the character sequence corresponding to the highest classification recognition score as the recognition result of the voice.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A speech recognition method, comprising:

determining a classification recognition score of the voice for each language according to the recognition confidence and the language confidence, and taking a character sequence corresponding to the highest value of the classification recognition score as a recognition result of the voice;

the determining the classification recognition score of the voice for each language according to the recognition confidence and the language confidence comprises:

for the character sequence of each language, carrying out character language classification on the character sequence, and determining the confidence coefficient of the character language of each language to which the character sequence belongs;

according to the language confidence and the character language confidence, determining language classification scores of the voices belonging to various languages, and determining the language corresponding to the highest value of the language classification scores as a target language;

performing domain classification on all the character sequences through the semantic model corresponding to the target language, and determining the domain confidence of the character sequences corresponding to each domain;

and determining the classification recognition score of the voice for each language according to the recognition confidence, the language classification score and the domain confidence.

2. The method of claim 1, wherein converting the speech into acoustic features comprises:

inputting the voice into a voice processing deep learning model to obtain acoustic characteristics of the voice;

converting the phoneme sequence of each language into a corresponding character sequence, and determining the recognition confidence of the character sequence, including:

and respectively inputting the phoneme sequences of all languages into the language models of the corresponding languages to obtain the character sequences corresponding to the phoneme sequences of all languages, and determining the recognition confidence coefficients of the character sequences.

3. The method of claim 2, wherein the speech processing deep learning model comprises: the speech processing of the bi-directional encoder learns the BERT model deeply.

4. A speech recognition apparatus, comprising:

the word sequence conversion module is used for converting the phoneme sequences of all languages into corresponding word sequences and determining the recognition confidence of the word sequences;

a recognition result obtaining module, configured to determine, according to the recognition confidence and the language confidence, a classification recognition score of the speech for each language, and use a text sequence corresponding to a highest value of the classification recognition score as a recognition result of the speech;

the identification result obtaining module includes: the system comprises a character language confidence determining unit, a target language determining unit, a second domain confidence determining unit and a second classification recognition score determining unit;

5. A speech recognition device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a speech recognition method as recited in any one of claims 1-3.

6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a speech recognition method as claimed in any one of claims 1 to 3.