CN116343755A

CN116343755A - Domain-adaptive speech recognition method, device, computer equipment and storage medium

Info

Publication number: CN116343755A
Application number: CN202310313176.1A
Authority: CN
Inventors: 赵梦原; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-06-27

Abstract

The invention discloses a field self-adaptive voice recognition method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring voice data to be recognized; performing primary decoding on the voice data to be identified to obtain a plurality of candidate identification results, and obtaining an optimal identification result from the plurality of candidate identification results; performing domain judgment according to the optimal recognition result, and determining a target domain; and adopting a target language model corresponding to the target field to perform secondary decoding on the candidate recognition results to obtain target recognition results. According to the method, the target language model corresponding to the target field is adopted to carry out secondary decoding on the multiple candidate recognition results, so that the method can adapt to the variable speaking fields of users, the accuracy in multi-field and cross-field recognition is improved, the robustness of voice recognition is improved, and better user experience is provided.

Description

Domain-adaptive speech recognition method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a method, an apparatus, a computer device, and a storage medium for domain-adaptive speech recognition.

Background

Speech recognition technology has been widely used in many contexts and achieves good results. However, due to the complexity of the human language and the speech signal itself, depending on the current model performance, the speech recognition system cannot guarantee that very good recognition accuracy is achieved in any scenario. The existing voice recognition model is generally accurate in voice recognition results aiming at the characteristic field, and if a plurality of fields are considered, even if the voice field is not limited, the voice recognition accuracy of the existing voice recognition model is greatly reduced. For example, a voice recognition system in the financial domain has high recognition accuracy for voices related to the financial domain, but the voice recognition accuracy for voices in music, games or other domains is greatly reduced. In practical application scenarios, the field to which the content uttered by the user belongs is very wide, and the content in different fields can be processed by the user at different moments, so that the common voice recognition system is difficult to cope with the complex use requirement of the user, and the user experience is affected.

Disclosure of Invention

The embodiment of the invention provides a field self-adaptive voice recognition method, a device, computer equipment and a storage medium, which are used for solving the problem of accuracy of multi-field or cross-field voice recognition.

A domain-adaptive speech recognition method, comprising:

acquiring voice data to be recognized;

performing primary decoding on the voice data to be recognized to obtain a plurality of candidate recognition results, and obtaining an optimal recognition result from the plurality of candidate recognition results;

performing domain judgment according to the optimal recognition result, and determining a target domain;

and adopting a target language model corresponding to the target field to perform secondary decoding on the candidate recognition results to obtain target recognition results.

A domain-adaptive speech recognition device, comprising:

the voice data acquisition module to be identified is used for acquiring voice data to be identified;

the primary decoding result acquisition module is used for carrying out primary decoding on the voice data to be identified, acquiring a plurality of candidate identification results and acquiring an optimal identification result from the plurality of candidate identification results;

the target domain determining module is used for performing domain judgment according to the optimal recognition result to determine a target domain;

and the target recognition result acquisition module is used for carrying out secondary decoding on a plurality of candidate recognition results by adopting a target language model corresponding to the target field to acquire a target recognition result.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the domain-adaptive speech recognition method as described above when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor implements the domain adaptive speech recognition method described above.

The method, the device, the computer equipment and the storage medium for the field self-adaptive voice recognition adopt the target language model corresponding to the target field to carry out secondary decoding on a plurality of candidate recognition results, so that the obtained target recognition results are more accurate and more practical; according to the optimal recognition result of the voice data to be recognized, the target field of the voice data to be recognized is automatically judged to adapt to the variable speaking field of the user, accuracy of multi-field and cross-field voice recognition is improved, robustness of voice recognition is improved, and better user experience is provided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a domain-adaptive speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of domain adaptive speech recognition in accordance with an embodiment of the present invention;

FIG. 3 is another flow chart of a method of domain adaptive speech recognition in an embodiment of the present invention;

FIG. 4 is another flow chart of a method of domain adaptive speech recognition in an embodiment of the present invention;

FIG. 5 is another flow chart of a method of domain adaptive speech recognition in an embodiment of the present invention;

FIG. 6 is another flow chart of a method of domain adaptive speech recognition in an embodiment of the present invention;

FIG. 7 is another flow chart of a method of domain adaptive speech recognition in an embodiment of the present invention;

FIG. 8 is another flow chart of a method of domain adaptive speech recognition in an embodiment of the present invention;

FIG. 9 is a schematic diagram of a domain-adaptive speech recognition device according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The field adaptive voice recognition method provided by the embodiment of the invention can be applied to an application environment shown in figure 1. Specifically, the domain adaptive voice recognition method is applied to a domain adaptive voice recognition system, wherein the domain adaptive voice recognition system comprises a client and a server as shown in fig. 1, and the client and the server communicate through a network and are used for realizing domain adaptive voice recognition. The client is also called a client, and refers to a program corresponding to the server for providing local service for the client. The client may be installed on, but is not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a method for domain-adaptive speech recognition is provided, and the method is applied to the server in fig. 1, and includes the following steps:

s201: acquiring voice data to be recognized;

s202: performing primary decoding on voice data to be identified to obtain a plurality of candidate identification results, and obtaining an optimal identification result from the plurality of candidate identification results;

S203: performing domain judgment according to the optimal recognition result, and determining a target domain;

s204: and adopting a target language model corresponding to the target field to perform secondary decoding on the multiple candidate recognition results to obtain target recognition results.

The voice data to be recognized refers to voice data to be recognized, the voice data to be recognized can be voice data in multiple fields or voice data in different fields, and the content of the voice data to be recognized can cover various fields.

As an example, in step S201, the server may acquire voice data to be recognized, where the voice data to be recognized may be multi-domain voice data or cross-domain voice data. In this example, the server obtains the voice data to be recognized in each field as the input data of the subsequent primary decoding, so as to ensure the feasibility of the subsequent primary decoding.

The primary decoding refers to a process of performing recognition processing on voice data to be recognized to obtain a plurality of voice recognition results. The candidate recognition results refer to a plurality of recognition results with larger recognition probability after the voice data to be recognized are decoded once, and can be specifically understood as a plurality of recognition paths or a plurality of original recognition sentences with larger recognition probability. The optimal recognition result is one recognition result with the best recognition effect in the candidate recognition results.

As an example, in step S202, the server performs a decoding process on the obtained voice data to be recognized, obtains a plurality of candidate recognition results, and obtains an optimal recognition result from the plurality of candidate recognition results. In this example, when the voice data to be recognized is decoded once, firstly, the voice data to be recognized is processed by using an acoustic model, and an acoustic model processing result is obtained; processing the voice data to be recognized by using the universal language model to obtain a universal language model processing result; selecting a plurality of recognition results with good recognition effects from the acoustic model processing results and the general language model processing results as a plurality of candidate recognition results; and selecting one recognition result with the best recognition effect from the plurality of candidate recognition results as the optimal recognition result.

In the example, the acoustic model and the universal language model are respectively used for processing the voice data to be recognized to obtain a plurality of recognition results, and a plurality of recognition results with good recognition effects are selected as a plurality of candidate recognition results, so that the input data in the subsequent secondary decoding process is ensured to be more practical, and the secondary decoding result is more accurate; and selecting the best recognition result from the plurality of candidate recognition results as an optimal recognition result so as to conveniently and subsequently utilize the optimal recognition result to carry out field judgment, thereby ensuring the accuracy of field judgment, ensuring the accuracy of the finally judged target field, ensuring the pertinence of a target language model used in the secondary decoding process and being beneficial to ensuring the accuracy of the finally obtained target recognition result.

As an example, in step S203, the server performs domain judgment on the voice data to be recognized according to the optimal recognition result, and obtains the target domain of the voice data to be recognized. In this example, the server uses the domain judgment model to perform the domain judgment process on the input optimal recognition result as follows: inputting the optimal recognition result into the domain judgment model, outputting probability values of the domains corresponding to the optimal recognition result, and selecting the domain with the largest probability value as the target domain.

In the example, the field judgment model is used to judge the target field according to the optimal recognition result, so that feasibility is provided for secondary decoding, and accuracy of the secondary decoding is guaranteed.

The target language model refers to a language model corresponding to the target field, and can be understood as a language model obtained by training data corresponding to the target field, and the language model has higher recognition accuracy on input data of the target field. The secondary decoding refers to a process of identifying a plurality of candidate identification results to obtain a target identification result.

As an example, in step S204, the server adopts a target language model corresponding to the target domain to perform secondary decoding on the multiple candidate recognition results, so as to obtain a target recognition result, which specifically includes: on the basis of acquiring the target field, the server obtains a target language model corresponding to the target field, identifies a plurality of candidate identification results by using the target language model, obtains a plurality of identification results and scores corresponding to the identification results, and selects the identification result with the highest score as the target identification result.

In this example, the target language model corresponding to the target domain is used to identify and score the multiple candidate identification results, so as to obtain the target identification result.

In the field adaptive voice recognition method provided by the embodiment, the voice data to be recognized is firstly decoded once to obtain a plurality of candidate recognition results and optimal recognition results, so that the field judgment of the voice data to be recognized subsequently has feasibility, the accuracy of the secondary decoding process is ensured, and the target recognition result is more practical; and performing domain judgment according to the optimal recognition result, determining a target domain, and performing secondary decoding on a plurality of candidate recognition results by adopting a target language model corresponding to the target domain, so that the obtained target recognition result is more accurate and more practical. Understandably, according to the optimal recognition result of the voice data to be recognized, the speaking target field is automatically judged so as to adapt to the variable speaking field of the user, improve the accuracy of multi-field and cross-field voice recognition, improve the robustness, accuracy and stability of voice recognition and provide better user experience.

In one embodiment, as shown in fig. 3, step S202, namely, decoding the voice data to be recognized once to obtain a plurality of candidate recognition results, and obtaining an optimal recognition result from the plurality of candidate recognition results, includes:

s301: decoding the voice data to be recognized by adopting an acoustic model to acquire a plurality of original recognition sentences and target acoustic scores corresponding to each original recognition sentence;

s302: decoding the voice data to be recognized by adopting a general language model to acquire a plurality of original recognition sentences and a first language score corresponding to each original recognition sentence;

s303: processing according to target acoustic scores and first language scores corresponding to the plurality of original identification sentences to obtain first identification scores corresponding to the plurality of original identification sentences;

s304: sorting the first recognition scores corresponding to the plurality of original recognition sentences, and determining the first N original recognition sentences with the large first recognition scores as a plurality of candidate recognition results, wherein N is more than or equal to 2;

s305: and selecting one original recognition sentence with the largest first recognition score from the candidate recognition results, and determining the original recognition sentence as the optimal recognition result.

As an example, in step S301, the server decodes the speech data to be recognized using the acoustic model to obtain a plurality of original recognition sentences, and at the same time, obtains a score corresponding to each original recognition sentence as the target acoustic score corresponding to each original recognition sentence. The target acoustic score herein may be understood as the score of each original recognition sentence output by the acoustic model recognition.

In this example, the server decodes the voice data to be recognized by using an acoustic model, and obtains a plurality of original recognition sentences and target acoustic scores corresponding to each original recognition sentence. For example, the acoustic model identifies a certain voice data to be identified, so as to obtain X original identification sentences, wherein X is more than or equal to 2, and each original identification sentence outputs a corresponding target acoustic score.

As an example, in step S302, the server decodes the speech data to be recognized using the universal language model to obtain a plurality of original recognition sentences, and at the same time, obtains a score corresponding to each original recognition sentence as the first language score corresponding to each original recognition sentence.

In one embodiment, the server decodes the voice data to be recognized by using a universal language model, and obtains a plurality of original recognition sentences and a first language score corresponding to each original recognition sentence. For example, the universal language model identifies a certain voice data to be identified to obtain Y original identification sentences, wherein Y is more than or equal to 2, and each original identification sentence outputs a corresponding first language score. The first language score herein may be understood as the score of each original recognition sentence output by the generic language model recognition.

As an example, in step S303, the server performs fusion processing on the target acoustic scores and the first language scores corresponding to the plurality of original recognition sentences, and obtains the scores corresponding to the plurality of original recognition sentences as the first recognition scores corresponding to the original recognition sentences.

In an embodiment, the server performs fusion processing according to the target acoustic scores and the first language scores corresponding to the plurality of original recognition sentences to obtain corresponding first recognition scores. For example, the acoustic model identifies a certain voice data to be identified, and obtains X original identification sentences and corresponding target acoustic scores, wherein X is more than or equal to 2; the universal language model identifies the same voice data to be identified to obtain Y original identification sentences and corresponding first language scores, wherein Y is more than or equal to 2; in the X original recognition sentences, the acoustic weights w are used ₁ Weighting target acoustic scores corresponding to each original recognition sentence, wherein language weights w are used in Y original recognition sentences ₂ Weighting the first language score corresponding to each original recognition sentence, and summing the two weights to obtain each original recognition sentenceAnd finally, acquiring a plurality of original identification sentences and corresponding first identification scores thereof.

As an example, in step S304, the server sorts the first recognition scores corresponding to the plurality of original recognition sentences, and determines the first N original recognition sentences with the larger first recognition scores as a plurality of candidate recognition results, where N is greater than or equal to 2.

In one embodiment, the server uses the first N original recognition sentences with the first recognition scores being larger in the original recognition sentences as a plurality of candidate recognition results, wherein N is more than or equal to 2. For example, the first N original recognition sentences with larger first recognition scores are selected as the plurality of candidate recognition results on the basis of acquiring the first recognition scores corresponding to the plurality of original recognition sentences.

As an example, in step S305, the server selects one original recognition sentence with the largest first recognition score from the plurality of candidate recognition results, and determines the selected original recognition sentence as the optimal recognition result, specifically, determines one original recognition sentence with the largest first recognition score from the plurality of candidate recognition results as the optimal recognition result.

In one embodiment, the server uses an original recognition sentence with the largest first recognition score as the optimal recognition result. For example, the N candidate recognition results include first recognition scores corresponding to the N original recognition sentences, and one of the N candidate recognition results, in which the first recognition score is the largest, may be selected as the optimal recognition result.

In the field adaptive voice recognition method provided by the embodiment, the voice data to be recognized are decoded by using an acoustic model and a universal language model respectively to obtain a plurality of original recognition sentences, wherein the acoustic model can recognize the accent of a user in the voice data to be recognized, and the influence of the accent of the user in the voice data to be recognized on a primary decoding result is reduced; scoring each original recognition sentence of the acoustic model and each original recognition sentence of the universal language model respectively, fusing the scores to obtain a first recognition score of each original recognition sentence, sequencing the first recognition scores, and taking the first N original recognition sentences with the large first recognition scores as a plurality of candidate recognition results, so that the input data of the subsequent secondary decoding process is more practical, and the secondary decoding result is more accurate; and selecting one original recognition sentence with the largest first recognition score from the plurality of candidate recognition results as an optimal recognition result, so that the accuracy of the judgment of the subsequent field is ensured, the more accurate judgment of the target field is ensured, and the more accurate target language model used in the secondary decoding process is ensured.

In one embodiment, as shown in fig. 4, step S204, that is, performing secondary decoding on a plurality of candidate recognition results by using a target language model corresponding to a target domain, obtains a target recognition result, includes:

s401: performing secondary decoding on the multiple candidate recognition results by adopting a target language model corresponding to the target field to obtain multiple candidate recognition sentences and second language scores corresponding to each candidate recognition sentence;

s402: processing the second language score and the target acoustic score corresponding to each candidate identification sentence to obtain a second identification score corresponding to each candidate identification sentence;

s403: and sorting the plurality of candidate recognition sentences according to the second recognition scores, and determining the candidate recognition sentences with the largest second recognition scores as target recognition results.

The candidate recognition result comprises a candidate recognition sentence and a target acoustic score corresponding to the candidate recognition sentence.

As an example, in step S401, the server confirms the language model corresponding to the target domain as the target language model, and performs secondary decoding recognition on the multiple candidate recognition results by using the target language model, so as to obtain multiple candidate recognition sentences and scores corresponding to each candidate recognition sentence, and the scores are used as the second language score of each candidate recognition sentence.

In one embodiment, the server uses the target language model to perform secondary decoding recognition on the multiple candidate recognition results to obtain multiple candidate recognition sentences and second language scores corresponding to each candidate recognition sentence. For example, on the basis of selecting a plurality of candidate recognition results with the number of N, performing secondary decoding recognition on the plurality of candidate recognition results by using a target language model to obtain N candidate recognition sentences and a second language score corresponding to each candidate recognition sentence.

As an example, in step S402, the server performs fusion processing on the second language score and the target acoustic score corresponding to each candidate recognition sentence, to obtain a second recognition score corresponding to each candidate recognition sentence.

In an embodiment, the server fuses the second language score corresponding to each candidate recognition sentence with the target acoustic score to obtain a second recognition score corresponding to each candidate recognition sentence. In this example, the second language scores corresponding to the N candidate recognition sentences are fused with the target acoustic scores, so as to obtain second recognition scores corresponding to the N candidate recognition sentences. For example, acoustic weights w are used for candidate recognition sentences corresponding to the N candidate recognition results ₁ Weighting the target acoustic score corresponding to each candidate recognition sentence by using language weight w ₂ And weighting the second language scores corresponding to each candidate identification sentence, summing the two weights to obtain the second identification scores corresponding to each candidate identification sentence, and finally obtaining the second identification scores corresponding to N candidate identification sentences.

As an example, in step S403, the server ranks the plurality of candidate recognition sentences according to the second recognition scores, and determines the candidate recognition sentence with the largest second recognition score as the target recognition result. In an embodiment, the candidate recognition sentence with the largest second recognition score is selected as the target recognition result of the voice data to be recognized. For example, on the basis of obtaining second recognition scores corresponding to the N candidate recognition sentences, a candidate recognition sentence with the largest second recognition score among the N candidate recognition sentences is selected as the target recognition result.

In the domain adaptive speech recognition method provided by the embodiment, a target language model corresponding to a target domain is used for recognizing a plurality of candidate recognition results, so that the recognition accuracy of the obtained plurality of candidate recognition sentences and second language scores is ensured; and fusing the second language score corresponding to each candidate recognition sentence with the target acoustic score to obtain a second recognition score, and selecting the candidate recognition sentence corresponding to the maximum second recognition score as a target recognition result corresponding to the voice data to be recognized, so that the recognition result of the obtained voice data to be recognized is more accurate and more practical.

In one embodiment, as shown in fig. 5, step S402, that is, processing the second language score and the target acoustic score corresponding to each candidate recognition sentence, obtains the second recognition score corresponding to each candidate recognition sentence, includes:

s501: determining a target language score corresponding to each candidate identification sentence according to the first language score and the second language score corresponding to each candidate identification sentence;

s502: and processing the target language score corresponding to each candidate identification sentence and the target acoustic score corresponding to each candidate identification sentence to acquire a second identification score corresponding to each candidate identification sentence.

The candidate recognition result further comprises a first language score corresponding to the candidate recognition sentence.

As an example, in step S501, after the server decodes the multiple candidate recognition results twice by using the target language model corresponding to the target domain to obtain the second language scores corresponding to the multiple candidate recognition sentences, the server may fuse the second language scores corresponding to the multiple candidate recognition sentences with the first language scores corresponding to the multiple candidate recognition sentences by using the general language model obtained in step S302 to obtain the target language score corresponding to each candidate recognition sentence. Understandably, since the target language score is a score based on a fusion of a first language score output by the general language model and a second language score output by the target language model, the fusion of the recognition results of the two language models can make the target language score more accurate than the recognition result of the single language model (i.e., either of the general language model and the target language model).

As an example, in step S502, the server calculates a target language score corresponding to each candidate recognition sentence and a target language score corresponding to each candidate recognition sentenceAnd (3) fusing the target acoustic scores of the candidate recognition sentences to obtain second recognition scores corresponding to each candidate recognition sentence. For example, acoustic weights w are used for candidate recognition sentences corresponding to the N candidate recognition results ₁ Weighting the target acoustic score corresponding to each candidate recognition sentence by using language weight w ₂ And weighting the target language scores corresponding to each candidate identification sentence, summing the weights of the two items to obtain a second identification score corresponding to each candidate identification sentence, and finally obtaining the second identification scores corresponding to N candidate identification sentences. Understandably, the acoustic model may identify accents of the user in the voice data to be identified, obtain a target acoustic score according to candidate identification sentences of the acoustic model, and fuse the target acoustic score with the target language score to obtain a second identification score.

In the domain adaptive speech recognition method provided by the embodiment, the second recognition score corresponding to each candidate recognition sentence is obtained, so that the acquisition of the target recognition result is feasible.

In one embodiment, as shown in fig. 6, step S501, that is, determining a target language score corresponding to each candidate recognition sentence according to a first language score and a second language score corresponding to each candidate recognition sentence, includes:

s601: acquiring a first fusion weight corresponding to the first language score and a second fusion weight corresponding to the second language score;

s602: and obtaining the target language score corresponding to each candidate identification sentence according to the first language score, the first fusion weight, the second language score and the second fusion weight.

As an example, in step S601, the server obtains a first fusion weight corresponding to the first language score and a second fusion weight corresponding to the second recognition score on the basis of obtaining the first language score and the second language score corresponding to each candidate recognition sentence. Because the first language score is a score based on recognition of the universal language model, and the second language model is a score based on recognition of the target language model corresponding to the target domain, generally, after the target domain is determined according to the optimal recognition result, the recognition result of the target language model is higher than the recognition result of the universal language model, and therefore the second fusion weight can be set to be greater than the first fusion weight, so that the second language score output by the target language model has higher weight. For example, the first fusion weight may be set to 0.3 and the second fusion weight may be set to 0.7.

As an example, in step S602, the server fuses the first language score and the second language score according to the first fusion weight and the second fusion weight, respectively, to obtain a target language score corresponding to each candidate recognition sentence.

In an embodiment, for each candidate recognition sentence, the weighting of the first language score by the first fusion weight and the weighting of the second language score by the second fusion weight are fused to obtain a target language score corresponding to each candidate recognition sentence. For example, after obtaining the first language score, the first fusion weight, the second language score and the second fusion weight corresponding to the N candidate recognition sentences, weighting the first language score corresponding to each candidate recognition sentence by using the first fusion weight, weighting the second language score corresponding to each candidate recognition sentence by using the second fusion weight, summing the two weights to obtain the target language score corresponding to each candidate recognition sentence, and finally obtaining the target language score corresponding to the N candidate recognition sentences.

In the domain adaptive speech recognition method provided by the embodiment, the target language score corresponding to each candidate recognition sentence is obtained, so that the second recognition score is obtained with feasibility.

In another embodiment, as shown in fig. 7, step S204, that is, performing secondary decoding on a plurality of candidate recognition results by using a target language model corresponding to a target domain, obtains a target recognition result, includes:

s701: performing secondary decoding on the multiple candidate recognition results by adopting a target language model corresponding to the target field to obtain multiple candidate recognition sentences and second language scores corresponding to each candidate recognition sentence;

s702: processing the second language score and the first recognition score corresponding to each candidate recognition sentence to obtain a third recognition score corresponding to each candidate recognition sentence;

s703: and sorting the plurality of candidate recognition sentences according to the third recognition scores, and determining the candidate recognition sentences with the largest third recognition scores as target recognition results.

The candidate recognition result comprises a candidate recognition sentence and a first recognition score corresponding to the candidate recognition sentence.

As an example, in step S701, the server uses a target language model corresponding to the target domain to perform secondary decoding on the multiple candidate recognition results, and obtains multiple candidate recognition sentences and a second language score corresponding to each candidate recognition sentence.

In an embodiment, the server performs secondary decoding recognition on the multiple candidate recognition results by using the target language model on the basis of selecting the multiple candidate recognition results with the number of N, so as to obtain N candidate recognition sentences and a second language score corresponding to each candidate recognition sentence.

As an example, in step S702, the server performs fusion processing on the second language score and the first recognition score corresponding to each candidate recognition sentence, and obtains a third recognition score corresponding to each candidate recognition sentence. Understandably, when the third recognition score is obtained, the first recognition score is obtained by fusing the target acoustic score and the first language score after primary decoding, and then the obtained first recognition score and the second language score obtained by secondary decoding can obtain a more accurate third recognition score, so that the target recognition result obtained according to the third recognition score is more in line with the actual situation of the voice data to be recognized.

In one embodiment, the server fuses the second language scores and the first recognition scores corresponding to the N candidate recognition sentences to obtain the N candidate recognition sentencesAnd a corresponding third recognition score. For example, for candidate recognition sentences corresponding to the N candidate recognition results, a preset weight w is used ₃ Weighting the first recognition score value corresponding to each candidate recognition sentence by using a preset weight w ₄ And weighting the second language scores corresponding to each candidate identification sentence, summing the two weights to obtain a third identification score corresponding to each candidate identification sentence, and finally obtaining the third identification scores corresponding to N candidate identification sentences. Understandably, the first recognition score is fused with the second language score to obtain a third recognition score, that is, the recognition effects of the acoustic model, the general language model and the target language model are evaluated at the same time, so that the obtained third recognition score is more accurate.

As an example, in step S703, the server sorts the plurality of candidate recognition sentences according to the third recognition score, and determines the candidate recognition sentence with the largest third recognition score as the target recognition result.

In an embodiment, the candidate recognition sentence with the largest third recognition score is selected as the target recognition result of the voice data to be recognized. And selecting the candidate recognition sentence with the largest third recognition score from the N candidate recognition sentences as a target recognition result on the basis of obtaining the third recognition scores corresponding to the N candidate recognition sentences.

In the domain adaptive speech recognition method provided by the embodiment, a target language model corresponding to a target domain is used for recognizing a plurality of candidate recognition results, so that the recognition accuracy of the obtained plurality of candidate recognition sentences and second language scores is ensured; and fusing the second language score corresponding to each candidate recognition sentence with the first recognition score to obtain a third recognition score, and selecting the candidate recognition sentence corresponding to the maximum third recognition score as a target recognition result corresponding to the voice data to be recognized, so that the recognition result of the obtained voice data to be recognized is more accurate and more practical.

In one embodiment, as shown in fig. 8, step S203, that is, performing domain judgment according to the optimal recognition result, determines a target domain, includes:

s801: performing space mapping on the optimal recognition result by adopting a word vector mapping model to obtain a target vector;

s802: processing the target vector by adopting a neural network model, and determining the recognition probabilities corresponding to a plurality of configuration fields;

s803: and determining the configuration domain with the maximum recognition probability as the target domain.

The domain judgment model comprises a word vector mapping model and a neural network model. The word vector mapping model is a model for converting an original recognition sentence corresponding to the optimal recognition result into a target vector. The neural network model is used for acquiring the recognition probabilities of the optimal recognition results corresponding to different configuration fields according to the target vector.

As an example, in step S801, the server performs spatial mapping on the optimal recognition result using a word vector mapping model, converts the original recognition sentence corresponding to the optimal recognition result into a vector in a specific format, and uses the vector in the specific format as the target vector.

In one embodiment, the server uses a word-vector mapping model, namely a word-embedding layer, to convert the original recognition sentence corresponding to the optimal recognition result into the target vector. The method guarantees feasibility of acquiring the identification probability through the neural network in the follow-up process.

As an example, in step S802, the server uses a neural network model to process the target vector, that is, inputs the target vector into the neural network model, respectively identifies the target vector through language models of a plurality of configuration fields in the neural network model, and outputs the identification probability of the optimal identification result corresponding to different configuration fields.

In an embodiment, the server processes the target vector by adopting a neural network model of an LSTM layer, specifically uses language models of a plurality of configuration fields inside the LSTM layer to respectively identify the target vector, and outputs identification probabilities corresponding to each configuration field after passing through two full connection layers. Understandably, neural network models include, but are not limited to, LSTM.

It will be appreciated that, before performing the above steps, the server needs to train a language model of a plurality of configuration domains based on training data of a plurality of domains, and the training process includes:

the server inputs the acquired text data with the domain labels into the neural network model, and outputs the text data with the domain labels to obtain the corresponding domain labels, wherein the text data with the domain labels can be text data from a plurality of domains. Training is carried out by using a large amount of text data with domain labels from different domains to obtain language models of corresponding domains, and finally, language models of a plurality of configuration domains are trained. For example, training is performed using text data having a game field tag, and the obtained language model is a game field language model. In an embodiment, the text data with the domain labels may be language model training corpus, or may be labeling text of a speech training set.

As an example, in step S803, the server determines the configuration domain having the highest recognition probability as the target domain. For example, the domain identification is performed on the optimal identification result, and a domain of a certain class in the configuration domain with the highest identification probability is obtained, and the domain corresponding to the class is taken as the target domain.

In the domain adaptive voice recognition method provided by the embodiment, the recognition probabilities of the optimal recognition result in the plurality of configuration domains are obtained through the domain judgment model, the configuration domain with the largest recognition probability is selected and determined as the target domain, and the accuracy of target domain recognition can be guaranteed.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one embodiment, a domain-adaptive voice recognition apparatus is provided, where the domain-adaptive voice recognition apparatus corresponds to the domain-adaptive voice recognition method in the above embodiment one by one. As shown in fig. 9, the domain-adaptive voice recognition apparatus includes a voice data to be recognized acquisition module 901, a primary decoding result acquisition module 902, a target domain determination module 903, and a target recognition result acquisition module 904. The functional modules are described in detail as follows:

The voice data to be recognized acquisition module 901 is configured to acquire voice data to be recognized;

a primary decoding result obtaining module 902, configured to perform primary decoding on the voice data to be identified, obtain a plurality of candidate recognition results, and obtain an optimal recognition result from the plurality of candidate recognition results;

the target domain determining module 903 is configured to determine a target domain by performing domain judgment according to the optimal recognition result;

the target recognition result obtaining module 904 performs secondary decoding on the multiple candidate recognition results by adopting a target language model corresponding to the target field to obtain a target recognition result.

In one embodiment, the primary decoding result obtaining module 902 includes:

the target acoustic score obtaining submodule is used for decoding the voice data to be recognized by adopting an acoustic model to obtain a plurality of original recognition sentences and target acoustic scores corresponding to the original recognition sentences;

the first language score obtaining sub-module is used for decoding the voice data to be recognized by adopting a general language model to obtain a plurality of original recognition sentences and first language scores corresponding to each original recognition sentence;

the first recognition score acquisition sub-module is used for processing according to the target acoustic scores and the first language scores corresponding to the plurality of original recognition sentences to acquire first recognition scores corresponding to the plurality of original recognition sentences;

The candidate recognition result determining submodule is used for determining the first N original recognition sentences with the first recognition scores as a plurality of candidate recognition results, wherein N is more than or equal to 2;

and the optimal recognition result determining sub-module is used for determining an original recognition sentence with the largest first recognition score as the optimal recognition result.

In one embodiment, the target area determination module 903 includes:

the target vector acquisition sub-module is used for carrying out space mapping on the optimal recognition result by adopting a word vector mapping model to acquire a target vector;

the recognition probability determination submodule is used for processing the target vector by adopting a neural network model and determining recognition probabilities corresponding to a plurality of configuration fields;

and the target domain determining sub-module is used for determining the configuration domain with the largest recognition probability as the target domain.

In one embodiment, the target recognition result obtaining module 904 includes:

the second language score obtaining sub-module is used for carrying out secondary decoding on the candidate recognition results by adopting a target language model corresponding to the target field to obtain a plurality of candidate recognition sentences and second language scores corresponding to each candidate recognition sentence; the candidate recognition result comprises candidate recognition sentences and target acoustic scores corresponding to the candidate recognition sentences;

The second recognition score obtaining sub-module is used for processing the second language score and the target acoustic score corresponding to each candidate recognition sentence and obtaining the second recognition score corresponding to each candidate recognition sentence;

and the target recognition result determining submodule is used for determining the candidate recognition sentences with the second recognition scores being the largest as the target recognition result.

In one embodiment, the second recognition score acquisition sub-module includes:

the target language score determining unit is used for determining the target language score corresponding to each candidate identification sentence according to the first language score and the second language score corresponding to each candidate identification sentence; the candidate recognition result also comprises a first language score corresponding to the candidate recognition sentence;

and the second recognition score acquisition unit is used for processing the target language score corresponding to each candidate recognition sentence and the target acoustic score corresponding to each candidate recognition sentence to acquire the second recognition score corresponding to each candidate recognition sentence.

In one embodiment, the target language score determining unit includes:

the fusion weight acquisition subunit is used for acquiring a first fusion weight corresponding to the first language score and a second fusion weight corresponding to the second language score;

The target language score obtaining subunit is configured to obtain a target language score corresponding to each candidate recognition sentence according to the first language score, the first fusion weight, the second language score and the second fusion weight.

In another embodiment, the target recognition result acquisition module 904 includes:

the second language score obtaining sub-module is used for carrying out secondary decoding on the candidate recognition results by adopting a target language model corresponding to the target field to obtain a plurality of candidate recognition sentences and second language scores corresponding to each candidate recognition sentence;

the third recognition score obtaining sub-module is used for processing the second language score and the first recognition score corresponding to each candidate recognition sentence and obtaining a third recognition score corresponding to each candidate recognition sentence;

and the target recognition result determining sub-module is used for sequencing the plurality of candidate recognition sentences according to the third recognition scores and determining the candidate recognition sentences with the largest third recognition scores as the target recognition result.

For specific limitations of the domain-adaptive speech recognition apparatus, reference may be made to the above limitation of the domain-adaptive speech recognition method, and no further description is given here. The modules in the above-described domain adaptive voice recognition apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data adopted or generated in the process of executing the field self-adaptive voice recognition method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a domain-adaptive speech recognition method.

In an embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for domain-adaptive speech recognition in the above embodiment when executing the computer program, for example, S201-S204 shown in fig. 2, or S201-S204 shown in fig. 3-8, and is not repeated here. Alternatively, the processor may implement the functions of each module/unit in this embodiment of the domain adaptive voice recognition apparatus when executing the computer program, for example, the functions of the to-be-recognized voice data acquisition module 901, the primary decoding result acquisition module 902, the target domain determining module 903, and the target recognition result acquisition module 904 shown in fig. 9, which are not repeated herein.

In an embodiment, a computer readable storage medium is provided, and a computer program is stored on the computer readable storage medium, where the computer program is executed by a processor to implement the domain adaptive speech recognition method in the above embodiment, for example, S201-S204 shown in fig. 2, or S201-S204 shown in fig. 3-8, which are not repeated herein. Alternatively, the functions of each module/unit in the above embodiment of the domain adaptive voice recognition apparatus, for example, the functions of the to-be-recognized voice data acquisition module 901, the primary decoding result acquisition module 902, the target domain determining module 903, and the target recognition result acquisition module 904 shown in fig. 9, are implemented when the computer program is executed by a processor, and are not repeated here. The computer readable storage medium may be nonvolatile or may be volatile.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A domain-adaptive speech recognition method, comprising:

acquiring voice data to be recognized;

2. The method of claim 1, wherein said decoding the speech data to be recognized at a time to obtain a plurality of candidate recognition results, and obtaining an optimal recognition result from a plurality of candidate recognition results comprises:

decoding the voice data to be recognized by adopting an acoustic model to acquire a plurality of original recognition sentences and target acoustic scores corresponding to the original recognition sentences;

decoding the voice data to be recognized by adopting a general language model to acquire a plurality of original recognition sentences and first language scores corresponding to each original recognition sentence;

processing according to target acoustic scores and first language scores corresponding to the plurality of original identification sentences to obtain first identification scores corresponding to the plurality of original identification sentences;

sorting the first recognition scores corresponding to the plurality of original recognition sentences, and determining the first N original recognition sentences with the large first recognition scores as a plurality of candidate recognition results, wherein N is more than or equal to 2;

And selecting one original recognition sentence with the largest first recognition score from the candidate recognition results, and determining the original recognition sentence as the optimal recognition result.

3. The domain adaptive voice recognition method according to claim 2, wherein the candidate recognition result includes a candidate recognition sentence and a target acoustic score corresponding to the candidate recognition sentence;

the adoption of the target language model corresponding to the target field for performing secondary decoding on the candidate recognition results to obtain target recognition results comprises the following steps:

performing secondary decoding on the candidate recognition results by adopting a target language model corresponding to the target field to obtain a plurality of candidate recognition sentences and second language scores corresponding to each candidate recognition sentence;

processing the second language score corresponding to each candidate identification sentence and the target acoustic score to obtain a second identification score corresponding to each candidate identification sentence;

and sorting the candidate recognition sentences according to the second recognition scores, and determining the candidate recognition sentences with the largest second recognition scores as target recognition results.

4. The method of claim 3, wherein the candidate recognition result further comprises a first language score corresponding to the candidate recognition sentence;

The processing the second language score and the target acoustic score corresponding to each candidate recognition sentence to obtain a second recognition score corresponding to each candidate recognition sentence includes:

determining a target language score corresponding to each candidate identification sentence according to the first language score and the second language score corresponding to each candidate identification sentence;

and processing the target language score corresponding to each candidate identification sentence and the target acoustic score corresponding to each candidate identification sentence to obtain a second identification score corresponding to each candidate identification sentence.

5. The method of claim 4, wherein determining the target language score for each candidate recognition sentence based on the first language score and the second language score for each candidate recognition sentence comprises:

acquiring a first fusion weight corresponding to the first language score and a second fusion weight corresponding to the second language score;

and obtaining the target language score corresponding to each candidate identification sentence according to the first language score, the first fusion weight, the second language score and the second fusion weight.

6. The method of claim 2, wherein the candidate recognition result includes a candidate recognition sentence and a first recognition score corresponding to the candidate recognition sentence;

processing the second language score and the first recognition score corresponding to each candidate recognition sentence to obtain a third recognition score corresponding to each candidate recognition sentence;

and sorting the candidate recognition sentences according to the third recognition scores, and determining the candidate recognition sentences with the largest third recognition scores as target recognition results.

7. The method of claim 1, wherein the performing the domain judgment according to the optimal recognition result to determine the target domain comprises:

Performing space mapping on the optimal recognition result by adopting a word vector mapping model to obtain a target vector;

processing the target vector by adopting a neural network model, and determining recognition probabilities corresponding to a plurality of configuration fields;

and determining the configuration domain with the maximum recognition probability as a target domain.

8. A domain-adaptive speech recognition device, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the domain adaptive speech recognition method according to any one of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the domain adaptive speech recognition method according to any one of claims 1 to 7.