CN111883122B

CN111883122B - Speech recognition method and device, storage medium and electronic equipment

Info

Publication number: CN111883122B
Application number: CN202010712229.3A
Authority: CN
Inventors: 赵培; 朱文博; 韩俊明; 苏腾荣
Original assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2023-10-27
Anticipated expiration: 2040-07-22
Also published as: CN111883122A

Abstract

The application provides a voice recognition method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; determining a response result with highest confidence according to the plurality of response results, the plurality of first confidence levels of the plurality of response results, the combined response result and the second confidence level of the combined response result; and obtaining target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problems that in an intelligent voice dialogue system, errors are easy to occur in voice recognition, the accuracy is low, and then the success rate of dialogue is affected due to analysis errors are solved.

Description

Speech recognition method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of communications, and in particular, to a method and apparatus for voice recognition, a storage medium, and an electronic device.

Background

With the rise of automatic speech recognition (Automatic Speech Recognition, abbreviated as ASR) in speech recognition technology, speech recognition is increasingly applied to intelligent speech dialogue systems, but due to factors such as changes in acoustic environment, ambiguous pronunciation, etc., errors in recognizing text can be caused. The error can cascade to the next semantic parsing module, thereby causing parsing errors and affecting the success rate of the dialog.

Aiming at the problems that in the related art, in an intelligent voice dialogue system, voice recognition is easy to generate errors, the accuracy is not high, and further analysis errors are caused to influence the success rate of the dialogue, no effective technical scheme has been proposed yet.

Disclosure of Invention

The embodiment of the application provides a voice recognition method and device, a storage medium and electronic equipment, which at least solve the problems that in the related technology, in an intelligent voice dialogue system, the voice recognition is easy to generate errors, the accuracy is low, and the analysis errors are caused to influence the success rate of the dialogue.

According to one embodiment of the present application, there is provided a voice recognition method applied to a voice interaction system, including: analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.

According to another embodiment of the present application, there is also provided a voice recognition apparatus including: the first processing module is used for analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; the first determining module is used for determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result; the first acquisition module is used for acquiring target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute operation corresponding to the target voice.

According to another embodiment of the present application, there is also provided a storage medium including a stored program, wherein the program, when executed, performs any one of the above voice recognition methods.

According to another embodiment of the present application, there is also provided an electronic device, where the storage medium includes a stored program, and where the program executes any one of the above voice recognition methods.

According to the application, firstly, a plurality of voice recognition results input into a plurality of sub-domain models are analyzed to obtain a plurality of response results corresponding to the plurality of voice recognition results, and then the plurality of response results are combined to obtain a combined response result; determining a response result with the highest confidence coefficient according to the response results, the first confidence coefficients of the response results, the combined response result and the second confidence coefficient of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problems that in the related technology, in an intelligent voice dialogue system, errors are easy to occur in voice recognition, the accuracy is low, and then analysis errors are caused to influence the success rate of the dialogue are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of an alternative speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative determination of combined response results according to an embodiment of the application;

FIG. 3 is a schematic diagram of an alternative speech recognition model according to an embodiment of the present application;

FIG. 4 is a block diagram of an alternative speech recognition device according to an embodiment of the present application;

FIG. 5 is a block diagram of an alternative first processing module according to an embodiment of the application;

fig. 6 is a block diagram of another alternative speech recognition device according to an embodiment of the present application.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

In the conventional speech dialogue system, natural speech audio data from a user can be obtained through an input device of the speech interaction system, and the natural speech audio data is input into a multi-speech recognition engine to recognize the speech of the user, so that a multi-speech recognition result is obtained. In general, marking and aligning the overall recognition result of the multi-voice recognition result, and then sorting the multi-voice recognition result to obtain an optimal voice recognition result, and transmitting the optimal recognition result to a semantic analysis module for semantic analysis. However, the results of the recognition of the specific words by different speech recognition engines may be inconsistent, but because part of the errors are not corrected and the erroneous part is selected as the recognition result of the speech recognition engine, the use of the erroneous part as the final output result of the speech recognition engine may result in an increase in the overall accuracy of the speech recognition.

The existing semantic analysis scheme only processes one text, and if the voice recognition is wrong, all subsequent semantic analysis can be wrong. In addition, a model is used in semantic analysis, and the model is complex in work and high in iteration time cost. After semantic analysis, if the dialogue state is directly updated, the dialogue response is generated and output, and as long as one module is in error, all modules are in error, and the process of mutual verification is omitted.

The existing multi-voice recognition engine mainly aims at overall optimization of each voice recognition result, semantic analysis is carried out after a final recognition result is obtained, namely, semantic analysis is carried out on a single voice engine recognition result, dialogue state update is directly carried out, and then dialogue generation is carried out. However, the use of a single speech recognition engine for recognition can result in a higher error rate of speech recognition results, semantic parsing is performed according to the recognized text, text generation is performed, the error rate becomes larger, and the text is superimposed layer by layer, so that the system stability is reduced, and the requirement is difficult to reach.

The following embodiments of the present application provide a solution for improving recognition accuracy by mixing and outputting the results of the multi-voice recognition engines to make up for the disadvantage of outputting the results of the multi-voice recognition engines separately, analyzing the results and the results of the multi-voice recognition engines respectively, generating final response results, and processing the final response results.

In order to solve the above-mentioned technical problem, in this embodiment 1, a speech recognition method is provided, fig. 1 is a flowchart of an alternative speech recognition method according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:

step S102, analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;

step S104, determining a response result with highest confidence according to the response results, the first confidence levels of the response results, the combined response result and the second confidence level of the combined response result;

step S106, obtaining target voice output by the target equipment according to the response result with highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.

Optionally, before analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further includes: acquiring a plurality of voice recognition results obtained by recognizing voice information to be recognized by a plurality of voice recognition engines; and determining the first confidence levels corresponding to the voice recognition results.

The voice information to be recognized may be natural voice audio data of the user in the voice interaction system. The voice information to be recognized is recognized by a plurality of voice recognition engines, so that a plurality of voice recognition results of the voice information to be recognized in the voice recognition engines can be obtained, and a plurality of first confidence degrees corresponding to the voice recognition results are determined.

It should be noted that, because the above-mentioned multiple speech recognition engines have different accuracy rates of recognizing speech, it is assumed that the multiple speech recognition engines are respectively engine a, engine B, and engine C, and the accuracy rates of recognizing speech of these three engines are respectively: engine a is 90%, engine 88%, engine C is 96%, then the first confidence levels of the three speech recognition results recognized by the three engines may be 0.90,0.88,0.96. It will be appreciated that the above is only an example, and the present embodiment is not limited in any way herein.

In order to make the response result corresponding to the voice recognition result more accurate, in the embodiment of the present application, the combination of the plurality of response results to obtain a combined response result may be implemented by the following technical scheme: dividing each response result in the plurality of response results into a plurality of segmentation words according to a division rule; determining a minimum editing distance between any two response results in the plurality of response results, and recording position information corresponding to the minimum editing distance, wherein the minimum editing distance is used for representing the number of inconsistent segmentation words between any two response results, and the position information is used for representing the positions of the inconsistent segmentation words in the any two response results; and combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.

In the embodiment of the present application, as shown in fig. 2, a method for determining a combined response result is further provided, as follows:

step S201, receiving audio input data from a user;

firstly, an audio input file (corresponding to the voice information to be recognized) from a user is obtained by utilizing an audio input device of a man-machine conversation system, and then the audio input file is input to a plurality of voice recognition engines through cloud service to generate a plurality of candidate voice recognition results (namely a plurality of voice recognition results).

Step S202, checking consistency of all recognition results according to the priority of the multi-voice recognition engine; if the plurality of speech recognition results are identical or completely inconsistent, the process goes to step S203; if the plurality of speech recognition results are not completely inconsistent, the process goes to step S204;

step S203, a plurality of voice recognition results and confidence levels are sequentially transmitted to a semantic analysis module;

step S204, the multi-engine recognition result sequentially uses a bidirectional maximum matching algorithm to perform word segmentation operation;

step S205, sequentially calculating the minimum editing distance between the ith word of the engine and the ith words of other engines;

step S206, determining whether the minimum editing distance between different engines is smaller than a pre-estimated threshold;

if the estimated threshold is higher than the estimated threshold, the process goes to step S207;

if the estimated threshold is lower than or equal to the estimated threshold, the process goes to step S208;

step S207, taking the one with the smallest editing distance and higher engine priority;

step S208, determining the recognition result of the ith word;

step S209, combining all partial words into a combined response result, setting a response weight coefficient according to the proportion of the words in the sentences, and estimating a second confidence coefficient.

Further, the plurality of voice recognition results can be divided into a plurality of parts according to the same division rule, for example, the multi-voice recognition results can be segmented by using a bidirectional maximum matching algorithm, so that the segmentation modes are unified.

And (3) carrying out minimum editing distance calculation aiming at the condition that the recognition results of the plurality of voice engines are partially inconsistent, and recording all calculation results among each word. Obtaining a minimum editing distance for the inconsistent part through preference, and obtaining the most probable recognition result according to the result of the minimum editing distance; marking boundary value positions at the same time; recombining the recognition results into finally output recognition results (namely combining response results);

for example, assume that the plurality of speech recognition engines are engine a, engine B, engine C, respectively, and:

the engine A recognizes the result of the input audio and divides words into: setting the sealing at a low speed;

the engine B recognizes the input audio frequency and divides words into: setting the air as plastic drop;

the engine C recognizes the input audio frequency and divides words into: what is set to low-speed wind.

And respectively carrying out minimum editing distance calculation on the identification results of the three engines to obtain the following results and a position information recording method of the inconsistent part:

the minimum edit distance between AB is noted as: delta _AB =3, the location information is: [ A ] _3，5 ，B _3，5 ](representing three replacement operations);

the minimum edit distance between ACs is noted as: delta _AC =4, the location information is: [ A ] _5，5 ，B _5，8 ](representing an alternative, three insert operations);

the minimum edit distance between BC is noted as: delta _BC1 ＝2,δ _BC2 =3, indicating a first inconsistent portion and a second inconsistent portion, the position information of the two being: [ A ] _3，4 ，B _3，4 ][A _end ，B _6，8 ](indicating that the first part is an alternative followed by 3 insert operations).

The results of engine A and engine C are consistent before position 4, then the edit distance of the part is 0, then the new combined result passes delta _AC The result of obtaining the first set of inconsistent portions is: low speed.

By delta _BC1 ＝2,δ _BC2 As can be seen from the record position information corresponding to=3, the engine B and the engine C are considered to be consistent at the 5 th position, and in summary, the three voice recognition results of the three voice recognition engines are combined to obtain a combined result as follows: "set to low velocity wind" (corresponding to the above combined response result).

Correspondingly, after the combined response result is obtained, the embodiment of the application further provides a method for determining the second confidence coefficient corresponding to the combined response result, which is as follows:

since there is some misjudgment of the combined result (i.e., the combined response result), the confidence result needs to be recalculated for reference.

The following description will be given by taking the results of the above 3 engine identifications as an example:

from the above, the multi-speech engine recognition result obtains a combination result as follows: "set to low speed wind" (i.e., the first speech recognition result).

For the above-mentioned multiple speech recognition results, each speech recognition result includes multiple words, for each word there is corresponding position information, by obtaining the position information specified by each word in each speech recognition result, the proportion of each word in the respective engine recognition result can be obtained, taking multiple speech recognition engines as a, B, and C as examples, the average value obtained in each engine recognition result of each word (i.e. word 1 "set to", word 2 "low speed", and word 3 "wind") can be used as the weight, and the weights corresponding to these three words are respectively recorded as:

the calculation method comprises the following steps:wherein C is ₁ For word number statistics of word 1, C _A Statistics of the overall word count of the recognition result obtained by the engine A, C _B Statistics of the number of words of the recognition result obtained by the engine B, C _C And counting the overall word number of the recognition result obtained by the engine C. The same theory can calculate +.>And will not be described in detail herein.

The confidence level calculation method for calculating the final combined recognition result (namely the combined response result) by using a weighted average method is as follows:

note that Conf _A ，Conf _B ，Conf _C ，Conf _multi Representing the confidence of engines a, B, C, respectively, and a second confidence of the combined response result.

It should be noted that, the accuracy of the recognized voices of the three engines is assumed to be: 90% engine A, 88% engine and 96% engine C, conf _A Can be 0.9, conf _B May be 0.88, conf _C May be 0.96.

In an embodiment of the present application, before the analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further includes: determining a plurality of categories corresponding to the plurality of voice recognition results, wherein the plurality of categories are used for representing language types to which the plurality of voice recognition results belong, the plurality of voice recognition results are in one-to-one correspondence with the plurality of categories, and the plurality of sub-field models are in one-to-one correspondence with the plurality of categories; and inputting the voice recognition results into the sub-domain models according to the categories.

The multiple sub-domain models can be obtained by training a language model and a classifier through a neural network by using millions of user real linguistic data, as shown in fig. 3, and comprise a rejection module, wherein the rejection module can be used for filtering texts corresponding to multiple voice recognition results, filtering texts with poorer text recognition, and filtering the texts through the rejection module, so that the analysis of the texts with poorer text recognition can be avoided, and further, the time complexity of subsequent logic can be reduced.

The above-described process of determining a plurality of sub-domain models can be understood as a domain recognition (corresponding to the above-described plurality of categories) of a plurality of speech recognition results. For example, the multiple sub-domain models may be used to identify multiple language types, such as identify sports class languages, appliance control class languages, music class languages, etc., and multiple categories, i.e., sports language categories, appliance control language categories, music language categories, etc. The foregoing is merely an example and is not intended to be limiting in any way.

In the embodiment of the application, the plurality of voice recognition results are analyzed based on the pre-trained plurality of sub-domain models to obtain a plurality of response results, and after determining a plurality of sub-domain models (i.e., domains) corresponding to the plurality of voice recognition results, a plurality of categories corresponding to the plurality of voice recognition results can be determined.

The above process of determining multiple classes may be understood as a sub-domain recognition of multiple speech recognition results, which may be implemented by different classifiers.

The user intention analysis and the slot extraction can be performed on different sub-fields through different resolvers. For example, the text information corresponding to the voice recognition result is 3 points and 20 minutes, and the air conditioner of the living room is turned on. The user intends to turn on the device, the slot has a "time" = "3 points 20 minutes"; "location" = "living room"; "device" = "air conditioner". It will be appreciated that by performing recognition of user intention and extraction of slot position on the piece of text information, the piece of text information can be divided into control fields, which can process requests for control of home appliances. Accordingly, text information in other fields may also be identified, such as a song field may handle a request to play a song, and so forth.

Because of the different domain data distributions, the different resolvers are used according to the unique distributions, so that each module (i.e. the sub-domain) is highly decoupled and only focuses on processing the unique rules of the module (i.e. the sub-domain). By the method, although the error rate superposition is increased by combining text information from different speech recognition engines through multiple modules (namely sub-fields), the error rate of each word in the text information is reduced by the block processing, so that the analysis effect is greatly improved.

In the embodiment of the application, another method for determining the combined response result is also provided, as follows:

through the mode, each piece of text information passing through the refusing module can be analyzed to obtain a plurality of response results (which can be a plurality of semantic results) corresponding to a plurality of voice recognition results, and the plurality of response results are combined to obtain the combined response result. When the multiple response results are combined, if the same positions of the multiple response results have different recognition results in different voice recognition engines, unified processing is carried out on inconsistent parts in the multiple response results through the first confidence coefficient and the combination of cross entropy, so that the inconsistent parts in the multiple response results are identical, and the processed multiple response results are obtained.

And combining the processed multiple response results to obtain a combined response result.

The speech recognition engines are exemplified as engine a, engine B, and engine C, respectively.

The response result obtained by analyzing the voice recognition result recognized by the voice recognition engine A is as follows: opening a single bleaching program of attack, wherein the confidence coefficient is 0.65; the extracted slot position information is as follows: mode= "single drift", confidence: 0.95 (the confidence is the confidence corresponding to the single float);

the analysis result corresponding to the recognition engine A can be obtained through cross entropy formula calculation: mode: -0.95 log (0.65) = 0.4092;

the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine B is as follows: opening a single pick program of the washing machine, wherein the confidence coefficient is 0.71; the extracted slot position information is as follows: device= "washing machine", confidence = 0.92 (the confidence is the confidence corresponding to the washing machine), mode= "single pick", confidence: 0.68 (the confidence is the confidence corresponding to the single pick);

the analysis result corresponding to the recognition engine B can be obtained through cross entropy formula calculation: the device comprises: -0.92 log (0.71) = 0.3151; mode: -0.68 log (0.71) = 0.2329;

the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine C is as follows: opening a single picking program of attack, confidence: 0.68; the extracted slot position information is as follows: mode= "single pick", confidence: 0.65 (the confidence is the confidence corresponding to the single pick);

the analysis result corresponding to the recognition engine C can be obtained through cross entropy formula calculation: mode: -0.65 log (0.68) = 0.2507;

and combining and outputting the response result corresponding to the voice recognition: opening a single bleaching program of the washing machine, confidence: 0.73; it can be appreciated that the combined output result is a combined response result.

Extracting slot information of the combined response result: device= "washing machine", confidence: 0.91; mode= "single drift", confidence: 0.93;

the analytic result corresponding to the combined output result can be obtained through the cross entropy formula calculation: the device comprises: -0.91 log (0.73) =0.2864; mode: -0.93 log (0.73) = 0.2927;

the recognition engine A, the recognition engine B, the recognition engine C and the analysis results corresponding to the combined output results are combined, and the result with the highest confidence coefficient is selected, so that the device corresponds to the washing machine and the mode corresponds to the single float.

For the highest response result, for example, the response result obtained by parsing the speech recognition result recognized by the speech recognition engine a is: opening a single bleaching program of attack, wherein the confidence coefficient is 0.65;

the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine B is as follows: opening a single pick program of the washing machine, wherein the confidence coefficient is 0.71;

the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine C is as follows: opening a single picking program of attack, confidence: 0.68;

the combined response results were: opening a single bleaching program of the washing machine, confidence: 0.73.

and selecting the value with the highest confidence coefficient to obtain the highest response result, namely a single bleaching program for opening the washing machine.

In an embodiment of the present application, after the target voice output by the target device according to the response result with the highest confidence coefficient is obtained and the response voice of the target voice is determined based on the target voice, the method further includes: and in the case that the response voice instructs to execute the operation corresponding to the target voice, instructing the target device to execute the operation.

Optionally, the target device may be an intelligent home appliance, and the highest response result output by the terminal is assumed to be: please cook at 12:00, the target device (e.g. intelligent electric cooker) outputs the target voice according to the highest response result: confirm that the cooking mode is turned on at 12:00? The terminal outputs response voice according to the target voice: the determination is turned on, at which point the target device (e.g., a smart rice cooker) may turn on the cooking mode of operation at 12:00.

The above technical solution is described below with reference to the preferred embodiments, but is not limited to the technical solution of the embodiments of the present application.

Preferred embodiment 1

Fig. 3 is a schematic diagram of an alternative speech recognition model according to an embodiment of the present application, as shown in fig. 3, a multi-speech recognition result is input to a rejection module in a plurality of sub-domain models, so as to perform domain recognition on the multi-speech recognition result, determine a sub-domain corresponding to the multi-speech recognition result, and obtain a plurality of intent analyses through the sub-domain, and generate a plurality of corresponding response results. Furthermore, through intention combination of a plurality of intention analyses, a combination response result corresponding to the intention combination can be obtained; and then the repeated parts of the responses and the combined response result are removed, the selection is carried out according to the defined rule, the final response is output, and the dialogue state is updated.

For example, the engine a recognizes that the result is a single-drift program for opening an attack, and after semantic analysis, outputs a response: asking what device you want to operate;

the engine B recognizes that the result is a single-picking program for opening the washing machine, performs semantic analysis, and outputs a response: asking what program needs to be turned on;

the engine C recognizes that the result is a single-picking program for opening the attack, performs semantic analysis, and outputs a response: failing to understand your meaning, please trade the theory;

the second voice recognition result is a single-float program for opening the washing machine, semantic analysis is carried out, and response is output: setting a single bleaching program for you;

the responses of the engine A, the engine B and the engine C all belong to the condition that other information is needed, the slot position information is missing or can not be understood directly, but the semantic information corresponding to the combined response result is clear, the response is clear, and finally the complete and meaningful response is output.

After determining the final output response result, the dialogue state is updated in the man-machine dialogue system.

The multi-voice recognition engine in the embodiment of the application starts from the source, on one hand, the multi-voice recognition engine can reduce the error rate of voice conversion, on the other hand, semantic analysis is carried out on a plurality of voice recognition texts, dialogue response generation is carried out according to certain strategy preference or combination, and finally output response is selected according to the established strategy. In the embodiment of the application, the error rate of semantic analysis is reduced, the generation of the dialogue response can be performed first, and the dialogue state is updated reversely after the dialogue response is selected according to a certain strategy.

In summary, with reference to fig. 2 and 3, the semantic parsing can describe the score of the segment of speech by multiple response results, multiple first confidence levels, combined response results second confidence levels, and by calculating their cross entropy. The best response is selected by confidence comparison. Because the optimal response content is selected from a plurality of responses, the fault tolerance of the model is increased, and the system is more stable. The combination strategy of the multi-voice recognition engine scheme and the multi-meaning recognition scheme is more suitable for the voice products in complex practical environments.

In this embodiment, a voice recognition device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

FIG. 4 is a block diagram of an alternative speech recognition device according to an embodiment of the application, as shown in FIG. 4, comprising:

a first processing module 42, configured to analyze a plurality of speech recognition results input into a plurality of sub-domain models, obtain a plurality of response results corresponding to the plurality of speech recognition results, and combine the plurality of response results to obtain a combined response result;

a first determining module 44, configured to determine a response result with the highest confidence according to the plurality of response results, the plurality of first confidence levels of the plurality of response results, the combined response result, and the second confidence level of the combined response result;

the first obtaining module 46 is configured to obtain a target voice output by the target device according to the response result with the highest confidence, and determine a response voice of the target voice based on the target voice, where the response voice is used to instruct the target device to determine whether to execute an operation corresponding to the target voice.

According to the application, a plurality of voice recognition results input into a plurality of sub-domain models are analyzed to obtain a plurality of response results corresponding to the plurality of voice recognition results, and the plurality of response results are combined to obtain a combined response result; determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problems that in the related technology, in an intelligent voice dialogue system, errors are easy to occur in voice recognition, the accuracy is low, and then analysis errors are caused to influence the success rate of the dialogue are solved.

In the embodiment of the present application, the first processing module 42 includes: a dividing unit 422, configured to divide each response result of the plurality of response results into a plurality of segmentation words according to a division rule; a determining unit 424, configured to determine a minimum edit distance between any two response results among the plurality of response results, and record location information corresponding to the minimum edit distance, where the minimum edit distance is used to represent a number of inconsistent word segments between any two response results, and the location information is used to represent a location where the inconsistent word segments are located in any two response results; the first processing unit 426 is configured to combine the plurality of response results according to the minimum editing distance and the position information, so as to obtain the combined response result.

Fig. 6 is a block diagram of another alternative speech recognition apparatus according to an embodiment of the present application, and as shown in fig. 6, the apparatus further includes:

a second determining module 48, configured to determine a plurality of categories corresponding to a plurality of voice recognition results before analyzing the plurality of voice recognition results input to a plurality of sub-domain models, where the plurality of categories are used to represent language types to which the plurality of voice recognition results belong, the plurality of voice recognition results are in one-to-one correspondence with the plurality of categories, and the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories; the input module 50 is configured to input the plurality of speech recognition results into the plurality of sub-domain models according to the plurality of categories.

As shown in fig. 6, the apparatus further includes: a second obtaining module 52, configured to obtain a plurality of voice recognition results obtained by recognizing the voice information to be recognized by a plurality of voice recognition engines; the third determining module 54 is configured to determine the plurality of first confidence degrees corresponding to the plurality of speech recognition results.

As shown in fig. 6, the apparatus further includes: and an instruction module 56, configured to instruct the target device to perform the operation when the response voice instructs to perform the operation corresponding to the target voice.

An embodiment of the present application also provides a storage medium including a stored program, wherein the program executes the method of any one of the above.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for performing the steps of:

s1, analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;

s2, determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result;

s3, obtaining target voice output by the target equipment according to the response result with highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute operation corresponding to the target voice.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims

1. A voice recognition method applied to a voice interaction system, comprising:

analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result, wherein the plurality of voice recognition results are in one-to-one correspondence with a plurality of categories, the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories, and the plurality of categories are used for representing language types to which the plurality of voice recognition results belong;

determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result;

and acquiring target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.

2. The method of claim 1, wherein combining the plurality of response results to obtain a combined response result comprises:

dividing each response result in the plurality of response results into a plurality of segmentation words according to a division rule;

determining the minimum editing distance between any two response results in the plurality of response results, and recording position information corresponding to the minimum editing distance, wherein the minimum editing distance is used for representing the number of inconsistent segmentation words between any two response results, and the position information is used for representing the positions of the inconsistent segmentation words in the any two response results;

and combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.

3. The method of claim 1, wherein prior to said analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further comprises:

determining the multiple categories corresponding to the multiple voice recognition results;

and inputting the voice recognition results into the sub-domain models according to the categories.

4. The method of claim 1, wherein prior to said analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further comprises:

acquiring a plurality of voice recognition results obtained by recognizing voice information to be recognized by a plurality of voice recognition engines;

and determining the first confidence degrees corresponding to the voice recognition results.

5. The method according to claim 1, wherein after the target voice output by the target device according to the response result with the highest confidence is obtained and the response voice of the target voice is determined based on the target voice, the method further comprises:

and under the condition that the response voice indicates to execute the operation corresponding to the target voice, indicating the target equipment to execute the operation.

6. A speech recognition apparatus, comprising:

the first processing module is used for analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result, wherein the plurality of voice recognition results are in one-to-one correspondence with a plurality of categories, the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories, and the plurality of categories are used for representing language types to which the plurality of voice recognition results belong;

the first determining module is used for determining a response result with highest confidence according to the plurality of response results, the plurality of first confidence degrees of the plurality of response results, the combined response result and the second confidence degree of the combined response result;

the first acquisition module is used for acquiring target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.

7. The apparatus of claim 6, wherein the first processing module comprises:

the dividing unit is used for dividing each response result in the plurality of response results into a plurality of segmentation words according to a dividing rule;

a determining unit, configured to determine a minimum editing distance between any two response results in the plurality of response results, and record location information corresponding to the minimum editing distance, where the minimum editing distance is used to represent a number of inconsistent segmentation words between any two response results, and the location information is used to represent a location of the inconsistent segmentation words in the any two response results;

and the first processing unit is used for combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.

8. The apparatus of claim 6, wherein the apparatus further comprises:

the second determining module is used for determining a plurality of categories corresponding to a plurality of voice recognition results before analyzing the plurality of voice recognition results input into a plurality of sub-domain models;

and the input module is used for inputting the voice recognition results into the sub-domain models according to the categories.

9. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when run.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the speech recognition method as claimed in any of the claims 1 to 5.