CN111883122A

CN111883122A - Voice recognition method and device, storage medium and electronic equipment

Info

Publication number: CN111883122A
Application number: CN202010712229.3A
Authority: CN
Inventors: 赵培; 朱文博; 韩俊明; 苏腾荣
Original assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-03
Anticipated expiration: 2040-07-22
Also published as: CN111883122B

Abstract

The invention provides a voice recognition method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: analyzing a plurality of voice recognition results input to the plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; determining a response result with the highest confidence degree according to the multiple response results, the multiple first confidence degrees of the multiple response results, the combined response result and the second confidence degree of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence degree, and determining the response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problem that in an intelligent voice conversation system, because voice recognition is easy to generate errors and the accuracy is low, the success rate of conversation is influenced due to analysis errors is solved.

Description

Voice recognition method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of communications, and in particular, to a voice recognition method and apparatus, a storage medium, and an electronic device.

Background

With the rise of Automatic Speech Recognition (ASR) in Speech Recognition technology, Speech Recognition is increasingly applied to intelligent Speech dialogue systems, but due to the change of acoustic environment and unclear pronunciation, errors in text Recognition can be caused. The error is cascaded to the next semantic analysis module, so that the analysis error is caused to influence the success rate of the conversation.

Aiming at the problems that in the related art, in an intelligent voice conversation system, because voice recognition is easy to generate errors and the accuracy is not high, the success rate of conversation is influenced due to analysis errors, an effective technical scheme is not provided.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method and device, a storage medium and electronic equipment, which at least solve the problem that in the related technology, in an intelligent voice conversation system, because voice recognition is easy to generate errors and the accuracy is low, the success rate of conversation is influenced by analysis errors.

According to an embodiment of the present invention, there is provided a speech recognition method applied to a speech interaction system, including: analyzing a plurality of voice recognition results input to a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; determining a response result with the highest confidence level according to the plurality of response results, the plurality of first confidence levels of the plurality of response results, the combined response result and the second confidence level of the combined response result; and acquiring a target voice output by the target equipment according to the response result with the highest confidence level, and determining the response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.

According to another embodiment of the present invention, there is also provided a speech recognition apparatus including: the first processing module is used for analyzing a plurality of voice recognition results input to a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; a first determining module, configured to determine a response result with a highest confidence level according to the multiple response results, the multiple first confidence levels of the multiple response results, the combined response result, and the second confidence level of the combined response result; and a first obtaining module, configured to obtain a target voice output by a target device according to a response result with a highest confidence level, and determine a response voice of the target voice based on the target voice, where the response voice is used to instruct the target device to determine whether to execute an operation corresponding to the target voice.

According to another embodiment of the present invention, there is also provided a storage medium including a stored program, wherein the program executes the voice recognition method according to any one of the above.

According to another embodiment of the present invention, there is also provided an electronic device, where the storage medium includes a stored program, and the program executes the voice recognition method described above.

According to the invention, firstly, a plurality of voice recognition results input to a plurality of sub-domain models are analyzed, a plurality of response results corresponding to the plurality of voice recognition results can be obtained, and then the plurality of response results are combined to obtain a combined response result; determining a response result with the highest confidence level according to the plurality of response results, the plurality of first confidence levels of the plurality of response results, the combined response result and the second confidence level of the combined response result; and acquiring a target voice output by the target equipment according to the response result with the highest confidence level, and determining the response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problem that in the related technology, in an intelligent voice conversation system, because the voice recognition is easy to generate errors and the accuracy is low, the success rate of conversation is influenced due to analysis errors is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative method of determining a combined response result according to an embodiment of the invention;

FIG. 3 is a schematic illustration of an alternative speech recognition model according to an embodiment of the present invention;

FIG. 4 is a block diagram of an alternative speech recognition arrangement according to an embodiment of the present invention;

FIG. 5 is a block diagram of an alternative first processing module according to an embodiment of the invention;

fig. 6 is a block diagram of another alternative speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

It should be noted that, in the existing voice dialog system, natural voice audio data from a user may be acquired through an input device of the voice interaction system, and the natural voice audio data is input to the multi-voice recognition engine to recognize the voice of the user, so as to obtain a multi-voice recognition result. Generally, marking and aligning the overall recognition result of the multiple voice recognition results, then performing sorting operation on the multiple voice recognition results to obtain an optimal voice recognition result, and transmitting the optimal recognition result to a semantic analysis module for semantic analysis. However, there may be cases where the results of recognition of a specific word by different speech recognition engines are inconsistent, but since some errors are not corrected and some errors are selected as the recognition results of the speech recognition engines, the use of the erroneous parts as the final output results of the speech recognition engines may result in an increase in the overall accuracy of speech recognition.

The existing semantic analysis scheme only processes one text, and if the speech recognition is wrong, all subsequent semantic analysis will be wrong. Moreover, a model is used in semantic analysis, and the model is complex in work and high in iteration time cost. After semantic analysis, if the dialogue state is directly updated, generation and output of dialogue response are carried out, if one module is wrong, all modules are wrong, and the processes are not verified mutually.

The existing multi-speech recognition engine mainly aims at the overall optimization of each speech recognition result, performs semantic analysis after obtaining a final recognition result, namely performs semantic analysis on a single speech engine recognition result, directly performs dialog state updating, and then performs dialog generation. However, the error rate of the speech recognition result is high due to the fact that a single speech recognition engine is used for recognition, semantic analysis is performed according to the recognized text, the text is generated, the error rate is larger, and the text is overlapped layer by layer, so that the stability of the system is reduced, and the requirement is difficult to achieve.

The following embodiments of the present invention provide a solution for improving the accuracy of recognition by mixing and outputting the results of the multiple speech recognition engines to make up for the disadvantage of outputting the results of the multiple speech recognition engines individually, and for generating a final response result by analyzing the results and the results of the multiple speech recognition engines, and for processing the final response result.

In order to solve the above technical problem, in this embodiment 1, a speech recognition method is provided, and fig. 1 is a flowchart of an alternative speech recognition method according to an embodiment of the present invention, as shown in fig. 1, the flowchart includes the following steps:

step S102, analyzing a plurality of voice recognition results input to a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;

step S104, determining a response result with the highest confidence according to the response results, the first confidence of the response results, the combined response result and the second confidence of the combined response result;

step S106, obtaining a target voice output by the target device according to the response result with the highest confidence, and determining a response voice of the target voice based on the target voice, where the response voice is used to instruct the target device to determine whether to execute an operation corresponding to the target voice.

Optionally, before the analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further includes: acquiring a plurality of voice recognition results obtained by recognizing voice information to be recognized by a plurality of voice recognition engines; and determining the plurality of first confidence degrees corresponding to the plurality of voice recognition results.

The voice information to be recognized may be natural voice audio data of a user in the voice interactive system. The speech information to be recognized is recognized through a plurality of speech recognition engines, a plurality of speech recognition results of the speech information to be recognized in the speech recognition engines can be obtained, and a plurality of first confidence degrees corresponding to the speech recognition results are determined.

Since the accuracy of recognizing the speech by the plurality of speech recognition engines is different, assuming that the plurality of speech recognition engines are engine a, engine B, and engine C, the accuracy of recognizing the speech by these three engines is: engine a is 90%, engine C is 88%, and engine C is 96%, then the first confidence levels of the three speech recognition results recognized by the three engines may be 0.90, 0.88, 0.96. It is understood that the above is only an example, and the present embodiment is not limited thereto.

In order to make the response result corresponding to the speech recognition result more accurate, in the embodiment of the present invention, the multiple response results are combined to obtain a combined response result, which can be implemented by the following technical solutions: dividing each response result in the plurality of response results into a plurality of participles according to a division rule; determining a minimum editing distance between any two response results in the plurality of response results, and recording position information corresponding to the minimum editing distance, wherein the minimum editing distance is used for indicating the number of inconsistent participles between any two response results, and the position information is used for indicating the positions of the inconsistent participles in any two response results; and combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.

In the embodiment of the present invention, as shown in fig. 2, a method for determining a combined response result is further provided, which is as follows:

step S201, receiving audio input data from a user;

firstly, an audio input file (corresponding to the voice information to be recognized) from a user is obtained by using an audio input device of a man-machine conversation system, and then the audio input file is input to a plurality of voice recognition engines through a cloud service to generate a plurality of candidate voice recognition results (namely the plurality of voice recognition results).

Step S202, checking the consistency of each recognition result according to the priority of the multi-voice recognition engine; when the plurality of speech recognition results are consistent or completely inconsistent, jumping to step S203; skipping to step S204 under the condition that the multiple voice recognition results are not completely inconsistent;

step S203, a plurality of voice recognition results and confidence degrees are sequentially transmitted to a semantic analysis module;

step S204, the multi-engine recognition result sequentially uses a bidirectional maximum matching algorithm to perform word segmentation operation;

step S205, sequentially calculating the minimum edit distance between the ith word of the engine and the ith words of other engines;

step S206, determining whether the minimum editing distance between different engines is smaller than an estimated threshold value;

wherein, when the value is higher than the estimated threshold value, skipping to step S207;

if the value is lower than or equal to the estimated threshold value, jumping to step S208;

step S207, selecting the one with the minimum editing distance and higher engine priority;

step S208, determining the recognition result of the ith word;

step S209, combine all partial words into a combined response result, set a response weight coefficient according to the proportion of the word in the sentence, and estimate the second confidence.

Further, the speech recognition results may be divided into a plurality of parts according to the same division rule, for example, the bi-directional maximum matching algorithm may be used to perform word segmentation on the multi-speech recognition results, so as to unify word segmentation modes.

And performing minimum edit distance calculation aiming at the condition that some of the recognition results of the plurality of speech engines are inconsistent, and recording all calculation results among each word. Obtaining the minimum editing distance by preference for the inconsistent part, and obtaining the most possible recognition result according to the result of the minimum editing distance; marking the boundary value position; recombining into a final output identification result (namely a combined response result);

for example, assume that the plurality of speech recognition engines are engine a, engine B, and engine C, respectively, and:

the engine A identifies the result of the input audio and performs word segmentation to obtain: "set to low speed seal";

the engine B identifies the result of the input audio and divides the word into: "set as plastic drop wind";

the engine C recognizes the result and performs word segmentation on the input audio as follows: "what is set to low speed wind".

And respectively carrying out minimum edit distance calculation on the identification results of the three engines to obtain the following results and a position information recording method of the inconsistent part:

the minimum edit distance between AB is noted:_ABthe position information is 3: [ A ]_3，5，B_3，5](three alternate operations are indicated);

the minimum edit distance between ACs is written as:_AC4, the position information is: [ A ]_5，5，B_5，8](represents one replacement, three insertion operations);

the minimum edit distance between BC is taken as:_BC1＝2,_BC2the first mismatch portion and the second mismatch portion are denoted by 3, and the position information of both portions is: [ A ]_3，4，B_3，4][A_end，B_6，8](the first part is shown as one replacement followed by 3 insertion operations).

If the results of engine A and engine C are consistent before position 4, the edit distance of the part is 0, and the new combined result passes_ACThe result of obtaining the first set of inconsistent portions is: and (4) low speed.

By passing_BC1＝2,_BC2Engine B and 3 corresponding to the recording position informationEngine C considers consistent at the 5 th position, and in summary, combines the three speech recognition results of the three speech recognition engines to obtain a combined result: "set to low velocity wind" (corresponding to the combined response result described above).

Accordingly, after obtaining the combined response result, an embodiment of the present invention further provides a method for determining a second confidence corresponding to the combined response result, where the method includes:

since there is a certain false positive in the combined result (i.e., the combined response result), the result of the confidence level needs to be recalculated for reference.

Here, the following description is made by taking the above results of 3 engine identifications as examples:

as can be seen from the above, the combined result obtained from the recognition results of the multiple speech engines is: "set to low velocity wind" (i.e., first speech recognition result).

For the multiple speech recognition results, each speech recognition result includes multiple words, each word has corresponding position information, the proportion of each word in the respective engine recognition result can be obtained by obtaining the position information specified by each word in each speech recognition result, taking the multiple speech recognition engines as a, B, and C as examples, the average value obtained by each word (i.e. word 1 "is set as, word 2" is slow speed, "and word 3" is wind ") in each engine recognition result can be taken as a weight, and the weights corresponding to the three words are respectively recorded as:

the calculation method comprises the following steps:

wherein, C₁Word count for word 1, C_AFor the overall word count of the recognition result obtained by engine A, C_BFor the overall word count of the recognition result obtained by engine B, C_CThe overall word count for the recognition result obtained by engine C. Can be calculated by the same principle

And will not be described in detail herein.

Then, the confidence coefficient calculation method for calculating the final combined recognition result (i.e. the combined response result) by using the weighted average method is as follows:

conf is_A，Conf_B，Conf_C，Conf_multiRepresenting the confidence of engines a, B, C, respectively, and the second confidence of the combined response result.

It should be noted that, the accuracy of the recognized speech of the three engines is assumed to be: conf is 90%, 88% and 96% for engine A, C_AMay be 0.9, Conf_BMay be 0.88, Conf_CMay be 0.96.

In an embodiment of the present invention, before analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further includes: determining a plurality of categories corresponding to the plurality of voice recognition results, wherein the plurality of categories are used for representing language types to which the plurality of voice recognition results belong, the plurality of voice recognition results are in one-to-one correspondence with the plurality of categories, and the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories; and inputting the plurality of voice recognition results to the plurality of sub-domain models according to the plurality of categories.

The plurality of sub-field models can be obtained by using million-level user real corpora through a neural network training language model and a classifier, as shown in fig. 3, and comprise a rejection module, the rejection module can perform text filtering on texts corresponding to a plurality of voice recognition results, can filter texts with poor text recognition, and perform text filtering through the rejection module, so that the analysis of texts with poor text recognition can be avoided, and the time complexity of subsequent logics can be reduced.

The above process of determining a plurality of sub-domain models may be understood as a domain recognition (corresponding to the above plurality of classes) on a plurality of speech recognition results. For example, the plurality of sub-domain models may be used to identify a plurality of language types, such as a sports language, a home appliance control language, a music language, and so on, and the plurality of categories are a sports language category, a home appliance control language category, a music language category, and so on. The above is merely an example and is not intended to be limiting.

In an embodiment of the present invention, the plurality of speech recognition results are analyzed based on the plurality of pre-trained sub-domain models to obtain a plurality of response results, and after the plurality of sub-domain models (i.e., domains) corresponding to the plurality of speech recognition results are determined, a plurality of categories corresponding to the plurality of speech recognition results may be determined.

The above process of determining multiple classes can be understood as a sub-domain recognition of multiple speech recognition results, which can be implemented by different classifiers.

And for different sub-fields, user intention analysis and slot extraction can be performed through different resolvers. For example, the text information corresponding to the voice recognition result is 3 o 'clock and 20 o' clock, and the air conditioner in the living room is opened. The user intention is to open the equipment, and the slot has time which is 3 points and 20 minutes; "position" - "living room"; "equipment" is "air conditioner". It can be understood that the piece of text information can be divided into a control field by identifying the user intention and extracting the slot position, wherein the control field can process the request of controlling the household appliance. Accordingly, other fields of textual information may also be identified, such as song fields that may handle requests to play songs, and so forth.

Because the data distribution of different domains is different, different resolvers are used according to the unique distribution of the domains, so that each module (namely, the sub-domain) is highly decoupled, and only the unique rule of processing the module (namely, the sub-domain) is focused. Through the mode, although the error rate superposition is increased by combining the text information from different speech recognition engines through multiple modules (namely, the sub-fields), the error rate of each word in the text information is reduced through the blocking processing, so that the analysis effect is greatly improved.

In the embodiment of the present invention, another method for determining a combined response result is further provided, where the method includes:

by the above manner, each piece of text information passing through the rejection module can be analyzed to obtain a plurality of response results (which may be a plurality of semantic results) corresponding to a plurality of speech recognition results, and the plurality of response results are combined to obtain the combined response result. When the multiple response results are combined, if the same position of the multiple response results has different recognition results in different speech recognition engines, the inconsistent parts in the multiple response results are uniformly processed by combining the first confidence coefficient and the cross entropy, so that the inconsistent parts in the multiple response results are the same, and the processed multiple response results are obtained.

And combining the processed multiple response results to obtain a combined response result.

The speech recognition engines are respectively an engine a, an engine B, and an engine C.

The response result obtained by analyzing the speech recognition result recognized by the speech recognition engine a is: opening a single-floating program of attack with a confidence coefficient of 0.65; the extracted slot position information is as follows: mode as "single-float", confidence: 0.95 (the confidence is the confidence corresponding to the single drift);

the analysis result corresponding to the recognition engine A can be obtained through cross entropy formula calculation: mode (2): -0.95 × log (0.65) ═ 0.4092;

the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine B is as follows: opening a single picking program of the washing machine, wherein the confidence coefficient is 0.71; the extracted slot position information is as follows: the equipment is equal to "washing machine", the confidence coefficient is equal to 0.92 (the confidence coefficient is the confidence coefficient corresponding to the washing machine), the mode is equal to "singleton", the confidence coefficient: 0.68 (the confidence is the confidence corresponding to the single pick);

and calculating by a cross entropy formula to obtain an analysis result corresponding to the recognition engine B: equipment: -0.92 log (0.71) ═ 0.3151; mode (2): -0.68 log (0.71) 0.2329;

the response result obtained by analyzing the speech recognition result recognized by the speech recognition engine C is: open the singleton program of the attack, confidence: 0.68; the extracted slot position information is as follows: mode is "singleton", confidence: 0.65 (the confidence is the confidence corresponding to the single pick);

and calculating by a cross entropy formula to obtain an analysis result corresponding to the recognition engine C: mode (2): -0.65 log (0.68) 0.2507;

and combining and outputting the response results corresponding to the voice recognition: turn on the single-bleaching program of the washing machine, confidence: 0.73; it is understood that the combined output result is a combined response result.

Extracting slot position information of the combined response result: equipment is "washing machine", confidence: 0.91; mode as "single-float", confidence: 0.93;

and calculating by a cross entropy formula to obtain an analysis result corresponding to the combined output result: equipment: -0.91 log (0.73) ═ 0.2864; mode (2): -0.93 log (0.73) ═ 0.2927;

and synthesizing the analysis results corresponding to the recognition engine A, the classification engine B, the classification engine C and the combined output result, and selecting the result with the highest confidence coefficient to obtain the washing machine corresponding to the equipment and the single-drift corresponding to the mode.

For the highest response result, for example, the response result obtained by parsing the speech recognition result recognized by the speech recognition engine a is: opening a single-floating program of attack with a confidence coefficient of 0.65;

the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine B is as follows: opening a single picking program of the washing machine, wherein the confidence coefficient is 0.71;

the response result obtained by analyzing the speech recognition result recognized by the speech recognition engine C is: open the singleton program of the attack, confidence: 0.68;

the combined response results were: turn on the single-bleaching program of the washing machine, confidence: 0.73.

and selecting the value with the highest confidence coefficient to obtain the highest response result, namely the single rinsing program of the washing machine is opened.

In an embodiment of the present invention, after the obtaining of the target voice output by the target device according to the response result with the highest confidence and determining the response voice of the target voice based on the target voice, the method further includes: and instructing the target device to execute the operation when the response voice instructs to execute the operation corresponding to the target voice.

Optionally, the target device may be an intelligent home device, and it is assumed that the highest response result output by the terminal is: please cook at 12:00, the target device (such as an intelligent electric cooker) outputs the target voice according to the highest response result: is it confirmed that the cooking mode is turned on at 12: 00? And the terminal outputs response voice according to the target voice: the rice cooker is determined to be turned on, and at the moment, the target device (such as the intelligent electric cooker) can be operated in a rice cooking mode at 12: 00.

The technical solutions described above are described below with reference to preferred embodiments, but are not intended to limit the technical solutions of the embodiments of the present invention.

Preferred embodiment 1

Fig. 3 is a schematic diagram of an alternative speech recognition model according to an embodiment of the present invention, and as shown in fig. 3, a multi-speech recognition result is input to a rejection module in a plurality of sub-domain models, so as to perform domain recognition on the multi-speech recognition result, and determine a sub-domain corresponding to the multi-speech recognition result, and a plurality of intent analyses can be obtained through the sub-domain, and a plurality of corresponding response results are generated. Further, by performing intention combination on a plurality of intention analyses, a combined response result corresponding to the intention combination can be obtained; then, the multiple responses and the combined response result are subjected to repetition elimination, repeated parts in the multiple responses and the combined response can be removed, selection is carried out according to a limited rule, the final response is output, and the conversation state is updated.

For example, engine a recognizes that the result is a single-floating program that opens an attack, and after semantic parsing, outputs a response: ask what equipment you want to operate;

and the engine B identifies that the result is a single picking program for opening the washing machine, performs semantic analysis and outputs a response: asking what program the washing machine needs to be opened;

and the engine C identifies that the result is a single picking program for opening attack, performs semantic analysis, and outputs a response: do not understand your meaning, please say it in another way;

the second voice recognition result is that a single-floating program of the washing machine is opened, semantic analysis is carried out, and response is output: a single-float program is being set for you;

the responses of the engine A, the engine B and the engine C are all the responses which need other information, are missing in slot position information or cannot be understood directly, but semantic information corresponding to a combined response result is clear, the response is clear, and finally the complete and meaningful response is output.

And after the final output response result is determined, the dialogue state is updated in the man-machine dialogue system.

The multi-voice recognition engine in the embodiment of the invention starts from the source, on one hand, the multi-voice recognition engine can reduce the error rate of voice to word, on the other hand, the multi-voice recognition engine performs semantic analysis on a plurality of voice recognition texts, generates a dialogue response according to certain strategy preference or combination, and selects the final output response according to a set strategy. In the embodiment of the invention, the error rate of semantic analysis is reduced, the generation of the dialogue response can be carried out firstly, and the dialogue state is updated reversely after the dialogue response is selected according to a certain strategy.

In summary, in conjunction with fig. 2 and fig. 3, the score of the speech segment that can be described by semantic parsing is represented by a plurality of response results, a plurality of first confidence levels, a combined response result, a second confidence level of the combined response result, and by calculating their cross entropy. The best response is selected by confidence comparison. Because the optimal response content is selected from a plurality of responses, the fault tolerance rate of the model is increased, and the system is more stable. Therefore, the combined strategy of the multiple voice recognition engine scheme and the multiple semantic recognition scheme is more suitable for the voice products in complex practical environments.

In this embodiment, a speech recognition apparatus is further provided, and the speech recognition apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of an alternative speech recognition apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus including:

a first processing module 42, configured to analyze multiple speech recognition results input to multiple sub-domain models to obtain multiple response results corresponding to the multiple speech recognition results, and combine the multiple response results to obtain a combined response result;

a first determining module 44, configured to determine a response result with a highest confidence level according to the multiple response results, the multiple first confidence levels of the multiple response results, the combined response result, and the second confidence level of the combined response result;

a first obtaining module 46, configured to obtain a target voice output by a target device according to a response result with a highest confidence level, and determine a response voice of the target voice based on the target voice, where the response voice is used to instruct the target device to determine whether to execute an operation corresponding to the target voice.

According to the invention, a plurality of voice recognition results input to a plurality of sub-domain models are analyzed to obtain a plurality of response results corresponding to the plurality of voice recognition results, and the plurality of response results are combined to obtain a combined response result; determining a response result with the highest confidence level according to the plurality of response results, the plurality of first confidence levels of the plurality of response results, the combined response result and the second confidence level of the combined response result; and acquiring a target voice output by the target equipment according to the response result with the highest confidence level, and determining the response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problem that in the related technology, in an intelligent voice conversation system, because the voice recognition is easy to generate errors and the accuracy is low, the success rate of conversation is influenced due to analysis errors is solved.

In an embodiment of the present invention, the first processing module 42 includes: a dividing unit 422, configured to divide each response result in the multiple response results into multiple participles according to a division rule; a determining unit 424, configured to determine a minimum edit distance between any two response results of the multiple response results, and record position information corresponding to the minimum edit distance, where the minimum edit distance is used to indicate the number of inconsistent participles between any two response results, and the position information is used to indicate a position of the inconsistent participles in any two response results; a first processing unit 426, configured to combine the multiple response results according to the minimum edit distance and the position information to obtain the combined response result.

Fig. 6 is a block diagram of another alternative speech recognition apparatus according to an embodiment of the present invention, and as shown in fig. 6, the apparatus further includes:

a second determining module 48, configured to determine, before the analyzing the plurality of speech recognition results input to the plurality of sub-domain models, a plurality of categories corresponding to the plurality of speech recognition results, where the plurality of categories are used to indicate language types to which the plurality of speech recognition results belong, the plurality of speech recognition results are in one-to-one correspondence with the plurality of categories, and the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories; an input module 50, configured to input the speech recognition results into the sub-domain models according to the categories.

As shown in fig. 6, the above apparatus further includes: a second obtaining module 52, configured to obtain multiple speech recognition results obtained by recognizing speech information to be recognized by multiple speech recognition engines; a third determining module 54, configured to determine the plurality of first confidence levels corresponding to the plurality of speech recognition results.

As shown in fig. 6, the above apparatus further includes: an instructing module 56, configured to instruct the target device to perform the operation when the response voice instructs to perform the operation corresponding to the target voice.

An embodiment of the present invention further provides a storage medium including a stored program, wherein the program executes any one of the methods described above.

Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps:

s1, analyzing a plurality of voice recognition results input to a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;

s2, determining a response result with the highest confidence according to the response results, the first confidence of the response results, the combined response result and the second confidence of the combined response result;

and S3, acquiring a target voice output by the target device according to the response result with the highest confidence coefficient, and determining a response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target device to determine whether to execute an operation corresponding to the target voice.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech recognition method is applied to a speech interaction system and is characterized by comprising the following steps:

analyzing a plurality of voice recognition results input to a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;

determining a response result with the highest confidence degree according to the multiple response results, the multiple first confidence degrees of the multiple response results, the combined response result and the second confidence degree of the combined response result;

and acquiring target voice output by the target equipment according to the response result with the highest confidence degree, and determining the response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.

2. The method of claim 1, wherein combining the plurality of response results to obtain a combined response result comprises:

dividing each response result in the plurality of response results into a plurality of participles according to a division rule;

determining a minimum editing distance between any two response results in the plurality of response results, and recording position information corresponding to the minimum editing distance, wherein the minimum editing distance is used for representing the number of inconsistent participles between any two response results, and the position information is used for representing the positions of the inconsistent participles in any two response results;

and combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.

3. The method of claim 1, wherein prior to the analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further comprises:

determining a plurality of categories corresponding to the plurality of voice recognition results, wherein the plurality of categories are used for representing language types to which the plurality of voice recognition results belong, the plurality of voice recognition results are in one-to-one correspondence with the plurality of categories, and the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories;

inputting the plurality of speech recognition results to the plurality of sub-domain models according to the plurality of categories.

4. The method of claim 1, wherein prior to the analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further comprises:

acquiring a plurality of voice recognition results obtained by recognizing voice information to be recognized by a plurality of voice recognition engines;

determining the plurality of first confidence degrees corresponding to the plurality of voice recognition results.

5. The method according to claim 1, wherein after the obtaining of the target voice output by the target device according to the response result with the highest confidence degree and the determining of the response voice of the target voice based on the target voice, the method further comprises:

and instructing the target equipment to execute the operation under the condition that the response voice indicates to execute the operation corresponding to the target voice.

6. A speech recognition apparatus, comprising:

the first processing module is used for analyzing a plurality of voice recognition results input to a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;

a first determining module, configured to determine, according to the multiple response results, multiple first confidence degrees of the multiple response results, the combined response result, and a second confidence degree of the combined response result, a response result with a highest confidence degree;

and the first obtaining module is used for obtaining a target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining the response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.

7. The apparatus of claim 6, wherein the first processing module comprises:

the dividing unit is used for dividing each response result in the plurality of response results into a plurality of participles according to a dividing rule;

a determining unit, configured to determine a minimum edit distance between any two response results in the multiple response results, and record position information corresponding to the minimum edit distance, where the minimum edit distance is used to indicate the number of inconsistent participles between any two response results, and the position information is used to indicate a position of the inconsistent participles in any two response results;

and the first processing unit is used for combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.

8. The apparatus of claim 6, further comprising:

a second determining module, configured to determine, before the analyzing of the multiple speech recognition results input to multiple sub-domain models, multiple categories corresponding to the multiple speech recognition results, where the multiple categories are used to represent language types to which the multiple speech recognition results belong, the multiple speech recognition results are in one-to-one correspondence with the multiple categories, and the multiple sub-domain models are in one-to-one correspondence with the multiple categories;

and the input module is used for inputting the plurality of voice recognition results to the plurality of sub-field models according to the plurality of categories.

9. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to run the computer program to perform the speech recognition method of any of claims 1 to 5.