CN111883122B - Speech recognition method and device, storage medium and electronic equipment - Google Patents

Speech recognition method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN111883122B
CN111883122B CN202010712229.3A CN202010712229A CN111883122B CN 111883122 B CN111883122 B CN 111883122B CN 202010712229 A CN202010712229 A CN 202010712229A CN 111883122 B CN111883122 B CN 111883122B
Authority
CN
China
Prior art keywords
response
voice
results
voice recognition
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010712229.3A
Other languages
Chinese (zh)
Other versions
CN111883122A (en
Inventor
赵培
朱文博
韩俊明
苏腾荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Haier Uplus Intelligent Technology Beijing Co Ltd
Priority to CN202010712229.3A priority Critical patent/CN111883122B/en
Publication of CN111883122A publication Critical patent/CN111883122A/en
Application granted granted Critical
Publication of CN111883122B publication Critical patent/CN111883122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application provides a voice recognition method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; determining a response result with highest confidence according to the plurality of response results, the plurality of first confidence levels of the plurality of response results, the combined response result and the second confidence level of the combined response result; and obtaining target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problems that in an intelligent voice dialogue system, errors are easy to occur in voice recognition, the accuracy is low, and then the success rate of dialogue is affected due to analysis errors are solved.

Description

Speech recognition method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of communications, and in particular, to a method and apparatus for voice recognition, a storage medium, and an electronic device.
Background
With the rise of automatic speech recognition (Automatic Speech Recognition, abbreviated as ASR) in speech recognition technology, speech recognition is increasingly applied to intelligent speech dialogue systems, but due to factors such as changes in acoustic environment, ambiguous pronunciation, etc., errors in recognizing text can be caused. The error can cascade to the next semantic parsing module, thereby causing parsing errors and affecting the success rate of the dialog.
Aiming at the problems that in the related art, in an intelligent voice dialogue system, voice recognition is easy to generate errors, the accuracy is not high, and further analysis errors are caused to influence the success rate of the dialogue, no effective technical scheme has been proposed yet.
Disclosure of Invention
The embodiment of the application provides a voice recognition method and device, a storage medium and electronic equipment, which at least solve the problems that in the related technology, in an intelligent voice dialogue system, the voice recognition is easy to generate errors, the accuracy is low, and the analysis errors are caused to influence the success rate of the dialogue.
According to one embodiment of the present application, there is provided a voice recognition method applied to a voice interaction system, including: analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.
According to another embodiment of the present application, there is also provided a voice recognition apparatus including: the first processing module is used for analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result; the first determining module is used for determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result; the first acquisition module is used for acquiring target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute operation corresponding to the target voice.
According to another embodiment of the present application, there is also provided a storage medium including a stored program, wherein the program, when executed, performs any one of the above voice recognition methods.
According to another embodiment of the present application, there is also provided an electronic device, where the storage medium includes a stored program, and where the program executes any one of the above voice recognition methods.
According to the application, firstly, a plurality of voice recognition results input into a plurality of sub-domain models are analyzed to obtain a plurality of response results corresponding to the plurality of voice recognition results, and then the plurality of response results are combined to obtain a combined response result; determining a response result with the highest confidence coefficient according to the response results, the first confidence coefficients of the response results, the combined response result and the second confidence coefficient of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problems that in the related technology, in an intelligent voice dialogue system, errors are easy to occur in voice recognition, the accuracy is low, and then analysis errors are caused to influence the success rate of the dialogue are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of an alternative speech recognition method according to an embodiment of the present application;
FIG. 2 is a flow chart of an alternative determination of combined response results according to an embodiment of the application;
FIG. 3 is a schematic diagram of an alternative speech recognition model according to an embodiment of the present application;
FIG. 4 is a block diagram of an alternative speech recognition device according to an embodiment of the present application;
FIG. 5 is a block diagram of an alternative first processing module according to an embodiment of the application;
fig. 6 is a block diagram of another alternative speech recognition device according to an embodiment of the present application.
Detailed Description
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
In the conventional speech dialogue system, natural speech audio data from a user can be obtained through an input device of the speech interaction system, and the natural speech audio data is input into a multi-speech recognition engine to recognize the speech of the user, so that a multi-speech recognition result is obtained. In general, marking and aligning the overall recognition result of the multi-voice recognition result, and then sorting the multi-voice recognition result to obtain an optimal voice recognition result, and transmitting the optimal recognition result to a semantic analysis module for semantic analysis. However, the results of the recognition of the specific words by different speech recognition engines may be inconsistent, but because part of the errors are not corrected and the erroneous part is selected as the recognition result of the speech recognition engine, the use of the erroneous part as the final output result of the speech recognition engine may result in an increase in the overall accuracy of the speech recognition.
The existing semantic analysis scheme only processes one text, and if the voice recognition is wrong, all subsequent semantic analysis can be wrong. In addition, a model is used in semantic analysis, and the model is complex in work and high in iteration time cost. After semantic analysis, if the dialogue state is directly updated, the dialogue response is generated and output, and as long as one module is in error, all modules are in error, and the process of mutual verification is omitted.
The existing multi-voice recognition engine mainly aims at overall optimization of each voice recognition result, semantic analysis is carried out after a final recognition result is obtained, namely, semantic analysis is carried out on a single voice engine recognition result, dialogue state update is directly carried out, and then dialogue generation is carried out. However, the use of a single speech recognition engine for recognition can result in a higher error rate of speech recognition results, semantic parsing is performed according to the recognized text, text generation is performed, the error rate becomes larger, and the text is superimposed layer by layer, so that the system stability is reduced, and the requirement is difficult to reach.
The following embodiments of the present application provide a solution for improving recognition accuracy by mixing and outputting the results of the multi-voice recognition engines to make up for the disadvantage of outputting the results of the multi-voice recognition engines separately, analyzing the results and the results of the multi-voice recognition engines respectively, generating final response results, and processing the final response results.
In order to solve the above-mentioned technical problem, in this embodiment 1, a speech recognition method is provided, fig. 1 is a flowchart of an alternative speech recognition method according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:
step S102, analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;
step S104, determining a response result with highest confidence according to the response results, the first confidence levels of the response results, the combined response result and the second confidence level of the combined response result;
step S106, obtaining target voice output by the target equipment according to the response result with highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.
According to the application, firstly, a plurality of voice recognition results input into a plurality of sub-domain models are analyzed to obtain a plurality of response results corresponding to the plurality of voice recognition results, and then the plurality of response results are combined to obtain a combined response result; determining a response result with the highest confidence coefficient according to the response results, the first confidence coefficients of the response results, the combined response result and the second confidence coefficient of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problems that in the related technology, in an intelligent voice dialogue system, errors are easy to occur in voice recognition, the accuracy is low, and then analysis errors are caused to influence the success rate of the dialogue are solved.
Optionally, before analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further includes: acquiring a plurality of voice recognition results obtained by recognizing voice information to be recognized by a plurality of voice recognition engines; and determining the first confidence levels corresponding to the voice recognition results.
The voice information to be recognized may be natural voice audio data of the user in the voice interaction system. The voice information to be recognized is recognized by a plurality of voice recognition engines, so that a plurality of voice recognition results of the voice information to be recognized in the voice recognition engines can be obtained, and a plurality of first confidence degrees corresponding to the voice recognition results are determined.
It should be noted that, because the above-mentioned multiple speech recognition engines have different accuracy rates of recognizing speech, it is assumed that the multiple speech recognition engines are respectively engine a, engine B, and engine C, and the accuracy rates of recognizing speech of these three engines are respectively: engine a is 90%, engine 88%, engine C is 96%, then the first confidence levels of the three speech recognition results recognized by the three engines may be 0.90,0.88,0.96. It will be appreciated that the above is only an example, and the present embodiment is not limited in any way herein.
In order to make the response result corresponding to the voice recognition result more accurate, in the embodiment of the present application, the combination of the plurality of response results to obtain a combined response result may be implemented by the following technical scheme: dividing each response result in the plurality of response results into a plurality of segmentation words according to a division rule; determining a minimum editing distance between any two response results in the plurality of response results, and recording position information corresponding to the minimum editing distance, wherein the minimum editing distance is used for representing the number of inconsistent segmentation words between any two response results, and the position information is used for representing the positions of the inconsistent segmentation words in the any two response results; and combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.
In the embodiment of the present application, as shown in fig. 2, a method for determining a combined response result is further provided, as follows:
step S201, receiving audio input data from a user;
firstly, an audio input file (corresponding to the voice information to be recognized) from a user is obtained by utilizing an audio input device of a man-machine conversation system, and then the audio input file is input to a plurality of voice recognition engines through cloud service to generate a plurality of candidate voice recognition results (namely a plurality of voice recognition results).
Step S202, checking consistency of all recognition results according to the priority of the multi-voice recognition engine; if the plurality of speech recognition results are identical or completely inconsistent, the process goes to step S203; if the plurality of speech recognition results are not completely inconsistent, the process goes to step S204;
step S203, a plurality of voice recognition results and confidence levels are sequentially transmitted to a semantic analysis module;
step S204, the multi-engine recognition result sequentially uses a bidirectional maximum matching algorithm to perform word segmentation operation;
step S205, sequentially calculating the minimum editing distance between the ith word of the engine and the ith words of other engines;
step S206, determining whether the minimum editing distance between different engines is smaller than a pre-estimated threshold;
if the estimated threshold is higher than the estimated threshold, the process goes to step S207;
if the estimated threshold is lower than or equal to the estimated threshold, the process goes to step S208;
step S207, taking the one with the smallest editing distance and higher engine priority;
step S208, determining the recognition result of the ith word;
step S209, combining all partial words into a combined response result, setting a response weight coefficient according to the proportion of the words in the sentences, and estimating a second confidence coefficient.
Further, the plurality of voice recognition results can be divided into a plurality of parts according to the same division rule, for example, the multi-voice recognition results can be segmented by using a bidirectional maximum matching algorithm, so that the segmentation modes are unified.
And (3) carrying out minimum editing distance calculation aiming at the condition that the recognition results of the plurality of voice engines are partially inconsistent, and recording all calculation results among each word. Obtaining a minimum editing distance for the inconsistent part through preference, and obtaining the most probable recognition result according to the result of the minimum editing distance; marking boundary value positions at the same time; recombining the recognition results into finally output recognition results (namely combining response results);
for example, assume that the plurality of speech recognition engines are engine a, engine B, engine C, respectively, and:
the engine A recognizes the result of the input audio and divides words into: setting the sealing at a low speed;
the engine B recognizes the input audio frequency and divides words into: setting the air as plastic drop;
the engine C recognizes the input audio frequency and divides words into: what is set to low-speed wind.
And respectively carrying out minimum editing distance calculation on the identification results of the three engines to obtain the following results and a position information recording method of the inconsistent part:
the minimum edit distance between AB is noted as: delta AB =3, the location information is: [ A ] 3,5 ,B 3,5 ](representing three replacement operations);
the minimum edit distance between ACs is noted as: delta AC =4, the location information is: [ A ] 5,5 ,B 5,8 ](representing an alternative, three insert operations);
the minimum edit distance between BC is noted as: delta BC1 =2,δ BC2 =3, indicating a first inconsistent portion and a second inconsistent portion, the position information of the two being: [ A ] 3,4 ,B 3,4 ][A end ,B 6,8 ](indicating that the first part is an alternative followed by 3 insert operations).
The results of engine A and engine C are consistent before position 4, then the edit distance of the part is 0, then the new combined result passes delta AC The result of obtaining the first set of inconsistent portions is: low speed.
By delta BC1 =2,δ BC2 As can be seen from the record position information corresponding to=3, the engine B and the engine C are considered to be consistent at the 5 th position, and in summary, the three voice recognition results of the three voice recognition engines are combined to obtain a combined result as follows: "set to low velocity wind" (corresponding to the above combined response result).
Correspondingly, after the combined response result is obtained, the embodiment of the application further provides a method for determining the second confidence coefficient corresponding to the combined response result, which is as follows:
since there is some misjudgment of the combined result (i.e., the combined response result), the confidence result needs to be recalculated for reference.
The following description will be given by taking the results of the above 3 engine identifications as an example:
from the above, the multi-speech engine recognition result obtains a combination result as follows: "set to low speed wind" (i.e., the first speech recognition result).
For the above-mentioned multiple speech recognition results, each speech recognition result includes multiple words, for each word there is corresponding position information, by obtaining the position information specified by each word in each speech recognition result, the proportion of each word in the respective engine recognition result can be obtained, taking multiple speech recognition engines as a, B, and C as examples, the average value obtained in each engine recognition result of each word (i.e. word 1 "set to", word 2 "low speed", and word 3 "wind") can be used as the weight, and the weights corresponding to these three words are respectively recorded as:
the calculation method comprises the following steps:wherein C is 1 For word number statistics of word 1, C A Statistics of the overall word count of the recognition result obtained by the engine A, C B Statistics of the number of words of the recognition result obtained by the engine B, C C And counting the overall word number of the recognition result obtained by the engine C. The same theory can calculate +.>And will not be described in detail herein.
The confidence level calculation method for calculating the final combined recognition result (namely the combined response result) by using a weighted average method is as follows:
note that Conf A ,Conf B ,Conf C ,Conf multi Representing the confidence of engines a, B, C, respectively, and a second confidence of the combined response result.
It should be noted that, the accuracy of the recognized voices of the three engines is assumed to be: 90% engine A, 88% engine and 96% engine C, conf A Can be 0.9, conf B May be 0.88, conf C May be 0.96.
In an embodiment of the present application, before the analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further includes: determining a plurality of categories corresponding to the plurality of voice recognition results, wherein the plurality of categories are used for representing language types to which the plurality of voice recognition results belong, the plurality of voice recognition results are in one-to-one correspondence with the plurality of categories, and the plurality of sub-field models are in one-to-one correspondence with the plurality of categories; and inputting the voice recognition results into the sub-domain models according to the categories.
The multiple sub-domain models can be obtained by training a language model and a classifier through a neural network by using millions of user real linguistic data, as shown in fig. 3, and comprise a rejection module, wherein the rejection module can be used for filtering texts corresponding to multiple voice recognition results, filtering texts with poorer text recognition, and filtering the texts through the rejection module, so that the analysis of the texts with poorer text recognition can be avoided, and further, the time complexity of subsequent logic can be reduced.
The above-described process of determining a plurality of sub-domain models can be understood as a domain recognition (corresponding to the above-described plurality of categories) of a plurality of speech recognition results. For example, the multiple sub-domain models may be used to identify multiple language types, such as identify sports class languages, appliance control class languages, music class languages, etc., and multiple categories, i.e., sports language categories, appliance control language categories, music language categories, etc. The foregoing is merely an example and is not intended to be limiting in any way.
In the embodiment of the application, the plurality of voice recognition results are analyzed based on the pre-trained plurality of sub-domain models to obtain a plurality of response results, and after determining a plurality of sub-domain models (i.e., domains) corresponding to the plurality of voice recognition results, a plurality of categories corresponding to the plurality of voice recognition results can be determined.
The above process of determining multiple classes may be understood as a sub-domain recognition of multiple speech recognition results, which may be implemented by different classifiers.
The user intention analysis and the slot extraction can be performed on different sub-fields through different resolvers. For example, the text information corresponding to the voice recognition result is 3 points and 20 minutes, and the air conditioner of the living room is turned on. The user intends to turn on the device, the slot has a "time" = "3 points 20 minutes"; "location" = "living room"; "device" = "air conditioner". It will be appreciated that by performing recognition of user intention and extraction of slot position on the piece of text information, the piece of text information can be divided into control fields, which can process requests for control of home appliances. Accordingly, text information in other fields may also be identified, such as a song field may handle a request to play a song, and so forth.
Because of the different domain data distributions, the different resolvers are used according to the unique distributions, so that each module (i.e. the sub-domain) is highly decoupled and only focuses on processing the unique rules of the module (i.e. the sub-domain). By the method, although the error rate superposition is increased by combining text information from different speech recognition engines through multiple modules (namely sub-fields), the error rate of each word in the text information is reduced by the block processing, so that the analysis effect is greatly improved.
In the embodiment of the application, another method for determining the combined response result is also provided, as follows:
through the mode, each piece of text information passing through the refusing module can be analyzed to obtain a plurality of response results (which can be a plurality of semantic results) corresponding to a plurality of voice recognition results, and the plurality of response results are combined to obtain the combined response result. When the multiple response results are combined, if the same positions of the multiple response results have different recognition results in different voice recognition engines, unified processing is carried out on inconsistent parts in the multiple response results through the first confidence coefficient and the combination of cross entropy, so that the inconsistent parts in the multiple response results are identical, and the processed multiple response results are obtained.
And combining the processed multiple response results to obtain a combined response result.
The speech recognition engines are exemplified as engine a, engine B, and engine C, respectively.
The response result obtained by analyzing the voice recognition result recognized by the voice recognition engine A is as follows: opening a single bleaching program of attack, wherein the confidence coefficient is 0.65; the extracted slot position information is as follows: mode= "single drift", confidence: 0.95 (the confidence is the confidence corresponding to the single float);
the analysis result corresponding to the recognition engine A can be obtained through cross entropy formula calculation: mode: -0.95 log (0.65) = 0.4092;
the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine B is as follows: opening a single pick program of the washing machine, wherein the confidence coefficient is 0.71; the extracted slot position information is as follows: device= "washing machine", confidence = 0.92 (the confidence is the confidence corresponding to the washing machine), mode= "single pick", confidence: 0.68 (the confidence is the confidence corresponding to the single pick);
the analysis result corresponding to the recognition engine B can be obtained through cross entropy formula calculation: the device comprises: -0.92 log (0.71) = 0.3151; mode: -0.68 log (0.71) = 0.2329;
the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine C is as follows: opening a single picking program of attack, confidence: 0.68; the extracted slot position information is as follows: mode= "single pick", confidence: 0.65 (the confidence is the confidence corresponding to the single pick);
the analysis result corresponding to the recognition engine C can be obtained through cross entropy formula calculation: mode: -0.65 log (0.68) = 0.2507;
and combining and outputting the response result corresponding to the voice recognition: opening a single bleaching program of the washing machine, confidence: 0.73; it can be appreciated that the combined output result is a combined response result.
Extracting slot information of the combined response result: device= "washing machine", confidence: 0.91; mode= "single drift", confidence: 0.93;
the analytic result corresponding to the combined output result can be obtained through the cross entropy formula calculation: the device comprises: -0.91 log (0.73) =0.2864; mode: -0.93 log (0.73) = 0.2927;
the recognition engine A, the recognition engine B, the recognition engine C and the analysis results corresponding to the combined output results are combined, and the result with the highest confidence coefficient is selected, so that the device corresponds to the washing machine and the mode corresponds to the single float.
For the highest response result, for example, the response result obtained by parsing the speech recognition result recognized by the speech recognition engine a is: opening a single bleaching program of attack, wherein the confidence coefficient is 0.65;
the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine B is as follows: opening a single pick program of the washing machine, wherein the confidence coefficient is 0.71;
the response result obtained by analyzing the voice recognition result recognized by the voice recognition engine C is as follows: opening a single picking program of attack, confidence: 0.68;
the combined response results were: opening a single bleaching program of the washing machine, confidence: 0.73.
and selecting the value with the highest confidence coefficient to obtain the highest response result, namely a single bleaching program for opening the washing machine.
In an embodiment of the present application, after the target voice output by the target device according to the response result with the highest confidence coefficient is obtained and the response voice of the target voice is determined based on the target voice, the method further includes: and in the case that the response voice instructs to execute the operation corresponding to the target voice, instructing the target device to execute the operation.
Optionally, the target device may be an intelligent home appliance, and the highest response result output by the terminal is assumed to be: please cook at 12:00, the target device (e.g. intelligent electric cooker) outputs the target voice according to the highest response result: confirm that the cooking mode is turned on at 12:00? The terminal outputs response voice according to the target voice: the determination is turned on, at which point the target device (e.g., a smart rice cooker) may turn on the cooking mode of operation at 12:00.
The above technical solution is described below with reference to the preferred embodiments, but is not limited to the technical solution of the embodiments of the present application.
Preferred embodiment 1
Fig. 3 is a schematic diagram of an alternative speech recognition model according to an embodiment of the present application, as shown in fig. 3, a multi-speech recognition result is input to a rejection module in a plurality of sub-domain models, so as to perform domain recognition on the multi-speech recognition result, determine a sub-domain corresponding to the multi-speech recognition result, and obtain a plurality of intent analyses through the sub-domain, and generate a plurality of corresponding response results. Furthermore, through intention combination of a plurality of intention analyses, a combination response result corresponding to the intention combination can be obtained; and then the repeated parts of the responses and the combined response result are removed, the selection is carried out according to the defined rule, the final response is output, and the dialogue state is updated.
For example, the engine a recognizes that the result is a single-drift program for opening an attack, and after semantic analysis, outputs a response: asking what device you want to operate;
the engine B recognizes that the result is a single-picking program for opening the washing machine, performs semantic analysis, and outputs a response: asking what program needs to be turned on;
the engine C recognizes that the result is a single-picking program for opening the attack, performs semantic analysis, and outputs a response: failing to understand your meaning, please trade the theory;
the second voice recognition result is a single-float program for opening the washing machine, semantic analysis is carried out, and response is output: setting a single bleaching program for you;
the responses of the engine A, the engine B and the engine C all belong to the condition that other information is needed, the slot position information is missing or can not be understood directly, but the semantic information corresponding to the combined response result is clear, the response is clear, and finally the complete and meaningful response is output.
After determining the final output response result, the dialogue state is updated in the man-machine dialogue system.
The multi-voice recognition engine in the embodiment of the application starts from the source, on one hand, the multi-voice recognition engine can reduce the error rate of voice conversion, on the other hand, semantic analysis is carried out on a plurality of voice recognition texts, dialogue response generation is carried out according to certain strategy preference or combination, and finally output response is selected according to the established strategy. In the embodiment of the application, the error rate of semantic analysis is reduced, the generation of the dialogue response can be performed first, and the dialogue state is updated reversely after the dialogue response is selected according to a certain strategy.
In summary, with reference to fig. 2 and 3, the semantic parsing can describe the score of the segment of speech by multiple response results, multiple first confidence levels, combined response results second confidence levels, and by calculating their cross entropy. The best response is selected by confidence comparison. Because the optimal response content is selected from a plurality of responses, the fault tolerance of the model is increased, and the system is more stable. The combination strategy of the multi-voice recognition engine scheme and the multi-meaning recognition scheme is more suitable for the voice products in complex practical environments.
In this embodiment, a voice recognition device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
FIG. 4 is a block diagram of an alternative speech recognition device according to an embodiment of the application, as shown in FIG. 4, comprising:
a first processing module 42, configured to analyze a plurality of speech recognition results input into a plurality of sub-domain models, obtain a plurality of response results corresponding to the plurality of speech recognition results, and combine the plurality of response results to obtain a combined response result;
a first determining module 44, configured to determine a response result with the highest confidence according to the plurality of response results, the plurality of first confidence levels of the plurality of response results, the combined response result, and the second confidence level of the combined response result;
the first obtaining module 46 is configured to obtain a target voice output by the target device according to the response result with the highest confidence, and determine a response voice of the target voice based on the target voice, where the response voice is used to instruct the target device to determine whether to execute an operation corresponding to the target voice.
According to the application, a plurality of voice recognition results input into a plurality of sub-domain models are analyzed to obtain a plurality of response results corresponding to the plurality of voice recognition results, and the plurality of response results are combined to obtain a combined response result; determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result; and acquiring target voice output by the target equipment according to the response result with the highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice. By adopting the technical scheme, the problems that in the related technology, in an intelligent voice dialogue system, errors are easy to occur in voice recognition, the accuracy is low, and then analysis errors are caused to influence the success rate of the dialogue are solved.
In the embodiment of the present application, the first processing module 42 includes: a dividing unit 422, configured to divide each response result of the plurality of response results into a plurality of segmentation words according to a division rule; a determining unit 424, configured to determine a minimum edit distance between any two response results among the plurality of response results, and record location information corresponding to the minimum edit distance, where the minimum edit distance is used to represent a number of inconsistent word segments between any two response results, and the location information is used to represent a location where the inconsistent word segments are located in any two response results; the first processing unit 426 is configured to combine the plurality of response results according to the minimum editing distance and the position information, so as to obtain the combined response result.
Fig. 6 is a block diagram of another alternative speech recognition apparatus according to an embodiment of the present application, and as shown in fig. 6, the apparatus further includes:
a second determining module 48, configured to determine a plurality of categories corresponding to a plurality of voice recognition results before analyzing the plurality of voice recognition results input to a plurality of sub-domain models, where the plurality of categories are used to represent language types to which the plurality of voice recognition results belong, the plurality of voice recognition results are in one-to-one correspondence with the plurality of categories, and the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories; the input module 50 is configured to input the plurality of speech recognition results into the plurality of sub-domain models according to the plurality of categories.
As shown in fig. 6, the apparatus further includes: a second obtaining module 52, configured to obtain a plurality of voice recognition results obtained by recognizing the voice information to be recognized by a plurality of voice recognition engines; the third determining module 54 is configured to determine the plurality of first confidence degrees corresponding to the plurality of speech recognition results.
As shown in fig. 6, the apparatus further includes: and an instruction module 56, configured to instruct the target device to perform the operation when the response voice instructs to perform the operation corresponding to the target voice.
An embodiment of the present application also provides a storage medium including a stored program, wherein the program executes the method of any one of the above.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for performing the steps of:
s1, analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;
s2, determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result;
s3, obtaining target voice output by the target equipment according to the response result with highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute operation corresponding to the target voice.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result;
s2, determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result;
s3, obtaining target voice output by the target equipment according to the response result with highest confidence, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute operation corresponding to the target voice.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A voice recognition method applied to a voice interaction system, comprising:
analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result, wherein the plurality of voice recognition results are in one-to-one correspondence with a plurality of categories, the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories, and the plurality of categories are used for representing language types to which the plurality of voice recognition results belong;
determining a response result with highest confidence according to the response results, the first confidence degrees of the response results, the combined response result and the second confidence degrees of the combined response result;
and acquiring target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.
2. The method of claim 1, wherein combining the plurality of response results to obtain a combined response result comprises:
dividing each response result in the plurality of response results into a plurality of segmentation words according to a division rule;
determining the minimum editing distance between any two response results in the plurality of response results, and recording position information corresponding to the minimum editing distance, wherein the minimum editing distance is used for representing the number of inconsistent segmentation words between any two response results, and the position information is used for representing the positions of the inconsistent segmentation words in the any two response results;
and combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.
3. The method of claim 1, wherein prior to said analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further comprises:
determining the multiple categories corresponding to the multiple voice recognition results;
and inputting the voice recognition results into the sub-domain models according to the categories.
4. The method of claim 1, wherein prior to said analyzing the plurality of speech recognition results input to the plurality of sub-domain models, the method further comprises:
acquiring a plurality of voice recognition results obtained by recognizing voice information to be recognized by a plurality of voice recognition engines;
and determining the first confidence degrees corresponding to the voice recognition results.
5. The method according to claim 1, wherein after the target voice output by the target device according to the response result with the highest confidence is obtained and the response voice of the target voice is determined based on the target voice, the method further comprises:
and under the condition that the response voice indicates to execute the operation corresponding to the target voice, indicating the target equipment to execute the operation.
6. A speech recognition apparatus, comprising:
the first processing module is used for analyzing a plurality of voice recognition results input into a plurality of sub-domain models to obtain a plurality of response results corresponding to the plurality of voice recognition results, and combining the plurality of response results to obtain a combined response result, wherein the plurality of voice recognition results are in one-to-one correspondence with a plurality of categories, the plurality of sub-domain models are in one-to-one correspondence with the plurality of categories, and the plurality of categories are used for representing language types to which the plurality of voice recognition results belong;
the first determining module is used for determining a response result with highest confidence according to the plurality of response results, the plurality of first confidence degrees of the plurality of response results, the combined response result and the second confidence degree of the combined response result;
the first acquisition module is used for acquiring target voice output by the target equipment according to the response result with the highest confidence coefficient, and determining response voice of the target voice based on the target voice, wherein the response voice is used for indicating the target equipment to determine whether to execute the operation corresponding to the target voice.
7. The apparatus of claim 6, wherein the first processing module comprises:
the dividing unit is used for dividing each response result in the plurality of response results into a plurality of segmentation words according to a dividing rule;
a determining unit, configured to determine a minimum editing distance between any two response results in the plurality of response results, and record location information corresponding to the minimum editing distance, where the minimum editing distance is used to represent a number of inconsistent segmentation words between any two response results, and the location information is used to represent a location of the inconsistent segmentation words in the any two response results;
and the first processing unit is used for combining the plurality of response results according to the minimum editing distance and the position information to obtain the combined response result.
8. The apparatus of claim 6, wherein the apparatus further comprises:
the second determining module is used for determining a plurality of categories corresponding to a plurality of voice recognition results before analyzing the plurality of voice recognition results input into a plurality of sub-domain models;
and the input module is used for inputting the voice recognition results into the sub-domain models according to the categories.
9. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when run.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the speech recognition method as claimed in any of the claims 1 to 5.
CN202010712229.3A 2020-07-22 2020-07-22 Speech recognition method and device, storage medium and electronic equipment Active CN111883122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010712229.3A CN111883122B (en) 2020-07-22 2020-07-22 Speech recognition method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010712229.3A CN111883122B (en) 2020-07-22 2020-07-22 Speech recognition method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111883122A CN111883122A (en) 2020-11-03
CN111883122B true CN111883122B (en) 2023-10-27

Family

ID=73155211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010712229.3A Active CN111883122B (en) 2020-07-22 2020-07-22 Speech recognition method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111883122B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614490B (en) * 2020-12-09 2024-04-16 北京罗克维尔斯科技有限公司 Method, device, medium, equipment, system and vehicle for generating voice instruction
CN112735394B (en) * 2020-12-16 2022-12-30 青岛海尔科技有限公司 Semantic parsing method and device for voice
CN112836522B (en) * 2021-01-29 2023-07-21 青岛海尔科技有限公司 Method and device for determining voice recognition result, storage medium and electronic device
CN113593535A (en) * 2021-06-30 2021-11-02 青岛海尔科技有限公司 Voice data processing method and device, storage medium and electronic device
CN113793597A (en) * 2021-09-15 2021-12-14 云知声智能科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN114840653B (en) * 2022-04-26 2023-08-01 北京百度网讯科技有限公司 Dialogue processing method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010139744A (en) * 2008-12-11 2010-06-24 Ntt Docomo Inc Voice recognition result correcting device and voice recognition result correction method
CN104795069A (en) * 2014-01-21 2015-07-22 腾讯科技(深圳)有限公司 Speech recognition method and server
CN106796788A (en) * 2014-08-28 2017-05-31 苹果公司 Automatic speech recognition is improved based on user feedback
WO2018059957A1 (en) * 2016-09-30 2018-04-05 Robert Bosch Gmbh System and method for speech recognition
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN110634486A (en) * 2018-06-21 2019-12-31 阿里巴巴集团控股有限公司 Voice processing method and device
CN111049996A (en) * 2019-12-26 2020-04-21 苏州思必驰信息科技有限公司 Multi-scene voice recognition method and device and intelligent customer service system applying same

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7363228B2 (en) * 2003-09-18 2008-04-22 Interactive Intelligence, Inc. Speech recognition system and method
US7340395B2 (en) * 2004-04-23 2008-03-04 Sap Aktiengesellschaft Multiple speech recognition engines
CA2785081C (en) * 2009-12-31 2021-03-30 Volt Delta Resources, Llc Method and system for processing multiple speech recognition results from a single utterance
EP3089159B1 (en) * 2015-04-28 2019-08-28 Google LLC Correcting voice recognition using selective re-speak
JP6570651B2 (en) * 2015-11-25 2019-09-04 三菱電機株式会社 Voice dialogue apparatus and voice dialogue method
US10971157B2 (en) * 2017-01-11 2021-04-06 Nuance Communications, Inc. Methods and apparatus for hybrid speech recognition processing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010139744A (en) * 2008-12-11 2010-06-24 Ntt Docomo Inc Voice recognition result correcting device and voice recognition result correction method
CN104795069A (en) * 2014-01-21 2015-07-22 腾讯科技(深圳)有限公司 Speech recognition method and server
CN106796788A (en) * 2014-08-28 2017-05-31 苹果公司 Automatic speech recognition is improved based on user feedback
WO2018059957A1 (en) * 2016-09-30 2018-04-05 Robert Bosch Gmbh System and method for speech recognition
CN109791767A (en) * 2016-09-30 2019-05-21 罗伯特·博世有限公司 System and method for speech recognition
CN110634486A (en) * 2018-06-21 2019-12-31 阿里巴巴集团控股有限公司 Voice processing method and device
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN111049996A (en) * 2019-12-26 2020-04-21 苏州思必驰信息科技有限公司 Multi-scene voice recognition method and device and intelligent customer service system applying same

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SA Khoubrouy等.Micphone array processing strategies for distant-based automatic speech recognition.《IEEE Signal Processing Letters》.2016,第1344-1348页. *
周祖洋.基于VxWorks平台的人声识别技术的研究.中国优秀博硕士学位论文全文数据库.2005,第5-7页. *

Also Published As

Publication number Publication date
CN111883122A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN111883122B (en) Speech recognition method and device, storage medium and electronic equipment
US5812975A (en) State transition model design method and voice recognition method and apparatus using same
US20210149994A1 (en) Device and method for machine reading comprehension question and answer
CN110853626B (en) Bidirectional attention neural network-based dialogue understanding method, device and equipment
JP2005084681A (en) Method and system for semantic language modeling and reliability measurement
US20080183468A1 (en) Augmentation and calibration of output from non-deterministic text generators by modeling its characteristics in specific environments
CN108364650B (en) Device and method for adjusting voice recognition result
CN108027814B (en) Stop word recognition method and device
CN106649253B (en) Auxiliary control method and system based on rear verifying
CN108932944B (en) Decoding method and device
CN110415679A (en) Voice error correction method, device, equipment and storage medium
CN111178081B (en) Semantic recognition method, server, electronic device and computer storage medium
CN109785846A (en) The role recognition method and device of the voice data of monophonic
CN111161726A (en) Intelligent voice interaction method, equipment, medium and system
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN114420102B (en) Method and device for speech sentence-breaking, electronic equipment and storage medium
CN110866094B (en) Instruction recognition method, instruction recognition device, storage medium, and electronic device
JP5975938B2 (en) Speech recognition apparatus, speech recognition method and program
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN111680514B (en) Information processing and model training method, device, equipment and storage medium
CN102237082B (en) Self-adaption method of speech recognition system
CN114783424A (en) Text corpus screening method, device, equipment and storage medium
CN110428814A (en) A kind of method and device of speech recognition
KR20110071742A (en) Apparatus for utterance verification based on word specific confidence threshold
JP2007026347A (en) Text mining device, text mining method and text mining program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant