CN111326140B - Speech recognition result discriminating method, correcting method, device, equipment and storage medium - Google Patents

Speech recognition result discriminating method, correcting method, device, equipment and storage medium Download PDF

Info

Publication number
CN111326140B
CN111326140B CN202010170991.3A CN202010170991A CN111326140B CN 111326140 B CN111326140 B CN 111326140B CN 202010170991 A CN202010170991 A CN 202010170991A CN 111326140 B CN111326140 B CN 111326140B
Authority
CN
China
Prior art keywords
voice
recognition result
probability
determining
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010170991.3A
Other languages
Chinese (zh)
Other versions
CN111326140A (en
Inventor
王容基
舒翔
陈韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202010170991.3A priority Critical patent/CN111326140B/en
Publication of CN111326140A publication Critical patent/CN111326140A/en
Application granted granted Critical
Publication of CN111326140B publication Critical patent/CN111326140B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a voice recognition result distinguishing method, a correcting method, a device, equipment and a storage medium, wherein the distinguishing method comprises the following steps: acquiring user behavior characterization information corresponding to the second voice, wherein the user behavior characterization information can reflect the correlation between the second voice and the content expressed by the first voice, and the first voice is the last input voice of the second voice; judging whether the target recognition result of the second voice is an error recognition result according to the user behavior characterization information corresponding to the second voice. The correction method comprises the steps of judging whether a target recognition result of the second voice is an error recognition result or not according to a judgment method of the voice recognition result, and if so, determining a correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database. The method and the device are simple to realize, do not depend on manpower, can be compatible with various requirements of users, and are high in universality and good in user experience.

Description

Speech recognition result discriminating method, correcting method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a method, a device, equipment, and a storage medium for discriminating a speech recognition result.
Background
With the rapid development of artificial intelligence technology, intelligent terminals play an increasingly important role in the life of people, and the maturity of voice recognition technology makes voice interaction a human-computer interaction mode which is deeply favored by users.
In some application scenarios where a user has an explicit requirement, the user typically inputs speech to express his specific requirement, so that the terminal can respond to the requirement of the user, and the terminal typically needs to perform two processes, namely speech recognition and semantic understanding, specifically, first, recognize speech input by the user to obtain a speech recognition result, and then perform semantic understanding on the speech recognition result to obtain the requirement of the user and respond.
It can be understood that, to enable the terminal to respond correctly, a correct speech recognition result needs to be obtained first, however, the current speech recognition scheme cannot guarantee that a correct recognition result is given for each input speech, and a recognition result error affects subsequent semantic understanding, so that the terminal responds incorrectly. In order to avoid the terminal from making an error response, it is first necessary to find an error recognition result and then correct the error recognition result, and how to find the error recognition result and then correct the error recognition result is a problem that needs to be solved at present.
Disclosure of Invention
In view of this, the present application provides a method, a device, equipment and a storage medium for determining a voice recognition result, which are used for finding out an incorrect recognition result, so as to determine a correct recognition result, thereby enabling a terminal to give a correct response to an input voice of a user, and the technical scheme is as follows:
a speech recognition result distinguishing method comprises the following steps:
acquiring user behavior characterization information corresponding to a second voice, wherein the user behavior characterization information can reflect the correlation between the second voice and content expressed by a first voice, and the first voice is the voice input before the second voice;
judging whether the target recognition result of the second voice is an error recognition result or not according to the user behavior characterization information corresponding to the second voice.
Optionally, the user behavior characterization information corresponding to the second voice includes any one or more of the following three information combinations:
the input time interval of the first voice and the second voice, the user behavior information from the first voice input to the second voice input, and the identification information corresponding to the first voice and the second voice respectively.
Optionally, the user behavior characterization information corresponding to the second voice includes: identification information corresponding to the first voice and the second voice respectively;
correspondingly, the determining whether the target recognition result of the second voice is an erroneous recognition result according to the user behavior characterization information corresponding to the second voice includes:
determining the probability that the content expressed by the first voice and the second voice is the same content according to the identification information respectively corresponding to the first voice and the second voice;
judging whether the target recognition result of the second voice is an error recognition result according to the probability that the content expressed by the first voice and the second voice is the same content.
Optionally, the user behavior characterization information corresponding to the second voice includes: the input time interval of the first voice and the second voice, the user behavior information from the first voice input to the second voice input, and the identification information corresponding to the first voice and the second voice respectively;
correspondingly, the determining whether the target recognition result of the second voice is the error recognition result according to the determination feature includes:
If the input time interval is smaller than a preset time threshold, determining the probability of representing the target recognition result of the second voice as an error recognition result according to the user behavior information, and taking the probability as a first probability;
determining the probability that the target recognition result representing the second voice is an error recognition result according to the recognition information respectively corresponding to the first voice and the second voice, and taking the probability as a second probability;
and judging whether the target recognition result of the second voice is an error recognition result according to the first probability and the second probability.
Optionally, the determining, according to the user behavior information, a probability that the target recognition result representing the second voice is a false recognition result includes:
determining the probability of representing the target recognition result of the second voice as the error recognition result according to the probability of representing the target recognition result corresponding to the first voice as the error recognition result and the user behavior information;
and the probability that the target recognition result corresponding to the first voice is the error recognition result is represented and is determined according to the user behavior information from the previous voice input of the first voice to the previous voice input of the first voice.
Optionally, the determining, according to the probability that the target recognition result corresponding to the first voice is the erroneous recognition result and the user behavior information, the probability that the target recognition result representing the second voice is the erroneous recognition result includes:
determining a behavior category corresponding to the user behavior information according to a preset behavior category;
determining the score of the behavior category corresponding to the user behavior information according to the preset corresponding relation between the behavior category and the score, and taking the score as a target score;
and determining the probability of representing the target recognition result of the second voice as the error recognition result according to the probability of representing the recognition result corresponding to the first voice as the error recognition result and the target score.
Optionally, the determining, according to the recognition information corresponding to the first voice and the second voice respectively, the probability that the target recognition result representing the second voice is the wrong recognition result includes:
and determining the probability that the content expressed by the first voice and the second voice is the same content according to the identification information respectively corresponding to the first voice and the second voice, and taking the probability as the probability that the target identification result representing the second voice is the error identification result.
Optionally, the recognition information includes a candidate recognition result set corresponding to the voice and/or a target recognition result determined from the candidate recognition result set;
the determining, according to the identification information respectively corresponding to the first voice and the second voice, the probability that the content expressed by the first voice and the second voice is the same content includes:
the similarity is calculated in any of four calculation modes: calculating the similarity between the target recognition result of the second voice and the target recognition result of the first voice; calculating the similarity between the target recognition result of the second voice and each candidate recognition result in the candidate recognition result set of the first voice; calculating the similarity of each candidate recognition result in the candidate recognition result set of the second voice and each candidate recognition result in the candidate recognition result set of the first voice; calculating the similarity between the target recognition result of the first voice and each candidate recognition result in the candidate recognition result set of the second voice;
and determining the probability that the content expressed by the first voice and the second voice is the same content according to the similarity obtained by any one of the four calculation modes.
Optionally, the process of calculating the similarity of the two recognition results includes:
for each of the two recognition results:
performing word segmentation on the recognition result, and removing non-keywords from words obtained by the word segmentation to obtain keywords in the recognition result;
determining word vectors of the keywords, and determining weights of the keywords according to parts of speech of the keywords and the occurrence of the keywords in a pre-constructed structured database, wherein the structured database is constructed according to an application scene, the structured database comprises a plurality of data records, and each data record comprises at least one keyword;
determining sentence vectors of the recognition results according to the word vectors and the weights of the keywords;
and determining the similarity of the two recognition results according to the sentence vectors of the two recognition results.
Optionally, the determining, according to the first probability and the second probability, whether the target recognition result of the second voice is a false recognition result includes:
determining the confidence level of the target recognition result of the second voice according to the first probability and the second probability;
If the confidence coefficient of the target recognition result of the second voice is smaller than a preset confidence coefficient threshold value, judging that the target recognition result of the second voice is an error recognition result.
A voice recognition result correction method comprising:
judging whether the target recognition result of the second voice is an error recognition result by adopting the judging method of the voice recognition result;
if the target recognition result of the second voice is an error recognition result, determining a correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database.
Optionally, the determining, according to the candidate recognition result set of the second voice and the pre-constructed structured database, a correct recognition result of the second voice from the candidate recognition result set of the second voice includes:
for each candidate recognition result in the set of candidate recognition results for the second speech:
extracting keywords from the candidate recognition results;
retrieving a data record containing the keyword in the structured database;
and if the data record containing the key words is retrieved, determining the candidate recognition result as a correct recognition result of the second voice.
A speech recognition result discriminating apparatus comprising: the information acquisition module and the recognition result judging module;
the information acquisition module is used for acquiring user behavior characterization information corresponding to second voice, wherein the user behavior characterization information can reflect the correlation between the second voice and the content to be expressed of first voice, and the first voice is the voice input before the second voice;
the recognition result judging module is used for judging whether the target recognition result of the second voice is an error recognition result according to the user behavior characterization information corresponding to the second voice.
A voice recognition result correcting device comprises the voice recognition result judging device and a recognition result correcting module;
the voice recognition result judging device is used for judging whether the target recognition result of the second voice is an error recognition result or not;
and the recognition result correction module is used for determining the correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database when the target recognition result of the second voice is an error recognition result.
A speech recognition result discriminating apparatus comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement each step of the method for discriminating a speech recognition result according to any one of the above.
A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the speech recognition result discrimination method according to any one of the above.
A speech recognition result correction apparatus, characterized by comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement each step of the method for correcting a speech recognition result according to any one of the above.
A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for correcting speech recognition results of any of the above.
Considering that the user can make some specific behaviors aiming at the error response of the terminal, the application provides that user behavior representation information which corresponds to the second voice and can reflect the content correlation to be expressed by the second voice and the previous input voice is obtained, and then whether the target recognition result of the second voice is an error recognition result is judged according to the user behavior representation information which corresponds to the second voice. The method for judging the voice recognition result provided by the embodiment of the invention can judge whether the recognition result of the input voice is wrong, is simple to realize, can be compatible with various requirements of users, has strong universality and does not depend on manpower. The method can judge whether the target recognition result of the second voice is an error recognition result or not, and can determine the correct recognition result from the candidate recognition result of the second voice when the target recognition result of the second voice is the error recognition result, so that the correct recognition result can be understood semantically later, the terminal can give correct response, and the user experience is good.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a method for discriminating a speech recognition result according to an embodiment of the present application;
FIG. 2 is a flowchart of an implementation manner for determining whether a target recognition result of a second voice is a false recognition result according to user behavior characterization information corresponding to the second voice provided in an embodiment of the present application;
FIG. 3 is a flowchart illustrating another implementation manner of determining whether a target recognition result of a second voice is a false recognition result according to user behavior characterization information corresponding to the second voice according to an embodiment of the present application;
fig. 4 is a flowchart of a method for correcting a speech recognition result according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a device for discriminating a speech recognition result according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of a device for correcting a speech recognition result according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a voice recognition result discriminating apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a voice recognition result correction apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the problem that the semantic understanding is wrong and the terminal gives an error response due to the fact that the recognition result of the input voice is wrong, the inventor conducts research, and the initial thought is as follows: and collecting a large number of incorrect recognition results, marking correct recognition results for the incorrect voice recognition results, and training a voice recognition model by using the marked recognition results, so that the voice recognition model can give the correct recognition results.
The inventor found through research that the scheme has some defects: firstly, the false recognition results are difficult to collect, and a long time is required for collecting a large number of false recognition results; secondly, a large number of false recognition results need to be manually marked, and the time cost and the labor cost are high; thirdly, the response is not timely, a long time period is needed from data collection to labeling data to model training, and the problem that a terminal gives an error response to a certain requirement of a user is difficult to solve in time; fourth, the user's needs may change often, so that optimization training of the model is required often, i.e. the solution cannot be compatible with the user's needs.
In view of the problems of the above-mentioned schemes, the inventor further researches and finally provides a voice recognition result distinguishing method and a voice recognition result correcting method which do not need to collect a large amount of training data, do not depend on manpower, respond timely and can be compatible with various user requirements, wherein the distinguishing method of the voice recognition result can distinguish whether the recognition result of the input voice is an erroneous recognition result, the correcting method of the voice recognition result can distinguish whether the recognition result of the input voice is an erroneous recognition result, and can determine the correct recognition result of the input voice when the recognition result of the input voice is an erroneous recognition result, so that the following can perform semantic understanding based on the correct recognition result of the input voice to acquire the user requirements, and further the terminal can make correct response to the input voice.
The voice recognition result judging method and the voice recognition result correcting method provided by the application can be applied to a terminal with data processing capability (the terminal can be a screen terminal, such as a smart phone, a notebook computer, a PC, a PAD, a smart television and the like, and can also be a non-screen terminal, such as a smart loudspeaker and the like), the terminal can receive voice input by a user, conduct voice recognition, judge whether the voice recognition result is wrong, determine the correct voice recognition result if the voice recognition result is wrong, then conduct semantic understanding on the correct voice recognition result, and then make a correct response based on the semantic understanding result.
Next, the speech recognition result discriminating method and the speech recognition result correcting method provided in the present application will be described by the following embodiments.
Referring to fig. 1, a flow chart illustrating a method for determining a speech recognition result according to an embodiment of the present application may include:
step S101: and obtaining user behavior characterization information corresponding to the second voice.
The user behavior characterization information is information capable of characterizing user behavior, and the user behavior characterization information can reflect correlation between the second voice and content to be expressed of the first voice, and it is to be noted that the first voice is a voice input before the second voice.
The inventor finds out through the behaviors made by a large number of users aiming at the error response of the terminal, and for the terminal users, the demands are quite scattered and the change frequency is quite rapid, but for products in a specific scene, the demands are limited by the use scenes of the products, so that the use behaviors of the users show certain characteristics. Taking a user of a large-screen television scene as an example, when the television responds to the input voice of the user by mistake, the user usually has obvious characteristics such as continuous quick expression, repeated expression of the same content, specific remote controller key action and the like, for example, when the input content of the television is the voice of 'i want to watch a television play', if the voice recognition error causes the television play of not 'a', the user can quickly input the voice of the same content again, otherwise, when the television responds to a certain input voice by mistake, the user usually presses a confirmation key of the remote controller, and when the television responds to the voice by mistake, the user usually presses a return key, a home key, keys for browsing programs (such as an up key, a down key, a left key and a right key) and the like.
Based on the above findings, the present application proposes that information abstracted according to the user behavior and capable of characterizing the user behavior, that is, user behavior characterization information, may be obtained, which is used to determine a target recognition result of an input voice.
In one possible implementation, the user behavior characterization information corresponding to the second voice may include, but is not limited to, any one or a combination of three of the following: the input time interval of the first voice and the second voice, the user behavior information before the first voice is input to the second voice after the first voice is input, and the identification information corresponding to the first voice and the second voice respectively. The recognition information includes a candidate recognition result set corresponding to the voice and/or a target recognition result determined from the candidate recognition result set, and it is to be noted that, in the voice recognition stage, a plurality of candidate recognition results are given for the input voice, each candidate recognition result has a score (the score is a sum of an acoustic score and a language score), and the target recognition result is a candidate recognition result with the highest score among the plurality of candidate recognition results.
It should be noted that, the input time interval between the first voice and the second voice, the user behavior information before the first voice is input to the second voice after the first voice is input, and the identification information corresponding to the first voice and the second voice respectively can represent the user behavior, where the input time interval between the first voice and the second voice can represent whether the user continuously and quickly expresses, the user behavior information before the first voice is input to the second voice can represent what feedback behavior the user makes for the response of the terminal to the first voice, and the identification information corresponding to the first voice and the second voice can represent whether the user repeatedly expresses the same content.
In view of decision accuracy, the user behavior characterization information corresponding to the second voice preferably includes identification information corresponding to the first voice and the second voice respectively, more preferably includes an input time interval between the first voice and the second voice, user behavior information before the first voice is input to the second voice after the first voice is input, and identification information corresponding to the first voice and the second voice respectively.
Step S102: judging whether the target recognition result of the second voice is an error recognition result according to the user behavior characterization information corresponding to the second voice.
It should be noted that, the "target recognition result of the second voice" referred to in this application refers to a recognition result obtained by performing voice recognition on the second voice, where the recognition result may be a correct recognition result or an incorrect recognition result of the second voice, and the "target recognition result of the first voice" referred to later is similar.
Starting from the behavior of the user made by the error response of the user to the terminal, the embodiment of the application firstly obtains user behavior characterization information corresponding to the second voice, for example, an input time interval (for characterizing whether the user performs continuous and rapid input) of the first voice and the second voice, identification information (for characterizing whether the user repeatedly expresses the same content) respectively corresponding to the first input voice and the second input voice, and then judges whether a target identification result of the second voice is an error identification result according to the user behavior characterization information corresponding to the second voice. The method for judging the voice recognition result provided by the embodiment of the invention can judge whether the recognition result of the input voice is correct or not, is simple to realize, can be compatible with the requirements of users, has strong universality and does not depend on manpower.
In another embodiment of the present application, for "step S102" in the above embodiment: and judging whether the target recognition result of the second voice is an error recognition result or not according to the user behavior characterization information corresponding to the second voice to introduce.
As mentioned in the above embodiment, the user behavior characterization information corresponding to the second voice may include one or more of an input time interval (characterizing whether the user makes a continuous rapid expression) of the first voice and the second voice, user behavior information (characterizing what feedback behavior the user makes for the response to the first voice) before the first voice is input to the second voice after the first voice is input, identification information (characterizing whether the user repeatedly expresses the same content) respectively corresponding to the first voice and the second voice, and, considering that the confidence that the behavior of "repeatedly expresses the same content" is relatively high as a basis for discrimination, if one of the three information is used as a basis for discrimination, identification information (characterizing whether the user repeatedly expresses the same content) respectively corresponding to the first voice and the second voice is preferable.
Next, taking the example that the user behavior characterization information corresponding to the second voice includes the identification information corresponding to the first voice and the second voice respectively, the step S102 in the above embodiment is described.
Referring to fig. 2, a flowchart of an implementation manner for determining whether a target recognition result of the second voice is a false recognition result according to user behavior characterization information corresponding to the second voice may include:
step S201: and determining the probability that the content expressed by the first voice and the second voice is the same content according to the identification information respectively corresponding to the first voice and the second voice.
The probability that the content expressed by the first voice and the second voice is the same content is the probability that the user repeatedly expresses the same content.
Step S202: and judging whether the target recognition result of the second voice is an error recognition result according to the probability that the content expressed by the first voice and the second voice is the same content.
The above embodiment mentions that the user usually makes a behavior of "repeatedly expressing the same content" in response to the error response of the terminal, and based on this, the present embodiment can determine whether the target recognition result of the second voice is the error recognition result according to the probability that the content expressed by the first voice and the second voice is the same content. It should be noted that, the greater the probability that the content expressed by the first voice and the second voice is the same content, the greater the probability that the target recognition result of the second voice is the erroneous recognition result, whereas the smaller the probability that the content expressed by the first voice and the second voice is the same content, the less the probability that the target recognition result of the second voice is the erroneous recognition result.
In some cases, it is insufficient to determine the recognition result based on only one of the three types of information, that is, the accuracy of determining based on only one of the three types of information may not be high enough, in order to obtain high determination accuracy, the user behavior characterization information corresponding to the second voice may include the input time interval between the first voice and the second voice, the user behavior information before the first voice is input and the recognition information corresponding to the first voice and the second voice, respectively, based on which, please refer to fig. 3, which shows a flowchart of another implementation manner of determining whether the target recognition result of the second voice is a false recognition result based on the user behavior characterization information corresponding to the second voice, and may include:
step S301: and judging that the input time interval of the first voice and the second voice is smaller than a preset time threshold, if yes, executing the step S302a and the step S302b, otherwise, executing other operations (carrying out semantic understanding on the target recognition result of the second voice, and further enabling the terminal to respond).
It should be noted that, the input time interval between the first voice and the second voice is smaller than the preset time threshold value, which indicates that the user performs continuous and rapid input.
Step S302a: and determining the probability of representing the target recognition result of the second voice as a false recognition result according to the user behavior information from the first voice input to the second voice input, and taking the probability as a first probability P1.
Step S302b: and determining the probability that the target recognition result representing the second voice is the error recognition result according to the recognition information respectively corresponding to the first voice and the second voice, and taking the probability as a second probability P2.
Specifically, according to the recognition information corresponding to the first voice and the second voice respectively, the probability that the content expressed by the first voice and the second voice is the same content can be determined and used as the probability that the target recognition result representing the second voice is the error recognition result.
The present embodiment is not limited to the execution sequence of step S302a and step S302b, and step S302a may be executed first, then step S302b may be executed first, then step S302a may be executed, and step S302a and step S302b may be executed in parallel.
Step S303: and judging whether the target recognition result of the second voice is an error recognition result according to the first probability P1 and the second probability P2.
Specifically, the confidence coefficient P of the target recognition result of the second voice can be determined according to the first probability P1 and the second probability P2; if the confidence level P of the target recognition result of the second voice is smaller than the preset confidence threshold value P th Determining that the target recognition result of the second voice is a false recognition resultOtherwise, the target recognition result of the second voice is judged to be the correct recognition result.
It should be noted that, since the first probability P1 and the second probability P2 are probabilities determined from two different dimensions, that is, the first probability is a probability determined from a dimension of feedback behavior of the user, and the second probability is a probability determined from a dimension of similarity of content expressed twice by the user, when determining the confidence coefficient P of the target recognition result of the second voice according to the first probability P1 and the second probability P2, normalization processing is first performed on the first probability P1 and the second probability P2, and optionally, normalization processing may be performed on the first probability P1 and the second probability P2 according to the following formula:
Figure BDA0002409169450000121
where Mean represents Mean, standard deviation represents variance, and assuming that the user performs consecutive 3 rounds of input (3 rd round of input speech is "second speech" in the present application, 2 nd round of input speech is "first speech" in the present application), mean1 is an average of a first probability corresponding to 3 rd round of input speech (i.e., a first probability determined according to user behavior information before the 2 nd round of input to the third round of input) and a first probability corresponding to 2 nd round of input speech (i.e., a first probability determined according to user behavior information after the 1 st round of input to the 2 nd round of input to the third round of input to the fourth round of input speech), standard deviation1 is a variance of a first probability corresponding to 3 rd round of input speech and a first probability corresponding to 2 nd round of input speech, mean2 is a second probability corresponding to 3 rd round of input speech (a second probability determined according to recognition information of 3 rd round of input speech and recognition information of 2 nd round of input speech) and a second probability corresponding to 2 nd round of input speech (a second probability determined according to recognition information of 2 nd round of input speech) and 3525 is a second probability corresponding to 2 nd round of input speech.
After obtaining the normalized first probability
Figure BDA0002409169450000131
And normalized second probability ++>
Figure BDA0002409169450000132
After that, according to->
Figure BDA0002409169450000133
And->
Figure BDA0002409169450000134
Determining the confidence level P of the target recognition result of the second speech according to +.>
Figure BDA0002409169450000135
And->
Figure BDA0002409169450000136
There are various implementations of determining the confidence level P of the target recognition result of the second speech, for example, two probabilities +.>
Figure BDA0002409169450000137
And->
Figure BDA0002409169450000138
The obtained probability is used as the confidence coefficient P of the target recognition result of the second voice, two probabilities can be used as two variables, and an optimal solution is obtained by adopting a method for solving an optimal solution for the binary variable (for example, an optimal solution can be obtained by adopting a gradient descent method), so that the probability is used as the confidence coefficient P of the target recognition result of the second voice.
In another embodiment of the present application, the description is given of "determining the probability that the target recognition result representing the second voice is the erroneous recognition result according to the user behavior information after the first voice input and before the second voice input" in step S301 in the above embodiment.
Determining a probability that the target recognition result characterizing the second voice is a false recognition result based on the user behavior information after the first voice input and before the second voice input may include: and determining the probability of representing the target recognition result of the second voice as the error recognition result according to the probability of representing the target recognition result corresponding to the first voice as the error recognition result and the user behavior information.
The probability that the target recognition result corresponding to the first voice is the error recognition result is determined according to the user behavior information from the previous voice input to the previous voice input of the first voice, specifically, the probability that the target recognition result corresponding to the first voice is the error recognition result is determined according to the user behavior information from the previous voice input to the previous voice input of the first voice, and the probability that the target recognition result corresponding to the previous voice of the first voice is the error recognition result.
Further, according to the probability that the target recognition result corresponding to the first voice is the error recognition result and the user behavior information, the process of determining the probability that the target recognition result representing the second voice is the error recognition result may include:
and a1, determining a behavior category corresponding to the user behavior information according to a preset behavior category.
In one possible implementation, the following behavior categories may be preset: "dominant positive", "dominant negative", "recessive positive", "recessive negative".
Assuming that after the terminal responds to an input voice, the user presses a confirmation key or sends a voice instruction for indicating confirmation, and then the behavior of the user is considered to be a dominant forward behavior; assuming that after the terminal responds to an input voice, the user presses a key to exit the current page (such as pressing a return key, a home key, etc.) or sends a voice command indicating to exit the current page, then the behavior of the user is considered to be a dominant negative behavior; after the terminal responds to an input voice, the user makes browsing behaviors, such as pressing up and down left and right keys, turning pages through voice and the like, and the behavior of the user is considered to be a 'implicit forward' behavior; assuming that after the terminal responds to an input voice, the user shows a behavior that the user is out of intent or does not meet, for example, the user expresses again in a very short time (usually <5 s), a negative word appears in the expressed content, and the like, the behavior of the user is considered to be a 'implicit negative' behavior.
It should be noted that, the user behavior information can indicate what feedback behavior the user has made with respect to the response of the terminal, and therefore, it can be determined which of the four types the indicated user behavior belongs to, based on the user behavior information. For example, if the user behavior information indicates that the user performs voice page turning, the behavior category corresponding to the user behavior information may be determined to be "implicit forward".
And a2, determining the score of the behavior category corresponding to the behavior information of the user according to the preset corresponding relation between the behavior category and the score, and taking the score as a target score.
When the behavior categories are set, the score corresponding to each behavior category can be set at the same time, that is, the corresponding relation between the behavior category and the score is preset, and for example, the score corresponding to the dominant positive direction can be set to be 0.5, the score corresponding to the dominant negative direction can be set to be-0.5, the score corresponding to the recessive positive direction is set to be 0.05, the score corresponding to the recessive negative direction is set to be-0.05, the behavior category corresponding to the user behavior information can be determined to be the recessive positive direction if the user behavior information indicates that the user turns pages in voice, and the score corresponding to the recessive positive direction can be determined to be 0.05 according to the corresponding relation between the behavior category and the score, that is, the score corresponding to the behavior category corresponding to the user behavior information is set to be 0.05.
And a3, determining the probability that the target recognition result representing the second voice is the error recognition result according to the probability that the recognition result corresponding to the first voice is the error recognition result and the target score.
Specifically, according to the probability that the recognition result corresponding to the first voice is the error recognition result and the target score, the probability that the target recognition result representing the second voice is the error recognition result can be determined by using the following formula:
P(x i )=|P(x i-1 )*∑ i γ i | (2)
wherein P (x) i ) The target recognition result for representing the second voice is errorProbability of recognition result, P (x i-1 ) To characterize the probability that the target recognition result of the first speech (i.e., the previous input speech of the second speech) is a false recognition result, γ i For the score of the behavior category corresponding to the user behavior information after the first voice input and before the second voice input, it should be noted that multiple user behaviors may occur before the first voice input and before the second voice input, for example, the user makes an "implicit positive" behavior first and makes an "explicit negative" behavior, and then the score corresponding to the "implicit positive" needs to be summed with the score corresponding to the "explicit negative", where Σ in the above formula i γ i I.e. the sum of the scores representing the behavior categories to which a plurality of user behaviors respectively correspond.
If the user inputs three consecutive rounds, the second voice is the voice input by the third round, and the first voice is the voice input by the second round, P (x) i ) Namely P (x) 3 ),P(x 3 ) I.e. the probability that the target recognition result representing the speech input in the third round is a false recognition result, P (x) i-1 ) Namely P (x) 2 ),P(x 2 ) I.e. the probability that the target recognition result representing the speech input in the second round is a false recognition result, P (x) 2 ) According to the score of the behavior category corresponding to the user behavior after the first round of voice input and before the second round of voice input and P (x) 1 ) And (5) determining.
The above embodiment provides two possible implementations of "determining whether the target recognition result of the second voice is the erroneous recognition result according to the user behavior characterization information corresponding to the second voice", where the two implementations include "determining the probability that the content expressed by the first voice and the second voice is the same content according to the recognition information corresponding to the first voice and the second voice, respectively", and then introducing the implementation process of "determining the probability that the content expressed by the first voice and the second voice is the same content according to the recognition information corresponding to the first voice and the second voice, respectively".
The determining the probability that the content expressed by the first voice and the second voice is the same content according to the identification information respectively corresponding to the first voice and the second voice may include:
step b1, calculating the similarity according to any one of the following four calculation modes:
the first calculation mode is as follows: and calculating the similarity between the target recognition result of the second voice and the target recognition result of the first voice.
The second calculation mode: and calculating the similarity between the target recognition result of the second voice and each candidate recognition result in the candidate recognition result set of the first voice.
The second calculation mode: and calculating the similarity between each candidate recognition result in the candidate recognition result set of the second voice and each candidate recognition result in the candidate recognition result set of the first voice.
Fourth calculation method: and calculating the similarity between the target recognition result of the first voice and each candidate recognition result in the candidate recognition result set of the second voice.
Assuming that the candidate recognition result set of the second voice includes { e1, e2, e3, e4}, the target recognition result of the second voice is e2, the candidate recognition result set of the first voice includes { f1, f2, f3, f4}, and the target recognition result of the first voice is f3, the first calculation method is: the similarity of e2 and f3 is calculated, and the second calculation mode is as follows: calculating the similarity between e2 and f1, f2, f3 and f4 respectively, wherein the third calculation mode is to calculate the similarity between e1 and f1, f2, f3 and f4 respectively, the similarity between e2 and f1, f2, f3 and f4 respectively, the similarity between e3 and f1, f2, f3 and f4 respectively, and the similarity between e4 and f1, f2, f3 and f4 respectively, and the fourth calculation mode is as follows: and calculating the similarity of f3 and e1, e2, e3 and e4 respectively.
Among the four calculation methods, the third calculation method is preferable.
In the above four calculation methods, the similarity between two recognition results needs to be calculated, and the following procedure for calculating the similarity between two recognition results is given:
step c1, respectively determining sentence vectors of two recognition results in the following manner:
and c11, performing word segmentation on the recognition result, and removing non-keywords from words obtained by the word segmentation to obtain keywords in the recognition result.
It should be noted that, in some application scenarios, when a user performs voice input, there is typically a typical sentence pattern, such as "i want to see …", "play …", etc., based on which words in the typical sentence pattern may be removed as non-keywords, i.e. "i/want/see", "play" is removed.
For example, an identification result in the television scene is "i want to watch a three-way show", and after the identification result is segmented, i am/want/watch/three/way show/show "can be obtained, and since" i am at … "is a typical sentence pattern of the television scene," i am at/want/watch "is removed, and the rest" Zhang san/show "is removed.
Step c12, determining word vectors of the keywords, and determining weights of the keywords according to the parts of speech of the keywords and the occurrence of the keywords in a pre-constructed structured database.
For the recognition result of "i want to see the variety program of Zhang three", the keywords "Zhang three", "synthetic art" and "program" may be obtained through the step c11, where the purpose of this step is to determine the word vectors of the four keywords, specifically, word2vec may be used to determine the word vectors of the four words respectively, and in general, after word2vec is used to determine the word vectors, dimension reduction processing is also required for the determined word vectors, for example, word2vec is used to determine that the word vector of "synthetic art" is {0 0 11 … }, and the dimension reduction processing is performed to obtain { 0.72-0.43 … }, where the word vectors obtained after the dimension reduction processing are used as final word vectors.
In the step, the weight of the keywords is determined in addition to the word vector of the keywords, and the part of speech of the keywords and the occurrence of the keywords in a pre-constructed structured database are considered in the step of determining the weight.
Specifically, weights corresponding to various parts of speech are preset, after keywords are obtained, the parts of speech of the keywords are determined, the weights corresponding to the parts of speech of the keywords are used as initial weights of the keywords, the word frequencies of the keywords in the structured database are determined, and final weights of the keywords are determined according to the initial weights of the keywords and the word frequencies of the keywords in the structured database. Specifically, the initial weight of the keyword may be multiplied by the word frequency of the keyword in the structured database, and the result obtained by multiplying may be used as the final weight of the keyword.
For example, the weight corresponding to noun may be set to 0.8, the weight corresponding to pronoun to 0.05, the weight corresponding to verb to 0.2, the weight corresponding to adjective to 0.4, the weight corresponding to auxiliary word to 0, for the four keywords "Zhang Sano", "synthetic", "program" described above, "Zhang Sano" as noun, "synthetic" as noun, "program" as noun, the weight corresponding to part of speech of keyword as initial weight of keyword, i.e., 0.8 as initial weight of "Zhang Sano", "synthetic", "program", 0 as initial weight of "while determining the word frequencies of the four keywords" Zhang Sano "," synthetic "," program "respectively appearing in the structured database, assuming that the frequencies of the four keywords appearing in the structured database are cf respectively 1 、cf 2 、cf 3 、cf 4 Then the final weights of the four keywords are 0.8cf 1 、0、0.8cf 3 、0.8cf 4
It should be noted that, according to the setting of the application scenario, the structured database may include a plurality of data records, each data record includes at least one keyword, and the following table shows an example of the structured database of the television scenario, and as can be seen from table 1, each data record in the structured database of the television scenario corresponds to a television program, and the keyword in each data record is key information of the corresponding program, such as a type of the program, an actor in the program, and so on.
Table 1 structured database examples of television scenarios
Figure BDA0002409169450000171
Figure BDA0002409169450000181
Alternatively, the word frequency at which keywords appear in the structured database may be determined as follows:
Figure BDA0002409169450000182
where cf (y) is the word frequency of occurrence of the keyword y in the structured database, n is the total number of data records contained in the structured database, n (y) is the total number of data records of the keyword y in the structured database, u is a smoothing coefficient, and u (y) is a smoothing coefficient of the keyword y and is generally set to 1.
And step c13, determining sentence vectors of the recognition results according to the word vectors and the weights of the keywords.
Specifically, the word vectors of the keywords are multiplied by weights to obtain word vectors corresponding to the keywords and endowed with the weights, the word vectors corresponding to all the keywords and endowed with the weights are spliced, and the spliced vectors are used as sentence vectors of recognition results.
For the recognition result "i want to watch the variety program of Zhang three", the word vectors and weights of the keywords "Zhang three", "variety", "program" can be obtained through step c12, and if the word vectors of the four keywords are v1, v2, v3, and v4, respectively, and the weights of the four keywords are w1, w2, w3, and w4, then v1w1, v2w2, v3w3, and v4w4 are spliced, and the spliced vectors are used as sentence vectors of the recognition result "i want to watch the variety program of Zhang three".
And c2, determining the similarity of the two recognition results according to the sentence vectors of the two recognition results.
Through the step c1, sentence vectors of two recognition results can be obtained, and since the sentence vectors are characterization vectors of the corresponding recognition results, the similarity of the sentence vectors of the two recognition results can be calculated and used as the similarity of the two recognition results.
And b2, determining the probability that the content expressed by the first voice and the second voice is the same content according to the similarity obtained by any one of the four calculation modes.
In the first calculation method, since only one similarity is obtained, the similarity can be directly used as a probability that the content expressed by the first voice and the second voice is the same content, and in the other three calculation methods, since a plurality of similarities are obtained, the plurality of similarities are multiplied, and the obtained result is used as a probability that the content expressed by the first voice and the second voice is the same content.
According to the voice recognition result judging method, a large amount of training data does not need to be collected, and the data do not need to be analyzed and marked manually, so that the voice recognition result judging method does not depend on manpower, can be compatible with various requirements of users, and is high in universality.
The embodiment of the application also provides a method for correcting the voice recognition result, referring to fig. 4, a flow diagram of the method is shown, and the method may include:
step S401: and obtaining user behavior characterization information corresponding to the second voice.
The user behavior characterization information can reflect the correlation between the second voice and the content to be expressed of the first voice, and the first voice is the voice input before the second voice.
Step S402: judging whether the target recognition result of the second voice is an error recognition result according to the user behavior characterization information corresponding to the second voice.
It should be noted that, the specific implementation process of step S401 to step S402 is the same as the specific implementation process of step S101 to step S102 in the voice recognition result determining method provided in the above embodiment, and specific reference may be made to the above embodiment, which is not described herein.
Step S403: if the target recognition result of the second voice is the error recognition result, determining the correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and the pre-constructed structured database.
Specifically, according to the candidate recognition result set of the second voice and the pre-constructed structured database, the implementation process of determining the correct recognition result of the second voice from the candidate recognition result set of the second voice may include: for each candidate recognition result in the candidate recognition result set of the second voice, firstly extracting a keyword from the candidate recognition result, then searching a data record containing the extracted keyword in the structured database, and if the data record containing the extracted keyword is searched, determining the candidate recognition result as a correct recognition result of the second voice. It should be noted that the structured database in this embodiment is the structured database mentioned in the foregoing embodiment.
The second speech candidate recognition result set includes three candidate recognition results, namely, "me wants to see the four-item category program", "me wants to see the Li Si-item category program", "me wants to see the four-item category program", and for the candidate recognition results of "me wants to see the four-item category program", the keywords "me" and "item category" are extracted therefrom, the data records containing the keywords "me" and "item category" are searched in the structured database, and similarly, the keywords "Li Si" and "item category" are extracted from the candidate recognition results "me wants to see the Li Si item category program", the data records containing the keywords ' Li Si ' and ' variety ' are searched in the structured database, the keywords ' Li Shi ' and ' variety ' are extracted from the candidate recognition result ' i want to see the variety program like ' and the data records containing the keywords ' Li Shi ' and ' variety ' are searched in the structured database, and if the data records containing ' Li IV ' and ' variety ' are searched in the structured database but the data records containing ' Li Si ' and ' variety ' are not searched and the data records containing ' Li Shi ' and ' variety ' are not searched, the candidate recognition result ' i want to see the variety program like ' of Li IV ' is determined as the correct recognition result of the second voice.
It should be noted that, at some time, there may be more than one correct recognition result determined in the manner of determining the correct recognition result, and if this occurs, the score of each recognition result determined in the manner described above may be obtained, and the recognition result with the highest score may be used as the final correct recognition result of the second speech. For example, 3 correct recognition results are determined according to the manner of determining the correct recognition result, namely, a recognition result a, a recognition result b and a recognition result c, at this time, the scores of the three recognition results (the score is the sum of the acoustic score and the language score obtained in the voice recognition stage) are obtained, and the recognition result b is taken as the final correct recognition result assuming that the score of the recognition result b is the highest.
According to the voice recognition result correction method provided by the embodiment of the invention, whether the target recognition result of the second voice is the wrong recognition result can be judged according to the user behavior characterization information corresponding to the second voice, and when the target recognition result of the second voice is judged to be the wrong recognition result, the correct recognition result can be determined from the candidate recognition results of the second voice, so that the correct recognition result can be subjected to semantic understanding later, the terminal can give a correct response, and the user experience is good. The voice recognition result correction method provided by the application embodiment is simple to implement, does not depend on manpower, can be compatible with various requirements of users, and is high in universality.
The embodiment of the application also provides a voice recognition result distinguishing device, which is described below, and the voice recognition result distinguishing device and the voice recognition result distinguishing method described below can be correspondingly referred to each other.
Referring to fig. 5, a schematic structural diagram of a voice recognition result determining apparatus 50 provided in an embodiment of the present application may include: an information acquisition module 501 and a recognition result discrimination module 502.
The information obtaining module 501 is configured to obtain user behavior characterization information corresponding to the second voice.
The user behavior characterization information can reflect the correlation between the second voice and the content to be expressed of the first voice, wherein the first voice is the voice input before the second voice.
The recognition result judging module 502 is configured to judge whether the target recognition result of the second voice is an erroneous recognition result according to the user behavior characterization information corresponding to the second voice.
According to the voice recognition result judging device provided by the embodiment of the application, whether the target recognition result of the second voice is the error recognition result can be judged according to the user behavior characterization information corresponding to the second voice, the judging process is simple to realize, the requirements of users can be met, the universality is high, and the labor is not relied on.
In a possible implementation manner, in the mid-voice recognition result determining apparatus provided in the foregoing embodiment, the user behavior characterization information corresponding to the second voice obtained by the information obtaining module 501 includes any one or more of the following three types of information: the input time interval of the first voice and the second voice, the user behavior information from the first voice input to the second voice input, and the identification information corresponding to the first voice and the second voice respectively.
In a possible implementation manner, in the mid-voice recognition result determining apparatus provided in the foregoing embodiment, the user behavior characterization information corresponding to the second voice obtained by the information obtaining module 501 includes recognition information corresponding to the first voice and the second voice respectively.
Accordingly, the recognition result discrimination module 502 may include: the probability determination sub-module and the recognition result discrimination sub-module.
And the probability determination submodule is used for determining the probability that the content expressed by the first voice and the second voice is the same content according to the identification information respectively corresponding to the first voice and the second voice.
And the recognition result judging sub-module is used for judging whether the target recognition result of the second voice is an error recognition result according to the probability that the content expressed by the first voice and the second voice is the same content.
In a possible implementation manner, in the mid-voice recognition result determining apparatus provided in the foregoing embodiment, the user behavior characterization information corresponding to the second voice obtained by the information obtaining module 501 includes: the input time interval of the first voice and the second voice, the user behavior information from the first voice input to the second voice input, and the identification information corresponding to the first voice and the second voice respectively.
Accordingly, the recognition result discrimination module 502 may include: and judging whether the target recognition result of the second voice is an error recognition result according to the judging feature, comprising: the device comprises a first probability determination submodule, a second probability determination submodule and a recognition result judging submodule.
And the first probability determination submodule is used for determining the probability of representing the target recognition result of the second voice as the error recognition result according to the user behavior information if the input time interval is smaller than a preset time threshold value, and taking the probability as the first probability.
And the second probability determination submodule is used for determining the probability that the target recognition result representing the second voice is the error recognition result according to the recognition information corresponding to the first voice and the second voice respectively, and taking the probability as the second probability.
And the recognition result judging sub-module is used for judging whether the target recognition result of the second voice is an error recognition result according to the first probability and the second probability.
In a possible implementation manner, the first probability determination submodule is specifically configured to determine, according to the probability that the target recognition result corresponding to the first voice is the erroneous recognition result and the user behavior information, the probability that the target recognition result corresponding to the second voice is the erroneous recognition result.
And the probability that the target recognition result corresponding to the first voice is the error recognition result is represented and is determined according to the user behavior information from the previous voice input of the first voice to the previous voice input of the first voice.
In one possible implementation manner, the first probability determination submodule may include: the system comprises a behavior category determination sub-module, a score determination sub-module and an error probability determination sub-module.
And the behavior category determining sub-module is used for determining the behavior category corresponding to the user behavior information according to the preset behavior category.
The score determining sub-module is used for determining the score of the behavior category corresponding to the user behavior information according to the preset corresponding relation between the behavior category and the score, and taking the score as a target score.
And the error probability determination submodule is used for determining the probability of representing the target recognition result of the second voice as the error recognition result according to the probability of representing the recognition result corresponding to the first voice as the error recognition result and the target score.
In a possible implementation manner, the second probability determination submodule is specifically configured to determine, according to the identification information corresponding to the first voice and the second voice respectively, a probability that the content expressed by the first voice and the second voice is the same content, as a probability that the target identification result representing the second voice is an erroneous identification result.
In a possible implementation manner, the above-mentioned recognition information includes a candidate recognition result set corresponding to the voice and/or a target recognition result determined from the candidate recognition result set.
The probability determination submodule and the second probability determination submodule are specifically configured to, when determining, according to the identification information corresponding to the first voice and the second voice respectively, the probability that the content expressed by the first voice and the second voice is the same content:
the similarity is calculated in any of four calculation modes: calculating the similarity between the target recognition result of the second voice and the target recognition result of the first voice; calculating the similarity between the target recognition result of the second voice and each candidate recognition result in the candidate recognition result set of the first voice; calculating the similarity of each candidate recognition result in the candidate recognition result set of the second voice and each candidate recognition result in the candidate recognition result set of the first voice; calculating the similarity between the target recognition result of the first voice and each candidate recognition result in the candidate recognition result set of the second voice;
And determining the probability that the content expressed by the first voice and the second voice is the same content according to the similarity obtained by any one of the four calculation modes.
In one possible implementation manner, the probability determination submodule and the second probability determination submodule are specifically used for, when calculating the similarity of two recognition results, for each recognition result of the two recognition results:
performing word segmentation on the recognition result, and removing non-keywords from words obtained by the word segmentation to obtain keywords in the recognition result; determining word vectors of the keywords, and determining weights of the keywords according to parts of speech of the keywords and the occurrence of the keywords in a pre-constructed structured database, wherein the structured database is constructed according to an application scene, the structured database comprises a plurality of data records, and each data record comprises at least one keyword; determining sentence vectors of the recognition results according to the word vectors and the weights of the keywords; and determining the similarity of the two recognition results according to the sentence vectors of the two recognition results.
In a possible implementation manner, the above-mentioned recognition result judging sub-module is specifically configured to determine, when judging whether the target recognition result of the second speech is an erroneous recognition result according to the first probability and the second probability, a confidence level of the target recognition result of the second speech according to the first probability and the second probability; if the confidence coefficient of the target recognition result of the second voice is smaller than a preset confidence coefficient threshold value, judging that the target recognition result of the second voice is an error recognition result.
Referring to fig. 6, a schematic structural diagram of the voice recognition result correction device 60 is shown, and the device 60 may include the voice recognition result determination device 50 provided in the foregoing embodiment, and further includes a recognition result correction module 601.
The voice recognition result discriminating means 50 is used for discriminating whether the target recognition result of the second voice is a wrong recognition result.
And the recognition result correction module 601 is configured to determine, when the target recognition result of the second voice is a false recognition result, a correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database.
According to the voice recognition result correction device provided by the embodiment of the invention, whether the target recognition result of the second voice is the wrong recognition result can be judged according to the user behavior characterization information corresponding to the second voice, and when the target recognition result of the second voice is judged to be the wrong recognition result, the correct recognition result can be determined from the candidate recognition result of the second voice, so that the correct recognition result can be subjected to semantic understanding later, the terminal can give a correct response, and the user experience is good.
In one possible implementation manner, the recognition result correction module 601 in the voice recognition result correction apparatus provided in the foregoing embodiment may include: the system comprises a keyword extraction sub-module, a retrieval sub-module and a correct recognition result determination sub-module.
And the keyword extraction sub-module is used for extracting keywords from each candidate recognition result in the candidate recognition result set of the second voice.
And the retrieval sub-module is used for retrieving the data record containing the keywords extracted by the keyword extraction sub-module from the structured database.
And the identification result determining sub-module is used for determining the candidate identification result as the correct identification result of the second voice when the searching sub-module searches the data record containing the keywords extracted by the keyword extraction sub-module.
The embodiment of the application also provides a voice recognition result distinguishing device, please refer to fig. 7, which shows a schematic structural diagram of the voice recognition result distinguishing device, and the device may include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704;
in the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703 and the communication bus 704 is at least one, and the processor 701, the communication interface 702 and the memory 703 complete communication with each other through the communication bus 704;
The processor 701 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 703 may comprise a high speed RAM memory, and may also include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
acquiring user behavior characterization information corresponding to a second voice, wherein the user behavior characterization information can reflect the correlation between the second voice and content expressed by a first voice, and the first voice is the voice input before the second voice;
judging whether the target recognition result of the second voice is an error recognition result or not according to the user behavior characterization information corresponding to the second voice.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to:
Acquiring user behavior characterization information corresponding to a second voice, wherein the user behavior characterization information can reflect the correlation between the second voice and content expressed by a first voice, and the first voice is the voice input before the second voice;
judging whether the target recognition result of the second voice is an error recognition result or not according to the user behavior characterization information corresponding to the second voice.
The embodiment of the application also provides a voice recognition result correction device, referring to fig. 8, a schematic structural diagram of the voice recognition result correction device is shown, and the device may include: at least one processor 801, at least one communication interface 802, at least one memory 803, and at least one communication bus 804;
in the embodiment of the present application, the number of the processor 801, the communication interface 802, the memory 803, and the communication bus 804 is at least one, and the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804;
the processor 801 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
The memory 803 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), etc., such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
by adopting the method for judging the voice recognition result provided by the embodiment, whether the target recognition result of the second voice is an error recognition result is judged;
if the target recognition result of the second voice is an error recognition result, determining a correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to:
by adopting the method for judging the voice recognition result provided by the embodiment, whether the target recognition result of the second voice is an error recognition result is judged;
if the target recognition result of the second voice is an error recognition result, determining a correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (16)

1. A method for discriminating a speech recognition result, comprising:
acquiring user behavior characterization information corresponding to second voice, wherein the user behavior characterization information can reflect the correlation between the second voice and content expressed by first voice, the user behavior characterization information comprises identification information corresponding to the first voice and the second voice respectively, the input time interval between the first voice and the second voice, and user behavior information before the first voice is input to the second voice after being input, and the identification information corresponding to the first voice and the second voice respectively can characterize whether the user repeatedly expresses the same content, and the first voice is the previous input voice of the second voice;
judging whether the target recognition result of the first voice is an error recognition result or not according to the user behavior characterization information corresponding to the second voice;
the step of judging whether the target recognition result of the first voice is an error recognition result according to the user behavior characterization information corresponding to the second voice comprises the following steps:
if the input time interval is smaller than a preset time threshold, determining the probability that the target recognition result representing the first voice is an error recognition result according to the user behavior information as a first probability;
Determining the probability that the target recognition result representing the first voice is an error recognition result according to the recognition information respectively corresponding to the first voice and the second voice, and taking the probability as a second probability;
and judging whether the target recognition result of the first voice is an error recognition result according to the first probability and the second probability.
2. The method for determining a speech recognition result according to claim 1, wherein determining whether the target recognition result of the first speech is a false recognition result according to the user behavior characterization information corresponding to the second speech includes:
determining the probability that the content expressed by the first voice and the second voice is the same content according to the identification information respectively corresponding to the first voice and the second voice;
judging whether the target recognition result of the first voice is an error recognition result according to the probability that the content expressed by the first voice and the second voice is the same content.
3. The method according to claim 1, wherein determining, based on the user behavior information, a probability that a target recognition result characterizing the first voice is a false recognition result includes:
Determining the probability of representing the target recognition result of the first voice as the error recognition result according to the probability of representing the target recognition result corresponding to the first voice as the error recognition result and the user behavior information;
and the probability that the target recognition result corresponding to the first voice is the error recognition result is represented and is determined according to the user behavior information from the previous voice input of the first voice to the previous voice input of the first voice.
4. The method for determining a recognition result of a voice according to claim 3, wherein determining the probability that the target recognition result representing the first voice is the wrong recognition result according to the probability that the target recognition result representing the first voice is the wrong recognition result and the user behavior information comprises:
determining a behavior category corresponding to the user behavior information according to a preset behavior category;
determining the score of the behavior category corresponding to the user behavior information according to the preset corresponding relation between the behavior category and the score, and taking the score as a target score;
and determining the probability of representing the target recognition result of the first voice as the error recognition result according to the probability of representing the recognition result corresponding to the first voice as the error recognition result and the target score.
5. The method according to claim 4, wherein determining the probability that the target recognition result representing the first voice is the erroneous recognition result according to the recognition information corresponding to the first voice and the second voice, respectively, comprises: and determining the probability that the content expressed by the first voice and the second voice is the same content according to the identification information respectively corresponding to the first voice and the second voice, and taking the probability as the probability that the target identification result representing the first voice is the wrong identification result.
6. The speech recognition result discrimination method according to claim 2 or 5, wherein the recognition information includes a candidate recognition result set of the corresponding speech and/or a target recognition result determined from the candidate recognition result set;
the determining, according to the identification information respectively corresponding to the first voice and the second voice, the probability that the content expressed by the first voice and the second voice is the same content includes:
the similarity is calculated in any of four calculation modes: calculating the similarity between the target recognition result of the second voice and the target recognition result of the first voice; calculating the similarity between the target recognition result of the second voice and each candidate recognition result in the candidate recognition result set of the first voice; calculating the similarity of each candidate recognition result in the candidate recognition result set of the second voice and each candidate recognition result in the candidate recognition result set of the first voice; calculating the similarity between the target recognition result of the first voice and each candidate recognition result in the candidate recognition result set of the second voice;
And determining the probability that the content expressed by the first voice and the second voice is the same content according to the similarity obtained by any one of the four calculation modes.
7. The method according to claim 6, wherein the process of calculating the similarity of two recognition results includes:
for each of the two recognition results:
performing word segmentation on the recognition result, and removing non-keywords from words obtained by the word segmentation to obtain keywords in the recognition result;
determining word vectors of the keywords, and determining weights of the keywords according to parts of speech of the keywords and the occurrence of the keywords in a pre-constructed structured database, wherein the structured database is constructed according to an application scene, the structured database comprises a plurality of data records, and each data record comprises at least one keyword;
determining sentence vectors of the recognition results according to the word vectors and the weights of the keywords;
and determining the similarity of the two recognition results according to the sentence vectors of the two recognition results.
8. The method according to claim 1, wherein the step of determining whether the target recognition result of the first voice is a false recognition result based on the first probability and the second probability comprises:
Determining the confidence level of the target recognition result of the second voice according to the first probability and the second probability;
if the confidence coefficient of the target recognition result of the second voice is smaller than a preset confidence coefficient threshold value, judging that the target recognition result of the first voice is an error recognition result.
9. A method for correcting a speech recognition result, comprising:
judging whether the target recognition result of the first voice is an erroneous recognition result by adopting the method for judging a voice recognition result according to any one of claims 1 to 8;
if the target recognition result of the first voice is an error recognition result, determining the correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database.
10. The method according to claim 9, wherein the determining the correct recognition result of the second voice from the candidate recognition result set of the second voice based on the candidate recognition result set of the second voice and a pre-constructed structured database includes:
For each candidate recognition result in the set of candidate recognition results for the second speech:
extracting keywords from the candidate recognition results;
retrieving a data record containing the keyword in the structured database;
and if the data record containing the key words is retrieved, determining the candidate recognition result as a correct recognition result of the second voice.
11. A speech recognition result discriminating apparatus, comprising: the information acquisition module and the recognition result judging module;
the information acquisition module is configured to acquire user behavior characterization information corresponding to a second voice, where the user behavior characterization information can reflect a correlation between the second voice and a content to be expressed by a first voice, the user behavior characterization information includes identification information corresponding to the first voice and the second voice respectively, an input time interval between the first voice and the second voice, and user behavior information before the first voice is input to the second voice after the first voice is input, and the identification information corresponding to the first voice and the second voice respectively can characterize whether the user repeatedly expresses the same content, where the first voice is a previous input voice of the second voice;
The recognition result judging module is used for judging whether the target recognition result of the first voice is an error recognition result according to the user behavior characterization information corresponding to the second voice;
the step of judging whether the target recognition result of the first voice is an error recognition result according to the user behavior characterization information corresponding to the second voice comprises the following steps:
if the input time interval is smaller than a preset time threshold, determining the probability that the target recognition result representing the first voice is an error recognition result according to the user behavior information as a first probability;
determining the probability that the target recognition result representing the first voice is an error recognition result according to the recognition information respectively corresponding to the first voice and the second voice, and taking the probability as a second probability;
and judging whether the target recognition result of the first voice is an error recognition result according to the first probability and the second probability.
12. A speech recognition result correcting apparatus comprising the speech recognition result discriminating apparatus according to claim 11 and a recognition result correcting module;
the voice recognition result judging device is used for judging whether the target recognition result of the first voice is an error recognition result or not;
The recognition result correction module is used for determining the correct recognition result of the second voice from the candidate recognition result set of the second voice according to the candidate recognition result set of the second voice and a pre-constructed structured database when the target recognition result of the first voice is an error recognition result.
13. A speech recognition result discriminating apparatus, characterized by comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the method for discriminating a speech recognition result according to any one of claims 1 to 8.
14. A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the speech recognition result discrimination method according to any one of claims 1 to 8.
15. A speech recognition result correction apparatus, characterized by comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the method for correcting a speech recognition result according to any one of claims 9 to 10.
16. A readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the steps of the speech recognition result correction method according to any one of claims 9 to 10.
CN202010170991.3A 2020-03-12 2020-03-12 Speech recognition result discriminating method, correcting method, device, equipment and storage medium Active CN111326140B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010170991.3A CN111326140B (en) 2020-03-12 2020-03-12 Speech recognition result discriminating method, correcting method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010170991.3A CN111326140B (en) 2020-03-12 2020-03-12 Speech recognition result discriminating method, correcting method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111326140A CN111326140A (en) 2020-06-23
CN111326140B true CN111326140B (en) 2023-05-30

Family

ID=71171633

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010170991.3A Active CN111326140B (en) 2020-03-12 2020-03-12 Speech recognition result discriminating method, correcting method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111326140B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112435512B (en) * 2020-11-12 2023-01-24 郑州大学 Voice behavior assessment and evaluation method for rail transit simulation training
CN113223500B (en) * 2021-04-12 2022-02-25 北京百度网讯科技有限公司 Speech recognition method, method for training speech recognition model and corresponding device
CN113378530A (en) * 2021-06-28 2021-09-10 北京七维视觉传媒科技有限公司 Voice editing method and device, equipment and medium
CN115798465B (en) * 2023-02-07 2023-04-07 天创光电工程有限公司 Voice input method, system and readable storage medium
CN116662764B (en) * 2023-07-28 2023-09-29 中国电子科技集团公司第十五研究所 Data identification method for error identification correction, model training method, device and equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4870686A (en) * 1987-10-19 1989-09-26 Motorola, Inc. Method for entering digit sequences by voice command
CN105810188B (en) * 2014-12-30 2020-02-21 联想(北京)有限公司 Information processing method and electronic equipment
JP6280074B2 (en) * 2015-03-25 2018-02-14 日本電信電話株式会社 Rephrase detection device, speech recognition system, rephrase detection method, program
JP6804909B2 (en) * 2016-09-15 2020-12-23 東芝テック株式会社 Speech recognition device, speech recognition method and speech recognition program
CN106486126B (en) * 2016-12-19 2019-11-19 北京云知声信息技术有限公司 Speech recognition error correction method and device
US11521608B2 (en) * 2017-05-24 2022-12-06 Rovi Guides, Inc. Methods and systems for correcting, based on speech, input generated using automatic speech recognition
CN110520925B (en) * 2017-06-06 2020-12-15 谷歌有限责任公司 End of query detection

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence

Also Published As

Publication number Publication date
CN111326140A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN111326140B (en) Speech recognition result discriminating method, correcting method, device, equipment and storage medium
KR102315732B1 (en) Speech recognition method, device, apparatus, and storage medium
US11194448B2 (en) Apparatus for vision and language-assisted smartphone task automation and method thereof
CN109922371B (en) Natural language processing method, apparatus and storage medium
CN108334490B (en) Keyword extraction method and keyword extraction device
CN107423440B (en) Question-answer context switching and reinforced selection method based on emotion analysis
CN109634436B (en) Method, device, equipment and readable storage medium for associating input method
CN110674396B (en) Text information processing method and device, electronic equipment and readable storage medium
CN108027814B (en) Stop word recognition method and device
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN108227564B (en) Information processing method, terminal and computer readable medium
CN107092602B (en) Automatic response method and system
CN110347866B (en) Information processing method, information processing device, storage medium and electronic equipment
CN111400513A (en) Data processing method, data processing device, computer equipment and storage medium
CN112700768B (en) Speech recognition method, electronic equipment and storage device
CN109492085B (en) Answer determination method, device, terminal and storage medium based on data processing
CN116628142B (en) Knowledge retrieval method, device, equipment and readable storage medium
CN111858966B (en) Knowledge graph updating method and device, terminal equipment and readable storage medium
CN112581297A (en) Information pushing method and device based on artificial intelligence and computer equipment
CN112417095A (en) Voice message processing method and device
CN116644228A (en) Multi-mode full text information retrieval method, system and storage medium
CN115759048A (en) Script text processing method and device
CN113220824B (en) Data retrieval method, device, equipment and storage medium
CN115063858A (en) Video facial expression recognition model training method, device, equipment and storage medium
CN111382265A (en) Search method, apparatus, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant