CN112669833A

CN112669833A - Voice interaction error correction method and device

Info

Publication number: CN112669833A
Application number: CN201910940847.0A
Authority: CN
Inventors: 杜国威
Original assignee: Beijing Anyun Century Technology Co Ltd
Current assignee: Beijing Anyun Century Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-04-16

Abstract

The invention discloses an error correction method and device for voice interaction, relates to the technical field of natural language processing, and improves the accuracy of voice instruction identification. The main technical scheme of the invention is as follows: when a first voice instruction sent by a user is received, analyzing semantic information and intonation information contained in the first voice instruction; receiving a second voice instruction, and analyzing semantic information and intonation information contained in the second voice instruction, wherein the second voice instruction is a voice instruction which is adjacent to the first voice instruction; judging whether to execute correction operation on the first voice instruction or not by comparing the tone information contained in the second voice instruction with the tone information contained in the first voice instruction; if yes, correcting the semantic information contained in the first voice command according to the semantic information contained in the second voice command. The invention is mainly applied to automatically correcting the error of the received adjacent voice command in the process of processing the input voice command.

Description

Voice interaction error correction method and device

Technical Field

The invention relates to the technical field of natural language processing, in particular to a voice interaction error correction method and device.

Background

With the innovation and development of science and technology, the dream of people and machines conversing in natural language has become a reality, and intelligent products derived by relying on natural language processing technology are increasingly popular, such as: the intelligent sound box not only can execute a control instruction issued by a user, but also can be used for chatting with the user in a dialogue mode, and thus the intelligent service is more and more favored by the user.

However, after the smart speaker box sold in the market is woken up by voice, there is no way to automatically correct the received adjacent voice command in the process of processing the input voice command, for example, the smart speaker box is exemplified by the voice command received by the iphone CC of the iphone of the mao and the kitten, i are helped to adjust the alarm clock of day 12:00, and the alarm clock is not adjusted or adjusted wrongly, the machine automatically controls the alarm clock to be set to 12 points, but the operation does not accord with the real intention of the user to send the voice command, so the user only needs to check whether the operation of the machine is correct again, if the operation is found to be incorrect, the voice command needs to be sent again, the operation is repeated, and the user experience is reduced.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for error correction of voice interaction, which mainly aim to automatically correct errors of received adjacent voice commands in the process of processing input voice commands, ensure that output control operations are in accordance with the real intentions of users, improve accuracy of voice command recognition, and improve user operation experience.

In order to achieve the above purpose, the present invention mainly provides the following technical solutions:

in a first aspect, the present invention provides a method for correcting errors in voice interaction, the method comprising:

when a first voice instruction sent by a user is received, analyzing semantic information and intonation information contained in the first voice instruction;

receiving a second voice instruction, and analyzing semantic information and intonation information contained in the second voice instruction, wherein the second voice instruction is a voice instruction which is adjacent to the first voice instruction;

judging whether to execute correction operation on the first voice instruction or not by comparing the tone information contained in the second voice instruction with the tone information contained in the first voice instruction;

if yes, correcting the semantic information contained in the first voice command according to the semantic information contained in the second voice command.

Optionally, before the receiving the first voice instruction sent by the user, the method further includes:

acquiring a plurality of historical voice instructions corresponding to the user;

analyzing semantic information and intonation information contained in each historical voice instruction;

randomly extracting two adjacent voice instructions from the plurality of historical voice instructions;

judging whether logic association exists between the two adjacent voice instructions or not according to semantic information respectively corresponding to the two adjacent voice instructions;

if yes, creating a label according to the logic association to obtain a mapping relation between the label and the logic association;

calculating difference information between intonation information corresponding to the two adjacent voice instructions respectively, wherein the difference information is intonation change information measured in four dimensions of voice height, voice speed, voice length and voice weight;

and labeling the difference information by using the label to obtain the intonation change information corresponding to the label.

Optionally, after obtaining the intonation change information corresponding to the tag, the method further includes:

obtaining intonation change information corresponding to each label;

comparing the similarity between the intonation change information corresponding to the two labels by randomly extracting the two labels;

and if the similarity reaches a first preset threshold value, integrating the two labels to obtain an upper label, wherein the upper label corresponds to two groups of tone variation information.

analyzing the word meaning of each label;

matching the label with a label recorded on a preset label template by comparing the similarity of words, wherein the preset label template is used for standardizing the label;

if the matching is successful, replacing the label with the label recorded on the preset label template;

if a plurality of same labels exist after matching operation, performing deduplication processing on the same labels and reserving one label, wherein the label corresponds to a plurality of groups of intonation change information.

Optionally, the determining whether to perform a correction operation on the first voice instruction by comparing the intonation information included in the second voice instruction with the intonation information included in the first voice instruction includes:

respectively calculating difference information between the first voice instruction and the second voice instruction in four dimensions of voice height, voice speed, voice length and voice weight;

calculating whether the similarity between the difference information and the intonation change information corresponding to the label reaches a second preset threshold value or not by comparing the difference information with the intonation change information corresponding to the label;

if yes, determining the logic association existing between the first voice command and the second voice command according to the label by searching the mapping relation between the label and the logic association;

when the logic association between the first voice command and the second voice command is determined to be a correction relationship, determining to execute a correction operation on the first voice command.

Optionally, after the receiving the second voice instruction, the method further includes:

identifying whether the second voice instruction is a word combination containing a negative word;

if yes, correcting semantic information contained in the first voice instruction according to semantic information contained in the second voice instruction;

if not, judging whether to execute correction operation on the first voice instruction or not by comparing the tone information contained in the second voice instruction with the tone information contained in the first voice instruction.

Optionally, the method further includes:

recording the system time for receiving the second voice command;

detecting whether the system time is within a preset correction time limit corresponding to the first voice instruction;

if the voice command is within the preset correction time limit corresponding to the first voice command, judging whether to execute correction operation on the first voice command or not by comparing the tone information contained in the second voice command with the tone information contained in the first voice command;

and if the first voice instruction is not within the preset correction time limit corresponding to the first voice instruction, respectively executing control operation according to the first voice instruction and the second voice instruction according to the sequence of the received voice instructions.

Optionally, before comparing the intonation information included in the second voice instruction with the intonation information included in the first voice instruction, the method further includes:

verifying whether the semantic information respectively contained in the second voice instruction and the first voice instruction has correlation or not;

and if the correlation exists, comparing the tone information contained in the second voice command with the tone information contained in the first voice command.

Optionally, when it is determined that a corrective operation is performed on the first voice instruction, the method further includes:

outputting prompt information to a user, wherein the prompt information is used for inquiring the user to confirm whether to execute the operation corresponding to the first voice instruction;

and if the indication information fed back by the user is not received within the preset time, controlling to execute correction operation on the first voice instruction.

Optionally, the correcting the semantic information included in the first voice instruction according to the semantic information included in the second voice instruction includes:

ignoring the first voice instruction;

and controlling to execute the operation corresponding to the second voice instruction.

In a second aspect, the present invention further provides an error correction apparatus for voice interaction, including:

the analysis unit is used for analyzing semantic information and intonation information contained in a first voice instruction when the first voice instruction sent by a user is received;

the receiving unit is used for receiving a second voice instruction;

the analysis unit is further configured to analyze semantic information and intonation information included in the second voice instruction, where the second voice instruction is a voice instruction that is adjacent to the first voice instruction;

the judging unit is used for judging whether to execute correction operation on the first voice instruction or not by comparing the intonation information contained in the second voice instruction analyzed and obtained by the analyzing unit with the intonation information contained in the first voice instruction analyzed and obtained by the analyzing unit;

and the correcting unit is used for correcting the semantic information contained in the first voice command according to the semantic information contained in the second voice command when the judging unit judges that the correcting operation is executed on the first voice command.

Optionally, the apparatus further comprises:

the acquisition unit is used for acquiring a plurality of historical voice instructions corresponding to a user before receiving a first voice instruction sent by the user;

the analysis unit is further used for analyzing semantic information and intonation information contained in each historical voice instruction;

the extraction unit is used for extracting two adjacent voice instructions from the plurality of historical voice instructions acquired by the acquisition unit;

the judging unit is used for judging whether the two adjacent voice instructions have logic association or not according to the semantic information respectively corresponding to the two adjacent voice instructions extracted by the extracting unit;

the creating unit is used for creating a label according to the logic association when the judging unit judges that the two adjacent voice instructions have the logic association, so as to obtain the mapping relation between the label and the logic association;

the calculation unit is used for calculating difference information between the intonation information respectively corresponding to the two adjacent voice instructions extracted by the extraction unit, wherein the difference information is intonation change information measured in four dimensions of voice height, voice speed, voice length and voice weight;

and the marking unit is used for marking the difference information by using the label created by the creating unit to obtain the intonation change information corresponding to the label.

Optionally, the apparatus further comprises:

the obtaining unit is further configured to obtain the intonation change information corresponding to each tag after the intonation change information corresponding to the tag is obtained;

the comparison unit is used for comparing the similarity between the intonation change information corresponding to the two labels by randomly extracting the two labels;

and the integration unit is used for integrating two labels to obtain an upper label if the similarity compared by the comparison unit reaches a first preset threshold, and the upper label corresponds to two groups of intonation change information.

Optionally, the apparatus further comprises:

the analyzing unit is further configured to analyze the word meaning of each tag after the intonation change information corresponding to the tag is obtained;

the matching unit is used for matching the label with a label recorded on a preset label template by comparing the similarity of words, and the preset label template is used for standardizing the label;

a replacing unit, configured to replace the tag with the tag recorded in the preset tag template if the matching unit succeeds in matching;

and the processing unit is used for performing de-duplication processing on a plurality of same labels and reserving one label if the same labels exist after matching operation, wherein the labels correspond to a plurality of groups of tone change information.

Optionally, the determining unit includes:

the calculation module is used for respectively calculating difference information between the first voice instruction and the second voice instruction in four dimensions of voice height, voice speed, voice length and voice weight;

the calculation module is further configured to compare the difference information with the intonation change information corresponding to the tag, and calculate whether a similarity between the difference information and the intonation change information corresponding to the tag reaches a second preset threshold;

a determining module, configured to determine, when the similarity between the difference information and the intonation change information corresponding to the tag calculated by the calculating module reaches a second preset threshold, a logical association existing between the first voice instruction and the second voice instruction according to the tag by searching for a mapping relationship between the tag and the logical association;

a determining module, configured to determine to perform a correction operation on the first voice instruction when the determining module determines that the logical association between the first voice instruction and the second voice instruction is a correction relationship.

Optionally, the apparatus further comprises:

a recognition unit configured to recognize, after the receiving of the second voice instruction, whether the second voice instruction is a word combination including a negative word;

the correcting unit is further used for correcting the semantic information contained in the first voice instruction according to the semantic information contained in the second voice instruction when the recognizing unit recognizes that the second voice instruction is a word combination containing a negative word;

the judging unit is further configured to judge whether to perform a correction operation on the first voice instruction by comparing intonation information included in the second voice instruction with intonation information included in the first voice instruction when the recognizing unit recognizes that the second voice instruction is a word combination including a negative word.

Optionally, the apparatus further comprises:

the recording unit is used for recording the system time for receiving the second voice command;

the detection unit is used for detecting whether the system time is within a preset correction time limit corresponding to the first voice instruction;

the judging unit is further configured to, when the detecting unit detects that the system time is within a preset correction time limit corresponding to the first voice instruction, judge whether to perform a correction operation on the first voice instruction by comparing intonation information included in the second voice instruction with intonation information included in the first voice instruction;

and the execution unit is used for executing control operation according to the first voice instruction and the second voice instruction respectively according to the sequence of the received voice instructions when the detection unit detects that the system time is not within the preset correction time limit corresponding to the first voice instruction.

Optionally, the apparatus further comprises:

a verification unit, configured to verify whether there is a correlation between semantic information included in the second voice instruction and semantic information included in the first voice instruction respectively before comparing the intonation information included in the second voice instruction with the intonation information included in the first voice instruction;

and the comparison unit is used for comparing the intonation information contained in the second voice instruction with the intonation information contained in the first voice instruction when the verification unit verifies that the semantic information contained in the second voice instruction and the semantic information contained in the first voice instruction are relevant.

Optionally, the apparatus further comprises:

the prompting unit is used for outputting prompting information to a user when the first voice instruction is determined to be corrected, and the prompting information is used for inquiring the user to confirm whether the operation corresponding to the first voice instruction is executed or not;

and the control unit is used for controlling the first voice instruction to execute correction operation if the indication information fed back by the user is not received within the preset time.

Optionally, the correcting unit includes:

the ignoring module is used for ignoring the first voice instruction;

and the control module is used for controlling and executing the operation corresponding to the second voice instruction.

In a third aspect, the present invention provides a storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method for error correction of voice interaction according to the first aspect.

In a fourth aspect, the present invention provides an electronic device comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the method of error correction for voice interaction as described in the first aspect.

By the technical scheme, the technical scheme provided by the invention at least has the following advantages:

in the invention, for two adjacent voice instructions received successively, tone change information is obtained by comparing tone information respectively contained in the two voice instructions, so that whether to execute automatic error correction of the voice instructions is pre-judged according to the tone change information, and the output control operation is ensured to be in accordance with the real intention of a user. Compared with the prior art, the method and the device solve the problems that the output control operation is not consistent with the real intention of the user, the accuracy is low due to the fact that repeated checking and correction are needed, and the user experience is poor.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of an error correction method for voice interaction according to an embodiment of the present invention;

FIG. 2 is a flowchart of another error correction method for voice interaction according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating an error correction apparatus for voice interaction according to an embodiment of the present invention;

fig. 4 is a block diagram of another error correction apparatus for voice interaction according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention provides a voice interaction error correction method, as shown in fig. 1, the method obtains intonation change information by comparing intonation information respectively contained in two adjacent voice instructions received successively, so as to pre-judge whether to execute automatic error correction of the voice instructions according to the intonation change information, and the embodiment of the invention provides the following specific steps:

101. when a first voice instruction sent by a user is received, semantic information and intonation information contained in the first voice instruction are analyzed.

In the embodiment of the present invention, two consecutive voice commands received successively are processed to determine whether there is a need to perform automatic correction of the voice command, so that the former command is the "first voice command" and the latter command is the "second voice command" in the two consecutive voice commands. For the embodiment of the present invention, two adjacent voice commands may be continuously received voice commands or two received voice commands that are disconnected.

After converting the voice command into text data by combining with an Automatic Speech Recognition (ASR), the semantic information is obtained by analyzing the lexical meaning included in the text data, in other words, the semantic information is equivalent to recognizing the operation intention of the user.

The tone information is the tone of the speech, namely the preparation and the change of the tone of the speech in a sentence, and the English has five basic tones: ascending (↗), descending (↙), ascending (Λ), descending (v), and leveling (→). The factors forming the intonation are complex, and are limited to four main factors (the height, speed, length and weight of the voice) in the embodiment of the invention. Intonation is also a determining factor for expression of tone. The embodiment of the invention is not limited to the specific implementation method for how to measure the tone height, speed, length and weight of the speech.

In the embodiment of the invention, the voice instruction comprises semantic information and intonation information, the intonation information is equivalent to the attitude and the mood of the speaker for the intonation expression, and the semantic information and the intonation information are comprehensively considered, so that the real intention of the speaker can be accurately identified.

In the embodiment of the present invention, a user and an intelligent sound box can perform a voice conversation, during the voice conversation, the user can initiate a plurality of voice instructions, and the intelligent sound box continuously receives the voice instructions and controls to implement a plurality of operations, such as: the method comprises the following functional operations of playing music, playing switching, adjusting sound effect, connecting wifi, downloading operation, broadcasting weather and the like.

102. And receiving a second voice instruction, and analyzing semantic information and intonation information contained in the second voice instruction, wherein the second voice instruction is a voice instruction adjacent to the first voice instruction.

In the embodiment of the present invention, two consecutive voice commands received consecutively are processed to determine whether there is a need to perform automatic correction of the voice command, so that the former command is the "first voice command" and the latter command is the "second voice command" in the two consecutive voice commands.

103. And judging whether to execute correction operation on the first voice instruction or not by comparing the tone information contained in the second voice instruction with the tone information contained in the first voice instruction.

In the embodiment of the present invention, after the smart speaker is successfully awakened, the voice of the speaker can be picked up, that is, the voice command is received, and the voice command is converted into text data by combining with the ASR technology and displayed on the screen.

For example, two adjacent voice commands are received successively and converted into text data of 'Xiaobao, playing song of Zhougelong' and 'playing song of Wangfei' respectively.

The two voice commands are two different intentions, and the intonation information respectively contained in the two commands is compared, namely the fluctuation change of the intonation is identified from the angles of high and low, long and short, and light and heavy of the voice. In view of most of user dialogues, when a user recognizes that his/her voice command is wrong, a new voice command is quickly supplemented after the wrong voice command, and when the new voice command is spoken, the user, though thinking about the command, consciously raises the pitch, accelerates or emphasizes the spoken word when the new voice command is spoken, so that the intonation information contained in the wrong voice command and the new voice command respectively is different.

Therefore, in the embodiment of the present invention, the intonation change information is obtained by comparing the intonation information included in the second voice command with the intonation information included in the first voice command, so as to detect whether there is a large variation in intonation, so that for the operation intentions corresponding to the two voice commands respectively, according to the variation in intonation, it can be predicted whether there is a necessity of selecting one of the two operation intentions in time, that is, the operation intent corresponding to the first voice command is corrected.

For example, for the two voice commands listed above, both are song playing operations, but for the smart speaker, only one song can be selected to be played at a time, and if it is detected that there is a large variation in tone between the two voice commands, it is predicted that the real intention of the user is "play song of royal pof" (i.e., the operation corresponding to the second voice command), rather than "play song of zhou jilun" (i.e., the operation corresponding to the first voice command).

In the embodiment of the invention, the requirement on tone recognition operation is high, and a tone model can be trained by using historical voice instructions, so that tone information contained in the voice instructions can be recognized from four dimensions (high and low, long and short, and light and heavy) in real time when one voice instruction is received, and whether large tone fluctuation change exists between two adjacent voice instructions can be further compared.

104. And if the correction operation is judged to be executed on the first voice command, correcting the semantic information contained in the first voice command according to the semantic information contained in the second voice command.

In the embodiment of the invention, by comparing the tone information respectively contained in two adjacent voice instructions and determining that the tone fluctuation existing between the two voice instructions meets a certain degree, the need of performing the correction operation on the first voice instruction can be predicted, so that the operation intention of the first voice instruction can be corrected according to the operation intention of the second voice operation instruction.

In the embodiment of the present invention, by this correction operation, it can be predicted whether the real intention of the user is the first voice command or the second voice command, and the following similar situations, such as: only executing the first voice instruction operation and ignoring the second voice instruction operation; or, the first voice instruction operation is executed first, and then the second voice instruction operation is executed. The above first case, which does not conform to the user's true intention; in the second case, it is equivalent to perform multiple operations to meet the real intention of the user, and in both cases, the accuracy of recognizing the voice command is low, which will reduce the user experience.

In the embodiment of the present invention, for two consecutive voice commands received successively, intonation change information is obtained by comparing intonation information respectively contained in the two voice commands, so as to pre-determine whether to execute automatic error correction of the voice command according to the intonation change information, thereby ensuring that output control operation is in accordance with the real intention of a user. Compared with the prior art, the method and the device solve the problems that the output control operation is not consistent with the real intention of the user, the accuracy is low due to the fact that repeated checking and correction are needed, and the user experience is poor.

In order to explain the above embodiment in more detail, another error correction method for voice interaction is provided in the embodiment of the present invention, as shown in fig. 2, which is a specific further refinement and complement to the error correction method for voice interaction provided in the above embodiment, and for this embodiment of the present invention, the following specific steps are provided:

201. acquiring a plurality of historical voice instructions corresponding to a user, and analyzing semantic information and intonation information contained in each historical voice instruction.

In the embodiment of the invention, the obtained historical voice instruction is equivalent to a large amount of sample data, and each historical voice instruction is analyzed to obtain semantic information and intonation information contained in each historical voice instruction, so that the intonation information, namely intonation habits, intonation changes and the like, which are used by a user in combination when expressing different semantics can be longitudinally analyzed by using the large amount of sample data.

202. And randomly extracting two adjacent voice instructions from the plurality of historical voice instructions, judging whether the two adjacent voice instructions have logic association or not according to semantic information respectively corresponding to the two adjacent voice instructions, and if so, creating a label according to the logic association to obtain a mapping relation between the label and the logic association.

For the embodiment of the invention, when a large amount of historical voice instructions are analyzed, firstly two adjacent voice instructions with logic association are searched from the aspect of semantics, and secondly, how the tone used by a user is fluctuated when the logic association exists is analyzed from the aspect of tone.

It should be noted that, in the embodiment of the present invention, the intonation change information is measured from four dimensions (high and low, fast and slow, long and short, and light of voice), specifically, the implementation method of calculating the intonation change information in each dimension is not limited in the embodiment of the present invention.

For example: when two adjacent voice instructions are 'help me to tune an alarm clock at 12:00 tomorrow' and 'play Zhou Jiereng', semantic information contained in the voice instructions is analyzed, the user intentions respectively represented by the two voice instructions are not logically related, and further analysis of tone change information existing between the two voice instructions is not needed, so that the operation corresponding to the two voice instructions is directly executed according to the time sequence of receiving the voice instructions.

However, for another example: when two adjacent voice commands extracted are ' help me to adjust to the alarm clock of 12:00 tomorrow ' and ' not ' or adjust to the alarm clock of 11:30 tomorrow ', semantic information contained in the voice commands is analyzed, and the two voice commands respectively represent different user intentions but have logical association. By combining the historical actual operation corresponding to the historical voice command, the latter voice command is the real intention of the user, so the logical association between the two voice commands is the correction relationship, namely: the user's goal is to correct the former voice command with the latter voice command. For the embodiment of the present invention, when it is determined that two adjacent voice commands are logically associated, it is necessary to further analyze the intonation change information existing between the two voice commands.

It should be noted that, in the embodiment of the present invention, the two adjacent historical voice commands are analyzed to obtain the logical association existing between the two historical voice commands, including but not limited to a correction relationship, which needs to be determined by combining specific semantic information application scenarios, for example: the logical associations that exist may also be: or a relationship indicating that the user hesitates between the operation intentions corresponding to the two voice commands, respectively.

For example, when two adjacent voice commands extracted are "help me turn an alarm clock at 12:00 tomorrow" and "the alarm clock at 12:00 tomorrow is reached? "the latter voice command initiated by the user is equivalent to a question about the former voice command, but is uncertain, so that the real intention of the user is unclear at this time.

In the embodiment of the present invention, a tag is created according to a logical association, and a mapping relationship between the tag and the logical association is obtained, for example: for logical associations where there is a "correct relationship," tags that "pray to negate" are created, and for logical associations where there is an "OR relationship," tags that "question hesitation" are created. Specifically, the content of creating the tag can be customized according to the use habit.

203. And calculating difference information between the tone information corresponding to the two adjacent voice instructions, wherein the difference information is tone change information measured in four dimensions of voice height, voice speed, voice length and voice weight, and the difference information is labeled by using a label to obtain the tone change information corresponding to the label.

In the embodiment of the invention, after the two adjacent voice commands are judged to have the logical association, the intonation fluctuation change existing under the logical association is further analyzed, namely the difference information between the intonation information respectively corresponding to the two adjacent voice commands. Specifically, the intonation change information is measured in four dimensions of voice height, voice speed, voice length and voice weight. For example, a intonation model may be trained in advance in combination with Natural Language Processing (NLP), so as to recognize intonation information included in a voice instruction from four dimensions (high and low, fast and slow, long and short, and light and heavy) for a historical voice instruction, so as to further compare whether there is a large intonation fluctuation change between two adjacent voice instructions.

After the intonation change information corresponding to the logical association is obtained through calculation, the corresponding relationship between the label and the intonation change information can be further obtained according to the mapping relationship between the label and the logical association, and the functions are as follows: in the process of processing the real-time received voice instruction, when the tone change information of two adjacent voice instructions received in real time is calculated and analyzed, the corresponding relation between the labels and the tone change information which are established in advance is referred to, so that the label corresponding to the tone change information obtained in real time is known, and the real intention of a user can be pre-judged.

Further, in the embodiment of the present invention, after obtaining the intonation change information corresponding to the tag, the tag may be further normalized, and a specific method may be as follows:

one method is as follows: obtaining the intonation change information corresponding to each label, comparing the similarity between the intonation change information corresponding to the two labels by randomly extracting the two labels, and integrating the two labels to obtain an upper label if the similarity reaches a first preset threshold value, wherein the upper label corresponds to two groups of intonation change information. Therefore, the labels corresponding to similar fluctuation changes are integrated from the language tone fluctuation change layer to reduce the number of the labels, avoid the existence of excessive redundant and disordered labels, and realize that the fluctuation changes of multiple similar languages can be marked by utilizing one upper label.

The other method is as follows: analyzing the meaning of words of each label, matching the labels with the labels recorded on a preset label template by comparing the similarity of the words, wherein the preset label template is used for standardizing the labels, if the matching is successful, the labels recorded on the preset label template are used for replacing the labels, if a plurality of same labels exist after the matching operation, the same labels are subjected to de-duplication processing, one label is reserved, and the labels correspond to a plurality of groups of tone change information. Therefore, the labels with similar word meanings are obtained from the word meaning layer of the labels, the labels with similar word meanings are standardized and unified by using the preset label template, then the duplication elimination processing is carried out, and finally the multiple groups of tone change information corresponding to the same label are obtained, and correspondingly, the multiple groups of tone change information are high in similarity. For the method, the standard operation of the label is realized by replacing the tone fluctuation change level from the word meaning level, the calculation process is simplified, and the calculation cost is saved.

204. When a first voice instruction sent by a user is received, semantic information and intonation information contained in the first voice instruction are analyzed.

205. And receiving a second voice instruction, and analyzing semantic information and intonation information contained in the second voice instruction, wherein the second voice instruction is a voice instruction adjacent to the first voice instruction.

In

step

204 and 205, the received first voice command and the second voice command may be two consecutive voice commands or two disconnected voice commands, but it should be ensured that the time interval between the two received voice commands is within the preset correction time period, for example, if the time interval between two adjacent voice commands is long, it is probably indicated that the operation intentions corresponding to the two voice commands are completely irrelevant, and there is no need to perform correction. Specifically, the specific step of preferentially judging whether the correction time is within the preset correction time limit may be:

first, the system time of receiving the second voice command is recorded,

and secondly, detecting whether the system time is within a preset correction time limit corresponding to the first voice instruction, and if so, judging whether to execute correction operation on the first voice instruction by comparing the tone information contained in the second voice instruction with the tone information contained in the first voice instruction.

However, if the first voice command is not within the preset correction time limit corresponding to the first voice command, the control operation is executed according to the first voice command and the second voice command according to the sequence of the received voice commands.

206. It is identified whether the second speech instruction is a word combination containing a negative word.

In the embodiment of the invention, after two adjacent voice commands are received, whether the second voice command is a word combination containing a negative word or not can be identified in advance, and if so, the operation step of correcting the voice command can be simplified, thereby improving the correction efficiency.

For example: two adjacent voice commands are 'playing Zhou Jieren song' and 'not-matching', when the second voice command is recognized to be a similar word combination containing a negative word, the judgment can be directly made: the operation intentions corresponding to two adjacent voice instructions are different, and the later is the real intention of the user and is used for correcting the voice instruction. Therefore, for the two adjacent voice commands, the intelligent sound box does not execute any control operation, namely, does not execute the operation corresponding to the first voice command.

207a, if the second voice command is recognized as a word combination containing a negative word, the semantic information contained in the first voice command is corrected according to the semantic information contained in the second voice command.

In the embodiment of the invention, if the second voice command is recognized to be a word combination containing a negative word, the step of correcting operation can be simplified, namely the first voice command is not executed, and then if the third voice command is received, the operation corresponding to the third voice command is directly executed, so that the correcting efficiency is improved, the steps 207b-208b do not need to be executed, and the processing cost is saved.

207b, if the second voice command is not recognized as the word combination containing the negative word, judging whether to execute the correction operation on the first voice command or not by comparing the tone information contained in the second voice command with the tone information contained in the first voice command.

It should be noted that before comparing whether the intonation information included in the second voice instruction and the intonation information included in the first voice instruction have intonation fluctuation changes, it may also be verified in advance whether there is a correlation between semantic information respectively included in the second voice instruction and the first voice instruction, and if there is no correlation, it is not necessary to perform subsequent operation for determining whether the voice instruction needs to be corrected.

For example: two adjacent voice instructions are 'help me to adjust the alarm clock of day 12: 00' and 'play the song of Zhou Jieren', and the semantic information contained in the voice instructions is analyzed to know that the user intentions represented by the two voice instructions are completely irrelevant, and the user intentions are in accordance with the real intentions of the user if both the two voice instructions are executed.

However, if there is a correlation, the intonation information included in the second voice command is compared with the intonation information included in the first voice command to determine whether to perform a correction operation on the first voice command, and the specific steps may be as follows:

the method comprises the steps of firstly, respectively calculating difference information between a first voice instruction and a second voice instruction in four dimensions of voice height, voice speed, voice length and voice weight.

The difference information is information of tone variation measured in four dimensions of voice height, voice speed, voice length and voice weight. The specific calculation method is not limited in the embodiment of the present invention.

And secondly, comparing the difference information with the tone change information corresponding to the label, and calculating whether the similarity between the difference information and the tone change information corresponding to the label reaches a second preset threshold value.

The tone variation information corresponding to the tag is sample information of tone fluctuation variation, and the correspondence between the tag and the sample information of tone fluctuation variation is obtained according to longitudinal analysis performed on the historical voice instruction in advance, see step 203, which is not described herein again.

In the embodiment of the invention, the following are equivalent: and comparing the tone change information between two adjacent voice instructions received by real-time calculation with the sample information of tone fluctuation change, and if the similarity reaches a threshold value, marking the voice change information obtained by real-time calculation by using the label.

And if so, determining the logical association between the first voice command and the second voice command according to the label by searching the mapping relation between the label and the logical association.

The label is created according to the logical association existing between the voice commands, and in step 203, in the process of analyzing the historical voice command sample, the mapping relationship between the label and the logical association is established in advance.

In the embodiment of the present invention, after determining the label corresponding to the voice change information obtained through real-time calculation, by using the mapping relationship, what the logical association corresponding to the label is can be known by searching the mapping relationship between the label and the logical association established in advance.

And fourthly, judging to execute correction operation on the first voice command when the logic association between the first voice command and the second voice command is determined to be a correction relation.

For example, for receiving two adjacent voice commands, "xiaobao, xiaojuen song" and "wangfei song" in real time, when it is determined that the latter and the former have a correction relationship, it is determined that the correction operation is performed on the first voice command, and it is obtained that the user really intends to perform the "wangfei song" operation.

208b, if the correction operation is judged to be executed on the first voice command, correcting the semantic information contained in the first voice command according to the semantic information contained in the second voice command.

In the embodiment of the present invention, the specific steps of performing the correction operation may be: and ignoring the first voice command and controlling to execute the operation corresponding to the second voice command.

Further, when it is determined that the correction operation is performed on the first voice instruction, a prompt message for asking the user to confirm whether the operation corresponding to the first voice instruction is performed may be output to the user, and the step of confirming is added to increase the interaction with the user. If the indication information fed back by the user is not received within the preset time, the interference to the user operation is avoided, and the correction operation on the first voice command can be directly controlled.

Further, as an implementation of the methods shown in fig. 1 and fig. 2, an embodiment of the present invention provides an error correction apparatus for voice interaction. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device is applied to automatically correct errors of received voice commands in the process of processing input voice commands, and specifically as shown in fig. 3, the device comprises:

the analysis unit 301 is configured to, when a first voice instruction sent by a user is received, analyze semantic information and intonation information included in the first voice instruction;

a receiving unit 302, configured to receive a second voice instruction;

the parsing unit 301 is further configured to parse semantic information and intonation information included in the second voice instruction, where the second voice instruction is a voice instruction that is adjacent to the first voice instruction;

a determining unit 303, configured to determine whether to perform a correction operation on the first voice instruction by comparing intonation information included in the second voice instruction analyzed by the analyzing unit 301 with intonation information included in the first voice instruction analyzed by the analyzing unit 301;

a correcting unit 304, configured to correct the semantic information included in the first voice instruction according to the semantic information included in the second voice instruction when the determining unit 303 determines that the correcting operation is performed on the first voice instruction.

Further, as shown in fig. 4, the apparatus further includes:

an obtaining unit 305, configured to obtain, before the receiving of a first voice instruction issued by a user, a plurality of historical voice instructions corresponding to the user;

the parsing unit 301 is further configured to parse semantic information and intonation information included in each historical voice instruction;

an extracting unit 306, configured to extract two adjacent voice instructions from any of the plurality of historical voice instructions acquired by the acquiring unit 305;

the judging unit 303 is configured to judge whether there is a logical association between two adjacent voice instructions according to semantic information respectively corresponding to the two adjacent voice instructions extracted by the extracting unit 306;

a creating unit 307, configured to create a tag according to the logical association when the determining unit 303 determines that the two adjacent voice instructions have the logical association, so as to obtain a mapping relationship between the tag and the logical association;

a calculating unit 308, configured to calculate difference information between intonation information corresponding to the two adjacent voice instructions extracted by the extracting unit 306, where the difference information is intonation change information measured in four dimensions, i.e., voice height, voice speed, voice length, and voice weight;

a labeling unit 309, configured to label the difference information with the label created by the creating unit 307, so as to obtain intonation change information corresponding to the label.

Further, as shown in fig. 4, the apparatus further includes:

the obtaining unit 305 is further configured to obtain the intonation change information corresponding to each tag after the intonation change information corresponding to the tag is obtained;

a comparing unit 310, configured to compare similarity between the intonation change information corresponding to the two tags by arbitrarily extracting the two tags;

an integrating unit 311, configured to, if the similarity obtained by the comparing unit 310 reaches a first preset threshold, integrate two tags to obtain an upper tag, where the upper tag corresponds to two sets of intonation change information.

Further, as shown in fig. 4, the apparatus further includes:

the analyzing unit 301 is further configured to analyze the word meaning of each tag after the intonation change information corresponding to the tag is obtained;

a matching unit 312, configured to match the tag with a tag recorded on a preset tag template by comparing similarity of words, where the preset tag template is used to standardize the tag;

a replacing unit 313, configured to replace the tag with the tag recorded in the preset tag template if the matching unit 312 is successful in matching;

the processing unit 314 is configured to, if a plurality of identical tags exist after the matching operation, perform deduplication processing on the plurality of identical tags and reserve one tag, where the tag corresponds to multiple sets of intonation change information.

Further, as shown in fig. 4, the determining unit 303 includes:

a calculating module 3031, configured to calculate difference information between the first voice instruction and the second voice instruction in four dimensions, namely voice height, voice speed, voice length, and voice weight;

the calculating module 3031 is further configured to calculate whether a similarity between the difference information and the intonation change information corresponding to the tag reaches a second preset threshold by comparing the difference information with the intonation change information corresponding to the tag;

a determining module 3032, configured to, when the similarity between the difference information and the intonation change information corresponding to the tag calculated by the calculating module 3031 reaches a second preset threshold, determine, according to the tag, a logical association existing between the first voice instruction and the second voice instruction by searching for a mapping relationship between the tag and the logical association;

a determining module 3033, configured to determine to perform a correction operation on the first voice instruction when the determining module determines that the logical association between the first voice instruction and the second voice instruction is a correction relationship.

Further, as shown in fig. 4, the apparatus further includes:

a recognition unit 315 configured to, after the receiving of the second voice instruction, recognize whether the second voice instruction is a word combination including a negative word;

the correcting unit 304 is further configured to correct the semantic information included in the first voice instruction according to the semantic information included in the second voice instruction when the identifying unit 315 identifies that the second voice instruction is a word combination including a negative word;

the judging unit 303 is configured to, when the identifying unit 315 identifies that the second voice command is a word combination including a negative word, judge whether to perform a correction operation on the first voice command by comparing intonation information included in the second voice command with intonation information included in the first voice command.

Further, as shown in fig. 4, the apparatus further includes:

a recording unit 316 for recording the system time of receiving the second voice instruction;

the detecting unit 317 is configured to detect whether the system time is within a preset correction time limit corresponding to the first voice instruction;

the determining unit 303 is further configured to, when the detecting unit 317 detects that the system time is within a preset correction time period corresponding to the first voice instruction, determine whether to perform a correction operation on the first voice instruction by comparing intonation information included in the second voice instruction with intonation information included in the first voice instruction;

and the executing unit 318 is configured to, when the detecting unit 317 detects that the system time is not within the preset correction time limit corresponding to the first voice instruction, respectively execute a control operation according to the first voice instruction and the second voice instruction according to the sequence of the received voice instructions.

Further, as shown in fig. 4, the apparatus further includes:

a verification unit 319, configured to verify whether there is a correlation between semantic information included in the second voice instruction and semantic information included in the first voice instruction respectively before comparing the intonation information included in the second voice instruction with the intonation information included in the first voice instruction;

a comparing unit 320, configured to compare the intonation information included in the second voice instruction and the intonation information included in the first voice instruction when the verifying unit 319 verifies that there is a correlation between the semantic information included in the second voice instruction and the semantic information included in the first voice instruction.

Further, as shown in fig. 4, the apparatus further includes:

a prompting unit 321, configured to output prompting information to a user when it is determined that a correction operation is performed on the first voice instruction, where the prompting information is used to ask the user to confirm whether to perform an operation corresponding to the first voice instruction;

the control unit 322 is configured to control to perform a correction operation on the first voice command if the indication information fed back by the user is not received within a preset time.

Further, as shown in fig. 4, the correcting unit 304 includes:

an ignoring module 3041 for ignoring the first voice instruction;

the control module 3042 is configured to control to execute an operation corresponding to the second voice instruction.

Further, according to the above method embodiment, another embodiment of the present invention further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are adapted to be loaded by a processor and execute the error correction method for voice interaction as described above.

The instruction in the storage medium for error correction of voice interaction provided by the embodiment of the invention obtains the tone change information by comparing the tone information respectively contained in the two adjacent voice instructions for the two consecutively received voice instructions, so as to pre-judge whether to execute the automatic error correction of the voice instructions according to the tone change information, thereby ensuring that the output control operation is in accordance with the real intention of a user.

Further, according to the above method embodiment, another embodiment of the present invention also provides an electronic device, which includes a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the error correction method of voice interaction as described above.

According to the electronic equipment for error correction of voice interaction provided by the embodiment of the invention, for two adjacent voice instructions which are received successively, the intonation change information is obtained by comparing the intonation information respectively contained in the two voice instructions, so that whether the automatic error correction of the voice instructions is executed or not is judged in advance according to the intonation change information, and the output control operation is ensured to be in accordance with the real intention of a user.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the workload proving method and apparatus according to embodiments of the invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method for error correction of voice interaction, the method comprising:

2. The method of claim 1, wherein prior to said receiving a first voice instruction from a user, the method further comprises:

3. The method according to claim 2, wherein after obtaining the intonation change information corresponding to the tag, the method further comprises:

obtaining intonation change information corresponding to each label;

4. The method according to claim 2, wherein after obtaining the intonation change information corresponding to the tag, the method further comprises:

analyzing the word meaning of each label;

5. The method according to any one of claims 2-4, wherein the determining whether to perform the correction operation on the first voice command by comparing the intonation information included in the second voice command with the intonation information included in the first voice command comprises:

6. The method of claim 1, wherein after said receiving a second voice instruction, the method further comprises:

7. An apparatus for error correction of voice interactions, the apparatus comprising:

the receiving unit is used for receiving a second voice instruction;

8. The apparatus of claim 7, further comprising:

9. A storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform a method of error correction of a voice interaction according to any one of claims 1 to 6.

10. An electronic device, comprising a storage medium and a processor;

the processor is suitable for realizing instructions; the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the method of error correction of voice interaction according to any of claims 1-6.