CN108877792B

CN108877792B - Method, apparatus, electronic device and computer readable storage medium for processing voice conversations

Info

Publication number: CN108877792B
Application number: CN201810541680.6A
Authority: CN
Inventors: 王矩; 张晶晶; 孙珂
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2023-10-24
Anticipated expiration: 2038-05-30
Also published as: CN108877792A

Abstract

According to example embodiments of the present disclosure, a method, apparatus, electronic device, and computer-readable storage medium for processing a voice conversation are provided. The method includes providing a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice. The method also includes receiving a second voice from the user, wherein the second voice is used to correct or supplement the first recognition result. In addition, the method includes generating a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the intent of the user than the first reply. According to the embodiment of the disclosure, in the case that the chat robot cannot accurately recognize the voice content of the user due to the voice recognition abnormality, the user can actively correct or supplement using a further voice dialogue, so that the abnormality in the voice recognition can be solved.

Description

Method, apparatus, electronic device and computer readable storage medium for processing voice conversations

Technical Field

Embodiments of the present disclosure relate generally to the field of artificial intelligence and, more particularly, relate to a method, apparatus, electronic device, and computer-readable storage medium for processing voice conversations.

Background

In recent years, the concept of a dialogue, i.e. a platform (Conversation as a Platform), is becoming more and more deep, and more network products and applications begin to use a dialogue type man-machine interaction mode. Chat robots refer to computer programs or software that can implement man-machine interaction through text, voice, or pictures, etc., that can understand what a user sends out, and automatically respond. Chat robots can take the place of real people to a certain extent to conduct conversations, which can be integrated into a conversation system as an automatic online assistant for use in scenarios such as intelligent chat, customer service, information inquiry, etc.

Speech dialog is a common form of man-machine interaction, and in comparison to text dialog, speech dialog also involves the processing of speech content, such as front-end recognition, speech synthesis, etc. Because the dialog system works based on speech recognition content, there is a high requirement for accuracy of speech recognition. The application scene of the voice conversation can comprise an intelligent voice assistant, an intelligent sound box, a vehicle navigation and the like.

Disclosure of Invention

According to example embodiments of the present disclosure, a method, apparatus, electronic device, and computer-readable storage medium for processing a voice conversation are provided.

In a first aspect of the present disclosure, a method for processing a voice conversation is provided. The method comprises the following steps: providing a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice; receiving a second voice from the user, wherein the second voice is used for correcting or supplementing the first recognition result; and generating a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the intent of the user than the first reply.

In a second aspect of the present disclosure, an apparatus for processing a voice conversation is provided. The device comprises: a first providing module configured to provide a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice; a voice receiving module configured to receive a second voice from the user, wherein the second voice is used for correcting or supplementing the first recognition result; and a second providing module configured to generate a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the user's intent than the first reply.

In a third aspect of the present disclosure, an electronic device is provided that includes one or more processors; and a storage device for storing one or more programs. The one or more programs, when executed by the one or more processors, cause the electronic device to implement methods or processes in accordance with embodiments of the present disclosure.

In a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which when executed by a processor implements a method or process according to an embodiment of the present disclosure.

It should be understood that what is described in this summary is not intended to limit key or critical features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

FIG. 2 illustrates a flow chart of a method for processing a voice conversation in accordance with an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a process for processing a voice message according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a method of resolving text recognition errors through a dialog, according to an embodiment of the disclosure;

FIG. 5 illustrates a flow chart of a method of resolving digital identification errors through a dialogue in accordance with an embodiment of the present disclosure;

FIG. 6 illustrates a flow chart of a method of supplementing recognition results with a conversation in accordance with an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an apparatus for processing a voice conversation in accordance with an embodiment of the present disclosure; and

fig. 8 illustrates a block diagram of an electronic device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". Other explicit and implicit definitions are also possible below.

In the context of speech conversations, speech recognition anomalies (such as speech recognition errors or inability to recognize) are often caused by the presence of ambient noise or user accents. In order to solve the problem of speech recognition abnormality, one improvement method is to increase the accuracy of speech recognition itself, and the other improvement method is to increase the semantic fault tolerance. However, even with the above two improved methods, a scene may still occur in which an accurate dialog is not possible due to speech recognition anomalies. In general, semantic understanding can be based on recognition results of speech, which can have unexpected consequences when chat robots cannot recognize or incorrectly recognize a user's intent.

Embodiments of the present disclosure propose a scheme for handling voice conversations. According to the embodiment of the present disclosure, when the chat robot erroneously recognizes or fails to recognize the voice content of the user due to voice recognition abnormality, the user can actively correct or supplement using a further voice dialogue. Therefore, the semantic understanding platform according to the embodiment of the disclosure can solve the problem of voice recognition abnormality, thereby improving user experience in chat process. Some example embodiments of the present disclosure are described in detail below with reference to fig. 1-8.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. In the example environment 100, a user 110 is engaged in a voice conversation with a chat bot 120 (also referred to as a "chat engine"). Alternatively, user 110 may be local to chat bot 120, i.e., user 110 may directly talk to chat bot 120. Alternatively, user 110 may also conduct a voice conversation with chat bot 120 over a network using his local device (such as a laptop, desktop, smart phone, tablet, etc.). It should be appreciated that chat bot 120 may be deployed into local electronic devices, which may also be deployed into the cloud or distributed.

Referring to fig. 1, a user 110 sends a voice 121 (referred to as a "first voice") to a chat robot 120, and the chat robot 120 processes the voice 121 and provides a corresponding reply 122 (referred to as a "first reply") to the user 110. To this end, the first session of user 110 with chat robot 120 has been completed. In some embodiments, the text of the voice 121 may be displayed simultaneously in the display of the user device so that the user has a clearer view of the current dialog content.

In embodiments of the present disclosure, reply 122 fails to meet the needs of user 110 (e.g., there is a recognition error for voice 121 such that chat robot 120 cannot accurately recognize the intent of user 110), the user may send further voice 123 (referred to as a "second voice") to chat robot 120 for correction or replenishment, chat robot 120 processes voice 123 and provides corresponding reply 124 (referred to as a "second reply") to user 110. According to the embodiment of the present disclosure, since the voice 123 is a correction or supplement to the recognition result of the voice 121, the chat robot 120 can more accurately recognize the intention of the user 110 by combining the voices 121 and 123. Table 1 below shows one example of a dialogue in which the first speech recognition result is corrected.

TABLE 1

Recognition result of user voice:	searching Wang Feng contact ways
		Reply of chat robot：	The contact details of Wang Feng are being queried, please later
Recognition result of user voice:	is rich in harvest
		Reply of chat robot:	the contact details of Wang Feng are being queried, please later

For example, the recognition result of the voice 121 of the user 110 is "find Wang Feng contact", and the chat bot 120 generates a corresponding reply 122 as "contact being queried Wang Feng, please be later. Since the user 110 intends to "find Wang Feng contact", he corrects the recognition result of the voice 121 in the voice 123, which says "is rich". Chat bot 120 generates reply 124 "query Wang Feng for contact details, please follow-up" based on the corrected content. In this manner, chat bot 120 is able to accurately identify the actual intent of user 110.

Fig. 2 illustrates a flow chart of a method 200 for processing a voice conversation in accordance with an embodiment of the present disclosure. It should be appreciated that the method 200 may be performed by the chat bot 120 described above with reference to fig. 1.

At block 202, in response to receiving a first voice from a user, a first reply to the first voice is provided, wherein the first reply is generated based on a first recognition result of the first voice. For example, after chat bot 120 receives voice 121 from user 110, a corresponding reply 122 is provided to user 110. In embodiments of the present disclosure, the first reply fails to accurately recognize the user's intent due to speech recognition anomalies, for example, may cause an erroneous reply, or prompt "no recognition" to ask the user to speak again.

In some embodiments, the reply 122 may be provided to the user 110 by voice only. In other embodiments, the recognition result of the voice 121 and the text form of the reply 122 may also be presented visually (e.g., via a display device) in order to make the user more intuitively aware of the recognition result of his voice.

At block 204, a second speech from the user is received, wherein the second speech is used to correct or supplement the first recognition result. For example, chat bot 120 further receives voice 123 from user 110 for correction or replenishment because reply 122 (i.e., the first reply) fails to properly identify the user's intent. That is, when the chat bot 120 erroneously recognizes or fails to recognize the voice content of the user 110 due to a voice recognition error, the user 110 can actively clarify through a further voice conversation. For example, the user may actively correct or actively supplement, including correcting one or more words and/or numbers by speech, or supplement by speech.

At block 206, a second reply to be provided to the user is generated based on the first voice and the second voice, wherein the second reply is more consistent with the user's intent than the first reply. For example, chat bot 120 provides reply 124 to user 110 based on the recognition result of voice 123 and the recognition result of voice 121. Because voice 123 corrects or supplements the recognition result of voice 121 so that chat bot 120 can further understand the intent of user 110, reply 124 (i.e., the second reply) is more consistent with the intent of user 110 than reply 122 (the first reply), thereby resolving anomalies in voice recognition and enhancing the user experience of the chat process.

In some embodiments, if the second reply has correctly identified the user's intent, then an action associated with the second reply may be performed. For example, since the second voice is a correction or supplement to the first recognition result of the first voice, so that the chat robot can accurately recognize the user's intention, the chat robot can perform or instruct to perform an action associated with the second reply, such as a telephone call, map navigation, or the like. In some embodiments, if no further speech is received from the user within a threshold time after the second reply is provided, the second reply may default to having met the user's intent, and further actions associated with the second reply may be performed directly. It should be appreciated that the action associated with the second reply may also be performed directly concurrently with or before or after the generation of the second reply, regardless of whether the second reply already meets the intent of the user.

In some embodiments, a third voice from the user may be received if the second reply does not yet identify the user's intent. Next, a third reply is provided to the user based at least in part on the third voice. For example, while the second reply is more consistent with the user's intent than the first reply, the user's needs may still not be fully satisfied. In this case, the user may initiate further speech to continue correcting or supplementing the previous speech recognition results.

Fig. 3 shows a schematic diagram of a process 300 for processing a voice message according to an embodiment of the disclosure. It should be appreciated that process 300 may be performed by chat bot 120 described above with reference to fig. 1, and that process 300 may provide an example implementation of a reply based on received speech described above with reference to fig. 2.

At block 302, speech from a user is input, and at block 304, the input speech is converted to text by Automatic Speech Recognition (ASR). At block 306, the text is converted by Natural Language Understanding (NLU) into a form of expression that can be recognized by a computer. At block 308, the intent and word slots in the text are extracted and integrated with the historical dialog states through dialog state maintenance (DST). At block 310, an action that best matches the current state is selected by action candidate ranking based on the current dialog state. After the action is obtained, at block 312, natural Language (NLG) is generated and the generated natural language is speech synthesized (TTS) at block 314. Then, at block 316, speech is output for provision to the user. In process 300, blocks 302, 304, 314, and 316 relate to speech processing, while blocks 306, 308, 310, and 312 relate to natural language processing, where dialog state maintenance and action candidate ordering constitute dialog management that can generate actions to be performed based on semantic representations of speech and current context, and update context.

In some embodiments, in order for the user to have a better interactive experience, the speech recognition results may be presented on a display device (e.g., a display of the user device). In this case, the recognition result of the voice and its reply may be presented simultaneously through the display device so that the user can know the recognition result of his voice.

Fig. 4 illustrates a flow chart of a method 400 of resolving text recognition errors through a dialog, according to an embodiment of the disclosure. It should be appreciated that method 400 may be performed by chat robot 120 described above with reference to fig. 1, and that block 402 may be an example implementation of block 204 described above with reference to fig. 2, and blocks 404 and 406 may be example implementations of block 206 described above with reference to fig. 2.

At block 402, a second speech is received that corrects one or more words in a first recognition result, where a word recognition error exists for the first recognition result of the first speech. At block 404, the first recognition result is corrected using the second recognition result for the second speech. At block 406, a second reply is presented via the display device based on the corrected first recognition result. For example, tables 2-3 below illustrate examples of speech dialogs that correct one or more text errors of a first speech recognition result.

TABLE 2

Recognition result of user voice:	i want to go to the Sidi-Qi
		Reply of chat robot:	not to mention, I do not understand what you mean
Recognition result of user voice:	is Xiyi flag
		Reply of chat robot:	navigation route being queried to go to the west two flag

TABLE 3 Table 3

Recognition result of user voice:	i want to go to practice two-curiosity
		Reply of chat robot:	not to mention, I do not understand what you mean
Recognition result of user voice:	flag being east-west and national flag
		Reply of chat robot:	navigation route being queried to go to the west two flag

In the example of table 2, the user corrects a single word "odd" in the recognition result of the first voice in the second voice; in the example of table 3, the user corrects a plurality of miswords "learned" and "odd" in the recognition result of the first voice in the second voice. Next, the chat robot provides a second reply based on the corrected voice recognition result, e.g., the corrected recognition result in the examples of table 2 and table 3 is "i want to go to the west two flag". It should be appreciated that while the above embodiments of tables 2-3 correct chinese text, embodiments of the present disclosure may be used to correct speech recognition errors in other languages as well.

Fig. 5 illustrates a flow chart of a method 500 of resolving a digital identification error through a dialogue, according to an embodiment of the disclosure. It should be appreciated that method 500 may be performed by chat robot 120 described above with reference to fig. 1, and that block 502 may be an example implementation of block 204 described above with reference to fig. 2, and blocks 504 and 506 may be example implementations of block 206 described above with reference to fig. 2.

At block 502, a second speech is received that corrects one or more digits in a first recognition result, where the first recognition result for the first speech has a digit recognition error. At block 504, the first recognition result is corrected using one or more digits in the second recognition result for the second speech. At block 506, a second reply is presented by the display device based on the corrected first recognition result. For example, table 4 below shows an example of a voice dialog for correcting a digital error of the first voice recognition result.

TABLE 4 Table 4

Recognition result of user voice:	telephone 13511652271
		Reply of chat robot:	is making telephone call 13511652271
Recognition result of user voice:	is 110
		Reply of chat robot:	is making telephone call 13511052271

In the example of table 4, the user corrects a single erroneous number "6" in the recognition result of the first voice in the second voice, i.e., the number of the sixth digit in the phone number should be 0 instead of 6. In some embodiments, in correcting the digits by the user, the portion of the digits that the user needs to correct may be determined based on a maximum match between the recognition results of the second voice and the first voice. For example, in the example of table 4, the number "110" is most matched with "116" in the recognition result, and thus it can be determined that the number "110" is "116" for correcting the previous round of dialogue. In some embodiments, the user may also correct both erroneous text and numbers in the speech recognition result at the same time.

The inventors of the present application have found that some speech recognition errors can only be found by visual display (e.g. text recognition errors due to homophones), while some speech recognition errors can also be found by speech, e.g. one or more digits in a telephone number are recognized as errors. In some embodiments, the recognition result of the first voice and the first reply may not be displayed through the display device, but the first reply may be provided through only voice. In this case, the user may speak a voice for correcting one or more digits in the set of digits based on reading the set of digits in the first reply. Next, the chat bot provides a second reply that is more user-friendly based on the corrected set of digits. It should be appreciated that the scene that does not display the recognition result may be either a scene that does not have a display (e.g., a smart speaker that does not have a display) or a scene that has a display but the display is not used to display the speech recognition result (e.g., a voice conversation in a smart phone black screen state).

Fig. 6 illustrates a flow chart of a method 600 of supplementing recognition results with a conversation in accordance with an embodiment of the present disclosure. It should be appreciated that method 600 may be performed by chat robot 120 described above with reference to fig. 1, and that block 602 may be an example implementation of block 204 described above with reference to fig. 2, and blocks 604 and 606 may be example implementations of block 206 described above with reference to fig. 2.

At block 602, a second speech is received that complements the first recognition result. For example, the first recognition result of the first voice does not fully reflect the user's needs. At block 604, the first recognition result and the second recognition result for the second speech are combined to generate a third recognition result in response to the content of the second speech semantically supplementing the content of the first speech.

Whether the content of the two pieces of speech has a complementary or interpreted relationship may be determined by various means. For example, in some embodiments, it may be determined whether the content of the second voice can semantically bear a bearing relationship with the content of the first voice. For example, if the content of two pieces of speech can be parsed together as a whole, then both can be considered to have a semantic acceptance relationship and the content of the second speech can be determined accordingly to be complementary to the first speech. Alternatively or additionally, in some embodiments, it may be determined whether the probability of content of two pieces of speech occurring simultaneously is greater than a predetermined threshold. For example, if the content of both voices are mostly simultaneous, it may be determined that the content of the second voice is complementary to the first voice.

At block 606, a second reply is presented via the display device based on the third recognition result. For example, table 5 below shows a dialogue example that complements the first speech recognition result.

TABLE 5

In the example of table 5, the user's intent is to go to the beijing university siemens, and the voice recognition system has already begun recognition after receiving "beijing university", and thus cannot recognize the user's exact intent. In this case, the user can talk supplementary information through natural language, then combine the recognition results of the two voices to generate a new recognition result "i want to go to siemens of the university of beijing", and then generate a corresponding reply based on the new recognition result. In some embodiments, for recognition results that are not fully expressed, the user may supplement the information directly without waiting for the recognition results and parsing results to be returned, nor having to recall the previous dialog content.

Fig. 7 illustrates a block diagram of an apparatus 700 for processing a voice conversation in accordance with an embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 includes a first providing module 710, a voice receiving module 720, and a second providing module 730. The first providing module 710 is configured to provide a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice. The speech receiving module 720 is configured to receive a second speech from the user, wherein the second speech is used to correct or supplement the first recognition result. The second providing module 730 is configured to generate a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the user's intent than the first reply.

In some embodiments, wherein the first providing module 710 includes a first rendering module configured to render the first recognition result and the first reply via the display device.

In some embodiments, wherein the second providing module 730 includes a first correction module configured to correct text recognition errors in the first recognition result using the second recognition result for the second speech.

In some embodiments, wherein the second providing module 730 includes a second correction module configured to correct a digital recognition error in the first recognition result using one or more digits in the second recognition result for the second speech.

In some embodiments, wherein the second speech is used to supplement the first recognition result, and the second providing module 730 comprises: a combining module configured to combine the first recognition result and the second recognition result of the second voice to generate a third recognition result in response to determining that the content of the second voice semantically complements the content of the first voice; and a second presentation module configured to present the second reply through the display device based on the third recognition result.

In some embodiments, wherein the first providing module 710 comprises a first speech providing module configured to provide a first reply by speech only, wherein the first reply comprises a set of digits from the first recognition result.

In some embodiments, wherein the second speech is used to correct a group of digits, and the second providing module 730 comprises: a third correction module configured to correct a set of digits using one or more digits in a second recognition result of the second speech; and a second speech providing module configured to provide a second reply by speech based on the corrected set of digits.

In some embodiments, the apparatus 700 further comprises an action execution module configured to execute an action associated with the second reply in response to the second reply having identified the user's intent.

In some embodiments, the apparatus 700 further comprises: a third voice receiving module configured to receive a third voice from the user in response to the second reply not yet identifying the user's intent; and a third providing module configured to provide a third reply to the user based at least in part on the third voice.

It should be appreciated that the first providing module 710, the voice receiving module 720, and the second providing module 730 shown in fig. 7 may be included in the chat robot 120 described with reference to fig. 1. Moreover, it should be understood that the modules illustrated in fig. 7 may perform steps or actions in a method or process with reference to embodiments of the present disclosure.

Fig. 8 shows a schematic block diagram of an example device 800 that may be used to implement embodiments of the present disclosure. It should be appreciated that the apparatus 800 may be used to implement the device 700 for processing voice conversations or chat bots 120 described in this disclosure. As shown, the device 800 includes a Central Processing Unit (CPU) 801 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 802 or loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processing unit 801 performs the various methods and processes described above, such as method 200, process 300, method 400, method 500, and method 600. For example, in some embodiments, the methods 200, 300, 400, 500, and 600 may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into RAM 803 and executed by CPU 801, one or more of the acts or steps of method 200, process 300, method 400, method 500, and method 600 described above may be performed. Alternatively, in other embodiments, CPU 801 may be configured to perform method 200, process 300, method 400, method 500, and method 600 in any other suitable manner (e.g., by means of firmware).

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and so forth.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Moreover, although the acts or steps are depicted in a particular order, this should be understood as requiring that such acts or steps be performed in the particular order shown or in sequential order, or that all illustrated acts or steps be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although embodiments of the disclosure have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method for processing a voice conversation, comprising:

providing a first reply to a first voice in response to receiving the first voice from a user, the first reply being generated based on a first recognition result of the first voice, the first recognition result including first digital content therein, the first reply including the first digital content;

receiving a second voice from the user, the second voice being used to correct or supplement the first recognition result, the second voice including second digital content in the second recognition result of the second voice, wherein the second voice is determined to be used to correct or supplement the first recognition result in the case where the second voice and the first voice can be parsed together as a whole;

generating, by a chat robot, a second reply to be provided to the user based on the first voice and the second voice, the second reply more conforming to the user's intent than the first reply; and

responsive to not receiving speech from the user within a threshold time after the second reply is provided, determining that the second reply meets the user's intent and performing an action associated with the second reply,

wherein generating the second reply comprises:

determining a set of digits to be corrected in the first digital content based on a maximum match between the second digital content and the first recognition result, wherein the set of digits to be corrected has the same number of digits as the second digital content;

correcting the set of digits to be corrected in the first digital content using the second digital content; and

the second reply is generated based on the corrected first digital content.

2. The method of claim 1, wherein providing a first reply to the first voice comprises:

and presenting the first identification result and the first reply through a display device.

3. The method of claim 2, wherein generating the second reply further comprises:

correcting a word recognition error in the first recognition result using the second recognition result of the second voice.

4. The method of claim 2, wherein generating the second reply further comprises:

combining the first recognition result and a second recognition result of the second voice to generate a third recognition result in response to determining that the content of the second voice semantically complements the content of the first voice; and

and based on a third recognition result, presenting the second reply through the display device.

5. The method of claim 1, wherein providing a first reply to the first voice comprises:

the first reply is provided by voice.

6. The method of claim 5, wherein generating the second reply further comprises:

the second reply is provided by voice based on the corrected set of digits.

7. The method of any of claims 1-6, further comprising:

receiving a third voice from the user in response to the second reply not yet identifying the intent of the user; and

a third reply is provided to the user based at least in part on the third voice.

8. An apparatus for processing a voice conversation, comprising:

a first providing module configured to provide a first reply to a first voice in response to receiving the first voice from a user, the first reply being generated based on a first recognition result of the first voice, the first recognition result including first digital content therein, the first reply including the first digital content;

a speech receiving module configured to receive a second speech from the user, the second speech being used to correct or supplement the first recognition result, the second recognition result of the second speech including second digital content, wherein the second speech is determined to be used to correct or supplement the first recognition result in the case where the second speech and the first speech can be parsed together as a whole;

a second providing module configured to cause a chat robot to generate a second reply to be provided to the user based on the first voice and the second voice, the second reply more conforming to the user's intent than the first reply; and

an action execution module configured to determine that the second reply meets the user's intent and to execute an action associated with the second reply in response to not receiving speech from the user within a threshold time after the second reply is provided,

wherein the second providing module comprises:

a digital content determination module configured to determine a set of digits to be corrected in the first digital content based on a maximum match between the second digital content and the first recognition result, wherein the set of digits to be corrected has the same number of digits as the second digital content;

a correction module configured to correct the set of digits to be corrected in the first digital content using the second digital content; and

a second reply generation module configured to generate the second reply based on the corrected first digital content.

9. The apparatus of claim 8, wherein the first providing module comprises:

a first presentation module configured to present the first recognition result and the first reply via a display device.

10. The apparatus of claim 9, wherein the second providing module further comprises:

a first correction module configured to correct text recognition errors in the first recognition result using the second recognition result of the second speech.

11. The apparatus of claim 9, wherein the second providing module further comprises:

a combining module configured to combine the first recognition result and a second recognition result of the second voice to generate a third recognition result in response to determining that the content of the second voice semantically complements the content of the first voice; and

and a second presentation module configured to present the second reply through the display device based on a third recognition result.

12. The apparatus of claim 8, wherein the first providing module comprises:

a first voice providing module configured to provide the first reply by voice.

13. The apparatus of claim 12, wherein the second providing module further comprises:

a second voice providing module configured to provide the second reply only by voice based on the corrected set of digits.

14. The apparatus of any of claims 8-13, further comprising:

a third speech receiving module configured to receive a third speech from the user in response to the second reply not yet identifying the intent of the user; and

a third providing module configured to provide a third reply to the user based at least in part on the third voice.

15. An electronic device, the electronic device comprising:

one or more processors; and

storage means for storing one or more programs that when executed by the one or more processors cause the electronic device to implement the method of any of claims 1-7.

16. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method according to any of claims 1-7.