CN108877792B - Method, apparatus, electronic device and computer readable storage medium for processing voice conversations - Google Patents

Method, apparatus, electronic device and computer readable storage medium for processing voice conversations Download PDF

Info

Publication number
CN108877792B
CN108877792B CN201810541680.6A CN201810541680A CN108877792B CN 108877792 B CN108877792 B CN 108877792B CN 201810541680 A CN201810541680 A CN 201810541680A CN 108877792 B CN108877792 B CN 108877792B
Authority
CN
China
Prior art keywords
voice
reply
recognition result
user
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810541680.6A
Other languages
Chinese (zh)
Other versions
CN108877792A (en
Inventor
王矩
张晶晶
孙珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810541680.6A priority Critical patent/CN108877792B/en
Publication of CN108877792A publication Critical patent/CN108877792A/en
Application granted granted Critical
Publication of CN108877792B publication Critical patent/CN108877792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

According to example embodiments of the present disclosure, a method, apparatus, electronic device, and computer-readable storage medium for processing a voice conversation are provided. The method includes providing a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice. The method also includes receiving a second voice from the user, wherein the second voice is used to correct or supplement the first recognition result. In addition, the method includes generating a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the intent of the user than the first reply. According to the embodiment of the disclosure, in the case that the chat robot cannot accurately recognize the voice content of the user due to the voice recognition abnormality, the user can actively correct or supplement using a further voice dialogue, so that the abnormality in the voice recognition can be solved.

Description

Method, apparatus, electronic device and computer readable storage medium for processing voice conversations
Technical Field
Embodiments of the present disclosure relate generally to the field of artificial intelligence and, more particularly, relate to a method, apparatus, electronic device, and computer-readable storage medium for processing voice conversations.
Background
In recent years, the concept of a dialogue, i.e. a platform (Conversation as a Platform), is becoming more and more deep, and more network products and applications begin to use a dialogue type man-machine interaction mode. Chat robots refer to computer programs or software that can implement man-machine interaction through text, voice, or pictures, etc., that can understand what a user sends out, and automatically respond. Chat robots can take the place of real people to a certain extent to conduct conversations, which can be integrated into a conversation system as an automatic online assistant for use in scenarios such as intelligent chat, customer service, information inquiry, etc.
Speech dialog is a common form of man-machine interaction, and in comparison to text dialog, speech dialog also involves the processing of speech content, such as front-end recognition, speech synthesis, etc. Because the dialog system works based on speech recognition content, there is a high requirement for accuracy of speech recognition. The application scene of the voice conversation can comprise an intelligent voice assistant, an intelligent sound box, a vehicle navigation and the like.
Disclosure of Invention
According to example embodiments of the present disclosure, a method, apparatus, electronic device, and computer-readable storage medium for processing a voice conversation are provided.
In a first aspect of the present disclosure, a method for processing a voice conversation is provided. The method comprises the following steps: providing a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice; receiving a second voice from the user, wherein the second voice is used for correcting or supplementing the first recognition result; and generating a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the intent of the user than the first reply.
In a second aspect of the present disclosure, an apparatus for processing a voice conversation is provided. The device comprises: a first providing module configured to provide a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice; a voice receiving module configured to receive a second voice from the user, wherein the second voice is used for correcting or supplementing the first recognition result; and a second providing module configured to generate a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the user's intent than the first reply.
In a third aspect of the present disclosure, an electronic device is provided that includes one or more processors; and a storage device for storing one or more programs. The one or more programs, when executed by the one or more processors, cause the electronic device to implement methods or processes in accordance with embodiments of the present disclosure.
In a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which when executed by a processor implements a method or process according to an embodiment of the present disclosure.
It should be understood that what is described in this summary is not intended to limit key or critical features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:
FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;
FIG. 2 illustrates a flow chart of a method for processing a voice conversation in accordance with an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a process for processing a voice message according to an embodiment of the present disclosure;
FIG. 4 illustrates a flow chart of a method of resolving text recognition errors through a dialog, according to an embodiment of the disclosure;
FIG. 5 illustrates a flow chart of a method of resolving digital identification errors through a dialogue in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates a flow chart of a method of supplementing recognition results with a conversation in accordance with an embodiment of the present disclosure;
FIG. 7 illustrates a block diagram of an apparatus for processing a voice conversation in accordance with an embodiment of the present disclosure; and
fig. 8 illustrates a block diagram of an electronic device capable of implementing various embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". Other explicit and implicit definitions are also possible below.
In the context of speech conversations, speech recognition anomalies (such as speech recognition errors or inability to recognize) are often caused by the presence of ambient noise or user accents. In order to solve the problem of speech recognition abnormality, one improvement method is to increase the accuracy of speech recognition itself, and the other improvement method is to increase the semantic fault tolerance. However, even with the above two improved methods, a scene may still occur in which an accurate dialog is not possible due to speech recognition anomalies. In general, semantic understanding can be based on recognition results of speech, which can have unexpected consequences when chat robots cannot recognize or incorrectly recognize a user's intent.
Embodiments of the present disclosure propose a scheme for handling voice conversations. According to the embodiment of the present disclosure, when the chat robot erroneously recognizes or fails to recognize the voice content of the user due to voice recognition abnormality, the user can actively correct or supplement using a further voice dialogue. Therefore, the semantic understanding platform according to the embodiment of the disclosure can solve the problem of voice recognition abnormality, thereby improving user experience in chat process. Some example embodiments of the present disclosure are described in detail below with reference to fig. 1-8.
FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. In the example environment 100, a user 110 is engaged in a voice conversation with a chat bot 120 (also referred to as a "chat engine"). Alternatively, user 110 may be local to chat bot 120, i.e., user 110 may directly talk to chat bot 120. Alternatively, user 110 may also conduct a voice conversation with chat bot 120 over a network using his local device (such as a laptop, desktop, smart phone, tablet, etc.). It should be appreciated that chat bot 120 may be deployed into local electronic devices, which may also be deployed into the cloud or distributed.
Referring to fig. 1, a user 110 sends a voice 121 (referred to as a "first voice") to a chat robot 120, and the chat robot 120 processes the voice 121 and provides a corresponding reply 122 (referred to as a "first reply") to the user 110. To this end, the first session of user 110 with chat robot 120 has been completed. In some embodiments, the text of the voice 121 may be displayed simultaneously in the display of the user device so that the user has a clearer view of the current dialog content.
In embodiments of the present disclosure, reply 122 fails to meet the needs of user 110 (e.g., there is a recognition error for voice 121 such that chat robot 120 cannot accurately recognize the intent of user 110), the user may send further voice 123 (referred to as a "second voice") to chat robot 120 for correction or replenishment, chat robot 120 processes voice 123 and provides corresponding reply 124 (referred to as a "second reply") to user 110. According to the embodiment of the present disclosure, since the voice 123 is a correction or supplement to the recognition result of the voice 121, the chat robot 120 can more accurately recognize the intention of the user 110 by combining the voices 121 and 123. Table 1 below shows one example of a dialogue in which the first speech recognition result is corrected.
TABLE 1
Recognition result of user voice: searching Wang Feng contact ways
Reply of chat robot: The contact details of Wang Feng are being queried, please later
Recognition result of user voice: is rich in harvest
Reply of chat robot: the contact details of Wang Feng are being queried, please later
For example, the recognition result of the voice 121 of the user 110 is "find Wang Feng contact", and the chat bot 120 generates a corresponding reply 122 as "contact being queried Wang Feng, please be later. Since the user 110 intends to "find Wang Feng contact", he corrects the recognition result of the voice 121 in the voice 123, which says "is rich". Chat bot 120 generates reply 124 "query Wang Feng for contact details, please follow-up" based on the corrected content. In this manner, chat bot 120 is able to accurately identify the actual intent of user 110.
Fig. 2 illustrates a flow chart of a method 200 for processing a voice conversation in accordance with an embodiment of the present disclosure. It should be appreciated that the method 200 may be performed by the chat bot 120 described above with reference to fig. 1.
At block 202, in response to receiving a first voice from a user, a first reply to the first voice is provided, wherein the first reply is generated based on a first recognition result of the first voice. For example, after chat bot 120 receives voice 121 from user 110, a corresponding reply 122 is provided to user 110. In embodiments of the present disclosure, the first reply fails to accurately recognize the user's intent due to speech recognition anomalies, for example, may cause an erroneous reply, or prompt "no recognition" to ask the user to speak again.
In some embodiments, the reply 122 may be provided to the user 110 by voice only. In other embodiments, the recognition result of the voice 121 and the text form of the reply 122 may also be presented visually (e.g., via a display device) in order to make the user more intuitively aware of the recognition result of his voice.
At block 204, a second speech from the user is received, wherein the second speech is used to correct or supplement the first recognition result. For example, chat bot 120 further receives voice 123 from user 110 for correction or replenishment because reply 122 (i.e., the first reply) fails to properly identify the user's intent. That is, when the chat bot 120 erroneously recognizes or fails to recognize the voice content of the user 110 due to a voice recognition error, the user 110 can actively clarify through a further voice conversation. For example, the user may actively correct or actively supplement, including correcting one or more words and/or numbers by speech, or supplement by speech.
At block 206, a second reply to be provided to the user is generated based on the first voice and the second voice, wherein the second reply is more consistent with the user's intent than the first reply. For example, chat bot 120 provides reply 124 to user 110 based on the recognition result of voice 123 and the recognition result of voice 121. Because voice 123 corrects or supplements the recognition result of voice 121 so that chat bot 120 can further understand the intent of user 110, reply 124 (i.e., the second reply) is more consistent with the intent of user 110 than reply 122 (the first reply), thereby resolving anomalies in voice recognition and enhancing the user experience of the chat process.
In some embodiments, if the second reply has correctly identified the user's intent, then an action associated with the second reply may be performed. For example, since the second voice is a correction or supplement to the first recognition result of the first voice, so that the chat robot can accurately recognize the user's intention, the chat robot can perform or instruct to perform an action associated with the second reply, such as a telephone call, map navigation, or the like. In some embodiments, if no further speech is received from the user within a threshold time after the second reply is provided, the second reply may default to having met the user's intent, and further actions associated with the second reply may be performed directly. It should be appreciated that the action associated with the second reply may also be performed directly concurrently with or before or after the generation of the second reply, regardless of whether the second reply already meets the intent of the user.
In some embodiments, a third voice from the user may be received if the second reply does not yet identify the user's intent. Next, a third reply is provided to the user based at least in part on the third voice. For example, while the second reply is more consistent with the user's intent than the first reply, the user's needs may still not be fully satisfied. In this case, the user may initiate further speech to continue correcting or supplementing the previous speech recognition results.
Fig. 3 shows a schematic diagram of a process 300 for processing a voice message according to an embodiment of the disclosure. It should be appreciated that process 300 may be performed by chat bot 120 described above with reference to fig. 1, and that process 300 may provide an example implementation of a reply based on received speech described above with reference to fig. 2.
At block 302, speech from a user is input, and at block 304, the input speech is converted to text by Automatic Speech Recognition (ASR). At block 306, the text is converted by Natural Language Understanding (NLU) into a form of expression that can be recognized by a computer. At block 308, the intent and word slots in the text are extracted and integrated with the historical dialog states through dialog state maintenance (DST). At block 310, an action that best matches the current state is selected by action candidate ranking based on the current dialog state. After the action is obtained, at block 312, natural Language (NLG) is generated and the generated natural language is speech synthesized (TTS) at block 314. Then, at block 316, speech is output for provision to the user. In process 300, blocks 302, 304, 314, and 316 relate to speech processing, while blocks 306, 308, 310, and 312 relate to natural language processing, where dialog state maintenance and action candidate ordering constitute dialog management that can generate actions to be performed based on semantic representations of speech and current context, and update context.
In some embodiments, in order for the user to have a better interactive experience, the speech recognition results may be presented on a display device (e.g., a display of the user device). In this case, the recognition result of the voice and its reply may be presented simultaneously through the display device so that the user can know the recognition result of his voice.
Fig. 4 illustrates a flow chart of a method 400 of resolving text recognition errors through a dialog, according to an embodiment of the disclosure. It should be appreciated that method 400 may be performed by chat robot 120 described above with reference to fig. 1, and that block 402 may be an example implementation of block 204 described above with reference to fig. 2, and blocks 404 and 406 may be example implementations of block 206 described above with reference to fig. 2.
At block 402, a second speech is received that corrects one or more words in a first recognition result, where a word recognition error exists for the first recognition result of the first speech. At block 404, the first recognition result is corrected using the second recognition result for the second speech. At block 406, a second reply is presented via the display device based on the corrected first recognition result. For example, tables 2-3 below illustrate examples of speech dialogs that correct one or more text errors of a first speech recognition result.
TABLE 2
Recognition result of user voice: i want to go to the Sidi-Qi
Reply of chat robot: not to mention, I do not understand what you mean
Recognition result of user voice: is Xiyi flag
Reply of chat robot: navigation route being queried to go to the west two flag
TABLE 3 Table 3
Recognition result of user voice: i want to go to practice two-curiosity
Reply of chat robot: not to mention, I do not understand what you mean
Recognition result of user voice: flag being east-west and national flag
Reply of chat robot: navigation route being queried to go to the west two flag
In the example of table 2, the user corrects a single word "odd" in the recognition result of the first voice in the second voice; in the example of table 3, the user corrects a plurality of miswords "learned" and "odd" in the recognition result of the first voice in the second voice. Next, the chat robot provides a second reply based on the corrected voice recognition result, e.g., the corrected recognition result in the examples of table 2 and table 3 is "i want to go to the west two flag". It should be appreciated that while the above embodiments of tables 2-3 correct chinese text, embodiments of the present disclosure may be used to correct speech recognition errors in other languages as well.
Fig. 5 illustrates a flow chart of a method 500 of resolving a digital identification error through a dialogue, according to an embodiment of the disclosure. It should be appreciated that method 500 may be performed by chat robot 120 described above with reference to fig. 1, and that block 502 may be an example implementation of block 204 described above with reference to fig. 2, and blocks 504 and 506 may be example implementations of block 206 described above with reference to fig. 2.
At block 502, a second speech is received that corrects one or more digits in a first recognition result, where the first recognition result for the first speech has a digit recognition error. At block 504, the first recognition result is corrected using one or more digits in the second recognition result for the second speech. At block 506, a second reply is presented by the display device based on the corrected first recognition result. For example, table 4 below shows an example of a voice dialog for correcting a digital error of the first voice recognition result.
TABLE 4 Table 4
Recognition result of user voice: telephone 13511652271
Reply of chat robot: is making telephone call 13511652271
Recognition result of user voice: is 110
Reply of chat robot: is making telephone call 13511052271
In the example of table 4, the user corrects a single erroneous number "6" in the recognition result of the first voice in the second voice, i.e., the number of the sixth digit in the phone number should be 0 instead of 6. In some embodiments, in correcting the digits by the user, the portion of the digits that the user needs to correct may be determined based on a maximum match between the recognition results of the second voice and the first voice. For example, in the example of table 4, the number "110" is most matched with "116" in the recognition result, and thus it can be determined that the number "110" is "116" for correcting the previous round of dialogue. In some embodiments, the user may also correct both erroneous text and numbers in the speech recognition result at the same time.
The inventors of the present application have found that some speech recognition errors can only be found by visual display (e.g. text recognition errors due to homophones), while some speech recognition errors can also be found by speech, e.g. one or more digits in a telephone number are recognized as errors. In some embodiments, the recognition result of the first voice and the first reply may not be displayed through the display device, but the first reply may be provided through only voice. In this case, the user may speak a voice for correcting one or more digits in the set of digits based on reading the set of digits in the first reply. Next, the chat bot provides a second reply that is more user-friendly based on the corrected set of digits. It should be appreciated that the scene that does not display the recognition result may be either a scene that does not have a display (e.g., a smart speaker that does not have a display) or a scene that has a display but the display is not used to display the speech recognition result (e.g., a voice conversation in a smart phone black screen state).
Fig. 6 illustrates a flow chart of a method 600 of supplementing recognition results with a conversation in accordance with an embodiment of the present disclosure. It should be appreciated that method 600 may be performed by chat robot 120 described above with reference to fig. 1, and that block 602 may be an example implementation of block 204 described above with reference to fig. 2, and blocks 604 and 606 may be example implementations of block 206 described above with reference to fig. 2.
At block 602, a second speech is received that complements the first recognition result. For example, the first recognition result of the first voice does not fully reflect the user's needs. At block 604, the first recognition result and the second recognition result for the second speech are combined to generate a third recognition result in response to the content of the second speech semantically supplementing the content of the first speech.
Whether the content of the two pieces of speech has a complementary or interpreted relationship may be determined by various means. For example, in some embodiments, it may be determined whether the content of the second voice can semantically bear a bearing relationship with the content of the first voice. For example, if the content of two pieces of speech can be parsed together as a whole, then both can be considered to have a semantic acceptance relationship and the content of the second speech can be determined accordingly to be complementary to the first speech. Alternatively or additionally, in some embodiments, it may be determined whether the probability of content of two pieces of speech occurring simultaneously is greater than a predetermined threshold. For example, if the content of both voices are mostly simultaneous, it may be determined that the content of the second voice is complementary to the first voice.
At block 606, a second reply is presented via the display device based on the third recognition result. For example, table 5 below shows a dialogue example that complements the first speech recognition result.
TABLE 5
In the example of table 5, the user's intent is to go to the beijing university siemens, and the voice recognition system has already begun recognition after receiving "beijing university", and thus cannot recognize the user's exact intent. In this case, the user can talk supplementary information through natural language, then combine the recognition results of the two voices to generate a new recognition result "i want to go to siemens of the university of beijing", and then generate a corresponding reply based on the new recognition result. In some embodiments, for recognition results that are not fully expressed, the user may supplement the information directly without waiting for the recognition results and parsing results to be returned, nor having to recall the previous dialog content.
Fig. 7 illustrates a block diagram of an apparatus 700 for processing a voice conversation in accordance with an embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 includes a first providing module 710, a voice receiving module 720, and a second providing module 730. The first providing module 710 is configured to provide a first reply to the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on a first recognition result of the first voice. The speech receiving module 720 is configured to receive a second speech from the user, wherein the second speech is used to correct or supplement the first recognition result. The second providing module 730 is configured to generate a second reply to be provided to the user based on the first voice and the second voice, wherein the second reply is more consistent with the user's intent than the first reply.
In some embodiments, wherein the first providing module 710 includes a first rendering module configured to render the first recognition result and the first reply via the display device.
In some embodiments, wherein the second providing module 730 includes a first correction module configured to correct text recognition errors in the first recognition result using the second recognition result for the second speech.
In some embodiments, wherein the second providing module 730 includes a second correction module configured to correct a digital recognition error in the first recognition result using one or more digits in the second recognition result for the second speech.
In some embodiments, wherein the second speech is used to supplement the first recognition result, and the second providing module 730 comprises: a combining module configured to combine the first recognition result and the second recognition result of the second voice to generate a third recognition result in response to determining that the content of the second voice semantically complements the content of the first voice; and a second presentation module configured to present the second reply through the display device based on the third recognition result.
In some embodiments, wherein the first providing module 710 comprises a first speech providing module configured to provide a first reply by speech only, wherein the first reply comprises a set of digits from the first recognition result.
In some embodiments, wherein the second speech is used to correct a group of digits, and the second providing module 730 comprises: a third correction module configured to correct a set of digits using one or more digits in a second recognition result of the second speech; and a second speech providing module configured to provide a second reply by speech based on the corrected set of digits.
In some embodiments, the apparatus 700 further comprises an action execution module configured to execute an action associated with the second reply in response to the second reply having identified the user's intent.
In some embodiments, the apparatus 700 further comprises: a third voice receiving module configured to receive a third voice from the user in response to the second reply not yet identifying the user's intent; and a third providing module configured to provide a third reply to the user based at least in part on the third voice.
It should be appreciated that the first providing module 710, the voice receiving module 720, and the second providing module 730 shown in fig. 7 may be included in the chat robot 120 described with reference to fig. 1. Moreover, it should be understood that the modules illustrated in fig. 7 may perform steps or actions in a method or process with reference to embodiments of the present disclosure.
Fig. 8 shows a schematic block diagram of an example device 800 that may be used to implement embodiments of the present disclosure. It should be appreciated that the apparatus 800 may be used to implement the device 700 for processing voice conversations or chat bots 120 described in this disclosure. As shown, the device 800 includes a Central Processing Unit (CPU) 801 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 802 or loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Processing unit 801 performs the various methods and processes described above, such as method 200, process 300, method 400, method 500, and method 600. For example, in some embodiments, the methods 200, 300, 400, 500, and 600 may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into RAM 803 and executed by CPU 801, one or more of the acts or steps of method 200, process 300, method 400, method 500, and method 600 described above may be performed. Alternatively, in other embodiments, CPU 801 may be configured to perform method 200, process 300, method 400, method 500, and method 600 in any other suitable manner (e.g., by means of firmware).
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and so forth.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although the acts or steps are depicted in a particular order, this should be understood as requiring that such acts or steps be performed in the particular order shown or in sequential order, or that all illustrated acts or steps be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although embodiments of the disclosure have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (16)

1. A method for processing a voice conversation, comprising:
providing a first reply to a first voice in response to receiving the first voice from a user, the first reply being generated based on a first recognition result of the first voice, the first recognition result including first digital content therein, the first reply including the first digital content;
receiving a second voice from the user, the second voice being used to correct or supplement the first recognition result, the second voice including second digital content in the second recognition result of the second voice, wherein the second voice is determined to be used to correct or supplement the first recognition result in the case where the second voice and the first voice can be parsed together as a whole;
generating, by a chat robot, a second reply to be provided to the user based on the first voice and the second voice, the second reply more conforming to the user's intent than the first reply; and
responsive to not receiving speech from the user within a threshold time after the second reply is provided, determining that the second reply meets the user's intent and performing an action associated with the second reply,
wherein generating the second reply comprises:
determining a set of digits to be corrected in the first digital content based on a maximum match between the second digital content and the first recognition result, wherein the set of digits to be corrected has the same number of digits as the second digital content;
correcting the set of digits to be corrected in the first digital content using the second digital content; and
the second reply is generated based on the corrected first digital content.
2. The method of claim 1, wherein providing a first reply to the first voice comprises:
and presenting the first identification result and the first reply through a display device.
3. The method of claim 2, wherein generating the second reply further comprises:
correcting a word recognition error in the first recognition result using the second recognition result of the second voice.
4. The method of claim 2, wherein generating the second reply further comprises:
combining the first recognition result and a second recognition result of the second voice to generate a third recognition result in response to determining that the content of the second voice semantically complements the content of the first voice; and
and based on a third recognition result, presenting the second reply through the display device.
5. The method of claim 1, wherein providing a first reply to the first voice comprises:
the first reply is provided by voice.
6. The method of claim 5, wherein generating the second reply further comprises:
the second reply is provided by voice based on the corrected set of digits.
7. The method of any of claims 1-6, further comprising:
receiving a third voice from the user in response to the second reply not yet identifying the intent of the user; and
a third reply is provided to the user based at least in part on the third voice.
8. An apparatus for processing a voice conversation, comprising:
a first providing module configured to provide a first reply to a first voice in response to receiving the first voice from a user, the first reply being generated based on a first recognition result of the first voice, the first recognition result including first digital content therein, the first reply including the first digital content;
a speech receiving module configured to receive a second speech from the user, the second speech being used to correct or supplement the first recognition result, the second recognition result of the second speech including second digital content, wherein the second speech is determined to be used to correct or supplement the first recognition result in the case where the second speech and the first speech can be parsed together as a whole;
a second providing module configured to cause a chat robot to generate a second reply to be provided to the user based on the first voice and the second voice, the second reply more conforming to the user's intent than the first reply; and
an action execution module configured to determine that the second reply meets the user's intent and to execute an action associated with the second reply in response to not receiving speech from the user within a threshold time after the second reply is provided,
wherein the second providing module comprises:
a digital content determination module configured to determine a set of digits to be corrected in the first digital content based on a maximum match between the second digital content and the first recognition result, wherein the set of digits to be corrected has the same number of digits as the second digital content;
a correction module configured to correct the set of digits to be corrected in the first digital content using the second digital content; and
a second reply generation module configured to generate the second reply based on the corrected first digital content.
9. The apparatus of claim 8, wherein the first providing module comprises:
a first presentation module configured to present the first recognition result and the first reply via a display device.
10. The apparatus of claim 9, wherein the second providing module further comprises:
a first correction module configured to correct text recognition errors in the first recognition result using the second recognition result of the second speech.
11. The apparatus of claim 9, wherein the second providing module further comprises:
a combining module configured to combine the first recognition result and a second recognition result of the second voice to generate a third recognition result in response to determining that the content of the second voice semantically complements the content of the first voice; and
and a second presentation module configured to present the second reply through the display device based on a third recognition result.
12. The apparatus of claim 8, wherein the first providing module comprises:
a first voice providing module configured to provide the first reply by voice.
13. The apparatus of claim 12, wherein the second providing module further comprises:
a second voice providing module configured to provide the second reply only by voice based on the corrected set of digits.
14. The apparatus of any of claims 8-13, further comprising:
a third speech receiving module configured to receive a third speech from the user in response to the second reply not yet identifying the intent of the user; and
a third providing module configured to provide a third reply to the user based at least in part on the third voice.
15. An electronic device, the electronic device comprising:
one or more processors; and
storage means for storing one or more programs that when executed by the one or more processors cause the electronic device to implement the method of any of claims 1-7.
16. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method according to any of claims 1-7.
CN201810541680.6A 2018-05-30 2018-05-30 Method, apparatus, electronic device and computer readable storage medium for processing voice conversations Active CN108877792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810541680.6A CN108877792B (en) 2018-05-30 2018-05-30 Method, apparatus, electronic device and computer readable storage medium for processing voice conversations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810541680.6A CN108877792B (en) 2018-05-30 2018-05-30 Method, apparatus, electronic device and computer readable storage medium for processing voice conversations

Publications (2)

Publication Number Publication Date
CN108877792A CN108877792A (en) 2018-11-23
CN108877792B true CN108877792B (en) 2023-10-24

Family

ID=64335845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810541680.6A Active CN108877792B (en) 2018-05-30 2018-05-30 Method, apparatus, electronic device and computer readable storage medium for processing voice conversations

Country Status (1)

Country Link
CN (1) CN108877792B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712616B (en) * 2018-11-29 2023-11-14 平安科技(深圳)有限公司 Telephone number error correction method and device based on data processing and computer equipment
CN109922371B (en) * 2019-03-11 2021-07-09 海信视像科技股份有限公司 Natural language processing method, apparatus and storage medium
CN110223694B (en) * 2019-06-26 2021-10-15 百度在线网络技术(北京)有限公司 Voice processing method, system and device
CN110299152A (en) * 2019-06-28 2019-10-01 北京猎户星空科技有限公司 Interactive output control method, device, electronic equipment and storage medium
CN110347815A (en) * 2019-07-11 2019-10-18 上海蔚来汽车有限公司 Multi-task processing method and multitasking system in speech dialogue system
CN110738997B (en) * 2019-10-25 2022-06-17 百度在线网络技术(北京)有限公司 Information correction method and device, electronic equipment and storage medium
CN112002321B (en) * 2020-08-11 2023-09-19 海信电子科技(武汉)有限公司 Display device, server and voice interaction method
CN115841814A (en) * 2021-09-18 2023-03-24 华为技术有限公司 Voice interaction method and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012037820A (en) * 2010-08-11 2012-02-23 Murata Mach Ltd Voice recognition apparatus, voice recognition apparatus for picking, and voice recognition method
CN105094315A (en) * 2015-06-25 2015-11-25 百度在线网络技术(北京)有限公司 Method and apparatus for smart man-machine chat based on artificial intelligence
CN105468582A (en) * 2015-11-18 2016-04-06 苏州思必驰信息科技有限公司 Method and device for correcting numeric string based on human-computer interaction
CN106710592A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Speech recognition error correction method and speech recognition error correction device used for intelligent hardware equipment
US9728188B1 (en) * 2016-06-28 2017-08-08 Amazon Technologies, Inc. Methods and devices for ignoring similar audio being received by a system
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN107305483A (en) * 2016-04-25 2017-10-31 北京搜狗科技发展有限公司 A kind of voice interactive method and device based on semantics recognition
JP2018004976A (en) * 2016-07-04 2018-01-11 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Voice interactive method, voice interactive device and voice interactive program
CN107870977A (en) * 2016-09-27 2018-04-03 谷歌公司 Chat robots output is formed based on User Status
CN107943914A (en) * 2017-11-20 2018-04-20 渡鸦科技(北京)有限责任公司 Voice information processing method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012037820A (en) * 2010-08-11 2012-02-23 Murata Mach Ltd Voice recognition apparatus, voice recognition apparatus for picking, and voice recognition method
CN105094315A (en) * 2015-06-25 2015-11-25 百度在线网络技术(北京)有限公司 Method and apparatus for smart man-machine chat based on artificial intelligence
CN105468582A (en) * 2015-11-18 2016-04-06 苏州思必驰信息科技有限公司 Method and device for correcting numeric string based on human-computer interaction
CN107305483A (en) * 2016-04-25 2017-10-31 北京搜狗科技发展有限公司 A kind of voice interactive method and device based on semantics recognition
US9728188B1 (en) * 2016-06-28 2017-08-08 Amazon Technologies, Inc. Methods and devices for ignoring similar audio being received by a system
JP2018004976A (en) * 2016-07-04 2018-01-11 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Voice interactive method, voice interactive device and voice interactive program
CN107870977A (en) * 2016-09-27 2018-04-03 谷歌公司 Chat robots output is formed based on User Status
CN106710592A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Speech recognition error correction method and speech recognition error correction device used for intelligent hardware equipment
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN107943914A (en) * 2017-11-20 2018-04-20 渡鸦科技(北京)有限责任公司 Voice information processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苹果Siri对比三星Bixby,语音助手都成精了;盛世流光中;《爱奇艺》;20171123;http://www.iqiyi.com/w_19rv1enov9.html *

Also Published As

Publication number Publication date
CN108877792A (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN108877792B (en) Method, apparatus, electronic device and computer readable storage medium for processing voice conversations
KR102222317B1 (en) Speech recognition method, electronic device, and computer storage medium
US11217236B2 (en) Method and apparatus for extracting information
EP2869298A1 (en) Information identification method and apparatus
US9894030B2 (en) Method, device, computer storage medium, and apparatus for providing candidate words
CN111737987B (en) Intention recognition method, device, equipment and storage medium
CN111898643A (en) Semantic matching method and device
CN110956955B (en) Voice interaction method and device
CN112883966B (en) Image character recognition method, device, medium and electronic equipment
KR101695348B1 (en) Apparatus for providing service based messenger and method using the same
JP2018063271A (en) Voice dialogue apparatus, voice dialogue system, and control method of voice dialogue apparatus
CN111508472A (en) Language switching method and device and storage medium
CN113157877A (en) Multi-semantic recognition method, device, equipment and medium
US10600405B2 (en) Speech signal processing method and speech signal processing apparatus
CN111916085A (en) Human-computer conversation matching method, device and medium based on pronunciation similarity
CN113015002B (en) Processing method and device for anchor video data
JP2018159729A (en) Interaction system construction support device, method and program
CN114860910A (en) Intelligent dialogue method and system
CN111382322B (en) Method and device for determining similarity of character strings
CN108877781B (en) Method and system for searching film through intelligent voice
CN112632241A (en) Method, device, equipment and computer readable medium for intelligent conversation
CN113221514A (en) Text processing method and device, electronic equipment and storage medium
CN112163078A (en) Intelligent response method, device, server and storage medium
CN112256855A (en) User intention identification method and device
CN111401011B (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant