CN108877792A

CN108877792A - For handling method, apparatus, electronic equipment and the computer readable storage medium of voice dialogue

Info

Publication number: CN108877792A
Application number: CN201810541680.6A
Authority: CN
Inventors: 王矩; 张晶晶; 孙珂
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2018-11-23
Anticipated expiration: 2038-05-30
Also published as: CN108877792B

Abstract

According to an example embodiment of the present disclosure, a kind of method, apparatus for handling voice dialogue, electronic equipment and computer readable storage medium are provided.This method includes providing and replying for the first of the first voice in response to receiving the first voice from the user, wherein the first reply is generated based on the first recognition result to the first voice.This method further includes receiving the second voice from the user, wherein the second voice is for correcting or supplementing the first recognition result.In addition, this method further includes generating to reply provide a user second based on the first voice and the second voice, wherein second replys the intention replied than first and be more in line with user.In accordance with an embodiment of the present disclosure, in the case where causing chat robots that can not accurately identify the voice content of user due to speech recognition exception, further voice dialogue can be used to carry out active correction or supplement, so as to solve the exception in speech recognition in user.

Description

For handling the method, apparatus of voice dialogue, electronic equipment and computer-readable depositing Storage media

Technical field

Embodiment of the disclosure relates generally to artificial intelligence field, and more particularly relates to processing voice dialogue Method, apparatus, electronic equipment and computer readable storage medium.

Background technique

In recent years, the theory of " dialogue is platform (Conversation as a Platform) " is increasingly rooted in the hearts of the people, more Begin to use conversational man-machine interaction mode come more networking products and application.Chat robots, which refer to, can pass through text Word, voice or picture etc. realize the computer program or software of human-computer interaction, are understood that the content that user issues, and certainly It is dynamic to respond.Chat robots can replace true man to engage in the dialogue to a certain extent, can be integrated into conversational system It is middle to be used as automatic on-line assistant, for the scene such as intelligence chat, customer service, information query.

Voice dialogue is a kind of common human-computer interaction form, and compared with text conversation, voice dialogue also relates to voice The processing of content, front end recognition, speech recognition, speech synthesis etc..Due to conversational system be based on speech recognition content and into Row work, thus have higher requirement to the accuracy of speech recognition.The application scenarios of voice dialogue may include intelligent sound Assistant, intelligent sound box, vehicle mounted guidance etc..

Summary of the invention

According to an example embodiment of the present disclosure, a kind of method, apparatus for handling voice dialogue, electronic equipment are provided And computer readable storage medium.

In the first aspect of the disclosure, a kind of method for handling voice dialogue is provided.This method includes：Response It in receiving the first voice from the user, provides and is replied for the first of the first voice, wherein first replys based on to first First recognition result of voice and be generated；The second voice from the user is received, wherein the second voice is for correcting or mending Fill the first recognition result；And it is based on the first voice and the second voice, it generates and is replied provide a user second, wherein second Reply the intention replied than first and be more in line with user.

In the second aspect of the disclosure, provide a kind of for handling the device of voice dialogue.The device includes：First Module is provided, is configured to respond to receive the first voice from the user, provides and is replied for the first of the first voice, In first reply be generated based on the first recognition result to the first voice；Speech reception module is configured as reception and comes from The second voice of user, wherein the second voice is for correcting or supplementing the first recognition result；And second provide module, matched It is set to based on the first voice and the second voice, generates and replied provide a user second, wherein the second reply is replied than first It is more in line with the intention of user.

In the third aspect of the disclosure, a kind of electronic equipment is provided comprising one or more processors；And it deposits Storage device, for storing one or more programs.One or more programs are when being executed by one or more processors, so that electronics Equipment realizes method or process according to an embodiment of the present disclosure.

In the fourth aspect of the disclosure, a kind of computer-readable medium is provided, computer program is stored thereon with, it should Method or process according to an embodiment of the present disclosure are realized when program is executed by processor.

It should be appreciated that content described in this part of the disclosure is not intended to limit the key of embodiment of the disclosure Or important feature, without in limiting the scope of the present disclosure.The other feature of the disclosure will become to hold by description below It is readily understood.

Detailed description of the invention

It refers to the following detailed description in conjunction with the accompanying drawings, the above and other feature, advantage and aspect of each embodiment of the disclosure It will be apparent.In the accompanying drawings, the same or similar appended drawing reference indicates the same or similar element, wherein：

Fig. 1, which shows embodiment of the disclosure, can be realized schematic diagram in example context wherein；

Fig. 2 shows according to an embodiment of the present disclosure for handling the flow chart of the method for voice dialogue；

Fig. 3 shows according to an embodiment of the present disclosure for handling the schematic diagram of the process of speech message；

Fig. 4 shows the flow chart of the method according to an embodiment of the present disclosure that Text region mistake is solved by dialogue；

Fig. 5 shows the flow chart of the method according to an embodiment of the present disclosure that number identification mistake is solved by dialogue；

Fig. 6 shows the flow chart of the method according to an embodiment of the present disclosure by dialogue supplement recognition result；

Fig. 7 shows according to an embodiment of the present disclosure for handling the block diagram of the device of voice dialogue；And

Fig. 8 shows the block diagram that can implement the electronic equipment of multiple embodiments of the disclosure.

Specific embodiment

Embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although shown in the drawings of the certain of the disclosure Embodiment, it should be understood that, the disclosure can be realized by various forms, and should not be construed as being limited to this In the embodiment that illustrates, providing these embodiments on the contrary is in order to more thorough and be fully understood by the disclosure.It should be understood that It is that being given for example only property of the accompanying drawings and embodiments effect of the disclosure is not intended to limit the protection scope of the disclosure.

In the description of embodiment of the disclosure, term " includes " and its similar term should be understood as that opening includes, I.e. " including but not limited to ".Term "based" should be understood as " being based at least partially on ".Term " one embodiment " or " reality Apply example " it should be understood as " at least one embodiment ".Hereafter it is also possible that other specific and implicit definition.

In the scene of voice dialogue, since there are ambient noise or user's accents, speech recognition exception is frequently resulted in (such as speech recognition errors or cannot identify).In order to solve the problems, such as that speech recognition exception, a kind of improved method are to improve The accuracy of speech recognition itself, another improved method are to improve semantic serious forgiveness.However, even if being changed using both the above Into method, it still will appear the scene that can not accurately talk with since speech recognition is abnormal.In general, semantic understanding can be based on The recognition result of voice is made, and when chat robots cannot identify or when the intention of wrong identification user, it may Cause unexpected consequence.

Embodiment of the disclosure proposes a kind of for handling the scheme of voice dialogue.In accordance with an embodiment of the present disclosure, exist When leading to chat robots wrong identification since speech recognition is abnormal or can not identify the voice content of user, user can make Active correction or supplement are carried out with further voice dialogue.Therefore, semantic understanding according to an embodiment of the present disclosure is flat Platform is able to solve the problem of speech recognition exception, thus promotes the user experience in chat process.It is detailed below with reference to attached drawing 1-8 Some example embodiments of the thin description disclosure.

Fig. 1, which shows embodiment of the disclosure, can be realized schematic diagram in example context 100 wherein.In example context In 100, user 110 carries out voice dialogue with chat robots 120 (also referred to as " chat engine ").Optionally, user 110 It can directly can engage in the dialogue with chat robots 120 in the local of chat robots 120, i.e. user 110.Alternatively, it uses It is logical that its local device (laptop computer, desktop computer, smart phone, tablet computer etc.) also can be used in family 110 It crosses network and chat robots 120 carries out voice dialogue.It should be appreciated that chat robots 120 can be deployed to local electricity In sub- equipment, it can also be deployed in cloud or by distributed deployment.

With reference to Fig. 1, user 110 sends voice 121 (referred to as " the first voice ") to chat robots 120, chat robots 120 handle voices 121 and provide corresponding reply 122 (referred to as " first replys ") to user 110.So far, user 110 with chat The first round dialogue of its robot 120 has been completed.In some embodiments, the text of voice 121 can be simultaneously displayed on In the display of user equipment, so that user is apparent to understand current conversation content.

In embodiment of the disclosure, the demand that 122 are not able to satisfy user 110 is replied (such as to there is voice 121 and identify Mistake prevents chat robots 120 from accurately identifying the intention of user 110), user can be sent to chat robots 120 into The voice 123 (referred to as " the second voice ") of one step with for correcting or supplement, chat robots 120 handle voice 123 and Corresponding reply 124 (referred to as " second replys ") is provided to user 110.In accordance with an embodiment of the present disclosure, since voice 123 is pair The correction or supplement of the recognition result of voice 121, therefore, by combine voice 121 and 123, chat robots 120 can be more quasi- Really identify the intention of user 110.The following table 1 shows one dialogue example of voice that the first speech recognition result is corrected.

Table 1

The recognition result of user speech：	Search the contact method of Wang Feng
		The reply of chat robots：	The contact method of Wang Feng is being inquired, please later
The recognition result of user speech：	It is the rich of good harvest
		The reply of chat robots：	The contact method of Wang Feng is being inquired, please later

For example, the recognition result of the voice 121 of user 110 is " contact method for searching Wang Feng ", chat robots 120 are raw 122 are replied at corresponding as " inquiring the contact method of Wang Feng, please later ".Since originally meaning for user 110 " searches Wang Feng Contact method ", thus it corrects the recognition result of voice 121 in voice 123, has said " being the rich of good harvest ".Merely Its robot 120 generates according to the content of correction and replys 124 " inquiring the contact method of Wang Feng, please later ".In this way, chatting Its robot 120 can accurately identify the true intention of user 110.

Fig. 2 shows according to an embodiment of the present disclosure for handling the flow chart of the method 200 of voice dialogue.It should manage Solution, method 200 can be executed by the chat robots 120 above with reference to described in Fig. 1.

In frame 202, in response to receiving the first voice from the user, provides and is replied for the first of the first voice, In first reply be generated based on the first recognition result to the first voice.For example, in chat robots 120 from user 110 After receiving voice 121, corresponding reply 122 is provided to user 110.In embodiment of the disclosure, first reply due to Speech recognition is abnormal and fails to accurately identify the intention of user, for example, it may be possible to cause the reply of mistake, or prompt " cannot be known " user is not required to say again.

In some embodiments, it can only be provided by way of voice to user 110 and reply 122.In other implementations In example, in order to allow the more intuitive recognition result for understanding its voice of user, (such as it can also be set by display by visual mode It is standby) recognition result of voice 121 is presented and replys 122 textual form.

In frame 204, the second voice from the user is received, wherein the second voice is for correcting or supplementing the first identification knot Fruit.For example, fail the intention for correctly identifying user due to replying 122 (i.e. first reply), chat robots 120 are from user 110 Further receive the voice 123 for correcting or supplementing.That is, leading to machine of chatting due to speech recognition errors 120 wrong identification of people or when can not identify the voice content of user 110, user 110 can by further voice dialogue come Carry out active clarification.For example, user can actively correct or actively supplement, including one or more texts are corrected by voice And/or number, or supplemented by voice.

In frame 206, it is based on the first voice and the second voice, generates and is replied provide a user second, wherein second time Compound ratio first replys the intention for being more in line with user.For example, chat robots 120 based on to voice 123 recognition result and To the recognition result of voice 121, is provided to user 110 and reply 124.Since the knowledge to voice 121 is corrected or supplemented to voice 123 Not as a result, chat robots 120 is enabled to further understand the intention of user 110, thus reply 124 (i.e. second replys) ratios It replys 122 (first replys) and is more in line with the intention of user 110, to solve the exception in speech recognition, and promote chat The user experience of process.

In some embodiments, it if second replys the intention for correctly identifying user, can execute and second time The associated movement of complex phase.For example, since the second voice is the correction or supplement to the first recognition result of the first voice, so that chatting Its robot can accurately identify the intention of user, thus chat robots can be executed or be indicated to execute and reply phase with second Associated movement, such as call, digital map navigation etc..In some embodiments, if offer second reply after threshold Further voice is not received from user in the value time, the intention that the second reply has met user can be defaulted as, into And it can directly execute movement associated with the second reply.It should be appreciated that can also be while generation second is replied or preceding Movement associated with the second reply is directly executed afterwards, and the intention for whether having met user is replied but regardless of second.

In some embodiments, if second replys the intention of still unidentified user, it can receive from the user the Three voices.Next, being based at least partially on third voice, third reply is provided a user.For example, although second replys than the One replys the intention for being more in line with user, but may still can not fully meet the demand of user.In this case, user Further voice can be initiated to continue the speech recognition result before correcting or supplementing.

Fig. 3 shows according to an embodiment of the present disclosure for handling the schematic diagram of the process 300 of speech message.It should manage Solution, process 300 can execute by the chat robots 120 above with reference to described in Fig. 1, and more than process 300 can be The example implementation of reply is provided with reference to described in Fig. 2 based on the voice received.

In frame 302, voice from the user is inputted, in frame 304, voice will be inputted by automatic speech recognition (ASR) It is converted into text.In frame 306, the expression shape that computer can identify is converted text to by natural language understanding (NLU) Formula.In frame 308, the intention and word slot in text are extracted, and safeguard (DST) by the dialogue state of itself and history by dialogue state It is integrated.In frame 310, according to current dialogue state, current state is best suitable for by acting candidate sequencing selection one Movement.After being acted, in frame 312, by generating natural language (NLG), and in frame 314 by the natural language of generation It carries out speech synthesis (TTS).Then, in frame 316, voice is exported to be supplied to user.In process 300, frame 302, frame 304, Frame 314 and frame 316 are related to speech processes process, and frame 306, frame 308, frame 310 and frame 312 are related to natural language processing Journey, wherein dialogue state maintenance and the candidate sequence composition dialogue management of movement, can voice-based semantic expressiveness and it is current on The movement to be executed hereafter is generated, and updates context.

In some embodiments, in order to make user that there is better interactive experience, speech recognition result can be shown It shows on equipment (such as display of user equipment).In such a case, it is possible to which voice is presented simultaneously by display equipment Recognition result and its reply, so that user could be aware that the recognition result of its voice.

Fig. 4 shows the process of the method 400 according to an embodiment of the present disclosure that Text region mistake is solved by dialogue Figure.It should be appreciated that method 400 can be executed by the chat robots 120 above with reference to described in Fig. 1, and frame 402 can be with For the example implementation of the frame 204 above with reference to described in Fig. 2, frame 404 and 406 can be frame 206 above with reference to described in Fig. 2 Example implementation.

In frame 402, the second voice for correcting one or more texts in the first recognition result is received, wherein to the There are Text region mistakes for first recognition result of one voice.In frame 404, entangled using to the second recognition result of the second voice Positive first recognition result.Second is presented by display equipment and is replied based on the first corrected recognition result in frame 406.Example Such as, following table 2-3 shows the voice dialogue corrected to one or more errors in text of the first speech recognition result and shows Example.

Table 2

The recognition result of user speech：	I will go to west two odd
		The reply of chat robots：	Sorry, I does not understand your meaning
The recognition result of user speech：	It is West Second Qi
		The reply of chat robots：	Inquiring the navigation routine of West Second Qi

Table 3

The recognition result of user speech：	I will go to practise two surprises
		The reply of chat robots：	Sorry, I does not understand your meaning
The recognition result of user speech：	It is the west of thing, the flag of national flag
		The reply of chat robots：	Inquiring the navigation routine of West Second Qi

In the example of table 2, user corrects for single wrongly written character " surprise " in the recognition result of the first voice in the second voice； In the example of table 3, user corrects multiple wrongly written characters " habit " and " surprise " in the recognition result of the first voice in the second voice.It connects Get off, chat robots provide the second reply based on the speech recognition result after correction, for example, entangling in the example of table 2 and table 3 Recognition result after just is " I will go to West Second Qi ".Chinese text is carried out although should be appreciated that in the embodiment of the above table 2-3 It corrects, however, the disclosure is speech recognition errors that embodiment can be used for correcting other language.

Fig. 5 shows the process of the method 500 according to an embodiment of the present disclosure that number identification mistake is solved by dialogue Figure.It should be appreciated that method 500 can be executed by the chat robots 120 above with reference to described in Fig. 1, and frame 502 can be with For the example implementation of the frame 204 above with reference to described in Fig. 2, frame 504 and 506 can be frame 206 above with reference to described in Fig. 2 Example implementation.

It in frame 502, receives for correcting one or more the second digital voices in the first recognition result, wherein to the There is number identification mistake in the first recognition result of one voice.In frame 504, using in the second recognition result to the second voice One or more numbers correct the first recognition result.It is set based on the first corrected recognition result by display in frame 506 It is standby that the second reply is presented.For example, following table 4 shows the voice pair corrected to the numeric error of the first speech recognition result Talk about example.

Table 4

The recognition result of user speech：	Phone 13511652271
		The reply of chat robots：	Phoning 13511652271
The recognition result of user speech：	It is 110
		The reply of chat robots：	Phoning 13511052271

In the example of table 4, user corrects single error number in the recognition result of the first voice in the second voice " 6 ", i.e., in telephone number the 6th number should be 0 rather than 6.In some embodiments, in the mistake of user's correcting digital Cheng Zhong can match based on the maximum between the second voice and the recognition result of the first voice to determine the number corrected of user's needs Character segment.For example, digital " 110 " are utmostly matched with " 116 " in recognition result, thus can be true in the example of table 4 Fixed number word " 110 " is for correcting " 116 " in previous round dialogue.In some embodiments, user can also correct language simultaneously Both the text of mistake in sound recognition result and number.

The inventors of the present application found that certain speech recognition errors be only capable of being found by visual display (for example, due to Unisonance and cause to be Text region mistake), and some speech recognition errors can be also found by voice, such as telephone number In the identified mistakes of one or more number.In some embodiments, it is also possible to not by showing that equipment shows the first voice Recognition result and first reply, and the first reply is only provided by voice.In this case, user can be based on first Set of number is read aloud in reply, to issue the voice for correcting the one or more number in set of number.Next, chatting Its robot is based on the set of number after correction, to provide the second reply for being more in line with user's intention.It should be appreciated that not showing The scene of recognition result both can be the scene intelligent sound box of display (for example, do not have) without display, can also be with For with display, still display is not used in the scene of display speech recognition result (for example, under smart phone black state Voice dialogue).

Fig. 6 shows the flow chart of the method 600 according to an embodiment of the present disclosure by dialogue supplement recognition result.It answers Work as understanding, method 600 can be executed by the chat robots 120 above with reference to described in Fig. 1, and frame 602 can for On the frame 204 with reference to described in Fig. 2 example implementation, frame 604 and 606 can show for above with reference to described in Fig. 2 frame 206 Example is realized.

In frame 602, the second voice for supplementing the first recognition result is received.For example, the first identification to the first voice As a result do not reflect the demand of user completely.In frame 604, the first voice is semantically being supplemented in response to the content of the second voice Content combines the first recognition result and to the second recognition result of the second voice to generate third recognition result.

The relationship whether content of two sections of voices there is supplement to illustrate in other words can be determined by various means. For example, in some embodiments, it can be determined that the content of the second voice whether can be in the content semantically with the first voice It is no that there is continuous relationship.For example, if the content of two sections of voices can be used as entirety and be parsed together, it can be assumed that the two has There is semantic continuous relationship, and determines that the content of the second voice is the supplement of the first voice accordingly.Alternatively or additionally, one In a little embodiments, it can be determined that whether the probability that the content of two sections of voices occurs simultaneously is greater than predetermined threshold.For example, if two The content of voice is all to occur simultaneously most of the time, then can be determined that the content of the second voice is the supplement of the first voice.

In frame 606, it is based on third recognition result, second is presented by display equipment and replys.For example, following table 5 is shown The dialogue example that first speech recognition result is supplemented.

Table 5

In the example of table 5, user's is intended to Peking University west gate, and speech recognition system is receiving " north Identification is just had begun after capital university ", thus not can recognize that the accurate intention of user.In this case, user can lead to Natural language dialogue supplemental information is crossed, the recognition result for then combining the two voices generates new recognition result " I will go to Beijing University west gate " then generates corresponding reply based on new recognition result.In some embodiments, complete for not expressing Recognition result, user can direct supplemental information, without wait recognition result and parsing result return, do not need to repeat yet Previous conversation content.

Fig. 7 shows according to an embodiment of the present disclosure for handling the block diagram of the device 700 of voice dialogue.Such as Fig. 7 institute Show, device 700 includes the first offer module 710, speech reception module 720 and the second offer module 730.First provides module 710 are configured to respond to receive the first voice from the user, provide and reply for the first of the first voice, wherein first Reply is generated based on the first recognition result to the first voice.Speech reception module 720 is configured as receiving from user The second voice, wherein the second voice is for correcting or supplement the first recognition result.Second offer module 730 is configured as base It in the first voice and the second voice, generates and is replied provide a user second, wherein the second reply is replied than first and more accorded with Share the intention at family.

In some embodiments, wherein the first offer module 710 includes the first presentation module, the first presentation module is configured It is replied for the first recognition result and first is presented by display equipment.

In some embodiments, wherein the second offer module 730 includes the first correction module, the first correction module is configured The Text region mistake in the first recognition result is corrected to use the second recognition result to the second voice.

In some embodiments, wherein the second offer module 730 includes the second correction module, the second correction module is configured Number to use one or more numbers in the second recognition result to the second voice to correct in the first recognition result is known Not mistake.

In some embodiments, wherein the second voice is for supplementing the first recognition result, and second provides the packet of module 730 It includes：Composite module, be configured to respond to determine the content of the second voice in the content for semantically supplementing the first voice, combination the One recognition result and to the second recognition result of the second voice to generate third recognition result；And second present module, matched It is set to based on third recognition result, is presented second by display equipment and replys.

In some embodiments, wherein the first offer module 710 includes the first speech rendering module, the first voice provides mould Block is configured as only providing the first reply by voice, wherein the first reply includes the set of number from the first recognition result.

In some embodiments, wherein the second voice is for correcting in set of number, and second provides the packet of module 730 It includes：Third corrects module, is configured with one or more numbers in the second recognition result to the second voice to correct Set of number；And second speech rendering module, it is configured as passing through voice based on corrected set of number and providing second time It is multiple.

In some embodiments, device 700 further includes action executing module, and action executing module is configured to respond to Two reply the intention of identified user, execute movement associated with the second reply.

In some embodiments, device 700 further includes：Third speech reception module is configured to respond to the second reply The intention of still unidentified user receives third voice from the user；And third provides module, is configured as at least partly Based on third voice, third reply is provided a user.

It should be appreciated that the first offer module 710, speech reception module 720 and second shown in Fig. 7 provide module 730 can be included in the chat robots 120 with reference to described in Fig. 1.Furthermore, it is to be understood that module shown in Fig. 7 Can execute with reference to embodiment of the disclosure method or in the process the step of or movement.

Fig. 8 shows the schematic block diagram that can be used to implement the example apparatus 800 of embodiment of the disclosure.It should manage Solution, it is described for handling the device 700 or chat robots of voice dialogue that equipment 800 can be used to implement the disclosure 120.As shown, equipment 800 includes central processing unit (CPU) 801, it can be according to being stored in read-only memory (ROM) Computer program instructions in 802 are loaded into the computer in random access storage device (RAM) 803 from storage unit 808 Program instruction, to execute various movements appropriate and processing.In RAM 803, can also store equipment 800 operate it is required various Program and data.CPU 801, ROM 802 and RAM 803 are connected with each other by bus 804.Input/output (I/O) interface 805 are also connected to bus 804.

Multiple components in equipment 800 are connected to I/O interface 805, including：Input unit 806, such as keyboard, mouse etc.； Output unit 807, such as various types of displays, loudspeaker etc.；Storage unit 808, such as disk, CD etc.；And it is logical Believe unit 809, such as network interface card, modem, wireless communication transceiver etc..Communication unit 809 allows equipment 800 by such as The computer network of internet and/or various telecommunication networks exchange information/data with other equipment.

Processing unit 801 executes each method and process as described above, such as method 200, process 300, method 400, method 500 and method 600.For example, in some embodiments, method 200, process 300, method 400, method 500 with And method 600 can be implemented as computer software programs, be tangibly embodied in machine readable media, such as storage unit 808.In some embodiments, computer program it is some or all of can via ROM 802 and/or communication unit 809 and It is loaded into and/or is installed in equipment 800.When computer program loads to RAM 803 and by CPU 801 execute when, can hold The one or more movements or step of row method as described above 200, process 300, method 400, method 500 and method 600. Alternatively, in other embodiments, CPU 801 can by other any modes (for example, by means of firmware) appropriate by It is configured to execution method 200, process 300, method 400, method 500 and method 600.

Function described herein can be executed at least partly by one or more hardware logic components.Example Such as, without limitation, the hardware logic component for the exemplary type that can be used includes：Field programmable gate array (FPGA), specially With integrated circuit (ASIC), Application Specific Standard Product (ASSP), the system (SOC) of system on chip, load programmable logic device (CPLD), etc..

For implement disclosed method program code can using any combination of one or more programming languages come It writes.These program codes can be supplied to the place of general purpose computer, special purpose computer or other programmable data processing units Device or controller are managed, so that program code makes defined in flowchart and or block diagram when by processor or controller execution Function/operation is carried out.Program code can be executed completely on machine, partly be executed on machine, as stand alone software Is executed on machine and partly execute or executed on remote machine or server completely on the remote machine to packet portion.

In the context of the disclosure, machine readable media can be tangible medium, may include or is stored for The program that instruction execution system, device or equipment are used or is used in combination with instruction execution system, device or equipment.Machine can Reading medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media can include but is not limited to electricity Son, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment or above content any conjunction Suitable combination.The more specific example of machine readable storage medium will include the electrical connection of line based on one or more, portable meter Calculation machine disk, hard disk, random access memory (RAM), read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM Or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage facilities or Any appropriate combination of above content.

Although this should be understood as requiring acting in this way in addition, depicting each movement or step using certain order Or step is executed with shown certain order or in sequential order, or requires the movement of all diagrams or step that should be performed To obtain desired result.Under certain environment, multitask and parallel processing be may be advantageous.Similarly, although above Several specific implementation details are contained in discussion, but these are not construed as the limitation to the scope of the present disclosure.In list Certain features described in the context of only embodiment can also be realized in combination in single realize.On the contrary, single Various features described in the context of realization can also be realized individually or in any suitable subcombination multiple In realization.

Although having used the implementation specific to the language description of the structure feature and/or method logical action disclosure Example it should be appreciated that theme defined in the appended claims is not necessarily limited to special characteristic described above or dynamic Make.On the contrary, special characteristic described above and movement are only to realize the exemplary forms of claims.

Claims

1. a kind of method for handling voice dialogue, including：

In response to receiving the first voice from the user, provides and replied for the first of first voice, described first time Complex radical is generated in the first recognition result to first voice；

The second voice from the user is received, second voice is for correcting or supplementing first recognition result； And

Based on first voice and second voice, generates and replied provided to the user second, described second time First replys the intention for being more in line with the user described in compound ratio.

2. according to the method described in claim 1, wherein provide for first voice first reply include：

First recognition result and described first is presented by display equipment to reply.

3. according to the method described in claim 2, wherein generate it is described second reply include：

Using the second recognition result to second voice, to correct the Text region mistake in first recognition result.

4. according to the method described in claim 2, wherein generate it is described second reply include：

Using the one or more number in the second recognition result to second voice, to correct first recognition result In number identification mistake.

5. according to the method described in claim 2, wherein generate it is described second reply include：

In response to determination second voice content in the content for semantically supplementing first voice, combine described first and know Other result and to the second recognition result of second voice to generate third recognition result；And

Based on third recognition result, described second is presented by the display equipment and is replied.

6. according to the method described in claim 1, wherein provide for first voice first reply include：

It provides described first by voice to reply, first reply includes the set of number from first recognition result.

7. according to the method described in claim 6, wherein generate it is described second reply include：

The set of number is corrected using one or more numbers in the second recognition result to second voice；And

Based on the corrected set of number, described second is provided by voice and is replied.

8. method according to any one of claims 1-7 further includes：

The intention for replying the identified user in response to described second executes associated with second reply dynamic Make.

9. method according to any one of claims 1-7 further includes：

The intention for replying the still unidentified user in response to described second receives the third voice from the user； And

It is based at least partially on the third voice, third is provided to the user and replys.

10. it is a kind of for handling the device of voice dialogue, including：

First provides module, is configured to respond to receive the first voice from the user, provide for first voice First reply, it is described first reply be generated based on the first recognition result to first voice；

Speech reception module, is configured as receiving the second voice from the user, second voice for correcting or Supplement first recognition result；And

Second provides module, is configured as based on first voice and second voice, generation will be provided to the user Second reply, described second replys to reply than described first and is more in line with the intention of the user.

11. device according to claim 10, wherein the first offer module includes：

First is presented module, is configured as that first recognition result and first reply are presented by display equipment.

12. device according to claim 11, wherein the second offer module includes：

First corrects module, is configured with the second recognition result to second voice to correct the first identification knot Text region mistake in fruit.

13. device according to claim 11, wherein the second offer module includes：

Second corrects module, and the one or more number being configured in the second recognition result to second voice is come Correct the number identification mistake in first recognition result.

14. device according to claim 11, wherein the second offer module includes：

Composite module is configured to respond to determine that the content of second voice is semantically being supplemented in first voice Hold, combines first recognition result and to the second recognition result of second voice to generate third recognition result；And

Second is presented module, is configured as being presented described second based on third recognition result by the display equipment and being replied.

15. device according to claim 10, wherein the first offer module includes：

First speech rendering module is configured as providing first reply by voice, and described first replys including coming from institute State the set of number of the first recognition result.

16. device according to claim 15, wherein the second offer module includes：

Third corrects module, and the one or more number being configured in the second recognition result to second voice is come Correct the set of number；And

Second speech rendering module is configured as only providing described second by voice based on the corrected set of number It replys.

17. device described in any one of 0-16 according to claim 1, further includes：

Action executing module is configured to respond to the described second intention for replying the identified user, execute with Described second replys associated movement.

18. device described in any one of 0-16 according to claim 1, further includes：

Third speech reception module is configured to respond to the described second intention for replying the still unidentified user, connects Receive the third voice from the user；And

Third provides module, is configured as being based at least partially on the third voice, provides third to the user and replys.

19. a kind of electronic equipment, the electronic equipment include：

One or more processors；And

Storage device, for storing one or more programs, one or more of programs are when by one or more of processing Device executes, so that the electronic equipment realizes method according to claim 1 to 9.

20. a kind of computer readable storage medium is stored thereon with computer program, realization when described program is executed by processor Method according to claim 1 to 9.