CN113779208A

CN113779208A - Method and device for man-machine conversation

Info

Publication number: CN113779208A
Application number: CN202011549879.7A
Authority: CN
Inventors: 刘亚龙; 蔡玉玉; 吴俊仪; 黄善洛; 王佳; 杨帆; 丁国宏
Original assignee: Beijing Huijun Technology Co ltd
Current assignee: Beijing Huijun Technology Co ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-12-10

Abstract

The embodiment of the disclosure discloses a method and a device for man-machine conversation, and relates to the technical field of artificial intelligence. One embodiment of the method comprises: responding to the received voice input by the user in the process of playing the first guide language, and performing semantic recognition on the voice to obtain a first intention of the user; determining whether the first intent is to interrupt the conversation; if the first intention is to interrupt the conversation, stopping playing the first guide conversation; continuously performing semantic recognition on the voice input by the user until the voice is detected to be ended to obtain a second intention; playing a second guided dialog according to the second intent. This embodiment enables to solve the problem of inaccurate or delayed interruptions that frequently occur in the smart client domain and in the application direction.

Description

Method and device for man-machine conversation

Technical Field

The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to a method and a device for man-machine conversation.

Background

Generally, in an intelligent voice conversation, one party of a conversation is a real user, the other party of the conversation is a voice robot, the voice robot broadcasts a guide conversation by voice synthesis, detects the content input by the user by using voice recognition, the voice synthesis and the voice recognition work simultaneously, and when the voice recognition detects the voice input, the TTS (Text To Speech ) playing is interrupted, and an interrupting effect is generated.

The prior art is mainly based on voice characteristic interruption, and the interruption is triggered when effective voice is found, namely the user is considered to have expression intention, the defect is that the expression is certain, the interruption is not necessarily real, and the interruption may be an attached word or a mistaken interruption caused by noise input. And certain audio needs to be accumulated for voice-based detection, and characteristics are extracted, so that the defect of large delay is caused. The prior art can generate contradiction on the accuracy and efficiency of interruption, if the efficiency is improved, more error interruptions can be generated, and the accuracy of interruption is reduced; if one wants to increase the accuracy of the interruption, a greater delay may occur.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for man-machine conversation.

In a first aspect, an embodiment of the present disclosure provides a method for human-machine conversation, including: responding to the received voice input by the user in the process of playing the first guide language, and performing semantic recognition on the voice to obtain a first intention of the user; determining whether the first intent is to interrupt the conversation; if the first intention is to interrupt the conversation, stopping playing the first guide conversation; continuously performing semantic recognition on the voice input by the user until the voice is detected to be ended to obtain a second intention; playing a second guided dialog according to the second intent.

In some embodiments, semantically recognizing the speech to obtain a first intention of the user comprises: carrying out voice recognition on the voice to obtain a local text; and carrying out semantic recognition on the local text to obtain a first intention.

In some embodiments, determining whether the first intent is to interrupt the conversation includes: determining the relevance of the played first guide language and the local text; if the correlation is greater than a predetermined threshold, the first intent is to interrupt the conversation.

In some embodiments, determining whether the first intent is to interrupt the conversation includes: determining whether the played first guide word includes local text; if so, the first intent is to interrupt the conversation.

In some embodiments, before semantically recognizing the speech to obtain the first intention of the user, the method further comprises: and if the preset condition is met, stopping playing the first guide speech technology before performing semantic recognition.

In some embodiments, the method further comprises: if the first intent is not to interrupt the conversation, the first guided session continues to be played.

In some embodiments, the method further comprises: if the first intent is not to interrupt the conversation, the first guided conversation is replayed.

In a second aspect, an embodiment of the present disclosure provides an apparatus for human-computer conversation, including: a first recognition unit, configured to respond to the voice input by the user in the process of playing the first guide language, and carry out semantic recognition on the voice to obtain the first intention of the user; a determination unit configured to determine whether the first intention is to interrupt the dialog; an interruption unit configured to stop playing the first guide conversation if the first intention is to interrupt the conversation; the second recognition unit is configured to continuously perform semantic recognition on the voice input by the user until the end of the voice is detected, so as to obtain a second intention; a playing unit configured to play the second guide dialog according to the second intention.

In some embodiments, the first identification unit is further configured to: carrying out voice recognition on the voice to obtain a local text; and carrying out semantic recognition on the local text to obtain a first intention.

In some embodiments, the determining unit is further configured to: determining the relevance of the played first guide language and the local text; if the correlation is greater than a predetermined threshold, the first intent is to interrupt the conversation.

In some embodiments, the determining unit is further configured to: determining whether the played first guide word includes local text; if so, the first intent is to interrupt the conversation.

In some embodiments, the first identification unit is further configured to: and if the preset condition is met, stopping playing the first guide speech technology before performing semantic recognition.

In some embodiments, the playback unit is further configured to: if the first intent is not to interrupt the conversation, the first guided session continues to be played.

In some embodiments, the playback unit is further configured to: if the first intent is not to interrupt the conversation, the first guided conversation is replayed.

In a third aspect, an embodiment of the present disclosure provides an electronic device for human-computer interaction, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.

In a fourth aspect, embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as in any one of the first aspect.

The embodiment of the disclosure provides a method and a device for man-machine conversation, aiming at solving the problem of inaccurate or delayed interruption frequently occurring in the field of intelligent clients and application directions. Compared with the prior art, the method and the device solve the problem of error interruption by adding the semantic detection method. The semantic analysis is combined with the context analysis, so that whether the current voice recognition result is strongly related to the context can be found, the intention of interruption is judged, and then interruption is triggered, and error interruption is avoided. And through the combination of voice and semantic analysis, when voice detection is effectively input, but the voice detection is not performed, and the possibility of interruption exists, the interruption attempt is performed, the semantic detection is continuously performed on the subsequent voice recognition result after the interruption is completed, if the interruption intention exists, the interruption is performed, and if the interruption intention does not exist in the detection, the interruption is regarded as error interruption, and the speech operation is redirected.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for human-machine conversation according to the present disclosure;

FIG. 3 is a flow diagram of yet another embodiment of a method for human-machine conversation in accordance with the present disclosure;

FIG. 4 is a schematic illustration of an application scenario of a method for human-machine dialog according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for human-machine interaction according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows an exemplary system architecture to which embodiments of the disclosed method for human-machine conversation or apparatus for human-machine conversation may be applied.

As shown in fig. 1, the system as a whole includes a user 101, a terminal 102, a dialogue management server 103, a speech synthesis engine 104, a speech recognition engine 105, and a semantic recognition engine 106. The dialogue management server can be a stand-alone server or can be integrated in the semantic recognition engine. The terminal 102 is installed with an application program of the intelligent media service, which is used for interfacing the user terminal and the back-end engine, systematically coordinating the work of the back-end engine, and also being a controller for interrupting the functions. The implementation of the intelligent Media service has some existing Protocol frameworks, for example, services implemented based on the MRCP (Media Resource Control Protocol) standard Protocol may be used to externally interface a telephone system, internally integrate various speech semantic engines, perform system coordination to implement an interrupt function, and provide a speech intelligent dialog system based on a telephone network.

The terminal 102 may be hardware or software. When the terminal 102 is a hardware, it may be various electronic devices supporting voice input, including but not limited to a smart speaker, a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop computer, a desktop computer, and the like. When the terminal 102 is software, it can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The dialogue management server 103, the speech synthesis engine 104, the speech recognition engine 105, and the semantic recognition engine 106 may be integrated in one server or may be physically separated from each other.

It should be noted that the method for human-computer interaction provided by the embodiment of the present disclosure is generally executed by the terminal 102, and accordingly, the apparatus for human-computer interaction is generally disposed in the terminal 102.

It should be understood that the numbers of the terminal 102, the dialogue management server 103, the speech synthesis engine 104, the speech recognition engine 105, and the semantic recognition engine 106 in fig. 1 are merely illustrative. There may be any number of terminals 102, dialog management servers 103, speech synthesis engines 104, speech recognition engines 105, semantic recognition engines 106, as desired for the implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for human-machine conversation in accordance with the present disclosure is shown. The method for man-machine conversation comprises the following steps:

step 201, responding to the voice input by the user received in the process of playing the first guide language, performing semantic recognition on the voice to obtain the first intention of the user.

In this embodiment, the executing entity (for example, the terminal shown in fig. 1) of the method for human-machine conversation may play the first guide dialog (provided by the conversation management server) through the speaker, and in the case where the terminal has not finished playing, the user may make a sound. The user may have a desire to express, may be a word of which the word of a word of a word a. Therefore, semantic recognition of the speech is required to obtain the first intention of the user. The intention can be directly matched with the built-in voice keywords of the terminal to determine the intention, for example, the user says 'right', the word is matched with a preset attached word library, and the intention of the user can be determined to be attached only instead of interrupting the conversation. Some attached word voices can be preset and are directly matched with voices input by a user in a phoneme mode to determine whether the words are attached.

In some optional implementations of the embodiment, performing semantic recognition on the speech to obtain the first intention of the user includes: carrying out voice recognition on the voice to obtain a local text; and carrying out semantic recognition on the local text to obtain a first intention. The speech may be sent to a speech recognition engine for speech recognition resulting in local text (since the user's speech is not finished at this time). And then sending the local text to a semantic recognition engine to obtain a first intention. The semantic recognition engine may be a neural network model, such as a two-classifier. The training samples are sentences and break labels, for example, sentences which break a conversation are set as positive samples, and sentences which do not break the conversation are set as negative samples. The two classifiers may determine whether the user's sentence is intended for interrupting the conversation. The semantic recognition engine can also be a reading understanding model, and the answer is obtained through the sentence of the user, namely the first intention.

At step 202, it is determined whether the first intent is to interrupt the conversation.

In this embodiment, the intention identified in step 201 can be directly judged as interrupting or non-interrupting the conversation, for example, if the user's sentence is "interrupting", the intention obtained directly is to interrupt the conversation. The first intent may also be indirectly determined as an interruption to the conversation, e.g., "listen and speak", "wait for", or "not hear", which are all intended to cause the machine to pause and not continue the conversation. The intent can be classified by a classifier into two categories, one is interrupted dialog and the other is non-interrupted dialog.

In some optional implementations of this embodiment, determining whether the first intent is to interrupt the conversation includes: determining the relevance of the played first guide language and the local text; if the correlation is greater than a predetermined threshold, the first intent is to interrupt the conversation. The similarity of the local text and the first guide dialect can be calculated, a similarity calculation method can be directly used, and whether the semantics are similar or not can be judged through a semantic matching model. The correlation is measured by similarity. If the correlation is greater than the predetermined threshold, it is declared that there is not a word or noise enclosed, but rather a conversation is really interrupted.

In some optional implementations of this embodiment, determining whether the first intent is to interrupt the conversation includes: determining whether the played first guide word includes local text; if so, the first intent is to interrupt the conversation. Keywords in the local text can be extracted and matched with the first guided dialog, if matching is successful, the user is indicated to be repeated, for example, when the machine says "please provide certificate information and." the user inputs "what certificate", the keywords are "certificate", and matching with the first guided dialog is successful, the user's intention is to interrupt the dialog.

In step 203, if the first intention is to interrupt the dialog, the first guiding dialog is stopped.

In this embodiment, if the user's intention is to interrupt the dialog, the first boot dialog is stopped playing, and a new turn of dialog is restarted after the user's voice input is completed.

And step 204, continuously performing semantic recognition on the voice input by the user until the voice is detected to be ended, so as to obtain a second intention.

In this embodiment, a voice boundary can be detected through voice endpoint detection, and it is determined that the voice input of the user is ended, in this process, the voice is continuously sent to the voice recognition engine for voice recognition, and finally, a complete text is obtained. And sending the complete text to a semantic recognition engine for semantic recognition to obtain a second intention. The semantic recognition engine can also be a reading understanding model, and the answer is obtained through the sentence of the user, namely the second intention.

In step 205, a second guide dialog is played according to the second intention.

In the embodiment, the dialog management is performed according to the second intention, and the corresponding second guide dialog is determined. For example, the machine may say "what credentials" when the user enters "what credentials" when the first introduction is said "please provide credential information and." determine that the user interrupted the first introduction, and then recognize which credentials the user intended to know, and play a second introduction "credentials related to the credentials may be an identification card, passport, military officer's credentials.

The method provided by the above embodiment of the present disclosure is right-first, the semantic recognition engine performs analysis according to the collected partial (local text) results, does not trigger interruption until the partial results are few or no sufficiently positive intention is obtained, and does not trigger interruption until sufficient partial results are collected or the intention is determined by comparison, and thereafter the speech and semantic recognition engines continue to work until the user expression is finished, and calls a corresponding answer or guide conversation according to the recognized semantics, and performs a new round of conversation.

With continued reference to FIG. 3, a flow 300 of one embodiment of a method for human-machine conversation in accordance with the present disclosure is shown. The method for man-machine conversation comprises the following steps:

step 301, in response to receiving the voice input by the user during the playing of the first guide word, stopping playing the first guide word.

In this embodiment, the executing entity (for example, the terminal shown in fig. 1) of the method for man-machine conversation may play the first guidance dialog through the speaker, and in the case where the terminal has not finished playing, the user may make a sound. The sound is not judged to be harmony, noise or true interruption intention, and the first guide speech playing is directly stopped.

The method of the present embodiment is adopted under the satisfaction of a predetermined condition. For example, if the user selects the man-machine conversation mode with the priority on efficiency, or if the time delay of the man-machine conversation application is long and exceeds a preset time, or the user complains, the man-machine conversation mode with the priority on efficiency can be automatically switched.

Step 302, performing semantic recognition on the voice to obtain a first intention of the user.

Step 302 is substantially the same as step 201, and therefore will not be described again.

Step 303 determines whether the first intent is to interrupt the dialog.

Step 303 is substantially the same as step 202, and therefore is not described in detail.

And step 304, if the first intention is to interrupt the conversation, continuing to perform semantic recognition on the voice input by the user until the voice is detected to be over to obtain a second intention, and playing a second guide conversation according to the second intention.

In the present embodiment, it is substantially the same as step 204-205.

If the first intent is not to interrupt the session, the first guided session is resumed or replayed, step 305.

In this embodiment, the difference from the process 200 is that the first guiding dialog stops playing before the semantic recognition, and if the first intention is not to interrupt the dialog, the explanation is a false positive, and the first guiding dialog needs to be continued or played again. The first guide dialog is selected to continue or replay according to the interrupted time, if the interrupted time exceeds a predetermined threshold, the first guide dialog is replayed, and if the interrupted time does not exceed the predetermined threshold, the first guide dialog can be continued to be played by the previously played content. For completeness of the statement, it may continue from individual words that were previously played. For example, the first guiding session is "please provide identity information and schedule time", when the "please provide identity information" is played, the user is detected to make a sound, the playing is interrupted, and then the playing is continued when the user's intention is not interrupted, and at this time, the "sum" is not suitable for starting playing, and the playing can be started from the beginning or from the "identity information", so that the user can be ensured to receive complete information.

Alternatively, the behavior of the user in interrupting the conversation in the man-machine conversation may be counted, and a user portrait may be drawn, which reflects whether the user likes to interrupt the conversation. If the user's intention is not to interrupt the conversation although the voice is detected a plurality of times, it may be considered that the user does not like to interrupt the conversation. The man-machine conversation mode is switched to a correct rate prioritized manner of the process 200.

As can be seen from fig. 3, the flow 300 of the method for human-machine conversation in the present embodiment represents an efficiency priority compared to the embodiment corresponding to fig. 2. This efficiency-first approach may be employed for users who prefer to interrupt a conversation.

The interruption function is applied to an intelligent voice conversation scene, one party of a conversation is a real user, the other party of the conversation is a voice robot, and the voice robot is combined with the technologies of voice synthesis, voice recognition, semantic recognition and the like to carry out conversation with the user. The voice synthesis is used as the mouth of the voice robot, and the designed dialect text is converted into voice to be expressed to the user; the voice recognition is used as the ear of the voice robot, the voice input of the user is monitored in real time, and the expression content of the user is converted into characters; semantic recognition is used as the brain of the voice robot, the expression of the user is understood in real time, and corresponding reaction is made. The interruption function is that when the voice robot is delivering a speech, it is monitored that the user tries to insert a speech and the speech is interrupted, the expression of the user is listened to with concentration, and a correct response is given according to the semantic meaning expressed by the user.

The interrupt function is a key technology for establishing the intelligent voice conversation robot taking the user as the center. Completing the interruption in conjunction with semantics can implement the interruption more accurately.

As shown in fig. 4, the axis represents the direction of the real-time transmission of speech, and each square represents a time slice. The conventional interrupt function and the speech recognition function can be started in combination with VAD detection, and when the speech recognition engine detects and returns a partial result, interrupt is implemented in two strategies:

one is correctly prioritized (see flow 200), indicated in figure 4 in the upper half of the number axis. The semantic recognition engine analyzes according to the collected partial results, does not trigger interruption before the partial results are few or no enough positive intention is obtained (in the figure, the intention of the user cannot be recognized before 0.6-0.8 seconds, so the dialogue is not interrupted, and the bar in by partial), and triggers interruption when enough partial results are collected or the intention is determined by comparison (in the figure, the intention of the user is recognized by semantics at 0.6-0.8 seconds, namely the dialogue is interrupted, and is expressed by bar in by NLP), and then the speech and semantic recognition engine continues working until the user expression is finished (END ASR represents the speech END, the speech recognition is finished, and then NLP is called for semantic recognition), and corresponding answer or guide conversation (bar in by NLP) is called according to the recognized semantics, and a new round of dialogue is carried out.

The other is efficiency first (see flow 300), represented in figure 4 by the lower half of the number axis. Before the semantic recognition engine works, triggering interruption (detecting speech in by VAD in 0.1 second), continuously collecting partial results by semantic recognition till speech is finished (voice END), giving judgment on interruption or not (barrel in by VAD END), and if an interruption intention is detected, then carrying out the following process as the same as the previous scheme. If non-interrupting intent is detected, the interrupted dialog (Response) will be restarted or continued.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for man-machine interaction, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for human-machine conversation of the present embodiment includes: a first recognition unit 501, a determination unit 502, an interruption unit 503, a second recognition unit 504 and a play unit 505. The first recognition unit 501 is configured to, in response to receiving a voice input by a user in the process of playing the first guide word, perform semantic recognition on the voice to obtain a first intention of the user; a determining unit 502 configured to determine whether the first intention is to interrupt the dialog; an interrupting unit 503 configured to stop playing the first guided dialog if the first intention is to interrupt the dialog; a second recognition unit 504 configured to continue semantic recognition of the speech input by the user until an end of the speech is detected, resulting in a second intention; a playing unit 505 configured to play the second guiding dialog according to the second intention.

In this embodiment, the specific processes of the first recognition unit 501, the determination unit 502, the interruption unit 503, the second recognition unit 504 and the playing unit 505 of the apparatus 500 for human-computer interaction may refer to step 201, step 202, step 203, step 204 and step 205 in the corresponding embodiment of fig. 2.

In some optional implementations of this embodiment, the first identifying unit 501 is further configured to: carrying out voice recognition on the voice to obtain a local text; and carrying out semantic recognition on the local text to obtain a first intention.

In some optional implementations of the present embodiment, the determining unit 502 is further configured to: determining the relevance of the played first guide language and the local text; if the correlation is greater than a predetermined threshold, the first intent is to interrupt the conversation.

In some optional implementations of the present embodiment, the determining unit 502 is further configured to: determining whether the played first guide word includes local text; if so, the first intent is to interrupt the conversation.

In some optional implementations of this embodiment, the first identifying unit 501 is further configured to: and if the preset condition is met, stopping playing the first guide speech technology before performing semantic recognition.

In some optional implementations of the present embodiment, the playing unit 505 is further configured to: if the first intent is not to interrupt the conversation, the first guided session continues to be played.

In some optional implementations of the present embodiment, the playing unit 505 is further configured to: if the first intent is not to interrupt the conversation, the first guided conversation is replayed.

Referring now to fig. 6, shown is a schematic diagram of an electronic device (e.g., the terminal of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure. The terminal in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: responding to the received voice input by the user in the process of playing the first guide language, and performing semantic recognition on the voice to obtain a first intention of the user; determining whether the first intent is to interrupt the conversation; if the first intention is to interrupt the conversation, stopping playing the first guide conversation; continuously performing semantic recognition on the voice input by the user until the voice is detected to be ended to obtain a second intention; playing a second guided dialog according to the second intent.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first recognition unit, a determination unit, an interruption unit, a second recognition unit, and a play unit. Where the names of the units do not in some cases constitute a limitation on the units themselves, for example, the first recognition unit may also be described as a "unit that semantically recognizes a speech to obtain a first intention of the user in response to receiving the speech input by the user in the course of playing the first guide speech".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for human-machine conversation, comprising:

responding to the voice input by a user in the process of playing the first guide language, performing semantic recognition on the voice to obtain a first intention of the user;

determining whether the first intent is to interrupt a conversation;

if the first intention is to interrupt the conversation, stopping playing the first guide conversation;

continuously performing semantic recognition on the voice input by the user until the voice is detected to be ended to obtain a second intention;

playing a second guided dialog according to the second intent.

2. The method of claim 1, wherein the semantically recognizing the speech to obtain the first intent of the user comprises:

performing voice recognition on the voice to obtain a local text;

and performing semantic recognition on the local text to obtain a first intention.

3. The method of claim 1, wherein the determining whether the first intent is to interrupt a conversation comprises:

determining the relevance of the played first guide language and the local text;

if the correlation is greater than a predetermined threshold, the first intent is to interrupt the conversation.

4. The method of claim 1, wherein the determining whether the first intent is to interrupt a conversation comprises:

determining whether the played first guide word includes the local text;

if so, the first intent is to interrupt the conversation.

5. The method of claim 1, wherein prior to the semantically recognizing the speech for the first intent of the user, the method further comprises:

and if the preset condition is met, stopping playing the first guide conversation before performing semantic recognition.

6. The method of claim 5, wherein the method further comprises:

if the first intention is not to interrupt a conversation, continuing to play the first guide conversation.

7. The method of claim 5, wherein the method further comprises:

if the first intention is not to interrupt a conversation, the first guided dialog is replayed.

8. An apparatus for human-machine conversation, comprising:

a first recognition unit, configured to respond to the voice input by the user received in the process of playing the first guide language, perform semantic recognition on the voice, and obtain the first intention of the user;

a determination unit configured to determine whether the first intention is to interrupt a conversation;

an interruption unit configured to stop playing the first guided dialog if the first intention is to interrupt a dialog;

a second recognition unit configured to continue semantic recognition of the voice input by the user until the end of the voice is detected, resulting in a second intention;

a playback unit configured to play a second guide dialog according to the second intention.

9. An electronic device for human-machine interaction, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.