CN114064858A

CN114064858A - Dialogue processing method and device for dialogue robot, electronic equipment and medium

Info

Publication number: CN114064858A
Application number: CN202111432736.2A
Authority: CN
Inventors: 高乐
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-02-18

Abstract

The invention provides a dialogue processing method, a device, a medium and a terminal of a dialogue robot, wherein the method comprises the following steps: acquiring real-time voice data of a user, executing robot voice output according to a preset flow, and when a conflict occurs, acquiring character information and user voice interval time according to real-time voice data, judging whether the user intention is an original intention or a new intention according to process characters, interval time and the user intention, executing strategies and contents of robot voice output corresponding to different intentions according to a preset mapping relation according to a judgment result, aiming at conversation scenes when a client breaks and snatches, by judging the conflict type firstly and then identifying the intention, different processing flows are respectively executed for interrupting and preempting the call, the real complaints and the understanding deviation of the customers are avoided to be ignored, the conversation robot can be made to cope with the situations which often occur in the real scene, and the conversation capacity of the robot under the challenge of the client is greatly improved.

Description

Dialogue processing method and device for dialogue robot, electronic equipment and medium

Technical Field

The present invention relates to the field of computer applications, and in particular, to a method and an apparatus for processing a dialog of a dialog robot, an electronic device, and a medium.

Background

With the rapid development of artificial intelligence, a conversation robot is frequently used, and usually, by mounting a natural language processing system, when a problem is thrown to the conversation robot, input keywords are captured, the most appropriate answer is found from a database through an algorithm, and a corresponding response reply is carried out.

However, the logic of human conversation is varied, many clients do not speak according to the conversation rules of the robot at all, and do not regularly make a word, and in a real scene, a situation that the client interrupts the robot to speak may occur, for example, when the robot just speaks for 1 second and is then spoken by the client, or when the client just speaks, the robot just prepares to broadcast a speech, and the client supplements the speech spoken before. At present, the existing robot cannot cope with the complex situation, or neglects the real appeal of a client, or understands that deviation is generated, so that conversation cannot be continued, the AI robot is not intelligent enough in chat interaction, and user experience is reduced.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention provides a method, an apparatus, a medium and a terminal for processing dialog of a dialog robot, so as to solve the above-mentioned technical problems.

The invention provides a dialogue processing method of a dialogue robot, which comprises the following steps:

acquiring real-time voice data of a user, and executing robot voice output according to a preset flow;

when the real-time voice data and the robot voice output conflict in the time dimension, acquiring text information and user voice interval time according to the real-time voice data, wherein the voice output conflict type comprises a first conflict type used for representing interruption;

when the user is in the first conflict type, executing a first processing flow according to the word number of the process words in the word information, wherein the first processing flow comprises stopping the robot voice output, and performing intention identification after the user stops outputting;

if the recognition result is the non-interruption intention, continuing to execute voice output according to the original flow;

and if the recognition result is the intention interruption, executing a second processing flow, wherein the second processing flow comprises judging whether the user intention is an original intention or a new intention according to the process characters, the interval time and the user intention, and executing strategies and contents of robot voice output corresponding to different intentions according to a preset mapping relation according to the judgment result.

In an embodiment of the present invention, the type of the speech output conflict further includes a second conflict type for indicating a call preemption;

and when the conflict exists in the second conflict type, triggering the robot to splice the original content and the call grabbing content of the user and then play the spliced content so as to judge the intention of the user again, and executing the strategy and the content of the robot voice output corresponding to different intentions according to the preset mapping relation.

In an embodiment of the present invention, when the collision type is a first collision type, the word number of the process word is compared with a preset word number threshold:

if the word number of the process characters is larger than a preset word number threshold value, stopping the robot voice output, and after the user voice output is finished, performing intention identification;

if the process character number is less than or equal to a preset character number threshold value, directly performing intention identification; if the recognition result is the interruption intention, stopping the robot voice output; and if the recognition result is the non-interruption intention, continuing the robot voice output.

In an embodiment of the present invention, when the first conflict type exists, the method further includes:

if the word number of the process characters is less than or equal to a preset word number threshold value and the interruption intention is judged to be a continuous playing type, the interruption intention of the continuous playing type is not processed, and the voice is normally played and output;

if the word number of the process characters is less than or equal to a preset word number threshold value and the interruption intention is judged to be objected continuous playing, playing preset objected dialogues, stopping playing when the user continues speaking, and normally playing the original voice and outputting after the user judges that the character is continuous playing;

if the word number of the process characters is less than or equal to a preset word number threshold value and the interruption intention is judged to be objection, playing preset objection dialogues and auxiliary dialogues, stopping playing when the user continues speaking, judging the intention of the user again, and playing the voice again according to the final intention of the user for outputting;

and if the word number of the process characters is less than or equal to the preset word number threshold value and the interruption intention is judged to be refused, ending the robot voice output.

if the word number of the process characters is larger than the preset word number threshold value and the user continues speaking, stopping playing, and if the intention of continuing speaking is a continuous playing class, normally playing the original voice and outputting after the user finishes speaking;

if the word number of the process characters is larger than the preset word number threshold value and the user continues speaking, stopping playing, if the intention of continuing speaking is an objection continuous playing type, firstly playing a preset switching language after the user finishes speaking, and then normally playing the original voice for output;

if the word number of the process characters is larger than the preset word number threshold value and the user continues speaking, the playing is stopped, and if the intention of continuing speaking is refusal, the robot voice output is ended.

In an embodiment of the present invention, a time threshold is preset, and when an interval time between two preceding and following words of the user voice output is smaller than the time threshold, it is determined that the user voice output is in the second collision type;

when the second conflict type is judged, combining the previous intention and the current intention, performing multi-intention processing in the turn of matching the previous intention, adjusting the waiting time of the robot when the words are robbed for a plurality of times continuously, and playing a preset linkage operation; after the call is completed, the language is played according to the merged intention without repeating the content played before the call is completed.

In an embodiment of the present invention, the intention identification includes:

setting a main natural language processor and a plurality of sub natural language processors, and acquiring the intention of the real-time voice data through the main natural language processor;

dispatching the intention of the real-time voice data to a sub natural language processor, wherein the query intentions of the main natural language processor and the sub natural language processor are different;

acquiring intention recognition results of the plurality of sub natural language processors, and feeding back all recognition results to the main natural language processor;

and evaluating all recognition results according to the confidence degrees of the recognition results, and selecting one recognition result as a final recognition intention result by the main natural language processor according to the evaluation result.

The invention also provides a dialogue processing method and a device of the dialogue robot, comprising the following steps:

a voice acquisition module for acquiring real-time voice data of a user,

the voice output module is used for executing the voice output of the robot according to a preset flow;

a processing module comprising an identification unit and a control unit;

when the real-time voice data conflicts with the robot voice output in the time dimension, the recognition unit acquires character information and user voice interval time according to the real-time voice data, wherein the voice output conflict type comprises a first conflict type used for representing interruption;

when the user is in the first conflict type, the control unit executes a first processing flow according to the word number of the process words in the word information, wherein the first processing flow comprises stopping the robot voice output, waiting for the user to stop outputting, and performing intention identification through the identification unit;

if the recognition result is the intention interruption, executing a second processing flow, wherein the second processing flow comprises judging whether the user intention is an original intention or a new intention according to the process characters, the interval time and the user intention, and executing strategies and contents of robot voice output corresponding to different intentions according to preset mapping relations and the judgment result

The invention also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method according to any one of the preceding claims when executing the computer program.

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as in any one of the above.

The invention has the beneficial effects that: aiming at the complex problem in the conversation scene when a client breaks and snatches the conversation, the invention judges the conflict type and then identifies the intention, and executes different processing flows for the breaking and snatching respectively, so that the real appeal of the client can not be ignored, the understanding deviation can not be generated, the conversation robot can deal with the frequent situation in the real scene, and the conversation capability of the robot under the challenge to the client is greatly improved.

In addition, the invention can splice the previous content and the call grabbing content together and recognize the intention once again, so that the accuracy of intention recognition is greatly improved, and the intention really wanted to be expressed by the client can not be ignored due to the call grabbing of the client. The invention can also process the interruption and the call of the client, when the robot needs to continue to go down according to the flow, the robot can smoothly respond and is not obtrusive through the sentence cutting and the connection of the call, and the newly broadcasted call can not re-broadcast the previously broadcasted part, so that the robot can express more really in the conversation process and tends to the human, and the conversation experience and the feeling of the user are improved.

Drawings

Fig. 1 is a flow chart illustrating a conversation processing method of a conversation robot according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an intention recognition flow of a conversation processing method of a conversation robot in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a hardware structure of an electronic device according to another embodiment of the present invention.

Fig. 5 is a schematic diagram of a hardware configuration of a dialogue processing device of the dialogue robot of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

In the following description, numerous details are set forth to provide a more thorough explanation of embodiments of the present invention, however, it will be apparent to one skilled in the art that embodiments of the present invention may be practiced without these specific details, and in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring embodiments of the present invention.

As shown in fig. 1, the conversation processing method of the conversation robot in the present embodiment includes:

s1, acquiring real-time voice data of a user, and executing robot voice output according to a preset flow;

s2, when real-time voice data and robot voice output conflict in a time dimension, acquiring character information and user voice interval time according to the real-time voice data, wherein the type of voice output conflict comprises a first conflict type used for representing interruption;

s3, when the user is in the first conflict type, executing a first processing flow according to the word number of the process words in the word information, wherein the first processing flow comprises stopping the voice output of the robot, and performing intention identification after the user stops outputting;

s4, if the recognition result is the non-interruption intention, continuing to execute voice output according to the original flow;

and S5, if the recognition result is the intention interruption, executing a second processing flow, wherein the second processing flow comprises the steps of judging whether the intention of the user is the original intention or the new intention according to the process characters, the interval time and the intention of the user, and executing the strategy and the content of the robot voice output corresponding to different intentions according to the preset mapping relation according to the judgment result.

In step S1 of this embodiment, the real-time voice data of the user is first obtained, and when no interruption or call robbery occurs, the robot voice output may be performed according to a preset flow. In this embodiment, the user Speech may be obtained in real time through an ASR (Automatic Speech Recognition) technology, and the ASR technology may convert the Speech into text in this embodiment. Of course, other ways to achieve the above functions may be adopted instead, and are not described herein again.

In step S2 of the present embodiment, when the real-time voice data and the robot voice output collide in the time dimension, the text information and the user voice interval time are acquired from the real-time voice data, and the type of the voice output collision includes a first collision type for indicating an interruption. In this embodiment, when the user voice and the robot voice output conflict, it is determined whether the user voice is interrupted, and then it is determined whether the user voice is a call robbery.

In the present embodiment, in steps S4 and S5, if the recognition result is the non-interruption intention, the execution of the voice output is continued as it is; and if the recognition result is the interrupting intention, executing a second processing flow, wherein the second processing flow comprises judging whether the user intention is the original intention or the new intention according to the process characters, the interval time and the user intention, and executing strategies and contents of robot voice output corresponding to different intentions according to a preset mapping relation according to the judgment result.

In this embodiment, the types of speech output collisions further include a second collision type for indicating a call preemption; and when the conflict exists in the second conflict type, triggering the robot to splice the original content and the call grabbing content of the user and then play the spliced content so as to judge the intention of the user again, and executing the strategy and the content of the robot voice output corresponding to different intentions according to the preset mapping relation.

In this embodiment, the judgment of whether the first collision type is a break or not can be performed by presetting a word number threshold. The "intention" in the present embodiment is computer-readable data indicating that a computer system component has been recognized as meaning intended by a natural language query, and the intention may be classified in advance based on the recognition result of NLP (natural language processing), for example, into a continuation class, an objection continuation, an objection class, and a rejection class in the present embodiment. The interruption processing flow in this embodiment may include two cases:

interrupting case 1: if the word number of the process characters is larger than a preset word number threshold value, stopping the robot voice output, and after the user voice output is finished, performing intention identification;

specifically, the interruption case 1 includes:

An interruption condition 2, if the word number of the process characters is less than or equal to a preset word number threshold value, performing intention identification, and if the identification result is an interruption intention, stopping the robot voice output; if the recognition result is the non-interrupting intention, the voice output of the robot is continued, and the broadcast is not influenced.

Specifically, the interruption case 2 includes:

if the word number of the process characters is larger than the preset word number threshold value and the user continues speaking, stopping playing, and if the intention of continuing speaking is refusal, ending the robot voice output;

in this embodiment, based on the ASR-processed characters, the break logic under different conditions may be preset according to the sub-spitting text:

situation 101-without interruption of normal play-if the user is not interrupted, the robot plays speech normally for output;

case 102-no intent to resume play, no interruption to play: if the number of words spoken by the user is less than or equal to the preset word number threshold value and the continuous playing class is judged, the continuous playing class intention is not processed, and the robot normally plays voice for output;

case 103-handling resume class intent, when there is an interruption to play: if the number of words spoken by the user is less than or equal to the preset word number threshold value and the objection continuous playing is judged, the robot plays the objection speech; when the client continues speaking, the playing is stopped, and after the client judges that the client continues speaking, the robot normally plays the original voice for output;

case 104-non-resume class intent, interrupting the latter non-resume: if the number of words spoken by the user is less than or equal to the preset word number threshold value and the robot judges that the words are objected, the robot plays objectional speech and auxiliary speech; when the client continues speaking, stopping playing, judging the intention of the user again, and playing the voice output again by the robot according to the final intention of the user;

case 105-handling no resume class intent, no interruption to play: and if the number of words spoken by the user is less than or equal to the preset word number threshold value and the judgment is negative, ending the robot voice output.

Case 206-do not handle resume class intent, without interruption to play: if the number of words spoken by the user is larger than the preset number of words threshold, stopping playing when the client continues speaking, and if the client judges that the user continues speaking, normally playing the original voice and outputting by the robot after the client finishes speaking;

case 207-handling resume class intent, without interruption to play: if the number of words spoken by the user is larger than the preset number of words threshold, the robot stops playing when the client continues speaking, and if the client judges that the words are in the disagreement continuous playing type, the robot plays the switching words first and then normally plays the original voice for output after the client finishes speaking.

Case 208-handling no-resume class intent, without interruption to play: and if the number of words spoken by the user is greater than the preset word number threshold value and is judged to be rejected, ending the robot voice output.

In this embodiment, in the two interruption situations, the interruption situation 1 generates an intention recognition result, and the interruption situation 2 triggers the robot to stop voice output on the premise that the recognition result is the interruption intention, and based on the two different interruption situations, it is necessary to continue to determine whether the user is a call robber. In this embodiment, a time threshold is preset, and when the interval time between two preceding and following words of the user voice output is smaller than the time threshold, it is determined that the user voice output is in the second collision type; and when the conflict is judged to be in the second conflict type, combining the previous intention and the current intention, performing multi-intention processing in the turn matched with the previous intention, adjusting the waiting time of the robot when the words are robbed for a plurality of times continuously, playing preset connection words, and not broadcasting the parts which are broadcasted before. Specifically, whether a call robbing rule is met or not can be judged through the interval time between two preceding and following words output by the user voice, for example, the preset time threshold is 3 seconds to judge, if the interval time between the two following words is smaller than the time threshold, a call robbing processing flow is triggered according to the user intention, the call robbing processing flow comprises the steps of merging the previous intention and the current intention, multi-intention processing is carried out in the turn matched with the previous intention, and when multiple successive calls are robbed, a preset processing mode can be triggered, for example, the waiting time is adjusted, the connection operation is increased, and the connection operation can be as 'kayinghe', and the like. When the interval time of the two words is larger than or equal to the time threshold, the trunk branch matching processing is carried out:

if the main trunk branch is matched, the core meaning rule is started, and the main dialect and the auxiliary dialect are not completely broadcasted, ignoring the user intention, and directly broadcasting the connection dialect and the auxiliary dialect;

if the main speech operation and the auxiliary speech operation are not matched with the main trunk branch or matched with the main trunk branch but the core meaning rule is not opened and the main speech operation and the auxiliary speech operation are completely broadcasted, the main speech operation and the auxiliary speech operation are normally matched according to the intention. In this embodiment, the ASR mode needs to be wordled, no new text is output for more than 3s, no end mark is given, and the robot performs on-hook processing and records a log. The conversational branch in this embodiment mainly includes a trunk branch, an objection branch, and an ending branch. Trunk branches are the broadcast of the splicing and auxiliary dialogs, ignoring the client intent. The objection branch is to reply an objection, ignore objection action after reply, and broadcast a connection operation and an auxiliary operation (the auxiliary operation still follows the rule of circular broadcast in sequence). The end branch refers to normal matching processing according to intention.

In this embodiment, after the first processing flow and the second processing flow, when the robot needs to continue the subsequent flow, the dialog is cut, and after adding the preset linking dialog, the dialog is broadcasted, and the part which has been broadcasted before is not broadcasted any more. By the method, the robot can smoothly carry out the dialect connection and is not more obtrusive, and the rebroadcast dialect can not rebroadcast the previously broadcasted part again, so that the robot can express more really in the dialog process.

In this embodiment, when the join broadcast is needed, when TTS is requested, an identifier indicating whether the sentence is "continue broadcast" is added, and the TTS engine splices and returns according to the join broadcast identifier of the DM. TTS technology (Text-To-Speech, Speech synthesis) is a technical implementation method for converting characters into voice, TTS, and mainly includes two types: splicing and parametric methods. The splicing method is formed by splicing a plurality of voice recorded in advance by selecting required basic units. Units may be syllables, phonemes, or the like; in pursuit of continuity of synthesized speech, it is also common to use diphones as units. The advantage is a higher speech quality and the disadvantage is a too large database requirement. Typically, tens of hours of finished corpora are required, and the cost is high. The parametric method is to generate speech parameters (including fundamental frequency, formant frequency, etc.) at every moment according to a statistical model and then convert the parameters into waveforms. Mainly divided into 3 modules: front-end, back-end and vocoder. The front end analyzes the text to determine what the pronunciation of each character is, what tone the sentence is in, what rhythm the sentence is read with, what places are important points to be emphasized, and the like. Common mood-related data descriptions include, but are not limited to, the following: prosodic boundaries, accents, boundary tones, and even emotions. The requirement of the database is relatively small, and optionally, a parameter method may be adopted in this embodiment to perform the linkage broadcast. A splicing operation may be as follows, requiring support of a library of voices to all scenes:

TABLE 1

In this embodiment, natural language query intent assignment may be combined to match a particular natural language query with intents from multiple intent matchers, pointing to the appropriate dialog query processor, further enabling the robot to smoothly make verbal links, without being obtrusive. For example, for the user's voice "I hungry," the query may be matched against multiple extended natural language processors that are capable of processing the query and generating an intent of the query (e.g., order pizza, order coffee). Each of the extended natural language processors may be a natural language processor that is independent of a main natural language processor of the system and is capable of returning at least one intent and is also capable of returning one or more entities for a natural language query. Each extended natural language processor may also: extending the natural language processor independent of perception of other processors regarding production intent; the extended natural language processor can identify the intent of the query using its own form of natural language query matching the natural language of the intent. A way to eliminate ambiguity between multiple intent matchers, dialog query processors, and possibly receive user input selections, such as selections between different intents, different dialog query processors, may also be provided by using data such as user rankings, user preferences, contextual information, and/or user profiles. As shown in fig. 2, specifically, comprises

S601, setting a main natural language processor and a plurality of sub natural language processors, and acquiring the intention of the real-time voice data through the main natural language processor;

s602, distributing the intention of the real-time voice data to a sub natural language processor, wherein the main natural language processor and the sub natural language processor have different query intentions;

s603, acquiring intention recognition results of the plurality of sub-natural language processors, and feeding back all recognition results to the main natural language processor;

and S604, evaluating all recognition results according to the confidence degrees of the recognition results, and selecting one recognition result as a final recognition intention result by the main natural language processor according to the evaluation results.

In this embodiment, the main natural language processor may have sent a query to a large set of component natural language processors, and only a subset of these component natural language processors are able to understand the particular query, returning a corresponding intent. If the main natural language processor sends a query of the corpus "i hungry" to be processed to multiple different sub-natural language processors, the sub-natural language processor expanded for food ordering may be programmed to return the intent of the query, and another sub-natural language processor for making a reservation at a restaurant may also be programmed to return the intent of the query, but the expanded sub-natural language processor for scheduling a taxi ride may not be programmed to return the intent of the query. Thus, the main natural language processor may send queries to all three extended natural language processors, but it may only receive back the intent of "food order" from the food order extended natural language processor and the intent of "make a reservation" from the restaurant reservation extended natural language processor. The main natural language processor may match the intent with the corresponding conversational query processor. For example, the extended natural language processor may be part of the same extension as the corresponding conversational query processor, as indicated to the primary natural language processor in the registration for the extension. Thus, after receiving an intent from a particular split natural language processor, the master natural language processor may look up the registry of extensions for that split natural language processor, finding data relevant to the corresponding conversational query processor. As another example, along with returning an intent, the sub-natural language processor may also return an identifier (e.g., address, etc.) of the conversational query processor that processed the intent. Such an identifier may be used by the host natural language processor to match the received intent to the matching conversational query processor.

Accordingly, as shown in fig. 5, the present embodiment further provides a dialogue processing apparatus for a dialogue robot, including:

a voice acquisition module 101 for acquiring real-time voice data of a user,

the voice output module 102 is used for executing robot voice output according to a preset flow;

a processing module 103 comprising an identification unit and a control unit;

when real-time voice data and robot voice output conflict in a time dimension, the recognition unit acquires character information and user voice interval time according to the real-time voice data, wherein the type of voice output conflict comprises a first conflict type used for representing interruption;

and if the recognition result is the interrupting intention, executing a second processing flow, wherein the second processing flow comprises judging whether the user intention is the original intention or the new intention according to the process characters, the interval time and the user intention, and executing strategies and contents of robot voice output corresponding to different intentions according to a preset mapping relation according to the judgment result.

In this embodiment, the real-time voice data of the user is first acquired through the voice acquisition module, and the robot voice output can be executed according to a preset flow without interruption or call robbery. In this embodiment, the Recognition unit may convert the user Speech acquired in real time into text through an ASR (Automatic Speech Recognition) technique.

In this embodiment, when the real-time voice data conflicts with the robot voice output in the time dimension, the text information and the user voice interval time are acquired according to the real-time voice data, and the type of the voice output conflict includes a first conflict type for representing interruption. In this embodiment, when the user voice and the robot voice output conflict, it is determined whether the user voice is interrupted, and then it is determined whether the user voice is a call robbery.

In the embodiment, if the recognition result is the non-interruption intention, the voice output is continuously executed according to the original flow; and if the recognition result is the interrupting intention, executing a second processing flow, wherein the second processing flow comprises judging whether the user intention is the original intention or the new intention according to the process characters, the interval time and the user intention, and executing strategies and contents of robot voice output corresponding to different intentions according to a preset mapping relation according to the judgment result.

specifically, the interruption case 1 includes:

Specifically, the interruption case 2 includes:

In this embodiment, when a join-up broadcast is required, when a Text-To-Speech (TTS) request is made, an identifier indicating whether the sentence is "continue broadcast" is added, and the TTS engine splices and returns according To the join-up broadcast identifier. In this embodiment, text is converted into voice by TTS, which mainly includes two types: splicing and parametric methods. The splicing method is formed by splicing a plurality of voice recorded in advance by selecting required basic units. Units may be syllables, phonemes, or the like; in pursuit of continuity of synthesized speech, it is also common to use diphones as units. The advantage is a higher speech quality and the disadvantage is a too large database requirement. Typically, tens of hours of finished corpora are required, and the cost is high. The parametric method is to generate speech parameters (including fundamental frequency, formant frequency, etc.) at every moment according to a statistical model and then convert the parameters into waveforms. Mainly divided into 3 modules: front-end, back-end and vocoder. The front end analyzes the text to determine what the pronunciation of each character is, what tone the sentence is in, what rhythm the sentence is read with, what places are important points to be emphasized, and the like. Common mood-related data descriptions include, but are not limited to, the following: prosodic boundaries, accents, boundary tones, and even emotions. The requirement of the database is relatively small, and optionally, a parameter method may be adopted in this embodiment to perform the linkage broadcast.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment further provides an electronic terminal, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the method in the embodiment.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for completing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program so as to enable the electronic terminal to execute the steps of the method.

As shown in fig. 3, the electronic device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Fig. 4 is a hardware structure of an electronic device provided in another embodiment, and the electronic device in this embodiment may include a second processor 1201 and a second memory 1202. The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment. The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The electronic device may further include: communication components 1203, power components 1204, multimedia components 1205, audio components 1206, input/output interfaces 1207, and/or sensor components 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In the above embodiments, unless otherwise specified, the description of common objects by using "first", "second", etc. ordinal numbers only indicate that they refer to different instances of the same object, rather than indicating that the objects being described must be in a given sequence, whether temporally, spatially, in ranking, or in any other manner. In the above-described embodiments, reference in the specification to "the embodiment," "an embodiment," "another embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The various appearances of the phrase "the present embodiment," "one embodiment," or "another embodiment" are not necessarily all referring to the same embodiment. If the specification states a component, feature, structure, or characteristic "may", "might", or "could" be included, that particular component, feature, structure, or characteristic is not necessarily included. If the specification or claim refers to "a" or "an" element, that does not mean there is only one of the element. If the specification or claim refers to "a further" element, that does not preclude there being more than one of the further element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.

The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A conversation process method for a conversation robot, comprising:

when voice output conflict occurs between the real-time voice data and robot voice output in a time dimension, acquiring text information and user voice interval time according to the real-time voice data, wherein the type of the voice output conflict comprises a first conflict type used for representing interruption;

2. The conversation processing method of a conversation robot according to claim 1, wherein the type of the voice output conflict further comprises a second conflict type for indicating a robbery of a conversation;

3. The conversation process method of a conversation robot according to claim 1, wherein when in the first conflict type, the word count of the course word is compared with a preset word count threshold value:

4. The conversation process method of a conversation robot according to claim 3, further comprising, when in the first conflict type:

5. The conversation process method of a conversation robot according to claim 3, further comprising, when in the first conflict type:

6. The dialogue processing method of the dialogue robot of claim 2,

presetting a time threshold, and when the interval time between the front sentence and the rear sentence of the voice output of the user is less than the time threshold, determining that the voice output of the user is in a second conflict type;

7. The conversation processing method of a conversation robot according to claim 1, wherein the intention recognition comprises:

8. A conversation processing method and device for a conversation robot, comprising:

a voice acquisition module for acquiring real-time voice data of a user,

a processing module comprising an identification unit and a control unit;

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.