CN110557451B

CN110557451B - Dialogue interaction processing method and device, electronic equipment and storage medium

Info

Publication number: CN110557451B
Application number: CN201910817112.9A
Authority: CN
Inventors: 刘瑛; 孙珂; 赵媛媛; 孙叔琦; 常月; 孙辉丰; 陈雷; 李婷婷
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-02-05
Anticipated expiration: 2039-08-30
Also published as: CN110557451A

Abstract

The application provides a dialogue interaction processing method, a dialogue interaction processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: when the intelligent calling equipment is detected to establish conversation interactive connection, an uplink channel and a downlink channel with the agent control are established through the full-duplex component; receiving text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; and processing the text information and the broadcast state to generate an asynchronous signal and sending the asynchronous signal to the agent control so that the agent control sends the asynchronous signal to the intelligent calling equipment for corresponding processing. The problem of among the prior art conversation interactive mode conversation process not smooth, misunderstanding user's intention, lead to the conversation interactive effect relatively poor is solved, establish the connection through intelligent calling equipment and agent control to establish through full duplex subassembly and agent control's uplink and downlink passageway and realize data real-time transmission such as pronunciation, text and broadcast state, guarantee the smoothness of dialogue when improving conversation interactive efficiency, satisfy user's user demand.

Description

Dialogue interaction processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing dialog interaction, an electronic device, and a storage medium.

Background

At present, with the continuous development of artificial intelligence technology, more and more scenes support man-machine intelligent conversation, such as conversation between a user and a robot customer service, and man-machine intelligent conversation becomes a common conversation interaction mode in people's life.

In the related technology, in a service system of robot customer service, three times of remote calls need To be executed in sequence, and ASR (Automatic Speech Recognition), semantic understanding and dialogue service and TTS (Text To Speech) are respectively called To realize man-machine intelligent dialogue.

Content of application

The present application is directed to solving, at least to some extent, one of the technical problems in the related art described above.

Therefore, a first objective of the present application is to provide a dialogue interaction processing method, which solves the problem that the dialogue interaction effect is poor due to the unsmooth dialogue process and the miscomprehension of the intention of the user in the dialogue interaction mode in the prior art, and the method establishes a connection with the agent control through the intelligent calling device, and establishes an uplink channel and a downlink channel with the agent control through the full-duplex component to realize real-time transmission of data such as voice, text, and playing state, thereby improving the dialogue interaction efficiency, ensuring the fluency of the dialogue, and meeting the user demand.

A second object of the present application is to provide a dialogue interaction processing apparatus.

A third object of the present application is to propose a computer device.

A fourth object of the present application is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, an embodiment of a first aspect of the present application provides a dialog interaction processing method, including: when the intelligent calling equipment is detected to establish conversation interactive connection, an uplink channel and a downlink channel with the agent control are established through the full-duplex component; receiving text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; the intelligent calling equipment receives voice information and sends the voice information to the agent control, and the voice information is converted to generate the text information; and processing the text information and the broadcast state to generate an asynchronous signal and sending the asynchronous signal to the agent control so that the agent control sends the asynchronous signal to the intelligent calling equipment to perform corresponding processing.

In addition, the dialogue interaction processing method in the embodiment of the application also has the following additional technical features:

optionally, the processing the text information and the broadcast state to generate an asynchronous signal, and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent call device to perform corresponding processing, including: performing semantic analysis on the text information, and judging whether a preset interruption condition is met or not according to a semantic analysis result; if the preset interruption condition is judged to be met according to the semantic analysis result and the broadcasting state is determined to be the broadcasting state according to the broadcasting state, an interruption signal is generated; and sending the interrupt signal to the agent control so that the agent control sends the interrupt signal to the intelligent calling equipment to stop broadcasting.

Optionally, the preset interruption condition includes: one or more of preset key press, intention interruption and preset keyword interruption.

Optionally, the processing the text information and the broadcast state to generate an asynchronous signal, and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent call device to perform corresponding processing, including: determining the broadcast state as a state to be broadcast according to the broadcast state, performing semantic recognition on the text information to obtain abnormal intention response, and calling a preset reply text from a preset database; and generating a preset reply voice from the preset reply text and sending the preset reply voice to the agent control, wherein the agent control sends the preset reply voice to the intelligent calling equipment for broadcasting.

Optionally, after the generating of the preset reply text into the preset reply voice and sending the preset reply voice to the agent control, the agent control sends the preset reply voice to the intelligent call device for broadcasting, the method further includes: if the text information forwarded by the proxy control is not received within a preset time threshold, determining that a silent condition is met; and calling a target text from a preset database, generating target voice information of the target text, and sending the target voice information to the agent control so that the agent control sends the target voice information to the intelligent calling equipment for broadcasting.

Optionally, the processing the text information and the broadcast state to generate an asynchronous signal, and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent call device to perform corresponding processing, including: extracting key words in the text information; and generating a reply text according to the keyword, generating reply voice from the reply text and sending the reply voice to the agent control, and sending the reply voice to the intelligent calling equipment by the agent control for broadcasting.

Optionally, the processing the text information and the broadcast state to generate an asynchronous signal, and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent call device to perform corresponding processing, including: determining the text information to be broadcasted according to the broadcasting state, and carrying out text detection on the text information; if the text detection result is a text error, modifying the text information; and generating a text to be replied according to the modified text information, generating the voice to be replied from the text to be replied, and sending the voice to be replied to the agent control, wherein the voice to be replied is sent to the intelligent calling equipment by the agent control for broadcasting.

Optionally, after receiving the text information forwarded by the agent control and the broadcast state of the intelligent calling device, performing semantic analysis on the text information; and if the triggering control event is determined according to the semantic analysis result, generating a corresponding control instruction and sending the control instruction to the agent control so that the agent control sends the control instruction to the intelligent calling equipment for corresponding control operation.

To achieve the above object, a second aspect of the present application provides a dialog interaction processing apparatus, including: the establishing module is used for establishing an uplink channel and a downlink channel with the agent control through the full duplex component when detecting that the intelligent calling equipment establishes the interactive connection; the receiving module is used for receiving the text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; the intelligent calling equipment receives voice information and sends the voice information to the agent control, and the voice information is converted to generate the text information; and the processing module is used for processing the text information and the broadcast state to generate an asynchronous signal and sending the asynchronous signal to the agent control so that the agent control sends the asynchronous signal to the intelligent calling equipment to perform corresponding processing.

In addition, the dialogue interaction processing device of the embodiment of the application also has the following additional technical features:

optionally, the processing module is specifically configured to: performing semantic analysis on the text information, and judging whether a preset interruption condition is met or not according to a semantic analysis result; if the preset interruption condition is judged to be met according to the semantic analysis result and the broadcasting state is determined to be the broadcasting state according to the broadcasting state, an interruption signal is generated; and sending the interrupt signal to the agent control so that the agent control sends the interrupt signal to the intelligent calling equipment to stop broadcasting.

Optionally, the processing module is specifically configured to: determining the broadcast state as a state to be broadcast according to the broadcast state, performing semantic recognition on the text information to obtain abnormal intention response, and calling a preset reply text from a preset database; and generating a preset reply voice from the preset reply text and sending the preset reply voice to the agent control, wherein the agent control sends the preset reply voice to the intelligent calling equipment for broadcasting.

Optionally, the apparatus further includes: the determining module is used for determining that the silent condition is met if the text information forwarded by the proxy control is not received within a preset time threshold; and the calling generation module is used for calling a target text from a preset database, generating target voice information of the target text and sending the target voice information to the agent control, so that the agent control sends the target voice information to the intelligent calling equipment for broadcasting.

Optionally, the processing module is specifically configured to: extracting key words in the text information; and generating a reply text according to the keyword, generating reply voice from the reply text and sending the reply voice to the agent control, and sending the reply voice to the intelligent calling equipment by the agent control for broadcasting.

Optionally, the processing module is specifically configured to: determining the text information to be broadcasted according to the broadcasting state, and carrying out text detection on the text information; if the text detection result is a text error, modifying the text information; and generating a text to be replied according to the modified text information, generating the voice to be replied from the text to be replied, and sending the voice to be replied to the agent control, wherein the voice to be replied is sent to the intelligent calling equipment by the agent control for broadcasting.

Optionally, the apparatus further includes an analysis module, configured to perform semantic analysis on the text information; and the generating module is used for generating a corresponding control instruction and sending the control instruction to the agent control if the semantic analysis result is determined to be the trigger control event, so that the agent control sends the control instruction to the intelligent calling equipment for corresponding control operation.

To achieve the above object, a third aspect of the present application provides a computer device, including: a processor and a memory; the processor reads the executable program code stored in the memory to run a program corresponding to the executable program code, so as to implement the dialog interaction processing method according to the embodiment of the first aspect.

To achieve the above object, a fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a dialog interaction processing method according to the first aspect.

To achieve the above object, a fifth aspect of the present application provides a computer program product, where instructions of the computer program product, when executed by a processor, implement the dialog interaction processing method according to the first aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

when the intelligent calling equipment is detected to establish conversation interactive connection, an uplink channel and a downlink channel with the agent control are established through the full-duplex component; receiving text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; and processing the text information and the broadcast state to generate an asynchronous signal and sending the asynchronous signal to the agent control so that the agent control sends the asynchronous signal to the intelligent calling equipment for corresponding processing. The problem of among the prior art conversation interactive mode conversation process not smooth, misunderstanding user's intention, lead to the conversation interactive effect relatively poor is solved, establish the connection through intelligent calling equipment and agent control to establish through full duplex subassembly and agent control's uplink and downlink passageway and realize data real-time transmission such as pronunciation, text and broadcast state, guarantee the smoothness of dialogue when improving conversation interactive efficiency, satisfy user's user demand.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a dialog interaction processing method according to one embodiment of the present application;

FIG. 2 is an exemplary diagram of dialogue interaction processing module connections according to one embodiment of the present application;

FIG. 3 is a diagram of an example structure of a full-duplex assembly according to one embodiment of the present application;

FIG. 4 is a flow diagram of a dialog interaction processing method according to another embodiment of the present application;

FIG. 5 is a flow diagram of a dialog interaction processing method according to yet another embodiment of the present application;

FIG. 6 is a flow diagram of a method of conversational interaction processing according to yet another embodiment of the present application;

FIG. 7 is a block diagram of a dialogue interaction processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a dialogue interaction processing apparatus according to another embodiment of the present application;

fig. 9 is a schematic structural diagram of a dialogue interaction processing apparatus according to another embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

A dialogue interaction processing method, apparatus, electronic device, and storage medium according to embodiments of the present application are described below with reference to the drawings.

Aiming at the technical problems that the conversation process is not smooth and the intention of a user is wrongly understood in a conversation interaction processing mode in the prior art, so that the conversation interaction effect is poor, the conversation interaction processing method is provided, connection is established between intelligent calling equipment and an agent control, and an uplink channel and a downlink channel of the agent control are established through a full-duplex component to realize real-time transmission of data such as voice, text, playing states and the like, so that the conversation interaction efficiency is improved, the conversation smoothness is ensured, and the use requirements of the user are met.

Specifically, fig. 1 is a flowchart of a dialogue interaction processing method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step 101, when detecting that the intelligent calling equipment establishes the interactive connection, establishing an uplink channel and a downlink channel with the agent control through the full duplex component.

Specifically, as shown in fig. 2, in the embodiment of the present application, the intelligent calling device may establish a connection with an agent control based on a dialog interface of a standard MRCP (v2) protocol, and the agent control is connected with the ASR, the semantic understanding and dialog service, and the TTS respectively, so that the audio of the user is used as the output and the audio of the intelligent calling device is output.

More specifically, as shown in fig. 3, the semantic understanding and dialogue service includes a full duplex component, and when it is detected that the intelligent calling device establishes a dialogue interactive connection, an uplink channel and a downlink channel with the agent control are established through the full duplex component, where it can be understood that when it is detected that the intelligent calling device establishes the dialogue interactive connection, the user sends a connection request to the intelligent calling device, and after the intelligent calling device feeds back the confirmation information, the two devices establish a connection, that is, the intelligent calling device is in a state where the intelligent calling device can perform dialogue interaction with the user.

As shown in fig. 3, an uplink channel and a downlink channel are provided between the full duplex component and the agent control, which can work simultaneously, receive information such as text information, voice information, and a broadcast state of the intelligent call device, which are sent by the agent control through the uplink channel, send event stream information such as an interrupt signal, reply voice, and the like to the agent control through the downlink channel, and the full duplex component further includes modules such as receiving and sending, semantic real-time forwarding, and calculating, which realize calculation of the text information and the broadcast state, judgment of interrupt conditions, silence conditions, and broadcast progress, and the like, and when the set conditions are met, an asynchronous signal is generated and sent to the intelligent call device through the agent control, thereby completing real-time transmission of uplink and downlink text information and events, completing voice and semantic interaction, and realizing control of linked ASR and TTS.

102, receiving text information forwarded by the agent control and a broadcasting state of the intelligent calling equipment; the intelligent calling equipment receives the voice information and sends the voice information to the agent control, and the voice information is converted to generate text information.

Specifically, after the session interaction starts, the user can send voice information to the intelligent calling device according to needs, the intelligent calling device receives the voice information and sends the voice information to the agent control, the agent control converts the voice information through the ASR to generate text information, and obtains the playing state of the intelligent calling device, so that the text information forwarded by the agent control and the broadcasting state of the intelligent calling device can be received through the uplink channel.

Wherein, broadcast the state can be that the state is being broadcast, wait to broadcast the state and stop broadcasting the state etc..

And 103, processing the text information and the broadcast state to generate an asynchronous signal, and sending the asynchronous signal to the proxy control so that the proxy control sends the asynchronous signal to the intelligent calling equipment to perform corresponding processing.

Specifically, the text information and the broadcast state are processed to generate an asynchronous signal, and the asynchronous signal is sent to the agent control, so that the agent control sends the asynchronous signal to the intelligent calling device for corresponding processing in various ways, and the asynchronous signal can be selected and set according to actual application needs, for example, as follows:

the first example is that semantic analysis is carried out on text information, a preset interrupt condition is judged to be met according to a semantic analysis result, and a broadcast state is determined according to the broadcast state, an interrupt signal is generated, and the interrupt signal is sent to an agent control, so that the agent control sends the interrupt signal to intelligent calling equipment to stop broadcasting.

In the second example, the state to be broadcasted is determined according to the broadcasting state, the text information is subjected to semantic recognition to be abnormal intention response, the preset reply text is called from the preset database, the preset reply text generates preset reply voice and is sent to the proxy control, and the proxy control sends the preset reply voice to the intelligent calling equipment for broadcasting.

In the third example, keywords in the text message are extracted, a reply text is generated according to the keywords, reply voice generated by the reply text is sent to the proxy control, and the proxy control sends the reply voice to the intelligent calling equipment for broadcasting.

In a fourth example, the state to be broadcasted is determined according to the broadcasting state, text detection is carried out on the text information, if the text detection result is a text error, the text information is modified, a text to be replied is generated according to the modified text information, a voice to be replied is generated from the text to be replied and sent to the proxy control, and the proxy control sends the voice to be replied to the intelligent calling device for broadcasting.

Specifically, the technology of semantic error correction is combined, the speech recognition error and the text information error after ASR recognition are corrected, semantic error correction is supported according to the smoothness of Chinese expression before the text information is sent to semantic recognition, and the accuracy of end-to-end semantic understanding is improved. The scene distances are as follows: the user: the user: i am a form of traffic collision (form- > driving) in a village in a post factory, and the intelligent calling device: good, the address of the emergency has been recorded for you.

To sum up, the dialogue interaction processing method according to the embodiment of the application establishes an uplink channel and a downlink channel with the agent control through the full-duplex component when detecting that the intelligent calling device establishes dialogue interaction connection; receiving text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; the intelligent calling equipment receives voice information and sends the voice information to the agent control, and the voice information is converted to generate text information; and processing the text information and the broadcast state to generate an asynchronous signal and sending the asynchronous signal to the agent control so that the agent control sends the asynchronous signal to the intelligent calling equipment for corresponding processing. The problem of among the prior art conversation interactive mode conversation process not smooth, misunderstanding user's intention, lead to the conversation interactive effect relatively poor is solved, establish the connection through intelligent calling equipment and agent control to establish through full duplex subassembly and agent control's uplink and downlink passageway and realize data real-time transmission such as pronunciation, text and broadcast state, guarantee the smoothness of dialogue when improving conversation interactive efficiency, satisfy user's user demand.

Fig. 4 is a flowchart of a dialogue interaction processing method according to another embodiment of the present application, as shown in fig. 4, the method includes:

step 201, when detecting that the intelligent calling device establishes a dialogue interactive connection, establishing an uplink channel and a downlink channel with the agent control through the full duplex component.

Step 202, receiving text information forwarded by the agent control and a broadcasting state of the intelligent calling device; the intelligent calling equipment receives the voice information and sends the voice information to the agent control, and the voice information is converted to generate text information.

It should be noted that steps 201 to 202 are the same as steps 101 to 102, and refer to the description of steps 101 to 102 specifically, and are not described in detail here.

And 203, performing semantic analysis on the text information, and judging whether a preset interruption condition is met according to a semantic analysis result.

And 204, if the preset interrupt condition is judged to be met according to the semantic analysis result and the broadcasting state is determined to be the broadcasting state according to the broadcasting state, generating an interrupt signal.

And step 205, sending the interrupt signal to the agent control so that the agent control sends the interrupt signal to the intelligent calling device to stop broadcasting.

Specifically, when the intelligent calling device is in a broadcasting state, when a user inputs voice information, the ASR identifies effective voice information after the agent control receives the voice information, semantically analyzes the text information, judges whether a semantic analysis result is a preset press key, namely, judges according to a customized logic (for example, meaningless interruption according to the number of words), judges whether a semantic analysis result is meaningless interruption, judges whether a semantic analysis result is interruption such as interruption of keywords which can be user-defined word lists, and the like, under the condition that interruption is met, an interruption signal for the intelligent calling device is generated in full duplex mode, issued to the agent control, forwarded to the intelligent calling device, and the intelligent calling device is immediately stopped broadcasting contents.

The scenario is as follows, for example, the intelligent calling device: mr. good, we here … …, user: wait, you should not be a robot bar, intelligent calling device: bad meaning is discovered by you, Mr. is good, we are here XX customer service centers.

Therefore, timely response to the user in the human-computer conversation process under the condition of interrupting the scene can be completed, the input and output rhythms of ASR and TTS are controlled, and the human-computer conversation and the human-to-human conversation are natural and smooth.

Fig. 5 is a flowchart of a dialogue interaction processing method according to another embodiment of the present application, as shown in fig. 5, the method includes:

step 301, when detecting that the intelligent calling device establishes a session interactive connection, establishing an uplink and downlink channel with the agent control through the full duplex component.

Step 302, receiving text information forwarded by the agent control and a broadcasting state of the intelligent calling device; the intelligent calling equipment receives the voice information and sends the voice information to the agent control, and the voice information is converted to generate text information.

It should be noted that steps 301 to 302 are the same as steps 101 to 102, and refer to the description of steps 101 to 102 specifically, and are not described in detail here.

Step 303, determining the broadcast state as a to-be-broadcast state according to the broadcast state, performing semantic recognition on the text information to obtain an abnormal intention response, and calling a preset reply text from a preset database.

And step 304, generating a preset reply voice from the preset reply text and sending the preset reply voice to the agent control, and sending the preset reply voice to the intelligent calling device for broadcasting by the agent control.

Specifically, common call abnormal intention responses, such as busy, unnecessary, unhealthy, canonto, abusive, and suspicious intelligent call devices, perform semantic recognition on text information as an abnormal intention response, call a preset reply text from a preset database, generate a preset reply voice from the preset reply text, send the preset reply voice to an agent control, and send the preset reply voice to the intelligent call device for broadcasting.

The scenario is as follows, for example, the intelligent calling device: the user: i are now busy, the intelligent calling device: sorry, i get back again later; the user: i do not need this product now, smart calling device: the mani can be known at first; the user: you just say what i am not listening clearly, the intelligent calling device: after that, i repeat the above steps; the user: sound is sweet and adds little letter bar, intelligent calling equipment: i can only chat at this side without meaning.

And 305, if the text information forwarded by the proxy control is not received within a preset time threshold, determining that a silence condition is met.

And step 306, calling the target text from the preset database, generating target voice information of the target text, and sending the target voice information to the agent control, so that the agent control sends the target voice information to the intelligent calling equipment for broadcasting.

Specifically, in the conversation, the user silent time is too long, the intelligent calling device needs to actively ask a question to continue the conversation, and the technology can support a developer to configure a single silent time, a multiple silent conversation, trigger a silent conversation process response and the like.

An example scenario is as follows, intelligent calling device; by the just introduction, want to ask if you are interested? The user: … …, respectively; the intelligent calling equipment comprises: do you also ask a sorry question? The user: … …, intelligent calling device: do you go well, still there? By the just introduction, want to ask if you are interested?

Therefore, timely response under a silent scene of a user in a human-computer conversation process can be completed, the input and output rhythms of ASR and TTS are controlled, and the human-computer conversation and the human-to-human conversation are natural and smooth.

Fig. 6 is a flowchart of a dialogue interaction processing method according to still another embodiment of the present application, as shown in fig. 5, the method including:

step 401, when detecting that the intelligent calling device establishes a session interactive connection, establishing an uplink and downlink channel with the agent control through the full duplex component.

Step 402, receiving text information forwarded by the agent control and a broadcasting state of the intelligent calling device; the intelligent calling equipment receives the voice information and sends the voice information to the agent control, and the voice information is converted to generate text information.

It should be noted that steps 401 to 402 are the same as steps 101 to 102, and refer to the description of steps 101 to 102 specifically, and are not described in detail here.

Step 403, extracting keywords in the text message.

And step 404, generating a reply text according to the keyword, generating reply voice from the reply text, and sending the reply voice to the agent control, wherein the agent control sends the reply voice to the intelligent calling equipment for broadcasting.

Specifically, a semantic error correction technology is combined to extract keywords in text information for denoising, irrelevant words are doped when the text information comprises long strings of numbers or letters, text noise such as kay/o/forehead/comma can be included when the text information comprises a mobile phone number, an identity card number, an order number and the like, and removal of noise fragments is supported when the key information is extracted, and secondary confirmation is carried out on the noise fragments and the user.

The scenario is for example as follows, user: group me calls 138 amount xxxx, 1725, intelligent calling device: good, is dialing phone 138xxxx1725 for you; the user: this is true if my order number is DQ636, i see kaha 456WOK5, intelligent calling device: good, confirm that your order number is DQ636456WOK 5.

Step 405, performing semantic analysis on the text information.

And 406, if the control event is determined to be a trigger control event according to the semantic analysis result, generating a corresponding control instruction and sending the control instruction to the agent control, so that the agent control sends the control instruction to the intelligent calling equipment to perform corresponding control operation.

Specifically, semantic analysis is performed on text information to generate a new event, conversion is performed through full-duplex calculation, a control instruction is generated and sent to a call center, for example, in conversation interaction, a user trigger event is a trigger control event for helping me to switch over XXX, or a trigger control event for helping me to recharge one hundred blocks of XX and the like, a corresponding control instruction is generated and sent to an agent control, and the agent control sends the control instruction to intelligent call equipment to perform corresponding control operation.

Therefore, in the whole conversation process, good experience is brought to a user, deep matching of ASR, TTS and a semantic understanding conversation technology is required to be realized, and end-to-end control is realized.

The dialogue interaction processing method of the embodiment of the application adopts a full duplex mechanism in a semantic part to complete real-time transmission of uplink and downlink text information and events, get through voice and semantic interaction, realize control of linkage ASR and TTS, and a scene function supports the logic of optimizing ASR recognition text, correcting error, interrupting, silencing and the like, is perfectly combined with semantic understanding and dialogue management, is more intelligent for a robot and smoother in a conversation process, reduces repeated calling of multiple links such as authentication, data forwarding, flow control and the like of each calling, can reduce end-to-end calling time delay, greatly improves calling efficiency, can greatly reduce the integration cost of a client system, for example, a large amount of independent integrated work needs to be done when integrated voice semantic capability in an intelligent customer service enters intelligent calling equipment, and can finally take TTS audio replied to a user only by transmitting audio once, all the middle processing steps are handed to an integrated system, instead of a calling platform, a great deal of AI technical management work is done, after semantic understanding and dialogue and ASR and TTS communication are realized, timely response under scenes of ' user silence ', user interruption ' and the like in the process of human-computer dialogue can be completed, the input and output rhythms of ASR and TTS are controlled, and the human-computer dialogue and the human-human dialogue are natural and smooth.

In order to implement the above embodiments, the present application further provides a dialog interaction processing apparatus. Fig. 7 is a schematic structural diagram of a dialogue interaction processing apparatus according to an embodiment of the present application, and as shown in fig. 7, the dialogue interaction processing apparatus includes: a building block 701, a receiving block 702 and a processing block 703, wherein,

an establishing module 701, configured to establish, when it is detected that the intelligent call device establishes a session interaction connection, an uplink channel and a downlink channel with the agent control through the full-duplex component;

a receiving module 702, configured to receive text information forwarded by the agent control and a broadcast state of the intelligent calling device; the intelligent calling equipment receives voice information and sends the voice information to the agent control, and the voice information is converted to generate the text information;

the processing module 703 is configured to process the text information and the broadcast state to generate an asynchronous signal, and send the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent call device to perform corresponding processing.

In an embodiment of the present application, the processing module 703 is specifically configured to: performing semantic analysis on the text information, and judging whether a preset interruption condition is met or not according to a semantic analysis result; if the preset interruption condition is judged to be met according to the semantic analysis result and the broadcasting state is determined to be the broadcasting state according to the broadcasting state, an interruption signal is generated; and sending the interrupt signal to the agent control so that the agent control sends the interrupt signal to the intelligent calling equipment to stop broadcasting.

In an embodiment of the present application, the preset interruption condition includes: one or more of preset key press, intention interruption and preset keyword interruption.

In an embodiment of the present application, the processing module is specifically configured to: determining the broadcast state as a state to be broadcast according to the broadcast state, performing semantic recognition on the text information to obtain abnormal intention response, and calling a preset reply text from a preset database; and generating a preset reply voice from the preset reply text and sending the preset reply voice to the agent control, wherein the agent control sends the preset reply voice to the intelligent calling equipment for broadcasting.

In an embodiment of the present application, as shown in fig. 8, on the basis of fig. 7, the method further includes: a determination module 704 and a call generation module 705.

The determining module 704 is configured to determine that a silence condition is met if the text information forwarded by the proxy control is not received within a preset time threshold.

The calling generation module 705 is configured to call a target text from a preset database, generate target voice information of the target text, and send the target voice information to the agent control, so that the agent control sends the target voice information to the intelligent call device for broadcasting.

In an embodiment of the present application, the processing module 703 is specifically configured to: extracting key words in the text information; and generating a reply text according to the keyword, generating reply voice from the reply text and sending the reply voice to the agent control, and sending the reply voice to the intelligent calling equipment by the agent control for broadcasting.

In an embodiment of the present application, the processing module 703 is specifically configured to: determining the text information to be broadcasted according to the broadcasting state, and carrying out text detection on the text information; if the text detection result is a text error, modifying the text information; and generating a text to be replied according to the modified text information, generating the voice to be replied from the text to be replied, and sending the voice to be replied to the agent control, wherein the voice to be replied is sent to the intelligent calling equipment by the agent control for broadcasting.

In an embodiment of the present application, as shown in fig. 9, on the basis of fig. 7, the method further includes: an analysis module 706 and a generation module 707.

The analysis module 706 is configured to perform semantic analysis on the text information.

A generating module 707, configured to generate a corresponding control instruction according to a result of the semantic analysis to send to the agent control, so that the agent control sends the control instruction to the intelligent call device to perform a corresponding control operation.

It should be noted that the foregoing explanation on the embodiment of the dialog interaction processing method is also applicable to the dialog interaction processing apparatus of this embodiment, and details are not described here again.

To sum up, when the dialogue interaction processing device of the embodiment of the application detects that the intelligent calling equipment establishes dialogue interaction connection, an uplink channel and a downlink channel with the agent control are established through the full-duplex component; receiving text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; and processing the text information and the broadcast state to generate an asynchronous signal and sending the asynchronous signal to the agent control so that the agent control sends the asynchronous signal to the intelligent calling equipment for corresponding processing. The problem of among the prior art conversation interactive mode conversation process not smooth, misunderstanding user's intention, lead to the conversation interactive effect relatively poor is solved, establish the connection through intelligent calling equipment and agent control to establish through full duplex subassembly and agent control's uplink and downlink passageway and realize data real-time transmission such as pronunciation, text and broadcast state, guarantee the smoothness of dialogue when improving conversation interactive efficiency, satisfy user's user demand.

In order to implement the foregoing embodiments, the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the dialog interaction processing method as described in the foregoing embodiments is implemented.

In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the dialog interaction processing method as described in the aforementioned method embodiments.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for processing dialogue interaction, comprising the steps of:

when the intelligent calling equipment is detected to establish conversation interactive connection, an uplink channel and a downlink channel with the agent control are established through the full-duplex component;

receiving text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; the intelligent calling equipment receives voice information and sends the voice information to the agent control, and the voice information is converted to generate the text information;

processing the text information and the broadcast state to generate an asynchronous signal, and sending the asynchronous signal to the agent control so that the agent control sends the asynchronous signal to the intelligent calling equipment for corresponding processing; wherein the full-duplex component generates the asynchronous signal.

2. The method of claim 1, wherein the processing the text message and the broadcast status to generate an asynchronous signal and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent calling device for corresponding processing, comprises:

performing semantic analysis on the text information, and judging whether a preset interruption condition is met or not according to a semantic analysis result;

if the preset interruption condition is judged to be met according to the semantic analysis result and the broadcasting state is determined to be the broadcasting state according to the broadcasting state, an interruption signal is generated;

and sending the interrupt signal to the agent control so that the agent control sends the interrupt signal to the intelligent calling equipment to stop broadcasting.

3. The method of claim 2, wherein the preset interrupt condition comprises:

one or more of preset key press, intention interruption and preset keyword interruption.

4. The method of claim 1, wherein the processing the text message and the broadcast status to generate an asynchronous signal and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent calling device for corresponding processing, comprises:

determining the broadcast state as a state to be broadcast according to the broadcast state, performing semantic recognition on the text information to obtain abnormal intention response, and calling a preset reply text from a preset database;

and generating a preset reply voice from the preset reply text and sending the preset reply voice to the agent control, wherein the agent control sends the preset reply voice to the intelligent calling equipment for broadcasting.

5. The method of claim 4, wherein after the generating of the preset reply text into a preset reply voice and sending the preset reply voice to the agent control, the agent control sending the preset reply voice to the intelligent calling device for broadcasting, further comprising:

if the text information forwarded by the proxy control is not received within a preset time threshold, determining that a silent condition is met;

and calling a target text from a preset database, generating target voice information of the target text, and sending the target voice information to the agent control so that the agent control sends the target voice information to the intelligent calling equipment for broadcasting.

6. The method of claim 1, wherein the processing the text message and the broadcast status to generate an asynchronous signal and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent calling device for corresponding processing, comprises:

extracting key words in the text information;

and generating a reply text according to the keyword, generating reply voice from the reply text and sending the reply voice to the agent control, and sending the reply voice to the intelligent calling equipment by the agent control for broadcasting.

7. The method of claim 1, wherein the processing the text message and the broadcast status to generate an asynchronous signal and sending the asynchronous signal to the agent control, so that the agent control sends the asynchronous signal to the intelligent calling device for corresponding processing, comprises:

determining the text information to be broadcasted according to the broadcasting state, and carrying out text detection on the text information;

if the text detection result is a text error, modifying the text information;

and generating a text to be replied according to the modified text information, generating a voice to be replied from the text to be replied, and sending the voice to be replied to the agent control, wherein the voice to be replied is sent to the intelligent calling equipment by the agent control for broadcasting.

8. The method of claim 1, wherein after receiving the text message forwarded by the agent control and the announcement status of the intelligent calling device, further comprising:

performing semantic analysis on the text information;

and if the triggering control event is determined according to the semantic analysis result, generating a corresponding control instruction and sending the control instruction to the agent control so that the agent control sends the control instruction to the intelligent calling equipment for corresponding control operation.

9. A dialog interaction processing apparatus, comprising:

the establishing module is used for establishing an uplink channel and a downlink channel with the agent control through the full duplex component when detecting that the intelligent calling equipment establishes the interactive connection;

the receiving module is used for receiving the text information forwarded by the agent control and the broadcasting state of the intelligent calling equipment; the intelligent calling equipment receives voice information and sends the voice information to the agent control, and the voice information is converted to generate the text information;

the processing module is used for processing the text information and the broadcast state to generate an asynchronous signal and sending the asynchronous signal to the proxy control so that the proxy control sends the asynchronous signal to the intelligent calling equipment to perform corresponding processing; wherein the full-duplex component generates the asynchronous signal.

10. The apparatus of claim 9, wherein the processing module is specifically configured to:

11. The apparatus of claim 10, wherein the preset interrupt condition comprises:

12. The apparatus of claim 9, wherein the processing module is specifically configured to:

13. The apparatus of claim 12, further comprising:

the determining module is used for determining that the silent condition is met if the text information forwarded by the proxy control is not received within a preset time threshold;

and the calling generation module is used for calling a target text from a preset database, generating target voice information of the target text and sending the target voice information to the agent control, so that the agent control sends the target voice information to the intelligent calling equipment for broadcasting.

14. The apparatus of claim 9, wherein the processing module is specifically configured to:

extracting key words in the text information;

15. The apparatus of claim 9, wherein the processing module is specifically configured to:

if the text detection result is a text error, modifying the text information;

16. The apparatus of claim 9, further comprising:

the analysis module is used for carrying out semantic analysis on the text information;

and the generating module is used for generating a corresponding control instruction and sending the control instruction to the agent control if the semantic analysis result is determined to be the trigger control event, so that the agent control sends the control instruction to the intelligent calling equipment for corresponding control operation.

17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the dialog interaction processing method according to any one of claims 1 to 8 when executing the computer program.

18. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the dialog interaction processing method according to any one of claims 1-8.