CN110995569A

CN110995569A - Intelligent interaction method and device, computer equipment and storage medium

Info

Publication number: CN110995569A
Application number: CN201911103109.7A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-04-10
Anticipated expiration: 2039-11-12
Also published as: CN110995569B

Abstract

The embodiment of the invention discloses an intelligent interaction method, an intelligent interaction device, computer equipment and a storage medium, which can display a chat page between a user and a virtual user, wherein the chat page comprises a conversation message currently sent to the virtual user by the user; displaying a reply message of the virtual user to the conversation message on a chat page; the reply message comprises a dialog reply text automatically generated by the virtual user and target multi-modal content associated with the dialog reply text; when the playing operation of the user for the target multi-mode content is detected, the target multi-mode content is played, so that the user can be reused by matching the multi-mode content with the text in the conversation process, the conversation form between the virtual user and the user is enriched, and the interest of chatting and the attraction to the user are greatly increased.

Description

Intelligent interaction method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of internet, in particular to an intelligent interaction method, an intelligent interaction device, computer equipment and a storage medium.

Background

At present, with the development of computer technology, more and more intelligent devices provide intelligent interaction functions for users, and based on the intelligent interaction functions, the intelligent devices can reply to information input by the users so as to expand conversations with the users.

However, in the conversation process, the intelligent device and the user generally carry out conversation in a text or voice mode, and the conversation mode is single, so that the conversation is not beneficial to the retention of the user.

Disclosure of Invention

The embodiment of the invention provides an intelligent interaction method, an intelligent interaction device, computer equipment and a storage medium, which can enrich the conversation form with a user.

The embodiment of the invention provides an intelligent interaction method, which comprises the following steps:

displaying a chat page between a user and a virtual user, wherein the chat page comprises a conversation message sent by the user to the virtual user currently;

displaying a reply message of a virtual user aiming at the conversation message on the chat page; wherein the reply message comprises a dialog reply text automatically generated by the virtual user and target multimodal content associated with the dialog reply text;

when the user is detected to play the target multi-modal content, the target multi-modal content is played.

Alternatively to this, the first and second parts may,

when the user is detected to play the target multi-modal content, playing the target multi-modal content, including:

when the playing operation of the user for the target multi-modal content is detected, displaying a virtual resource transfer page corresponding to the target multi-modal content;

triggering virtual resource transfer for the target multimodal content based on a virtual resource transfer operation of a user for the virtual resource transfer page;

and when the virtual resource is transferred successfully, playing the target multi-modal content.

Alternatively to this, the first and second parts may,

the displaying a reply message of the virtual user to the conversation message on the chat page comprises:

and displaying a dialog reply text in the reply message and a target multi-modal content list on the chat page, wherein the target multi-modal content list comprises at least two target multi-modal contents.

Optionally, a candidate type list of the target multi-modal content is further displayed on the chat page, where the candidate type list includes candidate types of the target multi-modal content;

the intelligent interaction method further comprises the following steps:

when the user selection operation for the candidate type in the candidate type list is detected, the target multi-modal content of the selected candidate type is switched and displayed in the target multi-modal content list.

Optionally, when the user's operation of playing the target multi-modal content is detected, playing the target multi-modal content includes:

when the user is detected to play the target multi-modal content, displaying a playing page of the target multi-modal content, wherein similar multi-modal content similar to the target multi-modal content is also included on the playing page.

when the user is detected to play the target multi-modal content, playing a target content segment in the target multi-modal content, wherein the target content segment is a content segment in the target multi-modal content, which is associated with the semantics of the dialog reply text.

Optionally, before the chat page displays the reply message of the virtual user to the dialog message, the intelligent interaction method further includes:

when receiving a dialogue message sent by a user aiming at the virtual user, acquiring historical dialogue information of the virtual user and the user;

generating a dialog reply text of the dialog message based on the dialog message and historical dialog information;

obtaining correlation information between the dialog reply text and candidate multi-modal content;

determining, from the candidate multimodal content, targeted multimodal content to reply to the dialog message based on the relevance information;

and combining the conversation reply text and the target multi-modal content to obtain a reply message corresponding to the conversation message.

Optionally, the obtaining of the correlation information between the dialog reply text and the candidate multimodal content includes:

acquiring a prediction model, wherein the prediction model is used for predicting user preference degrees corresponding to candidate multi-modal content of each preset type in a dialogue scene;

analyzing the dialogue messages and historical dialogue information through the prediction model, and predicting the user preference degree of the user for candidate multi-modal content of each preset type in the current dialogue scene;

selecting a target type from preset types of the candidate multi-modal content based on the predicted user preference degree;

and acquiring correlation information between the dialog reply text and the candidate multi-modal content of the target type.

Optionally, the prediction model includes a prediction submodel corresponding to each preset type of the candidate multi-modal content, and the prediction submodel is used for predicting a possible stay time of a user on the corresponding preset type of candidate multi-modal content in a dialog scene;

the analyzing the dialog messages and the historical dialog information through the prediction model to predict the user preference degree of the user for each preset type of candidate multi-modal content in the current dialog scene comprises the following steps:

analyzing the dialogue messages and historical dialogue information through each prediction submodel to predict the possible stay time of the user on each preset type of candidate multi-modal content in the current dialogue scene;

selecting a target type from the preset types of the candidate multi-modal content based on the predicted user preference degree, wherein the selecting the target type comprises the following steps:

and selecting the predicted preset type with the longest possible stay time of the user as a target type from the preset types of the candidate multi-modal content.

Optionally, before selecting the target type from the preset types of the candidate multi-modal content based on the predicted user preference degree, the method further includes:

determining whether a preset type with the user preference degree not lower than a preset minimum user preference degree exists in the preset types of the candidate multi-modal content or not based on the predicted user preference degree;

and if so, continuing the step of selecting a target type from the preset types of the candidate multi-modal content based on the predicted user preference degree.

Optionally, the obtaining of the correlation information between the dialog reply text and the candidate multi-modal content of the target type includes:

obtaining a tag set of the candidate multi-modal content of the target type;

relevance information of tags in the dialog reply text and the set of tags of the candidate multimodal content is calculated.

Optionally, the intelligent interaction method further includes:

obtaining a regression model;

acquiring sample historical dialogue information of the virtual user and the user, and acquiring historical stay time of the user in each preset type of candidate multi-modal content in a historical dialogue process corresponding to the sample historical dialogue information;

determining training samples corresponding to the candidate multi-modal content of each preset type based on the historical sample dialogue information and the historical stay time of the user on the candidate multi-modal content of each preset type corresponding to the historical sample dialogue information;

and respectively training one regression model by using the training sample corresponding to the candidate multi-modal content of each preset type to obtain the prediction sub-model corresponding to the candidate multi-modal content of each preset type.

Optionally, before obtaining the tag set of the candidate multi-modal content of the target type, the method further includes:

acquiring bearing information and/or user interaction information of the candidate multi-modal content;

and analyzing the characteristics of the candidate multi-modal content on at least one description dimension based on the bearing information and/or the user interaction information of the candidate multi-modal content, and generating a label on at least one description dimension for the candidate multi-modal content to obtain a label set of the candidate multi-modal content.

Optionally, the calculating the relevance information of the dialog reply text and the tags in the tag set of the candidate multi-modal content includes:

analyzing the conversation reply text to obtain keywords in the conversation reply text;

determining similar labels similar to the keywords in a label set of the candidate multi-modal content;

and determining the relevance information of the candidate multi-modal content corresponding to the tag set and the dialog reply text based on the similar tags in the tag set.

Optionally, the obtaining of the historical dialog information between the virtual user and the user includes:

acquiring historical dialogue information generated by the virtual user and the user in a current dialogue scene;

or acquiring historical dialogue information between the virtual user and the user within a historical time period which is a preset time length away from the current time;

or acquiring historical conversation information of the virtual user and the user and belonging to a target topic, wherein the target topic is the topic to which the conversation message belongs.

Optionally, the generating a dialog reply text of the dialog message based on the dialog message and the historical dialog information includes:

determining topic information of topics to which the historical conversation information belongs;

acquiring meaning description information of the historical dialogue information and meaning description information of the dialogue information;

determining a dialog reply text of the dialog message based on the topic information, the meaning description information of the historical dialog information, and the meaning description information of the dialog message.

The embodiment of the invention also provides an intelligent interaction device, which comprises:

the system comprises a page display unit, a chat page display unit and a chat display unit, wherein the page display unit is used for displaying the chat page between a user and a virtual user, and the chat page comprises a conversation message sent to the virtual user by the user currently;

the reply display unit is used for displaying a reply message of the virtual user aiming at the conversation message on the chat page; wherein the reply message comprises a dialog reply text automatically generated by the virtual user and target multimodal content associated with the dialog reply text;

and the playing unit is used for playing the target multi-modal content when the playing operation of the user for the target multi-modal content is detected.

Optionally, the playing unit includes:

the virtual transfer page display subunit is configured to, when a play operation of the user on the target multi-modal content is detected, display a virtual resource transfer page corresponding to the target multi-modal content;

the virtual resource transfer subunit is used for triggering virtual resource transfer aiming at the target multi-modal content based on the virtual resource transfer operation of a user aiming at the virtual resource transfer page;

and the playing subunit is used for playing the target multi-modal content when the virtual resource is successfully transferred.

Optionally, the reply display unit is configured to display, on the chat page, a dialog reply text in the reply message and a target multimodal content list, where the target multimodal content list includes at least two target multimodal contents.

the device further comprises:

and the switching display unit is used for switching and displaying the target multi-modal content of the selected candidate type in the target multi-modal content list when the user selection operation aiming at the candidate type in the candidate type list is detected.

Optionally, the playing unit is configured to, when a playing operation of the user for the target multi-modal content is detected, display a playing page of the target multi-modal content, where similar multi-modal content similar to the target multi-modal content is also displayed on the playing page.

Optionally, the playing unit is configured to play a target content segment in the target multi-modal content when a playing operation of the user on the target multi-modal content is detected, where the target content segment is a content segment in the target multi-modal content that is associated with semantics of the dialog reply text.

Optionally, the intelligent interaction device of this embodiment further includes:

the conversation acquisition unit is used for acquiring historical conversation information of the virtual user and the user when a conversation message sent by the user aiming at the virtual user is received;

the generating unit is used for generating a dialogue reply text of the dialogue message based on the dialogue message and historical dialogue information;

a correlation obtaining unit, configured to obtain correlation information between the dialog reply text and candidate multimodal content;

a determining unit, configured to determine, from the candidate multi-modal content, a target multi-modal content to reply to the dialog message based on the relevance information;

and the combination unit is used for combining the dialog reply text and the target multi-modal content to obtain a reply message corresponding to the dialog message.

Optionally, the correlation obtaining unit includes:

the device comprises a first obtaining subunit, a second obtaining subunit and a third obtaining subunit, wherein the first obtaining subunit is used for obtaining a prediction model, and the prediction model is used for predicting the user preference degree corresponding to each preset type of candidate multi-modal content in a dialogue scene;

the prediction subunit is used for analyzing the dialogue messages and the historical dialogue information through the prediction model and predicting the user preference degree of the user on candidate multi-modal content of each preset type in the current dialogue scene;

a selecting subunit, configured to select a target type from preset types of the candidate multimodal content based on the predicted user preference degree;

and the second acquisition subunit is used for acquiring correlation information between the dialog reply text and the candidate multi-modal content of the target type.

the prediction subunit is used for analyzing the dialogue messages and the historical dialogue information through each prediction submodel to predict the possible stay time of the user on each preset type of candidate multi-modal content in the current dialogue scene;

the selecting subunit is configured to select, from the preset types of the candidate multimodal content, the preset type with the longest predicted possible dwell time of the user as the target type.

Optionally, the apparatus further includes a control unit, configured to determine, before the selecting subunit selects the target type from the preset types of the candidate multi-modal content based on the predicted user preference degree, whether a preset type having a user preference degree not lower than a preset minimum user preference degree exists in the preset types of the candidate multi-modal content based on the predicted user preference degree; and if so, controlling the selection subunit to continue the step of selecting the target type from the preset types of the candidate multi-modal content based on the predicted user preference degree.

Optionally, the second obtaining subunit is configured to:

obtaining a tag set of the candidate multi-modal content of the target type;

Optionally, the intelligent interaction device further comprises a model processing unit, configured to:

obtaining a regression model;

Optionally, the intelligent interaction device further includes a tag setting unit, configured to acquire bearer information and/or user interaction information of the candidate multi-modal content before the second acquiring subunit acquires the tag set of the candidate multi-modal content of the target type; and analyzing the characteristics of the candidate multi-modal content on at least one description dimension based on the bearing information and/or the user interaction information of the candidate multi-modal content, and generating a label on at least one description dimension for the candidate multi-modal content to obtain a label set of the candidate multi-modal content.

Optionally, the second obtaining subunit is configured to:

Optionally, the generating unit includes:

the topic determining subunit is used for determining topic information of the topic to which the historical conversation information belongs;

a description information acquisition subunit, configured to acquire meaning description information of the historical dialog information and meaning description information of the dialog message;

a text determination subunit, configured to determine a dialog reply text of the dialog message based on the topic information, the meaning description information of the historical dialog information, and the meaning description information of the dialog message.

The present embodiment also provides a storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the intelligent interaction method as described above.

The present embodiment also provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the intelligent interaction method as described above.

The embodiment discloses an intelligent interaction method, an intelligent interaction device, computer equipment and a storage medium, which can display a chat page between a user and a virtual user, wherein the chat page comprises a conversation message currently sent to the virtual user by the user; displaying a reply message of the virtual user to the conversation message on a chat page; the reply message comprises a dialog reply text automatically generated by the virtual user and target multi-modal content associated with the dialog reply text; when the playing operation of the user for the target multi-mode content is detected, the target multi-mode content is played, so that the user can be reused by matching the multi-mode content with the text in the conversation process, the conversation form with the user is enriched, and the interest of the chat and the attraction to the user are greatly increased.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of an intelligent interaction method provided by an embodiment of the present invention;

FIG. 1b is a flowchart of an intelligent interaction method according to an embodiment of the present invention;

fig. 2a is a schematic display diagram of a chat page implemented based on the intelligent interaction method provided in the embodiment of the present invention;

FIG. 2b is a diagram illustrating a list display of targeted multimodal content in a chat page, in accordance with an embodiment of the present invention;

FIG. 2c is a schematic diagram illustrating a display of a list of targeted multimodal content and a list of candidate types of targeted multimodal content in a chat page in an embodiment of the invention;

FIG. 2d is a schematic diagram illustrating the display of a playback page of targeted multimodal content, in an embodiment of the present invention;

fig. 2e is a schematic flowchart of a reply message generation method in the intelligent interaction process according to an embodiment of the present invention;

FIG. 2f is a schematic diagram of another chat page implemented by the intelligent interaction method according to the embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an intelligent interaction device according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a computer device provided by an embodiment of the present invention;

fig. 5 is an alternative structure diagram of the distributed system 100 applied to the blockchain system according to the embodiment of the present invention;

fig. 6 is an alternative schematic diagram of a block structure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an intelligent interaction method, an intelligent interaction device, computer equipment and a storage medium. Specifically, the embodiment of the present invention provides an intelligent interaction method (which may be referred to as a first intelligent interaction device for distinction) suitable for a first computer device, and an intelligent interaction device (which may be referred to as a second intelligent interaction device for distinction) suitable for a second computer device. The first computer device can be a terminal and other devices, and the terminal can be a mobile phone, a tablet computer, a notebook computer, an intelligent robot, an intelligent bracelet and other devices; the second computer device may be a device such as a server, and the server may be a single server or a server cluster composed of a plurality of servers.

For example, the first intelligent interaction device may be integrated in the terminal, and the second intelligent interaction device may be integrated in the server.

The embodiment of the invention introduces the intelligent interaction method by taking the first computer equipment as a terminal and the second computer equipment as a server as an example.

Referring to fig. 1a, an embodiment of the present invention provides an intelligent interactive system, which includes a terminal 10, a server 20, and the like; the terminal 10 and the server 20 are connected via a network, such as a wired or wireless network, wherein the intelligent interaction device is integrated in the terminal, such as in the form of a client, and more specifically, such as in the form of a dialog system.

The terminal 10 may display a chat page between the user and the virtual user, where the chat page includes a dialog message currently sent by the user to the virtual user; displaying a reply message of a virtual user aiming at the conversation message on the chat page; wherein the reply message comprises a dialog reply text automatically generated by the virtual user and target multimodal content associated with the dialog reply text; when the user is detected to play the target multi-modal content, the target multi-modal content is played.

Wherein, after receiving the dialog message sent by the user to the virtual user, the method may trigger the acquisition of the reply message of the dialog message, and the acquisition process may be completed by the terminal 10 or the server 20, or by both the terminal 10 and the server 20.

For example, the terminal 10 may acquire historical dialogue information with the user when receiving a dialogue message sent by the user; generating a dialog reply text of the dialog message based on the dialog message and historical dialog information; the dialog reply text is sent to the server 20 to trigger retrieval of the targeted multimodal content by the server 20.

Optionally, the server 20 may obtain correlation information between the dialog reply text and the candidate multimodal content after receiving the dialog reply text; determining, from the candidate multimodal content, targeted multimodal content to reply to the dialog message based on the relevance information; and sending the target multi-modal content to the terminal 10, so that the terminal 10 combines the dialog reply text and the target multi-modal content to obtain a reply message corresponding to the dialog message.

In one embodiment, the terminal 10 may be an intelligent robot, and the user's dialog messages may be captured through a user input module, such as an audio capture module or a touch input module.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment of the invention will be described from the perspective of an intelligent interaction device, which may be specifically integrated in a terminal or a server.

The intelligent interaction method provided by the embodiment of the invention can be executed by a processor of a terminal, and can be applied to an intelligent interaction scene (for example, an interaction scene between a user and an intelligent sound box, an intelligent chat robot and the like). Based on the intelligent interaction method, the chatting experience of the user can be improved in the intelligent interaction scene. For an intelligent interaction scene, chatting is an important module, compared with a task module, the chatting has no clear purpose and is performed on an intelligent device in a personification mode with a user, but the technology behind the chatting is not simple, because the system needs to be better personified to arouse the desire of communication of the user, and the good chatting module can greatly improve the retention degree of the user and improve the market public praise.

The intelligent interaction method of the embodiment can interact with the user by matching the multi-modal content with the text in the chat scene, so as to improve the interest of the chat and the attraction to the user, as shown in fig. 1b, the specific flow of the intelligent interaction method may be as follows:

101. displaying a chat page between a user and a virtual user, wherein the chat page comprises a conversation message sent by the user to the virtual user currently;

wherein, step 101 may be implemented by a mobile terminal, such as a mobile phone, a smart robot, and the like. The virtual user may be emulated by an application in the terminal, which may be an instant messaging client, an application with a dialog function, such as a search engine, etc., or a service of the terminal system itself. Correspondingly, the chat page can be a chat page of an instant messaging client and a chat page provided for an intelligent conversation function of a terminal system.

The chat page may include an input control, such as an input box, for facilitating a user to input a dialog message in a touch manner. In an example of this embodiment, in step 101, an audio capture module of the terminal may be in an open state, and the terminal may capture external audio data through the audio capture module, obtain voice content of the user based on analysis of the audio data captured by the audio capture module, translate the voice content into text information, and display the text information as a dialog message in a chat page.

102. Displaying a reply message of the virtual user to the conversation message on a chat page; the reply message comprises a dialog reply text automatically generated by the virtual user and target multi-modal content associated with the dialog reply text;

in this embodiment, as for the multi-modal content, the form of the multi-modal content is not uniform, and the multi-modal content of this embodiment includes but is not limited to: music includes audio information, and text information such as lyrics; the audio books comprise audio information such as reading aloud and music, and text information such as books; the phase sound comprises audio information such as human voice and text information such as phase sound content; video includes multi-dimensional information such as images, text, audio, etc.

Optionally, referring to the schematic diagram shown in fig. 2a, in a chat page 201 between a user and a robot (i.e., a virtual user), the user (abbreviated as "use" in the figure) and the robot (abbreviated as "system" in fig. 2 a) are performing a conversation, during the conversation, the robot may display only a conversation reply text as a reply message for a conversation message of the user, or may display the conversation reply text and multimodal content as a reply message, which is not limited in this embodiment.

103. And when the playing operation of the user for the target multi-modal content is detected, playing the target multi-modal content.

In this embodiment, the playing operation of the user for the target multimodal content may be a touch operation such as clicking, double-clicking, long-pressing, or the like, or may be an operation implemented in a voice manner.

Optionally, the step "playing the target multi-modal content when the user's playing operation for the target multi-modal content is detected" may include:

collecting voice information of a user, and analyzing a dialogue message input by the user based on the voice information;

and when the dialogue message is analyzed to contain a playing instruction of the target multi-modal content, playing the target multi-modal content.

and when receiving a play touch operation of the user for the target multi-modal content, playing the target multi-modal content.

The play touch operation may be an arbitrarily set touch operation, such as a click operation.

Alternatively, in one example, for targeted multimodal content recommended by a virtual user, the user may be required to purchase to be able to play.

when the playing operation of a user for the target multi-modal content is detected, displaying a virtual resource transfer page corresponding to the target multi-modal content;

triggering virtual resource transfer for the target multi-modal content based on a virtual resource transfer operation of a user for a virtual resource transfer page;

and when the virtual resource transfer is successful, playing the target multi-modal content.

In order to facilitate the user to know the specific information of the target multimodal content and then determine whether to purchase, the scheme of the embodiment further provides a trial-and-see service before purchase for the user. Optionally, the step of displaying the virtual resource transfer page corresponding to the target multimodal content when the user is detected to play the target multimodal content may include:

when the playing operation of a user for the target multi-modal content is detected, displaying a virtual resource transfer selection control corresponding to the target multi-modal content, wherein the virtual resource transfer selection control comprises: the virtual resource transfer page displays a sub-control, a virtual resource transfer ending sub-control and a trial reading content playing sub-control;

and when the triggering operation of the user for the virtual resource transfer page display sub-control is detected, displaying the virtual resource transfer page corresponding to the target multi-modal content.

Optionally, when a triggering operation of the user for the virtual resource transfer end sub-control is detected, hiding the virtual resource transfer selection control; and when the triggering operation of the user for the trial reading content playing sub-control is detected, the trial reading content of the target multi-mode content is played.

In this embodiment, the virtual resource transfer page may include a specific amount of a virtual resource to be transferred, a virtual resource transfer confirmation control for triggering virtual resource transfer, and the like.

Optionally, the step "triggering the virtual resource transfer for the target multimodal content based on the virtual resource transfer operation of the user for the virtual resource transfer page" may include:

and when the confirmation operation of the user for the virtual resource transfer confirmation control in the virtual resource transfer page is detected, triggering the virtual resource transfer for the target multi-modal content.

For example, still referring to FIG. 2a, when a user is detected at page 201 with respect to targeted multimodal content, such as a play operation with an audio reader "energized," a virtual resource transfer selection control is displayed (as shown at page 202), including: if the virtual resource transfer page displays the sub-control, if the virtual resource transfer end sub-control does not display the sub-control, and if the virtual resource transfer end sub-control displays the sub-control, the trial reading content playing sub-control performs trial listening for 3 minutes; when the confirmation operation of the user for the virtual resource transfer page display child control is detected, the virtual resource transfer page shown in 203 is displayed. In another embodiment, the user may replace the touch operation on the "yes" control by a voice instruction, for example, when the terminal detects that the user inputs "this book is good and purchases a bar" voice information, it is analyzed that the user inputs a purchase instruction for "enabling" the audio book, and the virtual resource transfer page shown in 203 is displayed.

In fig. 2a, the virtual resource transfer page shown in 203 displays the amount of the virtual resource to be transferred, the payee information, and a virtual resource transfer confirmation control such as a "payment" control. When the triggering operation of the user for the payment control, such as clicking operation, is detected, the virtual resource transfer for the target multi-modal content is triggered. And when the virtual resource transfer is successful, playing the target multi-modal content. Aiming at the scheme that the user inputs a voice instruction to trigger the display of the virtual resource transfer page, the voice instruction input by the user can also be translated into text information to be displayed in the chat page, as shown in fig. 2a, in the page displayed at 203, when the click operation of the user for payment is detected, the virtual resource transfer is carried out, and after the virtual resource transfer is successful, the chat page shown at 204 is displayed. The chat page is added with a voice instruction input by the user.

According to the scheme, the embodiment is beneficial to converting the aim-free chatting into the scenes of purposefully listening to music, talking to audio books and the like, and increasing potential commercialization possibilities such as music payment and audio book payment.

In one embodiment, the number of targeted multimodal content can be more than one, and in this embodiment, the targeted multimodal content can be displayed in a list. Optionally, in this embodiment, the step of displaying a reply message of the virtual user to the dialog message on the chat page may include:

and displaying the dialog reply text in the reply message and a target multi-modal content list on the chat page, wherein the target multi-modal content list comprises at least two target multi-modal contents.

For example, referring to fig. 2b, for the user's conversation message "hear more mood and have no more fun after listening", the virtual user has acquired 4 videos of the vocal sounds, and the 4 videos of the vocal sounds are displayed in the form of a list in the reply message of the chat page. Optionally, in this embodiment, in the list of target multimodal content, the order of the target multimodal content is arranged based on the relevance between the target multimodal content and the corresponding dialog reply text, and the higher the relevance is, the higher the position of the target multimodal content in the list is.

Optionally, in this embodiment, a candidate type list of the target multi-modal content is further displayed on the chat page, where the candidate type list includes candidate types of the target multi-modal content. Correspondingly, the intelligent interaction method of the embodiment further includes:

and when the user selection operation for the candidate type in the candidate type list is detected, switching and displaying the target multi-modal content of the selected candidate type on the chat page.

Optionally, the step "when the user's selection operation for a candidate type in the candidate type list is detected, switching and displaying the target multimodal content of the selected candidate type on the chat page" may include:

and when the user selection operation for the candidate type in the candidate type list is detected, switching and displaying the target multi-modal content of the selected candidate type in the target multi-modal content list.

For example, referring to the chat page shown in fig. 2c, for the user's dialog message "hear more and have more mood and not more fun", it is more acceptable for the user to retrieve three types of multimodal content, which are "talking", "talking books" and "music", respectively, and the displayed candidate type list includes three types of options, "talking", "talking books" and "music". In the chat page shown in fig. 2c, in the reply message "after hearing, the mood is much better, and there is no more fun" for the dialog message, the target multimodal content list and the candidate type list are displayed, and when the user's selection operation for the candidate type in the candidate type list is detected, the lower target multimodal content of the selected candidate type is switched and displayed on the chat page. For example, in the chat page of fig. 2c, the target multi-modal content in the "talking book" type is displayed, and when the user's selection operation for the "talking book" type is detected, the target multi-modal content in the "talking book" type is displayed in the target multi-modal content list.

Optionally, in this embodiment, the step "playing the target multi-modal content when the user's play operation for the target multi-modal content is detected" may include:

and when the playing operation of the user for the target multi-modal content is detected, displaying a playing page of the target multi-modal content, and playing the target multi-modal content in the playing page of the target multi-modal content.

Optionally, similar multi-modal content similar to the target multi-modal content may also be displayed on the play page.

For example, referring to fig. 2d, when a user's play operation for a target multimodal content such as a movie "XX way of wind and rain" is detected in a chat page with a robot, a play page of the target multimodal content such as "XX way of wind and rain" is displayed.

In this embodiment, for multimodal content such as videos that need a certain display area, the multimodal content can be played through a playing page, and similar content similar to the target multimodal content can be displayed in the playing page in order to better serve the user.

In view of the fact that the total duration of some target multi-modal content such as food is long and needs to occupy more time of a user, in order to reduce the watching or listening time of the user and improve the user conversation experience, in this embodiment, when the playing operation of the user is detected, the essence segments of the target multi-modal content can be directly played to the user.

when the playing operation of the user for the target multi-modal content is detected, the target content segment in the target multi-modal content is played, wherein the target content segment is a content segment in the target multi-modal content, and the content segment is associated with the semantics of the dialog reply text.

For example, referring also to fig. 2d, when a user's play operation for a target multimodal content such as a movie "wind and rain XX way" is detected in a chat page with the robot, a play page of the target multimodal content such as "wind and rain XX way" is displayed, and in the play page, the movie "wind and rain XX way" is played from a play start point a of a target content section, and alternatively, a play end point of the target content section may be a play end point of the entire movie, or a time point between the play start point a and the play end point of the movie.

Further, the target content segment may be a segment that is pre-marked by a content producer or a worker for different semantics, or the target content segment may be a virtual user (e.g., a dialog system) and determined in real time for the semantics of the dialog reply text, which is not limited in this embodiment.

In this embodiment, for each time a virtual user receives a dialog message sent by a user, it is necessary to determine a reply message for the dialog message, and in this embodiment, a method that can be used for generating a reply message containing multimodal content is provided, and referring to fig. 2e, the reply message generating method of this embodiment includes:

205. when receiving a dialogue message sent by a user aiming at the virtual user, acquiring historical dialogue information of the virtual user and the user;

in this embodiment, the dialog message of the user may be acquired through any feasible receiving manner. The type of the dialog message is not limited, and may be text, voice, image, video, and other types of information.

In this embodiment, the step of receiving the dialog message sent by the user for the virtual user may include: and acquiring an external audio signal through an audio acquisition module, and analyzing the current voice information of the user from the audio signal to be used as the dialogue message of the user.

In this embodiment, the voice-type dialog message may be converted into a text-type dialog message, and the text-type dialog message participates in the following steps 205.

For example, in this embodiment, the step of receiving the dialog message sent by the user for the virtual user may include: the chat page, such as 201 shown in fig. 2a, is displayed, and when a dialog information input operation of the user on the chat page is detected, the dialog message input by the user on the chat page is acquired. It is understood that, in this embodiment, the dialog message input by the user on the chat page may be any one of text, image, video, audio and the like, or a mixture of at least two of these.

The chat page in this embodiment may be a chat session page provided by any instant messaging client, or an intelligent system session page provided by a system of a terminal, or a user operation page with a conversation function provided by a client such as a search engine, which is not limited in this embodiment,

for the non-text information type conversation messages input through the chat page, the non-text information can be converted into text information through a corresponding conversion mode.

For example, after a dialog message input by a user on a chat page is acquired, if the dialog message contains non-text information, the information type of the non-text information is determined, a text conversion model corresponding to the information type is acquired, and the non-text information is converted into text information by the text conversion model.

For example, if image information exists in the dialog message, a text conversion model corresponding to the image information is acquired, and the image information is converted into text information. The text conversion model corresponding to the Image information includes, but is not limited to, an OCR (optical character Recognition) Recognition model, an Image understanding (Image capture) model, and the like.

For other types of non-text information, the existing text conversion model can be adopted to realize the conversion from the non-text information to the text information, and the embodiment has no limitation on the specific type of the text conversion model.

In this embodiment, the historical dialog information obtained in step 205 is of a certain characteristic, and the historical dialog information corresponds to the received dialog message, that is, the historical dialog information and the received dialog message have a certain association (for example, association in content). The history dialogue information may include a history dialogue message of the user and a history reply message of the terminal.

In this embodiment, there are various ways to obtain the historical session information, and the historical session information may be obtained from a server or obtained locally.

Optionally, the step of "obtaining historical dialog information between the virtual user and the user" may include:

sending a historical dialogue information acquisition request to a server, wherein the historical dialogue information acquisition request carries identification information of a user and a virtual user and attribute information of a current dialogue message of the user;

and receiving historical conversation information of the virtual user and the user, which is fed back by the server in response to the historical conversation information acquisition request.

The attribute information of the dialog message includes, but is not limited to, content carried by the dialog message, a topic to which the dialog message belongs, input time of the dialog message, and the like.

and acquiring historical conversation information of the virtual user and the user and belonging to a target topic from the locally stored historical conversation information, wherein the target topic is the topic to which the conversation message belongs.

Optionally, in an embodiment, the obtained historical dialog message may be only a historical dialog message generated in a current dialog scenario, and the step of "obtaining historical dialog information of the virtual user and the user" may include:

and acquiring historical conversation information of the virtual user and the user in the current conversation scene.

For example, when the session interruption time of the virtual user and the user exceeds a preset interruption time threshold, the current session scene is considered to be ended, and a session performed by the user after the session interruption time exceeds the preset interruption time threshold is considered to be a new session scene.

In one embodiment, the step of "obtaining historical dialogue information of the virtual user and the user" may include:

and acquiring historical dialogue information between the virtual user and the user within a historical time period which is a preset time length away from the current time.

Considering that the more recent history dialogue information is from the dialogue message, the more relevant the dialogue message is, in one example of the present embodiment, the acquired history dialogue information is determined for a history period of a preset length of time. The preset time duration may be set according to actual needs, for example, set to be 5min, 8min, and so on.

and acquiring historical conversation information of the virtual user and the user and belonging to a target topic, wherein the target topic is the topic to which the conversation message belongs.

In this embodiment, during a conversation process between the virtual user and the user, topics to which each conversation information belongs may be determined based on the conversation information between the virtual user and the user. When receiving a conversation message sent by a user for a virtual user, analyzing whether the conversation message belongs to the same topic as the historical conversation information which is closest in time, if so, taking the topic corresponding to the closest historical conversation information as the topic of the conversation message, otherwise, determining the current topic based on the conversation message.

206. Based on the conversation message and the historical conversation information, a conversation reply text for the conversation message is generated.

In this embodiment, a dialog reply text corresponding to the dialog message may be generated based on the historical dialog information and the dialog message. And generating a dialog reply text based on the historical dialog information and the dialog message, so that the consistency of the context can be well maintained, and the possibility that the subsequent target multi-modal content is clicked by the user is improved.

In this embodiment, optionally, the step "generating a dialog reply text for the dialog message based on the dialog message and the historical dialog information" may include:

determining topic information of topics to which historical conversation information belongs;

acquiring meaning description information of historical conversation information and meaning description information of conversation information;

and determining the dialog reply text of the dialog message based on the topic information, the meaning description information of the historical dialog information and the meaning description information of the dialog message.

In this embodiment, a plurality of topics may be preset, and based on a trained topic identification model, the historical dialog information of this embodiment is analyzed to obtain a first code of the historical dialog information, where the first code is the topic information and is used to indicate a probability that the historical dialog information belongs to the plurality of preset topics.

In this embodiment, the historical dialog information may also be encoded to obtain a second encoding of the historical dialog information, where the second encoding is used to describe the meaning of the historical dialog information. Optionally, the present embodiment may further encode the dialog message input by the user to obtain a third code of the dialog message, where the third code is used to describe the meaning of the dialog message.

The embodiment may obtain the dialog reply text of the dialog message by decoding the first code, the second code and the third code.

In this embodiment, the historical dialog information and the dialog message may be encoded by an encoder.

The specific information amount in the history dialog information of this embodiment may be multiple, where the process of obtaining the second code may include: for a first piece of historical dialogue information in a plurality of pieces of historical dialogue information, obtaining a hidden vector of the historical dialogue information according to the historical dialogue information; for each piece of historical dialogue information after the first piece of historical dialogue information, acquiring a hidden vector of the historical dialogue information according to the historical dialogue information and the hidden vector of the previous piece of historical dialogue information of the historical dialogue information; and acquiring a second code according to the hidden vectors of the plurality of pieces of historical dialogue information.

The dialog message in this embodiment may be divided into a plurality of vocabularies according to the analysis, and the foregoing process of obtaining the third code may include: for a first vocabulary in a plurality of vocabularies in the conversation message, acquiring a hidden vector of the vocabulary according to the vocabulary; for each vocabulary after the first vocabulary, acquiring a hidden vector of the vocabulary according to the vocabulary and the hidden vector of the previous vocabulary; and acquiring a third code of the dialogue message according to the hidden vectors of the vocabularies.

In this embodiment, the process of obtaining the dialog reply text may include: processing the first code and the third code based on the coding unit to obtain a hidden vector for a first vocabulary in a plurality of vocabularies of the dialogue message; processing the hidden vector and the second code based on the attention unit to obtain a reply vocabulary identifier corresponding to the vocabulary, and determining a reply vocabulary corresponding to the vocabulary according to the reply vocabulary identifier; for each vocabulary after the first vocabulary, processing the first code, the third code, a reply vocabulary identifier corresponding to a previous vocabulary of the vocabulary and a reply vocabulary corresponding to the previous vocabulary on the basis of the coding unit to obtain a hidden vector; processing the hidden vector and the second code based on the attention unit to obtain a reply vocabulary identifier corresponding to the vocabulary, and determining a reply vocabulary corresponding to the vocabulary according to the reply vocabulary identifier; and generating a dialogue reply text according to the reply vocabularies corresponding to the vocabularies.

207. Correlation information between the dialog reply text and the candidate multimodal content is obtained.

In this embodiment, the word contained in the dialog reply text may be used to obtain the correlation information with the candidate multimodal content.

When obtaining the relevance information, the relevance between the candidate multi-modal content and the dialog reply text can be determined based on any information related to the candidate multi-modal content.

For example, the step of "obtaining relevance information between the dialog reply text and the candidate multimodal content" may include: the method comprises the steps of analyzing the associated information of the candidate multi-modal content, determining the description information of the candidate multi-modal content on at least one description dimension, and determining the relevance information of the candidate multi-modal content and the reply text based on the description information of the candidate multi-modal content and the dialog reply text.

The description dimension of the candidate multi-modal content includes, but is not limited to, an emotion description dimension, an author description dimension, a multi-modal content type description dimension, a topic description dimension, and the like. Wherein the associated information of the candidate multi-modal content includes, but is not limited to, bearer information and/or user interaction information of the candidate multi-modal content.

In this embodiment, optionally, the step of "obtaining correlation information between the dialog reply text and the candidate multimodal content" may include:

acquiring a prediction model, wherein the prediction model is used for predicting the user preference degree corresponding to each preset type of candidate multi-modal content in a dialogue scene;

analyzing the dialogue messages and historical dialogue information through a prediction model, and predicting the user preference degree of the user on candidate multi-modal content of each preset type in the current dialogue scene;

selecting a target type from preset types of candidate multi-modal content based on the predicted user preference degree;

correlation information between the dialog reply text and the candidate multimodal content of the target type is obtained.

The target type may be a type in which a user preference degree in a preset type of the candidate multimodal content satisfies a certain condition, such as a type with a highest user preference degree.

In this embodiment, the prediction model may be a model that predicts the user preference degree of the user. The preset types of the candidate multi-modal content are not limited, and can be multi-modal content of types such as music, audio books, videos, and television series.

The user preference level in this embodiment is a user preference level for a certain type of candidate multimodal content. The user preference level may be any information indicating the user preference, such as the probability of the user watching the preset type of candidate multimodal content, the staying time of the user on the preset type of candidate multimodal content, and so on.

In this embodiment, the "stay duration" in the stay duration of the candidate multimodal content may be understood as a duration spent on some series of contact or non-contact operations, such as browsing, listening, playing, clicking, sharing, and commenting, of the candidate multimodal content.

In one embodiment, the prediction model includes a prediction submodel corresponding to each preset type of the candidate multi-modal content, and the prediction submodel is used for predicting the possible stay time of the user on the corresponding preset type of the candidate multi-modal content in the dialog scene.

The step of predicting the user preference degree of the user for each preset type of candidate multi-modal content in the current dialog scenario by analyzing the dialog messages and the historical dialog information through the prediction model may include:

and analyzing the dialog messages and the historical dialog information through each predictor model so as to predict the possible stay time of the user on each preset type of candidate multi-modal content in the current dialog scene.

Correspondingly, the step of "selecting a target type from the preset types of the candidate multimodal content based on the predicted user preference degree" may include:

and selecting the predicted preset type with the longest possible stay time of the user as the target type from the preset types of the candidate multi-modal content.

In this embodiment, the prediction sub-model may be implemented based on a regression model, and optionally, the intelligent interaction method of this embodiment may further include:

obtaining a regression model;

acquiring sample historical dialogue information of a virtual user and the user, and acquiring historical stay time of the user in each preset type of candidate multi-modal content in a historical dialogue process corresponding to the sample historical dialogue information;

and respectively training a regression model by using the training samples corresponding to the candidate multi-modal content of each preset type to obtain the prediction sub-model corresponding to the candidate multi-modal content of each preset type.

In this embodiment, existing dialog information between the user and the virtual user can be collected as sample dialog information, and the operation condition of the candidate multi-modal content recommended by the terminal in the historical dialog process by the user is collected and modeled to judge what type of candidate multi-modal content is recommended to the user under what dialog context, so that the user acceptance is the highest.

Optionally, in this embodiment, sample historical dialog information of the user and the virtual user is recorded as X, where X is an expression in a text form, and then, in a dialog process corresponding to the historical dialog information, a duration spent by the user on the multimodal content is recorded as T (i.e., a historical staying duration, such as a song listening duration, a phase-listening duration, a video watching duration, and the like), where durations T corresponding to different types of candidate multimodal content are regarded as different dependent variables.

Optionally, in this embodiment, the regression model of f (x) ═ T is trained with the collected corpus of the user. In this embodiment, it can be understood that the type of the candidate multimodal content determines the number of the trained regression models, and each trained regression model can predict the possible stay time of the user on one type of the candidate multimodal content based on the dialog messages and the historical dialog information.

For example, in a real scenario, for each reply of the user, the present embodiment may input the dialog history information of the current dialog message into the trained f () together, and predict the possible stay time of the user on different types of candidate multimodal content.

In view of that the user may not want to receive any multimodal content in the current dialog scenario, and in order to avoid a bad experience brought to the user by forced recommendation of the multimodal content when the user does not want to receive the multimodal content, in this embodiment, before the step "based on the predicted user preference degree, selecting a target type from the preset types of the candidate multimodal content", the method may further include:

if yes, continuing to select a target type from the preset types of the candidate multi-modal content based on the predicted user preference degree;

and if not, not continuing the step of selecting the target type from the preset types of the candidate multi-modal content based on the predicted user preference degree, and taking the dialog reply text as the reply message corresponding to the dialog message.

Optionally, when the determination result is yes, the step "selecting a target type from the preset types of the candidate multimodal content based on the predicted user preference degree" may include: based on the predicted user preference degree, selecting a target type from preset types with the user preference degree not lower than a preset minimum user preference degree.

When there are multiple selected target types, in the chat page shown in fig. 2c, the displayed reply message includes a candidate type list of the target multimodal content, and the candidate type in the list is the target type determined in the above step.

For the embodiment in which the user preference degree is a possible staying time of the user, the step "determining whether a preset type with the user preference degree not lower than a preset minimum user preference degree exists in the preset types of the candidate multi-modal content based on the predicted user preference degree" may include:

and determining whether a preset type with the user possible stay time not less than the preset lowest user possible stay time exists in the preset types of the candidate multi-modal content or not based on the predicted user possible stay time.

Correspondingly, when the determined result is yes, selecting the preset type with the longest possible stay time of the user from the preset types of the candidate multi-modal content as the target type, and continuing to execute the subsequent steps; and when the determined result is negative, the step of selecting the target type from the preset types of the candidate multi-modal content based on the predicted user preference degree may not be performed, and the dialog reply text is used as the reply message corresponding to the dialog message.

208. Based on the relevance information, target multimodal content for the reply dialog message is determined from the candidate multimodal content.

Optionally, in this embodiment, based on the relevance information, the target multi-modal content whose relevance to the dialog reply text satisfies a certain preset condition may be determined from the candidate multi-modal content, for example, the candidate multi-modal content with the highest relevance to the dialog reply text is determined as the target multi-modal content for replying to the dialog message.

For candidate multimodal content of a targeted type, the step "determining targeted multimodal content for a reply dialog message from the candidate multimodal content based on the relevance information" may include:

based on the relevance information, targeted multimodal content of the reply dialog message is determined from the candidate multimodal content of the targeted type.

209. And combining the conversation reply text and the target multi-mode content to obtain a reply message corresponding to the conversation message.

Optionally, in this embodiment, the play query information of the target multi-modal content may be generated based on the target multi-modal content, and the play query information may include one or more information of the type, name, author, and the like of the target multi-modal content. Wherein the link address of the targeted multimodal content can be used as a link to play the query information.

And after the reply message is displayed on the chat page, when the playing operation of the user on the target multi-mode content is detected, the target multi-mode content is played according to the link address of the target multi-mode content.

In this embodiment, when the chat page displays the reply message of the virtual user for the dialog message, the text message in the reply message may be played in a voice manner, specifically, the dialog reply text in the reply message and the play inquiry message of the target multimodal content are played.

In this embodiment, it is feasible to separately process each multimodal content, but the system cost and complexity will be increased, and in order to reduce the system cost and complexity, the embodiment proposes a labeling method, that is, each multimodal content is mapped onto a uniform labeling system.

Such as:

[ music ] "accompany you too long years", correspond the label set: { cure line, accompany, Mandarin, Chen XX, inspire, Warm }

[ talking book ] "energized", corresponding tag: { job, competence, business, social science, sociology, growth, psychology }

[ phase sound ] "i want to fight" corresponds to the label: { Guo XX, fun, happy, struggle, in X, idealism, wire }

Optionally, in this embodiment, the step of "obtaining correlation information between the dialog reply text and the candidate multimodal content of the target type" may include:

acquiring a tag set of candidate multi-modal content of a target type;

relevance information is computed for the dialog reply text and the tags in the set of tags for the candidate multimodal content.

In this embodiment, there are many methods for multi-modal content tagging, and this embodiment is not limited. Optionally, the set of tags can be manually set for the candidate multi-modal content; the label set of the candidate multi-modal content can also be obtained by adopting an automatic labeling mode, for example, for the multi-modal content of the music type, word frequency information can be counted through lyrics/singers/music comments and the like, and words with higher word frequency are used as labels of music.

Optionally, in this embodiment, before the step "obtaining the tag set of the candidate multi-modal content of the target type", the method further includes:

The bearer information of the multimodal content may include basic information that the multimodal content has after being produced, such as producer information, production time information, and specific content information. Taking a video as an example, the carried information of the video may include background music, lines, character sounds in the video, and image frames in the video. The user interaction information of the multi-modal content can be understood as all information generated by the user interacting with the multi-modal content, including but not limited to comments of the user on the multi-modal content, and barrage and the like sent by the user for the multi-modal content.

Optionally, in this embodiment, the step of "calculating relevance information of tags in the dialog reply text and the tag set of the candidate multimodal content" may include:

analyzing the dialog reply text to obtain keywords in the dialog reply text;

Alternatively, the relevance information of the tab set and the dialog reply text may be determined based on the number of similar tabs in the tab set. The greater the number of similar tags in the tag set, the higher the relevance of the candidate multimodal content corresponding to the tag set to the dialog reply text is determined.

In this embodiment, for the calculation of the relevance information, any other feasible scheme may also be adopted, and this embodiment is not limited again, for example, the dialog reply text and the tags in the tag set may be mapped to a vector space, the similarity between the dialog reply text and the tags in the tag set is calculated based on the vector obtained by mapping, and the similarity is used as the relevance information.

The intelligent interaction method of the present embodiment is exemplified by taking an intelligent system as an execution subject of the method of the present embodiment, and combining with the chat page shown in fig. 2 f. In this embodiment, the candidate multimodal content and the corresponding tag set may be stored in the server, or in a block chain to which the server and the terminal belong, which is not limited in this embodiment.

Referring to fig. 2f, the dialog between the user and the intelligent system is as follows:

user input (hereinafter referred to as "user input"): your small assistant

System output (hereinafter referred to as system): you good

The following steps are used: want to say something with you

Comprises the following steps: "kaihe-xiao-assistant" ear washing Congrator

The following steps are used: i have recently fallen a bit low

Comprises the following steps: the small owner is expected to be happy and the small assistant is always accompanying you. [ Play display XX "[ accompany you spend a long time ]

The following steps are used: i do not work smoothly, the recent feeling capability is slowly improved, and what suggestions are not provided

Comprises the following steps: the book is shared by the small owners by understanding the woolen cloth. [ energized status ] of audio reading material

The following steps are used: the book is good for buying the bar

Comprises the following steps: slightly less than

The following steps are used: after hearing the music, the mood is much better, and the music is not more funny

Comprises the following steps: say a playful section, Nafei XX's Muo [ Play Guo XX's I want to fight ].

In the example shown above, when the system receives the user's dialog message "i am just recently low", historical dialog information may be obtained, i.e., "use: hello, little assistant "," department: hello "," use: want to say something with you "," system: and (5) after the son, the small assistant washes ear to listen, and based on the historical conversation information and the current conversation information, a conversation reply text 'hope that the small owner is happy and the small assistant always accompanies you' is obtained. And inputting the historical conversation information and the current conversation information into a trained regression model, predicting the possible stay time of the user on each type of candidate multi-mode content in the conversation scene, and calculating the relevance between the tags of the candidate multi-mode content in the music type and the conversation reply text 'hope that a small host is happy and a small assistant always accompanies you' if the predicted type of the user with the longest possible stay time is music, so as to associate the song with the old XX 'accompanying the long years' because of the tags { warm, accompany }.

When the system receives a dialogue message of a user, namely ' i do not work smoothly, the recent sensory ability is improved slowly, and there are suggestions ' the system can acquire historical dialogue information, namely ' use: hello, little assistant "," department: hello "," use: want to say something with you "," system: for example, you can wash their ears and listen to it, for example: i have recently fallen a little while "department: it is desirable for the small owner to be happy. Based on the historical dialogue information and the current dialogue information, a dialogue reply text 'understanding the book and sharing the book for the little host' is obtained. And inputting the historical conversation information and the current conversation information into a trained regression model, predicting the possible stay time of the user on each type of candidate multi-modal content in the conversation scene, determining that the type with the longest possible stay time of the user is the audio book, calculating the labels of the candidate multi-modal content in the audio book type and the conversation reply text, understanding, sharing the relevance of a book with a small host, and enabling the audio book in the relevance because of the labels of { { work, capability }.

By adopting the embodiment of the application, the chat page between the user and the virtual user can be displayed, wherein the chat page comprises the conversation message sent by the user to the virtual user currently; displaying a reply message of a virtual user aiming at the conversation message on the chat page; wherein the reply message comprises a dialog reply text automatically generated by the virtual user and target multimodal content associated with the dialog reply text; when the user is detected to play the target multi-mode content, the target multi-mode content is played, the intelligent interaction method not only can reply a user text message, but also can recommend the appropriate multi-mode content to the user according to the conversation history of the user and the virtual user and the current input of the user, so that the interestingness of chatting and the attraction to the user are greatly increased, meanwhile, the aim of idle chatting can be converted into the scenes of purposefully listening to music, listening to sound books and the like, and the potential commercialization possibility is increased, such as music payment, sound book payment and the like.

In order to better implement the above method, correspondingly, an embodiment of the present invention further provides an intelligent interaction device, which can be integrated in a terminal, and with reference to fig. 3, the intelligent interaction device includes:

a page display unit 301, configured to display a chat page between a user and a virtual user, where the chat page includes a dialog message currently sent to the virtual user by the user;

a reply display unit 302, configured to display, on the chat page, a reply message of the virtual user to the conversation message; wherein the reply message comprises a dialog reply text automatically generated by the virtual user and target multimodal content associated with the dialog reply text;

a playing unit 303, configured to play the target multi-modal content when a playing operation of the user for the target multi-modal content is detected.

In one embodiment, the playing unit 303 includes:

In one embodiment, the reply display unit is used for displaying the dialog reply text in the reply message and a target multi-modal content list on the chat page, wherein the target multi-modal content list comprises at least two target multi-modal contents.

In one embodiment, a candidate type list of the target multi-modal content is further displayed on the chat page, and the candidate type list comprises a candidate type of the target multi-modal content;

the device further comprises: and the switching display unit is used for switching and displaying the target multi-modal content of the selected candidate type in the target multi-modal content list when the user selection operation aiming at the candidate type in the candidate type list is detected.

In one embodiment, the playing unit is configured to display a playing page of the target multi-modal content when a playing operation of the user on the target multi-modal content is detected, wherein similar multi-modal content similar to the target multi-modal content is also displayed on the playing page.

In one embodiment, the playing unit is configured to play a target content segment in the target multi-modal content when a playing operation of the user on the target multi-modal content is detected, where the target content segment is a content segment in the target multi-modal content that is associated with semantics of the dialog reply text.

In one embodiment, the intelligent interaction device of this embodiment further includes:

In one embodiment, the correlation obtaining unit includes:

In one embodiment, the prediction model comprises a prediction submodel corresponding to each preset type of the candidate multi-modal content, and the prediction submodel is used for predicting the possible stay time of the user on the candidate multi-modal content of the corresponding preset type in a dialog scene;

In one embodiment, the apparatus further includes a control unit for determining whether there is a preset type having a degree of user preference not lower than a preset minimum degree of user preference among the preset types of the candidate multimodal content based on the predicted degree of user preference before the selecting subunit selects the target type from the preset types of the candidate multimodal content based on the predicted degree of user preference; and if so, controlling the selection subunit to continue the step of selecting the target type from the preset types of the candidate multi-modal content based on the predicted user preference degree.

In one embodiment, the second obtaining subunit is configured to:

obtaining a tag set of the candidate multi-modal content of the target type;

In one embodiment, the intelligent interaction device further comprises a model processing unit for:

obtaining a regression model;

In one embodiment, the intelligent interaction device further comprises a tag setting unit, configured to obtain bearer information and/or user interaction information of the candidate multi-modal content before the second obtaining subunit obtains the tag set of the candidate multi-modal content of the target type; and analyzing the characteristics of the candidate multi-modal content on at least one description dimension based on the bearing information and/or the user interaction information of the candidate multi-modal content, and generating a label on at least one description dimension for the candidate multi-modal content to obtain a label set of the candidate multi-modal content.

In one embodiment, the second obtaining subunit is configured to:

In one embodiment, a generation unit includes:

The device disclosed by the embodiment of the invention can display the chat page between the user and the virtual user, wherein the chat page comprises the conversation message sent by the user to the virtual user currently; displaying a reply message of the virtual user to the conversation message on a chat page; the reply message comprises a dialog reply text automatically generated by the virtual user and target multi-modal content associated with the dialog reply text; when the playing operation of the user for the target multi-mode content is detected, the target multi-mode content is played, so that the user can be reused by matching the multi-mode content with the text in the conversation process, the conversation form with the user is enriched, and the interest of the chat and the attraction to the user are greatly increased.

In addition, an embodiment of the present invention further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 4, which shows a schematic structural diagram of the computer device according to the embodiment of the present invention, and specifically:

the computer device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, and a power supply 403. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 4 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by operating or executing software programs and/or units stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device as a whole. Optionally, in one embodiment, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and units, and the processor 401 executes various functional applications and data processing by operating the software programs and units stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

When the computer device is a terminal, the computer device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate a keyboard, mouse, joystick, optical or trackball signal input in relation to user setting and function control. Of course, it is understood that the present embodiment does not exclude the solution that the server includes the input unit, and the server of the present embodiment may also include the input unit 404.

Although not shown, the computer device, such as the terminal, of the present embodiment may further include a display unit and the like, which are not described herein again. Similarly, the present embodiment does not exclude the scheme that the server includes the display unit, and the server in the present embodiment may also include the display unit.

Specifically, in this embodiment, the processor 401 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the computer device of this embodiment can implement that the user is replied by matching multi-modal content, such as music, songs, audio books, videos, and the like, with text in the chatting process.

The reply information generation system according to the embodiment of the present invention may be a distributed system formed by connecting a client and a plurality of nodes (computer devices in any form in an access network, such as servers and terminals) in a network communication manner.

Taking a distributed system as an example of a blockchain system, referring to fig. 5, fig. 5 is an optional structural schematic diagram of the distributed system 100 applied to the blockchain system, which is formed by a plurality of nodes (computing devices in any form in an access network, such as servers and user terminals) and clients, and a Peer-to-Peer (P2P, Peer to Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer. In this embodiment, the prediction model, the training sample, the current dialog information and the historical dialog information between the virtual user and the user, the target multi-modal content, the candidate multi-modal content and the corresponding tag set, and other information may all be stored in the shared ledger of the area chain system through the nodes of the distributed system, and the computer device (e.g., the terminal or the server) may obtain the candidate multi-modal content, the corresponding tag set thereof, and the target multi-modal content based on the recorded data stored in the shared ledger.

Referring to the functions of each node in the blockchain system shown in fig. 5, the functions involved include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

For example, the services implemented by the application include:

2.1) wallet, for providing the function of transaction of electronic money, including initiating transaction (i.e. sending the transaction record of current transaction to other nodes in the blockchain system, after the other nodes are successfully verified, storing the record data of transaction in the temporary blocks of the blockchain as the response of confirming the transaction is valid; of course, the wallet also supports the querying of the remaining electronic money in the electronic money address;

and 2.2) sharing the account book, wherein the shared account book is used for providing functions of operations such as storage, query and modification of account data, record data of the operations on the account data are sent to other nodes in the block chain system, and after the other nodes verify the validity, the record data are stored in a temporary block as a response for acknowledging that the account data are valid, and confirmation can be sent to the node initiating the operations.

2.3) Intelligent contracts, computerized agreements, which can enforce the terms of a contract, implemented by codes deployed on a shared ledger for execution when certain conditions are met, for completing automated transactions according to actual business requirement codes, such as querying the logistics status of goods purchased by a buyer, transferring the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to executing contracts for trading, but may also execute contracts that process received information.

3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.

Referring to fig. 6, fig. 6 is an optional schematic diagram of a Block Structure (Block Structure) according to an embodiment of the present invention, where each Block includes a hash value of a transaction record stored in the Block (hash value of the Block) and a hash value of a previous Block, and the blocks are connected by the hash values to form a Block chain. The block may include information such as a time stamp at the time of block generation. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains related information for verifying the validity (anti-counterfeiting) of the information and generating a next block.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present invention further provides a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the intelligent interaction methods provided by the embodiments of the present invention.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any of the intelligent interaction methods provided in the embodiments of the present invention, the beneficial effects that can be achieved by any of the intelligent interaction methods provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The intelligent interaction method, the intelligent interaction device, the computer equipment and the storage medium provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An intelligent interaction method, comprising:

2. The intelligent interaction method as claimed in claim 1, wherein said playing the targeted multi-modal content when the user's operation of playing the targeted multi-modal content is detected comprises:

3. The intelligent interaction method as claimed in claim 1, wherein said displaying a reply message of the virtual user to the conversation message on the chat page comprises:

4. The intelligent interaction method of claim 3, wherein a list of candidate types of the targeted multimodal content is further displayed on the chat page, wherein the list of candidate types comprises candidate types of the targeted multimodal content;

the intelligent interaction method further comprises the following steps:

5. The intelligent interaction method as claimed in claim 1, wherein said playing the targeted multi-modal content when the user's operation of playing the targeted multi-modal content is detected comprises:

6. The intelligent interaction method as claimed in claim 1, further comprising, before the chat page displays the reply message of the virtual user to the conversation message:

7. The intelligent interaction method of claim 6, wherein the obtaining of the correlation information between the dialog reply text and the candidate multimodal content comprises:

8. The intelligent interaction method according to claim 7, wherein the prediction model comprises a prediction submodel corresponding to each preset type of the candidate multi-modal content, and the prediction submodel is used for predicting the possible stay time of the user on the corresponding preset type of the candidate multi-modal content in the dialog scene;

9. The intelligent interaction method of claim 7, wherein before selecting the target type from the preset types of candidate multi-modal content based on the predicted user preference level, further comprising:

10. The intelligent interaction method of claim 7, wherein the obtaining of the correlation information between the dialog reply text and the target type of candidate multimodal content comprises:

obtaining a tag set of the candidate multi-modal content of the target type;

11. The intelligent interaction method of claim 10, wherein before obtaining the set of tags for the target type of candidate multimodal content, further comprising:

12. The intelligent interaction method of any one of claims 6 to 11, wherein the generating a dialog reply text for the dialog message based on the dialog message and historical dialog information comprises:

13. An intelligent interaction device, comprising:

14. A storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the method according to any of claims 1-12.

15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method according to any of claims 1-12 are implemented when the program is executed by the processor.