CN114048299A

CN114048299A - Dialogue method, apparatus, device, computer-readable storage medium, and program product

Info

Publication number: CN114048299A
Application number: CN202111393299.8A
Authority: CN
Inventors: 杨海军; 徐倩; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-02-15

Abstract

The application provides a conversation method, a device, equipment, a computer readable storage medium and a program product, wherein the method comprises the following steps: acquiring a conversation request sent by a terminal, wherein the conversation request carries a conversation problem and a user identifier; screening a target knowledge sub-base from a pre-established conversation knowledge base according to the user identification and the conversation problem, wherein the target knowledge sub-base is used for storing reference knowledge of a target object which is in conversation with the user; in the target knowledge sub-base, acquiring a conversation video according to the conversation question, wherein the conversation video is a dynamic picture of a target object for answering the conversation question; and transmitting the conversation video to the terminal so as to output the conversation video on the terminal. Therefore, the video conversation between the user and the target object is realized, and the personalized emotion requirement of the user is met.

Description

Dialogue method, apparatus, device, computer-readable storage medium, and program product

Technical Field

The present application relates to the field of artificial intelligence technology, and relates to, but is not limited to, a dialog method, apparatus, device, computer-readable storage medium, and program product.

Background

With the continuous development of artificial intelligence, internet and other technologies, the functions of the robot become increasingly powerful, and meanwhile, the requirements of the interaction mode between the robot and the user become increasingly diverse. In the prior art, when a robot talks with a user, a response related to the user question is generally searched from an offline or online corpus or a chat database, and the response content is uniform and has no emotional color.

When people need to feel emotions, particularly conversation with relatives, the people cannot converse with some people who want to speak due to various reasons. Parents die when children are still young, for example, children lose the opportunity to talk with parents; parents are not at the side of the children outside, and the emotion of the children cannot be released in time; one part of the couple dies, and the other part cannot ease thought. The robot can not meet the individual emotional requirements of the user when communicating with the robot without emotional colors.

Disclosure of Invention

The embodiment of the application provides a conversation method, a conversation device, conversation equipment, a computer readable storage medium and a computer program product, which can realize man-machine video conversation and meet the personalized emotion requirements of users.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a dialogue method, which comprises the following steps:

acquiring a conversation request sent by a terminal, wherein the conversation request carries a conversation problem and a user identifier;

screening a target knowledge sub-base from a pre-established conversation knowledge base according to the user identification and the conversation problem, wherein the target knowledge sub-base is used for storing reference knowledge of a target object which is in conversation with the user;

in the target knowledge sub-base, acquiring a conversation video according to the conversation question, wherein the conversation video is a dynamic picture of the target object for answering the conversation question;

and sending the conversation video to the terminal so as to output the conversation video on the terminal.

The embodiment of the application provides a conversation device, the device includes:

the system comprises a first acquisition module, a second acquisition module and a first processing module, wherein the first acquisition module is used for acquiring a conversation request sent by a terminal, and the conversation request carries a conversation problem and a user identifier;

the screening module is used for screening a target knowledge sub-base from a pre-established conversation knowledge base according to the user identification and the conversation problem, wherein the target knowledge sub-base is used for storing reference knowledge of a target object which is in conversation with the user;

a second obtaining module, configured to obtain, in the target knowledge sub-base, a conversation video according to the conversation question, where the conversation video is a dynamic picture of the target object answering the conversation question;

and the sending module is used for sending the conversation video to the terminal so as to output the conversation video on the terminal.

An embodiment of the present application provides an electronic device, the device includes:

a memory for storing executable instructions;

and the processor is used for realizing the dialogue method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, on which executable instructions are stored, and the executable instructions are used for causing a processor to implement the dialogue method provided by the embodiment of the application when the processor executes the executable instructions.

The embodiment of the present application provides a computer program product, which includes a computer program, and the computer program is executed by a processor to implement the dialog method provided by the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

in the conversation method provided by the embodiment of the application, when a user carries out conversation with a target object in advance, a terminal is used for inputting a conversation problem, and the terminal sends a conversation request carrying the conversation problem and a user identifier to a server; the server side screens out a target knowledge sub-base storing reference knowledge of a target object from a pre-established conversation knowledge base according to the user identification and the conversation question, obtains a conversation video of the target object for answering the conversation question according to the conversation question from the target knowledge sub-base, and finally sends the conversation video to the terminal for playing.

Drawings

Fig. 1 is a schematic network architecture diagram of a dialog system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 3 is a schematic flow chart of an implementation of a dialog method according to an embodiment of the present application;

fig. 4 is a schematic flow chart of another implementation of the dialog method according to the embodiment of the present application;

FIG. 5 is a schematic flow chart of another implementation of the conversation method provided in the embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of the present application for collecting emotional problem sets;

FIG. 7 is a block diagram of a dialog knowledge base according to an embodiment of the present application;

fig. 8 is a schematic view of an emotional dialogue processing flow provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only used to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where permissible, so that the embodiments of the present application described herein can be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) The intelligent question-answering system accurately positions question knowledge required by the website users in a question-answer mode, and provides personalized information services for the website users through interaction with the website users.

2) The speech synthesis and speech recognition technology is two key technologies necessary for realizing man-machine speech communication and establishing a spoken language system with listening and speaking capabilities. Speech recognition, which refers to a high technology for a machine to convert speech signals into corresponding text or commands through a recognition and understanding process, and speech synthesis, which refers to a technology for generating artificial speech by a mechanical, electronic method.

3) The face recognition is a biological recognition technology for performing identity recognition based on face feature information of a person, and is a series of related technologies, generally called face recognition and face recognition, for acquiring an image or video stream containing a face by using a camera or a camera, automatically detecting and tracking the face in the image, and further performing face recognition on the detected face.

Based on the above explanations of terms and terms involved in the embodiments of the present application, a dialog system provided in the embodiments of the present application is first described, referring to fig. 1, fig. 1 is a schematic diagram of a network architecture of the dialog system provided in the embodiments of the present application, the dialog system includes at least one terminal 100, a server 200, and a network 300, where fig. 1 illustrates 1 terminal 100 as an example. The terminal 100 is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to implement data transmission.

In some embodiments, the terminal 100 may be, but is not limited to, a smart phone, a vehicle-mounted terminal, a laptop computer, a tablet computer, a desktop computer, a dedicated messaging device, a portable gaming device, a smart speaker, a smart watch, and the like. The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The network 300 may be a wide area network or a local area network, or a combination of both. The terminal 100 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

The terminal 100 is configured to obtain a user identifier, receive a dialog question input by a user, generate a dialog request according to the dialog question and the user identifier, and send the dialog request to the server 200.

The server 200 is configured to obtain a session request which is sent by a terminal and carries a session question and a user identifier; screening a target knowledge sub-base from a pre-established conversation knowledge base according to the user identification and the conversation problem, wherein the target knowledge sub-base is used for storing reference knowledge of a target object which is in conversation with the user; in the target knowledge sub-base, acquiring a conversation video according to the conversation question, wherein the conversation video is a dynamic picture of a target object for answering the conversation question; the conversation video is transmitted to the terminal 100.

The terminal 100 is further configured to output a conversation video, and the user watches the conversation video, and realizes a video conversation with the target object based on human-computer interaction, so as to meet the personalized emotion requirement of the user.

Referring to fig. 2, fig. 2 is a schematic view of a composition structure of an electronic device according to an embodiment of the present disclosure, in practical application, an electronic device 10 may be implemented as the terminal 100 or the server 200 in fig. 1, and the electronic device 10 is taken as the server 200 shown in fig. 1 as an example to describe the electronic device that implements the conversation method according to the embodiment of the present disclosure. The electronic device 10 shown in fig. 2 includes: at least one processor 110, memory 150, at least one network interface 120, and a user interface 130. The various components in electronic device 10 are coupled together by a bus system 140. It will be appreciated that the bus system 140 is used to enable communications among the components. The bus system 140 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 140 in fig. 2.

The Processor 110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 130 includes one or more output devices 131, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 130 also includes one or more input devices 132 including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 150 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 150 optionally includes one or more storage devices physically located remotely from processor 110.

The memory 150 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 150 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 150 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 151 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 152 for communicating to other computing devices via one or more (wired or wireless) network interfaces 120, exemplary network interfaces 120 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 153 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 131 (e.g., display screens, speakers, etc.) associated with the user interface 130;

an input processing module 154 for detecting one or more user inputs or interactions from one of the one or more input devices 132 and translating the detected inputs or interactions.

In some embodiments, the dialog device provided by the embodiments of the present application may be implemented in software, and fig. 2 shows the dialog device 155 stored in the memory 150, which may be software in the form of programs and plug-ins, etc., and includes the following software modules: a first obtaining module 1551, a screening module 1552, a second obtaining module 1553 and a sending module 1554, which are logical, and thus can be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the dialog Device provided in the embodiments of the present Application may be implemented in hardware, and for example, the dialog Device provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the dialog method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The following describes a dialogue method provided in an embodiment of the present application. In some embodiments, the dialog method provided in the embodiment of the present application may be implemented by a terminal or a server of the network architecture shown in fig. 1 alone, or implemented by the terminal and the server in a cooperation manner, and then, taking the server as an example, refer to fig. 3, where fig. 3 is a schematic flow diagram of an implementation flow of the dialog method provided in the embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

Step S301, a session request sent by the terminal is obtained.

In real life, a user may not communicate with a specific user who wants to communicate due to various reasons such as distance, time, death and the like.

In practical application, a user accesses a conversation system through a terminal, a conversation question is input on the terminal, and the terminal carries the conversation question in a conversation request and sends the conversation question to a server. In order to realize targeted personalized conversation, the terminal acquires the user identification, and the user identification is carried in the conversation request and is sent to the server side.

And step S302, screening out a target knowledge sub-base from a pre-established conversation knowledge base according to the user identification and the conversation problem.

After the server side obtains the conversation request sent by the terminal, the conversation request is analyzed, and the conversation problem and the user identification carried by the conversation request are obtained. And then screening out a target knowledge sub-base from the pre-established conversation knowledge base according to the user identification and the conversation problem.

The dialog questions of different users are generally different, and even if the dialog questions are the same, the target objects of the dialog are different, for example, if the user A and the user B both say "dad and you are paining", although the dialog questions are the same, the user A and the user B have different "dad", and the obtained responses may be different. Based on the scene, according to the user identification and the dialogue problem, a target knowledge sub-base corresponding to the target object of the current user dialogue is screened out from the dialogue knowledge base.

In practical implementation, a dialog knowledge sub-base corresponding to a user identifier may be screened from a dialog knowledge base established in advance according to the user identifier, where the dialog knowledge sub-base includes a knowledge sub-base of a character relationship (such as a parent, a spouse, a child, and a friend) of a dialog of the user. And determining the relationship of the characters of the current conversation according to the conversation problem of the current conversation, if the conversation problem input by the user A is 'dad, you stupefied', the target object can be determined to be dad of the user A, and screening out the knowledge sub-library of the target object to be 'dad' from the knowledge sub-libraries of all the character relationships to obtain a target knowledge sub-library, wherein the target knowledge sub-library is used for storing reference knowledge of dad of the user A.

During actual implementation, the character relationship of the current conversation can be determined according to the conversation problem of the current conversation, and a conversation knowledge sub-base of the character relationship is screened out from a conversation knowledge base established in advance; and then determining a target knowledge sub-base according to the user identification. If you bitter according to ' dad, you sift out the knowledge sub-base of dad of all users from the dialogue knowledge base, then sift out the knowledge sub-base of user's A ' from the knowledge sub-base of all users, get the target knowledge sub-base, the reference knowledge stored in the target knowledge sub-base is the same as the reference knowledge stored in the target knowledge sub-base obtained by the first method, and is the reference knowledge of dad of user A.

Step S303, in the target knowledge sub-base, obtaining the conversation video according to the conversation problem.

In the embodiment of the application, the target knowledge sub-base may store questions and responses of the target object, and the questions and responses may be information in the form of video, audio, text, image, and the like. And determining a conversation video in the target knowledge sub-base of the target object according to the conversation question, wherein the conversation video is a dynamic picture of the target object for responding to the conversation question. The dialogue video can be the original video in the target knowledge sub-base, and can also be the video synthesized with audio or text according to the image of the target object.

And step S304, sending the conversation video to the terminal so as to output the conversation video on the terminal.

The server transmits the conversation video to the terminal, the terminal outputs the conversation video on a display interface of the terminal, the user watches the video, and the target object in the video answers the conversation question of the user as if the user and the target object are in face-to-face conversation. According to the embodiment of the application, the target object is vivid and lifelike through the conversation video, so that the user and the person who wants to have a conversation can realize real-time video conversation, and the individualized emotion requirements of the user are met.

According to the conversation method provided by the embodiment of the application, the server side obtains a conversation request sent by the terminal, and the conversation request carries a conversation problem and a user identification; screening a target knowledge sub-base from a pre-established conversation knowledge base according to the user identification and the conversation problem, wherein the target knowledge sub-base is used for storing reference knowledge of a target object which is in conversation with the user; in the target knowledge sub-base, acquiring a conversation video according to the conversation question, wherein the conversation video is a dynamic picture of a target object for answering the conversation question; and transmitting the conversation video to the terminal so as to output the conversation video on the terminal. Therefore, the video conversation between the user and the target object is realized, and the personalized emotion requirement of the user is met.

In one implementation, the step S302 "filtering out the target knowledge sub-base from the pre-established dialog knowledge base according to the user identifier and the dialog question" may be implemented as the following steps:

step S302a1, a dialog knowledge sub-base corresponding to the user identifier is screened from the pre-established dialog knowledge base.

In step S302a2, the person relationship between the user and the target object is determined based on the dialogue problem.

The persona relationship may refer to identity information relative to the user's identity or social attributes, etc. A unique target object can be determined according to the character relation. In practical applications, the target object may include dad, mom, a friend of a particular identity, a son, a daughter, a wife, a husband, and so on.

Step S302a3, according to the character relation, the target knowledge sub-base corresponding to the target object is screened out from the conversation knowledge sub-base.

And screening a conversation knowledge sub-base corresponding to the user identification from a pre-established conversation knowledge base according to the user identification, wherein the conversation knowledge sub-base comprises knowledge sub-bases of all character relations (such as parents, spouses, children, friends and the like) of the user conversation. And determining the relationship of the characters of the current conversation according to the conversation problem of the current conversation, if the conversation problem input by the user A is 'dad, you stupefied', the target object can be determined to be dad of the user A, and screening out the knowledge sub-base with the relationship of the characters being 'dad' from the knowledge sub-bases with the relationship of the characters to obtain a target knowledge sub-base, wherein the target knowledge sub-base is used for storing reference knowledge of dad of the user A.

In another implementation manner, the step S302 "screen out the target knowledge sub-base from the pre-established dialog knowledge base according to the user identifier and the dialog question" may also be implemented as the following steps:

step S302b1, a dialog knowledge sub-base corresponding to the user identifier is screened from the pre-established dialog knowledge base.

Step S302b2, according to the dialogue problem, determines the person relationship between the user and the target object.

Step S302b3, according to the character relation, the target knowledge sub-base corresponding to the target object is screened out from the conversation knowledge sub-base.

Determining the character relationship of the current conversation according to the conversation problem of the current conversation, and screening out a conversation knowledge sub-base of the character relationship from a pre-established conversation knowledge base; and then determining a target knowledge sub-base according to the user identification. If you bitter according to ' dad, you sift out the knowledge sub-base of dad of all users from the dialogue knowledge base, then sift out the knowledge sub-base of user's A ' from the knowledge sub-base of all users, get the target knowledge sub-base, the reference knowledge stored in the target knowledge sub-base is the same as the reference knowledge stored in the target knowledge sub-base obtained by the first method, and is the reference knowledge of dad of user A.

In some embodiments, the step S303 "acquiring conversation video according to conversation question in target knowledge sub-base" can be implemented as the following steps:

step S3031, obtaining a target reference question corresponding to the dialogue question in the reference question set included in the target knowledge sub-base.

Since the user input dialog question may be input in text, speech or other input means, and different semantics may express the same or similar meanings. It is obviously impractical to store all the dialogue questions in the knowledge base, in practical application, the questions with the same or similar meanings can be merged according to semantics and the like, and the questions with the same or similar meanings correspond to the same answer, so that the storage space can be greatly saved.

When the target reference problem is obtained, semantic analysis can be performed on the dialogue problem to obtain an analysis result, the similarity between the analysis result and each reference problem in a reference problem set included in the target knowledge sub-base is calculated, and the reference problem with the maximum similarity is determined as the target reference problem.

Step 3032, obtaining the dialogue video according to the target reference question.

After the target reference question is determined, according to the target reference question, in a reference reply set included in the target knowledge sub-base, a reply material is searched, and a conversation video is determined according to the reply material. The reply material here includes at least one of video, audio, and text.

In some embodiments, when the retrieval of the conversational video in the reference answer set fails, the server may also search the online corpus or chat database for answers related to the target reference question.

In one implementation, step S3032 may be implemented by:

step S0321, searching a reference video corresponding to the target reference problem in a reference answer set included in the target knowledge sub-base to obtain a first search result.

When the reply material is obtained, the video priority is greater than the audio priority, which is greater than the text priority. First, a reference video is searched in a reference reply set to obtain a first search result.

Step S0322, determine whether the first search result is not empty.

When the first search result is not empty, it indicates that the search is successful, and there is at least one reference video corresponding to the reference problem, and then step S0323 is performed; when the first search result is empty, it indicates that the search is failed, and there is no reference video corresponding to the reference problem in the target knowledge sub-base, then step S0324 is performed, and a video is synthesized according to the image of the target object.

In step S0323, the reference video included in the first search result is determined to be the conversation video.

The reference video is the dialogue video of the target object responding to the dialogue questions of the user, and other processing operations are not needed. After the dialogue video is obtained, the process proceeds to step S304.

Step S0324, the image of the target object is searched in the target knowledge sub-base.

When the image of the target object is found in the target knowledge sub-base, it indicates that the user uploads the image of the object conversation to the server in advance, and the server stores the image into the target knowledge sub-base, and then the process goes to step S0325; when the image of the target object is not found in the target knowledge sub-base, it indicates that the user does not upload the image of the target object to the server or the server fails to store the image of the target object, so that the target knowledge sub-base does not store the image of the target object, and then the process proceeds to step S30213.

Step S0325, a reference audio corresponding to the target reference question is searched in the reference answer set, and a second search result is obtained.

And if the reference video is not found in the reference reply set, continuing to find the reference audio corresponding to the target reference question to obtain a second search result.

Step S0326, determine whether the second lookup result is not empty.

When the second search result is not empty, it indicates that the search is successful, and there is at least one reference audio corresponding to the reference problem, and then step S0327 is performed, and a dialog video is synthesized according to the image of the target object and the reference audio; and when the second search result is empty, the search is failed, and the reference audio corresponding to the reference problem does not exist in the target knowledge sub-base, and then the process goes to step S0329, and a video is synthesized according to the image of the target object and the reference text.

In step S0327, the reference audio included in the second search result is determined as the dialogue audio.

The reference audio is the dialogue audio of the target object responding to the dialogue question of the user. In order to enable the user to hear the relative voice and see the relative, in the embodiment of the application, the conversation audio and the image of the target object are fused to obtain the conversation video, so that the user and the target object can chat in a face-to-face mode, and the emotional requirement of the user is met.

Step S0328, the image of the target object and the conversation audio are fused to obtain a conversation video.

When the fusion is realized, the static image and the conversation audio can be directly fused, and the static image can be dynamic by combining the information of the semantics, emotion and the like of the conversation audio, so that the conversation video is more vivid and lifelike. In one implementation mode, the server can pre-estimate the expression of the target object according to the conversation problem and the conversation audio to obtain first expression information; then, according to the first expression information and the conversation audio, facial features of the target object in the image are adjusted to obtain a first ordered dynamic image; and synthesizing the first ordered dynamic image and the dialogue audio to obtain the dialogue video. For example, if the conversation audio has words such as "haha, happy" or the mood is relaxed and happy, the target object is an open answer conversation question, the first expression information is determined to be open, the mouth corner in the image can be adjusted upwards, the mouth shape is opened and closed according to the audio content, and the conversation video for reading the conversation audio with great care is synthesized. After the dialogue video is obtained, the process proceeds to step S304.

Step S0329, a reference text corresponding to the target reference question is searched in the reference answer set, and a third search result is obtained.

And if the reference audio is not found in the reference reply set, continuously searching the reference text corresponding to the target reference question to obtain a third searching result.

Step S03210, determine whether the third search result is not empty.

When the third search result is not empty, it indicates that the search is successful, and there is at least one reference text corresponding to the reference problem, and at this time, step S03211 is performed, and a dialog video is synthesized according to the image of the target object and the reference text; and when the third search result is empty, the search is failed, and the reference text corresponding to the reference problem does not exist in the target knowledge sub-base, and at this time, the step S03213 is performed to determine that the conversation video acquisition fails.

In step S03211, the reference text included in the third search result is determined to be the dialog text.

The reference text is the dialog text of the target object to answer the user's dialog question. If the user simply looks at the dialog text, the user cannot be made to physically feel the emotion given to the target object. In the embodiment of the application, the conversation text and the image of the target object are subjected to fusion processing to obtain the conversation video, so that the face-to-face video chat between the user and the target object is realized, and the emotional requirements are met.

Step S03212, the image of the target object and the dialog text are fused to obtain a dialog video.

When the fusion is realized, the static image and the conversation text can be directly fused, the conversation text can be synthesized into the simulated conversation audio by combining the tone, the tone and the like of the target object, and the static image is dynamic by combining the information such as the semantic, the emotion and the like, so that the conversation video is more vivid and lifelike. In one implementation mode, the server can pre-estimate the expression of the target object according to the conversation question and the conversation text to obtain second expression information; acquiring audio information of a target object, and generating a simulation dialogue audio according to the audio information and the dialogue text; adjusting the facial features of the target object in the image according to the second expression information and the simulated dialogue audio to obtain a second ordered dynamic image; and synthesizing the second ordered dynamic image and the simulated dialogue audio to obtain the dialogue video. For example, if the dialog text has words such as "no heart, whining" and the like, which indicate that the target object is a wounded answer dialog question, the second expression information is determined to be the wounded, the mouth angle in the image can be adjusted downward, the mouth shape is opened and closed according to the audio content, and a dialog video for reading the dialog audio harmlessly is synthesized. After the dialogue video is obtained, the process proceeds to step S304.

Step S03213, it is determined that the session video acquisition fails.

When the reference video, the reference audio and the reference text corresponding to the conversation question are not found in the reference reply set, determining that the server side does not successfully acquire the conversation video, and sending prompt information to the terminal to prompt the user to upload the reference knowledge of the target object; and at this time, the server side can also generate a conversation video of a non-target object according to other data so as to meet the emotional requirements of the user as much as possible.

On the basis of the embodiment shown in fig. 3, an embodiment of the present application further provides a dialog method, and fig. 4 is a schematic flow chart of another implementation of the dialog method provided in the embodiment of the present application, as shown in fig. 4, the method includes the following steps:

step S401, acquiring a session request sent by the terminal.

The dialog request carries a dialog question and a user identification, which is the identification of the user who logged into the dialog system.

In the embodiment of the present application, steps S401 to S403 and step S405 correspond to steps S301 to S304 in the embodiment shown in fig. 3 one to one, and the detailed descriptions of steps S301 to S403 and step S405 may be referred to for implementation manners of steps S401 to S403 and step S405.

And S402, screening out a target knowledge sub-base from the pre-established conversation knowledge base according to the user identification and the conversation problem.

The target knowledge sub-base is used for storing reference knowledge of target objects in dialogue with the user.

And step S403, acquiring conversation video according to the conversation question in the target knowledge sub-base.

The conversation video is a dynamic picture in which the target object responds to the conversation question.

Step S404, determining whether the session video is successfully acquired.

When the administrator or the user has uploaded the image of the target object and the video, audio or text of the target object about the dialog problem in the target knowledge sub-base in advance, the dialog video can be successfully acquired from the target knowledge sub-base, and then the process proceeds to step S405; when the dialogue video cannot be successfully acquired, the process proceeds to step S406.

Step S405, the conversation video is sent to the terminal, so that the conversation video is output on the terminal.

In step S406, a dialogue response is generated.

The dialog response carries prompt information, and the prompt information is used for prompting that the reference knowledge of the target object fails to be searched. Specifically, when the image search of the target object fails, the user is prompted to 'please upload the image of the target object'; when the reference text corresponding to the target reference question fails to be searched, the user may be prompted to "please upload reference knowledge corresponding to the target reference question, such as reference video, reference audio or reference text", and the like.

In other embodiments, when the reference text corresponding to the target reference question fails to be searched, a response related to the target reference question may be searched from the online corpus or the chat database, and an automatic response may be implemented without requiring a user to input information such as text, audio, video, and the like.

Step S407, sending the dialogue response to the terminal so that the user uploads the reference knowledge of the target object to the server according to the prompt message.

According to the conversation method provided by the embodiment of the application, when the server side can successfully acquire the conversation video from the target knowledge sub-base, the conversation video is sent to the terminal for output; when the conversation video cannot be successfully acquired, a conversation response can be sent to the terminal rod, so that a user can upload reference knowledge of a target object according to prompt information carried by the conversation response, and the conversation knowledge base is updated so as to flexibly meet more personalized emotion requirements of the user.

Based on the foregoing embodiments, a dialog method is further provided in the embodiments of the present application, and fig. 5 is a schematic flow chart of another implementation of the dialog method provided in the embodiments of the present application, which is applied to the network architecture shown in fig. 1, and as shown in fig. 5, the dialog method includes the following steps:

step S501, the terminal acquires the user identification and receives the dialogue problem input by the user.

The user identifier may be a unique identity identifier generated by the server when the user registers in the dialog system, or unique information capable of determining the user identity, such as a mobile phone number or an identity card number, which is filled in when the user registers. The user can open an Application (App) installed on the terminal, log in an account number to enter a dialog system, select a person who wants to have a dialog (e.g., "dad"), enter a dialog interface, and input a dialog question in the dialog interface. The dialogue problem can be voice input (the corresponding dialogue problem is audio information), can be character input (the corresponding dialogue problem is text information), and can also be input in other modes, such as video input (the corresponding dialogue problem is video information).

Step S502, the terminal generates a dialogue request according to the dialogue question and the user identification.

The dialog request carries the user-entered dialog question and the user identification.

In step S503, the terminal sends the session request to the server.

And step S504, the server screens out a target knowledge sub-base from the pre-established conversation knowledge base according to the user identification and the conversation problem.

The target knowledge sub-base is used for storing reference knowledge of a target object of a conversation with a user, the reference knowledge including at least one of reference video, reference audio and reference text. In one implementation, this step may be implemented as: screening a conversation knowledge sub-base corresponding to the user identification from a pre-established conversation knowledge base; determining the character relationship between the user and the target object according to the dialogue problem; and screening out a target knowledge sub-base corresponding to the target object from the conversation knowledge sub-base according to the character relation.

And step S505, the server side acquires a conversation video according to the conversation problem in the target knowledge sub-base.

The server side acquires a target reference problem corresponding to the conversation problem in a reference problem set included in a target knowledge sub-base; and then acquiring the conversation video according to the target reference question.

When a conversation video is obtained, the server side firstly searches a reference video corresponding to a target reference question in a reference answer set included in a target knowledge sub-base to obtain a first search result; and when the first search result is not empty, determining the reference video included in the first search result as the conversation video. When the first search result is empty, searching the image of the target object in the target knowledge sub-base; searching a reference audio corresponding to the target reference question in the reference answer set to obtain a second search result; when the second search result is not empty, determining the reference audio included in the second search result as the dialogue audio; and fusing the image of the target object and the conversation audio to obtain the conversation video. When the second search result is empty, searching a reference text corresponding to the target reference question in the reference answer set to obtain a third search result; when the third search result is not empty, determining a reference text included in the third search result as a dialog text; and fusing the image of the target object and the conversation text to obtain a conversation video.

In some embodiments, when the third search result is empty, generating a dialog response carrying prompt information, where the prompt information is used to prompt that the reference knowledge search of the target object fails; and sending the dialogue response to the terminal so that the user uploads the reference knowledge of the target object to the server according to the prompt message.

And step S506, the server sends the conversation video to the terminal.

And step S507, the terminal outputs the conversation video.

According to the conversation method provided by the embodiment of the application, the terminal obtains the user identification, receives the conversation problem input by the user, generates the conversation request according to the conversation problem and the user identification, and sends the conversation request to the server. The server side screens out a target knowledge sub-base from a pre-established conversation knowledge base according to the user identification and the conversation problem, wherein the target knowledge sub-base is used for storing reference knowledge of a target object which is in conversation with the user; in the target knowledge sub-base, acquiring a conversation video according to the conversation question, wherein the conversation video is a dynamic picture of a target object for answering the conversation question; and then transmits the conversation video to the terminal. And the terminal receives and outputs the conversation video, and the user watches the conversation video to realize video conversation with the target object, thereby meeting the individualized emotional requirements of the user.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

With the development of the fields of artificial intelligence, intelligent hardware and the like, a man-machine interaction mode based on voice recognition is more and more approved by users. When people need to feel emotions, particularly conversation with relatives, the people cannot converse with some people who want to speak due to various reasons. Parents die when children are still young, for example, children lose the opportunity to talk with parents; parents are not at the side of the children outside, and the emotion of the children cannot be released in time; one part of the couple dies, and the other part cannot ease thought. In view of the emotional requirements of human beings in such scenes, the embodiment of the application provides a solution mainly aiming at emotional video conversation among the relatives. The embodiment of the application integrates a plurality of intelligent modules, including an intelligent question answering engine (supporting single-round or multi-round conversation), voice synthesis, voice recognition, face recognition and the like.

The intelligent question answering engine: and performing semantic analysis on the questioning content of the user, and searching a correct answer corresponding to the question from a knowledge base. Single-theory conversations supporting a question-and-answer, and multiple rounds of interactive conversations surrounding a certain topic.

And (3) voice recognition: the speech of the user is converted into corresponding characters, so that conversation content storage is facilitated, an intelligent conversation engine can perform semantic analysis conveniently, and intelligent quality inspection can be performed conveniently to detect quality inspection items.

And (3) voice synthesis: the text content to be transmitted to the user is converted into voice to be played for the user to listen, so that the conversation process is more natural and smooth, scene immersion type conversation experience is provided, a real customer service scene is further simulated, and the training quality is improved.

Face recognition: the identity of the user is identified through the face picture captured by the camera, and the identity is returned to the system for judging the processing logic of subsequent conversation.

According to the embodiment of the application, the regret caused by the fact that some people cannot communicate and communicate for various reasons can be solved through the intelligent video conversation service, and effective comfort and encouragement are conducted on the psychology of a conversation person.

Fig. 6 is a schematic diagram of collecting an emotional question set according to an embodiment of the present application, and as shown in fig. 6, the emotional question set 60 may be divided into:

a parent-child question set 601, a parent-child question set (not shown), a parent-child question set 602, a parent-child question set (not shown), a husband-wife question set 603, a wife-husband question set (not shown), and so on.

For example, a set of questions answered by a parent-child may include the following questions:

father, I talk about love.

Father, I want to get married in the next month.

Father, I were criticized by a teacher, which was very difficult.

Dad, play a game with them.

……

The collection of emotional problem sets 60 may include two types: the product administrator presets a problem set and the user edits and uploads the problem set by himself.

Fig. 7 is a schematic diagram of a framework of a dialog knowledge base provided in an embodiment of the present application. After the emotional question set exists, the corresponding knowledge base 70 needs to be edited for the question set, each knowledge bar comprises a question and an answer, and each question comprises a standard question, a similar question and a recording corresponding to the question (optional, mainly solving the problem that dialects are difficult to identify and facilitating the training of dialect models); the answers include three types: the method comprises the steps of video, recording and text, wherein the video is preferred because the impact and the movement brought by the video of a parent can be seen most, and then the recording and then the text are carried out. There are two management channels for the knowledge base 70: one type is customized terminal equipment 71, which is specially used for management, conversation and the like of knowledge, so that the experience is more real and shocked; one is App or h5 on the mobile phone 72, which can manage and converse the knowledge base after logging in an account, and experience is not as good as that of a customized terminal.

Fig. 8 is a schematic view of an emotional dialogue processing flow provided in an embodiment of the present application. With the knowledge base 70, it is under the coordination of the intelligent question-answering engine 80 that the cross-spatiotemporal and immersive emotional dialogue between the relatives is completed. First, the App of the customized terminal 71 or the mobile phone 72 obtains the identity of the interlocutor through face recognition or the user selects the identity of the interlocutor, such as father and son, father and daughter, and the like. The user then speaks a word, converts the voice to text via the speech recognition engine 81, and the intelligent question and answer engine 80 retrieves the question in the knowledge base 70, returns a default answer (configurable) if no answer is found, and returns an answer if an answer is found. And then, if the answer type is a video, the video is directly played on the customization terminal 71 or the mobile phone 72, if the answer type is a recording, a photo and the recording are played together, if the answer type is a text, the text is subjected to voice synthesis according to the voice synthesis engine 82, and then the photo is matched to synthesize the video.

In the embodiment of the application, an intelligent video emotion conversation system is realized by programming by means of artificial intelligence technologies such as machine learning, deep learning, transfer learning and natural language processing, and the system integrates intelligent modules such as an intelligent question-answering engine, voice recognition, voice synthesis and face recognition. The scheme can intelligently identify the identity of the user and select an adaptive conversation mode; intelligently searching for a proper answer to answer the question of the user; a plurality of answer modes are supported, and various reply scenes are met; the method helps solve the regret that some people cannot communicate for various reasons, and effectively consolidates and encourages the psychology of people.

Continuing with the exemplary structure of the dialog device implemented as a software module provided in the embodiments of the present application, in some embodiments, as shown in fig. 2, the dialog device 155 stored in the memory 150 is applied to a terminal, and the software module in the dialog device 155 may include:

a first obtaining module 1551, configured to obtain a session request sent by a terminal, where the session request carries a session question and the user identifier;

a screening module 1552, configured to screen a target knowledge sub-base from a pre-established conversation knowledge base according to the user identifier and the conversation question, where the target knowledge sub-base is used to store reference knowledge of a target object having a conversation with the user;

a second obtaining module 1553, configured to obtain, in the target knowledge base, a conversation video according to the conversation question, where the conversation video is a dynamic picture of the target object answering the conversation question;

a sending module 1554, configured to send the conversation video to the terminal, so as to output the conversation video on the terminal.

In some embodiments, the screening module 1552 comprises:

the first screening unit is used for screening a conversation knowledge sub-base corresponding to the user identification from a conversation knowledge base established in advance;

the determining unit is used for determining the character relationship between the user and the target object according to the dialogue problem;

and the second screening unit is used for screening the target knowledge sub-base corresponding to the target object from the conversation knowledge sub-base according to the character relationship.

In some embodiments, the second obtaining module 1553 comprises:

a first obtaining unit, configured to obtain a target reference question corresponding to the dialog question in a reference question set included in the target knowledge sub-base;

and the second acquisition unit is used for acquiring the conversation video according to the target reference problem.

In some embodiments, the second obtaining unit is further configured to:

searching a reference video corresponding to the target reference question in a reference answer set included in the target knowledge sub-base to obtain a first search result;

and when the first search result is not empty, determining the reference video included in the first search result as the conversation video.

In some embodiments, the second obtaining unit is further configured to:

when the first search result is empty, searching the image of the target object in the target knowledge sub-base;

searching a reference audio corresponding to the target reference question in the reference answer set to obtain a second search result;

when the second search result is not empty, determining the reference audio included in the second search result as conversation audio;

and carrying out fusion processing on the image of the target object and the conversation audio to obtain a conversation video.

In some embodiments, the second obtaining unit is further configured to:

estimating the expression of the target object according to the conversation question and the conversation audio to obtain first expression information;

adjusting the facial features of the target object in the image according to the first expression information and the conversation audio to obtain a first ordered dynamic image;

and synthesizing the first ordered dynamic image and the dialogue audio to obtain a dialogue video.

In some embodiments, the second obtaining unit is further configured to:

when the second search result is empty, searching a reference text corresponding to the target reference question in the reference answer set to obtain a third search result;

when the third search result is not empty, determining a reference text included in the third search result as a dialog text;

and carrying out fusion processing on the image of the target object and the conversation text to obtain a conversation video.

In some embodiments, the second obtaining unit is further configured to:

estimating the expression of the target object according to the conversation question and the conversation text to obtain second expression information;

acquiring audio information of the target object, and generating a simulated dialogue audio according to the audio information and the dialogue text;

adjusting the facial features of the target object in the image according to the second expression information and the simulated dialogue audio to obtain a second ordered dynamic image;

and synthesizing the second ordered dynamic image and the simulated dialogue audio to obtain a dialogue video.

In some embodiments, the dialog device 155 further includes:

a generating module, configured to generate a dialog response when the third search result is null, where the dialog response carries prompt information, and the prompt information is used to prompt that the reference knowledge search of the target object fails;

the sending module is further configured to send the dialog response to the terminal, so that the user uploads the reference knowledge of the target object to a server according to the prompt information.

Here, it should be noted that: the above description of the dialog device embodiment is similar to the above description of the method, with the same advantageous effects as the method embodiment. For technical details not disclosed in the embodiments of the dialog device of the present application, a person skilled in the art should understand with reference to the description of the embodiments of the method of the present application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the dialogue method according to the embodiment of the present application.

Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform the methods provided by embodiments of the present application, for example, the methods as illustrated in fig. 3 to 5.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of dialogues, the method comprising:

2. The method of claim 1, wherein the screening out a target knowledge base from a pre-established dialog knowledge base based on the user identification and the dialog question comprises:

screening a conversation knowledge sub-base corresponding to the user identification from a pre-established conversation knowledge base;

determining the character relationship between the user and the target object according to the dialogue problem;

and screening out a target knowledge sub-base corresponding to the target object from the conversation knowledge sub-base according to the character relation.

3. The method of claim 1, wherein obtaining conversation video from the conversation problem in the target knowledge base comprises:

acquiring a target reference question corresponding to the dialogue question in a reference question set included in the target knowledge sub-base;

and acquiring the conversation video according to the target reference problem.

4. The method of claim 3, wherein the obtaining of the conversational video according to the target reference question comprises:

5. The method of claim 4, wherein the obtaining of conversational video according to the target reference question further comprises:

6. The method according to claim 5, wherein the fusing the image of the target object and the dialogue audio to obtain a dialogue video comprises:

7. The method of claim 5, wherein the obtaining of conversational video according to the target reference question further comprises:

8. The method according to claim 7, wherein the fusing the image of the target object and the dialog text to obtain a dialog video comprises:

9. The method of claim 7, further comprising:

when the third search result is empty, generating a dialogue response, wherein the dialogue response carries prompt information, and the prompt information is used for prompting that the reference knowledge search of the target object fails;

and sending the dialogue response to the terminal so that the user uploads the reference knowledge of the target object to a server according to the prompt message.

10. A dialog device, characterized in that the device comprises:

11. An electronic device, characterized in that the device comprises:

a memory for storing executable instructions;

a processor for implementing the dialog method of any of claims 1 to 9 when executing executable instructions stored in the memory.

12. A computer-readable storage medium having stored thereon executable instructions for causing a processor to perform the method of dialog recited in any of claims 1 through 9 when executed.

13. A computer program product comprising a computer program, characterized in that the computer program realizes the dialog method of any one of claims 1 to 9 when executed by a processor.