CN113792196A

CN113792196A - Method and device for man-machine interaction based on multi-modal dialog state representation

Info

Publication number: CN113792196A
Application number: CN202111064527.7A
Authority: CN
Inventors: 赵楠; 张孟馨; 吴友政; 周伯文
Original assignee: Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-14

Abstract

The present disclosure provides a method and a device for man-machine interaction based on multi-modal dialog state representation, wherein the method comprises the following steps: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy. The method can completely express the dialogue interaction in the communication, support the realization of a multi-modal dialogue system and realize accurate dialogue expression by defining the dialogue state expression method suitable for the real scene dialogue.

Description

Method and device for man-machine interaction based on multi-modal dialog state representation

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for human-computer interaction based on multi-modal dialog state representation.

Background

With the development of technology and the progress of social demands, human-computer interaction begins to step towards a new stage of anthropomorphic human-computer interaction. The human-computer interaction system in a real scene needs to have certain communication skills and strategy planning capability. In addition, the multi-modal interactive robot can not only interact with words or voice, but also needs to show charts or pictures in time in the communication process to help users to understand better. In conversation communication in a real scene, there are various language phenomena such as active and passive role conversion, topic rotation, long-term dependence of context, and the like, and the requirement in the real scene cannot be met only by expressing the conversation state by means of intents and slot values. Both the intent and the slot need to be defined in advance, which is difficult to deal with. The definition method of the intention and the slot value is not universal, and the sharing among related knowledge fields is very difficult. The detailed description of the behavior of conversational communication in real scenes is lacking. There is a lack of consideration for multi-modal dialog states.

Disclosure of Invention

The invention provides a man-machine interaction method and device based on multi-modal dialog state representation, which are used for overcoming the defects that the prior art does not have universality and is difficult to accurately perform dialog, and realizing accurate dialog and cross-field universality.

In a first aspect, the present disclosure provides a method for human-computer interaction based on multi-modal dialog state representation, including:

acquiring original multi-modal input information;

processing the original multi-modal input information to obtain a multi-modal dialog state representation result;

determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;

and finishing multi-modal information output according to the multi-modal dialog strategy.

According to the method for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure, the processing the original multi-modal input information to obtain a multi-modal dialog state representation result specifically includes:

performing single-mode analysis on the original multi-mode input information to obtain a single-mode representation result;

obtaining relevant information of a dialog scene according to the original multi-modal input information;

and performing multi-modal understanding and chapter semantic analysis on the single-modal representation result and the relevant information of the dialog scene to obtain a multi-modal dialog state representation result.

According to the method for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure, the performing single-modal analysis on the original multi-modal input information to obtain a single-modal representation result specifically includes:

performing voice recognition on the original multi-modal input information to obtain a voice recognition result, and performing semantic analysis on the voice recognition result to obtain a semantic analysis result;

performing emotion analysis and behavior gesture analysis on the original multi-modal input information to obtain corresponding emotion analysis results and behavior gesture analysis results;

and forming a single-mode representation result by the semantic analysis result, the emotion analysis result and the behavior gesture analysis result.

According to the method for man-machine interaction based on multi-modal dialog state representation, the multi-modal dialog state representation result comprises dialog behaviors, dialog elements and dialog scenes;

wherein the conversation behavior is used for guiding conversation strategy generation;

the dialog element is used to determine the intention of the interlocutor;

the dialog scenario is used to determine a corresponding media interaction type.

According to the method for man-machine interaction based on multi-modal dialog state representation, the dialog behavior is used for guiding dialog strategy generation, and the method specifically comprises the following steps:

acquiring a human-computer interaction scene;

performing dialogue behavior dimension analysis according to the scene to obtain a dialogue behavior dimension analysis result;

and determining the generation of a conversation strategy according to the dimension analysis result of the conversation behavior.

According to the method for man-machine interaction based on multi-modal dialog state representation, the dialog elements are used for determining the intention of an interlocutor, and the method specifically comprises the following steps:

obtaining the sentence of the interlocutor;

performing multi-factor dialogue element representation on the statement to obtain a multi-factor dialogue element representation result;

determining the intention of the interlocutor based on the multi-factor dialog element representation result.

According to the method for man-machine interaction based on multi-modal dialog state representation, the dialog scene is used for determining a corresponding media interaction type, and the method specifically comprises the following steps:

carrying out user portrait analysis, media type analysis, style emotion analysis and equipment type analysis on the interlocutor to respectively obtain a user portrait result, a media type result, a style emotion result and an equipment type result of the interlocutor;

and determining a media interaction type interacted with the interlocutor according to the user portrait result, the media type result, the style emotion result and the equipment type result of the interlocutor.

According to the method for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure, the performing multi-factor dialog element representation on the sentence to obtain a multi-factor dialog element representation result specifically includes:

factoring the statement from the semantic perspective to obtain four dimensional factors of the action, the object, the condition and the question type of the statement;

determining a multi-factor dialog element representation result according to the four dimensional factors;

wherein the action refers to a predicate part in the sentence and is assumed by a verb or an adjective in the sentence;

the object refers to an influencer of the action or a central word of a nominal phrase sentence;

the condition refers to the state and condition of the action, and the modification and attribute of the object;

the question types refer to different query request categories in the interaction process set according to common knowledge.

In a second aspect, the present disclosure provides an apparatus for human-computer interaction based on multi-modal dialog state representation, comprising:

the first processing module is used for acquiring original multi-modal input information;

the second processing module is used for processing the original multi-modal input information and acquiring a multi-modal dialog state representation result;

the third processing module is used for determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;

and the fourth processing module is used for finishing multi-modal information output according to the multi-modal dialog strategy.

The present disclosure also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for human-computer interaction based on multimodal dialog state representation as described in any of the above when executing the program.

The present disclosure also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for human-computer interaction based on multimodal dialog state representation as recited in any of the above.

The method and the device for man-machine interaction based on multi-modal dialog state representation provided by the disclosure are characterized in that original multi-modal input information is obtained; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; the multi-modal dialog state representation result can represent the dialog features from multiple dimensions, and has a more anthropomorphic effect. Determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and after the multi-modal dialog strategy is obtained, finishing multi-modal information output according to the multi-modal dialog strategy. Because the multi-modal dialog representation has a more accurate representation effect, the multi-modal output constructed according to the multi-modal representation result is more accurate, and various and humanized interaction modes can be more shown.

Drawings

In order to more clearly illustrate the technical solutions of the present disclosure or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow diagram of a method of human-machine interaction based on multimodal dialog state representation provided by the present disclosure;

FIG. 2 is a schematic diagram of a multi-modal your interaction system architecture provided by the present disclosure;

FIG. 3 is a schematic diagram of a deep-level multi-modal dialog state representation provided by the present disclosure;

FIG. 4 is a schematic structural diagram of a device for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device provided by the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.

A method for man-machine interaction based on multi-modal dialog state representation according to an embodiment of the present disclosure is described below with reference to fig. 1-2, including:

step 100: acquiring original multi-modal input information;

in particular, with the rapid development of big data, deep learning and computing power, computers have developed intelligent systems that can represent and recognize multi-modal information such as voice, vision, text, etc., and fuse knowledge to achieve comprehension and reasoning. In order to solve the problem of man-machine communication obstacle when a popular User completes a complex task in a diversified scene, man-machine interaction begins to step towards a new stage of anthropomorphic User Interface (IUI). The system is typically applied to intelligent customer service facing to scenes such as telephone, online text customer service, face-to-face consultation, sales and service and the like. In terms of multi-turn conversation research and a man-machine interaction open platform, the existing man-machine interaction system has a good effect in the aspect of executing specific tasks of specific fields and specific modes, but the multi-turn conversation capability under the conditions of multi-mode, complex scenes, less resources, cold start and the like is in urgent need to be improved. For example, the conventional dialog management technology is limited to a single-domain task dialog, and lacks the global dialog management of integrating multiple response modules such as task dialogs, intelligent question answering, knowledge graph question answering and chatting, and the capability of emotionally generating responses in a high-noise complex scene.

Because the present disclosure is directed to various situations, in the process of implementing various tasks, such as robot conversation communication, various conversation behaviors, such as self introduction, commodity recommendation, invitation evaluation, emotion placation, etc., need to be completed, and only adopting such anthropomorphic stylized strategy reply can provide better user experience.

Each source or form of information may be referred to as a modality. For example, humans have touch, hearing, vision, smell; information media such as voice, video, text and the like; a wide variety of sensors such as radar, infrared, accelerometer, etc. Each of the above may be referred to as a modality. Also, the modality may be defined very broadly, for example, two different languages may be considered as two modalities, and even the data sets collected under two different situations may be considered as two modalities. Thus, the present disclosure can be in text form, in image form, or in video form by retrieving the original multimodal input information. Obtaining multi-modal input data, calling a digital human capability interface to analyze the multi-modal input data

Step 200: processing the original multi-modal input information to obtain a multi-modal dialog state representation result;

specifically, in the present disclosure, a multi-modal dialog state representation result is obtained by processing original multi-modal information, and processing the input original multi-modal information from multiple dimensions, such as a semantic angle, an action angle, and the like.

Step 300: determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;

specifically, after determining the multi-modal dialog state representation result, the response mode of the machine in the human-computer interaction can be determined through the multi-modal dialog state representation result, such as whether the selected dialog emotional state is soothing or exciting, calm or happy, and the like, whether the dialog mode is a visual mode or a voice mode, a picture mode, a video mode or a text mode, and the like.

Step 400: and finishing multi-modal information output according to the multi-modal dialog strategy.

Specifically, after determining the multi-modal dialog strategy, the robot performs the multi-modal output process, which is a process of calling multi-modal resource data owned by the robot system and outputting them in different ways. For example, if it is desired to cause the robot to output a facial expression, the facial expression is output by playing a video or displaying an image on a display screen for the robot provided with the display screen. The multimodal resource data involved in the robotic system typically includes audio data, video data, image data, or other multimedia data, as well as program instructions for controlling motors that drive the robot's actions, etc.

The man-machine interaction method based on multi-modal dialog state representation provided by the disclosure comprises the steps of obtaining original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; the multi-modal dialog state representation result can represent the dialog features from multiple dimensions, and has a more anthropomorphic effect. Determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and after the multi-modal dialog strategy is obtained, finishing multi-modal information output according to the multi-modal dialog strategy. Because the multi-modal dialog representation has a more accurate representation effect, the multi-modal output constructed according to the multi-modal representation result is more accurate, and various and humanized interaction modes can be more shown.

According to the method for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the processing the original multi-modal input information to obtain a multi-modal dialog state representation result specifically includes:

Specifically, multi-mode data semantic features are extracted based on a multi-mode data feature representation model, a pre-training model-based data feature extraction model of texts, images and audios and videos is constructed, and single-mode data semantic feature extraction, text data semantic feature extraction, image feature extraction, video feature extraction, image data textual description information extraction and textual description information extraction of the images and videos are respectively completed based on the feature extraction model; in addition, the single-mode representation result is obtained through processes such as behavior gesture analysis and emotion analysis.

In addition, the original multi-modal input information is analyzed for dialog scene information, wherein the scene information represents environment information of a dialog occurrence scene, is closely related to anthropomorphic audio visual perception, and can specifically describe types of responses from 4 angles of user portrayal (Persona), Media type (Media), Style (Style), Device type (Device).

Among other things, multimodal understanding aims to achieve the ability to process and understand multi-source modal information through a method of machine learning. The current popular research direction is multi-modal learning among images, videos, audios and semantics.

The multi-mode deep semantic understanding can realize the semantic understanding of texts and visual images simultaneously. For example, if AI recognition is done in the traditional era, when a puppy is identified under a treelet shade, it is found that the recognition classifies two objects, one object is the puppy and the other object is a tree, based on the visual semantic understanding, a puppy enjoys the cool in the treelet shade, while the meaning behind the more deep understanding text is that a puppy enjoys the cool in the treelet shade and is externally on a hot summer day. This is a multi-modal deep semantic understanding.

And then, performing multi-modal understanding and chapter semantic analysis on the single-modal representation result and the dialogue scene related information to obtain a multi-modal dialogue state representation result.

According to the method for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the performing single-modal analysis on the original multi-modal input information to obtain a single-modal representation result specifically includes:

Specifically, the original multi-modal input information is correspondingly processed by calling a voice recognition interface, an emotion analysis interface and a gesture behavior analysis interface, so that a corresponding single-modal representation result is obtained.

According to the method for man-machine interaction based on the multi-modal dialog state representation, the multi-modal dialog state representation result comprises a dialog behavior, a dialog element and a dialog scene;

the dialog element is used to determine the intention of the interlocutor;

Specifically, referring to fig. 3, the present disclosure describes information required for dialog decision and dialog generation in human-computer interaction comprehensively from 3 aspects of dialog behavior, i.e., multidimensional dialog behavior analysis, multi-factor dialog elements, dialog scenes, and multi-type dialog scene description, and more delicately depicts dialog behavior, dialog semantic elements, and dialog scene information for supporting anthropomorphic and multi-modal interaction in communication, thereby meeting the requirement of human-computer interaction in complex scenes. The method is not bound with domain knowledge, has universality and can be applied to man-machine conversation systems in various fields such as e-commerce, tourism, medical treatment and the like.

acquiring a human-computer interaction scene;

In particular, dialog-Act (DA) plays a crucial role in spoken language understanding systems for marking speaker intentions (such as statements, questions, commitments, indications, etc.) independent of the specific dialog system and thus has a certain versatility. Conversational behavior, also known as verbal behavior or social behavior, is an attempt to formalize and generalize intent. The speaker (sender), the recipient (address), and the conversation are conducted around the interaction between the two characters. Wherein the speaker is the speaker currently interacting, i.e. the current dialog behavior is generated. The receiver, the participants of the conversation interaction, and the interaction object of the current speaker.

The method combines the characteristics of human-computer interaction scenes such as artificial telephones, online customer service and the like, such as commodity recommendation scenes of pre-sale class customer service personnel in service, special scenes such as emotion conciliation scenes of post-sale class customer service personnel for customers and the like, provides a conversation behavior classification scheme capable of fully representing the characteristics of human-computer interaction spoken language Communication based on a foreign general conversation behavior scheme, and defines 5 different conversation behavior analysis dimensions (Task dimension, Time Management dimension, Feedback dimension, Own and Partner Communication Management dimension and Social Objections Management dimension) which are combined together to represent a complex spoken language conversation behavior state under a real scene. Wherein the definition of each dimension is shown in table 1 below:

TABLE 1

According to the method for man-machine interaction based on multi-modal dialog state representation, provided by the embodiment of the disclosure, wherein the dialog element is used for determining the intention of the interlocutor, the method specifically comprises the following steps:

obtaining the sentence of the interlocutor;

In particular, the invention proposes a simple and novel semantic representation framework, called a multi-factor dialog element representation framework, for replacing the classical semantic representation mode combining intent and slot value. Under this framework, different intentions, namely actions, objects, conditions, and question types, are distinguished by four key factors. Four key concepts are employed to distinguish intentions rather than fully express all the complex grammatical and semantic meanings of sentences. The multi-factor semantic framework is motivated by the fact that the number of possible sentences is infinite, and therefore it is not feasible to fully represent all the sentences, while too general semantic representations are difficult to meet the needs of real scenes. However, in a particular domain or scenario, the possible semantic space is limited, so all intents can be distinguished with limited key concepts, without the need to represent each intent directly with a complete representation, and the scenario needs can be met at the granularity of representation as well.

The method mainly solves the description problem of fine granularity of user intention, distinguishes different semantic intentions through factor difference, associates semantic knowledge points through factor sharing, and finds coexistence and coexistence. Four-dimensional factors, i.e., action, object, condition (modifier/attribute/state/condition), question type, are set in the framework. The dimension of condition (modification/attribute/state/condition) is mixed type, and includes multiple finer-grained factors such as modification/attribute/state/condition, and since these factors usually do not co-occur in the same question, they are combined into one dimension, which will be referred to as condition hereinafter. A plurality of factors are obtained by carrying out multi-dimensional factor semantic analysis on a question, and then the factors are connected to form a factor expression to express the semantic. For example, the factor expression of the question "how to complement the hotel invoice" is "complement (action) + invoice (object) + hotel (condition) + how-to-query (question type)", and the parenthesis is the factor type; in general, the four-dimensional factors are arranged in a fixed order, so the factor type can be omitted, in which case the factor expression is "make up + invoice + hotel + how-query".

According to the man-machine interaction method based on the multi-modal dialog state representation, provided by the embodiment of the disclosure, the dialog scene is used for determining a corresponding media interaction type, and the method specifically comprises the following steps:

Specifically, the scene information represents environment information of a dialog occurrence scene, is closely related to anthropomorphic audiovisual perception, and can specifically describe the type of response from the perspective of user image (Persona), Media type (Media), genre emotion (Style), Device type (Device) 4. A user representation describing user representation information (e.g., age and occupation, hobbies, etc.) of the conversant. Media type, representing preferred presentation media type, in what form the presentation input output completes the interaction (e.g., text, spoken and diagram, picture, etc.). Stylistic emotions, which express the emotional attitudes held by the current utterance (e.g., anger, urgency, etc.). Device type, which devices will be used in the presentation, the device ultimately provides support for the interaction in terms of physical hardware (e.g., web page, phone or PDA, etc.).

In a multi-modal dialog system, a dialog decision (Policy) unit takes the dialog scene information into full consideration and selects the most appropriate media interaction type in the decision process of a dialog strategy. When the dialog generation (NLG) unit generates the response text, the response reply with rich stylized characteristics can be generated by utilizing the portrait and style information of the user, so that better user experience can be provided, more user participation is brought, and the completion rate of the dialog task is improved.

According to the method for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the performing multi-factor dialog element representation on the statement to obtain a multi-factor dialog element representation result specifically includes:

Specifically, the splitting of the factors does not depend on specific domain knowledge, the main and subsidiary object components of the sentence are analyzed according to general Chinese syntactic analysis, and the factors can be decomposed by understanding the central idea of the sentence from the semantic perspective. When splitting, the following point rules can be grasped:

actions, i.e., predicate portions in sentences, are usually undertaken by verbs or adjectives in the sentence.

An object, i.e. an influencer of an action, or a core word of a nominal phrase sentence.

The conditions, that is, the states and conditions of the actions and the modifications and attributes of the objects are usually not simultaneously presented in one sentence, and are expressed in one dimension.

Question types, i.e. different query request categories in the interaction process set according to common knowledge: yesno-queston, affirmatively negating a question; selecting a question sentence by the choice-query; where-query, position question; while-query, time question; why-query, reason question; whynot-query, negation reason question; what-query, entity question; who-query, a person's name question; how-query, action/status question; howften-query, frequency question; howmay-query, quantity question; statement-positive, positive sentence; statement-negative, negation.

As shown in fig. 4, the present disclosure provides an apparatus for human-computer interaction based on multi-modal dialog state representation, including:

a first processing module 41, configured to obtain original multi-modal input information;

the second processing module 42 is configured to process the original multi-modal input information to obtain a multi-modal dialog state representation result;

a third processing module 43, configured to determine a multi-modal dialog strategy according to the multi-modal dialog state representation result;

and the fourth processing module 44 is configured to complete multi-modal information output according to the multi-modal dialog policy.

Since the apparatus provided by the embodiment of the present invention can be used for executing the method described in the above embodiment, and the operation principle and the beneficial effect are similar, detailed descriptions are omitted here, and specific contents can be referred to the description of the above embodiment.

The device for man-machine interaction based on multi-modal dialog state representation provided by the disclosure obtains original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; the multi-modal dialog state representation result can represent the dialog features from multiple dimensions, and has a more anthropomorphic effect. Determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and after the multi-modal dialog strategy is obtained, finishing multi-modal information output according to the multi-modal dialog strategy. Because the multi-modal dialog representation has a more accurate representation effect, the multi-modal output constructed according to the multi-modal representation result is more accurate, and various and humanized interaction modes can be more shown.

According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is specifically configured to:

According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is further specifically configured to:

According to the device for man-machine interaction based on multi-modal dialog state representation, provided by the embodiment of the disclosure, in the second processing module 42, the multi-modal dialog state representation result comprises a dialog behavior, a dialog element and a dialog scene;

the dialog element is used to determine the intention of the interlocutor;

acquiring a human-computer interaction scene;

obtaining the sentence of the interlocutor;

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method of providing human-machine interaction based on multi-modal dialog state representations according to the present disclosure, the method comprising: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present disclosure also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, the computer being capable of performing the methods provided above, provide a method of human-machine interaction based on a multi-modal dialog state representation, the method comprising: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy.

In yet another aspect, the present disclosure also provides a non-transitory computer readable storage medium having stored thereon a computer program that when executed by a processor performs the steps of the above-provided present disclosure providing a method of human-machine interaction based on multimodal dialog state representation, the method comprising: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims

1. A method for human-computer interaction based on multi-modal dialog state representations, comprising:

acquiring original multi-modal input information;

2. The method for human-computer interaction based on multi-modal dialog state representation according to claim 1, wherein the processing the original multi-modal input information to obtain a multi-modal dialog state representation result specifically comprises:

3. The method for human-computer interaction based on multi-modal dialog state representation according to claim 2, wherein the performing single-modal analysis on the original multi-modal input information to obtain a single-modal representation result specifically comprises:

4. The method for human-computer interaction based on multi-modal dialog state representation of claim 1 or 2, characterized in that the multi-modal dialog state representation results comprise dialog behaviors, dialog elements and dialog scenarios;

the dialog element is used to determine the intention of the interlocutor;

5. The method for human-computer interaction based on multi-modal dialog state representation according to claim 4, wherein the dialog behavior is used to guide dialog strategy generation, and specifically comprises:

acquiring a human-computer interaction scene;

6. Method of human-computer interaction based on multimodal dialog state representation according to claim 4, characterized in that the dialog elements are used to determine the intention of the interlocutor, in particular comprising:

obtaining the sentence of the interlocutor;

7. The method for human-computer interaction based on multi-modal dialog state representation according to claim 4, wherein the dialog scenario is used for determining a corresponding media interaction type, and in particular comprises:

8. The method for man-machine interaction based on multi-modal dialog state representation according to claim 6, wherein the multi-factor dialog element representation of the sentence to obtain a multi-factor dialog element representation result specifically comprises:

9. An apparatus for human-computer interaction based on multi-modal dialog state representation, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for human-machine interaction based on multi-modal dialog state representation according to any of claims 1 to 8 when executing the program.

11. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the method for human-computer interaction based on multi-modal dialog state representation according to any of claims 1 to 8.