CN113792196A - Method and device for man-machine interaction based on multi-modal dialog state representation - Google Patents

Method and device for man-machine interaction based on multi-modal dialog state representation Download PDF

Info

Publication number
CN113792196A
CN113792196A CN202111064527.7A CN202111064527A CN113792196A CN 113792196 A CN113792196 A CN 113792196A CN 202111064527 A CN202111064527 A CN 202111064527A CN 113792196 A CN113792196 A CN 113792196A
Authority
CN
China
Prior art keywords
modal
dialog
result
state representation
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111064527.7A
Other languages
Chinese (zh)
Inventor
赵楠
张孟馨
吴友政
周伯文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Shangke Information Technology Co Ltd
Priority to CN202111064527.7A priority Critical patent/CN113792196A/en
Publication of CN113792196A publication Critical patent/CN113792196A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The present disclosure provides a method and a device for man-machine interaction based on multi-modal dialog state representation, wherein the method comprises the following steps: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy. The method can completely express the dialogue interaction in the communication, support the realization of a multi-modal dialogue system and realize accurate dialogue expression by defining the dialogue state expression method suitable for the real scene dialogue.

Description

Method and device for man-machine interaction based on multi-modal dialog state representation
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for human-computer interaction based on multi-modal dialog state representation.
Background
With the development of technology and the progress of social demands, human-computer interaction begins to step towards a new stage of anthropomorphic human-computer interaction. The human-computer interaction system in a real scene needs to have certain communication skills and strategy planning capability. In addition, the multi-modal interactive robot can not only interact with words or voice, but also needs to show charts or pictures in time in the communication process to help users to understand better. In conversation communication in a real scene, there are various language phenomena such as active and passive role conversion, topic rotation, long-term dependence of context, and the like, and the requirement in the real scene cannot be met only by expressing the conversation state by means of intents and slot values. Both the intent and the slot need to be defined in advance, which is difficult to deal with. The definition method of the intention and the slot value is not universal, and the sharing among related knowledge fields is very difficult. The detailed description of the behavior of conversational communication in real scenes is lacking. There is a lack of consideration for multi-modal dialog states.
Disclosure of Invention
The invention provides a man-machine interaction method and device based on multi-modal dialog state representation, which are used for overcoming the defects that the prior art does not have universality and is difficult to accurately perform dialog, and realizing accurate dialog and cross-field universality.
In a first aspect, the present disclosure provides a method for human-computer interaction based on multi-modal dialog state representation, including:
acquiring original multi-modal input information;
processing the original multi-modal input information to obtain a multi-modal dialog state representation result;
determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;
and finishing multi-modal information output according to the multi-modal dialog strategy.
According to the method for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure, the processing the original multi-modal input information to obtain a multi-modal dialog state representation result specifically includes:
performing single-mode analysis on the original multi-mode input information to obtain a single-mode representation result;
obtaining relevant information of a dialog scene according to the original multi-modal input information;
and performing multi-modal understanding and chapter semantic analysis on the single-modal representation result and the relevant information of the dialog scene to obtain a multi-modal dialog state representation result.
According to the method for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure, the performing single-modal analysis on the original multi-modal input information to obtain a single-modal representation result specifically includes:
performing voice recognition on the original multi-modal input information to obtain a voice recognition result, and performing semantic analysis on the voice recognition result to obtain a semantic analysis result;
performing emotion analysis and behavior gesture analysis on the original multi-modal input information to obtain corresponding emotion analysis results and behavior gesture analysis results;
and forming a single-mode representation result by the semantic analysis result, the emotion analysis result and the behavior gesture analysis result.
According to the method for man-machine interaction based on multi-modal dialog state representation, the multi-modal dialog state representation result comprises dialog behaviors, dialog elements and dialog scenes;
wherein the conversation behavior is used for guiding conversation strategy generation;
the dialog element is used to determine the intention of the interlocutor;
the dialog scenario is used to determine a corresponding media interaction type.
According to the method for man-machine interaction based on multi-modal dialog state representation, the dialog behavior is used for guiding dialog strategy generation, and the method specifically comprises the following steps:
acquiring a human-computer interaction scene;
performing dialogue behavior dimension analysis according to the scene to obtain a dialogue behavior dimension analysis result;
and determining the generation of a conversation strategy according to the dimension analysis result of the conversation behavior.
According to the method for man-machine interaction based on multi-modal dialog state representation, the dialog elements are used for determining the intention of an interlocutor, and the method specifically comprises the following steps:
obtaining the sentence of the interlocutor;
performing multi-factor dialogue element representation on the statement to obtain a multi-factor dialogue element representation result;
determining the intention of the interlocutor based on the multi-factor dialog element representation result.
According to the method for man-machine interaction based on multi-modal dialog state representation, the dialog scene is used for determining a corresponding media interaction type, and the method specifically comprises the following steps:
carrying out user portrait analysis, media type analysis, style emotion analysis and equipment type analysis on the interlocutor to respectively obtain a user portrait result, a media type result, a style emotion result and an equipment type result of the interlocutor;
and determining a media interaction type interacted with the interlocutor according to the user portrait result, the media type result, the style emotion result and the equipment type result of the interlocutor.
According to the method for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure, the performing multi-factor dialog element representation on the sentence to obtain a multi-factor dialog element representation result specifically includes:
factoring the statement from the semantic perspective to obtain four dimensional factors of the action, the object, the condition and the question type of the statement;
determining a multi-factor dialog element representation result according to the four dimensional factors;
wherein the action refers to a predicate part in the sentence and is assumed by a verb or an adjective in the sentence;
the object refers to an influencer of the action or a central word of a nominal phrase sentence;
the condition refers to the state and condition of the action, and the modification and attribute of the object;
the question types refer to different query request categories in the interaction process set according to common knowledge.
In a second aspect, the present disclosure provides an apparatus for human-computer interaction based on multi-modal dialog state representation, comprising:
the first processing module is used for acquiring original multi-modal input information;
the second processing module is used for processing the original multi-modal input information and acquiring a multi-modal dialog state representation result;
the third processing module is used for determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;
and the fourth processing module is used for finishing multi-modal information output according to the multi-modal dialog strategy.
The present disclosure also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for human-computer interaction based on multimodal dialog state representation as described in any of the above when executing the program.
The present disclosure also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for human-computer interaction based on multimodal dialog state representation as recited in any of the above.
The method and the device for man-machine interaction based on multi-modal dialog state representation provided by the disclosure are characterized in that original multi-modal input information is obtained; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; the multi-modal dialog state representation result can represent the dialog features from multiple dimensions, and has a more anthropomorphic effect. Determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and after the multi-modal dialog strategy is obtained, finishing multi-modal information output according to the multi-modal dialog strategy. Because the multi-modal dialog representation has a more accurate representation effect, the multi-modal output constructed according to the multi-modal representation result is more accurate, and various and humanized interaction modes can be more shown.
Drawings
In order to more clearly illustrate the technical solutions of the present disclosure or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow diagram of a method of human-machine interaction based on multimodal dialog state representation provided by the present disclosure;
FIG. 2 is a schematic diagram of a multi-modal your interaction system architecture provided by the present disclosure;
FIG. 3 is a schematic diagram of a deep-level multi-modal dialog state representation provided by the present disclosure;
FIG. 4 is a schematic structural diagram of a device for man-machine interaction based on multi-modal dialog state representation provided by the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device provided by the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.
A method for man-machine interaction based on multi-modal dialog state representation according to an embodiment of the present disclosure is described below with reference to fig. 1-2, including:
step 100: acquiring original multi-modal input information;
in particular, with the rapid development of big data, deep learning and computing power, computers have developed intelligent systems that can represent and recognize multi-modal information such as voice, vision, text, etc., and fuse knowledge to achieve comprehension and reasoning. In order to solve the problem of man-machine communication obstacle when a popular User completes a complex task in a diversified scene, man-machine interaction begins to step towards a new stage of anthropomorphic User Interface (IUI). The system is typically applied to intelligent customer service facing to scenes such as telephone, online text customer service, face-to-face consultation, sales and service and the like. In terms of multi-turn conversation research and a man-machine interaction open platform, the existing man-machine interaction system has a good effect in the aspect of executing specific tasks of specific fields and specific modes, but the multi-turn conversation capability under the conditions of multi-mode, complex scenes, less resources, cold start and the like is in urgent need to be improved. For example, the conventional dialog management technology is limited to a single-domain task dialog, and lacks the global dialog management of integrating multiple response modules such as task dialogs, intelligent question answering, knowledge graph question answering and chatting, and the capability of emotionally generating responses in a high-noise complex scene.
Because the present disclosure is directed to various situations, in the process of implementing various tasks, such as robot conversation communication, various conversation behaviors, such as self introduction, commodity recommendation, invitation evaluation, emotion placation, etc., need to be completed, and only adopting such anthropomorphic stylized strategy reply can provide better user experience.
Each source or form of information may be referred to as a modality. For example, humans have touch, hearing, vision, smell; information media such as voice, video, text and the like; a wide variety of sensors such as radar, infrared, accelerometer, etc. Each of the above may be referred to as a modality. Also, the modality may be defined very broadly, for example, two different languages may be considered as two modalities, and even the data sets collected under two different situations may be considered as two modalities. Thus, the present disclosure can be in text form, in image form, or in video form by retrieving the original multimodal input information. Obtaining multi-modal input data, calling a digital human capability interface to analyze the multi-modal input data
Step 200: processing the original multi-modal input information to obtain a multi-modal dialog state representation result;
specifically, in the present disclosure, a multi-modal dialog state representation result is obtained by processing original multi-modal information, and processing the input original multi-modal information from multiple dimensions, such as a semantic angle, an action angle, and the like.
Step 300: determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;
specifically, after determining the multi-modal dialog state representation result, the response mode of the machine in the human-computer interaction can be determined through the multi-modal dialog state representation result, such as whether the selected dialog emotional state is soothing or exciting, calm or happy, and the like, whether the dialog mode is a visual mode or a voice mode, a picture mode, a video mode or a text mode, and the like.
Step 400: and finishing multi-modal information output according to the multi-modal dialog strategy.
Specifically, after determining the multi-modal dialog strategy, the robot performs the multi-modal output process, which is a process of calling multi-modal resource data owned by the robot system and outputting them in different ways. For example, if it is desired to cause the robot to output a facial expression, the facial expression is output by playing a video or displaying an image on a display screen for the robot provided with the display screen. The multimodal resource data involved in the robotic system typically includes audio data, video data, image data, or other multimedia data, as well as program instructions for controlling motors that drive the robot's actions, etc.
The man-machine interaction method based on multi-modal dialog state representation provided by the disclosure comprises the steps of obtaining original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; the multi-modal dialog state representation result can represent the dialog features from multiple dimensions, and has a more anthropomorphic effect. Determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and after the multi-modal dialog strategy is obtained, finishing multi-modal information output according to the multi-modal dialog strategy. Because the multi-modal dialog representation has a more accurate representation effect, the multi-modal output constructed according to the multi-modal representation result is more accurate, and various and humanized interaction modes can be more shown.
According to the method for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the processing the original multi-modal input information to obtain a multi-modal dialog state representation result specifically includes:
performing single-mode analysis on the original multi-mode input information to obtain a single-mode representation result;
obtaining relevant information of a dialog scene according to the original multi-modal input information;
and performing multi-modal understanding and chapter semantic analysis on the single-modal representation result and the relevant information of the dialog scene to obtain a multi-modal dialog state representation result.
Specifically, multi-mode data semantic features are extracted based on a multi-mode data feature representation model, a pre-training model-based data feature extraction model of texts, images and audios and videos is constructed, and single-mode data semantic feature extraction, text data semantic feature extraction, image feature extraction, video feature extraction, image data textual description information extraction and textual description information extraction of the images and videos are respectively completed based on the feature extraction model; in addition, the single-mode representation result is obtained through processes such as behavior gesture analysis and emotion analysis.
In addition, the original multi-modal input information is analyzed for dialog scene information, wherein the scene information represents environment information of a dialog occurrence scene, is closely related to anthropomorphic audio visual perception, and can specifically describe types of responses from 4 angles of user portrayal (Persona), Media type (Media), Style (Style), Device type (Device).
Among other things, multimodal understanding aims to achieve the ability to process and understand multi-source modal information through a method of machine learning. The current popular research direction is multi-modal learning among images, videos, audios and semantics.
The multi-mode deep semantic understanding can realize the semantic understanding of texts and visual images simultaneously. For example, if AI recognition is done in the traditional era, when a puppy is identified under a treelet shade, it is found that the recognition classifies two objects, one object is the puppy and the other object is a tree, based on the visual semantic understanding, a puppy enjoys the cool in the treelet shade, while the meaning behind the more deep understanding text is that a puppy enjoys the cool in the treelet shade and is externally on a hot summer day. This is a multi-modal deep semantic understanding.
And then, performing multi-modal understanding and chapter semantic analysis on the single-modal representation result and the dialogue scene related information to obtain a multi-modal dialogue state representation result.
According to the method for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the performing single-modal analysis on the original multi-modal input information to obtain a single-modal representation result specifically includes:
performing voice recognition on the original multi-modal input information to obtain a voice recognition result, and performing semantic analysis on the voice recognition result to obtain a semantic analysis result;
performing emotion analysis and behavior gesture analysis on the original multi-modal input information to obtain corresponding emotion analysis results and behavior gesture analysis results;
and forming a single-mode representation result by the semantic analysis result, the emotion analysis result and the behavior gesture analysis result.
Specifically, the original multi-modal input information is correspondingly processed by calling a voice recognition interface, an emotion analysis interface and a gesture behavior analysis interface, so that a corresponding single-modal representation result is obtained.
According to the method for man-machine interaction based on the multi-modal dialog state representation, the multi-modal dialog state representation result comprises a dialog behavior, a dialog element and a dialog scene;
wherein the conversation behavior is used for guiding conversation strategy generation;
the dialog element is used to determine the intention of the interlocutor;
the dialog scenario is used to determine a corresponding media interaction type.
Specifically, referring to fig. 3, the present disclosure describes information required for dialog decision and dialog generation in human-computer interaction comprehensively from 3 aspects of dialog behavior, i.e., multidimensional dialog behavior analysis, multi-factor dialog elements, dialog scenes, and multi-type dialog scene description, and more delicately depicts dialog behavior, dialog semantic elements, and dialog scene information for supporting anthropomorphic and multi-modal interaction in communication, thereby meeting the requirement of human-computer interaction in complex scenes. The method is not bound with domain knowledge, has universality and can be applied to man-machine conversation systems in various fields such as e-commerce, tourism, medical treatment and the like.
According to the method for man-machine interaction based on multi-modal dialog state representation, the dialog behavior is used for guiding dialog strategy generation, and the method specifically comprises the following steps:
acquiring a human-computer interaction scene;
performing dialogue behavior dimension analysis according to the scene to obtain a dialogue behavior dimension analysis result;
and determining the generation of a conversation strategy according to the dimension analysis result of the conversation behavior.
In particular, dialog-Act (DA) plays a crucial role in spoken language understanding systems for marking speaker intentions (such as statements, questions, commitments, indications, etc.) independent of the specific dialog system and thus has a certain versatility. Conversational behavior, also known as verbal behavior or social behavior, is an attempt to formalize and generalize intent. The speaker (sender), the recipient (address), and the conversation are conducted around the interaction between the two characters. Wherein the speaker is the speaker currently interacting, i.e. the current dialog behavior is generated. The receiver, the participants of the conversation interaction, and the interaction object of the current speaker.
The method combines the characteristics of human-computer interaction scenes such as artificial telephones, online customer service and the like, such as commodity recommendation scenes of pre-sale class customer service personnel in service, special scenes such as emotion conciliation scenes of post-sale class customer service personnel for customers and the like, provides a conversation behavior classification scheme capable of fully representing the characteristics of human-computer interaction spoken language Communication based on a foreign general conversation behavior scheme, and defines 5 different conversation behavior analysis dimensions (Task dimension, Time Management dimension, Feedback dimension, Own and Partner Communication Management dimension and Social Objections Management dimension) which are combined together to represent a complex spoken language conversation behavior state under a real scene. Wherein the definition of each dimension is shown in table 1 below:
TABLE 1
Figure BDA0003257719240000101
Figure BDA0003257719240000111
According to the method for man-machine interaction based on multi-modal dialog state representation, provided by the embodiment of the disclosure, wherein the dialog element is used for determining the intention of the interlocutor, the method specifically comprises the following steps:
obtaining the sentence of the interlocutor;
performing multi-factor dialogue element representation on the statement to obtain a multi-factor dialogue element representation result;
determining the intention of the interlocutor based on the multi-factor dialog element representation result.
In particular, the invention proposes a simple and novel semantic representation framework, called a multi-factor dialog element representation framework, for replacing the classical semantic representation mode combining intent and slot value. Under this framework, different intentions, namely actions, objects, conditions, and question types, are distinguished by four key factors. Four key concepts are employed to distinguish intentions rather than fully express all the complex grammatical and semantic meanings of sentences. The multi-factor semantic framework is motivated by the fact that the number of possible sentences is infinite, and therefore it is not feasible to fully represent all the sentences, while too general semantic representations are difficult to meet the needs of real scenes. However, in a particular domain or scenario, the possible semantic space is limited, so all intents can be distinguished with limited key concepts, without the need to represent each intent directly with a complete representation, and the scenario needs can be met at the granularity of representation as well.
The method mainly solves the description problem of fine granularity of user intention, distinguishes different semantic intentions through factor difference, associates semantic knowledge points through factor sharing, and finds coexistence and coexistence. Four-dimensional factors, i.e., action, object, condition (modifier/attribute/state/condition), question type, are set in the framework. The dimension of condition (modification/attribute/state/condition) is mixed type, and includes multiple finer-grained factors such as modification/attribute/state/condition, and since these factors usually do not co-occur in the same question, they are combined into one dimension, which will be referred to as condition hereinafter. A plurality of factors are obtained by carrying out multi-dimensional factor semantic analysis on a question, and then the factors are connected to form a factor expression to express the semantic. For example, the factor expression of the question "how to complement the hotel invoice" is "complement (action) + invoice (object) + hotel (condition) + how-to-query (question type)", and the parenthesis is the factor type; in general, the four-dimensional factors are arranged in a fixed order, so the factor type can be omitted, in which case the factor expression is "make up + invoice + hotel + how-query".
According to the man-machine interaction method based on the multi-modal dialog state representation, provided by the embodiment of the disclosure, the dialog scene is used for determining a corresponding media interaction type, and the method specifically comprises the following steps:
carrying out user portrait analysis, media type analysis, style emotion analysis and equipment type analysis on the interlocutor to respectively obtain a user portrait result, a media type result, a style emotion result and an equipment type result of the interlocutor;
and determining a media interaction type interacted with the interlocutor according to the user portrait result, the media type result, the style emotion result and the equipment type result of the interlocutor.
Specifically, the scene information represents environment information of a dialog occurrence scene, is closely related to anthropomorphic audiovisual perception, and can specifically describe the type of response from the perspective of user image (Persona), Media type (Media), genre emotion (Style), Device type (Device) 4. A user representation describing user representation information (e.g., age and occupation, hobbies, etc.) of the conversant. Media type, representing preferred presentation media type, in what form the presentation input output completes the interaction (e.g., text, spoken and diagram, picture, etc.). Stylistic emotions, which express the emotional attitudes held by the current utterance (e.g., anger, urgency, etc.). Device type, which devices will be used in the presentation, the device ultimately provides support for the interaction in terms of physical hardware (e.g., web page, phone or PDA, etc.).
In a multi-modal dialog system, a dialog decision (Policy) unit takes the dialog scene information into full consideration and selects the most appropriate media interaction type in the decision process of a dialog strategy. When the dialog generation (NLG) unit generates the response text, the response reply with rich stylized characteristics can be generated by utilizing the portrait and style information of the user, so that better user experience can be provided, more user participation is brought, and the completion rate of the dialog task is improved.
According to the method for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the performing multi-factor dialog element representation on the statement to obtain a multi-factor dialog element representation result specifically includes:
factoring the statement from the semantic perspective to obtain four dimensional factors of the action, the object, the condition and the question type of the statement;
determining a multi-factor dialog element representation result according to the four dimensional factors;
wherein the action refers to a predicate part in the sentence and is assumed by a verb or an adjective in the sentence;
the object refers to an influencer of the action or a central word of a nominal phrase sentence;
the condition refers to the state and condition of the action, and the modification and attribute of the object;
the question types refer to different query request categories in the interaction process set according to common knowledge.
Specifically, the splitting of the factors does not depend on specific domain knowledge, the main and subsidiary object components of the sentence are analyzed according to general Chinese syntactic analysis, and the factors can be decomposed by understanding the central idea of the sentence from the semantic perspective. When splitting, the following point rules can be grasped:
actions, i.e., predicate portions in sentences, are usually undertaken by verbs or adjectives in the sentence.
An object, i.e. an influencer of an action, or a core word of a nominal phrase sentence.
The conditions, that is, the states and conditions of the actions and the modifications and attributes of the objects are usually not simultaneously presented in one sentence, and are expressed in one dimension.
Question types, i.e. different query request categories in the interaction process set according to common knowledge: yesno-queston, affirmatively negating a question; selecting a question sentence by the choice-query; where-query, position question; while-query, time question; why-query, reason question; whynot-query, negation reason question; what-query, entity question; who-query, a person's name question; how-query, action/status question; howften-query, frequency question; howmay-query, quantity question; statement-positive, positive sentence; statement-negative, negation.
As shown in fig. 4, the present disclosure provides an apparatus for human-computer interaction based on multi-modal dialog state representation, including:
a first processing module 41, configured to obtain original multi-modal input information;
the second processing module 42 is configured to process the original multi-modal input information to obtain a multi-modal dialog state representation result;
a third processing module 43, configured to determine a multi-modal dialog strategy according to the multi-modal dialog state representation result;
and the fourth processing module 44 is configured to complete multi-modal information output according to the multi-modal dialog policy.
Since the apparatus provided by the embodiment of the present invention can be used for executing the method described in the above embodiment, and the operation principle and the beneficial effect are similar, detailed descriptions are omitted here, and specific contents can be referred to the description of the above embodiment.
The device for man-machine interaction based on multi-modal dialog state representation provided by the disclosure obtains original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; the multi-modal dialog state representation result can represent the dialog features from multiple dimensions, and has a more anthropomorphic effect. Determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and after the multi-modal dialog strategy is obtained, finishing multi-modal information output according to the multi-modal dialog strategy. Because the multi-modal dialog representation has a more accurate representation effect, the multi-modal output constructed according to the multi-modal representation result is more accurate, and various and humanized interaction modes can be more shown.
According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is specifically configured to:
performing single-mode analysis on the original multi-mode input information to obtain a single-mode representation result;
obtaining relevant information of a dialog scene according to the original multi-modal input information;
and performing multi-modal understanding and chapter semantic analysis on the single-modal representation result and the relevant information of the dialog scene to obtain a multi-modal dialog state representation result.
According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is further specifically configured to:
performing voice recognition on the original multi-modal input information to obtain a voice recognition result, and performing semantic analysis on the voice recognition result to obtain a semantic analysis result;
performing emotion analysis and behavior gesture analysis on the original multi-modal input information to obtain corresponding emotion analysis results and behavior gesture analysis results;
and forming a single-mode representation result by the semantic analysis result, the emotion analysis result and the behavior gesture analysis result.
According to the device for man-machine interaction based on multi-modal dialog state representation, provided by the embodiment of the disclosure, in the second processing module 42, the multi-modal dialog state representation result comprises a dialog behavior, a dialog element and a dialog scene;
wherein the conversation behavior is used for guiding conversation strategy generation;
the dialog element is used to determine the intention of the interlocutor;
the dialog scenario is used to determine a corresponding media interaction type.
According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is further specifically configured to:
acquiring a human-computer interaction scene;
performing dialogue behavior dimension analysis according to the scene to obtain a dialogue behavior dimension analysis result;
and determining the generation of a conversation strategy according to the dimension analysis result of the conversation behavior.
According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is further specifically configured to:
obtaining the sentence of the interlocutor;
performing multi-factor dialogue element representation on the statement to obtain a multi-factor dialogue element representation result;
determining the intention of the interlocutor based on the multi-factor dialog element representation result.
According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is further specifically configured to:
carrying out user portrait analysis, media type analysis, style emotion analysis and equipment type analysis on the interlocutor to respectively obtain a user portrait result, a media type result, a style emotion result and an equipment type result of the interlocutor;
and determining a media interaction type interacted with the interlocutor according to the user portrait result, the media type result, the style emotion result and the equipment type result of the interlocutor.
According to the device for man-machine interaction based on multi-modal dialog state representation provided by the embodiment of the present disclosure, the second processing module 42 is further specifically configured to:
factoring the statement from the semantic perspective to obtain four dimensional factors of the action, the object, the condition and the question type of the statement;
determining a multi-factor dialog element representation result according to the four dimensional factors;
wherein the action refers to a predicate part in the sentence and is assumed by a verb or an adjective in the sentence;
the object refers to an influencer of the action or a central word of a nominal phrase sentence;
the condition refers to the state and condition of the action, and the modification and attribute of the object;
the question types refer to different query request categories in the interaction process set according to common knowledge.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method of providing human-machine interaction based on multi-modal dialog state representations according to the present disclosure, the method comprising: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present disclosure also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, the computer being capable of performing the methods provided above, provide a method of human-machine interaction based on a multi-modal dialog state representation, the method comprising: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy.
In yet another aspect, the present disclosure also provides a non-transitory computer readable storage medium having stored thereon a computer program that when executed by a processor performs the steps of the above-provided present disclosure providing a method of human-machine interaction based on multimodal dialog state representation, the method comprising: acquiring original multi-modal input information; processing the original multi-modal input information to obtain a multi-modal dialog state representation result; determining a multi-modal dialog strategy according to the multi-modal dialog state representation result; and finishing multi-modal information output according to the multi-modal dialog strategy.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims (11)

1. A method for human-computer interaction based on multi-modal dialog state representations, comprising:
acquiring original multi-modal input information;
processing the original multi-modal input information to obtain a multi-modal dialog state representation result;
determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;
and finishing multi-modal information output according to the multi-modal dialog strategy.
2. The method for human-computer interaction based on multi-modal dialog state representation according to claim 1, wherein the processing the original multi-modal input information to obtain a multi-modal dialog state representation result specifically comprises:
performing single-mode analysis on the original multi-mode input information to obtain a single-mode representation result;
obtaining relevant information of a dialog scene according to the original multi-modal input information;
and performing multi-modal understanding and chapter semantic analysis on the single-modal representation result and the relevant information of the dialog scene to obtain a multi-modal dialog state representation result.
3. The method for human-computer interaction based on multi-modal dialog state representation according to claim 2, wherein the performing single-modal analysis on the original multi-modal input information to obtain a single-modal representation result specifically comprises:
performing voice recognition on the original multi-modal input information to obtain a voice recognition result, and performing semantic analysis on the voice recognition result to obtain a semantic analysis result;
performing emotion analysis and behavior gesture analysis on the original multi-modal input information to obtain corresponding emotion analysis results and behavior gesture analysis results;
and forming a single-mode representation result by the semantic analysis result, the emotion analysis result and the behavior gesture analysis result.
4. The method for human-computer interaction based on multi-modal dialog state representation of claim 1 or 2, characterized in that the multi-modal dialog state representation results comprise dialog behaviors, dialog elements and dialog scenarios;
wherein the conversation behavior is used for guiding conversation strategy generation;
the dialog element is used to determine the intention of the interlocutor;
the dialog scenario is used to determine a corresponding media interaction type.
5. The method for human-computer interaction based on multi-modal dialog state representation according to claim 4, wherein the dialog behavior is used to guide dialog strategy generation, and specifically comprises:
acquiring a human-computer interaction scene;
performing dialogue behavior dimension analysis according to the scene to obtain a dialogue behavior dimension analysis result;
and determining the generation of a conversation strategy according to the dimension analysis result of the conversation behavior.
6. Method of human-computer interaction based on multimodal dialog state representation according to claim 4, characterized in that the dialog elements are used to determine the intention of the interlocutor, in particular comprising:
obtaining the sentence of the interlocutor;
performing multi-factor dialogue element representation on the statement to obtain a multi-factor dialogue element representation result;
determining the intention of the interlocutor based on the multi-factor dialog element representation result.
7. The method for human-computer interaction based on multi-modal dialog state representation according to claim 4, wherein the dialog scenario is used for determining a corresponding media interaction type, and in particular comprises:
carrying out user portrait analysis, media type analysis, style emotion analysis and equipment type analysis on the interlocutor to respectively obtain a user portrait result, a media type result, a style emotion result and an equipment type result of the interlocutor;
and determining a media interaction type interacted with the interlocutor according to the user portrait result, the media type result, the style emotion result and the equipment type result of the interlocutor.
8. The method for man-machine interaction based on multi-modal dialog state representation according to claim 6, wherein the multi-factor dialog element representation of the sentence to obtain a multi-factor dialog element representation result specifically comprises:
factoring the statement from the semantic perspective to obtain four dimensional factors of the action, the object, the condition and the question type of the statement;
determining a multi-factor dialog element representation result according to the four dimensional factors;
wherein the action refers to a predicate part in the sentence and is assumed by a verb or an adjective in the sentence;
the object refers to an influencer of the action or a central word of a nominal phrase sentence;
the condition refers to the state and condition of the action, and the modification and attribute of the object;
the question types refer to different query request categories in the interaction process set according to common knowledge.
9. An apparatus for human-computer interaction based on multi-modal dialog state representation, comprising:
the first processing module is used for acquiring original multi-modal input information;
the second processing module is used for processing the original multi-modal input information and acquiring a multi-modal dialog state representation result;
the third processing module is used for determining a multi-modal dialog strategy according to the multi-modal dialog state representation result;
and the fourth processing module is used for finishing multi-modal information output according to the multi-modal dialog strategy.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for human-machine interaction based on multi-modal dialog state representation according to any of claims 1 to 8 when executing the program.
11. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the method for human-computer interaction based on multi-modal dialog state representation according to any of claims 1 to 8.
CN202111064527.7A 2021-09-10 2021-09-10 Method and device for man-machine interaction based on multi-modal dialog state representation Pending CN113792196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111064527.7A CN113792196A (en) 2021-09-10 2021-09-10 Method and device for man-machine interaction based on multi-modal dialog state representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111064527.7A CN113792196A (en) 2021-09-10 2021-09-10 Method and device for man-machine interaction based on multi-modal dialog state representation

Publications (1)

Publication Number Publication Date
CN113792196A true CN113792196A (en) 2021-12-14

Family

ID=78879994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111064527.7A Pending CN113792196A (en) 2021-09-10 2021-09-10 Method and device for man-machine interaction based on multi-modal dialog state representation

Country Status (1)

Country Link
CN (1) CN113792196A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356092A (en) * 2022-01-05 2022-04-15 花脸数字技术(杭州)有限公司 Multi-mode-based man-machine interaction system for digital human information processing
CN115905490A (en) * 2022-11-25 2023-04-04 北京百度网讯科技有限公司 Man-machine interaction dialogue method, device and equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356092A (en) * 2022-01-05 2022-04-15 花脸数字技术(杭州)有限公司 Multi-mode-based man-machine interaction system for digital human information processing
CN114356092B (en) * 2022-01-05 2022-09-09 花脸数字技术(杭州)有限公司 Multi-mode-based man-machine interaction system for digital human information processing
CN115905490A (en) * 2022-11-25 2023-04-04 北京百度网讯科技有限公司 Man-machine interaction dialogue method, device and equipment
CN115905490B (en) * 2022-11-25 2024-03-22 北京百度网讯科技有限公司 Man-machine interaction dialogue method, device and equipment

Similar Documents

Publication Publication Date Title
JP7032504B2 (en) Automatic assistant with meeting ability
CN110785763B (en) Automated assistant-implemented method and related storage medium
JP6889281B2 (en) Analyzing electronic conversations for presentations in alternative interfaces
CN112868060A (en) Multimodal interactions between users, automated assistants, and other computing services
Milhorat et al. A conversational dialogue manager for the humanoid robot ERICA
O’Shea et al. Systems engineering and conversational agents
Lopatovska et al. User recommendations for intelligent personal assistants
CN113792196A (en) Method and device for man-machine interaction based on multi-modal dialog state representation
JPWO2017200078A1 (en) Dialogue method, dialogue system, dialogue apparatus, and program
CN114969282B (en) Intelligent interaction method based on rich media knowledge graph multi-modal emotion analysis model
Glasser et al. Understanding Deaf and Hard-of-Hearing users' interest in sign-language interaction with personal-assistant devices
Petrova Meme language, its impact on digital culture and collective thinking
Candello et al. Designing conversational interfaces
Wang et al. Multi-party, multi-role comprehensive listening behavior
Kim et al. A systematic review on dyadic conversation visualizations
Goree et al. “It Was Really All About Books:” Speech-like Techno-Masculinity in the Rhetoric of Dot-Com Era Web Design Books
US10559298B2 (en) Discussion model generation system and method
Wallis Introducing the Talk Markup Language (TalkML): Adding a little social intelligence to industrial speech interfaces
Elango Ai Powered Chatbot For Ftms Learners
WO2022239053A1 (en) Information processing device, information processing method, and information processing program
Sundblad et al. OLGA—a multimodal interactive information assistant
CN113785540B (en) Method, medium and system for generating content promotions using machine learning nominators
JP7368335B2 (en) Programs, devices and methods for interacting with positive parroting responses
Havens Updating a Declaration [Commentary]
Bigelow A Pragmatic Guide to Conversational AI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination