CN115129839A

CN115129839A - Visual dialogue answer generation method and device based on graph perception

Info

Publication number: CN115129839A
Application number: CN202210685096.4A
Authority: CN
Inventors: 刘安安; 徐宁; 张国楷; 郭俊波; 靳国庆; 张勇东
Original assignee: Tianjin University; People Co Ltd
Current assignee: Tianjin University; Konami Sports Club Co Ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-09-30

Abstract

The invention discloses a visual dialog answer generation method and device based on graph perception, wherein the method comprises the following steps: respectively constructing query libraries according to the modal properties, distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modes, and reserving modal features which are beneficial to an inference process in an actual scene; carrying out entity identification and relation detection on the historical conversation, constructing a basic graph framework by taking an entity as a node and a relation as a directed edge, and encoding the entity and the relation by utilizing a GloVe word vector; the graph semantics are enhanced for multiple times through iterative updating, the purpose is to feed back the in-graph information to the conversation history and the image content for multiple times, and the information transmission process presents a closed loop and is used for fully mining the interaction relationship between the modalities; integrating the image features and the visual and text features which are fused for a plurality of times through iteration, and then sending the image features and the visual and text features into a decoder to realize multi-mode information collaborative representation and generate answers for the questions proposed by the current scene. The device comprises: a processor and a memory.

Description

Visual conversation answer generation method and device based on graph perception

Technical Field

The invention relates to the field of visual dialogue generation, in particular to a visual dialogue answer generation method and device based on graph perception for visual dialogue based on graph perception.

Background

With the continued development of artificial intelligence, visual dialog has received unprecedented attention in the area of intersection of computer vision and natural language processing. In this task, given a picture, a picture description, and a set of conversations (i.e., multiple question-and-answer pairs), the agent is able to fully understand the potential associations of cross-modal semantics from the current question and infer an accurate answer. And other visual language tasks, such as: image description generation ^[1] Scene graph generation ^[2] Visual general knowledge reasoning ^[3] Visual Question Answering (VQA) ^[4] And in order to further discuss the deep semantic dependency relationship between the vision and the language according to actual requirements and application scenes, the visual conversation not only requires that fine-grained cross-modal understanding exists between texts and images, but also requires that the current problem, conversation history and visual information have global semantic dependency. The technology aims to solve the problem of human answering through the accurate feedback of an intelligent agent to the current problem, and replaces human perception and thinking. The method can be widely applied to man-machine interaction and helps the visually impaired user to perceive peripheral information and the like. To build a visual dialog generating research platform, VisDial v0.9 communicates withThe VisDial v1.0 dataset was proposed ^[5] So as to verify the application capability of the model in the actual scene.

Existing methods ^[6-11] The innovative method and excellent performance are shown in the aspect of visual dialogue generation, a researcher mainly places a research center on how to guide the extraction of visual information by text information through an attention mechanism, then multi-modal features are embedded and fused, and finally the multi-modal features are sent to a decoder to analyze answer clues, wherein the existing skeleton image is DAN ^[6] 、RAA-Net ^[7] All achieve good performance. However, this reasoning process is one-way, resulting in insufficient cross-modal interaction, limited accuracy and richness of the generated answers. And GNN ^[8] 、FGA ^[9] The introduction of the graph structure into the framework can alleviate the defect, multi-level semantics are abstracted from texts and vision to construct a graph, interaction among graph nodes containing multi-mode information is realized through an intra-graph circulation mode of message transmission, and graph features are embedded to obtain graph features for answer generation.

However, the above existing models attach too much importance to the role of the high-level information in the graph in the inference process, neglect the inference capability of the original natural language and visual content, and weaken the role thereof in the inference process to some extent. This shows that it is necessary to introduce a dynamic structure to optimize the model, so that a close interaction relationship is established between the graph modality and the visual text, and the reasoning effect of the visual and text is strengthened through the graph outer loop. In the past, no strategy for enriching the semantics of dialog turns and visual regions by using a graph structure as a medium is provided.

In summary, although the field of visual dialog generation has made a series of advances ^[8,9] However, a multi-modal semantic interaction framework for image perception is not designed yet, and the effect of close interaction between image modalities and visual texts on reasoning is ignored. At present, the mainstream method still performs feature extraction and fusion on original information, and the semantic dependency between texts and vision cannot be sufficiently explored through the one-way coarse-grained operation, so that the answer generation effect aiming at the current scene is damaged.

Based on the current research situation, the current challenges are mainly the following three aspects:

1. how to abstract the graph structure from the visual text multi-modal information and then iteratively enhance the graph meaning;

2. how to feed back the high-order information in the graph to the dialogue history and the image area and optimize the self-attention weighting process;

3. how to embed joint features of the graph modal features and the features of texts, vision and the like and carry out collaborative reasoning on the current problems.

Disclosure of Invention

The invention provides a visual dialogue answer generation method and a device based on graph perception, which respectively establish a query library for storing query vectors according to the characteristics of visual and text modes, and perceive local features in the modes by the query vectors by using self-attention so as to obtain high-order semantic vectors; in the graph construction and iteration updating stage, the entities and the relations in the conversation history are identified to establish a basic directed graph structure, and the directed graph is subjected to feature embedding and multi-stage interaction with the text and the visual features, so that the semantic information of the vision, the text and the graph is enriched at the same time; in the multi-modal collaborative reasoning stage, node level feature fusion is performed on the graphs in each stage, then the graphs are embedded into a higher-level semantic space by using a multi-layer perceptron to form high-level graph features, semantic perception and fusion are performed on the high-level graph features and the multi-modal features selected from the attention module to generate vectors with strong reasoning capability, and the vectors are used for generating answers aiming at the current problems in a specific scene, and the details are described in the following:

in a first aspect, a method for generating visual dialog answers based on graph perception includes the steps of:

respectively constructing query libraries according to the modal properties, distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, and reserving modal features which are beneficial to an inference process in an actual scene;

entity identification and relation detection are carried out on historical conversations, entities serve as nodes, relations serve as directed edges to construct a basic graph structure, and the entities and the relations are coded by utilizing GloVe word vectors;

the graph semantics are enhanced for multiple times through iterative updating, the purpose is to feed back the in-graph information to the conversation history and the image content for multiple times, and the information transmission process presents a closed loop and is used for fully mining the interaction relationship between the modalities;

integrating the image features and the visual and text features which are fused for a plurality of times through iteration, and then sending the image features and the visual and text features into a decoder to realize multi-mode information collaborative representation and generate answers for the questions proposed by the current scene.

Wherein, prior to step 1), the method further comprises:

and coding text information such as visual description, conversation history and the like by using a long-short term memory network (LSTM), initializing a basic directed graph according to the text information, and extracting the features of the picture by using fast-RCNN.

Further, the building base graph architecture specifically includes:

according to the syntactic structure and semantics of the text information, the entity and the relationship in the text information are identified, a directed graph is initialized, the global semantics enhancement is carried out on all nodes in the graph by using visual description and problem features, the graph node features are integrated and respectively sent into historical dialogue and picture features, after all inquiry vectors in an inquiry library are selected, the semantics enhancement is carried out on graph nodes by using dialogue features and picture features related to problems.

The method comprises the following steps of integrating image features formed by iterative multiple times of fusion with visual and text features, and then sending the image features and the visual and text features into a decoder to realize multi-mode information collaborative representation, wherein the specific answer for generating a question proposed by a current scene is as follows:

and performing node level feature fusion on the multi-stage high-order semantic graph, embedding to generate a graph vector, feeding the graph vector back to the relevant dialogue turn and the picture region again, performing vector splicing and weighted summation, and obtaining answer reasoning features after a multi-layer perceptron and an activation function.

Wherein the method further comprises:

utilizing a full connection layer, a multi-layer perceptron, an activation function and self-attention; jointly embedding text, visual, and graphical features.

In a second aspect, an apparatus for generating answers to a visual dialog based on graph perception, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor invoking the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.

A third aspect, a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method steps of any of the first aspects.

The technical scheme provided by the invention has the beneficial effects that:

1. the method comprises the steps of respectively establishing query libraries according to text and visual modal characteristics, storing query vectors capable of accurately sensing key clues of the current modal, sending the query vectors in the query libraries to a self-attention mechanism, and respectively carrying out sensitivity weighting and fusion on local information of the current modal, so that modal characteristics which are beneficial to a reasoning process in an actual scene are kept as far as possible;

existing methods using attention mechanism ^[6,7] Most of the multi-modal feature concern points are only limited to the current problem and ignore the context semantics by only adopting the current problem vector as a query vector and weighting the multi-modal feature concern points; according to the method, the query library is constructed, so that the overall semantic integrity of the multi-mode information is kept in the screening process, and more accurate answers with richer semantics can be generated.

2. The invention constructs a directed graph and enhances graph semantics through iteration for many times, which is used for feeding back conversation history and image content, the whole information transmission process presents closed loop and reacts with close interaction relationship among modalities;

the existing method only utilizes the one-way transmission of text information to visual contents to carry out reasoning, the interaction between the modes is sparse, even if a part of methods are introduced into a graph structure, the information is only circularly transmitted in the graph, and the graph and the visual texts still have no close interaction relation; the invention pays attention to the relation between the graph and the visual text at each stage, re-senses the visual text by utilizing the graph, improves the positioning precision of the image area and the selection accuracy of the conversation turn, and refines the reasoning process.

3. The invention designs a graph and visual text collaborative reasoning mechanism, namely, answer reasoning is carried out after the prior knowledge under the current scene is fully utilized and coded by a multilayer perceptron; the existing method for introducing the graph structure only embeds the graph to obtain the feature vector of the encoder, so that the reasoning effect of text and visual features is weakened, or the feature vector is generated only by the text and the vision, and the two processes do not fully utilize the existing modal information; according to the method, the semantic association between the image and the visual text is fully established, the potential relation between the entities is mined, the semantic search space for the current problem and context is more accurately reduced, and the final multi-mode features are ensured to contain complete semantics, namely, the selected image area and the dialog turn are more fit with the current context, so that the generated answer has more relevance to the context.

Drawings

FIG. 1 is a flow chart of a visual dialog answer generation method based on graph-aware closed-loop reasoning;

FIG. 2 is a general framework of a visual dialog answer generation method based on graph perception;

fig. 3 is a schematic structural diagram of an apparatus for generating visual dialog answers based on graph perception.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

A method for generating visual dialog answers based on graph perception, referring to fig. 1, the method comprising the steps of:

101: carrying out round-level coding on historical dialogue, visual description and current problems by using an LSTM (long short term memory network), and carrying out region-level feature extraction on image contents by using a fast-RCNN (fast regional graph convolution feature extractor) to respectively obtain text and visual feature vectors;

102: respectively constructing query libraries according to the modal properties, and distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, so as to retain modal characteristics which are beneficial to an inference process in an actual scene as far as possible;

most of the existing methods adopting the attention mechanism only adopt the current problem vector as a query vector, and the multi-modal feature attention points obtained by weighting are only limited to the current problem and ignore the context semantics. According to the embodiment of the invention, the query library is constructed, so that the overall semantic integrity of the multi-mode information is ensured to be kept in the screening process, and more accurate answers with richer semantics can be generated.

103: entity identification and relation detection are carried out on historical conversations, entities serve as nodes, relations serve as directed edges to construct a basic graph framework, and GloVe word vectors are used for encoding the basic graph framework;

104: the graph semantics are enhanced through iterative updating for multiple times, the graph characteristics are fed back to the conversation history and the image contents for multiple times, the whole information transmission process presents a closed loop, and the close interaction relationship among the modalities is excavated;

the existing method only utilizes the unidirectional transmission of text information to visual content to carry out reasoning, the interaction between the modes is sparse, even if a graph structure is introduced in part of methods, the information is only circularly transmitted in the graph, and the graph and the visual text still have no close interaction relation. The embodiment of the invention pays attention to the relation between the graph at each stage and the visual text, and the graph is used for re-perceiving the visual text to refine the reasoning process.

105: and integrating the graph features obtained by iterative multiple times of fusion with the features such as vision, text and the like, and sending the graph features and the features into a decoder to realize multi-mode information collaborative representation and finally generate an accurate answer of a question provided by a human aiming at the current scene.

In summary, in the embodiment of the present invention, the query library is constructed through the above steps 101 to 105 to fully preserve the semantic globality between the modalities, so as to prevent semantic bias and information loss in the process of implementing the attention mechanism, and improve the reasoning ability of the model for the current problem; the embodiment of the invention optimizes the data preprocessing flow aiming at different modes, designs a closed-loop reasoning framework with intensive interaction, is used for excavating dependency relationship among the modes to further refine reasoning, fully replaces human perception and thinking, provides feedback of scenes and contexts more accurately, and discloses the effectiveness of the intensive interaction in the cross-mode reasoning process; the invention can more accurately reduce the semantic search space aiming at the current problem and context, and ensure that the final multi-modal characteristics contain complete semantics, namely, the selected image area and the dialog turn are more fit with the current context, so that the generated answer has more relevance to the context.

Example 2

The scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:

201: carrying out round-level coding on historical dialogue, visual description and current problems by using an LSTM (long short term memory network), and carrying out region-level feature extraction on image contents by using a fast-RCNN (fast regional graph convolution feature extractor) to respectively obtain text and visual feature vectors;

given historical dialogue, visual description and current question, the information is sent to different LSTMs for feature embedding, and the embedding vectors are respectively expressed as H ═ H ₁ ,h ₂ ,h ₃ ,…,h _m H, C, Q, wherein h _i ＝{q _i ,a _i The dialogue system comprises a dialogue system, a dialogue system and a dialogue system, wherein the dialogue system comprises a plurality of dialogue units, each dialogue unit comprises a plurality of dialogue rounds, and each dialogue unit comprises a question and an answer; given image, sending it into fast-RCNN to extract its characteristic, and expressing its characteristic vector as V ═ V { (V) ₁ ,v ₂ ,v ₃ ,…,v _n N denotes the number of image visual areas.

202: respectively constructing query libraries according to the modal properties, and distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, so as to retain modal characteristics which are beneficial to an inference process in an actual scene as far as possible;

for historical conversations, a query library is established to store visual description features C and current question features Q to generate corresponding attention based on the sensitivity between the query vector and different conversation rounds, and visual description-conversation associations and question-conversation associations are established:

wherein, replacing C with Q can obtain H through the same calculation process _q ，H _c And H _q Feature vectors, W, obtained by weighted summation of the historical dialogue with the self-attention mechanism for the visual description and the current question, respectively ₁ Denotes a fully connected layer, 1 ^T Representing an all-one vector, Softmax being a weight normalization function, h _i For the ith dialog turn feature, the symbol "°" represents matrix multiplication. H _c And H _q The semantics of the global problem driver graph in step 203 can be adaptively enhanced, both are mapped into a co-nested semantic space, and are combined with the current stage graph feature g _qc And problem feature Q:

μ＝ReLU(MLP[σ(T[H _c ；H _q ]),σ(W ₂ g _qc ),σ(W ₃ Q)]) (2)

wherein T represents a conversion matrix, ReLU represents a ReLU activation function, MLP represents a multi-layer perceptron, μ is defined as a text feature for continuously enriching the node semantics of the current stage graph and serving as a query vector of an image query library, σ represents an activation function, and W represents a conversion matrix ₂ And W ₃ Representing a learnable parameter.

Similarly, a query library is built for the images to store the current question feature Q, the text feature μ, and the current stage graph feature g _t Performing a visual feature attention weighted summation:

wherein v is _i Visual features representing the ith region; replacing Q with mu and g _t V can be obtained through the same calculation process _μ 、V _g ，W ₄ Representing a fully connected layer.

In order to prevent missing effective visual information, the perceived visual features are fused:

where ζ is defined as a visual feature used to continue enhancing the node semantics of the current phase graph.

203: carrying out entity identification and relation detection on the historical conversation, constructing a basic graph framework by taking an entity as a node and a relation as a directed edge, and encoding the basic graph framework by using a GloVe word vector;

there are a large number of noun entities and their relationship representations in the historical dialog, and these a priori knowledge can be used to construct a basic directed graph under the premise of guaranteeing semantic integrity, such as the visual description in fig. 2, "a group of young people is playing skateboards in a skateboarding field", which can be abstracted into "skateboarding field → (there) → young people → (playing) → skateboards". And for the extracted elements such as the relation, the entity, the attribute and the like, encoding the elements by using a GloVe word vector to obtain a basic directed graph G _o 。

204: the graph semantics are enhanced by iterative updating for multiple times, the graph characteristics are fed back to the conversation history and the image content for multiple times, the whole information transmission process presents a closed loop, and the close interaction relationship among the modalities is mined.

The method aims to find the subgraph with specific semantic nodes, and can promote language understanding and visual positioning. In the first stage, the basic directed graph G _o Is parsed out by the dialog. Then in the second phase it is initialized with the visual description and the current question:

G _qc ＝ReLU(W ₇ (σ(W ₅ [G _o ,Q]+W ₆ [G _o ,C]))) (5)

wherein G is _qc For enriching and participating in dialog rounds, defined as global problem driver graphs. Under the guidance of the question Q, the text feature mu is fused with the current stage graph node feature, and the related graph node semantics are continuously comprehensively enhanced:

wherein G is _t Features are fed back into the visual content, defined as text-driven graphs. Similar to the above formula, replacing μ with ζ, the visual driving graph G can be obtained through the same calculation process _v . The embodiment of the invention obtains the graph G with four different states _o 、G _qc 、G _t And G _v They can interact, switching in the perceptual dialog and the image from multiple angles:

G _multi ＝MLP(G _o +G _qc +G _t +G _v ) (7)

wherein G is _multi Is defined as a multi-state fusion graph. G _o 、G _qc 、G _t 、G _v And G _multi Carrying out graph convolution, carrying out addition of node and directed edge characteristics element by element to respectively obtain corresponding graph embedding characteristics g _o 、g _qc 、g _t 、g _v And g _multi 。

205: and finally, integrating the image features obtained by iterative multiple times of fusion with the features such as vision, text and the like, and sending the image features and the features into a decoder to realize multi-mode information collaborative representation and finally generate an accurate answer of a question provided by a human aiming at the current scene.

And re-mining the conversation wheel and the image area by using the graph embedding characteristics, wherein the whole reasoning process is in a closed loop, and the sufficiently close information interaction relationship exists between the modalities. g _multi And information integration is carried out on the interactive turn and the visual region, and the relevance of language and vision is mined:

wherein H is replaced by V, and the same calculation can be carried out

And

to guide features for the graph.

And finally, the text and the visual vector are obtained through a self-attention module and are expressed as

And

the two are subjected to semantic fusion through a multilayer perceptron:

and psi is a multi-modal fusion feature, and is sent to a decoder to generate an answer to the current question.

The embodiment of the invention ensures that the multimode information has close interaction by introducing the graph structure and designing the iteration updating scheme of the graph, thereby realizing the mining of common semantics among different modes, meeting the requirements of actual scenes and refining the reasoning process. The visual dialogue closed-loop reasoning method based on the graph perception has good performance exceeding that of a current mainstream method, and can fully perceive multi-modal information so as to generate answers fitting a current question scene.

Based on the same inventive concept, an embodiment of the present invention further provides a device for generating visual dialog answers based on graph perception, referring to fig. 3, where the device includes: a processor 1 and a memory 2, the memory 2 having stored therein program instructions, the processor 1 calling the program instructions stored in the memory 2 to cause the apparatus to perform the following method steps in an embodiment:

Wherein, the device still includes:

and coding the text information such as visual description, conversation history and the like by using a long-short term memory network LSTM, initializing a basic directed graph according to the text information, and extracting the features of the picture by using Faster-RCNN.

Further, the building of the basic graph architecture specifically includes:

according to the syntax structure and semantics of the text information, the entity and the relation in the text information are identified, a directed graph is initialized, the global semantics enhancement is carried out on each node in the graph by using visual description and problem features, the graph node features are integrated and respectively sent into historical conversation and picture features, after each query vector in a query library is selected, the semantics enhancement is carried out on the graph nodes by using the conversation features and the picture features related to the problems.

The method comprises the following steps of integrating image features formed by iterative multiple times of fusion with visual and text features, and then sending the integrated image features and visual and text features into a decoder to realize multi-mode information collaborative representation, wherein the specific answer for the problem proposed by the current scene is generated as follows:

Wherein, the device still includes:

It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the processor 1 and the memory 2 may be computers, single-chip microcomputers, microcontrollers and other devices with calculation functions, and in specific implementation, the execution main bodies are not limited in the embodiment of the present invention and are selected according to requirements in practical application.

The data signal is transmitted between the memory 2 and the processor 1 through the bus 3, which is not described in detail in the embodiment of the present invention.

Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.

The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.

It should be noted that the descriptions of the readable storage medium in the above embodiments correspond to the descriptions of the method in the embodiments, and the descriptions of the embodiments of the present invention are not repeated here.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention may be carried out in whole or in part when the computer program instructions are loaded and executed on a computer.

The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.

Reference documents:

[1]S.Antol,A.Agrawal,J.Lu,M.Mitchell,D.Batra,C.L.Zitnick,and D.Parikh.VQA:visual question answering.In ICCV,pages 2425–2433,2015.

[2]L.Chen,Z.Jiang,J.Xiao,and W.Liu.Human-like controllable image captioning with verb-specifific semantic roles.In CVPR,pages 16846–16856,2021.

[3]T.Chen,W.Yu,R.Chen,and L.Lin.Knowledge-embedded routing network for scene graph generation.In CVPR,pages 6163–6171,2019.

[4]Y.Cho and I.Kim.NMN-VD:A neural module network for visual dialog.Sensors,page 931,2021.

[5]Das A,Kottur S,Gupta K,et al.Visual Dialog[C],2017IEEE Conference on Computer Vision and Pattern Recognition(CVPR).IEEE.2017.

[6]Kang,G.,Lim,J.,Zhang,B.:Dual attention networks for visual reference resolution in visual dialog.In:EMNLP-IJCNLP,pp.2024–2033,2019.

[7]D.Guo,H.Wang,S.Wang,and M.Wang.Textual-visual reference-aware attention network for visual dialog.IEEE T rans.Image Process.,pages 6655–6666,2020.

[8]Zheng,Z.,Wang,W.,Qi,S.,Zhu,S.:Reasoning visual dialogs with structural and partial observations.In:CVPR,pp.6669–6678,2019.

[9]Schwartz,I.,Yu,S.,Hazan,T.,Schwing,A.G.:Factor graph attention.In:CVPR,pp.2039–2048,2019.

[10] zhao Lei, Gao Lii, Song Jing Width, adaptive visual memory network for visual dialog [ J ] academic newspaper of electronic science and technology university 2021,50(05): 749-.

[11] Niuyei, Zhang Wang, Vision question and answer and dialogue review [ J ] computer science, 2021,48(03):87-96.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for generating visual dialog answers based on graph perception, the method comprising the steps of:

carrying out entity identification and relation detection on the historical conversation, constructing a basic graph framework by taking an entity as a node and a relation as a directed edge, and encoding the entity and the relation by utilizing a GloVe word vector;

by iteratively updating multiple times of enhancement graph semantics, the method aims to feed back graph features to the conversation history and the image content for multiple times, presents a closed loop in the information transfer process, and excavates the interaction relationship between modalities;

and integrating the graph features formed by repeated iteration fusion and the visual and text features, and then sending the graph features and the visual and text features into a decoder to realize multi-mode information collaborative representation and generate an answer to the question raised in the current scene.

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein the building base graph architecture specifically includes:

4. The method for generating visual dialog answers according to claim 1, wherein the graph features obtained by fusing iterations for multiple times are integrated with visual and text features and then sent to a decoder to realize multi-modal information collaborative characterization, and the generation of the answer to the question posed by the current scene is specifically:

5. The method of claim 1, further comprising:

utilizing a full connection layer, a multi-layer perceptron, an activation function and self-attention; jointly embedding text, visual, and graph features.

6. A visual dialog answer generation apparatus based on graph perception, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor invoking the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-5.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-5.