CN115129839A - Visual dialogue answer generation method and device based on graph perception - Google Patents

Visual dialogue answer generation method and device based on graph perception Download PDF

Info

Publication number
CN115129839A
CN115129839A CN202210685096.4A CN202210685096A CN115129839A CN 115129839 A CN115129839 A CN 115129839A CN 202210685096 A CN202210685096 A CN 202210685096A CN 115129839 A CN115129839 A CN 115129839A
Authority
CN
China
Prior art keywords
graph
features
visual
information
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210685096.4A
Other languages
Chinese (zh)
Inventor
刘安安
徐宁
张国楷
郭俊波
靳国庆
张勇东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Konami Sports Club Co Ltd
Original Assignee
Tianjin University
People Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University, People Co Ltd filed Critical Tianjin University
Priority to CN202210685096.4A priority Critical patent/CN115129839A/en
Publication of CN115129839A publication Critical patent/CN115129839A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a visual dialog answer generation method and device based on graph perception, wherein the method comprises the following steps: respectively constructing query libraries according to the modal properties, distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modes, and reserving modal features which are beneficial to an inference process in an actual scene; carrying out entity identification and relation detection on the historical conversation, constructing a basic graph framework by taking an entity as a node and a relation as a directed edge, and encoding the entity and the relation by utilizing a GloVe word vector; the graph semantics are enhanced for multiple times through iterative updating, the purpose is to feed back the in-graph information to the conversation history and the image content for multiple times, and the information transmission process presents a closed loop and is used for fully mining the interaction relationship between the modalities; integrating the image features and the visual and text features which are fused for a plurality of times through iteration, and then sending the image features and the visual and text features into a decoder to realize multi-mode information collaborative representation and generate answers for the questions proposed by the current scene. The device comprises: a processor and a memory.

Description

Visual conversation answer generation method and device based on graph perception
Technical Field
The invention relates to the field of visual dialogue generation, in particular to a visual dialogue answer generation method and device based on graph perception for visual dialogue based on graph perception.
Background
With the continued development of artificial intelligence, visual dialog has received unprecedented attention in the area of intersection of computer vision and natural language processing. In this task, given a picture, a picture description, and a set of conversations (i.e., multiple question-and-answer pairs), the agent is able to fully understand the potential associations of cross-modal semantics from the current question and infer an accurate answer. And other visual language tasks, such as: image description generation [1] Scene graph generation [2] Visual general knowledge reasoning [3] Visual Question Answering (VQA) [4] And in order to further discuss the deep semantic dependency relationship between the vision and the language according to actual requirements and application scenes, the visual conversation not only requires that fine-grained cross-modal understanding exists between texts and images, but also requires that the current problem, conversation history and visual information have global semantic dependency. The technology aims to solve the problem of human answering through the accurate feedback of an intelligent agent to the current problem, and replaces human perception and thinking. The method can be widely applied to man-machine interaction and helps the visually impaired user to perceive peripheral information and the like. To build a visual dialog generating research platform, VisDial v0.9 communicates withThe VisDial v1.0 dataset was proposed [5] So as to verify the application capability of the model in the actual scene.
Existing methods [6-11] The innovative method and excellent performance are shown in the aspect of visual dialogue generation, a researcher mainly places a research center on how to guide the extraction of visual information by text information through an attention mechanism, then multi-modal features are embedded and fused, and finally the multi-modal features are sent to a decoder to analyze answer clues, wherein the existing skeleton image is DAN [6] 、RAA-Net [7] All achieve good performance. However, this reasoning process is one-way, resulting in insufficient cross-modal interaction, limited accuracy and richness of the generated answers. And GNN [8] 、FGA [9] The introduction of the graph structure into the framework can alleviate the defect, multi-level semantics are abstracted from texts and vision to construct a graph, interaction among graph nodes containing multi-mode information is realized through an intra-graph circulation mode of message transmission, and graph features are embedded to obtain graph features for answer generation.
However, the above existing models attach too much importance to the role of the high-level information in the graph in the inference process, neglect the inference capability of the original natural language and visual content, and weaken the role thereof in the inference process to some extent. This shows that it is necessary to introduce a dynamic structure to optimize the model, so that a close interaction relationship is established between the graph modality and the visual text, and the reasoning effect of the visual and text is strengthened through the graph outer loop. In the past, no strategy for enriching the semantics of dialog turns and visual regions by using a graph structure as a medium is provided.
In summary, although the field of visual dialog generation has made a series of advances [8,9] However, a multi-modal semantic interaction framework for image perception is not designed yet, and the effect of close interaction between image modalities and visual texts on reasoning is ignored. At present, the mainstream method still performs feature extraction and fusion on original information, and the semantic dependency between texts and vision cannot be sufficiently explored through the one-way coarse-grained operation, so that the answer generation effect aiming at the current scene is damaged.
Based on the current research situation, the current challenges are mainly the following three aspects:
1. how to abstract the graph structure from the visual text multi-modal information and then iteratively enhance the graph meaning;
2. how to feed back the high-order information in the graph to the dialogue history and the image area and optimize the self-attention weighting process;
3. how to embed joint features of the graph modal features and the features of texts, vision and the like and carry out collaborative reasoning on the current problems.
Disclosure of Invention
The invention provides a visual dialogue answer generation method and a device based on graph perception, which respectively establish a query library for storing query vectors according to the characteristics of visual and text modes, and perceive local features in the modes by the query vectors by using self-attention so as to obtain high-order semantic vectors; in the graph construction and iteration updating stage, the entities and the relations in the conversation history are identified to establish a basic directed graph structure, and the directed graph is subjected to feature embedding and multi-stage interaction with the text and the visual features, so that the semantic information of the vision, the text and the graph is enriched at the same time; in the multi-modal collaborative reasoning stage, node level feature fusion is performed on the graphs in each stage, then the graphs are embedded into a higher-level semantic space by using a multi-layer perceptron to form high-level graph features, semantic perception and fusion are performed on the high-level graph features and the multi-modal features selected from the attention module to generate vectors with strong reasoning capability, and the vectors are used for generating answers aiming at the current problems in a specific scene, and the details are described in the following:
in a first aspect, a method for generating visual dialog answers based on graph perception includes the steps of:
respectively constructing query libraries according to the modal properties, distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, and reserving modal features which are beneficial to an inference process in an actual scene;
entity identification and relation detection are carried out on historical conversations, entities serve as nodes, relations serve as directed edges to construct a basic graph structure, and the entities and the relations are coded by utilizing GloVe word vectors;
the graph semantics are enhanced for multiple times through iterative updating, the purpose is to feed back the in-graph information to the conversation history and the image content for multiple times, and the information transmission process presents a closed loop and is used for fully mining the interaction relationship between the modalities;
integrating the image features and the visual and text features which are fused for a plurality of times through iteration, and then sending the image features and the visual and text features into a decoder to realize multi-mode information collaborative representation and generate answers for the questions proposed by the current scene.
Wherein, prior to step 1), the method further comprises:
and coding text information such as visual description, conversation history and the like by using a long-short term memory network (LSTM), initializing a basic directed graph according to the text information, and extracting the features of the picture by using fast-RCNN.
Further, the building base graph architecture specifically includes:
according to the syntactic structure and semantics of the text information, the entity and the relationship in the text information are identified, a directed graph is initialized, the global semantics enhancement is carried out on all nodes in the graph by using visual description and problem features, the graph node features are integrated and respectively sent into historical dialogue and picture features, after all inquiry vectors in an inquiry library are selected, the semantics enhancement is carried out on graph nodes by using dialogue features and picture features related to problems.
The method comprises the following steps of integrating image features formed by iterative multiple times of fusion with visual and text features, and then sending the image features and the visual and text features into a decoder to realize multi-mode information collaborative representation, wherein the specific answer for generating a question proposed by a current scene is as follows:
and performing node level feature fusion on the multi-stage high-order semantic graph, embedding to generate a graph vector, feeding the graph vector back to the relevant dialogue turn and the picture region again, performing vector splicing and weighted summation, and obtaining answer reasoning features after a multi-layer perceptron and an activation function.
Wherein the method further comprises:
utilizing a full connection layer, a multi-layer perceptron, an activation function and self-attention; jointly embedding text, visual, and graphical features.
In a second aspect, an apparatus for generating answers to a visual dialog based on graph perception, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor invoking the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.
A third aspect, a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method steps of any of the first aspects.
The technical scheme provided by the invention has the beneficial effects that:
1. the method comprises the steps of respectively establishing query libraries according to text and visual modal characteristics, storing query vectors capable of accurately sensing key clues of the current modal, sending the query vectors in the query libraries to a self-attention mechanism, and respectively carrying out sensitivity weighting and fusion on local information of the current modal, so that modal characteristics which are beneficial to a reasoning process in an actual scene are kept as far as possible;
existing methods using attention mechanism [6,7] Most of the multi-modal feature concern points are only limited to the current problem and ignore the context semantics by only adopting the current problem vector as a query vector and weighting the multi-modal feature concern points; according to the method, the query library is constructed, so that the overall semantic integrity of the multi-mode information is kept in the screening process, and more accurate answers with richer semantics can be generated.
2. The invention constructs a directed graph and enhances graph semantics through iteration for many times, which is used for feeding back conversation history and image content, the whole information transmission process presents closed loop and reacts with close interaction relationship among modalities;
the existing method only utilizes the one-way transmission of text information to visual contents to carry out reasoning, the interaction between the modes is sparse, even if a part of methods are introduced into a graph structure, the information is only circularly transmitted in the graph, and the graph and the visual texts still have no close interaction relation; the invention pays attention to the relation between the graph and the visual text at each stage, re-senses the visual text by utilizing the graph, improves the positioning precision of the image area and the selection accuracy of the conversation turn, and refines the reasoning process.
3. The invention designs a graph and visual text collaborative reasoning mechanism, namely, answer reasoning is carried out after the prior knowledge under the current scene is fully utilized and coded by a multilayer perceptron; the existing method for introducing the graph structure only embeds the graph to obtain the feature vector of the encoder, so that the reasoning effect of text and visual features is weakened, or the feature vector is generated only by the text and the vision, and the two processes do not fully utilize the existing modal information; according to the method, the semantic association between the image and the visual text is fully established, the potential relation between the entities is mined, the semantic search space for the current problem and context is more accurately reduced, and the final multi-mode features are ensured to contain complete semantics, namely, the selected image area and the dialog turn are more fit with the current context, so that the generated answer has more relevance to the context.
Drawings
FIG. 1 is a flow chart of a visual dialog answer generation method based on graph-aware closed-loop reasoning;
FIG. 2 is a general framework of a visual dialog answer generation method based on graph perception;
fig. 3 is a schematic structural diagram of an apparatus for generating visual dialog answers based on graph perception.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
A method for generating visual dialog answers based on graph perception, referring to fig. 1, the method comprising the steps of:
101: carrying out round-level coding on historical dialogue, visual description and current problems by using an LSTM (long short term memory network), and carrying out region-level feature extraction on image contents by using a fast-RCNN (fast regional graph convolution feature extractor) to respectively obtain text and visual feature vectors;
102: respectively constructing query libraries according to the modal properties, and distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, so as to retain modal characteristics which are beneficial to an inference process in an actual scene as far as possible;
most of the existing methods adopting the attention mechanism only adopt the current problem vector as a query vector, and the multi-modal feature attention points obtained by weighting are only limited to the current problem and ignore the context semantics. According to the embodiment of the invention, the query library is constructed, so that the overall semantic integrity of the multi-mode information is ensured to be kept in the screening process, and more accurate answers with richer semantics can be generated.
103: entity identification and relation detection are carried out on historical conversations, entities serve as nodes, relations serve as directed edges to construct a basic graph framework, and GloVe word vectors are used for encoding the basic graph framework;
104: the graph semantics are enhanced through iterative updating for multiple times, the graph characteristics are fed back to the conversation history and the image contents for multiple times, the whole information transmission process presents a closed loop, and the close interaction relationship among the modalities is excavated;
the existing method only utilizes the unidirectional transmission of text information to visual content to carry out reasoning, the interaction between the modes is sparse, even if a graph structure is introduced in part of methods, the information is only circularly transmitted in the graph, and the graph and the visual text still have no close interaction relation. The embodiment of the invention pays attention to the relation between the graph at each stage and the visual text, and the graph is used for re-perceiving the visual text to refine the reasoning process.
105: and integrating the graph features obtained by iterative multiple times of fusion with the features such as vision, text and the like, and sending the graph features and the features into a decoder to realize multi-mode information collaborative representation and finally generate an accurate answer of a question provided by a human aiming at the current scene.
In summary, in the embodiment of the present invention, the query library is constructed through the above steps 101 to 105 to fully preserve the semantic globality between the modalities, so as to prevent semantic bias and information loss in the process of implementing the attention mechanism, and improve the reasoning ability of the model for the current problem; the embodiment of the invention optimizes the data preprocessing flow aiming at different modes, designs a closed-loop reasoning framework with intensive interaction, is used for excavating dependency relationship among the modes to further refine reasoning, fully replaces human perception and thinking, provides feedback of scenes and contexts more accurately, and discloses the effectiveness of the intensive interaction in the cross-mode reasoning process; the invention can more accurately reduce the semantic search space aiming at the current problem and context, and ensure that the final multi-modal characteristics contain complete semantics, namely, the selected image area and the dialog turn are more fit with the current context, so that the generated answer has more relevance to the context.
Example 2
The scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:
201: carrying out round-level coding on historical dialogue, visual description and current problems by using an LSTM (long short term memory network), and carrying out region-level feature extraction on image contents by using a fast-RCNN (fast regional graph convolution feature extractor) to respectively obtain text and visual feature vectors;
given historical dialogue, visual description and current question, the information is sent to different LSTMs for feature embedding, and the embedding vectors are respectively expressed as H ═ H 1 ,h 2 ,h 3 ,…,h m H, C, Q, wherein h i ={q i ,a i The dialogue system comprises a dialogue system, a dialogue system and a dialogue system, wherein the dialogue system comprises a plurality of dialogue units, each dialogue unit comprises a plurality of dialogue rounds, and each dialogue unit comprises a question and an answer; given image, sending it into fast-RCNN to extract its characteristic, and expressing its characteristic vector as V ═ V { (V) 1 ,v 2 ,v 3 ,…,v n N denotes the number of image visual areas.
202: respectively constructing query libraries according to the modal properties, and distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, so as to retain modal characteristics which are beneficial to an inference process in an actual scene as far as possible;
for historical conversations, a query library is established to store visual description features C and current question features Q to generate corresponding attention based on the sensitivity between the query vector and different conversation rounds, and visual description-conversation associations and question-conversation associations are established:
Figure BDA0003697272830000061
wherein, replacing C with Q can obtain H through the same calculation process q ,H c And H q Feature vectors, W, obtained by weighted summation of the historical dialogue with the self-attention mechanism for the visual description and the current question, respectively 1 Denotes a fully connected layer, 1 T Representing an all-one vector, Softmax being a weight normalization function, h i For the ith dialog turn feature, the symbol "°" represents matrix multiplication. H c And H q The semantics of the global problem driver graph in step 203 can be adaptively enhanced, both are mapped into a co-nested semantic space, and are combined with the current stage graph feature g qc And problem feature Q:
μ=ReLU(MLP[σ(T[H c ;H q ]),σ(W 2 g qc ),σ(W 3 Q)]) (2)
wherein T represents a conversion matrix, ReLU represents a ReLU activation function, MLP represents a multi-layer perceptron, μ is defined as a text feature for continuously enriching the node semantics of the current stage graph and serving as a query vector of an image query library, σ represents an activation function, and W represents a conversion matrix 2 And W 3 Representing a learnable parameter.
Similarly, a query library is built for the images to store the current question feature Q, the text feature μ, and the current stage graph feature g t Performing a visual feature attention weighted summation:
Figure BDA0003697272830000062
wherein v is i Visual features representing the ith region; replacing Q with mu and g t V can be obtained through the same calculation process μ 、V g ,W 4 Representing a fully connected layer.
In order to prevent missing effective visual information, the perceived visual features are fused:
Figure BDA0003697272830000063
where ζ is defined as a visual feature used to continue enhancing the node semantics of the current phase graph.
203: carrying out entity identification and relation detection on the historical conversation, constructing a basic graph framework by taking an entity as a node and a relation as a directed edge, and encoding the basic graph framework by using a GloVe word vector;
there are a large number of noun entities and their relationship representations in the historical dialog, and these a priori knowledge can be used to construct a basic directed graph under the premise of guaranteeing semantic integrity, such as the visual description in fig. 2, "a group of young people is playing skateboards in a skateboarding field", which can be abstracted into "skateboarding field → (there) → young people → (playing) → skateboards". And for the extracted elements such as the relation, the entity, the attribute and the like, encoding the elements by using a GloVe word vector to obtain a basic directed graph G o
204: the graph semantics are enhanced by iterative updating for multiple times, the graph characteristics are fed back to the conversation history and the image content for multiple times, the whole information transmission process presents a closed loop, and the close interaction relationship among the modalities is mined.
The method aims to find the subgraph with specific semantic nodes, and can promote language understanding and visual positioning. In the first stage, the basic directed graph G o Is parsed out by the dialog. Then in the second phase it is initialized with the visual description and the current question:
G qc =ReLU(W 7 (σ(W 5 [G o ,Q]+W 6 [G o ,C]))) (5)
wherein G is qc For enriching and participating in dialog rounds, defined as global problem driver graphs. Under the guidance of the question Q, the text feature mu is fused with the current stage graph node feature, and the related graph node semantics are continuously comprehensively enhanced:
Figure BDA0003697272830000076
wherein G is t Features are fed back into the visual content, defined as text-driven graphs. Similar to the above formula, replacing μ with ζ, the visual driving graph G can be obtained through the same calculation process v . The embodiment of the invention obtains the graph G with four different states o 、G qc 、G t And G v They can interact, switching in the perceptual dialog and the image from multiple angles:
G multi =MLP(G o +G qc +G t +G v ) (7)
wherein G is multi Is defined as a multi-state fusion graph. G o 、G qc 、G t 、G v And G multi Carrying out graph convolution, carrying out addition of node and directed edge characteristics element by element to respectively obtain corresponding graph embedding characteristics g o 、g qc 、g t 、g v And g multi
205: and finally, integrating the image features obtained by iterative multiple times of fusion with the features such as vision, text and the like, and sending the image features and the features into a decoder to realize multi-mode information collaborative representation and finally generate an accurate answer of a question provided by a human aiming at the current scene.
And re-mining the conversation wheel and the image area by using the graph embedding characteristics, wherein the whole reasoning process is in a closed loop, and the sufficiently close information interaction relationship exists between the modalities. g multi And information integration is carried out on the interactive turn and the visual region, and the relevance of language and vision is mined:
Figure BDA0003697272830000077
wherein H is replaced by V, and the same calculation can be carried out
Figure BDA0003697272830000071
And
Figure BDA0003697272830000072
to guide features for the graph.
And finally, the text and the visual vector are obtained through a self-attention module and are expressed as
Figure BDA0003697272830000073
And
Figure BDA0003697272830000074
the two are subjected to semantic fusion through a multilayer perceptron:
Figure BDA0003697272830000075
and psi is a multi-modal fusion feature, and is sent to a decoder to generate an answer to the current question.
The embodiment of the invention ensures that the multimode information has close interaction by introducing the graph structure and designing the iteration updating scheme of the graph, thereby realizing the mining of common semantics among different modes, meeting the requirements of actual scenes and refining the reasoning process. The visual dialogue closed-loop reasoning method based on the graph perception has good performance exceeding that of a current mainstream method, and can fully perceive multi-modal information so as to generate answers fitting a current question scene.
Based on the same inventive concept, an embodiment of the present invention further provides a device for generating visual dialog answers based on graph perception, referring to fig. 3, where the device includes: a processor 1 and a memory 2, the memory 2 having stored therein program instructions, the processor 1 calling the program instructions stored in the memory 2 to cause the apparatus to perform the following method steps in an embodiment:
respectively constructing query libraries according to the modal properties, distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, and reserving modal features which are beneficial to an inference process in an actual scene;
entity identification and relation detection are carried out on historical conversations, entities serve as nodes, relations serve as directed edges to construct a basic graph structure, and the entities and the relations are coded by utilizing GloVe word vectors;
the graph semantics are enhanced for multiple times through iterative updating, the purpose is to feed back the in-graph information to the conversation history and the image content for multiple times, and the information transmission process presents a closed loop and is used for fully mining the interaction relationship between the modalities;
integrating the image features and the visual and text features which are fused for a plurality of times through iteration, and then sending the image features and the visual and text features into a decoder to realize multi-mode information collaborative representation and generate answers for the questions proposed by the current scene.
Wherein, the device still includes:
and coding the text information such as visual description, conversation history and the like by using a long-short term memory network LSTM, initializing a basic directed graph according to the text information, and extracting the features of the picture by using Faster-RCNN.
Further, the building of the basic graph architecture specifically includes:
according to the syntax structure and semantics of the text information, the entity and the relation in the text information are identified, a directed graph is initialized, the global semantics enhancement is carried out on each node in the graph by using visual description and problem features, the graph node features are integrated and respectively sent into historical conversation and picture features, after each query vector in a query library is selected, the semantics enhancement is carried out on the graph nodes by using the conversation features and the picture features related to the problems.
The method comprises the following steps of integrating image features formed by iterative multiple times of fusion with visual and text features, and then sending the integrated image features and visual and text features into a decoder to realize multi-mode information collaborative representation, wherein the specific answer for the problem proposed by the current scene is generated as follows:
and performing node level feature fusion on the multi-stage high-order semantic graph, embedding to generate a graph vector, feeding the graph vector back to the relevant dialogue turn and the picture region again, performing vector splicing and weighted summation, and obtaining answer reasoning features after a multi-layer perceptron and an activation function.
Wherein, the device still includes:
utilizing a full connection layer, a multi-layer perceptron, an activation function and self-attention; jointly embedding text, visual, and graphical features.
It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.
The execution main bodies of the processor 1 and the memory 2 may be computers, single-chip microcomputers, microcontrollers and other devices with calculation functions, and in specific implementation, the execution main bodies are not limited in the embodiment of the present invention and are selected according to requirements in practical application.
The data signal is transmitted between the memory 2 and the processor 1 through the bus 3, which is not described in detail in the embodiment of the present invention.
Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.
The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.
It should be noted that the descriptions of the readable storage medium in the above embodiments correspond to the descriptions of the method in the embodiments, and the descriptions of the embodiments of the present invention are not repeated here.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention may be carried out in whole or in part when the computer program instructions are loaded and executed on a computer.
The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.
Reference documents:
[1]S.Antol,A.Agrawal,J.Lu,M.Mitchell,D.Batra,C.L.Zitnick,and D.Parikh.VQA:visual question answering.In ICCV,pages 2425–2433,2015.
[2]L.Chen,Z.Jiang,J.Xiao,and W.Liu.Human-like controllable image captioning with verb-specifific semantic roles.In CVPR,pages 16846–16856,2021.
[3]T.Chen,W.Yu,R.Chen,and L.Lin.Knowledge-embedded routing network for scene graph generation.In CVPR,pages 6163–6171,2019.
[4]Y.Cho and I.Kim.NMN-VD:A neural module network for visual dialog.Sensors,page 931,2021.
[5]Das A,Kottur S,Gupta K,et al.Visual Dialog[C],2017IEEE Conference on Computer Vision and Pattern Recognition(CVPR).IEEE.2017.
[6]Kang,G.,Lim,J.,Zhang,B.:Dual attention networks for visual reference resolution in visual dialog.In:EMNLP-IJCNLP,pp.2024–2033,2019.
[7]D.Guo,H.Wang,S.Wang,and M.Wang.Textual-visual reference-aware attention network for visual dialog.IEEE T rans.Image Process.,pages 6655–6666,2020.
[8]Zheng,Z.,Wang,W.,Qi,S.,Zhu,S.:Reasoning visual dialogs with structural and partial observations.In:CVPR,pp.6669–6678,2019.
[9]Schwartz,I.,Yu,S.,Hazan,T.,Schwing,A.G.:Factor graph attention.In:CVPR,pp.2039–2048,2019.
[10] zhao Lei, Gao Lii, Song Jing Width, adaptive visual memory network for visual dialog [ J ] academic newspaper of electronic science and technology university 2021,50(05): 749-.
[11] Niuyei, Zhang Wang, Vision question and answer and dialogue review [ J ] computer science, 2021,48(03):87-96.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A method for generating visual dialog answers based on graph perception, the method comprising the steps of:
respectively constructing query libraries according to the modal properties, distributing weight summation to the feature vectors by using a self-attention mechanism according to different degrees of sensitivity of query elements to information in the modal, and reserving modal features which are beneficial to an inference process in an actual scene;
carrying out entity identification and relation detection on the historical conversation, constructing a basic graph framework by taking an entity as a node and a relation as a directed edge, and encoding the entity and the relation by utilizing a GloVe word vector;
by iteratively updating multiple times of enhancement graph semantics, the method aims to feed back graph features to the conversation history and the image content for multiple times, presents a closed loop in the information transfer process, and excavates the interaction relationship between modalities;
and integrating the graph features formed by repeated iteration fusion and the visual and text features, and then sending the graph features and the visual and text features into a decoder to realize multi-mode information collaborative representation and generate an answer to the question raised in the current scene.
2. The method of claim 1, further comprising:
and coding the text information such as visual description, conversation history and the like by using a long-short term memory network LSTM, initializing a basic directed graph according to the text information, and extracting the features of the picture by using Faster-RCNN.
3. The method according to claim 1, wherein the building base graph architecture specifically includes:
according to the syntax structure and semantics of the text information, the entity and the relation in the text information are identified, a directed graph is initialized, the global semantics enhancement is carried out on each node in the graph by using visual description and problem features, the graph node features are integrated and respectively sent into historical conversation and picture features, after each query vector in a query library is selected, the semantics enhancement is carried out on the graph nodes by using the conversation features and the picture features related to the problems.
4. The method for generating visual dialog answers according to claim 1, wherein the graph features obtained by fusing iterations for multiple times are integrated with visual and text features and then sent to a decoder to realize multi-modal information collaborative characterization, and the generation of the answer to the question posed by the current scene is specifically:
and performing node level feature fusion on the multi-stage high-order semantic graph, embedding to generate a graph vector, feeding the graph vector back to the relevant dialogue turn and the picture region again, performing vector splicing and weighted summation, and obtaining answer reasoning features after a multi-layer perceptron and an activation function.
5. The method of claim 1, further comprising:
utilizing a full connection layer, a multi-layer perceptron, an activation function and self-attention; jointly embedding text, visual, and graph features.
6. A visual dialog answer generation apparatus based on graph perception, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor invoking the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-5.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-5.
CN202210685096.4A 2022-06-16 2022-06-16 Visual dialogue answer generation method and device based on graph perception Pending CN115129839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210685096.4A CN115129839A (en) 2022-06-16 2022-06-16 Visual dialogue answer generation method and device based on graph perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210685096.4A CN115129839A (en) 2022-06-16 2022-06-16 Visual dialogue answer generation method and device based on graph perception

Publications (1)

Publication Number Publication Date
CN115129839A true CN115129839A (en) 2022-09-30

Family

ID=83377641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210685096.4A Pending CN115129839A (en) 2022-06-16 2022-06-16 Visual dialogue answer generation method and device based on graph perception

Country Status (1)

Country Link
CN (1) CN115129839A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340778A (en) * 2023-05-25 2023-06-27 智慧眼科技股份有限公司 Medical large model construction method based on multiple modes and related equipment thereof
CN116862000A (en) * 2023-09-01 2023-10-10 浪潮电子信息产业股份有限公司 Causal thinking chain generation method, device and equipment for generating artificial intelligence
CN117290429A (en) * 2023-11-24 2023-12-26 山东焦易网数字科技股份有限公司 Method for calling data system interface through natural language

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340778A (en) * 2023-05-25 2023-06-27 智慧眼科技股份有限公司 Medical large model construction method based on multiple modes and related equipment thereof
CN116340778B (en) * 2023-05-25 2023-10-03 智慧眼科技股份有限公司 Medical large model construction method based on multiple modes and related equipment thereof
CN116862000A (en) * 2023-09-01 2023-10-10 浪潮电子信息产业股份有限公司 Causal thinking chain generation method, device and equipment for generating artificial intelligence
CN116862000B (en) * 2023-09-01 2024-01-23 浪潮电子信息产业股份有限公司 Causal thinking chain generation method, device and equipment for generating artificial intelligence
CN117290429A (en) * 2023-11-24 2023-12-26 山东焦易网数字科技股份有限公司 Method for calling data system interface through natural language
CN117290429B (en) * 2023-11-24 2024-02-20 山东焦易网数字科技股份有限公司 Method for calling data system interface through natural language

Similar Documents

Publication Publication Date Title
CN111522962B (en) Sequence recommendation method, device and computer readable storage medium
Zhou et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
Battaglia et al. Relational inductive biases, deep learning, and graph networks
CN110717017B (en) Method for processing corpus
CN112131366B (en) Method, device and storage medium for training text classification model and text classification
CN111782838B (en) Image question-answering method, device, computer equipment and medium
Gu et al. A systematic survey of prompt engineering on vision-language foundation models
CN115129839A (en) Visual dialogue answer generation method and device based on graph perception
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
WO2021212601A1 (en) Image-based writing assisting method and apparatus, medium, and device
CN113553418B (en) Visual dialogue generation method and device based on multi-modal learning
CN116681810B (en) Virtual object action generation method, device, computer equipment and storage medium
EP4302234A1 (en) Cross-modal processing for vision and language
CN111881292A (en) Text classification method and device
CN115438160A (en) Question and answer method and device based on deep learning and electronic equipment
Sur MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC)
Jedoui et al. Deep Bayesian active learning for multiple correct outputs
Li et al. Graph convolutional network meta-learning with multi-granularity POS guidance for video captioning
CN114328943A (en) Question answering method, device, equipment and storage medium based on knowledge graph
CN113869324A (en) Video common-sense knowledge reasoning implementation method based on multi-mode fusion
Lu et al. Coordinated-joint translation fusion framework with sentiment-interactive graph convolutional networks for multimodal sentiment analysis
CN114547308A (en) Text processing method and device, electronic equipment and storage medium
Lee et al. Language Model Using Differentiable Neural Computer Based on Forget Gate-Based Memory Deallocation.
Liu et al. Closed-loop reasoning with graph-aware dense interaction for visual dialog
CN113869518A (en) Visual common sense reasoning method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination