CN114691847A - Relational attention network visual question-answering method based on deep perception and semantic guidance - Google Patents

Relational attention network visual question-answering method based on deep perception and semantic guidance Download PDF

Info

Publication number
CN114691847A
CN114691847A CN202210231121.1A CN202210231121A CN114691847A CN 114691847 A CN114691847 A CN 114691847A CN 202210231121 A CN202210231121 A CN 202210231121A CN 114691847 A CN114691847 A CN 114691847A
Authority
CN
China
Prior art keywords
image
attention
correlation
visual
namely
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210231121.1A
Other languages
Chinese (zh)
Other versions
CN114691847B (en
Inventor
魏巍
刘宇航
彭道万
刘逸帆
潘为燃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202210231121.1A priority Critical patent/CN114691847B/en
Publication of CN114691847A publication Critical patent/CN114691847A/en
Application granted granted Critical
Publication of CN114691847B publication Critical patent/CN114691847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/62Analysis of geometric attributes of area, perimeter, diameter or volume
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Geometry (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a relation attention network visual question-answering method based on depth perception and semantic guidance, which comprises the following steps of: 1) constructing a three-dimensional space relation between image targets; obtaining a three-dimensional space relation between image targets; 2) acquiring a correlation score of the image targets i and j in a space dimension according to a three-dimensional space relationship between the image targets; 3) acquiring the correlation between the image objects i and j by combining the implicit attention and the explicit attention; 4) according to the framework of the Transformer, the traditional self-attention layer is replaced by an improved attention mechanism, and a visual question-answering model is obtained. The invention introduces the correlation of the three-dimensional space into the traditional self-attention mechanism, and improves the accuracy of the visual question answering.

Description

Relational attention network visual question-answering method based on deep perception and semantic guidance
Technical Field
The invention relates to a natural language processing technology, in particular to a relational attention network visual question-answering method based on deep perception and semantic guidance.
Background
Conventional visual question answering methods are generally based on depth feature fusion models, such as bilinear BLOCK diagonal fusion (BLOCK) and Self-attention mechanism fusion (Self-attention), but these methods have difficulty solving the answer of complex problems with spatial relationship reasoning. With the progress of deep learning, many studies based on deep neural network models are dedicated to improving the effect of the visual question-answering task, which generally extracts an image target visual representation and a word vector representation from an image and a text respectively for input, and implements multi-modal entity alignment in an end-to-end training manner, and then adopts a multi-classification strategy to predict an answer. Recently, many research works construct models based on Attention networks (Attention networks), and although the models show excellent performance on the visual question-answering task, the models do not consider the spatial relationship or semantic relationship between image targets, so that the models have limitation on complex question answering related to visual reasoning.
Disclosure of Invention
The invention aims to solve the technical problem of providing a relational attention network visual question-answering method based on depth perception and semantic guidance aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a relational attention network visual question-answering method based on depth perception and semantic guidance comprises the following steps:
1) three-dimensional spatial relationship construction between image objects
The visual relationship in two-dimensional space is calculated using the rectangular box coordinates of the image objects, which two-dimensional spatial relationship is expressed as for image objects i and j
Figure BDA0003540567330000021
It can be derived from the rectangular boxes of the two image objects by a specific calculation, namely:
Figure BDA0003540567330000022
wherein the content of the first and second substances,
Figure BDA0003540567330000023
wi,hirespectively representing the coordinate, the width and the height of the central point of a rectangular frame of the image target i;
according to the depth distance value dep of the center point of the rectangular frame of the image targets i and jiAnd depjThen calculating the visual relationship in the depth space
Figure BDA0003540567330000024
Namely:
Figure BDA0003540567330000025
wherein the content of the first and second substances,
Figure BDA0003540567330000026
the area of the overlapped part of the rectangular frames i and j is shown;
according to the two-dimensional space relation between the image objects i and j
Figure BDA0003540567330000027
And depth spatial relationship
Figure BDA0003540567330000028
The three-dimensional space relation between the image objects can be obtained
Figure BDA0003540567330000031
Namely:
Figure BDA0003540567330000032
wherein the content of the first and second substances,
Figure BDA0003540567330000033
as a learnable weight parameter, ds64 is the dimension of the explicit spatial relationship representation, σ is the activation function ReLU;
2) depth perception and semantic guidance attention mechanism
Using the explicitly modeled three-dimensional spatial relationship described above, it can be used to calculate a correlation score in the spatial dimension between image objects i and j
Figure BDA0003540567330000034
Namely:
Figure BDA0003540567330000035
wherein f isspaFor calculating the correlation of two image objects in space dimension, from the visual characteristics q of the ith image objectiAnd three-dimensional spatial relationship representation
Figure BDA0003540567330000036
Obtained by dot product, namely:
Figure BDA0003540567330000037
fsemthe method is used for calculating the correlation of the spatial relationship of two image objects and text semantics, namely:
Figure BDA0003540567330000038
Figure BDA0003540567330000039
wherein the content of the first and second substances,
Figure BDA00035405673300000310
is a learnable weight parameter;
Figure BDA00035405673300000311
[ CLS ] from BERT model for textual feature representation of problem]The feature representation of the last layer of the position is obtained;
3) combining implicit attention and explicit attention
Correlation alpha between image objects i and jijFinal implicit correlation
Figure BDA0003540567330000041
And explicit dependencies
Figure BDA0003540567330000042
Obtained by weighting, namely:
Figure BDA0003540567330000043
4) attention mechanism incorporated into visual question-answering model
According to the framework of the Transformer, an improved attention mechanism alpha is adoptedijReplacing the conventional self-attention layer, all alpha's are usedijExpressed in matrix form, i.e.
Figure BDA0003540567330000044
The improved Transformer calculation mode is as follows:
Figure BDA0003540567330000045
Figure BDA0003540567330000046
wherein, L is the number of transform layers, FFN is a multilayer perceptron (MLP) which is two fully-connected layers and activated by a ReLU hidden layer, that is:
FFN(X)=W2σ(W1X+b1)+b2
wherein the content of the first and second substances,
Figure BDA0003540567330000047
are parameters that can be learned.
The invention has the following beneficial effects:
1. the invention introduces the correlation of the three-dimensional space to the traditional self-attention mechanism, and flexibly expands the correlation to realize the explicit modeling and calculation of the three-dimensional space relation between the image targets;
2. by modeling the three-dimensional spatial relationship between the image targets and designing an attention mechanism for depth perception and semantic guidance on the basis, more accurate spatial correlation calculation is performed between the input image targets by introducing two different attention weight bias terms, namely the correlation weights of the spatial dimension and the semantic dimension, so that the accuracy of visual question answering is improved.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic diagram of an overall model structure of the visual question answering method of the present invention;
FIG. 2 is a schematic structural diagram of a depth perception and semantic guidance relationship attention mechanism in the visual question-answering method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a method for relational attention network visual question answering based on depth perception and semantic guidance includes the following steps:
the invention provides a depth perception and semantic-guided relational attention network, which calculates spatial correlation between image targets by explicitly modeling three-dimensional spatial relations between the image targets
Figure BDA0003540567330000051
And semantic relevance
Figure BDA0003540567330000061
The method mainly comprises the following parts:
1) three-dimensional spatial relationship construction between image objects
The visual relationship in two-dimensional space is calculated using the rectangular box coordinates of the image objects, which two-dimensional space relationship is expressed as image objects i and j
Figure BDA0003540567330000062
It can be derived from the rectangular boxes of the two image objects by a specific calculation, namely:
Figure BDA0003540567330000063
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003540567330000064
wi,hirespectively representing the coordinate, the width and the height of the central point of a rectangular frame of the image target i;
according to the depth distance value dep of the center point of the rectangular frame of the image targets i and jiAnd depjThen calculating the visual relationship in the depth space
Figure BDA0003540567330000065
Namely:
Figure BDA0003540567330000066
wherein the content of the first and second substances,
Figure BDA0003540567330000067
the area of the overlapped part of the rectangular frames i and j is shown;
according to the two-dimensional space relation between the image objects i and j
Figure BDA0003540567330000068
And depth spatial relationship
Figure BDA0003540567330000069
The three-dimensional space relation between the image objects can be obtained
Figure BDA00035405673300000610
Namely:
Figure BDA00035405673300000611
wherein the content of the first and second substances,
Figure BDA00035405673300000612
as a learnable weight parameter, ds64 is the dimension of the explicit spatial relationship representation, σ is the activation function ReLU;
2) depth perception and semantic guidance attention mechanism
As shown in FIG. 2, the three-dimensional spatial relationship explicitly modeled above may be used to calculate a correlation score in a spatial dimension between image objects i and j
Figure BDA0003540567330000071
Namely:
Figure BDA0003540567330000072
wherein f isspaFor calculating the correlation of two image objects in space dimension, from the visual characteristics q of the ith image objectiAnd three-dimensional spatial relationship representation
Figure BDA0003540567330000073
Obtained by dot product, namely:
Figure BDA0003540567330000074
fsemthe method is used for calculating the correlation of the spatial relationship of two image objects and text semantics, namely:
Figure BDA0003540567330000075
Figure BDA0003540567330000076
wherein the content of the first and second substances,
Figure BDA0003540567330000077
is a learnable weight parameter;
Figure BDA0003540567330000078
[ CLS ] from BERT model for textual feature representation of problem]The feature representation of the last layer of the position is obtained;
3) combining implicit attention and explicit attention
Correlation alpha between image objects i and jijFinal implicit correlation
Figure BDA0003540567330000079
And explicit dependencies
Figure BDA00035405673300000710
Obtained by weighting, namely:
Figure BDA00035405673300000711
the original Transformer model uses an implicit correlation self-attention mechanism to calculate the outputThe correlation between entries assumes that the feature matrix formed by the input image object RoI features is
Figure BDA0003540567330000081
Where N is the number of detected image objects, dhFor the characteristic dimension, in order to measure the implicit relation between image targets, the invention firstly adopts a scaled dot-product (scaled dot-product) f (-) to calculate the implicit correlation between the image targets i and j, and then adopts a softmax function to normalize all image target neighbors to convert into a correlation score
Figure BDA0003540567330000082
Specifically, the input features X are first mapped to the hidden space of query, key and value, and then used to measure the implicit correlation between two image objects, namely:
qi=Wqxi
kj=Wkxj
vj=Wvxj
Figure BDA0003540567330000083
Figure BDA0003540567330000084
wherein, Wq,Wk,
Figure BDA0003540567330000085
Is a learnable full connectivity layer parameter. x is a radical of a fluorine atomi,xjVisual characteristics of the ith, jth image object, qi,kj,vjTo map to the visual features of the hidden space, f (·,) is the scaled dot product function, exp (·) is an exponential function based on the natural number e.
According to the method, the relevance between the image targets can be measured from the characteristic dimension and the space dimension respectively through combining the implicit attention mechanism and the explicit attention mechanism, and compared with the original Transformer which only considers the relevance of the input in the characteristic level, the method also considers the relevance of the input in the space dimension, so that the capability of answering visual reasoning related complex questions is improved.
4) Attention mechanism incorporated into visual question-answering model
According to the framework of the Transformer, an improved attention mechanism alpha is adoptedijReplacing the conventional self-attention layer, all alpha's are usedijExpressed in matrix form, i.e.
Figure BDA0003540567330000091
The improved Transformer calculation mode is as follows:
Figure BDA0003540567330000092
Figure BDA0003540567330000093
wherein, L is the number of transform layers, FFN is a multilayer perceptron (MLP) which is two fully-connected layers and activated by a ReLU hidden layer, that is:
FFN(X)=W2σ(W1X+b1)+b2
wherein the content of the first and second substances,
Figure BDA0003540567330000094
are learnable parameters.
The invention provides a neural network architecture for a visual question-answering task, which comprises implicit and explicit image target relation modeling, and the following relation reasoning is better realized by implicitly and explicitly constructing the spatial and semantic relation between image targets. The depth perception and semantic guidance relation attention module provided by the invention is incorporated into a self-attention layer in a transform framework, namely, a layer of similarity measuring the spatial relation of image targets and text semantics is added, and a new image target correlation matrix is obtained by adjusting the original self-attention distribution weight, wherein the correlation matrix can reflect the correlation between the image targets in the relation level.
Experiments show that compared with the existing mainstream method, the sequence labeling method provided by the invention has a better effect. The experiment was evaluated using two reference Visual Question Answering datasets, namely, Visual Question Answering v2(VQA v2) and GQA datasets. The detailed information of the data set is shown in table 1.
Table 1 data set information
Figure BDA0003540567330000101
The experimental section is intended to evaluate the effectiveness of the visual question-answering model proposed by the present invention on different data sets. Specifically, we list the accuracy of the VQA v2 dataset and the GQA dataset as evaluation indexes of the model, and the experimental comparison results are given in table 2 and table 3, respectively.
TABLE 2 VQA v2 data set comparison experiment results
Figure BDA0003540567330000111
Figure BDA0003540567330000121
TABLE 3 GQA data set comparison of experimental results
Figure BDA0003540567330000122
It is noteworthy that, as can be observed from the above two tables, the method proposed by the present invention consistently outperforms all of these reference models in different visual question-answering tasks. Because most of these models focus on the attention of the image target entities, and ignore the modeling of the spatial and semantic relationships of the image targets, the models lack the ability to reason between the image targets. By explicitly modeling the three-dimensional spatial position characteristics of the image targets and adopting an attention mechanism to be combined into a neural network structure, the method provided by the invention can explicitly model the relationship between the image targets, so that the relationship reasoning between the image targets can be realized.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (5)

1. A relational attention network visual question-answering method based on depth perception and semantic guidance is characterized by comprising the following steps:
1) three-dimensional spatial relationship construction between image objects
1.1) Using the rectangular box coordinates of the image objects, the visual relationship in two-dimensional space is calculated, for image objects i and j, the two-dimensional spatial relationship is expressed as
Figure FDA0003540567320000011
1.2) according to the depth distance value dep of the center point of the rectangular frame of the image targets i and jiAnd depjThen calculating the visual relationship in the depth space
Figure FDA0003540567320000012
1.3) based on the two-dimensional spatial relationship between image objects i and j
Figure FDA0003540567320000013
And depth spatial relationship
Figure FDA0003540567320000014
Obtaining three-dimensional spatial relationships between image objects
Figure FDA0003540567320000015
Namely:
Figure FDA0003540567320000016
wherein W is a learnable weight parameter, dsThe dimensionality represented by the explicit spatial relationship is sigma of an activation function ReLU;
2) depth perception and semantic guidance attention mechanism
According to the three-dimensional space relationship between the image targets, the correlation score of the image targets i and j in the space dimension is obtained
Figure FDA0003540567320000017
Namely:
Figure FDA0003540567320000018
wherein f isspaFor calculating the correlation of two image objects in space dimension, from the visual characteristics q of the ith image objectiAnd three-dimensional spatial relationship representation
Figure FDA0003540567320000021
Obtained by dot product, namely:
Figure FDA0003540567320000022
fsemthe method is used for calculating the correlation of the spatial relationship of two image objects and text semantics, namely:
Figure FDA0003540567320000023
Figure FDA0003540567320000024
wherein the content of the first and second substances,
Figure FDA0003540567320000025
is a learnable weight parameter;
Figure FDA0003540567320000026
[ CLS ] from BERT model for textual feature representation of problem]The feature representation of the last layer of the position is obtained;
3) combining implicit attention and explicit attention
Correlation alpha between image objects i and jijBy implicit correlation
Figure FDA0003540567320000027
And explicit dependencies
Figure FDA0003540567320000028
Obtained by weighting, namely:
Figure FDA0003540567320000029
4) attention mechanism incorporated into visual question-answering model
According to the framework of the Transformer, an improved attention mechanism alpha is adoptedijReplacing the conventional self-attention layer, all alpha's are usedijExpressed in matrix form, i.e.
Figure FDA00035405673200000210
The improved Transformer calculation mode is as follows:
Figure FDA00035405673200000211
Figure FDA0003540567320000031
wherein, L is the number of layers of a transducer, FFN is a multilayer perceptron which is activated by two layers of full connection layers and adopts a ReLU hidden layer, namely:
FFN(X)=W2σ(W1X+b1)+b2
wherein the content of the first and second substances,
Figure FDA0003540567320000032
are learnable parameters.
2. The method for relational attention network visual question answering based on depth perception and semantic guidance according to claim 1, wherein in the step 1.1), two-dimensional spatial relation
Figure FDA0003540567320000033
The rectangular frames from the two image objects are calculated as follows:
Figure FDA0003540567320000034
wherein the content of the first and second substances,
Figure FDA0003540567320000035
respectively representing the coordinate, the width and the height of the central point of a rectangular frame of the image target i;
Figure FDA0003540567320000036
respectively, the coordinate of the center point of the rectangular frame of the image target j, the width and the height.
3. The method for visual question-answering based on deep perception and semantically guided relation attention network according to claim 1, wherein, in the step 1.2),
computing visual relationships in depth space
Figure FDA0003540567320000037
Namely:
Figure FDA0003540567320000038
wherein, wi,hiThe width and height of the rectangular frame of the image target i respectively; w is aj,hjThe width and height of the rectangular box of image object j,
Figure FDA0003540567320000041
the area of the overlapping portion of the rectangular frame of image object i and the rectangular frame of image object j.
4. The relation attention network visual question-answering method based on deep perception and semantic guidance according to claim 1, wherein in the step 3), implicit correlation is performed
Figure FDA0003540567320000042
The self-attention mechanism adopting the Transformer model specifically comprises the following steps:
qi=Wqxi
kj=Wkxj
vj=Wvxj
Figure FDA0003540567320000043
Figure FDA0003540567320000044
wherein the content of the first and second substances,
Figure FDA0003540567320000045
for learnable full connectivity layer parameters, xi,xjVisual characteristics of the ith, jth image object, qi,kj,vjTo map to the visual features of the hidden space, f (·,) is the scaled dot product function, exp (·) is an exponential function based on the natural number e.
5. The method for visual question-answering based on deep perception and semantically guided relation attention network according to claim 1, wherein in the step 3),
correlation alpha between image objects i and jijBy implicit correlation
Figure FDA0003540567320000046
And explicit dependencies
Figure FDA0003540567320000047
Obtained by average weighting, namely:
Figure FDA0003540567320000051
CN202210231121.1A 2022-03-10 2022-03-10 Relation attention network vision question-answering method based on depth perception and semantic guidance Active CN114691847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210231121.1A CN114691847B (en) 2022-03-10 2022-03-10 Relation attention network vision question-answering method based on depth perception and semantic guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210231121.1A CN114691847B (en) 2022-03-10 2022-03-10 Relation attention network vision question-answering method based on depth perception and semantic guidance

Publications (2)

Publication Number Publication Date
CN114691847A true CN114691847A (en) 2022-07-01
CN114691847B CN114691847B (en) 2024-04-26

Family

ID=82138315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210231121.1A Active CN114691847B (en) 2022-03-10 2022-03-10 Relation attention network vision question-answering method based on depth perception and semantic guidance

Country Status (1)

Country Link
CN (1) CN114691847B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2711670A1 (en) * 2012-09-21 2014-03-26 Technische Universität München Visual localisation
US20180322646A1 (en) * 2016-01-05 2018-11-08 California Institute Of Technology Gaussian mixture models for temporal depth fusion
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning
US20210081728A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Contextual grounding of natural language phrases in images
EP3920048A1 (en) * 2020-06-02 2021-12-08 Siemens Aktiengesellschaft Method and system for automated visual question answering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2711670A1 (en) * 2012-09-21 2014-03-26 Technische Universität München Visual localisation
US20180322646A1 (en) * 2016-01-05 2018-11-08 California Institute Of Technology Gaussian mixture models for temporal depth fusion
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
US20210081728A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Contextual grounding of natural language phrases in images
EP3920048A1 (en) * 2020-06-02 2021-12-08 Siemens Aktiengesellschaft Method and system for automated visual question answering
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUASONG ZHONG ET AL.: "Selfadaptive neural module transformer for visual question answering", 《IEEE TRANSACTIONS ON MULTIMEDIA》, vol. 23, 18 May 2020 (2020-05-18), pages 1264 - 1273, XP011851901, DOI: 10.1109/TMM.2020.2995278 *
邱男等: "基于复合图文特征的视觉问答模型研究", 《计算机应用研究》, vol. 38, no. 08, 23 April 2021 (2021-04-23), pages 2293 - 2298 *

Also Published As

Publication number Publication date
CN114691847B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
Gan et al. Sparse attention based separable dilated convolutional neural network for targeted sentiment analysis
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN113610126B (en) Label-free knowledge distillation method based on multi-target detection model and storage medium
CN109766427B (en) Intelligent question-answering method based on collaborative attention for virtual learning environment
CN113191357B (en) Multilevel image-text matching method based on graph attention network
CN110309503A (en) A kind of subjective item Rating Model and methods of marking based on deep learning BERT--CNN
CN111897944B (en) Knowledge graph question-answering system based on semantic space sharing
CN111242197B (en) Image text matching method based on double-view semantic reasoning network
Jiang et al. An eight-layer convolutional neural network with stochastic pooling, batch normalization and dropout for fingerspelling recognition of Chinese sign language
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN111563149A (en) Entity linking method for Chinese knowledge map question-answering system
CN111611367B (en) Visual question-answering method introducing external knowledge
CN114973125A (en) Method and system for assisting navigation in intelligent navigation scene by using knowledge graph
CN112632250A (en) Question and answer method and system under multi-document scene
CN116611024A (en) Multi-mode trans mock detection method based on facts and emotion oppositivity
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
Yu et al. Question classification based on MAC-LSTM
CN114595306A (en) Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
Zhou et al. Stock prediction based on bidirectional gated recurrent unit with convolutional neural network and feature selection
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN110889340A (en) Visual question-answering model based on iterative attention mechanism
CN116303976B (en) Penetration test question-answering method, system and medium based on network security knowledge graph
CN114691847A (en) Relational attention network visual question-answering method based on deep perception and semantic guidance
Tian et al. Scene graph generation by multi-level semantic tasks
Xin et al. Knowledge-based intelligent education recommendation system with IoT networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant