CN114691847B - Relation attention network vision question-answering method based on depth perception and semantic guidance - Google Patents
Relation attention network vision question-answering method based on depth perception and semantic guidance Download PDFInfo
- Publication number
- CN114691847B CN114691847B CN202210231121.1A CN202210231121A CN114691847B CN 114691847 B CN114691847 B CN 114691847B CN 202210231121 A CN202210231121 A CN 202210231121A CN 114691847 B CN114691847 B CN 114691847B
- Authority
- CN
- China
- Prior art keywords
- image
- attention
- visual
- correlation
- depth
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000008447 perception Effects 0.000 title claims abstract description 15
- 230000000007 visual effect Effects 0.000 claims abstract description 41
- 230000007246 mechanism Effects 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims 1
- 238000002474 experimental method Methods 0.000 description 4
- 230000004927 fusion Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 229910000831 Steel Inorganic materials 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000010959 steel Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/60—Analysis of geometric attributes
- G06T7/62—Analysis of geometric attributes of area, perimeter, diameter or volume
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Geometry (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a relational attention network visual question-answering method based on depth perception and semantic guidance, which comprises the following steps: 1) Constructing a three-dimensional space relation between image targets; obtaining a three-dimensional space relation between image targets; 2) According to the three-dimensional space relation between the image targets, obtaining the correlation score between the image targets i and j in the space dimension; 3) Acquiring the correlation between the image targets i and j by combining the implicit attention and the explicit attention; 4) According to the framework of the transducer, an improved attention mechanism is adopted to replace a traditional self-attention layer, and a visual question-answering model is obtained. The invention introduces the correlation of the three-dimensional space to the traditional self-attention mechanism and improves the accuracy of visual questions and answers.
Description
Technical Field
The invention relates to a natural language processing technology, in particular to a relational attention network vision question-answering method based on depth perception and semantic guidance.
Background
Conventional visual question-answering methods are typically based on depth feature fusion models, such as bilinear BLOCK diagonal fusion (BLOCK) and Self-attention mechanism fusion (Self-attention), but these methods have difficulty in solving answers to complex questions where spatial relationship reasoning exists. As deep learning advances, many studies based on deep neural network models have been focused on improving the effect of visual question-answering tasks, which typically extract image target visual representations and word vector representations from images and text, respectively, for input, and achieve multi-modal entity alignment in an end-to-end training fashion, and then employ multi-classification strategies to predict answers. Recently, many research works have constructed models based on an attention network (Attention Network), which, although they exhibit excellent performance on visual question-answering tasks, do not take into account spatial or semantic relationships between image objects, making them limited in complex question-answers involving visual reasoning, and previous attention mechanisms only consider the correlation between image objects and text entities, but not between visual space or semantic relationships and text information, making their models somewhat lacking in understanding and reasoning capabilities for visual relationships.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a relational attention network vision question-answering method based on depth perception and semantic guidance.
The technical scheme adopted for solving the technical problems is as follows: a relation attention network vision question-answering method based on depth perception and semantic guidance comprises the following steps:
1) Three-dimensional spatial relationship construction between image objects
Calculating a visual relationship in a two-dimensional space using rectangular frame coordinates of image objects, the two-dimensional spatial relationship for image objects i and j being represented asThe rectangular box of two image targets can be obtained by specific calculation, namely:
wherein, W i,hi is the coordinates, width and height of the center point of the rectangular frame of the image object i respectively;
From depth distance values dep i and dep j of rectangular frame center points of image objects i and j, then a visual relationship in depth space is calculated Namely:
wherein, The area of the overlapping part of the rectangular frames i and j;
According to the two-dimensional spatial relationship between the image objects i and j And depth spatial relationship/>Three-dimensional space relation/>, among image targets, can be obtainedNamely:
wherein, D s =64 is the dimension represented by the explicit spatial relationship, σ is the activation function ReLU;
2) Depth aware and semantic guided attention mechanism
Using the explicitly modeled three-dimensional spatial relationship described above, it can be used to calculate a correlation score between image objects i and j in the spatial dimensionNamely:
wherein f spa is used for calculating the correlation of two image targets in the space dimension, and is represented by the visual characteristic q i and the three-dimensional space relation of the input ith image target Obtained by dot product, namely:
f sem is used to calculate the correlation of the spatial relationship of two image objects with text semantics, namely:
wherein, Is a weight parameter which can be learned; /(I)A text feature representation of the question, derived from the feature representation of the last layer of [ CLS ] positions of the BERT model;
3) Combining implicit and explicit attention
The correlation α ij between image objects i and j is ultimately defined by an implicit correlationAnd explicit relevance/>Obtained by weighting, namely:
4) Attention mechanisms are incorporated into visual question-answering models
According to the framework of a transducer, all alpha ij are represented in matrix form, i.e. by replacing the conventional self-attention layer with an improved attention mechanism alpha ij The improved transducer calculation mode is as follows:
Wherein, L is the layer number of a transducer, FFN is two full-connection layers and adopts a multi-layer perceptron (MLP) activated by a ReLU hidden layer, namely:
FFN(X)=W2σ(W1X+b1)+b2
wherein, Is a parameter that can be learned.
The invention has the beneficial effects that:
1. the invention introduces the correlation of the three-dimensional space to the traditional self-attention mechanism, and flexibly expands to realize the explicit modeling and calculation of the three-dimensional space relation between the image targets;
2. By modeling the three-dimensional space relation between image targets and designing a depth perception and semantic guidance attention mechanism on the basis, more accurate space correlation calculation is performed between input image targets by introducing two different attention weight bias items, namely the correlation weights of space dimension and semantic dimension, and the accuracy of visual question-answering is improved.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic diagram of the overall model structure of the visual question-answering method of the present invention;
Fig. 2 is a schematic structural diagram of a depth perception and semantic guidance relationship attention mechanism in the visual question-answering method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in fig. 1, a relational attention network visual question-answering method based on depth perception and semantic guidance comprises the following steps:
The invention provides a relation attention network of depth perception and semantic guidance, which calculates the spatial correlation between image targets by explicitly modeling the three-dimensional spatial relation between the image targets And semantic relevance/>The method is mainly divided into the following parts:
1) Three-dimensional spatial relationship construction between image objects
Calculating a visual relationship in a two-dimensional space using rectangular frame coordinates of image objects, the two-dimensional spatial relationship for image objects i and j being represented asThe rectangular box of two image targets can be obtained by specific calculation, namely:
wherein, W i,hi is the coordinates, width and height of the center point of the rectangular frame of the image object i respectively;
From depth distance values dep i and dep j of rectangular frame center points of image objects i and j, then a visual relationship in depth space is calculated Namely:
wherein, The area of the overlapping part of the rectangular frames i and j;
According to the two-dimensional spatial relationship between the image objects i and j And depth spatial relationship/>Three-dimensional space relation/>, among image targets, can be obtainedNamely:
wherein, D s =64 is the dimension represented by the explicit spatial relationship, σ is the activation function ReLU;
2) Depth aware and semantic guided attention mechanism
As in FIG. 2, the three-dimensional spatial relationship using explicit modeling described above can be used to calculate a correlation score between image objects i and j in the spatial dimensionNamely:
wherein f spa is used for calculating the correlation of two image targets in the space dimension, and is represented by the visual characteristic q i and the three-dimensional space relation of the input ith image target Obtained by dot product, namely:
f sem is used to calculate the correlation of the spatial relationship of two image objects with text semantics, namely:
wherein, Is a weight parameter which can be learned; /(I)A text feature representation of the question, derived from the feature representation of the last layer of [ CLS ] positions of the BERT model;
3) Combining implicit and explicit attention
The correlation α ij between image objects i and j is ultimately defined by an implicit correlationAnd explicit relevance/>Obtained by weighting, namely:
The original transducer model uses an implicit correlation self-attention mechanism to calculate the correlation between inputs, assuming that the feature matrix composed of the input image target RoI features is In order to measure the implicit relation among image targets, the invention firstly adopts a scaled dot-product f (DEG,) to calculate the implicit correlation between the image targets i and j, and then adopts a softmax function to normalize all image target neighbors to convert the correlation score/>, wherein N is the number of the detected image targets and d h is the feature dimensionSpecifically, the input feature X is first mapped to the hidden space of query, key and value, and then used to measure the implicit correlation between two image objects, namely:
qi=Wqxi
kj=Wkxj
vj=Wvxj
Wherein, W q,Wk is used for preparing the high-strength steel, Is a learnable full connection layer parameter. x i,xj is the visual features of the i, j-th image object, q i,kj,vj is the visual features mapped to the hidden space, f (·, ·) is the scaled dot product function, exp (·) is the exponential function based on the natural number e, respectively.
The invention can measure the correlation between image targets from the characteristic dimension and the space dimension respectively by combining the implicit attention and the explicit attention mechanism, and compared with the original Transformer which only considers the correlation of the input in the characteristic hierarchy, the invention also considers the correlation of the input in the space dimension, thereby improving the capability of answering the complex problems related to visual reasoning.
4) Attention mechanisms are incorporated into visual question-answering models
According to the framework of a transducer, all alpha ij are represented in matrix form, i.e. by replacing the conventional self-attention layer with an improved attention mechanism alpha ij The improved transducer calculation mode is as follows:
Wherein, L is the layer number of a transducer, FFN is two full-connection layers and adopts a multi-layer perceptron (MLP) activated by a ReLU hidden layer, namely:
FFN(X)=W2σ(W1X+b1)+b2
wherein, Is a parameter that can be learned.
The invention provides a neural network architecture for visual question-answering tasks, which comprises an implicit and explicit image target relation modeling, and can better realize subsequent relation reasoning by implicitly and explicitly constructing the space and semantic relation between image targets. The depth perception and semantic guidance relationship attention module is integrated into a self-attention layer in a transducer architecture, namely, a layer of similarity for measuring the spatial relationship of image targets and text semantics is added, and a new image target correlation matrix is obtained by adjusting original self-attention distribution weights, wherein the correlation matrix can reflect the correlation among image targets in a relationship level.
Experiments show that compared with the existing mainstream method, the sequence labeling method provided by the invention has a better effect. The experiment was evaluated using two baseline Visual Question-answer datasets, visual Question ANSWERING V (VQA v 2) and GQA dataset. The details of the dataset are shown in table 1.
Table 1 dataset information
The experimental part aims at evaluating the effectiveness of the visual question-answer model proposed by the invention on different data sets. Specifically, we list the accuracy of VQA v data set and GQA data set as the evaluation index of the model, and experimental comparison results are given in tables 2 and 3, respectively.
TABLE 2 VQA v2 dataset comparative experiment results
Table 3 GQA dataset comparative experiment results
It is noted that from the two tables above, it can be observed that the method proposed by the present invention is always better than all these benchmark models in different visual question-answering tasks. Because these models mostly focus on the attention of image object entities, and neglect modeling of spatial and semantic relationships of image objects, the models lack the ability to infer between image objects. By explicitly modeling the three-dimensional spatial position characteristics of the image targets and combining the three-dimensional spatial position characteristics into a neural network structure by adopting an attention mechanism, the method provided by the invention can explicitly model the relationship between the image targets, thereby realizing relationship reasoning between the image targets.
It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.
Claims (4)
1. The relational attention network visual question-answering method based on depth perception and semantic guidance is characterized by comprising the following steps of:
1) Three-dimensional spatial relationship construction between image objects
1.1 Calculating a visual relationship in a two-dimensional space using rectangular frame coordinates of an image object for whichAnd/>Its two-dimensional spatial relationship is expressed as/>;
1.2 According to the image objectAnd/>Depth distance value/>, of rectangular box center pointAnd/>Visual relationship/>, in depth space is then calculated;
1.3 According to the image objectAnd/>Two-dimensional spatial relationship between/>And depth spatial relationship/>Obtaining the three-dimensional space relation/>, between the image targetsThe method comprises the following steps:
wherein, Is a learnable weight parameter,/>For the dimension of explicit spatial relationship representation,/>For the activation function ReLU;
2) Depth aware and semantic guided attention mechanism
Acquiring the image targets according to the three-dimensional space relation among the image targetsAnd/>Correlation score between in spatial dimensionsThe method comprises the following steps:
wherein, For calculating the correlation of two image objects in the spatial dimension, defined by the input/>Visual characteristics of individual image objects/>And three-dimensional spatial relationship representation/>Obtained by dot product, namely:
for calculating the correlation of the spatial relationship of two image objects with text semantics, namely:
wherein, Is a weight parameter which can be learned; /(I)A text feature representation of the question, derived from the feature representation of the last layer of [ CLS ] positions of the BERT model;
3) Combining implicit and explicit attention
Image objectAnd/>Correlation between/>By implicit relevance/>And explicit relevance/>Obtained by weighting, namely:
4) Attention mechanisms are incorporated into visual question-answering models
In accordance with the framework of the transducer, an improved attention mechanism is adoptedReplace the traditional self-attention layer, will all/>Expressed in matrix form, i.e. >The improved transducer calculation mode is as follows:
wherein, Layer number of transducer,/>A multi-layer perceptron activated by a ReLU hidden layer for two full-connection layers, namely:
wherein, Is a parameter that can be learned.
2. The depth perception and semantic guidance based relational attention network visual question-answering method according to claim 1, wherein in step 1.1), two-dimensional spatial relationship isThe rectangular frames of the two image targets are obtained by the following calculation:
wherein, Image object/>, respectivelyThe coordinates, width and height of the center point of the rectangular frame; /(I)Image object/>, respectivelyThe rectangular frame center point coordinates, width and height.
3. The depth perception and semantic guidance based relational attention network visual question and answer method according to claim 1, wherein in step 1.2),
Computing visual relationships in depth spaceThe method comprises the following steps:
wherein, Image object/>, respectivelyThe width and height of the rectangular frame of (a); /(I)Image object/>, respectivelyWidth and height of rectangular frame of/>For image object/>Rectangular box and image object/>Is a rectangular frame of the display device.
4. The depth perception and semantic guidance based relational attention network visual question-answering method according to claim 1, wherein in step 3), the implicit correlation isThe method comprises the following steps:
wherein, Is a learnable full connection layer parameter,/>Respectively is/>Visual characteristics of individual image objects,/>For mapping to visual features of hidden space,/>To scale the dot product function,/>In natural number/>An exponential function of the base.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210231121.1A CN114691847B (en) | 2022-03-10 | 2022-03-10 | Relation attention network vision question-answering method based on depth perception and semantic guidance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210231121.1A CN114691847B (en) | 2022-03-10 | 2022-03-10 | Relation attention network vision question-answering method based on depth perception and semantic guidance |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114691847A CN114691847A (en) | 2022-07-01 |
CN114691847B true CN114691847B (en) | 2024-04-26 |
Family
ID=82138315
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210231121.1A Active CN114691847B (en) | 2022-03-10 | 2022-03-10 | Relation attention network vision question-answering method based on depth perception and semantic guidance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114691847B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2711670A1 (en) * | 2012-09-21 | 2014-03-26 | Technische Universität München | Visual localisation |
CN110377710A (en) * | 2019-06-17 | 2019-10-25 | 杭州电子科技大学 | A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion |
CN111984772A (en) * | 2020-07-23 | 2020-11-24 | 中山大学 | Medical image question-answering method and system based on deep learning |
EP3920048A1 (en) * | 2020-06-02 | 2021-12-08 | Siemens Aktiengesellschaft | Method and system for automated visual question answering |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11461912B2 (en) * | 2016-01-05 | 2022-10-04 | California Institute Of Technology | Gaussian mixture models for temporal depth fusion |
US11620814B2 (en) * | 2019-09-12 | 2023-04-04 | Nec Corporation | Contextual grounding of natural language phrases in images |
-
2022
- 2022-03-10 CN CN202210231121.1A patent/CN114691847B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2711670A1 (en) * | 2012-09-21 | 2014-03-26 | Technische Universität München | Visual localisation |
CN110377710A (en) * | 2019-06-17 | 2019-10-25 | 杭州电子科技大学 | A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion |
EP3920048A1 (en) * | 2020-06-02 | 2021-12-08 | Siemens Aktiengesellschaft | Method and system for automated visual question answering |
CN111984772A (en) * | 2020-07-23 | 2020-11-24 | 中山大学 | Medical image question-answering method and system based on deep learning |
Non-Patent Citations (2)
Title |
---|
Selfadaptive neural module transformer for visual question answering;Huasong Zhong et al.;《IEEE Transactions on Multimedia》;20200518;第23卷;第1264-1273页 * |
基于复合图文特征的视觉问答模型研究;邱男等;《计算机应用研究》;20210423;第38卷(第08期);第2293-2298页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114691847A (en) | 2022-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gan et al. | Sparse attention based separable dilated convolutional neural network for targeted sentiment analysis | |
CN110222770B (en) | Visual question-answering method based on combined relationship attention network | |
CN111242197B (en) | Image text matching method based on double-view semantic reasoning network | |
Jiang et al. | An eight-layer convolutional neural network with stochastic pooling, batch normalization and dropout for fingerspelling recognition of Chinese sign language | |
Zeng et al. | Fine-grained image retrieval via piecewise cross entropy loss | |
CN114936623B (en) | Aspect-level emotion analysis method integrating multi-mode data | |
CN105976397B (en) | A kind of method for tracking target | |
CN112949622A (en) | Bimodal character classification method and device fusing text and image | |
CN114612767B (en) | Scene graph-based image understanding and expressing method, system and storage medium | |
CN112784782A (en) | Three-dimensional object identification method based on multi-view double-attention network | |
CN116611024A (en) | Multi-mode trans mock detection method based on facts and emotion oppositivity | |
CN116484042A (en) | Visual question-answering method combining autocorrelation and interactive guided attention mechanism | |
CN106021402A (en) | Multi-modal multi-class Boosting frame construction method and device for cross-modal retrieval | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
CN110889340A (en) | Visual question-answering model based on iterative attention mechanism | |
Zheng et al. | BDLA: Bi-directional local alignment for few-shot learning | |
CN117393098A (en) | Medical image report generation method based on visual priori and cross-modal alignment network | |
CN114691847B (en) | Relation attention network vision question-answering method based on depth perception and semantic guidance | |
CN116958740A (en) | Zero sample target detection method based on semantic perception and self-adaptive contrast learning | |
CN116071410A (en) | Point cloud registration method, system, equipment and medium based on deep learning | |
Zhao et al. | Episode-based personalization network for gaze estimation without calibration | |
CN114332623A (en) | Method and system for generating countermeasure sample by utilizing spatial transformation | |
Yu et al. | Research on folk handicraft image recognition based on neural networks and visual saliency | |
Wenhui et al. | Lidar image classification based on convolutional neural networks | |
Liu | Evaluation Algorithm of Teaching Work Quality in Colleges and Universities Based on Deep Denoising Autoencoder Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |