CN112184805B - Drawing meaning network construction method based on vision and spatial relation fusion - Google Patents

Drawing meaning network construction method based on vision and spatial relation fusion Download PDF

Info

Publication number
CN112184805B
CN112184805B CN202010946723.6A CN202010946723A CN112184805B CN 112184805 B CN112184805 B CN 112184805B CN 202010946723 A CN202010946723 A CN 202010946723A CN 112184805 B CN112184805 B CN 112184805B
Authority
CN
China
Prior art keywords
visual
node
graph
attention
central node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010946723.6A
Other languages
Chinese (zh)
Other versions
CN112184805A (en
Inventor
俞俊
杨艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202010946723.6A priority Critical patent/CN112184805B/en
Publication of CN112184805A publication Critical patent/CN112184805A/en
Application granted granted Critical
Publication of CN112184805B publication Critical patent/CN112184805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a graph annotation meaning network construction method based on vision and spatial relationship fusion. The method comprises the following steps of 1, calculating visual characteristics and absolute position characteristics of a target object in an input image. Forming a dual-attribute node in a graph formed by the input image by utilizing the two features, and finally forming a graph; 2. calculating the space geometrical relative position characteristics from each adjacent node to the central node in the graph; 3. calculating the attention weight between each adjacent node and the central node; 4. calculating the transfer information from each adjacent node to the central node; 5. and multiplying the transmission information of all adjacent nodes corresponding to the central node by the corresponding attention weight, and then summing to obtain the information of the central node after aggregation. The visual features are updated with this information, leaving the absolute position features unchanged. The invention can be used for assisting in analysis of various visual scenes and is a universal model capable of being embedded with different visual tasks.

Description

Drawing meaning network construction method based on vision and spatial relation fusion
Technical Field
The invention relates to the field of visual and spatial position relation modeling and graph meaning force network among different objects, in particular to a graph meaning force network construction method based on visual and spatial relation fusion, and a graph meaning force network combining visual features and spatial position features.
Background
Exploring the relationship between target objects in an image may assist in understanding the visual scene, thereby helping to improve performance of related visual tasks, such as target detection, image captioning, visual question answering, and the like. However, different images contain different target objects, and there are various relationships between the different objects. In most practical related task applications, no annotation of relationships between objects is provided in advance. Therefore, it is important to design a model that can mine relationships between target objects in detail.
Attention networks are a common object relational modeling mechanism. In general, objects detected in an image and the relationships between them are formed into a graph that includes nodes and connecting edges between the nodes. The attention network is mainly composed of two processes: attention weight calculation and information transfer aggregation. However, most previous attention networks only use the visual information of the nodes to accomplish both processes, ignoring the spatial relationship between the nodes. However, the spatial information of the objects has a great influence on the modeling of the relationships between the objects.
To solve this problem, researchers have attempted to add spatial positional relationships between targets to the attention network as well, creating a spatial attention network. Nevertheless, these methods still suffer from at least two problems: the current spatial characteristics are not comprehensive for describing the spatial relationship, and most of the spatial attention networks only add spatial information in the weight calculation of the edges.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a graph annotation force network based on fusion of visual and spatial relations, which fuses the visual and spatial relations of images into a double-attribute graph and can accurately model the relation among objects in the images. The network can complete the aggregation and updating of visual and spatial information at the same time, and is well applied to related visual tasks to help understand visual scene information. The method has better effect on tasks such as target detection, image captions, visual question answers and the like.
A method for constructing a graph annotation meaning network based on fusion of vision and spatial relationship comprises the following specific steps:
and (1) calculating visual features and absolute position features corresponding to the target object in the input image.
Forming a dual-attribute node in a graph formed by the input image by utilizing the visual characteristic and the absolute position characteristic of each target object, and finally forming a graph;
step (2), calculating the space geometric relative position characteristic between each adjacent node and the central node in the graph;
step (3), calculating the attention weight between each adjacent node and the central node, wherein the attention weight comprises visual attention and composite attention;
step (4), calculating the transmission information from each adjacent node to the central node, wherein the transmission information comprises visual transmission information and composite transmission information;
and (5) multiplying the transmission information from all adjacent nodes corresponding to a central node to the central node and the corresponding attention weight respectively, and then summing to obtain the information of the central node after aggregation. This information is only used to update the visual features of this central node. The absolute position characteristics of the central node remain unchanged.
Further, inputting an image in the step (1), detecting the target, and calculating the detected target to obtain the corresponding visual characteristicsAnd absolute position feature->Utilize visual characteristics of each object +.>And absolute position feature->Forming a dual attribute node in the graph formed by the image, and finally forming a graph, wherein the implementation is as follows:
visual features in step (1)Refers to the visual characteristics of a target object in the input image.
Absolute position feature in step (1)Refers to the absolute position feature of a target object in an input image; the specific formula is as follows:
wherein i represents an ith dual attribute node in the graph; c 1i And c 2i Is the central position coordinate, w, of the target object i And h i Representing the length and width of the target object, respectively. The target object may be an object in a detection frame obtained by target detection, or may be a target object framed manually.
Further, the spatial geometric relative position characteristics of each neighboring node to the central node in the calculation graph in the step (2) are as follows:
given a central nodeAnd its corresponding neighbor node->The relative distance between them ∈>Relative dimension->And relative direction->The calculation formulas of the three spatial position relations are respectively as follows:
finally, spatial geometric relative position featuresIs calculated as follows, and is obtained by embedding the three spatial position features:
wherein Emb (·) is a location embedding mapping operation.
Further, the calculating of the attention weight from each neighboring node to the central node in the step (3) includes two parts of visual attention and composite attention, and the specific process is as follows:
3-1 characterization of the relative position of the space geometryGenerating a composite feature in combination with the visual features of the neighboring nodes>The specific formula is as follows:
wherein || represents a stitching operation; sigma represents a nonlinear transformation operation; w (W) * And b * Respectively represent a weight matrix and a bias, W in the following formula * And b * And the same is done; wherein the subscripts are referred to.
3-2. In passingVisual characteristics of cardiac nodesAnd composite features->Calculate composite attention +.>The specific calculation formula is as follows:
wherein W is ap+2 、W ap+1 And W is t Representing weight matrixes corresponding to different full connection layers; similarly, b ap+2 、b ap+1 And b t Representing the bias corresponding to the different fully connected layers.
3-3 visual characteristics through a center nodeAnd visual characteristics of its corresponding neighbor node +.>Calculating visual attentionThe specific formula is as follows:
wherein,d av representing a weight matrix->Corresponding dimension and weightHeavy matrixThe dimensions are the same.
3-4 visual attention obtained by calculationAnd composite attention->The attention weight a between each adjacent node and the central node in the calculation graph ij The specific formula is as follows:
wherein, alpha is a harmonic weight for balancing visual attention and composite attention, and N refers to the number of adjacent nodes corresponding to the central node.
Further, in the step (4), the visual transfer information and the composite transfer information are adopted to calculate the transfer information from each neighboring node to the central node, which is specifically as follows:
4-1. Calculating visual Transmission informationAnd composite delivery information->The corresponding formulas are as follows:
4-2 Using the obtained visual transfer informationAnd composite delivery information->Calculating the transfer information of each neighboring node to the central node +.>The specific formula is as follows:
further, in step (5), the transmission information from all neighboring nodes corresponding to a central node to the central node is multiplied by the corresponding attention weight and summed to obtain updated and aggregated information of the central nodeThe specific formula is as follows:
n is the number of all adjacent nodes corresponding to the central node; this aggregate information is used only to update the visual information of the central node, the absolute position characteristics of the node remaining unchanged.
The invention has the beneficial effects that:
the invention uses a dual attribute map to build an analysis model of the target object relationship based on the predecessor. The node attributes of the graph include both visual features of the object and spatial location features of the object. In this figure, the spatial information of the object can naturally be seamlessly combined with the visual features and aggregate propagated in the figure. Meanwhile, the invention combines the relative distance, the relative scale and the relative direction of the target object to construct the spatial position characteristic.
Aiming at the problem of relation modeling of each object in an image, the invention provides a graph annotation meaning network model based on fusion of visual and spatial relations, which combines the visual characteristics and the spatial position characteristics of the object, and improves the accuracy of relation analysis of each object in the image to a certain extent. The model is beneficial to visual scene understanding, and has better effects on tasks such as target detection, image captions, visual question answering and the like. The method can be used for practical applications such as automatic driving behavior detection in the future.
Drawings
Fig. 1 is a schematic flow diagram of a schematic force network based on fusion of visual and spatial relationships in the method of the present invention.
Fig. 2 is a detailed flow diagram of a schematic diagram force network based on fusion of visual and spatial relationships in the method of the present invention.
Fig. 3 is a network framework in which the method of the present invention is applied in object detection.
FIG. 4 is a schematic diagram of the application of the method of the present invention to a target detection task.
Fig. 5 is a schematic diagram of the application of the method of the present invention in an image captioning task.
Fig. 6 is a schematic diagram of the application of the method of the present invention in a visual question answering task.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
As shown in FIG. 1, the invention provides a graph annotation force network based on fusion of visual and spatial relations, which can be used for modeling the relation between visual objects, thereby helping visual scene analysis, and the specific flow details are shown in FIG. 2. The invention can be embedded and applied to different specific visual tasks, such as three different tasks corresponding to figures 4, 5 and 6 respectively: object detection, image captioning, and visual question answering. It is worth noting that in addition to these three tasks, other related tasks can also embed the present invention to provide a visual and spatial relationship fusion-based graph annotation network into a visual model for better analysis and understanding of visual scenes.
The embodiment of the object detection task is specifically described herein by taking the object detection task of fig. 3 as an example. In general, target detection is the generation of a detection box and corresponding classification category for a target object. FIG. 3 employs a pre-trained ResNet-101 on an ImageNet image library as the backbone network, and the overall target detection network employs a Fast R-CNN architecture. The graph intent network of the present invention, based on a fusion of visual and spatial relationships, is then added behind the candidate box generation module and placed before the object classification and detection box regression layer. Two graph meaning force networks based on fusion of visual and spatial relations are embedded in the implementation process. In detail, the invention adds a layer of normalization operation in front of the graph annotation meaning network based on the fusion of the visual and spatial relations, and adds a full connection layer with ReLU after the graph annotation meaning network based on the fusion of the visual and spatial relations. This graph-annotation-force network based on fusion of visual and spatial relationships helps to improve the accuracy of the final target detection task.
The specific implementation steps are as follows:
and (1) generating candidate targets for one input image to obtain a plurality of candidate frames. In the target detection network, the invention converts the output linearity of the 5 th convolution layer into 512-dimensional characteristics serving as the visual characteristics of the nodes. The absolute position of a node is characterized by the center coordinates, length and height of the candidate box. Then, using the visual characteristics of each targetAnd absolute position feature->A dual attribute node in the graph formed by the image is formed, and finally a graph is formed. The number of the double-attribute nodes of the graph is the number of the candidate frames;
step (2), calculating the space geometric relative position characteristic between each adjacent node and the central node in the dual-attribute graph;
given a central nodeAnd its corresponding neighbor node->Calculate the relative distance between them ∈>Relative dimension->And relative direction->Corresponding calculation formulas are respectively formula 2, formula 3 and formula 4.
Computing spatial geometric relative position featuresThe three spatial position features are embedded into the system. Wherein Emb (·) is a location embedding mapping operation.
Step (3), calculating the attention weight between each adjacent node and a central node in the dual-attribute graph, wherein the attention weight contains two parts of information of visual attention and composite attention;
first, combining the spatial geometric relative position features with the corresponding visual features of the neighborhood to generate a composite featureWhere || denotes a stitching operation. Sigma represents a nonlinear transformation operation. W (W) * And b * Respectively representing the weight matrix and the bias.
Then, visual characteristics through the nodeAnd composite features->Calculate composite attention +.>The specific calculation formula is as follows:
visual characteristics through the central nodeAnd visual characteristics of its neighbors +.>Calculating visual attentionThe specific formula is as follows:
finally, the attention weight a between each adjacent node and the central node in the graph is calculated by adopting the calculated visual attention and the composite attention ij The specific formula is as follows, and alpha is a harmonic weight for balancing visual attention and compound attention.
Step (4), calculating the transmission information from each adjacent node to the central node, wherein the transmission information comprises two aspects of visual transmission information and composite transmission information: visually communicating informationAnd composite delivery information->
Then, the obtained visual transfer information and composite transfer information are adopted to calculate the transfer information from each adjacent node to the central node
Step (5), multiplying the transmission information from all adjacent nodes corresponding to a central node to the central node and the corresponding attention weight respectively, and then summing to obtain the information of the central node after aggregation and updatingThis information is taken as a visual feature of the updated node, but the absolute position feature of the node remains unchanged.
And connecting the updated dual-attribute node to the next graph annotation meaning network based on the fusion of the visual and spatial relations, and performing the same operation. After the visual characteristics of the nodes are updated twice, the final visual characteristics are used for classification and regression to obtain the category information of the final detection target and the coordinate information of the detection frame.
In order to test the performance of the graph meaning force model based on the fusion of visual and spatial relations, different graph meaning force models (SA, SA-P, RN, SORN, SGRN, GAGA-Net) are added into a target detection task, and the detection performance of the graph meaning force models is tested. The experimental results are shown in table 1, wherein the APs, AP50 and AP75 are indexes for measuring the detection performance. As can be seen from Table 1, the graph annotation force network (GAGA-Net) constructed by the method based on the fusion of visual and spatial relations has obvious effect improvement on target detection.
TABLE 1 comparison of Performance of different attention models on target detection
In order to further test the performance of the graph annotation meaning network based on the fusion of visual and spatial relations, different graph annotation meaning models (SA, SA-P, RN, SORN, SGRN, GAGA-Net) are added into the image subtitle and the visual question-answering task, the performance of the image subtitle and the visual question-answering task is tested, and experimental results are shown in the following tables 2 and 3 respectively. CIDEr, BLEU-1, ROUGE and METEOR are four common indicators measuring the accuracy of subtitle generation in Table 2. ALL, Y/N, number and Other in Table 3 are question types of four types of visual questions and answers, and the corresponding index is answer accuracy. As can be seen from tables 2 and 3, the graph annotation force network (GAGA-Net) based on the fusion of visual and spatial relations has obvious effects on the image captions and visual question-answering tasks.
TABLE 2 comparison of Performance of different attention models on image captions
TABLE 3 comparison of the performance of different attention models on visual questions and answers

Claims (4)

1. The method for constructing the graph annotation meaning network based on the fusion of the visual and spatial relations is characterized by comprising the following steps of:
step (1), calculating visual features and absolute position features corresponding to a target object in an input image;
forming a dual-attribute node in a graph formed by the input image by utilizing the visual characteristic and the absolute position characteristic of each target object, and finally forming a graph;
step (2), calculating the space geometric relative position characteristic between each adjacent node and the central node in the graph;
step (3), calculating the attention weight between each adjacent node and the central node, wherein the attention weight comprises visual attention and composite attention;
step (4), calculating the transmission information from each adjacent node to the central node, wherein the transmission information comprises visual transmission information and composite transmission information;
step (5), multiplying and summing the transmission information from all adjacent nodes corresponding to a central node to the central node and the corresponding attention weight respectively to obtain the information of the central node after aggregation;
the specific implementation process of the step (2) is as follows:
2-1 given a central nodeAnd corresponding neighbor node-> The relative distance between them ∈>Relative dimension->And relative direction->The calculation formulas of the three spatial position relations are respectively as follows:
wherein i represents an ith dual attribute node in the graph; c 1i And c 2i Is the central position coordinate, w, of the target object i And h i Representing the length and width of the target object, respectively;
2-2 calculating the spatial geometric relative position characteristicsThe method is obtained by embedding the three spatial position features:
wherein Emb (·) is a location embedding mapping operation;
the specific implementation process of the step (3) is as follows:
3-1, combining the spatial geometrical relative position features with the visual features of the adjacent nodes to generate composite featuresThe specific formula is as follows:
wherein the splicing operation is represented by the following steps of; sigma represents a nonlinear transformation operation; w (W) * And b * Respectively representing a weight matrix and bias, and the following formulas are the same; wherein refers to subscripts;
3-2 visual characteristics through a center nodeAnd composite features->Calculate composite attention +.>The specific calculation formula is as follows:
wherein W is ap+2 、W ap+1 And W is t Representing weight matrixes corresponding to different full connection layers; similarly, b ap+2 、b ap+1 And b t Representing the corresponding bias of different full connection layers;
3-3 visual characteristics through a center nodeAnd visual characteristics of its corresponding neighbor node +.>Calculating visual attention +.>The specific formula is as follows:
wherein,d av representing pairs of weight matricesThe dimension of the response;
3-4 visual attention obtained by calculationAnd composite attention->The attention weight a between each adjacent node and the central node in the calculation graph ij The specific formula is as follows:
wherein, alpha is a harmonic weight for balancing visual attention and composite attention, and N refers to the number of adjacent nodes corresponding to the central node.
2. The method for constructing a graph-meaning network based on fusion of visual and spatial relations according to claim 1, wherein the step (1) is specifically implemented as follows:
visual features in step (1)Refers to the visual characteristics of a target object in an input image;
absolute position feature in step (1)Refers to the absolute position feature of a target object in an input image; the specific formula is as follows:
wherein i represents an ith dual attribute node in the graph; c 1i And c 2i Is the central position coordinate, w, of the target object i And h i Representing the length and width of the target object, respectively.
3. The graph-meaning force network based on the fusion of visual and spatial relations according to claim 1, wherein the step (4) uses visual transfer information and composite transfer information to calculate transfer information from each neighboring node to the central node, specifically as follows:
4-1. Calculating visual Transmission informationAnd composite delivery information->The corresponding formulas are as follows:
4-2, calculating the transfer information from each adjacent node to the central node by adopting the obtained visual transfer information and the composite transfer informationThe specific formula is as follows:
4. the method for constructing a graph-meaning force network based on fusion of visual and spatial relations according to claim 3, wherein the specific implementation formula of the step (5) is as follows:
n is the number of all adjacent nodes corresponding to the central node; this aggregate information is used only to update the visual information of the central node, the absolute position characteristics of the node remaining unchanged.
CN202010946723.6A 2020-09-10 2020-09-10 Drawing meaning network construction method based on vision and spatial relation fusion Active CN112184805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010946723.6A CN112184805B (en) 2020-09-10 2020-09-10 Drawing meaning network construction method based on vision and spatial relation fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010946723.6A CN112184805B (en) 2020-09-10 2020-09-10 Drawing meaning network construction method based on vision and spatial relation fusion

Publications (2)

Publication Number Publication Date
CN112184805A CN112184805A (en) 2021-01-05
CN112184805B true CN112184805B (en) 2024-04-09

Family

ID=73921743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010946723.6A Active CN112184805B (en) 2020-09-10 2020-09-10 Drawing meaning network construction method based on vision and spatial relation fusion

Country Status (1)

Country Link
CN (1) CN112184805B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905819B (en) * 2021-01-06 2022-09-23 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11210572B2 (en) * 2018-12-17 2021-12-28 Sri International Aligning symbols and objects using co-attention for understanding visual content

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
增强视觉特征的视觉问答任务研究;秦淑婧;杨关;;中原工学院学报;20200225(第01期);全文 *

Also Published As

Publication number Publication date
CN112184805A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN111275618A (en) Depth map super-resolution reconstruction network construction method based on double-branch perception
CN111881773B (en) Event camera human body posture estimation method and system based on position offset
CN103606188A (en) Geographical information on-demand acquisition method based on image point cloud
CN105229703A (en) For the system and method using the position data of sensing to carry out generating three-dimensional models
WO2024060395A1 (en) Deep learning-based high-precision point cloud completion method and apparatus
CN111024089A (en) Indoor positioning navigation method based on BIM and computer vision technology
CN113012208B (en) Multi-view remote sensing image registration method and system
CN111881804A (en) Attitude estimation model training method, system, medium and terminal based on joint training
CN112184805B (en) Drawing meaning network construction method based on vision and spatial relation fusion
CN111739037B (en) Semantic segmentation method for indoor scene RGB-D image
CN112767486A (en) Monocular 6D attitude estimation method and device based on deep convolutional neural network
CN114494436A (en) Indoor scene positioning method and device
CN114820655A (en) Weak supervision building segmentation method taking reliable area as attention mechanism supervision
CN116385660A (en) Indoor single view scene semantic reconstruction method and system
CN103854271B (en) A kind of planar pickup machine scaling method
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN113807362A (en) Image classification method based on interlayer semantic information fusion deep convolutional network
CN117557804A (en) Multi-label classification method combining target structure embedding and multi-level feature fusion
CN104156952B (en) A kind of image matching method for resisting deformation
CN111932612A (en) Intelligent vehicle vision positioning method and device based on second-order hidden Markov model
CN116433904A (en) Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution
CN103761725B (en) A kind of video plane detection method based on innovatory algorithm
CN114155406A (en) Pose estimation method based on region-level feature fusion
CN114549958A (en) Night and disguised target detection method based on context information perception mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant