CN112819012B - Image description generation method based on multi-source cooperative features - Google Patents

Image description generation method based on multi-source cooperative features Download PDF

Info

Publication number
CN112819012B
CN112819012B CN202110128180.1A CN202110128180A CN112819012B CN 112819012 B CN112819012 B CN 112819012B CN 202110128180 A CN202110128180 A CN 202110128180A CN 112819012 B CN112819012 B CN 112819012B
Authority
CN
China
Prior art keywords
features
image
feature
grid
enhancement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110128180.1A
Other languages
Chinese (zh)
Other versions
CN112819012A (en
Inventor
孙晓帅
纪荣嵘
骆云鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202110128180.1A priority Critical patent/CN112819012B/en
Publication of CN112819012A publication Critical patent/CN112819012A/en
Application granted granted Critical
Publication of CN112819012B publication Critical patent/CN112819012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

An image description generation method based on multi-source collaborative features relates to multi-source feature extraction, enhancement and fusion, belongs to the technical field of artificial intelligence, and comprises the following steps: step 1, simultaneously extracting grid characteristics and area characteristics of an image by adopting a target detector; step 2, utilizing absolute and relative position information of the features to assist the model in understanding the features and interacting and enhancing the two features; and 3, utilizing the geometric alignment relation between the features to interactively enhance the two features, exchanging important visual information and realizing better visual expression. The method aims at the limitation that the traditional image description method based on single-source characteristics lacks scenes and detailed information, provides a multi-source cooperative characteristic extraction, fusion and enhancement method, and strengthens visual prior, thereby improving the accuracy of generation description.

Description

Image description generation method based on multi-source cooperative features
Technical Field
The invention relates to multi-source feature extraction, enhancement and fusion, in particular to an image description generation method based on multi-source cooperative features.
Background
Image description generation is the task of automatically generating descriptive statements for an input image. The task of generating image descriptions spans two fields of computer vision and natural language processing, and the main challenge lies not only in the comprehensive understanding of objects and relationships in images through object recognition, scene recognition, attribute and relationship detection, etc., but also in the generation of fluent sentences conforming to visual semantics. The image description generation has wide application range, and can help the automatic driving field to understand the road condition and also help the vision-impaired people to understand the environment.
Despite the challenges of image description generation, over the years of development, great progress has been made in image description generation, both in reference data sets and methods. Lin et al (Lin, t. -y.; Maire, m.; Belongie, s.; Hays, j.; Perona, p.; ramann, d.; dolar, p.; and Zitnick, c.l.2014.microsoft COCO: Common objects in context.in ECCV.) propose a baseline dataset COCO for image description generation. Vinyals et al (Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D.2015.show and tell: A neural image capture generator. in CVPR.) for the first time, used the encoder-decoder structure in the field of machine translation as a large paradigm for image description generation. Anderson et al (Rennie, s.j.; marcher, e.; Mroueh, y.; Ross, j.; and Goel, v.2017. self-diagnostic sequence tracing for image capturing. in CVPR) propose a method that provides an image prior using an object detector. Rennie et al (Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L.2018.bottom-up and top-down adherence for image capturing and visual query answering. in CVPR.) use an reinforcement learning method to solve the problem of inconsistent behavior when training and testing an image description generation network.
The above work lays a solid foundation for image description generation. The regional features proposed by Anderson et al by the target detection network greatly reduce the difficulty of visual semantic embedding, compared to the mesh features used in earlier methods, since most salient regions in images tend to be targets. Despite great success, regional features are still subject to scaling due to lack of contextual information and fine-grained detail. The detected area may not cover the entire image, resulting in an inability to correctly describe the global scene. At the same time, each region is represented by a single feature vector, which inevitably loses a large amount of object detail. However, these drawbacks are the advantages of grid features, which in contrast cover all the content of a given image in a more fragmented form.
Based on the background, the image description generation method based on the multi-source cooperative features is selected and researched to make up for the defects of the existing method, more accurate and fine image description contents are obtained, and the step of industrial application of image description generation is promoted.
Disclosure of Invention
The invention aims to provide a multi-source characteristic cooperation method aiming at the defects of the image characteristics of the traditional image report generation method, extract and use various image characteristics so as to strengthen the prior information of an image and perform more accurate and detailed image description generation, and provides an image description generation method based on the multi-source cooperation characteristics.
The invention comprises the following steps:
1) simultaneously extracting grid characteristics and area characteristics of the image by adopting a target detector;
2) establishing a comprehensive relation attention mechanism auxiliary model for feature understanding and relation modeling by using the absolute position information and the relative position information of the features, wherein the auxiliary model performs feature understanding and self-enhancement of the two features;
3) by utilizing the geometric alignment relation between the features, the two features are subjected to interactive cooperation enhancement, important visual information is exchanged, and better visual expression is realized.
In step 1), the specific method for simultaneously extracting the grid feature and the region feature of the image by using the target detector may be:
(1) target detection and attribute prediction training were performed on the Visual Genome dataset using the fast-RCNN as the target detector.
(2) Extracting the image features corresponding to the detection frames with the confidence degree higher than 20% detected by the target detector as regional features, and extracting the features extracted by the target detector backbone network as grid features.
In step 2), the absolute position information is the position of the grid feature or the area feature in the whole picture; the relative positionThe geometric information of the grid feature and the region feature can be represented as a rectangular frame (x, y, w, h), wherein (x, y) is the coordinates of the upper left corner of the frame, and w, h are the width and the height of the frame; then two boxes are putiAnd boxjIs expressed as a 4-dimensional vector:
Figure BDA0002924713340000021
after the 4-dimensional relative encoding vector is obtained, it is also mapped to d using the PE functionmodelMaintaining;
the characteristic self-enhancement can be carried out by using a Transformer model after obtaining absolute position codes and relative position codes.
In step 3), the specific steps of utilizing the geometric alignment relationship between the features to interactively enhance the two features and exchange important visual information to realize better visual expression may be:
(1) and constructing a geometric alignment graph according to the position information of the region characteristic and the grid characteristic.
(2) And performing visual information interaction and enhancement according to the geometric alignment graph.
The invention has the following outstanding advantages:
1. the method overcomes the limitation and the defect of single-source characteristics, considers the complementarity of the multi-source characteristics for the first time, considers the self-reinforcement in various characteristics and the cooperative promotion among the characteristics, constructs the image description generation method of the multi-source cooperative characteristics, designs and realizes a model, and obtains a more accurate and fine high-quality image description text.
2. The invention fully utilizes the meta-information of the feature position, specifically considers the absolute position information of the feature and specifically models the relative position information between the features, and further helps the model to understand the inherent properties and the relationship between the features.
3. The invention designs a method for interaction among light-weight features, which carries out more efficient and light-weight information interaction through geometric position information among different types of features.
Drawings
Fig. 1 is an overall block diagram of an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a characteristic self-reinforcing module according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a feature cooperation enhancing module according to an embodiment of the present invention.
FIG. 4 is a geometric alignment of features of an embodiment of the present invention.
Detailed Description
The following examples will further illustrate the present invention with reference to the accompanying drawings.
The overall framework of the work of the image description generation method based on the multi-source collaborative feature provided by the invention is shown in FIG. 1, and specifically comprises the following aspects:
1) and (5) extracting image features. And (3) using a fast-RCNN as a target detector, combining a 5 th convolution module into a target detection backbone network on the basis of the fast-RCNN target detector, using a 1x1 region-of-interest Pooling (RoI Pooling) method in a target detection head network, and using two fully-connected layers as detection heads of the target detector to perform target detection and attribute prediction training on a Visual Genome data set. For one picture, a network obtained by training is used, a feature map is obtained by calculating until the 5 th convolution module is finished, and then the feature map is averaged and pooled into 7x7 grid features. For the region features, extracting the corresponding image features in the detection frame with the confidence coefficient higher than 20% detected by the target detector as the region features, when the number of the region features is less than 10, extracting the detection result 10 before the confidence coefficient, and setting the maximum number of the region features as 100, namely, each picture has 100 region features at most. Therefore, for each picture, 7x 7-49 grid features and 10-100 region features can be obtained.
2) The feature is self-reinforcing and the modular structure is shown in figure 2. The purpose of the feature self-enhancement module is to enable grid features and region features to enhance feature expression through respective internal interaction, and in the process, absolute position information and relative position information are used for establishing a comprehensive relationship attention mechanism (CRA) auxiliary model for feature understanding and relationship modeling.
Absolute position information, i.e. the position of a grid feature or a region feature in the whole picture, for the grid feature, its absolute position can be represented by using a two-dimensional coordinate (i, j), and in order to input the coordinate into the neural network, the 2-dimensional coordinate is mapped into a high-dimensional vector by GPE function:
GPE(i,j)=[PEi;PEj] (1)
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002924713340000041
dmodelmiddle layer feature dimension for neural networks:
Figure BDA0002924713340000042
Figure BDA0002924713340000043
where pos represents a position (i.e., i or j) and k represents a dimension. For the region feature, the corresponding rectangular frame (x) is mapped by a linear mapping RPEmin,ymin,xmax,ymax) Mapping to a high-dimensional vector:
RPE(i)=BiWemb (4)
where i is a subscript of the region feature, (x)min,ymin) Is the upper left corner coordinate of the feature, (x)max,ymax) Is the coordinate of the upper right corner of the feature,
Figure BDA0002924713340000044
is a learnable parameter matrix.
In order to better fuse the relative position information, the relative position information is added according to the geometric information, for this purpose, the geometric information of the grid characteristic and the region characteristic is represented as a rectangular frame (x, y, w, h), wherein (x, y) is the upper left corner of the frameThe coordinates w, h are the width and height of the box. Then two boxes are putiAnd boxjIs expressed as a 4-dimensional vector:
Figure BDA0002924713340000045
after the 4-dimensional relative encoding vector is obtained, it is also mapped to d using the PE functionmodelDimension for input into subsequent modules.
After the absolute position code and the relative position code are obtained, a Transformer model is used for carrying out feature self-enhancement, and concretely, the Transformer model comprises the following components:
MHCRA(Q,K,V)=Concat(head1,...,headh)WO (6)
Figure BDA0002924713340000051
Figure BDA0002924713340000052
after a comprehensive relationship attention mechanism (CRA) calculation method is determined, a self-enhancement step of the features can be carried out, and the grid features and the area features of the ith layer are respectively recorded as
Figure BDA0002924713340000053
And
Figure BDA0002924713340000054
then:
Figure BDA0002924713340000055
Figure BDA0002924713340000056
wherein RPE and GPE are absolute position codes of region feature and grid feature, omega**Is the relative position code of the two. Two layers of forward propagation network FFN are then used as intermediate mappings:
Figure BDA0002924713340000057
Figure BDA0002924713340000058
the self-enhancement module of the characteristics is used, after the self-enhancement is finished, the two paths of characteristics enter the next module to carry out the cooperative enhancement of the characteristics;
3) the feature cooperation is enhanced, and the module structure is shown in fig. 3. The feature cooperation enhancement module aims to model the interaction between two different features to enhance the feature expression, and in order to more efficiently perform the interaction between the two features, a geometric alignment graph is firstly constructed, G is (V, E), as shown in FIG. 4, in the geometric graph, all region features and grid features are independent nodes to form a node set V, for an edge set E, an edge is arranged between a region feature node and a grid feature node, and if and only if the region features and the grid features are geometrically intersected, a multi-head cross attention Mechanism (MHLCCA) is used in the feature cooperation enhancement module;
MHLCCA(Q,K,V)=Concat(head1,...,headh)WO (13)
Figure BDA0002924713340000059
Figure BDA00029247133400000510
wherein, the graph-softmax operation is based on the graph G, for each node, only all nodes connected with the graph G are subjected to normalization operation, and the weight values of the nodes without connection are set to be zero; self-increment for featuresFirst output of strong module
Figure BDA00029247133400000511
The feature cooperation enhancement module calculates:
Figure BDA00029247133400000512
Figure BDA00029247133400000513
wherein omegarggrIs the relative position information between the region feature and the mesh feature. Region features are embedded in grid features and vice versa to enhance both features. Specifically, the mesh feature focuses on the region feature to acquire high-level object information, and the region feature focuses on the mesh feature to supplement the detail information. The output of this layer is then obtained by two FFN layers:
Figure BDA0002924713340000061
Figure BDA0002924713340000062
the feature self-enhancement module and the cooperative enhancement module act alternately for 3 times, and finally the obtained features are input into the language generation module.
4) And a language generation module. Language generation module given enhanced features
Figure BDA0002924713340000063
And the previously generated partial sentence w1,w2,...,wiIs generated to generate the next word wi+1First, the generated part of the sentence is mapped to d by the word embedding methodmodelDimension vectors arranged in rows and combined with the space vector of the next position to obtain a matrix
Figure BDA0002924713340000064
Then proceed to the self-attention module:
MHSA(Q,K,V)=Concat(head1,...,headh)WO (20)
Figure BDA0002924713340000065
Figure BDA0002924713340000066
wherein the content of the first and second substances,
Figure BDA0002924713340000067
are all learnable parameters, pos*Is the position coding of the word. Output H for l-th layer(l)The language generation module calculates:
M(l)=MHSA(H(l)) (23)
H(l+1)=FFN(M(l)) (24)
the i +1 th word is finally predicted as:
Figure BDA0002924713340000068
5) a loss function. The whole model is divided into two stages of training, and the loss function of the first stage is as follows:
Figure BDA0002924713340000069
i.e., the probability of each word prediction, the second stage loss function is:
Figure BDA00029247133400000610
reinforce loss for reinforcement learning, where r denotes CIDER, b denotes baseline, and k is the bundle search size.
The specific implementation results are as follows:
experiments were performed on the reference image subtitle data set COCO. This data set contains 123287 pictures, each with 5 different annotations. For data partitioning, 113287, 5000 images were used for training, validation, and testing, respectively, following the widely adopted Karpathy segmentation method. Will dmodelSet to 512 and the number of heads set to 8. The number of layers for both encoder and decoder is set to 3. In the first stage of training, the model is preheated for 4 rounds, and the learning rate is linearly increased to 1x 10-4. Then setting the learning rate to 1x 10 between 5-10 rounds-4Is set to be 2 multiplied by 10 between 11 and 12-5Then set to 4 × 10-6. The batch size was set to 50. After the 18 era XE pre-training phase, the Cider optimization model is started, and the learning rate is 5 multiplied by 10-6The batch size is 100. The Adam optimizer was used in both stages, with a bundle search width of 5. Models were evaluated using BLEU @ N, METEOR, ROUGE-L, CIDER, and SPICE following standard evaluation procedures.
TABLE 1
Model B-1 B-4 M R C S
SCST(ResNet-101) - 34.2 26.7 57.7 114 -
Up-Down(ResNet-101) 79.8 36.3 27.7 56.9 120.1 21.4
HAN(ResNet-101) 80.9 37.6 27.8 58.1 121.7 21.5
GCN-LSTM(ResNet-101) 80.5 38.2 28.5 58.5 128.3 22
SGAE(ResNet-101) 80.8 38.4 28.4 58.6 127.8 22.1
ORT(ResNet-101) 80.5 38.6 28.7 58.4 127.8 22.1
AoA(ResNet-101) 80.2 38.9 29.2 58.8 129.8 22.4
M2(ResNet-101) 80.8 39.1 29.2 58.6 131.2 22.6
X-Transformer(SENet-154) 80.9 39.7 29.5 59.1 132.8 23.4
Ours(ResNeXt-101)141 2 81.4 39.8 29.5 59.1 133.8 23
Image description final test results are shown in table 1.

Claims (3)

1. An image description generation method based on multi-source collaborative features is characterized by comprising the following steps:
1) simultaneously extracting grid characteristics and area characteristics of the image by adopting a target detector;
2) establishing a comprehensive relation attention mechanism auxiliary model for feature understanding and relation modeling by using absolute position information and relative position information of features, wherein the auxiliary model performs feature understanding and self-enhancement of two features;
the absolute position information is the position of the grid feature or the area feature in the whole picture; the relative position information is that the geometric information of the grid characteristic and the region characteristic is represented as a rectangular frame (x, y, w, h), wherein (x, y) is the coordinate of the upper left corner of the frame, and w, h are the width and the height of the frame; then two boxes are putiAnd boxjIs expressed as a 4-dimensional vector:
Figure FDA0003562075850000011
after the 4-dimensional relative encoding vector is obtained, it is also mapped to d using the PE functionmodelDimension, dmodelIs the middle layer characteristic dimension of the neural network;
the characteristic self-enhancement is carried out by using a Transformer model after obtaining absolute position codes and relative position codes;
3) by utilizing the geometric alignment relation between the features, the two features are subjected to interactive cooperation enhancement, important visual information is exchanged, and better visual expression is realized.
2. The image description generation method based on multi-source cooperative features of claim 1, wherein in step 1), the specific method for simultaneously extracting the grid features and the region features of the image by using the target detector comprises:
(1) performing target detection and attribute prediction training on a Visual Genome dataset by using a fast-RCNN as a target detector;
(2) extracting the image features corresponding to the detection frames with the confidence degree higher than 20% detected by the target detector as regional features, and extracting the features extracted by the target detector backbone network as grid features.
3. The image description generation method based on multi-source cooperative features as claimed in claim 1, wherein in step 3), the two features are interactively enhanced by using the geometric alignment relationship between the features, and exchange important visual information, so as to realize better visual expression, specifically comprising the steps of:
(1) constructing a geometric alignment graph according to the position information of the region characteristic and the grid characteristic;
(2) and performing visual information interaction and enhancement according to the geometric alignment graph.
CN202110128180.1A 2021-01-29 2021-01-29 Image description generation method based on multi-source cooperative features Active CN112819012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110128180.1A CN112819012B (en) 2021-01-29 2021-01-29 Image description generation method based on multi-source cooperative features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110128180.1A CN112819012B (en) 2021-01-29 2021-01-29 Image description generation method based on multi-source cooperative features

Publications (2)

Publication Number Publication Date
CN112819012A CN112819012A (en) 2021-05-18
CN112819012B true CN112819012B (en) 2022-05-03

Family

ID=75858380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110128180.1A Active CN112819012B (en) 2021-01-29 2021-01-29 Image description generation method based on multi-source cooperative features

Country Status (1)

Country Link
CN (1) CN112819012B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378919B (en) * 2021-06-09 2022-06-14 重庆师范大学 Image description generation method for fusing visual sense and enhancing multilayer global features
CN114898121B (en) * 2022-06-13 2023-05-30 河海大学 Automatic generation method for concrete dam defect image description based on graph attention network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737243A (en) * 2011-03-31 2012-10-17 富士通株式会社 Method and device for acquiring descriptive information of multiple images and image matching method
CN110717498A (en) * 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 Image description generation method and device and electronic equipment
CN111144553A (en) * 2019-12-28 2020-05-12 北京工业大学 Image description method based on space-time memory attention
CN111523534A (en) * 2020-03-31 2020-08-11 华东师范大学 Image description method
CN111612103A (en) * 2020-06-23 2020-09-01 中国人民解放军国防科技大学 Image description generation method, system and medium combined with abstract semantic representation
CN111737511A (en) * 2020-06-17 2020-10-02 南强智视(厦门)科技有限公司 Image description method based on self-adaptive local concept embedding

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737243A (en) * 2011-03-31 2012-10-17 富士通株式会社 Method and device for acquiring descriptive information of multiple images and image matching method
CN110717498A (en) * 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 Image description generation method and device and electronic equipment
CN111144553A (en) * 2019-12-28 2020-05-12 北京工业大学 Image description method based on space-time memory attention
CN111523534A (en) * 2020-03-31 2020-08-11 华东师范大学 Image description method
CN111737511A (en) * 2020-06-17 2020-10-02 南强智视(厦门)科技有限公司 Image description method based on self-adaptive local concept embedding
CN111612103A (en) * 2020-06-23 2020-09-01 中国人民解放军国防科技大学 Image description generation method, system and medium combined with abstract semantic representation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
dual-level collaborative transformer for image captioning;Yunpeng Luo et al.;《arXiv》;20210803;第1-8页 *
High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention;Zongjian Zhang et al.;《IEEE Transactions on Multimedia》;20181220;第21卷(第7期);第1681-1693页 *
图像描述技术综述;苗益 等;《计算机科学》;20201215;第47卷(第12期);第149-160页 *

Also Published As

Publication number Publication date
CN112819012A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
Melekhov et al. Dgc-net: Dense geometric correspondence network
Jiang et al. Scfont: Structure-guided chinese font generation via deep stacked networks
Ji et al. Deep view morphing
Zhou et al. Contextual ensemble network for semantic segmentation
CN111325165B (en) Urban remote sensing image scene classification method considering spatial relationship information
Lee et al. Deep architecture with cross guidance between single image and sparse lidar data for depth completion
CN108334830A (en) A kind of scene recognition method based on target semanteme and appearance of depth Fusion Features
CN112819012B (en) Image description generation method based on multi-source cooperative features
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN107481279A (en) A kind of monocular video depth map computational methods
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Qiu et al. Hallucinating visual instances in total absentia
Zhong et al. 3d geometry-aware semantic labeling of outdoor street scenes
Wang et al. KTN: Knowledge transfer network for learning multiperson 2D-3D correspondences
CN115018999A (en) Multi-robot-cooperation dense point cloud map construction method and device
Huang et al. Attention‐Enhanced One‐Stage Algorithm for Traffic Sign Detection and Recognition
Qi et al. Sparse prior guided deep multi-view stereo
Shen et al. ImLiDAR: cross-sensor dynamic message propagation network for 3D object detection
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
Zhou et al. Lrfnet: an occlusion robust fusion network for semantic segmentation with light field
Zheng et al. Modular graph attention network for complex visual relational reasoning
Wang et al. An Improved Convolutional Neural Network‐Based Scene Image Recognition Method
Lyu et al. Deep semantic feature matching using confidential correspondence consistency
Wang et al. Generative data augmentation by conditional inpainting for multi-class object detection in infrared images
Jiang et al. DI-MVS: Learning Efficient Multi-View Stereo With Depth-Aware Iterations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant