CN114677580B - Image description method based on self-adaptive enhanced self-attention network - Google Patents
Image description method based on self-adaptive enhanced self-attention network Download PDFInfo
- Publication number
- CN114677580B CN114677580B CN202210586762.9A CN202210586762A CN114677580B CN 114677580 B CN114677580 B CN 114677580B CN 202210586762 A CN202210586762 A CN 202210586762A CN 114677580 B CN114677580 B CN 114677580B
- Authority
- CN
- China
- Prior art keywords
- semantic
- geometric
- adaptive
- vector
- relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 230000000007 visual effect Effects 0.000 claims abstract description 14
- 230000002708 enhancing effect Effects 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 51
- 230000003044 adaptive effect Effects 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000002787 reinforcement Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000003190 augmentative effect Effects 0.000 claims 4
- 230000009471 action Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of image description, and discloses an image description method based on a self-adaptive enhanced self-attention network, which can be used for modeling the relationship on a geometric level and a semantic level together, and can be used for adaptively enhancing the visual relationship in a given image to realize high-precision credible image description generation when a definite geometric or semantic relationship exists between two objects in the image.
Description
Technical Field
The invention relates to the technical field of image description, in particular to an image description method based on a self-adaptive enhanced self-attention network.
Background
The image description aims to automatically generate a sentence description for a given image, can well combine vision and language, and is an important multi-modal task. The generated image description is to not only identify objects of interest in the image, but also to describe the relationships between the objects. For image description, a key challenge is how to accurately and efficiently model the relationships between identified objects, which is crucial to improve the quality of the generation. Recently, geometric information has been extensively studied to enhance features at the regional level, because geometric features, i.e., relative distance and relative size, contain a clear positional relationship between objects.
Visual relationships include geometric positions and semantic interactions, which indicate the correlation between objects in the region-level representation, but previous work only uses geometric positions to enhance the representation of the visual relationships, and only shallow position information cannot cover the semantic relationships with complex action. Therefore, the prior art is limited in that the image description model has difficulty generating descriptions with accurate semantic relationship prediction credible.
Disclosure of Invention
In order to solve the technical problem, the invention provides an image description method based on an adaptive enhanced self-attention network.
In order to solve the technical problem, the invention adopts the following technical scheme:
an image description method based on an adaptive enhanced self-attention network comprises the following steps:
the method comprises the following steps: for a given scene image, a semantic relation graph S is constructed by using a scene graph extractor, the regional characteristics of an interested target in the scene image are detected through a pre-trained Faster-RCNN model, a regional characteristic embedding vector H is obtained through calculation, and a geometric relation graph G is constructed through the relative geometric characteristics of a target boundary box;
step two: the method for enhancing the attention mechanism through the adaptive relationship comprises the following steps of:
Wherein,in order to be the region feature weight,in order to enhance the attention of the semantics,to enhance attention for geometry;andrespectively, using projection matrices based on regional characteristicsAndthe generated query region feature vector and the key value region feature vector are as follows:;
wherein,;andis a projection matrix based on a semantic relationship,for the projected query semantic relationship vector,obtaining a key value semantic relation vector for projection;
wherein,;in order to be a projection matrix based on a geometric relationship,keys derived for projectionA value geometric relationship vector;
step three: the semantic relation and the geometric relation are enhanced in a self-adaptive weight distribution mode to obtain self-adaptive relation enhanced attention;
Wherein the gating value,、Andrespectively using projection matrices、Andgenerating a truth value region feature vector, a truth value semantic relation vector and a truth value geometric relation vector:;an output weight representing the semantic relationship is determined,the output weights representing the geometric relationship,is a learnable parameter;
step four: and inputting the characteristics of the last layer of the encoder into a transform decoder, and outputting the description of the scene image.
Specifically, when the scene graph extractor is used to construct the semantic relation graph S in the first step, the scene graph analyzer is used to extract semantic relation triples for each pair of targets of the scene image, then a word embedding layer is used to convert the semantic relation triples into semantic embedding vectors, and finally all the semantic embedding vectors are constructed into a high-dimensional directed semantic vector matrix, namely the semantic relation graph S.
Specifically, when the region feature embedding vector H is calculated in the step one, the visual information of the target of interest in the scene image is coded into the region feature by using the pre-trained Faster R-CNN modelMeanwhile, obtaining the category C of the target, and then adopting a semantic embedding layer to map the category C into a word embedding vector of the target categoryCalculating region feature embedding vector for obtaining context information of region featureWhereinis a feed-forward network with a ReLU activation function]A feature stitching operation is represented.
Specifically, when the geometric relationship graph G is constructed in the first step, the relative geometric features between the bounding boxes of the object i and the object j are first calculated:
Wherein,for a four-dimensional vector containing relative distances and magnitudes,respectively is the central abscissa, central ordinate, width and height of the target i bounding box,respectively the central abscissa, central ordinate, width and height of the bounding box of the target j;
relative geometric featuresAnd inputting a position coding function to obtain a high-dimensional geometric relationship matrix, namely the geometric relationship graph G.
wherein,to adaptively enhance the parameters of the adaptive attention network,is the target true value sequence at the current time,for previously generated word sequences;
wherein the prize is awardedFor CIDER-D scores, E for expected rewards, and the goal of additional model training is based on parametersAnd generating word sequences at different training momentsGenerating a rewardA CIDER-D score of (1), maximizing the expected reward E for generating a word;
in the training process, first several rounds of cross entropy loss training are used, and then reinforcement learning is usedAnd further optimizing until convergence.
Compared with the prior art, the invention has the beneficial technical effects that:
in order to solve the problem of inaccurate relation generation among objects in the image description task, the invention can jointly model the relation on the geometric level and the semantic level, enrich the characteristic representation of the visual relation and realize the more accurate and higher-quality image description generation effect. In particular, when a definite geometric or semantic relation exists between two objects in an image, the method can adaptively enhance the visual relation in the given image and realize high-precision credible image description generation.
Drawings
Fig. 1 is an overall schematic diagram of an adaptive enhanced adaptive attention network according to the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
Fig. 1 shows the overall network structure. For a given natural scene image, the invention obtains the final image description through the processing of three modules. First, (a) feature extraction: the scene graph extractor is used for constructing a semantic relation graph, and the pre-trained object detector fast-RCNN is used for detecting the interested region and the bounding box to construct a geometric relation graph. Second, (b) an encoder with adaptive relationship enhancement attention mechanism: 1) the direction sensitive semantic enhancement considers the bidirectional association of the region features to the semantic relations and the semantic relations to the region features, and the region features and the semantic relations are used for representing complete triple (subject-predicate-object) information; 2) the geometric relationship enhancement dynamically calculates the association between the regional features and the geometric features; 3) adaptive relationship weight assignment adaptively enhances different relationship characteristics. Finally, (c) language-generated predictions: a trusted image description with accurate relationships is generated using a translator decoder. The detailed description is as follows:
1. feature extraction
The present invention uses regional, semantic, and geometric relational graph features of objects to collectively represent complex relationships between objects. The target visual features have rich object detail information, which is important for image understanding; the semantic relation graph represents action behaviors among the objects; a geometric map reflecting the spatial pattern of a single object may supplement the visual information.
1.1 regional characteristics
The invention firstly utilizes a pre-training target detector Faster R-CNN model to encode visual information of a series of interested targets into regional characteristicsAnd meanwhile, obtaining the class C of the target. Then, the invention adopts a semantic embedding layer to map C into a word embedding vector of a target category. Finally, to obtain context information for the regional characteristics, a counter is calculatedCalculating the region feature embedding vector H as follows:
wherein,is a feed-forward network with a ReLU activation function]A feature stitching operation is represented.
1.2 semantic relationship graph
In order to obtain accurate visual semantic association information, the invention uses the existing scene graph analyzer to extract semantic relationship triples (subject-predicate-object) for each pair of targets of a scene image, then uses a word embedding layer to convert the triples into semantic embedding vectors, and finally constructs all the semantic embedding vectors into a high-dimensional directed semantic vector matrix, namely a semantic relationship graph S; the matrix can be used as semantic relation priori knowledge in an explicit mode to guide an encoder to encode more accurate semantic relations.
1.3 geometric relational graph
Unlike the semantic graph, the geometric graph is an undirected graph. The invention first calculates the relative geometric features between the bounding boxes of object i and object j,Is a four-dimensional vector containing relative distances and magnitudes, of the form:
wherein,,are respectively eyesThe central abscissa, the central ordinate, the width and the height of the bounding box of the index i,respectively the central abscissa, the central ordinate, the width and the height of the bounding box of the target j; to obtain the geometric relationship information, the relative geometric characteristics are input by using the existing position coding functionOutputting a high-dimensional geometric relationship matrix, i.e. the geometric relationship graph. This explicit high-dimensional geometric relationship matrix will serve as geometric prior knowledge to guide the encoder to encode more accurate geometric relationships.
2. Adaptive relationship enhanced attention mechanism
The invention provides a novel adaptive relationship enhancement attention mechanism, which considers semantic relationship enhancement and geometric relationship enhancement at the same time, adaptively encodes arbitrary relationship representation vectors between targets in pairs, and the relationship representation vectors perform joint calculation on all inputs by using the adaptive relationship enhancement mechanism, as shown in figure 1. Specifically, the adaptive relationship-enhanced attention mechanism includes: 1) a relation enhancement attention mechanism, and 2) adaptive relation weight distribution, which are respectively described in detail below.
2.1 relationship-enhanced attention mechanism
In order to realize the relation enhancement attention mechanism, the invention simultaneously utilizes the prior knowledge of semantic relation and geometric relation to divide the attention mechanism into three parts, namely semantic enhancement attention, geometric enhancement attention and regional characteristic weight. Semantic enhancement attention means attention enhancement obtained a priori according to a semantic relationship; the geometric enhancement attention represents the attention enhancement obtained a priori according to the geometric relationship; the regional feature weights represent the original attention scores derived from regional visual features. In particular, enhanced attention scores for semantically enhanced attention, geometrically enhanced attention, and regional feature weightsThe calculation is as follows:
wherein H is a region feature embedding vector,andrespectively using projection matrices based on regional featuresAndthe generated query region feature vector and key-value region feature vector are used, so according to the basic calculation mode of the attention mechanism,it represents the region feature weight calculated from the region feature embedding vector.In order to enhance the attention of the semantics,in order to enhance the attention geometrically,in order to enhance the function for the semantic relationship,is a geometric relationship enhancement function.The semantic relationship enhancement function and the geometric relationship enhancement function are described in detail below.
Compared with the prior art, the semantic relation enhancement function has the advantages that when a specific relation exists between two targets, attention scores are improved by utilizing semantic relation information of the two targets, and a novel direction-sensitive semantic enhancement attention algorithm is provided for the method. Specifically, the invention considers the concern of the regional characteristic information to the semantic relation and the concern of the semantic relation to the two directions of the regional characteristic information, and the semantic relation enhancement functionThe calculation is as follows:
andis a projection matrix based on a semantic relationship,andsubscripts c and s represent region features (content) and semantic relationships (semantic) respectively for the projected query semantic relationship vector and key value semantic relationship vector. Thus in the basic way of calculation of the attention mechanism,andthe attention of the regional characteristic information to the semantic relation and the attention of the semantic relation to the regional characteristic information in two directions are represented respectively.
For geometric enhancement functions, unlike semantic relationships, the geometric matrix is symmetric. The invention inquires the feature vector of the regionUsing key-value geometric relation vectorsDynamic geometric enhancement was performed:
2.2 adaptive relationship weight assignment
In different scenarios, the image description generates an inconsistency in dependency on semantic and geometric information, which indicates when semantic and geometric features need to be activated. Therefore, the invention provides a weight distribution module for self-adaptive learning in attention aggregation operation to obtain semantic and geometric relationship information enhancement of different levels. Specifically, the patent introduces a gating fusion module and implements a sigmoid function:
Wherein,、andrespectively using projection matrices、Andand generating a truth-value region feature vector, a truth-value semantic relation vector and a truth-value geometric relation vector. Gated valueIndicating which property information is more important to the current state,an output weight representing the semantic relationship,the output weights representing the geometric relationship,are learnable parameters. Finally, the output of the adaptive relationship enhancement attention is calculated as:
in summary, the present invention considers direction sensitive semantic enhanced attention and geometric enhanced attention from different perspectives, the former focusing more on the provided prior semantic relationship, and the latter focusing more on the geometric relationship between objects, in contrast.
3. Language generation prediction
After encoding, the present invention uses a transform decoder (translator decoder) to generate the image description. The input to the decoder is the features from the last layer of encoders, which contain the alignment of the region features to the words, and the output is a descriptive sentence of the region features, as shown in fig. 1. The confidence of the word distribution output after the decoding layer is as follows:
4. Training loss function
The adaptive enhanced adaptive attention network of the present invention is trained in an end-to-end manner. According to a standard image description training mode, the adaptive enhanced self-attention network of the invention is firstly optimized by using cross entropy loss, and a training target loss function is as follows:
wherein,in order to be the parameters of the model,is the target true value sequence at the current time,is a previously generated word sequence.
Then, additional training is performed with reinforcement learning:
wherein the prize is awardedCIDER-D scores, CIDER-D will be given a lower weight score based on the more occurrences of visual information-independent words in all reference labels, E representing the desired reward. The goal of the additional training is to be based on the model parametersAnd generating words at different training momentsInformation, using generated rewardsThe CIDER-D score of (1) maximizes the desired reward E for generating words.
In the training process, the model is first trained for several rounds using cross entropy loss, and then further optimized using CIDER-D awards until convergence.
The invention utilizes the semantic relation and the geometric relation to jointly cover the complex visual relation between the targets, and can adaptively enhance the semantic relation and the geometric relation, and a large number of experimental results show that the invention can obviously and effectively improve the precision of image description, generate more accurate semantic and geometric relation words and better embody the advantages of the invention on the task of image description.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Claims (5)
1. An image description method based on an adaptive enhanced self-attention network comprises the following steps:
the method comprises the following steps: for a given scene image, a semantic relation graph S is constructed by using a scene graph extractor, the regional characteristics of an interested target in the scene image are detected through a pre-trained Faster-RCNN model, a regional characteristic embedding vector H is obtained through calculation, and a geometric relation graph G is constructed through the relative geometric characteristics of a target boundary box;
step two: the method for enhancing the attention mechanism through the adaptive relationship comprises the following steps of:
Wherein,in order to be the region feature weight,in order to enhance the attention of the semantics,to enhance attention for geometry;andrespectively, using projection matrices based on regional characteristicsAndthe generated query region feature vector and the key value region feature vector are as follows:;
wherein,;andis a projection matrix based on the semantic relation,for the projected query semantic relationship vector,obtaining a key value semantic relation vector for projection;
wherein,;is a projection matrix based on a geometric relationship,key value geometric relation vector obtained for projection;
Step three: the semantic relation and the geometric relation are enhanced in a self-adaptive weight distribution mode to obtain self-adaptive relation enhanced attention;
Wherein the gating valueAnd σ () is a sigmoid function,、andrespectively using projection matrices、Andgenerating a truth value region feature vector, a truth value semantic relation vector and a truth value geometric relation vector:;an output weight representing the semantic relationship,output representing geometric relationshipsThe weight of the weight is calculated,is a learnable parameter;
step four: and inputting the characteristics of the last layer of the encoder into a transform decoder, and outputting the description of the scene image.
2. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when a scene graph extractor is used for constructing a semantic relation graph S in the first step, a scene graph analyzer is used for extracting semantic relation triples for each pair of targets of a scene image, then a word embedding layer is used for converting the semantic relation triples into semantic embedding vectors, and finally all the semantic embedding vectors are constructed into a high-dimensional directed semantic vector matrix, namely the semantic relation graph S.
3. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when the region feature embedding vector H is calculated in the first step, visual information of an interested target in a scene image is coded into region features by utilizing a pre-trained Faster R-CNN modelMeanwhile, obtaining the category C of the target, and then adopting a semantic embedding layer to map the category C into a word embedding vector of the target categoryCalculating region feature embedding vector for obtaining context information of region featureWhereinfor activation with ReLUA feedforward network of functions]A feature stitching operation is represented.
4. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when the geometric relational graph G is constructed in the first step, firstly, the relative geometric characteristics between the bounding boxes of the object i and the object j are calculated:
Wherein,for a four-dimensional vector containing the relative distances and magnitudes,respectively is the central abscissa, central ordinate, width and height of the target i bounding box,respectively the central abscissa, central ordinate, width and height of the bounding box of the target j;
5. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: using cross entropy loss functionTraining the image description method:
wherein,to adaptively enhance the parameters of the adaptive attention network,is the target true value sequence at the current time,for previously generated word sequences;
wherein the prize is awardedFor CIDER-D scores, E desired rewards, and the goal of the additional model training is based on the parametersAnd generating word sequences at different training momentsGenerating a rewardA CIDER-D score of maximizing the expected reward E for generating a word;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210586762.9A CN114677580B (en) | 2022-05-27 | 2022-05-27 | Image description method based on self-adaptive enhanced self-attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210586762.9A CN114677580B (en) | 2022-05-27 | 2022-05-27 | Image description method based on self-adaptive enhanced self-attention network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114677580A CN114677580A (en) | 2022-06-28 |
CN114677580B true CN114677580B (en) | 2022-09-30 |
Family
ID=82079777
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210586762.9A Active CN114677580B (en) | 2022-05-27 | 2022-05-27 | Image description method based on self-adaptive enhanced self-attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114677580B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116152118B (en) * | 2023-04-18 | 2023-07-14 | 中国科学技术大学 | Image description method based on contour feature enhancement |
CN116204674B (en) * | 2023-04-28 | 2023-07-18 | 中国科学技术大学 | Image description method based on visual concept word association structural modeling |
CN117612170A (en) * | 2024-01-23 | 2024-02-27 | 中国科学技术大学 | Image-to-long text generation method combining memory network and diffusion network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112256904A (en) * | 2020-09-21 | 2021-01-22 | 天津大学 | Image retrieval method based on visual description sentences |
CN113515951A (en) * | 2021-07-19 | 2021-10-19 | 同济大学 | Story description generation method based on knowledge enhanced attention network and group-level semantics |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11238631B2 (en) * | 2019-08-19 | 2022-02-01 | Sri International | Align-to-ground, weakly supervised phrase grounding guided by image-caption alignment |
-
2022
- 2022-05-27 CN CN202210586762.9A patent/CN114677580B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112256904A (en) * | 2020-09-21 | 2021-01-22 | 天津大学 | Image retrieval method based on visual description sentences |
CN113515951A (en) * | 2021-07-19 | 2021-10-19 | 同济大学 | Story description generation method based on knowledge enhanced attention network and group-level semantics |
Non-Patent Citations (5)
Title |
---|
A image caption method of construction scene based on attention mechanism and encoding-decoding architecture;NONG Yuan-jun et al.;《浙江大学学报(工学版)》;20220228;第56卷(第2期);第236-244页 * |
Aligning Linguistic Words and Visual Semantic Units for Image Captioning;Longteng Guo et al.;《MM"19: Proceedings of the 27th ACM International Conference on Multimedia》;20191015;第765-773页 * |
Image Caption with Global-Local Attention;Linghui Li et al.;《Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)》;20171231;第4133-4139页 * |
图像内容自动描述技术综述;邓旭冉 等;《信息安全研究》;20191130;第5卷(第11期);第988-992页 * |
基于注意力机制的图像描述生成研究;田敬贤;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20220415(第04期);第I138-966页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114677580A (en) | 2022-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114677580B (en) | Image description method based on self-adaptive enhanced self-attention network | |
CN111581405B (en) | Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN107844743B (en) | Image multi-subtitle automatic generation method based on multi-scale hierarchical residual error network | |
CN113343705B (en) | Text semantic based detail preservation image generation method and system | |
CN109685724B (en) | Symmetric perception face image completion method based on deep learning | |
CN112949647B (en) | Three-dimensional scene description method and device, electronic equipment and storage medium | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN114926835A (en) | Text generation method and device, and model training method and device | |
CN115658954B (en) | Cross-modal search countermeasure method based on prompt learning | |
CN114612767B (en) | Scene graph-based image understanding and expressing method, system and storage medium | |
CN115761900B (en) | Internet of things cloud platform for practical training base management | |
CN114970517A (en) | Visual question and answer oriented method based on multi-modal interaction context perception | |
CN115908908A (en) | Remote sensing image gathering type target identification method and device based on graph attention network | |
Zhuang et al. | Improving remote sensing image captioning by combining grid features and transformer | |
CN117036778A (en) | Potential safety hazard identification labeling method based on image-text conversion model | |
Li et al. | Caption generation from road images for traffic scene modeling | |
CN113747168A (en) | Training method of multimedia data description model and generation method of description information | |
CN114661874B (en) | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels | |
CN116434058A (en) | Image description generation method and system based on visual text alignment | |
CN116109649A (en) | 3D point cloud instance segmentation method based on semantic error correction | |
CN113420680B (en) | Remote sensing image area attention and text generation method based on GRU attention | |
CN114818739A (en) | Visual question-answering method optimized by using position information | |
Zhou et al. | Joint scence network and attention-guided for image captioning | |
Wang | Video description with GAN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |