CN114677580B - Image description method based on self-adaptive enhanced self-attention network - Google Patents

Image description method based on self-adaptive enhanced self-attention network Download PDF

Info

Publication number
CN114677580B
CN114677580B CN202210586762.9A CN202210586762A CN114677580B CN 114677580 B CN114677580 B CN 114677580B CN 202210586762 A CN202210586762 A CN 202210586762A CN 114677580 B CN114677580 B CN 114677580B
Authority
CN
China
Prior art keywords
semantic
geometric
adaptive
vector
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210586762.9A
Other languages
Chinese (zh)
Other versions
CN114677580A (en
Inventor
毛震东
张勇东
李经宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210586762.9A priority Critical patent/CN114677580B/en
Publication of CN114677580A publication Critical patent/CN114677580A/en
Application granted granted Critical
Publication of CN114677580B publication Critical patent/CN114677580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of image description, and discloses an image description method based on a self-adaptive enhanced self-attention network, which can be used for modeling the relationship on a geometric level and a semantic level together, and can be used for adaptively enhancing the visual relationship in a given image to realize high-precision credible image description generation when a definite geometric or semantic relationship exists between two objects in the image.

Description

Image description method based on self-adaptive enhanced self-attention network
Technical Field
The invention relates to the technical field of image description, in particular to an image description method based on a self-adaptive enhanced self-attention network.
Background
The image description aims to automatically generate a sentence description for a given image, can well combine vision and language, and is an important multi-modal task. The generated image description is to not only identify objects of interest in the image, but also to describe the relationships between the objects. For image description, a key challenge is how to accurately and efficiently model the relationships between identified objects, which is crucial to improve the quality of the generation. Recently, geometric information has been extensively studied to enhance features at the regional level, because geometric features, i.e., relative distance and relative size, contain a clear positional relationship between objects.
Visual relationships include geometric positions and semantic interactions, which indicate the correlation between objects in the region-level representation, but previous work only uses geometric positions to enhance the representation of the visual relationships, and only shallow position information cannot cover the semantic relationships with complex action. Therefore, the prior art is limited in that the image description model has difficulty generating descriptions with accurate semantic relationship prediction credible.
Disclosure of Invention
In order to solve the technical problem, the invention provides an image description method based on an adaptive enhanced self-attention network.
In order to solve the technical problem, the invention adopts the following technical scheme:
an image description method based on an adaptive enhanced self-attention network comprises the following steps:
the method comprises the following steps: for a given scene image, a semantic relation graph S is constructed by using a scene graph extractor, the regional characteristics of an interested target in the scene image are detected through a pre-trained Faster-RCNN model, a regional characteristic embedding vector H is obtained through calculation, and a geometric relation graph G is constructed through the relative geometric characteristics of a target boundary box;
step two: the method for enhancing the attention mechanism through the adaptive relationship comprises the following steps of:
enhancing attention by calculating relationships
Figure DEST_PATH_IMAGE002
Wherein,
Figure DEST_PATH_IMAGE004
in order to be the region feature weight,
Figure DEST_PATH_IMAGE006
in order to enhance the attention of the semantics,
Figure DEST_PATH_IMAGE008
to enhance attention for geometry;
Figure DEST_PATH_IMAGE010
and
Figure DEST_PATH_IMAGE012
respectively, using projection matrices based on regional characteristics
Figure DEST_PATH_IMAGE014
And
Figure DEST_PATH_IMAGE016
the generated query region feature vector and the key value region feature vector are as follows:
Figure DEST_PATH_IMAGE018
wherein,
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE022
and
Figure DEST_PATH_IMAGE024
is a projection matrix based on a semantic relationship,
Figure DEST_PATH_IMAGE026
for the projected query semantic relationship vector,
Figure DEST_PATH_IMAGE028
obtaining a key value semantic relation vector for projection;
wherein,
Figure DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE032
in order to be a projection matrix based on a geometric relationship,
Figure DEST_PATH_IMAGE034
keys derived for projectionA value geometric relationship vector;
step three: the semantic relation and the geometric relation are enhanced in a self-adaptive weight distribution mode to obtain self-adaptive relation enhanced attention
Figure DEST_PATH_IMAGE036
Wherein the gating value
Figure DEST_PATH_IMAGE038
Figure DEST_PATH_IMAGE040
Figure DEST_PATH_IMAGE042
And
Figure DEST_PATH_IMAGE044
respectively using projection matrices
Figure DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE048
And
Figure DEST_PATH_IMAGE050
generating a truth value region feature vector, a truth value semantic relation vector and a truth value geometric relation vector:
Figure DEST_PATH_IMAGE052
Figure DEST_PATH_IMAGE054
an output weight representing the semantic relationship is determined,
Figure DEST_PATH_IMAGE056
the output weights representing the geometric relationship,
Figure DEST_PATH_IMAGE058
is a learnable parameter;
step four: and inputting the characteristics of the last layer of the encoder into a transform decoder, and outputting the description of the scene image.
Specifically, when the scene graph extractor is used to construct the semantic relation graph S in the first step, the scene graph analyzer is used to extract semantic relation triples for each pair of targets of the scene image, then a word embedding layer is used to convert the semantic relation triples into semantic embedding vectors, and finally all the semantic embedding vectors are constructed into a high-dimensional directed semantic vector matrix, namely the semantic relation graph S.
Specifically, when the region feature embedding vector H is calculated in the step one, the visual information of the target of interest in the scene image is coded into the region feature by using the pre-trained Faster R-CNN model
Figure DEST_PATH_IMAGE060
Meanwhile, obtaining the category C of the target, and then adopting a semantic embedding layer to map the category C into a word embedding vector of the target category
Figure DEST_PATH_IMAGE062
Calculating region feature embedding vector for obtaining context information of region feature
Figure DEST_PATH_IMAGE064
Wherein
Figure DEST_PATH_IMAGE066
is a feed-forward network with a ReLU activation function]A feature stitching operation is represented.
Specifically, when the geometric relationship graph G is constructed in the first step, the relative geometric features between the bounding boxes of the object i and the object j are first calculated
Figure DEST_PATH_IMAGE068
Figure DEST_PATH_IMAGE070
Wherein,
Figure DEST_PATH_IMAGE072
for a four-dimensional vector containing relative distances and magnitudes,
Figure DEST_PATH_IMAGE074
respectively is the central abscissa, central ordinate, width and height of the target i bounding box,
Figure DEST_PATH_IMAGE076
respectively the central abscissa, central ordinate, width and height of the bounding box of the target j;
relative geometric features
Figure DEST_PATH_IMAGE077
And inputting a position coding function to obtain a high-dimensional geometric relationship matrix, namely the geometric relationship graph G.
In particular, a cross entropy loss function is employed
Figure DEST_PATH_IMAGE079
Training the image description method:
Figure DEST_PATH_IMAGE081
wherein,
Figure DEST_PATH_IMAGE083
to adaptively enhance the parameters of the adaptive attention network,
Figure DEST_PATH_IMAGE085
is the target true value sequence at the current time,
Figure DEST_PATH_IMAGE087
for previously generated word sequences;
then, reinforcement learning is utilized
Figure DEST_PATH_IMAGE089
Additional model training was performed:
Figure DEST_PATH_IMAGE091
wherein the prize is awarded
Figure DEST_PATH_IMAGE093
For CIDER-D scores, E for expected rewards, and the goal of additional model training is based on parameters
Figure DEST_PATH_IMAGE095
And generating word sequences at different training moments
Figure DEST_PATH_IMAGE097
Generating a reward
Figure 397609DEST_PATH_IMAGE093
A CIDER-D score of (1), maximizing the expected reward E for generating a word;
in the training process, first several rounds of cross entropy loss training are used, and then reinforcement learning is used
Figure 688913DEST_PATH_IMAGE089
And further optimizing until convergence.
Compared with the prior art, the invention has the beneficial technical effects that:
in order to solve the problem of inaccurate relation generation among objects in the image description task, the invention can jointly model the relation on the geometric level and the semantic level, enrich the characteristic representation of the visual relation and realize the more accurate and higher-quality image description generation effect. In particular, when a definite geometric or semantic relation exists between two objects in an image, the method can adaptively enhance the visual relation in the given image and realize high-precision credible image description generation.
Drawings
Fig. 1 is an overall schematic diagram of an adaptive enhanced adaptive attention network according to the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
Fig. 1 shows the overall network structure. For a given natural scene image, the invention obtains the final image description through the processing of three modules. First, (a) feature extraction: the scene graph extractor is used for constructing a semantic relation graph, and the pre-trained object detector fast-RCNN is used for detecting the interested region and the bounding box to construct a geometric relation graph. Second, (b) an encoder with adaptive relationship enhancement attention mechanism: 1) the direction sensitive semantic enhancement considers the bidirectional association of the region features to the semantic relations and the semantic relations to the region features, and the region features and the semantic relations are used for representing complete triple (subject-predicate-object) information; 2) the geometric relationship enhancement dynamically calculates the association between the regional features and the geometric features; 3) adaptive relationship weight assignment adaptively enhances different relationship characteristics. Finally, (c) language-generated predictions: a trusted image description with accurate relationships is generated using a translator decoder. The detailed description is as follows:
1. feature extraction
The present invention uses regional, semantic, and geometric relational graph features of objects to collectively represent complex relationships between objects. The target visual features have rich object detail information, which is important for image understanding; the semantic relation graph represents action behaviors among the objects; a geometric map reflecting the spatial pattern of a single object may supplement the visual information.
1.1 regional characteristics
The invention firstly utilizes a pre-training target detector Faster R-CNN model to encode visual information of a series of interested targets into regional characteristics
Figure 822127DEST_PATH_IMAGE060
And meanwhile, obtaining the class C of the target. Then, the invention adopts a semantic embedding layer to map C into a word embedding vector of a target category
Figure DEST_PATH_IMAGE098
. Finally, to obtain context information for the regional characteristics, a counter is calculatedCalculating the region feature embedding vector H as follows:
Figure DEST_PATH_IMAGE099
wherein,
Figure 630814DEST_PATH_IMAGE066
is a feed-forward network with a ReLU activation function]A feature stitching operation is represented.
1.2 semantic relationship graph
In order to obtain accurate visual semantic association information, the invention uses the existing scene graph analyzer to extract semantic relationship triples (subject-predicate-object) for each pair of targets of a scene image, then uses a word embedding layer to convert the triples into semantic embedding vectors, and finally constructs all the semantic embedding vectors into a high-dimensional directed semantic vector matrix, namely a semantic relationship graph S; the matrix can be used as semantic relation priori knowledge in an explicit mode to guide an encoder to encode more accurate semantic relations.
1.3 geometric relational graph
Unlike the semantic graph, the geometric graph is an undirected graph. The invention first calculates the relative geometric features between the bounding boxes of object i and object j
Figure 362010DEST_PATH_IMAGE068
,
Figure 198116DEST_PATH_IMAGE068
Is a four-dimensional vector containing relative distances and magnitudes, of the form:
Figure 155708DEST_PATH_IMAGE070
wherein,
Figure DEST_PATH_IMAGE100
,
Figure DEST_PATH_IMAGE101
are respectively eyesThe central abscissa, the central ordinate, the width and the height of the bounding box of the index i,
Figure 361562DEST_PATH_IMAGE076
respectively the central abscissa, the central ordinate, the width and the height of the bounding box of the target j; to obtain the geometric relationship information, the relative geometric characteristics are input by using the existing position coding function
Figure 462110DEST_PATH_IMAGE068
Outputting a high-dimensional geometric relationship matrix, i.e. the geometric relationship graph
Figure DEST_PATH_IMAGE103
. This explicit high-dimensional geometric relationship matrix will serve as geometric prior knowledge to guide the encoder to encode more accurate geometric relationships.
2. Adaptive relationship enhanced attention mechanism
The invention provides a novel adaptive relationship enhancement attention mechanism, which considers semantic relationship enhancement and geometric relationship enhancement at the same time, adaptively encodes arbitrary relationship representation vectors between targets in pairs, and the relationship representation vectors perform joint calculation on all inputs by using the adaptive relationship enhancement mechanism, as shown in figure 1. Specifically, the adaptive relationship-enhanced attention mechanism includes: 1) a relation enhancement attention mechanism, and 2) adaptive relation weight distribution, which are respectively described in detail below.
2.1 relationship-enhanced attention mechanism
In order to realize the relation enhancement attention mechanism, the invention simultaneously utilizes the prior knowledge of semantic relation and geometric relation to divide the attention mechanism into three parts, namely semantic enhancement attention, geometric enhancement attention and regional characteristic weight. Semantic enhancement attention means attention enhancement obtained a priori according to a semantic relationship; the geometric enhancement attention represents the attention enhancement obtained a priori according to the geometric relationship; the regional feature weights represent the original attention scores derived from regional visual features. In particular, enhanced attention scores for semantically enhanced attention, geometrically enhanced attention, and regional feature weights
Figure DEST_PATH_IMAGE105
The calculation is as follows:
Figure DEST_PATH_IMAGE107
Figure DEST_PATH_IMAGE108
wherein H is a region feature embedding vector,
Figure DEST_PATH_IMAGE109
and
Figure 610064DEST_PATH_IMAGE012
respectively using projection matrices based on regional features
Figure 914006DEST_PATH_IMAGE014
And
Figure 533338DEST_PATH_IMAGE016
the generated query region feature vector and key-value region feature vector are used, so according to the basic calculation mode of the attention mechanism,
Figure 645650DEST_PATH_IMAGE004
it represents the region feature weight calculated from the region feature embedding vector.
Figure 308713DEST_PATH_IMAGE006
In order to enhance the attention of the semantics,
Figure DEST_PATH_IMAGE110
in order to enhance the attention geometrically,
Figure DEST_PATH_IMAGE112
in order to enhance the function for the semantic relationship,
Figure DEST_PATH_IMAGE114
is a geometric relationship enhancement function.The semantic relationship enhancement function and the geometric relationship enhancement function are described in detail below.
Compared with the prior art, the semantic relation enhancement function has the advantages that when a specific relation exists between two targets, attention scores are improved by utilizing semantic relation information of the two targets, and a novel direction-sensitive semantic enhancement attention algorithm is provided for the method. Specifically, the invention considers the concern of the regional characteristic information to the semantic relation and the concern of the semantic relation to the two directions of the regional characteristic information, and the semantic relation enhancement function
Figure 490164DEST_PATH_IMAGE112
The calculation is as follows:
Figure DEST_PATH_IMAGE115
Figure DEST_PATH_IMAGE116
and
Figure DEST_PATH_IMAGE118
is a projection matrix based on a semantic relationship,
Figure 146142DEST_PATH_IMAGE026
and
Figure DEST_PATH_IMAGE119
subscripts c and s represent region features (content) and semantic relationships (semantic) respectively for the projected query semantic relationship vector and key value semantic relationship vector. Thus in the basic way of calculation of the attention mechanism,
Figure DEST_PATH_IMAGE120
and
Figure DEST_PATH_IMAGE121
the attention of the regional characteristic information to the semantic relation and the attention of the semantic relation to the regional characteristic information in two directions are represented respectively.
For geometric enhancement functions, unlike semantic relationships, the geometric matrix is symmetric. The invention inquires the feature vector of the region
Figure 847382DEST_PATH_IMAGE010
Using key-value geometric relation vectors
Figure 930613DEST_PATH_IMAGE034
Dynamic geometric enhancement was performed:
Figure DEST_PATH_IMAGE122
wherein,
Figure DEST_PATH_IMAGE123
is a projection matrix with respect to the geometric relationship.
2.2 adaptive relationship weight assignment
In different scenarios, the image description generates an inconsistency in dependency on semantic and geometric information, which indicates when semantic and geometric features need to be activated. Therefore, the invention provides a weight distribution module for self-adaptive learning in attention aggregation operation to obtain semantic and geometric relationship information enhancement of different levels. Specifically, the patent introduces a gating fusion module and implements a sigmoid function
Figure DEST_PATH_IMAGE125
Figure DEST_PATH_IMAGE126
Figure DEST_PATH_IMAGE127
Wherein,
Figure 130519DEST_PATH_IMAGE040
Figure 340921DEST_PATH_IMAGE042
and
Figure DEST_PATH_IMAGE128
respectively using projection matrices
Figure DEST_PATH_IMAGE130
Figure DEST_PATH_IMAGE131
And
Figure 880356DEST_PATH_IMAGE050
and generating a truth-value region feature vector, a truth-value semantic relation vector and a truth-value geometric relation vector. Gated value
Figure DEST_PATH_IMAGE132
Indicating which property information is more important to the current state,
Figure 619641DEST_PATH_IMAGE054
an output weight representing the semantic relationship,
Figure 136205DEST_PATH_IMAGE056
the output weights representing the geometric relationship,
Figure 25663DEST_PATH_IMAGE058
are learnable parameters. Finally, the output of the adaptive relationship enhancement attention is calculated as:
Figure DEST_PATH_IMAGE133
in summary, the present invention considers direction sensitive semantic enhanced attention and geometric enhanced attention from different perspectives, the former focusing more on the provided prior semantic relationship, and the latter focusing more on the geometric relationship between objects, in contrast.
3. Language generation prediction
After encoding, the present invention uses a transform decoder (translator decoder) to generate the image description. The input to the decoder is the features from the last layer of encoders, which contain the alignment of the region features to the words, and the output is a descriptive sentence of the region features, as shown in fig. 1. The confidence of the word distribution output after the decoding layer is as follows:
Figure DEST_PATH_IMAGE135
wherein,
Figure DEST_PATH_IMAGE137
w is the learnable weight and b is the bias term for the output of the decoder layer.
4. Training loss function
The adaptive enhanced adaptive attention network of the present invention is trained in an end-to-end manner. According to a standard image description training mode, the adaptive enhanced self-attention network of the invention is firstly optimized by using cross entropy loss, and a training target loss function is as follows:
Figure DEST_PATH_IMAGE138
wherein,
Figure 278659DEST_PATH_IMAGE083
in order to be the parameters of the model,
Figure DEST_PATH_IMAGE139
is the target true value sequence at the current time,
Figure 923267DEST_PATH_IMAGE087
is a previously generated word sequence.
Then, additional training is performed with reinforcement learning:
Figure 425661DEST_PATH_IMAGE091
wherein the prize is awarded
Figure 977865DEST_PATH_IMAGE093
CIDER-D scores, CIDER-D will be given a lower weight score based on the more occurrences of visual information-independent words in all reference labels, E representing the desired reward. The goal of the additional training is to be based on the model parameters
Figure 773783DEST_PATH_IMAGE083
And generating words at different training moments
Figure 136762DEST_PATH_IMAGE097
Information, using generated rewards
Figure 877185DEST_PATH_IMAGE093
The CIDER-D score of (1) maximizes the desired reward E for generating words.
In the training process, the model is first trained for several rounds using cross entropy loss, and then further optimized using CIDER-D awards until convergence.
The invention utilizes the semantic relation and the geometric relation to jointly cover the complex visual relation between the targets, and can adaptively enhance the semantic relation and the geometric relation, and a large number of experimental results show that the invention can obviously and effectively improve the precision of image description, generate more accurate semantic and geometric relation words and better embody the advantages of the invention on the task of image description.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (5)

1. An image description method based on an adaptive enhanced self-attention network comprises the following steps:
the method comprises the following steps: for a given scene image, a semantic relation graph S is constructed by using a scene graph extractor, the regional characteristics of an interested target in the scene image are detected through a pre-trained Faster-RCNN model, a regional characteristic embedding vector H is obtained through calculation, and a geometric relation graph G is constructed through the relative geometric characteristics of a target boundary box;
step two: the method for enhancing the attention mechanism through the adaptive relationship comprises the following steps of:
calculating an enhanced attention score
Figure 976090DEST_PATH_IMAGE001
Wherein,
Figure 865549DEST_PATH_IMAGE002
in order to be the region feature weight,
Figure 603698DEST_PATH_IMAGE003
in order to enhance the attention of the semantics,
Figure 920410DEST_PATH_IMAGE004
to enhance attention for geometry;
Figure 111219DEST_PATH_IMAGE005
and
Figure 788057DEST_PATH_IMAGE006
respectively, using projection matrices based on regional characteristics
Figure 849554DEST_PATH_IMAGE007
And
Figure 399484DEST_PATH_IMAGE008
the generated query region feature vector and the key value region feature vector are as follows:
Figure 15273DEST_PATH_IMAGE009
wherein,
Figure 308851DEST_PATH_IMAGE010
Figure 693696DEST_PATH_IMAGE011
and
Figure 148949DEST_PATH_IMAGE012
is a projection matrix based on the semantic relation,
Figure 252034DEST_PATH_IMAGE013
for the projected query semantic relationship vector,
Figure 83724DEST_PATH_IMAGE014
obtaining a key value semantic relation vector for projection;
wherein,
Figure 385392DEST_PATH_IMAGE015
Figure 198496DEST_PATH_IMAGE016
is a projection matrix based on a geometric relationship,
Figure 851194DEST_PATH_IMAGE017
key value geometric relation vector obtained for projection;
Step three: the semantic relation and the geometric relation are enhanced in a self-adaptive weight distribution mode to obtain self-adaptive relation enhanced attention
Figure 424258DEST_PATH_IMAGE018
Wherein the gating value
Figure 580433DEST_PATH_IMAGE019
And σ () is a sigmoid function,
Figure 580750DEST_PATH_IMAGE020
Figure 455165DEST_PATH_IMAGE021
and
Figure 831920DEST_PATH_IMAGE022
respectively using projection matrices
Figure 842601DEST_PATH_IMAGE023
Figure 76136DEST_PATH_IMAGE024
And
Figure 627728DEST_PATH_IMAGE025
generating a truth value region feature vector, a truth value semantic relation vector and a truth value geometric relation vector:
Figure 604911DEST_PATH_IMAGE026
Figure 673361DEST_PATH_IMAGE027
an output weight representing the semantic relationship,
Figure 77798DEST_PATH_IMAGE028
output representing geometric relationshipsThe weight of the weight is calculated,
Figure 864488DEST_PATH_IMAGE029
is a learnable parameter;
step four: and inputting the characteristics of the last layer of the encoder into a transform decoder, and outputting the description of the scene image.
2. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when a scene graph extractor is used for constructing a semantic relation graph S in the first step, a scene graph analyzer is used for extracting semantic relation triples for each pair of targets of a scene image, then a word embedding layer is used for converting the semantic relation triples into semantic embedding vectors, and finally all the semantic embedding vectors are constructed into a high-dimensional directed semantic vector matrix, namely the semantic relation graph S.
3. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when the region feature embedding vector H is calculated in the first step, visual information of an interested target in a scene image is coded into region features by utilizing a pre-trained Faster R-CNN model
Figure 379783DEST_PATH_IMAGE030
Meanwhile, obtaining the category C of the target, and then adopting a semantic embedding layer to map the category C into a word embedding vector of the target category
Figure 365057DEST_PATH_IMAGE031
Calculating region feature embedding vector for obtaining context information of region feature
Figure 878078DEST_PATH_IMAGE032
Wherein
Figure 214381DEST_PATH_IMAGE033
for activation with ReLUA feedforward network of functions]A feature stitching operation is represented.
4. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when the geometric relational graph G is constructed in the first step, firstly, the relative geometric characteristics between the bounding boxes of the object i and the object j are calculated
Figure 720318DEST_PATH_IMAGE034
Figure 560098DEST_PATH_IMAGE035
Wherein,
Figure 306337DEST_PATH_IMAGE036
for a four-dimensional vector containing the relative distances and magnitudes,
Figure 802040DEST_PATH_IMAGE037
respectively is the central abscissa, central ordinate, width and height of the target i bounding box,
Figure 924717DEST_PATH_IMAGE038
respectively the central abscissa, central ordinate, width and height of the bounding box of the target j;
relative geometric features
Figure 822266DEST_PATH_IMAGE034
And inputting a position coding function to obtain a high-dimensional geometric relationship matrix, namely the geometric relationship graph G.
5. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: using cross entropy loss function
Figure 739406DEST_PATH_IMAGE039
Training the image description method:
Figure 987985DEST_PATH_IMAGE040
wherein,
Figure 383194DEST_PATH_IMAGE041
to adaptively enhance the parameters of the adaptive attention network,
Figure 197566DEST_PATH_IMAGE042
is the target true value sequence at the current time,
Figure 738138DEST_PATH_IMAGE043
for previously generated word sequences;
then, reinforcement learning is utilized
Figure 270751DEST_PATH_IMAGE044
Additional model training was performed:
Figure 672913DEST_PATH_IMAGE045
wherein the prize is awarded
Figure 341792DEST_PATH_IMAGE046
For CIDER-D scores, E desired rewards, and the goal of the additional model training is based on the parameters
Figure 600735DEST_PATH_IMAGE047
And generating word sequences at different training moments
Figure 558327DEST_PATH_IMAGE048
Generating a reward
Figure 764180DEST_PATH_IMAGE049
A CIDER-D score of maximizing the expected reward E for generating a word;
in the training process, first several rounds of cross entropy loss training are used, and then reinforcement learning is used
Figure 553144DEST_PATH_IMAGE050
And further optimizing until convergence.
CN202210586762.9A 2022-05-27 2022-05-27 Image description method based on self-adaptive enhanced self-attention network Active CN114677580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210586762.9A CN114677580B (en) 2022-05-27 2022-05-27 Image description method based on self-adaptive enhanced self-attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210586762.9A CN114677580B (en) 2022-05-27 2022-05-27 Image description method based on self-adaptive enhanced self-attention network

Publications (2)

Publication Number Publication Date
CN114677580A CN114677580A (en) 2022-06-28
CN114677580B true CN114677580B (en) 2022-09-30

Family

ID=82079777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210586762.9A Active CN114677580B (en) 2022-05-27 2022-05-27 Image description method based on self-adaptive enhanced self-attention network

Country Status (1)

Country Link
CN (1) CN114677580B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152118B (en) * 2023-04-18 2023-07-14 中国科学技术大学 Image description method based on contour feature enhancement
CN116204674B (en) * 2023-04-28 2023-07-18 中国科学技术大学 Image description method based on visual concept word association structural modeling
CN117612170A (en) * 2024-01-23 2024-02-27 中国科学技术大学 Image-to-long text generation method combining memory network and diffusion network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256904A (en) * 2020-09-21 2021-01-22 天津大学 Image retrieval method based on visual description sentences
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238631B2 (en) * 2019-08-19 2022-02-01 Sri International Align-to-ground, weakly supervised phrase grounding guided by image-caption alignment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256904A (en) * 2020-09-21 2021-01-22 天津大学 Image retrieval method based on visual description sentences
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A image caption method of construction scene based on attention mechanism and encoding-decoding architecture;NONG Yuan-jun et al.;《浙江大学学报(工学版)》;20220228;第56卷(第2期);第236-244页 *
Aligning Linguistic Words and Visual Semantic Units for Image Captioning;Longteng Guo et al.;《MM"19: Proceedings of the 27th ACM International Conference on Multimedia》;20191015;第765-773页 *
Image Caption with Global-Local Attention;Linghui Li et al.;《Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)》;20171231;第4133-4139页 *
图像内容自动描述技术综述;邓旭冉 等;《信息安全研究》;20191130;第5卷(第11期);第988-992页 *
基于注意力机制的图像描述生成研究;田敬贤;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20220415(第04期);第I138-966页 *

Also Published As

Publication number Publication date
CN114677580A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
CN114677580B (en) Image description method based on self-adaptive enhanced self-attention network
CN111581405B (en) Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN107844743B (en) Image multi-subtitle automatic generation method based on multi-scale hierarchical residual error network
CN113343705B (en) Text semantic based detail preservation image generation method and system
CN109685724B (en) Symmetric perception face image completion method based on deep learning
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN114926835A (en) Text generation method and device, and model training method and device
CN115658954B (en) Cross-modal search countermeasure method based on prompt learning
CN114612767B (en) Scene graph-based image understanding and expressing method, system and storage medium
CN115761900B (en) Internet of things cloud platform for practical training base management
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN115908908A (en) Remote sensing image gathering type target identification method and device based on graph attention network
Zhuang et al. Improving remote sensing image captioning by combining grid features and transformer
CN117036778A (en) Potential safety hazard identification labeling method based on image-text conversion model
Li et al. Caption generation from road images for traffic scene modeling
CN113747168A (en) Training method of multimedia data description model and generation method of description information
CN114661874B (en) Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN116434058A (en) Image description generation method and system based on visual text alignment
CN116109649A (en) 3D point cloud instance segmentation method based on semantic error correction
CN113420680B (en) Remote sensing image area attention and text generation method based on GRU attention
CN114818739A (en) Visual question-answering method optimized by using position information
Zhou et al. Joint scence network and attention-guided for image captioning
Wang Video description with GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant