CN115311465A - Image description method based on double attention models - Google Patents

Image description method based on double attention models Download PDF

Info

Publication number
CN115311465A
CN115311465A CN202210955941.5A CN202210955941A CN115311465A CN 115311465 A CN115311465 A CN 115311465A CN 202210955941 A CN202210955941 A CN 202210955941A CN 115311465 A CN115311465 A CN 115311465A
Authority
CN
China
Prior art keywords
lstm
model
image
attention
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210955941.5A
Other languages
Chinese (zh)
Inventor
李佩
汪红娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Graphic Communication
Original Assignee
Beijing Institute of Graphic Communication
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Graphic Communication filed Critical Beijing Institute of Graphic Communication
Priority to CN202210955941.5A priority Critical patent/CN115311465A/en
Publication of CN115311465A publication Critical patent/CN115311465A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/422Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
    • G06V10/424Syntactic representation, e.g. by using alphabets or grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Image Analysis (AREA)

Abstract

The invention realizes an image description method based on a double attention model through a method in the field of network security. The image description research task is completed by adopting an encoder-decoder framework structure, and a framework combining a convolutional neural network, a long-term and short-term memory network and an attention mechanism is adopted; firstly, an encoder is built, and a new encoder model is built by means of a ResNeXt-101 network and a target detector fast R-CNN so as to obtain a target area of an image to be detected and perform feature extraction; then constructing a decoder, and enabling a double-attention mechanism and a double-layer LSTM to construct a new decoder to generate a description statement of the image; and finally, optimizing the model by adopting a Reinforce reinforcement learning algorithm. The method provided by the invention can effectively improve the accuracy of the image description model and the richness of the image description content, and improve the overall performance of the image description model.

Description

Image description method based on double attention models
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an image description method based on a double-attention model.
Background
The image description task is to automatically generate description words for the image content by using a computer. The method needs to determine the types of objects in the pictures, analyzes the correlation among the objects, enables a machine to understand the deeper meaning of the images, and returns the language which can be understood by the user after analysis. The image description is the collision and blending of computer vision and natural language processing, and is a cross comprehensive research subject from vision to text.
Most of the mainstream image description task research at present is a significant change made on a traditional architecture adopting a "decoder-decoder", that is, a method for collecting image feature data by using a Convolutional Neural Network (CNN) as an encoder and realizing image description by using a Long Short-Term Memory network (LSTM) as a decoder. The method comprises the steps of generating description words successively by utilizing circulation, obtaining all words after the circulation is finished, and then combining the words into a description sentence of an input image. However, in the cyclic process, information may be lost over time, so that the finally generated image description statement content and the content of the image to be detected have errors.
The popular applications of the mobile terminal Internet App, such as small red books, popular comments, microblogs, notices and the like, are not social products under the multimedia information fusion. Taking the case of a small red book, each article is cartoonish and most pictures have category labels, but require manual addition by the publisher. The image annotation under the scene is limited, firstly, the editing time cost is high for a publisher, and secondly, the image annotation is not convenient and flexible for a searcher. In the field of artificial intelligence, understanding obstacles still easily appear in human-computer interaction research at present, and similar to typical AI assistants such as love, siri, and smallness, no operations related to inputting images are supported at present, and a robot in a human-computer conversation is not really intelligent enough in terms of related images and carrying out conversation, so that the main reason for the communication obstacles is that the performance of image description task technology is not strong. It is believed that computers should also be able to obtain more information from images than just stop at the simple information of classifying and locating them, and we would like computers to be able to generate natural image description sentences like humans, thus speeding up the pace of image description task research [2]. The text generation and image recognition technology is respectively based on the natural language processing category and the deep learning category in computer learning, and the combination of the two becomes the image description task at present. Many studies on image description tasks are currently being conducted in the hot weather, and most commonly, some question types related to the picture writing exist in the English and Chinese test paper of daily examinations or large examinations.
Disclosure of Invention
Therefore, the invention firstly provides an image description method based on a double-attention model, which comprises the steps of inputting an image into a system through external input, completing an image description research task by adopting a frame structure of a coder-decoder, and combining a convolutional neural network, a long-short term memory network and an attention mechanism; firstly, an encoder is built, and a new encoder model is built by means of a ResNeXt-101 network and a target detector fast R-CNN so as to obtain a target area of an image to be detected and perform feature extraction; then, a decoder is constructed, and a new decoder is constructed by the double attention mechanism and the double-layer LSTM to generate a description statement of the image; optimizing the model by adopting a REINFORCE reinforcement learning algorithm; the output structure is image content automatically described by a computer, the technology can well help people with visual impairment to better understand network and even real world image information, and meanwhile, the technology can be applied to the fields of early infant education, intelligent news title labeling, man-machine interaction and the like. At present, the technology is most widely used for automatically generating news articles, such as the winter Olympic meeting in 2022, and partial media use the technology to convert pictures into textual descriptions so as to achieve the effect of publishing match results in real time.
When network pre-training is carried out, the ResNeXt network structure is finely adjusted, a useless network layer of the last layer of the ResNeXt network structure is removed, and then ResNeXt-101 is pre-trained on an ImageNet data set.
The encoder firstly extracts the basic features of the image by using a ResNeXt network, then utilizes a Faster R-CNN algorithm to detect an object, and further extracts the target features.
The object detection is divided into two stages, wherein in the first stage, the characteristics output by the last layer of the ResNeXt network are injected into a candidate area network of the Faster R-CNN to generate a candidate frame, and the boundary frame of the region of interest is corrected for the first time; injecting the generated candidate frame and the feature map into the region of interest pooling layer to obtain a feature map with a certain scale with the candidate frame; and finally, determining the category and the accurate information of the visual characteristics through softmax and a full connection layer, performing second correction on the bounding box, and recording the obtained visual characteristic set as V = { V = (V) } 1 ,v 2 ,...,v n };
Features obtained by average pooling are noted
Figure BDA0003791354620000031
Is calculated by the formula
Figure BDA0003791354620000032
Inputting the global visual characteristics to a decoding end to construct a visual attention model;
the visual LSTM model is denoted as LSTM 1vLSTM 1v 1 in (1) denotes the first layer LSTM; average vector
Figure BDA0003791354620000033
Previous word embedding w t-1 And previous output state vectors from the language LSTM
Figure BDA0003791354620000034
Fusion as input to compute visual attention queries
Figure BDA0003791354620000035
Where E represents the identity matrix transformation, θ v Is a parameter. The visual attention inquiry vector is taken as a guide vector of the visual attention mechanism to obtain a baseThe output of the visual LSTM in the visual attention mechanism is shown as
Figure BDA0003791354620000036
Predicting description statements by using a cyclic neural network to obtain global visual features extracted by using a convolutional neural network, adding an attention model at an encoder stage, and extracting object semantic features by using a scene graph analyzer, wherein the set S = { S = obj ,S attr ,S rel Where the target object word is represented as O = { obj = 1 ,obj 2 ,...,obj n And the object attribute word is expressed as A = { attr = } 1 ,attr 2 ,...,attr n }, object relation word notation is R = { rel = 1 ,rel 2 ,...,rel n And the Semantic words are coded by using a One-Hot method and are mapped to a Visual-Semantic space by using a word embedding matrix. Word list characterization of target entity is denoted S obj ={s obj1 ,s obj2 ,...,s objn Attribute vocabulary characterization as S attr ={s attr1 ,s attr2 ,...,s attrn The characteristics of the relational vocabulary are denoted as S rel ={s rel1 ,s rel2 ,...,s reln }, in turn according to the formula:
Figure BDA0003791354620000037
Figure BDA0003791354620000038
deriving average features of semantic Global features
Figure BDA0003791354620000039
The semantic LSTM model is denoted as LSTM 1s 1 in (A) denotes a first layer LSTM; average vector
Figure BDA00037913546200000310
Previous word embedding w t-1 And state vectors from the language LSTM
Figure BDA00037913546200000311
As inputs to compute semantic attention queries
Figure BDA00037913546200000312
Wherein E represents an identity matrix transformation, θ v Is a parameter. A semantic LSTM based on a semantic attention mechanism is obtained, and the output of the semantic LSTM is expressed as
Figure BDA00037913546200000313
The decoder adopts a double-layer LSTM structure, comprises an attention LSTM and a language LSTM, combines a semantic attention mechanism and a visual attention mechanism, places the two mechanisms in an attention model of a first layer LSTM, defines image features detected by Faster R-CNN as a set V, obtains a visual and semantic mean value after the image features are subjected to average pooling, takes a word embedding vector of a current time step as the input of the second layer language LSTM, and marks the model as an attRNeXt model, wherein a standard LSTM operates on a single time step as follows: h is a total of t =LSTM(h t-1 ,x t ) Obtained in the encoder
Figure BDA0003791354620000041
And
Figure BDA0003791354620000042
an attention map image feature is generated by an attention LSTM layer 'Attend' and then a language LSTM is input
Figure BDA0003791354620000043
And
Figure BDA0003791354620000044
the words are generated in a de-sequential manner,
Figure BDA0003791354620000045
wherein theta is l ,W p ,b p Is a parameter, y t Is the generated word;
the image description is then optimizedThe objective function is: the objective function is: h P (θ)=-∑log p(y t |I;y 1:t-1 ) Wherein y is t Is a word generated at time t, I represents an input image, y 1:t-1 Representing the generated subtitles.
Then a reinforced learning algorithm is used to obtain the target,
Figure BDA0003791354620000046
the problem that the model is unmatched in training and testing is solved, wherein r is the standard score of the used automatic evaluation index BLEU, CIDER or METEOR index, and whether the predicted statement and the evaluation standard are similar or not is judged.
The technical effects to be realized by the invention are as follows:
by using the image description method based on the double-attention model, the accuracy of the image description model and the richness of the image description content are effectively improved, and the overall performance of the image description model is improved.
Drawings
FIG. 1 ResNeXt101 model image depicts a model;
FIG. 2 is a schematic diagram of a target detection technique for the Fater R-CNN network structure;
FIG. 3 is a diagram of the construction of a single layer LSTM and a double layer LSTM at each time instant;
FIG. 4 is based on a double layer LSTM and double attention machine machining model structure;
fig. 5 shows an example of experimental results.
Detailed Description
The following is a preferred embodiment of the present invention and is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.
The invention provides an image description method based on a double-attention model.
Image description is the collision and blending of computer vision with natural language processing; it is a cross comprehensive research subject from vision to text. However, most image caption models are not intended to be controllable, namely, image descriptions are passively generated from a coarse-grained layer, so that the generated image description sentences are more accurate and richer in content due to research performed on the existing fine-grained level. The technology of automatically describing the image content by the computer can well help people with visual impairment to better understand network and even real world image information, and meanwhile, the technology can be applied to the fields of early infant education, intelligent news title labeling, man-machine interaction and the like. At present, the technology is most widely used for automatically generating news articles, such as the winter Olympic meeting in 2022, and partial media use the technology to convert pictures into textual descriptions so as to achieve the effect of publishing match results in real time.
The invention improves the operation mechanism of computer vision image description based on deep learning to improve the performance, adopts a frame structure of a coder-decoder to complete an image description research task, solves some problems in the image description task by utilizing various methods, improves the model performance, and improves the accuracy of image description and the richness of sentences. The research of the invention focuses on:
(1) When a new image description model is constructed, in an encoding part, a new encoder model is constructed by means of a ResNeXt-101 network and an object detector fast R-CNN so as to obtain an object area of an image to be detected and perform feature extraction.
(2) In the decoder part, we make the double attention mechanism and the double-layer LSTM construct a new decoder to generate the descriptive statement of the image
(3) The model is optimized and researched, and aiming at the characteristic that the Reinforce algorithm can optimize irreconcilable and discrete indexes in the process of training the model, the reinforcement learning algorithm is adopted to optimize the model.
The invention adopts a framework combining a convolutional neural network, a long and short term memory network and an attention mechanism.
When the ResNext network is used for building the model, the RESNET101 network model is replaced by a ResNeXt-101 network structure model in the encoding part, the ResNeXt101 model is called in the invention, and the LSTM network is still used in the decoding part. The model diagram is shown in fig. 1.
When network pre-training is carried out, a ResNeXt network structure needs to be finely adjusted, a useless network layer of the last layer of the ResNeXt network structure is removed, and then ResNeXt-101 is pre-trained on an ImageNet data set.
At the encoding end, the ResNeXt network is used for extracting the basic features of the image, then the fast R-CNN algorithm is used for object detection, and then the target features are extracted. The object detection is divided into two stages, wherein the first stage is to inject the features output by the last layer of the ResNeXt network into a candidate area network (RPN) of the fast R-CNN to generate a candidate frame and perform first correction on the boundary frame of the Region of interest. And in the second stage, the generated candidate frame and the feature map are injected into a region-of-interest Pooling (ROI) layer to obtain a feature map with a certain scale with the candidate frame. And finally, determining the category and the accurate information of the visual characteristics through softmax and a full connection layer, performing second correction on the bounding box, and recording the obtained visual characteristic set as V = { V1, V2 n }. The technical route of object target detection is shown in fig. 2.
The features obtained by average pooling are noted
Figure BDA0003791354620000061
Is calculated by the formula
Figure BDA0003791354620000062
And inputting the global visual features to a decoding end to construct a visual attention model.
The visual LSTM based on the visual attention mechanism is obtained from the formula (2), and the output is expressed as
Figure BDA0003791354620000063
Let the visual LSTM model be LSTM 1vLSTM 1v 1 in (b) denotes the first layer LSTM; average vector
Figure BDA0003791354620000064
Previous word embedding w t-1 And from previous language LSState vector of TM
Figure BDA0003791354620000065
Fusion as a computational visual attention query
Figure BDA0003791354620000066
The visual attention query vector is taken as a guide vector for the visual attention mechanism.
Figure BDA0003791354620000067
On the other hand, the semantic attention feature extraction is realized by enlightening the visual attention mechanism, predicting description sentences by using a circular neural network for global visual features extracted by adopting a convolutional neural network and adding an attention model at an encoder stage. The invention utilizes a scene graph analyzer to extract object semantic features, denoted as set S obj ={s obj1 ,s obj2 ,...,s objn Where the target object word is represented as O = { obj = 1 ,obj 2 ,...,obj n And the object attribute word is expressed as A = { attr = } 1 ,attr 2 ,...,attr n R, object relational word notation R = { rel 1 ,rel 2 ,...,rel n }. The Semantic words are encoded by a One-Hot method and mapped to a Visual-Semantic space by a word embedding matrix. Word list characterization of target entity is denoted S obj ={s obj1 ,s obj2 ,...,s objn Attribute vocabulary characterization as S attr ={s attr1 ,s attr2 ,...,s attrn The characteristics of the relational vocabulary are denoted as S rel ={s rel1 ,s rel2 ,...,s reln }. Calculating according to formula (3) to obtain average feature of semantic global feature
Figure BDA0003791354620000068
Figure BDA0003791354620000069
The semantic LSTM based on the semantic attention mechanism is obtained from the formula (4), and the output is expressed as
Figure BDA0003791354620000071
The semantic LSTM model is denoted as LSTM 1s 1 in (1) denotes a first layer LSTM; average vector
Figure BDA0003791354620000072
Previous word embedding w t-1 Vector and state vector from previous language LSTM
Figure BDA0003791354620000073
Fusing as a computational semantic attention query
Figure BDA0003791354620000074
Figure BDA0003791354620000075
The semantic attention is to take semantic features of target entities, relations and attributes of the scene graph as attention query elements, and take a query vector as a guide vector of a semantic attention mechanism.
When the ResNeXt101 model uses the language LSTM, the image visual signals, the current image words and the text information of the image artificial annotation which are transmitted by the encoder are used as the input of the language layer LSTM mode. The construction of the single-layer LSTM and the double-layer LSTM at each time instant, as shown in FIG. 3, the single-layer LSTM single time step structure on the left, and the double-layer LSTM single time step structure on the right
The invention uses a two-layer LSTM structure in the decoder section, including attention LSTM and language LSTM. And combining a semantic attention mechanism and a visual attention mechanism, placing the semantic attention mechanism and the visual attention mechanism in an attention model of a first-layer LSTM, defining image features detected by fast R-CNN as a set V, obtaining a visual and semantic mean value after the image features are subjected to average pooling, and taking word embedding vectors of a current time step as input of a second-layer language LSTM. This model is denoted as attRNeXt model, and the structure of the decoder partial model is shown in fig. 4. The LSTM used in each layer is a standard long-short term memory network structure, and a standard LSTM operates at a single time step as shown in equation (5) below.
h t =LSTM(h t-1 ,x t )......(5)
We will derive from equation (2) and equation (4)
Figure BDA0003791354620000076
And
Figure BDA0003791354620000077
the attention map image features are generated by the attention LSTM layer "Attend". Then the language LSTM is input
Figure BDA0003791354620000078
And
Figure BDA0003791354620000079
sequentially generating words as shown in formula (6) and formula (7), wherein theta l ,W p ,b p Is a parameter, y t Is the generated word.
Figure BDA00037913546200000710
Figure BDA00037913546200000711
Next, we optimize the image description, and the objective function calculation is shown in the following equation (8).
H P (θ)=-∑log p(y t |I;y 1:t-1 )......(8)
In addition, since the objective function returns the standard answer at every moment when being trained, the self-contained evaluation standard used by the invention is not continuous, and the problem that the model is not matched when being trained and tested is caused. In this regard, we use a reinforcement learning algorithm to improve, as shown in equation (9) below.
Figure BDA0003791354620000081
Wherein. And r is the standard score of the used automatic evaluation indexes BLEU, CIDER, METEOR and the like, so as to judge whether the predicted sentences are similar to the evaluation standards.
Table 1 shows the comparison of the performance indicators of the model of the invention and the reference model on the MSCOO dataset
Figure BDA0003791354620000082
It can be seen that there are some similarities and differences between each model. The same point is that the deep learning model architecture of the encoder and the decoder is adopted by the models. The difference is that there is a slight difference in the methods of use of the encoder and decoder. The difference of the encoder is mainly embodied in that the network models extracted by the image features are different; the differences in the decoder are mainly reflected in the difference in the attention mechanism. In the invention, by adopting a ResNeXt network as an encoder, a double-layer LSTM and a double-layer attention mechanism model are adopted in a decoder part, so that the image description model has better performance. Meanwhile, as can be seen from the table above, the index B @4 of the continuity and accuracy of the model is 0.08 higher than the lowest Deep VS and 0.06 higher than the highest Hard-Att; on the semantic richness CIDER, the specific value is 1.103, which is 0.443 higher than the lowest Deep VS and 0.247 higher than the highest Hard-Att. The results show that the overall performance of the model is improved, both in accuracy and semantic richness.
In a verification experiment, in order to evaluate the performance and effectiveness of modeling, the model established in the text is evaluated in the MSCOCO2014 data set, and each image comprises manually labeled image description (ground/route). Then, the MSCOCO2014 data sets are classified by adopting a Karpathy dividing mode, wherein 113287 images are divided in a training set, 5000 images are divided in a verification set, and 5000 images are divided in a test set. The model will count all artificially labeled descriptive words in the data set and the output descriptive sentence will consist of more than five occurrences of the word and the start/end designator < star >/< end >.
The experimental results are shown in FIG. 5.
In fig. 5 (a), the reference description is that an adult wears a gray T-shirt with red sleeves to sleep on a sofa. And our model is: a (woman) wearing (black pants) and gray T-shirt was sleeping on her (room) sofa.
Our model in fig. 5 (b) is more accurate in description and semantically rich in completion, adding more color to the other (black and white alternate) dogs than the description given by the dataset and making explicit that they are on (road). The result shows that the attRNeXt image description model provided by the invention is improved in the richness and accuracy of contents on the basis of being close to the original description, and the overall description of the image is more accurate.

Claims (4)

1. An image description method based on a dual attention model is characterized in that: inputting image information through an input unit, completing an image description research task by adopting an encoder-decoder frame structure, and finally outputting the textual expression of the image by a frame combining a convolutional neural network, a long-short term memory network and an attention mechanism; firstly, constructing an encoder, and constructing a new encoder model by means of a ResNeXt-101 network and a target detector fast R-CNN to obtain a target area of an image to be detected and perform feature extraction; then constructing a decoder, and enabling a double-attention mechanism and a double-layer LSTM to construct a new decoder to generate a description statement of the image; optimizing the model by adopting a REINFORCE reinforcement learning algorithm;
the method comprises the steps of finely adjusting a ResNeXt network structure when network pre-training is carried out, removing a useless network layer of the last layer of the ResNeXt network structure, pre-training ResNeXt-101 on an ImageNet data set, obtaining target characteristics, namely visual characteristics and semantic characteristics by utilizing the pre-trained ResNeXt network and combining a Faster R-CNN algorithm, and then constructing a double-attention-machine image description model by utilizing an attention-machine method.
2. An image description method based on a dual attention model as claimed in claim 1, characterized in that: the encoder firstly extracts the basic features of the image by using a ResNeXt network, then utilizes a Faster R-CNN algorithm to detect an object, and further extracts the target features.
3. A method for image description based on a dual attention model as claimed in claim 2, characterized in that: the object detection is divided into two stages, in the first stage, the characteristics output by the last layer of the ResNeXt network are injected into a candidate area network of the Faster R-CNN to generate a candidate frame, and the boundary frame of the area of interest is corrected for the first time; injecting the generated candidate frame and the feature map into the region of interest pooling layer to obtain a feature map with a certain scale with the candidate frame; and finally, determining the category and the accurate information of the visual characteristic through softmax and a full connection layer, performing second correction on the bounding box, and recording the obtained visual characteristic set as V = { V = 1 ,v 2 ,...,v n };
Features obtained by average pooling are noted
Figure FDA0003791354610000011
Is calculated by the formula
Figure FDA0003791354610000012
Inputting the global visual characteristics to a decoding end to construct a visual attention model;
the visual LSTM model is denoted as LSTM 1v ,LSTM 1v 1 in (1) denotes the first layer LSTM; average vector
Figure FDA0003791354610000013
Previous word embedding w t-1 And previous output state vector from language LSTM
Figure FDA0003791354610000014
Fusion as input to compute visual attention queries
Figure FDA0003791354610000015
Wherein E represents an identity matrix transformation, θ v Is a parameter, the visual attention inquiry vector is taken as a guide vector of the visual attention mechanism, a visual LSTM based on the visual attention mechanism is obtained, and the output of the visual LSTM is expressed as
Figure FDA0003791354610000021
Global visual features extracted using the convolutional neural network are used to predict descriptive statements using a recurrent neural network, an attention model is added in an encoder stage, and a scene graph analyzer is used for extracting object semantic features, wherein the semantic features are marked as a set S = { S = obj ,S attr ,S rel A start-up time of the system is shortened, wherein the target object word is represented by O = { obj = 1 ,obj 2 ,...,ob jn }, object attribute word notation is A = { attr = 1 ,attr 2 ,...,attr n R, object relational word notation R = { rel 1 ,rel 2 ,..,rel n The Semantic words are coded by using an One-Hot method, and are mapped to a Visual-Semantic space by using a word embedding matrix; word list characterization of target entity is denoted S obj ={S obj1 ,s obj2 ,...,s objn }, attribute word list feature is S attr ={s attr1 ,s attr2 ,...,s attrn The characteristics of the relational vocabulary are denoted as S rel ={s rel1 ,s rel2 ,...,s reln And then according to the formula:
Figure FDA0003791354610000022
Figure FDA0003791354610000023
deriving average features of semantic Global features
Figure FDA0003791354610000024
The semantic LSTM model is denoted as LSTM 1s 1 in (A) denotes a first layer LSTM; average vector
Figure FDA0003791354610000025
Previous word embedding w t-1 Vector and state vector from previous language LSTM
Figure FDA0003791354610000026
Fusing as a computational semantic attention query
Figure FDA0003791354610000027
Wherein E represents the identity matrix transformation, θ v Is a parameter; a semantic LSTM based on a semantic attention mechanism is obtained, and the output of the semantic LSTM is expressed as
Figure FDA0003791354610000028
4. A method for image description based on a dual attention model as claimed in claim 3, characterized in that: the decoder adopts a double-layer LSTM structure, comprises an attention LSTM and a language LSTM, combines a semantic attention mechanism and a visual attention mechanism, places the two mechanisms in an attention model of a first layer LSTM, defines image features detected by Faster R-CNN as a set V, obtains a visual and semantic mean value after the image features are subjected to average pooling, takes a word embedding vector of a current time step as the input of the second layer language LSTM, and marks the model as an attRNeXt model, wherein a standard LSTM operates on a single time step as follows: h is a total of t =LSTM(h t-1 ,x t ) Obtained in the encoder
Figure FDA0003791354610000029
And
Figure FDA00037913546100000210
an attention image feature is generated by the attention LSTM layer "Attend" and then the language LSTM is input
Figure FDA00037913546100000211
And
Figure FDA00037913546100000212
the words are generated in a de-sequential manner,
Figure FDA00037913546100000213
wherein theta is l ,W p ,b p Is a parameter, y t Is a generated word;
and then optimizing the image description, wherein the objective function is as follows: h P (θ)=-∑log p(y t |I;y 1:t-1 ) Wherein y is t Is a word generated at time t, I represents an input image, y 1:t-1 Representing the generated subtitles;
then a reinforced learning algorithm is used to obtain the target,
Figure FDA0003791354610000031
the problem that the model is unmatched in training and testing is solved, wherein r is the standard score of the used automatic evaluation index BLEU, CIDER or METEOR index, and whether the prediction statement is similar to the evaluation standard or not is judged.
CN202210955941.5A 2022-08-10 2022-08-10 Image description method based on double attention models Pending CN115311465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210955941.5A CN115311465A (en) 2022-08-10 2022-08-10 Image description method based on double attention models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210955941.5A CN115311465A (en) 2022-08-10 2022-08-10 Image description method based on double attention models

Publications (1)

Publication Number Publication Date
CN115311465A true CN115311465A (en) 2022-11-08

Family

ID=83861565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210955941.5A Pending CN115311465A (en) 2022-08-10 2022-08-10 Image description method based on double attention models

Country Status (1)

Country Link
CN (1) CN115311465A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543146A (en) * 2023-07-06 2023-08-04 贵州大学 Image dense description method based on window self-attention and multi-scale mechanism
CN116543289A (en) * 2023-05-10 2023-08-04 南通大学 Image description method based on encoder-decoder and Bi-LSTM attention model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543289A (en) * 2023-05-10 2023-08-04 南通大学 Image description method based on encoder-decoder and Bi-LSTM attention model
CN116543289B (en) * 2023-05-10 2023-11-21 南通大学 Image description method based on encoder-decoder and Bi-LSTM attention model
CN116543146A (en) * 2023-07-06 2023-08-04 贵州大学 Image dense description method based on window self-attention and multi-scale mechanism
CN116543146B (en) * 2023-07-06 2023-09-26 贵州大学 Image dense description method based on window self-attention and multi-scale mechanism

Similar Documents

Publication Publication Date Title
Zhu et al. Knowledge-based question answering by tree-to-sequence learning
Gao et al. Hierarchical representation network with auxiliary tasks for video captioning and video question answering
CN115311465A (en) Image description method based on double attention models
CN110866542A (en) Depth representation learning method based on feature controllable fusion
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN116402066A (en) Attribute-level text emotion joint extraction method and system for multi-network feature fusion
CN114492441A (en) BilSTM-BiDAF named entity identification method based on machine reading understanding
CN113392265A (en) Multimedia processing method, device and equipment
Khurram et al. Dense-captionnet: a sentence generation architecture for fine-grained description of image semantics
CN117235114A (en) Retrieval method based on cross-modal semantic and mixed inverse fact training
Yan et al. Multimodal feature fusion based on object relation for video captioning
Wang et al. Dynamic-balanced double-attention fusion for image captioning
CN112966518B (en) High-quality answer identification method for large-scale online learning platform
CN112084788B (en) Automatic labeling method and system for implicit emotion tendencies of image captions
Nam et al. A survey on multimodal bidirectional machine learning translation of image and natural language processing
González et al. Piglet: Pixel-level grounding of language expressions with transformers
CN112507717A (en) Medical field entity classification method fusing entity keyword features
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
Wei et al. Stack-vs: Stacked visual-semantic attention for image caption generation
Zhou et al. Joint scence network and attention-guided for image captioning
Huang et al. Modeling Multiple Aesthetic Views for Series Photo Selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination