CN115311465A

CN115311465A - Image description method based on double attention models

Info

Publication number: CN115311465A
Application number: CN202210955941.5A
Authority: CN
Inventors: 李佩; 汪红娟
Original assignee: Beijing Institute of Graphic Communication
Current assignee: Beijing Institute of Graphic Communication
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-08

Abstract

The invention realizes an image description method based on a double attention model through a method in the field of network security. The image description research task is completed by adopting an encoder-decoder framework structure, and a framework combining a convolutional neural network, a long-term and short-term memory network and an attention mechanism is adopted; firstly, an encoder is built, and a new encoder model is built by means of a ResNeXt-101 network and a target detector fast R-CNN so as to obtain a target area of an image to be detected and perform feature extraction; then constructing a decoder, and enabling a double-attention mechanism and a double-layer LSTM to construct a new decoder to generate a description statement of the image; and finally, optimizing the model by adopting a Reinforce reinforcement learning algorithm. The method provided by the invention can effectively improve the accuracy of the image description model and the richness of the image description content, and improve the overall performance of the image description model.

Description

Image description method based on double attention models

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an image description method based on a double-attention model.

Background

The image description task is to automatically generate description words for the image content by using a computer. The method needs to determine the types of objects in the pictures, analyzes the correlation among the objects, enables a machine to understand the deeper meaning of the images, and returns the language which can be understood by the user after analysis. The image description is the collision and blending of computer vision and natural language processing, and is a cross comprehensive research subject from vision to text.

Most of the mainstream image description task research at present is a significant change made on a traditional architecture adopting a "decoder-decoder", that is, a method for collecting image feature data by using a Convolutional Neural Network (CNN) as an encoder and realizing image description by using a Long Short-Term Memory network (LSTM) as a decoder. The method comprises the steps of generating description words successively by utilizing circulation, obtaining all words after the circulation is finished, and then combining the words into a description sentence of an input image. However, in the cyclic process, information may be lost over time, so that the finally generated image description statement content and the content of the image to be detected have errors.

The popular applications of the mobile terminal Internet App, such as small red books, popular comments, microblogs, notices and the like, are not social products under the multimedia information fusion. Taking the case of a small red book, each article is cartoonish and most pictures have category labels, but require manual addition by the publisher. The image annotation under the scene is limited, firstly, the editing time cost is high for a publisher, and secondly, the image annotation is not convenient and flexible for a searcher. In the field of artificial intelligence, understanding obstacles still easily appear in human-computer interaction research at present, and similar to typical AI assistants such as love, siri, and smallness, no operations related to inputting images are supported at present, and a robot in a human-computer conversation is not really intelligent enough in terms of related images and carrying out conversation, so that the main reason for the communication obstacles is that the performance of image description task technology is not strong. It is believed that computers should also be able to obtain more information from images than just stop at the simple information of classifying and locating them, and we would like computers to be able to generate natural image description sentences like humans, thus speeding up the pace of image description task research [2]. The text generation and image recognition technology is respectively based on the natural language processing category and the deep learning category in computer learning, and the combination of the two becomes the image description task at present. Many studies on image description tasks are currently being conducted in the hot weather, and most commonly, some question types related to the picture writing exist in the English and Chinese test paper of daily examinations or large examinations.

Disclosure of Invention

Therefore, the invention firstly provides an image description method based on a double-attention model, which comprises the steps of inputting an image into a system through external input, completing an image description research task by adopting a frame structure of a coder-decoder, and combining a convolutional neural network, a long-short term memory network and an attention mechanism; firstly, an encoder is built, and a new encoder model is built by means of a ResNeXt-101 network and a target detector fast R-CNN so as to obtain a target area of an image to be detected and perform feature extraction; then, a decoder is constructed, and a new decoder is constructed by the double attention mechanism and the double-layer LSTM to generate a description statement of the image; optimizing the model by adopting a REINFORCE reinforcement learning algorithm; the output structure is image content automatically described by a computer, the technology can well help people with visual impairment to better understand network and even real world image information, and meanwhile, the technology can be applied to the fields of early infant education, intelligent news title labeling, man-machine interaction and the like. At present, the technology is most widely used for automatically generating news articles, such as the winter Olympic meeting in 2022, and partial media use the technology to convert pictures into textual descriptions so as to achieve the effect of publishing match results in real time.

When network pre-training is carried out, the ResNeXt network structure is finely adjusted, a useless network layer of the last layer of the ResNeXt network structure is removed, and then ResNeXt-101 is pre-trained on an ImageNet data set.

The encoder firstly extracts the basic features of the image by using a ResNeXt network, then utilizes a Faster R-CNN algorithm to detect an object, and further extracts the target features.

The object detection is divided into two stages, wherein in the first stage, the characteristics output by the last layer of the ResNeXt network are injected into a candidate area network of the Faster R-CNN to generate a candidate frame, and the boundary frame of the region of interest is corrected for the first time; injecting the generated candidate frame and the feature map into the region of interest pooling layer to obtain a feature map with a certain scale with the candidate frame; and finally, determining the category and the accurate information of the visual characteristics through softmax and a full connection layer, performing second correction on the bounding box, and recording the obtained visual characteristic set as V = { V = (V) } ₁ ,v ₂ ,...,v _n }；

Features obtained by average pooling are noted

Is calculated by the formula

Inputting the global visual characteristics to a decoding end to construct a visual attention model;

the visual LSTM model is denoted as LSTM _1v ，LSTM _1v 1 in (1) denotes the first layer LSTM; average vector

Previous word embedding w _t-1 And previous output state vectors from the language LSTM

Fusion as input to compute visual attention queries

Where E represents the identity matrix transformation, θ ^v Is a parameter. The visual attention inquiry vector is taken as a guide vector of the visual attention mechanism to obtain a baseThe output of the visual LSTM in the visual attention mechanism is shown as

Predicting description statements by using a cyclic neural network to obtain global visual features extracted by using a convolutional neural network, adding an attention model at an encoder stage, and extracting object semantic features by using a scene graph analyzer, wherein the set S = { S = _obj ,S _attr ,S _rel Where the target object word is represented as O = { obj = ₁ ,obj ₂ ,...,obj _n And the object attribute word is expressed as A = { attr = } ₁ ,attr ₂ ,...,attr _n }, object relation word notation is R = { rel = ₁ ,rel ₂ ,...,rel _n And the Semantic words are coded by using a One-Hot method and are mapped to a Visual-Semantic space by using a word embedding matrix. Word list characterization of target entity is denoted S _obj ＝{s _obj1 ,s _obj2 ,...,s _objn Attribute vocabulary characterization as S _attr ＝{s _attr1 ,s _attr2 ,...,s _attrn The characteristics of the relational vocabulary are denoted as S _rel ＝{s _rel1 ,s _rel2 ,...,s _reln }, in turn according to the formula:

deriving average features of semantic Global features

The semantic LSTM model is denoted as LSTM _1s 1 in (A) denotes a first layer LSTM; average vector

Previous word embedding w _t-1 And state vectors from the language LSTM

As inputs to compute semantic attention queries

Wherein E represents an identity matrix transformation, θ ^v Is a parameter. A semantic LSTM based on a semantic attention mechanism is obtained, and the output of the semantic LSTM is expressed as

The decoder adopts a double-layer LSTM structure, comprises an attention LSTM and a language LSTM, combines a semantic attention mechanism and a visual attention mechanism, places the two mechanisms in an attention model of a first layer LSTM, defines image features detected by Faster R-CNN as a set V, obtains a visual and semantic mean value after the image features are subjected to average pooling, takes a word embedding vector of a current time step as the input of the second layer language LSTM, and marks the model as an attRNeXt model, wherein a standard LSTM operates on a single time step as follows: h is a total of _t ＝LSTM(h _t-1 ,x _t ) Obtained in the encoder

And

an attention map image feature is generated by an attention LSTM layer 'Attend' and then a language LSTM is input

And

the words are generated in a de-sequential manner,

wherein theta is ^l ，W _p ，b _p Is a parameter, y _t Is the generated word;

the image description is then optimizedThe objective function is: the objective function is: h _P (θ)＝-∑log p(y _t |I；y _1:t-1 ) Wherein y is _t Is a word generated at time t, I represents an input image, y _1:t-1 Representing the generated subtitles.

Then a reinforced learning algorithm is used to obtain the target,

the problem that the model is unmatched in training and testing is solved, wherein r is the standard score of the used automatic evaluation index BLEU, CIDER or METEOR index, and whether the predicted statement and the evaluation standard are similar or not is judged.

The technical effects to be realized by the invention are as follows:

by using the image description method based on the double-attention model, the accuracy of the image description model and the richness of the image description content are effectively improved, and the overall performance of the image description model is improved.

Drawings

FIG. 1 ResNeXt101 model image depicts a model;

FIG. 2 is a schematic diagram of a target detection technique for the Fater R-CNN network structure;

FIG. 3 is a diagram of the construction of a single layer LSTM and a double layer LSTM at each time instant;

FIG. 4 is based on a double layer LSTM and double attention machine machining model structure;

fig. 5 shows an example of experimental results.

Detailed Description

The following is a preferred embodiment of the present invention and is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.

The invention provides an image description method based on a double-attention model.

Image description is the collision and blending of computer vision with natural language processing; it is a cross comprehensive research subject from vision to text. However, most image caption models are not intended to be controllable, namely, image descriptions are passively generated from a coarse-grained layer, so that the generated image description sentences are more accurate and richer in content due to research performed on the existing fine-grained level. The technology of automatically describing the image content by the computer can well help people with visual impairment to better understand network and even real world image information, and meanwhile, the technology can be applied to the fields of early infant education, intelligent news title labeling, man-machine interaction and the like. At present, the technology is most widely used for automatically generating news articles, such as the winter Olympic meeting in 2022, and partial media use the technology to convert pictures into textual descriptions so as to achieve the effect of publishing match results in real time.

The invention improves the operation mechanism of computer vision image description based on deep learning to improve the performance, adopts a frame structure of a coder-decoder to complete an image description research task, solves some problems in the image description task by utilizing various methods, improves the model performance, and improves the accuracy of image description and the richness of sentences. The research of the invention focuses on:

(1) When a new image description model is constructed, in an encoding part, a new encoder model is constructed by means of a ResNeXt-101 network and an object detector fast R-CNN so as to obtain an object area of an image to be detected and perform feature extraction.

(2) In the decoder part, we make the double attention mechanism and the double-layer LSTM construct a new decoder to generate the descriptive statement of the image

(3) The model is optimized and researched, and aiming at the characteristic that the Reinforce algorithm can optimize irreconcilable and discrete indexes in the process of training the model, the reinforcement learning algorithm is adopted to optimize the model.

The invention adopts a framework combining a convolutional neural network, a long and short term memory network and an attention mechanism.

When the ResNext network is used for building the model, the RESNET101 network model is replaced by a ResNeXt-101 network structure model in the encoding part, the ResNeXt101 model is called in the invention, and the LSTM network is still used in the decoding part. The model diagram is shown in fig. 1.

When network pre-training is carried out, a ResNeXt network structure needs to be finely adjusted, a useless network layer of the last layer of the ResNeXt network structure is removed, and then ResNeXt-101 is pre-trained on an ImageNet data set.

At the encoding end, the ResNeXt network is used for extracting the basic features of the image, then the fast R-CNN algorithm is used for object detection, and then the target features are extracted. The object detection is divided into two stages, wherein the first stage is to inject the features output by the last layer of the ResNeXt network into a candidate area network (RPN) of the fast R-CNN to generate a candidate frame and perform first correction on the boundary frame of the Region of interest. And in the second stage, the generated candidate frame and the feature map are injected into a region-of-interest Pooling (ROI) layer to obtain a feature map with a certain scale with the candidate frame. And finally, determining the category and the accurate information of the visual characteristics through softmax and a full connection layer, performing second correction on the bounding box, and recording the obtained visual characteristic set as V = { V1, V2 _n }. The technical route of object target detection is shown in fig. 2.

The features obtained by average pooling are noted

Is calculated by the formula

And inputting the global visual features to a decoding end to construct a visual attention model.

The visual LSTM based on the visual attention mechanism is obtained from the formula (2), and the output is expressed as

Let the visual LSTM model be LSTM _1v ，LSTM _1v 1 in (b) denotes the first layer LSTM; average vector

Previous word embedding w _t-1 And from previous language LSState vector of TM

Fusion as a computational visual attention query

The visual attention query vector is taken as a guide vector for the visual attention mechanism.

On the other hand, the semantic attention feature extraction is realized by enlightening the visual attention mechanism, predicting description sentences by using a circular neural network for global visual features extracted by adopting a convolutional neural network and adding an attention model at an encoder stage. The invention utilizes a scene graph analyzer to extract object semantic features, denoted as set S _obj ＝{s _obj1 ,s _obj2 ,...,s _objn Where the target object word is represented as O = { obj = ₁ ,obj ₂ ,...,obj _n And the object attribute word is expressed as A = { attr = } ₁ ,attr ₂ ,...,attr _n R, object relational word notation R = { rel ₁ ,rel ₂ ,...,rel _n }. The Semantic words are encoded by a One-Hot method and mapped to a Visual-Semantic space by a word embedding matrix. Word list characterization of target entity is denoted S _obj ＝{s _obj1 ,s _obj2 ,...,s _objn Attribute vocabulary characterization as S _attr ＝{s _attr1 ,s _attr2 ,...,s _attrn The characteristics of the relational vocabulary are denoted as S _rel ＝{s _rel1 ,s _rel2 ,...,s _reln }. Calculating according to formula (3) to obtain average feature of semantic global feature

The semantic LSTM based on the semantic attention mechanism is obtained from the formula (4), and the output is expressed as

The semantic LSTM model is denoted as LSTM _1s 1 in (1) denotes a first layer LSTM; average vector

Previous word embedding w _t-1 Vector and state vector from previous language LSTM

Fusing as a computational semantic attention query

The semantic attention is to take semantic features of target entities, relations and attributes of the scene graph as attention query elements, and take a query vector as a guide vector of a semantic attention mechanism.

When the ResNeXt101 model uses the language LSTM, the image visual signals, the current image words and the text information of the image artificial annotation which are transmitted by the encoder are used as the input of the language layer LSTM mode. The construction of the single-layer LSTM and the double-layer LSTM at each time instant, as shown in FIG. 3, the single-layer LSTM single time step structure on the left, and the double-layer LSTM single time step structure on the right

The invention uses a two-layer LSTM structure in the decoder section, including attention LSTM and language LSTM. And combining a semantic attention mechanism and a visual attention mechanism, placing the semantic attention mechanism and the visual attention mechanism in an attention model of a first-layer LSTM, defining image features detected by fast R-CNN as a set V, obtaining a visual and semantic mean value after the image features are subjected to average pooling, and taking word embedding vectors of a current time step as input of a second-layer language LSTM. This model is denoted as attRNeXt model, and the structure of the decoder partial model is shown in fig. 4. The LSTM used in each layer is a standard long-short term memory network structure, and a standard LSTM operates at a single time step as shown in equation (5) below.

h _t ＝LSTM(h _t-1 ,x _t )......(5)

We will derive from equation (2) and equation (4)

And

the attention map image features are generated by the attention LSTM layer "Attend". Then the language LSTM is input

And

sequentially generating words as shown in formula (6) and formula (7), wherein theta ^l ，W _p ，b _p Is a parameter, y _t Is the generated word.

Next, we optimize the image description, and the objective function calculation is shown in the following equation (8).

H _P (θ)＝-∑log p(y _t |I；y _1:t-1 )......(8)

In addition, since the objective function returns the standard answer at every moment when being trained, the self-contained evaluation standard used by the invention is not continuous, and the problem that the model is not matched when being trained and tested is caused. In this regard, we use a reinforcement learning algorithm to improve, as shown in equation (9) below.

Wherein. And r is the standard score of the used automatic evaluation indexes BLEU, CIDER, METEOR and the like, so as to judge whether the predicted sentences are similar to the evaluation standards.

Table 1 shows the comparison of the performance indicators of the model of the invention and the reference model on the MSCOO dataset

It can be seen that there are some similarities and differences between each model. The same point is that the deep learning model architecture of the encoder and the decoder is adopted by the models. The difference is that there is a slight difference in the methods of use of the encoder and decoder. The difference of the encoder is mainly embodied in that the network models extracted by the image features are different; the differences in the decoder are mainly reflected in the difference in the attention mechanism. In the invention, by adopting a ResNeXt network as an encoder, a double-layer LSTM and a double-layer attention mechanism model are adopted in a decoder part, so that the image description model has better performance. Meanwhile, as can be seen from the table above, the index B @4 of the continuity and accuracy of the model is 0.08 higher than the lowest Deep VS and 0.06 higher than the highest Hard-Att; on the semantic richness CIDER, the specific value is 1.103, which is 0.443 higher than the lowest Deep VS and 0.247 higher than the highest Hard-Att. The results show that the overall performance of the model is improved, both in accuracy and semantic richness.

In a verification experiment, in order to evaluate the performance and effectiveness of modeling, the model established in the text is evaluated in the MSCOCO2014 data set, and each image comprises manually labeled image description (ground/route). Then, the MSCOCO2014 data sets are classified by adopting a Karpathy dividing mode, wherein 113287 images are divided in a training set, 5000 images are divided in a verification set, and 5000 images are divided in a test set. The model will count all artificially labeled descriptive words in the data set and the output descriptive sentence will consist of more than five occurrences of the word and the start/end designator < star >/< end >.

The experimental results are shown in FIG. 5.

In fig. 5 (a), the reference description is that an adult wears a gray T-shirt with red sleeves to sleep on a sofa. And our model is: a (woman) wearing (black pants) and gray T-shirt was sleeping on her (room) sofa.

Our model in fig. 5 (b) is more accurate in description and semantically rich in completion, adding more color to the other (black and white alternate) dogs than the description given by the dataset and making explicit that they are on (road). The result shows that the attRNeXt image description model provided by the invention is improved in the richness and accuracy of contents on the basis of being close to the original description, and the overall description of the image is more accurate.

Claims

1. An image description method based on a dual attention model is characterized in that: inputting image information through an input unit, completing an image description research task by adopting an encoder-decoder frame structure, and finally outputting the textual expression of the image by a frame combining a convolutional neural network, a long-short term memory network and an attention mechanism; firstly, constructing an encoder, and constructing a new encoder model by means of a ResNeXt-101 network and a target detector fast R-CNN to obtain a target area of an image to be detected and perform feature extraction; then constructing a decoder, and enabling a double-attention mechanism and a double-layer LSTM to construct a new decoder to generate a description statement of the image; optimizing the model by adopting a REINFORCE reinforcement learning algorithm;

the method comprises the steps of finely adjusting a ResNeXt network structure when network pre-training is carried out, removing a useless network layer of the last layer of the ResNeXt network structure, pre-training ResNeXt-101 on an ImageNet data set, obtaining target characteristics, namely visual characteristics and semantic characteristics by utilizing the pre-trained ResNeXt network and combining a Faster R-CNN algorithm, and then constructing a double-attention-machine image description model by utilizing an attention-machine method.

2. An image description method based on a dual attention model as claimed in claim 1, characterized in that: the encoder firstly extracts the basic features of the image by using a ResNeXt network, then utilizes a Faster R-CNN algorithm to detect an object, and further extracts the target features.

3. A method for image description based on a dual attention model as claimed in claim 2, characterized in that: the object detection is divided into two stages, in the first stage, the characteristics output by the last layer of the ResNeXt network are injected into a candidate area network of the Faster R-CNN to generate a candidate frame, and the boundary frame of the area of interest is corrected for the first time; injecting the generated candidate frame and the feature map into the region of interest pooling layer to obtain a feature map with a certain scale with the candidate frame; and finally, determining the category and the accurate information of the visual characteristic through softmax and a full connection layer, performing second correction on the bounding box, and recording the obtained visual characteristic set as V = { V = ₁ ，v ₂ ，...，v _n }；

Features obtained by average pooling are noted

Is calculated by the formula

Previous word embedding w _t-1 And previous output state vector from language LSTM

Fusion as input to compute visual attention queries

Wherein E represents an identity matrix transformation, θ ^v Is a parameter, the visual attention inquiry vector is taken as a guide vector of the visual attention mechanism, a visual LSTM based on the visual attention mechanism is obtained, and the output of the visual LSTM is expressed as

Global visual features extracted using the convolutional neural network are used to predict descriptive statements using a recurrent neural network, an attention model is added in an encoder stage, and a scene graph analyzer is used for extracting object semantic features, wherein the semantic features are marked as a set S = { S = _obj ，S _attr ，S _rel A start-up time of the system is shortened, wherein the target object word is represented by O = { obj = ₁ ，obj ₂ ，...，ob _jn }, object attribute word notation is A = { attr = ₁ ，attr ₂ ，...，attr _n R, object relational word notation R = { rel ₁ ，rel ₂ ，..，rel _n The Semantic words are coded by using an One-Hot method, and are mapped to a Visual-Semantic space by using a word embedding matrix; word list characterization of target entity is denoted S _obj ＝{S _obj1 ，s _obj2 ，...，s _objn }, attribute word list feature is S _attr ＝{s _attr1 ，s _attr2 ，...，s _attrn The characteristics of the relational vocabulary are denoted as S _rel ＝{s _rel1 ，s _rel2 ，...，s _reln And then according to the formula:

deriving average features of semantic Global features

Fusing as a computational semantic attention query

Wherein E represents the identity matrix transformation, θ ^v Is a parameter; a semantic LSTM based on a semantic attention mechanism is obtained, and the output of the semantic LSTM is expressed as

4. A method for image description based on a dual attention model as claimed in claim 3, characterized in that: the decoder adopts a double-layer LSTM structure, comprises an attention LSTM and a language LSTM, combines a semantic attention mechanism and a visual attention mechanism, places the two mechanisms in an attention model of a first layer LSTM, defines image features detected by Faster R-CNN as a set V, obtains a visual and semantic mean value after the image features are subjected to average pooling, takes a word embedding vector of a current time step as the input of the second layer language LSTM, and marks the model as an attRNeXt model, wherein a standard LSTM operates on a single time step as follows: h is a total of _t ＝LSTM(h _t-1 ，x _t ) Obtained in the encoder

And

an attention image feature is generated by the attention LSTM layer "Attend" and then the language LSTM is input

And

the words are generated in a de-sequential manner,

wherein theta is ^l ，W _p ，b _p Is a parameter, y _t Is a generated word;

and then optimizing the image description, wherein the objective function is as follows: h _P (θ)＝-∑log p(y _t |I；y _1：t-1 ) Wherein y is _t Is a word generated at time t, I represents an input image, y _1：t-1 Representing the generated subtitles;

then a reinforced learning algorithm is used to obtain the target,

the problem that the model is unmatched in training and testing is solved, wherein r is the standard score of the used automatic evaluation index BLEU, CIDER or METEOR index, and whether the prediction statement is similar to the evaluation standard or not is judged.