CN112819012B

CN112819012B - Image description generation method based on multi-source cooperative features

Info

Publication number: CN112819012B
Application number: CN202110128180.1A
Authority: CN
Inventors: 孙晓帅; 纪荣嵘; 骆云鹏
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2022-05-03
Anticipated expiration: 2041-01-29
Also published as: CN112819012A

Abstract

An image description generation method based on multi-source collaborative features relates to multi-source feature extraction, enhancement and fusion, belongs to the technical field of artificial intelligence, and comprises the following steps: step 1, simultaneously extracting grid characteristics and area characteristics of an image by adopting a target detector; step 2, utilizing absolute and relative position information of the features to assist the model in understanding the features and interacting and enhancing the two features; and 3, utilizing the geometric alignment relation between the features to interactively enhance the two features, exchanging important visual information and realizing better visual expression. The method aims at the limitation that the traditional image description method based on single-source characteristics lacks scenes and detailed information, provides a multi-source cooperative characteristic extraction, fusion and enhancement method, and strengthens visual prior, thereby improving the accuracy of generation description.

Description

Image description generation method based on multi-source cooperative features

Technical Field

The invention relates to multi-source feature extraction, enhancement and fusion, in particular to an image description generation method based on multi-source cooperative features.

Background

Image description generation is the task of automatically generating descriptive statements for an input image. The task of generating image descriptions spans two fields of computer vision and natural language processing, and the main challenge lies not only in the comprehensive understanding of objects and relationships in images through object recognition, scene recognition, attribute and relationship detection, etc., but also in the generation of fluent sentences conforming to visual semantics. The image description generation has wide application range, and can help the automatic driving field to understand the road condition and also help the vision-impaired people to understand the environment.

Despite the challenges of image description generation, over the years of development, great progress has been made in image description generation, both in reference data sets and methods. Lin et al (Lin, t. -y.; Maire, m.; Belongie, s.; Hays, j.; Perona, p.; ramann, d.; dolar, p.; and Zitnick, c.l.2014.microsoft COCO: Common objects in context.in ECCV.) propose a baseline dataset COCO for image description generation. Vinyals et al (Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D.2015.show and tell: A neural image capture generator. in CVPR.) for the first time, used the encoder-decoder structure in the field of machine translation as a large paradigm for image description generation. Anderson et al (Rennie, s.j.; marcher, e.; Mroueh, y.; Ross, j.; and Goel, v.2017. self-diagnostic sequence tracing for image capturing. in CVPR) propose a method that provides an image prior using an object detector. Rennie et al (Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L.2018.bottom-up and top-down adherence for image capturing and visual query answering. in CVPR.) use an reinforcement learning method to solve the problem of inconsistent behavior when training and testing an image description generation network.

The above work lays a solid foundation for image description generation. The regional features proposed by Anderson et al by the target detection network greatly reduce the difficulty of visual semantic embedding, compared to the mesh features used in earlier methods, since most salient regions in images tend to be targets. Despite great success, regional features are still subject to scaling due to lack of contextual information and fine-grained detail. The detected area may not cover the entire image, resulting in an inability to correctly describe the global scene. At the same time, each region is represented by a single feature vector, which inevitably loses a large amount of object detail. However, these drawbacks are the advantages of grid features, which in contrast cover all the content of a given image in a more fragmented form.

Based on the background, the image description generation method based on the multi-source cooperative features is selected and researched to make up for the defects of the existing method, more accurate and fine image description contents are obtained, and the step of industrial application of image description generation is promoted.

Disclosure of Invention

The invention aims to provide a multi-source characteristic cooperation method aiming at the defects of the image characteristics of the traditional image report generation method, extract and use various image characteristics so as to strengthen the prior information of an image and perform more accurate and detailed image description generation, and provides an image description generation method based on the multi-source cooperation characteristics.

The invention comprises the following steps:

1) simultaneously extracting grid characteristics and area characteristics of the image by adopting a target detector;

2) establishing a comprehensive relation attention mechanism auxiliary model for feature understanding and relation modeling by using the absolute position information and the relative position information of the features, wherein the auxiliary model performs feature understanding and self-enhancement of the two features;

3) by utilizing the geometric alignment relation between the features, the two features are subjected to interactive cooperation enhancement, important visual information is exchanged, and better visual expression is realized.

In step 1), the specific method for simultaneously extracting the grid feature and the region feature of the image by using the target detector may be:

(1) target detection and attribute prediction training were performed on the Visual Genome dataset using the fast-RCNN as the target detector.

(2) Extracting the image features corresponding to the detection frames with the confidence degree higher than 20% detected by the target detector as regional features, and extracting the features extracted by the target detector backbone network as grid features.

In step 2), the absolute position information is the position of the grid feature or the area feature in the whole picture; the relative positionThe geometric information of the grid feature and the region feature can be represented as a rectangular frame (x, y, w, h), wherein (x, y) is the coordinates of the upper left corner of the frame, and w, h are the width and the height of the frame; then two boxes are put_iAnd box_jIs expressed as a 4-dimensional vector:

after the 4-dimensional relative encoding vector is obtained, it is also mapped to d using the PE function_modelMaintaining;

the characteristic self-enhancement can be carried out by using a Transformer model after obtaining absolute position codes and relative position codes.

In step 3), the specific steps of utilizing the geometric alignment relationship between the features to interactively enhance the two features and exchange important visual information to realize better visual expression may be:

(1) and constructing a geometric alignment graph according to the position information of the region characteristic and the grid characteristic.

(2) And performing visual information interaction and enhancement according to the geometric alignment graph.

The invention has the following outstanding advantages:

1. the method overcomes the limitation and the defect of single-source characteristics, considers the complementarity of the multi-source characteristics for the first time, considers the self-reinforcement in various characteristics and the cooperative promotion among the characteristics, constructs the image description generation method of the multi-source cooperative characteristics, designs and realizes a model, and obtains a more accurate and fine high-quality image description text.

2. The invention fully utilizes the meta-information of the feature position, specifically considers the absolute position information of the feature and specifically models the relative position information between the features, and further helps the model to understand the inherent properties and the relationship between the features.

3. The invention designs a method for interaction among light-weight features, which carries out more efficient and light-weight information interaction through geometric position information among different types of features.

Drawings

Fig. 1 is an overall block diagram of an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a characteristic self-reinforcing module according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a feature cooperation enhancing module according to an embodiment of the present invention.

FIG. 4 is a geometric alignment of features of an embodiment of the present invention.

Detailed Description

The following examples will further illustrate the present invention with reference to the accompanying drawings.

The overall framework of the work of the image description generation method based on the multi-source collaborative feature provided by the invention is shown in FIG. 1, and specifically comprises the following aspects:

1) and (5) extracting image features. And (3) using a fast-RCNN as a target detector, combining a 5 th convolution module into a target detection backbone network on the basis of the fast-RCNN target detector, using a 1x1 region-of-interest Pooling (RoI Pooling) method in a target detection head network, and using two fully-connected layers as detection heads of the target detector to perform target detection and attribute prediction training on a Visual Genome data set. For one picture, a network obtained by training is used, a feature map is obtained by calculating until the 5 th convolution module is finished, and then the feature map is averaged and pooled into 7x7 grid features. For the region features, extracting the corresponding image features in the detection frame with the confidence coefficient higher than 20% detected by the target detector as the region features, when the number of the region features is less than 10, extracting the detection result 10 before the confidence coefficient, and setting the maximum number of the region features as 100, namely, each picture has 100 region features at most. Therefore, for each picture, 7x 7-49 grid features and 10-100 region features can be obtained.

2) The feature is self-reinforcing and the modular structure is shown in figure 2. The purpose of the feature self-enhancement module is to enable grid features and region features to enhance feature expression through respective internal interaction, and in the process, absolute position information and relative position information are used for establishing a comprehensive relationship attention mechanism (CRA) auxiliary model for feature understanding and relationship modeling.

Absolute position information, i.e. the position of a grid feature or a region feature in the whole picture, for the grid feature, its absolute position can be represented by using a two-dimensional coordinate (i, j), and in order to input the coordinate into the neural network, the 2-dimensional coordinate is mapped into a high-dimensional vector by GPE function:

GPE(i,j)＝[PE_i；PE_j] (1)

wherein, the first and the second end of the pipe are connected with each other,

d_modelmiddle layer feature dimension for neural networks:

where pos represents a position (i.e., i or j) and k represents a dimension. For the region feature, the corresponding rectangular frame (x) is mapped by a linear mapping RPE_min,y_min,x_max,y_max) Mapping to a high-dimensional vector:

RPE(i)＝B_iW_emb (4)

where i is a subscript of the region feature, (x)_min,y_min) Is the upper left corner coordinate of the feature, (x)_max,y_max) Is the coordinate of the upper right corner of the feature,

is a learnable parameter matrix.

In order to better fuse the relative position information, the relative position information is added according to the geometric information, for this purpose, the geometric information of the grid characteristic and the region characteristic is represented as a rectangular frame (x, y, w, h), wherein (x, y) is the upper left corner of the frameThe coordinates w, h are the width and height of the box. Then two boxes are put_iAnd box_jIs expressed as a 4-dimensional vector:

after the 4-dimensional relative encoding vector is obtained, it is also mapped to d using the PE function_modelDimension for input into subsequent modules.

After the absolute position code and the relative position code are obtained, a Transformer model is used for carrying out feature self-enhancement, and concretely, the Transformer model comprises the following components:

MHCRA(Q,K,V)＝Concat(head₁,...,head_h)W^O (6)

after a comprehensive relationship attention mechanism (CRA) calculation method is determined, a self-enhancement step of the features can be carried out, and the grid features and the area features of the ith layer are respectively recorded as

And

then:

wherein RPE and GPE are absolute position codes of region feature and grid feature, omega_**Is the relative position code of the two. Two layers of forward propagation network FFN are then used as intermediate mappings:

the self-enhancement module of the characteristics is used, after the self-enhancement is finished, the two paths of characteristics enter the next module to carry out the cooperative enhancement of the characteristics;

3) the feature cooperation is enhanced, and the module structure is shown in fig. 3. The feature cooperation enhancement module aims to model the interaction between two different features to enhance the feature expression, and in order to more efficiently perform the interaction between the two features, a geometric alignment graph is firstly constructed, G is (V, E), as shown in FIG. 4, in the geometric graph, all region features and grid features are independent nodes to form a node set V, for an edge set E, an edge is arranged between a region feature node and a grid feature node, and if and only if the region features and the grid features are geometrically intersected, a multi-head cross attention Mechanism (MHLCCA) is used in the feature cooperation enhancement module;

MHLCCA(Q,K,V)＝Concat(head₁,...,head_h)W^O (13)

wherein, the graph-softmax operation is based on the graph G, for each node, only all nodes connected with the graph G are subjected to normalization operation, and the weight values of the nodes without connection are set to be zero; self-increment for featuresFirst output of strong module

The feature cooperation enhancement module calculates:

wherein omega_rg,Ω_grIs the relative position information between the region feature and the mesh feature. Region features are embedded in grid features and vice versa to enhance both features. Specifically, the mesh feature focuses on the region feature to acquire high-level object information, and the region feature focuses on the mesh feature to supplement the detail information. The output of this layer is then obtained by two FFN layers:

the feature self-enhancement module and the cooperative enhancement module act alternately for 3 times, and finally the obtained features are input into the language generation module.

4) And a language generation module. Language generation module given enhanced features

And the previously generated partial sentence w₁,w₂,...,w_iIs generated to generate the next word w_i+1First, the generated part of the sentence is mapped to d by the word embedding method_modelDimension vectors arranged in rows and combined with the space vector of the next position to obtain a matrix

Then proceed to the self-attention module:

MHSA(Q,K,V)＝Concat(head₁,...,head_h)W^O (20)

wherein the content of the first and second substances,

are all learnable parameters, pos_*Is the position coding of the word. Output H for l-th layer^(l)The language generation module calculates:

M^(l)＝MHSA(H^(l)) (23)

H^(l+1)＝FFN(M^(l)) (24)

the i +1 th word is finally predicted as:

5) a loss function. The whole model is divided into two stages of training, and the loss function of the first stage is as follows:

i.e., the probability of each word prediction, the second stage loss function is:

reinforce loss for reinforcement learning, where r denotes CIDER, b denotes baseline, and k is the bundle search size.

The specific implementation results are as follows:

experiments were performed on the reference image subtitle data set COCO. This data set contains 123287 pictures, each with 5 different annotations. For data partitioning, 113287, 5000 images were used for training, validation, and testing, respectively, following the widely adopted Karpathy segmentation method. Will d_modelSet to 512 and the number of heads set to 8. The number of layers for both encoder and decoder is set to 3. In the first stage of training, the model is preheated for 4 rounds, and the learning rate is linearly increased to 1x 10^-4. Then setting the learning rate to 1x 10 between 5-10 rounds^-4Is set to be 2 multiplied by 10 between 11 and 12^-5Then set to 4 × 10^-6. The batch size was set to 50. After the 18 era XE pre-training phase, the Cider optimization model is started, and the learning rate is 5 multiplied by 10^-6The batch size is 100. The Adam optimizer was used in both stages, with a bundle search width of 5. Models were evaluated using BLEU @ N, METEOR, ROUGE-L, CIDER, and SPICE following standard evaluation procedures.

TABLE 1

Model	B-1	B-4	M	R	C	S
							SCST(ResNet-101)	-	34.2	26.7	57.7	114	-
Up-Down(ResNet-101)	79.8	36.3	27.7	56.9	120.1	21.4
							HAN(ResNet-101)	80.9	37.6	27.8	58.1	121.7	21.5
GCN-LSTM(ResNet-101)	80.5	38.2	28.5	58.5	128.3	22
							SGAE(ResNet-101)	80.8	38.4	28.4	58.6	127.8	22.1
ORT(ResNet-101)	80.5	38.6	28.7	58.4	127.8	22.1
							AoA(ResNet-101)	80.2	38.9	29.2	58.8	129.8	22.4
M2(ResNet-101)	80.8	39.1	29.2	58.6	131.2	22.6
							X-Transformer(SENet-154)	80.9	39.7	29.5	59.1	132.8	23.4
Ours(ResNeXt-101)141 2	81.4	39.8	29.5	59.1	133.8	23

Image description final test results are shown in table 1.

Claims

1. An image description generation method based on multi-source collaborative features is characterized by comprising the following steps:

2) establishing a comprehensive relation attention mechanism auxiliary model for feature understanding and relation modeling by using absolute position information and relative position information of features, wherein the auxiliary model performs feature understanding and self-enhancement of two features;

the absolute position information is the position of the grid feature or the area feature in the whole picture; the relative position information is that the geometric information of the grid characteristic and the region characteristic is represented as a rectangular frame (x, y, w, h), wherein (x, y) is the coordinate of the upper left corner of the frame, and w, h are the width and the height of the frame; then two boxes are put_iAnd box_jIs expressed as a 4-dimensional vector:

after the 4-dimensional relative encoding vector is obtained, it is also mapped to d using the PE function_modelDimension, d_modelIs the middle layer characteristic dimension of the neural network;

the characteristic self-enhancement is carried out by using a Transformer model after obtaining absolute position codes and relative position codes;

2. The image description generation method based on multi-source cooperative features of claim 1, wherein in step 1), the specific method for simultaneously extracting the grid features and the region features of the image by using the target detector comprises:

(1) performing target detection and attribute prediction training on a Visual Genome dataset by using a fast-RCNN as a target detector;

3. The image description generation method based on multi-source cooperative features as claimed in claim 1, wherein in step 3), the two features are interactively enhanced by using the geometric alignment relationship between the features, and exchange important visual information, so as to realize better visual expression, specifically comprising the steps of:

(1) constructing a geometric alignment graph according to the position information of the region characteristic and the grid characteristic;