CN112733866A

CN112733866A - Network construction method for improving text description correctness of controllable image

Info

Publication number: CN112733866A
Application number: CN202110110377.2A
Authority: CN
Inventors: 朱虹; 张雨嘉; 杜森; 史静; 刘媛媛; 王栋
Original assignee: Xian University of Technology
Current assignee: Hunan Qianliyun Medical Technology Co ltd; Shenzhen Wanzhida Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-04-30
Anticipated expiration: 2041-01-27
Also published as: CN112733866B

Abstract

The invention discloses a network construction method for improving the text description correctness of controllable images, which comprises the following steps: step 1, constructing a model data set; step 2, extracting the characteristics of the data set; step 3, constructing an encoder for extracting key features; step 4, enhancing a coding network of the relationship between the image coding features; step 5, inputting the coded features into a decoding network output statement; and 6, constructing a training network according to the steps, training the constructed coding network and decoding network through the steps 1 to 5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text. The controllable image text description has higher correctness by the method.

Description

Network construction method for improving text description correctness of controllable image

Technical Field

The invention belongs to the technical field of image text description algorithms, and relates to a network construction method for improving the correctness of controllable image text description.

Background

Images are the most common information carriers in human activities, containing abundant useful information. It is difficult, but feasible, to automatically extract image content and describe it correctly. The image text description algorithm means that for a given image, a computer automatically outputs a sentence of characters describing the image content. Because the cross-mode conversion from the image to the text information can be completed, the method can be applied to a plurality of fields including cross-mode quick retrieval of the image and the like, and therefore, the research in the direction has wide application prospect.

The correctness of the image text description mainly depends on two aspects: firstly, the identification capability of the mutual relation among objects, scenes and objects contained in the image is realized; secondly, the description capability of accurately outputting the text content through the object. The correct recognition capability is a precondition for correctly outputting the image text description, and the work of the correct recognition capability is completed in an encoder of a model, but the prior art has the defects of inaccurate output information and deviation of output emphasis in this respect.

Disclosure of Invention

The invention aims to provide a network construction method for improving the description correctness of a controllable image text, and solves the problems that in the prior art, description sentences are not accurate in the description process of the image text and the description content is uncontrollable.

The invention adopts the technical scheme that a network construction method for improving the text description correctness of a controllable image is implemented according to the following steps:

step 1, constructing a model data set;

step 2, extracting the characteristics of the data set;

step 3, constructing an encoder for extracting key features;

step 4, enhancing a coding network of the relationship between the image coding features;

step 5, inputting the coded characteristics into a decoding network output statement,

the global coding characteristics obtained in the step 4

As input, the weight of a node which needs to be concerned in decoding each time is calculated through Graph Attention, then the weight is output through a double-layer LSTM network, a current word is predicted, then the output of the current LSTM is returned to update global coding characteristics to recalculate the weight of the node, and the like;

step 6, constructing a training network according to the steps,

and (5) training the constructed coding network and decoding network through the steps 1-5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text.

The invention has the advantages that in the coding network model, the difference of description degrees among different objects is amplified by enhancing the characteristics of key parts of the image, and then the difference is combined with controllable conditions to obtain more accurate image coding characteristics. After the characteristics are input into a decoding network, the text description is carried out on the input image, and compared with the algorithm indexes published by the current retrieved mainstream papers, the controllable image text description has higher accuracy.

Drawings

FIG. 1 is a form of the method of the present invention controlling a textual description of an image;

FIG. 2 is a flow chart of the overall structure of the feature encoding network model of the method of the present invention;

FIG. 3 is a flow chart of the AoA module structure for enhancing the relationship characteristics between objects according to the method of the present invention;

fig. 4 is a flow chart of the structure of the decoding network model adopted by the method of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The method of the invention is implemented according to the following steps:

step 1, constructing a model data set,

1.1) establishing a training set of data samples, a validation set of validation image data,

training a data set of a deep learning network requires a large number of labeled samples; considering that the image marking by self has certain limitation and huge workload, and the model needs to provide control conditions to control the generation of description, for this purpose, in the step, the published MSCOCO data set image sample and label are selected as the data sample of the feature extraction network, 90% of the MSCOCO data set is randomly selected as a training set, and the rest partial image samples are used as the samples of the verification set, which are collectively referred to as sample images;

1.2) establishing a data set of control conditions,

when generating an image text description, providing an Abstract Scene Graph (ASG) for each sample image, as an input to control the structure of a generated sentence, where the structure of the abstract scene graph is shown in fig. 1, and includes three types of nodes (i.e., an object node o, an attribute node a, and a relationship node r) and edges connecting the nodes;

referring to fig. 1, in a sentence describing that "a bundle of pink flowers is placed on a wooden table," flowers "and" table "are object nodes o," pink "," a bundle of "," wooden "is attribute node a," flowers are associated with a table "is relationship node r;

in the step, an ASG generator (belonging to the public technology) is adopted to generate respective abstract scene graphs for each sample image; here, the RPN model that has been disclosed is used to detect an object node in each image, and an attribute node is added to the object node by automatic sampling; the relation node only needs to determine whether the relation exists between the objects, so a simple classification network is adopted to judge whether the relation node (namely the edge) exists between the two objects; finally, ASG registration of sample imagesIs G_ks＝(N_ks,E_ks)，ks＝1,2,...,N_s，N_sIs the number of samples in the data set, wherein the node set of the ASG of the sample set is N_ks＝[node₁ ^ks,node₂ ^ks,...,node_Ne ^ks]，node_k ^ksE { o, a, r }, k 1, 2., Ne are node numbers, and for convenience of description and calculation, the node number of each sample image is set as a fixed value, and a preferred range is: ne ∈ [10,20 ]]If the number of the actually extracted nodes is more than Ne, eliminating the unrelated isolated nodes or limiting the number of the attribute nodes, and if the number of the actually extracted nodes is less than Ne, setting the corresponding nodes to be 0; the set of edges of the ASG is E_ks＝[e_i,j]_Ne×Ne，e_i,jE is left to {0,1}, namely, the association between two nodes is 1, and the non-association between two nodes is 0;

step 2, extracting the characteristics of the data set,

2.1) establishing a semantic dictionary describing the text,

selecting the first m words with the highest frequency of occurrence in all words from the labels of the sample images to form a semantic concept set; the total number m of words is selected according to different text description fields and different description accuracy requirements, the step is based on the MSCOCO data set and the general requirements, the preferred value range of the total number m of words is [10000,12000], an integer serial number is allocated to each word, and three special bits are added after the integer serial number: namely a start marker bit, an end marker bit and a low-frequency word bit, wherein m +3 integer serial numbers form a dictionary;

for the ks sample image Y in the training set_ks，L_ksThe description length of the text of the ks image is used, the established dictionary is used for carrying out semantic dictionary labeling on the data set sample image, and the labeling form is as follows:

wherein ,

is the sequence number of the kth word in the text semantic dictionary, which is an integer, where k is 1,2_ks；

2.2) extracting the image characteristics,

extracting global characteristics of sample images by using ResNet network (belonging to the public technology), namely taking output M of the last average pooling layer in the convolutional network ResNet₁The dimensional feature vector describes the global features g of the image_ksPreferably M₁2048;

2.3) extracting abstract scene graph characteristics,

ASG node set N obtained according to step 1.2)_ks＝[node₁ ^ks,node₂ ^ks,...,node_Ne ^ks]，node_k ^ks∈{o,a,r}，k＝1,2,...,Ne，ks＝1,2,...,N_sAdopting a fast-RCNN network (belonging to the public technology), taking a fully-connected fc7 layer of the fast-RCNN as an image region feature, and taking the region feature as M for the convenience of calculation₁A feature vector of the dimension;

let the regional characteristics of all the extracted nodes be expressed as

Wherein for the node_k ^ksA target node characterized by features extracted on a corresponding region; for node_k ^ksThe attribute node of a has the same characteristics as the region characteristics of the object node connected with the attribute node; for node_k ^ksR, the characteristics of which are extracted from the union region of the two involved associated targets;

step 3, constructing an encoder for extracting key features,

3.1) establishing a network for extracting key features of the image,

the structure of the coding network for extracting the key features of the image is shown in FIG. 2, and the regional features obtained in step 2.3) are used

As an input, weighting different region features through a multi-head attention mechanism, so that the network focuses on the important part to be described, and obtaining the weighted region features

And determining the incidence relation between the objects through the AoA module and layer normalization

Finally, the characteristics are embedded into the module through the nodes

Combining with node attribute to obtain output

3.2) constructing a multi-head attention module,

region characteristics obtained in step 2.3)

In general, a large amount of redundant information is included, and the feature importance degree of each region is the same, so that constructing a multi-head attention module can map feature vectors to different subspaces, so that a model can understand features from different angles, the coding of the region features is enhanced, the obtained region features are more accurate, the description focus is more prominent, and the module is described in detail below,

3.2a) characterizing the regions obtained in step 2.3)

Obtaining query vectors Q with the same dimensionality through three different linear transformations_k ^ksKey vector K_k ^ksSum vector V_k ^ksThe expression for the linear transformation is as follows:

wherein ,W_Q、W_K、W_VDifferent randomly initialized mapping matrixes are obtained by network training;

3.2b) query vector Q_k ^ksKey vector K_k ^ksSum vector V_k ^ksAre respectively divided into n₁A (n)₁Is an empirical value, preferably n₁＝8)M₂＝M₁/n₁Query sub-features of a dimension

Key features

Sum value sub-feature

Computing

And

the expression of the similarity score of (1) is as follows:

wherein ,f_simIs a function of calculating the similarity score, defined as follows:

then, the similarity is scored

Performing softmax operation as weighted weight summation to obtain spatial attention sub-feature

The expression is as follows:

finally, fusing the weighted features of the plurality of sub-regions to obtain the region feature containing the attention weight

The expression is as follows:

wherein ,W^OLinear mapping is obtained through network training;

3.3) constructing the AoA module,

in order to accurately predict the semantic relationship between two objects in a sample image, the region features containing attention weight

And query vector Q_k ^ksThe AoA module is combined to improve the accuracy of word prediction on the characteristics;

referring to fig. 3, the AoA module includes two separate linear transformations, each generating an information vector f_k ^ksAnd attention gate vector m_k ^ksThe expression is as follows:

wherein ,

respectively two-dimensional linear transformation weights learned by the network, b^f、b^mAre all one-dimensional constant terms, and sigma is a sigmoid activation function;

then for the information vector f_k ^ksAnd attention gate vector m_k ^ksPerforming dot product operation to obtain attention information characteristics

So as to express the dependency relationship among the objects more appropriately, the expression is as follows:

wherein ,

representing a dot-product operation such that the dimension of the feature that is larger in value becomes larger and the dimension that is smaller becomes smaller, thereby enlarging the difference between the features;

3.4) a node embedding module is arranged,

due to the attention-only information characteristic

The controllability of the image text description cannot be embodied, so that different node embedding enhancement is carried out on the characteristics representing different nodes to obtain the characteristic Z with node attribute perception_k ^ksThe expression is as follows:

wherein ,W_rIs 3 XM₁The node embedding matrix is obtained by network learning, W_r[1]，W_r[2]，W_r[3]Respectively represent W_rThe number of the first, second, third,

is an attribute of the kth node; pos_kIs M₁The position of the dimension is embedded into a vector, and W is increased when the node is an attribute node_r[2]The weight coefficient of (2) is used for distinguishing the sequence of different attribute nodes connecting the same object, and the expression is as follows:

step 4, enhancing the coding network of the relationship between the image coding features,

according to the feature coding network shown in FIG. 2, the region feature Z obtained in step 3.4) is used_k ^ksAs input, adjacent node features are combined through Graph Convolution (GCN), so that a coding network can process feature information with a structure, focus on node features connected by semantics, and obtain a better image text description result;

4.1) constructing a bidirectional abstract scene graph,

because the influence of the edge between the connecting nodes on the two nodes in the ASG is mutual, and because the types of the nodes are different, the way of transferring the message from one type of node to another type of node is different from the way of transferring the message in the reverse direction, therefore, the original directional unidirectional edge in the ASG of the abstract scene graph needs to be expanded into a bidirectional edge with different meanings, that is, G_ks＝(N_ks,E_ks)，ks＝1,2,...,N_sChanging to a multiple relationship scene graph G_ks'＝(N_ks,E_ks,R_ks)，R_ks6 interactive relations among the nodes, including an object pair attribute oa, an attribute pair object ao, a subject pair relation or, a relation pair subject ro, an object pair relation sr and a relation pair object rs;

4.2) carrying out graph convolution,

by means of graph convolution operation, the node characteristic relation Z is obtained_k ^ksCoding to obtain final region characteristics

The expression is as follows:

wherein ,

represents the neighbor nodes of node k under the relation s, σ is the ReLU activation function,

is a parameter of the relation s of the l-th layer learned by the network;

the use of GCN once can bring characteristic information from neighboring nodes to each node, while stacking multiple times can obtain a wider context; preferably l ∈ [2,4 ]]Finally, the output of the l-th layer is used as 10 512-dimensional regional characteristics output by the encoding stage

Characteristics of the recourse region

Taking the average to obtain the global coding characteristics

The expression is as follows:

4.3) carrying out feature fusion,

global coding features

And the global feature g obtained in step 2.2)_ksFusing to obtain global features output in the encoding stage

The expression is as follows:

the decoding network model is mainly composed of a double-layer LSTM network as shown in FIG. 4, and the global coding characteristics obtained in step 4 are used

As input, the weights of the nodes needing attention in each decoding are calculated through GraphAttenttion (belonging to the public technology), then the current words are predicted through the output of a double-layer LSTM network, then the output of the current LSTM is returned to update the global coding characteristics to recalculate the node weights, the next words are generated, and so on, and the specific process is that,

5.1) constructing a double-layer LSTM network model, which consists of Top-Down Attention LSTM and Language LSTM (belongs to the public technology); taking the input of Attention LSTM at time t as global feature

And the output h of the Language LSTM at time t-1_t-1And word list characteristics W_t-1Performing operation to obtain the output of Attention LSTM at time t

The expression is as follows:

wherein ,θ^aIs a network parameter;

then outputs the Attention LSTM

And weighted region features

Generating word prediction results at time t as input to Language LSTM

The expression is as follows:

wherein ,W_ksc、W_hc、W_c、θ^lAre all parameters of the network training and are,

is the updated regional characteristics at time t;

then, the prediction result is processed by a softmax function to obtain a word prediction probability matrix, and the probability matrix is returned to the updating characteristic

Carrying out next word prediction;

step 6, constructing a training network according to the steps,

training the constructed coding network and decoding network through the steps 1-5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text;

the network is trained using standard cross-entropy penalties for the under-control condition G_ksLoss L of text description of lower image ks^ksThe expression is as follows:

the specific parameters set in the training process are that the Batch size is preferably 128, the iteration number Epoch is preferably 50, and the initial Learning rate is preferably 0.0002; thus, a complete constructed model is obtained, and controllable image text description can be generated according to the image and the designated ASG.

Claims

1. A network construction method for improving the correctness of controllable image text description is characterized by comprising the following steps:

step 1, constructing a model data set;

step 2, extracting the characteristics of the data set;

step 3, constructing an encoder for extracting key features;

the global coding characteristics obtained in the step 4

step 6, constructing a training network according to the steps,

2. The network construction method for improving the correctness of the text description of the controllable image according to claim 1, characterized in that: in the step 1, the specific process is,

selecting MSCOCO data set image samples and labels as data samples of a feature extraction network, randomly selecting 90% of the MSCOCO data set as a training set, and using the rest partial image samples as samples of a verification set, wherein the samples are collectively referred to as sample images;

1.2) establishing a data set of control conditions,

generating respective abstract scene graphs for each sample image by adopting an ASG generator; detecting an object node in each image by using an RPN model, and adding an attribute node to the object node through automatic sampling; the relation node adopts a simple classification network to judge whether a relation node exists between two objects; finally, the ASG of the sample image is denoted G_ks＝(N_ks,E_ks)，ks＝1,2,...,N_s，N_sIs the number of samples in the data set, wherein the node set of the ASG of the sample set is N_ks＝[node₁ ^ks,node₂ ^ks,...,node_Ne ^ks]，node_k ^ksE, { o, a, r }, k ═ 1, 2., Ne are node numbers, the node number of each sample image is set as a fixed value, if the number of actually extracted nodes is more than Ne, the number of unrelated isolated nodes is eliminated or the number of attribute nodes is limited, and if the number of actually extracted nodes is less than Ne, the corresponding node is set as 0; the set of edges of the ASG is E_ks＝[e_i,j]_Ne×Ne，e_i,jE {0,1}, i.e., a relationship between two nodes is 1, and a relationship between two nodes is 0.

3. The network construction method for improving the correctness of the text description of the controllable image according to claim 2, characterized in that: in the step 2, the specific process is,

2.1) establishing a semantic dictionary describing the text,

selecting the first m words with the highest frequency of occurrence in all words from the labels of the sample images to form a semantic concept set; the total number m of words is selected according to different text description fields and different description accuracy requirements, the value range of the total number m of words is [10000,12000] according to the MSCOCO data set and the general requirements, each word is assigned with an integer serial number, and three special bits are added after the integer serial number: namely a start marker bit, an end marker bit and a low-frequency word bit, wherein m +3 integer serial numbers form a dictionary;

wherein ,

2.2) extracting the image characteristics,

extracting global characteristics of sample images by adopting ResNet network, namely taking output M of the last average pooling layer in convolution network ResNet₁The dimensional feature vector describes the global features g of the image_ksPreferably M₁2048;

2.3) extracting abstract scene graph characteristics,

ASG node set N obtained according to step 1.2)_ks＝[node₁ ^ks,node₂ ^ks,...,node_Ne ^ks]，node_k ^ks∈{o,a,r}，k＝1,2,...,Ne，ks＝1,2,...,N_sAdopting a fast-RCNN network, taking a full-connection fc7 layer of the fast-RCNN as an image area characteristic, and taking the area characteristic as M for the convenience of calculation₁A feature vector of the dimension;

let the regional characteristics of all the extracted nodes be expressed as

Wherein for the node_k ^ksA target node characterized by features extracted on a corresponding region; for node_k ^ksThe attribute node of a has the same characteristics as the region characteristics of the object node connected with the attribute node; for node_k ^ksR, which features are extracted from the union region of the two involved related objects.

4. The network construction method for improving the correctness of the text description of the controllable image according to claim 3, characterized in that: in the step 3, the specific process is,

3.1) establishing a network for extracting key features of the image,

the regional characteristics obtained in the step 2.3)

Finally, the characteristics are embedded into the module through the nodes

Combining with node attribute to obtain output

3.2) constructing a multi-head attention module,

3.2a) characterizing the regions obtained in step 2.3)

Respectively undergo three different linear transformationsTo obtain the query vector Q with the same dimension_k ^ksKey vector K_k ^ksSum vector V_k ^ksThe expression for the linear transformation is as follows:

3.2b) query vector Q_k ^ksKey vector K_k ^ksSum vector V_k ^ksAre respectively divided into n₁A M₂＝M₁/n₁Query sub-features of a dimension

Key features

Sum value sub-feature

Computing

And

the expression of the similarity score of (1) is as follows:

then, the similarity is scored

The expression is as follows:

The expression is as follows:

wherein ,W^OLinear mapping is obtained through network training;

3.3) constructing the AoA module,

the AoA module comprises two separate linear transformations, each generating an information vector f_k ^ksAnd attention gate vector m_k ^ksThe expression is as follows:

wherein ,

wherein ,

3.4) a node embedding module is arranged,

different node embedding enhancement is carried out on the characteristics representing different nodes to obtain characteristics Z with node attribute perception_k ^ksThe expression is as follows:

is the genus of the kth nodeSex; pos_kIs M₁The position of the dimension is embedded into a vector, and W is increased when the node is an attribute node_r[2]The weight coefficient of (2) is used for distinguishing the sequence of different attribute nodes connecting the same object, and the expression is as follows:

5. the network construction method for improving the correctness of the text description of the controllable image according to claim 4, characterized in that: in the step 4, the specific process is,

4.1) constructing a bidirectional abstract scene graph,

the original directional one-way edge in the ASG of the abstract scene graph is expanded into a two-way edge with different meanings, namely G_ks＝(N_ks,E_ks)，ks＝1,2,...,N_sChanging to a multiple relationship scene graph G_ks'＝(N_ks,E_ks,R_ks)，R_ks6 interactive relations among the nodes, including an object pair attribute oa, an attribute pair object ao, a subject pair relation or, a relation pair subject ro, an object pair relation sr and a relation pair object rs;

4.2) carrying out graph convolution,

The expression is as follows:

wherein ,

is a parameter of the relation s of the l-th layer learned by the network;

the use of GCN once can bring characteristic information from neighboring nodes to each node, while stacking multiple times can obtain a wider context; the output of the last l layer is 10 area characteristics with 512 dimensions output by the encoding stage

Characteristics of the recourse region

Taking the average to obtain the global coding characteristics

The expression is as follows:

4.3) carrying out feature fusion,

global coding features

The expression is as follows:

6. the network construction method for improving the correctness of the text description of the controllable image according to claim 5, characterized in that: in the step 5, the specific process is,

constructing a double-layer LSTM network model which consists of Top-Down Attention LSTM and Language LSTM; taking the input of Attention LSTM at time t as global feature

The expression is as follows:

wherein ,θ^aIs a network parameter;

then outputs the Attention LSTM

And weighted region features

Generating word prediction results at time t as input to Language LSTM

The expression is as follows:

is the updated regional characteristics at time t;

The next word prediction is performed.

7. The network construction method for improving the correctness of the text description of the controllable image according to claim 6, characterized in that: in the step 6, the specific process is,

the specific parameters set in the training process are that the Batch size is 128, the iteration time Epoch is 50 generations, and the initial Learning rate is 0.0002; thus, a complete constructed model is obtained, and controllable image text description can be generated according to the image and the designated ASG.