Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The method of the invention is implemented according to the following steps:
step 1, constructing a model data set,
1.1) establishing a training set of data samples, a validation set of validation image data,
training a data set of a deep learning network requires a large number of labeled samples; considering that the image marking by self has certain limitation and huge workload, and the model needs to provide control conditions to control the generation of description, for this purpose, in the step, the published MSCOCO data set image sample and label are selected as the data sample of the feature extraction network, 90% of the MSCOCO data set is randomly selected as a training set, and the rest partial image samples are used as the samples of the verification set, which are collectively referred to as sample images;
1.2) establishing a data set of control conditions,
when generating an image text description, providing an Abstract Scene Graph (ASG) for each sample image, as an input to control the structure of a generated sentence, where the structure of the abstract scene graph is shown in fig. 1, and includes three types of nodes (i.e., an object node o, an attribute node a, and a relationship node r) and edges connecting the nodes;
referring to fig. 1, in a sentence describing that "a bundle of pink flowers is placed on a wooden table," flowers "and" table "are object nodes o," pink "," a bundle of "," wooden "is attribute node a," flowers are associated with a table "is relationship node r;
in the step, an ASG generator (belonging to the public technology) is adopted to generate respective abstract scene graphs for each sample image; here, the RPN model that has been disclosed is used to detect an object node in each image, and an attribute node is added to the object node by automatic sampling; the relation node only needs to determine whether the relation exists between the objects, so a simple classification network is adopted to judge whether the relation node (namely the edge) exists between the two objects; finally, ASG registration of sample imagesIs Gks=(Nks,Eks),ks=1,2,...,Ns,NsIs the number of samples in the data set, wherein the node set of the ASG of the sample set is Nks=[node1 ks,node2 ks,...,nodeNe ks],nodek ksE { o, a, r }, k 1, 2., Ne are node numbers, and for convenience of description and calculation, the node number of each sample image is set as a fixed value, and a preferred range is: ne ∈ [10,20 ]]If the number of the actually extracted nodes is more than Ne, eliminating the unrelated isolated nodes or limiting the number of the attribute nodes, and if the number of the actually extracted nodes is less than Ne, setting the corresponding nodes to be 0; the set of edges of the ASG is Eks=[ei,j]Ne×Ne,ei,jE is left to {0,1}, namely, the association between two nodes is 1, and the non-association between two nodes is 0;
step 2, extracting the characteristics of the data set,
2.1) establishing a semantic dictionary describing the text,
selecting the first m words with the highest frequency of occurrence in all words from the labels of the sample images to form a semantic concept set; the total number m of words is selected according to different text description fields and different description accuracy requirements, the step is based on the MSCOCO data set and the general requirements, the preferred value range of the total number m of words is [10000,12000], an integer serial number is allocated to each word, and three special bits are added after the integer serial number: namely a start marker bit, an end marker bit and a low-frequency word bit, wherein m +3 integer serial numbers form a dictionary;
for the ks sample image Y in the training setks,LksThe description length of the text of the ks image is used, the established dictionary is used for carrying out semantic dictionary labeling on the data set sample image, and the labeling form is as follows:
wherein ,
is the sequence number of the kth word in the text semantic dictionary, which is an integer, where k is 1,2
ks;
2.2) extracting the image characteristics,
extracting global characteristics of sample images by using ResNet network (belonging to the public technology), namely taking output M of the last average pooling layer in the convolutional network ResNet1The dimensional feature vector describes the global features g of the imageksPreferably M12048;
2.3) extracting abstract scene graph characteristics,
ASG node set N obtained according to step 1.2)ks=[node1 ks,node2 ks,...,nodeNe ks],nodek ks∈{o,a,r},k=1,2,...,Ne,ks=1,2,...,NsAdopting a fast-RCNN network (belonging to the public technology), taking a fully-connected fc7 layer of the fast-RCNN as an image region feature, and taking the region feature as M for the convenience of calculation1A feature vector of the dimension;
let the regional characteristics of all the extracted nodes be expressed as
Wherein for the node
k ksA target node characterized by features extracted on a corresponding region; for node
k ksThe attribute node of a has the same characteristics as the region characteristics of the object node connected with the attribute node; for node
k ksR, the characteristics of which are extracted from the union region of the two involved associated targets;
step 3, constructing an encoder for extracting key features,
3.1) establishing a network for extracting key features of the image,
the structure of the coding network for extracting the key features of the image is shown in FIG. 2, and the regional features obtained in step 2.3) are used
As an input, weighting different region features through a multi-head attention mechanism, so that the network focuses on the important part to be described, and obtaining the weighted region features
And determining the incidence relation between the objects through the AoA module and layer normalization
Finally, the characteristics are embedded into the module through the nodes
Combining with node attribute to obtain output
3.2) constructing a multi-head attention module,
region characteristics obtained in step 2.3)
In general, a large amount of redundant information is included, and the feature importance degree of each region is the same, so that constructing a multi-head attention module can map feature vectors to different subspaces, so that a model can understand features from different angles, the coding of the region features is enhanced, the obtained region features are more accurate, the description focus is more prominent, and the module is described in detail below,
3.2a) characterizing the regions obtained in step 2.3)
Obtaining query vectors Q with the same dimensionality through three different linear transformations
k ksKey vector K
k ksSum vector V
k ksThe expression for the linear transformation is as follows:
wherein ,WQ、WK、WVDifferent randomly initialized mapping matrixes are obtained by network training;
3.2b) query vector Q
k ksKey vector K
k ksSum vector V
k ksAre respectively divided into n
1A (n)
1Is an empirical value, preferably n
1=8)M
2=M
1/n
1Query sub-features of a dimension
Key features
Sum value sub-feature
Computing
And
the expression of the similarity score of (1) is as follows:
wherein ,fsimIs a function of calculating the similarity score, defined as follows:
then, the similarity is scored
Performing softmax operation as weighted weight summation to obtain spatial attention sub-feature
The expression is as follows:
finally, fusing the weighted features of the plurality of sub-regions to obtain the region feature containing the attention weight
The expression is as follows:
wherein ,WOLinear mapping is obtained through network training;
3.3) constructing the AoA module,
in order to accurately predict the semantic relationship between two objects in a sample image, the region features containing attention weight
And query vector Q
k ksThe AoA module is combined to improve the accuracy of word prediction on the characteristics;
referring to fig. 3, the AoA module includes two separate linear transformations, each generating an information vector fk ksAnd attention gate vector mk ksThe expression is as follows:
wherein ,
respectively two-dimensional linear transformation weights learned by the network, b
f、b
mAre all one-dimensional constant terms, and sigma is a sigmoid activation function;
then for the information vector f
k ksAnd attention gate vector m
k ksPerforming dot product operation to obtain attention information characteristics
So as to express the dependency relationship among the objects more appropriately, the expression is as follows:
wherein ,
representing a dot-product operation such that the dimension of the feature that is larger in value becomes larger and the dimension that is smaller becomes smaller, thereby enlarging the difference between the features;
3.4) a node embedding module is arranged,
due to the attention-only information characteristic
The controllability of the image text description cannot be embodied, so that different node embedding enhancement is carried out on the characteristics representing different nodes to obtain the characteristic Z with node attribute perception
k ksThe expression is as follows:
wherein ,W
rIs 3 XM
1The node embedding matrix is obtained by network learning, W
r[1],W
r[2],W
r[3]Respectively represent W
rThe number of the first, second, third,
is an attribute of the kth node; pos
kIs M
1The position of the dimension is embedded into a vector, and W is increased when the node is an attribute node
r[2]The weight coefficient of (2) is used for distinguishing the sequence of different attribute nodes connecting the same object, and the expression is as follows:
step 4, enhancing the coding network of the relationship between the image coding features,
according to the feature coding network shown in FIG. 2, the region feature Z obtained in step 3.4) is usedk ksAs input, adjacent node features are combined through Graph Convolution (GCN), so that a coding network can process feature information with a structure, focus on node features connected by semantics, and obtain a better image text description result;
4.1) constructing a bidirectional abstract scene graph,
because the influence of the edge between the connecting nodes on the two nodes in the ASG is mutual, and because the types of the nodes are different, the way of transferring the message from one type of node to another type of node is different from the way of transferring the message in the reverse direction, therefore, the original directional unidirectional edge in the ASG of the abstract scene graph needs to be expanded into a bidirectional edge with different meanings, that is, Gks=(Nks,Eks),ks=1,2,...,NsChanging to a multiple relationship scene graph Gks'=(Nks,Eks,Rks),Rks6 interactive relations among the nodes, including an object pair attribute oa, an attribute pair object ao, a subject pair relation or, a relation pair subject ro, an object pair relation sr and a relation pair object rs;
4.2) carrying out graph convolution,
by means of graph convolution operation, the node characteristic relation Z is obtained
k ksCoding to obtain final region characteristics
The expression is as follows:
wherein ,
represents the neighbor nodes of node k under the relation s, σ is the ReLU activation function,
is a parameter of the relation s of the l-th layer learned by the network;
the use of GCN once can bring characteristic information from neighboring nodes to each node, while stacking multiple times can obtain a wider context; preferably l ∈ [2,4 ]]Finally, the output of the l-th layer is used as 10 512-dimensional regional characteristics output by the encoding stage
Characteristics of the recourse region
Taking the average to obtain the global coding characteristics
The expression is as follows:
4.3) carrying out feature fusion,
global coding features
And the global feature g obtained in step 2.2)
ksFusing to obtain global features output in the encoding stage
The expression is as follows:
step 5, inputting the coded characteristics into a decoding network output statement,
the decoding network model is mainly composed of a double-layer LSTM network as shown in FIG. 4, and the global coding characteristics obtained in step 4 are used
As input, the weights of the nodes needing attention in each decoding are calculated through GraphAttenttion (belonging to the public technology), then the current words are predicted through the output of a double-layer LSTM network, then the output of the current LSTM is returned to update the global coding characteristics to recalculate the node weights, the next words are generated, and so on, and the specific process is that,
5.1) constructing a double-layer LSTM network model, which consists of Top-Down Attention LSTM and Language LSTM (belongs to the public technology); taking the input of Attention LSTM at time t as global feature
And the output h of the Language LSTM at time t-1
t-1And word list characteristics W
t-1Performing operation to obtain the output of Attention LSTM at time t
The expression is as follows:
wherein ,θaIs a network parameter;
then outputs the Attention LSTM
And weighted region features
Generating word prediction results at time t as input to Language LSTM
The expression is as follows:
wherein ,W
ksc、W
hc、W
c、θ
lAre all parameters of the network training and are,
is the updated regional characteristics at time t;
then, the prediction result is processed by a softmax function to obtain a word prediction probability matrix, and the probability matrix is returned to the updating characteristic
Carrying out next word prediction;
step 6, constructing a training network according to the steps,
training the constructed coding network and decoding network through the steps 1-5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text;
the network is trained using standard cross-entropy penalties for the under-control condition GksLoss L of text description of lower image ksksThe expression is as follows:
the specific parameters set in the training process are that the Batch size is preferably 128, the iteration number Epoch is preferably 50, and the initial Learning rate is preferably 0.0002; thus, a complete constructed model is obtained, and controllable image text description can be generated according to the image and the designated ASG.