CN112733866A - Network construction method for improving text description correctness of controllable image - Google Patents

Network construction method for improving text description correctness of controllable image Download PDF

Info

Publication number
CN112733866A
CN112733866A CN202110110377.2A CN202110110377A CN112733866A CN 112733866 A CN112733866 A CN 112733866A CN 202110110377 A CN202110110377 A CN 202110110377A CN 112733866 A CN112733866 A CN 112733866A
Authority
CN
China
Prior art keywords
node
network
image
features
follows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110110377.2A
Other languages
Chinese (zh)
Other versions
CN112733866B (en
Inventor
朱虹
张雨嘉
杜森
史静
刘媛媛
王栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Qianliyun Medical Technology Co ltd
Shenzhen Wanzhida Technology Co ltd
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110110377.2A priority Critical patent/CN112733866B/en
Publication of CN112733866A publication Critical patent/CN112733866A/en
Application granted granted Critical
Publication of CN112733866B publication Critical patent/CN112733866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a network construction method for improving the text description correctness of controllable images, which comprises the following steps: step 1, constructing a model data set; step 2, extracting the characteristics of the data set; step 3, constructing an encoder for extracting key features; step 4, enhancing a coding network of the relationship between the image coding features; step 5, inputting the coded features into a decoding network output statement; and 6, constructing a training network according to the steps, training the constructed coding network and decoding network through the steps 1 to 5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text. The controllable image text description has higher correctness by the method.

Description

Network construction method for improving text description correctness of controllable image
Technical Field
The invention belongs to the technical field of image text description algorithms, and relates to a network construction method for improving the correctness of controllable image text description.
Background
Images are the most common information carriers in human activities, containing abundant useful information. It is difficult, but feasible, to automatically extract image content and describe it correctly. The image text description algorithm means that for a given image, a computer automatically outputs a sentence of characters describing the image content. Because the cross-mode conversion from the image to the text information can be completed, the method can be applied to a plurality of fields including cross-mode quick retrieval of the image and the like, and therefore, the research in the direction has wide application prospect.
The correctness of the image text description mainly depends on two aspects: firstly, the identification capability of the mutual relation among objects, scenes and objects contained in the image is realized; secondly, the description capability of accurately outputting the text content through the object. The correct recognition capability is a precondition for correctly outputting the image text description, and the work of the correct recognition capability is completed in an encoder of a model, but the prior art has the defects of inaccurate output information and deviation of output emphasis in this respect.
Disclosure of Invention
The invention aims to provide a network construction method for improving the description correctness of a controllable image text, and solves the problems that in the prior art, description sentences are not accurate in the description process of the image text and the description content is uncontrollable.
The invention adopts the technical scheme that a network construction method for improving the text description correctness of a controllable image is implemented according to the following steps:
step 1, constructing a model data set;
step 2, extracting the characteristics of the data set;
step 3, constructing an encoder for extracting key features;
step 4, enhancing a coding network of the relationship between the image coding features;
step 5, inputting the coded characteristics into a decoding network output statement,
the global coding characteristics obtained in the step 4
Figure BDA0002919074210000021
As input, the weight of a node which needs to be concerned in decoding each time is calculated through Graph Attention, then the weight is output through a double-layer LSTM network, a current word is predicted, then the output of the current LSTM is returned to update global coding characteristics to recalculate the weight of the node, and the like;
step 6, constructing a training network according to the steps,
and (5) training the constructed coding network and decoding network through the steps 1-5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text.
The invention has the advantages that in the coding network model, the difference of description degrees among different objects is amplified by enhancing the characteristics of key parts of the image, and then the difference is combined with controllable conditions to obtain more accurate image coding characteristics. After the characteristics are input into a decoding network, the text description is carried out on the input image, and compared with the algorithm indexes published by the current retrieved mainstream papers, the controllable image text description has higher accuracy.
Drawings
FIG. 1 is a form of the method of the present invention controlling a textual description of an image;
FIG. 2 is a flow chart of the overall structure of the feature encoding network model of the method of the present invention;
FIG. 3 is a flow chart of the AoA module structure for enhancing the relationship characteristics between objects according to the method of the present invention;
fig. 4 is a flow chart of the structure of the decoding network model adopted by the method of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The method of the invention is implemented according to the following steps:
step 1, constructing a model data set,
1.1) establishing a training set of data samples, a validation set of validation image data,
training a data set of a deep learning network requires a large number of labeled samples; considering that the image marking by self has certain limitation and huge workload, and the model needs to provide control conditions to control the generation of description, for this purpose, in the step, the published MSCOCO data set image sample and label are selected as the data sample of the feature extraction network, 90% of the MSCOCO data set is randomly selected as a training set, and the rest partial image samples are used as the samples of the verification set, which are collectively referred to as sample images;
1.2) establishing a data set of control conditions,
when generating an image text description, providing an Abstract Scene Graph (ASG) for each sample image, as an input to control the structure of a generated sentence, where the structure of the abstract scene graph is shown in fig. 1, and includes three types of nodes (i.e., an object node o, an attribute node a, and a relationship node r) and edges connecting the nodes;
referring to fig. 1, in a sentence describing that "a bundle of pink flowers is placed on a wooden table," flowers "and" table "are object nodes o," pink "," a bundle of "," wooden "is attribute node a," flowers are associated with a table "is relationship node r;
in the step, an ASG generator (belonging to the public technology) is adopted to generate respective abstract scene graphs for each sample image; here, the RPN model that has been disclosed is used to detect an object node in each image, and an attribute node is added to the object node by automatic sampling; the relation node only needs to determine whether the relation exists between the objects, so a simple classification network is adopted to judge whether the relation node (namely the edge) exists between the two objects; finally, ASG registration of sample imagesIs Gks=(Nks,Eks),ks=1,2,...,Ns,NsIs the number of samples in the data set, wherein the node set of the ASG of the sample set is Nks=[node1 ks,node2 ks,...,nodeNe ks],nodek ksE { o, a, r }, k 1, 2., Ne are node numbers, and for convenience of description and calculation, the node number of each sample image is set as a fixed value, and a preferred range is: ne ∈ [10,20 ]]If the number of the actually extracted nodes is more than Ne, eliminating the unrelated isolated nodes or limiting the number of the attribute nodes, and if the number of the actually extracted nodes is less than Ne, setting the corresponding nodes to be 0; the set of edges of the ASG is Eks=[ei,j]Ne×Ne,ei,jE is left to {0,1}, namely, the association between two nodes is 1, and the non-association between two nodes is 0;
step 2, extracting the characteristics of the data set,
2.1) establishing a semantic dictionary describing the text,
selecting the first m words with the highest frequency of occurrence in all words from the labels of the sample images to form a semantic concept set; the total number m of words is selected according to different text description fields and different description accuracy requirements, the step is based on the MSCOCO data set and the general requirements, the preferred value range of the total number m of words is [10000,12000], an integer serial number is allocated to each word, and three special bits are added after the integer serial number: namely a start marker bit, an end marker bit and a low-frequency word bit, wherein m +3 integer serial numbers form a dictionary;
for the ks sample image Y in the training setks,LksThe description length of the text of the ks image is used, the established dictionary is used for carrying out semantic dictionary labeling on the data set sample image, and the labeling form is as follows:
Figure BDA0002919074210000041
wherein ,
Figure BDA0002919074210000042
is the sequence number of the kth word in the text semantic dictionary, which is an integer, where k is 1,2ks
2.2) extracting the image characteristics,
extracting global characteristics of sample images by using ResNet network (belonging to the public technology), namely taking output M of the last average pooling layer in the convolutional network ResNet1The dimensional feature vector describes the global features g of the imageksPreferably M12048;
2.3) extracting abstract scene graph characteristics,
ASG node set N obtained according to step 1.2)ks=[node1 ks,node2 ks,...,nodeNe ks],nodek ks∈{o,a,r},k=1,2,...,Ne,ks=1,2,...,NsAdopting a fast-RCNN network (belonging to the public technology), taking a fully-connected fc7 layer of the fast-RCNN as an image region feature, and taking the region feature as M for the convenience of calculation1A feature vector of the dimension;
let the regional characteristics of all the extracted nodes be expressed as
Figure BDA0002919074210000051
Wherein for the nodek ksA target node characterized by features extracted on a corresponding region; for nodek ksThe attribute node of a has the same characteristics as the region characteristics of the object node connected with the attribute node; for nodek ksR, the characteristics of which are extracted from the union region of the two involved associated targets;
step 3, constructing an encoder for extracting key features,
3.1) establishing a network for extracting key features of the image,
the structure of the coding network for extracting the key features of the image is shown in FIG. 2, and the regional features obtained in step 2.3) are used
Figure BDA0002919074210000052
As an input, weighting different region features through a multi-head attention mechanism, so that the network focuses on the important part to be described, and obtaining the weighted region features
Figure BDA0002919074210000053
And determining the incidence relation between the objects through the AoA module and layer normalization
Figure BDA0002919074210000054
Finally, the characteristics are embedded into the module through the nodes
Figure BDA0002919074210000055
Combining with node attribute to obtain output
Figure BDA0002919074210000056
3.2) constructing a multi-head attention module,
region characteristics obtained in step 2.3)
Figure BDA0002919074210000057
In general, a large amount of redundant information is included, and the feature importance degree of each region is the same, so that constructing a multi-head attention module can map feature vectors to different subspaces, so that a model can understand features from different angles, the coding of the region features is enhanced, the obtained region features are more accurate, the description focus is more prominent, and the module is described in detail below,
3.2a) characterizing the regions obtained in step 2.3)
Figure BDA0002919074210000061
Obtaining query vectors Q with the same dimensionality through three different linear transformationsk ksKey vector Kk ksSum vector Vk ksThe expression for the linear transformation is as follows:
Figure BDA0002919074210000062
wherein ,WQ、WK、WVDifferent randomly initialized mapping matrixes are obtained by network training;
3.2b) query vector Qk ksKey vector Kk ksSum vector Vk ksAre respectively divided into n1A (n)1Is an empirical value, preferably n1=8)M2=M1/n1Query sub-features of a dimension
Figure BDA0002919074210000063
Key features
Figure BDA0002919074210000064
Sum value sub-feature
Figure BDA0002919074210000065
Computing
Figure BDA0002919074210000066
And
Figure BDA0002919074210000067
the expression of the similarity score of (1) is as follows:
Figure BDA0002919074210000068
wherein ,fsimIs a function of calculating the similarity score, defined as follows:
Figure BDA0002919074210000069
then, the similarity is scored
Figure BDA00029190742100000610
Performing softmax operation as weighted weight summation to obtain spatial attention sub-feature
Figure BDA00029190742100000611
The expression is as follows:
Figure BDA00029190742100000612
finally, fusing the weighted features of the plurality of sub-regions to obtain the region feature containing the attention weight
Figure BDA00029190742100000613
The expression is as follows:
Figure BDA00029190742100000614
wherein ,WOLinear mapping is obtained through network training;
3.3) constructing the AoA module,
in order to accurately predict the semantic relationship between two objects in a sample image, the region features containing attention weight
Figure BDA0002919074210000071
And query vector Qk ksThe AoA module is combined to improve the accuracy of word prediction on the characteristics;
referring to fig. 3, the AoA module includes two separate linear transformations, each generating an information vector fk ksAnd attention gate vector mk ksThe expression is as follows:
Figure BDA0002919074210000072
Figure BDA0002919074210000073
wherein ,
Figure BDA0002919074210000074
respectively two-dimensional linear transformation weights learned by the network, bf、bmAre all one-dimensional constant terms, and sigma is a sigmoid activation function;
then for the information vector fk ksAnd attention gate vector mk ksPerforming dot product operation to obtain attention information characteristics
Figure BDA0002919074210000075
So as to express the dependency relationship among the objects more appropriately, the expression is as follows:
Figure BDA0002919074210000076
wherein ,
Figure BDA0002919074210000077
representing a dot-product operation such that the dimension of the feature that is larger in value becomes larger and the dimension that is smaller becomes smaller, thereby enlarging the difference between the features;
3.4) a node embedding module is arranged,
due to the attention-only information characteristic
Figure BDA0002919074210000078
The controllability of the image text description cannot be embodied, so that different node embedding enhancement is carried out on the characteristics representing different nodes to obtain the characteristic Z with node attribute perceptionk ksThe expression is as follows:
Figure BDA0002919074210000079
wherein ,WrIs 3 XM1The node embedding matrix is obtained by network learning, Wr[1],Wr[2],Wr[3]Respectively represent WrThe number of the first, second, third,
Figure BDA0002919074210000081
is an attribute of the kth node; poskIs M1The position of the dimension is embedded into a vector, and W is increased when the node is an attribute noder[2]The weight coefficient of (2) is used for distinguishing the sequence of different attribute nodes connecting the same object, and the expression is as follows:
Figure BDA0002919074210000082
step 4, enhancing the coding network of the relationship between the image coding features,
according to the feature coding network shown in FIG. 2, the region feature Z obtained in step 3.4) is usedk ksAs input, adjacent node features are combined through Graph Convolution (GCN), so that a coding network can process feature information with a structure, focus on node features connected by semantics, and obtain a better image text description result;
4.1) constructing a bidirectional abstract scene graph,
because the influence of the edge between the connecting nodes on the two nodes in the ASG is mutual, and because the types of the nodes are different, the way of transferring the message from one type of node to another type of node is different from the way of transferring the message in the reverse direction, therefore, the original directional unidirectional edge in the ASG of the abstract scene graph needs to be expanded into a bidirectional edge with different meanings, that is, Gks=(Nks,Eks),ks=1,2,...,NsChanging to a multiple relationship scene graph Gks'=(Nks,Eks,Rks),Rks6 interactive relations among the nodes, including an object pair attribute oa, an attribute pair object ao, a subject pair relation or, a relation pair subject ro, an object pair relation sr and a relation pair object rs;
4.2) carrying out graph convolution,
by means of graph convolution operation, the node characteristic relation Z is obtainedk ksCoding to obtain final region characteristics
Figure BDA0002919074210000083
The expression is as follows:
Figure BDA0002919074210000084
wherein ,
Figure BDA0002919074210000085
represents the neighbor nodes of node k under the relation s, σ is the ReLU activation function,
Figure BDA0002919074210000086
is a parameter of the relation s of the l-th layer learned by the network;
the use of GCN once can bring characteristic information from neighboring nodes to each node, while stacking multiple times can obtain a wider context; preferably l ∈ [2,4 ]]Finally, the output of the l-th layer is used as 10 512-dimensional regional characteristics output by the encoding stage
Figure BDA0002919074210000091
Characteristics of the recourse region
Figure BDA0002919074210000092
Taking the average to obtain the global coding characteristics
Figure BDA0002919074210000093
The expression is as follows:
Figure BDA0002919074210000094
4.3) carrying out feature fusion,
global coding features
Figure BDA0002919074210000095
And the global feature g obtained in step 2.2)ksFusing to obtain global features output in the encoding stage
Figure BDA0002919074210000096
The expression is as follows:
Figure BDA0002919074210000097
step 5, inputting the coded characteristics into a decoding network output statement,
the decoding network model is mainly composed of a double-layer LSTM network as shown in FIG. 4, and the global coding characteristics obtained in step 4 are used
Figure BDA0002919074210000098
As input, the weights of the nodes needing attention in each decoding are calculated through GraphAttenttion (belonging to the public technology), then the current words are predicted through the output of a double-layer LSTM network, then the output of the current LSTM is returned to update the global coding characteristics to recalculate the node weights, the next words are generated, and so on, and the specific process is that,
5.1) constructing a double-layer LSTM network model, which consists of Top-Down Attention LSTM and Language LSTM (belongs to the public technology); taking the input of Attention LSTM at time t as global feature
Figure BDA0002919074210000099
And the output h of the Language LSTM at time t-1t-1And word list characteristics Wt-1Performing operation to obtain the output of Attention LSTM at time t
Figure BDA00029190742100000910
The expression is as follows:
Figure BDA00029190742100000911
wherein ,θaIs a network parameter;
then outputs the Attention LSTM
Figure BDA0002919074210000101
And weighted region features
Figure BDA0002919074210000102
Generating word prediction results at time t as input to Language LSTM
Figure BDA0002919074210000103
The expression is as follows:
Figure BDA0002919074210000104
Figure BDA0002919074210000105
wherein ,Wksc、Whc、Wc、θlAre all parameters of the network training and are,
Figure BDA0002919074210000106
is the updated regional characteristics at time t;
then, the prediction result is processed by a softmax function to obtain a word prediction probability matrix, and the probability matrix is returned to the updating characteristic
Figure BDA0002919074210000107
Carrying out next word prediction;
step 6, constructing a training network according to the steps,
training the constructed coding network and decoding network through the steps 1-5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text;
the network is trained using standard cross-entropy penalties for the under-control condition GksLoss L of text description of lower image ksksThe expression is as follows:
Figure BDA0002919074210000108
the specific parameters set in the training process are that the Batch size is preferably 128, the iteration number Epoch is preferably 50, and the initial Learning rate is preferably 0.0002; thus, a complete constructed model is obtained, and controllable image text description can be generated according to the image and the designated ASG.

Claims (7)

1. A network construction method for improving the correctness of controllable image text description is characterized by comprising the following steps:
step 1, constructing a model data set;
step 2, extracting the characteristics of the data set;
step 3, constructing an encoder for extracting key features;
step 4, enhancing a coding network of the relationship between the image coding features;
step 5, inputting the coded characteristics into a decoding network output statement,
the global coding characteristics obtained in the step 4
Figure FDA0002919074200000011
As input, the weight of a node which needs to be concerned in decoding each time is calculated through Graph Attention, then the weight is output through a double-layer LSTM network, a current word is predicted, then the output of the current LSTM is returned to update global coding characteristics to recalculate the weight of the node, and the like;
step 6, constructing a training network according to the steps,
and (5) training the constructed coding network and decoding network through the steps 1-5, obtaining image characteristics with controllable conditions by using the coding network, and inputting the image characteristics into the decoding network to complete the description of the image text.
2. The network construction method for improving the correctness of the text description of the controllable image according to claim 1, characterized in that: in the step 1, the specific process is,
1.1) establishing a training set of data samples, a validation set of validation image data,
selecting MSCOCO data set image samples and labels as data samples of a feature extraction network, randomly selecting 90% of the MSCOCO data set as a training set, and using the rest partial image samples as samples of a verification set, wherein the samples are collectively referred to as sample images;
1.2) establishing a data set of control conditions,
generating respective abstract scene graphs for each sample image by adopting an ASG generator; detecting an object node in each image by using an RPN model, and adding an attribute node to the object node through automatic sampling; the relation node adopts a simple classification network to judge whether a relation node exists between two objects; finally, the ASG of the sample image is denoted Gks=(Nks,Eks),ks=1,2,...,Ns,NsIs the number of samples in the data set, wherein the node set of the ASG of the sample set is Nks=[node1 ks,node2 ks,...,nodeNe ks],nodek ksE, { o, a, r }, k ═ 1, 2., Ne are node numbers, the node number of each sample image is set as a fixed value, if the number of actually extracted nodes is more than Ne, the number of unrelated isolated nodes is eliminated or the number of attribute nodes is limited, and if the number of actually extracted nodes is less than Ne, the corresponding node is set as 0; the set of edges of the ASG is Eks=[ei,j]Ne×Ne,ei,jE {0,1}, i.e., a relationship between two nodes is 1, and a relationship between two nodes is 0.
3. The network construction method for improving the correctness of the text description of the controllable image according to claim 2, characterized in that: in the step 2, the specific process is,
2.1) establishing a semantic dictionary describing the text,
selecting the first m words with the highest frequency of occurrence in all words from the labels of the sample images to form a semantic concept set; the total number m of words is selected according to different text description fields and different description accuracy requirements, the value range of the total number m of words is [10000,12000] according to the MSCOCO data set and the general requirements, each word is assigned with an integer serial number, and three special bits are added after the integer serial number: namely a start marker bit, an end marker bit and a low-frequency word bit, wherein m +3 integer serial numbers form a dictionary;
for the ks sample image Y in the training setks,LksThe description length of the text of the ks image is used, the established dictionary is used for carrying out semantic dictionary labeling on the data set sample image, and the labeling form is as follows:
Figure FDA0002919074200000021
wherein ,
Figure FDA0002919074200000022
is the sequence number of the kth word in the text semantic dictionary, which is an integer, where k is 1,2ks
2.2) extracting the image characteristics,
extracting global characteristics of sample images by adopting ResNet network, namely taking output M of the last average pooling layer in convolution network ResNet1The dimensional feature vector describes the global features g of the imageksPreferably M12048;
2.3) extracting abstract scene graph characteristics,
ASG node set N obtained according to step 1.2)ks=[node1 ks,node2 ks,...,nodeNe ks],nodek ks∈{o,a,r},k=1,2,...,Ne,ks=1,2,...,NsAdopting a fast-RCNN network, taking a full-connection fc7 layer of the fast-RCNN as an image area characteristic, and taking the area characteristic as M for the convenience of calculation1A feature vector of the dimension;
let the regional characteristics of all the extracted nodes be expressed as
Figure FDA0002919074200000031
Wherein for the nodek ksA target node characterized by features extracted on a corresponding region; for nodek ksThe attribute node of a has the same characteristics as the region characteristics of the object node connected with the attribute node; for nodek ksR, which features are extracted from the union region of the two involved related objects.
4. The network construction method for improving the correctness of the text description of the controllable image according to claim 3, characterized in that: in the step 3, the specific process is,
3.1) establishing a network for extracting key features of the image,
the regional characteristics obtained in the step 2.3)
Figure FDA0002919074200000032
As an input, weighting different region features through a multi-head attention mechanism, so that the network focuses on the important part to be described, and obtaining the weighted region features
Figure FDA0002919074200000033
And determining the incidence relation between the objects through the AoA module and layer normalization
Figure FDA0002919074200000034
Finally, the characteristics are embedded into the module through the nodes
Figure FDA0002919074200000035
Combining with node attribute to obtain output
Figure FDA0002919074200000036
3.2) constructing a multi-head attention module,
3.2a) characterizing the regions obtained in step 2.3)
Figure FDA0002919074200000037
Respectively undergo three different linear transformationsTo obtain the query vector Q with the same dimensionk ksKey vector Kk ksSum vector Vk ksThe expression for the linear transformation is as follows:
Figure FDA0002919074200000041
wherein ,WQ、WK、WVDifferent randomly initialized mapping matrixes are obtained by network training;
3.2b) query vector Qk ksKey vector Kk ksSum vector Vk ksAre respectively divided into n1A M2=M1/n1Query sub-features of a dimension
Figure FDA0002919074200000042
Key features
Figure FDA0002919074200000043
Sum value sub-feature
Figure FDA0002919074200000044
Computing
Figure FDA0002919074200000045
And
Figure FDA0002919074200000046
the expression of the similarity score of (1) is as follows:
Figure FDA0002919074200000047
wherein ,fsimIs a function of calculating the similarity score, defined as follows:
Figure FDA0002919074200000048
then, the similarity is scored
Figure FDA0002919074200000049
Performing softmax operation as weighted weight summation to obtain spatial attention sub-feature
Figure FDA00029190742000000410
The expression is as follows:
Figure FDA00029190742000000411
finally, fusing the weighted features of the plurality of sub-regions to obtain the region feature containing the attention weight
Figure FDA00029190742000000412
The expression is as follows:
Figure FDA00029190742000000413
wherein ,WOLinear mapping is obtained through network training;
3.3) constructing the AoA module,
the AoA module comprises two separate linear transformations, each generating an information vector fk ksAnd attention gate vector mk ksThe expression is as follows:
Figure FDA0002919074200000051
Figure FDA0002919074200000052
wherein ,
Figure FDA0002919074200000053
respectively two-dimensional linear transformation weights learned by the network, bf、bmAre all one-dimensional constant terms, and sigma is a sigmoid activation function;
then for the information vector fk ksAnd attention gate vector mk ksPerforming dot product operation to obtain attention information characteristics
Figure FDA0002919074200000054
So as to express the dependency relationship among the objects more appropriately, the expression is as follows:
Figure FDA0002919074200000055
wherein ,
Figure FDA0002919074200000056
representing a dot-product operation such that the dimension of the feature that is larger in value becomes larger and the dimension that is smaller becomes smaller, thereby enlarging the difference between the features;
3.4) a node embedding module is arranged,
different node embedding enhancement is carried out on the characteristics representing different nodes to obtain characteristics Z with node attribute perceptionk ksThe expression is as follows:
Figure FDA0002919074200000057
wherein ,WrIs 3 XM1The node embedding matrix is obtained by network learning, Wr[1],Wr[2],Wr[3]Respectively represent WrThe number of the first, second, third,
Figure FDA0002919074200000058
is the genus of the kth nodeSex; poskIs M1The position of the dimension is embedded into a vector, and W is increased when the node is an attribute noder[2]The weight coefficient of (2) is used for distinguishing the sequence of different attribute nodes connecting the same object, and the expression is as follows:
Figure FDA0002919074200000059
5. the network construction method for improving the correctness of the text description of the controllable image according to claim 4, characterized in that: in the step 4, the specific process is,
4.1) constructing a bidirectional abstract scene graph,
the original directional one-way edge in the ASG of the abstract scene graph is expanded into a two-way edge with different meanings, namely Gks=(Nks,Eks),ks=1,2,...,NsChanging to a multiple relationship scene graph Gks'=(Nks,Eks,Rks),Rks6 interactive relations among the nodes, including an object pair attribute oa, an attribute pair object ao, a subject pair relation or, a relation pair subject ro, an object pair relation sr and a relation pair object rs;
4.2) carrying out graph convolution,
by means of graph convolution operation, the node characteristic relation Z is obtainedk ksCoding to obtain final region characteristics
Figure FDA0002919074200000061
The expression is as follows:
Figure FDA0002919074200000062
wherein ,
Figure FDA0002919074200000063
represents the neighbor nodes of node k under the relation s, σ is the ReLU activation function,
Figure FDA0002919074200000064
is a parameter of the relation s of the l-th layer learned by the network;
the use of GCN once can bring characteristic information from neighboring nodes to each node, while stacking multiple times can obtain a wider context; the output of the last l layer is 10 area characteristics with 512 dimensions output by the encoding stage
Figure FDA0002919074200000065
Characteristics of the recourse region
Figure FDA0002919074200000066
Taking the average to obtain the global coding characteristics
Figure FDA0002919074200000067
The expression is as follows:
Figure FDA0002919074200000068
4.3) carrying out feature fusion,
global coding features
Figure FDA0002919074200000069
And the global feature g obtained in step 2.2)ksFusing to obtain global features output in the encoding stage
Figure FDA00029190742000000610
The expression is as follows:
Figure FDA00029190742000000611
6. the network construction method for improving the correctness of the text description of the controllable image according to claim 5, characterized in that: in the step 5, the specific process is,
constructing a double-layer LSTM network model which consists of Top-Down Attention LSTM and Language LSTM; taking the input of Attention LSTM at time t as global feature
Figure FDA0002919074200000071
And the output h of the Language LSTM at time t-1t-1And word list characteristics Wt-1Performing operation to obtain the output of Attention LSTM at time t
Figure FDA0002919074200000072
The expression is as follows:
Figure FDA0002919074200000073
wherein ,θaIs a network parameter;
then outputs the Attention LSTM
Figure FDA0002919074200000074
And weighted region features
Figure FDA0002919074200000075
Generating word prediction results at time t as input to Language LSTM
Figure FDA0002919074200000076
The expression is as follows:
Figure FDA0002919074200000077
Figure FDA0002919074200000078
wherein ,Wksc、Whc、Wc、θlAre all parameters of the network training and are,
Figure FDA0002919074200000079
is the updated regional characteristics at time t;
then, the prediction result is processed by a softmax function to obtain a word prediction probability matrix, and the probability matrix is returned to the updating characteristic
Figure FDA00029190742000000710
The next word prediction is performed.
7. The network construction method for improving the correctness of the text description of the controllable image according to claim 6, characterized in that: in the step 6, the specific process is,
the network is trained using standard cross-entropy penalties for the under-control condition GksLoss L of text description of lower image ksksThe expression is as follows:
Figure FDA00029190742000000711
the specific parameters set in the training process are that the Batch size is 128, the iteration time Epoch is 50 generations, and the initial Learning rate is 0.0002; thus, a complete constructed model is obtained, and controllable image text description can be generated according to the image and the designated ASG.
CN202110110377.2A 2021-01-27 2021-01-27 Network construction method for improving text description correctness of controllable image Active CN112733866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110110377.2A CN112733866B (en) 2021-01-27 2021-01-27 Network construction method for improving text description correctness of controllable image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110110377.2A CN112733866B (en) 2021-01-27 2021-01-27 Network construction method for improving text description correctness of controllable image

Publications (2)

Publication Number Publication Date
CN112733866A true CN112733866A (en) 2021-04-30
CN112733866B CN112733866B (en) 2023-09-26

Family

ID=75595341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110110377.2A Active CN112733866B (en) 2021-01-27 2021-01-27 Network construction method for improving text description correctness of controllable image

Country Status (1)

Country Link
CN (1) CN112733866B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113283336A (en) * 2021-05-21 2021-08-20 湖南大学 Text recognition method and system
CN113343966A (en) * 2021-05-08 2021-09-03 武汉大学 Infrared and visible light image text description generation method
CN113449081A (en) * 2021-07-08 2021-09-28 平安国际智慧城市科技股份有限公司 Text feature extraction method and device, computer equipment and storage medium
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description
CN113642630A (en) * 2021-08-10 2021-11-12 福州大学 Image description method and system based on dual-path characteristic encoder
CN114021558A (en) * 2021-11-10 2022-02-08 北京航空航天大学杭州创新研究院 Intelligent evaluation method for consistency of graph and text meaning based on layering
CN114399646A (en) * 2021-12-21 2022-04-26 北京中科明彦科技有限公司 Image description method and device based on Transformer structure
CN114625882A (en) * 2022-01-26 2022-06-14 西安理工大学 Network construction method for improving unique diversity of image text description
CN116504395A (en) * 2023-06-21 2023-07-28 广东省人民医院 Osteoporosis prediction method, system and storage medium based on artificial intelligence
CN114625882B (en) * 2022-01-26 2024-04-16 西安理工大学 Network construction method for improving unique diversity of image text description

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288665A (en) * 2019-05-13 2019-09-27 中国科学院西安光学精密机械研究所 Image Description Methods, computer readable storage medium based on convolutional neural networks, electronic equipment
US10671878B1 (en) * 2019-01-11 2020-06-02 Capital One Services, Llc Systems and methods for text localization and recognition in an image of a document
CN111488734A (en) * 2020-04-14 2020-08-04 西安交通大学 Emotional feature representation learning system and method based on global interaction and syntactic dependency
CN111985612A (en) * 2020-07-21 2020-11-24 西安理工大学 Encoder network model design method for improving video text description accuracy
US20200372225A1 (en) * 2019-05-22 2020-11-26 Royal Bank Of Canada System and method for controllable machine text generation architecture
US20200410054A1 (en) * 2019-06-27 2020-12-31 Conduent Business Services, Llc Neural network systems and methods for target identification from text
CN112163416A (en) * 2020-10-09 2021-01-01 北京理工大学 Event joint extraction method for merging syntactic and entity relation graph convolution network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10671878B1 (en) * 2019-01-11 2020-06-02 Capital One Services, Llc Systems and methods for text localization and recognition in an image of a document
CN110288665A (en) * 2019-05-13 2019-09-27 中国科学院西安光学精密机械研究所 Image Description Methods, computer readable storage medium based on convolutional neural networks, electronic equipment
US20200372225A1 (en) * 2019-05-22 2020-11-26 Royal Bank Of Canada System and method for controllable machine text generation architecture
US20200410054A1 (en) * 2019-06-27 2020-12-31 Conduent Business Services, Llc Neural network systems and methods for target identification from text
CN111488734A (en) * 2020-04-14 2020-08-04 西安交通大学 Emotional feature representation learning system and method based on global interaction and syntactic dependency
CN111985612A (en) * 2020-07-21 2020-11-24 西安理工大学 Encoder network model design method for improving video text description accuracy
CN112163416A (en) * 2020-10-09 2021-01-01 北京理工大学 Event joint extraction method for merging syntactic and entity relation graph convolution network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FEIRAN HUANG ET AL.: "Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching", 《IEEE TRANSACTIONS ON IMAGE PROCESSING 》 *
姚义等: "基于深度学习的结构化图像标注研究", 《电脑知识与技术》 *
胡朝举 等: "基于深层注意力的LSTM的特定主题情感分析", 《计算机应用研究》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343966A (en) * 2021-05-08 2021-09-03 武汉大学 Infrared and visible light image text description generation method
CN113343966B (en) * 2021-05-08 2022-04-29 武汉大学 Infrared and visible light image text description generation method
CN113283336A (en) * 2021-05-21 2021-08-20 湖南大学 Text recognition method and system
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description
CN113487629B (en) * 2021-07-07 2023-04-07 电子科技大学 Image attribute editing method based on structured scene and text description
CN113449081A (en) * 2021-07-08 2021-09-28 平安国际智慧城市科技股份有限公司 Text feature extraction method and device, computer equipment and storage medium
CN113642630A (en) * 2021-08-10 2021-11-12 福州大学 Image description method and system based on dual-path characteristic encoder
CN113642630B (en) * 2021-08-10 2024-03-15 福州大学 Image description method and system based on double-path feature encoder
CN114021558B (en) * 2021-11-10 2022-05-10 北京航空航天大学杭州创新研究院 Intelligent evaluation method for consistency of graph and text meaning based on layering
CN114021558A (en) * 2021-11-10 2022-02-08 北京航空航天大学杭州创新研究院 Intelligent evaluation method for consistency of graph and text meaning based on layering
CN114399646B (en) * 2021-12-21 2022-09-20 北京中科明彦科技有限公司 Image description method and device based on transform structure
CN114399646A (en) * 2021-12-21 2022-04-26 北京中科明彦科技有限公司 Image description method and device based on Transformer structure
CN114625882A (en) * 2022-01-26 2022-06-14 西安理工大学 Network construction method for improving unique diversity of image text description
CN114625882B (en) * 2022-01-26 2024-04-16 西安理工大学 Network construction method for improving unique diversity of image text description
CN116504395A (en) * 2023-06-21 2023-07-28 广东省人民医院 Osteoporosis prediction method, system and storage medium based on artificial intelligence
CN116504395B (en) * 2023-06-21 2023-10-27 广东省人民医院 Osteoporosis prediction method, system and storage medium based on artificial intelligence

Also Published As

Publication number Publication date
CN112733866B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110334705B (en) Language identification method of scene text image combining global and local information
CN112733866B (en) Network construction method for improving text description correctness of controllable image
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110674305B (en) Commodity information classification method based on deep feature fusion model
CN114298158A (en) Multi-mode pre-training method based on image-text linear combination
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN115222998B (en) Image classification method
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
CN112434686B (en) End-to-end misplaced text classification identifier for OCR (optical character) pictures
Li et al. Multimodal fusion with co-attention mechanism
CN114048290A (en) Text classification method and device
CN111666375B (en) Text similarity matching method, electronic device and computer readable medium
CN116258504A (en) Bank customer relationship management system and method thereof
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN114881038A (en) Chinese entity and relation extraction method and device based on span and attention mechanism
CN115359486A (en) Method and system for determining custom information in document image
CN114417872A (en) Contract text named entity recognition method and system
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN111259650A (en) Text automatic generation method based on class mark sequence generation type countermeasure model
CN114625882B (en) Network construction method for improving unique diversity of image text description

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230831

Address after: Room 1116, Building A, Xike Meiyu, No. 6 Lianhu 8th Road, Meixihu Street, Yuelu District, Changsha City, Hunan Province, 410000

Applicant after: Hunan Qianliyun Medical Technology Co.,Ltd.

Address before: 518000 1002, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Applicant before: Shenzhen Wanzhida Technology Co.,Ltd.

Effective date of registration: 20230831

Address after: 518000 1002, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Applicant after: Shenzhen Wanzhida Technology Co.,Ltd.

Address before: 710048 Shaanxi province Xi'an Beilin District Jinhua Road No. 5

Applicant before: XI'AN University OF TECHNOLOGY

GR01 Patent grant
GR01 Patent grant