CN114547235B

CN114547235B - Construction method of image text matching model based on priori knowledge graph

Info

Publication number: CN114547235B
Application number: CN202210060418.6A
Authority: CN
Inventors: 郭军; 解煜晨; 肖云; 任鹏真; 任哲; 王淑文; 董智强; 许鹏飞; 陈晓江; 房鼎益
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2024-04-16
Anticipated expiration: 2042-01-19
Also published as: CN114547235A

Abstract

The invention relates to a construction method of an image text matching model based on a priori knowledge graph, wherein the constructed model comprises a priori knowledge graph module, an image text matching module and an integration module; the priori knowledge graph module and the image text matching module are respectively connected with the integration module. The method has the advantages that the external priori knowledge graph is constructed to guide the image text matching, the understanding capability of the model to the real scene is greatly enhanced, the graph convolution is utilized to construct the relationship between the priori knowledge graphs, the local attention relationship between all image areas and text fragments is calculated instead of the cross attention mechanism pair-wise, the calculated amount and the parameter amount are reduced, and the training speed and the reasoning speed of the model are improved; aggregating attention relationships between image regions using a self-attention mechanism transfomer; extracting text feature vectors by using a pre-training model BERT, and then utilizing an attention mechanism to aggregate the attention relation among words in the text vectors; the accuracy of image text matching is effectively improved.

Description

Construction method of image text matching model based on priori knowledge graph

Technical Field

The invention relates to the field of computer vision and natural language processing, in particular to a method for constructing an image text matching model based on a priori knowledge graph.

Background

Vision and language are the most important information of two modes of the outside world, and a great deal of popular applications need to combine the two modes, for example: man-machine interaction, advertisement recommendation system and search engine; image text matching is a key technique for achieving the task, and aims to measure semantic similarity between an image and text, specifically, a model needs to search a text database for the text most relevant to the image, or a text is given, and the model needs to search the image database for the image most relevant to the text.

In recent years, the image text matching method based on deep learning has made a great breakthrough, and can be roughly divided into a one-to-one global matching method and a many-to-many fine granularity matching method; the one-to-one global matching method generally extracts global feature representations of images and sentences, then embeds the global features of the images and sentences into a joint space, and measures their similarity by feature distance of the joint space.

In order to solve the problems, a multi-to-multi fine granularity matching method is proposed, and most of the method adopts a stacked cross attention mechanism to find out all alignment relations between local significant areas in pictures and words in sentences, so that good performance is achieved, but the method exhausts calculation power to calculate similarity between all possible image areas and text fragments, consumes huge calculation power and reduces the reasoning speed of a model.

In recent years, external priori knowledge has been applied to many mainstream deep learning tasks, the inference speed of the model can be accelerated by effectively utilizing the priori knowledge, and constructing an effective priori knowledge relation graph can provide abundant priori scene information for the model, expand semantic concepts of the model and enhance generalization capability of the model, however, the external priori knowledge is not well utilized in the field of image text matching. Simultaneously, the semantic relation in the text aspect can be better captured by using the novel pre-training model BERT and the 1D-CNN.

Disclosure of Invention

Aiming at the defects of the existing image text matching method, the invention aims to provide a method for constructing an image text matching model based on a priori knowledge graph, and the model constructed by the method can be used for improving the accuracy and the reasoning speed of the image text matching model.

In order to achieve the above task, the present invention adopts the following technical solutions:

the method for constructing the image text matching model based on the prior knowledge graph is characterized in that the constructed image text matching model based on the prior knowledge graph comprises a prior knowledge graph module, an image text matching module and an integration module; the prior knowledge graph module and the image text matching module are respectively connected with the integration module, and the specific construction steps are as follows:

step 1, constructing a priori knowledge graph module:

extracting meaningful words from a text corpus by using a statistical method, performing word embedding operation on the extracted words by using a glove technology, and representing the words as word feature vectors, which are called prior knowledge; constructing a relation diagram of priori knowledge according to the co-occurrence statistical probability of words in the corpus; learning the interdependencies between prior knowledge using graph convolution;

step 2, constructing an image text matching module:

after the image data and the text data are given, obtaining an image feature vector by using a pre-trained fast-RCNN model, and obtaining a text feature vector by using a pre-trained BERT model; carrying out intra-mode context information aggregation on the image feature vector by using a self-attention mechanism to obtain a first layer of image features; carrying out intra-mode context information aggregation on the text feature vector by using a self-attention mechanism to obtain a first-layer text feature;

step 3, construction of an integration module:

the prior knowledge learned by the graph convolution is utilized to guide the first-layer image features and the first-layer text features, and the second-layer image features and the second-layer text features guided by the prior knowledge graph are output; the second layer image features and the first layer image features are weighted and combined to obtain third layer image features of the integration module; the second layer text feature and the first layer text feature are weighted and combined to obtain a third layer text feature of the integration module;

step 4, constructing a loss function by utilizing the first layer image text characteristics and the third layer image text characteristics;

and 5, training and testing to obtain an image text matching model based on the priori knowledge graph.

Further, in step 1, the constructing the prior knowledge graph module further includes:

the extracting words from the text corpus comprises: deleting rare words from a text corpus, and selecting words with three parts of speech, verbs and adjectives; and according to the statistical frequency of words in the corpus, the proportion of nouns, verbs and adjectives selected is strictly limited to 7:2:1, word embedding operation is carried out on the selected words by utilizing a glove technology, the words are expressed as word feature vectors, and the word feature vectors are called as priori knowledge.

The construction of the prior knowledge relation graph comprises the following steps: modeling a relation diagram in the form of a conditional probability matrix, wherein the specific format is as follows:

in which W is _i Representing the number of occurrences of word i in the corpus, W _ij Representing words i and j in a corpusTimes co-occurring in a text, P _ij Representing the probability of co-occurrence of word i and word j;

the convolution includes: and taking the word feature vector obtained by the glove technology as a node, taking the constructed prior knowledge relation graph as a correlation matrix, inputting the correlation matrix into a graph convolution network for training, and finally obtaining the feature vector of the prior knowledge graph.

Specifically, in step 2, the construction of the image text matching module further includes:

the image data and text data feature extraction includes: extracting 36 salient regions of each input image by using a Faster-RCNN pre-training model, and representing each salient region as an image feature vector through a full connection layer; extracting feature vectors of each text by using a BERT pre-training model, wherein the text feature vectors output by the BERT aggregate word segmentation features, semantic features and position features;

the self-attention mechanism includes: aggregating the attention relation among the image areas by using a transducer model, obtaining three inputs Q, K, V of a transducer by using three different full-connection layers of specific image area-level feature vectors, and finally obtaining the first layer of image features after the transducer is aggregated; an implementation of the text self-attention mechanism is: the context information of the sentences is explored by using three one-dimensional convolution networks with different sizes, so that the information of phrases with different lengths in the sentences can be captured, and finally, the text characteristics of the first layer are obtained.

Preferably, the implementation method for constructing the loss function by using the text features of the first layer image and the text features of the third layer image comprises the following steps:

using a triplet loss function, the triplet loss function basic formula is:

where alpha is a predefined edge parameter, S is a similarity function (e.g., cosine similarity function) of the image text pairs, S (Ω, T) represents a forward matching image text pair similarity score,and->Representing similarity scores for reverse matches from image to text and from text to image, respectively;

in the experiment, only small batches of reverse matching pairs are used, instead of accumulating all reverse samples, and a triple loss function is applied to the first layer image text feature pair and the third layer image text feature pair; in addition, the relative entropy is added on the importance score of the semantic concept and is used for further enhancing the image text similarity measurement, and the final loss function formula is as follows:

wherein lambda is ₁ ，λ ₂ ，λ ₃ Is a weight parameter that balances the different losses.

Compared with the prior art, the method for constructing the image text matching model based on the priori knowledge graph has the following technical effects:

1. an external priori knowledge graph is constructed to guide image text matching, so that semantic concepts of the model are expanded, the understanding capability of the model to a real scene is greatly enhanced, and the model has good generalization capability; meanwhile, the relation between priori knowledge graphs is constructed by utilizing the graph convolution mode, so that the local attention relation between all image areas and text fragments is calculated instead of using a cross attention mechanism pair by pair, the calculated amount and parameter amount of the model are greatly reduced, and the training speed and the reasoning speed of the model are improved.

2. Aggregating attention relationships between image regions using a new self-attention mechanism transfomer; extracting text feature vectors by using a new pre-training model BERT, and then aggregating the attention relation between words in the text vectors by using an attention mechanism; in this way, both image and text modalities are well characterized.

Drawings

FIG. 1 is a schematic flow diagram of a method for constructing an image text matching model based on a priori knowledge map;

FIG. 2 is a block diagram of a constructed prior knowledge graph-based image text matching model structure;

FIG. 3 is a schematic diagram of a self-attention mechanism transducer;

the invention is described in further detail below with reference to the drawings and examples.

Detailed Description

Referring to fig. 1 and fig. 2, the present embodiment provides a method for constructing an image text matching model based on a priori knowledge map, where the constructed image text matching model based on the priori knowledge map includes a priori knowledge map module, an image text matching module and an integration module; the prior knowledge graph module and the image text matching module are respectively connected with the integration module, and the specific construction steps are as follows:

step 1, constructing a priori knowledge graph module:

the method utilizes a statistical method to extract meaningful words from a text corpus, and the specific extraction strategy is as follows: deleting rare words in a text corpus, selecting names, verbs and adjectives with high occurrence frequency as priori knowledge, strictly limiting the ratio of three parts of speech to 7:2:1 according to the statistical probability of the three parts of speech in the corpus, marking the selected words as word labels, and recording as W ^tag 。

Word embedding operation is carried out on the extracted words by utilizing a glove technology, the words are expressed as word feature vectors, the word feature vectors are expressed by a matrix K, and the K is called as priori knowledge;

according to the statistical probability of the co-occurrence of words in the corpus, constructing a priori knowledge relation graph in the form of a conditional probability matrix, wherein the specific formula is as follows:

in which W is _i Representing the number of times word i appears in the corpus,W _ij representing the number of times word i and word j co-occur in a text in the corpus, then P _ij Representing the probability of co-occurrence of word i and word j; however, there is a deviation between the co-occurrence relationship among the words counted in the corpus and the real scene situation, which causes the model to be excessively fitted on the training set to affect the generalization capability of the model, in order to avoid the above problem, the matrix P is binarized, specifically, the threshold value ψ is used to filter out the edge noise, and the final prior knowledge relationship graph a _ij The method comprises the following steps:

the basic idea of graph convolution, which utilizes graph convolution to learn the interdependence relationship between priori knowledge, is to continuously update node characteristic representations by propagating neighborhood information between nodes, specifically: given priori knowledge K as node information of graph convolution and a priori knowledge relation graph A _ij As a correlation matrix of graph convolution, the calculation process of the first layer of graph convolution is as follows:

wherein H is ⁰ ＝K，Is to use the relation matrix A _ij Normalized symmetric matrix, W ^l Is a transfer matrix to be learned, sigma is a nonlinear activation function ReLU;

obtaining output H of last layer of graph convolution ^F And obtaining the final prior knowledge graph feature vector.

Step 2, constructing an image text matching module:

after the image data and the text data are given, the pre-trained Faster-RCNN model is utilized to obtain the image feature vector omega= { omega ₁ ，ω ₂ ，…，ω _n }，ω _i Is the region of the ith region in a pictureFeature vectors. Obtaining text feature vector t= { T by using pre-trained BERT model ₁ ，t ₂ ，…，t _e }，t _i Is the feature vector of the i-th word in a text;

for the image feature vector Ω, intra-modality context information aggregation is performed on the image feature vector using a transducer self-attention mechanism, as shown in fig. 3, where the transducer self-attention formula is:

to further enhance the characterizability of the model, the transducer aggregates context information from different subspaces using multiple parallel self-attention formulas;

head _i ＝Attention(Q _i ,K _i ,V _i )

wherein H is _i Represents the output of the ith header, Q _i ，K _i ，V _i The representing results of the image feature vector omega passing through different full connection layers are specifically:

then the transducer model will splice a plurality of heads to obtain the first layer image characteristic omega ₁ The method comprises the following steps:

Ω ₁ ＝MultiHead(Ω)＝concat(head ₁ ,head ₂ ,...,head _n )W ^O

for the text feature vector T, three one-dimensional convolution networks with different sizes are utilized to explore the context information of sentences, and the specific mode is as follows:

the one-dimensional convolution formula for applying a convolution kernel m to the kth word is:

p _m,k ＝ReLU(W _m t _k:k+m-1 +b _m ),m∈{1,2,3}

wherein W is _m Is a learnable convolution kernel parameter, b _m Is biased toPut items, t _k:k+m-1 Is the eigenvector of the kth to the kth+m-1 word;

after the one-dimensional convolution output is obtained, the maximum pooling operation is used for words at all positions:

q _m ＝max{p _m,1 ,p _m,2 ,...,p _s,e },m∈{1,2,3}

final splice q ₁ ，q ₂ ，q ₃ And through a fully-connected layer and appliedRegularization term to finally obtain first layer image feature T ₁ The method comprises the following steps:

step 3, construction of an integration module:

the prior knowledge learned by the graph convolution is utilized to guide the first-layer image features and the first-layer text features, and the second-layer image features and the second-layer text features guided by the prior knowledge graph are output;

the guided second layer image features are specifically as follows:

wherein the method comprises the steps ofIs a priori knowledge graph feature vector +.>With respect to the first layer image feature vector Ω ₁ Is the importance score of λ is the smoothing parameter of the softmax function, Ω ₂ Is the final calculated second layer image feature vector;

for text, since the prior knowledge graph is obtained from the text corpus, the first layer text features are guided by statistics, the word labels after screening and the prior knowledge learned by graph convolution together, specifically:

wherein W is _i ^tag For the word tag to be a word tag,is a priori knowledge feature map vector +.>With respect to the first layer text feature vector T ₁ Is the importance score of (a), λ is the smoothing parameter of the softmax function, ω is the optimization parameter controlling the word label and the a priori knowledge ratio, T ₂ Is the final calculated second layer image feature vector.

The second layer image features and the first layer image features are weighted and combined to obtain third layer image features of the integration module; and carrying out weighted combination on the second layer text features and the first layer text features to obtain third layer text features of the integration module, wherein the third layer text features comprise:

Ω ₃ ＝δΩ ₁ +(1-δ)Ω ₂

T ₃ ＝δT ₁ +(1-δ)T ₂

where δ is an optimization parameter controlling the ratio of the first layer features to the second layer features, Ω ₃ ，T ₃ The third layer image feature vector and the third layer text feature vector, respectively.

Step 4, constructing a loss function by utilizing the first layer image text characteristics and the third layer image text characteristics; the loss function used is a general triplet loss function in the field of image text matching, and specifically comprises the following steps:

in experiments, only small batches of reverse matching pairs were used, rather than accumulating all of the reverse samples, and the triplet loss function was applied to the first layer image text feature pairs and the third layer image text feature pairs. To further enhance the measure of image text similarity, the relative entropy is applied on the importance score of the semantic concept and added to the final loss function, where the relative entropy formula is:

the final loss function formula is:

The following are specific examples given by the inventors.

Examples:

experiment platform: the experiment was run on a NVIDIA TITAN RTX workstation using the Pytorch framework tool.

Data set: the experiment adopts two reference data sets widely used in the field of image text matching: flickr30k dataset and MSCOCO dataset; flickr30k contains 31783 pictures, each picture contains 5 text subtitles, 1000 pictures are used as a verification set, 1000 pictures are used as a test set, and the rest are used for a training set; MSCOCO contains 123287 pictures, each containing 5 artificial text descriptions, using 113287 pictures for training, 5000 for verification, 5000 for testing, and MSCOCO is split into a 1K capacity dataset and a 5K capacity dataset.

The inventors performed a correlation experiment on the MSCOCO1K dataset and the Filckr30K dataset; table 1 shows the comparison result of the image text matching model based on the prior knowledge graph constructed in this embodiment and the image text matching model excellent in recent years.

Table 1: comparison of model results on MSCOCO1K and Flickr30K datasets

As can be demonstrated from Table 1, the image text matching model based on the prior knowledge graph constructed in the embodiment effectively improves the accuracy of the image text matching method, and meanwhile, the constructed image text matching model based on the prior knowledge graph has faster training and reasoning speed because the construction method does not have the paired local attention relation between the calculated image area and the text segment.

Claims

1. The method for constructing the image text matching model based on the prior knowledge graph is characterized in that the constructed image text matching model based on the prior knowledge graph comprises a prior knowledge graph module, an image text matching module and an integration module; the prior knowledge graph module and the image text matching module are respectively connected with the integration module, and the specific construction steps are as follows:

step 1, constructing a priori knowledge graph module:

extracting meaningful words from a text corpus by using a statistical method, performing word embedding operation on the extracted words by using a glove technology, and representing the words as word feature vectors, which are called prior knowledge; constructing a priori knowledge relation graph according to the co-occurrence statistical probability of the words in the corpus; learning the interdependencies between prior knowledge using graph convolution;

step 2, constructing an image text matching module:

step 3, construction of an integration module:

the second layer image features and the first layer image features are weighted and combined to obtain third layer image features of the integration module;

the second layer text feature and the first layer text feature are weighted and combined to obtain a third layer text feature of the integration module;

2. The method of building as claimed in claim 1, wherein in step 1, the building of the prior knowledge graph module further comprises:

the extracting words from the text corpus comprises: deleting rare words from a text corpus, and selecting words with three parts of speech, verbs and adjectives; and according to the statistical frequency of words in the corpus, the proportion of nouns, verbs and adjectives selected is strictly limited to 7:2:1, performing word embedding operation on a selected word by utilizing a glove technology, and representing the word as a word feature vector, which is called as priori knowledge;

the construction of the prior knowledge relation graph comprises the following steps: modeling a relation diagram in the form of a conditional probability matrix, wherein the specific formula is as follows:

in which W is _i Representing the number of occurrences of word i in the corpus, W _ij Representing the number of times word i and word j co-occur in a text in the corpus, then P _ij Representing the probability of co-occurrence of word i and word j;

3. The method of claim 1, wherein in step 2, the construction of the image text matching module further comprises:

4. The construction method according to claim 1, wherein in step 4, the construction of the loss function using the first layer image text feature and the third layer image text feature is implemented by:

a triplet loss function is used, which has the basic formula:

where alpha is a predefined edge parameter, S is a similarity function for the image text pairs, S (Ω, T) represents a similarity score for the image text pairs that are forward matched,and->Representing similarity scores for reverse matches from image to text and from text to image, respectively;

in the experiment, using small batches of reverse matching pairs, a triple loss function is applied to the first layer image text feature pair and the third layer image text feature pair;

the relative entropy is added on the importance score of the semantic concept and is used for further enhancing the similarity measurement of the image text, and the final loss function formula is as follows: