CN114896438A

CN114896438A - Image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine mechanism

Info

Publication number: CN114896438A
Application number: CN202210504224.0A
Authority: CN
Inventors: 郭洁; 王孟瀛; 周妍; 高雅; 宋彬; 池育浩
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-08-12
Anticipated expiration: 2042-05-10
Also published as: CN114896438B

Abstract

The invention relates to an image-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine, which comprises the following steps: respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text; obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector; respectively inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combination module to obtain final image and text characteristic vectors; obtaining comprehensive similarity based on the first similarity, the second similarity and the third similarity, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters; and obtaining a retrieval matching result by utilizing the comprehensive similarity. The invention improves the problem of difficult alignment of the retrieval task, and can obtain more complete image characteristic vectors and text characteristic vectors which can represent the image text matching relationship, thereby improving the retrieval accuracy.

Description

Image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine mechanism

Technical Field

The invention belongs to the technical field of data mining, and relates to a graph-text retrieval method based on hierarchical alignment and generalized pooling image attention mechanism.

Background

In recent years, with the rapid development of the internet, people can receive a large amount of data every day, and researchers pay attention to how to accurately retrieve required information from a large amount of information. The proposal of the image-text retrieval provides a solution to the problems.

The essence of the image-text retrieval is that samples of two modes, namely an image and a text, are respectively coded to obtain semantic representation characteristics of the images and the text, and meanwhile, a corresponding similarity calculation method is designed to calculate the similarity between the image characteristics and the text characteristics. Through the image-text retrieval model, a user can quickly find the image corresponding to the description under the condition of giving the text description, and can quickly obtain the corresponding text description content under the condition of giving the image. The existing hierarchical alignment mode only considers semantic alignment between the whole image and the whole sentence and semantic alignment between an image area and a word, and ignores non-object elements such as global background information. Such semantic alignment is susceptible to negative examples with similar object entities but slightly different backgrounds. Meanwhile, the traditional feature aggregation method adopts a maximum pooling or average pooling mode, and ignores the importance of the multi-modal feature global-local feature cooperative relationship.

Therefore, how to improve the semantic alignment problem and how to enhance the multi-modal feature global-local feature collaborative relationship becomes an urgent problem to be solved.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine. The technical problem to be solved by the invention is realized by the following technical scheme:

the embodiment of the invention provides an image-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine, which comprises the following steps:

step 1, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector;

step 2, correspondingly obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector;

step 3, inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combined module respectively to obtain a final image characteristic vector and a final text characteristic vector;

step 4, obtaining a comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively located in an image feature vector extraction part, a text feature vector extraction part, a graph attention and generalized pooling combined module;

and 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters.

In one embodiment of the present invention, the step 1 comprises:

step 1.1, extracting the global feature vector V of the preset image _G And local feature vector V _L ；

Step 1.2, cascading the global feature vector V _G And the local feature vector V _L Obtaining the characteristic vector of the initial image;

step 1.3, extracting the initial text characteristic vector T of the preset text _S 。

In one embodiment of the invention, the global feature vector V _G Comprises the following steps:

V _G ＝W _g G+b _g ,

wherein, V _G A global feature vector, W, representing said preset image _g A first weight matrix is represented that is,

representing the size of the first weight matrix, D representing the dimension of the feature vector of the output image, D ⁰ Represents the size of each pixel, G represents the first output characteristic and satisfies

Representing the size of the first output feature, m representing the size of the reconstructed feature map, b _g Represents a first bias constant;

the local feature vector V _L Comprises the following steps:

V _L ＝W _l L+b _l ,

wherein, V _L Local feature vectors, W, representing said preset image _l A second weight matrix is represented that represents a second weight matrix,

representing the size of the second weight matrix, D ^k A dimension representing a feature of each region, L represents a second output feature and satisfies

Representing the size of the second output feature, k representing the number of regions detected from the preset image, b _l Represents a second bias constant;

the initial image feature vector is:

V _U ＝V _G ||V _L ,

wherein, V _U Representing the initial image feature vector, | | | representing a cascading operation, V _U Can be expressed as

Representing the size of the image feature vector, D ^U A dimension representing a feature vector of the image;

the initial text feature vector is:

T _S ＝W _S S+b _S

wherein, T _S Representing the initial text feature vector, S represents the output feature and satisfies

Size of a feature vector representing text, D ¹ Dimension representing a feature of the text, l representing the number of words in the text, W _S Representing weight momentsThe number of the arrays is determined,

b _S representing a third bias constant.

In one embodiment of the present invention, the step 2 comprises:

step 2.1, extracting the first image feature vector of the ith node from the initial image feature vector

And a second image feature vector of a j-th node

Step 2.2, the first image feature vector is processed

And the second image feature vector

Performing dot product operation to obtain a first relation E _U ；

Step 2.3, according to the characteristic vector of the initial image and the first relation E _U Constructing the image feature map;

step 2.4, extracting the first text feature vector of the ith 1 node from the initial text feature vector

And a second text feature vector of the j1 th node

Step 2.5, the first text feature vector is processed

And the second text feature vector

Performing dot product operation to obtain a second relation E _S ；

Step 2.6, according to the initial text feature vector and the second relation E _S And constructing the text feature graph.

In one embodiment of the invention, said first relation E _U Comprises the following steps:

wherein, | represents a dot product operation;

the image characteristic map is as follows:

G _V ＝(V _U ,E _U )

wherein G is _V Representing an image feature graph, taking the features in the initial image feature vector as nodes, and taking the first relation E _U As an edge;

the second relation E _S Comprises the following steps:

the text characteristic graph is as follows:

G _T ＝(T _S ,E _S )

wherein G is _T Representing a text feature graph, taking the features in the initial text feature vector as nodes, and taking the second relation E _S As an edge.

In one embodiment of the present invention, the step 3 comprises:

step 3.1, inputting the image feature map into a map attention network module, and propagating the initial image feature vector through a multi-head map attention mechanism algorithm to obtain an updated image feature vector;

step 3.2, inputting the text feature map into a map attention network module, and propagating the initial text feature vector through a multi-head map attention machine algorithm to obtain an updated text feature vector;

step 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector;

and 3.4, inputting the updated text feature vector into a generalized pooling module to obtain a final text feature vector.

In one embodiment of the invention, said step 3.1 comprises:

step 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;

step 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix;

step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer;

step 3.14, all the first output characteristics of the same image are spliced to obtain spliced image characteristics;

step 3.15, the spliced image features are subjected to regularization network to obtain updated image feature vectors;

the step 3.2 comprises:

step 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;

step 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix;

step 3.23, multiplying the second multi-head attention weight matrix and the learnable weight matrix with the initial text feature vector to obtain a second output feature of each layer;

step 3.24, all second output characteristics of the same text are spliced to obtain spliced text characteristics;

and 3.25, obtaining an updated text feature vector by the spliced text features through a regularization network.

In one embodiment of the invention, said step 3.3 comprises:

step 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code;

step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit;

step 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector, wherein the final image feature vector is as follows:

wherein,

the final image feature vector is represented by a vector of features,

representing the ith feature, θ, in the final image feature vector _k The first pooling coefficient is represented by a first pooling coefficient,

representing ith feature in the updated image feature vector, wherein the value of N is equal to D ^U ；

The step 3.4 comprises:

step 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code;

step 3.42, after the second position code is converted into vector representation, a sequence model based on a bidirectional gating circulation unit is adopted to generate a second pooling coefficient;

step 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector, wherein the final text feature vector is as follows:

wherein,

the final text feature vector is represented by a vector of characters,

represents the i1 th feature, θ, in the final text feature vector _k1 The second pooling coefficient is represented as a function of,

representing the i1 th feature in the updated text feature vector, wherein the value of N1 is equal to D ^S 。

In one embodiment of the present invention, the step 4 comprises:

step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain the first similarity;

step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain the second similarity;

4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity;

4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity;

and 4.5, calculating a loss function by utilizing the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image characteristic vector extraction part, the text characteristic vector extraction part and the drawing attention and generalized pooling combined module.

In one embodiment of the present invention, the first similarity is:

wherein S is ₁ (V _L ,T _S ) Represents the first degree of similarity, V _L Representing said local feature vector, T _S Representing the initial text feature vector, | | | |, represents the module value of the feature vector;

the second similarity is as follows:

wherein S is ₂ (V _U ,T _S ) Represents the second degree of similarity, V _U Representing the initial image feature vector; the third similarity is as follows:

wherein,

represents the third degree of similarity, and represents the third degree of similarity,

the final image feature vector is represented by a vector of features,

representing the final text feature vector;

the comprehensive similarity is as follows:

s (I, T) represents the comprehensive similarity, I represents an input image to be matched, and T represents an input text to be matched;

the loss function is calculated as follows:

L＝[d+S(I′,T)-S(I,T)] ₊ +[d+S(I,T′)-S(I,T)] ₊

wherein L represents a loss function, d represents a deficit parameter, [ x ]] ₊ ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max _X≠I S (X, T) and T ═ argmax _Y≠T S (I, Y), X denotes image information that does not match the given text information, and Y denotes text information that does not match the given image information.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention force mechanism, a hierarchical similarity comprehensive calculation mode is introduced into intra-modal and inter-modal semantic alignment, and the semantic alignment mode is calculated by utilizing the similarity between the extracted image feature vectors and the text description feature vectors under different conditions, so that the learning of intra-modal and inter-modal interaction information is enriched, the problem of difficulty in alignment of retrieval tasks is solved, and the retrieval accuracy is further improved.

2. The image-text retrieval method based on the hierarchical alignment and generalized pooling image-attention machine system utilizes a generalized pooling mode to replace the traditional modes of maximum pooling, average pooling and the like, integrates the pooling mode into the image-attention machine system, and extracts the maximum value in the feature vector.

Other aspects and features of the present invention will become apparent from the following detailed description, which proceeds with reference to the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims. It should be further understood that the drawings are not necessarily drawn to scale and that, unless otherwise indicated, they are merely intended to conceptually illustrate the structures and procedures described herein.

Drawings

Fig. 1 is a schematic flowchart of an image-text retrieval method based on hierarchical alignment and generalized pooling map attention machine according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a feature vector diagram according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating an attention mechanism and a generalized pooling module according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of an image-text retrieval method based on a hierarchical alignment and generalized pooling map attention machine provided in an embodiment of the present invention, and the present invention provides an image-text retrieval method based on a hierarchical alignment and generalized pooling map attention machine, which includes steps 1 to 5, wherein:

step 1, please refer to fig. 2, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector.

Specifically, the preset image is an image which needs to be matched with the text, the preset text is a text which needs to be matched with the image, if the preset image is 1 and the number of the initial texts is 5, the preset image and the 5 initial texts need to be retrieved to obtain a text with the highest similarity, and the text is used for describing the content of the image and serves as a matching result.

In a specific embodiment, step 1 may specifically include:

step 1.1, extracting global feature vector V of preset image _G And local feature vector V _L 。

In the present embodiment, the global feature vector V _G Comprises the following steps:

V _G ＝W _g G+b _g ,

wherein, V _G Global feature vector, W, representing a preset image _g A first weight matrix is represented that is,

representing the size of the first weight matrix, D representing the dimension of the global feature vector of the output image, D ⁰ Represents the size of each pixel, G represents the first output characteristic and satisfies

Representing the size of the first output feature, m representing the size of the reconstructed feature map, b _g Representing a first bias constant.

In the present embodiment, the local feature vector V _L Comprises the following steps:

V _L ＝W _l L+b _l ,

wherein, V _L Local feature vectors, W, representing a predetermined image _l A second weight matrix is represented that represents a second weight matrix,

Representing the size of the second output feature, k representing the number of regions detected from the preset image, b _l Representing a second bias constant.

Step 1.2, cascading Global feature vector V _G And local feature vector V _L Obtaining an initial image feature vector, wherein the initial image feature vector is as follows:

V _U ＝V _G ||V _L ,

Representing the size of the image feature vector, D ^U Representing the dimensions of the image feature vector.

Step 1.3, extracting initial text characteristic vector T of preset text _S The initial text feature vector is:

T _S ＝W _S S+b _S

Representing text featuresSize of vector, D ¹ Dimension representing a feature of the text, l representing the number of words in the text, W _S A matrix of weights is represented by a matrix of weights,

b _S representing a third bias constant.

Optionally, the global feature vector extraction process may use a ResNet152 encoder module that is pre-trained on the ImageNet dataset to accurately extract pixel-level features of the image. In the local feature vector extraction process, a fast-RCNN module can be used as an encoder, and the module is obtained by pre-training on a Visual Genome data set. The image feature vector dimension is 2048 and is shared by global features and local features. The text feature vector extraction part selects a BERT pre-training model which comprises 12 layers, 12 heads, 768 hidden units and 110M parameters, and the dimensionality of the finally obtained text feature vector is 768.

And 2, correspondingly obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector.

In a specific embodiment, step 2 may specifically include:

And a second image feature vector of a j-th node

Step 2.2, carrying out feature vector on the first image

And a second image feature vector

Performing dot product operation to obtain a first relation E _U First relation E _U Comprises the following steps:

wherein an "-" indicates a dot-product operation.

Step 2.3, according to the characteristic vector of the initial image and the first relation E _U Constructing an image feature map, wherein the image feature map is as follows:

G _V ＝(V _U ,E _U )

wherein G is _V Representing an image feature map, taking the features in the initial image feature vector as nodes, and taking the first relation E _U As an edge.

And a second text feature vector of the j1 th node

Step 2.5, carrying out comparison on the first text feature vector

And a second text feature vector

Performing dot product operation to obtain a second relation E _S Second relation E _S Comprises the following steps:

step 2.6, according to the initial text feature vector and the second relation E _S Constructing a text characteristic diagram, wherein the text characteristic diagram is as follows:

G _T ＝(T _S ,E _S )

wherein G is _T Representing text featuresThe graph takes the characteristics in the initial text characteristic vector as nodes and takes the second relation E _S As an edge.

And 3, respectively inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combined module to obtain a final image characteristic vector and a final text characteristic vector.

Specifically, the attention and generalized pooling combination module is used for inputting the image feature map and the text feature map into the attention and generalized pooling combination module respectively and simultaneously, and iteratively updating the image feature vector and the text feature vector; referring to fig. 3, fig. 3 is a schematic diagram illustrating an attention mechanism and a generalized pooling module according to an embodiment of the present invention. As shown, in this embodiment, the image feature vectors and text feature vectors are updated and aggregated by a constructed graph attention mechanism and generalized pooling operation. The stacking of multiple graph attention and generalized pooling union modules enables better updating of vectors.

In a particular embodiment, step 3 may particularly comprise steps 3.1-3.4, wherein:

and 3.1, inputting the image feature map into a map attention network module, and spreading the initial image feature vector through a multi-head map attention machine mechanism algorithm to obtain an updated image feature vector.

Specifically, step 3.1 may specifically include steps 3.11-3.15, wherein:

and 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features.

Specifically, the initial image feature vector V of the image is obtained in step 1 _U Is shown as

And (3) introducing a multi-head self-attention mechanism to calculate the attention coefficient of each node, and setting the number of parallel layers of the multi-head attention mechanism as H, wherein H is more than or equal to 1 and less than or equal to H. Inputting the image feature vectors into each parallel layer at the same time, and calculating the weight matrixInputting the vector dot product of the features to obtain a first preliminary quantization result of the node features, wherein the calculation mode of the first feature quantization result of the h-th layer node is as follows:

wherein,

represents the first preliminary quantization result, i.e., the importance of node i to node j in the h-th layer, D ^U Dimension, W, representing image features _q And W _k Each represents a learnable weight matrix.

And 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix.

Specifically, the first feature quantization result is regularized to facilitate comparison of parameters between nodes, so as to obtain a first multi-start attention weight matrix α _ij The specific calculation method is as follows:

wherein softmax represents a normalization function, Ν _i A set of neighbor nodes representing node i.

Step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer, wherein the first output feature is as follows:

wherein the head _h Representing a first output characteristic, W _v ^h Representing a weight matrix that the h-th layer can learn.

Step 3.14, all the first output features of the same image are spliced (vector end-to-end connection), so as to obtain spliced image features, wherein the spliced image features are as follows:

wherein,

representing features of the stitched image, W _o Representing a learnable weight matrix and concat represents a splicing function.

And 3.15, obtaining an updated image feature vector by the spliced image features through a regularization network.

Specifically, the spliced image features are subjected to final output representation through a regularization network, namely, the updated image feature vectors are refined through an image attention machine mechanism

Ith feature of updated image feature vector

The specific calculation method is as follows:

in which, ReLU is selected as the activation function, and BN layer is used to keep the input of each layer of neural network in the same distribution.

And 3.2, inputting the text feature map into the map attention network module, and transmitting the initial text feature vector through a multi-head map attention machine mechanism algorithm to obtain an updated text feature vector.

Specifically, step 3.2 may specifically include steps 3.21-3.25, wherein:

and 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features.

Specifically, the initial text feature vector T of the image is obtained in step 1 _S Is shown as

And simultaneously inputting the text feature vectors into each parallel layer, and calculating the vector dot product of the weight matrix and the input features to obtain a second preliminary quantization result of the node features. The second feature quantization result of the h-th layer node is calculated as follows:

wherein,

represents the second preliminary quantization result, i.e., the importance of node i1 to node j1 in the h-th level, D ^S Dimension, W, representing a feature of the text _q And W _k Each represents a learnable weight matrix.

And 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix.

Specifically, the second feature quantization result is regularized to facilitate comparison of parameters between nodes, so as to obtain a second multi-headed attention weight matrix α _i1j1 The specific calculation method is as follows:

wherein N is _i1 Representing a set of neighbor nodes for node i 1.

Step 3.23, multiplying the second multi-head attention weight matrix, the learnable weight matrix and the initial text feature vector to obtain a second output feature of each layer, wherein the second output feature is as follows:

wherein, head 1 _h Representing a second output characteristic, W _s ^h Representing a weight matrix that the h-th layer can learn.

Step 3.24, all second output features of the same text are spliced (vectors are connected end to end) to obtain spliced text features, wherein the spliced text features are as follows:

wherein,

and representing the spliced text features.

Specifically, the spliced text features are subjected to final output representation through a regularization network, namely, the updated text feature vectors are refined through an attention machine mechanism. Ith 1 feature of updated text feature vector

The specific calculation method of (2) is as follows:

and 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector.

Specifically, step 3.3 may specifically include steps 3.31-3.33, wherein:

and 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code.

Specifically, the generalized pooling module consists of a triangular position coding strategy and a sequence model based on a bidirectional gating cyclic unit. Firstly, vectorizing a position index by an updated image feature vector through a triangular position coding strategy, wherein the specific calculation mode is as follows:

wherein p is _k Representing a first position code, d _p Representing a first given vector dimension, j _a Code representing the first position, j _a Is given as d _p 1/2, d of _p Is equal to D ^U 。

Step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit, wherein the specific calculation mode is as follows:

wherein,

representing a set of first pooling coefficients, MLP representing a multi-layer neural network unit, BiGRU representing a bi-directional gated cyclic unit.

And 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector.

Specifically, when the image passes through the generalized pooling module, the generalized pooling module performs sorting operation on the vectors, learns the pooling coefficient of each vector, performs weighted sum on the vectors, and finally outputs the node feature vector of the image

The specific calculation method is as followsThe following:

θ _k ＝f(k,N),k＝1,2,…,N

aggregating all image nodes to obtain a final image feature vector

Wherein,

representing the final image feature vector, f corresponding to the process of generating pooling coefficients, θ _k Representing the first pooling coefficient, i.e. theta _k Representing pooled coefficients of the classified kth vector and satisfying

Representing the updated image feature vector, with the value of N equal to D ^U 。

Specifically, step 3.4 may specifically include steps 3.41 to 3.43, where:

and 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code.

Specifically, vectorization is performed on the position index by the updated text feature vector through a triangle position coding strategy, and the specific calculation mode is as follows:

wherein p is _k1 Representing a second position code, d _q Representing a second given vector dimension, j _b Code representing the second position, j _b Is given as d _q 1/2, d of _q Is equal to D ^S 。

Step 3.42, after the second position code is converted into vector representation, generating a second pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit, wherein the specific calculation mode is as follows:

wherein,

representing a set of second pooling coefficients.

And 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector.

The specific calculation method is as follows:

θ _k1 ＝f(k1,N1),k1＝1,2,…,N1

all text nodes are aggregated to obtain a final text feature vector

Wherein,

the feature vector of the final text is represented,

And 4, obtaining comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update the network parameters.

In one embodiment, step 4 comprises:

step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain a first similarity, wherein the first similarity is as follows:

wherein S is ₁ (V _L ,T _S ) Denotes a first degree of similarity, V _L Representing local feature vectors, T _S Representing the initial text feature vector, | | · | | represents the module value of the feature vector.

Step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain a second similarity, wherein the second similarity is as follows:

wherein S is ₂ (V _U ,T _S ) Denotes a second degree of similarity, V _U Representing the initial image feature vector.

And 4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity, wherein the third similarity is as follows:

wherein,

the final image feature vector is represented by a vector of features,

representing the final text feature vector;

and 4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity, wherein the comprehensive similarity is as follows:

Specifically, a loss function training model is introduced, so that the matched image-text pair has a higher similarity score than the unmatched image-text pair, and the specific calculation method is as follows:

L＝[d+S(I′,T)-S(I,T)] ₊ +[d+S(I,T′)-S(I,T)] ₊

wherein L represents a loss function, d represents a deficit parameter, [ x ]] ₊ ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max _X≠I S (X, T) and T ═ argmax _Y≠T S (I, Y), X denotes image information that does not match the given text information, and Y denotes text information that does not match the given image information. The above-described loss function calculation is introduced such that the overall similarity score between matching image and text pairs is higher compared to unmatched image and text pairs.

And 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters, wherein the model comprises an image characteristic vector extraction part, a text characteristic vector extraction part and a drawing attention and generalized pooling combined module.

Specifically, if the preset images to be matched are in the image retrieval text task, ranking the plurality of preset texts to be matched according to the comprehensive similarity obtained in the step 4, so as to obtain a text retrieval matching result corresponding to the preset images, namely taking the preset text with the highest score in the comprehensive similarity as a final retrieval matching result; similarly, in the task of searching the images in the text, for the preset text to be matched, ranking a plurality of preset images to be matched according to the final comprehensive similarity output by the model after updating the network parameters, so as to obtain an image searching and matching result corresponding to the preset text, namely, taking the preset image with the highest score in the comprehensive similarity as the final searching and matching result.

The image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention force mechanism of the embodiment utilizes a hierarchical similarity comprehensive calculation mode to extract image feature vectors and text feature vectors under different conditions to perform semantic alignment between modalities, enriches the learning of intra-modality and inter-modality interaction information, improves the problem of difficult alignment of retrieval tasks and further improves the retrieval accuracy compared with a retrieval model corresponding to the prior art. Compared with the prior art, the method has the advantages that the local object semantic relation and the global context information of the image and the text are enhanced, the image feature vector and the text feature vector which are more complete and can represent the image text matching relation more can be obtained, and the retrieval accuracy is improved.

Example two

In this embodiment, a simulation experiment is performed on the image-text retrieval method based on the hierarchical alignment and generalized pooling image attention machine mechanism in the first embodiment, and the effect of the present invention is further explained by comparing with the existing image-text retrieval method.

1. Simulation experiment conditions are as follows:

operating the system: ubuntu 16.04, python3.6

An experiment platform: pyroch-1.7.1

A processor: intel Xeon Gold 6226R CPU,64GB RAM,1T SSD

A display card: NVIDIA Tesla A100 GPU

Memory: 64GB

2. Simulation experiment contents:

simulation experiment I: image retrieval text task and accuracy rate experiment of text retrieval image task

The following experiments were all performed in the same experimental environment. The data set 1 and the data set 2 are both image-text retrieval task classic data sets, and the image-text retrieval method based on hierarchical alignment and generalized pooling image attention machine provided by the invention and the reference method belong to image-text retrieval algorithms in various semantic alignment modes.

Table 1 baseline method under data set 1 and method recall comparison proposed by the present invention

TABLE 2 Baseline method under data set 2 and recall comparison of methods proposed by the invention

From table 1 and table 2, it can be seen that under different data sets, the method provided by the present invention has good performance on both the image retrieval text task and the text retrieval image task, and especially exceeds the baseline method on both the R @1 and Rsum indexes. The experimental results carried out on the data set 1 can respectively reach 81.1, 67.4 and 533.2, and compared with the existing retrieval method (baseline method 1 in the figure), the improvement of 2.3%, 0.8% and 3.8% is respectively realized; in the experimental results performed on the data set 2, the results corresponding to the R @1 index achieved 2.3% and 2% improvement, respectively, compared to the existing retrieval method (baseline method 2 in the figure). The above experimental results show that the retrieval accuracy can be greatly improved by introducing the generalized pooling method into the image-text retrieval task and guiding the updating of the generalized pooling method by utilizing the similarity of the feature vectors.

And (2) simulation experiment II: importance comparison visualization experiment of combined module of graphic attention machine mechanism and generalized pooling operation in model

The following experiments were all performed in the same experimental environment. The substitution method for removing the module does not use the combined module of the graph attention machine mechanism and the generalized pooling operation, but uses the traditional graph attention machine mechanism and the maximal pooling mode to form a model to participate in experimental research.

Table 3 illustrates the significance comparison visualization experiment of the combined model of the attention mechanism and the generalized pooling operation

As can be seen from table 3, for the same image, the first five text descriptions retrieved by the method of the present invention correspond to correct texts, while the third result is wrong in the first five retrieval results of the alternative method. The above experimental results show that the retrieval accuracy can be greatly improved by introducing the generalized pooling method into the image-text retrieval task and guiding the updating of the generalized pooling method by utilizing the similarity of the feature vectors.

In the description of the invention, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic data point described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristic data points described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A graph-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine is characterized by comprising the following steps:

2. The method for teletext retrieval based on the hierarchical alignment and generalized pooling of image attention machine mechanism according to claim 1, wherein the step 1 comprises:

3. According to claim 2The image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention machine mechanism is characterized in that the global feature vector V _G Comprises the following steps:

V _G ＝W _g G+b _g ,

the local feature vector V _L Comprises the following steps:

V _L ＝W _l L+b _l ,

the initial image feature vector is:

V _U ＝V _G ||V _L ,

the initial text feature vector is:

T _S ＝W _S S+b _S

Size of a feature vector representing text, D ¹ Dimension representing a feature of the text, l representing the number of words in the text, W _S A matrix of weights is represented by a matrix of weights,

b _S representing a third bias constant.

4. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein the step 2 comprises:

And a second image feature vector of a j-th node

Step 2.2, the first image feature vector is processed

And the second image feature vector

Performing dot product operation to obtain a first relation E _U ；

And a second text feature vector of the j1 th node

Step 2.5, the first text feature vector is processed

And the second text feature vector

Performing dot product operation to obtain a second relation E _S ；

5. The method of claim 4, wherein the first relationship E is a hierarchical alignment and generalized pooling graph attention machine based graph-text retrieval method _U Comprises the following steps:

wherein, an indicates a dot-product operation;

the image characteristic map is as follows:

G _V ＝(V _U ,E _U )

the second relation E _S Comprises the following steps:

the text characteristic graph is as follows:

G _T ＝(T _S ,E _S )

6. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein said step 3 comprises:

step 3.2, inputting the text feature map into a map attention network module, and propagating the initial text feature vector through a multi-head map attention mechanism algorithm to obtain an updated text feature vector;

7. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 6, wherein said step 3.1 comprises:

the step 3.2 comprises:

step 3.23, multiplying the second multi-head attention weight matrix and the learnable weight matrix by the initial text feature vector to obtain a second output feature of each layer;

8. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 6, wherein said step 3.3 comprises:

wherein,

the final image feature vector is represented by a vector of features,

representing the final image feature directionCharacteristic i of the quantity, θ _k The first pooling coefficient is represented by a first pooling coefficient,

The step 3.4 comprises:

wherein,

the final text feature vector is represented by a vector of characters,

9. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein the step 4 comprises:

and 4.5, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image feature vector extraction part, the text feature vector extraction part and the drawing attention and generalized pooling combined module.

10. The method of teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism of claim 9, wherein the first similarity is:

the second similarity is as follows:

wherein S is ₂ (V _U ,T _S ) Represents the second degree of similarity, V _U Representing the initial image feature vector;

the third similarity is as follows:

wherein,

the final image feature vector is represented by a vector of features,

representing the final text feature vector;

the comprehensive similarity is as follows:

the loss function is calculated as follows:

L＝[d+S(I′,T)-S(I,T)] ₊ +[d+S(I,T′)-S(I,T)] ₊

wherein L represents a loss function, d represents a deficit parameter, [ x ]] ₊ ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max _X≠I S (X, T) and T ═ argmax _Y≠T S (I, Y), X represents image information not matching given text information, Y represents image information not matching given text informationAnd determining text information of which the image information is not matched.