CN111324752B

CN111324752B - Image and text retrieval method based on graphic neural network structure modeling

Info

Publication number: CN111324752B
Application number: CN202010104275.5A
Authority: CN
Inventors: 张勇东; 张天柱; 魏曦
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2023-06-16
Anticipated expiration: 2040-02-20
Also published as: CN111324752A

Abstract

The invention discloses an image and text retrieval method based on graphic neural network structural modeling, which applies an attention mechanism to express fine granularity vision and text elements extracted from pictures and texts, so that the similarity of the pictures and the texts can be better calculated; the method has the advantages that the graph structure is adaptively constructed by utilizing visual and text elements, and the characteristics are updated by using a graph convolution method, so that the relationships between the intra-mode and inter-mode of the visual and text elements can be better considered; and a constraint mechanism is introduced between different picture and text pairs in the alignment process of visual and text elements, so that fine-granularity text elements can be favorably corresponding to corresponding picture areas, the reliability of picture and text level similarity calculation is further improved, and the accuracy of picture and text retrieval is further improved.

Description

Image and text retrieval method based on graphic neural network structure modeling

Technical Field

The invention relates to the technical field of multimedia retrieval, in particular to an image and text retrieval method based on graphic neural network structure modeling.

Background

As massive amounts of multimedia data flow into the internet, multimedia retrieval techniques across a variety of different modalities of data (visual, text, speech, etc.) play an increasingly important role.

Conventional image retrieval techniques often use labels to retrieve pictures. This process tends to be unidirectional and only uses discrete tag data. The bidirectional retrieval of the images and the texts contains richer semantics and accords with the habit of using natural language by human beings. However, there is a large difference between the data of the two different modalities of vision and text. In order to achieve cross-modal retrieval of images and text, it is desirable to well integrate computer vision with natural language understanding.

Recently, a cross-modal retrieval method for images and texts based on deep learning mainly maps the images and texts to a unified embedded space, compares global similarity between visual data and language data, and finally outputs a retrieval result. However, these methods rarely consider the alignment between fine-grained visual elements, text elements. This limits the overall similarity calculation of the image and text, affecting the final retrieval accuracy.

Disclosure of Invention

The invention aims to provide an image and text retrieval method based on graphic neural network structure modeling, which can obtain higher image and text retrieval accuracy.

The invention aims at realizing the following technical scheme:

an image and text retrieval method based on graph neural network structure modeling comprises the following steps:

training phase: extracting visual elements and initial text elements of a single picture and text pair, and introducing an attention mechanism to re-represent each text element; taking visual elements of single picture and text pairs and re-represented text elements as nodes, adaptively constructing a graph structure, and updating each node by utilizing a graph convolution method; calculating the autocorrelation of each initial text element by combining the updated text elements, and making alignment constraints of visual elements and text elements in different pictures and texts; meanwhile, the autocorrelation of the initial text elements is converged to measure the similarity of the whole text and the whole picture, so that a retrieval ordering result is generated according to the similarity; constructing a total loss function by using the loss of the element alignment process and the loss function of the retrieval ordering;

testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; and sorting according to the similarity to obtain a retrieval result.

According to the technical scheme provided by the invention, pictures and texts are expressed as fine-grained visual and text elements under the attention mechanism, and all potential visual and text element alignment can be found; the graph structure is adaptively constructed, and the relations between the inside of the same modal data of visual and text elements and different modal data are better considered; in different picture and text pairs, constraints are added to the text elements, so that the text elements can be better aligned to the corresponding visual elements. Accurate and comprehensive vision and fine granularity alignment of text elements enable the method to better measure similarity of picture and text levels and obtain higher picture and text retrieval accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an image and text retrieval method based on a graph neural network structure modeling according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides an image and text retrieval method based on graphic neural network structural modeling, as shown in fig. 1, which is a flow of the whole method, and the main processes of training and testing stages are the same, specifically:

testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; and sorting according to the similarity to obtain a corresponding retrieval result.

The method provided by the embodiment of the invention is a fine-grained image and text retrieval method, and the method performs fine-grained visual and text element representation on the images and texts; in the single picture and text pair, fine granularity vision and text element relation extraction and alignment are performed; between different pairs of pictures and texts, a constraint mechanism of visual and text element alignment is performed. Therefore, the alignment relation of fine granularity among all the areas in the picture and all the words in the text can be fully considered, the similarity of a given corresponding picture and text pair can be well calculated, and a retrieval result is given. The method can be applied to a database of internet multimedia application and can be used for feeding back a picture/text retrieval request of a user. In implementation, the method can be installed on a background server of a company in a software mode, performs similarity calculation on a large amount of pictures and text data, and returns the most similar result for searching the pictures or the text.

The training phase and the testing phase are described in detail below.

1. Training stage.

1. The visual features and the text features of the single picture and text pair are mapped to the same space, so that the visual elements and the initial text elements are obtained, and a attention mechanism is introduced to re-represent each text element.

This step is mainly to achieve fine-grained representation and alignment of visual, text elements (picture areas/objects, text words/words) by means of an attention mechanism.

For a given picture I, visual features f= { F of multiple regions of the picture I are extracted using fast R-CNN (a general target detection algorithm based on convolutional neural network CNN) ₁ ,f ₂ ,...,f _n Next, feature F is mapped to the embedding space with full connection layer, denoted v= { V ₁ ,v ₂ ,...,v _n -a }; where n is the number of regions (objects) in the picture that have explicit semantic information. As will be appreciated by those skilled in the art, a region of explicit semantic information means that the semantic information of the corresponding region is known and explicit, e.g., explicit semantic information may be cat, house, etc.; that is, the area where the semantic information is clear may be an area where one cat is circled, an area where one house is circled, or the like.

For the text T, each word in the sentence is represented by an embedded vector, and then mapped to an embedded space by bi-GRU (bi-directional gated cyclic unit network, a general natural language processing network based on cyclic neural network RNN), denoted as e= { E ₁ ,e ₂ ,...,e _k And, where k is the number of words.

Thereafter, for the initial text element e _j Expressed as:

where j=1,.. _ij Is text element e _j With visual element v _i The attention coefficient between them is defined by v= { V ₁ ,v ₂ ,...,v _n} and E＝{e₁ ,e ₂ ,...,e _k Similarity matrix calculation.

2. The visual elements and the re-represented text elements of the single picture and text pair are used as input, a graph structure is adaptively constructed, the elements are used as nodes, cosine similarity among the nodes is used as the continuous edge of the nodes, M edges with the maximum pre-similarity are reserved, and then each node is updated by using a graph convolution method.

In the embodiment of the invention, the visual elements of the single picture and text pair and the re-represented text element { v } ₁ ...v _n ,a ₁ ...a _k As input, adaptively constructing a graph structure (graph structure modeling), taking into account the relationships between the internal and inter-modal visual and textual element modalities; wherein { v ₁ ...v _n The visual element of the picture, { a }, is ₁ ...a _k The text element that is represented again is noted, and the subscript is the element number.

Each characteristic element { v } ₁ ...v _n ,a ₁ ...a _k As nodes of the graph, cosine similarity b between the nodes _ij (i, j represents the serial number of two connected nodes) as the connecting edge of the node, for any node t _p (i.e. t _p ＝v _i Or t _p ＝a _j ) Only the M edges with the maximum cosine similarity, namely M (t _p )＝topm(b _pq ) The specific value of M can be set according to the actual situation.

Method for updating each node { v }, by using graph convolution ₁ ...v _n ,a ₁ ...a _k Characteristics of }, namely:

the parameter beta is the update strength and can be adjusted according to the actual situation. The characteristic element updated thereby is denoted +.>

Or->

t _p And t _q The element types of the node are not limited, namely, the node can be the same type of element (both visual element or text element) or different types of element (namely, one is visual element and the other is text element), so that the characteristics of each node simultaneously consider the relation between the modes of the visual element and the text element.

3. Extracting updated text elements, and calculating the autocorrelation of each initial text element to be used as the constraint of the alignment process of vision and the initial text elements between different pictures and text pairs; meanwhile, the self-correlation of the initial text element is utilized to measure the similarity between the related text element and the whole picture, and a retrieval ordering result is generated according to the similarity.

In the last step, updated visual and text elements are obtained

The updated visual elements and text elements are respectively marked with the subscript as the characteristic serial number; extracting->

To calculate each initial text element e _j Is the autocorrelation of (a):

because of the above

Is based on the element v= { V of the whole picture ₁ ,v ₂ ,...,v _n Finding based on a picture that does not match text +.>

Calculated autocorrelation of the original text element +.>

Autocorrelation of an initial text element calculated on the basis of a matching picture I>

Small, i.e.)>

Based on the unequal relations, the patent considers graph structure modeling between different pictures and text pairs to enable the text element e to be _j Better alignment with the visual elements in the true matching picture I.

In the embodiment of the invention, the triples are used for reflecting the unequal relations, and the sentences t in each training minimum batch (mini-batch) have the following losses:

where θ is the triplet loss margin, [ x ]] ₊ ＝max{0,x}，v _pos and v_neg Representing picture I and mismatch, respectively, matching sentence t

Is a visual element of a picture of (a).

In the embodiment of the invention, the text element e is utilized _j Is related to (a) by (b)

The similarity between the related text elements and the whole picture T is measured, and an average pooling method is adopted in calculation, and is expressed as follows:

likewise, the search ordering objective is represented using triplet loss:

wherein (I, T) represents the matching picture and text pair,

and />

Representing the most difficult non-matching samples to picture I, text T, respectively, i.e +.>

and />

The parameter gamma is the triplet loss margin.

Finally, the total loss function of the whole model in the training stage is as follows: l=l _IT +ηΣ _t l _tR The super-parameter eta is used for regulating two losses l _IT and ηΣ_t l _tR The factor of the weight.

In fig. 1, the pictures actually matching the input text and the pictures not matching the input text are denoted by "+", and "-", respectively. vi+, vi-are visual elements of the corresponding picture, ai+, ai-are text elements calculated using the attention mechanism in combination with visual features of the corresponding picture, ri+, ri-are autocorrelation of the text elements mentioned herein.

2. And (3) a testing stage.

In the test process, a user inputs a picture to be searched or a text to be searched, the invention calculates the similarity between the query content (picture or text) and all the contents (text or picture) to be searched in the database according to the similarity, and a final search sequencing result is generated according to the similarity.

If the text to be retrieved is input, the text element E= { E is extracted ₁ ,e ₂ ,...,e _k Then follow the visual element v= { V for each picture in the database alone ₁ ,v ₂ ,...,v _n Computing the corresponding a ₁ ...a _k Separately constructing the graph structure, and updating to obtain

Finally calculate the correlation->

Thereby calculating the similarity, and sorting according to the similarity to obtain a search result (namely, a series of text data).

If the input is the picture to be retrieved, the visual element V= { V of the picture is extracted ₁ ,v ₂ ,...,v _n Then separately follow the text elements E= { E of each text of the database ₁ ,e ₂ ,...,e _k Calculation of a ₁ ...a _k Separately constructing the graph structure, and updating to obtain

Finally calculate the correlation->

Thereby calculating the similarity, and further sequencing according to the size of the similarity to obtain a retrieval result (namely, a series of picture data).

According to the scheme provided by the embodiment of the invention, the attention mechanism is applied to express fine granularity vision and text elements extracted from the pictures and the texts, so that the similarity of the pictures and the texts can be better calculated; the method has the advantages that the graph structure is adaptively constructed by utilizing visual and text elements, and the characteristics are updated by using a graph convolution method, so that the relationships between the intra-mode and inter-mode of the visual and text elements can be better considered; and a constraint mechanism is introduced between different picture and text pairs in the alignment process of visual and text elements, so that fine-granularity text elements can be favorably corresponding to corresponding picture areas, the reliability of picture and text level similarity calculation is further improved, and the accuracy of picture and text retrieval is further improved.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. An image and text retrieval method based on graph neural network structure modeling is characterized by comprising the following steps:

testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; sorting according to the similarity to obtain a retrieval result;

extracting visual elements and initial text elements of a single picture and text pair includes:

for a given picture I, extracting visual features f= { F for multiple regions of picture I using fast R-CNN ₁ ,f ₂ ,...,f _n Next, feature F is mapped to the embedding space with full connection layer, denoted v= { V ₁ ,v ₂ ,...,v _n -a }; wherein n is the number of regions or targets with definite semantic information in the picture;

for the text T, each word in the sentence is first represented by an embedded vector, and then mapped to an embedded space by bi-GRU, denoted as e= { E ₁ ,e ₂ ,...,e _k -where k is the number of words;

the attention-directed mechanism re-represents each text element including:

for the initial text element e _j The attention-directing mechanism is re-expressed as:

where j=1,.. _ij Is the initial text element e _j With visual element v _i The attention coefficient between them is defined by v= { V ₁ ,v ₂ ,...,v _n} and E＝{e₁ ,e ₂ ,...,e _k Calculating a similarity matrix;

visual element and re-represented text element { v ] of single picture and text pair ₁ ...v _n ,a ₁ ...a _k As input, adaptively building a graph structure; wherein { v ₁ ...v _n The visual element of the picture, { a }, is ₁ ...a _k The text element is re-represented, and the subscript is an element number;

each element { v } ₁ ...v _n ,a ₁ ...a _k Using cosine similarity between nodes as connecting edge of nodes, and for any node t _p Only the M edges with the maximum cosine similarity, namely M (t _p )＝topm(b _pq )，t _p ＝v _i Or t _p ＝a _j ，j＝1,...,k，i＝1,...,n；

And updating the characteristics of each node by using a graph convolution method, namely:

the parameter beta is the update strength, and the updated characteristic is marked as +.>

2. The method for image and text retrieval based on structural modeling of a neural network of claim 1, wherein,

the updated text element is noted as

The subscript is the sequence number of the text element, and each initial text element e is calculated _j Is the autocorrelation of (a):

wherein ,

is based on the feature V= { V of the whole picture ₁ ,v ₂ ,...,v _n Finding based on a picture that does not match text +.>

Calculated autocorrelation of the original text element +.>

Small, i.e.)>

Using triples to reflect the above-mentioned inequality relationships, there is a loss of sentences t in each training minimum batch as follows:

where θ is the triplet loss margin, v _pos and v_neg Representing picture I and mismatch, respectively, matching sentence t

Is a visual element of a picture of (a).

3. The image and text retrieval method based on neural network structure modeling of claim 2, wherein the autocorrelation of the initial text element is utilized

wherein I represents text.

4. A method of image and text retrieval based on modeling of a neural network structure as claimed in claim 3, wherein the retrieval ordering objective is represented using triplet loss:

wherein (I, T) represents the matching picture and text pair,

and />

Respectively representing unmatched samples which are the most difficult to match with the picture I and the text T, wherein the parameter gamma is a triplet loss allowance;

the total loss function for the training phase is: l=l _IT +η∑ _t l _tR η is a weight factor.