CN115658934A

CN115658934A - Image-text cross-modal retrieval method based on multi-class attention mechanism

Info

Publication number: CN115658934A
Application number: CN202211252004.XA
Authority: CN
Inventors: 代翔; 周子杰; 潘磊; 孟令宣; 钟海玲; 张琳科; 崔莹
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2022-10-13
Filing date: 2022-10-13
Publication date: 2023-01-31

Abstract

The invention provides a multi-class attention mechanism-based image-text cross-modal retrieval method, which belongs to the technical field of image-text information retrieval and solves the problems of large information redundancy, poor retrieval precision, semantic gap and the like in the traditional image-text retrieval mode; the method comprises the steps of obtaining a graph-text data set, dividing the graph-text data set into a training set and a testing set, respectively extracting feature vectors required by feature engineering by adopting two input channels of images and texts, enhancing the internal correlation of respective features by using a self-attention mechanism, respectively constructing constraint relations in respective modes by using an image attention network, and calculating the similarity; designing a cross-modal attention network among the modes by combining the regional characteristics and the word characteristics, aggregating the alignment vectors by using an attention separation method so as to calculate the similarity, finally fusing the two similarities, training the network, calculating, testing and evaluating the retrieval performance; the method has the advantages of high precision, high matching and high robustness, and is suitable for image-text mutual inspection tasks in multiple fields.

Description

Image-text cross-modal retrieval method based on multi-class attention mechanism

Technical Field

The invention belongs to the technical field of image-text information retrieval, and particularly relates to an image-text cross-modal retrieval method based on a multi-class attention mechanism.

Background

With the development of intelligent devices and social networks, multimedia data on the internet in many fields such as digital libraries, intellectual property, medical health, fashion design, electronic commerce, environmental monitoring, earth information systems, communication systems, and military systems, present an explosive growth. The large amount of data describing the same object, including various modalities such as text, images, video and audio, is semantically connected although the formats of the data are different.

The traditional information retrieval method is a single-mode retrieval, which requires that the retrieval set and the query set are the same mode, for example: the method comprises the following steps of (1) searching characters by characters, searching images by images and searching videos by videos; in image retrieval, the single-modality retrieval technology mainly includes: based on word keyword retrieval, image bottom layer feature retrieval, semantic model retrieval and the like, the technologies can obtain better retrieval effect in the single-mode retrieval process, but the obtained information is limited to one mode data; today, single-mode retrieval cannot meet the efficient, comprehensive and accurate retrieval requirements of people on information, so how to effectively describe multiple modal data of the same object is an important subject in the field of current information retrieval.

When the massive and interconnected multimedia data are faced, people need to find out the related auxiliary data of other modes from the mode data, such as extracting related characters from pictures or extracting related pictures from the characters; in the same object, the text and the picture are different modalities, and the mutual search process for these modalities is called cross-modality search.

Aiming at the research of the current cross-modal retrieval problem, different image-text cross-modal retrieval methods are available, which can be roughly divided into the following three directions: the method comprises the following steps of global-based coarse-grained retrieval, local-based fine-grained retrieval and multi-grained combined retrieval.

The coarse-grained retrieval based on the global situation mainly extracts integral expressions from integral images and integral sentences, and projects the expressions end to end into a constructed shared subspace, wherein the similarity of visual embedding and text embedding can be directly calculated in the subspace through a similarity function. Early, the general common spatial learning reference was the classical correlation analysis CCA, which encodes cross-modal data into a highly correlated common subspace by linear projection; DCCA learns the maximum correlation between the image and the text representation through the superposition of a plurality of nonlinear transformation layers; after this, many scholars introduce DNN in mapping process, combine DNN and CCA depth canonical correlation analysis, or suggest encoding image and text with CNN and LSTM, respectively; because both CNN and LSTM have good expression functions, the image and the text can be subjected to stronger characteristic expression, so that the performance of a related model is improved; VSE + + subsequently introduced the concept of hard-ligands, which will serve as the basis for many subsequent studies.

Although the global coarse-grained retrieval method can calculate the similarity through a mapping mode, a large amount of information cannot be extracted from images and texts, and a local matching algorithm is introduced for better solving visual semantic differences in subsequent researches. Compared with the traditional CNN, the regional image text matching algorithm is more suitable for detecting objects in the image by using target detection, and meanwhile, a word feature matrix rather than a global statement vector is output by a text encoder, so that more accurate detail matching in the image and the sentence can be obtained according to a local alignment algorithm; a method is then proposed to detect objects in the image and encode them into subspaces, where the paired image-text similarity is calculated by summarizing the similarity of all region-word pairs. A bottom-up attention scheme is introduced into the SCAN, images are coded into region-level features by using pre-trained Faster R-CNN, and texts are coded into word-level features; the PFAN adds the position information of the local target in the visual expression to improve the retrieval effect; the VSRN applies GCNs to make visual inferences and learn relationships between regions that conform to the textual modality.

In image-text matching, whether based on coarse granularity or fine granularity, the requirement of accurate matching is difficult to meet, and both global context alignment and matching of regional objects and words are required. Therefore, a multi-granularity matching method is developed, meanwhile, the correlation between the global context and the region-word level concepts is learned to achieve a better matching effect, and the method is mainly used for not only learning the overall correlation among all the modalities, but also accurately finding the fine-grained alignment of all the region concepts. The method comprises the steps of analyzing the overall similarity and the local similarity of images, and combining the similarities to obtain the final image-text similarity; GSLS learns the global properties of the image using global CNN and encodes the local region properties using the FasterR-CNN model; CRAN adds relational matching on the basis of global-local matching and thus obtains accurate cross-mode correlation. While these methods successfully learn consistency matches at both coarse and fine granularity, they limit the design of computing global and local similarities for models that learn the relationship between local objects and some global information.

The retrieval method based on multi-granularity can match the image and text to a great extent, but lacks information interaction in and among the modalities. The relation of single-mode context information is enhanced through a self-attention mechanism and a graph attention mechanism in the modes, and the reliability of retrieval is improved through aligning the region information and the word information through a cross-mode attention mechanism between the modes. Attention-based approaches have also been adopted in recent years in image-text matching. Some approaches focus on considering interactions between modalities to find all possible alignments between image regions and sentence words. SCAN is a typical method of associating regions with words of different weights, and the role of SCAN is very significant and serves as the basis for improvement or comparison of a large number of experimental results; BFAN further handles semantic misplacement and improves performance by focusing on only relevant segments rather than all segments; IMRAM proposes an iterative matching method with loop attention memory that captures the correspondence between images and text by multi-step alignment to achieve better matching. However, existing attention-based methods only focus on regional relationships, and less focus on the relationship between regional objects and global concepts; under the condition, under the scene based on multiple granularities, the method has wider application prospect by adopting a multi-class attention method.

Disclosure of Invention

The invention aims to provide a multi-class attention-based image-text cross-modal retrieval method capable of realizing high precision, high matching and high robustness aiming at the problems of large information redundancy, poor retrieval precision, semantic gaps among different modes and the like in multi-modal image-text data so as to make up for the defects of the existing image-text retrieval technology.

The invention adopts the following technical scheme to realize the purpose:

a multi-class attention mechanism-based image-text cross-modal retrieval method comprises the following steps:

acquiring an input image-text data set, and dividing the image-text data set into a training set and a test set in proportion;

extracting global feature vectors and regional feature vectors of images from image input channels of the training set, and extracting word feature vectors of texts from text input channels of the training set;

enhancing internal relevance of the global feature vector, the regional feature vector, and the word feature vector using a self-attention mechanism;

respectively constructing constraint relations in respective modalities for the enhanced global feature vector, the enhanced regional feature vector and the enhanced word feature vector through a graph attention network to obtain a graph-text feature pair, and performing similarity calculation on the graph-text feature pair to obtain a first similarity;

designing a cross-modal attention network among the modes by combining the enhanced region characteristic vectors and the enhanced word characteristic vectors, aggregating all alignment vectors by using an attention separation method, inhibiting meaningless alignment, and calculating the similarity to obtain a second similarity;

and fusing the first similarity and the second similarity, training the graph attention network and the cross-modal attention network, and evaluating retrieval performance by calculating the image-text feature pair similarity in the test set.

Specifically, the teletext data set is divided into a training set X containing N samples and a test set containing M samples, where:

wherein R represents a real number space, d ₁ Representing the image dimension of the sample, d ₂ Text dimension, x, representing a sample _n Representing images in the training set, z _n Representing text in a training set, y _m Representing images in a test set,/ _m Representing text in the test set.

Further, for the training set, extracting a global feature vector G and a regional feature vector R of the image by adopting ResNet152 and Bottom-up-based Faster R-CNN; in the extraction of the global feature vector G, the last full connection layer in the ResNet152 network is removed and dimension reduction is carried out to d dimension, and the feature expression G = { G } is obtained ₁ ,...,g _k In which g is _i I is ordinal numbers 1 to k, representing the features of the global feature map; in the extraction of the regional characteristic vector R, for each input image, a bottom-up attention method is adopted to extract K regional characteristic expressions, then a full-connection layer is added to convert an output dimension into a d-dimension vector, and the characteristic expression R = { R = (R) = ₁ ,...,r _k In which r is _i I is ordinal numbers 1 to k, and represents the characteristics of the regional characteristic diagram; embedding the global feature vector G and the region feature vector R into a shared potential space, and carrying out full connection, as follows:

V _g ＝W _g G+b _g

V _r ＝W _r R+b _r

in the formula, W _g ,W _r Represents a weight matrix, b _g ,b _r Representing a bias matrix.

Further, for a text input channel in the training set, decomposing a sentence into L words by using a serialization segmentation method, and then inputting the serialized word feature vector into the GRU network; the characteristic expression T of each word is obtained by averaging the forward hidden state and the backward hidden state of all time steps _r ＝{t ₁ ,...,t _L H, where t _i And i is ordinal numbers 1 to L, representing the characteristics of each word.

Further, the self-attention mechanism is calculated by means of three matrix variables of query, key and value, and a query matrix and a group of key value pair matrixes are mapped to output; wherein, the output of the attention function is the weighted sum of the value matrix variables, and the weight matrix is determined by the query matrix variables and the key matrix variables corresponding to the query matrix variables, and can be expressed as the following formula:

wherein softmax represents a normalized exponential function, Q represents an input vector and a weight matrix W _q Product of, K ^T Representing input vectors and a weight matrix W _K V denotes the input vector and the weight matrix W _V Product of (D) _k Indicating the scaling factor.

Further, the construction process of the graph attention network is as follows:

given a full connection graph G = (V, E), wherein V = { V = { V) _i ,...,v _n Is a node characteristic, E is an edge set; the attention coefficient is calculated and normalized using the softmax function as follows:

wherein softmax represents a normalized exponential function, W _q And W _k Denotes a learnable parameter, v _i Representing different node characteristics, D representing a scaling factor, a _ij Representing the weight coefficients of any two nodes.

Further, according to the construction process of the graph attention network, a global graph G is constructed for the enhanced global feature vector _g ＝(V _g ,E _g ) In which the edge sets E _g Defined by computing each pair of global features

And

the similarity matrix of the edge similarity combination of (1) is as follows:

in the formula (I), the compound is shown in the specification,

feature of nodes representing coarse granularity, E _g Representing a similarity matrix;

for the enhanced region feature vector, constructing a region map G _r ＝(V _r ,E _r ) In which E _r Is a reaction of with E _g The edge sets of the similarity matrixes of the region features with the same calculation method;

global graph G to be constructed _g And region map G _r Respectively sent into the graph attention network to obtain the characteristics enhanced by graph attention

And

and sending the image data into the attention network again to obtain the multi-scale imageFeature vector of road

And fusing the multi-scale features into image-side features through a fusion algorithm

For the enhanced word feature vector, a full-connected graph G is constructed in the same way _w ＝(T _w ,E _w ) And G is _w Sending the graph attention network into the text channel to obtain the text end characteristics

For the image end characteristics

And text end features

And performing similarity evaluation, and obtaining the first similarity by adopting a cosine similarity function, wherein the formula is as follows:

in the formula (I), the compound is shown in the specification,

a feature vector representing a channel of the image,

a feature vector representing a text channel.

Further, the cross-modal attention network among the design modalities uses an attention separation method to aggregate all alignment vectors, specifically: using the attention mode of word alignment area to complete the collocation of each area and the corresponding word, including:

using two full-link layers respectivelyConverting the domain representation and the word representation into the same dimension m, and calculating the cosine similarity matrix c of all possible pairs _ij The following formula:

in the formula, v _i Denotes the ith region, w _j Denotes the jth word, c _ij Representing the cosine similarity between the ith area and the jth word.

According to the cosine similarity matrix, obtaining the attention weight a of each area _ij The following formula:

in the formula, λ represents a weight coefficient,

the normalization of the cosine similarity matrix is expressed as follows:

wherein [ c ] _ij ] ₊ Represents the maximum value between the maximum variable and 0;

calculating the related region characteristics of the jth word, and calculating a local similarity score based on the related region characteristics, as follows:

in the formula, W _l Representing a parameter matrix, the denominator representing a matrix norm,

a similarity score is represented.

Further, the attention separation method specifically includes: given the computed local and global similarity vectors

For each similarity s _n Calculating a weight value as follows:

in the formula, sigmoid () represents Sigmoid function, and BN () represents batch normalization; and (3) aggregating the desired similarity vectors by using the weights, and calculating the similarity to obtain a second similarity, which is as follows:

in the formula, s _m Representing different similarity vectors, β _s Represents a weight value, s _all The final degree of similarity of aggregation, i.e., the second degree of similarity, is indicated.

Preferably, the evaluating the search performance takes the ranking of the similarity scores as the evaluation of the search performance, and specifically includes: for the image-text data concentrated in the test after the multi-type attention network, recording former K correct retrieval results by adopting R @ K recall rate, wherein the key points are R @1, R @5 and R @10; the total R @ K index value Rsum of the test texts in the form of images and the test pictures in the form of images is recorded at the same time, and the index value Rsum is represented by the following formula:

Rsum＝(R@1+R@5+R@10) _I2T +(R@1+R@5+R@10) _T2I

in the formula, I2T represents that the image is used as a query sample to search for the relevant text, and T2I represents that the text is used as a query sample to search for the relevant image.

In summary, due to the adoption of the technical scheme, the invention has the following beneficial effects:

1. according to the method, different training channels are respectively constructed according to the image-text training set; a self-attention mechanism and a graph attention mechanism are introduced, and the intra-modal feature relation is tighter by effectively combining context information through an attention mode in a modal; meanwhile, a cross-modal attention mechanism is introduced, and the local information of the image channel and the local information of the text channel are associated and matched to enhance the retrieval accuracy.

2. In order to make up for the defects of coarse granularity and fine granularity in feature matching, the invention designs a multi-granularity method considering global and local semantics, so that the interaction between coarse granularity information and fine granularity information is higher, and global coarse granularity features and local fine granularity features are respectively extracted from an image channel and combined into multi-granularity features.

3. The invention introduces a cross-modal attention mechanism, is different from the direct construction of a public subspace, and adds local feature alignment before constructing the subspace so that the feature matching of different modes is more accurate, and the defect of feature heterogeneity caused by the direct construction of the subspace is overcome.

4. The method is suitable for multi-field image-text mutual inspection, the core of the method is to combine a multi-class attention mechanism and a multi-granularity alignment expression learning model to accurately mine and connect the characteristics and context information of image-text data in respective modes, local association matching is carried out on the characteristics among the modes, and the method is effective as long as the image-text cross-mode retrieval is related.

Drawings

FIG. 1 is a schematic flow chart of a retrieval method of the present invention;

FIG. 2 is a schematic cross-modal local feature alignment process according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

As shown in fig. 1, a method for cross-modal retrieval of graphics based on multiple attention mechanisms includes:

extracting global feature vectors and regional feature vectors of images from an image input channel in a training set, and extracting word feature vectors of texts from a text input channel;

enhancing the internal relevance of the global feature vector, the regional feature vector and the word feature vector using a self-attention mechanism;

combining the enhanced region characteristic vectors and the enhanced word characteristic vectors, designing a cross-modal attention network among the modalities, using an attention separation method, aggregating all the alignment vectors, inhibiting meaningless alignment, and performing similarity calculation to obtain a second similarity;

and fusing the first similarity and the second similarity, training a graph attention network and a cross-modal attention network, and evaluating retrieval performance by calculating the similarity of the image-text characteristics in the test set.

This example will explain the method in detail according to the specific operation steps of the method.

Step 1, according to a set proportion, dividing an image-text data set with (N + M) samples into a training set X containing N samples and a test set Y containing M samples, as follows:

wherein R represents a real number space, d ₁ Representing the image dimension of the sample, d ₂ The text dimension, x, representing a sample _n Representing images in the training set, z _n Representing text in the training set, y _m Representing images in a test set,/ _m Representing text in the test set.

And 2, extracting features by adopting different feature projects according to the training set and different input channels of images and texts. And respectively adopting ResNet152 and Bottom-up-based Faster R-CNN to extract global level characteristics G and region level characteristics R for the image channels. Removing the last full connection layer in the ResNet152 network in the global feature extraction, and performing dimension reduction processing to d dimensions to obtain a feature expression G = { G = ₁ ,...,g _k In which g is _i Features representing a global feature map. In local feature extraction, K regional feature expressions are extracted from each input image by adopting a method of attentiveness from top to top, and then a full-connection layer is added to convert an output dimension into a d-dimensional vector as a local feature expression R = { R = ₁ ,...,r _k In which r is _i Representing local area features. To embed them into the shared potential space, full connectivity is done:

V _g ＝W _g G+b _g

V _r ＝W _r R+b _r

For a text channel, a sentence is decomposed into L words by using a serialization segmentation method, and then a serialized word feature vector is input into GIn the RU network; obtaining a representation T of each word by averaging the forward hidden state and the backward hidden state of all time steps _r ＝{t ₁ ,...,t _L Where t is _i Representing the characteristics of each word.

And 3, enhancing the relevance of the internal features of the extracted image features and text features by using a self-attention mechanism, wherein the self-attention mechanism mainly depends on 3 matrix variables for calculation: query, key, and value, may be viewed as mapping a query matrix and a set of key-value pair matrices to an output. The output of the attention function is a weighted sum of values, where the weight matrix is determined by the query and its corresponding key, which can be expressed as:

Step 4, constructing constraint relations in respective modes through the self-attention enhanced features through a graph attention network, and firstly constructing the graph attention network:

given a full connection graph G = (V, E), wherein V = { V = { V) _i ,...,v _n Is a node feature, E is an edge set; the attention coefficient is calculated and normalized using the softmax function as follows:

wherein softmax represents a normalized exponential function, W _q And W _k Representing a learnable parameter, v _i Representing different node characteristics, D representing a scaling factor, a _ij Representing the weight coefficients of any two nodes.

In the image pathIn the method, two different characteristics, namely coarse granularity and fine granularity characteristics, are adopted, and the global characteristic v is extracted from the coarse granularity according to the construction process of the graph attention network _g Building a global graph G _g ＝(V _g ,E _g ) In which the edge sets E _g Defined by computing each pair of global features

And

the similarity matrix of the edge similarity combination of (1) is as follows:

in the formula (I), the compound is shown in the specification,

feature of nodes representing coarse granularity, E _g Representing a similarity matrix; this allows the relevant regions to have a higher edge matching score, which results in a coarse-grained map. This process determines the extent to which each pixel is affected by other pixels and thus facilitates the learning of pixel-by-pixel relationships.

In order to accurately grasp the context information of coarse granularity in the aspect of fine granularity, the graph attention network is used for capturing the regional relation, and a regional graph G is constructed for the enhanced regional characteristic vector _r ＝(V _r ,E _r ) In which E _r Is a reaction of with E _g And the edge sets of the similarity matrix of the region features with the same calculation method.

The obtained global graph G _g And region map G _r Respectively sent to a graph attention network to obtain graph attention enhanced coarse-grained characteristics

And

to in the fine particlesAdding context information into degree features requires connecting enhanced features to obtain feature vectors of multi-scale image channels

Fusing multi-scale features into final image-side features through a fusion algorithm

For the enhanced word feature vector, a full-connection graph G is constructed in the same way _w ＝(T _w ,E _w ) And combining G _w Sending the graph attention network into the text channel to obtain the text end characteristics

And calculating the image-text characteristic pair similarity processed by the steps in the training set by adopting a similarity function. The similarity evaluation of two vectors usually uses a cosine similarity function:

in the formula (I), the compound is shown in the specification,

a feature vector representing a channel of the image,

a feature vector representing a text channel.

Step 5, as shown in fig. 2, a cross-modal attention network between region feature and word feature design modalities is combined, and an attention separation method is used to aggregate all alignment vectors and suppress meaningless alignment. In order to better study the cross-modal relationship between the image region and the sentence words and establish a more accurate matching relationship, the collocation of each region and the corresponding word is completed by using the attention mode of the word alignment region.

Firstly, two full-connected layers are respectively used for converting the region representation and the word representation into the same dimension m, and then the cosine similarity matrixes of all possible pairs are calculated:

in the formula, v _i Denotes the ith region, w _j Denotes the jth word, c _ij Indicating the cosine similarity between the ith area and the jth word.

in the formula, λ represents a weight coefficient,

the normalization of the cosine similarity matrix is expressed as follows:

in the formula, W _l Representing a parameter matrix, a denominator representing a matrix norm,

a similarity score is represented.

Similarity prediction is improved by finer visual semantic alignment by capturing the relationship of each particular word and its corresponding image region in an inter-modal manner of attention.

Step 6, directly aggregating all possible alignment vectors would result in many meaningless alignments that would hinder the matching ability of the vectors. In order to make the important matching more prominent, a method of attention separation is adopted, which specifically comprises the following steps: given the calculated local and global similarity vectors

For each similarity s _n A weight is calculated as follows:

in the formula, sigmoid () represents a Sigmoid function, and BN () represents batch normalization; and (3) aggregating the desired similarity vectors by using the weights, and calculating the similarity to obtain a second similarity, which is as follows:

in the formula s _m Representing different similarity vectors, beta _s Represents a weight value, s _all Indicating the final degree of aggregation similarity.

A more informative similarity representation between modalities is obtained by learning the significance scores and aggregating the similarities.

And 7, sequencing the similarity scores to be used as the evaluation of the search performance. And recording previous K correct retrieval results by adopting R @ K recall rate after the steps for the graphic and text data concentrated in the test, and particularly paying attention to R @1, R @5 and R @10. The total R @ K index value Rsum of the test texts in the form of images and the test pictures in the form of images is recorded at the same time:

Rsum＝(R@1+R@5+R@10) _I2T +(R@1+R@5+R@10) _T2I

Claims

1. A multi-class attention mechanism-based image-text cross-modal retrieval method is characterized by comprising the following steps:

and fusing the first similarity and the second similarity, training the graph attention network and the cross-modal attention network, and evaluating the retrieval performance by calculating the similarity of the image-text characteristics in the test set.

2. The method of claim 1, wherein the method comprises: the teletext data set is divided into a training set X comprising N samples and a test set comprising M samples, wherein:

wherein R represents a real number space, d ₁ Representing the image dimension of the sample, d ₂ The text dimension, x, representing a sample _n Representing images in the training set, z _n Representing text in a training set, y _m Representing images in a test set,/ _m Representing text in the test set.

3. The method of claim 1, wherein the method comprises: for the training set, extracting a global feature vector G and a regional feature vector R of the image by adopting ResNet152 and Bottom-up-based Faster R-CNN; in the extraction of the global feature vector G, the last full-connection layer in the ResNet152 network is removed and dimension reduction processing is carried out to d dimension, and the feature expression G = { G =isobtained ₁ ,...,g _k In which g is _i I is ordinal numbers 1 to k, representing the features of the global feature map; in the extraction of the regional characteristic vector R, for each input image, a bottom-up attention method is adopted to extract K regional characteristic expressions, then a full-connection layer is added to convert an output dimension into a d-dimension vector, and the characteristic expression R = { R = (R) = ₁ ,...,r _k In which r is _i I is ordinal numbers 1 to k, and represents the characteristics of the regional characteristic diagram; embedding the global feature vector G and the region feature vector R into a shared potential space, and carrying out full connection, as follows:

V _g ＝W _g G+b _g

V _r ＝W _r R+b _r

4. The method for multi-class attention-based image-text cross-modal retrieval according to claim 1, wherein: for a text input channel in the training set, decomposing a sentence into L words by using a serialization segmentation method, and then inputting the serialized word feature vector into a GRU network; the characteristic expression T of each word is obtained by averaging the forward hidden state and the backward hidden state of all time steps _r ＝{t ₁ ,...,t _L H, where t _i And i is ordinal numbers 1 to L, representing the characteristics of each word.

5. The method for multi-class attention-based image-text cross-modal retrieval according to claim 1, wherein: the self-attention mechanism is calculated by depending on three matrix variables of query, key and value, and a query matrix and a group of key value pair matrixes are mapped to output; wherein, the output of the attention function is the weighted sum of the value matrix variables, and the weight matrix is determined by the query matrix variables and the key matrix variables corresponding to the query matrix variables, and can be expressed as the following formula:

wherein softmax represents a normalized exponential function, Q represents an input vector and a weight matrix W _q Product of (a), K ^T Representing input vectors and a weight matrix W _K V denotes the input vector and the weight matrix W _V Product of (b), D _k Indicating the scaling factor.

6. The method according to claim 1, wherein the graph-text cross-modal retrieval method based on multiple attention mechanisms comprises the following steps:

given a fully connected graph G = (V, E), wherein V = { V) _i ,...,v _n Is a node characteristic, E is an edge set; calculating attention coefficient and normalizing by using softmax functionThe following formula is shown:

7. The method of claim 6, wherein the method comprises: according to the construction process of the graph attention network, constructing a global graph G for the enhanced global feature vector _g ＝(V _g ,E _g ) In which the edge sets E _g Defined by computing each pair of global features

And

the similarity matrix of the edge similarity combination of (1) is as follows:

in the formula (I), the compound is shown in the specification,

global graph G to be constructed _g And region map G _r Are sent separatelyInto the graph attention network, features are obtained that are enhanced by graph attention

And

and sending the image data into the attention network again to obtain the characteristic vector of the multi-scale image channel

For the image end characteristics

And text end features

And performing similarity evaluation, and obtaining the first similarity by using a cosine similarity function, wherein the first similarity is as follows:

in the formula (I), the compound is shown in the specification,

features representing image channelsThe vector of the vector is then calculated,

a feature vector representing a text channel.

8. The method of claim 1, wherein the method comprises: the cross-modal attention network among the design modalities uses an attention separation method to aggregate all alignment vectors, and specifically comprises the following steps: using the attention mode of word alignment area to complete the collocation of each area and the corresponding word, including:

converting the region representation and the word representation into the same dimension m by using two full connection layers respectively, and calculating the cosine similarity matrix c of all possible pairs _ij The following formula:

in the formula, λ represents a weight coefficient,

the normalization of the cosine similarity matrix is represented as follows:

calculating the relevant region characteristics of the jth word, and calculating a local similarity score based on the characteristics, as follows:

a similarity score is represented.

9. The method as claimed in claim 8, wherein the attention separation method specifically comprises: given the computed local and global similarity vectors

For each similarity s _n Calculating a weight value as follows:

in the formula, sigmoid () represents Sigmoid function, and BN () represents batch normalization; and aggregating the desired similarity vector by using the weight to calculate the similarity to obtain a second similarity, which is as follows:

in the formula s _m Representing different similarity vectors, beta _s Represents a weight value, s _all The final degree of similarity of aggregation, i.e., the second degree of similarity, is indicated.

10. The method according to claim 1, wherein the evaluation of the retrieval performance is performed by taking the ranking of similarity scores as the evaluation of the retrieval performance, and specifically comprises: for the image-text data concentrated in the test after the multi-type attention network, recording former K correct retrieval results by adopting R @ K recall rate, wherein the key points are R @1, R @5 and R @10; the total R @ K index value Rsum of the graph check text and the graph check text is recorded at the same time, and the index value Rsum is as follows:

Rsum＝(R@1+R@5+R@10) _I2T +(R@1+R@5+R@10) _T2I