CN115658934A - Image-text cross-modal retrieval method based on multi-class attention mechanism - Google Patents

Image-text cross-modal retrieval method based on multi-class attention mechanism Download PDF

Info

Publication number
CN115658934A
CN115658934A CN202211252004.XA CN202211252004A CN115658934A CN 115658934 A CN115658934 A CN 115658934A CN 202211252004 A CN202211252004 A CN 202211252004A CN 115658934 A CN115658934 A CN 115658934A
Authority
CN
China
Prior art keywords
similarity
text
image
attention
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211252004.XA
Other languages
Chinese (zh)
Inventor
代翔
周子杰
潘磊
孟令宣
钟海玲
张琳科
崔莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 10 Research Institute
Original Assignee
CETC 10 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 10 Research Institute filed Critical CETC 10 Research Institute
Priority to CN202211252004.XA priority Critical patent/CN115658934A/en
Publication of CN115658934A publication Critical patent/CN115658934A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multi-class attention mechanism-based image-text cross-modal retrieval method, which belongs to the technical field of image-text information retrieval and solves the problems of large information redundancy, poor retrieval precision, semantic gap and the like in the traditional image-text retrieval mode; the method comprises the steps of obtaining a graph-text data set, dividing the graph-text data set into a training set and a testing set, respectively extracting feature vectors required by feature engineering by adopting two input channels of images and texts, enhancing the internal correlation of respective features by using a self-attention mechanism, respectively constructing constraint relations in respective modes by using an image attention network, and calculating the similarity; designing a cross-modal attention network among the modes by combining the regional characteristics and the word characteristics, aggregating the alignment vectors by using an attention separation method so as to calculate the similarity, finally fusing the two similarities, training the network, calculating, testing and evaluating the retrieval performance; the method has the advantages of high precision, high matching and high robustness, and is suitable for image-text mutual inspection tasks in multiple fields.

Description

Image-text cross-modal retrieval method based on multi-class attention mechanism
Technical Field
The invention belongs to the technical field of image-text information retrieval, and particularly relates to an image-text cross-modal retrieval method based on a multi-class attention mechanism.
Background
With the development of intelligent devices and social networks, multimedia data on the internet in many fields such as digital libraries, intellectual property, medical health, fashion design, electronic commerce, environmental monitoring, earth information systems, communication systems, and military systems, present an explosive growth. The large amount of data describing the same object, including various modalities such as text, images, video and audio, is semantically connected although the formats of the data are different.
The traditional information retrieval method is a single-mode retrieval, which requires that the retrieval set and the query set are the same mode, for example: the method comprises the following steps of (1) searching characters by characters, searching images by images and searching videos by videos; in image retrieval, the single-modality retrieval technology mainly includes: based on word keyword retrieval, image bottom layer feature retrieval, semantic model retrieval and the like, the technologies can obtain better retrieval effect in the single-mode retrieval process, but the obtained information is limited to one mode data; today, single-mode retrieval cannot meet the efficient, comprehensive and accurate retrieval requirements of people on information, so how to effectively describe multiple modal data of the same object is an important subject in the field of current information retrieval.
When the massive and interconnected multimedia data are faced, people need to find out the related auxiliary data of other modes from the mode data, such as extracting related characters from pictures or extracting related pictures from the characters; in the same object, the text and the picture are different modalities, and the mutual search process for these modalities is called cross-modality search.
Aiming at the research of the current cross-modal retrieval problem, different image-text cross-modal retrieval methods are available, which can be roughly divided into the following three directions: the method comprises the following steps of global-based coarse-grained retrieval, local-based fine-grained retrieval and multi-grained combined retrieval.
The coarse-grained retrieval based on the global situation mainly extracts integral expressions from integral images and integral sentences, and projects the expressions end to end into a constructed shared subspace, wherein the similarity of visual embedding and text embedding can be directly calculated in the subspace through a similarity function. Early, the general common spatial learning reference was the classical correlation analysis CCA, which encodes cross-modal data into a highly correlated common subspace by linear projection; DCCA learns the maximum correlation between the image and the text representation through the superposition of a plurality of nonlinear transformation layers; after this, many scholars introduce DNN in mapping process, combine DNN and CCA depth canonical correlation analysis, or suggest encoding image and text with CNN and LSTM, respectively; because both CNN and LSTM have good expression functions, the image and the text can be subjected to stronger characteristic expression, so that the performance of a related model is improved; VSE + + subsequently introduced the concept of hard-ligands, which will serve as the basis for many subsequent studies.
Although the global coarse-grained retrieval method can calculate the similarity through a mapping mode, a large amount of information cannot be extracted from images and texts, and a local matching algorithm is introduced for better solving visual semantic differences in subsequent researches. Compared with the traditional CNN, the regional image text matching algorithm is more suitable for detecting objects in the image by using target detection, and meanwhile, a word feature matrix rather than a global statement vector is output by a text encoder, so that more accurate detail matching in the image and the sentence can be obtained according to a local alignment algorithm; a method is then proposed to detect objects in the image and encode them into subspaces, where the paired image-text similarity is calculated by summarizing the similarity of all region-word pairs. A bottom-up attention scheme is introduced into the SCAN, images are coded into region-level features by using pre-trained Faster R-CNN, and texts are coded into word-level features; the PFAN adds the position information of the local target in the visual expression to improve the retrieval effect; the VSRN applies GCNs to make visual inferences and learn relationships between regions that conform to the textual modality.
In image-text matching, whether based on coarse granularity or fine granularity, the requirement of accurate matching is difficult to meet, and both global context alignment and matching of regional objects and words are required. Therefore, a multi-granularity matching method is developed, meanwhile, the correlation between the global context and the region-word level concepts is learned to achieve a better matching effect, and the method is mainly used for not only learning the overall correlation among all the modalities, but also accurately finding the fine-grained alignment of all the region concepts. The method comprises the steps of analyzing the overall similarity and the local similarity of images, and combining the similarities to obtain the final image-text similarity; GSLS learns the global properties of the image using global CNN and encodes the local region properties using the FasterR-CNN model; CRAN adds relational matching on the basis of global-local matching and thus obtains accurate cross-mode correlation. While these methods successfully learn consistency matches at both coarse and fine granularity, they limit the design of computing global and local similarities for models that learn the relationship between local objects and some global information.
The retrieval method based on multi-granularity can match the image and text to a great extent, but lacks information interaction in and among the modalities. The relation of single-mode context information is enhanced through a self-attention mechanism and a graph attention mechanism in the modes, and the reliability of retrieval is improved through aligning the region information and the word information through a cross-mode attention mechanism between the modes. Attention-based approaches have also been adopted in recent years in image-text matching. Some approaches focus on considering interactions between modalities to find all possible alignments between image regions and sentence words. SCAN is a typical method of associating regions with words of different weights, and the role of SCAN is very significant and serves as the basis for improvement or comparison of a large number of experimental results; BFAN further handles semantic misplacement and improves performance by focusing on only relevant segments rather than all segments; IMRAM proposes an iterative matching method with loop attention memory that captures the correspondence between images and text by multi-step alignment to achieve better matching. However, existing attention-based methods only focus on regional relationships, and less focus on the relationship between regional objects and global concepts; under the condition, under the scene based on multiple granularities, the method has wider application prospect by adopting a multi-class attention method.
Disclosure of Invention
The invention aims to provide a multi-class attention-based image-text cross-modal retrieval method capable of realizing high precision, high matching and high robustness aiming at the problems of large information redundancy, poor retrieval precision, semantic gaps among different modes and the like in multi-modal image-text data so as to make up for the defects of the existing image-text retrieval technology.
The invention adopts the following technical scheme to realize the purpose:
a multi-class attention mechanism-based image-text cross-modal retrieval method comprises the following steps:
acquiring an input image-text data set, and dividing the image-text data set into a training set and a test set in proportion;
extracting global feature vectors and regional feature vectors of images from image input channels of the training set, and extracting word feature vectors of texts from text input channels of the training set;
enhancing internal relevance of the global feature vector, the regional feature vector, and the word feature vector using a self-attention mechanism;
respectively constructing constraint relations in respective modalities for the enhanced global feature vector, the enhanced regional feature vector and the enhanced word feature vector through a graph attention network to obtain a graph-text feature pair, and performing similarity calculation on the graph-text feature pair to obtain a first similarity;
designing a cross-modal attention network among the modes by combining the enhanced region characteristic vectors and the enhanced word characteristic vectors, aggregating all alignment vectors by using an attention separation method, inhibiting meaningless alignment, and calculating the similarity to obtain a second similarity;
and fusing the first similarity and the second similarity, training the graph attention network and the cross-modal attention network, and evaluating retrieval performance by calculating the image-text feature pair similarity in the test set.
Specifically, the teletext data set is divided into a training set X containing N samples and a test set containing M samples, where:
Figure BDA0003888444030000041
Figure BDA0003888444030000042
wherein R represents a real number space, d 1 Representing the image dimension of the sample, d 2 Text dimension, x, representing a sample n Representing images in the training set, z n Representing text in a training set, y m Representing images in a test set,/ m Representing text in the test set.
Further, for the training set, extracting a global feature vector G and a regional feature vector R of the image by adopting ResNet152 and Bottom-up-based Faster R-CNN; in the extraction of the global feature vector G, the last full connection layer in the ResNet152 network is removed and dimension reduction is carried out to d dimension, and the feature expression G = { G } is obtained 1 ,...,g k In which g is i I is ordinal numbers 1 to k, representing the features of the global feature map; in the extraction of the regional characteristic vector R, for each input image, a bottom-up attention method is adopted to extract K regional characteristic expressions, then a full-connection layer is added to convert an output dimension into a d-dimension vector, and the characteristic expression R = { R = (R) = 1 ,...,r k In which r is i I is ordinal numbers 1 to k, and represents the characteristics of the regional characteristic diagram; embedding the global feature vector G and the region feature vector R into a shared potential space, and carrying out full connection, as follows:
V g =W g G+b g
V r =W r R+b r
in the formula, W g ,W r Represents a weight matrix, b g ,b r Representing a bias matrix.
Further, for a text input channel in the training set, decomposing a sentence into L words by using a serialization segmentation method, and then inputting the serialized word feature vector into the GRU network; the characteristic expression T of each word is obtained by averaging the forward hidden state and the backward hidden state of all time steps r ={t 1 ,...,t L H, where t i And i is ordinal numbers 1 to L, representing the characteristics of each word.
Further, the self-attention mechanism is calculated by means of three matrix variables of query, key and value, and a query matrix and a group of key value pair matrixes are mapped to output; wherein, the output of the attention function is the weighted sum of the value matrix variables, and the weight matrix is determined by the query matrix variables and the key matrix variables corresponding to the query matrix variables, and can be expressed as the following formula:
Figure BDA0003888444030000051
wherein softmax represents a normalized exponential function, Q represents an input vector and a weight matrix W q Product of, K T Representing input vectors and a weight matrix W K V denotes the input vector and the weight matrix W V Product of (D) k Indicating the scaling factor.
Further, the construction process of the graph attention network is as follows:
given a full connection graph G = (V, E), wherein V = { V = { V) i ,...,v n Is a node characteristic, E is an edge set; the attention coefficient is calculated and normalized using the softmax function as follows:
Figure BDA0003888444030000052
wherein softmax represents a normalized exponential function, W q And W k Denotes a learnable parameter, v i Representing different node characteristics, D representing a scaling factor, a ij Representing the weight coefficients of any two nodes.
Further, according to the construction process of the graph attention network, a global graph G is constructed for the enhanced global feature vector g =(V g ,E g ) In which the edge sets E g Defined by computing each pair of global features
Figure BDA0003888444030000053
And
Figure BDA0003888444030000054
the similarity matrix of the edge similarity combination of (1) is as follows:
Figure BDA0003888444030000055
in the formula (I), the compound is shown in the specification,
Figure BDA0003888444030000056
feature of nodes representing coarse granularity, E g Representing a similarity matrix;
for the enhanced region feature vector, constructing a region map G r =(V r ,E r ) In which E r Is a reaction of with E g The edge sets of the similarity matrixes of the region features with the same calculation method;
global graph G to be constructed g And region map G r Respectively sent into the graph attention network to obtain the characteristics enhanced by graph attention
Figure BDA0003888444030000061
And
Figure BDA0003888444030000062
and sending the image data into the attention network again to obtain the multi-scale imageFeature vector of road
Figure BDA0003888444030000063
And fusing the multi-scale features into image-side features through a fusion algorithm
Figure BDA0003888444030000064
For the enhanced word feature vector, a full-connected graph G is constructed in the same way w =(T w ,E w ) And G is w Sending the graph attention network into the text channel to obtain the text end characteristics
Figure BDA0003888444030000065
For the image end characteristics
Figure BDA0003888444030000066
And text end features
Figure BDA0003888444030000067
And performing similarity evaluation, and obtaining the first similarity by adopting a cosine similarity function, wherein the formula is as follows:
Figure BDA0003888444030000068
in the formula (I), the compound is shown in the specification,
Figure BDA0003888444030000069
a feature vector representing a channel of the image,
Figure BDA00038884440300000610
a feature vector representing a text channel.
Further, the cross-modal attention network among the design modalities uses an attention separation method to aggregate all alignment vectors, specifically: using the attention mode of word alignment area to complete the collocation of each area and the corresponding word, including:
using two full-link layers respectivelyConverting the domain representation and the word representation into the same dimension m, and calculating the cosine similarity matrix c of all possible pairs ij The following formula:
Figure BDA00038884440300000611
in the formula, v i Denotes the ith region, w j Denotes the jth word, c ij Representing the cosine similarity between the ith area and the jth word.
According to the cosine similarity matrix, obtaining the attention weight a of each area ij The following formula:
Figure BDA00038884440300000612
in the formula, λ represents a weight coefficient,
Figure BDA00038884440300000613
the normalization of the cosine similarity matrix is expressed as follows:
Figure BDA00038884440300000614
wherein [ c ] ij ] + Represents the maximum value between the maximum variable and 0;
calculating the related region characteristics of the jth word, and calculating a local similarity score based on the related region characteristics, as follows:
Figure BDA0003888444030000071
Figure BDA0003888444030000072
in the formula, W l Representing a parameter matrix, the denominator representing a matrix norm,
Figure BDA0003888444030000073
a similarity score is represented.
Further, the attention separation method specifically includes: given the computed local and global similarity vectors
Figure BDA0003888444030000074
For each similarity s n Calculating a weight value as follows:
Figure BDA0003888444030000075
in the formula, sigmoid () represents Sigmoid function, and BN () represents batch normalization; and (3) aggregating the desired similarity vectors by using the weights, and calculating the similarity to obtain a second similarity, which is as follows:
Figure BDA0003888444030000076
in the formula, s m Representing different similarity vectors, β s Represents a weight value, s all The final degree of similarity of aggregation, i.e., the second degree of similarity, is indicated.
Preferably, the evaluating the search performance takes the ranking of the similarity scores as the evaluation of the search performance, and specifically includes: for the image-text data concentrated in the test after the multi-type attention network, recording former K correct retrieval results by adopting R @ K recall rate, wherein the key points are R @1, R @5 and R @10; the total R @ K index value Rsum of the test texts in the form of images and the test pictures in the form of images is recorded at the same time, and the index value Rsum is represented by the following formula:
Rsum=(R@1+R@5+R@10) I2T +(R@1+R@5+R@10) T2I
in the formula, I2T represents that the image is used as a query sample to search for the relevant text, and T2I represents that the text is used as a query sample to search for the relevant image.
In summary, due to the adoption of the technical scheme, the invention has the following beneficial effects:
1. according to the method, different training channels are respectively constructed according to the image-text training set; a self-attention mechanism and a graph attention mechanism are introduced, and the intra-modal feature relation is tighter by effectively combining context information through an attention mode in a modal; meanwhile, a cross-modal attention mechanism is introduced, and the local information of the image channel and the local information of the text channel are associated and matched to enhance the retrieval accuracy.
2. In order to make up for the defects of coarse granularity and fine granularity in feature matching, the invention designs a multi-granularity method considering global and local semantics, so that the interaction between coarse granularity information and fine granularity information is higher, and global coarse granularity features and local fine granularity features are respectively extracted from an image channel and combined into multi-granularity features.
3. The invention introduces a cross-modal attention mechanism, is different from the direct construction of a public subspace, and adds local feature alignment before constructing the subspace so that the feature matching of different modes is more accurate, and the defect of feature heterogeneity caused by the direct construction of the subspace is overcome.
4. The method is suitable for multi-field image-text mutual inspection, the core of the method is to combine a multi-class attention mechanism and a multi-granularity alignment expression learning model to accurately mine and connect the characteristics and context information of image-text data in respective modes, local association matching is carried out on the characteristics among the modes, and the method is effective as long as the image-text cross-mode retrieval is related.
Drawings
FIG. 1 is a schematic flow chart of a retrieval method of the present invention;
FIG. 2 is a schematic cross-modal local feature alignment process according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
As shown in fig. 1, a method for cross-modal retrieval of graphics based on multiple attention mechanisms includes:
acquiring an input image-text data set, and dividing the image-text data set into a training set and a test set in proportion;
extracting global feature vectors and regional feature vectors of images from an image input channel in a training set, and extracting word feature vectors of texts from a text input channel;
enhancing the internal relevance of the global feature vector, the regional feature vector and the word feature vector using a self-attention mechanism;
respectively constructing constraint relations in respective modalities for the enhanced global feature vector, the enhanced regional feature vector and the enhanced word feature vector through a graph attention network to obtain a graph-text feature pair, and performing similarity calculation on the graph-text feature pair to obtain a first similarity;
combining the enhanced region characteristic vectors and the enhanced word characteristic vectors, designing a cross-modal attention network among the modalities, using an attention separation method, aggregating all the alignment vectors, inhibiting meaningless alignment, and performing similarity calculation to obtain a second similarity;
and fusing the first similarity and the second similarity, training a graph attention network and a cross-modal attention network, and evaluating retrieval performance by calculating the similarity of the image-text characteristics in the test set.
This example will explain the method in detail according to the specific operation steps of the method.
Step 1, according to a set proportion, dividing an image-text data set with (N + M) samples into a training set X containing N samples and a test set Y containing M samples, as follows:
Figure BDA0003888444030000091
Figure BDA0003888444030000092
wherein R represents a real number space, d 1 Representing the image dimension of the sample, d 2 The text dimension, x, representing a sample n Representing images in the training set, z n Representing text in the training set, y m Representing images in a test set,/ m Representing text in the test set.
And 2, extracting features by adopting different feature projects according to the training set and different input channels of images and texts. And respectively adopting ResNet152 and Bottom-up-based Faster R-CNN to extract global level characteristics G and region level characteristics R for the image channels. Removing the last full connection layer in the ResNet152 network in the global feature extraction, and performing dimension reduction processing to d dimensions to obtain a feature expression G = { G = 1 ,...,g k In which g is i Features representing a global feature map. In local feature extraction, K regional feature expressions are extracted from each input image by adopting a method of attentiveness from top to top, and then a full-connection layer is added to convert an output dimension into a d-dimensional vector as a local feature expression R = { R = 1 ,...,r k In which r is i Representing local area features. To embed them into the shared potential space, full connectivity is done:
V g =W g G+b g
V r =W r R+b r
in the formula, W g ,W r Represents a weight matrix, b g ,b r Representing a bias matrix.
For a text channel, a sentence is decomposed into L words by using a serialization segmentation method, and then a serialized word feature vector is input into GIn the RU network; obtaining a representation T of each word by averaging the forward hidden state and the backward hidden state of all time steps r ={t 1 ,...,t L Where t is i Representing the characteristics of each word.
And 3, enhancing the relevance of the internal features of the extracted image features and text features by using a self-attention mechanism, wherein the self-attention mechanism mainly depends on 3 matrix variables for calculation: query, key, and value, may be viewed as mapping a query matrix and a set of key-value pair matrices to an output. The output of the attention function is a weighted sum of values, where the weight matrix is determined by the query and its corresponding key, which can be expressed as:
Figure BDA0003888444030000101
wherein softmax represents a normalized exponential function, Q represents an input vector and a weight matrix W q Product of, K T Representing input vectors and a weight matrix W K V denotes the input vector and the weight matrix W V Product of (D) k Indicating the scaling factor.
Step 4, constructing constraint relations in respective modes through the self-attention enhanced features through a graph attention network, and firstly constructing the graph attention network:
given a full connection graph G = (V, E), wherein V = { V = { V) i ,...,v n Is a node feature, E is an edge set; the attention coefficient is calculated and normalized using the softmax function as follows:
Figure BDA0003888444030000102
wherein softmax represents a normalized exponential function, W q And W k Representing a learnable parameter, v i Representing different node characteristics, D representing a scaling factor, a ij Representing the weight coefficients of any two nodes.
In the image pathIn the method, two different characteristics, namely coarse granularity and fine granularity characteristics, are adopted, and the global characteristic v is extracted from the coarse granularity according to the construction process of the graph attention network g Building a global graph G g =(V g ,E g ) In which the edge sets E g Defined by computing each pair of global features
Figure BDA0003888444030000103
And
Figure BDA0003888444030000104
the similarity matrix of the edge similarity combination of (1) is as follows:
Figure BDA0003888444030000105
in the formula (I), the compound is shown in the specification,
Figure BDA0003888444030000106
feature of nodes representing coarse granularity, E g Representing a similarity matrix; this allows the relevant regions to have a higher edge matching score, which results in a coarse-grained map. This process determines the extent to which each pixel is affected by other pixels and thus facilitates the learning of pixel-by-pixel relationships.
In order to accurately grasp the context information of coarse granularity in the aspect of fine granularity, the graph attention network is used for capturing the regional relation, and a regional graph G is constructed for the enhanced regional characteristic vector r =(V r ,E r ) In which E r Is a reaction of with E g And the edge sets of the similarity matrix of the region features with the same calculation method.
The obtained global graph G g And region map G r Respectively sent to a graph attention network to obtain graph attention enhanced coarse-grained characteristics
Figure BDA0003888444030000111
And
Figure BDA0003888444030000112
to in the fine particlesAdding context information into degree features requires connecting enhanced features to obtain feature vectors of multi-scale image channels
Figure BDA0003888444030000113
Fusing multi-scale features into final image-side features through a fusion algorithm
Figure BDA0003888444030000114
For the enhanced word feature vector, a full-connection graph G is constructed in the same way w =(T w ,E w ) And combining G w Sending the graph attention network into the text channel to obtain the text end characteristics
Figure BDA0003888444030000115
And calculating the image-text characteristic pair similarity processed by the steps in the training set by adopting a similarity function. The similarity evaluation of two vectors usually uses a cosine similarity function:
Figure BDA0003888444030000116
in the formula (I), the compound is shown in the specification,
Figure BDA0003888444030000117
a feature vector representing a channel of the image,
Figure BDA0003888444030000118
a feature vector representing a text channel.
Step 5, as shown in fig. 2, a cross-modal attention network between region feature and word feature design modalities is combined, and an attention separation method is used to aggregate all alignment vectors and suppress meaningless alignment. In order to better study the cross-modal relationship between the image region and the sentence words and establish a more accurate matching relationship, the collocation of each region and the corresponding word is completed by using the attention mode of the word alignment region.
Firstly, two full-connected layers are respectively used for converting the region representation and the word representation into the same dimension m, and then the cosine similarity matrixes of all possible pairs are calculated:
Figure BDA0003888444030000119
in the formula, v i Denotes the ith region, w j Denotes the jth word, c ij Indicating the cosine similarity between the ith area and the jth word.
According to the cosine similarity matrix, obtaining the attention weight a of each area ij The following formula:
Figure BDA0003888444030000121
in the formula, λ represents a weight coefficient,
Figure BDA0003888444030000122
the normalization of the cosine similarity matrix is expressed as follows:
Figure BDA0003888444030000123
wherein [ c ] ij ] + Represents the maximum value between the maximum variable and 0;
calculating the related region characteristics of the jth word, and calculating a local similarity score based on the related region characteristics, as follows:
Figure BDA0003888444030000124
Figure BDA0003888444030000125
in the formula, W l Representing a parameter matrix, a denominator representing a matrix norm,
Figure BDA0003888444030000126
a similarity score is represented.
Similarity prediction is improved by finer visual semantic alignment by capturing the relationship of each particular word and its corresponding image region in an inter-modal manner of attention.
Step 6, directly aggregating all possible alignment vectors would result in many meaningless alignments that would hinder the matching ability of the vectors. In order to make the important matching more prominent, a method of attention separation is adopted, which specifically comprises the following steps: given the calculated local and global similarity vectors
Figure BDA0003888444030000127
For each similarity s n A weight is calculated as follows:
Figure BDA0003888444030000128
in the formula, sigmoid () represents a Sigmoid function, and BN () represents batch normalization; and (3) aggregating the desired similarity vectors by using the weights, and calculating the similarity to obtain a second similarity, which is as follows:
Figure BDA0003888444030000129
in the formula s m Representing different similarity vectors, beta s Represents a weight value, s all Indicating the final degree of aggregation similarity.
A more informative similarity representation between modalities is obtained by learning the significance scores and aggregating the similarities.
And 7, sequencing the similarity scores to be used as the evaluation of the search performance. And recording previous K correct retrieval results by adopting R @ K recall rate after the steps for the graphic and text data concentrated in the test, and particularly paying attention to R @1, R @5 and R @10. The total R @ K index value Rsum of the test texts in the form of images and the test pictures in the form of images is recorded at the same time:
Rsum=(R@1+R@5+R@10) I2T +(R@1+R@5+R@10) T2I
in the formula, I2T represents that the image is used as a query sample to search for the relevant text, and T2I represents that the text is used as a query sample to search for the relevant image.

Claims (10)

1. A multi-class attention mechanism-based image-text cross-modal retrieval method is characterized by comprising the following steps:
acquiring an input image-text data set, and dividing the image-text data set into a training set and a test set in proportion;
extracting global feature vectors and regional feature vectors of images from image input channels of the training set, and extracting word feature vectors of texts from text input channels of the training set;
enhancing internal relevance of the global feature vector, the regional feature vector, and the word feature vector using a self-attention mechanism;
respectively constructing constraint relations in respective modalities for the enhanced global feature vector, the enhanced regional feature vector and the enhanced word feature vector through a graph attention network to obtain a graph-text feature pair, and performing similarity calculation on the graph-text feature pair to obtain a first similarity;
combining the enhanced region characteristic vectors and the enhanced word characteristic vectors, designing a cross-modal attention network among the modalities, using an attention separation method, aggregating all the alignment vectors, inhibiting meaningless alignment, and performing similarity calculation to obtain a second similarity;
and fusing the first similarity and the second similarity, training the graph attention network and the cross-modal attention network, and evaluating the retrieval performance by calculating the similarity of the image-text characteristics in the test set.
2. The method of claim 1, wherein the method comprises: the teletext data set is divided into a training set X comprising N samples and a test set comprising M samples, wherein:
Figure FDA0003888444020000011
Figure FDA0003888444020000012
wherein R represents a real number space, d 1 Representing the image dimension of the sample, d 2 The text dimension, x, representing a sample n Representing images in the training set, z n Representing text in a training set, y m Representing images in a test set,/ m Representing text in the test set.
3. The method of claim 1, wherein the method comprises: for the training set, extracting a global feature vector G and a regional feature vector R of the image by adopting ResNet152 and Bottom-up-based Faster R-CNN; in the extraction of the global feature vector G, the last full-connection layer in the ResNet152 network is removed and dimension reduction processing is carried out to d dimension, and the feature expression G = { G =isobtained 1 ,...,g k In which g is i I is ordinal numbers 1 to k, representing the features of the global feature map; in the extraction of the regional characteristic vector R, for each input image, a bottom-up attention method is adopted to extract K regional characteristic expressions, then a full-connection layer is added to convert an output dimension into a d-dimension vector, and the characteristic expression R = { R = (R) = 1 ,...,r k In which r is i I is ordinal numbers 1 to k, and represents the characteristics of the regional characteristic diagram; embedding the global feature vector G and the region feature vector R into a shared potential space, and carrying out full connection, as follows:
V g =W g G+b g
V r =W r R+b r
in the formula, W g ,W r Represents a weight matrix, b g ,b r Representing a bias matrix.
4. The method for multi-class attention-based image-text cross-modal retrieval according to claim 1, wherein: for a text input channel in the training set, decomposing a sentence into L words by using a serialization segmentation method, and then inputting the serialized word feature vector into a GRU network; the characteristic expression T of each word is obtained by averaging the forward hidden state and the backward hidden state of all time steps r ={t 1 ,...,t L H, where t i And i is ordinal numbers 1 to L, representing the characteristics of each word.
5. The method for multi-class attention-based image-text cross-modal retrieval according to claim 1, wherein: the self-attention mechanism is calculated by depending on three matrix variables of query, key and value, and a query matrix and a group of key value pair matrixes are mapped to output; wherein, the output of the attention function is the weighted sum of the value matrix variables, and the weight matrix is determined by the query matrix variables and the key matrix variables corresponding to the query matrix variables, and can be expressed as the following formula:
Figure FDA0003888444020000021
wherein softmax represents a normalized exponential function, Q represents an input vector and a weight matrix W q Product of (a), K T Representing input vectors and a weight matrix W K V denotes the input vector and the weight matrix W V Product of (b), D k Indicating the scaling factor.
6. The method according to claim 1, wherein the graph-text cross-modal retrieval method based on multiple attention mechanisms comprises the following steps:
given a fully connected graph G = (V, E), wherein V = { V) i ,...,v n Is a node characteristic, E is an edge set; calculating attention coefficient and normalizing by using softmax functionThe following formula is shown:
Figure FDA0003888444020000031
wherein softmax represents a normalized exponential function, W q And W k Denotes a learnable parameter, v i Representing different node characteristics, D representing a scaling factor, a ij Representing the weight coefficients of any two nodes.
7. The method of claim 6, wherein the method comprises: according to the construction process of the graph attention network, constructing a global graph G for the enhanced global feature vector g =(V g ,E g ) In which the edge sets E g Defined by computing each pair of global features
Figure FDA0003888444020000032
And
Figure FDA0003888444020000033
the similarity matrix of the edge similarity combination of (1) is as follows:
Figure FDA0003888444020000034
in the formula (I), the compound is shown in the specification,
Figure FDA0003888444020000035
feature of nodes representing coarse granularity, E g Representing a similarity matrix;
for the enhanced region feature vector, constructing a region map G r =(V r ,E r ) In which E r Is a reaction of with E g The edge sets of the similarity matrixes of the region features with the same calculation method;
global graph G to be constructed g And region map G r Are sent separatelyInto the graph attention network, features are obtained that are enhanced by graph attention
Figure FDA0003888444020000036
And
Figure FDA0003888444020000037
and sending the image data into the attention network again to obtain the characteristic vector of the multi-scale image channel
Figure FDA0003888444020000038
And fusing the multi-scale features into image-side features through a fusion algorithm
Figure FDA0003888444020000039
For the enhanced word feature vector, a full-connection graph G is constructed in the same way w =(T w ,E w ) And combining G w Sending the graph attention network into the text channel to obtain the text end characteristics
Figure FDA00038884440200000310
For the image end characteristics
Figure FDA00038884440200000311
And text end features
Figure FDA00038884440200000312
And performing similarity evaluation, and obtaining the first similarity by using a cosine similarity function, wherein the first similarity is as follows:
Figure FDA00038884440200000313
in the formula (I), the compound is shown in the specification,
Figure FDA00038884440200000314
features representing image channelsThe vector of the vector is then calculated,
Figure FDA00038884440200000315
a feature vector representing a text channel.
8. The method of claim 1, wherein the method comprises: the cross-modal attention network among the design modalities uses an attention separation method to aggregate all alignment vectors, and specifically comprises the following steps: using the attention mode of word alignment area to complete the collocation of each area and the corresponding word, including:
converting the region representation and the word representation into the same dimension m by using two full connection layers respectively, and calculating the cosine similarity matrix c of all possible pairs ij The following formula:
Figure FDA0003888444020000041
in the formula, v i Denotes the ith region, w j Denotes the jth word, c ij Indicating the cosine similarity between the ith area and the jth word.
According to the cosine similarity matrix, obtaining the attention weight a of each area ij The following formula:
Figure FDA0003888444020000042
in the formula, λ represents a weight coefficient,
Figure FDA0003888444020000043
the normalization of the cosine similarity matrix is represented as follows:
Figure FDA0003888444020000044
wherein [ c ] ij ] + Represents the maximum value between the maximum variable and 0;
calculating the relevant region characteristics of the jth word, and calculating a local similarity score based on the characteristics, as follows:
Figure FDA0003888444020000045
Figure FDA0003888444020000046
in the formula, W l Representing a parameter matrix, a denominator representing a matrix norm,
Figure FDA0003888444020000047
a similarity score is represented.
9. The method as claimed in claim 8, wherein the attention separation method specifically comprises: given the computed local and global similarity vectors
Figure FDA0003888444020000048
For each similarity s n Calculating a weight value as follows:
Figure FDA0003888444020000049
in the formula, sigmoid () represents Sigmoid function, and BN () represents batch normalization; and aggregating the desired similarity vector by using the weight to calculate the similarity to obtain a second similarity, which is as follows:
Figure FDA0003888444020000051
in the formula s m Representing different similarity vectors, beta s Represents a weight value, s all The final degree of similarity of aggregation, i.e., the second degree of similarity, is indicated.
10. The method according to claim 1, wherein the evaluation of the retrieval performance is performed by taking the ranking of similarity scores as the evaluation of the retrieval performance, and specifically comprises: for the image-text data concentrated in the test after the multi-type attention network, recording former K correct retrieval results by adopting R @ K recall rate, wherein the key points are R @1, R @5 and R @10; the total R @ K index value Rsum of the graph check text and the graph check text is recorded at the same time, and the index value Rsum is as follows:
Rsum=(R@1+R@5+R@10) I2T +(R@1+R@5+R@10) T2I
in the formula, I2T represents that the image is used as a query sample to search for the relevant text, and T2I represents that the text is used as a query sample to search for the relevant image.
CN202211252004.XA 2022-10-13 2022-10-13 Image-text cross-modal retrieval method based on multi-class attention mechanism Pending CN115658934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211252004.XA CN115658934A (en) 2022-10-13 2022-10-13 Image-text cross-modal retrieval method based on multi-class attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211252004.XA CN115658934A (en) 2022-10-13 2022-10-13 Image-text cross-modal retrieval method based on multi-class attention mechanism

Publications (1)

Publication Number Publication Date
CN115658934A true CN115658934A (en) 2023-01-31

Family

ID=84987974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211252004.XA Pending CN115658934A (en) 2022-10-13 2022-10-13 Image-text cross-modal retrieval method based on multi-class attention mechanism

Country Status (1)

Country Link
CN (1) CN115658934A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127123A (en) * 2023-04-17 2023-05-16 中国海洋大学 Semantic instance relation-based progressive ocean remote sensing image-text retrieval method
CN116578738A (en) * 2023-07-14 2023-08-11 深圳须弥云图空间科技有限公司 Graph-text retrieval method and device based on graph attention and generating countermeasure network
CN117611245A (en) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127123A (en) * 2023-04-17 2023-05-16 中国海洋大学 Semantic instance relation-based progressive ocean remote sensing image-text retrieval method
CN116578738A (en) * 2023-07-14 2023-08-11 深圳须弥云图空间科技有限公司 Graph-text retrieval method and device based on graph attention and generating countermeasure network
CN116578738B (en) * 2023-07-14 2024-02-20 深圳须弥云图空间科技有限公司 Graph-text retrieval method and device based on graph attention and generating countermeasure network
CN117611245A (en) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities
CN117611245B (en) * 2023-12-14 2024-05-31 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities

Similar Documents

Publication Publication Date Title
CN112966127B (en) Cross-modal retrieval method based on multilayer semantic alignment
Yuan et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN113297975A (en) Method and device for identifying table structure, storage medium and electronic equipment
CN115658934A (en) Image-text cross-modal retrieval method based on multi-class attention mechanism
CN111125406B (en) Visual relation detection method based on self-adaptive cluster learning
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN115131638B (en) Training method, device, medium and equipment for visual text pre-training model
CN113780003B (en) Cross-modal enhancement method for space-time data variable-division encoding and decoding
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
Li et al. Adapting clip for phrase localization without further training
CN112182275A (en) Trademark approximate retrieval system and method based on multi-dimensional feature fusion
CN116561305A (en) False news detection method based on multiple modes and transformers
CN113536015A (en) Cross-modal retrieval method based on depth identification migration
CN113159053A (en) Image recognition method and device and computing equipment
CN117556076A (en) Pathological image cross-modal retrieval method and system based on multi-modal characterization learning
CN111859979A (en) Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
CN116843175A (en) Contract term risk checking method, system, equipment and storage medium
CN116775929A (en) Cross-modal retrieval method based on multi-level fine granularity semantic alignment
CN115640418A (en) Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
CN116089644A (en) Event detection method integrating multi-mode features
CN115346132A (en) Method and device for detecting abnormal events of remote sensing images by multi-modal representation learning
CN110852066A (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
Zhou et al. Spatial-aware topic-driven-based image Chinese caption for disaster news
CN114896962A (en) Multi-view sentence matching model, application method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination