CN116796032A

CN116796032A - Multi-mode data retrieval model based on self-adaptive graph attention hash

Info

Publication number: CN116796032A
Application number: CN202310380197.5A
Authority: CN
Inventors: 李明勇; 李业文
Original assignee: Chongqing Normal University
Current assignee: Chongqing Normal University
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-09-22

Abstract

The invention discloses a multi-mode data retrieval model based on self-adaptive graph attention hash, which is used for establishing a deep unsupervised cross-mode hash model and introducing an attention mechanism and a graph neural network; constructing an unsupervised cross-modal hash retrieval framework of a self-adaptive graph annotation network (CAGAN) based on the CLIP and optimizing targets and functions; firstly, collecting data of a data set, carrying out comprehensive experiments and index evaluation according to the collected data, and then realizing experimental details; performing contrast experiments, ablation experiments and sensitivity analysis of super parameters of the data set; and performing convergence experiments and then performing visualization of cross-modal hash search results. The invention relates to a multi-mode data retrieval model based on self-adaptive graph attention hash, which has the characteristics of high cross-mode retrieval accuracy, low data storage consumption and high retrieval speed.

Description

Multi-mode data retrieval model based on self-adaptive graph attention hash

Technical Field

The invention belongs to the technical field of multi-mode data retrieval, and particularly relates to a multi-mode data retrieval model based on self-adaptive graph attention hash.

Background

The basic idea of cross-modal hash retrieval is: and learning hash transformation of different modes by utilizing sample pair information of different modes, and mapping data of different modes to a Hamming binary space. Meanwhile, the similarity of data is kept in the mapping process, and then, the rapid cross-modal retrieval is realized in the Hamming space. Cross-modal hashes can be divided into two categories: the supervised method uses semantic tags to make up for the heterogeneity gap and the semantic gap, and the unsupervised method eliminates the dependence on tag information and only considers paired multimedia data. The unsupervised approach is rarely explored relative to the supervised approach, and this work aims to improve the retrieval performance of cross-modal hashes under unsupervised conditions. In recent years, due to the strong feature extraction capability of the deep neural network, an unsupervised cross-modal hash retrieval method based on deep learning has greatly progressed. While these unsupervised methods achieve impressive performance, most of them suffer from inaccurate similarity metrics and multi-modal learning imbalance, resulting in suboptimal search results. In particular, it is difficult to comprehensively measure complex data correlations with simple data features of different modalities. In the process from the true value to the binarization, the original structure of the hash code is destroyed, and there is information loss. In addition, the multi-modal learning has the problem of unbalanced multi-modal learning caused by modal gaps and data deviations, and the training efficiency of the existing method is still limited.

To address these issues, we propose a novel and efficient CLIP-based adaptive graph attention network for large-scale unsupervised cross-modal hash retrieval.

Disclosure of Invention

The present invention aims to solve the above-mentioned problems, and to provide a multi-modal data retrieval model based on adaptive graph attention hash, which solves the problems mentioned in the background art.

In order to solve the problems, the invention provides a technical scheme. A multi-mode data retrieval model based on adaptive graph attention hash comprises the following specific steps:

step S101: establishing a deep unsupervised cross-modal hash, and introducing an attention mechanism and a graph neural network;

step S102: constructing an unsupervised cross-modal hash retrieval framework of a self-adaptive graph annotation network (CAGAN) based on the CLIP and optimizing targets and functions;

step S103: collecting data of a data set, carrying out comprehensive experiments and index evaluation according to the collected data, and realizing experimental details;

step S104: performing contrast experiments, ablation experiments and sensitivity analysis of super parameters of the data set;

step S105: and performing convergence experiments and then performing visualization of cross-modal hash search results.

Firstly, the invention uses CLIP to extract cross-modal semantic features, which learn transferable visual models from natural language supervision; thereby extracting the semantic features of the multi-mode data with fine granularity.

The invention designs the multi-modal similarity enhancement module to fuse and enhance the similarity information of different modal data, which can effectively relieve the inaccurate similarity measurement of the multi-modal data;

the invention adopts an attention mechanism to pay attention to the related characteristic features; the method can transfer the extracted features to important information of different modes through the attention module to construct a semantic fusion matrix for attention perception;

the invention provides a cross-modal hash method based on GCN; specifically, a cross-modal hash method based on GCN adopts a plurality of modal individual GCNs under semantic guidance;

wherein, independently acting on each mode to keep the similarity in the modes, and adopting a graph convolution neural network to aggregate the similarity information of each mode instance, thereby further mining the semantic relevance of different mode data;

the invention comprehensively utilizes the characteristics of different modes to construct the semantic affinity graph, thereby relieving the inaccurate relation measurement among the data nodes;

the present invention has devised an adaptive graph annotation module to solve this problem, which uses an attention mechanism to learn the semantic affinity graph and gathers information between similar nodes through the graph volume, thereby making the similar data produce more consistent hash codes.

In the step S102, the framework includes a depth feature encoding module, a multi-mode similarity enhancing module, an adaptive graph attention module, and a hash code reconstruction module; the depth coding module contains two main networks: a visual coding network and a text coding network;

an efficient and novel CLIP-based adaptive graph annotation network (CAGAN) is presented for use in unsupervised cross-modal hash retrieval tasks.

Among them, we apply the visual language model CLIP to unsupervised image-text hash retrieval for the first time. In order to alleviate the problem of inaccurate similarity, we have designed a multi-modal similarity enhancement module to enhance the similarity of data, which helps to improve the accuracy of the search.

In addition, an iterative approximate optimization strategy is used to reduce the information loss during the hash code binarization process.

Finally, a carefully designed graph adaptive attention module can assist in learning the hash network, improve the hash code representation capability and alleviate the problem of multi-modal learning imbalance. The full experiments performed on the three reference data sets show that the proposed method is superior to several representative advanced methods, achieving the best retrieval accuracy.

Drawings

For ease of illustration, the invention is described in detail by the following detailed description and the accompanying drawings.

FIG. 1 is a workflow diagram of the present invention;

FIG. 2 is a diagram of an unsupervised cross-modal hash retrieval framework of the present invention;

FIG. 3 is a top-N precision curve comparison graph of 128-bit hash codes on three cross-modal retrieval reference data sets according to the present invention;

FIG. 4 is a graph of the present invention illustrating the analysis of hyper-parametric sensitivity in three multi-modal retrieved reference datasets;

FIG. 5 is a graph of the convergence of the loss function and MAP variation for the CAGAN of the present invention over three widely used multimedia data sets;

Detailed Description

As shown in fig. 1, this embodiment is described in detail as follows:

a multi-mode data retrieval model based on adaptive graph attention hash comprises the following specific steps:

In the step S101, the present invention uses CLIP to extract cross-modal semantic features, and learns a transferable visual model from natural language supervision; the multi-modal similarity enhancement module is used for fusing and enhancing the similarity information of different modal data, so that inaccurate similarity measurement of the multi-modal data can be effectively relieved; the invention provides an anti-hash network with an attention mechanism, which enhances the measurement of content similarity by selectively focusing on information parts in multi-mode data and focuses on related characteristic features; the method can transfer the extracted features to important information of different modes through the attention module to construct a semantic fusion matrix for attention perception; in addition, the present invention devised an adaptive graph annotation module to solve this problem, which uses an attention mechanism to learn the semantic affinity graph and aggregates information between similar nodes through the graph volume, thereby making the similar data produce more consistent hash codes.

In step S102, symbol and problem definitions are first defined, and a cross-modal dataset is givenWherein v is _i And t _i Representing pairs of image text; we divide the data into small batches o= { o ₁ ,o ₂ ,···,o _m Where m represents the batch size, o _j ＝[v _j ,t _j ]Representing a j-th pair of image text in each batch of data; bulk training samples for each random sample +.>We use +.>A characteristic representation representing a visual modality,a feature representation representing a text modality; meanwhile, we represent the hash code generated by the hash coding network as B _v ∈{-1,+1} ^m×c And B _t ∈{-1,+1} ^m×c The hash code generated by the graph convolution neural network is expressed as +.>Andwherein c represents the length of the hash code;

in the phase of constructing the similarity matrix, we first set F _v And F _t Go through l ₂ Normalized toAnd->We then calculate the similarity matrix for visual and text modalities respectively using cosine similarity +.>Andwhich in turn are used to describe the inherent similarity between the original image and the text data; this isIn addition, we can use the generated hash code B _v And B _t The feature vector of the vertex of the high-dimensional space can be only taken; from this perspective, adjacent vertices correspond to similar hash codes, that is, the hamming distance between two hash codes can be represented by their cosine angular distance;

the hash method saves the storage space and improves the retrieval speed by mapping the original features to the binary code (Hamming) space; at the same time, the similarity of the data should be kept during the mapping process (the highly similar data in the original space is mapped to hamming space, and the distance between hash codes is also small.

In step S102, the framework includes a depth feature encoding module, a multi-mode similarity enhancing module, an adaptive graph attention module, and a hash code reconstruction module; the depth coding module contains two main networks: a visual coding network and a text coding network; visual language pre-training (VLP) models with CLIP as a representation have proven to be more efficient at learning text and visual representations; in the invention, a CLIP visual encoder and a multi-layer perceptron are adopted as a backbone network, so that semantic information of original data can be fully extracted and cross-modal characteristics can be learned; we represent the visual encoder as Enc _v The text encoder is denoted Enc _t The feature encoding formula is expressed as follows:

wherein V and T represent a batch of images and text training samples; θ _v And theta _t Parameters representing visual and text feature encoding networks; we then learn the hash function with the MLP and generate the hash code as follows:

H _v ＝MLP _v (F _v ,θ _Hv )∈[-1,+1] ^m×c ,H _t ＝MLP _t (F _t ,θ _Ht )∈[-1,+1] ^m×c . (2)

therefore, the method can encode rich semantic features of different modes, better describe semantic similarity between original data and further guide the learning of the hash codes;

B _v ＝tanh(αH _v )∈[-1,+1] ^m×c ,B _t ＝tanh(αH _t )∈[-1,+1] ^m×c (3)

wherein α represents the number of iterations; as the iteration number increases, the hyperbolic tangent function converges to a sign function; the iterative approximate optimization strategy is used for reducing information loss in the binarization process of the hash code; in particular, we use small batches of visual featuresConstructing a visual modality similarity matrix>Wherein the method comprises the steps ofFor text modality we directly exploit the feature +.>Processing by word bag to create text cosine similarity matrix +.>Wherein->

Subsequently, we construct a cross-modality similarity matrix to capture co-occurrence similarity of different modality instances; in particular, we use the visual modality similarity matrix S _v And text modality similarity matrix S _t To construct a cross-modal cosine similarity matrix S _c Co-occurrence information between the image and the text modality instance can be preserved; the equation for the fusion process is described as follows:

wherein ( ^T Representing a transpose of the matrix; furthermore, we construct a semantic preservingAffinity matrix S _A It integrates information from different matrices, formulated as follows:

wherein eta, beta and lambda are balance hyper-parameters for balancing the importance degree of the similarity matrix between the image and the text mode; finally, we apply to the fusion affinity matrix S _A And carrying out similarity enhancement, wherein the formula is as follows:

wherein S is _max ,S _mean ,S _min Respectively representing the maximum value, the average value and the minimum value of the similarity matrix; the formula for similarity matrix enhancement is as follows:

after the similarity is enhanced, the similarity enhancement matrix can be expressed as:compared with the previous unsupervised method, the similarity enhancement enables similar data to be closer and dissimilar data to be dissimilar by setting the threshold value, so that a better supervised signal is provided for the learning of the hash code;

the self-adaptive graph attention module can learn graph neighborhood correlation of self-adaptive different modes and adopts an attention mechanism to learn a similarity matrix of the self-adaptive modes, and the formula is as follows:

wherein the method comprises the steps ofAnd->A projection matrix representing visual and text modalities, gamma being a trade-off hyper-parameter; and aggregating information between similar nodes through the GCN to generate more consistent hash codes; subsequently, we pass the attention similarity matrix to a two-layer graph convolution network that aggregates graph neighborhood correlations between similar nodes:

wherein D is _ii ＝∑ _j s _ij ,W ⁽¹⁾ And W is ⁽²⁾ Is a parameter matrix, sigma ₁ Sum sigma ₂ Representing activation functions of the first layer and the second layer;an output representing an i-th layer of the visual and text modality graph roll-up network; in the training process, the attention matrix is iteratively updated, the similarity relation between the examples is maximized, and then the information of similar nodes is aggregated through the graph convolution network to generate more consistent hash codes, so that the image and text retrieval performance is improved; the hash code generated by the convolution is as follows:

where α represents the number of iterations, we use an iterative approximate optimization strategy to optimize the hash code; when (when)When the method is used, the discrete problem is converted into a series of continuous optimization problems, so that the problems of information loss and instability in the process of binarization of the hash code can be effectively relieved;

to better optimize the hash code, we come from the hash code B that will be generated by the network _v 、B _t 、B _v And B _v To construct cosine similarity matrixWherein S is _* ＝cos(S _* ,S _* ),*∈{v,t},Finally, we use them and the similarity enhancement matrix S _E Constructing a loss function; these loss functions are formulated as follows:

wherein L is _Intra And L _Cross Representing intra-modal and trans-modal losses, respectively; l (L) _Gcn Representing graph convolution reconstruction loss; μ is a scale super parameter, the quantization range of the enhancement matrix can be adjusted,the symbols represent a matrix bit-wise multiplication.

In the step 102, the method proposed by the objective and function optimization may iteratively update the parameters of the entire network through the back propagation algorithm until the network converges, thereby completing the reconstruction process of the hash code; the formula for the total loss is as follows:

is a trade-off superparameter; minimizing the loss function allows similar data to generate more consistent hash codes; the CAGAN method can be optimized in a batch iteration mode, and high-quality hash codes can be generated by minimizing loss; the entire CAGAN model can be optimized using SGD and Adam optimization algorithms.

In the step S103, the dataset has 25000 photos and related text description tags from 24 different categories through the multi-tag dataset on the Flickr website; to represent the relevant text content. The NUS-WIDE data set comprises 269,648 images collected from a real scene and corresponding text descriptions and labels thereof; MS COCO is a widely used, diverse data set for object recognition, multimedia retrieval, and semantic segmentation; the dataset contains 123,287 images obtained from a complex daily scene, the objects in the photograph being located by careful segmentation; in our experiments we used 87,081 photos with 91 kinds of information, each corresponding text being represented by a 2000-dimensional bag of words vector;

in the step S103, in the experiment, two widely used index measurement indexes are adopted; average accuracy (MAP) and top-N curve accuracy measure the search performance of the proposed model compared to other methods; the accuracy and ranking information may be well reflected in the measurement method.

The use state of the invention is as follows:

in a comparative experiment, we compared two cross-modal search tasks: I→T and T→I: querying the text using the image and retrieving the image using the text. The invention compares all base lines and CAGAN in the two search tasks with the evaluation indexes of MAP@5000 and Top-N precision curves respectively.

Map@5000 comparison results: table 1 shows MAP@5000 results for the proposed CAGAN and other most advanced unsupervised cross-modal hash methods at hash code lengths of 16 bits to 128 bits over three reference data sets (MIRFlickr-25K, NUS-WIDE and MS COCO). As can be seen from the data in table 1, we propose a method that is better than the baseline for all comparisons. Our approach has about 1.5% -3% performance improvement over the most advanced unsupervised cross-modal hash approach, which demonstrates the superiority of the proposed CAGAN. The performance improvement of our proposed method is more pronounced on data sets with a large number of classes (MSCOCO) and still maintains good performance with a low hash code length. This reflects the excellent ability of the proposed model to fine-grained retrieval, which is more suitable for practical applications.

Top-N precision curve FIG. 4 shows a Top-N precision curve comparing the proposed method with all 11 baseline methods over three baseline data sets. From the graph in fig. 4, our method is better than all of the comparison baselines, which intuitively reflects the efficiency of our CAGAN. Notably, as the number of search instances increases, the top-N accuracy curve of our proposed method drops slowly. Finally, together with the MAP comparison result, the top-N accuracy curve can also indicate that the method proposed by us reduces the accuracy loss in the binarization process, thereby improving the retrieval performance and maintaining higher accuracy when the number of retrieval samples is increased.

Table 1: the MAP@5000 results of the image-text retrieval task for the proposed method under different hash code lengths and data sets (I.fwdarw.T represents the image search text task and vice versa).

The references for the methods compared in the table are as follows:

[1]Su,S.,Zhong,Z.,&Zhang,C.(2019).Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval.In Proceedings of the IEEE/CVF international conference on computer vision(pp.3027-3035).

[2]Liu,S.,Qian,S.,Guan,Y.,Zhan,J.,&Ying,L.(2020,July).Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval.In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval(pp.1379-1388).

[3]Zhang,P.F.,Li,Y.,Huang,Z.,&Xu,X.S.(2021).Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval.IEEE Transactions on Multimedia,24,466-479.

[4]Yu,J.,Zhou,H.,Zhan,Y.,&Tao,D.(2021,May).Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing.In Proceedings of the AAAI Conference on Artificial Intelligence(Vol.35,No.5,pp.4626-4634).

[5]Yang,D.,Wu,D.,Zhang,W.,Zhang,H.,Li,B.,&Wang,W.(2020,June).Deep semantic-alignment hashing for unsupervised cross-modal retrieval.In Proceedings of the 2020international conference on multimedia retrieval(pp.44-52).

[6]Zhang,P.F.,Luo,Y.,Huang,Z.,Xu,X.S.,&Song,J.(2021).High-order nonlocal Hashing for unsupervised cross-modal retrieval.World Wide Web,24,563-583.

[7]Mikriukov,G.,Ravanbakhsh,M.,&Demir,B.(2022).Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing.arXiv preprint arXiv:2201.08125.

[8]Shi,Y.,Zhao,Y.,Liu,X.,Zheng,F.,Ou,W.,You,X.,&Peng,Q.(2022).Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval.IEEE Transactions on Circuits and Systems for Video Technology,32(10),7255-7268.

to demonstrate the effectiveness and contribution of each module in our proposed method, ablation experiments were performed on each module. To this end we designed a variant of five models to verify the effect of each module on the whole model. The comparative results of the ablation experiments are shown in table 5.

We studied the convergence and training efficiency of the proposed CAGAN over three baseline datasets. The final loss function convergence curve at 16-bit hash code length is shown, showing the MAP variation curve with increasing number of iterations.

From the results of the graph, the following conclusions can be drawn. Firstly, as the number of optimization iterations increases, the loss function gradually decreases, and the result shows that the optimization process can improve the coding capability of the hash function. The method reduces the consumption of training time and improves training efficiency. Finally, the results of the study show that the proposed network converges to the optimal point in several tens of iterations.

While the basic principles and main features of the present invention and advantages of the present invention have been shown and described, it will be understood by those skilled in the art that the present invention is not limited by the foregoing embodiments, which are described in the foregoing specification merely illustrate the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined in the appended claims and their equivalents.

Claims

1. A multi-modal data retrieval model based on adaptive graph attention hash, characterized in that: the method comprises the following specific steps:

step S105: and performing convergence experiments and training efficiency analysis, and then performing cross-modal hash retrieval.

2. The multi-modal data retrieval model based on adaptive graph attention hash of claim 1, wherein: in the step S101, the invention uses CLIP to extract cross-modal semantic features, and learns a transferable visual model from natural language supervision; the multi-modal similarity enhancement module is used for fusing and enhancing the similarity information of different modal data, so that inaccurate similarity measurement of the multi-modal data can be effectively relieved; the attention mechanism can solve the problem of information redundancy by paying attention to information which is more critical to the current target in a plurality of inputs, and a semantic fusion matrix for attention perception is constructed based on the attention by adopting an attention mechanism; the present invention has devised an adaptive graph annotation module to solve this problem, which uses an attention mechanism to learn the semantic affinity graph and gathers information between similar nodes through the graph volume, thereby making the similar data produce more consistent hash codes.

3. The multi-modal data retrieval model based on adaptive graph attention hash of claim 1, wherein: in the step S102, the framework includes a depth feature encoding module, a multi-mode similarity enhancing module, an adaptive graph attention module, and a hash code reconstruction module; the depth coding module contains two main networks: a visual coding network and a text coding network; in this context, we use CLIP visual encoders and multi-layer perceptrons as the backbone network that is able to fully extract semantic information of the original data and learn cross-modal features.

4. The adaptive graph attention module of claim 1 capable of learning graph neighborhood correlations for different modalities and employing an attention mechanism to learn a similarity matrix for the adaptive modalities, then we pass the attention similarity matrix to a two-layer graph convolutional network that aggregates graph neighborhood correlations between similar nodes; so we can learn the similarity between different modality data using the attention mechanism; in the training process, the attention matrix is iteratively updated, the similarity relation between the examples is maximized, and then the information of similar nodes is aggregated through the graph convolution network to generate more consistent hash codes, so that the image and text retrieval performance is improved; the invention uses an iterative approximate optimization strategy to optimize the hash code; the discrete problem is converted into a series of continuous optimization problems, so that the problems of information loss and instability in the hash code binarization process can be effectively relieved.

5. The multimodal data retrieval model of claim 1, wherein: in the step 102, the proposed method for objective and function optimization may iteratively update parameters of the entire network through a back propagation algorithm until the network converges, and the entire CAGAN model may be optimized using SGD and Adam optimization algorithms.