CN116796032A - Multi-mode data retrieval model based on self-adaptive graph attention hash - Google Patents

Multi-mode data retrieval model based on self-adaptive graph attention hash Download PDF

Info

Publication number
CN116796032A
CN116796032A CN202310380197.5A CN202310380197A CN116796032A CN 116796032 A CN116796032 A CN 116796032A CN 202310380197 A CN202310380197 A CN 202310380197A CN 116796032 A CN116796032 A CN 116796032A
Authority
CN
China
Prior art keywords
hash
modal
attention
data
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310380197.5A
Other languages
Chinese (zh)
Inventor
李明勇
李业文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Normal University
Original Assignee
Chongqing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Normal University filed Critical Chongqing Normal University
Priority to CN202310380197.5A priority Critical patent/CN116796032A/en
Publication of CN116796032A publication Critical patent/CN116796032A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a multi-mode data retrieval model based on self-adaptive graph attention hash, which is used for establishing a deep unsupervised cross-mode hash model and introducing an attention mechanism and a graph neural network; constructing an unsupervised cross-modal hash retrieval framework of a self-adaptive graph annotation network (CAGAN) based on the CLIP and optimizing targets and functions; firstly, collecting data of a data set, carrying out comprehensive experiments and index evaluation according to the collected data, and then realizing experimental details; performing contrast experiments, ablation experiments and sensitivity analysis of super parameters of the data set; and performing convergence experiments and then performing visualization of cross-modal hash search results. The invention relates to a multi-mode data retrieval model based on self-adaptive graph attention hash, which has the characteristics of high cross-mode retrieval accuracy, low data storage consumption and high retrieval speed.

Description

Multi-mode data retrieval model based on self-adaptive graph attention hash
Technical Field
The invention belongs to the technical field of multi-mode data retrieval, and particularly relates to a multi-mode data retrieval model based on self-adaptive graph attention hash.
Background
The basic idea of cross-modal hash retrieval is: and learning hash transformation of different modes by utilizing sample pair information of different modes, and mapping data of different modes to a Hamming binary space. Meanwhile, the similarity of data is kept in the mapping process, and then, the rapid cross-modal retrieval is realized in the Hamming space. Cross-modal hashes can be divided into two categories: the supervised method uses semantic tags to make up for the heterogeneity gap and the semantic gap, and the unsupervised method eliminates the dependence on tag information and only considers paired multimedia data. The unsupervised approach is rarely explored relative to the supervised approach, and this work aims to improve the retrieval performance of cross-modal hashes under unsupervised conditions. In recent years, due to the strong feature extraction capability of the deep neural network, an unsupervised cross-modal hash retrieval method based on deep learning has greatly progressed. While these unsupervised methods achieve impressive performance, most of them suffer from inaccurate similarity metrics and multi-modal learning imbalance, resulting in suboptimal search results. In particular, it is difficult to comprehensively measure complex data correlations with simple data features of different modalities. In the process from the true value to the binarization, the original structure of the hash code is destroyed, and there is information loss. In addition, the multi-modal learning has the problem of unbalanced multi-modal learning caused by modal gaps and data deviations, and the training efficiency of the existing method is still limited.
To address these issues, we propose a novel and efficient CLIP-based adaptive graph attention network for large-scale unsupervised cross-modal hash retrieval.
Disclosure of Invention
The present invention aims to solve the above-mentioned problems, and to provide a multi-modal data retrieval model based on adaptive graph attention hash, which solves the problems mentioned in the background art.
In order to solve the problems, the invention provides a technical scheme. A multi-mode data retrieval model based on adaptive graph attention hash comprises the following specific steps:
step S101: establishing a deep unsupervised cross-modal hash, and introducing an attention mechanism and a graph neural network;
step S102: constructing an unsupervised cross-modal hash retrieval framework of a self-adaptive graph annotation network (CAGAN) based on the CLIP and optimizing targets and functions;
step S103: collecting data of a data set, carrying out comprehensive experiments and index evaluation according to the collected data, and realizing experimental details;
step S104: performing contrast experiments, ablation experiments and sensitivity analysis of super parameters of the data set;
step S105: and performing convergence experiments and then performing visualization of cross-modal hash search results.
Firstly, the invention uses CLIP to extract cross-modal semantic features, which learn transferable visual models from natural language supervision; thereby extracting the semantic features of the multi-mode data with fine granularity.
The invention designs the multi-modal similarity enhancement module to fuse and enhance the similarity information of different modal data, which can effectively relieve the inaccurate similarity measurement of the multi-modal data;
the invention adopts an attention mechanism to pay attention to the related characteristic features; the method can transfer the extracted features to important information of different modes through the attention module to construct a semantic fusion matrix for attention perception;
the invention provides a cross-modal hash method based on GCN; specifically, a cross-modal hash method based on GCN adopts a plurality of modal individual GCNs under semantic guidance;
wherein, independently acting on each mode to keep the similarity in the modes, and adopting a graph convolution neural network to aggregate the similarity information of each mode instance, thereby further mining the semantic relevance of different mode data;
the invention comprehensively utilizes the characteristics of different modes to construct the semantic affinity graph, thereby relieving the inaccurate relation measurement among the data nodes;
the present invention has devised an adaptive graph annotation module to solve this problem, which uses an attention mechanism to learn the semantic affinity graph and gathers information between similar nodes through the graph volume, thereby making the similar data produce more consistent hash codes.
In the step S102, the framework includes a depth feature encoding module, a multi-mode similarity enhancing module, an adaptive graph attention module, and a hash code reconstruction module; the depth coding module contains two main networks: a visual coding network and a text coding network;
an efficient and novel CLIP-based adaptive graph annotation network (CAGAN) is presented for use in unsupervised cross-modal hash retrieval tasks.
Among them, we apply the visual language model CLIP to unsupervised image-text hash retrieval for the first time. In order to alleviate the problem of inaccurate similarity, we have designed a multi-modal similarity enhancement module to enhance the similarity of data, which helps to improve the accuracy of the search.
In addition, an iterative approximate optimization strategy is used to reduce the information loss during the hash code binarization process.
Finally, a carefully designed graph adaptive attention module can assist in learning the hash network, improve the hash code representation capability and alleviate the problem of multi-modal learning imbalance. The full experiments performed on the three reference data sets show that the proposed method is superior to several representative advanced methods, achieving the best retrieval accuracy.
Drawings
For ease of illustration, the invention is described in detail by the following detailed description and the accompanying drawings.
FIG. 1 is a workflow diagram of the present invention;
FIG. 2 is a diagram of an unsupervised cross-modal hash retrieval framework of the present invention;
FIG. 3 is a top-N precision curve comparison graph of 128-bit hash codes on three cross-modal retrieval reference data sets according to the present invention;
FIG. 4 is a graph of the present invention illustrating the analysis of hyper-parametric sensitivity in three multi-modal retrieved reference datasets;
FIG. 5 is a graph of the convergence of the loss function and MAP variation for the CAGAN of the present invention over three widely used multimedia data sets;
Detailed Description
As shown in fig. 1, this embodiment is described in detail as follows:
a multi-mode data retrieval model based on adaptive graph attention hash comprises the following specific steps:
step S101: establishing a deep unsupervised cross-modal hash, and introducing an attention mechanism and a graph neural network;
step S102: constructing an unsupervised cross-modal hash retrieval framework of a self-adaptive graph annotation network (CAGAN) based on the CLIP and optimizing targets and functions;
step S103: collecting data of a data set, carrying out comprehensive experiments and index evaluation according to the collected data, and realizing experimental details;
step S104: performing contrast experiments, ablation experiments and sensitivity analysis of super parameters of the data set;
step S105: and performing convergence experiments and then performing visualization of cross-modal hash search results.
In the step S101, the present invention uses CLIP to extract cross-modal semantic features, and learns a transferable visual model from natural language supervision; the multi-modal similarity enhancement module is used for fusing and enhancing the similarity information of different modal data, so that inaccurate similarity measurement of the multi-modal data can be effectively relieved; the invention provides an anti-hash network with an attention mechanism, which enhances the measurement of content similarity by selectively focusing on information parts in multi-mode data and focuses on related characteristic features; the method can transfer the extracted features to important information of different modes through the attention module to construct a semantic fusion matrix for attention perception; in addition, the present invention devised an adaptive graph annotation module to solve this problem, which uses an attention mechanism to learn the semantic affinity graph and aggregates information between similar nodes through the graph volume, thereby making the similar data produce more consistent hash codes.
In step S102, symbol and problem definitions are first defined, and a cross-modal dataset is givenWherein v is i And t i Representing pairs of image text; we divide the data into small batches o= { o 1 ,o 2 ,···,o m Where m represents the batch size, o j =[v j ,t j ]Representing a j-th pair of image text in each batch of data; bulk training samples for each random sample +.>We use +.>A characteristic representation representing a visual modality,a feature representation representing a text modality; meanwhile, we represent the hash code generated by the hash coding network as B v ∈{-1,+1} m×c And B t ∈{-1,+1} m×c The hash code generated by the graph convolution neural network is expressed as +.>Andwherein c represents the length of the hash code;
in the phase of constructing the similarity matrix, we first set F v And F t Go through l 2 Normalized toAnd->We then calculate the similarity matrix for visual and text modalities respectively using cosine similarity +.>Andwhich in turn are used to describe the inherent similarity between the original image and the text data; this isIn addition, we can use the generated hash code B v And B t The feature vector of the vertex of the high-dimensional space can be only taken; from this perspective, adjacent vertices correspond to similar hash codes, that is, the hamming distance between two hash codes can be represented by their cosine angular distance;
the hash method saves the storage space and improves the retrieval speed by mapping the original features to the binary code (Hamming) space; at the same time, the similarity of the data should be kept during the mapping process (the highly similar data in the original space is mapped to hamming space, and the distance between hash codes is also small.
In step S102, the framework includes a depth feature encoding module, a multi-mode similarity enhancing module, an adaptive graph attention module, and a hash code reconstruction module; the depth coding module contains two main networks: a visual coding network and a text coding network; visual language pre-training (VLP) models with CLIP as a representation have proven to be more efficient at learning text and visual representations; in the invention, a CLIP visual encoder and a multi-layer perceptron are adopted as a backbone network, so that semantic information of original data can be fully extracted and cross-modal characteristics can be learned; we represent the visual encoder as Enc v The text encoder is denoted Enc t The feature encoding formula is expressed as follows:
wherein V and T represent a batch of images and text training samples; θ v And theta t Parameters representing visual and text feature encoding networks; we then learn the hash function with the MLP and generate the hash code as follows:
H v =MLP v (F vHv )∈[-1,+1] m×c ,H t =MLP t (F tHt )∈[-1,+1] m×c . (2)
therefore, the method can encode rich semantic features of different modes, better describe semantic similarity between original data and further guide the learning of the hash codes;
B v =tanh(αH v )∈[-1,+1] m×c ,B t =tanh(αH t )∈[-1,+1] m×c (3)
wherein α represents the number of iterations; as the iteration number increases, the hyperbolic tangent function converges to a sign function; the iterative approximate optimization strategy is used for reducing information loss in the binarization process of the hash code; in particular, we use small batches of visual featuresConstructing a visual modality similarity matrix>Wherein the method comprises the steps ofFor text modality we directly exploit the feature +.>Processing by word bag to create text cosine similarity matrix +.>Wherein->
Subsequently, we construct a cross-modality similarity matrix to capture co-occurrence similarity of different modality instances; in particular, we use the visual modality similarity matrix S v And text modality similarity matrix S t To construct a cross-modal cosine similarity matrix S c Co-occurrence information between the image and the text modality instance can be preserved; the equation for the fusion process is described as follows:
wherein ( T Representing a transpose of the matrix; furthermore, we construct a semantic preservingAffinity matrix S A It integrates information from different matrices, formulated as follows:
wherein eta, beta and lambda are balance hyper-parameters for balancing the importance degree of the similarity matrix between the image and the text mode; finally, we apply to the fusion affinity matrix S A And carrying out similarity enhancement, wherein the formula is as follows:
wherein S is max ,S mean ,S min Respectively representing the maximum value, the average value and the minimum value of the similarity matrix; the formula for similarity matrix enhancement is as follows:
after the similarity is enhanced, the similarity enhancement matrix can be expressed as:compared with the previous unsupervised method, the similarity enhancement enables similar data to be closer and dissimilar data to be dissimilar by setting the threshold value, so that a better supervised signal is provided for the learning of the hash code;
the self-adaptive graph attention module can learn graph neighborhood correlation of self-adaptive different modes and adopts an attention mechanism to learn a similarity matrix of the self-adaptive modes, and the formula is as follows:
wherein the method comprises the steps ofAnd->A projection matrix representing visual and text modalities, gamma being a trade-off hyper-parameter; and aggregating information between similar nodes through the GCN to generate more consistent hash codes; subsequently, we pass the attention similarity matrix to a two-layer graph convolution network that aggregates graph neighborhood correlations between similar nodes:
wherein D is ii =∑ j s ij ,W (1) And W is (2) Is a parameter matrix, sigma 1 Sum sigma 2 Representing activation functions of the first layer and the second layer;an output representing an i-th layer of the visual and text modality graph roll-up network; in the training process, the attention matrix is iteratively updated, the similarity relation between the examples is maximized, and then the information of similar nodes is aggregated through the graph convolution network to generate more consistent hash codes, so that the image and text retrieval performance is improved; the hash code generated by the convolution is as follows:
where α represents the number of iterations, we use an iterative approximate optimization strategy to optimize the hash code; when (when)When the method is used, the discrete problem is converted into a series of continuous optimization problems, so that the problems of information loss and instability in the process of binarization of the hash code can be effectively relieved;
to better optimize the hash code, we come from the hash code B that will be generated by the network v 、B t 、B v And B v To construct cosine similarity matrixWherein S is * =cos(S * ,S * ),*∈{v,t},Finally, we use them and the similarity enhancement matrix S E Constructing a loss function; these loss functions are formulated as follows:
wherein L is Intra And L Cross Representing intra-modal and trans-modal losses, respectively; l (L) Gcn Representing graph convolution reconstruction loss; μ is a scale super parameter, the quantization range of the enhancement matrix can be adjusted,the symbols represent a matrix bit-wise multiplication.
In the step 102, the method proposed by the objective and function optimization may iteratively update the parameters of the entire network through the back propagation algorithm until the network converges, thereby completing the reconstruction process of the hash code; the formula for the total loss is as follows:
is a trade-off superparameter; minimizing the loss function allows similar data to generate more consistent hash codes; the CAGAN method can be optimized in a batch iteration mode, and high-quality hash codes can be generated by minimizing loss; the entire CAGAN model can be optimized using SGD and Adam optimization algorithms.
In the step S103, the dataset has 25000 photos and related text description tags from 24 different categories through the multi-tag dataset on the Flickr website; to represent the relevant text content. The NUS-WIDE data set comprises 269,648 images collected from a real scene and corresponding text descriptions and labels thereof; MS COCO is a widely used, diverse data set for object recognition, multimedia retrieval, and semantic segmentation; the dataset contains 123,287 images obtained from a complex daily scene, the objects in the photograph being located by careful segmentation; in our experiments we used 87,081 photos with 91 kinds of information, each corresponding text being represented by a 2000-dimensional bag of words vector;
in the step S103, in the experiment, two widely used index measurement indexes are adopted; average accuracy (MAP) and top-N curve accuracy measure the search performance of the proposed model compared to other methods; the accuracy and ranking information may be well reflected in the measurement method.
The use state of the invention is as follows:
in a comparative experiment, we compared two cross-modal search tasks: I→T and T→I: querying the text using the image and retrieving the image using the text. The invention compares all base lines and CAGAN in the two search tasks with the evaluation indexes of MAP@5000 and Top-N precision curves respectively.
Map@5000 comparison results: table 1 shows MAP@5000 results for the proposed CAGAN and other most advanced unsupervised cross-modal hash methods at hash code lengths of 16 bits to 128 bits over three reference data sets (MIRFlickr-25K, NUS-WIDE and MS COCO). As can be seen from the data in table 1, we propose a method that is better than the baseline for all comparisons. Our approach has about 1.5% -3% performance improvement over the most advanced unsupervised cross-modal hash approach, which demonstrates the superiority of the proposed CAGAN. The performance improvement of our proposed method is more pronounced on data sets with a large number of classes (MSCOCO) and still maintains good performance with a low hash code length. This reflects the excellent ability of the proposed model to fine-grained retrieval, which is more suitable for practical applications.
Top-N precision curve FIG. 4 shows a Top-N precision curve comparing the proposed method with all 11 baseline methods over three baseline data sets. From the graph in fig. 4, our method is better than all of the comparison baselines, which intuitively reflects the efficiency of our CAGAN. Notably, as the number of search instances increases, the top-N accuracy curve of our proposed method drops slowly. Finally, together with the MAP comparison result, the top-N accuracy curve can also indicate that the method proposed by us reduces the accuracy loss in the binarization process, thereby improving the retrieval performance and maintaining higher accuracy when the number of retrieval samples is increased.
Table 1: the MAP@5000 results of the image-text retrieval task for the proposed method under different hash code lengths and data sets (I.fwdarw.T represents the image search text task and vice versa).
The references for the methods compared in the table are as follows:
[1]Su,S.,Zhong,Z.,&Zhang,C.(2019).Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval.In Proceedings of the IEEE/CVF international conference on computer vision(pp.3027-3035).
[2]Liu,S.,Qian,S.,Guan,Y.,Zhan,J.,&Ying,L.(2020,July).Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval.In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval(pp.1379-1388).
[3]Zhang,P.F.,Li,Y.,Huang,Z.,&Xu,X.S.(2021).Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval.IEEE Transactions on Multimedia,24,466-479.
[4]Yu,J.,Zhou,H.,Zhan,Y.,&Tao,D.(2021,May).Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing.In Proceedings of the AAAI Conference on Artificial Intelligence(Vol.35,No.5,pp.4626-4634).
[5]Yang,D.,Wu,D.,Zhang,W.,Zhang,H.,Li,B.,&Wang,W.(2020,June).Deep semantic-alignment hashing for unsupervised cross-modal retrieval.In Proceedings of the 2020international conference on multimedia retrieval(pp.44-52).
[6]Zhang,P.F.,Luo,Y.,Huang,Z.,Xu,X.S.,&Song,J.(2021).High-order nonlocal Hashing for unsupervised cross-modal retrieval.World Wide Web,24,563-583.
[7]Mikriukov,G.,Ravanbakhsh,M.,&Demir,B.(2022).Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing.arXiv preprint arXiv:2201.08125.
[8]Shi,Y.,Zhao,Y.,Liu,X.,Zheng,F.,Ou,W.,You,X.,&Peng,Q.(2022).Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval.IEEE Transactions on Circuits and Systems for Video Technology,32(10),7255-7268.
to demonstrate the effectiveness and contribution of each module in our proposed method, ablation experiments were performed on each module. To this end we designed a variant of five models to verify the effect of each module on the whole model. The comparative results of the ablation experiments are shown in table 5.
We studied the convergence and training efficiency of the proposed CAGAN over three baseline datasets. The final loss function convergence curve at 16-bit hash code length is shown, showing the MAP variation curve with increasing number of iterations.
From the results of the graph, the following conclusions can be drawn. Firstly, as the number of optimization iterations increases, the loss function gradually decreases, and the result shows that the optimization process can improve the coding capability of the hash function. The method reduces the consumption of training time and improves training efficiency. Finally, the results of the study show that the proposed network converges to the optimal point in several tens of iterations.
While the basic principles and main features of the present invention and advantages of the present invention have been shown and described, it will be understood by those skilled in the art that the present invention is not limited by the foregoing embodiments, which are described in the foregoing specification merely illustrate the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined in the appended claims and their equivalents.

Claims (5)

1. A multi-modal data retrieval model based on adaptive graph attention hash, characterized in that: the method comprises the following specific steps:
step S101: establishing a deep unsupervised cross-modal hash, and introducing an attention mechanism and a graph neural network;
step S102: constructing an unsupervised cross-modal hash retrieval framework of a self-adaptive graph annotation network (CAGAN) based on the CLIP and optimizing targets and functions;
step S103: collecting data of a data set, carrying out comprehensive experiments and index evaluation according to the collected data, and realizing experimental details;
step S104: performing contrast experiments, ablation experiments and sensitivity analysis of super parameters of the data set;
step S105: and performing convergence experiments and training efficiency analysis, and then performing cross-modal hash retrieval.
2. The multi-modal data retrieval model based on adaptive graph attention hash of claim 1, wherein: in the step S101, the invention uses CLIP to extract cross-modal semantic features, and learns a transferable visual model from natural language supervision; the multi-modal similarity enhancement module is used for fusing and enhancing the similarity information of different modal data, so that inaccurate similarity measurement of the multi-modal data can be effectively relieved; the attention mechanism can solve the problem of information redundancy by paying attention to information which is more critical to the current target in a plurality of inputs, and a semantic fusion matrix for attention perception is constructed based on the attention by adopting an attention mechanism; the present invention has devised an adaptive graph annotation module to solve this problem, which uses an attention mechanism to learn the semantic affinity graph and gathers information between similar nodes through the graph volume, thereby making the similar data produce more consistent hash codes.
3. The multi-modal data retrieval model based on adaptive graph attention hash of claim 1, wherein: in the step S102, the framework includes a depth feature encoding module, a multi-mode similarity enhancing module, an adaptive graph attention module, and a hash code reconstruction module; the depth coding module contains two main networks: a visual coding network and a text coding network; in this context, we use CLIP visual encoders and multi-layer perceptrons as the backbone network that is able to fully extract semantic information of the original data and learn cross-modal features.
4. The adaptive graph attention module of claim 1 capable of learning graph neighborhood correlations for different modalities and employing an attention mechanism to learn a similarity matrix for the adaptive modalities, then we pass the attention similarity matrix to a two-layer graph convolutional network that aggregates graph neighborhood correlations between similar nodes; so we can learn the similarity between different modality data using the attention mechanism; in the training process, the attention matrix is iteratively updated, the similarity relation between the examples is maximized, and then the information of similar nodes is aggregated through the graph convolution network to generate more consistent hash codes, so that the image and text retrieval performance is improved; the invention uses an iterative approximate optimization strategy to optimize the hash code; the discrete problem is converted into a series of continuous optimization problems, so that the problems of information loss and instability in the hash code binarization process can be effectively relieved.
5. The multimodal data retrieval model of claim 1, wherein: in the step 102, the proposed method for objective and function optimization may iteratively update parameters of the entire network through a back propagation algorithm until the network converges, and the entire CAGAN model may be optimized using SGD and Adam optimization algorithms.
CN202310380197.5A 2023-04-11 2023-04-11 Multi-mode data retrieval model based on self-adaptive graph attention hash Pending CN116796032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310380197.5A CN116796032A (en) 2023-04-11 2023-04-11 Multi-mode data retrieval model based on self-adaptive graph attention hash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310380197.5A CN116796032A (en) 2023-04-11 2023-04-11 Multi-mode data retrieval model based on self-adaptive graph attention hash

Publications (1)

Publication Number Publication Date
CN116796032A true CN116796032A (en) 2023-09-22

Family

ID=88046980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310380197.5A Pending CN116796032A (en) 2023-04-11 2023-04-11 Multi-mode data retrieval model based on self-adaptive graph attention hash

Country Status (1)

Country Link
CN (1) CN116796032A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914156A (en) * 2020-08-14 2020-11-10 中国科学院自动化研究所 Cross-modal retrieval method and system for self-adaptive label perception graph convolution network
CN112199532A (en) * 2020-09-01 2021-01-08 中国科学院信息工程研究所 Zero sample image retrieval method and device based on Hash coding and graph attention machine mechanism
CN115599942A (en) * 2022-11-08 2023-01-13 重庆师范大学(Cn) GCN-based deep unsupervised cross-modal retrieval method
CN115687571A (en) * 2022-10-28 2023-02-03 重庆师范大学 Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
CN115840827A (en) * 2022-11-07 2023-03-24 重庆师范大学 Deep unsupervised cross-modal Hash retrieval method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914156A (en) * 2020-08-14 2020-11-10 中国科学院自动化研究所 Cross-modal retrieval method and system for self-adaptive label perception graph convolution network
CN112199532A (en) * 2020-09-01 2021-01-08 中国科学院信息工程研究所 Zero sample image retrieval method and device based on Hash coding and graph attention machine mechanism
CN115687571A (en) * 2022-10-28 2023-02-03 重庆师范大学 Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
CN115840827A (en) * 2022-11-07 2023-03-24 重庆师范大学 Deep unsupervised cross-modal Hash retrieval method
CN115599942A (en) * 2022-11-08 2023-01-13 重庆师范大学(Cn) GCN-based deep unsupervised cross-modal retrieval method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YEWEN LI等: ""CLIP-Based Adaptive Graph Attention Network for Large-Scale Unsupervised Multi-Modal Hashing Retrieval"", 《SENSORS (BASEL, SWITZERLAND)》, vol. 23, no. 7, pages 3439 *

Similar Documents

Publication Publication Date Title
Nie et al. Deep multiscale fusion hashing for cross-modal retrieval
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
CN106033426B (en) Image retrieval method based on latent semantic minimum hash
CN116204706A (en) Multi-mode content retrieval method and system for text content and image analysis
Li et al. DAHP: Deep attention-guided hashing with pairwise labels
Zhang et al. Scalable discrete matrix factorization and semantic autoencoder for cross-media retrieval
Yang et al. Asymmetric cross–modal hashing with high–level semantic similarity
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
Su et al. Semi-supervised knowledge distillation for cross-modal hashing
Tu et al. Unsupervised cross-modal hashing via semantic text mining
Liu et al. Deep cross-modal hashing based on semantic consistent ranking
Xu et al. Idhashgan: deep hashing with generative adversarial nets for incomplete data retrieval
Wang et al. Cross-modal image–text search via efficient discrete class alignment hashing
Zou et al. Transductive zero-shot hashing for multilabel image retrieval
Duan et al. A web knowledge-driven multimodal retrieval method in computational social systems: Unsupervised and robust graph convolutional hashing
Yu et al. Hadamard matrix-guided multi-modal hashing for multi-modal retrieval
CN116594994B (en) Application method of visual language knowledge distillation in cross-modal hash retrieval
Zhang et al. Graph convolution based efficient re-ranking for visual retrieval
Li et al. Cross-Model Hashing Retrieval Based on Deep Residual Network.
Sun et al. Learning from expert: Vision-language knowledge distillation for unsupervised cross-modal hashing retrieval
CN112035689A (en) Zero sample image hash retrieval method based on vision-to-semantic network
Mingyong et al. CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval
CN116796032A (en) Multi-mode data retrieval model based on self-adaptive graph attention hash
Xie et al. Multi-similarity reconstructing and clustering-based contrastive hashing for cross-modal retrieval
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination