CN116894120A

CN116894120A - Unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation

Info

Publication number: CN116894120A
Application number: CN202310579789.XA
Authority: CN
Inventors: 李明勇; 李业文; 张捷; 吴宏浩
Original assignee: Chongqing Normal University
Current assignee: Chongqing Normal University
Priority date: 2023-05-20
Filing date: 2023-05-20
Publication date: 2023-10-17

Abstract

The invention discloses an unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation, which relates to the technical field of retrieval methods and comprises the following steps: a student hash coding module (SHE) codes the multi-modal data into feature vectors through a deep neural network and then maps the feature vectors into hash codes through a full connection layer; a dynamic multi-expert selection module (DMES) which adopts various visual language pre-training (VLP) models as expert models and designs a dynamic multi-expert selection strategy of a self-adaptive student so as to allocate an optimal expert model for each batch of data; a graph convolutional hash coding module (GCHE); and a multi-level knowledge distillation Module (MLKD) for designing a multi-level knowledge distillation framework, introducing GNN to process graph-based knowledge distillation, and transferring topology semantics of a teacher network as topology perception knowledge into a student network. The full experiment carried out on three multi-mode retrieval reference data sets shows that the superiority of the proposed method is verified.

Description

Unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation

Technical Field

The invention relates to the technical field of retrieval methods, in particular to an unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation.

Background

With the rise of many social networking platforms, there is a large amount of unstructured data on the internet that is stored in different modalities (e.g., images, short videos, blogs, comments, voices, etc.) in the network. The proliferation of multi-modal data has led to the change in the concept of data acquisition, and the need for efficient retrieval of multi-modal data has become urgent. Among existing unstructured data retrieval methods, an unsupervised cross-modal hash (UCMH) retrieval method has gained a wide attention due to its efficient storage and retrieval efficiency, and tag irrelevance. However, the limited search accuracy remains a bottleneck in applying the UCMH method in production.

In the learning process of an agent, not only is the inherent knowledge learned from an existing book, but also additional knowledge that is experienced and practical is learned to different teachers or specialists. In such a learning process, the individual is free from inherent knowledge, and the learned knowledge is verified with the practical knowledge of the teacher or expert, thereby obtaining a strong knowledge generalization ability. In the field of machine learning, a model not only needs to learn knowledge (e.g., data tags, manual features, etc.) from inherent information, but also needs to learn prior knowledge containing rich semantic information from a teacher or expert model.

The common unsupervised method represents a one-to-one relationship in a group of examples, but not other neighborhood relationships, so that the problem of inaccurate similarity measurement exists, the retrieval precision is limited, the problem of multi-teacher knowledge distillation is not explored by the existing knowledge distillation method, and the problem of insufficient knowledge refinement exists.

Disclosure of Invention

The invention provides a dynamic multi-expert knowledge distillation (DMKD) method applied to unsupervised cross-modal hash retrieval, which comprises the steps of firstly, using the existing visual language pre-training model as a multi-expert model, extracting semantic features of multi-modal data fine granularity, designing a multi-expert selection mechanism to dynamically allocate weights to expert models of different training samples, optimizing the performance of student models, systematically developing a multi-level knowledge distillation framework (MLKD) which comprises an auxiliary graph packing network and a multi-level (feature level, relation level and response level) knowledge distillation module, carrying out neighborhood aggregation on the features extracted by the expert models by the proposed MLKD framework, realizing multi-level knowledge distillation, and widely verifying the effectiveness of the proposed method on three multi-modal retrieval baseline data sets.

The invention provides an unsupervised cross-mode hash retrieval method based on dynamic multi-expert knowledge distillation, which comprises four parts in sequence:

a student hash coding module (SHE) codes the multi-modal data into feature vectors through a deep neural network and then maps the feature vectors into hash codes through a full connection layer;

a dynamic multi-expert selection module (DMES) which adopts various visual language pre-training (VLP) models as expert models and designs a dynamic multi-expert selection strategy of a self-adaptive student so as to allocate an optimal expert model for each batch of data;

a graph convolutional hash coding module (GCHE);

and a multi-level knowledge distillation Module (MLKD) for designing a multi-level knowledge distillation framework, introducing GNN to process graph-based knowledge distillation, and transferring topology semantics of a teacher network as topology perception knowledge into a student network.

Preferably, the multi-level knowledge distillation Module (MLKD) comprises three levels of knowledge distillation, a response level, a feature level, and a relationship level, respectively.

Preferably, in the student hash coding module (SHE), the image and text data are encoded into a middle layer vector H _I And H _T The visual encoder is denoted Enc _I The text feature encoder is denoted Enc _T The formula is as follows:

wherein I and T represent small batches of training image-text pairs, θ _I And theta _T Parameters representing different mode encoders, m representing batch size, H _I And H _T Is mapped to binary vectors through the full connection layer.

Preferably, in the student hash coding module (SHE), after coding the image and text data, a hash code B is generated by iterative quantization _I And B _T The formula is expressed as follows:

wherein the method comprises the steps ofAlpha represents the training wheel number, enc _* (·,θ _HI ) Epsilon { HI, HT } represents a hash encoder of image and text modality, hash code B of image and text modality _I And B _T Cosine self-similarity matrix S used for constructing different modes _BI And S is _BT 。

Preferably, in the dynamic multiple expert selection module (DMES), a plurality of visual language pre-training (VLP) models are used as the multiple expert models, and the multiple-modal data is input into the VL encoder to obtain the corresponding characteristics, the equation being expressed as follows:

wherein VLEnc _* (·,θ _* ) Epsilon { I, T } represents the VL transformer (expert model), k represents the kth selected expert model, d _I And d _T Respectively representing the dimensions of the feature vectors;

subsequently, usingAnd->To construct the corresponding similarity matrix +.>Wherein the method comprises the steps ofIncluding fine-grained feature similarity of expert models.

Preferably, in constructing the corresponding similarity matrixAfter that, feature similarity of different expert models is +.>Similarity to student interlayer features S _H ＝cos(H _I ,H _T )∈[-1,+1] ^m×m The comparison is made to select the expert that best matches the student model study, whose formula is as follows:

wherein the method comprises the steps ofargmin () represents a function for indexing the minimum value, and k represents the index of the expert model selected for the corresponding batch of training samples.

Preferably, in the graph rolling hash coding (GCHE), expert model characteristics are obtainedAnd->After that, firstly, an expert feature similarity matrix S is constructed _E The formula is expressed as follows:

s.t.0≤β,γ,δ≤1,β+γ+δ＝1.

wherein the method comprises the steps ofBeta, gamma, delta are hyper-parameters that balance the similarity of different modalities;

at the same time, selected expert featuresAnd->Similarity matrix->Information fed into the corresponding graph convolutional neural network to aggregate similar features to produce higher quality hash codes;

the two-layer graph convolutional hash encoding process is described as follows:

wherein the method comprises the steps ofW ⁽¹⁾ And W is ⁽²⁾ For a learnable parameter matrix, sigma represents an activation function of the middle layer of the graph neural network, H _GI And H _GT Representing the middle layer characteristics of the GCN, alpha represents the training round, and quantization will hash the codeDiscrete optimization conversion of (c) into a series of continuity quantization problems, resulting hash code B due to the aggregation of similar data features using the powerful neighborhood feature modeling capabilities of the graph neural network _GI And B _GT Naturally, the similarity of the original features will be maintained.

Preferably, in the multi-level knowledge distillation (MLKD), a hash level knowledge refinement compares hash code B generated by SHE component and GCHE component _I ，B _T ，B _GI And B _GT ；

Knowledge distillation based on relation compares hash similarity S of student network _BI ,S _BT Hash similarity S with graph rolling network (GCN) _GI ,S _GT Wherein S is a loss of _BI ＝cos(B _I ,B _I ),S _BT ＝cos(B _T ,B _T )∈[-1,+1] ^m×m Similarly, a graph convolution hash similarity matrix S can be obtained _GI ＝cos(B _GI ,B _GI ),S _GT ＝cos(B _GT ,B _GT )∈[-1,+1] ^m×m ；

For knowledge refinement of feature level, utilizing student network middle layer features H _I And H _T And graph rolling network middle layer feature H _GI ,H _GT To construct a mean square error loss;

finally, intra-mode similarity matrix S is utilized _BI ,S _BT And inter-modal similarity matrixSimilarity matrix S of expert features _E Approximation is made, wherein->These loss functions are expressed as follows:

wherein L is _Hash Representing the hash code level KD loss, enabling the hash code generated by the GCN to be consistent with the hash code of the student, and L _Intra Representing intra-mode similarity reconstruction loss, and carrying out self-similarity matrix S of intra-mode hash codes _BI ,S _BT And expert matrix S _E Comparing to refine the fine granularity similarity with the student network, μ is an expandable hyper-parameter, the quantization range of the hash code can be adjusted, L _Cross Representing cross-modal similarity loss, facilitating fusion of different modal hash codes and preserving similarity, whereinRepresenting the inner product of vectors, L _Relation Representing relational-based knowledge refinement loss that refines fine-grained similarity of expert features into a student's network, final feature-level knowledge distillation loss L _Feature The characteristics of the student and expert encoder interlayers were distilled.

Preferably, the dynamic multi-expert knowledge distillation (DMKD) method employed in the multi-level knowledge distillation Module (MLKD) is implemented on a NVIDIA RTX 3090GPU and a 32GB memory machine using Python programming language and PyTorch deep learning framework.

Compared with the related art, the invention has the following beneficial effects:

the invention is inspired by vision language pre-training, an effective unsupervised cross-modal hash retrieval method DMKD is provided, various multi-modal models are adopted as expert models, a multi-expert selection strategy DMES is developed to dynamically allocate expert model weights to different batches of training samples, so that the student models acquire understanding of multi-modal knowledge from the differential learning, a multi-level knowledge distillation module MLKD is provided, the multi-level knowledge distillation module MLKD comprises three levels of knowledge distillation (based on response, feature and relation based knowledge distillation), the module integrates multiple levels of knowledge distillation, fine-grained multi-modal information of the expert models is distilled to the student models more comprehensively and effectively, and full experiments performed on three multi-modal retrieval reference data sets show that the provided DMKD method can more effectively optimize hash functions than other unsupervised methods, and the superiority of the proposed method is verified.

Drawings

FIG. 1 is a data set of three multimodal retrieval baselines;

FIG. 2 is a map@5000 comparison across two search tasks and three reference data sets;

FIG. 3 is a comparison of the methods based on the master and student paradigms on MIRFLICKR-25K and NUS-WIDE datasets;

FIG. 4 is a Top-k accuracy curve obtained by analysis of MS COCO baseline data sets in tasks I-T and T-I;

FIG. 5 is a Top-K accuracy curve obtained by analysis of MIRFLICKR-25K baseline data sets in the I.fwdarw.T and T.fwdarw.I tasks;

FIG. 6 is a Top-k accuracy curve obtained by analysis of the NUS-WIDE baseline dataset in the tasks I.fwdarw.T and T.fwdarw.I;

FIG. 7 is a comparison of the method based on the master and student paradigms on MIRFLICKR-25K and NUS-WIDE datasets;

fig. 8 shows the data set of the data set pairs epsilon, tau, eta,lambda performs 128-bit parameter sensitivity analysis result graph;

fig. 9 is an algorithm flow chart of the DMKD method employed in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention; all other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

1, including deep cross-modal hashing, visual language pre-training and knowledge distillation;

1.1 deep cross-modal hash:

deep learning achieves satisfactory results in many areas. The depth features extracted by the deep learning method contain richer semantic information and have stronger capability of expressing the original data. Therefore, the combination of the deep learning and the hash method can be applied to multi-modal retrieval to remarkably improve the retrieval efficiency.

Based on this idea, many inventive methods have been proposed in recent years. In this section, some recently representative deep cross-modal hash methods will be described.

Cross-modal hash methods are broadly divided into two categories depending on whether tags are used: a supervised approach and an unsupervised approach. The supervised hash method uses semantic tags to bridge the modal gaps and realize semantic alignment of data of different modes, so that the similarity of the tags is kept in the process of mapping the data to the hash codes. The most classical deep cross-modal hash work is DCMH proposed by Jiang et al, which combines cross-modal hash algorithm and deep learning for the first time, and unifies feature and hash code learning into one end-to-end framework. IRGR provides an reasoning method based on a multi-instance relation graph, which makes full use of fine-grained relations among instances to construct a similarity matrix and establish a global and local instance relation graph. However, these supervised methods require the tags to participate in training, but acquisition of the tags requires significant labor costs and the tags tend to be noisy.

Therefore, the unsupervised hash method has more research value and application prospect due to the tag independence. One of the most representative works of unsupervised cross-modal hashing is deep joint semantic reconstruction hashing (DJSTH), which designs a joint semantic affinity matrix to unify similarity relations of different modal data and realizes a reconstruction method of a feature similarity matrix and a hash matrix. Based on this approach, a number of inventive methods are derived. For example, joint Distributed Similarity Hashing (JDSH) proposes a weighted scheme that is able to generate more discriminative hashes by pulling pairs of semantically similar instances and pulling pairs of semantically different instances. Depth Adaptive Enhancement Hashing (DAEH) proposes a strategy with discriminant similarity guidance and adaptive enhancement optimization and enhancement using an additional teacher network. However, these unsupervised methods represent one-to-one relationships in a set of instances, but not other neighborhood relationships, and therefore have the problem of inaccurate similarity metrics, resulting in limited retrieval accuracy. Inspired by visual language pre-training, the invention explores the use of multiple visual language pre-training models for knowledge distillation, thereby improving the retrieval performance.

1.2 multimodal characterization learning:

currently, multi-modal characterization learning is widely studied. These efforts have been directed to training a large multimodal pre-training model that can be adapted for use in a variety of downstream tasks (including teletext matching and retrieval, object detection, etc.). According to the interaction modes of different modal data, the existing multi-modal characterization learning models can be roughly divided into two types:

single flow models and dual flow models. The dual stream model uses two independent encoders to learn high-level representations of visual and language, which uses various modal interactions to perform semantic alignment of visual language. The uniflow model cuts the visual input into patches or acquires visual regions through the object detector, encodes text words into tokens, and then concatenates and inputs the visual regions and the text tokens into a unified code where semantic alignment of the regions and words is performed. In the classical uniflow model, UNITER powers heterogeneous downstream visual language tasks by joint multi-modal embedding and uses conditional masking in pre-training tasks.

In a series of classical works of a dual-flow model, the CLIP uses an unsupervised contrast learning method to pretrain 4 hundred million image text pairs, learns a migratable visual model from the supervision information of natural language, and realizes zero sample transfer of the model in a downstream task. SimVLM reduces training complexity by utilizing extensive weak supervision and uses single prefix language modeling targets for end-to-end training. Inspired by the visual language pre-training related work.

According to the knowledge, the performance of dynamic multi-expert knowledge distillation on an unsupervised cross-modal hash retrieval task is explored for the first time and related analysis is carried out.

1.3 Knowledge Distillation (KD):

knowledge distillation is a method of model compression, which was first proposed by Hinton et al. Specifically, knowledge distillation generally adopts a teacher-student learning paradigm, and the complex and strong-learning-ability teacher network learned characteristic representation is distilled out and transmitted to a student network with small parameters and weak learning ability.

Distillation can enable students to learn softer knowledge in a teacher model, so that the capability of the student model is effectively improved. Some research has been carried out on some knowledge-based distillation search methods. For example, JOG proposes an effective joint teaching unsupervised learning framework to pursue high performance but lightweight cross-modal retrieval. The core idea is to utilize cross-task teachers to migrate knowledge to guide students to learn.

The teacher model of KDCMH adopts a distributed unsupervised similarity hash method. Specifically, the method uses teacher and student optimization to propagate knowledge, and the teacher model adopts a similarity weighing strategy based on distribution, so that a more effective similarity matrix can be constructed. While these approaches achieve considerable performance, they have some limitations. The existing knowledge distillation method does not explore the problem of multi-teacher knowledge distillation, and has the problem of insufficient knowledge extraction. Therefore, the invention designs a distillation framework containing three levels of knowledge, so that the knowledge of a teacher model is distilled to a student network more fully, and the performance of unsupervised cross-modal hash retrieval is improved.

Discussion of method 2

2.1 problem symbol definition

The invention willRepresenting a multimodal image-text dataset, wherein I _i And T _i Representing paired image-text data. The dataset is randomly sliced into batches of training samples o= { o ₁ ,o ₂ ,...,o _j }. Training samples for each lot +.>Where m represents the batch size. The invention uses->And->To represent image and text feature encodings of VL expert models. In addition, the invention represents the hash code generated by the student hash encoder as B _I ∈{-1,+1} ^m×c And B _T ∈{-1,+1} ^m×c . The hash code generated by the graph rolling network is denoted B _GI ∈{-1,+1} ^m×c And B _GT ∈{-1,+1} ^m×c Where c represents the hash code length.

The invention uses the hash code B _I ,B _T ,B _GI ,B _GT And constructing a corresponding self-similarity matrix. Then, similarity matrices S of the hash codes are calculated respectively using cosine similarity functions _BI ＝cos(B _I ,B _I )∈[-1,+1] ^m×m And S is _BT ＝cos(B _T ,B _T )∈[-1,+1] ^m×m . Similarly, a hash code generated by a graph convolution hash coding module is used to construct a similarity matrix S _GI ＝cos(B _GI ,B _GI )∈[-1,+1] ^m×m And S is _GT ＝cos(B _GT ,B _GT )∈[-1,+1] ^m×m 。

Unsupervised cross-modal hashing aims at achieving fast queries by projecting data into a unified binary space using information of pairs of samples of different modalities. At the same time, the semantic intrinsic similarity of the data is preserved in the data map.

2.2 model overview

As shown in fig. 2, the DMKD proposed by the present invention is an end-to-end learning framework that integrates four components: student hash coding, dynamic multi-expert selection, graph convolution hash coding and multi-level knowledge distillation module. The method comprises the following steps:

student hash code (SHE): in order to obtain a feature representation of deep semantic information, the proposed SHE module encodes image and text data into middle layer vectors H _I And H _T . The present invention represents the visual encoder as Enc _I The text feature encoder is denoted Enc _T The formula is as follows:

where I and T represent a small lot of training image-text pairs. θ _I And theta _T Representing parameters of different modality encoders. m represents the batch size. H _I And H _T To be mapped into binary vectors by full-concatenated layers, followed by iterative quantization to generate hash code B _I And B _T . The formula is expressed as follows:

wherein the method comprises the steps ofAlpha represents the training wheel number, enc _* (·,θ _HI ) E { HI, HT } represents the hash encoder of the image and text modalities. Iterative quantization strategies are used to reduce the loss of accuracy of hash code binarization. Finally, hash code B of image and text modality _I And B _T Cosine self-similarity matrix S used for constructing different modes _BI And S is _BT 。

Dynamic Multiple Expert Selection (DMES): in experiments, it was found that a powerful expert model does not necessarily train a better student model, which is limited by the differences in training samples and the representation capabilities of the student network. Therefore, the invention designs a dynamic multi-expert selection mechanism to select proper expert (teacher) models for different batches of training samples. First, a plurality of visual language pre-training (VLP) models are used as the multi-expert model, and multimodal data is input into the VL encoder to obtain corresponding features. The equation is expressed as follows:

wherein VLEnc _* (·,θ _* ) Epsilon { I, T } represents the VL transformer (expert model), k represents the kth selected expert model. d, d _I And d _T Representing the dimensions of the feature vector, respectively.

Subsequently, the invention is usedAnd->To construct the corresponding similarity matrix +.>Wherein the method comprises the steps ofNotably, the->Including fine-grained feature similarity of expert models. Finally, in order to provide a basis for selection, the present invention provides the feature similarity of different expert models +.>Similarity to student interlayer features S _H ＝cos(H _I ,H _T )∈[-1,+1] ^m×m A comparison is made to select the expert that best matches the student model study. The formula is as follows:

Graph convolutional hash coding (GCHE): to further mine fine-grained knowledge of expert models, the present invention designs GCHE components to capture more structured semantics. In particular, expert model features are derivedAnd->After that, firstly, an expert feature similarity matrix S is constructed _E . The formula is expressed as follows:

wherein the method comprises the steps ofBeta, gamma, delta are hyper-parameters that balance the similarity of different modalities.

At the same time, to obtain hash codes that are more rich in semantic information, expert features are selectedAnd->Similarity matrix->Is sent into the corresponding graph convolutional neural network. Aggregating information of similar characteristics in this process results in a higher quality hash code. The two-layer graph convolutional hash encoding process is described as follows:

wherein the method comprises the steps ofW ⁽¹⁾ And W is ⁽²⁾ Sigma represents the activation function of the middle layer of the graph neural network as a matrix of learnable parameters. H _GI And H _GT Representing the interlayer characteristics of the GCN. Furthermore, similar to equation 2, α represents a training round, and quantization converts discrete optimization of the hash code into a series of sequential quantization problems. Finally, since similar data features are aggregated by utilizing powerful neighborhood feature modeling capability of the graph neural network, the generated hash code B _GI And B _GT Naturally, the similarity of the original features will be maintained. In particular due to the matrix S _E Similar data will produce more relevant hash codes and dissimilar data will produce more discriminating hash codes.

Multi-level knowledge distillation (MLKD): in order to more fully refine the multi-modal similarity information of the student network, the proposed MLKD component contains multiple levels of knowledge distillation. In particular, the hash level knowledge refinement compares hash code B generated by the SHE component and the GCHE component _I ，B _T ，B _GI And B _GT . Knowledge distillation based on relation compares hash similarity S of student network _BI ,S _BT Hash similarity S with graph rolling network (GCN) _GI ,S _GT Wherein S is a loss of _BI ＝cos(B _I ,B _I ),S _BT ＝cos(B _T ,B _T )∈[-1,+1] ^m×m . The same principle can be used to obtain a figureConvolution hash similarity matrix S _GI ＝cos(B _GI ,B _GI ),S _GT ＝cos(B _GT ,B _GT )∈[-1,+1] ^m×m . For knowledge refinement of feature level, utilizing student network middle layer features H _I And H _T And graph rolling network middle layer feature H _GI ,H _GT To construct a mean square error loss. Finally, intra-modality similarity matrix S is utilized _BI ,S _BT And inter-modal similarity matrixSimilarity matrix S of expert features _E Approximation is made, wherein->These loss functions are expressed as follows:

wherein L is _Hash The hash level KD loss is represented, so that the hash generated by the GCN is consistent with the hash of the student. L (L) _Intra Representing in-mold similarity reconstruction loss. It will self-similarity matrix S of intra-modality hash code _BI ,S _BT And expert matrix S _E A comparison is made to refine the fine-grained similarity to the student's network. Mu is an expandable hyper-parameterThe quantization range of the hash code can be adjusted. L (L) _Cross Representing cross-modal similarity loss, facilitating fusion of different modal hash codes and preserving similarity, whereinRepresenting the inner product of the vectors. L (L) _Relation Representing a loss of relational-based knowledge refinement that refines fine-grained similarity of expert features into a student network. Finally, feature level knowledge distillation loss L _Feature And distilling the characteristics of the middle layers of the student and expert encoders, thereby ensuring smooth information transmission.

2.3 Overall cost function

And the parameters of the whole model are iteratively learned by the SGD optimizer until the retrieval precision of the model is not improved any more and the training is finished. The overall cost function is formulated as follows:

s.t.B _I ,B _T ∈{-1,+1} ^m×c .

wherein epsilon, tau, eta,lambda is a weighed hyper-parameter. It is particularly pointed out that the multi-expert network used in the proposed DMKD framework can be replaced by other multi-modal characterization learning models.

Continuous optimization of student hash networks is achieved by minimizing cost functions, and detailed training and optimization procedures for the proposed DMKD method are set forth in fig. 9.

3 experiment

3.1 data set

MS COCO: the MS COCO, collectively Microsoft Common Objects in Context, is a very large-scale dataset for use in object detection, segmentation, image description, and so forth. It contains 123287 pictures and corresponding text descriptions, with each image text pair containing a 91 class multi-label, thereby enabling more context information to be provided.

MIRFLICKR-25K：

The dataset was a multi-tag dataset applied to multimedia tasks that collected 25,000 photos from 24 different categories on the Flickr website, related text and tags. In order to represent the relevant text content, it also provides a 1386-dimensional feature vector obtained by principal component analysis of the text.

NUS-WIDE: it is a multi-tag dataset containing 269,648 pairs of graphics collected from real scenes and their corresponding tags. In the experiments of the present invention, along with the previous set of related work, 10 of the most widely used categories and the associated 186,577 image text pairs were selected, each text phrase providing a 1000-dimensional representation of the BOW feature vector. The dataset partitions of these three search reference datasets are shown in the table of fig. 1 of the specification.

3.2 Experimental setup

In the present invention, the proposed DMKD method is implemented on NVIDIA RTX 3090GPU and 32GB memory machine using Python programming language and PyTorch deep learning framework. The super parameters of the invention are set as follows:. In addition, the random gradient descent algorithm is adopted to perform parameter optimization of the network, the learning rate is set to be 0.01, the weight attenuation is set to be 5e-4, and the momentum is set to be 0.9.

In order to unify the setup of the experiment, the previous advanced method was followed and VGG-16 was used as the image extractor. The text encoder uses a fully connected layer as the backbone network. For the GCN hash encoding module, a two-layer graph rolling network (d _f 4096→c) to generate a hash code, where d _f Representing the input dimension, c

Representing the hash code length. For the multiple expert model, the present invention uses different variants of the pre-trained visual language model (CLIP-RN 101, CLIP-ViT-B/16 and CLIP-ViT-B/32) as the multiple expert model. In particular, expert models may be replaced with other multimodal characterization learning models.

3.3 performance comparison

In the present invention, to demonstrate the effectiveness of the proposed DMKD method, comprehensive experiments were performed on three baseline data sets (MSCOCO, MIRFLICKR-25K and NUS-WIDE). The proposed method was overall analyzed by performance comparison, ablation study and parameter analysis.

In the present invention, in order to demonstrate the efficacy of the proposed DMKD. Comparison was made with several unsupervised baseline methods, including DJSTH, JDSH, HNH, DSAH, DGCPN, DUCH, DAEH, over three data sets. The search precision of all baselines is compared by MAP@5000 and Top-N precision curves in the tasks of I- > T and T- > I respectively, and reference is made to the accompanying drawings 2-6 of the specification respectively.

3.4Top-k precision curve comparison

Referring to fig. 4-6 of the specification, the accuracy of the top-N curves using the various models on the three reference datasets can be obtained.

3.5 ablation experiments

To demonstrate the effectiveness and contribution of each component, four variant models were designed to verify the impact of each component. The results of the ablation study are shown in table 4, and the description of these variant models is as follows:

(1) DMKD-1: meaning that the variant model does not use hash-level knowledge distillation.

(2) DMKD-2: it indicates that this variant has no knowledge distillation at the feature level.

(3) DMKD-3: it represents knowledge distillation loss without a relational stage for this variant.

(4) DMKD-4: it means that the variant does not use a Dynamic Multiple Expert Selection (DMES) component.

Reference is made to figure 7 of the specification which is a comparison of the MIRFLICKR-25K and NUS-WIDE datasets with a method based on the master-slave paradigm.

3.6 parameter sensitivity analysis

The proposed DMKD method was systematically studied for the parameters epsilon, tau, eta,sensitivity to lambda variation. The parameter sensitivity analysis is shown in fig. 8.

Conclusion 4

The present invention proposes a novel and efficient unsupervised cross-modal hash retrieval method (DMKD) in which a dynamic multi-expert selection strategy of an adaptive student network is designed to select optimal expert models for different batches of multi-modal data. The strategy enables the student model to better optimize the hash function during such differentiated learning. Secondly, a multi-stage knowledge distillation framework is proposed for more efficient knowledge distillation. The framework comprises an auxiliary coding network based on GNN and a three-level knowledge distillation module (namely a feature level, a relation level and a hash code level), which can more comprehensively distill fine-grained multi-modal knowledge of the expert model to the student network, thereby improving the retrieval precision of UCMH. Finally, the DMKD method improves the retrieval performance of UCMH without increasing the model parameters, and keeps the weight reduction of the model. The full experiments carried out on three widely used multimedia retrieval data sets show that the proposed method can improve the learning capacity of the hash representation of the student model through a visual language knowledge distillation method, and the performance of the method on a plurality of evaluation indexes is superior to that of the recent representative unsupervised cross-modal hash method.

It is noted that in the present invention, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation is characterized by comprising four parts in sequence:

a graph convolutional hash coding module (GCHE);

2. The method of claim 1, wherein the multi-level knowledge distillation Module (MLKD) comprises three levels of knowledge distillation, a response level, a feature level, and a relationship level.

3. The method for unsupervised cross-modal hash retrieval based on dynamic multi-expert knowledge distillation as claimed in claim 1, wherein in said student hash encoding module (SHE), image and text data are encoded as middle layer vectors H _I And H _T The visual encoder is denoted Enc _I The text feature encoder is denoted Enc _T The formula is as follows:

4. A method of unsupervised cross-modal hash retrieval based on dynamic multi-expert knowledge distillation according to claim 3, wherein in said student hash encoding module (SHE), after encoding the image and text data, hash code B is generated by iterative quantization _I And B _T The formula is expressed as follows:

wherein the method comprises the steps ofAlpha represents the training wheel number, enc _* (·,θ _HI ),*∈{HI,HT}

Hash encoder representing image and text modalities, hash code B of image and text modalities _I And B _T

Cosine self-similarity matrix S used for constructing different modes _BI And S is _BT 。

5. The method of claim 4, wherein the dynamic multi-expert selection module (DMES) uses a plurality of visual language pre-training (VLP) models as the multi-expert model and inputs multi-modal data into the VL encoder to obtain corresponding features, the equation being expressed as follows:

subsequently, usingAnd->To construct the corresponding similarity matrix +.>Wherein-> Including fine-grained feature similarity of expert models.

6. The method for unsupervised cross-modal hash retrieval based on dynamic multi-expert knowledge distillation as claimed in claim 5, wherein the corresponding similarity matrix is constructedAfter that, feature similarity of different expert models is +.>

Similarity to student interlayer features S _H ＝cos(H _I ,H _T )∈[-1,+1] ^m×m

The comparison is made to select the expert that best matches the student model study, whose formula is as follows:

7. The method for unsupervised cross-modal hash retrieval based on dynamic multi-expert knowledge distillation of claim 6, wherein in the graph rolling hash coding (GCHE), expert model features are obtainedAnd->After that, firstly, an expert feature similarity matrix S is constructed _E The formula is expressed as follows:

s.t.0≤β,γ,δ≤1,β+γ+δ＝1.

at the same time, selected expert featuresAnd->Phase and method for producing the sameSimilarity matrix->Information fed into the corresponding graph convolutional neural network to aggregate similar features to produce higher quality hash codes;

wherein the method comprises the steps ofW ⁽¹⁾ And W is ⁽²⁾ For a learnable parameter matrix, sigma represents an activation function of the middle layer of the graph neural network, H _GI And H _GT Representing the middle layer features of the GCN, alpha represents the training round, and quantization converts discrete optimization of hash codes into a series of continuous quantization problems, and the generated hash codes B are generated due to the fact that similar data features are aggregated by utilizing the strong neighborhood feature modeling capability of the graph neural network _GI And B _GT Naturally, the similarity of the original features will be maintained.

8. The method of claim 7, wherein in the multi-level knowledge distillation (MLKD), hash level knowledge refinement compares hash codes B generated by SHE component and GCHE component _I ，B _T ，B _GI And B _GT ；

Knowledge distillation based on relation compares hash similarity S of student network _BI ,S _BT

Hash similarity S with graph rolling network (GCN) _GI ,S _GT Wherein (2) is lost, wherein

S _BI ＝cos(B _I ,B _I ),S _BT ＝cos(B _T ,B _T )∈[-1,+1] ^m×m Similarly, the graph convolution hash similarity matrix can be obtained

S _GI ＝cos(B _GI ,B _GI ),S _GT ＝cos(B _GT ,B _GT )∈[-1,+1] ^m×m ；

For knowledge refinement of feature level, utilizing student network middle layer features H _I And H _T

Graph roll-up network middle layer feature H _GI ,H _GT To construct a mean square error loss;

wherein L is _Hash Representing the hash code level KD loss, enabling the hash code generated by the GCN to be consistent with the hash code of the student, and L _Intra Representing intra-mode similarity reconstruction loss, and carrying out self-similarity matrix S of intra-mode hash codes _BI ,S _BT And expert matrix S _E Comparing to refine the fine grain similarity with the student network, mu

Is an expandable super-parameter, can adjust the quantization range of the hash code, L _Cross

Representing cross-modal similarity loss, facilitating fusion of different modal hash codes and preserving similarity, wherein

Representing the inner product of vectors, L _Relation

Representing relational-based knowledge refinement loss that refines fine-grained similarity of expert features into a student's network, final feature-level knowledge distillation loss L _Feature The characteristics of the student and expert encoder interlayers were distilled.

9. According to claim 1-

The unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation as claimed in any one of claims 8, wherein the dynamic multi-expert knowledge distillation (DMKD) method adopted in the multi-level knowledge distillation Module (MLKD) is implemented on NVIDIA RTX 3090GPU and 32GB memory machine using Python programming language and PyTorch deep learning framework.