CN117540039A

CN117540039A - Data retrieval method based on unsupervised cross-modal hash algorithm

Info

Publication number: CN117540039A
Application number: CN202311514255.5A
Authority: CN
Inventors: 李祎; 郭艳卿; 付海燕; 李梦栾
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-02-09

Abstract

The invention discloses a data retrieval method based on an unsupervised cross-modal hash algorithm, which comprises the following steps: obtaining depth characteristics by using an image encoder and a text encoder of a pre-training model CLIP; inputting the depth features into a hash network to generate hash codes, and fusing the depth features of different modes to obtain a similarity matrix of the fused features; combining the similarity matrix with the similarity matrix in the single mode to obtain a final similarity matrix, and training a hash network; and outputting a search result based on Hamming distances of hash codes among different modalities. The invention fully utilizes the characteristics of different modes, mines more abundant semantic similarity information, constructs a reliable cross-mode similarity matrix and guides the training of the hash network. In addition, the present invention uses a fused hash reconstruction strategy to reduce quantization loss between real-valued hash features and discrete binary hash codes.

Description

Data retrieval method based on unsupervised cross-modal hash algorithm

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data retrieval method based on an unsupervised cross-modal hash algorithm.

Background

The cross-modal retrieval and matching task is mainly applied to the artificial intelligence industry, in particular to the cross-modal retrieval task of two most common modes (images and texts). With the advent of the information age and the rapid growth of the internet, multimedia data such as text, images, audio, video, etc., has been explosively grown. Massive multimedia data puts higher demands on storage and cross-modal semantic retrieval. Cross-modal hashing is a popular topic due to its compact representation and efficient similarity computation.

Cross-modal hashing is primarily concerned with mapping raw multi-modal data to a common hamming space while maintaining semantic similarity between different instances. The existing cross-modal hash method can be generally divided into a shallow method and a deep method according to whether a deep neural network is used in the training process.

Shallow cross-modal hashing methods typically use manually designed features to learn binary codes and hash networks. Based on whether or not the supervision information is utilized, the existing shallow cross-modal hash method can be roughly classified into unsupervised learning and supervised learning. In general, unsupervised cross-modal hashing generates hash codes or projection functions by training intra-and inter-modal similarity structures of data. In contrast, the supervised cross-modal hash method can obtain more accurate semantic information by utilizing the semantic tags, and obtains better performance under the supervision of accurate similarity semantic information.

The deep cross-modal hash method trains a nonlinear hash network by using features extracted from the deep network, and a block diagram thereof is shown in fig. 1. The deep neural network extracts features of different modalities, maps the high-dimensional features into binary hash codes through the hash network of each modality, while maintaining intra-modality similarity and inter-modality similarity, i.e., similar hash codes correspond to semantically similar instances. In the downstream matching and searching task, only the hash codes of the query data are needed to be obtained, and the instance of semantic similarity can be found by comparing the Hamming distances among different hash codes, so that the searching speed and the storage efficiency are greatly improved. Typical supervised deep cross-modal hashing methods include DCMH, SRLCH. The DCMH constructs an end-to-end deep discrete hash learning framework to realize deep feature learning and hash learning. To learn more discriminative hash codes, SRLCH directly uses the relationship information of the semantic tags by converting class tags into subspaces. The main challenge of unsupervised deep cross-modal hashing is how to mine deep semantic information between instances, thereby guiding the training of the hash network. DJSTH constructs a joint semantic affinity matrix to mine potential internal semantic relationships between multimodal instances. JDSH proposes a distribution-based similarity decision and weighting method, further optimizing the construction of a joint similarity matrix. The DGCPN provides the graph neighbor similarity to mine information between the instance and the neighbor of the instance, and more semantic information is obtained.

In a real-world scenario, large amounts of data are often untagged, and manually tagging large amounts of data is time consuming. Therefore, unsupervised cross-modal hashing has received a great deal of attention. The main challenge of unsupervised cross-modal hashing is how to mine deep semantic information between instances and reduce quantization loss in the binarization process.

Disclosure of Invention

In order to bridge the heterogeneity of data features and distributions among different modalities, quantization loss is reduced. The invention designs a data retrieval method based on an unsupervised cross-modal hash algorithm, which fully utilizes the characteristics of different modes, digs out richer semantic similarity information, constructs a reliable cross-modal similarity matrix and guides the training of a hash network. In addition, the present invention uses a fused hash reconstruction strategy to reduce quantization loss between real-valued hash features and discrete binary hash codes.

The invention adopts the following technical means:

a data retrieval method based on an unsupervised cross-modal hash algorithm comprises the following steps:

acquiring training image instance data and text instance data, and constructing a text image data pair;

extracting features of training image instance data by using an image encoder of a pre-training model CLIP, so as to obtain training depth image features; extracting features of training text instance data by using a text encoder of a pre-training model CLIP, thereby obtaining training depth text features;

on one hand, after normalizing the obtained depth image features, calculating a similarity matrix in an image mode, and after normalizing the obtained depth text features, calculating the similarity matrix in the text mode; on the other hand, after the depth image features and the depth text features are aggregated, the depth image features and the depth text features are input into a transform encoder to generate a fusion hash code, and a similarity matrix of a fusion image mode and a text mode is calculated;

obtaining a final similarity matrix based on the similarity matrix in the image mode, the similarity matrix in the text mode and the similarity matrix of the fusion image mode and the text mode; inputting the acquired depth image features into an image hash network, inputting the acquired depth text features into a text hash network, training the hash network of an image mode and the hash network of a text mode based on a final similarity matrix, wherein the hash network of the image mode is used for outputting image hash codes, and the hash network of the text mode is used for outputting text hash codes;

acquiring image instance data and text instance data to be retrieved, extracting depth image features to be retrieved based on an image encoder of a pre-training model CLIP, extracting depth text features based on a text encoder of the pre-training model CLIP, inputting the depth image features to be retrieved into a trained hash network of an image mode to acquire an image hash code, and inputting the depth text features into the trained hash network of the text mode to acquire a text hash code;

and outputting a search result by calculating the Hamming distance of the image hash code and the text hash code.

Further, the method further comprises:

generating a strict binary image hash code and a strict binary text hash code through a symbol function;

carrying out hash code fusion on the strict binary image hash code and the strict binary text hash code to generate a strict binary fusion hash code;

inputting the strict binary fusion hash code into a transform encoder for reconstruction, thereby obtaining real value reconstruction hash characteristics;

and aligning hash codes of different modes with the hash code reconstruction characteristics to construct a loss function, and fine-tuning the hash network parameters.

Further, inputting the depth image feature to be retrieved into a trained hash network of an image modality to obtain an image hash code, and inputting the depth text feature into the trained hash network of a text modality to obtain a text hash code includes generating a relaxed real-valued hash feature according to the following formula:

H ^* ＝F ^* (F ^* ；θ ^* ),*∈{I,T}

wherein F is ^I (·；θ ^I ) Representing a hash network under image modality, F ^T (·；θ ^T ) Representing a hash network in text mode, F ^I Representing depth image features extracted by a CLIP image encoder, F ^T Representing depth text features, θ, extracted by a CLIP text encoder ^I Parameters, θ, representing an image hash network ^T Parameters representing a text hash network.

Further, after the depth image features and the depth text features are aggregated, the depth image features and the depth text features are input into a Transformer encoder to generate a fusion hash code, and the fusion hash features are obtained according to the following formula:

H ^C ＝Trans ^c (F ^IT ；θ ^C )。

wherein F is ^IT Representing depth aggregation features, F ^I Depth image features, F ^T Representing a deep text feature, representing a merging operation, θ ^C Is a transducer encoder parameter.

Further, after normalizing the obtained depth image features, calculating a similarity matrix in the image mode, and after normalizing the obtained depth text features, calculating a similarity matrix in the text mode, including calculating the similarity matrix in the image mode and the similarity matrix in the text mode according to the following formula:

S ^* ＝D _cos (F ^* ,F ^* ),*∈{I,T}

wherein S is ^* Representing similarity matrices within different modalities, D _cos (. Cndot.) represents calculating cosine similarity between different instances;

calculating a similarity matrix of the fused image modality and the text modality according to the following formula

S ^C ＝D _cos (H ^C ,H ^C )

Wherein S is ^C Representing a fusion similarity matrix.

Further, obtaining a final similarity matrix based on the similarity matrix in the image modality, the similarity matrix in the text modality, and the similarity matrix fusing the image modality and the text modality, including obtaining the similarity matrix according to the following formula:

wherein S is ^I Representing an image modality similarity matrix, S ^T Representing a text modality similarity matrix, S ^C Representation ofFusing the similarity of the image mode and the text mode, wherein alpha, beta and gamma represent weight coefficients;

2S-1 was used as the final similarity matrix.

Further, outputting the search result by calculating the hamming distance of the image hash code and the text hash code, including:

mapping different modal features to a common hamming space, and defining a fusion objective function as follows:

wherein Tr (·) represents the trace of the matrix, I.I _F Representing the Frobenius norm, x, y e { I, T }.

Compared with the prior art, the invention has the following advantages:

1. the invention fully utilizes semantic information of different modes to obtain consistent features of cross modes, fully utilizes co-occurrence information, reduces heterogeneity among different modes and improves accuracy of cross-mode retrieval.

2. The method and the device for fusing the similarity matrix constructed by the invention fully capture the semantic similarity relationship between the examples, and provide more reliable supervision information for the study of the unsupervised hash network.

3. The invention can bridge the gap between different modes, reduce quantization error and generate hash codes with higher quality.

For the reasons, the method can be widely popularized in the fields of cross-modal image and text search engines, cross-modal recommendation systems, cross-modal image diagnosis, multi-modal data analysis and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.

Fig. 1 is a prior art deep cross-modal hash basic framework.

Fig. 2 is a basic framework of a data retrieval method based on an unsupervised cross-modal hash algorithm.

FIG. 3 is a cross-modal hash map search example in an embodiment.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In unsupervised cross-modal hashing, it is important to extract deep rich semantic features of the original data, which further serve as supervisory information to guide the learning of the hash network. Because of the heterogeneity of feature representation and data distribution among different modalities, many existing approaches focus on building a unified semantic similarity matrix using features extracted from a single-modality network. However, these features cannot capture co-occurrence semantic information of paired data, nor does the similarity matrix calculated by the artificial design formula adequately represent complex semantic relationships of cross-modal data.

Aiming at the defects of a similarity matrix calculated by using a manually designed formula in cross-modal hash and the quantization loss caused by the hash code generation process, the invention designs an unsupervised cross-modal hash algorithm with maintained modal consistency, and designs a data retrieval method based on the unsupervised cross-modal hash algorithm based on the algorithm, so that semantic information of different modes is fully utilized. Specifically, based on the powerful feature extraction capability of CLIP, a pre-trained CLIP image encoder and text encoder are employed to extract rich semantic information from the raw data. And fusing the features extracted by the CLIP model into a common potential space, and fully utilizing the paired co-occurrence information to guide the learning of the hash codes. A trainable fusion similarity calculation method is designed, and a transducer is utilized to fully extract similarity of multi-modal fusion characteristics to obtain a reliable fusion similarity matrix, so as to guide training of an unsupervised cross-modal hash network. A fused hash code reverse generation strategy is provided to obtain consistent hash codes and relieve quantization loss in the binarization process.

Specifically, the method comprises the following steps:

s1, acquiring training image instance data and text instance data, and constructing a text image data pair.

The invention is applied to the cross-modal retrieval task of two most commonly used modalities (images and texts). Assume that there are n pairs of image text dataWherein o is _i ∈[I _i ,T _i ]。

S2, performing feature extraction on training image instance data by using an image encoder of a pre-training model CLIP so as to obtain training depth image features; and extracting the characteristics of the training text instance data by using a text encoder of the pre-training model CLIP, thereby obtaining training depth text characteristics.

Because of the strong feature extraction capability of CLIP, the algorithm uses the image encoder and text encoder of the pre-training model CLIP to obtain depth features F containing rich semantic information from the raw data ^I ,*∈{I,T}。

S3, after normalizing the acquired depth image features, calculating a similarity matrix in an image mode, and after normalizing the acquired depth text features, calculating the similarity matrix in the text mode; and on the other hand, after the depth image features and the depth text features are aggregated, the depth image features and the depth text features are input into a transform encoder to generate a fusion hash code, and a similarity matrix of a fusion image mode and a text mode is calculated.

Depth feature F ^I ,*∈{I,T}。

Input to each modality specific hash network, and generate hash codes. Since discrete binary codes can cause gradient disappearance of the neural network in back propagation, the algorithm overcomes gradient disappearance caused by binary codes by generating loose real-valued hash features:

H ^* ＝F ^* (F ^* ；θ ^* ),*∈{I,T} (1)

wherein F is ^I (·；θ ^I )、F ^T (·；θ ^T ) Hash networks representing image and text modalities, respectively.

Further, the unsupervised cross-modal hash has no tag information and uses only internal data distribution and features to compare instance similarities. Therefore, it is necessary to mine semantic consistency information of the pair instance and construct a reliable similarity matrix to guide the training of the hash network. The algorithm provides a feature fusion strategy. Advanced nonlinear feature F ^I And F ^T Will be aggregated and then sent to a transducer encoder to capture consistent information and eliminate heterogeneity between modalities through network training. The process of feature integration is expressed as:

H ^C ＝Trans ^c (F ^IT ；θ ^C ) (3)

wherein the method comprises the steps ofFor merging operations, θ ^C Is a transducer encoder parameter.

The method provided by the algorithm does not rely on the features extracted by the single-mode network to calculate the unified similarity matrix, but utilizes the consistency features to construct the fused similarity matrix, so that richer cross-modes are maintainedThe state semantic information makes up for the heterogeneity of data and provides good supervision information for cross-modal hash learning. For F ^I 、F ^T And H ^C After normalization, calculating cosine similarity among different examples to obtain similarity among molds and similarity among fusion molds:

s4, obtaining a final similarity matrix based on the similarity matrix in the image mode, the similarity matrix in the text mode and the similarity matrix of the fusion image mode and the text mode.

Obtaining a unified similarity matrixThe matrix contains inter-mode similarity and intra-mode similarity: s=αs ^I +βS ^T +γS ^C

s.t.α,β,γ≥0,α+β+γ＝1 (5)

2S-1 was used as the final similarity matrix.

S5, inputting the acquired depth image features into an image hash network, inputting the acquired depth text features into a text hash network, training the hash network of an image mode and the hash network of a text mode based on a final similarity matrix, wherein the hash network of the image mode is used for outputting image hash codes, and the hash network of the text mode is used for outputting text hash codes.

S6, obtaining image instance data and text instance data to be retrieved, extracting depth image features to be retrieved based on an image encoder of a pre-training model CLIP, extracting depth text features based on a text encoder of the pre-training model CLIP, inputting the depth image features to be retrieved into a trained hash network of an image mode to obtain an image hash code, and inputting the depth text features into the trained hash network of the text mode to obtain a text hash code; and outputting a search result by calculating the Hamming distance of the image hash code and the text hash code.

The consistency characteristic and the similarity matrix are mutually reinforced, so that the hash network is effectively guided to map different modal characteristics to a common Hamming space. The fusion objective function is defined as follows:

the first term in the formula is used to constrain the consistency feature to preserve semantic similarity while optimizing the unified similarity S ^C . The second term ensures similarity between instance pairs. Setting the diagonal similarity to 1 may not achieve optimal results due to the modality gap, so a balancing factor η is introduced to solve this problem. The third term maximizes semantic similarity between the consistency feature and the cross-modal hash feature to maintain co-occurrence information between paired instances.

Further, the method further comprises:

s7, generating a strict binary image hash code and a strict binary text hash code through a symbol function. A strict binary hash code may be generated by a sign function.

B ^* ＝sign(H ^* ),*∈{I,T} (7)

S8, carrying out hash code fusion on the strict binary image hash code and the strict binary text hash code to generate a strict binary fusion hash code.

The purpose of cross-modal hashing is to map high-dimensional multi-modal features into low-dimensional hamming space, which necessarily results in loss of information. In order to eliminate the gap between the real value and the binary code, the algorithm provides a more proper fusion hash code reconstruction method. First, the binary hash codes B of different modes are processed ^I ，Obtaining a joint hash code comprising multi-mode information by concatenation>The requirement for the paired instances to obtain the same hash code is met. The process of hash code fusion can be expressed as:

s9, inputting the strict binary fusion hash code into a transform encoder for reconstruction, so as to obtain a real value reconstruction hash code.

Reconstructing the fused binary hash code into a real-valued hash feature by using a transducer encoder:

H ^B ＝Trans ^B (B ^C ；θ ^B ) (9)

wherein,θ ^B is a transducer encoder parameter.

The self-attention mechanism in the Transformer encoder leverages the fusion hash code and generates a real-valued hash representation. The gradient disappearance caused by the symbol function in the back propagation process is eliminated, and the end-to-end training of the network is realized. Similar to equation (8), the present algorithm proposes a hash reconstruction penalty:

in order to minimize the error between the similarity matrix and the instance similarity, and maximize the semantic similarity of the pair-wise hash codes, the algorithm expresses the semantic preserving objective function as:

the similarity maintaining objective function in the modes and among the modes is defined, and unified hash codes are generated for different modes:

where x, y, x ', y' e { I, T }.

S10, aligning hash codes of different modes with the hash code reconstruction features to construct a loss function, and fine-tuning encoder parameters.

In order to realize the common representation of data in different modes, the algorithm provides a feature consistency loss, and the hash codes in different modes are aligned with fusion and reconstruction features:

where p, q ε { C, B, I, T }.

Combining equations (6), (10-13), the final objective function is calculated as:

wherein lambda is ₁ ,λ ₂ ,λ ₃ ,λ ₄ And lambda (lambda) ₅ Is a super parameter to balance the total loss. Since the magnitude of the different losses is the same, no fine adjustment is made to this, and the weights are all set to 1.

The cross-modal hash is a technology for processing multimedia data, and is mainly used for mapping different types of data (such as images, videos, texts and the like) into a common Hamming space so as to perform efficient similarity search and matching among different modalities, realize efficient matching, recommendation and fusion analysis among different-modality data, improve storage and calculation efficiency, and provide powerful tools and technical support for multi-modality data processing.

The invention will be further described with reference to specific examples of application.

The task of graph-text cross-mode searching is graph searching text or graph searching, and example searching in a mode can be performed. Given a picture or query of text, a hash code is generated over the network, and text or pictures semantically similar to the retrieved dataset are returned according to distance from the instance hamming distance, as shown in fig. 3.

The present invention is applicable to a variety of aspects:

cross-modality image and text search engines: through cross-modal hashing, a search engine of images and text can be built, enabling users to query related text using images, or to query related images using text. For example, a user may upload a picture and the system returns a textual description similar to the query picture by converting the picture to a hash code and matching the hash in the image database. This has important applications in advertisement recommendation, media management, and merchandise search.

Cross-modality recommendation system: the cross-modal hash may be applied to build a cross-modal recommendation system. By mapping the user's historical behavior, images, and text information to binary codes and calculating the similarity between the binary codes, recommendation of content related to the user's interests may be achieved. For example, in electronic commerce, the system may recommend merchandise with similar characteristics to a user based on merchandise purchased by the user in the past, browsed pictures, and text information.

Cross-modality image diagnosis: cross-modal hashing is also widely used in the medical field. It may correlate medical images (e.g., CT scan, MRI images, etc.) with related textual information (e.g., medical history, report). By converting both the image and text information into hash codes and calculating the similarity between the binary codes, more accurate disease diagnosis and treatment plan recommendation can be achieved.

Multi-modal data analysis: the cross-modal hash technology is beneficial to fusing and aligning data of different modes so as to realize unified representation and analysis of multi-modal data. By mapping data of different modes to the same binary coding space, tasks such as classification, clustering, generation and the like of cross-mode data can be performed. This is significant for in-depth analysis of potential relationships and patterns in multimodal data, for example, a user may upload a photograph, the system may generate a corresponding hash code and match text posted by the user on the social media platform, returning posts, comments, or topics related to the photograph. Such functionality may help people better understand content in social media, extract and analyze text information related to pictures.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The data retrieval method based on the unsupervised cross-modal hash algorithm is characterized by comprising the following steps of:

obtaining a final similarity matrix based on the similarity matrix in the image mode, the similarity matrix in the text mode and the similarity matrix of the fusion image mode and the text mode;

inputting the acquired depth image features into an image hash network, inputting the acquired depth text features into a text hash network, training the hash network of an image mode and the hash network of a text mode based on a final similarity matrix, wherein the hash network of the image mode is used for outputting image hash codes, and the hash network of the text mode is used for outputting text hash codes;

2. The method for data retrieval based on an unsupervised cross-modal hash algorithm according to claim 1, further comprising:

3. The data retrieval method based on an unsupervised cross-modal hash algorithm according to claim 1, wherein inputting the depth image feature to be retrieved into a trained image-modal hash network to obtain an image hash code, inputting the depth text feature into a trained text-modal hash network to obtain a text hash code, comprises generating a relaxed real-valued hash feature according to the following formula:

H ^* ＝F ^* (F ^* ；θ ^* ),*∈{I,T}

4. The data retrieval method based on an unsupervised cross-modal hash algorithm according to claim 1, wherein after aggregating the depth image features and the depth text features, inputting the depth image features and the depth text features into a transform encoder to generate a fusion hash code, and obtaining the fusion hash features according to the following formula:

F ^IT ＝F ^I ⊕F ^T

H ^C ＝Trans ^c (F ^IT ；θ ^C )。

5. The method for data retrieval based on an unsupervised cross-modal hash algorithm according to claim 4, wherein the step of calculating a similarity matrix in an image mode after normalizing the obtained depth image features, and calculating a similarity matrix in a text mode after normalizing the obtained depth text features, comprises calculating the similarity matrix in the image mode and the similarity matrix in the text mode according to the following formula:

S ^* ＝D _cos (F ^* ,F ^* ),*∈{I,T}

wherein S represents phases in different modesSimilarity matrix, D _cos (. Cndot.) represents calculating cosine similarity between different instances;

S ^C ＝D _cos (H ^C ,H ^C )

Wherein S is ^C Representing a fusion similarity matrix.

6. The method for data retrieval based on an unsupervised cross-modal hash algorithm according to claim 5, wherein obtaining a final similarity matrix based on the similarity matrix in the image modality, the similarity matrix in the text modality, and the similarity matrix fusing the image modality and the text modality comprises obtaining the similarity matrix according to the following formula:

wherein S is ^I Representing an image modality similarity matrix, S ^T Representing a text modality similarity matrix, S ^C The similarity of the fusion image mode and the text mode is represented, and alpha, beta and gamma represent weight coefficients;

2S-1 was used as the final similarity matrix.

7. The data retrieval method based on an unsupervised cross-modal hash algorithm according to claim 6, wherein the outputting of the retrieval result by calculating the hamming distance of the image hash code and the text hash code comprises:

wherein Tr (·) represents the trace of the matrix, I.I _F Representing the Frobenius norm, x, y E{I,T}。