CN115840827A

CN115840827A - Deep unsupervised cross-modal Hash retrieval method

Info

Publication number: CN115840827A
Application number: CN202211382234.8A
Authority: CN
Inventors: 李明勇; 马龙飞
Original assignee: Chongqing Normal University
Current assignee: Chongqing Normal University
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-03-24
Anticipated expiration: 2042-11-07
Also published as: CN115840827B

Abstract

The invention discloses a deep unsupervised cross-modal Hash retrieval method, which relates to the technical field of cross-modal Hash retrieval, and is characterized in that a text is considered as data of a graph structure, text characteristics are converted into node information in the graph, sparse text characteristics are further fused by using a GAT network, related neighbor node information is fused with an original node by an attention scoring mechanism through an attention mechanism, attention scores also represent the closeness degree of connection among characteristic words, the higher the scores are, the closer the relation is, a self-encoder is adopted to perform characteristic encoding and characteristic decoding on extracted modal characteristics, meanwhile, the graph attention network is introduced into the cross-modal Hash retrieval field, so that the text mode with insufficient semantic information can be subjected to deep extraction, the high-level characteristic information representation of the text mode is enriched, and CLIP is adopted as a visual encoder of an image mode, and semantic characteristics with higher granularity are extracted.

Description

Deep unsupervised cross-modal Hash retrieval method

Technical Field

The invention relates to the technical field of cross-modal hash retrieval, in particular to a deep unsupervised cross-modal hash retrieval method.

Background

With the rapid development of the internet and social networks, multimedia information data such as multimedia information vision and text is increasing dramatically, and how to effectively retrieve the data is a great challenge. The purpose of cross-modal retrieval is to retrieve heterogeneous modal data with similar semantic features by one modality, and the hash method is widely used in retrieval tasks to improve storage and calculation efficiency. The cross-modal hashing method tries to represent heterogeneous modal data into compact binary codes, and meanwhile semantic similarity among different modal data is kept in a common feature space;

cross-mode hashing methods can be divided into two broad categories: supervised cross-modality hashing methods and unsupervised cross-modality hashing methods. Common supervised hashing methods have shown significant performance. However, a common drawback of the above unsupervised methods is that vision-

Co-occurrence information inherent in the text pair is easily ignored. Due to the lack of guidance of label information, the method is easy to be ignored in the high-level semantic feature extraction process. Further, the unsupervised model cannot accurately capture semantic relation between different modal data, so that the retrieval accuracy is not high;

furthermore, most existing cross-modality methods focus on semantic feature alignment between different modality data. The method simplifies the correlation between the modal internal semantic reconstruction features and the original features, omits coherent interaction between one modal feature vector and another modal vector and generated hash codes, and blocks the semantic interaction between the modal internal features and the reconstruction features and between the features and homogeneous and heterogeneous hash codes, so that the cross-modal hash retrieval cannot be perfectly compatible. Meanwhile, the inherent problem of modal gap in deep semantic interaction cannot be noticed, the alignment of modal characteristics with hash codes of homogeneous data and hash codes of heterogeneous data cannot be closed, and the retrieval result cannot reach the optimal solution.

Therefore, a deep unsupervised cross-modal hash retrieval method is provided to solve the problems.

Disclosure of Invention

The invention provides a deep unsupervised cross-modal Hash retrieval method, which solves the technical problem that the retrieval result cannot reach the optimal solution in the prior art.

In order to solve the technical problem, the invention provides a deep unsupervised cross-modal hash retrieval method, which comprises the following steps:

s1, designing a modal cycle interaction method, and reconstructing one modal feature in a modal to perform semantic alignment on the other modal feature, so as to comprehensively consider semantic similarity relation between the intra-modal and the inter-modal;

s2, introducing the graph attention network into cross-modal retrieval, and considering the influence of the text neighbor node information;

s3, introducing an attention mechanism so as to capture the global features of the text mode;

s4, fine-grained extraction is carried out on image features by using a CLIP visual encoder;

and S5, learning the Hash codes through a Hash function, converting the query points into the Hash codes and detecting corresponding transverse information more quickly.

Preferably, S2 is specifically:

s2.1, expressing node information in a graph mode, converting a graph topological structure into a constructed adjacency matrix through the association between aggregation nodes and nodes, and fusing the information of each node and the neighbor nodes of the node into a new node;

s2.2, with the attention showing advanced execution effects in the NLP and CV fields, introducing an attention mechanism into a graph network, introducing the attention mechanism into a graph neural network, giving each node an attention score through an attention algorithm, and then performing information fusion on different nodes, wherein the score of the feature word with low relevance is low, and the feature word with high relevance can obtain a high attention score;

and S2.3, finally, fusing the information to strengthen the influence of different feature words on the nodes.

Preferably, in S1, a dual-flow model is adopted to perform semantic feature extraction on different modal data information.

Preferably, S1 specifically is:

s1.1, compressing high-level semantic features into low-level semantic representations by using an Auto-Encoder, and reconstructing the low-level semantic features into features of heterogeneous data;

s1.2, inputting the characteristics of the image into a decoder, and mapping the obtained semantic information to a characteristic space of a text through the decoder to realize semantic alignment between the modes;

s1.3, after the reconstruction characteristics of the heterogeneous data are obtained, semantic alignment is carried out on the original image characteristics and the characteristics of the text reconstructed by the decoder, and the high-dimensional characteristics and the characteristics obtained by the encoder are also aligned once, so that the semantic alignment in the mode is realized.

Preferably, in S1, after the semantic features obtained by one modality through the feature extractor are decoded by the self-encoder, the semantic features are mapped to a semantic space corresponding to another modality, and the intra-modality constraint is performed in the same modality.

Compared with the related technology, the deep unsupervised cross-modal hash retrieval method provided by the invention has the following beneficial effects:

in the invention, a text is considered as data of a graph structure, text features are converted into node information in the graph, sparse text features are further fused by using a GAT network, related neighbor node information is fused with an original node by an attention scoring mechanism, attention scores also represent the closeness degree of connection among feature words, the higher the scores are, the closer the relation is, and feature coding and feature decoding are carried out on the extracted modal features by adopting an auto-encoder.

Drawings

FIG. 1 is a schematic flow chart of a deep unsupervised cross-modal Hash retrieval method;

Detailed Description

The following describes embodiments of the present invention in detail.

In the description of the present invention, it should be understood that if the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. are referred to, they refer to the orientation or positional relationship shown in the drawings, and are used for convenience of description and simplicity of description, but do not indicate or imply that the device or element referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description of the present invention, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected unless otherwise explicitly stated or limited. Either mechanically or electrically. They may be directly connected or indirectly connected through intervening media, or may be connected through the use of two elements or the interaction of two elements. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.

In a first embodiment, as shown in fig. 1, a deep unsupervised cross-modal hash retrieval method includes the following steps:

s1, designing a modal cycle interaction method, and carrying out semantic alignment on one modal feature in a modal by reconstructing the other modal feature so as to comprehensively consider semantic similarity relation between the interior and the interior of the modal;

The method specifically comprises the following steps: acquiring m image text pairs;

defining a data structure as

Definition I _i Represents the ith original image data and defines T _j Representing the jth original text data, each image text pair instance may be denoted as o _k ＝(I _k ,T _k ) Defining the expression form of the characteristic dimension as F;

the semantic features extracted by the image feature encoder are denoted as F _I And is and

D _I the method comprises the steps of representing high-level dimension characteristics of an image obtained by an original image through an image encoder;

defining the feature representation obtained after the text passes through the text encoder as

D _T Representing high-level feature dimension representation of the text, wherein m is the number of sample instance points;

defining a Hash code representation as B _* ∈{-1,+1} ^m×c E { I, T }, where c denotes the length of the hash code, at B _* The hash code of the ith original data in (1) is denoted as b _*,i . Also defined as cos () is the cosine similarity loss function paired, and as element-wise sign function using sign () is defined as follows:

define | |. | represents l ₂ The regularization normal form is used for Frobenius regularization of vectors and matrixes;

the CLIP-based circular alignment hash contains 3 blocks in total for unsupervised visual text retrieval, namely a feature extraction part, a circular semantic alignment part and a hash coding learning part.

S2, introducing the graph attention network into cross-modal retrieval, and considering the influence of the text neighbor node information.

S3, introducing an attention mechanism, and capturing global features of text modes

The attention mechanism is introduced into a graph network, the attention mechanism is introduced into a graph neural network, an attention score is given to each node through an attention algorithm, then different nodes are subjected to information fusion, the score of a feature word with low relevance is low, a high attention score can be obtained with a feature word with strong relevance, the information is fused, the influence of different feature words on the nodes is strengthened, and semantic information can be better extracted;

since the texts are 1386-dimensional feature vector representations, taking these features as node data, each text can be represented as f _i ∈R ¹³⁸⁶ In order to obtain enough expressive power, the input features are converted into features of higher levels through a learnable weighting matrix W epsilon R ^f×f The node is switched and then self-attentive mechanism operated.

a is an attention calculation coefficient, e _ij Representing the importance degree of the node j to the node i, calculating each neighbor node of the node i, and normalizing all the neighbor nodes by using a softmax function in order to facilitate the comparison of coefficients among different nodes.

All the nodes are operated in such a way, the obtained adjacent matrix node information becomes a new node vector containing the attention characteristics of all the neighbor nodes, and the most easy weighted fusion is carried out on the semantic information lacking in the text mode, so that the text mode is promoted to obtain more strengthened mode representation;

t defines the transpose. By doing so for all nodes, the node information of the adjacency matrix is converted into a new node vector containing the attention characteristics of each neighboring node, which is the fusion of the most easily weighted semantic information lacking in the text aspect, resulting in a more powerful modal representation of the text aspect. An attention network is a method of fusing feature words associated with a certain feature word using an attention mechanism. Weighted fusion using an attention mechanism can result in a new semantic feature representation containing neighbor node information.

And S4, extracting fine granularity of the image features by using a CLIP visual encoder.

Different modality encoders are designed for different modality data. Compared with a text mode, an image mode contains richer semantic information, a single-flow model cannot close inherent mode differences among cross modes, optimal feature extraction cannot be performed on each mode, mining capability of heterogeneous data semantic consistency information is limited, therefore, a double-flow model is adopted to perform semantic feature extraction on different mode data information, and a good effect is displayed in the whole training stage;

image feature extraction: the comparison learning method in unsupervised learning is adopted, and a large-scale data set is used for training, compared with ViT [5 ]]It yields good results across multiple datasets. A CLIP pre-training model is used in the model as a feature extractor for the image modality. The image part uses the image encoder of the CLIP, the original image is input into the image encoder of the approximate CLIP, a 1024-dimensional high-level semantic vector is obtained through extraction, and the high-level semantic vector can be defined as F _I ∈R ^m×1024 ；

Text feature extraction: considering that the text modal data contains less high-level semantic information than image data, but the text semantics has contextual relation, the feature words of the text are considered as node information in the graph. Text modality uses graph attention network (GAT [24 ]]) Semantic information of the text is extracted. The GAT takes the text feature words as nodes, in order to obtain stronger expression capacity, the input features are converted into features of higher levels, an attention mechanism is introduced, self _ attentions are carried out on the nodes, attention weight coefficients among the nodes are obtained, information aggregating all surrounding nodes can be obtained through weighting and summing surrounding neighbor nodes, and the relation of text information is more practical. By treating the text as a graph structure, a graph is obtainedAnd the adjacency matrix, wherein the changed information in the adjacency matrix indicates the relation between the feature words, and the semantic representation of the text can be better processed by weighting the features. The original text information is characterized by F _T ∈R ^m×1386 ；

For simplicity, the feature extractor is defined as F, and the mathematical notation of each modal feature extractor is defined as follows:

F _I ＝F(I；θ _I )

F _T ＝F(T；θ _T )

where I and T are the original image and text. Theta _I And theta _T Are parameters of the feature extractor. Therefore, high-level representation features with rich semantics can be extracted for each mode, and semantic relations among data can be fully mined by using the features to further guide mode alignment and hash code learning.

To facilitate alignment of semantic feature representations at higher levels within and between modalities, it is proposed to use cyclic semantic alignment to facilitate the proximity of semantically similar pairs of image text to a common representation space and vice versa to make the distance in the common representation space far apart. Intra-modality and inter-modality loss measures are employed to further align text and images. The function that compresses the high-level semantic representation is defined as follows:

TV _I ＝Enc(F _I ；δ _I )TV _I ∈R ^m×c

TV _T ＝Enc(F _T ；δ _T )TV _T ∈R ^m×c

F _* representing the original features of images and text, delta _* Defined as the parameters under Enc (;) of each modality, ∈ { I, T }

After the high-level semantic features extracted by the feature extractor are compressed by the encoder, the truth semantics which have extremely strong representation capability and contain the high-level semantic features are obtained, and then the truth semantics are reconstructed into the representation of heterogeneous data by a decoder, and the definition is as follows:

the features of the image (text) are input into a decoder, and the obtained semantic information is mapped to the feature space of the text (image) through the decoder, so that semantic alignment among the modalities is realized. After the reconstruction characteristics of the heterogeneous data are obtained, in order to promote cross-mode information interaction, semantic alignment is carried out on the original image characteristics and the characteristics of the text reconstructed by a decoder. In order to ensure that the obtained compressed feature vector can represent the original high-dimensional feature characterization, the high-dimensional feature and the feature obtained by the encoder are aligned once, so that semantic alignment in the modality is realized.

And (3) between modes: in order to promote information interaction among different data and realize cross-modal semantic interaction, a semantic feature obtained by one mode through a feature extractor is decoded by a self-encoder and then mapped to a semantic space corresponding to the other mode, F _T,I Representing a vector representation after mapping text features into an image feature space, F _I,T Representing a vector representation resulting from mapping image features into a text feature space. Constructing cross-modal semantic feature matrices

Alignment of different modality types is achieved by minimizing cross-modality semantic loss, the loss function being as follows:

L _{C_inter} ＝L _{C_inter1} +L _{C_inter2}

can fully utilize high-level semantic feature representation between two modes to carry out cross-mode alignment, and minimize L through calculation _F To achieve cross-modal heterogeneous data alignment.

In-mode: in order to ensure representativeness of the semantic information in the modality and reduce semantic feature loss, the modality internal constraint is carried out under the same modality, the features extracted from the original image and the high-level semantic representation coded by the coder are also subjected to feature alignment, and L is minimized _intra The representativeness and the integrity of high-level semantic information in the modality are ensured. Constructing an image modality feature matrix as

Obtaining the hidden state feature by the self-encoder, using S _Enc-I And (4) showing. Text features likewise take>

S _Enc-T And (4) showing. Defining:

L _{C_intra} ＝∑||S _F* -S _Enc-* || ² *∈{I,T}

therefore, a semantic alignment method with intra-modality and inter-modality is constructed, the high-level semantic representation extracted by the image encoder and the text encoder is aligned with the compressed semantic features of the features after passing through the encoder, so that the semantic alignment inside the modality is realized, the high-latitude modal data can be reduced through a small amount of high-level features, the heterogeneous data is aligned with the original modality features through the mapping of the decoder, the information interaction of the cross-modality data is realized, and the alignment between the intra-modality and the inter-modality is realized. The loss of modal alignment is defined as:

L _C ＝L _{C_inter} +L _{C_intra} 。

After the feature extraction and the cyclic semantic alignment processing, the semantic information of the text and the image can be extracted in good quality and mutuallyAre connected together. The aim in the cross-modal retrieval field is to make the heterogeneous data with more similar semantics more closely related, and according to a defined similarity measurement method, a data sample related to the semantics of one modality is searched in a data set from a query point of the other modality. By converting the query point into the hash code, the corresponding modal information can be retrieved more quickly. By mapping of the self-encoder, high-latitude feature codes corresponding to each mode can be fully extracted in a training stage. The mapping of the hash code is performed through the feature vector generated by the self-encoder, the hash code is constructed by using the truth value due to the feature extraction and the semantic reconstruction operation, and the hash code is generated through the than (×) function. Calculating the cosine similarity matrix of the pair, and defining them as

And is used to represent the generated hash matrix. The hash matrix of the text modality is denoted as ≥>

The hash matrix of the image portion is denoted as ≥>

The similarity matrix for an image and text pair is defined as ≥>

. By cosine similarity calculation with respect to matrix elements

In order to make full use of semantic information commonly described by image text pairs, a cross-modality hash code similarity matrix is constructed, collinear image text pairs have the most similar labels or categories compared with other modality data, and elements on the diagonal line are better as the distance is closer to 1, so that hash coding of the image text pairs is calculated, and the loss of collinear examples is minimized:

regarding other elements, diagonal similarity loss is used to bridge the links between different modalities, e.g. the same pair of image text similarity should be independent of location information, only related to feature information, and the semantics of the teletext pair are bridged together by minimizing the diagonal loss, defined as:

about

The total loss was:

matrix obtained by Hash function mapping

And &>

Introducing a new matrix->

The method is a matrix formed by original image text labels, labels are not used for guiding in a training stage, and label information is introduced to mainly calculate hash loss.

Although the hash method can speed up the retrieval process, mapping the truth characteristics to hash codes still causes partial information loss, resulting in suboptimal solution of retrieval. In the hash coding learning, the semantic relation among different modal data is also noticed, and the similarity information of different modalities is a central task of cross-modal retrieval. Based on the method, the characteristics in the single mode are aligned with the generated hash code, and the generated hash code can be guaranteed to represent the original data information more truly. k is a modal adjustment parameter, and semantic similarity of people can be ensured more flexibly.

A combined characteristic matrix is constructed, the text characteristic matrix and the image characteristic matrix are subjected to weighted integration, and only a public matrix is used

And (4) showing. And jointly processing the pictures and the texts by a weighting coefficient alpha.

Our hash encoding is optimized based on matrix alignment.

The total loss of hash coding is:

L _H ＝L _{H_inter} +βL _{H_intra} 。

although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A deep unsupervised cross-modal Hash retrieval method is characterized by comprising the following steps:

s3, introducing an attention mechanism so as to capture global features of a text mode;

2. The deep unsupervised cross-modal hash retrieval method of claim 1, wherein: the S2 specifically comprises the following steps:

s2.2, with the fact that attention shows advanced execution effects in the NLP and CV fields, an attention mechanism is introduced into a graph network, an attention mechanism is introduced into a graph neural network, each node is given an attention score through an attention algorithm, then different nodes are subjected to information fusion, the score of a feature word with low relevance is low, and a high attention score can be obtained by the feature word with high relevance;

3. The deep unsupervised cross-modal hash retrieval method of claim 1, wherein: in the S1, a double-flow model is adopted to extract semantic features of different modal data information.

4. The deep unsupervised cross-modal hash retrieval method according to claim 1, wherein S1 specifically is:

5. The deep unsupervised cross-modal hash retrieval method of claim 4, wherein: in the S1, semantic features obtained by one mode through the feature extractor are mapped to a semantic space corresponding to the other mode after being decoded by the self-encoder, and intra-mode constraint is also performed in the same mode.