CN115840827A - Deep unsupervised cross-modal Hash retrieval method - Google Patents

Deep unsupervised cross-modal Hash retrieval method Download PDF

Info

Publication number
CN115840827A
CN115840827A CN202211382234.8A CN202211382234A CN115840827A CN 115840827 A CN115840827 A CN 115840827A CN 202211382234 A CN202211382234 A CN 202211382234A CN 115840827 A CN115840827 A CN 115840827A
Authority
CN
China
Prior art keywords
modal
semantic
text
information
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211382234.8A
Other languages
Chinese (zh)
Other versions
CN115840827B (en
Inventor
李明勇
马龙飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Normal University
Original Assignee
Chongqing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Normal University filed Critical Chongqing Normal University
Priority to CN202211382234.8A priority Critical patent/CN115840827B/en
Publication of CN115840827A publication Critical patent/CN115840827A/en
Application granted granted Critical
Publication of CN115840827B publication Critical patent/CN115840827B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a deep unsupervised cross-modal Hash retrieval method, which relates to the technical field of cross-modal Hash retrieval, and is characterized in that a text is considered as data of a graph structure, text characteristics are converted into node information in the graph, sparse text characteristics are further fused by using a GAT network, related neighbor node information is fused with an original node by an attention scoring mechanism through an attention mechanism, attention scores also represent the closeness degree of connection among characteristic words, the higher the scores are, the closer the relation is, a self-encoder is adopted to perform characteristic encoding and characteristic decoding on extracted modal characteristics, meanwhile, the graph attention network is introduced into the cross-modal Hash retrieval field, so that the text mode with insufficient semantic information can be subjected to deep extraction, the high-level characteristic information representation of the text mode is enriched, and CLIP is adopted as a visual encoder of an image mode, and semantic characteristics with higher granularity are extracted.

Description

Deep unsupervised cross-modal Hash retrieval method
Technical Field
The invention relates to the technical field of cross-modal hash retrieval, in particular to a deep unsupervised cross-modal hash retrieval method.
Background
With the rapid development of the internet and social networks, multimedia information data such as multimedia information vision and text is increasing dramatically, and how to effectively retrieve the data is a great challenge. The purpose of cross-modal retrieval is to retrieve heterogeneous modal data with similar semantic features by one modality, and the hash method is widely used in retrieval tasks to improve storage and calculation efficiency. The cross-modal hashing method tries to represent heterogeneous modal data into compact binary codes, and meanwhile semantic similarity among different modal data is kept in a common feature space;
cross-mode hashing methods can be divided into two broad categories: supervised cross-modality hashing methods and unsupervised cross-modality hashing methods. Common supervised hashing methods have shown significant performance. However, a common drawback of the above unsupervised methods is that vision-
Co-occurrence information inherent in the text pair is easily ignored. Due to the lack of guidance of label information, the method is easy to be ignored in the high-level semantic feature extraction process. Further, the unsupervised model cannot accurately capture semantic relation between different modal data, so that the retrieval accuracy is not high;
furthermore, most existing cross-modality methods focus on semantic feature alignment between different modality data. The method simplifies the correlation between the modal internal semantic reconstruction features and the original features, omits coherent interaction between one modal feature vector and another modal vector and generated hash codes, and blocks the semantic interaction between the modal internal features and the reconstruction features and between the features and homogeneous and heterogeneous hash codes, so that the cross-modal hash retrieval cannot be perfectly compatible. Meanwhile, the inherent problem of modal gap in deep semantic interaction cannot be noticed, the alignment of modal characteristics with hash codes of homogeneous data and hash codes of heterogeneous data cannot be closed, and the retrieval result cannot reach the optimal solution.
Therefore, a deep unsupervised cross-modal hash retrieval method is provided to solve the problems.
Disclosure of Invention
The invention provides a deep unsupervised cross-modal Hash retrieval method, which solves the technical problem that the retrieval result cannot reach the optimal solution in the prior art.
In order to solve the technical problem, the invention provides a deep unsupervised cross-modal hash retrieval method, which comprises the following steps:
s1, designing a modal cycle interaction method, and reconstructing one modal feature in a modal to perform semantic alignment on the other modal feature, so as to comprehensively consider semantic similarity relation between the intra-modal and the inter-modal;
s2, introducing the graph attention network into cross-modal retrieval, and considering the influence of the text neighbor node information;
s3, introducing an attention mechanism so as to capture the global features of the text mode;
s4, fine-grained extraction is carried out on image features by using a CLIP visual encoder;
and S5, learning the Hash codes through a Hash function, converting the query points into the Hash codes and detecting corresponding transverse information more quickly.
Preferably, S2 is specifically:
s2.1, expressing node information in a graph mode, converting a graph topological structure into a constructed adjacency matrix through the association between aggregation nodes and nodes, and fusing the information of each node and the neighbor nodes of the node into a new node;
s2.2, with the attention showing advanced execution effects in the NLP and CV fields, introducing an attention mechanism into a graph network, introducing the attention mechanism into a graph neural network, giving each node an attention score through an attention algorithm, and then performing information fusion on different nodes, wherein the score of the feature word with low relevance is low, and the feature word with high relevance can obtain a high attention score;
and S2.3, finally, fusing the information to strengthen the influence of different feature words on the nodes.
Preferably, in S1, a dual-flow model is adopted to perform semantic feature extraction on different modal data information.
Preferably, S1 specifically is:
s1.1, compressing high-level semantic features into low-level semantic representations by using an Auto-Encoder, and reconstructing the low-level semantic features into features of heterogeneous data;
s1.2, inputting the characteristics of the image into a decoder, and mapping the obtained semantic information to a characteristic space of a text through the decoder to realize semantic alignment between the modes;
s1.3, after the reconstruction characteristics of the heterogeneous data are obtained, semantic alignment is carried out on the original image characteristics and the characteristics of the text reconstructed by the decoder, and the high-dimensional characteristics and the characteristics obtained by the encoder are also aligned once, so that the semantic alignment in the mode is realized.
Preferably, in S1, after the semantic features obtained by one modality through the feature extractor are decoded by the self-encoder, the semantic features are mapped to a semantic space corresponding to another modality, and the intra-modality constraint is performed in the same modality.
Compared with the related technology, the deep unsupervised cross-modal hash retrieval method provided by the invention has the following beneficial effects:
in the invention, a text is considered as data of a graph structure, text features are converted into node information in the graph, sparse text features are further fused by using a GAT network, related neighbor node information is fused with an original node by an attention scoring mechanism, attention scores also represent the closeness degree of connection among feature words, the higher the scores are, the closer the relation is, and feature coding and feature decoding are carried out on the extracted modal features by adopting an auto-encoder.
Drawings
FIG. 1 is a schematic flow chart of a deep unsupervised cross-modal Hash retrieval method;
Detailed Description
The following describes embodiments of the present invention in detail.
In the description of the present invention, it should be understood that if the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. are referred to, they refer to the orientation or positional relationship shown in the drawings, and are used for convenience of description and simplicity of description, but do not indicate or imply that the device or element referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the description of the present invention, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected unless otherwise explicitly stated or limited. Either mechanically or electrically. They may be directly connected or indirectly connected through intervening media, or may be connected through the use of two elements or the interaction of two elements. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.
In a first embodiment, as shown in fig. 1, a deep unsupervised cross-modal hash retrieval method includes the following steps:
s1, designing a modal cycle interaction method, and carrying out semantic alignment on one modal feature in a modal by reconstructing the other modal feature so as to comprehensively consider semantic similarity relation between the interior and the interior of the modal;
s1.1, compressing high-level semantic features into low-level semantic representations by using an Auto-Encoder, and reconstructing the low-level semantic features into features of heterogeneous data;
s1.2, inputting the characteristics of the image into a decoder, and mapping the obtained semantic information to a characteristic space of a text through the decoder to realize semantic alignment between the modes;
s1.3, after the reconstruction characteristics of the heterogeneous data are obtained, semantic alignment is carried out on the original image characteristics and the characteristics of the text reconstructed by the decoder, and the high-dimensional characteristics and the characteristics obtained by the encoder are also aligned once, so that the semantic alignment in the mode is realized.
The method specifically comprises the following steps: acquiring m image text pairs;
defining a data structure as
Figure SMS_1
Definition I i Represents the ith original image data and defines T j Representing the jth original text data, each image text pair instance may be denoted as o k =(I k ,T k ) Defining the expression form of the characteristic dimension as F;
the semantic features extracted by the image feature encoder are denoted as F I And is and
Figure SMS_2
D I the method comprises the steps of representing high-level dimension characteristics of an image obtained by an original image through an image encoder;
defining the feature representation obtained after the text passes through the text encoder as
Figure SMS_3
D T Representing high-level feature dimension representation of the text, wherein m is the number of sample instance points;
defining a Hash code representation as B * ∈{-1,+1} m×c E { I, T }, where c denotes the length of the hash code, at B * The hash code of the ith original data in (1) is denoted as b *,i . Also defined as cos () is the cosine similarity loss function paired, and as element-wise sign function using sign () is defined as follows:
Figure SMS_4
define | |. | represents l 2 The regularization normal form is used for Frobenius regularization of vectors and matrixes;
the CLIP-based circular alignment hash contains 3 blocks in total for unsupervised visual text retrieval, namely a feature extraction part, a circular semantic alignment part and a hash coding learning part.
S2, introducing the graph attention network into cross-modal retrieval, and considering the influence of the text neighbor node information.
S2.1, expressing node information in a graph mode, converting a graph topological structure into a constructed adjacency matrix through the association between aggregation nodes and nodes, and fusing the information of each node and the neighbor nodes of the node into a new node;
s2.2, with the attention showing advanced execution effects in the NLP and CV fields, introducing an attention mechanism into a graph network, introducing the attention mechanism into a graph neural network, giving each node an attention score through an attention algorithm, and then performing information fusion on different nodes, wherein the score of the feature word with low relevance is low, and the feature word with high relevance can obtain a high attention score;
and S2.3, finally, fusing the information to strengthen the influence of different feature words on the nodes.
S3, introducing an attention mechanism, and capturing global features of text modes
The attention mechanism is introduced into a graph network, the attention mechanism is introduced into a graph neural network, an attention score is given to each node through an attention algorithm, then different nodes are subjected to information fusion, the score of a feature word with low relevance is low, a high attention score can be obtained with a feature word with strong relevance, the information is fused, the influence of different feature words on the nodes is strengthened, and semantic information can be better extracted;
since the texts are 1386-dimensional feature vector representations, taking these features as node data, each text can be represented as f i ∈R 1386 In order to obtain enough expressive power, the input features are converted into features of higher levels through a learnable weighting matrix W epsilon R f×f The node is switched and then self-attentive mechanism operated.
Figure SMS_5
a is an attention calculation coefficient, e ij Representing the importance degree of the node j to the node i, calculating each neighbor node of the node i, and normalizing all the neighbor nodes by using a softmax function in order to facilitate the comparison of coefficients among different nodes.
Figure SMS_6
All the nodes are operated in such a way, the obtained adjacent matrix node information becomes a new node vector containing the attention characteristics of all the neighbor nodes, and the most easy weighted fusion is carried out on the semantic information lacking in the text mode, so that the text mode is promoted to obtain more strengthened mode representation;
t defines the transpose. By doing so for all nodes, the node information of the adjacency matrix is converted into a new node vector containing the attention characteristics of each neighboring node, which is the fusion of the most easily weighted semantic information lacking in the text aspect, resulting in a more powerful modal representation of the text aspect. An attention network is a method of fusing feature words associated with a certain feature word using an attention mechanism. Weighted fusion using an attention mechanism can result in a new semantic feature representation containing neighbor node information.
And S4, extracting fine granularity of the image features by using a CLIP visual encoder.
Different modality encoders are designed for different modality data. Compared with a text mode, an image mode contains richer semantic information, a single-flow model cannot close inherent mode differences among cross modes, optimal feature extraction cannot be performed on each mode, mining capability of heterogeneous data semantic consistency information is limited, therefore, a double-flow model is adopted to perform semantic feature extraction on different mode data information, and a good effect is displayed in the whole training stage;
image feature extraction: the comparison learning method in unsupervised learning is adopted, and a large-scale data set is used for training, compared with ViT [5 ]]It yields good results across multiple datasets. A CLIP pre-training model is used in the model as a feature extractor for the image modality. The image part uses the image encoder of the CLIP, the original image is input into the image encoder of the approximate CLIP, a 1024-dimensional high-level semantic vector is obtained through extraction, and the high-level semantic vector can be defined as F I ∈R m×1024
Text feature extraction: considering that the text modal data contains less high-level semantic information than image data, but the text semantics has contextual relation, the feature words of the text are considered as node information in the graph. Text modality uses graph attention network (GAT [24 ]]) Semantic information of the text is extracted. The GAT takes the text feature words as nodes, in order to obtain stronger expression capacity, the input features are converted into features of higher levels, an attention mechanism is introduced, self _ attentions are carried out on the nodes, attention weight coefficients among the nodes are obtained, information aggregating all surrounding nodes can be obtained through weighting and summing surrounding neighbor nodes, and the relation of text information is more practical. By treating the text as a graph structure, a graph is obtainedAnd the adjacency matrix, wherein the changed information in the adjacency matrix indicates the relation between the feature words, and the semantic representation of the text can be better processed by weighting the features. The original text information is characterized by F T ∈R m×1386
For simplicity, the feature extractor is defined as F, and the mathematical notation of each modal feature extractor is defined as follows:
F I =F(I;θ I )
F T =F(T;θ T )
where I and T are the original image and text. Theta I And theta T Are parameters of the feature extractor. Therefore, high-level representation features with rich semantics can be extracted for each mode, and semantic relations among data can be fully mined by using the features to further guide mode alignment and hash code learning.
To facilitate alignment of semantic feature representations at higher levels within and between modalities, it is proposed to use cyclic semantic alignment to facilitate the proximity of semantically similar pairs of image text to a common representation space and vice versa to make the distance in the common representation space far apart. Intra-modality and inter-modality loss measures are employed to further align text and images. The function that compresses the high-level semantic representation is defined as follows:
TV I =Enc(F I ;δ I )TV I ∈R m×c
TV T =Enc(F T ;δ T )TV T ∈R m×c
F * representing the original features of images and text, delta * Defined as the parameters under Enc (;) of each modality, ∈ { I, T }
After the high-level semantic features extracted by the feature extractor are compressed by the encoder, the truth semantics which have extremely strong representation capability and contain the high-level semantic features are obtained, and then the truth semantics are reconstructed into the representation of heterogeneous data by a decoder, and the definition is as follows:
Figure SMS_7
Figure SMS_8
the features of the image (text) are input into a decoder, and the obtained semantic information is mapped to the feature space of the text (image) through the decoder, so that semantic alignment among the modalities is realized. After the reconstruction characteristics of the heterogeneous data are obtained, in order to promote cross-mode information interaction, semantic alignment is carried out on the original image characteristics and the characteristics of the text reconstructed by a decoder. In order to ensure that the obtained compressed feature vector can represent the original high-dimensional feature characterization, the high-dimensional feature and the feature obtained by the encoder are aligned once, so that semantic alignment in the modality is realized.
And (3) between modes: in order to promote information interaction among different data and realize cross-modal semantic interaction, a semantic feature obtained by one mode through a feature extractor is decoded by a self-encoder and then mapped to a semantic space corresponding to the other mode, F T,I Representing a vector representation after mapping text features into an image feature space, F I,T Representing a vector representation resulting from mapping image features into a text feature space. Constructing cross-modal semantic feature matrices
Figure SMS_9
Figure SMS_10
Alignment of different modality types is achieved by minimizing cross-modality semantic loss, the loss function being as follows:
Figure SMS_11
Figure SMS_12
L C_inter =L C_inter1 +L C_inter2
can fully utilize high-level semantic feature representation between two modes to carry out cross-mode alignment, and minimize L through calculation F To achieve cross-modal heterogeneous data alignment.
In-mode: in order to ensure representativeness of the semantic information in the modality and reduce semantic feature loss, the modality internal constraint is carried out under the same modality, the features extracted from the original image and the high-level semantic representation coded by the coder are also subjected to feature alignment, and L is minimized intra The representativeness and the integrity of high-level semantic information in the modality are ensured. Constructing an image modality feature matrix as
Figure SMS_13
Obtaining the hidden state feature by the self-encoder, using S Enc-I And (4) showing. Text features likewise take>
Figure SMS_14
S Enc-T And (4) showing. Defining:
L C_intra =∑||S F* -S Enc-* || 2 *∈{I,T}
therefore, a semantic alignment method with intra-modality and inter-modality is constructed, the high-level semantic representation extracted by the image encoder and the text encoder is aligned with the compressed semantic features of the features after passing through the encoder, so that the semantic alignment inside the modality is realized, the high-latitude modal data can be reduced through a small amount of high-level features, the heterogeneous data is aligned with the original modality features through the mapping of the decoder, the information interaction of the cross-modality data is realized, and the alignment between the intra-modality and the inter-modality is realized. The loss of modal alignment is defined as:
L C =L C_inter +L C_intra
and S5, learning the Hash codes through a Hash function, converting the query points into the Hash codes and detecting corresponding transverse information more quickly.
After the feature extraction and the cyclic semantic alignment processing, the semantic information of the text and the image can be extracted in good quality and mutuallyAre connected together. The aim in the cross-modal retrieval field is to make the heterogeneous data with more similar semantics more closely related, and according to a defined similarity measurement method, a data sample related to the semantics of one modality is searched in a data set from a query point of the other modality. By converting the query point into the hash code, the corresponding modal information can be retrieved more quickly. By mapping of the self-encoder, high-latitude feature codes corresponding to each mode can be fully extracted in a training stage. The mapping of the hash code is performed through the feature vector generated by the self-encoder, the hash code is constructed by using the truth value due to the feature extraction and the semantic reconstruction operation, and the hash code is generated through the than (×) function. Calculating the cosine similarity matrix of the pair, and defining them as
Figure SMS_15
And is used to represent the generated hash matrix. The hash matrix of the text modality is denoted as ≥>
Figure SMS_16
The hash matrix of the image portion is denoted as ≥>
Figure SMS_17
The similarity matrix for an image and text pair is defined as ≥>
Figure SMS_18
. By cosine similarity calculation with respect to matrix elements
Figure SMS_19
In order to make full use of semantic information commonly described by image text pairs, a cross-modality hash code similarity matrix is constructed, collinear image text pairs have the most similar labels or categories compared with other modality data, and elements on the diagonal line are better as the distance is closer to 1, so that hash coding of the image text pairs is calculated, and the loss of collinear examples is minimized:
Figure SMS_20
regarding other elements, diagonal similarity loss is used to bridge the links between different modalities, e.g. the same pair of image text similarity should be independent of location information, only related to feature information, and the semantics of the teletext pair are bridged together by minimizing the diagonal loss, defined as:
Figure SMS_21
about
Figure SMS_22
The total loss was:
Figure SMS_23
matrix obtained by Hash function mapping
Figure SMS_24
And &>
Figure SMS_25
Introducing a new matrix->
Figure SMS_26
The method is a matrix formed by original image text labels, labels are not used for guiding in a training stage, and label information is introduced to mainly calculate hash loss.
Although the hash method can speed up the retrieval process, mapping the truth characteristics to hash codes still causes partial information loss, resulting in suboptimal solution of retrieval. In the hash coding learning, the semantic relation among different modal data is also noticed, and the similarity information of different modalities is a central task of cross-modal retrieval. Based on the method, the characteristics in the single mode are aligned with the generated hash code, and the generated hash code can be guaranteed to represent the original data information more truly. k is a modal adjustment parameter, and semantic similarity of people can be ensured more flexibly.
Figure SMS_27
A combined characteristic matrix is constructed, the text characteristic matrix and the image characteristic matrix are subjected to weighted integration, and only a public matrix is used
Figure SMS_28
And (4) showing. And jointly processing the pictures and the texts by a weighting coefficient alpha.
Figure SMS_29
Our hash encoding is optimized based on matrix alignment.
Figure SMS_30
The total loss of hash coding is:
L H =L H_inter +βL H_intra
although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (5)

1. A deep unsupervised cross-modal Hash retrieval method is characterized by comprising the following steps:
s1, designing a modal cycle interaction method, and reconstructing one modal feature in a modal to perform semantic alignment on the other modal feature, so as to comprehensively consider semantic similarity relation between the intra-modal and the inter-modal;
s2, introducing the graph attention network into cross-modal retrieval, and considering the influence of the text neighbor node information;
s3, introducing an attention mechanism so as to capture global features of a text mode;
s4, fine-grained extraction is carried out on image features by using a CLIP visual encoder;
and S5, learning the Hash codes through a Hash function, converting the query points into the Hash codes and detecting corresponding transverse information more quickly.
2. The deep unsupervised cross-modal hash retrieval method of claim 1, wherein: the S2 specifically comprises the following steps:
s2.1, expressing node information in a graph mode, converting a graph topological structure into a constructed adjacency matrix through the association between aggregation nodes and nodes, and fusing the information of each node and the neighbor nodes of the node into a new node;
s2.2, with the fact that attention shows advanced execution effects in the NLP and CV fields, an attention mechanism is introduced into a graph network, an attention mechanism is introduced into a graph neural network, each node is given an attention score through an attention algorithm, then different nodes are subjected to information fusion, the score of a feature word with low relevance is low, and a high attention score can be obtained by the feature word with high relevance;
and S2.3, finally, fusing the information to strengthen the influence of different feature words on the nodes.
3. The deep unsupervised cross-modal hash retrieval method of claim 1, wherein: in the S1, a double-flow model is adopted to extract semantic features of different modal data information.
4. The deep unsupervised cross-modal hash retrieval method according to claim 1, wherein S1 specifically is:
s1.1, compressing high-level semantic features into low-level semantic representations by using an Auto-Encoder, and reconstructing the low-level semantic features into features of heterogeneous data;
s1.2, inputting the characteristics of the image into a decoder, and mapping the obtained semantic information to a characteristic space of a text through the decoder to realize semantic alignment between the modes;
s1.3, after the reconstruction characteristics of the heterogeneous data are obtained, semantic alignment is carried out on the original image characteristics and the characteristics of the text reconstructed by the decoder, and the high-dimensional characteristics and the characteristics obtained by the encoder are also aligned once, so that the semantic alignment in the mode is realized.
5. The deep unsupervised cross-modal hash retrieval method of claim 4, wherein: in the S1, semantic features obtained by one mode through the feature extractor are mapped to a semantic space corresponding to the other mode after being decoded by the self-encoder, and intra-mode constraint is also performed in the same mode.
CN202211382234.8A 2022-11-07 2022-11-07 Deep unsupervised cross-modal hash retrieval method Active CN115840827B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211382234.8A CN115840827B (en) 2022-11-07 2022-11-07 Deep unsupervised cross-modal hash retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211382234.8A CN115840827B (en) 2022-11-07 2022-11-07 Deep unsupervised cross-modal hash retrieval method

Publications (2)

Publication Number Publication Date
CN115840827A true CN115840827A (en) 2023-03-24
CN115840827B CN115840827B (en) 2023-09-19

Family

ID=85576906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211382234.8A Active CN115840827B (en) 2022-11-07 2022-11-07 Deep unsupervised cross-modal hash retrieval method

Country Status (1)

Country Link
CN (1) CN115840827B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796032A (en) * 2023-04-11 2023-09-22 重庆师范大学 Multi-mode data retrieval model based on self-adaptive graph attention hash

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299216A (en) * 2018-10-29 2019-02-01 山东师范大学 A kind of cross-module state Hash search method and system merging supervision message
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN113971209A (en) * 2021-12-22 2022-01-25 松立控股集团股份有限公司 Non-supervision cross-modal retrieval method based on attention mechanism enhancement
CN115203442A (en) * 2022-09-15 2022-10-18 中国海洋大学 Cross-modal deep hash retrieval method, system and medium based on joint attention

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN109299216A (en) * 2018-10-29 2019-02-01 山东师范大学 A kind of cross-module state Hash search method and system merging supervision message
CN113971209A (en) * 2021-12-22 2022-01-25 松立控股集团股份有限公司 Non-supervision cross-modal retrieval method based on attention mechanism enhancement
CN115203442A (en) * 2022-09-15 2022-10-18 中国海洋大学 Cross-modal deep hash retrieval method, system and medium based on joint attention

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
樊花;陈华辉;: "基于哈希方法的跨模态检索研究进展", 数据通信, no. 03 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796032A (en) * 2023-04-11 2023-09-22 重庆师范大学 Multi-mode data retrieval model based on self-adaptive graph attention hash

Also Published As

Publication number Publication date
CN115840827B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN113971209B (en) Non-supervision cross-modal retrieval method based on attention mechanism enhancement
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
Han et al. Fine-grained cross-modal alignment network for text-video retrieval
CN111709518A (en) Method for enhancing network representation learning based on community perception and relationship attention
Zeng et al. Tag-assisted multimodal sentiment analysis under uncertain missing modalities
CN115687571B (en) Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
CN116702091B (en) Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
Zhao et al. Videowhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks
CN115017299A (en) Unsupervised social media summarization method based on de-noised image self-encoder
Peng et al. Multilevel hierarchical network with multiscale sampling for video question answering
CN116796251A (en) Poor website classification method, system and equipment based on image-text multi-mode
CN111368176B (en) Cross-modal hash retrieval method and system based on supervision semantic coupling consistency
Zhu et al. Multi-attention based semantic deep hashing for cross-modal retrieval
Amara et al. Cross-network representation learning for anchor users on multiplex heterogeneous social network
CN115840827A (en) Deep unsupervised cross-modal Hash retrieval method
Luo et al. Collaborative learning for extremely low bit asymmetric hashing
CN116610818A (en) Construction method and system of power transmission and transformation project knowledge base
CN111428518B (en) Low-frequency word translation method and device
CN117036833B (en) Video classification method, apparatus, device and computer readable storage medium
CN111680190B (en) Video thumbnail recommendation method integrating visual semantic information
Yu et al. Self-attentive clip hashing for unsupervised cross-modal retrieval
CN116956128A (en) Hypergraph-based multi-mode multi-label classification method and system
Wang et al. Incomplete multimodality-diffused emotion recognition
Cai et al. Deep learning approaches on multimodal sentiment analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant