CN113095415B - Cross-modal hashing method and system based on multi-modal attention mechanism - Google Patents

Cross-modal hashing method and system based on multi-modal attention mechanism Download PDF

Info

Publication number
CN113095415B
CN113095415B CN202110407112.9A CN202110407112A CN113095415B CN 113095415 B CN113095415 B CN 113095415B CN 202110407112 A CN202110407112 A CN 202110407112A CN 113095415 B CN113095415 B CN 113095415B
Authority
CN
China
Prior art keywords
image
modal
feature vector
text
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110407112.9A
Other languages
Chinese (zh)
Other versions
CN113095415A (en
Inventor
鲁芹
吴吉祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202110407112.9A priority Critical patent/CN113095415B/en
Publication of CN113095415A publication Critical patent/CN113095415A/en
Application granted granted Critical
Publication of CN113095415B publication Critical patent/CN113095415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of multi-mode attention mechanism and cross-modal hash network fusion, and provides a cross-modal hash method and a cross-modal hash system based on the multi-mode attention mechanism. The method comprises the following steps: training process and retrieval process, the training process: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system; and (3) retrieval process: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.

Description

Cross-modal hashing method and system based on multi-modal attention mechanism
Technical Field
The invention belongs to the field of multi-modal attention mechanism and cross-modal hash network fusion, and particularly relates to a cross-modal hash method and a cross-modal hash system based on the multi-modal attention mechanism.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Cross-modal retrieval is to use one modal data type as a query to retrieve the content of another modal data type with similar semantics. Particularly for mutual retrieval between images and texts, the retrieval mode can be used for solving the daily life and work requirements of people. In the feature extraction of the existing cross-modal hash method, the part with semantic meaning in the image and the text cannot be accurately positioned by the method based on the global representation alignment, and the local representation alignment method has huge calculation burden because the similarity of image fragments and text words needs to be thoroughly aggregated.
With the development of deep learning in various fields, multiple studies show that feature representations extracted by deep learning have stronger expression capability than that of a traditional shallow learning method. In the current advanced method, two similar structural branches are selected to respectively extract depth features of image data and text data, and then the extracted features of two different modes are subjected to the next operation, so that the similarity between the different modes is calculated. Although this approach has made some progress, there are still some problems in using the deep learning architecture for cross-modal retrieval. The deep features are extracted only from the overall feature information of the modes, are not enough to express the local key feature information of the modes, and cannot mine semantic association among different modes, so that the retrieval precision and accuracy are influenced. In addition, when searching is performed on some widely used data sets, the searching speed is greatly reduced due to the large data information amount and the high calculation amount.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a cross-modal hashing method and a cross-modal hashing system based on a multi-modal attention mechanism, wherein the cross-modal hashing method comprises a training process and a retrieval process, and image features and text features are extracted in the training process; performing fine interaction on the characteristics of the image mode and the characteristics of the text mode by using a multi-mode attention mechanism, and extracting more refined key characteristic information in the image mode and the text mode; finally, the hash representation of the two modalities is learned. In the retrieval process, an image mode or a text mode to be queried is input into a training module to obtain binary hash codes of the image or the text, then the binary hash codes are input into a query search library, the values of the hash codes and the hash codes in the search library are calculated through a Hamming distance formula, and finally, retrieval results are sequentially output from small to large according to the size sequence of the Hamming distance values to obtain an image or text list required by people.
In order to achieve the purpose, the invention adopts the following technical scheme:
a first aspect of the invention provides a cross-modal hashing method based on a multi-modal attention mechanism.
A cross-modal hashing method based on a multi-modal attention mechanism comprises the following steps: a training process and a retrieval process are carried out,
training process: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;
and (3) retrieval process: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.
Further, the training process comprises:
step (1-1): inputting images of different categories into an image modal feature extraction network, and extracting global feature vectors of the images;
step (1-2): inputting text data corresponding to the image data in the step (1-1) into a text modal feature extraction network, and extracting a global feature vector of the text;
step (1-3): and respectively inputting the global feature vector of the image and the global feature vector of the text into a multi-modal interactive gate, respectively inputting the obtained multi-modal image context feature vector and the multi-modal text context feature vector into a cross-modal Hash network, respectively inputting the obtained image feature vector and the obtained text feature vector into a Hash layer, and obtaining a binary Hash code corresponding to the image feature vector and a binary Hash code corresponding to the text feature vector.
Further, the step (1-1) includes:
step (1-1-1): extracting coarse-grained characteristic vectors of an image modality by adopting a Convolutional Neural Network (CNN);
step (1-1-2): inputting the extracted coarse-grained features of the image modality into a mean pooling layer to obtain an image global context feature vector;
step (1-1-3): inputting the coarse-grained feature vector of the image modality into a recurrent neural network GRU to obtain a spatial position feature vector of the image;
step (1-1-4): and adding the image global context feature vector and the spatial position feature vector of the image to obtain the image global feature vector.
Further, the step (1-2) comprises:
step (1-2-1): extracting coarse-grained feature vectors of a text mode by adopting Bi-LSTM in a recurrent neural network;
step (1-2-2): and inputting the coarse-grained feature vector of the text mode into a mean pooling layer to obtain a global feature vector of the text.
Further, the step (1-3) comprises:
step (1-3-11): inputting the global feature vector of the image into a multi-modal interactive gate to obtain a multi-modal image context feature vector;
step (1-3-12): inputting the multi-modal image context feature vector and the coarse-grained feature vector of an image modality into a multi-modal attention function of the image together, and calculating the attention weight of each image region;
step (1-3-13): according to the attention weight of each image region, the coarse-grained feature vector of the image modality and bmCalculating an image feature vector through weighted average;
step (1-3-14): inputting the image feature vector into a hash layer, and calculating a binary hash code corresponding to the image feature vector;
further, the step (1-3) comprises:
step (1-3-21): inputting the global feature vector of the text into a multi-mode interactive gate to obtain a multi-mode text context feature vector;
step (1-3-22): inputting the multi-modal text context feature vector and the coarse-grained feature vector of the text mode into a multi-modal attention function of the text together, and calculating the attention weight of the vocabulary in each text;
step (1-3-23): according to the attention weight of the vocabulary in each text, the coarse-grained feature vector of the text mode and blCalculating a text feature vector through weighted average;
step (1-3-24): and inputting the text feature vector into a hash layer, and calculating a binary hash code corresponding to the text feature vector.
Further, the retrieving process includes:
step (2-1): inputting an image or text to be inquired into a cross-modal Hash network model of a multi-modal attention mechanism to obtain a binary Hash code corresponding to the image or text;
step (2-2): inputting the binary hash code of the image or the binary hash code of the text into a query library to be retrieved, calculating the Hamming distance between the hash code and the Hash code in the query library, and sequentially outputting the first k retrieved texts or images from small to large according to the sequence of the Hamming distances.
Further, a cross-modal retrieval loss function is adopted to calculate the similarity between the image and the text of the same type of label, and the similarity between the image, the image retrieval text, the text retrieval text and the text retrieval image is calculated according to the loss functions of the image retrieval image, the image retrieval text, the text retrieval text and the text retrieval image.
A second aspect of the invention provides a cross-modal hashing system based on a multi-modal attention mechanism.
A cross-modal hashing system based on a multi-modal attention mechanism, comprising: a training module and a retrieval module, wherein,
a training module configured to: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;
a retrieval module configured to: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.
A third aspect of the invention provides a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps in the cross-modal hashing method based on a multimodal attention mechanism as described in the first aspect above.
A fourth aspect of the invention provides a computer apparatus.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps in the multi-modal attention mechanism based cross-modal hashing method as described in the first aspect above.
Compared with the prior art, the invention has the beneficial effects that:
1. the method adopts a ResNet-152 network pre-trained on ImageNet to extract the characteristics of the image by deep learning; and continuously extracting context features of the image with fine granularity on the basis, further extracting spatial position information features of the image by utilizing GRUs, and finally combining the two features with fine granularity to serve as global features of the image. For text features, features are extracted through the bidirectional LSTM, the problem of gradient explosion is solved by utilizing the long-term memory function of the features, semantic consistency in the mode is kept to a certain extent, and calculation of similarity measurement is improved.
2. The invention designs a multi-mode interactive gate to carry out fine interaction between image and text modes, so as to mine semantic association characteristics between different modes and balance information quantity and semantic complementarity between the different modes. And input into an attention mechanism to capture local key information features of the image or text modality, and then input the features with attention into a hash function to obtain a binary hash code representation of the image or text, respectively. During retrieval, any modality to be queried is input into the process to obtain a hash code, the Hamming distance between the hash code and the hash code in the retrieval library is calculated, and finally, retrieval results are output in sequence according to the distance.
3. Experiments conducted on some published data sets show that the mAP value of the newly proposed HX _ MAN model is improved to some extent compared with the existing cross-modal retrieval method, thereby also verifying the superiority of the performance of the method proposed by the present invention.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a block diagram of a cross-modal Hash network model based on a multi-modal attention mechanism proposed by the present invention;
FIG. 2(a) is a first comparison graph of the importance of visual-spatial location information and semantic complementarity in a cross-modal image retrieval model in an embodiment of the present invention;
FIG. 2(b) is a first comparison graph of the importance of visual-spatial location information and semantic complementarity in a cross-modal image retrieval model in an embodiment of the present invention;
FIG. 2(c) is a first comparison graph of the importance of visual-spatial location information and semantic complementarity in a cross-modal image retrieval model in an embodiment of the present invention;
FIG. 3(a) is a graph of accuracy versus a line for various methods of implementing an "image → text" search using a NUS-WIDE dataset in an embodiment of the present invention;
FIG. 3(b) is a graph of accuracy versus a line for various methods of implementing a "text → image" search using a NUS-WIDE dataset in an embodiment of the present invention;
FIG. 4(a) is a line graph comparing accuracy of various methods for implementing an "image → text" search using the MIR-Flickr25K dataset in an embodiment of the present invention;
FIG. 4(b) is a line graph comparing accuracy of various methods for implementing a "text → image" search using the MIR-Flickr25K dataset in an embodiment of the present invention;
FIG. 5 is a page display diagram of a cross-modal hashing system based on a multi-modal attention mechanism according to an embodiment of the present invention;
FIG. 6 is a comparison graph of the retrieval result on the data set of the cross-modal hashing method based on the multi-modal attention mechanism and the two existing methods in the embodiment of the present invention;
FIG. 7(a) is a visual display of a search case according to an embodiment of the present invention;
fig. 7(b) is a visual display diagram of a second search case in the embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example one
As shown in fig. 1, the present embodiment provides a cross-modal hashing method based on a multi-modal attention mechanism, and the present embodiment is exemplified by applying the method to a server, it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:
step (1) training process: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;
step (2) retrieval process: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.
The training process comprises the following steps:
in the training process, the embodiment utilizes the strong feature extraction capability of deep learning to extract the global coarse-grained feature vectors of the image and text modes, and utilizes a multi-mode attention mechanism to perform fine interaction on different modes, so as to search fine-grained association between the image and the text feature vectors on a bottom layer, and then pay attention to local information of the fine-grained features, thereby solving the problem of irrelevant semantics between different modes to a certain extent, and expressing the feature information of the modes from a deep-level network.
Extraction and representation of features:
the feature extraction of images and texts is to preprocess a group of digital data of the images and the texts through a series of steps, then to shrink the dimensionality of the data to a certain degree, and finally to obtain another group of digital vectors capable of expressing modal information, wherein the quality of the group of digital vectors has a great influence on the generalization capability. In the image and text feature extraction of the part, the convolutional neural network CNN verified by a plurality of people is selected to extract image features, and for the text feature extraction, Bi-LSTM in the convolutional neural network is adopted to extract text features.
(1) Image representation: ResNet-152 pre-trained on ImageNet is used as an image feature encoder, the dimensionality of the image feature encoder is modified to 448 x 448 by a dimensionality reduction method, and the image feature encoder and the dimensionality are input into a network of CNNs after the dimensionality reduction. In this step we make some changes, namely remove the last pooling layer and output the final result as image coarse-grained feature I. Previous experiments have shown that removing the pooling layer has little impact on the network of this embodiment. After the image feature is obtained, it is progressively input into the mean pooling layer network. For convenience of description, we use { I }1,...,IMRepresents the coarse-grained features of these inputs, where the value of M represents how many regions are in common in the image, Ii(i∈[1,M]) Representing the ith region in the image.
After the coarse-grained feature representation is obtained, the feature is used as a basis and input into the mean pooling layer, and output as a local feature vector of the image. This is done to obtain deeper feature information and express the context information of the image by this, which we denote as image global context feature vector I(g)
Figure GDA0003533861270000101
Where tanh () is an activation function used to perform a non-linear mapping of the feature vector to project the features into a common subspace, P(0)Is a weighting matrix by which the image feature vector and the text feature vector can be embedded in the same common space.
Sometimes, the effect that we visually see may be slightly wrong with the latent expression information of the image, resulting in wrong judgment, and the reason for this problem is that we neglect the spatial position information of the image. As shown in fig. 2(a) and 2(b), both images look like the same two characters of "car" and "man" at a glance, but the information they express is completely different. Therefore, it is difficult to distinguish the two images if we use only the coarse-grained features mentioned above. The reason for this is that coarse-grained features discard some spatial position information during mean pooling. It can be shown that the spatial position information and the coarse-grained characteristic information are equally important, but neither information is important. For the solution, the present embodiment chooses to further parse the spatial location information of the images through GRUs, so that the two images can be better visually distinguished. GRU is a special type of recurrent neural network that has few parameters and is computationally very efficient.
For the resulting image feature vector I1,...,IMWe align them and input these features into GRUs in turn for outputting a position feature vector between them. This process can be defined by equation (2):
Figure GDA0003533861270000111
wherein the content of the first and second substances,
Figure GDA0003533861270000112
representing the hidden state of the GRU at time step t,
Figure GDA0003533861270000113
then the hidden state passed down by the previous node is indicated, ItRepresenting the image characteristics of the t-th region. Thus, they can be combined into one hidden state vector
Figure GDA0003533861270000114
Then, pooling operation is carried out on the group of vectors to obtain the spatial position characteristics of the image
Figure GDA0003533861270000115
It shows the position information of the image in vision.
Finally, two important characteristic information I of the image are processed(g)And I(d)Adding their characteristic informationSummarized together, the global feature vector I of the final image is obtained(0)
Figure GDA0003533861270000116
(2) Text representation: for the feature representation aspect of text, bi-directional LSTMs are used as encoders to generate coarse-grained feature vectors for text. Assume for text input { w1,...,wLRepresentation, each word is first represented by a one-hot vector, thereby characterizing the index of each word in the table. Then passing each one-hot encoding vector through eL=PwLEmbedding into vector space, where P is the embedding matrix. Finally, the vectors are arranged in a spatial order and input into the bidirectional LSTMs. This process can be represented by equation (4).
Figure GDA0003533861270000117
Figure GDA0003533861270000118
Wherein
Figure GDA0003533861270000119
And
Figure GDA00035338612700001110
representing the hidden states of forward and backward LSTM at time step t, respectively, with both hidden states added at each time step, i.e.
Figure GDA00035338612700001111
Coarse-grained feature vector T for a set of texts is constructed1,...,TL}。
For the aspect of deep feature extraction of a text mode, when coarse-grained features of a text are extracted, each segment inherits the sequence information of the previous moment. Therefore, two kinds of important feature information do not need to be extracted respectively like the extraction mode of the image features, and only the average pooling is used for generating the coarse-grained features of the text into the global features T of the text(0)Wherein T is(0)Encoding the context semantics of the ith word in all sentences of the textual modality:
Figure GDA0003533861270000121
multimodal attention network:
in most of the previous retrieval methods, they only train out global feature information of different modalities and then mathematically project the feature information into a common space to measure the similarity between each image region and the word. Although the similarity of the two modes can be measured to a certain extent by the method, the global feature information not only consumes more computing resources, but also cannot represent the key information of the modes, and further cannot dig out the depth relation between the two modes at the bottom layer, so that the retrieval accuracy and speed are reduced.
In the next long time, when the development of the multi-modal field is not advanced, the scholars have put forward attention and are widely applied to various fields. Inspired by the former, the present embodiment innovates and improves the existing method and proposes a new attention mechanism. Attention mechanism has many contributions in various fields, just as we see its surface meaning, "attention" is aimed at finding out which part needs to be most emphasized by us. By utilizing the local information extraction capability of the method, the key information in the modes can be easily displayed, so that the matching degree of the characteristic information among different modes can be better analyzed.
Although the above method can increase the local key information content of images and sentences to some extent, and its performance is better than other models that do not utilize this method. However, the method only mines the respective key areas of the image or text modalities and does not complete the interaction between heterogeneous data, so that a certain problem exists in capturing semantic association between different modalities. As shown in fig. 2(b) and 2(c), the language descriptions of the two images are semantically very close, but it is still difficult to distinguish the two images from each other in visual observation. The reason for this is that we only focus on the key information of the text modality, but do not consider the semantic complementarity between the visual part and the text.
In view of the above problems, the present embodiment adds a multi-modal interaction gate to interact with image and text modalities before an attention mechanism is used, and enhances the representation capability of the image and text by utilizing semantic complementarity existing between different modalities. The interactive gate can finely fuse fine-grained image features and abstract representations of words, and enables different modal semantics to be complementary through interaction between the fine-grained image features and the abstract representations of the words, so that a bottom-layer incidence relation between the fine-grained image features and the abstract representations of the words is mined, and the retrieval accuracy is improved.
In the initial experimental design phase, we consider the simplest way to interact with image and text features to be adding them directly. However, as experimentation progresses, it has been found that this direct addition approach may in practice result in relatively poor performance. This may be because the image context features and the text context features are not the same extraction method used during the training phase. If they are fused in this simple way, it is possible that some meaningful information of a certain modality is obscured by other modalities in the process. To address this problem of obscured modality information, interaction gates are designed to semantically complement image and text features in order to allow for the underlying interaction of these two features from different modalities.
Specifically, as shown in fig. 1, the present embodiment uses a context feature vector I of an image and a text(0)And T(0)And inputting the data into an interactive gate with complementary semantics to perform interaction between the data and the interactive gate. This process can be represented by equation (6):
o(I)=σ((α·UI(I(0))+(1-α)·UI(I(0))))
o(T)=σ((α·UT(T(0))+(1-α)·UT(T(0)))) (6)
wherein U isIAnd UTThe method is a matrix capable of reducing dimension, and alpha is a parameter for preventing information loss in the process of fusing image and text context characteristics. Finally, reducing each feature in the interaction process to [0,1 ] again through sigmoid activation function sigma]。o(I)And o(T)Respectively, representing more refined feature vectors resulting from the multi-modal interaction gate output. For convenience, they are referred to as multimodal image context feature vectors and text context feature vectors, respectively.
After the image and text feature vectors are subjected to underlying interaction and semantic association between them is obtained through semantic complementarity, local key information within the image or text modality can be captured and detected by means of an attention mechanism. Attention is drawn to capture what we need after learning, and to disregard those unimportant information areas directly, which are typically output as probability maps or probability feature vectors after learning results. The objective of multi-modal attention is to exploit the data information of multi-modal image or text context features with semantic complementarity independently to explore fine-grained associations between multiple image regions or words. This is achieved by computing convex combinations of local features of the image region or text.
Specifically, for the multi-modal attention module of an image, as shown in FIG. 1, the resulting image feature vector { I }1,...,IMAnd a multimodal image context feature vector o(I)Multimodal attention function f as query input to an imageatt(ii) calculating an attention weight α for each image regionI,m. Multimodal attention function f of an imageattTwo layers of feedforward sensors are adopted, and the weighting in the whole process is not unbalanced through a softmax function. In particular, the attention weight αI,mCan be defined by equation (7):
Figure GDA0003533861270000141
αI,m=softmax(WI,hhI,m+bI,h) (7)
wherein, wI,wI,qAnd wI,hIs a parameter of the sensor, bIAnd bI,h bI,qIs the bias term of the sensor, hI,mRepresenting a hidden state at a time step m in the multi-modal attention function of the image, tanh () is an activation function. After obtaining the attention weight of each image area, the attention-bearing image feature representation vector I can be calculated by weighted averaging(1)Which comprises the following steps:
Figure GDA0003533861270000151
where P (1) is a weight matrix by which image and text feature vectors can be embedded in the same common space, bmIs the bias term for the sensor.
The same purpose of the multi-modal attention module setting of the image is to express abstract high-level representation of words in a text sentence through an attention mechanism, so as to extract the context semantic features with multi-modal attention. Attention weight αT,lThe multi-modal context feature vector T of the text is obtained by a soft attention module consisting of two layers of feedforward perceptors and a softmax function(1)Can be defined by the following equation:
Figure GDA0003533861270000152
αT,l=softmax(WT,hhT,l+bT,h)
Figure GDA0003533861270000153
wherein wT,wT,qAnd wT,hAre parameters of the sensor, respectively bT,bT,qAnd bT,hIs the bias term of the sensor, hT,lRepresenting hidden states, T, of multimodal text attention at time step llIs a coarse-grained feature of text, blIs the bias term for the sensor. Unlike the multimodal attention model of images, multimodal attention of text does not already require the addition of an embedding layer after weighted averaging, because of the text features { T }1,...,TLThere is already a common space and training is done in an end-to-end fashion.
And (3) Hash layer:
in the hash layer, the image features I with multi-modal attention are respectively(1)And text feature T(1)The input is to the hash layer and the hash layer,
and obtaining binary representations of different modal characteristics by learning a hash function. In the hash layer, the activation function of Tanh makes the output of each neuron between-1 and 1, and the Sign function with a threshold of 0 converts it into binary code. An encoding value of 1 represents that the output of the neuron is greater than or equal to 0; the code value is 0, representing an output less than 0. The hash functions of the image and the text are shown in equation (10) and equation (11), respectively:
HI=Sign(Tanh(w(I)I(1)+b(I))) (10)
HT=Sign(Tanh(w(T)T(1)+b(T))) (11)
wherein w(I)And w(T)Network parameters of the image or text modality, respectively, b(I)And b(I)Is a bias term of the sensor, HIAnd HTRespectively, a hashed representation of the image and the text.
(II) a retrieval process:
in the training process, the present embodiment utilizes the deep learning underlying feature mining capability and the attention mechanism to capture the local key feature information to obtain the respective binary hash code representations of the features of the image modality or the text modality through the hash function. Therefore, when the cross-modal search is performed, a sample of any one modality is taken as a query object, and a similar sample of another different modality can be searched. Specifically, as shown in fig. 1, for image query, a user inputs an image to be queried into a training module to convert image features into a form of a trained binary hash code, inputs the trained hash code into a query library to be retrieved, calculates hamming distances between the hash code and the hash codes in the query library, and sequentially outputs the first k search results from small to large according to the size sequence of the hamming distances; similarly, for text query, a user takes text data as a query object, hash codes of a text mode are obtained through an end-to-end network framework in a training module, then hamming distances between the hash codes and the hash codes in a database to be retrieved are calculated and sequenced, and finally the retrieved first k pictures are output.
An objective function:
the goal of the cross-modal search penalty function is to preserve both intra-modal and inter-heterogeneous modal semantic similarity. The cross-modal search penalty function is shown in equation (12):
F=min(Fv→v+Fv→t+Ft→t+Ft→v) (12)
where v → v, v → t, t → t and t → v denote an image retrieval image, an image retrieval text, a text retrieval text and a text retrieval image, respectively. And Fv→tA loss function representing the image retrieval text, and the remaining loss functions are similar. Loss function F of image retrieval textv→tIs defined as:
Figure GDA0003533861270000171
wherein, (i, j, k) is a triple, representing a minimum edge distance.
Figure GDA0003533861270000172
Representing the euclidean distance of the image currently being the query modality from the positive sample,
Figure GDA0003533861270000173
representing the euclidean distance of the current modality from the negative examples. Fv→tIs the triple ordering penalty, meaning that image i has greater similarity to text j than image i has to text k.
Experimental results and analysis:
in the embodiment, firstly, detailed analysis is performed on data results of a training module in HX _ MAN and a current advanced cross-modal retrieval method, and then, in two public data sets of NUS-WIDE data set and MIR-Flickr25K data set, some evaluation indexes are calculated. Then, benchmarking analysis is performed by using the HX _ MAN model proposed by the embodiment and several existing methods.
Data set and evaluation index:
(1) data set
The NUS-WIDE dataset is a large network image dataset created by a media search laboratory. The dataset contains 260648 images and 5018 different classmarks that were gathered on the Flickr website. Each image has its corresponding text label and constitutes an image-text pair. These texts describing the images are a set of sentences that the user word-links when uploading the images. The present embodiment performs the analysis of the baseline method based on 194600 image-text pairs of the 20 most commonly used labels in this dataset, where the text of each pair of data is represented as a 1000-dimensional bag-of-words (bow) vector. If the image and the text have one label with the same concept, the image and the text are considered to be similar, otherwise, the image and the text are not considered to be similar.
The MIR-Flickr25K dataset contained 25000 multi-labeled images, 24 manually labeled category labels, collected from the Flickr website. The experimental data for this example taken at least 20 text-labeled image-text pairs, resulting in a total of 20015 pairs of data, each pair labeled with one of the 24-type labels. The text of each pair of data is represented as a 1386-dimensional BOW vector. If the image and the text have the same label, they are considered similar, otherwise they are considered dissimilar.
(2) Evaluation index
This example uses mean Average Precision (mAP) to evaluate the model herein. The calculation formula of mAP is shown as (14):
Figure GDA0003533861270000181
where | Q | represents the size of the query data set Q, Q represents a given one of the queries, AP represents Average accuracy (Average Precision):
Figure GDA0003533861270000182
wherein M represents the number of real neighbors of q in the query data, n represents the total amount of data, Pq(i) Indicating the precision of the first i retrieved instances, δ (i) is an indicator function, indicating that the ith instance is correlated with the retrieved instance when δ (i) is 1, and that δ (i) is not correlated when δ (i) is 0.
Analysis by the reference method:
as another embodiment, we compare the HX _ MAN model proposed in this embodiment with several existing cross-modal search methods, so as to verify the performance of the model proposed by us. To be able to achieve the results we have expected, we have compared not only the shallow structure based methods (CMFH, SCM, STMH, SePH), but also the two deep structure based methods (DCMH and SDCH). For experimental fairness, we used the ResNet-152 network model pre-trained on ImageNet for all methods for feature extraction for image modalities; for the text modality, we also use Bi-LSTM to extract features. In terms of splitting the dataset, we use 2500 pairs of data in the MIR-Flickr25K dataset as queries and the remaining pairs of data as a search library. For the NUS-WIDE dataset, we chose 1% of the dataset as queries and the rest as the search pool. We took 5500 pairs of data from the corpus as a training set of two data sets. All parameters were randomly initialized using a gaussian function with a mean of 0 and a standard deviation of 0.01. The network is trained here by stochastic gradient descent with a batch value of 64, a total epoch of 60, a learning rate of 0.05, and 1/10 where the learning rate becomes the current value after every 20 iterations.
The results of this experiment are shown in table 1 in comparison with other search methods. Wherein, the image → text represents that the query data is in an image mode, and the retrieval data is in a text mode; the "text → image" indicates that the query data is in a text mode and the search data is in an image mode. We compared the mAP values for each model method with code lengths of 16bits, 32bits and 64bits on the NUS-WIDE dataset and the MIR-Flickr25K dataset. From the experimental results and the comparative data in the table we can see that the deep structure based approach performs significantly better than the shallow structure based approach. This illustrates, to some extent, that the deep-level features extracted by deep learning improve the accuracy of cross-modal search, and thus illustrates that the model proposed herein makes some progress in cross-modal search.
TABLE 1 comparison data of HX _ MAN model with other cross-modality search models
Figure GDA0003533861270000191
Figure GDA0003533861270000201
In addition, to visually demonstrate the contrast of the model herein to other methods, we show the contrast data using line graphs. FIG. 3(a) is a graph comparing accuracy of various methods for implementing an "image → text" search using a NUS-WIDE dataset in an embodiment of the present invention; FIG. 3(b) is a graph of accuracy versus a line for various methods of implementing a "text → image" search using a NUS-WIDE dataset in an embodiment of the present invention; FIG. 4(a) is a line graph comparing accuracy of various methods for implementing an "image → text" search using the MIR-Flickr25K dataset in an embodiment of the present invention; FIG. 4(b) is a line graph comparing accuracy of various methods for implementing a "text → image" search using the MIR-Flickr25K dataset in an embodiment of the present invention; as can be seen from the four figures, the mAP value of the method of the embodiment on the MIR-Flickr25K data set is slightly higher than that of the NUS-WIDE data set, and the mAP value of the text retrieval image is also slightly higher than that of the image retrieval text. It can be seen that the HX _ MAN model of the present embodiment has higher performance than other methods, which also verifies that the image and text modalities can be better associated together through the interaction of the stack attention mechanism, and the hash method can improve the speed of cross-modality retrieval.
Visual analysis:
this embodiment will show the page of the cross-modal search system designed by us, and compare the search results with the DCMH method and the SDCH method.
As shown in fig. 5, our cross-modal retrieval system page is mainly divided into two parts: image search text and text search image. For the image retrieval text part, the image to be queried is uploaded to a system, the system carries out the image step by step in the method designed herein, so as to retrieve the image description with semantic similarity with the image content, and the first few with the highest similarity are output in the form of text and finally presented to the customer. The text retrieval image part is similar to the text retrieval image part, namely, the text content needing to be inquired is uploaded to the system, and then the first few images which are most similar to the text content are output.
In addition, we randomly picked 3 textual descriptions from the test set of the MIR-Flickr25K dataset for comparative analysis with the DCMH method and the SDCH method. As shown in fig. 6, the three models are output by their respective methods and the best result is selected for comparison. In the first text description, the "dog" in the image output by the DCMH method is "lying on the stomach". In the second text description, the action of "dog" in the image output by the SDCH method is not "standing". This problem is also the case in the third description. As can be seen from the comparison, after the position feature information is extracted by using deep learning, the method of the embodiment generates a more accurate and clear image of the visual information in the text description, which also illustrates to a certain extent that the method of the embodiment improves the accuracy of retrieval on the basis of ensuring the speed.
Although this method is improved in accuracy and speed compared with other methods, it is not perfect as expected, and there is a little error in the output result. FIG. 7(a) is a visual display diagram of a search case according to an embodiment of the present invention, wherein the visual result is all correct 5 original descriptions; fig. 7(b) is a visual display diagram of a second search case in the embodiment of the present invention, and the search in the 5 th sentence in the visualization result is incorrect, but there is a certain rationality for the description, because the realistic background of the picture can be reasonably configured.
Example two
The embodiment provides a cross-modal hashing system based on a multi-modal attention mechanism.
A cross-modal hashing system based on a multi-modal attention mechanism, comprising: a training module and a retrieval module, wherein,
a training module configured to: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;
a retrieval module configured to: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.
It should be noted here that the training module and the retrieving module correspond to the step (1) to the step (2) in the first embodiment, and the modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the steps in the cross-modal hashing method based on a multi-modal attention mechanism as described in the first embodiment.
Example four
The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the steps in the cross-modal hashing method based on the multi-modal attention mechanism according to the embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A cross-modal hashing method based on a multi-modal attention mechanism comprises the following steps: training process and retrieval process, its characterized in that:
training process: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;
and (3) retrieval process: inputting the image or text to be inquired into a cross-modal Hash network model of a trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity;
the training process comprises:
step (1-1): inputting images of different categories into an image modal feature extraction network, and extracting global feature vectors of the images;
step (1-2): inputting text data corresponding to the image data in the step (1-1) into a text modal feature extraction network, and extracting a global feature vector of the text;
step (1-3): respectively inputting the global feature vector of the image and the global feature vector of the text into a multi-modal interactive gate, respectively inputting the obtained multi-modal image context feature vector and the multi-modal text context feature vector into a cross-modal Hash network, respectively inputting the obtained image feature vector and the obtained text feature vector into a Hash layer, and obtaining a binary Hash code corresponding to the image feature vector and a binary Hash code corresponding to the text feature vector;
the step (1-1) comprises:
step (1-1-1): extracting coarse-grained characteristic vectors of an image modality by adopting a Convolutional Neural Network (CNN);
step (1-1-2): inputting the extracted coarse-grained features of the image modality into a mean pooling layer to obtain an image global context feature vector;
step (1-1-3): inputting the coarse-grained feature vector of the image modality into a recurrent neural network GRU to obtain a spatial position feature vector of the image;
step (1-1-4): adding the image global context feature vector and the spatial position feature vector of the image to obtain a global feature vector of the image;
the step (1-2) comprises:
step (1-2-1): extracting coarse-grained feature vectors of a text mode by adopting Bi-LSTM in a recurrent neural network;
step (1-2-2): inputting the coarse-grained feature vector of the text mode into a mean pooling layer to obtain a global feature vector of the text;
the step (1-3) comprises:
step (1-3-11): inputting the global feature vector of the image into a multi-modal interactive gate to obtain a multi-modal image context feature vector;
step (1-3-12): inputting the multi-modal image context feature vector and the coarse-grained feature vector of an image modality into a multi-modal attention function of the image together, and calculating the attention weight of each image region;
step (1-3-13): according to the attention weight of each image region, the coarse-grained feature vector of the image modality and bmCalculating an image feature vector through weighted average;
step (1-3-14): and inputting the image feature vector into a hash layer, and calculating a binary hash code corresponding to the image feature vector.
2. The multi-modal attention mechanism-based cross-modal hashing method of claim 1 wherein said steps (1-3) comprise:
step (1-3-21): inputting the global feature vector of the text into a multi-mode interactive gate to obtain a multi-mode text context feature vector;
step (1-3-22): inputting the multi-modal text context feature vector and the coarse-grained feature vector of the text mode into a multi-modal attention function of the text together, and calculating the attention weight of the vocabulary in each text;
step (1-3-23): according to the attention weight of the vocabulary in each text, the coarse-grained feature vector of the text mode and blCalculating a text feature vector through weighted average;
step (1-3-24): and inputting the text feature vector into a hash layer, and calculating a binary hash code corresponding to the text feature vector.
3. The multi-modal attention mechanism-based cross-modal hashing method according to claim 1, wherein said retrieving process comprises:
step (2-1): inputting an image or text to be inquired into a cross-modal Hash network model of a multi-modal attention mechanism to obtain a binary Hash code corresponding to the image or text;
step (2-2): inputting the binary hash code of the image or the binary hash code of the text into a query library to be retrieved, calculating the Hamming distance between the hash code and the Hash code in the query library, and sequentially outputting the first k retrieved texts or images from small to large according to the sequence of the Hamming distances.
4. The cross-modal hashing method based on the multi-modal attention mechanism is characterized in that a cross-modal retrieval loss function is adopted to calculate similarity between images and texts of the same class labels, and similarity between images, between texts and between texts is calculated according to the loss functions of the image retrieval images, the image retrieval texts, the text retrieval images.
5. A cross-modal hashing system based on a multi-modal attention mechanism, comprising: training module and retrieval module, its characterized in that:
a training module configured to: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;
a retrieval module configured to: inputting the image or text to be inquired into a cross-modal Hash network model of a trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity;
the training module comprises:
step (1-1): inputting images of different categories into an image modal feature extraction network, and extracting global feature vectors of the images;
step (1-2): inputting text data corresponding to the image data in the step (1-1) into a text modal feature extraction network, and extracting a global feature vector of the text;
step (1-3): respectively inputting the global feature vector of the image and the global feature vector of the text into a multi-modal interactive gate, respectively inputting the obtained multi-modal image context feature vector and the multi-modal text context feature vector into a cross-modal Hash network, respectively inputting the obtained image feature vector and the obtained text feature vector into a Hash layer, and obtaining a binary Hash code corresponding to the image feature vector and a binary Hash code corresponding to the text feature vector;
the step (1-1) comprises:
step (1-1-1): extracting coarse-grained characteristic vectors of an image modality by adopting a Convolutional Neural Network (CNN);
step (1-1-2): inputting the extracted coarse-grained features of the image modality into a mean pooling layer to obtain an image global context feature vector;
step (1-1-3): inputting the coarse-grained feature vector of the image modality into a recurrent neural network GRU to obtain a spatial position feature vector of the image;
step (1-1-4): adding the image global context feature vector and the spatial position feature vector of the image to obtain a global feature vector of the image;
the step (1-2) comprises:
step (1-2-1): extracting coarse-grained feature vectors of a text mode by adopting Bi-LSTM in a recurrent neural network;
step (1-2-2): inputting the coarse-grained feature vector of the text mode into a mean pooling layer to obtain a global feature vector of the text;
the step (1-3) comprises:
step (1-3-11): inputting the global feature vector of the image into a multi-modal interactive gate to obtain a multi-modal image context feature vector;
step (1-3-12): inputting the multi-modal image context feature vector and the coarse-grained feature vector of an image modality into a multi-modal attention function of the image together, and calculating the attention weight of each image region;
step (1-3-13): according to the attention weight of each image region, the coarse-grained feature vector of the image modality and bmCalculating an image feature vector through weighted average;
step (1-3-14): and inputting the image feature vector into a hash layer, and calculating a binary hash code corresponding to the image feature vector.
6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the cross-modal hashing method based on a multimodal attention mechanism according to any one of claims 1-4.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the cross-modal hashing method based on a multi-modal attention mechanism according to any one of claims 1-4.
CN202110407112.9A 2021-04-15 2021-04-15 Cross-modal hashing method and system based on multi-modal attention mechanism Active CN113095415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110407112.9A CN113095415B (en) 2021-04-15 2021-04-15 Cross-modal hashing method and system based on multi-modal attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110407112.9A CN113095415B (en) 2021-04-15 2021-04-15 Cross-modal hashing method and system based on multi-modal attention mechanism

Publications (2)

Publication Number Publication Date
CN113095415A CN113095415A (en) 2021-07-09
CN113095415B true CN113095415B (en) 2022-06-14

Family

ID=76678153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110407112.9A Active CN113095415B (en) 2021-04-15 2021-04-15 Cross-modal hashing method and system based on multi-modal attention mechanism

Country Status (1)

Country Link
CN (1) CN113095415B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657380B (en) * 2021-08-17 2023-08-18 福州大学 Image aesthetic quality evaluation method integrating multi-mode attention mechanism
CN114022735B (en) * 2021-11-09 2023-06-23 北京有竹居网络技术有限公司 Training method, device, equipment and medium for visual language pre-training model
CN114201621B (en) * 2021-11-24 2024-04-02 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention
CN113971209B (en) * 2021-12-22 2022-04-19 松立控股集团股份有限公司 Non-supervision cross-modal retrieval method based on attention mechanism enhancement
CN114841243B (en) * 2022-04-02 2023-04-07 中国科学院上海高等研究院 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
CN115098620B (en) * 2022-07-26 2024-03-29 北方民族大学 Cross-modal hash retrieval method for attention similarity migration
CN115410717B (en) * 2022-09-15 2024-05-21 北京京东拓先科技有限公司 Model training method, data retrieval method, image data retrieval method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110122A (en) * 2018-06-22 2019-08-09 北京交通大学 Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash
CN110765281A (en) * 2019-11-04 2020-02-07 山东浪潮人工智能研究院有限公司 Multi-semantic depth supervision cross-modal Hash retrieval method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299216B (en) * 2018-10-29 2019-07-23 山东师范大学 A kind of cross-module state Hash search method and system merging supervision message
CN110309331B (en) * 2019-07-04 2021-07-27 哈尔滨工业大学(深圳) Cross-modal deep hash retrieval method based on self-supervision
CN111209415B (en) * 2020-01-10 2022-09-23 重庆邮电大学 Image-text cross-modal Hash retrieval method based on mass training
CN111639240B (en) * 2020-05-14 2021-04-09 山东大学 Cross-modal Hash retrieval method and system based on attention awareness mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110122A (en) * 2018-06-22 2019-08-09 北京交通大学 Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash
CN110765281A (en) * 2019-11-04 2020-02-07 山东浪潮人工智能研究院有限公司 Multi-semantic depth supervision cross-modal Hash retrieval method

Also Published As

Publication number Publication date
CN113095415A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN113095415B (en) Cross-modal hashing method and system based on multi-modal attention mechanism
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN112966127B (en) Cross-modal retrieval method based on multilayer semantic alignment
CN112966074B (en) Emotion analysis method and device, electronic equipment and storage medium
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN111753189A (en) Common characterization learning method for few-sample cross-modal Hash retrieval
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
WO2023011382A1 (en) Recommendation method, recommendation model training method, and related product
CN108536735B (en) Multi-mode vocabulary representation method and system based on multi-channel self-encoder
KR102379660B1 (en) Method for utilizing deep learning based semantic role analysis
CN115438674B (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN110969023B (en) Text similarity determination method and device
CN115438215A (en) Image-text bidirectional search and matching model training method, device, equipment and medium
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN108920446A (en) A kind of processing method of Engineering document
CN115983271A (en) Named entity recognition method and named entity recognition model training method
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN115374845A (en) Commodity information reasoning method and device
CN115860006A (en) Aspect level emotion prediction method and device based on semantic syntax
CN110532562B (en) Neural network training method, idiom misuse detection method and device and electronic equipment
Bucher et al. Semantic bottleneck for computer vision tasks
CN114841151A (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN112632223B (en) Case and event knowledge graph construction method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant