CN113095415B

CN113095415B - Cross-modal hashing method and system based on multi-modal attention mechanism

Info

Publication number: CN113095415B
Application number: CN202110407112.9A
Authority: CN
Inventors: 鲁芹; 吴吉祥
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2022-06-14
Anticipated expiration: 2041-04-15
Also published as: CN113095415A

Abstract

The invention belongs to the field of multi-mode attention mechanism and cross-modal hash network fusion, and provides a cross-modal hash method and a cross-modal hash system based on the multi-mode attention mechanism. The method comprises the following steps: training process and retrieval process, the training process: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system; and (3) retrieval process: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.

Description

Cross-modal hashing method and system based on multi-modal attention mechanism

Technical Field

The invention belongs to the field of multi-modal attention mechanism and cross-modal hash network fusion, and particularly relates to a cross-modal hash method and a cross-modal hash system based on the multi-modal attention mechanism.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Cross-modal retrieval is to use one modal data type as a query to retrieve the content of another modal data type with similar semantics. Particularly for mutual retrieval between images and texts, the retrieval mode can be used for solving the daily life and work requirements of people. In the feature extraction of the existing cross-modal hash method, the part with semantic meaning in the image and the text cannot be accurately positioned by the method based on the global representation alignment, and the local representation alignment method has huge calculation burden because the similarity of image fragments and text words needs to be thoroughly aggregated.

With the development of deep learning in various fields, multiple studies show that feature representations extracted by deep learning have stronger expression capability than that of a traditional shallow learning method. In the current advanced method, two similar structural branches are selected to respectively extract depth features of image data and text data, and then the extracted features of two different modes are subjected to the next operation, so that the similarity between the different modes is calculated. Although this approach has made some progress, there are still some problems in using the deep learning architecture for cross-modal retrieval. The deep features are extracted only from the overall feature information of the modes, are not enough to express the local key feature information of the modes, and cannot mine semantic association among different modes, so that the retrieval precision and accuracy are influenced. In addition, when searching is performed on some widely used data sets, the searching speed is greatly reduced due to the large data information amount and the high calculation amount.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a cross-modal hashing method and a cross-modal hashing system based on a multi-modal attention mechanism, wherein the cross-modal hashing method comprises a training process and a retrieval process, and image features and text features are extracted in the training process; performing fine interaction on the characteristics of the image mode and the characteristics of the text mode by using a multi-mode attention mechanism, and extracting more refined key characteristic information in the image mode and the text mode; finally, the hash representation of the two modalities is learned. In the retrieval process, an image mode or a text mode to be queried is input into a training module to obtain binary hash codes of the image or the text, then the binary hash codes are input into a query search library, the values of the hash codes and the hash codes in the search library are calculated through a Hamming distance formula, and finally, retrieval results are sequentially output from small to large according to the size sequence of the Hamming distance values to obtain an image or text list required by people.

In order to achieve the purpose, the invention adopts the following technical scheme:

a first aspect of the invention provides a cross-modal hashing method based on a multi-modal attention mechanism.

A cross-modal hashing method based on a multi-modal attention mechanism comprises the following steps: a training process and a retrieval process are carried out,

training process: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;

and (3) retrieval process: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.

Further, the training process comprises:

step (1-1): inputting images of different categories into an image modal feature extraction network, and extracting global feature vectors of the images;

step (1-2): inputting text data corresponding to the image data in the step (1-1) into a text modal feature extraction network, and extracting a global feature vector of the text;

step (1-3): and respectively inputting the global feature vector of the image and the global feature vector of the text into a multi-modal interactive gate, respectively inputting the obtained multi-modal image context feature vector and the multi-modal text context feature vector into a cross-modal Hash network, respectively inputting the obtained image feature vector and the obtained text feature vector into a Hash layer, and obtaining a binary Hash code corresponding to the image feature vector and a binary Hash code corresponding to the text feature vector.

Further, the step (1-1) includes:

step (1-1-1): extracting coarse-grained characteristic vectors of an image modality by adopting a Convolutional Neural Network (CNN);

step (1-1-2): inputting the extracted coarse-grained features of the image modality into a mean pooling layer to obtain an image global context feature vector;

step (1-1-3): inputting the coarse-grained feature vector of the image modality into a recurrent neural network GRU to obtain a spatial position feature vector of the image;

step (1-1-4): and adding the image global context feature vector and the spatial position feature vector of the image to obtain the image global feature vector.

Further, the step (1-2) comprises:

step (1-2-1): extracting coarse-grained feature vectors of a text mode by adopting Bi-LSTM in a recurrent neural network;

step (1-2-2): and inputting the coarse-grained feature vector of the text mode into a mean pooling layer to obtain a global feature vector of the text.

Further, the step (1-3) comprises:

step (1-3-11): inputting the global feature vector of the image into a multi-modal interactive gate to obtain a multi-modal image context feature vector;

step (1-3-12): inputting the multi-modal image context feature vector and the coarse-grained feature vector of an image modality into a multi-modal attention function of the image together, and calculating the attention weight of each image region;

step (1-3-13): according to the attention weight of each image region, the coarse-grained feature vector of the image modality and b_mCalculating an image feature vector through weighted average;

step (1-3-14): inputting the image feature vector into a hash layer, and calculating a binary hash code corresponding to the image feature vector;

further, the step (1-3) comprises:

step (1-3-21): inputting the global feature vector of the text into a multi-mode interactive gate to obtain a multi-mode text context feature vector;

step (1-3-22): inputting the multi-modal text context feature vector and the coarse-grained feature vector of the text mode into a multi-modal attention function of the text together, and calculating the attention weight of the vocabulary in each text;

step (1-3-23): according to the attention weight of the vocabulary in each text, the coarse-grained feature vector of the text mode and b_lCalculating a text feature vector through weighted average;

step (1-3-24): and inputting the text feature vector into a hash layer, and calculating a binary hash code corresponding to the text feature vector.

Further, the retrieving process includes:

step (2-1): inputting an image or text to be inquired into a cross-modal Hash network model of a multi-modal attention mechanism to obtain a binary Hash code corresponding to the image or text;

step (2-2): inputting the binary hash code of the image or the binary hash code of the text into a query library to be retrieved, calculating the Hamming distance between the hash code and the Hash code in the query library, and sequentially outputting the first k retrieved texts or images from small to large according to the sequence of the Hamming distances.

Further, a cross-modal retrieval loss function is adopted to calculate the similarity between the image and the text of the same type of label, and the similarity between the image, the image retrieval text, the text retrieval text and the text retrieval image is calculated according to the loss functions of the image retrieval image, the image retrieval text, the text retrieval text and the text retrieval image.

A second aspect of the invention provides a cross-modal hashing system based on a multi-modal attention mechanism.

A cross-modal hashing system based on a multi-modal attention mechanism, comprising: a training module and a retrieval module, wherein,

a training module configured to: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;

a retrieval module configured to: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps in the cross-modal hashing method based on a multimodal attention mechanism as described in the first aspect above.

A fourth aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps in the multi-modal attention mechanism based cross-modal hashing method as described in the first aspect above.

Compared with the prior art, the invention has the beneficial effects that:

1. the method adopts a ResNet-152 network pre-trained on ImageNet to extract the characteristics of the image by deep learning; and continuously extracting context features of the image with fine granularity on the basis, further extracting spatial position information features of the image by utilizing GRUs, and finally combining the two features with fine granularity to serve as global features of the image. For text features, features are extracted through the bidirectional LSTM, the problem of gradient explosion is solved by utilizing the long-term memory function of the features, semantic consistency in the mode is kept to a certain extent, and calculation of similarity measurement is improved.

2. The invention designs a multi-mode interactive gate to carry out fine interaction between image and text modes, so as to mine semantic association characteristics between different modes and balance information quantity and semantic complementarity between the different modes. And input into an attention mechanism to capture local key information features of the image or text modality, and then input the features with attention into a hash function to obtain a binary hash code representation of the image or text, respectively. During retrieval, any modality to be queried is input into the process to obtain a hash code, the Hamming distance between the hash code and the hash code in the retrieval library is calculated, and finally, retrieval results are output in sequence according to the distance.

3. Experiments conducted on some published data sets show that the mAP value of the newly proposed HX _ MAN model is improved to some extent compared with the existing cross-modal retrieval method, thereby also verifying the superiority of the performance of the method proposed by the present invention.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a block diagram of a cross-modal Hash network model based on a multi-modal attention mechanism proposed by the present invention;

FIG. 2(a) is a first comparison graph of the importance of visual-spatial location information and semantic complementarity in a cross-modal image retrieval model in an embodiment of the present invention;

FIG. 2(b) is a first comparison graph of the importance of visual-spatial location information and semantic complementarity in a cross-modal image retrieval model in an embodiment of the present invention;

FIG. 2(c) is a first comparison graph of the importance of visual-spatial location information and semantic complementarity in a cross-modal image retrieval model in an embodiment of the present invention;

FIG. 3(a) is a graph of accuracy versus a line for various methods of implementing an "image → text" search using a NUS-WIDE dataset in an embodiment of the present invention;

FIG. 3(b) is a graph of accuracy versus a line for various methods of implementing a "text → image" search using a NUS-WIDE dataset in an embodiment of the present invention;

FIG. 4(a) is a line graph comparing accuracy of various methods for implementing an "image → text" search using the MIR-Flickr25K dataset in an embodiment of the present invention;

FIG. 4(b) is a line graph comparing accuracy of various methods for implementing a "text → image" search using the MIR-Flickr25K dataset in an embodiment of the present invention;

FIG. 5 is a page display diagram of a cross-modal hashing system based on a multi-modal attention mechanism according to an embodiment of the present invention;

FIG. 6 is a comparison graph of the retrieval result on the data set of the cross-modal hashing method based on the multi-modal attention mechanism and the two existing methods in the embodiment of the present invention;

FIG. 7(a) is a visual display of a search case according to an embodiment of the present invention;

fig. 7(b) is a visual display diagram of a second search case in the embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

As shown in fig. 1, the present embodiment provides a cross-modal hashing method based on a multi-modal attention mechanism, and the present embodiment is exemplified by applying the method to a server, it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:

step (1) training process: inputting image text pairs with the same semantics and class labels of the image text pairs into a cross-modal Hash network model of the multi-modal attention system for training until the cross-modal Hash network model of the multi-modal attention system is converged to obtain a trained cross-modal Hash network model of the multi-modal attention system;

step (2) retrieval process: and inputting the image or text to be inquired into a cross-modal Hash network model of the trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity.

The training process comprises the following steps:

in the training process, the embodiment utilizes the strong feature extraction capability of deep learning to extract the global coarse-grained feature vectors of the image and text modes, and utilizes a multi-mode attention mechanism to perform fine interaction on different modes, so as to search fine-grained association between the image and the text feature vectors on a bottom layer, and then pay attention to local information of the fine-grained features, thereby solving the problem of irrelevant semantics between different modes to a certain extent, and expressing the feature information of the modes from a deep-level network.

Extraction and representation of features:

the feature extraction of images and texts is to preprocess a group of digital data of the images and the texts through a series of steps, then to shrink the dimensionality of the data to a certain degree, and finally to obtain another group of digital vectors capable of expressing modal information, wherein the quality of the group of digital vectors has a great influence on the generalization capability. In the image and text feature extraction of the part, the convolutional neural network CNN verified by a plurality of people is selected to extract image features, and for the text feature extraction, Bi-LSTM in the convolutional neural network is adopted to extract text features.

(1) Image representation: ResNet-152 pre-trained on ImageNet is used as an image feature encoder, the dimensionality of the image feature encoder is modified to 448 x 448 by a dimensionality reduction method, and the image feature encoder and the dimensionality are input into a network of CNNs after the dimensionality reduction. In this step we make some changes, namely remove the last pooling layer and output the final result as image coarse-grained feature I. Previous experiments have shown that removing the pooling layer has little impact on the network of this embodiment. After the image feature is obtained, it is progressively input into the mean pooling layer network. For convenience of description, we use { I }₁,...,I_MRepresents the coarse-grained features of these inputs, where the value of M represents how many regions are in common in the image, I_i(i∈[1,M]) Representing the ith region in the image.

After the coarse-grained feature representation is obtained, the feature is used as a basis and input into the mean pooling layer, and output as a local feature vector of the image. This is done to obtain deeper feature information and express the context information of the image by this, which we denote as image global context feature vector I^(g)：

Where tanh () is an activation function used to perform a non-linear mapping of the feature vector to project the features into a common subspace, P⁽⁰⁾Is a weighting matrix by which the image feature vector and the text feature vector can be embedded in the same common space.

Sometimes, the effect that we visually see may be slightly wrong with the latent expression information of the image, resulting in wrong judgment, and the reason for this problem is that we neglect the spatial position information of the image. As shown in fig. 2(a) and 2(b), both images look like the same two characters of "car" and "man" at a glance, but the information they express is completely different. Therefore, it is difficult to distinguish the two images if we use only the coarse-grained features mentioned above. The reason for this is that coarse-grained features discard some spatial position information during mean pooling. It can be shown that the spatial position information and the coarse-grained characteristic information are equally important, but neither information is important. For the solution, the present embodiment chooses to further parse the spatial location information of the images through GRUs, so that the two images can be better visually distinguished. GRU is a special type of recurrent neural network that has few parameters and is computationally very efficient.

For the resulting image feature vector I₁,...,I_MWe align them and input these features into GRUs in turn for outputting a position feature vector between them. This process can be defined by equation (2):

wherein the content of the first and second substances,

representing the hidden state of the GRU at time step t,

then the hidden state passed down by the previous node is indicated, I_tRepresenting the image characteristics of the t-th region. Thus, they can be combined into one hidden state vector

Then, pooling operation is carried out on the group of vectors to obtain the spatial position characteristics of the image

It shows the position information of the image in vision.

Finally, two important characteristic information I of the image are processed^(g)And I^(d)Adding their characteristic informationSummarized together, the global feature vector I of the final image is obtained⁽⁰⁾：

(2) Text representation: for the feature representation aspect of text, bi-directional LSTMs are used as encoders to generate coarse-grained feature vectors for text. Assume for text input { w₁,...,w_LRepresentation, each word is first represented by a one-hot vector, thereby characterizing the index of each word in the table. Then passing each one-hot encoding vector through e_L＝Pw_LEmbedding into vector space, where P is the embedding matrix. Finally, the vectors are arranged in a spatial order and input into the bidirectional LSTMs. This process can be represented by equation (4).

Wherein

And

representing the hidden states of forward and backward LSTM at time step t, respectively, with both hidden states added at each time step, i.e.

Coarse-grained feature vector T for a set of texts is constructed₁,...,T_L}。

For the aspect of deep feature extraction of a text mode, when coarse-grained features of a text are extracted, each segment inherits the sequence information of the previous moment. Therefore, two kinds of important feature information do not need to be extracted respectively like the extraction mode of the image features, and only the average pooling is used for generating the coarse-grained features of the text into the global features T of the text⁽⁰⁾Wherein T is⁽⁰⁾Encoding the context semantics of the ith word in all sentences of the textual modality:

multimodal attention network:

in most of the previous retrieval methods, they only train out global feature information of different modalities and then mathematically project the feature information into a common space to measure the similarity between each image region and the word. Although the similarity of the two modes can be measured to a certain extent by the method, the global feature information not only consumes more computing resources, but also cannot represent the key information of the modes, and further cannot dig out the depth relation between the two modes at the bottom layer, so that the retrieval accuracy and speed are reduced.

In the next long time, when the development of the multi-modal field is not advanced, the scholars have put forward attention and are widely applied to various fields. Inspired by the former, the present embodiment innovates and improves the existing method and proposes a new attention mechanism. Attention mechanism has many contributions in various fields, just as we see its surface meaning, "attention" is aimed at finding out which part needs to be most emphasized by us. By utilizing the local information extraction capability of the method, the key information in the modes can be easily displayed, so that the matching degree of the characteristic information among different modes can be better analyzed.

Although the above method can increase the local key information content of images and sentences to some extent, and its performance is better than other models that do not utilize this method. However, the method only mines the respective key areas of the image or text modalities and does not complete the interaction between heterogeneous data, so that a certain problem exists in capturing semantic association between different modalities. As shown in fig. 2(b) and 2(c), the language descriptions of the two images are semantically very close, but it is still difficult to distinguish the two images from each other in visual observation. The reason for this is that we only focus on the key information of the text modality, but do not consider the semantic complementarity between the visual part and the text.

In view of the above problems, the present embodiment adds a multi-modal interaction gate to interact with image and text modalities before an attention mechanism is used, and enhances the representation capability of the image and text by utilizing semantic complementarity existing between different modalities. The interactive gate can finely fuse fine-grained image features and abstract representations of words, and enables different modal semantics to be complementary through interaction between the fine-grained image features and the abstract representations of the words, so that a bottom-layer incidence relation between the fine-grained image features and the abstract representations of the words is mined, and the retrieval accuracy is improved.

In the initial experimental design phase, we consider the simplest way to interact with image and text features to be adding them directly. However, as experimentation progresses, it has been found that this direct addition approach may in practice result in relatively poor performance. This may be because the image context features and the text context features are not the same extraction method used during the training phase. If they are fused in this simple way, it is possible that some meaningful information of a certain modality is obscured by other modalities in the process. To address this problem of obscured modality information, interaction gates are designed to semantically complement image and text features in order to allow for the underlying interaction of these two features from different modalities.

Specifically, as shown in fig. 1, the present embodiment uses a context feature vector I of an image and a text⁽⁰⁾And T⁽⁰⁾And inputting the data into an interactive gate with complementary semantics to perform interaction between the data and the interactive gate. This process can be represented by equation (6):

o^(I)＝σ((α·U_I(I⁽⁰⁾)+(1-α)·U_I(I⁽⁰⁾)))

o^(T)＝σ((α·U_T(T⁽⁰⁾)+(1-α)·U_T(T⁽⁰⁾))) (6)

wherein U is_IAnd U_TThe method is a matrix capable of reducing dimension, and alpha is a parameter for preventing information loss in the process of fusing image and text context characteristics. Finally, reducing each feature in the interaction process to [0,1 ] again through sigmoid activation function sigma]。o^(I)And o^(T)Respectively, representing more refined feature vectors resulting from the multi-modal interaction gate output. For convenience, they are referred to as multimodal image context feature vectors and text context feature vectors, respectively.

After the image and text feature vectors are subjected to underlying interaction and semantic association between them is obtained through semantic complementarity, local key information within the image or text modality can be captured and detected by means of an attention mechanism. Attention is drawn to capture what we need after learning, and to disregard those unimportant information areas directly, which are typically output as probability maps or probability feature vectors after learning results. The objective of multi-modal attention is to exploit the data information of multi-modal image or text context features with semantic complementarity independently to explore fine-grained associations between multiple image regions or words. This is achieved by computing convex combinations of local features of the image region or text.

Specifically, for the multi-modal attention module of an image, as shown in FIG. 1, the resulting image feature vector { I }₁,...,I_MAnd a multimodal image context feature vector o^(I)Multimodal attention function f as query input to an image_att(ii) calculating an attention weight α for each image region_I,m. Multimodal attention function f of an image_attTwo layers of feedforward sensors are adopted, and the weighting in the whole process is not unbalanced through a softmax function. In particular, the attention weight α_I,mCan be defined by equation (7):

α_I,m＝softmax(W_I,hh_I,m+b_I,h) (7)

wherein, w_I,w_I,qAnd w_I,hIs a parameter of the sensor, b_IAnd b_I,h b_I,qIs the bias term of the sensor, h_I,mRepresenting a hidden state at a time step m in the multi-modal attention function of the image, tanh () is an activation function. After obtaining the attention weight of each image area, the attention-bearing image feature representation vector I can be calculated by weighted averaging⁽¹⁾Which comprises the following steps:

where P (1) is a weight matrix by which image and text feature vectors can be embedded in the same common space, b_mIs the bias term for the sensor.

The same purpose of the multi-modal attention module setting of the image is to express abstract high-level representation of words in a text sentence through an attention mechanism, so as to extract the context semantic features with multi-modal attention. Attention weight α_T，lThe multi-modal context feature vector T of the text is obtained by a soft attention module consisting of two layers of feedforward perceptors and a softmax function⁽¹⁾Can be defined by the following equation:

α_T,l＝softmax(W_T,hh_T,l+b_T,h)

wherein w_T,w_T,qAnd w_T,hAre parameters of the sensor, respectively b_T，b_T,qAnd b_T,hIs the bias term of the sensor, h_T,lRepresenting hidden states, T, of multimodal text attention at time step l_lIs a coarse-grained feature of text, b_lIs the bias term for the sensor. Unlike the multimodal attention model of images, multimodal attention of text does not already require the addition of an embedding layer after weighted averaging, because of the text features { T }₁,...,T_LThere is already a common space and training is done in an end-to-end fashion.

And (3) Hash layer:

in the hash layer, the image features I with multi-modal attention are respectively⁽¹⁾And text feature T⁽¹⁾The input is to the hash layer and the hash layer,

and obtaining binary representations of different modal characteristics by learning a hash function. In the hash layer, the activation function of Tanh makes the output of each neuron between-1 and 1, and the Sign function with a threshold of 0 converts it into binary code. An encoding value of 1 represents that the output of the neuron is greater than or equal to 0; the code value is 0, representing an output less than 0. The hash functions of the image and the text are shown in equation (10) and equation (11), respectively:

H^I＝Sign(Tanh(w^(I)I⁽¹⁾+b^(I))) (10)

H^T＝Sign(Tanh(w^(T)T⁽¹⁾+b^(T))) (11)

wherein w^(I)And w^(T)Network parameters of the image or text modality, respectively, b^(I)And b^(I)Is a bias term of the sensor, H^IAnd H^TRespectively, a hashed representation of the image and the text.

(II) a retrieval process:

in the training process, the present embodiment utilizes the deep learning underlying feature mining capability and the attention mechanism to capture the local key feature information to obtain the respective binary hash code representations of the features of the image modality or the text modality through the hash function. Therefore, when the cross-modal search is performed, a sample of any one modality is taken as a query object, and a similar sample of another different modality can be searched. Specifically, as shown in fig. 1, for image query, a user inputs an image to be queried into a training module to convert image features into a form of a trained binary hash code, inputs the trained hash code into a query library to be retrieved, calculates hamming distances between the hash code and the hash codes in the query library, and sequentially outputs the first k search results from small to large according to the size sequence of the hamming distances; similarly, for text query, a user takes text data as a query object, hash codes of a text mode are obtained through an end-to-end network framework in a training module, then hamming distances between the hash codes and the hash codes in a database to be retrieved are calculated and sequenced, and finally the retrieved first k pictures are output.

An objective function:

the goal of the cross-modal search penalty function is to preserve both intra-modal and inter-heterogeneous modal semantic similarity. The cross-modal search penalty function is shown in equation (12):

F＝min(F_v→v+F_v→t+F_t→t+F_t→v) (12)

where v → v, v → t, t → t and t → v denote an image retrieval image, an image retrieval text, a text retrieval text and a text retrieval image, respectively. And F_v→tA loss function representing the image retrieval text, and the remaining loss functions are similar. Loss function F of image retrieval text_v→tIs defined as:

wherein, (i, j, k) is a triple, representing a minimum edge distance.

Representing the euclidean distance of the image currently being the query modality from the positive sample,

representing the euclidean distance of the current modality from the negative examples. F_v→tIs the triple ordering penalty, meaning that image i has greater similarity to text j than image i has to text k.

Experimental results and analysis:

in the embodiment, firstly, detailed analysis is performed on data results of a training module in HX _ MAN and a current advanced cross-modal retrieval method, and then, in two public data sets of NUS-WIDE data set and MIR-Flickr25K data set, some evaluation indexes are calculated. Then, benchmarking analysis is performed by using the HX _ MAN model proposed by the embodiment and several existing methods.

Data set and evaluation index:

(1) data set

The NUS-WIDE dataset is a large network image dataset created by a media search laboratory. The dataset contains 260648 images and 5018 different classmarks that were gathered on the Flickr website. Each image has its corresponding text label and constitutes an image-text pair. These texts describing the images are a set of sentences that the user word-links when uploading the images. The present embodiment performs the analysis of the baseline method based on 194600 image-text pairs of the 20 most commonly used labels in this dataset, where the text of each pair of data is represented as a 1000-dimensional bag-of-words (bow) vector. If the image and the text have one label with the same concept, the image and the text are considered to be similar, otherwise, the image and the text are not considered to be similar.

The MIR-Flickr25K dataset contained 25000 multi-labeled images, 24 manually labeled category labels, collected from the Flickr website. The experimental data for this example taken at least 20 text-labeled image-text pairs, resulting in a total of 20015 pairs of data, each pair labeled with one of the 24-type labels. The text of each pair of data is represented as a 1386-dimensional BOW vector. If the image and the text have the same label, they are considered similar, otherwise they are considered dissimilar.

(2) Evaluation index

This example uses mean Average Precision (mAP) to evaluate the model herein. The calculation formula of mAP is shown as (14):

where | Q | represents the size of the query data set Q, Q represents a given one of the queries, AP represents Average accuracy (Average Precision):

wherein M represents the number of real neighbors of q in the query data, n represents the total amount of data, P_q(i) Indicating the precision of the first i retrieved instances, δ (i) is an indicator function, indicating that the ith instance is correlated with the retrieved instance when δ (i) is 1, and that δ (i) is not correlated when δ (i) is 0.

Analysis by the reference method:

as another embodiment, we compare the HX _ MAN model proposed in this embodiment with several existing cross-modal search methods, so as to verify the performance of the model proposed by us. To be able to achieve the results we have expected, we have compared not only the shallow structure based methods (CMFH, SCM, STMH, SePH), but also the two deep structure based methods (DCMH and SDCH). For experimental fairness, we used the ResNet-152 network model pre-trained on ImageNet for all methods for feature extraction for image modalities; for the text modality, we also use Bi-LSTM to extract features. In terms of splitting the dataset, we use 2500 pairs of data in the MIR-Flickr25K dataset as queries and the remaining pairs of data as a search library. For the NUS-WIDE dataset, we chose 1% of the dataset as queries and the rest as the search pool. We took 5500 pairs of data from the corpus as a training set of two data sets. All parameters were randomly initialized using a gaussian function with a mean of 0 and a standard deviation of 0.01. The network is trained here by stochastic gradient descent with a batch value of 64, a total epoch of 60, a learning rate of 0.05, and 1/10 where the learning rate becomes the current value after every 20 iterations.

The results of this experiment are shown in table 1 in comparison with other search methods. Wherein, the image → text represents that the query data is in an image mode, and the retrieval data is in a text mode; the "text → image" indicates that the query data is in a text mode and the search data is in an image mode. We compared the mAP values for each model method with code lengths of 16bits, 32bits and 64bits on the NUS-WIDE dataset and the MIR-Flickr25K dataset. From the experimental results and the comparative data in the table we can see that the deep structure based approach performs significantly better than the shallow structure based approach. This illustrates, to some extent, that the deep-level features extracted by deep learning improve the accuracy of cross-modal search, and thus illustrates that the model proposed herein makes some progress in cross-modal search.

TABLE 1 comparison data of HX _ MAN model with other cross-modality search models

In addition, to visually demonstrate the contrast of the model herein to other methods, we show the contrast data using line graphs. FIG. 3(a) is a graph comparing accuracy of various methods for implementing an "image → text" search using a NUS-WIDE dataset in an embodiment of the present invention; FIG. 3(b) is a graph of accuracy versus a line for various methods of implementing a "text → image" search using a NUS-WIDE dataset in an embodiment of the present invention; FIG. 4(a) is a line graph comparing accuracy of various methods for implementing an "image → text" search using the MIR-Flickr25K dataset in an embodiment of the present invention; FIG. 4(b) is a line graph comparing accuracy of various methods for implementing a "text → image" search using the MIR-Flickr25K dataset in an embodiment of the present invention; as can be seen from the four figures, the mAP value of the method of the embodiment on the MIR-Flickr25K data set is slightly higher than that of the NUS-WIDE data set, and the mAP value of the text retrieval image is also slightly higher than that of the image retrieval text. It can be seen that the HX _ MAN model of the present embodiment has higher performance than other methods, which also verifies that the image and text modalities can be better associated together through the interaction of the stack attention mechanism, and the hash method can improve the speed of cross-modality retrieval.

Visual analysis:

this embodiment will show the page of the cross-modal search system designed by us, and compare the search results with the DCMH method and the SDCH method.

As shown in fig. 5, our cross-modal retrieval system page is mainly divided into two parts: image search text and text search image. For the image retrieval text part, the image to be queried is uploaded to a system, the system carries out the image step by step in the method designed herein, so as to retrieve the image description with semantic similarity with the image content, and the first few with the highest similarity are output in the form of text and finally presented to the customer. The text retrieval image part is similar to the text retrieval image part, namely, the text content needing to be inquired is uploaded to the system, and then the first few images which are most similar to the text content are output.

In addition, we randomly picked 3 textual descriptions from the test set of the MIR-Flickr25K dataset for comparative analysis with the DCMH method and the SDCH method. As shown in fig. 6, the three models are output by their respective methods and the best result is selected for comparison. In the first text description, the "dog" in the image output by the DCMH method is "lying on the stomach". In the second text description, the action of "dog" in the image output by the SDCH method is not "standing". This problem is also the case in the third description. As can be seen from the comparison, after the position feature information is extracted by using deep learning, the method of the embodiment generates a more accurate and clear image of the visual information in the text description, which also illustrates to a certain extent that the method of the embodiment improves the accuracy of retrieval on the basis of ensuring the speed.

Although this method is improved in accuracy and speed compared with other methods, it is not perfect as expected, and there is a little error in the output result. FIG. 7(a) is a visual display diagram of a search case according to an embodiment of the present invention, wherein the visual result is all correct 5 original descriptions; fig. 7(b) is a visual display diagram of a second search case in the embodiment of the present invention, and the search in the 5 th sentence in the visualization result is incorrect, but there is a certain rationality for the description, because the realistic background of the picture can be reasonably configured.

Example two

The embodiment provides a cross-modal hashing system based on a multi-modal attention mechanism.

It should be noted here that the training module and the retrieving module correspond to the step (1) to the step (2) in the first embodiment, and the modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the steps in the cross-modal hashing method based on a multi-modal attention mechanism as described in the first embodiment.

Example four

The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the steps in the cross-modal hashing method based on the multi-modal attention mechanism according to the embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A cross-modal hashing method based on a multi-modal attention mechanism comprises the following steps: training process and retrieval process, its characterized in that:

and (3) retrieval process: inputting the image or text to be inquired into a cross-modal Hash network model of a trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity;

the training process comprises:

step (1-3): respectively inputting the global feature vector of the image and the global feature vector of the text into a multi-modal interactive gate, respectively inputting the obtained multi-modal image context feature vector and the multi-modal text context feature vector into a cross-modal Hash network, respectively inputting the obtained image feature vector and the obtained text feature vector into a Hash layer, and obtaining a binary Hash code corresponding to the image feature vector and a binary Hash code corresponding to the text feature vector;

the step (1-1) comprises:

step (1-1-4): adding the image global context feature vector and the spatial position feature vector of the image to obtain a global feature vector of the image;

the step (1-2) comprises:

step (1-2-2): inputting the coarse-grained feature vector of the text mode into a mean pooling layer to obtain a global feature vector of the text;

the step (1-3) comprises:

step (1-3-14): and inputting the image feature vector into a hash layer, and calculating a binary hash code corresponding to the image feature vector.

2. The multi-modal attention mechanism-based cross-modal hashing method of claim 1 wherein said steps (1-3) comprise:

3. The multi-modal attention mechanism-based cross-modal hashing method according to claim 1, wherein said retrieving process comprises:

4. The cross-modal hashing method based on the multi-modal attention mechanism is characterized in that a cross-modal retrieval loss function is adopted to calculate similarity between images and texts of the same class labels, and similarity between images, between texts and between texts is calculated according to the loss functions of the image retrieval images, the image retrieval texts, the text retrieval images.

5. A cross-modal hashing system based on a multi-modal attention mechanism, comprising: training module and retrieval module, its characterized in that:

a retrieval module configured to: inputting the image or text to be inquired into a cross-modal Hash network model of a trained multi-modal attention mechanism, and obtaining the first k texts or images to be searched according to the similarity;

the training module comprises:

the step (1-1) comprises:

the step (1-2) comprises:

the step (1-3) comprises:

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the cross-modal hashing method based on a multimodal attention mechanism according to any one of claims 1-4.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the cross-modal hashing method based on a multi-modal attention mechanism according to any one of claims 1-4.