CN111639240B

CN111639240B - Cross-modal Hash retrieval method and system based on attention awareness mechanism

Info

Publication number: CN111639240B
Application number: CN202010408302.8A
Authority: CN
Inventors: 罗昕; 姚洪磊; 许信顺
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2021-04-09
Anticipated expiration: 2040-05-14
Also published as: CN111639240A

Abstract

The invention discloses a cross-modal Hash retrieval method and a system based on an attention awareness mechanism, which comprises the following steps: performing feature extraction and attention feature extraction on a training set in the cross-modal data set to obtain cross-modal features weighted by the attention features; inputting cross-modal characteristics of cross-modal data pairs into a Hash learning model, and optimizing the Hash learning model by taking a minimum loss function as a target according to an output cross-modal Hash code; and screening modal data meeting the retrieval requirement from the hash codes of modal data with different modals from the to-be-detected data according to the hash codes of the to-be-detected data obtained by the optimized hash learning model. The attention mechanism is applied to a cross-modal Hash retrieval task, a novel attention method of the attention sensing mechanism is provided, noise and redundancy in original data are suppressed, meanwhile, a key attention area is enhanced, and the generation quality of the Hash code is improved.

Description

Cross-modal Hash retrieval method and system based on attention awareness mechanism

Technical Field

The invention relates to the technical field of cross-modal hash retrieval, in particular to a cross-modal hash retrieval method and a cross-modal hash retrieval system based on an attention awareness mechanism.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the explosive growth of network multimedia data, it is necessary to search for texts or videos related to the existing image retrieval or search for images or videos based on texts, that is, even if data of one modality is used to search for similar samples of another modality, meanwhile, efficient storage and fast query of data also become a difficult problem.

The cross-modal retrieval technology aims to retrieve data of different modalities matched with the existing data according to the existing data, such as searching a picture set which accords with the text description in a database through text information. The prior art can be divided into a depth model and a non-depth model according to whether a depth learning technology is combined or not, a traditional depth cross-modal Hash retrieval model is generally divided into three steps, firstly, features of different modes are extracted by using a depth network, then, a Hash function is learned by using a full-connection network under the supervision of cross entropy loss and a sample similarity matrix according to the extracted features, and finally, a sample is converted into a Hash code through the Hash function and stored in a database.

At present, many cross-modal hash retrieval methods have been proposed, but the inventor finds that the prior art has at least the following problems: for a retrieval task, real data often has some noises and redundancies, and during feature extraction, the most useful visual information needs to be extracted, while background information is ignored, because the background information can cause interference to retrieval; however, in actual data, the valuable category information only covers a small part, most areas are backgrounds, and most of the current cross-modal retrieval methods can neglect the problem and learn features directly from original data, so that invalid or redundant information can mislead the features, and a low-quality hash code is generated; in addition, in order to improve the retrieval effect, many depth cross-modal hash retrieval models with better effects often introduce network models with more parameters and better effects, such as GAN (generation countermeasure network) and the like, but the training and retrieval time can be greatly increased.

Disclosure of Invention

In order to solve the problems, the invention provides a cross-modal hash retrieval method and a cross-modal hash retrieval system based on an attention perception mechanism, wherein the attention mechanism is applied to a cross-modal hash retrieval task, and provides a novel attention method of the attention perception mechanism.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a cross-modal hash retrieval method based on an attention-aware mechanism, including:

performing feature extraction and attention feature extraction on a training set in the cross-modal data set to obtain cross-modal features weighted by the attention features;

inputting cross-modal characteristics of cross-modal data pairs in the training set into a Hash learning model, and optimizing the Hash learning model by taking a minimum loss function as a target according to an output cross-modal Hash code;

and screening modal data meeting the retrieval requirement from the hash codes of the modal data in the cross-modal data set, which are different from the modal of the data to be detected, according to the hash codes of the data to be detected, which are obtained by the optimized hash learning model.

In a second aspect, the present invention provides a cross-modal hash retrieval system based on an attention-aware mechanism, including:

the feature extraction module is used for performing feature extraction and attention feature extraction on a training set in the cross-modal data set to obtain cross-modal features weighted by the attention features;

the Hash learning module is used for inputting cross-modal characteristics of cross-modal data pairs in the training set into the Hash learning model and optimizing the Hash learning model by taking a minimum loss function as a target according to the output cross-modal Hash code;

and the retrieval module is used for screening modal data meeting the retrieval requirement in the cross-modal data set and the hash codes of modal data with different modals from the data to be detected according to the hash codes of the data to be detected obtained by the optimized hash learning model.

In a third aspect, the present invention provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the method of the first aspect is performed.

In a fourth aspect, the present invention provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the cross-modal data set comprises multiple modal data, and the multiple modal data can simultaneously perform feature learning and hash code learning, so that the hash code generation efficiency is improved.

The invention provides a novel attention method of an attention perception mechanism, which applies the attention mechanism to a cross-modal Hash retrieval task, weights two different modes, can highlight a key part of cross-modal data, such as a certain word in a region where an object exists in a picture or text input, can inhibit the influence of a redundant or invalid part on a retrieval effect, such as a picture background or a text interference word and the like, effectively improves the quality of Hash code generation, and can be suitable for the cross-modal retrieval task under various multi-modal data scenes

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIGS. 1(a) - (b) are pictorial modal data;

FIG. 1(c) is a diagram of a text in a public data set MIRFlicker-25K labeled words with a frequency ranking of 10 top;

FIG. 1(d) is the text annotation data of FIG. 1 (a);

fig. 2 is a flowchart of a cross-modal hash retrieval method based on an attention-aware mechanism according to embodiment 1 of the present invention;

fig. 3 is a flowchart of image attention feature extraction provided in embodiment 1 of the present invention;

fig. 4 is a flowchart of text attention feature extraction provided in embodiment 1 of the present invention;

fig. 5 is a structural diagram of a cross-modal hash retrieval system based on an attention-aware mechanism according to embodiment 1 of the present invention.

The specific implementation mode is as follows:

the invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example 1

At present, various cross-mode hash retrieval methods are proposed, but because real data has noise and redundancy, the current retrieval method directly learns characteristics from original data, and the characteristics are misled by invalid or redundant information, so that a low-quality hash code is generated. Taking two modalities of pictures and texts as an example, as shown in fig. 1(a) -1(b), for the picture of fig. 1(a), it is necessary to highlight the area where bees and flowers are located and ignore the background portion behind, because it will interfere with the retrieval; likewise, for the picture of fig. 1(b), the labels, i.e. the supervisory information, are "animal", "flower" and "plant life", the most useful visual information might be a butterfly hovering over the flower. However, these valuable categories of information cover only a small portion of the entire image, while most of the area in the image is background;

as shown in fig. 1(c) which contains the public data set mirlicker-25K, the text labels words with the word frequency ranking 10 top, half of the words can be seen: "explore", "canon", "bw", "nikon", and "2007" are all invalid words that have no direct relationship to the image content; FIG. 1(d) is a text label of FIG. 1(a), only the word "bes" is relevant to the search task.

Therefore, if the noise and redundancy in the original data are not suppressed, the low-quality hash code is easily generated, and the retrieval result is influenced.

The Attention mechanism has been widely applied in the field of computer vision in recent years, for example, in the fields of natural language processing, object detection, image recognition and voice recognition, but is rarely used in the cross-modal search direction. The traditional Attention mechanism is used for image recognition, and can automatically find a part needing important Attention in a picture, namely a Mask with the same size as a picture representation (the picture representation can be an original picture, a feature map and the like) is generated through learning; for the attention area, the Mask corresponding position has a higher activation value. The Attention model can be generally divided into a spatial Attention model and a channel Attention model according to the region of action; the spatial attention model generates corresponding attention values aiming at different positions in the feature map, and the restoration to the original picture means that different positions in the picture have different degrees of influence on the task; the channel attention mechanism generates corresponding attention values aiming at different channels in feature map, and is more abstract.

The embodiment integrates a spatial attention mechanism, applies the attention mechanism to a cross-modal hash retrieval task, and provides a new attention method on the basis of the traditional attention mechanism, namely an attention perception mechanism, which is used for weighting two different modes;

that is, in the cross-modal hash retrieval method based on the attention-aware mechanism in this embodiment, noise and redundancy in raw data are suppressed, and a key attention area is enhanced, so as to extract an attention matrix, which has a better effect of improving the quality of a generated hash code, and can be used for cross-modal information retrieval in various multi-modal data scenarios, as shown in fig. 2, specifically including the following steps:

s1: performing feature extraction and attention feature extraction on a training set in the cross-modal data set to obtain cross-modal features weighted by the attention features;

s2: inputting cross-modal characteristics of cross-modal data pairs in the training set into a Hash learning model, and optimizing the Hash learning model by taking a minimum loss function as a target according to an output cross-modal Hash code;

s3: and screening modal data meeting the retrieval requirement from the hash codes of the modal data in the cross-modal data set, which are different from the modal of the data to be detected, according to the hash codes of the data to be detected, which are obtained by the optimized hash learning model.

In step S1, the cross-modality data set includes multiple modality data, and in this embodiment, the image modality data and the text modality data are taken as an example, and it is understood that the modality type may be extended to other modalities, such as video, voice, and the like.

Dividing a cross-modal data set into a training set and a test set, and simultaneously performing feature extraction and attention feature extraction on the cross-modal data of images and texts in the training set by adopting two parallel convolutional neural networks; the method specifically comprises the following steps: acquiring an initial attention matrix, training the convolutional neural network by using a minimum loss function, and outputting an improved attention matrix; and performing dot product operation on the attention matrix and the feature matrix output by the convolutional neural network to obtain the cross-modal feature weighted by the attention feature.

The image feature extraction and the image attention feature extraction are carried out on the images in the training set, and the method specifically comprises the following steps:

s1-1: in the image feature extraction process, a convolutional neural network CNN _ F is used as a basic network structure, and an image feature matrix is output at a fifth convolutional layer Conv 5;

s1-2: the image attention feature extraction process comprises the following steps: (1) an attention layer is introduced between the fifth convolutional layer and the fully-connected layer, so that a residual error network Resnet-50 is improved, as shown in FIG. 3, the fully-connected layer is replaced by a new convolutional layer Conv6 and a maximum pooling layer Max pooling, and the Conv6 layer is introduced to ensure that the size of a final attention map is consistent with the size of an image feature matrix output by the Conv5 layer in the image feature extraction process; the initial attention matrix O is extracted using a modified Resnet-50 network and the network is pre-trained using a cross-entropy function as a loss function.

(2) The initial attention matrix is further improved:

O′_ir＝sigmoid(max_k(O_ijk))，

wherein, O'_irIs a picture I_iThe attention weight, O, corresponding to the r-th region of (1)_ijkIs the value of the kth class (Nc total classes) at the same position in the pre-training network output O.

Wherein the content of the first and second substances,

is the finally obtained attention matrix, μ_iThe threshold value can be calculated in the following specific manner:

sorting the attention values of different areas of the picture in ascending order, and assuming that about p% (0) exists in one picture<p<100) The remaining part (about 1-p%) is the key area; then mu_iValue of (2) is O'_iAfter sequencing

An activation value, where Nr ═ n × n denotes the number of regions.

(3) Will be provided with

Extending on the channel layer to obtain a new weight matrix

Then, the point multiplication operation is carried out on the image feature matrix output by the Conv5 layer,image features weighted by image attention features are obtained.

Performing text feature extraction and text attention feature extraction on the images in the training set, specifically comprising:

s1-3: in the text feature extraction process, two full-connection layers are adopted to obtain text features;

s1-4: the text attention feature extraction process comprises the following steps: (1) an attention layer is introduced before the first fully-connected layer Fc1, a neural network without a hidden layer, namely a two-layer nonlinear classification network, is adopted to obtain a mapping relation W between each label represented by the input text and the corresponding classification thereof, as shown in FIG. 4, and W is used as an initial attention matrix, and the training of the classification network is guided by using the least squares error loss.

(2) The initial attention matrix is further improved:

normalizing W using SoftMax function_ijAnd assume text y_iContribution to different classes obeys distribution F_i(·)，

F_i(l_j)＝W′_ij，

Wherein l_jIs the label information corresponding to the jth sample,

solving the information entropy corresponding to each label:

W″_i＝-E_i，

solving final attention moment array

Wherein v is a calculable threshold, and the specific calculation mode is as follows:

attention matrix W ″)_iIn ascending order, set v to the second

And the corresponding values of the positions, wherein Nt represents the number of different labels in the text label set.

(3) Attention graph of original text features and text

Multiplying to obtain text features weighted by the text attention features; wherein, the original text feature is represented by BoW, and can be in other forms such as Word2 Vec.

In step S2, inputting the image features and the text features into a hash learning network model, obtaining a binarized hash code by using a sign function, and constructing a global objective function with a minimum loss function as a target:

where n is the number of samples in the sample set, B^xIs a binary hash code corresponding to the picture modality, B^yIs a binary hash code corresponding to the text mode, and sets B as B^x＝B^y＝sign(γ(F+G))，W_x、W_yIs an initial attention matrix corresponding to the picture modal data and the text modal data, F_*＝f^x(x_i,θ^x),θ^xIs the image network parameter, F is the output of the image network; g_*＝f^y(y_i,θ^y),θ^yIs a text network parameter, G is the output of the text network(ii) a Order to

Both gamma and eta are hyperparameters; the similarity matrix S is: for two different samples i, j, if at least one class exists for both sample labels, then S is used_ijSet to 1, otherwise set to 0.

In this embodiment, the first term of the global objective function is a negative log-likelihood loss function, the second term is a quantization loss function, and since the similarity relationship between samples is obtained through the label information L, in order to more fully utilize the sample supervision information, the third term loss, that is, the semantic preserving loss function, is proposed in this embodiment.

In step S2, the hash learning model is optimized with the minimization of loss function as the target, and the variables to be optimized are respectively B, F, G, and W_x,W_yIn the present embodiment, an iterative optimization manner is adopted to minimize the loss function, that is, only one variable is optimized at a time, and other variables are kept unchanged. The specific optimization strategy is as follows:

s2-1: fixing variables B, G, W_x,W_yUpdating a variable F:

for sample point x_iOptimization of F using stochastic gradient descent method_*Namely:

calculating by using chain rule

Namely, it is

Updating a parameter θ of an image network via back propagation^x。

S2-2: fixing variables B, F, G, W_yUpdate the variable W_x：

The variable is updated using a random gradient descent method,

s2-3: fixing variables B, F, W_x,W_yUpdating a variable G:

similar to the process of updating the variable F, for the sample point y_jFirst, the gradient of the variable G is calculated, i.e.:

calculation using chain rule

And update the parameter theta^y。

S2-4: fixing variables B, F, G, W_xUpdate the variable W_yNamely:

s2-5: fixing variables F, G, W_x,W_yUpdate variable B, i.e.:

wherein V ═ γ (F + G).

In step S3, after the hash learning model is optimized, calculating all samples in the cross-modal dataset according to the optimized hash learning model to obtain corresponding hash codes;

when a retrieval task is carried out, the obtained data is input into the model to obtain corresponding hash codes, N hash codes with the closest Hamming distance are retrieved from the hash codes of the modal data in the cross-modal data set and different from the data to be detected in modal, and the cross-modal data meeting the retrieval requirement is screened out.

Example 2

As shown in fig. 5, the present embodiment provides a cross-modal hash retrieval system based on an attention-aware mechanism, including:

It should be noted that the above modules correspond to steps S1 to S3 in embodiment 1, and the above modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In this embodiment, a feature extraction module receives pictures and texts, and performs feature learning and hash coding learning on the image data and the text data at the same time, wherein an image attention feature extraction module is included in an image feature extraction network, and a text attention feature extraction module is included in a text feature extraction network, and finally, features weighted by attention are input into a hash learning module to guide generation of hash codes, so that the quality of hash code generation is improved, and the method is suitable for cross-modal retrieval tasks in various multi-modal data scenes.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment 1. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.

The method in embodiment 1 may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A cross-modal hash retrieval method based on an attention-aware mechanism is characterized by comprising the following steps:

inputting cross-modal characteristics of cross-modal data pairs in the training set into a Hash learning model, and optimizing the Hash learning model by taking a minimum loss function as a target according to an output cross-modal Hash code; constructing a global objective function by taking a minimum loss function as a target according to the output cross-modal hash code;

the global objective function is:

where n is the number of samples in the sample set, B^x、B^yIs a hash code, theta, corresponding to x-mode data and y-mode data in a cross-mode data pair^x、θ^yIs a network parameter, W, of the network corresponding to the x-modal data and the y-modal data_x、W_yIs an initial attention matrix, S, corresponding to the x-mode data and the y-mode data_ijIs a similarity matrix, gamma and eta are both hyper-parameters; F. g is the output of the corresponding network of the x modal data and the y modal data, and L is label information;

2. The attention-aware mechanism-based cross-modal hash retrieval method of claim 1, wherein the cross-modal data set comprises a plurality of modal data, the training set comprises a plurality of cross-modal data pairs, and the cross-modal data pairs employ two parallel convolutional neural networks for feature extraction and attention feature extraction at the same time.

3. The cross-modal hash retrieval method based on attention awareness mechanism as claimed in claim 1, wherein said attention feature extraction comprises:

acquiring an initial attention feature matrix, training the convolutional neural network by using a minimum loss function, and outputting an improved attention feature matrix;

and performing dot product operation on the attention feature matrix and the feature matrix output by the convolutional neural network to obtain the cross-modal feature weighted by the attention feature.

4. The attention-aware mechanism-based cross-modal hash retrieval method of claim 1, wherein the global objective function comprises a negative log-likelihood loss function, a quantization loss function, and a semantic preserving loss function.

5. The cross-modal hash retrieval method based on the attention-aware mechanism as claimed in claim 1, wherein the hash learning model is optimized by using an iterative optimization method, and the optimized variables include hash codes of cross-modal data pairs, outputs of corresponding networks of cross-modal data pairs, and an initial attention matrix.

6. The cross-modal hash retrieval method based on the attention-aware mechanism as claimed in claim 1, wherein in the hash codes of modal data in the cross-modal data set different from the data to be detected in modal, the hamming distances between the hash codes and the hash codes of the data to be detected are compared, the N hash codes with the nearest hamming distances are retrieved, and the cross-modal data satisfying the retrieval requirement are screened out.

7. A cross-modal hash retrieval system based on an attention-aware mechanism, comprising:

the Hash learning module is used for inputting cross-modal characteristics of cross-modal data pairs in the training set into the Hash learning model and optimizing the Hash learning model by taking a minimum loss function as a target according to the output cross-modal Hash code; constructing a global objective function by taking a minimum loss function as a target according to the output cross-modal hash code;

the global objective function is:

8. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of any of claims 1-6.

9. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 6.