CN110674333B

CN110674333B - Large-scale image high-speed retrieval method based on multi-view enhanced depth hashing

Info

Publication number: CN110674333B
Application number: CN201910712046.9A
Authority: CN
Inventors: 颜成钢; 龚镖; 白俊杰; 孙垚棋; 张继勇; 张勇东; 沈韬
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-08-02
Filing date: 2019-08-02
Publication date: 2022-04-01
Anticipated expiration: 2039-08-02
Also published as: CN110674333A

Abstract

The invention discloses a large-scale image high-speed retrieval method based on multi-view enhanced depth hashing. The invention comprises the following steps: step 1, acquiring multi-view characteristic representation of an image; step 2, calculating a view relation matrix; step 3, designing a loss function of the model; step 4, fusing and enhancing; step 5, training the built model on a large-scale image training data set; step 6, testing the trained model to generate a hash code, and then performing hash retrieval; and 7, evaluating indexes in an experiment. The effect of the expansion of the Hamming radius on the result is small; and as the code length increases, the precision remains stable.

Description

Large-scale image high-speed retrieval method based on multi-view enhanced depth hashing

Technical Field

The invention belongs to the technical field of computer images and artificial intelligence, particularly solves the problem of high-speed retrieval of large-scale image data sets, and relates to novel theories such as multi-view, deep learning and Hash learning.

Background

With the explosive growth of image data, efficient large-scale image retrieval algorithms are urgently needed for many tasks. Approximate nearest neighbor searching has attracted increasing attention to balance the time-consuming and efficient retrieval of large-scale data sets. Hashing is an efficient method of performing nearest neighbor search in a large-scale data space by embedding high-dimensional feature descriptors in similarity preserving hamming space with low dimensions. However, large-scale high-speed retrieval by binary codes has a certain reduction in retrieval accuracy as compared with conventional retrieval methods.

Hash learning is an emerging nearest neighbor search method with high efficiency, but the precision of the Hash method is limited. The learning to hash in the large-scale image retrieval is mainly to automatically generate a hash function. The binary code output by the hash function may be used to compare hamming distances in hamming space to obtain nearest neighbors. In recent years, several hash models with convolutional artificial neural networks have been proposed. Supervised hashing is a typical representation of these methods for image retrieval. Meanwhile, many impressive researches such as Cauchy deep hashing and other methods greatly improve the retrieval precision in a Hamming space, so that efficient large-scale image retrieval enters the next stage. These methods automatically learn a good image representation that is customized for a hash and a set of hash functions. However, the foregoing approaches focus on learning binary code from data with only a single view (i.e., using a single convolution feature). Recently, many multi-view hashing methods for efficient similarity search have been proposed, such as multi-view anchor graph hashing, multi-view alignment hashing, and the like. These methods rely primarily on spectral, graphical or deep learning techniques to achieve data structure preserving coding. However, in most cases, the hashing approach, purely using the above scheme, simply collects multi-view information to supplement the missing components in the single-view hash code, which ignores the relationships between views. In addition, these methods also have the problem of high computational complexity.

We propose a supervised multi-view hashing model that can enhance multi-view information through neural networks. The method actively explores the relationship between views by using an effective view stability evaluation method, which influences the optimization direction of the whole network. We have also designed multiple data fusion methods in hamming space to preserve the advantages of convolution and multiview. The proposed method was evaluated systematically on the CIFAR-10 (Western Fax) and NUS-WIDE (national university of Singapore foreign Germany) datasets, and the results showed that our method is significantly superior to the state-of-the-art single-view/multi-view hashing method.

Disclosure of Invention

In this context, we propose depth multiview enhanced hashing (D-MVE-Hash) and multiview hashing (MV-Hash) for image retrieval. Multiview hashing is a non-convolution multiview submodule of a multiview enhanced hash that measures view relationships in a manner called view stability evaluation. To obtain a stability assessment of the views, first, the multi-view hashes are pre-trained on the labeled dataset, and then similar images are repeatedly input in different view spaces to compare their stability strengths. A view-relationship matrix (view-relationship matrix) quantifies the relationship between views. The whole process meets the requirement of back propagation, so that the optimization can be carried out through gradient descent.

In our framework, we use three enhancement methods to incorporate the view relationship matrix and various binary codes learned from the single/multi-view space into the backbone network. The three different depth multi-view enhanced hash fusion methods make full use of the view relationship matrix. copy-Fusion (Fusion-R) enhances the effect of dominant views by iteratively repeating specific code segments, while also attenuating the effect of useless views. View code Fusion (Fusion-C) takes into account the most primitive view relationship matrix and the least artificial constraints. The method overcomes the difficulty that the repetition times must be manually set by people so as to ensure that the dimensionality of input data is unified in copy fusion because of a dynamic view relation matrix. Probability view pooling (Fusion-P) is a probability-based view pool approach. We have designed a memory network to eliminate the high temporal complexity of view stability evaluation. It shares input with the backbone network, making deep multi-view enhanced hashing a two-step model. The piping is shown in figure 1. The frame is shown in fig. 2. The experimental results and the visualization results prove the effectiveness of the depth multi-view enhanced hash and the multi-view hash on the image retrieval task.

The technical scheme adopted by the invention for solving the technical problems is as follows:

1. the large-scale image high-speed retrieval method based on the multi-view enhanced depth hash is characterized by comprising the following steps of:

step 1, acquiring multi-view characteristic representation of an image;

step 2, calculating a view relation matrix;

step 3, designing a loss function of the model;

step 4, fusing and enhancing;

step 5, training the built model on a large-scale image training data set;

step 6, testing the trained model to generate a hash code, and then performing hash retrieval;

and 7, evaluating indexes in an experiment.

The

steps

1 and 2 are realized as follows:

2-1. problem definition and multiview hash description:

suppose that

Is a set of objects and the corresponding features:

wherein d is_mIs the dimension of the mth view, M being the number of views, N being the number of objects; integrated binary code matrix

Wherein b is_iIs and o_iAn associated binary code, and q is the code length;

2-2. setting a mapping function

Wherein the mapping function is capable of converting a stack of similar objects into classification scores in different views;

2-3, defining potentially expected hash function

The composition of (A) is as follows:

epsilon is an evaluation function, and each view network is trained in advance in a labeled data set to perform a classification task before stability evaluation is started; the following loss function was used:

2-4, abstracting the test process, wherein the output dimension of the test process is consistent with the number of classes;

given image I ═ I₁，...，i_NLet Q be f (i), the dimensionality of Q be M × N × C, M being the number of views, N being the number of pictures, C being the number of categories

The definition is as follows:

ε is expressed as [. epsilon. ]₁，…，ε_M]Then make normalized epsilon:

the step 3 is specifically realized as follows:

training a multi-view binary code generation network by using view relation information; at the beginning, set a pair of images i₁，i₂And corresponding binary network output b₁，b₂E is B; from { -1, +1} by relaxation mapping^qTo [ -1, + 1)]^q(ii) a If similar, define y ═ 1, otherwise y ═ 1; the following formula is the loss function for the mth view:

wherein | · | purple₁Is a 1 norm, | · | is an absolute value, α > 0 is a boundaryControl, the third term is a regularization term to avoid gradient vanishing₁，...，i_NAnd the corresponding output binary code in the multi-view space

To obtain an equation representation in matrix form, B is introduced

The formula, given below:

merging

Is expressed in the form of a second multiplication term of p (I); then the regular terms and the similarity matrix are supplemented, and the following global objective functions are obtained:

the view relationship matrix E is

The overall loss function is rewritten as:

with this function, the network is trained using a back propagation algorithm with a small batch gradient descent method, and the view relation matrix E can affect all layers of the network.

The step 4 is specifically realized as follows:

ordering the view relation matrix E for important views and enhancing views by repeating the binary code of the respective views in the multi-view binary codeThe importance of the graph; specifically, the basic binary code is denoted as B; the intermediate code (multi-view binary code) is represented as

Setting a fusion vector v to guide the multi-view binary code to be repeated under various views; the following formula represents the fusion process:

wherein H represents the input binary code of the fusion layer; phi (-) is a self-join operation of the vector from 1 to M; the second parameter in phi (-) represents the number of self-replicates;

is the ranking function in the d dimension; the advantage of this fusion method is that it can convert E into discrete control vectors, so E only determines the order between views; the strength of the enhancement or the weakening is manually controlled by the fusion vector;

eliminating a Fusion vector after view code Fusion (Fusion-C), wherein the Fusion vector is used for ensuring that the dimension of input data is unified in view code Fusion due to a dynamic view relation matrix; first, the entire binary string H is encoded as a header code (H)_h) Middle code (H)_m) And a tail code (H)_e)；H_hSame as copy fusion; h_mDirectly using the product of the binary code length and the coefficients of the corresponding view as the repetition time of the current code segment; this operation produces a series of dummy bytes (i.e., H)_e) The lengths thereof are not equal; secondly, a specific and different view codeword is assigned in each view, which is a codeword belonging to [ -1, 1]The random number of (2); and H_mIn contrast, H_eUsing view code instead of multi-view binary code; thus, regardless of the dynamic view relationship matrix and code length, it can be fully populated;

a probability view pool with a view relation matrix is provided as a multi-view fusion method, and view probability distribution is generated according to E; in each pooling filter, a random number sampled from the view probability distribution activates the selected view.

The step 5 is specifically realized as follows:

establishing a module called a memory network, which is independent of the model but participates in training together; a configuration module called a memory network is supplemented; learning the view relation matrix E in the step 1 by a memory network, and then obtaining the view relation matrix E through the module in the step 2 without stability evaluation; the structure of the memory network is a multilayer convolutional neural network, but the output layer of the memory network is relative to a view relation matrix E; and the loss function during training is

l_n＝(I_n-E_n)²。

The invention has the following beneficial effects:

experiments of different code lengths were designed without loss of generality. Compared with a single-view hash model and a common multi-view hash model, the deep multi-view enhanced hash not only obtains higher average retrieval accuracy (mAP) but also has lower calculation cost. Especially in long code environments, deep multi-view enhanced hashing may achieve better retrieval results.

The proposed multi-view hashing achieves better performance on the NUS-WIDE dataset than the most advanced multi-view hashing methods. For example, when retrieved using 16-bit, 32-bit, and 48-bit hash codes, the multiview hash achieves gains of 3.44%, 1.65%, and 2.46% compared to SSMDH.

In fig. 3, it can be seen that the performance curve of the original binary code is drastically reduced by increasing the code length, while the performance curve of the depth multi-view enhanced hash is not completely affected. The enhanced binary code can maintain stable retrieval performance under the condition of long code length. In fig. 5, we see that the maps of 128-bit depth multi-view enhanced hashing using view code fusion are between 77.82% and 83.21%. The best retrieval hamming radius is 5. the maps of the 128-bit depth multiview enhanced hash using copy fusion are between 76.68% and 83.39%, 1.14% and 0.18% lower than using view code fusion.

Two advantages of this method are summarized from the experimental results: (1) the effect of the expansion of the Hamming radius on the result is small; (2) as the code length increases, the accuracy remains stable. Deep multiview enhanced hashing not only uses a convolutional neural network to obtain potential hash functions, but also combines multiview information in each view to generate a binary code. In contrast to other multi-view approaches, deep multi-view enhanced hashing uses a view relationship matrix, allowing the network to actively consider relationships between views to control the training direction. And the view relation matrix is not learned by existing fixed neural networks, which means that it is not uninterpretable. In order to visualize the differences more intuitively, the search results are visually presented in fig. 7.

Drawings

FIG. 1 is a pipeline diagram of the present invention for a large-scale image high-speed retrieval method based on multi-view enhanced depth hashing;

FIG. 2 is a global framework architecture diagram of a large-scale image high-speed retrieval method based on multi-view enhanced depth hashing provided by the invention;

FIG. 3 is a graphical illustration of single-view intra-space and multi-view generalization constraint rules under a two-sample condition;

FIG. 4 is a plot of mean search accuracy and accuracy versus recall obtained experimentally;

FIG. 5 is an average value of retrieval average accuracy under different code length environments and different Hamming radii obtained through experiments;

FIG. 6 is a graph of the loss variation of the model during training;

FIG. 7 is a visual search result presentation of a model.

Detailed Description

The invention is further illustrated by the following figures and examples.

The invention combines the deep hash learning and the multi-view method for the first time through the deep multi-view enhanced hash. Sub-module multi-view hashing finds and quantifies view relationships under non-deep learning conditions. The deep multi-view enhanced hash retains the inherent advantages of the multi-view approach and can be applied to any single-view hash retrieval model.

The invention comprises the following steps:

step 1, problem definition and multi-view Hash (MV-Hash) detailed solution

Suppose that

Is a set of objects and the corresponding features:

wherein d is_mIs the dimension of the mth view, M being the number of views, and N being the number of objects. We also represent an integrated binary code matrix

Wherein b is_iIs and o_iAssociated binary code and q is the code length. Formulating a mapping function

Where a function may convert a stack of similar objects into classification scores in different views. A potentially desirable hash function is then defined

The composition of (A) is as follows:

ε is an evaluation function that pre-trains each view network in the labeled dataset to perform the classification task before starting the stability evaluation. The following loss function was used:

abstracting the test process: the output dimension of which is consistent with the number of classes. Specific to image data, given image I ═ { I ═ I₁，...，i_N}. let Q be F (i), the dimension of Q be M × N × C, where M is the number of views, N is the number of pictures, C is the number of categories ∈ (F) is defined as follows:

representing epsilon as [ epsilon ]₁，…，ε_M]Then a simple normalization ε is made:

then consider training the multi-view binary generation network with the view relationship information. At the beginning, consider a pair of images i₁，i₂And corresponding binary network output b₁，b₂E.g. B, do some relaxation mapping from { -1, +1}^qTo [ -1, + 1)]^q(ii) a Definition y-1 if they are similar, otherwise y-1. The following formula is the loss function for the mth view:

wherein | · | purple sweet₁Is 1 norm, | · | is absolute value, | > 0 is boundary control, the third term is a regularization term to avoid gradient disappearance, and the more general image I ═ { I1_NAnd corresponding output in multi-view space twoCarry system code

To obtain an equation representation in matrix form, B is introduced

The formula, given below:

merging

Is expressed in the form of a second multiplication term for p (I). Then the regular terms and the similarity matrix are supplemented, and the following global objective functions are obtained:

to intuitively show the effect and location of view stability assessment, symbol E is used. The view relationship matrix E is

The overall loss function is rewritten as:

with this objective function, the network is trained using a back propagation algorithm with a small batch gradient descent method, and the view relation matrix E can affect all layers of the network.

Step 2, fusion and enhancement

Copy Fusion-R is a relatively simple solution, which depends on parameters. We rank E for important views and enhance views by repeating the binary code of the respective views in the multi-view binary codeThe importance of (c). Specifically, the basic binary code is denoted B. The intermediate code (multi-view binary code) is represented as

The fusion vector v is set to direct the multi-view binary code to be repeated under various views, and the following formula represents the fusion process:

where H represents the input binary code of the fusion layer. φ (-) is a self-join operation of the vector from 1 to M. The second parameter in phi (-) indicates the number of self-replicates.

Is a ranking function in the d dimension. The advantage of this fusion method is that it can convert E into discrete control vectors, so E only determines the order between views. The strength of the enhancement or the weakening is manually controlled by the fusion vector.

View code Fusion (Fusion-C) takes into account the most primitive view relationship matrix and the least artificial constraints. In particular, we wish to eliminate the fused vector, which is used to ensure that the dimensions of the input data are unified in view code fusion due to the dynamic view relation matrix. First, the entire binary string H is encoded as a header code (H)_h) Middle code (H)_m) And a tail code (H)_e)。H_hAs with replication fusion. H_mThe product of the binary code length and the coefficients of the corresponding view are directly used as the repetition time of the current code segment. This operation produces a series of dummy bytes (i.e., H)_e) The lengths thereof are not equal. Secondly, we assign a specific and different view codeword in each view, which is a codeword belonging to [ -1, 1]The random number of (2). And H_mIn contrast, H_eView code is used instead of multi-view binary code. Thus, regardless of the dynamic view relationship matrix and code length, it can be fully populated. The advantage of view code fusion is that it is fully beneficialThe information contained in the view relation matrix is used. We found that view code fusion is limited by the view stability assessment in our experiments, which means that it can exceed replication fusion when the number of views increases.

Probability view pooling (Fusion-P): the invention provides a probability view pool with a view relation matrix as a multi-view fusion method. Conventional pooling operations select a maximum or average value as a result of each pooling cell. A view pool is a dimension reduction method that uses element maximization across views to unify data of multiple views into one view. Since pool operations inevitably result in information loss, we need to extend the length of the multiview binary code to preserve as much multiview information as possible before the probabilistic view pool. A view probability distribution is then generated from E. In each pooling filter, a random number sampled from the view probability distribution activates the selected view. The code fragments of this view are used for conventional pool operations. It ensures that sub-binaries of high priority views are more likely to appear during the fusion process.

Step 3, retrieval acceleration

To avoid excessive computational resources for stability evaluation during the search process, we have built a module called a memory network that is independent of the model but participates in training together. View stability evaluation seeks view relationships in the multi-view space, which is very time consuming and unsuitable in image retrieval. Therefore, we supplement a configuration module called a memory network. The memory network learns the view relation matrix E in step one, and then we can obtain the view relation matrix E through the module in step 2 without stability evaluation. The structure of the memory network is a multilayer convolutional neural network (e.g., VGG, ResNet, densnet, etc.), but its output layers are relative to the view relationship matrix E. And the loss function during training is

l_n＝(I_n-E_n)². FIG. 1 shows different states of a deep multiview enhanced hash between two steps andand (6) associating. The model not only contains the stability assessment, but also trains some layers in advance to perform the stability assessment in step 1. In step 2, we can obtain the view relation matrix E without stability evaluation. It can greatly raise efficiency.

Step 4, experimental comparison

In the present invention we provide experiments on several common data sets and compare them with the most advanced hashing methods. The multi-view hashing method is also within our comparison scope. Two reference image datasets were used to evaluate our method: CIFAR-10 and NUS-WIDE. Next, we obtain 2D image multi-view information through RGB color space color histogram, HSV color space color histogram, texture. In addition to the multi-view information, we also learn the hash function from the convolution view using VGG-19. Dropout Batch Normalization was used for each fully connected layer to avoid overfitting. The activation function is ReLu and the hidden layer size is 4096 x 4096. We used a small batch of random gradient body planes (SGD) with 0.9 momentum and LR schedulers. Following the standard evaluation method hamming space search, it includes two consecutive steps: (1) pruning, using hash table lookup to return data points within hamming radius 2 for each query; (2) scanning, if the distance to each query is reached using a continuous code, the returned data points are rearranged in ascending order.

We compare the retrieval performance of D-MVE-Hash with several classical hashing methods: ITQ-CCA, KSH, the most advanced single view method based on CNN: CNNH, HashNet, DCH and multiview hashing methods: CHMIS, MVAGH, MAH, SSMDH. Table 1 shows the mAP experimental results of different code lengths on the CIFAR-10 dataset. Since the multiview Hash method does not use deep convolution, we split and name the multiview part of the method as MV-Hash and compare them in table 2.

TABLE 1 mAP to re-rank different bits on NUS-WIDE dataset

TABLE 2 mAP to re-rank different bits on NUS-WIDE dataset

Compared with the most advanced CNN-based single-view method and the traditional single-view hashing method, the proposed deep multi-view enhanced hashing method realizes higher retrieval performance on the CIFAR-10 data set. For example, deep multiview enhanced hashing achieves gains of 5.21%, 4.6%, 3.57%, and 4.30% when retrieved using 16-bit, 32-bit, 48-bit, and 64-bit hash codes as compared to DCH. Similar results were observed in other experiments. The proposed multi-view hashing achieves better performance on the NUS-WIDE dataset than the most advanced multi-view hashing methods. For example, when retrieved using 16-bit, 32-bit, and 48-bit hash codes, the multiview hash achieves gains of 3.44%, 1.65%, and 2.46% compared to SSMDH.

In fig. 3, we can see that the performance curve of the original binary code is drastically reduced by increasing the code length, while the performance curve of the deep multiview enhanced hash is not completely affected. The enhanced binary code can maintain stable retrieval performance under the condition of long code length. In fig. 5, we see that the maps of 128-bit depth multi-view enhanced hashing using view code fusion are between 77.82% and 83.21%. The best retrieval hamming radius is 5. the maps of the 128-bit depth multiview enhanced hash using copy fusion are between 76.68% and 83.39%, 1.14% and 0.18% lower than using view code fusion.

We summarize two advantages of this approach from experimental results: (1) the effect of the expansion of the Hamming radius on the result is small; (2) as the code length increases, the accuracy remains stable. Deep multiview enhanced hashing not only uses a convolutional neural network to obtain potential hash functions, but also combines multiview information in each view to generate a binary code. In contrast to other multi-view approaches, deep multi-view enhanced hashing uses a view relationship matrix, allowing the network to actively consider relationships between views to control the training direction. And the view relation matrix is not learned by existing fixed neural networks, which means that it is not uninterpretable. To visualize the differences more intuitively, we visually present the search results in fig. 7.

Claims

step 1, acquiring multi-view characteristic representation of an image;

step 2, calculating a view relation matrix;

step 3, designing a loss function of the model;

step 4, fusing and enhancing;

step 5, training the built model on a large-scale image training data set;

step 7, evaluating indexes in an experiment;

the method is characterized in that the steps 1 and 2 are realized as follows:

2-1. problem definition and multiview hash description:

suppose that

Is a set of objects and the corresponding features:

Wherein b is_iIs and o_iAn associated binary code, and q is the code length;

2-2. setting a mapping function

2-3, defining potentially expected hash function

The composition of (A) is as follows:

given image I ═ I₁，...，i_NInstruction of

The dimension of Q is M × N × C, M is the number of views, N is the number of pictures, and C is the number of categories.

The definition is as follows:

ε is expressed as [. epsilon. ]₁，…，ε_M]Then make normalized epsilon:

2. the large-scale image high-speed retrieval method based on multi-view enhanced depth hashing according to claim 1, wherein the step 3 is implemented as follows:

wherein | · | purple₁Is 1 norm, | · | is absolute value, α > 0 is boundary control, the third term is regularization term to avoid gradient disappearance₁，...，i_NAnd the corresponding output binary code in the multi-view space

To obtain an equation representation in matrix form, B is introduced

The formula, given below:

merging

Is expressed in the form of a second multiplication term of p (I); then supplement the constantRule terms and similarity matrix, and obtain the following global objective function:

the view relationship matrix E is

The overall loss function is rewritten as:

3. The large-scale image high-speed retrieval method based on multi-view enhanced depth hashing according to claim 2, wherein the step 4 is implemented as follows:

sorting the view relation matrix E to find important views, and enhancing the importance of the views by repeating the binary codes of the corresponding views in the multi-view binary codes; specifically, the basic binary code is denoted as B; the intermediate code is expressed as

eliminating a fusion vector after the view code fusion, wherein the fusion vector is used for ensuring that the dimensionality of input data is unified due to a dynamic view relation matrix in the view code fusion; first, the entire binary string H is encoded as a header code H_hMiddle code H_mAnd a tail code H_e；H_hSame as copy fusion; h_mDirectly using the product of the binary code length and the coefficients of the corresponding view as the repetition time of the current code segment; this operation produces a series of dummy bytes, or H_eThe lengths thereof are not equal; secondly, a specific and different view codeword is assigned in each view, which is a codeword belonging to [ -1, 1]The random number of (2); and H_mIn contrast, H_eUsing view code instead of multi-view binary code; thus, regardless of the dynamic view relationship matrix and code length, it can be fully populated;

4. The large-scale image high-speed retrieval method based on multi-view enhanced depth hashing according to claim 3, wherein the step 5 is implemented as follows:

establishing a module called a memory network, which is independent of the model but participates in training together; a configuration module called a memory network is supplemented; learning the view relation matrix E in the step 1 by a memory network, and then obtaining the view relation matrix E through the module in the step 2 without stability evaluation; the structure of the memory network is a multilayer convolutional neural network, but its output layer is opposite to that of the convolutional neural networkA view relationship matrix E; and the loss function during training is

l_n＝(I_n-E_n)²。