CN113191445B

CN113191445B - Large-scale image retrieval method based on self-supervision countermeasure Hash algorithm

Info

Publication number: CN113191445B
Application number: CN202110531130.8A
Authority: CN
Inventors: 曹媛; 刘峻玮; 桂杰; 许晓伟
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-05-16
Filing date: 2021-05-16
Publication date: 2022-07-19
Anticipated expiration: 2041-05-16
Also published as: CN113191445A

Abstract

The invention provides a large-scale image retrieval method based on an automatic supervision countercheck hash algorithm. The invention provides a new Hash learning framework, which is called self-supervision counterattack Hash; the framework primarily learns discriminative hash codes using image rotation based self-supervised similarity metrics and generation countermeasure networks. The neural network model mainly comprises an encoder for acquiring the hash code, a generator for generating a pseudo image and a discriminator for distinguishing a true image from a false image; a loss function consisting of approximate semantic similarity loss, feature loss and antagonism loss is designed to maintain the similarity between the image and the hash code. Adding self-supervision characteristics into the whole model, neglecting bottom-layer semantic information and keeping high-layer semantic information; particularly for short hash codes, the high-level semantic information of the image can be better maintained. Experimental results show that compared with the conventional retrieval method, the image retrieval method provided by the invention has better image retrieval performance.

Description

Large-scale image retrieval method based on self-supervision countermeasure Hash algorithm

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a large-scale image data retrieval method based on an automatic supervision counterattack hash algorithm.

Background

The hash algorithm is concerned more and more in solving the problem of large-scale image retrieval due to low storage requirement and high search efficiency; and can be divided into supervised hashing and unsupervised hashing according to whether an image label is used, and the supervised hashing method generally has better performance than the unsupervised hashing method. However, in most cases, there is no label information in the dataset that is useful for the image, and manual labeling requires a lot of manpower. To address this problem, many researchers have attempted to improve the process. For example, Gidaris et al propose an image rotation based self-supervision method; however, this may result in different feature representations of the images before and after rotation. Although Misra et al solved this problem, they did not map the similarity matrix of similar images in the original space to the feature space.

With the rise of deep learning, deep learning algorithms can be divided into two categories: supervised learning and unsupervised learning. Supervised learning algorithms are favored by people with their high accuracy. However, manually marked labels are not readily available and require significant human resources. Therefore, in recent years, unsupervised learning algorithms have received increasing attention. The self-supervised learning is a popular choice in the unsupervised learning, and the popularity thereof is inevitable. After all mainstream supervised learning tasks mature, data becomes the most important bottleneck. Learning effective information from unlabeled data is a very important research subject, and self-supervision learning provides very rich imagination.

Disclosure of Invention

The invention aims to provide a large-scale image retrieval method based on an automatic supervision countermeasure hash algorithm to make up for the defects of the prior art.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

a large-scale image retrieval method based on an automatic supervision countermeasure hash algorithm comprises the following steps:

s1: acquiring image data comprising a training set and a test set;

s2: optimizing the encoder by utilizing the training set;

s3: rotating the test set image data, and inputting the test set image data into an optimized S2 encoder to obtain a hash code;

s4: and calculating the Hamming distance between the hash code obtained in the step S3 and the hash code of the training set of the step S2, sorting the Hamming distances from small to large, outputting the first k search results, and completing the search.

Further, in S2: the encoder uses a structure similar to VGG19, including five convolutional layers, two fully-connected layers, and one hash layer; for feature comparison, a full connection layer is added at the end; by utilizing the relation between image neighborhood structures, namely the relation between the Hash code B and the semantic similarity matrix, the following objective function l is provided_sTo learn hash codes to approximate as closely as possible the raw data distribution in projection space:

wherein, L is the length of the hash code, S is a similarity matrix, E represents that the objective function optimizes an encoder which is used for generating the hash code

Optimization of l_sSimilar images in the original space can be made to have similar hash codes when mapped to the hash space.

Further, the encoder optimization in S2 specifically includes:

s2-1: obtaining the feature vectors of the images of the training set, calculating the cosine distances between the images and sequencing to obtain a similarity ranking;

s2-2: analyzing the similarity ranking and setting a threshold value to obtain a similarity matrix;

s2-3: rotating the training set image and inputting the image into the encoder to obtain a hash code;

s2-4: inputting the hash code into a generator to obtain a pseudo image;

s2-5: inputting the pseudo image and the real image into a discriminator at the same time for confrontation training;

s2-6: optimizing the encoder, the generator and the discriminator according to an objective function; the optimized encoder, generator and discriminator form an auto-supervised countermeasure hash algorithm.

Further, the S2-1 is specifically: for database points

Feature vectors are extracted from pool5 layer of VGG model by using k-nearest neighbor (KNN) method

And calculating cosine distances between the two groups, and sequencing the two groups in the order from small to large to obtain similarity ranking.

Further, in the S2-2: setting the K1 range as the neighborhood according to the cosine similarity of each image, and obtaining the initial matrix S₁，S₁The calculation is as follows:

wherein x is_iAnd x_jFor the feature vector of the image, K1-NN is x_iK1 nearest neighbors of S₁On the basis of (1), comparing S₁And corresponding column of (1) and use

Structure S₂As follows:

wherein K2-NN is x_iK2 nearest neighbors and finally, combining these two matrices into S, is calculated as follows:

further, in the S2-4: the generator consists of a full-link layer and four deconvolution layers, and is generatedIn the device, a hash code is input as 'noise' to generate a new image; specifically, the hash code is input into a fully concatenated layer of size 8 × 8 × 256, and then 3 deconvolution layers of 5 × 5 and 1 × 1 are used, the number of kernels being 256, 128, 32, and 3, respectively; for the image I generated by the generator^GAnd an original image I, and an objective function l is provided between the feature vectors_f1(ii) a The objective function is defined as follows:

where Ψ (-) denotes the convolution-activated feature vector, w and h denote the sizes of the corresponding features, and D denotes the adjustment of the parameters in the arbiter; the generator generates a new image by using the hash code of the rotated image, so that a larger difference exists between the feature vector of the new image and the original image in consideration of low-level semantic information in the image; based on this problem, the image I after rotation^RAn objective function is arranged between the feature vectors of the original image I and the feature vectors of the original image I, the objective function being to ensure that the feature vectors of the same image are as similar as possible irrespective of rotation, thereby reducing the new image I^RAnd a rotated image I obtained from the encoder^GThe feature vector of the original image is obtained from the discriminator. Therefore, we use this loss function to optimize the encoder and the discriminator; so the finally set objective function l_f2The following were used:

wherein, I^RThe image is rotated, and I is an original image; the method aims to ignore semantic information of a lower layer, so that the loss of characteristics of the lower layer is not considered. Wherein l_f＝l_f1+γl_f2And gamma is a weight parameter.

Further, in the S2-5: according to the structure of counterstudy, a discriminator D is arranged to judge whether the image is true or false, so as to optimize the generator G; the optimization of the generation of the countermeasure network is a very small value game problem; the discriminator D is composed of four convolution layers, the number of the cores is 32, 128, 256 and 256, a full connection layer is arranged next to the cores, the size of the full connection layer is 1024, and eLU is used as an activation function; the generative model is essentially a maximum likelihood estimate, which is used to generate a model of the particular distribution data; the function of the generated model is to capture the distribution of sample data, and convert the distribution of the original input information into a sample with designated distribution through the transformation of parameters in the maximum likelihood estimation; using a standard objective function for generating a countermeasure network, the formula is as follows:

wherein G represents adjusting parameters in the generator, D (-) represents the output of the last layer of the discriminator, and G (-) represents the output of the last layer of the generator; in order to make the effect of generating the antagonistic network more obvious, random noise is input into the generator, and the generator and the discriminator are optimized by using the following loss function:

where z is random noise, l_d＝l_d′+l_d”。

Further, in the S2-6: the sum of the four parts of the loss function is set, and the two weights α and β of the loss function are set, the overall loss function, i.e. the objective function, is as follows:

L＝l_s+αl_f+βl_d (9)；

using the above (9) to perform network optimization, and adding a hash layer after the last complete connection layer to obtain hash code, learning parameters is required

θ、ηAnd ξ, the calculation process is as follows;

wherein

Indicates the input rotated picture, beta,

And θ represents a parameter in the encoder network;

I^G＝G(b_i；η) (11)

where η represents a parameter in the generator;

where ξ represents a parameter in the discriminator; the objective function is optimized using back-propagation and random gradient descent algorithms.

The invention has the advantages and technical effects that:

the invention provides a new Hash learning framework, which is called self-supervision counterattack Hash; the framework primarily learns discriminative hash codes using image rotation based self-supervised similarity metrics and generation countermeasure networks (GANs). The neural network model mainly comprises an encoder for acquiring a hash code, a generator for generating a pseudo image and a discriminator for distinguishing a true image from a false image; a loss function consisting of approximate semantic similarity loss, feature loss and antagonism loss is designed to maintain the similarity between the image and the hash code. Adding self-supervision characteristics in the whole model, neglecting bottom semantic information and keeping high-level semantic information; particularly for short hash codes, the high-level semantic information of the image can be better maintained.

And the experimental result shows that compared with the conventional retrieval method, the image retrieval method provided by the invention has better image retrieval performance.

Drawings

FIG. 1 is a diagram illustrating the process of the present invention for self-supervised countermeasure hashing.

Fig. 2 is an effect diagram of the similarity matrix S generated by the present invention.

FIG. 3 is a graph comparing the effect of the weight parameter γ on the mean of accuracy (mAP) of different bits in the present invention.

FIG. 4 is a comparison graph of the results of whether pixel loss is considered in the loss function of the present invention.

Detailed Description

The invention will be further explained and illustrated by means of specific embodiments and with reference to the drawings.

Example 1:

a large-scale image retrieval method based on an auto-supervision countermeasure hash algorithm comprises the following steps (as shown in figure 1):

step 1: firstly, extracting a Semantic similarity matrix S (such as a Semantic similarity matrix part in FIG. 1 and FIG. 2) from a data set;

step 2: rotating the image to a certain angle, inputting the image into an Encoder to obtain a hash code (such as an Encoder part in fig. 1);

and step 3: inputting the hash code into the Generator to obtain a new image (as in the Generator section of FIG. 1);

and 4, step 4: the original image and the new image are input to a Discriminator for confrontation recognition (see the Discriminator section of fig. 1).

And 5: and optimizing the network according to the objective function. The performance (table 1) and training time (table 2) of the self-supervised counterhash algorithm (SHGan) and several hash algorithms (iterative quantization (ITQ), Locality Sensitive Hash (LSH), Spectral Hash (SH), Spherical hash (Spherical hash), deep binary hash (DeepBit), Deep Hash (DH), binary counterhash (BGAN)) on the cifar-10 dataset are shown in the following table:

TABLE 1 mean of average precision results on cifar-10 data set after 90 degree rotation of the image

TABLE 2 training and testing time for deep Hash, binary countermeasure Hash, and self-supervision countermeasure Hash

TABLE 3 average precision mean of 12-bit hash codes obtained by 90 and 180 degree rotation in self-supervised countermeasure hashing

Rotation angle	mAP
		90	0.495
180	0.510

Example 2:

the embodiment 1 comprises the following concrete steps:

step 1: for database points

And calculating cosine distances between them, and sorting in order from small to large. Setting the K1 range as its neighborhood according to the cosine similarity of each image, and obtaining the initial matrix S₁，S₁The calculation is as follows:

wherein x is_iAnd x_jFor the feature vector of the image, K1-NN is x_iK1 nearest neighbors of (S)₁Based on (2), compare S₁And corresponding column of (1) and use

Structure S₂As follows:

step 2: the picture is rotated to a certain angle and then input into the encoder E, which uses a structure similar to VGG19, including five convolutional layers, two fully-connected layers, and one hash layer. For feature comparison, a full link layer is added at the end. By utilizing the relationship between the image neighborhood structures, namely the relationship between the hash code B and the semantic similarity matrix S, the following objective function is proposed to learn the hash code so as to be as close to the original data distribution in the projection space as possible:

wherein L is the length of the hash code, S is a similarity matrix, E represents that the objective function optimizes the encoder, and the encoder E is used for generating the hash code

Optimization of_sSimilar images in the original space can be made to have similar hash codes when mapped to the hash space.

And step 3: in generator G, we input hash code B as "noise" to generate a new image. Specifically, the hash code B is input into a fully connected layer having a size of 8 × 8 × 256. Then 3 5 × 5 and 1 × 1 deconvolution layers were used. The number of kernels is 256, 128, 32 and 3, respectively. For the image I generated by the generator^GAnd an original image I, an objective function is proposed between the feature vectors. The objective function is defined as follows:

where Ψ (-) represents the convolution-activated feature vector, w and h represent the sizes of the corresponding features, and D represents the adjustment of the parameters in the arbiter. However, the generator generates a new image by using the hash code of the rotated image, so that a large gap exists between the feature vector of the new image and the original image in consideration of low-level semantic information in the image. Based on this problem, the image I after rotation^RAn objective function is arranged between the feature vectors of the original image I and the feature vectors of the original image I, the objective function being to ensure that the feature vectors of the same image are as similar as possible irrespective of rotation, thereby reducing the new image I^RAnd a rotated image I obtained from the encoder^GThe feature vector of the original image is obtained from the discriminator. Therefore, this loss function is used to optimize the encoder and the discriminator. The set objective function is as follows:

wherein, I^RThe image is rotated, and I is an original image; the method aims to ignore semantic information of a lower layer, so that feature loss of the lower layer is not consideredAnd (6) losing. l_f＝l_f1+γl_f2And gamma is a weight parameter.

And 4, step 4: the original picture I and the pseudo picture I are combined^GThe input to the discriminator is optimized by setting a discriminator D to judge the truth of the picture according to the structure of the counterstudy. Optimization to generate a competing network is a very small value gaming problem. Arbiter D consists of four convolutional layers with core numbers 32, 128, 256, followed by a fully-connected layer with size 1024, with eLU as the activation function. The generative model is essentially a maximum likelihood estimate that is used to generate a model of the particular distribution data. The function of the generated model is to capture the distribution of sample data and convert the distribution of the original input information into a sample with specified distribution through the transformation of parameters in the maximum likelihood estimation. Using a standard objective function for generating a countermeasure network, the formula is as follows:

where z is random noise. l. the_d＝l_d′+l_d”

The sum of the four parts of the loss function is set, and the two weights α and β of the loss function are set, and the overall loss function is as follows:

L＝l_s+αl_f+βl_d (9)

and 5: we use (9) above to optimize our network and connect completely at the lastAdding a hash layer after layer connection to obtain our hash code requires learning parameter W_e ^Tθ, η, and ξ, the calculation is as follows:

wherein

Indicates the input rotated picture, beta,

And θ represents a parameter in the encoder network.

I^G＝G(b_i；η) (11)

Where η represents a parameter in the generator G.

Where ξ represents a parameter in discriminator D; the objective function is optimized using back-propagation and random gradient descent algorithms.

Example 3:

experiments were performed on Cifar-10. Cifar-10 is a dataset compiled by Alex krizzesky and Ilya sutskver. It contains 60000 images (32 × 32)10 categories of 6000 pictures each. In Cifar-10, 1000 pictures are randomly drawn at each class as a training set and 100 pictures are drawn as a test set.

Evaluation indices mean of precision (mAP) and mean precision (AP) were used to evaluate our method. For each query, the average precision is the average of the top k results, the average precision is the average of all queries, and the calculation formula of the average precision is as follows:

where N represents the number of instances in the database used for the query that are associated with the ground truth. P (k) represents the precision of the first k instances. When the kth instance is relevant to the query, δ (k) is 1, otherwise δ (k) is 0.

Results on the cifar-10 dataset:

first, let K1 be 20 and K2 be 30 to obtain the semantic similarity matrix S. Fig. 2 shows a part of data in the matrix. Then, the minimum batch size was set to 256, and the learning rate was set to 0.0001. Setting alpha-1, beta-1 and gamma-3. The rotation angle is 90 degrees.

Table 1 shows the average precision mean results for 12, 24, 32 and 48 bits. The results show that the average precision mean results of 12 bits, 24 bits and 48 bits are respectively improved by 9.4%, 9.8% and 5.2%. This shows that the present invention has better performance on fewer bits, and the hash code provided by the present invention can better represent high-level semantic information of an image, thereby verifying the above inference. To further verify this idea, the image was rotated by 180 degrees again for the experiment, and the results are shown in table 3. As shown in table 3, the result of the 12-bit hash code is improved by 10.9%.

Training and testing times for Deep Hash (DH), binary countermeasure hash (BGAN), and self-supervised countermeasure hash (SHGan) were further compared. As shown in table 2. Binary hash (BGAN) and self-supervision countermeasure hash parameters are more, and training time and testing time are both longer than DH. However, binary hashing (BGAN) and self-supervised counterhashing (hch) generate hash codes very quickly due to the advantages of the hashing algorithm.

The influence of the parameter gamma on the experimental results was investigated. As shown in fig. 3, it was found that γ had little influence on the experimental results.

According to the pixel loss function in the binary countermeasure hash, a loss function at a pixel level is added to the self-supervised countermeasure hash. The formula is as follows:

I_ijand

respectively representing the original image and the generated pseudo-image. Since the high number of bits is more representative of the pixel information in the image, the experimental results of the 32-bit and 48-bit hash codes are compared. The results are shown in FIG. 4. It can be seen that the mean of average accuracy (mAP) of this method decreases significantly with the loss of pixels, which further illustrates that our invention can enable neural networks to learn the high level semantic information of the image.

In summary, the present invention provides an auto-supervised hash algorithm based on a generative countermeasure network, which is called as an auto-supervised countermeasure hash algorithm. The self-supervision countermeasure hash is composed of an encoder, a generator and a discriminator. A loss function consisting of approximate semantic similarity loss, feature loss and antagonism loss is designed to maintain the similarity between the image and the hash code. The learned hash code can better represent high-level semantic information of the image, so that the accuracy of image retrieval is improved. Experimental results on the Cifar-10 dataset show that the proposed self-supervised counterhash has a higher performance. The invention provides a self-supervision learning method, which utilizes self-supervision information based on rotation or other transformation to design a target function; the generation of an antagonistic network is one of the most promising self-supervision learning methods; generating the countermeasure network can effectively generate synthetic data from the underlying space that is similar to the training data.

Claims

1. A large-scale image retrieval method based on an automatic supervision countermeasure hash algorithm is characterized by comprising the following steps:

s1: acquiring image data comprising a training set and a test set;

s2: optimizing the encoder by utilizing the training set; the encoder optimization specifically includes:

s2-4: inputting the hash code into a generator to obtain a pseudo image;

s2-5: inputting the pseudo image and the real image into a discriminator simultaneously for countermeasure training;

s2-6: optimizing the encoder, the generator and the discriminator according to an objective function; the optimized encoder, generator and discriminator form an automatic supervision countermeasure Hash algorithm;

s3: rotating the image data of the test set, and inputting the image data of the test set into an optimized S2 encoder to obtain a hash code;

s4: calculating the Hamming distance between the hash code obtained in the step S3 and the hash code of the training set S2, sorting the Hamming distances from small to large, outputting the first k retrieval results, and completing retrieval;

in the S2: the encoder uses a structure similar to VGG19, including five convolutional layers, two fully-connected layers, and one hash layer; adding a full connection layer; by utilizing the relationship between image neighborhood structures, namely the relationship between the hash code and the semantic similarity matrix, the following objective function is proposed to learn the hash code so as to approximate the original data distribution in the projection space:

where L is the length of the hash code, and the encoder is configured to generate the hash code

Optimization of l_sSimilar images in the original space are made to have similar hash codes when mapped to the hash space.

2. The large-scale image retrieval method according to claim 1, wherein the S2-1 is specifically: for database points

Extracting feature vectors from pool5 layer of VGG model by using k-nearest neighbor KNN method

3. The large-scale image retrieval method according to claim 1, wherein in S2-2: setting the K1 range as its neighborhood according to the cosine similarity of each image, and obtaining the initial matrix S₁，S₁The calculation is as follows:

wherein x is_iAnd x_jFor the feature vector of the image, K1-NN is x_iK1 nearest neighbors of (S)₁On the basis of (1), comparing S₁And corresponding column of (1) and use

Structure S₂As follows:

4. the large-scale image retrieval method according to claim 1, wherein in S2-4: the generator consists of a fully connected layerAnd four deconvolution layers, in the generator, inputting hash code as 'noise' to generate new image; specifically, the hash code is input into a fully concatenated layer of size 8 × 8 × 256, and then 3 deconvolution layers of 5 × 5 and 1 × 1 are used, the number of kernels being 256, 128, 32, and 3, respectively; for the image I generated by the generator^GAnd an original image I, and an objective function is provided among the characteristic vectors; the objective function is defined as follows:

where Ψ (-) denotes the convolution-activated feature vector, w and h represent the sizes of the corresponding features; the generator generates a new image by using the hash code of the rotated image, and the rotated image I^RAn objective function is set between the feature vector of the original image I and the feature vector of the original image I, and the feature vector of the original image is obtained from a discriminator; using this loss function to optimize the encoder and the arbiter; the final set objective function is therefore as follows:

wherein l_f＝l_f1+γl_f2And gamma is a weight parameter.

5. The large-scale image retrieval method according to claim 1, wherein in S2-5: according to the structure of the counterstudy, a discriminator D is arranged to judge whether the image is true or false, so that the generator G is optimized; the discriminator D is composed of four convolution layers, the number of the cores is 32, 128, 256 and 256, a full connection layer is arranged next to the cores, the size of the full connection layer is 1024, and eLU is used as an activation function; using a standard objective function for generating a countermeasure network, the formula is as follows:

in order to make the effect of generating the antagonistic network more obvious, random noise is input into the generator, and the generator and the arbiter are optimized by using the following loss function:

where z is random noise, l_d＝l_d'+l_d”。

6. The large-scale image retrieval method according to claim 1, wherein in S2-6: the sum of the three parts of the loss function is set, and the two weights α and β of the loss function are set, the overall loss function, i.e. the objective function, is as follows:

L＝l_s+αl_f+βl_d (9)；

Theta, eta and xi, and the calculation process is as follows;

wherein

A rotated picture representing the input is shown,

and θ represents a parameter in the encoder network;

I^G＝G(b_i；η) (11)

where η represents a parameter in the generator;