CN112395438A

CN112395438A - Hash code generation method and system for multi-label image

Info

Publication number: CN112395438A
Application number: CN202011226768.2A
Authority: CN
Inventors: 刘渝; 汪洋涛; 谢延昭; 周可; 夏天; 冯树耀
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-23

Abstract

The invention discloses a hash code generation method and a hash code generation system for a multi-label image, and belongs to the field of artificial intelligence image retrieval. The method comprises the steps of firstly combining a convolutional neural network and a graph convolutional network to respectively generate image representation and label co-occurrence embedding, then adopting MFB to fuse the two modal vectors, and finally learning a Hash model through a loss function based on Cauchy distribution. Mutual dependency among the objects is explored through co-occurrence probability of the objects in the label set, multi-mode bilinear combination co-occurrence characteristics and image characteristics based on an attention mechanism are adopted, the capability of measuring the dependency of the object relation among data through the hash code is improved, and further the performance of the hash code is improved. The use of the co-occurrence relationship and the MFB not only can improve the accuracy of the hash code, but also accelerates the hash learning.

Description

Hash code generation method and system for multi-label image

Technical Field

The invention belongs to the field of artificial intelligence image retrieval, and particularly relates to a hash code generation method and system for a multi-label image.

Background

Similarity hash codes are widely used for large-scale image retrieval due to their lightweight storage (compact binary) and efficient comparison (exclusive or). For classical image hashing, correct identification of objects from images is an important factor to improve retrieval accuracy. However, for multi-tag image retrieval, where each image contains more objects, it becomes more challenging to correctly identify the objects.

The prior solution technology has the following problems:

1. the target dependency is ambiguous: how to construct the ideal topology and what dependencies should be expressed is not certain.

2. An end-to-end training approach using topology information cannot be established: it is very difficult to represent the end-to-end approach by end-learning the images in the hash task using topology information.

Disclosure of Invention

Aiming at the defects and the improvement requirements of the prior art, the invention provides a method and a system for generating a hash code of a multi-label image, and aims to replace DP with improved MFB, fuse the MFB with the image characteristics under the multi-label correlation characteristics obtained by a graph convolution network, train an end-to-end hash model through a Cauchy loss function, and improve the correctness of the hash method through the information of the multi-label correlation.

To achieve the above object, according to a first aspect of the present invention, there is provided a hash code generation method for a multi-label image, the method including the steps of:

s1, counting all labels in a multi-label image set, mapping each label into a label word vector to obtain a label word vector matrix corresponding to the multi-label image set, and calculating the co-occurrence probability between any two labels to obtain a label co-occurrence correlation matrix corresponding to the multi-label image set;

s2, extracting image characteristic vectors of all multi-label images in the multi-label image set by adopting a convolutional neural network, and convolving a label word vector matrix and a label co-occurrence correlation matrix by adopting a graph convolution network to obtain label co-occurrence embedded characteristic vectors corresponding to the multi-label image set;

s3, respectively fusing the feature vectors of the images and the label co-occurrence embedded feature vectors by adopting multi-mode bilinear based on an attention mechanism to obtain the fused feature vectors of the multi-label images;

s4, respectively inputting the fusion characteristic vectors of the multi-label images into a hash activation layer to generate corresponding hash codes;

s5, calculating the total loss value of all hash codes generated by the whole multi-label image set based on Cauchy distribution;

s6, adjusting parameters of a convolutional neural network, a graph convolution network and a multi-mode bilinear and Hash activation layer based on an attention mechanism according to the total loss value to minimize the total loss value;

s7, repeating the steps S2-S6 until the stop condition is met, and obtaining the trained Hash code generation model of the multi-label image and the Hash code library of the multi-label image set.

Has the advantages that: the method and the device realize the purpose of determining the icon label correlation characteristics of the target by modeling the label correlation dependency in a conditional probability mode and extracting the correlation information of the label in a graph convolution mode; the hash function based on the improved Cauchy distribution can solve the problem that the traditional S-shaped function brings low concentration of similar samples in a shorter Hamming distance, and can obtain better effect. Meanwhile, the relevance among the labels is considered, so that the identification of multiple targets is improved, and a more accurate Hash model is obtained.

Preferably, in step S1, the tag dependency is modeled in the form of conditional probabilities, i.e. the tag dependency is modeled

Wherein, T_jRepresenting a multi-label image set label r_jNumber of occurrences, T_ijIndicating the number of times two objects appear simultaneously.

Has the advantages that: the invention models the dependency of the tag relevance in the form of conditional probability, describes the dependency of the tag in the mode of the conditional probability, and accurately reflects the relevance among the tags, thereby achieving the purpose of accurately describing the tag relevance.

Preferably, in step S2, the convolutional neural network employs a pre-trained ResNet-101.

Has the advantages that: according to the method, the balance between the effect and the training speed is achieved in various pre-training models through comparison of different pre-training convolutional neural network models in the image feature extraction process.

In step S3, for the ith label, i is 1, 2, …, R, and R is the number of label word vectors in the label word vector matrix, and the multi-modal bilinear model is as follows:

wherein z is_iIs the fusion feature corresponding to the ith tag feature,

is an image feature vector, E is a label co-occurrence embedded feature vector, k is a potential dimension of a decomposition matrix, U_i、V_iIs the trainable parameter corresponding to the ith tag feature,

is an all-one vector of dimension k,

is the Hadmard product, i.e. the element-wise multiplication of two vectors, the function D (·) representing the dimensionality.

Has the advantages that: hadmard product-sum pooling is utilized to increase the interaction of vector elements between the different forms, rather than DP, thereby improving accuracy. On the other hand, it reduces overfitting and parameter explosion due to increased interaction by summing pooling, thereby speeding up model convergence.

Preferably, in step S4, a full connection layer is located before the hash activation layer, the fused feature vector enters the full connection layer first, and then enters the hash activation layer, where the number of nodes in the full connection layer is the same as that of the hash activation layer.

Has the advantages that: full linkage layer itself is difficult to train, but the parameter capacity of the full linkage layer of solitary hash is less, is difficult to learn complicated transform, has enlarged parameter capacity through increasing full linkage layer in the layer of hash in front, but too much full linkage layer is unfavorable for training, here we have selected a full linkage layer and have added the mode of active layer through experimental, have obtained the balance of effect and training.

Preferably, in step S5, the total loss function

L＝λL_cce+(1-λ)L_cq

Cauchy cross entropy error

Cauchy quantization error

Wherein λ is Cauchy crossingThe weight of the entropy error is determined,

is a training sample pair { (x)_i，x_j，s_ij) Weight of s_ijIs a multi-label image x_iAnd x_jSimilar relationship of (1), s_ijIf 1 indicates similarity, s_ij0 indicates dissimilarity, S is a set of similarity relationships, S_s＝{s_ij∈S：s _ij1 is a set of similarity pairs, S_d＝{s_ij∈S：s_ij0 is a set of dissimilar pairings, | · | is an operator taking the number of elements of the set, h_i，h_j∈{-1，1}^KRespectively represent inputs as x_i，x_jCorresponding output of the time-full join hash layer, δ (h)_i，h_j) Is h_i，h_jGamma is a cauchy distribution parameter, N is the number of multi-labeled images of the multi-labeled image set, and K is the hash code length.

Preferably, the hamming distance is calculated as follows:

has the advantages that: according to the method, the Cauchy distance in the Cauchy loss function is regularized through the Hamming distance, meanwhile, for calculation convenience, the original definition of the Hamming distance is not adopted for calculation, and an approximate calculation mode is provided, so that better model performance is obtained.

Preferably, the method is applied to the field of image multi-label retrieval.

Has the advantages that: the invention obtains an excellent hash code generation scheme by improving the existing hash generation method in many ways based on the introduction of MFB and the use of improved Cauchy loss, and achieves the existing best performance in retrieval after the hash code is generated.

To achieve the above object, according to a second aspect of the present invention, there is provided a hash code generation system for a multi-label image, including: a computer-readable storage medium and a processor;

the computer-readable storage medium is used for storing executable instructions;

the processor is configured to read executable instructions stored in the computer-readable storage medium, and execute the hash code generation method for a multi-label image according to the first aspect.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

the method comprises the steps of firstly combining a Convolutional Neural Network (CNN) and a Graph Convolutional Network (GCN) to respectively generate image representation and label co-occurrence embedding, then adopting MFB to fuse the two modal vectors, and finally learning a Hash model through a loss function based on Cauchy distribution. Mutual dependency among the objects is explored through co-occurrence probability of the objects in the label set, and multi-mode bilinear (MFB) based on an attention mechanism is adopted to combine co-occurrence characteristics and image characteristics, so that the capability of measuring the dependency of object relations among data of the hash codes is improved, and further the performance of the hash codes is improved. The use of the co-occurrence relationship and the MFB not only can improve the accuracy of the hash code, but also accelerates the hash learning. Extensive experiments with this method on public datasets showed that: the method can achieve the existing latest retrieval result; the co-occurrence relation and the MFB are used, so that the accuracy of the hash code can be improved, the best performance at present can be achieved, and meanwhile, the hash learning is accelerated on the basis.

Drawings

Fig. 1 is a flowchart of a hash code generation method for a multi-label image according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the present invention provides a hash code generation method for a multi-label image, including the following steps:

s1, counting all labels in a multi-label image set, mapping each label into a label word vector to obtain a label word vector matrix corresponding to the multi-label image set, and calculating the co-occurrence probability between any two labels to obtain a label co-occurrence correlation matrix corresponding to the multi-label image set.

Wherein, T_jRepresenting a multi-label image set label r_jNumber of occurrences, T_ij(equal to T)_ji) Indicating the number of times two objects appear simultaneously.

For example, the multi-label image shown in fig. 1 includes four labels { person, football, court, goal }, each of which is mapped to a label word vector, resulting in four label word vectors [00], [01], [10], [11 ].

In order to avoid the long tail phenomenon caused by rare samples, a matrix A is binarized by using a threshold value tau:

wherein,

is a binary correlation matrix, q ∈ (0, 1).

And S2, extracting image characteristic vectors of all multi-label images in the multi-label image set by adopting a convolutional neural network, and convolving the label word vector matrix and the label co-occurrence correlation matrix by adopting a graph convolution network to obtain label co-occurrence embedded characteristic vectors corresponding to the multi-label image set.

In this embodiment, the target image data is sampled by a pre-training depth model to obtain a feature vector of 2048 × 14 × 14 dimensions, and then a global maximum pooling layer is introduced to generate image-level features

Wherein θ represents a parameter of CNN and

by a graph convolution function F_gcnCompleting the extraction of the characteristics, and converting the word description in the label set into a vector V (r), wherein V is^c∈R^R×D(V(r))The input D (v (r)) representing level C represents the dimension of v (r). The input of the relation is a correlation matrix A epsilon R^R×RThe updated node characteristics are represented as V^c+1∈R^{R×D(V(r))′}. Each GCN layer propagation function is described as:

wherein,

in this embodiment, two GCN layers are used, i.e. e (r) ═ V^c+2Through experiments, the two-layer structure achieves the purpose of extracting features and ensures the training speed.

And S3, respectively fusing the feature vectors of the images and the label co-occurrence embedded feature vectors by adopting multi-mode bilinear based on an attention mechanism to obtain the fused feature vectors of the multi-label images.

Preferably, in step S3, for the features of the ith object, the multi-modal bilinear model with two low rank matrices is as follows:

wherein,

is an image feature vector, E (r) is a tag co-occurrence embedded feature vector, k is a potential dimension of a decomposition matrix, U_iIs a parameter that can be trained in a way that,

V_iis a parameter that can be trained in a way that,

is an all-one vector of dimension k,

is a Hadmard product, i.e. an element-wise multiplication of two vectors, D (·) representing the taking dimension function.

And (3) respectively finishing the transformation by adopting two parallel k-dimension fc layers, and introducing pooling after multiplication to obtain:

wherein,

sum function

Is expressed in a use size of

The one-dimensional non-overlapping window pairs of (a) are summed and combined.

And S4, respectively inputting the fusion characteristic vectors of the multi-label images into a hash activation layer to generate corresponding hash codes.

And fitting the deep network through a loss function, wherein the last two layers are a full-connection layer and a full-connection Hash layer respectively, and the obtained matrix Z is used as input to obtain a predicted Hash code and a final Hash algorithm model.

And S5, calculating the total loss value of all hash codes generated by the whole multi-label image set based on Cauchy distribution.

Preferably, in step S5, the total loss function

L＝λL_cce+(1-λ)L_cq

Cauchy cross entropy error

Cauchy quantization error

Where λ is the weight of the Cauchy cross entropy error,

is a training sample pair { (x)_i，x_j，s_ij) Weight of s_ijIs x_iAnd x_jSimilar relationship of (1), s_ijIf 1 indicates similarity, s_ijSimilarity is indicated by 0, S is a set of similarity relationships, S_s＝{s_ij∈S：s_ij1 is a set of similarity pairs, S_d＝{s_ij∈S：s_ij0 is a set of dissimilar pairings, | · | is an operator taking the number of elements of the set, h_i，h_j∈{-1，1}^KRespectively representing when the input of the fully-connected hash layer is x_i，x_jTime x_i，x_jCorresponding output, δ (h)_i，h_j) Is h_i，h_jGamma is the cauchy distribution parameter, N is the input size, and K is the hash code length.

Preferably, the hamming distance is calculated as follows:

and S6, adjusting parameters of a convolutional neural network, a graph convolution network and a multi-mode bilinear and Hash activation layer based on an attention mechanism according to the total loss value, so that the total loss value is minimized.

In the embodiment, the parameters of each module are adjusted by adopting a gradient descent optimization method, so that the total loss value is minimized.

And S7, repeating the steps S2-S6 until the stop condition is met, and obtaining the trained Hash code generation model of the multi-label image and the Hash code library of the multi-label image set.

Preferably, in step S7, the hamming distance is smaller than a set threshold as the stop condition, and in this embodiment, the set threshold is 2. It is also possible to use reaching a specified number of iterations as a stop condition.

Preferably, the method is applied to image multi-label retrieval.

For example, when applied to cloud photo album retrieval, the multi-tag image set contains all photos of the user, and each photo may include a plurality of tags, which are the types of objects contained in the image, such as people, dogs, tables, and the like. Inputting a picture to be inquired, extracting a picture characteristic vector through a convolutional neural network, then co-existing the picture characteristic vector with a label corresponding to a multi-label image set, embedding the co-embedded characteristic vector into an MFB, obtaining a hash code after passing through a hash activation layer, comparing the hash code with the hash code in a hash code library, returning an approximate picture with similarity in a set threshold range as a retrieval result, determining the photographing preference of a user by using the similar picture in a cloud photo album of the user, or determining whether the photo album of the user contains extremely similar pictures, so as to delete the picture and save the cloud storage space.

For example, when applied to the retrieval of an image of a product, the multi-tag image collectively includes all the images of the product in the product database, and each photo may include a plurality of tags, where the tags are the types of objects included in the image, such as a certain brand bag, a certain brand car, a certain brand computer, and the like. Inputting a commodity picture to be inquired, extracting a picture characteristic vector through a convolutional neural network, then co-occurrence embedding the picture characteristic vector with a label corresponding to a multi-label image set into an MFB, obtaining a hash code through a hash activation layer, comparing the hash code with the hash code in a hash code library, returning an approximate picture with a similarity in a set threshold range as a retrieval result, and enabling the picture to correspond to the commodity to achieve the purpose of retrieving the commodity through the picture.

The invention provides a hash code generation system of a multi-label image, which comprises: a computer-readable storage medium and a processor;

the processor is used for reading the executable instructions stored in the computer-readable storage medium and executing the hash code generation method of the multi-label image.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A hash code generation method of a multi-label image is characterized by comprising the following steps:

2. The method of claim 1, wherein in step S1, the tag dependency is modeled in the form of conditional probabilities

3. The method of claim 1 or claim 2, wherein in step S2, the convolutional neural network employs a pre-trained ResNet-101.

4. The method according to any one of claims 1 to 3, wherein in step S3, for the ith label, i is 1, 2, …, R, R is the number of label word vectors in the label word vector matrix, and the multi-modal bilinear model is as follows:

wherein z is_iIs the fusion feature corresponding to the ith tag feature,

is the all-one vector of dimension k, with ° being the Hadmard product, i.e. the element-wise multiplication of the two vectors, the function D (·) representing the dimensionality.

5. The method according to any one of claims 1 to 4, wherein in step S4, the hash activation layer is preceded by a full connection layer, the fused feature vector enters the full connection layer first, and then enters the hash activation layer, and the number of nodes of the full connection layer and the hash activation layer is the same.

6. The method of claim 5, wherein in step S5, the total loss function

L＝2L_cce+(1-λ)L_cq

Cauchy cross entropy error

Cauchy quantization error

Where λ is the weight of the Cauchy cross entropy error,

is a training sample pair { (x)_i，x_j，s_ij) Weight of s_ijIs a multi-label image x_iAnd x_jSimilar relationship of (1), s_ijIf 1 indicates similarity, s_ij0 indicates dissimilarity, S is a set of similarity relationships, S_s＝{s_ij∈S：s_ij1 is a set of similarity pairs, S_d＝{s_ij∈S：s_ij0 is a set of dissimilar pairings, | · | is an operator taking the number of elements of the set, h_i，h_j∈{-1，1}^KRespectively represent inputs as x_i，x_jCorresponding output of the time-full join hash layer, δ (h)_i，h_j) Is h_i，h_jGamma is a cauchy distribution parameter, N is the number of multi-labeled images of the multi-labeled image set, and K is the hash code length.

7. The method according to any one of claims 1 to 6, wherein in step S7, a Hamming distance less than a set threshold is used as the stop condition.

8. The method of claim 6, wherein the hamming distance is calculated as follows:

9. the method of any one of claims 1 to 8, applied to image multi-label retrieval or image multi-label classification.

10. A hash code generation system for a multi-label image, comprising: a computer-readable storage medium and a processor;

the processor is configured to read executable instructions stored in the computer-readable storage medium, and execute the hash code generation method of the multi-label image according to any one of claims 1 to 9.