CN109472282B

CN109472282B - Depth image hashing method based on few training samples

Info

Publication number: CN109472282B
Application number: CN201811053140.XA
Authority: CN
Inventors: 耿立冰; 潘炎; 印鉴; 赖韩江; 潘文杰
Original assignee: Sun Yat Sen University; Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd
Current assignee: Sun Yat Sen University; Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2022-05-06
Anticipated expiration: 2038-09-10
Also published as: CN109472282A

Abstract

The invention provides a deep image hash method based on few training samples, which is designed on the premise that the existing traditional hash methods and the hash method based on deep learning are carried out on the premise of a large number of training samples, and the cost for obtaining a large number of marked training samples is very high in a real production environment, so that the method has very high practical value if an image hash model with relatively good effect can be obtained under few training samples.

Description

Depth image hashing method based on few training samples

Technical Field

The invention relates to the field of image retrieval and computer vision, in particular to a depth image hashing method based on few training samples.

Background

In recent years, with the rapid development of large data and information technology, the image data generated every day cannot be estimated, and it is important how to search the vast image data for the image display desired by the user. Meanwhile, the information retrieval technology is also greatly developed and applied, and one of the more important technologies in the field of information retrieval is image hashing.

As for the image hashing technique, from the implementation point of view, the conventional image hashing and the deep learning based image hashing (deep hashing) can be divided. In recent years, with the rapid development of deep learning, deep hashing has become the leading image hashing method at present. The deep hash model has strong characterization capability, and simultaneously needs a large amount of training samples to learn the whole deep neural network. However, in a real-world environment, it is often difficult to obtain a large number of training samples, and thus a problem arises: how to design a relatively good hash model when some training samples are few? This is the problem to be solved by the invention patent, and therefore, a deep hash method (few-shot hashing) for learning few new samples from the existing a priori knowledge is provided.

Disclosure of Invention

The invention provides a depth image hashing method which can obtain an image hashing model with relatively good effect and is based on few training samples.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a depth image hashing method based on few training samples comprises the following steps:

s1: task definition and data division;

s2: constructing a triplet-based universal deep hash model;

s3: constructing a support memory based on a universal deep hash model;

s4: learning a feature representation of few samples through a bidirectional long-short term memory subnetwork and supporting memory;

s5: training the depth image Hash model under few samples, and performing retrieval test on the test set of few samples.

Further, the specific process of step S1 is:

s11: taking the cifar100 dataset as an example, a specific definition of few-shot hashing is given. Dividing the cifar100 into 2 parts, wherein the first part comprises 80 classes, each class comprises 500 training pictures which are sufficient and are marked as S (support set); the other part has 20 classes, each with only a small number of 3 (or 5, 10) training samples, and this part is denoted as l (learning set). The goal of this is to train a deep hash model so that pictures belonging to these 20 classes can be retrieved relatively efficiently across a class 100 image database.

Further, the specific process of step S2 is:

s21: for the task of depth image hashing, a feature learning sub-network, namely a deep convolutional network (CNN), needs to be constructed first. The convolution network is formed by stacking a convolution layer, an activation layer and a pooling layer and has strong characteristic expression capability;

s22: after passing through a convolution sub-network, each picture is converted into a semantic feature vector, and then a full connection layer with the output neuron number q and a corresponding sigmod activation function layer are added behind the feature vector. Thus, each image is converted into a q-dimensional real number vector ranging from 0 to 1, namely a Hash vector;

s23: after the hash vectors are obtained, constraint is carried out through a triple loss function (triple ranking loss), and the purpose of the triple loss function is that the distance between the approximate hash vectors of similar pictures is far smaller than the distance between the hash vectors of dissimilar pictures through learning;

s24: and training the triplet-based universal deep hash network to obtain a universal deep hash model.

Further, the specific process of step S3 is as follows:

s31: from the previous task definition, there are 2 parts of the data set, one part is S (support set), and the other part is l (learning set) which is concerned and has few training samples, and each class of S has enough training samples and can correspond to what has been seen or learned; there are few training samples in L, corresponding to newly seen things;

s32: and (4) carrying out feature extraction on the sample of S by using the trained triplet-based universal deep hash model. The method specifically comprises the following steps: sequentially inputting samples I [ I ] [ j ] (I is more than or equal to 1 and less than or equal to S, j is more than or equal to 1 and less than or equal to n, S is the number of the types of S, and n is the number of the samples of each type) into a deep hash model to obtain the semantic dimensional characteristics of each picture;

s33: arranging all the characteristics into M [ i ] [ j ], specifically, each row i is the same, representing that the characteristic vector of the row belongs to the same class, different columns represent the jth sample characteristic vector of the class, and M is a support memory (support memory);

further, the specific process of step S4 is as follows:

s41: in each iteration, the support pops up a feature vector for each class of features according to a specified sequence, and the feature vector is recorded as ft, wherein t is more than or equal to 1 and is less than or equal to s.

S42: the forward and backward unrolling of the bidirectional long short term memory subnetwork (BLSTM) is s time steps.

S43: let f_lTime-invariant (static) input x as a bidirectional long-short term memory subnetwork_staticLet f_tInput x as time-varying (time-varying) for bidirectional long-short term memory subnetworks_t；

S44: through the interaction of the bidirectional long-short term memory sub-network and the support memory, the final feature representation of few new samples is obtained

S45: the new feature representation is constrained with a triplet loss function.

Further, the specific process of step S5 is as follows:

s51: and training the whole network by using a random gradient descent method.

S52: and searching the test set of the L in the whole image database, and calculating a test result.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention is designed on the premise that the existing traditional Hash method and the Hash method based on deep learning are both designed on the premise of a large number of training samples, and the cost for obtaining a large number of marked training samples is very high in a real production environment, so that under the condition of few training samples, if an image Hash model with relatively good effect can be obtained, the invention has very great practical value.

Drawings

FIG. 1 is a schematic diagram of a triplet-based generalized deep hash network;

FIG. 2 is a diagram of the overall network architecture of the present invention;

FIG. 3 is a diagram of a network architecture of a bidirectional long-short term memory subnetwork;

FIG. 4 is the result of an NDCG experiment on the SUN dataset;

FIG. 5 is the result of NDCG experiment on CIFAR-10 dataset;

FIG. 6 shows the results of NDCG experiments on the CIFAR-100 dataset.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The technical solution of the present invention is further described with reference to the drawings and the embodiments.

1. Task definition and data partitioning

When a new training sample arrives, the deep neural network updates the whole network from beginning to end, and if the number of the new training samples is small, overfitting inevitably occurs, so that the effect is poor. However, it has been found that when a person sees a new object, it is often reminiscent of the previous object, for example, a child first sees a tiger, he may search from memory and find that the new object is similar to a cat that has been familiar before, so that he sees a tiger once and may remember the look of the tiger.

Elicited from this and applied to the image hashing problem: if the training pictures of some things are very few, for example, each class has only 3 or 5 pictures, it is possible to learn a "priori knowledge" (prior knowledge) or "support memory" (support memory) from a large number of samples of other things, and then learn a new sample from the a priori knowledge, which is called few-shot hashing.

A specific definition of few-shot hashing is given below using the cifar100 dataset as an example. Dividing the cifar100 into 2 parts, wherein the first part comprises 80 classes, each class comprises 500 training pictures which are sufficient and are marked as S (support set); the other part has 20 classes, each with only a small number of 3 (or 5, 10) training samples, and this part is denoted as l (learning set). The goal is to train a deep hash model so that pictures belonging to these 20 classes can be retrieved relatively efficiently across a class 100 image database.

2. Construction of triple-based universal deep hash model

The deep hash model is widely applied to the field of image retrieval, such as 'searching images with images', 'searching similar articles from Taobao', and the like. Furthermore, the deep hash model is the basic part of the few-shot hashing model, so the deep hash model in this document will be explained first, and as shown in fig. 1, it is mainly divided into three major parts: a feature learning subnetwork, a hash code generation subnetwork, and a loss function.

a) Feature learning subnetwork

In the image field, a deep convolutional network (CNN) is formed by stacking convolutional layers, active layers and pooling layers, and has a strong feature expression capability, and AlexNet, google lenet, VGG, ResNet and the like are common. The image can be converted into feature vectors through a convolutional network, for example, 1024-dimensional vectors of the last pooling layer of google lenet or 4096-dimensional vectors of the last fully-connected layer of VGG can be used as feature representations of the image, and the depth features are far better than traditional manual features such as GIST features, SIFT features and the like. The Googlenet used is exemplified below.

b) Hash code generation subnetwork

After CNN, each picture is transformed into a 1024-dimensional feature vector, and the final purpose is to obtain 0/1 hash codes with specific length, such as 12-bit hash codes, so that it is the most intuitive and common practice to add a fully-connected layer with 12 output neurons after the 1024-dimensional feature vector, and then to connect a sigmod activation function after that. Thus, each image is converted into a 12-dimensional real number vector ranging from 0 to 1, which is called an approximate hash code vector.

c) Triple loss function triple ranking loss

The loss functions of the deep hash model are various, and can be roughly divided into two main classes, one is pair-based, and the other is triplet-based. As used herein, a triple ranking loss is described in detail below.

the input to the triple ranking loss is triplets, i.e. image triplets, having an image dataset in which I is the sample, sim is the similarity between 2 images, if sim (I, I)⁺)>sim(I，I^-) Then called (I, I)⁺，I^-) Is a triplet. For example, in the case of a single-label image dataset: image a and image b belong to the same class, and a and c are not, then (a, b, c) is a triplet.

the purpose of the triple ranking loss is to let I and I go through learning⁺Is much smaller than I and I^-The distance between hash vectors is mathematically defined as follows:

l_tri(v(I),v(I⁺),v(I^-))＝max(0,m+||v(I)-v(I⁺)||-||v(I)-v(I^-)||)

s.t.v(I),v(I⁺),v(I^-)∈[0,1]_n

where v (i) represents a hash code approximation vector and m represents a distance parameter margin. As can be seen from equation (1), when I and I^-Is less than I and I⁺If the sum of the distance between the two and margin is greater than zero, loss occurs, and I are increased^-By a distance of, I and I are reduced⁺The distance of (d); when I and I^-Is greater than I andI⁺the sum of the distance between and margin, loss is zero, indicating that this triplet has been learned.

After the deep hash model training is finished, a user submits a picture, the picture is changed into an approximate hash vector through the deep hash model, the approximate hash vector is changed into binary hash after quantization (each bit of the approximate hash vector is more than or equal to 0.5 and is 1, otherwise, the approximate hash vector is 0), hamming distance calculation is carried out on the hash code and hash codes of all images in a database, all hamming distances are sorted from small to large after calculation, and then the result of top-k can be quickly returned and presented to the user.

3. Construction of support memory based on universal deep hash model

From the previous task definition, there are 2 parts of the data set, one part is S (support set), and the other part is L (learning set) which is concerned and is also a few training samples. Each class in S has sufficient training samples, which can correspond to things that a child has seen or learned; the training samples in L are rare, corresponding to newly seen things. First, a priori knowledge or support memory is constructed by S.

And (3) training a triplet-based deep hash network by using all the data S, as shown in figure 1. Because the training samples of S are sufficient, the parameters of the deep hash network can be well learned, and a good-effect hash model is obtained and is marked as a Support Hashing Model (SHM).

Then, SHM is used for constructing 'support memory', specifically: the samples I [ I ] [ j ] (I is not less than 1 and not more than S, j is not less than 1 and not more than n, S is the number of the types of S, such as 80, n is the number of the samples of each type, such as 500) are sequentially input into the SHM, all the features are arranged as M [ I ] [ j ], specifically, each row I is the same, the feature vectors of the row belong to the same type, different rows show the jth sample feature vector of the type, and M is the support memory, as shown in FIG. 2, 1024-dimensional features (the last pooling layer) of each picture are obtained.

4. Learning feature representations of few samples through bidirectional long-short term memory subnetworks and supporting memory

This section is the core of few-shot hashing and will describe how to learn few new samples from the support memory.

Firstly, an overall network structure diagram of few-shot hashing is given, and the main difference is that a support memory and a bidirectional long-short term memory sub-network are added on the basis of a triplet-based deep hash network, as shown in the following figure:

as can be seen from FIG. 2, during training, each new sample I is subjected to feature extraction through a convolution sub-network, and the feature is f_lIt is worth noting here that the parameters of the convolution sub-network are shared with the SHM and that part of the parameters are not updated, i.e., the trained SHM is used to act as a feature extractor for new samples in addition to being used to construct the support memory.

After feature extraction, a bidirectional long-short term memory network BLSTM is designed to carry out interaction and learning of a new sample and a support memory. As shown in fig. 3.

Specifically, in each iteration of the training phase, M pops up a feature vector for each type of feature according to a specified sequence, the feature vector is recorded as ft, t is more than or equal to 1 and less than or equal to s, and meanwhile, the forward expansion and the reverse expansion of BLSTM are s time steps. Then, as shown in FIG. 3, let f_lInput x as non-time-varying (static) of BLSTM_staticLet ft be the input xt of time-varying of BLSTM, in the mathematical form:

x’_t＝concat(x_t,x_static)

wherein the concat function is a splicing operation of the feature vectors, such as splicing 2 1024-dimensional ft and fl into 2048-dimensional vector x'_t. The hidden size of BLSTM is set to 1024 (consistent with the original feature dimension), after s time steps of BLSTM, each forward LSTM cell outputs a 1024-dimensional hft and each backward LSTM cell outputs a 1024-bit hbt according to the equation (3) de operation:

hft＝LSTMf(hft-1,xt-1),1<t≤s

hbt＝LSTMb(hbt+1,x′t+1),1≤t<s

the new feature lnew of the new sample can then be expressed as:

lsum＝eltwise sum(hfs,hb1)

lnew＝eltwise product(hsum,0.5)

where eltwise _ num is the addition operation between the elements of the vector and eltwise _ product is the multiplication operation between the elements, and in the direct view, equation (4) is to add hfs and hb1 and then take the average as the new feature representation.

Therefore, after each new sample is subjected to interactive learning with the support memory through the BLSTM sub-network, a new 1024-dimensional feature representation is obtained.

After the new feature representation is obtained, model training can be performed through the same hash generation sub-network and triplet ranking loss, as shown in FIG. 2.

5. Results of the experiment

1) Data set

SUN, 64 classes of pictures, 430 samples per class, total 27,520 pictures. The SUN is divided into 2 parts: the first part, S, contains all samples of class 54, for a total of 23200 pictures. The remaining 10 classes of the second part L are newly learned samples, and there are only 3, 5, 10 training samples of each class of L (herein the three cases of experiment few-shot are referred to as 3shot, 5shot, 10shot, respectively). All samples of S and L, except the test sample of L, make up the search database.

CIFAR-10, 6000 samples of each category, 10 categories, 60000 pictures. The first part S of CIFAR-10 contained 48000 samples of the top 8 classes. The remaining last class 2 is L. Similarly, there are 3 cases for the number of training samples of each class L: 3shot, 5shot and 10 shot. All samples of S and L, except for the test sample of L, make up the search database.

CIFAR-100 is similar to CIFAR-10, except that it contains 100 classes of samples, 600 training samples per class. The first 80 sample groups and the last 20 sample groups. Similarly, the training samples of L have only three cases of 3, 5 and 10 per class.

2) Evaluation index

The most common Mean Average Precision (MAP) and Normalized Dis-counted relative Gains (NDCG) in the information search field were selected as evaluation indexes. The larger the MAP and NDCG are, the better the search effect is.

3) Comparative test

The following are comparative tests on 3 data sets:

table 1: MAP experimental results on SUN dataset

Table 2: MAP experimental results on CIFAR-10 dataset

Table 3: MAP experimental results on CIFAR-100 dataset

The results show that the invention is greatly improved compared with the prior method, the invention reasonably utilizes the bidirectional long-short term memory sub-network and the support memory to learn the characteristic representation of few new samples by dividing the method by a large amount of support memory or a priori knowledge, and the whole network structure of the invention is shown as the attached figure 2.

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A depth image hashing method based on training samples is characterized by comprising the following steps:

s1: task definition and data division;

s2: constructing a triplet-based universal deep hash model;

s3: constructing a support memory based on a universal deep hash model;

s5: training a depth image hash model under few samples, and performing retrieval test on a test set of few samples;

the specific process of step S1 is:

dividing a cifar100 data set into 2 parts by taking the cifar100 data set as a sample, wherein the first part comprises 80 classes, each class comprises 500 sufficient training pictures, and the training pictures are recorded as S; the other part is 20 types, each type only has a small number of 3 or 5 or 10 training samples, the part is marked as L, and the aim is to train a deep hash model, so that pictures belonging to the 20 types can be relatively effectively searched in the whole image database of 100 types;

the specific process of step S2 is:

s21: aiming at the task of the depth image Hash, firstly, a feature learning sub-network, namely a depth convolution network, needs to be constructed, wherein the convolution network is formed by stacking a convolution layer, an activation layer and a pooling layer and has strong feature expression capability;

s22: after passing through a convolution sub-network, each picture is converted into a semantic feature vector, then a full connection layer with the output neuron number q and a corresponding sigmoid activation function layer are added behind the feature vector, and each image is converted into a q-dimensional real number vector, namely a Hash vector, with the range of 0-1;

s23: after the hash vectors are obtained, constraint is carried out through a triple loss function (triple ranking loss), wherein the triple loss function aims to enable the distance between the approximate hash vectors of similar pictures to be far smaller than the distance between the hash vectors of dissimilar pictures through learning;

s24: training a triplet-based universal deep hash network to obtain a universal deep hash model;

the specific process of step S3 is:

s31: from the previous task definition, there are 2 parts of the data set, one part is S, and the other part is L, and each class in S has sufficient training samples, which can correspond to what has been seen or learned; there are few training samples in L, corresponding to newly seen things;

s32: carrying out feature extraction on the sample in the S by using a trained triplet-based universal deep hash model, which specifically comprises the following steps: sequentially inputting the samples I [ I ] [ j ] into the universal deep hash model to obtain the semantic features of each picture, wherein I is more than or equal to 1 and less than or equal to S, j is more than or equal to 1 and less than or equal to n, S is the number of S types, and n is the number of samples of each type;

s33: all features are arranged as M [ i ] [ j ], specifically: each row i is the same, indicating that the feature vectors of the row belong to the same class, the different columns indicate the jth sample feature vector of the class, and M is the support memory.

2. The training sample-based depth image hashing method according to claim 1, wherein the specific process of said step S4 is as follows:

s41: in each iteration, the support memory pops up a feature vector for each class of features according to a specified sequence, and the feature vector is recorded as ft, wherein t is more than or equal to 1 and is less than or equal to s;

s42: the forward and reverse expansion of the bidirectional long and short term memory sub-network is s time steps;

s43: let f_lTime-invariant input x as a bidirectional long-short term memory subnetwork_staticLet f_tTime-varying input x as a bi-directional long-short term memory subnetwork_t；

S44: through the interaction of the bidirectional long-short term memory sub-network and the support memory, the final feature representation of few new samples is obtained;

3. The training sample-based depth image hashing method according to claim 2, wherein said step S5 includes the following steps:

s51: training the whole network by a random gradient descent method;