CN107133348B

CN107133348B - Approximate searching method based on semantic consistency in large-scale picture set

Info

Publication number: CN107133348B
Application number: CN201710368677.4A
Authority: CN
Inventors: 胡鸣珂; 胡海峰; 吕成钢
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-05-23
Filing date: 2017-05-23
Publication date: 2021-04-30
Anticipated expiration: 2037-05-23
Also published as: CN107133348A

Abstract

The invention discloses an approximate searching method based on semantic consistency in a large-scale picture set, which comprises the following steps of: introducing semantic consistency when calculating the similarity of the pictures in the picture set and the sampled pictures, and obtaining a conversion matrix required by the next stage; and (3) Hash coding process: calculating the optimized similarity between the pictures and the sampled pictures according to the conversion matrix obtained in the training process, and constructing a similarity matrix according to the optimized similarity so as to carry out binary coding on each picture in the picture set by utilizing a Hash coding technology; and then comparing the new query picture with the binary-coded Hamming distance of each picture so as to find out the neighbor of the query picture. The invention introduces the semantic consistency characteristic when measuring the similarity of the pictures, can more accurately measure the similarity between the pictures, reduces the training time of the algorithm by using a random gradient descent method, and can be effectively applied to large-scale picture data concentration.

Description

Approximate searching method based on semantic consistency in large-scale picture set

Technical Field

The invention relates to a method for approximately searching pictures in a large-scale picture data set, and belongs to the technical field of machine learning.

Background

One important application in neighbor queries is the approximate search of pictures. In the big data era, the most obvious characteristics of picture data are that the data scale is extremely large, and the characteristic dimension of a picture is very high. The method has extremely important application value for the research of the advanced subjects such as computer vision, machine learning and the like by efficiently and accurately inquiring the neighbor of massive high-dimensional pictures.

Conventional neighbor query algorithms, such as search algorithms based on tree index structures, have dimension problems. The performance of the method is rapidly reduced when the approximate neighbor search is carried out on the high-dimensional picture data, so that the method is not suitable for the current big data era. The most popular approach today is approximate neighbor search based on hashing techniques, classical approximate search hashing algorithms such as Locality Sensitive Hashing (LSH) that are solved by translating the neighbor search problem into finding similar binary codes. The approximate search algorithm based on the hash technology has a simpler index structure and less storage space. However, in order to simultaneously ensure the accuracy and the recall rate, LSH needs to construct multiple hash tables, which results in a large increase in query time and storage overhead.

A graph-based hash algorithm that can yield more efficient coding has also emerged, and can achieve better performance due to better measure of similarity between picture samples. Such as a Spectral Hash (SH) algorithm, an anchor hash (AGH) algorithm. However, these algorithms are too faceted when looking for neighboring pictures, and they only consider the actual storage location of the picture in the data set, but do not consider semantic tag information that the picture may have, thus making these algorithms less effective in picture approximation searches. In a real large-scale picture data set, many pictures have semantic tag information, and different class tag information represents that the pictures belong to different classes. For example two pictures may actually be stored a long distance apart in the data set, but they have the same class label "sky", then these two pictures are also approximate pictures. And the current popular image approximate search algorithm is often poor in performance when applied to a large-scale image data set, and the practical problem cannot be well solved.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: an approximate picture searching method based on semantic consistency applied to a large-scale picture data set is provided. The method mainly solves the approximate searching problem of the pictures and maps similar pictures into the same or similar binary codes through a Hash technology.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides an approximate searching method based on semantic consistency in a large-scale picture set, which comprises the following steps:

step 1: inputting a picture set sample matrix X and a semantic class mark matrix Y corresponding to the picture set, wherein X is a matrix with dimensions of n X d, Y is a matrix with dimensions of n X c, n is the number of picture samples, d is the dimension of picture features, and c is the number of class marks;

step 2: randomly extracting a part of pictures from the picture set as a sampling picture set;

and step 3: defining a relation matrix W between pictures in the picture set and the sampling picture set, combining the relation matrix and introducing semantic consistency to construct a target function expression, iteratively solving optimization through a random gradient descent algorithm, and obtaining an optimized conversion matrix A after the expression is converged;

and 4, step 4: for each picture sample x, substituting the conversion matrix A into the relation matrix defined in the step 3 to obtain the value of each element of the relation matrix; constructing a similar matrix Z through the relation matrix, obtaining an encoding matrix by combining the similar matrix, carrying out Hash encoding on each picture in the large-scale picture data set by using the encoding matrix, and compressing and mapping the original d-dimensional features of the pictures into k-dimensional binary codes;

and 5: and for a new query picture q, calculating the binary code of the query picture through the coding matrix, comparing the Hamming distance with the binary code of each picture in the picture data set, and if the Hamming distance is smaller than a set threshold value r, considering that the two pictures are approximate pictures.

Further, the approximate search method of the present invention, the calculation process of the transformation matrix a is as follows:

step (1), defining a relation matrix W between pictures, wherein each element in the relation matrix is defined as:

W_ij＝exp(-||A(x_i-u_j)||²) (1)

in the above formula, A represents a transformation matrix, x_iIndicates the ith picture in the picture set, u_jRepresenting the jth picture in the sampled picture set;

step (2), defining an objective function formula as follows:

wherein f is_iA class mark vector representing the ith picture sample, wherein the class mark vector is a c-dimensional column vector, and the values of elements in the vector are 1 or 0 respectively representingPictures belonging to and not belonging to this class, f_jClass-labeled vector, | f, representing a picture in a sampled picture set_i-f_j||²Namely, the semantic consistency property introduced when training the transformation matrix;

step (3), optimizing the transformation matrix A according to a random gradient descent algorithm, wherein an iteration updating rule is as follows:

wherein, γ_tThe optimization step length in each iteration process is obtained, the initial value of a conversion matrix is I/delta, I is a unit matrix of d x d dimensions, and delta is the median of Euclidean distances between pictures in a picture set;

and (4) after all the picture samples in the picture set are traversed, obtaining the finally optimized conversion matrix A.

Further, the approximate search method of the present invention, step size γ_tSelecting one of the following values: 1*10^-5，1*10^-4，1*10^-3Or 1 x 10^-2。

Further, the approximate search method of the present invention, step 4, is specifically as follows:

step a, after obtaining a conversion matrix A, calculating the optimized similarity after introducing semantic consistency between each picture and a sampled picture by using a formula (1), namely obtaining the value of each element of a relation matrix W, if m picture samples are collected in the sampled picture set, constructing a similar matrix Z by using the relation matrix,

the Z matrix calculation formula is defined as follows:

wherein the < i > set represents a sampled picture set, i.e. the value of the corresponding element on the Z matrix is calculated only when the picture belongs to the sampled picture set, otherwise the value of the corresponding element on the Z matrix is 0;

b, setting the number of the pictures in the sampling picture set as M, and constructing an M matrix with M x M dimensions, wherein the M matrix is defined as follows:

M＝Λ^-1/2Z^TZΛ^-1/2 (5)

wherein Λ ═ diag (Z)^T1) The method is a diagonal matrix, and a k x k dimensional diagonal matrix consisting of the first k largest eigenvalues of the M matrix is obtained by calculation: sigma is diag (delta)₁,...,δ_k)∈R^k×kAnd an m x k dimensional matrix composed of eigenvectors corresponding to the first k largest eigenvalues: v ═ V₁,...,v_k]∈R^m×k；

And c, constructing a final coding matrix Y by using the matrixes obtained by the formula, wherein the Y matrix is defined as follows:

y is a matrix of n x k dimension, n represents the number of pictures in the picture set, k represents the coded digit when mapping to binary coding, each row of the coding matrix Y is a coding function, each picture obtains a vector of k dimension through the calculation of the coding function, and then the vector is subjected to binary segmentation: sgn (y), the binary code of each picture in the picture set is obtained.

Further, in the approximate search method of the present invention, r is selected from one of the following values: 1,2,3, or 4.

The key technology of the invention is as follows:

(1) approximate search algorithm based on semantic consistency

The approximate search algorithm based on semantic consistency introduces semantic consistency when calculating the similarity of each picture in a picture data set and a picture in a sampled picture set, and constructs a target function expression containing semantic information. And then, iterative solution is carried out by using a random gradient descent algorithm, and a conversion matrix reflecting the inherent semantic consistency characteristics among the pictures is obtained after the expression is converged. And mapping the pictures into k-bit binary codes by utilizing a Hash technology, and mapping similar input pictures into binary codes with similar Hamming distances.

(2) Random gradient descent (SGD) algorithm:

the random Gradient Descent algorithm is an improved algorithm of a Gradient Descent (GD) algorithm, mainly aims at the problems that the convergence speed of an original Gradient Descent algorithm is too slow and the original Gradient Descent algorithm easily falls into local optimization, and is an iterative solution method of a minimum loss function or a risk function. The invention reduces the training time of the transformation matrix in the semantic consistency approximate search method by using the stochastic gradient descent algorithm.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the problem that the convergence speed is too low by using the traditional gradient descent algorithm is solved.

2. The optimal similarity between the pictures is calculated by using the conversion matrix, and the problem that sensitive parameters are excessively dependent when the similarity of the pictures is measured by using the traditional Gaussian kernel function is solved.

3. The original picture of d dimension is compressed and mapped into binary coding of k bits by using a Hash technology, so that the efficiency of the algorithm is greatly improved, and the occupation of the memory space is greatly reduced.

Drawings

FIG. 1 is a system framework diagram of the present invention.

Fig. 2 is a flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

the semantic consistency characteristic is introduced when the similarity measurement is carried out on the pictures, so that the similarity between the pictures can be measured more accurately. And generating a sampling picture set to calculate the optimized similarity after semantic consistency is introduced, and reducing the training time of the algorithm by using a random gradient descent method, so that the algorithm can be effectively applied to a large-scale picture data set. And then, efficient binary codes are generated by using a Hash coding technology, so that better performance can be obtained in approximate picture searching.

As shown in fig. 1, the present invention provides a method for finding approximate pictures by comparing hamming distances between picture codes by binary coding pictures based on semantic consistency in large-scale picture sets and using a hash coding technique.

The invention mainly comprises two parts: a process of training a transformation matrix and a process of hash coding.

The process of training the conversion matrix mainly introduces semantic consistency when calculating the similarity between the pictures in the picture set and the sampled pictures and obtains the conversion matrix required by the next stage.

The Hash coding process mainly comprises the steps of calculating the optimized similarity between the pictures and the sampled pictures according to the conversion matrix obtained in the training process, constructing a similarity matrix according to the optimized similarity, and then carrying out binary coding on each picture in the picture set by utilizing the Hash coding technology. And then comparing the new query picture with the binary-coded Hamming distance of each picture so as to find out the neighbor of the query picture.

Firstly, training a transformation matrix:

the process of training the conversion matrix is mainly to establish a model according to the idea of consistent semantics and obtain the conversion matrix required in the encoding stage, and the conversion matrix reflects the inherent semantic consistency characteristic between pictures. The invention reduces training time by using a Stochastic Gradient Descent (SGD) algorithm in the process of training the transformation matrix. If the feature dimension of the picture is d dimension, the trained transformation matrix is a square matrix of d rows and d columns.

The basic idea of the approximate search method based on semantic consistency in the large-scale picture set is to map the picture from initial d-dimensional compression to k-dimensional binary coding by introducing semantic consistency characteristics. And maps similar input pictures to binary codes with similar hamming distances.

Step 1: when calculating the optimized similarity, if the picture data set contains n picture samples, defining a relation matrix W between the pictures as an n × n dimensional matrix, wherein each element in the relation matrix is defined as:

W_ij＝exp(-||A(x_i-u_j)||²) (1)

in the above formula, A represents a transformation matrix, x_iNumber of representationsIth picture in data set, u_jIndicating the j-th picture in the sampled picture set.

Step 2: in the process of training the transformation matrix A, semantic consistency is mainly introduced to establish an objective function, and the transformation matrix required by the encoding stage is obtained through iterative solution of a random gradient descent (SGD) algorithm. Defining the objective function as:

f in the above objective function_iClass flag vector representing the ith picture sample, (class flag vector is a c-dimensional column vector, c is the number of classes, and the value of an element in the vector is 1 or 0, indicating that the picture belongs to this class and does not belong to this class, respectively). f. of_jA class flag vector representing a picture in the sampled picture set. I f_i-f_j||²I.e., the semantically consistent nature introduced when training the transformation matrix. It can generate more accurate binary codes by combining the characteristic similarity between pictures.

And step 3: the invention employs a stochastic gradient descent algorithm in the optimization process to reduce the time taken to train the transformation matrix. The initial value of the transformation matrix is 1/delta, I is a unit matrix with dimensions d x d, and delta is the median of Euclidean distances between pictures in the data set. Then, optimizing the transformation matrix A according to a random gradient descent algorithm, wherein the iteration updating rule is as follows:

wherein gamma is_tThe step length is optimized in each iteration process, and the step length can be selected from the following values: (1*10^-5，1*10^-4，1*10^-3，1*10^-2). And after all the picture samples in the picture data set are traversed, obtaining a conversion matrix A which is finally optimized. At this time, the training of the transformation matrix is finished, and the transformation matrix A is output.

II, Hash coding process:

as shown in fig. 2, in the hash encoding process, a similarity matrix Z reflecting the optimized similarity relationship between the sample and the sampled picture set is constructed mainly from the conversion matrix obtained in the previous step. And then, carrying out hash coding on each picture in the large-scale picture data set by utilizing a hash technology. And (3) searching an approximate picture of a new picture in the data set, comparing the Hamming distance of binary coding between the pictures, and if the Hamming distance is smaller than a set threshold value r, determining that the two pictures are approximate pictures.

Step 1: after the conversion matrix A is obtained, the optimized similarity of each picture and the sampled picture after introducing the semantic consistency is calculated through the formula (1). I.e. the values of the individual elements of the relation matrix W are obtained. If m picture samples exist in the sampling picture set, a similarity matrix Z required by the Hash coding technology can be constructed through the relation matrix. The Z matrix calculation formula is defined as follows:

where the < i > set represents a sampled picture set. That is, the value of the corresponding element on the Z matrix is calculated only when the picture belongs to the sampled picture set, otherwise the value of the corresponding element on the Z matrix is 0.

Step 2: and (5) setting the number of the pictures in the sampling picture set as M, and constructing an M-by-M dimensional M matrix. The M matrix is defined as follows:

M＝Λ^-1/2Z^TZΛ^-1/2 (5)

wherein Λ ═ diag (Z)^T1) Is a diagonal matrix. Calculating a k x k dimensional diagonal matrix consisting of the first k largest eigenvalues of the M matrix: sigma is diag (delta)₁,...,δ_k)∈R^k×kAnd an m x k dimensional matrix composed of eigenvectors corresponding to the first k largest eigenvalues: v ═ V₁,...,v_k]∈R^m×k。

And step 3: constructing a final encoding matrix Y from the matrices obtained by the above formula, wherein the Y matrix is defined as follows:

y is a matrix with dimension n x k, n represents the number of pictures in the picture set, and k represents the number of coded bits when mapping to binary coding. Each row of the coding matrix Y is a coding function, each picture is calculated through the coding function to obtain a k-dimensional vector, and then the vector is subjected to binarization segmentation: sgn (y). A binary code for each picture in the picture data set is obtained.

And 4, step 4: if a new query picture needs to be searched for an approximate picture, the binary coding of the query picture is calculated by using the coding function. The encoding of the query picture is then compared to the hamming distances of all picture encodings in the picture data set. Defining a Hamming distance threshold value r (the value of r can be selected as 1,2,3 and 4), and if the Hamming distance between a query picture and a picture is less than the threshold value r, the picture is considered to be an approximate picture of the query picture. And traversing the picture data set to find all approximate pictures of the query picture.

The overall method flow of the invention is as follows:

step 1: inputting a sample matrix X (X is a matrix with n X d dimensions, n is the number of pictures, the value of n can be large, and d is the dimension of picture characteristics) of the picture data set, and inputting a semantic class label matrix Y (Y is a matrix with n X c dimensions, n is the number of samples, and c is the number of class labels) corresponding to the picture data set.

Step 2: a part of pictures are randomly extracted from the picture set to serve as a sampling picture set, and the purpose of selecting the sampling picture set is to greatly reduce the calculation time overhead and improve the algorithm efficiency by calculating the similarity between the pictures and the sampling pictures.

And step 3: and for each picture in the picture data set, introducing semantic consistency to construct an object function expression O (A), wherein A (A is a matrix of d-d dimensions, and d is the dimension of picture features) is a conversion matrix required in an encoding stage. And (5) iterative solution is carried out through a random gradient descent algorithm, and the optimized transformation matrix A is obtained after the expression is converged.

And 4, step 4: for each picture sample x, the similarity between the picture sample x and the sampled picture is multiplied by the conversion matrix a. The optimized similarity after semantic consistency is introduced is obtained. Then, the pictures are encoded by using a Hash technology, and the original d-dimensional features of the pictures are compressed and mapped into k-dimensional binary codes.

And 5: for a new query picture q, its approximate picture is found. Firstly, the conversion matrix A obtained by training in the step 3 is multiplied by the similarity between the picture q and the sampling picture. The optimized similarity after semantic consistency is introduced is obtained. And calculating the binary code of the query picture through a coding function. The hamming distance is compared to the binary encoding of each picture in the picture data set. And if the Hamming distance is smaller than a set threshold value r, the two pictures are considered to be approximate pictures.

By adopting the technical implementation scheme, compared with the prior art, the invention solves the problems as follows:

(1) the traditional approximate search algorithm training process does not introduce the problem of poor performance caused by semantic consistency: many conventional algorithms for searching for image neighbors are too unilateral when searching for query image neighbors, and semantic information possibly possessed by images is not considered when searching for query image neighbors, so that the performance of the algorithms in practical application of image approximate search is poor. The semantic consistency characteristic is introduced when the similarity measurement is carried out on the pictures, so that the similarity between the pictures can be measured more accurately. The method can be effectively applied to the realistic picture approximate search.

(2) And calculating the optimized similarity by using the abstract picture set. The problem that the similarity of large-scale image data set calculation is too slow is solved: in a large-scale picture data set, if the traditional measurement method for calculating the similarity between pictures is used, the time cost is very large, and the method is not feasible in practical application. The method randomly extracts a few pictures from a mass picture set as a sampling picture set, and only calculates the optimized similarity between the pictures and the sampling picture set. The time overhead of the algorithm is greatly reduced, and the algorithm efficiency is improved.

(3) The problem that the target function converges too slowly is solved by using a random gradient descent algorithm: the original gradient descent algorithm is called batch gradient descent algorithm, and the algorithm is to minimize the loss function of all training data, so that the final solution is a global optimal solution, i.e. the solved parameters are the parameters which minimize the loss function value. However, each iteration of the batch gradient algorithm requires all data in the training set, and if the number of pictures in the data set is large, the use of the batch gradient algorithm is very slow. The random gradient descent algorithm only uses one data sample when the iterative update is carried out once, and the speed is high. The speed advantage is more pronounced, especially for large-scale picture data sets. Moreover, for the target loss function, convergence can be achieved without traversing the entire data set using a random gradient descent algorithm. The invention replaces the batch gradient algorithm with the random gradient descent algorithm to iteratively solve the target function of the algorithm, thereby solving the problem of slow convergence of the algorithm.

In summary, the present invention uses a transformation matrix reflecting the intra-picture semantic consistency to calculate the optimized similarity between pictures. In order to improve the search efficiency, a part of pictures are randomly selected from a large-scale picture set to be used as a sampling picture set to measure the similarity between the pictures, and the training time of the algorithm is reduced by adopting a random gradient descent method when a conversion matrix is trained. And after a similarity matrix for coding is obtained through the optimized similarity between the pictures, mapping the original picture into k-bit binary coding by using a Hash coding technology. When searching the neighbor of a new query picture, firstly obtaining the binary code of the query picture through the coding function of the model, and then comparing the Hamming distance between the codes with all pictures in the picture set. Certain pictures are considered to be approximate pictures of the query picture when the hamming distance between them is less than a given hamming distance threshold.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The approximate searching method based on semantic consistency in the large-scale picture set is characterized by comprising the following steps of: the method comprises the following steps:

step 1: inputting a picture set sample matrix X and a semantic class mark matrix Y corresponding to a picture set, wherein X is a matrix with dimensions of n X d, Y is a matrix with dimensions of n X c, n represents the number of pictures in the picture set, d is the dimension of picture features, and c is the number of class marks;

and step 3: defining a relation matrix W between pictures in the picture set and the sampling picture set, combining the relation matrix and introducing semantic consistency to construct a target function expression, iteratively solving optimization through a random gradient descent algorithm, and obtaining an optimized conversion matrix A after the expression is converged; the calculation process of the transformation matrix a is as follows:

W_ij＝exp(-||A(x_i-u_j)||²) (1)

step (2), defining an objective function formula as follows:

wherein f is_iClass mark vector representing ith picture sample, the class mark vector is c-dimensional column vector, the value of element in the vector is 1 or 0, respectively representing that the picture belongs to the class and does not belong to the class, f_jA class tag vector representing a jth picture in the sampled picture set;

step (4), after all the picture samples in the picture set are traversed, obtaining a conversion matrix A which is finally optimized;

2. The approximate search method according to claim 1, wherein: step size gamma_tSelecting one of the following values: 1*10^-5，1*10^-4，1*10^-3Or 1 x 10^-2。

3. The approximate search method according to claim 1, wherein: the step 4 is as follows:

the Z matrix calculation formula is defined as follows:

M＝Λ^-1/2Z^TZΛ^-1/2 (5)

wherein Λ ═ diag (Z)^T1) The method is a diagonal matrix, and the diagonal matrix of l x l dimensions consisting of the first l maximum eigenvalues of the M matrix is obtained by calculation: sigma-diag (delta)₁,...,δ_l)∈R^l×lAnd an m x l dimensional matrix formed by eigenvectors corresponding to the first l largest eigenvalues: v ═ V₁,...,v_l]∈R^m×l；

C, constructing a final coding matrix Y from the matrixes obtained by the formula_t，Y_tThe matrix is defined as follows:

Y_tis a matrix with dimension n x k, n represents the number of pictures in the picture set, k represents the coded digit when mapping to binary code, and the coding matrix Y_tEach line of the image is a coding function, each image is calculated through the coding function to obtain a k-dimensional vector, and then the vector is subjected to binarization segmentation: sgn (Y)_t) The binary coding of each picture in the picture set is obtained.

4. The approximate search method according to claim 1, wherein: r is selected to be one of the following values: 1,2,3, or 4.