CN109634953B

CN109634953B - Weighted quantization Hash retrieval method for high-dimensional large data set

Info

Publication number: CN109634953B
Application number: CN201811316883.1A
Authority: CN
Inventors: 孙瑶; 钱江波; 胡伟; 任艳多
Original assignee: Ningbo University
Current assignee: Dragon Totem Technology Hefei Co ltd; Guangzhou Ruifeng Data Service Co.,Ltd.
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2021-08-17
Anticipated expiration: 2038-11-07
Also published as: CN109634953A

Abstract

The invention discloses a weighted quantization Hash retrieval method facing a high-dimensional large data set, which is characterized in that a principal component analysis algorithm is utilized to respectively reduce the dimensions of original high-dimensional data and given query data, then a loss function is constructed according to a pairwise similarity principle and by adopting a relaxed orthogonal constraint condition, a final binary coding matrix and a final weight matrix are obtained by minimizing the loss function, a weighted binary coding matrix and a binary code corresponding to the given query data are obtained according to the final weight matrix and the final binary coding matrix, row vector data with the closest weighted Hamming distance of the binary code corresponding to the given query data are searched in the weighted binary coding matrix, and the Hash retrieval process of the given query data is completed; the method has the advantages that the relaxed orthogonal constraint condition is adopted when the loss function is constructed, the weighted Hamming distance is utilized to carry out Hash retrieval, and the retrieval efficiency and accuracy of the Hash retrieval method can be better improved.

Description

Weighted quantization Hash retrieval method for high-dimensional large data set

Technical Field

The invention relates to a data retrieval method, in particular to a weighted quantization Hash retrieval method for a high-dimensional large data set.

Background

Nearest neighbor searching has been a fundamental research problem in computer science. Generally, a hash retrieval technology is an effective method capable of solving large-scale high-dimensional data retrieval, and a similarity query method based on hash has good query performance and storage efficiency, but most of the existing hash methods consider that the weights of all dimensions of hash codes are the same, that is, the similarity between two data is measured by directly utilizing hamming distance; however, in practical situations, different mapping direction selections can result in different classification effects, and each dimension carries different information corresponding to the hash code, so that the influence of different coding dimensions on the similarity between data is different.

If the hamming distance is used as the measurement standard, although the similarity of the data is judged to some extent, the distance between the data cannot be fully explained, and needs to be improved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a high-dimensional large data set-oriented weighted quantization hash retrieval method which can effectively improve the retrieval efficiency and accuracy of the hash retrieval method.

The technical scheme adopted by the invention for solving the technical problems is as follows: a weighted quantization Hash retrieval method for a high-dimensional large data set comprises the following steps:

firstly, obtaining an original high-dimensional data set X consisting of n original high-dimensional data and giving query data q, wherein X is an n multiplied by d dimensional matrix, q is a 1 multiplied by d dimensional vector, reducing the dimension of X by using a principal component analysis algorithm to obtain a low-dimensional vector set V corresponding to X,

wherein V is a matrix of dimension n × c, c < d, V_ijRepresenting the j dimension of the ith data in the original high-dimensional data as a corresponding low-dimensional vector element in V, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to c, and then using a principal component analysis algorithm to reduce the dimension of q to obtain a 1 x c-dimensional low-dimensional vector q' corresponding to q;

secondly, obtaining a final binary coding matrix B 'and a final weight matrix W' through iteration, and the specific process is as follows:

secondly, setting the maximum iteration times to-1, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}^n×cRandomly giving an initial weight matrix W, W ═ diag (W)₁，w₂，…w_j…，w_c) Wherein w is_jRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix;

secondly-2, constructing a loss function according to a pairwise-preserving similarity principle in the Hash function construction principle, introducing a complete orthogonal constraint condition, and relaxing the complete orthogonal constraint condition so as to construct the loss function

Wherein | | | purple hair_FTo take the F-norm sign of the matrix,

2 in is a square symbol, B^TA transposed matrix representing B, I representing an identity matrix;

secondly-3, starting an iterative process, firstly keeping W unchanged in the current iterative process, and performing iterative process on W

Performing minimum solution, updating B by gradient descent method

B obtained by updating at the minimum is marked as B',

b_ijrepresenting the updated binary coding value corresponding to the jth dimension element of the ith original high-dimensional data in the X in the current iteration process;

keeping B' unchanged, passing through

Performing a minimization solution to update W will

Updating the obtained W as W' at the minimum;

judging whether the iteration frequency of the current iteration process reaches a set maximum iteration frequency or not, if not, making W equal to W ', B equal to B', returning to the step (3) to start the next iteration process, and adding 1 to the iteration frequency, wherein W equal to W 'and B equal to B' are assignment symbols; if the maximum iteration number is reached, taking W 'obtained by updating in the current iteration process as a final weight matrix W', and taking B 'obtained by updating in the current iteration process as a final binary coding matrix B';

thirdly, weighting and quantizing each element in B 'according to W' to obtain a weighted binary coding matrix Z;

obtaining from W and B

And (3) the smallest q ' is used as a binary code q ' corresponding to q ', the row vector data closest to the weighted hamming distance of q ' is searched in Z, the original high-dimensional data corresponding to the row vector data closest to the weighted hamming distance of q ' is used as a final nearest neighbor query result, and the Hash retrieval process of q is completed.

The maximum iteration number set in the step II-1 is 50.

Compared with the prior art, the method has the advantages that firstly, the principal component analysis algorithm is utilized to respectively carry out dimensionality reduction on original high-dimensional data to obtain corresponding low-dimensional vector sets, dimensionality reduction is carried out on given query data to obtain corresponding low-dimensional vectors, then, a loss function is constructed according to a pairwise preservation similarity principle and by adopting a relaxed orthogonal constraint condition, a final binary coding matrix and a final weight matrix are obtained by minimizing the loss function, each element in the final binary coding matrix is weighted and quantized according to the final weight matrix to obtain a weighted binary coding matrix, then, binary codes corresponding to the given query data are obtained according to the final binary coding matrix and the final weight matrix, row vector data with the closest weighted hamming distance of the binary codes corresponding to the given query data are searched in the weighted binary coding matrix, the original high-dimensional data corresponding to the row vector data are used as a final nearest neighbor query result, completing the hash retrieval process of given query data; by carrying out Hash retrieval on given query data by using the weighted Hamming distance, data information in a data set can be better mined, and similarity information among data is kept; when constructing the loss function, the relaxed orthogonal constraint condition is adopted, so that the effectiveness of the coding is improved, and meanwhile, the Hash method selects the projection direction with better effect in the projection process, thereby further improving the retrieval efficiency and accuracy of the Hash retrieval method.

Drawings

FIG. 1 is a flow chart of the steps of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

A weighted quantization Hash retrieval method for a high-dimensional large data set comprises the following steps:

secondly, setting the maximum iteration times to-1, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}^n×cRandomly giving an initial weight matrix W, W ═ diag (W)₁，w₂，…w_j…，w_c) Wherein w is_jRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix; wherein the set maximum number of iterations may be 50.

Wherein | | | purple hair_FTo take the F-norm sign of the matrix,

Performing minimum solution, updating B by gradient descent method

B obtained by updating at the minimum is marked as B',

keeping B' unchanged, passing through

Performing a minimization solution to update W will

Updating the obtained W as W' at the minimum;

obtaining from W and B

Claims

1. A weighted quantization Hash retrieval method for a high-dimensional large data set is characterized by comprising the following steps:

the method comprises the following steps: obtaining an original high-dimensional data set X consisting of n original high-dimensional data, giving query data q, wherein X is an n X d-dimensional matrix, q is a 1X d-dimensional vector, reducing the dimension of X by using a principal component analysis algorithm to obtain a low-dimensional vector set V corresponding to X,

step two: obtaining a final binary coding matrix B 'and a final weight matrix W' through iteration, and the specific process is as follows:

step two-1: setting the maximum iteration times, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}^n×cRandomly giving an initial weight matrix W, W ═ diag (W)₁，w₂，...，w_j，...，w_c) Wherein w is_jRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix;

step 2: constructing a loss function according to a pairwise-preserving similarity principle in a Hash function construction principle, introducing a complete orthogonal constraint condition, and relaxing the complete orthogonal constraint condition to construct the loss function

Wherein | | | purple hair_FTo take the F-norm sign of the matrix,

step two-3: starting an iterative process, and in the current iterative process, firstly keeping W unchanged for

Performing minimum solution, updating B by gradient descent method

B obtained by updating at the minimum is marked as B',

keeping B' unchanged, passing through

Performing a minimization solution to update W will

Updating the obtained W as W' at the minimum;

step two-4: judging whether the iteration frequency of the current iteration process reaches the set maximum iteration frequency, if not, making W equal to W ', B equal to B', returning to the step (II) -3 to start the next iteration process, and adding 1 to the iteration frequency, wherein W equal to W 'and B equal to B' are assignment symbols; if the maximum iteration number is reached, taking W 'obtained by updating in the current iteration process as a final weight matrix W', and taking B 'obtained by updating in the current iteration process as a final binary coding matrix B';

step three: weighting and quantizing each element in B 'according to W' to obtain a weighted binary coding matrix Z;

step IV: according to W 'and B', obtaining

And the smallest q ' is used as a binary code q ' corresponding to q ', the row vector data closest to the weighted hamming distance of q ' is searched in Z, and the original high-dimensional data corresponding to the row vector data closest to the weighted hamming distance of q ' is used as a final nearest neighbor query result to finish the hash retrieval process of q.

2. The weighted quantization hash retrieval method for high-dimensional large data sets according to claim 1, wherein the maximum number of iterations set in step (ii) -1 is 50.