Disclosure of Invention
The invention aims to solve the technical problem of providing a high-dimensional large data set-oriented weighted quantization hash retrieval method which can effectively improve the retrieval efficiency and accuracy of the hash retrieval method.
The technical scheme adopted by the invention for solving the technical problems is as follows: a weighted quantization Hash retrieval method for a high-dimensional large data set comprises the following steps:
firstly, obtaining an original high-dimensional data set X consisting of n original high-dimensional data and giving query data q, wherein X is an n multiplied by d dimensional matrix, q is a 1 multiplied by d dimensional vector, reducing the dimension of X by using a principal component analysis algorithm to obtain a low-dimensional vector set V corresponding to X,
wherein V is a matrix of dimension n × c, c < d, V
ijRepresenting the j dimension of the ith data in the original high-dimensional data as a corresponding low-dimensional vector element in V, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to c, and then using a principal component analysis algorithm to reduce the dimension of q to obtain a 1 x c-dimensional low-dimensional vector q' corresponding to q;
secondly, obtaining a final binary coding matrix B 'and a final weight matrix W' through iteration, and the specific process is as follows:
secondly, setting the maximum iteration times to-1, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}n×cRandomly giving an initial weight matrix W, W ═ diag (W)1,w2,…wj…,wc) Wherein w isjRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix;
secondly-2, constructing a loss function according to a pairwise-preserving similarity principle in the Hash function construction principle, introducing a complete orthogonal constraint condition, and relaxing the complete orthogonal constraint condition so as to construct the loss function
Wherein | | | purple hair
FTo take the F-norm sign of the matrix,
2 in is a square symbol, B
TA transposed matrix representing B, I representing an identity matrix;
secondly-3, starting an iterative process, firstly keeping W unchanged in the current iterative process, and performing iterative process on W
Performing minimum solution, updating B by gradient descent method
B obtained by updating at the minimum is marked as B',
b
ijrepresenting the updated binary coding value corresponding to the jth dimension element of the ith original high-dimensional data in the X in the current iteration process;
keeping B' unchanged, passing through
Performing a minimization solution to update W will
Updating the obtained W as W' at the minimum;
judging whether the iteration frequency of the current iteration process reaches a set maximum iteration frequency or not, if not, making W equal to W ', B equal to B', returning to the step (3) to start the next iteration process, and adding 1 to the iteration frequency, wherein W equal to W 'and B equal to B' are assignment symbols; if the maximum iteration number is reached, taking W 'obtained by updating in the current iteration process as a final weight matrix W', and taking B 'obtained by updating in the current iteration process as a final binary coding matrix B';
thirdly, weighting and quantizing each element in B 'according to W' to obtain a weighted binary coding matrix Z;
obtaining from W and B
And (3) the smallest q ' is used as a binary code q ' corresponding to q ', the row vector data closest to the weighted hamming distance of q ' is searched in Z, the original high-dimensional data corresponding to the row vector data closest to the weighted hamming distance of q ' is used as a final nearest neighbor query result, and the Hash retrieval process of q is completed.
The maximum iteration number set in the step II-1 is 50.
Compared with the prior art, the method has the advantages that firstly, the principal component analysis algorithm is utilized to respectively carry out dimensionality reduction on original high-dimensional data to obtain corresponding low-dimensional vector sets, dimensionality reduction is carried out on given query data to obtain corresponding low-dimensional vectors, then, a loss function is constructed according to a pairwise preservation similarity principle and by adopting a relaxed orthogonal constraint condition, a final binary coding matrix and a final weight matrix are obtained by minimizing the loss function, each element in the final binary coding matrix is weighted and quantized according to the final weight matrix to obtain a weighted binary coding matrix, then, binary codes corresponding to the given query data are obtained according to the final binary coding matrix and the final weight matrix, row vector data with the closest weighted hamming distance of the binary codes corresponding to the given query data are searched in the weighted binary coding matrix, the original high-dimensional data corresponding to the row vector data are used as a final nearest neighbor query result, completing the hash retrieval process of given query data; by carrying out Hash retrieval on given query data by using the weighted Hamming distance, data information in a data set can be better mined, and similarity information among data is kept; when constructing the loss function, the relaxed orthogonal constraint condition is adopted, so that the effectiveness of the coding is improved, and meanwhile, the Hash method selects the projection direction with better effect in the projection process, thereby further improving the retrieval efficiency and accuracy of the Hash retrieval method.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
A weighted quantization Hash retrieval method for a high-dimensional large data set comprises the following steps:
firstly, obtaining an original high-dimensional data set X consisting of n original high-dimensional data and giving query data q, wherein X is an n multiplied by d dimensional matrix, q is a 1 multiplied by d dimensional vector, reducing the dimension of X by using a principal component analysis algorithm to obtain a low-dimensional vector set V corresponding to X,
wherein V is a matrix of dimension n × c, c < d, V
ijRepresenting the j dimension of the ith data in the original high-dimensional data as a corresponding low-dimensional vector element in V, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to c, and then using a principal component analysis algorithm to reduce the dimension of q to obtain a 1 x c-dimensional low-dimensional vector q' corresponding to q;
secondly, obtaining a final binary coding matrix B 'and a final weight matrix W' through iteration, and the specific process is as follows:
secondly, setting the maximum iteration times to-1, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}n×cRandomly giving an initial weight matrix W, W ═ diag (W)1,w2,…wj…,wc) Wherein w isjRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix; wherein the set maximum number of iterations may be 50.
Secondly-2, constructing a loss function according to a pairwise-preserving similarity principle in the Hash function construction principle, introducing a complete orthogonal constraint condition, and relaxing the complete orthogonal constraint condition so as to construct the loss function
Wherein | | | purple hair
FTo take the F-norm sign of the matrix,
2 in is a square symbol, B
TA transposed matrix representing B, I representing an identity matrix;
secondly-3, starting an iterative process, firstly keeping W unchanged in the current iterative process, and performing iterative process on W
Performing minimum solution, updating B by gradient descent method
B obtained by updating at the minimum is marked as B',
b
ijrepresenting the updated binary coding value corresponding to the jth dimension element of the ith original high-dimensional data in the X in the current iteration process;
keeping B' unchanged, passing through
Performing a minimization solution to update W will
Updating the obtained W as W' at the minimum;
judging whether the iteration frequency of the current iteration process reaches a set maximum iteration frequency or not, if not, making W equal to W ', B equal to B', returning to the step (3) to start the next iteration process, and adding 1 to the iteration frequency, wherein W equal to W 'and B equal to B' are assignment symbols; if the maximum iteration number is reached, taking W 'obtained by updating in the current iteration process as a final weight matrix W', and taking B 'obtained by updating in the current iteration process as a final binary coding matrix B';
thirdly, weighting and quantizing each element in B 'according to W' to obtain a weighted binary coding matrix Z;
obtaining from W and B
And (3) the smallest q ' is used as a binary code q ' corresponding to q ', the row vector data closest to the weighted hamming distance of q ' is searched in Z, the original high-dimensional data corresponding to the row vector data closest to the weighted hamming distance of q ' is used as a final nearest neighbor query result, and the Hash retrieval process of q is completed.