CN109634953B - Weighted quantization Hash retrieval method for high-dimensional large data set - Google Patents

Weighted quantization Hash retrieval method for high-dimensional large data set Download PDF

Info

Publication number
CN109634953B
CN109634953B CN201811316883.1A CN201811316883A CN109634953B CN 109634953 B CN109634953 B CN 109634953B CN 201811316883 A CN201811316883 A CN 201811316883A CN 109634953 B CN109634953 B CN 109634953B
Authority
CN
China
Prior art keywords
matrix
dimensional
data
weighted
equal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811316883.1A
Other languages
Chinese (zh)
Other versions
CN109634953A (en
Inventor
孙瑶
钱江波
胡伟
任艳多
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dragon Totem Technology Hefei Co ltd
Guangzhou Ruifeng Data Service Co.,Ltd.
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN201811316883.1A priority Critical patent/CN109634953B/en
Publication of CN109634953A publication Critical patent/CN109634953A/en
Application granted granted Critical
Publication of CN109634953B publication Critical patent/CN109634953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a weighted quantization Hash retrieval method facing a high-dimensional large data set, which is characterized in that a principal component analysis algorithm is utilized to respectively reduce the dimensions of original high-dimensional data and given query data, then a loss function is constructed according to a pairwise similarity principle and by adopting a relaxed orthogonal constraint condition, a final binary coding matrix and a final weight matrix are obtained by minimizing the loss function, a weighted binary coding matrix and a binary code corresponding to the given query data are obtained according to the final weight matrix and the final binary coding matrix, row vector data with the closest weighted Hamming distance of the binary code corresponding to the given query data are searched in the weighted binary coding matrix, and the Hash retrieval process of the given query data is completed; the method has the advantages that the relaxed orthogonal constraint condition is adopted when the loss function is constructed, the weighted Hamming distance is utilized to carry out Hash retrieval, and the retrieval efficiency and accuracy of the Hash retrieval method can be better improved.

Description

Weighted quantization Hash retrieval method for high-dimensional large data set
Technical Field
The invention relates to a data retrieval method, in particular to a weighted quantization Hash retrieval method for a high-dimensional large data set.
Background
Nearest neighbor searching has been a fundamental research problem in computer science. Generally, a hash retrieval technology is an effective method capable of solving large-scale high-dimensional data retrieval, and a similarity query method based on hash has good query performance and storage efficiency, but most of the existing hash methods consider that the weights of all dimensions of hash codes are the same, that is, the similarity between two data is measured by directly utilizing hamming distance; however, in practical situations, different mapping direction selections can result in different classification effects, and each dimension carries different information corresponding to the hash code, so that the influence of different coding dimensions on the similarity between data is different.
If the hamming distance is used as the measurement standard, although the similarity of the data is judged to some extent, the distance between the data cannot be fully explained, and needs to be improved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a high-dimensional large data set-oriented weighted quantization hash retrieval method which can effectively improve the retrieval efficiency and accuracy of the hash retrieval method.
The technical scheme adopted by the invention for solving the technical problems is as follows: a weighted quantization Hash retrieval method for a high-dimensional large data set comprises the following steps:
firstly, obtaining an original high-dimensional data set X consisting of n original high-dimensional data and giving query data q, wherein X is an n multiplied by d dimensional matrix, q is a 1 multiplied by d dimensional vector, reducing the dimension of X by using a principal component analysis algorithm to obtain a low-dimensional vector set V corresponding to X,
Figure BDA0001856516860000011
wherein V is a matrix of dimension n × c, c < d, VijRepresenting the j dimension of the ith data in the original high-dimensional data as a corresponding low-dimensional vector element in V, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to c, and then using a principal component analysis algorithm to reduce the dimension of q to obtain a 1 x c-dimensional low-dimensional vector q' corresponding to q;
secondly, obtaining a final binary coding matrix B 'and a final weight matrix W' through iteration, and the specific process is as follows:
secondly, setting the maximum iteration times to-1, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}n×cRandomly giving an initial weight matrix W, W ═ diag (W)1,w2,…wj…,wc) Wherein w isjRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix;
secondly-2, constructing a loss function according to a pairwise-preserving similarity principle in the Hash function construction principle, introducing a complete orthogonal constraint condition, and relaxing the complete orthogonal constraint condition so as to construct the loss function
Figure BDA0001856516860000021
Wherein | | | purple hairFTo take the F-norm sign of the matrix,
Figure BDA0001856516860000028
2 in is a square symbol, BTA transposed matrix representing B, I representing an identity matrix;
secondly-3, starting an iterative process, firstly keeping W unchanged in the current iterative process, and performing iterative process on W
Figure BDA0001856516860000022
Performing minimum solution, updating B by gradient descent method
Figure BDA0001856516860000023
B obtained by updating at the minimum is marked as B',
Figure BDA0001856516860000024
bijrepresenting the updated binary coding value corresponding to the jth dimension element of the ith original high-dimensional data in the X in the current iteration process;
keeping B' unchanged, passing through
Figure BDA0001856516860000025
Performing a minimization solution to update W will
Figure BDA0001856516860000026
Updating the obtained W as W' at the minimum;
judging whether the iteration frequency of the current iteration process reaches a set maximum iteration frequency or not, if not, making W equal to W ', B equal to B', returning to the step (3) to start the next iteration process, and adding 1 to the iteration frequency, wherein W equal to W 'and B equal to B' are assignment symbols; if the maximum iteration number is reached, taking W 'obtained by updating in the current iteration process as a final weight matrix W', and taking B 'obtained by updating in the current iteration process as a final binary coding matrix B';
thirdly, weighting and quantizing each element in B 'according to W' to obtain a weighted binary coding matrix Z;
obtaining from W and B
Figure BDA0001856516860000027
And (3) the smallest q ' is used as a binary code q ' corresponding to q ', the row vector data closest to the weighted hamming distance of q ' is searched in Z, the original high-dimensional data corresponding to the row vector data closest to the weighted hamming distance of q ' is used as a final nearest neighbor query result, and the Hash retrieval process of q is completed.
The maximum iteration number set in the step II-1 is 50.
Compared with the prior art, the method has the advantages that firstly, the principal component analysis algorithm is utilized to respectively carry out dimensionality reduction on original high-dimensional data to obtain corresponding low-dimensional vector sets, dimensionality reduction is carried out on given query data to obtain corresponding low-dimensional vectors, then, a loss function is constructed according to a pairwise preservation similarity principle and by adopting a relaxed orthogonal constraint condition, a final binary coding matrix and a final weight matrix are obtained by minimizing the loss function, each element in the final binary coding matrix is weighted and quantized according to the final weight matrix to obtain a weighted binary coding matrix, then, binary codes corresponding to the given query data are obtained according to the final binary coding matrix and the final weight matrix, row vector data with the closest weighted hamming distance of the binary codes corresponding to the given query data are searched in the weighted binary coding matrix, the original high-dimensional data corresponding to the row vector data are used as a final nearest neighbor query result, completing the hash retrieval process of given query data; by carrying out Hash retrieval on given query data by using the weighted Hamming distance, data information in a data set can be better mined, and similarity information among data is kept; when constructing the loss function, the relaxed orthogonal constraint condition is adopted, so that the effectiveness of the coding is improved, and meanwhile, the Hash method selects the projection direction with better effect in the projection process, thereby further improving the retrieval efficiency and accuracy of the Hash retrieval method.
Drawings
FIG. 1 is a flow chart of the steps of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
A weighted quantization Hash retrieval method for a high-dimensional large data set comprises the following steps:
firstly, obtaining an original high-dimensional data set X consisting of n original high-dimensional data and giving query data q, wherein X is an n multiplied by d dimensional matrix, q is a 1 multiplied by d dimensional vector, reducing the dimension of X by using a principal component analysis algorithm to obtain a low-dimensional vector set V corresponding to X,
Figure BDA0001856516860000031
wherein V is a matrix of dimension n × c, c < d, VijRepresenting the j dimension of the ith data in the original high-dimensional data as a corresponding low-dimensional vector element in V, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to c, and then using a principal component analysis algorithm to reduce the dimension of q to obtain a 1 x c-dimensional low-dimensional vector q' corresponding to q;
secondly, obtaining a final binary coding matrix B 'and a final weight matrix W' through iteration, and the specific process is as follows:
secondly, setting the maximum iteration times to-1, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}n×cRandomly giving an initial weight matrix W, W ═ diag (W)1,w2,…wj…,wc) Wherein w isjRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix; wherein the set maximum number of iterations may be 50.
Secondly-2, constructing a loss function according to a pairwise-preserving similarity principle in the Hash function construction principle, introducing a complete orthogonal constraint condition, and relaxing the complete orthogonal constraint condition so as to construct the loss function
Figure BDA0001856516860000041
Wherein | | | purple hairFTo take the F-norm sign of the matrix,
Figure BDA0001856516860000042
2 in is a square symbol, BTA transposed matrix representing B, I representing an identity matrix;
secondly-3, starting an iterative process, firstly keeping W unchanged in the current iterative process, and performing iterative process on W
Figure BDA0001856516860000043
Performing minimum solution, updating B by gradient descent method
Figure BDA0001856516860000044
B obtained by updating at the minimum is marked as B',
Figure BDA0001856516860000045
bijrepresenting the updated binary coding value corresponding to the jth dimension element of the ith original high-dimensional data in the X in the current iteration process;
keeping B' unchanged, passing through
Figure BDA0001856516860000046
Performing a minimization solution to update W will
Figure BDA0001856516860000047
Updating the obtained W as W' at the minimum;
judging whether the iteration frequency of the current iteration process reaches a set maximum iteration frequency or not, if not, making W equal to W ', B equal to B', returning to the step (3) to start the next iteration process, and adding 1 to the iteration frequency, wherein W equal to W 'and B equal to B' are assignment symbols; if the maximum iteration number is reached, taking W 'obtained by updating in the current iteration process as a final weight matrix W', and taking B 'obtained by updating in the current iteration process as a final binary coding matrix B';
thirdly, weighting and quantizing each element in B 'according to W' to obtain a weighted binary coding matrix Z;
obtaining from W and B
Figure BDA0001856516860000048
And (3) the smallest q ' is used as a binary code q ' corresponding to q ', the row vector data closest to the weighted hamming distance of q ' is searched in Z, the original high-dimensional data corresponding to the row vector data closest to the weighted hamming distance of q ' is used as a final nearest neighbor query result, and the Hash retrieval process of q is completed.

Claims (2)

1. A weighted quantization Hash retrieval method for a high-dimensional large data set is characterized by comprising the following steps:
the method comprises the following steps: obtaining an original high-dimensional data set X consisting of n original high-dimensional data, giving query data q, wherein X is an n X d-dimensional matrix, q is a 1X d-dimensional vector, reducing the dimension of X by using a principal component analysis algorithm to obtain a low-dimensional vector set V corresponding to X,
Figure FDA0003112205430000011
wherein V is a matrix of dimension n × c, c < d, VijRepresenting the j dimension of the ith data in the original high-dimensional data as a corresponding low-dimensional vector element in V, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to c, and then using a principal component analysis algorithm to reduce the dimension of q to obtain a 1 x c-dimensional low-dimensional vector q' corresponding to q;
step two: obtaining a final binary coding matrix B 'and a final weight matrix W' through iteration, and the specific process is as follows:
step two-1: setting the maximum iteration times, randomly giving an initial binary coding matrix B, wherein B belongs to { -1,1}n×cRandomly giving an initial weight matrix W, W ═ diag (W)1,w2,...,wj,...,wc) Wherein w isjRepresents the dimension weight of the j-th dimension, diag () represents the diagonal matrix;
step 2: constructing a loss function according to a pairwise-preserving similarity principle in a Hash function construction principle, introducing a complete orthogonal constraint condition, and relaxing the complete orthogonal constraint condition to construct the loss function
Figure FDA0003112205430000012
Wherein | | | purple hairFTo take the F-norm sign of the matrix,
Figure FDA0003112205430000013
2 in is a square symbol, BTA transposed matrix representing B, I representing an identity matrix;
step two-3: starting an iterative process, and in the current iterative process, firstly keeping W unchanged for
Figure FDA0003112205430000014
Performing minimum solution, updating B by gradient descent method
Figure FDA0003112205430000015
B obtained by updating at the minimum is marked as B',
Figure FDA0003112205430000016
bijrepresenting the updated binary coding value corresponding to the jth dimension element of the ith original high-dimensional data in the X in the current iteration process;
keeping B' unchanged, passing through
Figure FDA0003112205430000017
Performing a minimization solution to update W will
Figure FDA0003112205430000018
Updating the obtained W as W' at the minimum;
step two-4: judging whether the iteration frequency of the current iteration process reaches the set maximum iteration frequency, if not, making W equal to W ', B equal to B', returning to the step (II) -3 to start the next iteration process, and adding 1 to the iteration frequency, wherein W equal to W 'and B equal to B' are assignment symbols; if the maximum iteration number is reached, taking W 'obtained by updating in the current iteration process as a final weight matrix W', and taking B 'obtained by updating in the current iteration process as a final binary coding matrix B';
step three: weighting and quantizing each element in B 'according to W' to obtain a weighted binary coding matrix Z;
step IV: according to W 'and B', obtaining
Figure FDA0003112205430000021
And the smallest q ' is used as a binary code q ' corresponding to q ', the row vector data closest to the weighted hamming distance of q ' is searched in Z, and the original high-dimensional data corresponding to the row vector data closest to the weighted hamming distance of q ' is used as a final nearest neighbor query result to finish the hash retrieval process of q.
2. The weighted quantization hash retrieval method for high-dimensional large data sets according to claim 1, wherein the maximum number of iterations set in step (ii) -1 is 50.
CN201811316883.1A 2018-11-07 2018-11-07 Weighted quantization Hash retrieval method for high-dimensional large data set Active CN109634953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811316883.1A CN109634953B (en) 2018-11-07 2018-11-07 Weighted quantization Hash retrieval method for high-dimensional large data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811316883.1A CN109634953B (en) 2018-11-07 2018-11-07 Weighted quantization Hash retrieval method for high-dimensional large data set

Publications (2)

Publication Number Publication Date
CN109634953A CN109634953A (en) 2019-04-16
CN109634953B true CN109634953B (en) 2021-08-17

Family

ID=66067314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811316883.1A Active CN109634953B (en) 2018-11-07 2018-11-07 Weighted quantization Hash retrieval method for high-dimensional large data set

Country Status (1)

Country Link
CN (1) CN109634953B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143625B (en) * 2019-09-03 2023-04-25 西北工业大学 Cross-modal retrieval method based on semi-supervised multi-modal hash coding
CN110750731B (en) * 2019-09-27 2023-10-27 成都数联铭品科技有限公司 Method and system for removing duplicate of news public opinion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226585A (en) * 2013-04-10 2013-07-31 大连理工大学 Self-adaptation Hash rearrangement method for image retrieval
CN104820696A (en) * 2015-04-29 2015-08-05 山东大学 Large-scale image retrieval method based on multi-label least square Hash algorithm
CN106776856A (en) * 2016-11-29 2017-05-31 江南大学 A kind of vehicle image search method of Fusion of Color feature and words tree
CN106777388A (en) * 2017-02-20 2017-05-31 华南理工大学 A kind of multilist hashing image search method of dual compensation
CN107423309A (en) * 2016-06-01 2017-12-01 国家计算机网络与信息安全管理中心 Magnanimity internet similar pictures detecting system and method based on fuzzy hash algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133496A1 (en) * 2006-12-01 2008-06-05 International Business Machines Corporation Method, computer program product, and device for conducting a multi-criteria similarity search

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226585A (en) * 2013-04-10 2013-07-31 大连理工大学 Self-adaptation Hash rearrangement method for image retrieval
CN104820696A (en) * 2015-04-29 2015-08-05 山东大学 Large-scale image retrieval method based on multi-label least square Hash algorithm
CN107423309A (en) * 2016-06-01 2017-12-01 国家计算机网络与信息安全管理中心 Magnanimity internet similar pictures detecting system and method based on fuzzy hash algorithm
CN106776856A (en) * 2016-11-29 2017-05-31 江南大学 A kind of vehicle image search method of Fusion of Color feature and words tree
CN106777388A (en) * 2017-02-20 2017-05-31 华南理工大学 A kind of multilist hashing image search method of dual compensation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"面向大规模数据检索的哈希学习研究进展";任艳多等;《无线通信技术》;20171215;第26卷(第4期);第21-25页 *

Also Published As

Publication number Publication date
CN109634953A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
Li et al. Generalized uncorrelated regression with adaptive graph for unsupervised feature selection
CN110309343B (en) Voiceprint retrieval method based on deep hash
CN112732864B (en) Document retrieval method based on dense pseudo query vector representation
CN109634953B (en) Weighted quantization Hash retrieval method for high-dimensional large data set
Chehreghani et al. Information theoretic model validation for spectral clustering
Murray et al. Interferences in match kernels
Shi et al. Query-efficient black-box adversarial attack with customized iteration and sampling
CN112256727B (en) Database query processing and optimizing method based on artificial intelligence technology
Ozan et al. K-subspaces quantization for approximate nearest neighbor search
Habib et al. Retracted: Forecasting model for wind power integrating least squares support vector machine, singular spectrum analysis, deep belief network, and locality‐sensitive hashing
Li et al. Deep multi-similarity hashing for multi-label image retrieval
CN111612319A (en) Load curve depth embedding clustering method based on one-dimensional convolution self-encoder
Zhang et al. CapsNet-based supervised hashing
Liang et al. Distrihd: A memory efficient distributed binary hyperdimensional computing architecture for image classification
CN107133348B (en) Approximate searching method based on semantic consistency in large-scale picture set
Qiu et al. Efficient document retrieval by end-to-end refining and quantizing BERT embedding with contrastive product quantization
Li et al. Embedding Compression in Recommender Systems: A Survey
CN109710607B (en) Hash query method for high-dimensional big data based on weight solving
Wang et al. Grassmann hashing for approximate nearest neighbor search in high dimensional space
Ferdowsi et al. Sparse ternary codes for similarity search have higher coding gain than dense binary codes
CN117079744A (en) Artificial intelligent design method for energetic molecule
Zhang et al. Efficient indexing of binary LSH for high dimensional nearest neighbor
CN115344693A (en) Clustering method based on fusion of traditional algorithm and neural network algorithm
Zare et al. A Novel multiple kernel-based dictionary learning for distributive and collective sparse representation based classifiers
Ye et al. Fast search in large-scale image database using vector quantization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231214

Address after: Room 309-1, No. 62 Huamei Road, Tianhe District, Guangzhou City, Guangdong Province, 510000 (office only)

Patentee after: Guangzhou Ruifeng Data Service Co.,Ltd.

Address before: 230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee before: Dragon totem Technology (Hefei) Co.,Ltd.

Effective date of registration: 20231214

Address after: 230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee after: Dragon totem Technology (Hefei) Co.,Ltd.

Address before: 315211, Fenghua Road, Jiangbei District, Zhejiang, Ningbo 818

Patentee before: Ningbo University

TR01 Transfer of patent right