CN103605653B - Big data retrieval method based on sparse hash - Google Patents

Big data retrieval method based on sparse hash Download PDF

Info

Publication number
CN103605653B
CN103605653B CN201310457033.4A CN201310457033A CN103605653B CN 103605653 B CN103605653 B CN 103605653B CN 201310457033 A CN201310457033 A CN 201310457033A CN 103605653 B CN103605653 B CN 103605653B
Authority
CN
China
Prior art keywords
big data
hash function
dimensional
hash
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310457033.4A
Other languages
Chinese (zh)
Other versions
CN103605653A (en
Inventor
朱晓峰
张师超
刘星毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201310457033.4A priority Critical patent/CN103605653B/en
Publication of CN103605653A publication Critical patent/CN103605653A/en
Application granted granted Critical
Publication of CN103605653B publication Critical patent/CN103605653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations

Abstract

The present invention is big data Approximate Retrieval method, specifically based on sparse hash big data retrieval method.Application and development is carried out mainly for the storage of big data and the retrieval of big data.Determine the size of training set according to theoretical and calculator memory first by the method sampled.Then training set is learnt, learn the binary coding of hash function and the training set big data encoding.Then according to the hash function acquired, big data are carried out binary coding.At this point it is possible to carry out online retrieval application, i.e. to a test case, first obtain its binary code according to the hash function obtained, then the binary code of big data is carried out real-time retrieval.This method is linear to the time complexity of big data retrieval, can solve the manifold learning problem without explicit function, and reduce the amount of storage to up to ten thousand times of big data, it is easy to implements, relates only to some simple mathematical modeies when writing code.

Description

Big data retrieval method based on sparse hash
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to big data, particularly one Plant and use sparse hash to carry out the big data retrieval method such as picture, text, music.
Background technology
Big data refer to the number that the instrument of routine cannot be used under the present conditions to retrieve data content and manage According to collection.Data volume is big, data type is various, value density is low and processing speed is the feature of big four highly significants of data soon. The research that at present big data knowledge finds be concentrated mainly on division, cluster, retrieve, increment (in batches, online or parallel) learn this 4 Individual aspect.
The most fewer to the research of big data retrieval issue handling.The when of retrieval user it is generally desirable to can quickly from All data obtain oneself required thing.The problem that this relates to a speed and how accuracy rate is chosen.Two Before 10 years even 10 years, what researcher was pursued is accuracy rate.Therefore, devise various tree-like result KD-tree, the standard such as M-tree True carries out database retrieval, and achieves the biggest application.Last decade, along with becoming increasingly popular of network, the product of big data Raw, accurate retrieval can not meet the needs of user.Lot of documents shows, if the dimension of data is less than 10 dimensions, accurately examines Rope is well positioned to meet user and is actually needed.But dimension once exceedes this threshold value or higher, the accurately complexity of retrieval The highest, worst case reaches to travel through the complexity of whole data base, and this is the most infeasible.
In recent years, Approximate Retrieval has been achieved for significantly developing, particularly network retrieval, user pursue be quick and The multimedia retrieval of approximation.In numerous Approximate Retrieval methods, hash method is the most prominent.The principle of hash method is high The real-value data of dimension is reduced to low-dimensional binary data and the similarity that preserves between data, then large data sets is protected as far as possible There is calculator memory or outer disk, reach the purpose of quick-searching with this.
Summary of the invention
The present invention studies big data Approximate Retrieval problem.
It is an object of the invention to provide simple and effective big data Approximate Retrieval algorithm.The method can solve big data Retrieve high complexity and low accuracy rate etc..By the popular structure keeping data, i.e. this method ensures that binary system is as much as possible Keep the partial structurtes of original high dimensional data to improve Hash achievement, reduce algorithm complex to line by effective optimization method Property.The present invention comprises two critical process, i.e. hash function study and big data real-time retrieval.Wherein hash function study includes Higher-dimension real-value data changes into low-dimensional real number value and low-dimensional real number value such as changes at dimension binary system two process.Big data real-time retrieval is i.e. First turning example according to the hash function obtained is binary system, then retrieves at calculator memory.
Specifically comprising the following steps that of this method
(1) from big data, sampled data is used for training hash function as training set.Big data bulk is the hugest, root Theoretical according to statistics, need not be by all data as training set.The present invention first Sampled portions data are as training set.And extraction Training set size n byDetermine, wherein tα/2Represent the value of confidence level, can be obtained by t-distribution marginal value, ε table Show the allowable error of maximum.Various parameters arrange and please see table.
So far, training set X is obtained.
(2) hash function is trained with X.First design object function turns higher-dimension real data to low-dimensional data.Object function It is defined as:
Wherein X is training set, and B is base space, and each vector of B is the base vector training out from training set X, and S is that X is projected in base Low-dimensional real number value in space B, λ1And λ2It is the adjustable parameter obtained by ten folding cross validation methods, wi,jIt is two realities in X Example xiAnd xjBetween Euclidean distance projection in gaussian kernel, siAnd sjIt is two vectors in matrix S, Bi,jIt is in matrix B Ith row and jth column element, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is example Number, k is the number of base vector,Represent that in S, each element is non-negative.
Section 1 | | X-BS | |2Target be in base space B, to reconstruct training set X obtain S and reconfiguring false is wished Little;Section 2 Σi,jwi,j||si-sj||2Being to maintain localized epidemics's result of original training set X, this ensures binary data Keep the similarity of original high dimensional data, thus ensure the achievement of Hash;Section 3 ensures that the S obtained is sparse;Section 4Guarantee that obtaining S is non-negative.According to this object function, the S obtained is that the low-dimensional of X represents.The second of training hash function Step is i.e. converted into binary code S: in S, non-zero element is converted into 0, is otherwise 1.3rd step of training hash function i.e. obtains Hash function.The dimension assuming S is d, and the dimension of X is D, (D > > d), binary-coded a length of d.D tie up in the most one-dimensional As a vector, this vector is binary system (i.e. two class problems in classification), and the present invention sets up a Hash letter for the most one-dimensional Number, sets up altogether d hash function.The process setting up hash function is very simple, i.e. finds cryptographic Hash in training set X to be all 1 Example is class Am1, m=1 ..., d, remain the example that cryptographic Hash is 0 and be classified as class Am0, m=1 ..., d, obtain 2d class, hash function It is defined as:
sign ( x i ) = arg min i { | | x i - A ij s i | | 2 , j = 0,1 ,
If the dimension of S is d, the dimension of X is D, D > > d, d dimension in the most one-dimensional be a binary vector, for d tie up in each Dimension sets up a hash function, sets up altogether d hash function;
In formula, XiIt is the i-th vector of matrix X, SiIt is the i-th vector of matrix S, i=1 ..., n.
(3) example not obtaining binary code in large data sets being carried out binary coding process is: to each Example x, by s=(B'B+2I)-1B'x obtains the low-dimensional real number value of x, is then obtained its low-dimensional binary system by hash function Code;Wherein, B is the base space of previous step definition, and I is with B is with the unit matrix of dimension.So, whole big data are entered Row coding so that big data can be stored in calculator memory or outer disk.
(4) to new test case xt, s is passed throught=(B'B+2I)-1B'xtObtain xtLow-dimensional real number value, then pass through Hash function obtains its low-dimensional binary code;Wherein, B is the base space of previous step definition, and I is with B is with the list of dimension Bit matrix.Finally, the binary code of test case is carried out similarity searching with the binary code of big data, to obtain it Similar case.
Wherein the step (2) of the present invention is crucial, it is ensured that the efficiency of algorithm and effect.Its algorithm complex is closely dimension D Cube.In big market demand, the size of dimension D is much smaller more than example quantity, and therefore inventive algorithm complexity is The linear relationship of example size.And owing to step of the present invention (2) considers the flow structure of holding data, the effect of algorithm can To be protected.It is the low-dimensional real number value of non-negative additionally, due to generate so that the result obtained is easy to explain.
The sparse hash big data retrieval model of the present invention is characterised by: use the method for Corresponding Sparse Algorithm and sampling to reduce Algorithm complex;Use manifold learning theory to generate hash function and improve Hash achievement;Generate explicit hash function and avoid stream The implicit expression hash function of shape study;Binarization principle refers to that Hash result is soluble;The storage problem of big data obtains significantly Degree reduces.
Sample big data: it is extremely difficult for generally carrying out various data mining study in whole big data.Even if can OK, complexity is the highest, and sampling approach makes the operation classifying big data become feasible, and makes the complexity fall of classification Low to linear.The result that this biggest data mining is expected.
Theory of manifolds embeds Hash learning model: theory of manifolds has proven to a kind of very effective partial structurtes and protects Holding method, the method is particularly important to setting up Hash model.The present invention study Hash during add manifold regularization because of Son.Primary and foremost purpose is to maintain the manifold result of data set and guarantees high Hash achievement, next to that futuramic optimization method obtains The explicit expression having arrived hash function solves the conventional manifold learning difficulty without explicit expression;
The interpretability of binarization.It is with when tieing up binary turning low-dimensional real number value, owing to using non-negative indication Convert with novel binary system so that the binary representation obtained has interpretability and similarity continues to keep.This difference Binarizing method in existing hash method;
Low complex degree: owing to using efficient optimization method and sampling approach so that the process of study hash function is complicated Spending unrelated with big data instance quantity, the complexity under worst condition is linear.
Low storage capacity: owing to the binary code that uses of innovation replaces the storage method of actual data so that big data Storage saves the space of up to ten thousand times.
Accompanying drawing explanation
Fig. 1 is the dimensionality reduction result of a test case;
Fig. 2 is the binary code of the picture of Fig. 1.
Detailed description of the invention
Random 70,000 animal pictures of intercepting from network, it is assumed that every pictures needs the memory space of 1M, (notes this picture It is not the most the picture of very fidelity), then whole data set needs 70G space to store.The present invention replaces every pictures with 4 two Carry system code replaces, and total mistake only needs the memory space of about 3.5K.So nearly 20000 times are saved than original storage.
(1) inventive algorithm 100,000 example can be processed due to common 4G internal memory computer.Therefore to this data set, this Bright need not sample, directly be trained obtaining hash function with 70,000 data sets.And finally give each example and use 4 two System represents.
(2) to each test case, first the present invention obtains its low-dimensional real number value and represents: be 0.4, and 0,0.1,0.7, (see figure 1).
This represents: 1) be reduced to 4 dimensions from 784 dimensions of original description picture;2) maintain its partial structurtes, i.e. it The neighbours of luv space are lower dimensional space or neighbours;3) non-negative, this makes the inventive method have clear and definite semanteme i.e. Interpretability.According to upper figure, it is considered herein that the picture of monkey can be formed by four basic weight structures, the weight of each base is exactly it Four-dimensional coordinate represent, i.e. (0.4,0,0.1,0.7), it is clear that the second dimensional weight is 0, it may be said that this picture is not basis set by second Become.According to binarization principle of the present invention, the binary code of this picture is: 1,0,1,1, and (see figure 2).
(3) according to this binary code, the present invention displays that this picture is not formed by second base.Therefore the present invention compiles The process of code is explainable.And it can easily be proven that the similarity Preserving problems of the present invention.Such as two four-dimensional pictures divide Not Wei (0.51,0.51,0.51,0.51) and (0.49,0.49,0.49,0.49), the present invention is encoded to (1,1,1,1) them (1,1,1,1).Obviously show that they are similar in their Euclidean distance of real number value space.Encoding by the present invention To result be also similar.But use common Hash encoding law, the two picture be encoded into (1,1,1,1) and (0,0,0,0).Obviously the similarity of luv space can not be kept in binary system (i.e. hamming) space.The phase of this display present invention Keep being effective like property.

Claims (3)

1. big data retrieval method based on sparse hash, comprises the steps:
(1) from big data, sampled data regards training set X;
(2) hash function is trained with X;
(3) example not obtaining binary code in large data sets is carried out binary coding, and by the big data after coding It is stored in calculator memory or outer disk;
(4) to new test case, first obtain its low-dimensional real number value, then obtain its low-dimensional binary code, After, the binary code of test case is carried out similarity searching with the binary code of big data, obtains its similar case;
Training set size n of training set X of described step (1) byDetermine, wherein tα/2Represent the value of confidence level, logical Crossing t-distribution marginal value to obtain, ε represents the maximum allowable error of setting;
Described step (2) includes following process:
A). set up object function:
Wherein X is training set, and B is base space, and each vector of B is the base vector training out from training set X, and S is X quilt The low-dimensional real number value being projected in base space B, λ1And λ2It is the adjustable parameter obtained by ten folding cross validation methods, wi,jIt is X In two example xiAnd xjBetween Euclidean distance projection in gaussian kernel, siAnd sjIt is two vectors in matrix S, Bi,jIt is Ith row and jth column element in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, n Being the number of example, k is the number of base vector,Represent that in S, each element is non-negative;
B). S is converted into binary code;In S, non-zero element is converted into 1, is otherwise 0;
C). set up hash function: the example finding cryptographic Hash in training set X to be all 1 is class Am1, m=1 ..., d, remain cryptographic Hash Be 0 example be classified as class Am0, m=1 ..., d, obtain 2d class, hash function is defined as:
If the dimension of S is d, the dimension of X is D, and the most one-dimensional in D > > d, d dimension is a binary vector, the most one-dimensional in tieing up for d Set up a hash function, set up altogether d hash function;In formula, XiIt is the i-th vector of matrix X, SiIt is the i-th of matrix S Individual vector, i=1 ..., n.
Method the most according to claim 1, described step (3) is to big data each example x, by s=(B'B+2I)-1B'x Obtain the low-dimensional real number value of x, then obtained its low-dimensional binary code by hash function;Wherein, B is previous step definition Base space, I is with B is with the unit matrix of dimension.
Method the most according to claim 1, described step (4) is to test data set each example xt, pass through st=(B'B+ 2I)-1B'xtObtain xtLow-dimensional real number value, then obtained its low-dimensional binary code by hash function;Wherein, on B is The base space of face step definition, I is with B is with the unit matrix of dimension.
CN201310457033.4A 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash Active CN103605653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310457033.4A CN103605653B (en) 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310457033.4A CN103605653B (en) 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash

Publications (2)

Publication Number Publication Date
CN103605653A CN103605653A (en) 2014-02-26
CN103605653B true CN103605653B (en) 2017-01-04

Family

ID=50123878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310457033.4A Active CN103605653B (en) 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash

Country Status (1)

Country Link
CN (1) CN103605653B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484566A (en) * 2014-12-16 2015-04-01 芜湖乐锐思信息咨询有限公司 Big data analysis system and big data analysis method
CN104462458A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Data mining method of big data system
CN104462459A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Neural network based big data analysis and processing system and method
CN113377294B (en) * 2021-08-11 2021-10-22 武汉泰乐奇信息科技有限公司 Big data storage method and device based on binary data conversion

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402617A (en) * 2011-12-23 2012-04-04 天津神舟通用数据技术有限公司 Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402617A (en) * 2011-12-23 2012-04-04 天津神舟通用数据技术有限公司 Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Sparse Hashing for Fast Multimedia Search;XIAOFENG ZHU,etc.;《ACM Transactions on Information Systems》;20130531;第31卷(第2期);9:7-9:13 *
基于稀疏谱哈希的图像索引;张啸;《中国优秀硕士学位论文全文数据库》;20110715(第7期);I138-506 *
基于结构化稀疏谱哈希的图像索引算法;欧阳遄飞;《中国优秀硕士学位论文全文数据库》;20120715(第7期);I138-2166 *

Also Published As

Publication number Publication date
CN103605653A (en) 2014-02-26

Similar Documents

Publication Publication Date Title
Liu et al. Deep sketch hashing: Fast free-hand sketch-based image retrieval
Zafar et al. A novel discriminating and relative global spatial image representation with applications in CBIR
Ali et al. A hybrid geometric spatial image representation for scene classification
US8428397B1 (en) Systems and methods for large scale, high-dimensional searches
US8849030B2 (en) Image retrieval using spatial bag-of-features
Zafar et al. Image classification by addition of spatial information based on histograms of orthogonal vectors
EP3166020A1 (en) Method and apparatus for image classification based on dictionary learning
Tabia et al. Compact vectors of locally aggregated tensors for 3D shape retrieval
Yang et al. An improved Bag-of-Words framework for remote sensing image retrieval in large-scale image databases
Serra et al. Gold: Gaussians of local descriptors for image representation
Zhang et al. Fast orthogonal projection based on kronecker product
CN106033426A (en) A latent semantic min-Hash-based image retrieval method
Ali et al. Modeling global geometric spatial information for rotation invariant classification of satellite images
Picard et al. Efficient image signatures and similarities using tensor products of local descriptors
López-Sastre et al. Evaluating 3d spatial pyramids for classifying 3d shapes
CN103605653B (en) Big data retrieval method based on sparse hash
Sadeghi-Tehran et al. Scalable database indexing and fast image retrieval based on deep learning and hierarchically nested structure applied to remote sensing and plant biology
Bu et al. Local deep feature learning framework for 3D shape
Hu et al. Fast binary coding for the scene classification of high-resolution remote sensing imagery
Wu et al. A multi-sample, multi-tree approach to bag-of-words image representation for image retrieval
CN106250918A (en) A kind of mixed Gauss model matching process based on the soil-shifting distance improved
CN105760875A (en) Binary image feature similarity discrimination method based on random forest algorithm
CN103324942B (en) A kind of image classification method, Apparatus and system
CN109145111B (en) Multi-feature text data similarity calculation method based on machine learning
Wang et al. Random angular projection for fast nearest subspace search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant