CN103605653B

CN103605653B - Big data retrieval method based on sparse hash

Info

Publication number: CN103605653B
Application number: CN201310457033.4A
Authority: CN
Inventors: 朱晓峰; 张师超; 刘星毅
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2017-01-04
Anticipated expiration: 2033-09-29
Also published as: CN103605653A

Abstract

The present invention is big data Approximate Retrieval method, specifically based on sparse hash big data retrieval method.Application and development is carried out mainly for the storage of big data and the retrieval of big data.Determine the size of training set according to theoretical and calculator memory first by the method sampled.Then training set is learnt, learn the binary coding of hash function and the training set big data encoding.Then according to the hash function acquired, big data are carried out binary coding.At this point it is possible to carry out online retrieval application, i.e. to a test case, first obtain its binary code according to the hash function obtained, then the binary code of big data is carried out real-time retrieval.This method is linear to the time complexity of big data retrieval, can solve the manifold learning problem without explicit function, and reduce the amount of storage to up to ten thousand times of big data, it is easy to implements, relates only to some simple mathematical modeies when writing code.

Description

Big data retrieval method based on sparse hash

Technical field

The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to big data, particularly one Plant and use sparse hash to carry out the big data retrieval method such as picture, text, music.

Background technology

Big data refer to the number that the instrument of routine cannot be used under the present conditions to retrieve data content and manage According to collection.Data volume is big, data type is various, value density is low and processing speed is the feature of big four highly significants of data soon. The research that at present big data knowledge finds be concentrated mainly on division, cluster, retrieve, increment (in batches, online or parallel) learn this 4 Individual aspect.

The most fewer to the research of big data retrieval issue handling.The when of retrieval user it is generally desirable to can quickly from All data obtain oneself required thing.The problem that this relates to a speed and how accuracy rate is chosen.Two Before 10 years even 10 years, what researcher was pursued is accuracy rate.Therefore, devise various tree-like result KD-tree, the standard such as M-tree True carries out database retrieval, and achieves the biggest application.Last decade, along with becoming increasingly popular of network, the product of big data Raw, accurate retrieval can not meet the needs of user.Lot of documents shows, if the dimension of data is less than 10 dimensions, accurately examines Rope is well positioned to meet user and is actually needed.But dimension once exceedes this threshold value or higher, the accurately complexity of retrieval The highest, worst case reaches to travel through the complexity of whole data base, and this is the most infeasible.

In recent years, Approximate Retrieval has been achieved for significantly developing, particularly network retrieval, user pursue be quick and The multimedia retrieval of approximation.In numerous Approximate Retrieval methods, hash method is the most prominent.The principle of hash method is high The real-value data of dimension is reduced to low-dimensional binary data and the similarity that preserves between data, then large data sets is protected as far as possible There is calculator memory or outer disk, reach the purpose of quick-searching with this.

Summary of the invention

The present invention studies big data Approximate Retrieval problem.

It is an object of the invention to provide simple and effective big data Approximate Retrieval algorithm.The method can solve big data Retrieve high complexity and low accuracy rate etc..By the popular structure keeping data, i.e. this method ensures that binary system is as much as possible Keep the partial structurtes of original high dimensional data to improve Hash achievement, reduce algorithm complex to line by effective optimization method Property.The present invention comprises two critical process, i.e. hash function study and big data real-time retrieval.Wherein hash function study includes Higher-dimension real-value data changes into low-dimensional real number value and low-dimensional real number value such as changes at dimension binary system two process.Big data real-time retrieval is i.e. First turning example according to the hash function obtained is binary system, then retrieves at calculator memory.

Specifically comprising the following steps that of this method

(1) from big data, sampled data is used for training hash function as training set.Big data bulk is the hugest, root Theoretical according to statistics, need not be by all data as training set.The present invention first Sampled portions data are as training set.And extraction Training set size n byDetermine, wherein t_α/2Represent the value of confidence level, can be obtained by t-distribution marginal value, ε table Show the allowable error of maximum.Various parameters arrange and please see table.

So far, training set X is obtained.

(2) hash function is trained with X.First design object function turns higher-dimension real data to low-dimensional data.Object function It is defined as:

Wherein X is training set, and B is base space, and each vector of B is the base vector training out from training set X, and S is that X is projected in base Low-dimensional real number value in space B, λ₁And λ₂It is the adjustable parameter obtained by ten folding cross validation methods, w_i,jIt is two realities in X Example x_iAnd x_jBetween Euclidean distance projection in gaussian kernel, s_iAnd s_jIt is two vectors in matrix S, B_i,jIt is in matrix B Ith row and jth column element, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is example Number, k is the number of base vector,Represent that in S, each element is non-negative.

Section 1 | | X-BS | |₂Target be in base space B, to reconstruct training set X obtain S and reconfiguring false is wished Little；Section 2 Σ_i,jw_i,j||s_i-s_j||²Being to maintain localized epidemics's result of original training set X, this ensures binary data Keep the similarity of original high dimensional data, thus ensure the achievement of Hash；Section 3 ensures that the S obtained is sparse；Section 4Guarantee that obtaining S is non-negative.According to this object function, the S obtained is that the low-dimensional of X represents.The second of training hash function Step is i.e. converted into binary code S: in S, non-zero element is converted into 0, is otherwise 1.3rd step of training hash function i.e. obtains Hash function.The dimension assuming S is d, and the dimension of X is D, (D > > d), binary-coded a length of d.D tie up in the most one-dimensional As a vector, this vector is binary system (i.e. two class problems in classification), and the present invention sets up a Hash letter for the most one-dimensional Number, sets up altogether d hash function.The process setting up hash function is very simple, i.e. finds cryptographic Hash in training set X to be all 1 Example is class A_m1, m=1 ..., d, remain the example that cryptographic Hash is 0 and be classified as class A_m0, m=1 ..., d, obtain 2d class, hash function It is defined as:

sign (x_{i}) = \arg \min_{i} {{| | x_{i} - A_{ij} s_{i} | |}_{2}, j = 0,1,

If the dimension of S is d, the dimension of X is D, D > > d, d dimension in the most one-dimensional be a binary vector, for d tie up in each Dimension sets up a hash function, sets up altogether d hash function；

In formula, X_iIt is the i-th vector of matrix X, S_iIt is the i-th vector of matrix S, i=1 ..., n.

(3) example not obtaining binary code in large data sets being carried out binary coding process is: to each Example x, by s=(B'B+2I)^-1B'x obtains the low-dimensional real number value of x, is then obtained its low-dimensional binary system by hash function Code；Wherein, B is the base space of previous step definition, and I is with B is with the unit matrix of dimension.So, whole big data are entered Row coding so that big data can be stored in calculator memory or outer disk.

(4) to new test case xt, s is passed through_t=(B'B+2I)^-1B'x_tObtain x_tLow-dimensional real number value, then pass through Hash function obtains its low-dimensional binary code；Wherein, B is the base space of previous step definition, and I is with B is with the list of dimension Bit matrix.Finally, the binary code of test case is carried out similarity searching with the binary code of big data, to obtain it Similar case.

Wherein the step (2) of the present invention is crucial, it is ensured that the efficiency of algorithm and effect.Its algorithm complex is closely dimension D Cube.In big market demand, the size of dimension D is much smaller more than example quantity, and therefore inventive algorithm complexity is The linear relationship of example size.And owing to step of the present invention (2) considers the flow structure of holding data, the effect of algorithm can To be protected.It is the low-dimensional real number value of non-negative additionally, due to generate so that the result obtained is easy to explain.

The sparse hash big data retrieval model of the present invention is characterised by: use the method for Corresponding Sparse Algorithm and sampling to reduce Algorithm complex；Use manifold learning theory to generate hash function and improve Hash achievement；Generate explicit hash function and avoid stream The implicit expression hash function of shape study；Binarization principle refers to that Hash result is soluble；The storage problem of big data obtains significantly Degree reduces.

Sample big data: it is extremely difficult for generally carrying out various data mining study in whole big data.Even if can OK, complexity is the highest, and sampling approach makes the operation classifying big data become feasible, and makes the complexity fall of classification Low to linear.The result that this biggest data mining is expected.

Theory of manifolds embeds Hash learning model: theory of manifolds has proven to a kind of very effective partial structurtes and protects Holding method, the method is particularly important to setting up Hash model.The present invention study Hash during add manifold regularization because of Son.Primary and foremost purpose is to maintain the manifold result of data set and guarantees high Hash achievement, next to that futuramic optimization method obtains The explicit expression having arrived hash function solves the conventional manifold learning difficulty without explicit expression；

The interpretability of binarization.It is with when tieing up binary turning low-dimensional real number value, owing to using non-negative indication Convert with novel binary system so that the binary representation obtained has interpretability and similarity continues to keep.This difference Binarizing method in existing hash method；

Low complex degree: owing to using efficient optimization method and sampling approach so that the process of study hash function is complicated Spending unrelated with big data instance quantity, the complexity under worst condition is linear.

Low storage capacity: owing to the binary code that uses of innovation replaces the storage method of actual data so that big data Storage saves the space of up to ten thousand times.

Accompanying drawing explanation

Fig. 1 is the dimensionality reduction result of a test case；

Fig. 2 is the binary code of the picture of Fig. 1.

Detailed description of the invention

Random 70,000 animal pictures of intercepting from network, it is assumed that every pictures needs the memory space of 1M, (notes this picture It is not the most the picture of very fidelity), then whole data set needs 70G space to store.The present invention replaces every pictures with 4 two Carry system code replaces, and total mistake only needs the memory space of about 3.5K.So nearly 20000 times are saved than original storage.

(1) inventive algorithm 100,000 example can be processed due to common 4G internal memory computer.Therefore to this data set, this Bright need not sample, directly be trained obtaining hash function with 70,000 data sets.And finally give each example and use 4 two System represents.

(2) to each test case, first the present invention obtains its low-dimensional real number value and represents: be 0.4, and 0,0.1,0.7, (see figure 1).

This represents: 1) be reduced to 4 dimensions from 784 dimensions of original description picture；2) maintain its partial structurtes, i.e. it The neighbours of luv space are lower dimensional space or neighbours；3) non-negative, this makes the inventive method have clear and definite semanteme i.e. Interpretability.According to upper figure, it is considered herein that the picture of monkey can be formed by four basic weight structures, the weight of each base is exactly it Four-dimensional coordinate represent, i.e. (0.4,0,0.1,0.7), it is clear that the second dimensional weight is 0, it may be said that this picture is not basis set by second Become.According to binarization principle of the present invention, the binary code of this picture is: 1,0,1,1, and (see figure 2).

(3) according to this binary code, the present invention displays that this picture is not formed by second base.Therefore the present invention compiles The process of code is explainable.And it can easily be proven that the similarity Preserving problems of the present invention.Such as two four-dimensional pictures divide Not Wei (0.51,0.51,0.51,0.51) and (0.49,0.49,0.49,0.49), the present invention is encoded to (1,1,1,1) them (1,1,1,1).Obviously show that they are similar in their Euclidean distance of real number value space.Encoding by the present invention To result be also similar.But use common Hash encoding law, the two picture be encoded into (1,1,1,1) and (0,0,0,0).Obviously the similarity of luv space can not be kept in binary system (i.e. hamming) space.The phase of this display present invention Keep being effective like property.

Claims

1. big data retrieval method based on sparse hash, comprises the steps:

(1) from big data, sampled data regards training set X；

(2) hash function is trained with X；

(3) example not obtaining binary code in large data sets is carried out binary coding, and by the big data after coding It is stored in calculator memory or outer disk；

(4) to new test case, first obtain its low-dimensional real number value, then obtain its low-dimensional binary code, After, the binary code of test case is carried out similarity searching with the binary code of big data, obtains its similar case；

Training set size n of training set X of described step (1) byDetermine, wherein t_α/2Represent the value of confidence level, logical Crossing t-distribution marginal value to obtain, ε represents the maximum allowable error of setting；

Described step (2) includes following process:

A). set up object function:

Wherein X is training set, and B is base space, and each vector of B is the base vector training out from training set X, and S is X quilt The low-dimensional real number value being projected in base space B, λ₁And λ₂It is the adjustable parameter obtained by ten folding cross validation methods, w_i,jIt is X In two example x_iAnd x_jBetween Euclidean distance projection in gaussian kernel, s_iAnd s_jIt is two vectors in matrix S, B_i,jIt is Ith row and jth column element in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, n Being the number of example, k is the number of base vector,Represent that in S, each element is non-negative；

B). S is converted into binary code；In S, non-zero element is converted into 1, is otherwise 0；

C). set up hash function: the example finding cryptographic Hash in training set X to be all 1 is class A_m1, m=1 ..., d, remain cryptographic Hash Be 0 example be classified as class A_m0, m=1 ..., d, obtain 2d class, hash function is defined as:

If the dimension of S is d, the dimension of X is D, and the most one-dimensional in D ＞＞ d, d dimension is a binary vector, the most one-dimensional in tieing up for d Set up a hash function, set up altogether d hash function；In formula, X_iIt is the i-th vector of matrix X, S_iIt is the i-th of matrix S Individual vector, i=1 ..., n.

Method the most according to claim 1, described step (3) is to big data each example x, by s=(B'B+2I)^-1B'x Obtain the low-dimensional real number value of x, then obtained its low-dimensional binary code by hash function；Wherein, B is previous step definition Base space, I is with B is with the unit matrix of dimension.

Method the most according to claim 1, described step (4) is to test data set each example x_t, pass through s_t=(B'B+ 2I)^-1B'x_tObtain x_tLow-dimensional real number value, then obtained its low-dimensional binary code by hash function；Wherein, on B is The base space of face step definition, I is with B is with the unit matrix of dimension.