CN109829549A

CN109829549A - Hash learning method and its unsupervised online Hash learning method based on the tree that develops

Info

Publication number: CN109829549A
Application number: CN201910088472.XA
Authority: CN
Inventors: 寿震宇; 钱江波; 杨安邦; 袁明汶
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2019-05-31
Also published as: CN111079949A

Abstract

The present invention relates to a kind of Hash learning methods based on the tree that develops, the tree that develops is trained by the data point in data set, obtain the evolution tree of training completion, the Hamming code coding that all nodes in the evolution tree completed to training in addition to root node are initialized, guarantor's similitude loss function of the whole tree that develops is optimized using the path code strategy of greed, Hash of the coding of Hamming code corresponding to similitude loss function minimum value as each leaf node for the tree that develops will be protected and encoded；Calculate optimal match point of a certain data point in the tree that develops, find the splitpath that leaf node corresponding to the optimal match point of the data point is divided out from root node, the Hash coding of corresponding leaf node in the optimal match point splitpath of the data point is subjected to sequential combination, is encoded as the Hash of the data point.Also disclose a kind of unsupervised online Hash learning method.The hash method can reduce encoder complexity, have preferable query performance.

Description

Hash learning method and its unsupervised online Hash learning method based on the tree that develops

Technical field

The present invention relates to data processing field, in particular to a kind of Hash learning method and its unsupervised based on the tree that develops Online Hash learning method.

Background technique

With the fast development of internet and each class of electronic devices, Various types of data, such as text, image and video are It is skyrocketed through.Under many application scenarios, people require to retrieve related content from such large-scale data.However, In large-scale data, the calculating time that the accurate arest neighbors of the given query point of lookup is spent is unacceptably.In order to This problem is solved, has a large amount of research to have been directed to similar arest neighbors (Approximate Nearest recently Neighbor, ANN) search, in large-scale data, the effect of ANN retrieval be can replace in accurate nearest _neighbor retrieval, and speed It spends very fast.ANN retrieval based on Hash study is one kind more well-known in numerous ANN retrieval techniques, it combines engineering Mapping of data points to hamming space is replaced the Euclidean distance of initial data with Hamming distances by habit mechanism, is guaranteeing accuracy rate Meanwhile retrieval time and storage cost is greatly reduced.Come in recent years, has emerged many outstanding Hash study and calculated Whether method utilizes the label information of sample according to learning model, can be divided into unsupervised model and monitor model.In view of obtaining Label information needs huge cost of labor, therefore unsupervised Hash learning algorithm has obtained wider application.

In general, the hash algorithm of mainstream is divided into two classes: Dynamic data exchange Hash and data dependence Hash at present.In data In individual Hash, independently of data set, Typical Representative is local sensitivity Hash (Locality for the generation of hash function race Sensitive Hashing, LSH), it establishes Hash table using one group of random hash function, so that similar data point Can be mapped to biggish probability in similar Hash bucket, but the disadvantage is that index establishment process be Dynamic data exchange, In the retrieval of practical large-scale dataset, effect is poor.Data dependence Hash is also known as Hash study, passes through machine learning machine Data are mapped as protecting the binary coding of similitude by system, are that machine learning techniques are typically answered at field of data retrieval one With it is to realize guarantor's similitude of Hash coding that Hash, which learns most important purpose, and specifically, distance is smaller in luv space Two data points after being mapped to hamming space, lesser Hamming distances are still able to maintain, for apart from remote data Point, after being mapped, Hamming distances still maintain larger.In recent years, many Hash learning algorithms are put forward one after another, according to Whether learning model utilizes the label information of sample, can be divided into unsupervised hash algorithm and with supervision hash algorithm.It is unsupervised The famous representative of hash algorithm has principal component Hash (Principal Component Analysis Hashing, PCAH), changes Generation quantization (Iterative Quantization, ITQ), K mean value Hash (K-Means Hash, KMH) etc., wherein PCAH is used Input data space projection is mapped as Hash coding into lower dimensional space, then by low-dimensional data by principal component analysis, and ITQ attempts to seek The rotation mode for looking for a kind of pair of initial data optimal when initial data is mapped as binary coding, quantifies loss reduction, KMH It is encoded from the angle design Hash of cluster, basic thought is that data are polymerized to K class, and data use vector quantization strategy, system in class One is quantified as the value of cluster centre point, in addition, encoding according to principle of similarity is protected to each cluster centre point.In inquiry rank Section, is approximately the Hamming distance of the Hash codes of corresponding cluster centre by data point x, the distance between y.Band supervision hash algorithm master It to include RBM, BRE, MFH, IMH, MLH, although supervision Hash shows searching accuracy more higher than unsupervised Hash method, It is that their training requires label information, in the mass data epoch, data scale is big, and renewal speed is fast, obtains data label Acquisition usually need huge cost of labor, therefore unsupervised Hash is more significant in practical application.However it is most Unsupervised hash algorithm need disposably to load all data, a large amount of memory can be occupied, can not be suitable for stream data, And correlative study is less.

Summary of the invention

First technical problem to be solved by this invention is the status for the prior art, and providing one kind can make to develop It sets stable convergence and reduces the Hash learning method based on the tree that develops of encoder complexity.

Second technical problem to be solved by this invention is the status for the prior art, is provided a kind of using above-mentioned base In the unsupervised online Hash learning method of the Hash learning method for the tree that develops, this method has preferable query performance and energy Applied to stream data.

The present invention solves technical solution used by above-mentioned first technical problem are as follows: a kind of Hash based on the tree that develops Learning method, for passing through the data point x in data set X_iTo developing, tree is trained, and the evolution tree of training completion is obtained, to instruction Practice the evolution tree completed to carry out protecting similitude coding, obtains the Hash coding of each leaf node in evolution tree, and calculate any Data point obtains the Hash coding of any data point, it is characterised in that: including following step in the optimal match point to develop on tree It is rapid:

One step 1, creation evolution tree, wherein only one root node of the evolution tree of initialization assigns the root node Weight vector；

Step 2 is trained root node: all data points in data set being formed data flow at random, by the root section Optimal match point of the point as first data point in the data flow, and number of the root node as optimal match point is recorded, It is transferred to step 4；

Step 3 is trained using leaf node of first data point in data flow to the evolution tree that division is completed: The Euclidean distance in the tree that develops between each node and the data point is calculated separately, Euclidean distance corresponding with the data point is found The smallest node judges whether the node is leaf node, if so, currently training node as the data point in the tree that will then develop Optimal match point, the record all leaf nodes in tree that develop become the number of optimal match point, and are transferred to step 4；If not, It is transferred to step 6；

Step 4, to develop tree in root node and all leaf nodes successively perform the following operations respectively: judgement develop tree In currently train node to become the number of optimal match point whether less than the first preset value, wherein currently training node in the tree that develops For root node or any leaf node, if so, then updating the weight vector for currently training node in the tree that develops, and it is transferred to step 6； If not, being transferred to step 5；Wherein, develop in setting and currently train the weight vector more new formula of node are as follows:

w_i(t+1)=x (t)

Wherein, w_iIt (t+1) is that the weight vector after node updates, w are currently trained in evolution tree_iIt (t) is current in the tree that develops Training node updates before weight vector, x (t) be with develop tree in currently train node pairing optimal match point weight to Amount；

Step 5 judges the depth capacity for currently training the current depth of node whether to be less than the tree that develops in evolution tree, evolution The depth capacity of tree is preset value, if so, then the current trained node in the tree that develops is divided, it will be current in the tree that develops Training node split assigns different weight vectors to each leaf node at n leaf node, and the node of the division is denoted as Trunk node reformulates data flow, and the number of composition data stream at this time is counted again, is transferred to step 3；If not, this When the evolution tree completed for training of evolution tree, and be transferred to step 8；Wherein, the calculation formula of the weight vector of n leaf node Are as follows:

W ' (t)=(1- β) w (t)+β r (t)

Wherein, w ' (t) is the weight vector of new leaf node, and w (t) is the weight of the corresponding trunk node of new leaf node Vector, r (t) are the random unitary vector with w (t) identical dimensional, and β is preset hyper parameter, for controlling random perturbation degree；

Step 6 judges whether the data point in the data flow has all trained, if not, using next in data flow Data point is trained the tree that develops, and continuing all nodes in record evolution tree becomes the number of optimal match point, and is transferred to step Rapid 4；If so, being transferred to step 7；

Whether step 7, the number for judging composition data stream, if so, then reformulating data flow, weigh less than the second preset value Newly the tree that develops is trained, and the number for becoming optimal match point to the training node in the tree that develops adds up, and is transferred to Step 4；If not, the evolution tree that evolution tree at this time is completed for training, and it is transferred to step 8；

The Hamming code that all nodes in step 8, the evolution tree completed to training in addition to root node are initialized encodes, Guarantor's similitude loss function of the whole tree that develops is optimized using the path code strategy of greed, similitude will be protected and lose letter Hash coding of the Hamming code coding corresponding to number minimum value as each leaf node for the tree that develops；

Wherein, optimization aim are as follows:

Wherein, E is guarantor's similitude penalty values of the whole tree that develops, W_kFor whole develop tree trunk node k weight to Amount, W_k={ w₁,w₂,...,w_n, w₁,w₂,...,w_nThe weight vector for the n leaf node that respectively trunk node k is divided out； N={ W₁,W₂,...,W_cBe whole develop tree in all trunk nodes set；F(W_k) it is the corresponding leaf of each trunk node Guarantor's similitude loss function of child node coding,Wherein, w_i For the weight vector of i-th of leaf node in trunk node k, w_jFor j-th of leaf node in trunk node k weight to Amount；d(w_i,w_j) indicate leaf node w_iWith leaf node w_jBetween Euclidean distance, λ is default hyper parameter, b (w_i) indicate leaf Node w_iHamming code, b (w_j) indicate leaf node w_jHamming code, d_h(b(w_i),b(w_j)) indicate b (w_i) and b (w_j) between Hamming distances；

Step 9 calculates optimal match point of a certain data point in the tree that develops, and finds from root node and divides out the data point Optimal match point corresponding to leaf node splitpath, and according to obtained in step 8 evolution tree in each leaf section The Hash coding of point, by the Hash of the corresponding leaf node of optimal match point splitpath of data point coding according to depth from It is small to carry out sequential combination to big mode, it is encoded as the Hash of the data point, the Hash coding expression of the data point Are as follows: y=u₁u₂...u_dep-1, wherein u₁For Hash coding of the data point in the corresponding node that the tree depth that develops is 2；u₂For Hash coding of the data point in the corresponding node that the tree depth that develops is 3, dep is the depth capacity of evolution tree；u_dep-1For this Hash coding of the data point in the corresponding node for developing tree depth capacity.

The present invention solves technical solution used by above-mentioned second technical problem are as follows: a kind of unsupervised online Hash Learning method, it is characterised in that: more evolution trees of creation, and forest is formed in sequence, using the above-mentioned Hash based on the tree that develops Learning method is trained forest, and the data point in data set is formed data flow at random, uses first in data flow Data point develops to each in the forest to set respectively according to sequencing to be trained: to each evolution in forest Tree, one number of stochastical sampling from the Poisson distribution that intensity is 1, is denoted as K, using the data point in data flow respectively to every The tree that develops is trained K times；Successively the forest is trained using the data point in data flow, after the completion of training, to every in forest The leaf node of tree of developing carries out protecting similitude coding respectively, obtains in forest the Hash coding of the every tree that develops, calculates number Strong point x_iEvery optimal match point to develop on tree in forest, and by data point x_iIt is corresponding to set upper best match in every evolution The Hash coding of point forms data point x in sequence_iHash coding, wherein data point x_iHash coding expression are as follows:Indicate that kth develops tree to data point x in forest_iCoding, T indicate forest The sum of the middle tree that develops.

Compared with the prior art, the advantages of the present invention are as follows: by carrying out Hash study to developing to set, in training evolution tree When pass through introduce weight inheritance mechanism and update when only optimal match point is adjusted, keep training process simpler, develop Tree balance and stable convergence as far as possible, and optimized by the path code strategy of greed to similitude loss function is protected, Realize that similitude is protected in the part between child node；Also by proposing forest on the basis of the Hash learning method for the tree that develops Hash learning method, the code length used is longer, and query performance is more preferable, and can apply to stream data.

Detailed description of the invention

Fig. 1 is the training flow chart of tree of developing in the embodiment of the present invention；

Fig. 2 is the specific original data space distribution in the present embodiment；

Fig. 3 is the evolution tree initial stage figure in the present embodiment；

Fig. 4 is the second stage figure after developing tree first division in Fig. 3；

Fig. 5 is the phase III figure after developing tree second division in Fig. 3；

Fig. 6 is the structure chart of tree of developing in Fig. 5；

Fig. 7 is the leaf node distribution map after developing tree training completion in Fig. 3.

Specific embodiment

The present invention will be described in further detail below with reference to the embodiments of the drawings.

As shown in Figure 1, a kind of Hash learning method based on the tree that develops, for passing through the data point x in data set X_iIt is right The tree that develops is trained, and obtains the evolution tree of training completion, carries out protecting similitude coding to the evolution tree that training is completed, be drilled Change the Hash coding of each leaf node in tree, and calculate any data point in the optimal match point to develop on tree, obtains any The Hash of data point encodes, comprising the following steps:

w_i(t+1)=x (t)

Wherein, w_iIt (t+1) is that the weight vector after node updates, w are currently trained in evolution tree_iIt (t) is current in the tree that develops Training node updates before weight vector, x (t) be with develop tree in currently train node pairing optimal match point weight to Amount；In the present embodiment, the first preset value is 60；

W ' (t)=(1- β) w (t)+β r (t)

Wherein, w ' (t) is the weight vector of new leaf node, and w (t) is the weight of the corresponding trunk node of new leaf node Vector, r (t) are the random unitary vector with w (t) identical dimensional, and β is preset hyper parameter, for controlling random perturbation degree； In the present embodiment, β=0.05；

In the present embodiment, by introducing weight inheritance mechanism, i.e., when trunk node split goes out new leaf node, young leaves Child node can inherit most of weight of father node, and the random perturbation of fraction is added, and can guarantee evolution tree stable convergence.

Whether step 7, the number for judging composition data stream, if so, then reformulating data flow, weigh less than the second preset value Newly the tree that develops is trained, and the number for becoming optimal match point to the training node in the tree that develops adds up, and is transferred to Step 4；If not, the evolution tree that evolution tree at this time is completed for training, and it is transferred to step 8；In the present embodiment, the second preset value It is 10；

As shown in Figure 2 to 7, the process that is trained, instruction in Fig. 2 are set using data set shown in Fig. 2 evolution Practice data to be made of 890 two-dimensional coordinate points, 5 class clusters are formed in two-dimensional space.Fig. 3,4,5 respectively indicate evolution tree training Preceding three phases in the process, wherein dotted line with the arrow indicates the direction of growth of current generation leaf node.In the initial stage, As shown in figure 3, only one root node R in two-dimensional space, as training carries out, as shown in figure 4, in second stage, root section Point R divides out three leaf nodes A, B, C, as shown in figure 5, leaf node A, B, C divide out respective sub- section in the phase III Point, { A1, A2, A3 }, { B1, B2, B3 }, { C1, C2, C3 }, Fig. 6 are the evolution tree construction of phase III；According to the above leaf section The position of point is it can be found that at this point, the tree that develops briefly learns to have arrived the topological structure of data, as shown in fig. 7, to develop Tree completes the spatial position of all leaf nodes after training, it can be found that the tree that develops at this time has learnt opening up to training data Flutter structure.

Wherein, optimization aim are as follows:

Wherein, E is guarantor's similitude penalty values of the whole tree that develops, W_kFor whole develop tree trunk node k weight to Amount, W_k={ w₁,w₂,...,w_n, w₁,w₂,...,w_nThe weight vector for the n leaf node that respectively trunk node k is divided out； N={ W₁,W₂,...,W_cBe whole develop tree in all trunk nodes set；F(W_k) it is the corresponding leaf of each trunk node Guarantor's similitude loss function of child node coding,Wherein, w_i For the weight vector of i-th of leaf node in trunk node k, w_jFor j-th of leaf node in trunk node k weight to Amount；d(w_i,w_j) indicate leaf node w_iWith leaf node w_jBetween Euclidean distance, λ is default hyper parameter, b (w_i) indicate leaf Node w_iHamming code, b (w_j) indicate leaf node w_jHamming code, d_h(b(w_i),b(w_j)) indicate b (w_i) and b (w_j) between Hamming distances；In the present embodiment, effect is best when λ=0.6；

Optimize each F (W_k) it is all independent, and operation is consistent, can design the shared office of all trunk nodes Portion hamming code book M_local, in order to which similitude, M are protected in the part between each child node of strict guarantee_localNeed to meet following 2 It is required that: 1, the number of Hamming code be equal to the child node quantity n of trunk node, and Hamming code is different；2, between Hamming code Hamming distances range be [1, n-1], and any one Hamming code is different from a distance from other each Hamming codes.

In the present embodiment, n=3, M_local={ 00,11,01 }.Since the quantity of leaf node can't significantly affect evolution The effect of tree study initial data topological structure, therefore, in order to reduce the algorithm complexity of coded portion, the quantity of leaf node It is generally set to 3 or 4, at this time M_localThe quantity of fully intermeshing I is only 6 or 24.Optimal part volume is solved by traversing I The time complexity of code, the algorithm of coded portion is O (6n) or O (24n), can reduce encoder complexity.

Step 9 calculates optimal match point of a certain data point in the tree that develops, and finds from root node and divides out the data point Optimal match point corresponding to leaf node splitpath, and according to obtained in step 8 evolution tree in each leaf section The Hash coding of point carries out the Hash coding of corresponding leaf node in the optimal match point splitpath of the data point orderly Combination encodes, the Hash coding expression of the data point are as follows: y=u as the Hash of the data point₁u₂...u_dep-1, In, u₁For Hash coding of the data point in the corresponding node that the tree depth that develops is 2；u₂It is the data point in the tree depth that develops For the Hash coding in 3 corresponding node, dep is the depth capacity set that develops；u_dep-1It is maximum deep in the tree that develops for the data point Hash coding in the corresponding node of degree.

Due to developing, tree is not balanced tree, if a certain data point x_iThe depth capacity of corresponding optimal match point is less than Develop the depth capacity set, and in order to guarantee the unification of code length, is then mended using the coding of the depth capacity of the optimal match point The coding lacked entirely, at this point, the Hash coding expression of the data point are as follows: y=u₁u₂...u_max-1...u_dep-1, wherein u₁For Coding of the data point in the corresponding node that the tree depth that develops is 2；u₂The corresponding section for being 3 in the tree depth that develops for the data point Coding on point, max are the depth capacity of optimal match point, u_max-2=u_max-3...=u_dep-1=u_max-1, dep is the tree that develops Depth capacity；u_dep-1For coding of the data point in the corresponding node of the depth capacity for the tree that develops.

In above-mentioned evolution tree Hash, converted the leaf node that training obtains to using the path code strategy of greed The binary coding of similitude is protected, the complexity of the Hash coding for the tree that develops is small, by calculating any data point in the tree that develops Optimal match point, and according to optimal match point develop tree in it is corresponding coding as the data point Hash coding, therefore By calculating the Hamming distances between Hash coding, the similitude between data point is obtained, computation complexity is reduced.

But the scope of application of the evolution tree hash method is only limitted to short coding, but the query performance of short coding is compiled compared to long Code is often poor, it is difficult to competent actual task, it is desirable to the long codes of higher query performance are obtained by the hash method for the tree that develops It is very difficult thing, it is directly proportional to the depth of tree from the code length of the angle analysis of space-time expense, the tree Hash that develops, it is single It is pure that expand coding by increasing the depth for the tree that develops be unpractical.With the increase for the tree depth that develops, whole evolution tree Number of nodes grow exponentially, in limited memory resource and effective time range it is almost impossible complete develop tree Trained and path code, furthermore the quantity of leaf node is considerably beyond the quantity of data point, such quantification manner Also without in all senses.It is integrated using parallel type in order to solve the problems, such as long codes and for the succinct and high efficiency of coding Bagging extended coding in learning method, while in view of traditional Bagging needs to obtain whole sample datas, and It is not suitable for stream data sampling, therefore proposes a kind of unsupervised online Hash on the basis of Online-Bagging Learning method.

A kind of unsupervised online Hash learning method creates more evolution trees, and forms forest in sequence, in use It states the Hash learning method based on the tree that develops to be trained forest, the data point in data set is formed into data flow at random, is made The each evolution tree in the forest is trained respectively according to sequencing with first data point in data flow: to gloomy Each evolution tree in woods, one number of stochastical sampling from the Poisson distribution that intensity is 1 are denoted as K, using in data flow Data point respectively trains every evolution tree K times；Successively the forest is trained using the data point in data flow, has been trained Cheng Hou carries out protecting similitude coding respectively to the leaf node of every in the forest tree that develops, and obtains the every tree that develops in forest Hash coding, calculates data point x_iEvery optimal match point to develop on tree in forest, and by data point x_iCorrespondence is drilled at every The Hash coding for changing the upper optimal match point of tree forms data point x in sequence_iHash coding, wherein data point x_iHash Coding expression are as follows: Indicate that kth develops tree to data point x in forest_iVolume Yard, the sum for the tree that develops in T expression forest.

Since the training method for the tree that develops is on-line training, different random subsample set passes through Online-Bagging Another randomness is introduced in itself, that is, the randomness for the tree direction of growth that develops, when developing, tree is extended to evolution forest Later, from multiple positions in luv space, the spatial distribution of data can be being captured at random, the tree that develops is more, then can The Space expanding for enough capturing data is more comprehensive, to alleviate the defect of the coding in path.

Therefore, unsupervised online Hash learning method is on the basis of based on the Hash learning method for the tree that develops into one Step is improved, and can neatly adjust code length, trained and coding staff by adjusting the depth for the tree that develops and the quantity for the tree that develops Formula is very succinct, but possesses preferable query performance, and can apply to stream data.Since the evolution tree in forest is all phase It is mutually independent, it is very suitable to distributed platform.

Claims

1. a kind of Hash learning method based on the tree that develops, for passing through the data point x in data set X_iTo developing, tree is instructed Practice, obtain the evolution tree that training is completed, the evolution tree that training is completed is carried out to protect similitude coding, obtains each leaf in evolution tree The Hash of child node encodes, and calculates any data point in the optimal match point to develop on tree, and the Hash for obtaining the data point is compiled Code, it is characterised in that: the following steps are included:

One step 1, creation evolution tree, wherein only one root node of the evolution tree of initialization assigns weight to the root node Vector；

Step 2 is trained root node: all data points in data set being formed data flow at random, which is made It for the optimal match point of first data point in the data flow, and records root node and becomes the number of optimal match point, be transferred to Step 4；

Step 3 is trained using leaf node of first data point in data flow to the evolution tree that division is completed: respectively The Euclidean distance in the tree that develops between each node and the data point is calculated, it is minimum to find Euclidean distance corresponding with the data point Node, judge whether the node is leaf node, if so, then will develop tree in currently train node as the data point most Good match point, all leaf nodes recorded in the tree that develops become the number of optimal match point, and are transferred to step 4；If not, being transferred to Step 6；

Step 4, to develop tree in root node and all leaf nodes successively perform the following operations respectively: judgement develop tree in when Whether preceding trained node becomes the number of optimal match point less than the first preset value, wherein currently training node in the tree that develops is root Node or any leaf node if so, then updating the weight vector for currently training node in the tree that develops, and are transferred to step 6；Such as It is no, then it is transferred to step 5；Wherein, develop in setting and currently train the weight vector more new formula of node are as follows:

w_i(t+1)=x (t)

Wherein, w_iIt (t+1) is that the weight vector after node updates, w are currently trained in evolution tree_iIt (t) is currently to be trained in evolution tree Weight vector before node updates, x (t) are the weight vector with the optimal match point for currently training node pairing in the tree that develops；

Step 5 judges currently whether the current depth of trained node is less than the depth capacity for the tree that develops in evolution tree, and develop tree Depth capacity is preset value, if so, then dividing to the current trained node in the tree that develops, by the current training in the tree that develops Node split assigns different weight vectors to each leaf node at n leaf node, and the node of the division is denoted as trunk Node reformulates data flow, and the number of composition data stream at this time is counted again, is transferred to step 3；If not, at this time The evolution tree that the tree that develops is completed for training, and it is transferred to step 8；Wherein, the calculation formula of the weight vector of n leaf node are as follows:

W ' (t)=(1- β) w (t)+β r (t)

Wherein, w ' (t) be new leaf node weight vector, w (t) be the corresponding trunk node of new leaf node weight to Amount, r (t) are the random unitary vector with w (t) identical dimensional, and β is preset hyper parameter, for controlling random perturbation degree；

Step 6 judges whether the data point in the data flow has all trained, if not, using next data in data flow Point is trained the tree that develops, and continuing all nodes in record evolution tree becomes the number of optimal match point, and is transferred to step 4； If so, being transferred to step 7；

Whether step 7 judges the number of composition data stream less than the second preset value, again right if so, then reformulate data flow The tree that develops is trained, and the number for becoming optimal match point to the training node in the tree that develops adds up, and is transferred to step 4；If not, the evolution tree that evolution tree at this time is completed for training, and it is transferred to step 8；

The Hamming code that all nodes in step 8, the evolution tree completed to training in addition to root node are initialized encodes, and uses The path code strategy of greed optimizes guarantor's similitude loss function of the whole tree that develops, and will protect similitude loss function most Hash coding of the coding of Hamming code corresponding to small value as each leaf node for the tree that develops；

Wherein, optimization aim are as follows:

Wherein, E is guarantor's similitude penalty values of the whole tree that develops, W_kFor whole develop tree trunk node k weight vector, In, W_k={ w₁,w₂,...,w_n, w₁,w₂,...,w_nThe weight vector for the n leaf node that respectively trunk node k is divided out； N={ W₁,W₂,...,W_cBe whole develop tree in all trunk nodes set；F(W_k) it is the corresponding leaf of each trunk node Guarantor's similitude loss function of child node coding,Wherein, w_i For the weight vector of i-th of leaf node in trunk node k, w_jFor j-th of leaf node in trunk node k weight to Amount；d(w_i,w_j) indicate leaf node w_iWith leaf node w_jBetween Euclidean distance, λ is default hyper parameter, b (w_i) indicate leaf Node w_iHamming code, b (w_j) indicate leaf node w_jHamming code, d_h(b(w_i),b(w_j)) indicate b (w_i) and b (w_j) between Hamming distances；

Step 9 calculates optimal match point of a certain data point in the tree that develops, and finds from root node and divides out the data point most The splitpath of leaf node corresponding to good match point, and each leaf node in the evolution tree according to obtained in step 8 The Hash coding of corresponding leaf node in the optimal match point splitpath of the data point is carried out orderly group by Hash coding It closes, is encoded as the Hash of the data point, the Hash coding expression of the data point are as follows: y=u₁u₂...u_dep-1, wherein u₁For Hash coding of the data point in the corresponding node that the tree depth that develops is 2；u₂Setting depth in evolution for the data point is 3 Corresponding node on Hash coding, dep be develop set depth capacity；u_dep-1It is the data point in the tree depth capacity that develops Hash coding in corresponding node.

2. a kind of unsupervised online Hash learning method, it is characterised in that: more evolution trees of creation, and form in sequence gloomy Woods is trained forest using method described in claim 1, and the data point in data set is formed data flow at random, is made The each evolution tree in the forest is trained respectively according to sequencing with first data point in data flow: to gloomy Each evolution tree in woods, one number of stochastical sampling from the Poisson distribution that intensity is 1 are denoted as K, using in data flow Data point respectively trains every evolution tree K times；Successively the forest is trained using the data point in data flow, has been trained Cheng Hou carries out protecting similitude coding respectively to the leaf node of every in the forest tree that develops, and obtains the every tree that develops in forest Hash coding, calculates data point x_iEvery optimal match point to develop on tree in forest, and by data point x_iCorrespondence is drilled at every The Hash coding for changing the upper optimal match point of tree forms data point x in sequence_iHash coding, wherein data point x_iHash Coding expression are as follows: Indicate that kth develops tree to data point x in forest_iVolume Yard, the sum for the tree that develops in T expression forest.