CN109829549A - Hash learning method and its unsupervised online Hash learning method based on the tree that develops - Google Patents

Hash learning method and its unsupervised online Hash learning method based on the tree that develops Download PDF

Info

Publication number
CN109829549A
CN109829549A CN201910088472.XA CN201910088472A CN109829549A CN 109829549 A CN109829549 A CN 109829549A CN 201910088472 A CN201910088472 A CN 201910088472A CN 109829549 A CN109829549 A CN 109829549A
Authority
CN
China
Prior art keywords
tree
node
develops
data
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910088472.XA
Other languages
Chinese (zh)
Inventor
寿震宇
钱江波
杨安邦
袁明汶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo University
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN201910088472.XA priority Critical patent/CN109829549A/en
Publication of CN109829549A publication Critical patent/CN109829549A/en
Priority to CN202010070802.5A priority patent/CN111079949A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present invention relates to a kind of Hash learning methods based on the tree that develops, the tree that develops is trained by the data point in data set, obtain the evolution tree of training completion, the Hamming code coding that all nodes in the evolution tree completed to training in addition to root node are initialized, guarantor's similitude loss function of the whole tree that develops is optimized using the path code strategy of greed, Hash of the coding of Hamming code corresponding to similitude loss function minimum value as each leaf node for the tree that develops will be protected and encoded;Calculate optimal match point of a certain data point in the tree that develops, find the splitpath that leaf node corresponding to the optimal match point of the data point is divided out from root node, the Hash coding of corresponding leaf node in the optimal match point splitpath of the data point is subjected to sequential combination, is encoded as the Hash of the data point.Also disclose a kind of unsupervised online Hash learning method.The hash method can reduce encoder complexity, have preferable query performance.

Description

Hash learning method and its unsupervised online Hash learning method based on the tree that develops
Technical field
The present invention relates to data processing field, in particular to a kind of Hash learning method and its unsupervised based on the tree that develops Online Hash learning method.
Background technique
With the fast development of internet and each class of electronic devices, Various types of data, such as text, image and video are It is skyrocketed through.Under many application scenarios, people require to retrieve related content from such large-scale data.However, In large-scale data, the calculating time that the accurate arest neighbors of the given query point of lookup is spent is unacceptably.In order to This problem is solved, has a large amount of research to have been directed to similar arest neighbors (Approximate Nearest recently Neighbor, ANN) search, in large-scale data, the effect of ANN retrieval be can replace in accurate nearest _neighbor retrieval, and speed It spends very fast.ANN retrieval based on Hash study is one kind more well-known in numerous ANN retrieval techniques, it combines engineering Mapping of data points to hamming space is replaced the Euclidean distance of initial data with Hamming distances by habit mechanism, is guaranteeing accuracy rate Meanwhile retrieval time and storage cost is greatly reduced.Come in recent years, has emerged many outstanding Hash study and calculated Whether method utilizes the label information of sample according to learning model, can be divided into unsupervised model and monitor model.In view of obtaining Label information needs huge cost of labor, therefore unsupervised Hash learning algorithm has obtained wider application.
In general, the hash algorithm of mainstream is divided into two classes: Dynamic data exchange Hash and data dependence Hash at present.In data In individual Hash, independently of data set, Typical Representative is local sensitivity Hash (Locality for the generation of hash function race Sensitive Hashing, LSH), it establishes Hash table using one group of random hash function, so that similar data point Can be mapped to biggish probability in similar Hash bucket, but the disadvantage is that index establishment process be Dynamic data exchange, In the retrieval of practical large-scale dataset, effect is poor.Data dependence Hash is also known as Hash study, passes through machine learning machine Data are mapped as protecting the binary coding of similitude by system, are that machine learning techniques are typically answered at field of data retrieval one With it is to realize guarantor's similitude of Hash coding that Hash, which learns most important purpose, and specifically, distance is smaller in luv space Two data points after being mapped to hamming space, lesser Hamming distances are still able to maintain, for apart from remote data Point, after being mapped, Hamming distances still maintain larger.In recent years, many Hash learning algorithms are put forward one after another, according to Whether learning model utilizes the label information of sample, can be divided into unsupervised hash algorithm and with supervision hash algorithm.It is unsupervised The famous representative of hash algorithm has principal component Hash (Principal Component Analysis Hashing, PCAH), changes Generation quantization (Iterative Quantization, ITQ), K mean value Hash (K-Means Hash, KMH) etc., wherein PCAH is used Input data space projection is mapped as Hash coding into lower dimensional space, then by low-dimensional data by principal component analysis, and ITQ attempts to seek The rotation mode for looking for a kind of pair of initial data optimal when initial data is mapped as binary coding, quantifies loss reduction, KMH It is encoded from the angle design Hash of cluster, basic thought is that data are polymerized to K class, and data use vector quantization strategy, system in class One is quantified as the value of cluster centre point, in addition, encoding according to principle of similarity is protected to each cluster centre point.In inquiry rank Section, is approximately the Hamming distance of the Hash codes of corresponding cluster centre by data point x, the distance between y.Band supervision hash algorithm master It to include RBM, BRE, MFH, IMH, MLH, although supervision Hash shows searching accuracy more higher than unsupervised Hash method, It is that their training requires label information, in the mass data epoch, data scale is big, and renewal speed is fast, obtains data label Acquisition usually need huge cost of labor, therefore unsupervised Hash is more significant in practical application.However it is most Unsupervised hash algorithm need disposably to load all data, a large amount of memory can be occupied, can not be suitable for stream data, And correlative study is less.
Summary of the invention
First technical problem to be solved by this invention is the status for the prior art, and providing one kind can make to develop It sets stable convergence and reduces the Hash learning method based on the tree that develops of encoder complexity.
Second technical problem to be solved by this invention is the status for the prior art, is provided a kind of using above-mentioned base In the unsupervised online Hash learning method of the Hash learning method for the tree that develops, this method has preferable query performance and energy Applied to stream data.
The present invention solves technical solution used by above-mentioned first technical problem are as follows: a kind of Hash based on the tree that develops Learning method, for passing through the data point x in data set XiTo developing, tree is trained, and the evolution tree of training completion is obtained, to instruction Practice the evolution tree completed to carry out protecting similitude coding, obtains the Hash coding of each leaf node in evolution tree, and calculate any Data point obtains the Hash coding of any data point, it is characterised in that: including following step in the optimal match point to develop on tree It is rapid:
One step 1, creation evolution tree, wherein only one root node of the evolution tree of initialization assigns the root node Weight vector;
Step 2 is trained root node: all data points in data set being formed data flow at random, by the root section Optimal match point of the point as first data point in the data flow, and number of the root node as optimal match point is recorded, It is transferred to step 4;
Step 3 is trained using leaf node of first data point in data flow to the evolution tree that division is completed: The Euclidean distance in the tree that develops between each node and the data point is calculated separately, Euclidean distance corresponding with the data point is found The smallest node judges whether the node is leaf node, if so, currently training node as the data point in the tree that will then develop Optimal match point, the record all leaf nodes in tree that develop become the number of optimal match point, and are transferred to step 4;If not, It is transferred to step 6;
Step 4, to develop tree in root node and all leaf nodes successively perform the following operations respectively: judgement develop tree In currently train node to become the number of optimal match point whether less than the first preset value, wherein currently training node in the tree that develops For root node or any leaf node, if so, then updating the weight vector for currently training node in the tree that develops, and it is transferred to step 6; If not, being transferred to step 5;Wherein, develop in setting and currently train the weight vector more new formula of node are as follows:
wi(t+1)=x (t)
Wherein, wiIt (t+1) is that the weight vector after node updates, w are currently trained in evolution treeiIt (t) is current in the tree that develops Training node updates before weight vector, x (t) be with develop tree in currently train node pairing optimal match point weight to Amount;
Step 5 judges the depth capacity for currently training the current depth of node whether to be less than the tree that develops in evolution tree, evolution The depth capacity of tree is preset value, if so, then the current trained node in the tree that develops is divided, it will be current in the tree that develops Training node split assigns different weight vectors to each leaf node at n leaf node, and the node of the division is denoted as Trunk node reformulates data flow, and the number of composition data stream at this time is counted again, is transferred to step 3;If not, this When the evolution tree completed for training of evolution tree, and be transferred to step 8;Wherein, the calculation formula of the weight vector of n leaf node Are as follows:
W ' (t)=(1- β) w (t)+β r (t)
Wherein, w ' (t) is the weight vector of new leaf node, and w (t) is the weight of the corresponding trunk node of new leaf node Vector, r (t) are the random unitary vector with w (t) identical dimensional, and β is preset hyper parameter, for controlling random perturbation degree;
Step 6 judges whether the data point in the data flow has all trained, if not, using next in data flow Data point is trained the tree that develops, and continuing all nodes in record evolution tree becomes the number of optimal match point, and is transferred to step Rapid 4;If so, being transferred to step 7;
Whether step 7, the number for judging composition data stream, if so, then reformulating data flow, weigh less than the second preset value Newly the tree that develops is trained, and the number for becoming optimal match point to the training node in the tree that develops adds up, and is transferred to Step 4;If not, the evolution tree that evolution tree at this time is completed for training, and it is transferred to step 8;
The Hamming code that all nodes in step 8, the evolution tree completed to training in addition to root node are initialized encodes, Guarantor's similitude loss function of the whole tree that develops is optimized using the path code strategy of greed, similitude will be protected and lose letter Hash coding of the Hamming code coding corresponding to number minimum value as each leaf node for the tree that develops;
Wherein, optimization aim are as follows:
Wherein, E is guarantor's similitude penalty values of the whole tree that develops, WkFor whole develop tree trunk node k weight to Amount, Wk={ w1,w2,...,wn, w1,w2,...,wnThe weight vector for the n leaf node that respectively trunk node k is divided out; N={ W1,W2,...,WcBe whole develop tree in all trunk nodes set;F(Wk) it is the corresponding leaf of each trunk node Guarantor's similitude loss function of child node coding,Wherein, wi For the weight vector of i-th of leaf node in trunk node k, wjFor j-th of leaf node in trunk node k weight to Amount;d(wi,wj) indicate leaf node wiWith leaf node wjBetween Euclidean distance, λ is default hyper parameter, b (wi) indicate leaf Node wiHamming code, b (wj) indicate leaf node wjHamming code, dh(b(wi),b(wj)) indicate b (wi) and b (wj) between Hamming distances;
Step 9 calculates optimal match point of a certain data point in the tree that develops, and finds from root node and divides out the data point Optimal match point corresponding to leaf node splitpath, and according to obtained in step 8 evolution tree in each leaf section The Hash coding of point, by the Hash of the corresponding leaf node of optimal match point splitpath of data point coding according to depth from It is small to carry out sequential combination to big mode, it is encoded as the Hash of the data point, the Hash coding expression of the data point Are as follows: y=u1u2...udep-1, wherein u1For Hash coding of the data point in the corresponding node that the tree depth that develops is 2;u2For Hash coding of the data point in the corresponding node that the tree depth that develops is 3, dep is the depth capacity of evolution tree;udep-1For this Hash coding of the data point in the corresponding node for developing tree depth capacity.
The present invention solves technical solution used by above-mentioned second technical problem are as follows: a kind of unsupervised online Hash Learning method, it is characterised in that: more evolution trees of creation, and forest is formed in sequence, using the above-mentioned Hash based on the tree that develops Learning method is trained forest, and the data point in data set is formed data flow at random, uses first in data flow Data point develops to each in the forest to set respectively according to sequencing to be trained: to each evolution in forest Tree, one number of stochastical sampling from the Poisson distribution that intensity is 1, is denoted as K, using the data point in data flow respectively to every The tree that develops is trained K times;Successively the forest is trained using the data point in data flow, after the completion of training, to every in forest The leaf node of tree of developing carries out protecting similitude coding respectively, obtains in forest the Hash coding of the every tree that develops, calculates number Strong point xiEvery optimal match point to develop on tree in forest, and by data point xiIt is corresponding to set upper best match in every evolution The Hash coding of point forms data point x in sequenceiHash coding, wherein data point xiHash coding expression are as follows:Indicate that kth develops tree to data point x in forestiCoding, T indicate forest The sum of the middle tree that develops.
Compared with the prior art, the advantages of the present invention are as follows: by carrying out Hash study to developing to set, in training evolution tree When pass through introduce weight inheritance mechanism and update when only optimal match point is adjusted, keep training process simpler, develop Tree balance and stable convergence as far as possible, and optimized by the path code strategy of greed to similitude loss function is protected, Realize that similitude is protected in the part between child node;Also by proposing forest on the basis of the Hash learning method for the tree that develops Hash learning method, the code length used is longer, and query performance is more preferable, and can apply to stream data.
Detailed description of the invention
Fig. 1 is the training flow chart of tree of developing in the embodiment of the present invention;
Fig. 2 is the specific original data space distribution in the present embodiment;
Fig. 3 is the evolution tree initial stage figure in the present embodiment;
Fig. 4 is the second stage figure after developing tree first division in Fig. 3;
Fig. 5 is the phase III figure after developing tree second division in Fig. 3;
Fig. 6 is the structure chart of tree of developing in Fig. 5;
Fig. 7 is the leaf node distribution map after developing tree training completion in Fig. 3.
Specific embodiment
The present invention will be described in further detail below with reference to the embodiments of the drawings.
As shown in Figure 1, a kind of Hash learning method based on the tree that develops, for passing through the data point x in data set XiIt is right The tree that develops is trained, and obtains the evolution tree of training completion, carries out protecting similitude coding to the evolution tree that training is completed, be drilled Change the Hash coding of each leaf node in tree, and calculate any data point in the optimal match point to develop on tree, obtains any The Hash of data point encodes, comprising the following steps:
One step 1, creation evolution tree, wherein only one root node of the evolution tree of initialization assigns the root node Weight vector;
Step 2 is trained root node: all data points in data set being formed data flow at random, by the root section Optimal match point of the point as first data point in the data flow, and number of the root node as optimal match point is recorded, It is transferred to step 4;
Step 3 is trained using leaf node of first data point in data flow to the evolution tree that division is completed: The Euclidean distance in the tree that develops between each node and the data point is calculated separately, Euclidean distance corresponding with the data point is found The smallest node judges whether the node is leaf node, if so, currently training node as the data point in the tree that will then develop Optimal match point, the record all leaf nodes in tree that develop become the number of optimal match point, and are transferred to step 4;If not, It is transferred to step 6;
Step 4, to develop tree in root node and all leaf nodes successively perform the following operations respectively: judgement develop tree In currently train node to become the number of optimal match point whether less than the first preset value, wherein currently training node in the tree that develops For root node or any leaf node, if so, then updating the weight vector for currently training node in the tree that develops, and it is transferred to step 6; If not, being transferred to step 5;Wherein, develop in setting and currently train the weight vector more new formula of node are as follows:
wi(t+1)=x (t)
Wherein, wiIt (t+1) is that the weight vector after node updates, w are currently trained in evolution treeiIt (t) is current in the tree that develops Training node updates before weight vector, x (t) be with develop tree in currently train node pairing optimal match point weight to Amount;In the present embodiment, the first preset value is 60;
Step 5 judges the depth capacity for currently training the current depth of node whether to be less than the tree that develops in evolution tree, evolution The depth capacity of tree is preset value, if so, then the current trained node in the tree that develops is divided, it will be current in the tree that develops Training node split assigns different weight vectors to each leaf node at n leaf node, and the node of the division is denoted as Trunk node reformulates data flow, and the number of composition data stream at this time is counted again, is transferred to step 3;If not, this When the evolution tree completed for training of evolution tree, and be transferred to step 8;Wherein, the calculation formula of the weight vector of n leaf node Are as follows:
W ' (t)=(1- β) w (t)+β r (t)
Wherein, w ' (t) is the weight vector of new leaf node, and w (t) is the weight of the corresponding trunk node of new leaf node Vector, r (t) are the random unitary vector with w (t) identical dimensional, and β is preset hyper parameter, for controlling random perturbation degree; In the present embodiment, β=0.05;
In the present embodiment, by introducing weight inheritance mechanism, i.e., when trunk node split goes out new leaf node, young leaves Child node can inherit most of weight of father node, and the random perturbation of fraction is added, and can guarantee evolution tree stable convergence.
Step 6 judges whether the data point in the data flow has all trained, if not, using next in data flow Data point is trained the tree that develops, and continuing all nodes in record evolution tree becomes the number of optimal match point, and is transferred to step Rapid 4;If so, being transferred to step 7;
Whether step 7, the number for judging composition data stream, if so, then reformulating data flow, weigh less than the second preset value Newly the tree that develops is trained, and the number for becoming optimal match point to the training node in the tree that develops adds up, and is transferred to Step 4;If not, the evolution tree that evolution tree at this time is completed for training, and it is transferred to step 8;In the present embodiment, the second preset value It is 10;
As shown in Figure 2 to 7, the process that is trained, instruction in Fig. 2 are set using data set shown in Fig. 2 evolution Practice data to be made of 890 two-dimensional coordinate points, 5 class clusters are formed in two-dimensional space.Fig. 3,4,5 respectively indicate evolution tree training Preceding three phases in the process, wherein dotted line with the arrow indicates the direction of growth of current generation leaf node.In the initial stage, As shown in figure 3, only one root node R in two-dimensional space, as training carries out, as shown in figure 4, in second stage, root section Point R divides out three leaf nodes A, B, C, as shown in figure 5, leaf node A, B, C divide out respective sub- section in the phase III Point, { A1, A2, A3 }, { B1, B2, B3 }, { C1, C2, C3 }, Fig. 6 are the evolution tree construction of phase III;According to the above leaf section The position of point is it can be found that at this point, the tree that develops briefly learns to have arrived the topological structure of data, as shown in fig. 7, to develop Tree completes the spatial position of all leaf nodes after training, it can be found that the tree that develops at this time has learnt opening up to training data Flutter structure.
The Hamming code that all nodes in step 8, the evolution tree completed to training in addition to root node are initialized encodes, Guarantor's similitude loss function of the whole tree that develops is optimized using the path code strategy of greed, similitude will be protected and lose letter Hash coding of the Hamming code coding corresponding to number minimum value as each leaf node for the tree that develops;
Wherein, optimization aim are as follows:
Wherein, E is guarantor's similitude penalty values of the whole tree that develops, WkFor whole develop tree trunk node k weight to Amount, Wk={ w1,w2,...,wn, w1,w2,...,wnThe weight vector for the n leaf node that respectively trunk node k is divided out; N={ W1,W2,...,WcBe whole develop tree in all trunk nodes set;F(Wk) it is the corresponding leaf of each trunk node Guarantor's similitude loss function of child node coding,Wherein, wi For the weight vector of i-th of leaf node in trunk node k, wjFor j-th of leaf node in trunk node k weight to Amount;d(wi,wj) indicate leaf node wiWith leaf node wjBetween Euclidean distance, λ is default hyper parameter, b (wi) indicate leaf Node wiHamming code, b (wj) indicate leaf node wjHamming code, dh(b(wi),b(wj)) indicate b (wi) and b (wj) between Hamming distances;In the present embodiment, effect is best when λ=0.6;
Optimize each F (Wk) it is all independent, and operation is consistent, can design the shared office of all trunk nodes Portion hamming code book Mlocal, in order to which similitude, M are protected in the part between each child node of strict guaranteelocalNeed to meet following 2 It is required that: 1, the number of Hamming code be equal to the child node quantity n of trunk node, and Hamming code is different;2, between Hamming code Hamming distances range be [1, n-1], and any one Hamming code is different from a distance from other each Hamming codes.
In the present embodiment, n=3, Mlocal={ 00,11,01 }.Since the quantity of leaf node can't significantly affect evolution The effect of tree study initial data topological structure, therefore, in order to reduce the algorithm complexity of coded portion, the quantity of leaf node It is generally set to 3 or 4, at this time MlocalThe quantity of fully intermeshing I is only 6 or 24.Optimal part volume is solved by traversing I The time complexity of code, the algorithm of coded portion is O (6n) or O (24n), can reduce encoder complexity.
Step 9 calculates optimal match point of a certain data point in the tree that develops, and finds from root node and divides out the data point Optimal match point corresponding to leaf node splitpath, and according to obtained in step 8 evolution tree in each leaf section The Hash coding of point carries out the Hash coding of corresponding leaf node in the optimal match point splitpath of the data point orderly Combination encodes, the Hash coding expression of the data point are as follows: y=u as the Hash of the data point1u2...udep-1, In, u1For Hash coding of the data point in the corresponding node that the tree depth that develops is 2;u2It is the data point in the tree depth that develops For the Hash coding in 3 corresponding node, dep is the depth capacity set that develops;udep-1It is maximum deep in the tree that develops for the data point Hash coding in the corresponding node of degree.
Due to developing, tree is not balanced tree, if a certain data point xiThe depth capacity of corresponding optimal match point is less than Develop the depth capacity set, and in order to guarantee the unification of code length, is then mended using the coding of the depth capacity of the optimal match point The coding lacked entirely, at this point, the Hash coding expression of the data point are as follows: y=u1u2...umax-1...udep-1, wherein u1For Coding of the data point in the corresponding node that the tree depth that develops is 2;u2The corresponding section for being 3 in the tree depth that develops for the data point Coding on point, max are the depth capacity of optimal match point, umax-2=umax-3...=udep-1=umax-1, dep is the tree that develops Depth capacity;udep-1For coding of the data point in the corresponding node of the depth capacity for the tree that develops.
In above-mentioned evolution tree Hash, converted the leaf node that training obtains to using the path code strategy of greed The binary coding of similitude is protected, the complexity of the Hash coding for the tree that develops is small, by calculating any data point in the tree that develops Optimal match point, and according to optimal match point develop tree in it is corresponding coding as the data point Hash coding, therefore By calculating the Hamming distances between Hash coding, the similitude between data point is obtained, computation complexity is reduced.
But the scope of application of the evolution tree hash method is only limitted to short coding, but the query performance of short coding is compiled compared to long Code is often poor, it is difficult to competent actual task, it is desirable to the long codes of higher query performance are obtained by the hash method for the tree that develops It is very difficult thing, it is directly proportional to the depth of tree from the code length of the angle analysis of space-time expense, the tree Hash that develops, it is single It is pure that expand coding by increasing the depth for the tree that develops be unpractical.With the increase for the tree depth that develops, whole evolution tree Number of nodes grow exponentially, in limited memory resource and effective time range it is almost impossible complete develop tree Trained and path code, furthermore the quantity of leaf node is considerably beyond the quantity of data point, such quantification manner Also without in all senses.It is integrated using parallel type in order to solve the problems, such as long codes and for the succinct and high efficiency of coding Bagging extended coding in learning method, while in view of traditional Bagging needs to obtain whole sample datas, and It is not suitable for stream data sampling, therefore proposes a kind of unsupervised online Hash on the basis of Online-Bagging Learning method.
A kind of unsupervised online Hash learning method creates more evolution trees, and forms forest in sequence, in use It states the Hash learning method based on the tree that develops to be trained forest, the data point in data set is formed into data flow at random, is made The each evolution tree in the forest is trained respectively according to sequencing with first data point in data flow: to gloomy Each evolution tree in woods, one number of stochastical sampling from the Poisson distribution that intensity is 1 are denoted as K, using in data flow Data point respectively trains every evolution tree K times;Successively the forest is trained using the data point in data flow, has been trained Cheng Hou carries out protecting similitude coding respectively to the leaf node of every in the forest tree that develops, and obtains the every tree that develops in forest Hash coding, calculates data point xiEvery optimal match point to develop on tree in forest, and by data point xiCorrespondence is drilled at every The Hash coding for changing the upper optimal match point of tree forms data point x in sequenceiHash coding, wherein data point xiHash Coding expression are as follows: Indicate that kth develops tree to data point x in forestiVolume Yard, the sum for the tree that develops in T expression forest.
Since the training method for the tree that develops is on-line training, different random subsample set passes through Online-Bagging Another randomness is introduced in itself, that is, the randomness for the tree direction of growth that develops, when developing, tree is extended to evolution forest Later, from multiple positions in luv space, the spatial distribution of data can be being captured at random, the tree that develops is more, then can The Space expanding for enough capturing data is more comprehensive, to alleviate the defect of the coding in path.
Therefore, unsupervised online Hash learning method is on the basis of based on the Hash learning method for the tree that develops into one Step is improved, and can neatly adjust code length, trained and coding staff by adjusting the depth for the tree that develops and the quantity for the tree that develops Formula is very succinct, but possesses preferable query performance, and can apply to stream data.Since the evolution tree in forest is all phase It is mutually independent, it is very suitable to distributed platform.

Claims (2)

1. a kind of Hash learning method based on the tree that develops, for passing through the data point x in data set XiTo developing, tree is instructed Practice, obtain the evolution tree that training is completed, the evolution tree that training is completed is carried out to protect similitude coding, obtains each leaf in evolution tree The Hash of child node encodes, and calculates any data point in the optimal match point to develop on tree, and the Hash for obtaining the data point is compiled Code, it is characterised in that: the following steps are included:
One step 1, creation evolution tree, wherein only one root node of the evolution tree of initialization assigns weight to the root node Vector;
Step 2 is trained root node: all data points in data set being formed data flow at random, which is made It for the optimal match point of first data point in the data flow, and records root node and becomes the number of optimal match point, be transferred to Step 4;
Step 3 is trained using leaf node of first data point in data flow to the evolution tree that division is completed: respectively The Euclidean distance in the tree that develops between each node and the data point is calculated, it is minimum to find Euclidean distance corresponding with the data point Node, judge whether the node is leaf node, if so, then will develop tree in currently train node as the data point most Good match point, all leaf nodes recorded in the tree that develops become the number of optimal match point, and are transferred to step 4;If not, being transferred to Step 6;
Step 4, to develop tree in root node and all leaf nodes successively perform the following operations respectively: judgement develop tree in when Whether preceding trained node becomes the number of optimal match point less than the first preset value, wherein currently training node in the tree that develops is root Node or any leaf node if so, then updating the weight vector for currently training node in the tree that develops, and are transferred to step 6;Such as It is no, then it is transferred to step 5;Wherein, develop in setting and currently train the weight vector more new formula of node are as follows:
wi(t+1)=x (t)
Wherein, wiIt (t+1) is that the weight vector after node updates, w are currently trained in evolution treeiIt (t) is currently to be trained in evolution tree Weight vector before node updates, x (t) are the weight vector with the optimal match point for currently training node pairing in the tree that develops;
Step 5 judges currently whether the current depth of trained node is less than the depth capacity for the tree that develops in evolution tree, and develop tree Depth capacity is preset value, if so, then dividing to the current trained node in the tree that develops, by the current training in the tree that develops Node split assigns different weight vectors to each leaf node at n leaf node, and the node of the division is denoted as trunk Node reformulates data flow, and the number of composition data stream at this time is counted again, is transferred to step 3;If not, at this time The evolution tree that the tree that develops is completed for training, and it is transferred to step 8;Wherein, the calculation formula of the weight vector of n leaf node are as follows:
W ' (t)=(1- β) w (t)+β r (t)
Wherein, w ' (t) be new leaf node weight vector, w (t) be the corresponding trunk node of new leaf node weight to Amount, r (t) are the random unitary vector with w (t) identical dimensional, and β is preset hyper parameter, for controlling random perturbation degree;
Step 6 judges whether the data point in the data flow has all trained, if not, using next data in data flow Point is trained the tree that develops, and continuing all nodes in record evolution tree becomes the number of optimal match point, and is transferred to step 4; If so, being transferred to step 7;
Whether step 7 judges the number of composition data stream less than the second preset value, again right if so, then reformulate data flow The tree that develops is trained, and the number for becoming optimal match point to the training node in the tree that develops adds up, and is transferred to step 4;If not, the evolution tree that evolution tree at this time is completed for training, and it is transferred to step 8;
The Hamming code that all nodes in step 8, the evolution tree completed to training in addition to root node are initialized encodes, and uses The path code strategy of greed optimizes guarantor's similitude loss function of the whole tree that develops, and will protect similitude loss function most Hash coding of the coding of Hamming code corresponding to small value as each leaf node for the tree that develops;
Wherein, optimization aim are as follows:
Wherein, E is guarantor's similitude penalty values of the whole tree that develops, WkFor whole develop tree trunk node k weight vector, In, Wk={ w1,w2,...,wn, w1,w2,...,wnThe weight vector for the n leaf node that respectively trunk node k is divided out; N={ W1,W2,...,WcBe whole develop tree in all trunk nodes set;F(Wk) it is the corresponding leaf of each trunk node Guarantor's similitude loss function of child node coding,Wherein, wi For the weight vector of i-th of leaf node in trunk node k, wjFor j-th of leaf node in trunk node k weight to Amount;d(wi,wj) indicate leaf node wiWith leaf node wjBetween Euclidean distance, λ is default hyper parameter, b (wi) indicate leaf Node wiHamming code, b (wj) indicate leaf node wjHamming code, dh(b(wi),b(wj)) indicate b (wi) and b (wj) between Hamming distances;
Step 9 calculates optimal match point of a certain data point in the tree that develops, and finds from root node and divides out the data point most The splitpath of leaf node corresponding to good match point, and each leaf node in the evolution tree according to obtained in step 8 The Hash coding of corresponding leaf node in the optimal match point splitpath of the data point is carried out orderly group by Hash coding It closes, is encoded as the Hash of the data point, the Hash coding expression of the data point are as follows: y=u1u2...udep-1, wherein u1For Hash coding of the data point in the corresponding node that the tree depth that develops is 2;u2Setting depth in evolution for the data point is 3 Corresponding node on Hash coding, dep be develop set depth capacity;udep-1It is the data point in the tree depth capacity that develops Hash coding in corresponding node.
2. a kind of unsupervised online Hash learning method, it is characterised in that: more evolution trees of creation, and form in sequence gloomy Woods is trained forest using method described in claim 1, and the data point in data set is formed data flow at random, is made The each evolution tree in the forest is trained respectively according to sequencing with first data point in data flow: to gloomy Each evolution tree in woods, one number of stochastical sampling from the Poisson distribution that intensity is 1 are denoted as K, using in data flow Data point respectively trains every evolution tree K times;Successively the forest is trained using the data point in data flow, has been trained Cheng Hou carries out protecting similitude coding respectively to the leaf node of every in the forest tree that develops, and obtains the every tree that develops in forest Hash coding, calculates data point xiEvery optimal match point to develop on tree in forest, and by data point xiCorrespondence is drilled at every The Hash coding for changing the upper optimal match point of tree forms data point x in sequenceiHash coding, wherein data point xiHash Coding expression are as follows: Indicate that kth develops tree to data point x in forestiVolume Yard, the sum for the tree that develops in T expression forest.
CN201910088472.XA 2019-01-30 2019-01-30 Hash learning method and its unsupervised online Hash learning method based on the tree that develops Pending CN109829549A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910088472.XA CN109829549A (en) 2019-01-30 2019-01-30 Hash learning method and its unsupervised online Hash learning method based on the tree that develops
CN202010070802.5A CN111079949A (en) 2019-01-30 2020-01-21 Hash learning method, unsupervised online Hash learning method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910088472.XA CN109829549A (en) 2019-01-30 2019-01-30 Hash learning method and its unsupervised online Hash learning method based on the tree that develops

Publications (1)

Publication Number Publication Date
CN109829549A true CN109829549A (en) 2019-05-31

Family

ID=66863000

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910088472.XA Pending CN109829549A (en) 2019-01-30 2019-01-30 Hash learning method and its unsupervised online Hash learning method based on the tree that develops
CN202010070802.5A Pending CN111079949A (en) 2019-01-30 2020-01-21 Hash learning method, unsupervised online Hash learning method and application thereof

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010070802.5A Pending CN111079949A (en) 2019-01-30 2020-01-21 Hash learning method, unsupervised online Hash learning method and application thereof

Country Status (1)

Country Link
CN (2) CN109829549A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209867A (en) * 2019-06-05 2019-09-06 腾讯科技(深圳)有限公司 Training method, device, equipment and the storage medium of image encrypting algorithm
CN110674335A (en) * 2019-09-16 2020-01-10 重庆邮电大学 Hash code and image bidirectional conversion method based on multi-generation and multi-countermeasure
CN110909027A (en) * 2019-10-17 2020-03-24 宁波大学 Hash retrieval method
CN111079949A (en) * 2019-01-30 2020-04-28 宁波大学 Hash learning method, unsupervised online Hash learning method and application thereof
CN111078911A (en) * 2019-12-13 2020-04-28 宁波大学 Unsupervised hashing method based on self-encoder
CN111625258A (en) * 2020-05-22 2020-09-04 深圳前海微众银行股份有限公司 Mercker tree updating method, device, equipment and readable storage medium
CN112699942A (en) * 2020-12-30 2021-04-23 东软睿驰汽车技术(沈阳)有限公司 Operating vehicle identification method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830333A (en) * 2018-06-22 2018-11-16 河南广播电视大学 A kind of nearest neighbor search method based on three times bit quantization and non symmetrical distance

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777038B (en) * 2016-12-09 2019-06-14 厦门大学 A kind of ultralow complexity image search method retaining Hash based on sequence
CN108182256A (en) * 2017-12-31 2018-06-19 厦门大学 It is a kind of based on the discrete efficient image search method for being locally linear embedding into Hash
CN109166615B (en) * 2018-07-11 2021-09-10 重庆邮电大学 Medical CT image storage and retrieval method based on random forest hash
CN109829549A (en) * 2019-01-30 2019-05-31 宁波大学 Hash learning method and its unsupervised online Hash learning method based on the tree that develops

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830333A (en) * 2018-06-22 2018-11-16 河南广播电视大学 A kind of nearest neighbor search method based on three times bit quantization and non symmetrical distance

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079949A (en) * 2019-01-30 2020-04-28 宁波大学 Hash learning method, unsupervised online Hash learning method and application thereof
CN110209867A (en) * 2019-06-05 2019-09-06 腾讯科技(深圳)有限公司 Training method, device, equipment and the storage medium of image encrypting algorithm
CN110209867B (en) * 2019-06-05 2023-05-16 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium for image retrieval model
CN110674335A (en) * 2019-09-16 2020-01-10 重庆邮电大学 Hash code and image bidirectional conversion method based on multi-generation and multi-countermeasure
CN110674335B (en) * 2019-09-16 2022-08-23 重庆邮电大学 Hash code and image bidirectional conversion method based on multiple generation and multiple countermeasures
CN110909027A (en) * 2019-10-17 2020-03-24 宁波大学 Hash retrieval method
CN110909027B (en) * 2019-10-17 2022-04-01 宁波大学 Hash retrieval method
CN111078911B (en) * 2019-12-13 2022-03-22 宁波大学 Unsupervised hashing method based on self-encoder
CN111078911A (en) * 2019-12-13 2020-04-28 宁波大学 Unsupervised hashing method based on self-encoder
WO2021233182A1 (en) * 2020-05-22 2021-11-25 深圳前海微众银行股份有限公司 Merkle tree updating method, apparatus and device, and readable storage medium
CN111625258B (en) * 2020-05-22 2021-08-27 深圳前海微众银行股份有限公司 Mercker tree updating method, device, equipment and readable storage medium
CN111625258A (en) * 2020-05-22 2020-09-04 深圳前海微众银行股份有限公司 Mercker tree updating method, device, equipment and readable storage medium
CN112699942A (en) * 2020-12-30 2021-04-23 东软睿驰汽车技术(沈阳)有限公司 Operating vehicle identification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111079949A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN109829549A (en) Hash learning method and its unsupervised online Hash learning method based on the tree that develops
CN106503106B (en) A kind of image hash index construction method based on deep learning
CN113868366B (en) Streaming data-oriented online cross-modal retrieval method and system
CN107220180A (en) A kind of code classification method based on neutral net language model
CN108734223A (en) The social networks friend recommendation method divided based on community
CN109818971B (en) Network data anomaly detection method and system based on high-order association mining
CN107729290A (en) A kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash
CN108710948A (en) A kind of transfer learning method based on cluster equilibrium and weight matrix optimization
CN104915388B (en) It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology
Wang et al. A new approach of obtaining reservoir operation rules: Artificial immune recognition system
CN108876595A (en) A kind of P2P personal credit file method and device based on data mining
CN116862024A (en) Credible personalized federal learning method and device based on clustering and knowledge distillation
CN115828143A (en) Node classification method for realizing heterogeneous primitive path aggregation based on graph convolution and self-attention mechanism
CN114580763A (en) Power load prediction method based on improved dragonfly and lightweight gradient lifting tree model
CN115544029A (en) Data processing method and related device
CN107886132A (en) A kind of Time Series method and system for solving music volume forecasting
Dong et al. Knowledge Restore and Transfer for Multi-Label Class-Incremental Learning
CN115905903A (en) Multi-view clustering method and system based on graph attention automatic encoder
CN115099309A (en) Method for designing cost evaluation model for storage and index of graph data
CN115131362A (en) Large-scale point cloud local area feature coding method
Sarkar et al. Accuracy-based learning classification system
Koli et al. Parallel decision tree with map reduce model for big data analytics
CN112817959A (en) Construction method of ancient biomorphic phylogenetic tree based on multi-metric index weight
Wang et al. Prediction model of glutamic acid production of data mining based on R language
CN109766371B (en) Hash sorting method based on list supervision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190531