CN107729290A

CN107729290A - A kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash

Info

Publication number: CN107729290A
Application number: CN201710857844.1A
Authority: CN
Inventors: 李笑宇; 陈修司; 周畅; 高军
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2017-09-21
Filing date: 2017-09-21
Publication date: 2018-02-23
Anticipated expiration: 2037-09-21
Also published as: CN107729290B

Abstract

The invention discloses a kind of expression learning method of the ultra-large figure using the optimization of local sensitivity Hash.This method is：Each node of target figure is calculated using local sensitivity hash function, and the knot vector of the node is defined according to result of calculation；Training sample is obtained from the graph structure of the target figure；Based on the training sample, the knot vector of each node in the target figure is trained using skip gram models, knot vector corresponding to each node in the target figure is obtained and represents.The present invention solves the puzzlement that " the long-tail phenomenon " of generally existing in real network structure is brought, while in view of the content information and structural information in network, is adapted to distributed implementation, has enhanced scalability.

Description

A kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash

Technical field

The invention belongs to areas of information technology, it is related to the expression learning method of Large Scale Graphs structure, more particularly to a kind of profit The ultra-large figure optimized with local sensitivity Hash represents learning method.

Background technology

Expression study on figure, also known as " figure insertion " (Graph Embedding), refer to map each node in figure For the algorithm of the low-dimensional vector representation of node diagnostic can be kept.The knot vector that " figure insertion " algorithm obtains represents to be considered as The essential characteristic of figure interior joint, the generality input as the other machines learning tasks on figure.

The research work of relevant " figure insertion " algorithm before is learnt emphatically using the network structure information of figure mostly, Comparing typical method includes DeepWalk, LINE, node2vec etc..However, the network structure of real world is often sparse And skewness weighing apparatus." hot topic " node is connected to most side on a small quantity, and structural information is close, and substantial amounts of " unexpected winner " node The side number of connection is seldom, and structural information is sparse, namely " long tail effect ".Therefore, be based only upon the method for network structure information for The expression effect of substantial amounts of structural information sparse " long tail node " is often not fully up to expectations.

The content of the invention

The present invention proposes a kind of scheme for the knot vector expression for being applied to calculate large scale network, is particularly intended to solve The puzzlement that " the long-tail phenomenon " of generally existing is brought in real network structure.The algorithm is simultaneously in view of the content information in network And structural information, it is adapted to distributed implementation, there is enhanced scalability.

The present invention establishes the association based on node content information using local sensitivity hash function between figure interior joint, then Using random walk method generation training sample from artwork structure, knot vector corresponding to skip-gram model trainings is used. This method can be indicated study using the content information and overall network structure of each node simultaneously, have Highly Scalable Property, especially there is preferable effect on the significant figure of structure long tail effect.

Present patent application is the local sensitivity hash function using node content information as input.The spy of local sensitivity Hash Matter ensure that the Hash mapping of node similar in node content information is also close, so as to which the close node of content passes through altogether Enjoy Hash output and establish contact.Hash output of the present invention based on each node has redefined vector corresponding to it, resets The mode of justice may insure that the close node of content information will possess similar final expression vector.The present invention uses random walk Method obtains training sample from artwork structure, is then instructed using skip-gram models on the knot vector redefined Practice.This method has a clear superiority on the large-scale obvious figure of structure long tail effect compared to the method based on pure structure, body The advantage of fusion content information is showed, and has realized there is highly scalable suitable for distributed structure/architecture.

The deficiency for the method that the present invention practises for the existing chart dendrography based on pure structure, innovatively propose using local Sensitive hash function merges content information.Compared to the method based on pure structure, the present invention has advantages below：

1) present invention make use of network content information and structural information simultaneously.Associated methods are first to use local sensitivity Hash Output association is established between the close node of content information, then use and given birth to from network structure by random walk mode Into training sample be trained.Such combination can associate node similar in distant in those figures and content, This is that structure-based method can not accomplish.And enhance originally may be more using the association of content information by the present invention Sparse graph structure, and after content information associates, the sparse unexpected winner node of script structure can also share popular node Structural information, so as to efficiently solve " figure insertion " scheme based on pure structure on " long tail effect " obvious graph structure The problem of effect is poor.

2) present invention can save redundant space.The present invention is breathed out by the use of the local sensitivity using node content information as input Uncommon function has redefined knot vector, and each node corresponds to its exclusive vectorial mode in method before changing.Existing Substantial amounts of homogeneity node in real network often be present, such as in the commodity figure of electric business website, there is what a large amount of different hotel owners sold Basically identical commodity.Depositing repeatedly for homogeneity knot vector can be caused if each node occupies its exclusive memory space Storage.The present invention changes knot vector definition, and node similar in content information has maximum probability shared vector parameter, therefore can save Save the redundant space problem that homogeneity node is brought.

3) present invention is applied to distributed computing framework, has enhanced scalability.Algorithm in the present invention is in Ali There is smooth realization on cloud distributed computing framework odps ps (Ali's cloud parameter server), figure scale reaches ten million rank.

Brief description of the drawings

Fig. 1 is the schematic diagram that knot vector expression is redefined based on local sensitivity Hash；

Fig. 2 is the gradient updating process schematic of single sample in training process.

Embodiment

Hereafter by specific embodiment, and coordinate accompanying drawing, the algorithm flow of the inventive method is illustrated.

First, the present invention introduces the design of the local sensitivity hash function used in algorithm, and then introducing the present invention is How each node vectorial is redefined, and the last present invention finally trains the process of knot vector expression by algorithm is introduced.

(1) the local sensitivity hash function using node content information as input

Node content information processing is that low-dimensional real number is vectorial by the step a. present invention, the input as follow-up hash function. Process of the content information processing for low-dimensional real number vector can be utilized into existing algorithm.For example, reality of the present invention in Ali's cloud In testing, graph structure is that the commodity figure that construction is extracted in sequence is clicked on from user, a node in each commodity corresponding diagram, by commodity Content information of the title text as corresponding node.The present invention handles commodity title text for the vectorial process of low-dimensional such as Under：First the title text of all commodity is segmented, then using all title texts as training corpus, using in word2vec Skip-gram models, the title text of each commodity is regarded as a window and is trained, training obtain in dictionary own The d dimension term vectors of word.Because title is usually short text, the average work of the term vector of the invention by all words included in title For the content vector of title vector, namely corresponding node.

The step b. present invention designs local sensitive hash function by the way of hyperplane cutting.Local sensitivity Hash letter Number needs closer with input value (node content vector), then is output to the bigger characteristic of the probability of same barrel number.The present invention Design method be randomly generated k d dimension hyperplane, due in d dimension spaces, one divide the space into it is two-part super flat Face can be tieed up real number vector representation by a d, thus in fact be random k d dimension real number vector of generation, be denoted as This k hyperplane together constitutes the hash function of a local sensitivity.It is defeated for two The d dimension node content vectors entered, if they are in the same side of k hyperplane, then it is assumed that they are under this hash function Finally there is identical output valve.Specific hash function computational methods are to input the d dimension content vectors of some nodeAllow it Vector corresponding with each hyperplaneInner product is all done respectivelyInner product 1 is recorded as more than or equal to 0, is otherwise recorded as 0.Then for an input content vectorCan obtain k dimension 0/1 to Amount, namely a k bit, the k bits are output of the node under this hash function.Two inputs VectorWithHaving identical output valve under the hash function, and if only if for all i=1,2 .., k,With Sign symbol it is identical.As can be seen that each Hash letter that the present invention designs Several barrelages is 2^k。

The m dimension discrete codes that step c. are obtained each node in figure by local sensitivity Hash represent.M use of the present invention Building mode in previous step, a hash function race { h can be obtained^(j), j ∈ { 1,2 ..., m }, wherein each hash function h^(j)Codomain have 0,1 ..., 2^k- 1 this 2^kIndividual value (each hash function here generates k hyperplane at random, M*k hyperplane has been used altogether in i.e. whole hash function race).The content information vector of all nodes in figure is all inputted respectively After being calculated into this m hash function, each node can obtain m barrel number, i.e., each node has a new m to tie up discrete volume Representation, this m dimension discrete codes represent to be obtained after local sensitivity Hash mapping by node content information.Due to office The characteristic of portion's sensitive hash function, the m dimension discrete codes of similar node represent and close in content.Next the present invention To introduce how the m based on each node ties up discrete codes to redefine the vector of node.

(2) knot vector redefines

In " figure insertion " algorithm based on pure structure before, each node u is mapped as two vectorial s_uAnd t_u, That is " source vector " and " object vector ".The concept of " source vector " and " object vector " is still remained in the algorithm of the present invention, Difference is that each node no longer independently possesses their own vector, but hash function race { h^(j)In each hash function Each Hash bucket correspond to a pair " source vectors " and " object vector ".Specifically, hash function h^(j)(j∈{1,2,…, M }) numbering be i (i ∈ 0,1 ..., 2^k- 1 } " source vector " and " object vector " corresponding to Hash bucket) is respectivelyWith

For a certain node u in figure, remember that " content vector " is e corresponding to it_u, remember that it is obtained after hash function maps To m dimension discrete codes be expressed asInd represents index, that is, the meaning indexed；This " source vector " and " object vector " s of the node is defined as below in invention_uAnd t_u：

That is, " source vector " that the present invention defines some node passes through hash function race { h for it^(j)Obtain after mapping " source vector " of corresponding m Hash bucket is averaged, and " object vector " passes through hash function race { h for it^(j)Obtain after mapping " object vector " of corresponding m Hash bucket is averaged.The present invention had analyzed in upper one section, and content information is more similar Node more Hash bucket of meeting share count after mapping, and the vector of each node is directly equal to its corresponding Hash bucket Vector is averaged, therefore the node that content information is more similar, and their corresponding " source vectors " and " object vector " also can more connect respectively Closely.The present invention utilizes node content information in this way, passes through between the vector representation of the close node of content information Shared Hash bucket contacts to establish, and the calculating process of hash function, which can be automatically performed, determines that the content information of which node connects Closely.Next, the present invention again by this new definition mode bring into it is structure-based training framework in, so as to reach text and The combination of structural information.

The diagram that knot vector redefines such as Fig. 1.

(3) training process introduction

After the redefining of knot vector is completed, the present invention can include new definition mode structure-based training frame In frame, the knot vector that obtaining the present invention needs represents.The model that the present invention uses is in recent years in natural language processing field Popular skip-gram models.In artwork, using the random walk strategy for terminating probability using carrying from graph structure Generate training sample.

The specific introduction (random walk strategy, which belongs to, to be fruitful, and is not belonging to the contribution of the present invention) of random walk strategy is such as Under：Migration is carried out from some node u, if the node currently reached is t, the probability for having p terminates migration at t；Have 1-p's Probability continues migration, goes to any node for having side to be connected with t, goes between the probability of which node and t and these points While weights directly proportional (weights when in real figure generally reflect the close journey of the entity relationship representated by the node at side both ends Degree, therefore allow to go to probability directly proportional to side right be rational).Migration is persistently carried out, until algorithms selection is whole at the v of certain point Only migration.Finally, (u, v) point is to being positive sample that we obtain in this paths.P is that algorithm designer can be taking human as control The parameter of system, p is more big, and the average length in the path that random walk obtains is shorter.

For positive sample (u, v), the present invention predicts the general of destination node v using softmax functions to define source node u Rate：

Wherein V is the vertex set in figure, s_uAnd t_vAfter respectively redefining u node " source vector " and v nodes " purpose to Amount ".Attempt directly optimization above this new probability formula be it is very time-consuming because formula need to calculate the source vector of the node with The inner product sum of the object vector of all nodes in figure.In order to improve the efficiency of training, present invention employs random negative example The strategy of (negative sampling) optimizes, and has redefined the probability that source node u predicts destination node v：

Wherein, σ is sigmoid functions,P_DIt is a pre-set Node distribution, is generally Node is uniformly distributed (probability being sampled a little is all equal),What is represented is from Node distribution P_DIn, adopt at random Sample goes out point n process, t_nAs " object vector " for the point n that sampling obtains.For each positive example, the present invention from the distribution with K negative examples of machine sampling.NoteFor the number of the positive sample (u, v) that our samplings obtain in whole sampling process, then instruct Experienced global object function is shown below.

Notice s in above formula_u,t_v,t_nBe not independent parameter, but by node by local sensitivity Hash mapping to Hash bucket corresponding to vector calculate and get.Therefore, the present invention by the forward part intermediary knot vector that continues reset the right way of conduct Formula is substituted into, i.e., the knot vector after we are redefined is included in above-mentioned structure-based training framework, obtains the final overall situation Object function：

All example probabilistic logarithms and be target that global objective function is observed with maximizing, update each hash function Each Hash bucket corresponding to " source vector " and " object vector ".What the final result of training obtained is each hash function Corresponding to each Hash bucket " source vector " and " object vector ", and and then institute in figure can be calculated according to redefining for node There is the vector representation of node.

Global object function looks comparatively laborious, but in the training process, and the renewal process of parameter is in fact simultaneously It is uncomplicated.Such as a given training positive example (u, v), the present invention predict to obtain node v probability by maximizing from node u, And then more corresponding to new node u corresponding to m individual " source vector " and node v m " object vector ".Specific renewal process such as Fig. 2 institutes Show：The present invention finds corresponding to m corresponding to node u " source vector " and node v m by u and v Hash bucket index first " object vector ", vectorial s corresponding to each of which is obtained after being averaged_uAnd t_v, then with obtained s_u,t_vAfter substituting into derivation Formula, the gradient magnitude for needing to update in each parameter is calculated, then is reversely renewed back to vector corresponding to each Hash bucket i.e. Can.

Embodiment

Taobao's commodity chart dendrography practises example：

Using commodity on the line of Taobao as node of graph, according between the intraday click sequence construct node of Taobao user Side (in clicking on sequence close to former and later two commodity between connect a line), form a commodity relation figure, figure scale reaches To ten million rank.

The present invention use " title " of Taobao's commodity corresponding to each node as the node " content information " (such as " great star that gram Men's Shoes wire side air cushion running shoes noctilucence is jogged ").In an experiment, " title " is converted into the real number of 200 dimensions by the present invention Vector.Then method for transformation is used as training language first to be segmented to the title text of the whole network commodity using the title of all commodity Expect storehouse, using each title as a window, apply mechanically and all words in corpus are obtained in the skip-gram in word2vec Vector representation.Because commodity title is all short text, the present invention uses the average conduct pair of the vector representation of all words in title Answer " the content vector " of node.

After the content vector of each node is obtained, using the hash function of method construct local sensitivity mentioned in the present invention And train the vector representation of each commodity.Test result indicates that the vector representation application that the method in the present invention is calculated In recommending on to the line of reality, the result (APP, the Scalable that are obtained compared to the chart dendrography learning method based on " pure structure " Graph Embedding for Asymmetric Proximity, AAAI2017), have in clicking rate index and significantly carry Rise.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this area Technical scheme can be modified by personnel or equivalent substitution, without departing from the spirit and scope of the present invention, this The protection domain of invention should be to be defined described in claims.

Claims

1. a kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash, its step include：

Each node of target figure is calculated using local sensitivity hash function, and the node is defined according to result of calculation Knot vector；

Training sample is obtained from the graph structure of the target figure；

Based on the training sample, the knot vector of each node in the target figure is trained using skip-gram models, obtained Into the target figure, knot vector corresponding to each node represents.

2. the method as described in claim 1, it is characterised in that it is quick that the m parts are designed by the way of hyperplane cutting Feel hash function, obtain a hash function race { h^(j)},j∈{1,2,…,m}；Wherein, each local sensitivity Hash letter Number includes the k d dimension hyperplane generated at random, if the d dimensional vectors of two inputs tie up the same side of hyperplane in the k d, Then the two d dimensional vectors have identical output valve under this hash function.

3. method as claimed in claim 2, it is characterised in that the barrelage of the local sensitivity hash function is 2^k。

4. method as claimed in claim 2, it is characterised in that the computational methods of the local sensitivity hash function are：For The d dimensional vectors of input, the d dimensional vectors and the d dimensional vectors are done into inner product respectively in the corresponding vector of each hyperplane, inner product is more than 1 is recorded as equal to 0, is otherwise recorded as 0, obtains the output result of the d dimensional vectors, i.e. a k bits.

5. method as claimed in claim 2 or claim 3, it is characterised in that using local sensitivity hash function to each of target figure Node is calculated, and the method for defining according to result of calculation the knot vector of the node is：

1) it is that d dimensional vectors input the hash function race { h respectively by the node content information processing of the target figure^(j)In it is every The one local sensitivity hash function；Wherein, j-th of local sensitivity hash function h^(j)By node u d dimensional vectors e_uReflect It is mapped to the source vector of Hash bucket, i.e. the Hash bucket that numbering is iAnd object vectorIt is corresponding with node u；

2) to each node of the target figure, the source vector of m Hash bucket corresponding to the node is averaged and is used as the node Source vector, the object vector of m Hash bucket corresponding to the node is averaged as the object vector to the node.

6. method as claimed in claim 5, it is characterised in that using most similar two node of knot vector as content information phase Like node, association is established by shared Hash bucket between the vector representation of content information similar node.

7. method as claimed in claim 5, it is characterised in that based on the training sample, using skip-gram models to this The method that the knot vector of each node is trained in target figure is：For training sample (u, v), training sample (u, v) is one Positive example, using formulaCalculate source node u and predict mesh Mark node v probability；Wherein,What is represented is from Node distribution P_DIn, stochastical sampling goes out point n process, s_uFor source node u Source vector, t_vFor the object vector of destination node u node, t_nTo sample obtained point n object vector；P_DIt is one advance The Node distribution set, for each training sample (u, v), from Node distribution P_DMiddle stochastical sampling k negative examples；Then adopt Use global objective functionRenewal Source vector and object vector corresponding to each Hash bucket of each local sensitivity hash function；And then according to renewal result Knot vector corresponding to each node in the target figure is calculated to represent；Wherein,For the number of positive example (u, v).

8. method as claimed in claim 5, it is characterised in that based on the training sample, using skip-gram models to this The method that the knot vector of each node is trained in target figure is：For training sample (u, v), training sample (u, v) is one Positive example, using formulaCalculate source node u and predict mesh Mark node v probability；Wherein,What is represented is from Node distribution P_DIn, stochastical sampling goes out point n process, s_uFor source node U source vector, t_vFor the object vector of destination node u node, t_nTo sample obtained point n object vector；P_DIt is one advance The Node distribution set, for each training sample (u, v), from Node distribution P_DMiddle stochastical sampling k negative examples；Then adopt Use global objective function Renewal Source vector and object vector corresponding to each Hash bucket of each local sensitivity hash function；And then according to renewal result Knot vector corresponding to each node in the target figure is calculated to represent；Wherein,For the number of positive example (u, v).

9. the method as described in claim 1, it is characterised in that given birth to from the graph structure of the target figure using random walk method Into training sample.