CN107451703A

CN107451703A - A kind of social networks multitask Forecasting Methodology based on factor graph model

Info

Publication number: CN107451703A
Application number: CN201710770816.6A
Authority: CN
Inventors: 张子柯; 林松; 刘闯
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2017-12-08

Abstract

A kind of social networks multitask Forecasting Methodology based on factor graph model, comprises the following steps：The first step, Network Data Capture, specifically include network data crawl, data prediction；Second step, multitask factor graph model is established, specifically include network characterization extraction, network migration structure structure, factor graph model construction；3rd step, prediction result are assessed.

Description

A kind of social networks multitask Forecasting Methodology based on factor graph model

Technical field

The present invention relates to machine learning method, factor graph model, personalized recommendation technology and network migration structure structure skill Art, suitable for solving multitask link forecasting problem heterogeneous network.

Background technology

With the progress of Internet technology and the popularization of online social networks, network size constantly expands, and people are faced Network also from simple human relation network to coupling social networks transition.Coupling social networks generally has complicated net Network structure, node (such as user and commodity) and polytype side (such as social link and scoring chain comprising polymorphic type Connect).Link prediction research in traditional coupling social networks is concentrated mainly on social link prediction or scoring link is pre- Survey, it is generally recognized that be separate between different types of link prediction task.But the network in real world, this Two kinds of predictions are often related, such as, they are more likely to purchase or evaluated identical business if two people are friends Product, if likewise, two people often buy or evaluated same commodity, they more likely have similar interest and Hobby, and there is more maximum probability to turn into friend.Therefore, how by building network migration structure, binding factor graph model, make more Kind prediction task produces association by information flow, has epochmaking reason to the heterogeneous link forecasting research in complex network By and practical significance.

Existing link prediction thinking is first calculating network node diagnostic, such as in-degree, out-degree, cluster coefficients etc.；Network is opened up Architectural feature, such as common neighbours' index, AA indexs, Salton indexs, Jaccard indexs, HPI indexs etc. are flutterred, then more Kind feature integratesR represents integrated result, x in formula_iRepresent the ith feature of extraction, w_iRepresent The weight of ith feature, finally it is brought into existing machine learning model, weight vectors w is obtained by training.

However, due to social link prediction and the difference of scoring link prediction network structure, the feature of extraction is also different, Therefore need to learn multiple models.Present factor graph model thinks more in social networks in the application of link prediction technology Between kind prediction task is independent, and doing a variety of predictions needs to train different models, and there is presently no by excavating network Structure is migrated to do the link prediction of multitask.

Existing Predicting Technique does not solve Sparse sex chromosome mosaicism, does not make full use of the structural information of network, it is impossible to Adapt to the link prediction of multitask；

Multitask prediction needs to learn multiple machine learning models, and computational efficiency is low；Do not fully take into account a variety of pre- Coupling facilitation between survey task.

The content of the invention

The present invention will overcome the disadvantages mentioned above of prior art, there is provided a kind of more of social networks based on factor graph model Business Forecasting Methodology.

The present invention is a kind of social networks multitask Forecasting Methodology based on factor graph model, its flow chart such as Fig. 1 institutes Show.This method, by building network migration structure, is made more by the use of traditional complex network link Forecasting Methodology as feature Information flow occurs between kind prediction task, intercouples, solves the problems, such as that Deta sparseness, computational efficiency are low, simultaneously The accuracy of prediction result is improved, this method includes following steps.

The first step, Network Data Capture：User social contact information and behavioural information are collected by web crawlers, and to crawling Data cleared up, it is convenient follow-up to calculate, it is main to be crawled including network data and data prediction.

(1) network data crawls：Crawl the behavioural information of user social contact behavioural information and user to commodity, every information Including：User UserID and user UserID, user UserID and commodity ItemID.

(2) data prediction：In order to facilitate follow-up calculating, it is necessary to clear up redundancy in data, incomplete data, formed Unified user and user social contact behavioural matrix w required for model₁, user and commodity scoring behavioural matrix w₂.In matrix w₁ In, element w in matrix₁ ^ijRepresent that (it can also be unidirectionally double that can be to the relations such as the good friend between user i and user j, concern To), in matrix w₂In, element w in matrix₂ ^ijThe relations such as collection, purchase, evaluation between expression user i and commodity j.

Second step, establish multitask factor graph model：

(1) network characterization extracts：Factor graph model is a supervised learning model, it would be desirable to using different in network Structure information is social link and scoring link extraction feature.For a specific node i in social networks, we can be with Extract node diagnostic, including degree k (v_i), out-degree k_out(v_i), in-degree k_in(v_i), cluster coefficients c_i.For social networks node pair I and j, similarity indices are to predict its maximally related feature whether connected in a network.Therefore, we are extracted some biographies The similarity indices of system are as feature.

Crossover network (the scoring relation of user and commodity) the also under cover information of social networks node pair, such as they The commodity commented on jointly are more, then the possibility that they are friends is bigger.Based on this, we extract according to crossover network Similarity indices.Similar, for crossover network destination node to user i and commodity a, we can also refer to according to similitude Mark to extract feature.

(2) network migration structure is built：It is the factor important in factor graph model to migrate structure, and the label information on side can Can be migrated in structure.We build migration structure with triangle in this work so that information can be Inside social networks, propagated between social networks and crossover network.Structure is migrated as shown in Fig. 2-1- Fig. 2-18.

(3) factor graph model construction：Coupling network G=(G^S,G^C) a social networks G can be divided into^SWith an intersection Network G^C, our target is one model of study while predicts that potential social link and scoring link that (such as Fig. 3 b's is orange Dotted line)

For the node in network to e_ij, we use label y_eIts state is represented, works as y_eDeposited between=1 expression node pair In a line, work as y_eSide is not present between=0 expression node pair.The label y of final mask output_e=1 probability P (y_e=1).

(a) joint probability distribution

For coupling social networks G=(V, E, X), V={ v_iRepresent set of node, E={ e_ijNode is represented to gathering,It is an attribute matrix, node is represented to e per a line_ijCorresponding attribute vector, our target are estimations every Probability P (the y that unknown link is formed_e|x_e).We represent the joint probability distribution of network with P (Y | X, G), and G represents the institute of network There is information.This joint probability distribution show the label of link not only with the local attribute of node pair about also and network knot Structure is relevant, and joint probability distribution can be instantiated as：

Wherein, d and d' represents the characteristic dimension of social networks and crossover network, x respectively_eiRepresent node to i-th of e Property value, E^SThe node in social networks is represented to set, E^CThe node on crossover network is represented to setRepresent In attribute in social networksUnder the conditions ofProbability,Represent in crossover network in attributeUnder the conditions of Probability, P (Y_ε) represent to migrate the influence of structure, Π represents the species of migration structure, and π represents a type of migration structure, ε represents one of migration structure.

(b) factor is instantiated

In principle, Attribute Association characteristic function and Social Relation characteristic function can be instantiated by different modes.I It is modeled using the Hammersley-Clifford theories in markov random file here：

f_i(*)、g_i(*)h_ε(*) be respectively social networks, crossover network, migrate structure characteristic function, α_i、β_i、γ_εIt is Its corresponding weight, Z₁、Z₂、Z₃For normalization factor.

(c) objective function optimization

With reference to above-mentioned formula, last we obtain object function：

Wherein, Z=Z₁Z₂Z₃For normalization factor.

With the method for stochastic gradient descent, the gradient of each parameter can be obtained：

With E [h_ε(Y_ε)] respectively represent data distribution function h_ε(Y_ε) expectation,WithIt is according to estimation model In P_{α, beta, gamma}Expectation under (Y | X, G) distribution.

3rd step, prediction result are assessed

The index for weighing this method validity has totally 3 kinds of AUC, Precision and Ranking Score.They are to prediction The emphasis that accuracy is weighed is different：AUC(area under the receiver operating characteristic Curve) the accuracy of measure algorithm on the whole.Precision only considers whether the side of L positions before coming is predicted accurately.And Ranking Score are more considered to the sequence on the side predicted.

AUC can be understood as in test set while fractional value have be not present than randomly selected one while point The high probability of numerical value, that is to say, that choose a line from test set at random every time and carried out with the randomly selected side being not present Compare, if in test set while fractional value be more than be not present while fractional value, just plus 1 point；If two fractional value phases Deng with regard to adding 0.5 point.Independently compare n times, if in the secondary test sets of n ' while fractional value be more than be not present while point Number, there is that secondary two fractional values of n " are equal, then AUC is defined as：

Obviously, if all fractions all randomly generate, AUC=0.5.Therefore degree of the AUC more than 0.5 is weighed Algorithm is to what extent more accurate than randomly selected method.

Precision is defined as being predicted accurate ratio in first L prediction side.If m prediction is accurate, i.e., There are m before coming in L side in test set, then Precision is defined as：

Obviously, the bigger predictions of Precision are more accurate.If two algorithm AUC are identical, and the Precision of algorithm 1 More than algorithm 2, illustrate that algorithm 1 is more preferable because its tend to really connect the node on side to coming before.

Ranking Score mainly consider position of the side in final sequence in test set.Make H=U-E^TTo be unknown While set (equivalent in test set while and the set on side that is not present), r_iRepresent the rows of unknown side i ∈ E in the ranking Name.Then the Ranking Score values on the unknown side of this are RS_i=r_i/ | H |, wherein | H | represent element in set H it is individual several times All sides in test set are gone through, the Ranking Score values for obtaining system are：

The key of the present invention is by building network migration structure, and binding factor graph model, is made between multi-type network Information flows, so as to realize multitask link prediction in heterogeneous network.

The present invention has it in network characterization extraction, network migration structure structure, multitask factor graph model construction etc. Feature.

It is an advantage of the invention that：Due to taking data crawling method, thus can efficiently, comprehensively search dependency number According to.Proof analysis community network user individual behavior and interbehavior, disclose the phase between user's dissemination and network structure Mutually influence.We portray the inherent coupling between more prediction tasks by network migration structure, and the preference of user will influence The formation of interaction of the user in social networks, social networks will also influence the preference of user.Based on this, we use more Business factor graph model, allows information to be propagated between network internal, network, solves multitask forecasting problem simultaneously and solves Determine Sparse sex chromosome mosaicism.

Brief description of the drawings

Fig. 1 is the flow chart of the inventive method.

Fig. 2-1- Fig. 2-10 represents user-user-user and migrates structure.Figure Fig. 2-11- Fig. 2-18 represent user-commodity- User migrates structure.

Fig. 3 a are the social networks examples of coupling；Fig. 3 b are that coupling social networks can be divided into social networks and crossing net Network；Fig. 3 c are migration topology examples；Fig. 3 d are output：Social networks and the probability on scoring network missing side.

Fig. 3 a~Fig. 3 d are the schematic diagrames of multitask link prediction in coupling network of the present invention.

Embodiment

Technical scheme is further illustrated below in conjunction with the accompanying drawings.

Second step, establish multitask factor graph model：

(3) factor graph model construction：Coupling network G=(G^S,G^C) a social networks G can be divided into^SWith an intersection Network G^C, our target is one model of study while the potential social link of prediction and scoring link (such as Fig. 3 (b) orange Color dotted line)

(a) joint probability distribution

Wherein, d and d' represents the characteristic dimension of social networks and crossover network, x respectively_eiRepresent node to i-th of e Property value,Represent in social networks in attributeUnder the conditions ofProbability,Represent in crossover network In attributeUnder the conditions ofProbability, P (Y_ε) represent to migrate the influence of structure, π represents a type of migration structure.

(b) factor is instantiated

(c) objective function optimization

With reference to above-mentioned formula, last we obtain object function：

Wherein, Z=Z₁Z₂Z₃For normalization factor.

3rd step, prediction result are assessed

Claims

1. a kind of social networks multitask Forecasting Methodology based on factor graph model, comprises the following steps：

The first step, Network Data Capture：User social contact information and behavioural information, and the data to crawling are collected by web crawlers Cleared up, it is convenient follow-up to calculate, it is main to be crawled including network data and data prediction.

(11) network data crawls：The behavioural information of user social contact behavioural information and user to commodity is crawled, every information includes： User UserID and user UserID, user UserID and commodity ItemID.

(12) data prediction：In order to facilitate follow-up calculating, it is necessary to clear up redundancy in data, incomplete data, model is formed Required unified user and user social contact behavioural matrix w₁, user and commodity scoring behavioural matrix w₂.In matrix w₁In, square Element w in battle array₁ ^ijGood friend, concern relation between expression user i and user j, in matrix w₂In, element w in matrix₂ ^ijRepresent to use Collection, purchase between family i and commodity j, evaluation relation.

Second step, establish multitask factor graph model：

(21) network characterization extracts：Factor graph model is a supervised learning model, it is necessary to be using the Heterogeneous Information in network Social activity link and scoring link extraction feature.For a specific node i in social networks, the feature of node is extracted, is wrapped Degree of including k (v_i), out-degree k_out(v_i), in-degree k_in(v_i), cluster coefficients c_i.For social networks node to i and j, similarity indices It is to predict its maximally related feature whether connected in a network.Therefore, some traditional similarity indices are extracted as special Sign.

Also under cover the information of social networks node pair, two users comment crossover network (the scoring relation of user and commodity) jointly The commodity of opinion are more, then the possibility that they are friends is bigger.Based on this, some similarity indices are extracted according to crossover network. Similar, for crossover network destination node to user i and commodity a, feature is extracted according to above-mentioned similarity indices.

(22) network migration structure is built：Migration structure is the factor important in factor graph model, and the label information on side can be It can be migrated in structure.Migration structure is built with triangle so that information can be in social networks in this work Inside, propagated between social networks and crossover network.

(23) factor graph model construction：Coupling network G=(G^s, G^C) a social networks G can be divided into^sWith a crossover network G^C, target is one model of study while the potential social link of prediction and scoring link

For the node in network to e_ij, with label y_eIts state is represented, works as y_eA line be present between=1 expression node pair, when y_eSide is not present between=0 expression node pair.The label y of final mask output_e=1 probability P (y_e=1).

(a) joint probability distribution

For coupling social networks G=(V, E, X), V={ v_iRepresent set of node, E={ e_ijNode is represented to gathering,It is an attribute matrix, node is represented to e per a line_ijCorresponding attribute vector, target are every unknown chains of estimation Probability P (the y that road is formed_e|x_e).The joint probability distribution of network is represented with P (Y | X, G), G represents all information of network.It is this Joint probability distribution show the label of link not only with the local attribute of node pair about also and network structure it is relevant, joint is general Rate distribution can be instantiated as：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>Y</mi> <mo>|</mo> <mi>X</mi> <mo>,</mo> <mi>G</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Pi;</mo> <mrow> <mi>e</mi> <mo>&Element;</mo> <msup> <mi>E</mi> <mi>S</mi> </msup> </mrow> </munder> <munderover> <mo>&Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>d</mi> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <msubsup> <mi>y</mi> <mi>e</mi> <mi>s</mi> </msubsup> <mo>|</mo> <msubsup> <mi>x</mi> <mrow> <mi>e</mi> <mi>i</mi> </mrow> <mi>s</mi> </msubsup> <mo>)</mo> </mrow> <munder> <mo>&Pi;</mo> <mrow> <mi>e</mi> <mo>&Element;</mo> <msup> <mi>E</mi> <mi>C</mi> </msup> </mrow> </munder> <munderover> <mo>&Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msup> <mi>d</mi> <mo>&prime;</mo> </msup> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <msubsup> <mi>y</mi> <mi>e</mi> <mi>c</mi> </msubsup> <mo>|</mo> <msubsup> <mi>x</mi> <mrow> <mi>e</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <mo>)</mo> </mrow> <munder> <mo>&Pi;</mo> <mrow> <mi>&pi;</mi> <mo>&Element;</mo> <mo>&Pi;</mo> </mrow> </munder> <munder> <mo>&Pi;</mo> <mrow> <mi>&epsiv;</mi> <mo>&Element;</mo> <mi>&pi;</mi> </mrow> </munder> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>&epsiv;</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein, d and d ' represents the characteristic dimension of social networks and crossover network, x respectively_eiRepresent ith attribute of the node to e Value, E^SThe node in social networks is represented to set, E^CThe node on crossover network is represented to setRepresent social network In attribute in networkUnder the conditions ofProbability,Represent in crossover network in attributeUnder the conditions ofProbability, P (Y_ε) represent to migrate the influence of structure, Π represents the species of migration structure, and π represents a type of migration structure, and ε is represented wherein One migration structure.

(b) factor is instantiated

In principle, Attribute Association characteristic function and Social Relation characteristic function can be instantiated by different modes.Here adopt It is modeled with the Hammersley-Clifford theories in markov random file：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msubsup> <mi>y</mi> <mi>e</mi> <mi>s</mi> </msubsup> <mo>|</mo> <msubsup> <mi>x</mi> <mrow> <mi>e</mi> <mi>i</mi> </mrow> <mi>s</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>Z</mi> <mn>1</mn> </msub> </mfrac> <mi>exp</mi> <mo>{</mo> <msub> <mi>&alpha;</mi> <mi>i</mi> </msub> <msub> <mi>f</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>x</mi> <mrow> <mi>e</mi> <mi>i</mi> </mrow> <mi>s</mi> </msubsup> <mo>,</mo> <msubsup> <mi>y</mi> <mi>e</mi> <mi>s</mi> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>&epsiv;</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>Z</mi> <mn>3</mn> </msub> </mfrac> <mi>exp</mi> <mo>{</mo> <msub> <mi>&gamma;</mi> <mi>&epsiv;</mi> </msub> <msub> <mi>h</mi> <mi>&epsiv;</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>&epsiv;</mi> </msub> <mo>)</mo> </mrow> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

f_i(*)、g_i(*)h_ε(*) be respectively social networks, crossover network, migrate structure characteristic function, α_i、β_i、γ_εIt is corresponding Its weight, Z₁、Z₂、Z₃For normalization factor.

(c) objective function optimization

With reference to above-mentioned formula, object function is finally obtained：

Wherein, Z=Z₁Z₂Z₃For normalization factor.

3rd step, prediction result are assessed：

The index for weighing this method validity has totally 3 kinds of AUC, Precision and Ranking Score.They are accurate to predicting It is different to spend the emphasis weighed：AUC(area under the receiver operating characteristic Curve) the accuracy of measure algorithm on the whole.Precision only considers whether the side of L positions before coming is predicted accurately.And Ranking Score are more considered to the sequence on the side predicted.

AUC can be understood as in test set while fractional value have be not present than randomly selected one while fractional value it is high Probability, that is to say, that every time at random from test set choose a line compared with the randomly selected side being not present, such as In fruit test set while fractional value be more than be not present while fractional value, just plus 1 point；If two fractional values are equal, just add 0.5 point.Independently compare n times, if in the secondary test sets of n ' while fractional value be more than be not present while fraction, have n " secondary Two fractional values are equal, then AUC is defined as：

<mrow> <mi>A</mi> <mi>U</mi> <mi>C</mi> <mo>=</mo> <mfrac> <mrow> <msup> <mi>n</mi> <mo>&prime;</mo> </msup> <mo>+</mo> <mn>0.5</mn> <msup> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> </mrow> </msup> </mrow> <mi>n</mi> </mfrac> </mrow>

Obviously, if all fractions all randomly generate, AUC=0.5.Therefore degree of the AUC more than 0.5 has been weighed algorithm and existed It is more accurate than randomly selected method in much degree.

Precision is defined as being predicted accurate ratio in first L prediction side.If m prediction is accurate, that is, before coming There are m in L side in test set, then Precision is defined as：

Obviously, the bigger predictions of Precision are more accurate.If two algorithm AUC are identical, and the Precision of algorithm 1 is more than calculation Method 2, illustrate that algorithm 1 is more preferable because its tend to really connect the node on side to coming before.

Ranking Score mainly consider position of the side in final sequence in test set.Make H=U-E^TFor the collection on unknown side Close (equivalent in test set while and be not present while set), r_iRepresent the rankings of unknown side i ∈ E in the ranking.Then should The Ranking Score values on the unknown side of bar are RS_i=r_i/ | H |, wherein | H | represent element in set H number travel through it is all Side in test set, the Ranking Score values for obtaining system are：

<mrow> <mi>R</mi> <mi>S</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msup> <mi>E</mi> <mi>P</mi> </msup> <mo>|</mo> </mrow> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msup> <mi>E</mi> <mi>P</mi> </msup> </mrow> </munder> <msub> <mi>RS</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msup> <mi>E</mi> <mi>P</mi> </msup> <mo>|</mo> </mrow> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msup> <mi>E</mi> <mi>P</mi> </msup> </mrow> </munder> <mfrac> <msub> <mi>r</mi> <mi>i</mi> </msub> <mrow> <mo>|</mo> <mi>H</mi> <mo>|</mo> </mrow> </mfrac> <mo>.</mo> </mrow> 3