CN111160483B

CN111160483B - Network relation type prediction method based on multi-classifier fusion model

Info

Publication number: CN111160483B
Application number: CN201911414801.1A
Authority: CN
Inventors: 刘闯; 于柿民; 张子柯
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-03-17
Anticipated expiration: 2039-12-31
Also published as: CN111160483A

Abstract

The invention discloses a network relation type prediction method based on a multi-classifier fusion model. The method comprises the following steps: preprocessing network data, including extracting network structure from semi-structured data, and converting into an edge list or an adjacent list; performing Node characterization on the side list or the adjacent list by using a Node2Vec network embedding method to obtain k-dimensional feature vector representation of each Node in the network; obtaining a 3 xk-dimensional sample feature vector by splicing two adjacent node feature vectors and a difference vector of the two node feature vectors, and dividing a training set and a test set; taking the classification method after parameter adjustment as a base learner to make decisions; and fusing the classification result of the base learner as the input data of the meta learner, and outputting the final network relationship type prediction result. The method is suitable for common social network data, and improves the accuracy of prediction.

Description

Network relation type prediction method based on multi-classifier fusion model

Technical Field

The invention belongs to the technical field of computers, particularly relates to the technical field of link prediction in a complex network, and relates to a network relation type prediction method based on a multi-classifier fusion model.

Background

In the society with complex relationships, the structures of biomolecules are abstracted into network representations, most of real network data are missing, for example, the problem of sampling or hiding human relationships in the social relationship network exists, and the field of microbial medicine needs a lot of experiments to infer the interaction between cells or tissues. According to different research problems in different fields, a network structure is constructed, and 'possible connection' or 'missing connection' in the network structure is analyzed and mined, so that new findings can be brought to different fields, and the development of the corresponding fields is promoted.

The definition of link prediction in the network is that "the possibility that a connection exists between two nodes in the network is not generated through information such as network structure and network attributes prediction". The link prediction algorithm comprehensively uses multidisciplinary methods such as similarity analysis, network dynamics, bayesian models, machine learning, motif analysis, maximum likelihood analysis and the like.

In the field of social networks, most researchers tend to mine a connection relationship between two nodes in a network besides the possibility of whether the two nodes have a relationship, network relationship type prediction is applied to scenes such as recommendation, advertisement and community detection, and the connection relationship type in the social network is divided into "+" and "-" which respectively represent the quality of the relationship, the status or the skill level, and the like. There have been related researchers applying balance theory to network relationship type prediction, i.e., "friend of friend is my friend, friend of enemy is my enemy, friend of friend is my enemy, and enemy of enemy is my friend". The balance theory aims to explore the characteristics of a triangular network structure, capture the relation among three users and predict a positive chain and a negative chain. However, in reality, there are often a large number of "bridge" edges, i.e. the connecting edges between two adjacent nodes, and the two adjacent nodes do not have any common neighbors, whereas a "triangle" edge is composed of two adjacent nodes containing at least one common neighbor.

Although most edges in a directed network with relationships can be contained in a triangle, a "bridge" edge also exists globally. According to the balance theory, the nodes in the triangle are connected with each other, and the network relation type prediction of the edges in the triangle can be effectively carried out. The "bridge" edge lacks triangle information and cannot be modeled by equilibrium theory. Therefore, performing network relationship type prediction on "bridge" edges is a challenging task.

Disclosure of Invention

The invention aims to provide a network relation type prediction method based on a multi-classifier fusion model aiming at the defects of the prior art.

The method provided by the invention is characterized in that social network public data Slashdot, relationships and Wikirfa with relationships are utilized, a Node2Vec method in a network embedding method and the idea of a fusion model in machine learning are adopted, a set of social network-based relationship type prediction method is designed, and Roc _ auc, binary _ F1, macro _ F1 and Micro _ F1 are all expressed on a network data set with a bridge edge.

The method specifically comprises the following steps:

step (1), preprocessing network data, specifically:

(1-1) network data structured representation: the network data comprises structured network data and semi-structured network data, and the semi-structured network data is converted into the structured network data { (v) _s1 ,v _e1 ,flag ₁ ),(v _s2 ,v _e2 ,flag ₂ ),…,(v _si ,v _ei ,flag _i ),…,(v _sn ,v _en ,flag _n ) }; where n represents the number of network edges, v _si 、v _ei Respectively representing start and end nodes, sample tag flag _i Equal to 1 or-1, indicating that the actual relationship type is "friendly" or "hostile", i =1,2, …, n.

(1-2) structured network data standardized representation: converting the structured network data into an edge list or an adjacency list;

list of edges: { (v) _s1 ,v _e1 ),(v _s2 ,v _e2 ),…,(v _si ,v _ei ),…,(v _sn ,v _en )}，(v _si ,v _ei ) Denotes v _si 、v _ei A connecting edge exists;

adjacency list:

{(v _s1 ,v _s1-e1 ,v _s1-e2 ,…,v _s1-ei ,…,v _s1-en ),(v _s2 ,v _s2-e1 ,v _s2-e2 ,…,v _s2-ei ,…,v _s2-en ),…,(v _sk ,v _sk-e1 ,v _sk-e2 ,…,v _sk-ei ,…,v _sk-en ) }; k represents the dimensions of each node in the network.

Step (2), the edge list or the adjacent list carries out Node characterization by using a Node2Vec network embedding method to obtain k-dimensional feature vector representation of each Node in the network, and the method specifically comprises the following steps:

the biased random walk method is adopted to regulate the walk probability from one node to the next node through the parameters p (return parameter), q (in-out parameter).

Given node v, the probability of selecting the next node x by random walk is:

wherein, pi _vx Z is a normalization constant for the unnormalized transition probability between node v and node x. Suppose that the current random walk passes through edge (t, v) to reach node v, π _vx ＝α _pq (t,x)·w _vx ；α _pq (t, x) represents the probability of node x deviating from node t, w _vx Is the weight between node v and node x, and node t is the previous node in the random walk sequence to node v.

p and q are hyperparameters, d _tx E {0,1,2}, which represents the shortest distance between node t and node x.

And (3) learning the sampled vertex sequence by using the Node2Vec to obtain a network characterization vector set of the nodes: { (f) _v1-1 ,f _v1-2 ,…,f _v1-k ),(f _v2-1 ,f _v2-2 ,…,f _v2-k ),…,(f _vi-1 ,f _vi-2 ,…,f _vi-k ),…,(f _vn-1 ,v _vn-2 ,…,f _vn-k )}；(f _vi-1 ,f _vi-2 ,…,f _vi-k ) Representing a representation of the k-dimensional feature vector corresponding to node i in the network.

And (3) carrying out network characteristic engineering processing on the network characterization vector set, and dividing a training set and a test set, wherein the method specifically comprises the following steps:

(3-1) network feature engineering: splicing two adjacent node feature vectors and a difference vector of the two node feature vectors to obtain a 3 xk-dimensional sample feature directionVolume, i.e. structured network data conversion GraphData = { simple ₁ ,…,simple _i ,…simple _n In which sample simple _i ＝([f _si-1 ,f _si-2 ,…,f _si-k ],[f _ei-1 ,f _ei-2 ,…,f _ei-k ],[f _si-1 -f _ei-1 ,f _si-2 -f _ei-2 ,…,f _si-k -f _ei-k ],flag _i ) Mapping (v) in the structured network data in step (1-1) _si ,v _ei ,flag _i )；

(3-2) data set partitioning: and randomly extracting GraphData, wherein 70-80% of data is used as a training set, and 20-30% of data is used as a testing set.

And (4) respectively carrying out super-parameter optimization on ExtraTrees, gradienBoosting, lightGBM and XGBboost base learners on a training set by using a GridCV parameter adjusting method: respectively obtaining super-parameter combinations of ExtraTrees, gradienenBoosting, lightGBM and XGBboost through GridCV calculation, initializing each base learner by using the super-parameter combinations, and obtaining a tuning model BaseModel ₁ 、BaseModel ₂ 、BaseModel ₃ 、BaseModel ₄ 。

And (5) using the obtained tuning model as a base learner, predicting a training set by adopting a K-fold cross validation method, fusing a prediction result of the base learner as input data of a meta learner RandomForest, and outputting a final network relationship type prediction result, wherein the method specifically comprises the following steps:

(5-1). K-fold cross validation base learner: respectively to BaseModel on training set ₁ 、BaseModel ₂ 、BaseModel ₃ 、BaseModel ₄ Performing K-fold cross validation, namely dividing the training set into K parts to generate training subsets, and obtaining the prediction sets R of the training subsets by the four base learners ₁ 、R ₂ 、R ₃ 、R ₄ ；

(5-2) meta learner training: will predict set R ₁ 、R ₂ 、R ₃ 、R ₄ Splicing to obtain a new training set, training by using RandomForest as a meta-learner, and combining the trained base learner and the meta-learner to obtain a final fusion model;

(5-3) predicting results and evaluating: and predicting the test set by using the Stackingmodel, and evaluating the effect of the prediction result by using Roc _ auc, binary _ F1, macro _ F1 and Micro _ F1 as model performance evaluation indexes.

The method effectively recommends a network relation type prediction method based on a multi-classifier fusion model, and improves the prediction accuracy. The invention creatively provides a network characteristic engineering method, which comprises the characteristics of node vector difference and the characteristics of nodes, and creatively utilizes the idea of a fusion model in a network relation type prediction task. The method can better serve the aspects of actual recommendation, advertisement, community discovery and the like.

Detailed Description

The technical solution of the present invention is further explained below.

A network relation type prediction method based on a multi-classifier fusion model comprises the following specific steps:

step (1), preprocessing network data, specifically:

(1-1) network data structured representation: the network data has structured network data and semi-structured network data, and the semi-structured network data is converted into the structured network data { (v) _s1 ,v _e1 ,flag ₁ ),(v _s2 ,v _e2 ,flag ₂ ),…,(v _si ,v _ei ,flag _i ),…,(v _sn ,v _en ,flag _n ) }; where n represents the number of network edges, v _si 、v _ei Respectively representing start and end nodes, sample label flag _i Equal to 1 or-1, indicating that the actual relationship type is "friendly" or "hostile", i =1,2, …, n.

adjacency list:

{(v _s1 ,v _s1-e1 ,v _s1-e2 ,…,v _s1-ei ,…,v _s1-en ),(v _s2 ,v _s2-e1 ,v _s2-e2 ,…,v _s2-ei ,…,v _s2-en ),…,(v _sk ,v _sk-e1 ,v _sk-e2 ,…,v _sk-ei ,…,v _sk-en ) }; k denotes the dimension of each node in the network.

Given node v, the probability of selecting the next node x by random walk is:

Step (3), carrying out network characteristic engineering processing on the network characterization vector set, and dividing a training set and a test set, wherein the method specifically comprises the following steps:

(3-1) network feature engineering: splicing two adjacent node eigenvectors and the difference vector of the two node eigenvectors to obtain a 3 x k dimensional sample eigenvector, namely, the structured network data conversion GraphData = { simple = ₁ ,…,simple _i ,…simple _n In which sample simple _i ＝([f _si-1 ,f _si-2 ,…,f _si-k ],[f _ei-1 ,f _ei-2 ,…,f _ei-k ],[f _si-1 -f _ei-1 ,f _si-2 -f _ei-2 ,…,f _si-k -f _ei-k ],flag _i ) Mapping (v) in the structured network data in step (1-1) _si ,v _ei ,flag _i )；

(3-2) data set partitioning: the GraphData is randomly extracted, 70-80% of data is used as a training set, 20-30% of data is used as a test set, 80% of data is extracted as the training set and 20% of data is used as the test set in the embodiment.

(5-1). K-fold cross validation base learner: respectively to BaseModel on training set ₁ 、BaseModel ₂ 、BaseModel ₃ 、BaseModel ₄ Performing K-fold cross validation, namely dividing the training set into K parts to generate a training subset, and obtaining a prediction set R of the training subset by the four base learners ₁ 、R ₂ 、R ₃ 、R ₄ ；

(5-3) predicting results and evaluating: and predicting the test set through the StackingModel, introducing Roc _ auc, binary _ F1, macro _ F1 and Micro _ F1 as model performance evaluation indexes, and evaluating the effect of a prediction result.

By applying the social network data Slashdot, epinions and Wikirfa, the task of predicting the network relationship type on a large-scale network can be verified, for example, evaluation indexes of Roc _ auc, binary _ F1, macro _ F1 and Micro _ F1 on a Slashdot data set are 91.22%,91.82%,80.13% and 87.01%, and four index scores are all superior to those of the existing method, and the prediction method is generally applicable to the network relationship type prediction problem.

Claims

1. A network relation type prediction method based on a multi-classifier fusion model is characterized by comprising the following steps:

step (1), preprocessing network data, specifically:

(1-1) network data structured representation: the network data comprises structured network data and semi-structured network data, and the semi-structured network data is converted into the structured network data { (v) _s1 ,v _e1 ,flag ₁ ),(v _s2 ,v _e2 ,flag ₂ ),…,(v _si ,v _ei ,flag _i ),…,(v _sn ,v _en ,flag _n ) }; where n represents the number of network edges, v _si 、v _ei Respectively representing start and end nodes, sample tag flag _i Equal to 1 or-1, indicating that the actual relationship type is friendly or hostile, i =1,2, …, n;

adjacency list:

{(v _s1 ,v _s1-e1 ,v _s1-e2 ,…,v _s1-ei ,…,v _s1-en ),(v _s2 ,v _s2-e1 ,v _s2-e2 ,…,v _s2-ei ,…,v _s2-en ),…,(v _sk ,v _sk-e1 ,v _sk-e2 ,…,v _sk-ei ,…,v _sk-en ) }; k represents the dimension of each node in the network;

adopting a biased random walk method, and regulating and controlling the walk probability from one node to the next node through parameters p and q; given node v, the probability of selecting the next node x by random walk is:

wherein, pi _vx Z is a normalization constant, which is the unnormalized transition probability between node v and node x; suppose that the current random walk passes through edge (t, v) to reach node v, π _vx ＝α _pq (t,x)·w _vx ；α _pq (t, x) represents the probability of node x deviating from node t, w _vx Is the weight between node v and node x, and node t is the previous node in the random walk sequence located at node v;

d _tx the element belongs to {0,1,2}, and represents the shortest distance between the node t and the node x;

and (3) learning the sampled vertex sequence by using the Node2Vec to obtain a network characterization vector set of the nodes: { (f) _v1-1 ,f _v1-2 ,…,f _v1-k ),(f _v2-1 ,f _v2-2 ,…,f _v2-k ),…,(f _vi-1 ,f _vi-2 ,…,f _vi-k ),…,(f _vn-1 ,v _vn-2 ,…,f _vn-k )}；(f _vi-1 ,f _vi-2 ,…,f _vi-k ) Representing k-dimensional feature vectors corresponding to nodes i in the network;

(3-1) network feature engineering: splicing two adjacent node eigenvectors and the difference vector of the two node eigenvectors to obtain a 3 xk-dimensional sample eigenvector, namely structured network data conversion GraphData = { simple ₁ ,…,simple _i ,…simple _n In which sample simple _i ＝([f _si-1 ,f _si-2 ,…,f _si-k ],[f _ei-1 ,f _ei-2 ,…,f _ei-k ],[f _si-1 -f _ei-1, f _si-2 -f _ei-2 ,…,f _si-k -f _ei-k ],flag _i ) Mapping (v) in the structured network data in step (1-1) _si ,v _ei ,flag _i )；

(3-2) data set partitioning: randomly extracting GraphData, wherein 70-80% of data is used as a training set, and 20-30% of data is used as a testing set;

and (4) respectively carrying out super-parameter optimization on ExtraTrees, gradienBoosting, lightGBM and XGBboost base learners on a training set by using a GridCV parameter adjusting method: respectively obtaining super-parameter combinations of ExtraTrees, gradienenBoosting, lightGBM and XGBboost through GridCV calculation, initializing each base learner by using the super-parameter combinations, and obtaining a tuning model BaseModel ₁ 、BaseModel ₂ 、BaseModel ₃ 、BaseModel ₄ ；

(5-2) meta learner training: will predict set R ₁ 、R ₂ 、R ₃ 、R ₄ Splicing to obtain a new training set, training by using RandomForest as a meta-learner, and combining the trained base learner and the meta-learner to obtain a final fusion model Stacking model;

(5-3) predicting results and evaluating: and predicting the test set by using a Stackingmodel, and evaluating the effect of the prediction result by using Roc _ auc, binary _ F1, macro _ F1 and Micro _ F1 as model performance evaluation indexes.