CN112039700A - Social network link abnormity prediction method based on stack generalization and cost sensitive learning - Google Patents

Social network link abnormity prediction method based on stack generalization and cost sensitive learning Download PDF

Info

Publication number
CN112039700A
CN112039700A CN202010873960.4A CN202010873960A CN112039700A CN 112039700 A CN112039700 A CN 112039700A CN 202010873960 A CN202010873960 A CN 202010873960A CN 112039700 A CN112039700 A CN 112039700A
Authority
CN
China
Prior art keywords
model
probability
data
prediction
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010873960.4A
Other languages
Chinese (zh)
Other versions
CN112039700B (en
Inventor
刘小洋
李祥
叶舒
马敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202010873960.4A priority Critical patent/CN112039700B/en
Publication of CN112039700A publication Critical patent/CN112039700A/en
Application granted granted Critical
Publication of CN112039700B publication Critical patent/CN112039700B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Environmental & Geological Engineering (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a social network link abnormity prediction method based on stack generalization and cost sensitive learning, which comprises the following steps of: s1, obtaining social network node data, and taking similarity indexes in the obtained social network node data as characteristics of basic model learning; s2, determining the hyper-parameters of the basic model; s3, relearning the prediction result of the base model; and obtaining a final prediction result. The method and the device can predict the link abnormity of the social network node.

Description

Social network link abnormity prediction method based on stack generalization and cost sensitive learning
Technical Field
The invention relates to the technical field of social networks, in particular to a social network link abnormity prediction method based on stack generalization and cost sensitive learning.
Background
In the real world, social networks are ubiquitous, such as social networks, collaboration networks, protein-protein interaction networks, and communication networks. Analyzing these networks has attracted increasing attention not only in the field of computer science, but also in the fields of sociology, physics, bioinformatics and statistics. Link prediction in social networks is a basic network analysis task, which refers to how to predict the likelihood of generating a link between two nodes in a network that are not connected by known information (e.g., network nodes and network structure). It should be noted that link prediction includes prediction of existing links and prediction of future links.
Link prediction for social networks has been intensively studied. In the past decades, various link prediction methods have been proposed, and most algorithms are based on network architecture. Here, we briefly review two mainstream methods for link prediction, the similarity method (including node similarity and structural similarity) and the likelihood estimation method. To date, a series of achievements have been made in a link prediction method based on similarity, and the method is widely applied to various fields accordingly. Similarity-based link prediction methods can be further divided into three categories, namely neighbor-based, path-based and random walk-based methods. The simplest link prediction method is based on the following assumptions: two nodes are more likely to have a link if they have more neighbors in common. Newman first uses Common Neighbor index (CN) to measure similarity and then presents indices of two nodes and many variants of CN such as Salton index, Resource Allocation index (RA), Adamic-Adar index (AA), Jaccard coefficientHub Promoted index (HPI), Leicht-Holme-Newman index (LHN), preferred Attachment index (PA), etc. According to extensive experiments on real networks, the results show that the RA index performs best, while the overall performance of the PA index is worst. The path-based method computes the similarity of node pairs using a path between two nodes. Examples include Local path index (LP) and Katz index. The LP index only considers local paths of length 2 and 3. The Katz index is based on the whole all paths and can achieve high performance on a practical network. Random walk-based methods use random walks to model interactions between nodes in a network. Some representative methods include Average Commute Time (ACT), SimRank, Random Walk with Restart (RWR) and Local Random Walk (LRW). The ACT index is based on the number of steps required for a mean random walker to start from one node to reach another. SimRank measures the time at which two random walkers, starting from two different nodes respectively, will meet at a certain node. RWR is a direct application of the PageRank algorithm. The LRW is a local index and only focuses on a few steps of random walk. It is well known that the LRW method outperforms ACT indexing with less computational complexity than ACT and RWR. The second category of methods is based on likelihood estimation. Clauset et al proposes a general technique to infer the hierarchy of the network and further use it to predict lost links. The stored block model divides network nodes into several groups, and The connection probability between any two nodes is used for determining which group The node belongs to. Pan et al maximize the likelihood of an observed network based on a predefined structural hamilton and score an observed network for unobserved links by the conditional probability to which the link is added. Liben-Nowell and Kleinberg propose a likelihood estimation method for link prediction. New link prediction methods based on likelihood analysis are then successively obtained-these maximum likelihood methods, although computationally complex, can provide valuable insight.
The similarity method and the likelihood estimation method each have advantages and disadvantages. The similarity-based method has the characteristic of low calculation complexity, but the calculation result is influenced by the network structure. In networks with different structural features, the calculation results are unstable and robustness cannot be obtained. The idea based on likelihood estimation has strong mathematical significance and high prediction accuracy, but needs strict assumptions, has large calculation amount and is not suitable for large-scale networks.
Disclosure of Invention
The invention aims to at least solve the technical problems in the prior art, and particularly provides a social network link abnormity prediction method based on stack generalization and cost sensitive learning.
In order to achieve the above object, the present invention provides a social network link anomaly prediction method based on stack generalization and cost-sensitive learning, including the following steps:
s1, obtaining social network node data, and taking similarity indexes in the obtained social network node data as characteristics of basic model learning;
s2, determining the hyper-parameters of the basic model;
s3, relearning the prediction result of the base model; and obtaining a final prediction result.
In a preferred embodiment of the present invention, the base model in step S1 includes:
given dataset D ═ x1,y1),(x2,y2),(x3,y3),……,(xN,yN) Wherein, in the step (A),
Figure BDA0002652022070000021
yie {0,1 }; when y isiWhen equal to 0, yiRepresents a negative class; when y isiWhen 1, yiRepresents a positive class; 1,2,3, …, N;
Figure BDA0002652022070000022
representing a sample feature space, wherein n represents the feature number of each sample; n represents the number of samples in the data set D;
due to wTThe x + b values are continuous, wherein w represents a column vector and the dimension is (n, 1); t represents transposition; x represents a column vector with dimension (n, 1); b represents a column vector with dimension (1, 1); it cannot fit discrete variables, and it can be considered to fit the conditional probability P (Y ═ 1| x); however, for w ≠ 0, there is no value for solving if w is equal to the zero vector, wTThe value of x + b is a real number R, and the value of unsatisfied probability is 0 to 1, so that a generalized linear model is considered;
since the unit step function is not trivial, the log probability function is a typical alternative function:
Figure BDA0002652022070000023
thus, there are:
Figure BDA0002652022070000024
if y is the probability that x takes the positive example, 1-y is the probability that x takes the negative example; the ratio of the two is called probability odds, which refers to the ratio of the probability of the event occurring to the probability of the event not occurring, and if the probability of the event occurring is P, the log probability:
Figure BDA0002652022070000031
regarding y as the class posterior probability estimation, the rewrite formula is:
Figure BDA0002652022070000032
Figure BDA0002652022070000033
that is, the log-probability of output Y ═ 1 is a model represented by a linear function of input x, which is a logistic regression model; when w isTThe closer the value of + b is to positive infinity, the closer the P (Y ═ 1| x) probability value is to 1; therefore, the idea of logistic regression is to fit a decision boundary first and then establish the probability connection between the boundary and the classification, thereby obtaining the probability under the two-classification condition;
after the mathematical form of the logistic regression model is determined, how to solve the parameters in the model is left; in statistics, a maximum likelihood estimation method is often used for solving, that is, a group of parameters is found, so that the likelihood of data is maximum under the group of parameters; order:
Figure BDA0002652022070000034
Figure BDA00026520220700000310
p(xi) Denotes that the ith sample is x in a known featureiA probability of a positive class (Y ═ 1);
yiis the two classification problemIn a given data set D, i.e. yi=y1,y2,y3,...,yn,yi∈{0,1};
For more convenient solution, logarithms are taken from two sides of the peer-to-peer equation and written into log-likelihood functions:
Figure BDA0002652022070000035
in machine learning, the concept of a loss function is lost, which measures the degree of model prediction error; if the average log-likelihood loss over the entire data set is taken, one can obtain:
Figure BDA0002652022070000036
wherein N represents the number of samples in the data set D;
that is, in the logistic regression model, the maximum likelihood function and the minimum loss function are practically equivalent;
there are many methods for solving logistic regression, and here, a gradient descent method is mainly used; the main objective of the optimization is to find a direction towards which the parameter moves so that the value of the loss function can be reduced, which direction is often found by various combinations of first order partial derivatives or second order partial derivatives; the loss function of the logistic regression is:
Figure BDA0002652022070000037
gradient descent finds the descending direction by the first derivative of j (w) to w, and updates the parameters in an iterative manner by:
Figure BDA0002652022070000038
Figure BDA0002652022070000039
Figure BDA0002652022070000041
representing the updated weight parameter of the kth iteration of the ith sample weight parameter;
alpha represents the learning rate and represents the speed of 1-time parameter iterative updating;
Figure BDA0002652022070000042
representing the weight parameter after the (k + 1) th iteration update of the ith sample weight parameter;
wirepresenting the weight parameter of the ith sample.
In a preferred embodiment of the present invention, in step S2, the method for determining the hyperparameters in the base model includes one or any combination of cross validation, grid search and early stopping method.
In a preferred embodiment of the present invention, in step S3, it is determined according to the final prediction result: if the final prediction result FinalPredictionLabel is greater than or equal to the preset result threshold, the two nodes are abnormal links; and if the final prediction result FinalPredictionLabel is smaller than the preset result threshold, the two nodes are normal links.
In conclusion, due to the adoption of the technical scheme, the method and the device can predict the link abnormity of the social network node.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram of the staged amplification of the present invention.
FIG. 2 is a schematic diagram of LLSLP model framework of the present invention.
FIG. 3 is a schematic diagram of a fusion Matrix according to the present invention.
FIG. 4 is a schematic diagram of ROC for each algorithm on the FBK data set of the present invention.
FIG. 5 is a PR diagram of the algorithms on the FBK data set of the present invention.
FIG. 6 is a PR diagram of the algorithms on the FBK data set of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
1 introduction to the public
1.1 background
In the real world, social networks are ubiquitous, such as social networks, collaboration networks, protein-protein interaction networks, and communication networks. Analyzing these networks has attracted increasing attention not only in the field of computer science, but also in the fields of sociology, physics, bioinformatics and statistics. Link prediction in social networks is a basic network analysis task, which refers to how to predict the likelihood of generating a link between two nodes in a network that are not connected by known information (e.g., network nodes and network structure). It should be noted that link prediction includes prediction of existing links and prediction of future links.
Link prediction for social networks has been intensively studied. In the past decades, various link prediction methods have been proposed, and most algorithms are based on network architecture. Here, we briefly review two mainstream methods for link prediction, the similarity method (including node similarity and structural similarity) and the likelihood estimation method. To date, a series of achievements have been made in a link prediction method based on similarity, and the method is widely applied to various fields accordingly. Similarity-based link prediction methods can be further divided into three categories, namely neighbor-based, path-based and random walk-based methods. The simplest link prediction method is based on the following assumptions: two nodes are more likely to have a link if they have more neighbors in common. Newman first uses Common Neighbor index (CN) to measure similarity and then presents indices of two nodes and many variants of CN such as Salton index, Resource Allocation index (RA), Adamic-Adar index (AA), Jaccard coefficientHub Promoted index (HPI), Leicht-Holme-Newman index (LHN), preferred Attachment index (PA), etc. According to extensive experiments on real networks, the results show that the RA index performs best, while the overall performance of the PA index is worst. The path-based method computes the similarity of node pairs using a path between two nodes. Examples include Local path index (LP) and Katz index. The LP index only considers local paths of length 2 and 3. The Katz index is based on the whole all paths and can achieve high performance on a practical network. Random walk-based methods use random walks to model interactions between nodes in a network. Some representative methods include Average Commute Time (ACT), SimRank, Random Walk with Restart (RWR) and Local Random Walk (LRW). The ACT index is based on the number of steps required for a mean random walker to start from one node to reach another. SimRank measures the time at which two random walkers, starting from two different nodes respectively, will meet at a certain node. RWR is a direct application of the PageRank algorithm. The LRW is a local index and only focuses on a few steps of random walk. It is well known that the LRW method outperforms ACT indexing with less computational complexity than ACT and RWR. The second category of methods is based on likelihood estimation. Clauset et al proposes a general technique to infer the hierarchy of the network and further use it to predict lost links. The stored block model divides network nodes into several groups, and The connection probability between any two nodes is used for determining which group The node belongs to. Pan et al maximize the likelihood of an observed network based on a predefined structural hamilton and score an observed network for unobserved links by the conditional probability to which the link is added. Liben-Nowell and Kleinberg propose a likelihood estimation method for link prediction. New link prediction methods based on likelihood analysis are then successively obtained-these maximum likelihood methods, although computationally complex, can provide valuable insight.
The similarity method and the likelihood estimation method each have advantages and disadvantages. The similarity-based method has the characteristic of low calculation complexity, but the calculation result is influenced by the network structure. In networks with different structural features, the calculation results are unstable and robustness cannot be obtained. The idea based on likelihood estimation has strong mathematical significance and high prediction accuracy, but needs strict assumptions, has large calculation amount and is not suitable for large-scale networks.
1.2 major contributions
1) Aiming at the problem that the traditional link prediction algorithm only considers a single similarity index, is easily influenced by a network structure and has no good generalization, a new social network link prediction method (LLSLP) is provided on the basis of fusing 15 traditional similarity indexes.
2) The LLSLP method provided by the invention not only integrates the traditional similarity indexes, but also introduces the Stacking idea. And carrying out nonlinear calculation on 15 traditional similarity indexes by using a Logistic Regression model and a LightGBM model, and obtaining the characteristics of a fusion index. On the basis, a Logistic Regression model is adopted to learn the fusion characteristics, and Cross-validation, Grid searching and Early stopping methods are used for optimization, so that the proposed LLSLP obtains more complementarity, more stable effect and good generalization.
3) Detailed, systematic assessment analysis was performed on 10 sets of social networking data, SMG, EML, NSC, YST, HMT, KHN, FBK, UGP, ADV, and GRQ, from different domains, with different scales and network structures. And 7 different evaluation indexes are adopted, so that the performance of the algorithm and the model is more comprehensively measured. The LLSLP method presented herein is compared to a single conventional algorithm and model.
4) The experimental results show that the overall performance of the LLSLP on each experimental data set is better than that of the traditional algorithm and model, not only the AUC value reaches over 98.71 percent, but also the AUC value is 10.52 percent higher than that of the traditional 15 link prediction algorithms such as CN, Sal, Jac, Sor, HPI, HDI, LHN-I, PA, AA, RA, LP, Katz, ACT, Cos and RWR on average. And under the condition of extreme imbalance of the data set categories, the F1-score value and the MCC value respectively achieve 3.25% -9.73% and 5.90% -10.21% improvement relative to the traditional 15 link prediction algorithms. Better predictions are made under different data sets, and the effectiveness, stability and generalization of the algorithm are verified through result analysis.
The rest of the text is arranged as follows. In section 2, Logistic Regression, LightGBM, and Stacking are described in detail. In section 3, the proposed LLSLP method is introduced. The experimental setup is discussed in section 4 and the experimental results are analyzed in comparison. Finally, this document is summarized in section 5. In addition, the appendix section supplements the relevant experimental data graphs, including ROC graphs, PR graphs, and confusion matrix graphs.
2 basic model
2.1 Logistic Regression
The nature of Logistic Regression (LR) is: assuming that the data obeys this distribution, then the maximum likelihood estimate is used for parameter estimation. Although referred to as regression, it is actually a classification model and is commonly used for two-classification. The Logistic Regression is described by taking two classification problems as an example:
given a dataset D ═ x considering a binary problem1,y1),(x2,y2),(x3,y3),……,(xN,yN) Wherein, in the step (A),
Figure BDA0002652022070000061
yie {0,1 }; when y isiWhen equal to 0, yiRepresents a negative class; when y isiWhen 1, yiRepresents a positive class; 1,2,3, …, N;
Figure BDA0002652022070000062
representing a sample feature space, wherein n represents the feature number of each sample; n represents the number of samples in the data set D.
Due to wTThe x + b values are continuous, wherein w represents a column vector and the dimension is (n, 1); t represents transposition; x represents a column vector with dimension (n, 1); b represents a column vector with dimension (1, 1); it cannot fit discrete variables and can be considered to fit the conditional probability P (Y ═ 1| x). However, for w ≠ 0 (with zero vectors, there is no value for solving), wTThe value of x + b is a real number R, and the value of unsatisfied probability is 0 to 1, so that a generalized linear model is considered.
Since the unit step function is not trivial, the log probability function is a typical alternative function:
Figure BDA0002652022070000071
thus, there are:
Figure BDA0002652022070000072
if y is the probability that x is positive, then 1-y is the probability that x is negative. The ratio of the two is called probability (odds), which refers to the ratio of the probability of the event occurring to the probability of not occurring, and if the probability of the event occurring is P, the log probability:
Figure BDA0002652022070000073
regarding y as the class posterior probability estimation, the rewrite formula is:
Figure BDA0002652022070000074
Figure BDA0002652022070000075
that is, the log-probability of output Y being 1 is a model represented by a linear function of input x, which is a logistic regression model. When w isTThe more the value of + b getsTo be more infinite, the probability value of P (Y ═ 1| x) is closer to 1. The idea of logistic regression is to fit a decision boundary (not limited to linear but also polynomial) and then establish the probability relationship between the boundary and the classification, so as to obtain the probability under the two classification conditions.
After the mathematical form of the logistic regression model is determined, how to solve the parameters in the model remains. In statistics, a maximum likelihood estimation method is often used to solve, that is, a set of parameters is found, so that the likelihood (probability) of data is maximum under the set of parameters. Order:
Figure BDA0002652022070000076
Figure BDA0002652022070000077
p(xi) The conditional probability in expression (6) indicates that the ith sample has a known characteristic of xiIn the case of (2), the probability of the positive type (Y ═ 1) is used.
yiThat is, the two-class problem is given in the data set D, i.e., yi=y1,y2,y3,...,yn,yi∈{0,1};
For more convenient solution, logarithms are taken from two sides of the peer-to-peer equation and written into log-likelihood functions:
Figure BDA0002652022070000078
the notion of a loss function is lost in machine learning, which measures how wrong the model predicts. If the average log-likelihood loss over the entire data set is taken, one can obtain:
Figure BDA0002652022070000079
wherein N represents the number of samples in the data set D;
i.e., in a logistic regression model, the maximum likelihood function and the minimum loss function are practically equivalent.
There are many methods for solving logistic regression, and here, a gradient descent method is mainly used. The main objective of the optimization is to find a direction towards which the parameters are moved to enable the value of the loss function to be reduced, often by first order partial derivatives or various combinations of second order partial derivatives. The loss function of the logistic regression is:
Figure BDA0002652022070000081
gradient descent finds the descending direction by the first derivative of j (w) to w, and updates the parameters in an iterative manner by:
Figure BDA0002652022070000082
Figure BDA0002652022070000083
Figure BDA0002652022070000084
representing the updated weight parameter of the kth iteration of the ith sample weight parameter;
alpha represents the learning rate and represents the speed of 1-time parameter iterative updating;
Figure BDA0002652022070000085
representing the weight parameter after the (k + 1) th iteration update of the ith sample weight parameter;
wirepresenting the weight parameter of the ith sample.
Where k is the number of iterations. After each update of the parameters, the parameters can be updated by comparing | | | J (w)k+1)-J(wk) I is less than a threshold value orThe maximum number of iterations is reached and the iteration is stopped.
2.1.1 regularization
Regularization is a general algorithm and idea, so algorithms that produce an overfitting phenomenon can use regularization to avoid overfitting. On the basis of minimizing the empirical risk (namely minimizing the training error), a simple model is adopted as far as possible, and the generalization prediction precision can be effectively improved. If the model is too complex, the variable values will vary slightly, causing prediction accuracy problems. Regularization is effective because it reduces the weight of the features, making the model simpler. Regularization is typically in the form of L1 or L2 fanner, respectively, in the form of Φ (w) | x | | survival1,Φ(w)=||x||2
1) L1 regularization. LASSO regression, which is equivalent to adding such a priori knowledge to the model: w is a zero mean laplace distribution. Laplace distribution:
Figure BDA0002652022070000086
μ represents a position parameter in the laplace distribution, and when μ is 0, the symmetry axis of the laplace distribution curve is on the y-axis; σ denotes a scale parameter.
Due to the introduction of a priori knowledge, the likelihood function is:
Figure BDA0002652022070000087
d represents the number of weight parameters w needing regularization;
taking log and then taking negative to obtain an objective function:
Figure BDA0002652022070000091
equation (15) is equivalent to the original loss function followed by the L1 regularization, so the nature of the L1 regularization is actually to add a priori knowledge of the model parameters obeying a zero-mean laplacian distribution.
2) L2 regularization. Ridge regression, which is equivalent to adding such a priori knowledge to the model: w follows a zero-mean normal distribution. Normal distribution:
Figure BDA0002652022070000092
due to the introduction of prior knowledge, the likelihood function:
Figure BDA0002652022070000093
taking log and then taking negative to obtain an objective function:
Figure BDA0002652022070000094
equation (18) is equivalent to the original loss function followed by the L2 regularization, so the nature of the L2 regularization is actually to add a priori knowledge that the model parameters follow a zero-mean normal distribution to the model.
The L1 regularization is that the regularization term added after the loss function is L1 norm, and the L1 norm is added to easily obtain sparse solution (more than 0). The L2 regularization is the square of the norm of the regularization term L2 added behind the loss function, and compared with the L1 regularization, the L2 regularization obtains a smoother (not sparse) solution, but can also ensure that the dimension close to 0 (but not equal to 0, so that the dimension is relatively smooth) in the solution is more, and the complexity of the model is reduced.
2.2 LightGBM
The lifting Tree is an optimization process for learning by using an additive model and a forward distribution algorithm, and has some high-efficiency implementation, such as XGBoost, pGBRT, GBDT (Gradient Boosting Decision Tree) and the like. The GBDT uses the negative gradient as an index (information gain) of the division, and the XGboost uses the second derivative. They have the common disadvantage that calculating the information gain requires scanning all samples in order to find the optimal division point. Their efficiency and scalability is difficult to satisfy in the face of large amounts of data or high feature dimensions. The direct method to solve this problem is to reduce the feature amount and data amount without affecting the accuracy, and there is some work to speed up the boosting process according to the sampling of data weight, but GBDT cannot be applied without sample weight.
Microsoft open source LightGBM (GBDT based) solves these problems well, and it mainly contains two algorithms:
1) single-sided Gradient Sampling, Gradient-based One-Side Sampling (gos). Gos (from reduced sample perspective): most of the samples of the small gradient are excluded and only the remaining samples are used to calculate the information gain. The GBDT has no data weight, but each data instance has different gradient, and the instance with large gradient has larger influence on the information gain according to the definition of calculating the information gain, so that when down-sampling, the sample with large gradient (the preset threshold value or the highest percentile interval) should be kept as much as possible, and the sample with small gradient is randomly removed. This measure proves to achieve more accurate results than random sampling at the same sampling rate, especially when the information gain range is large.
2) Mutually Exclusive Feature Binding (EFB). EFB (from a feature reduction perspective): bound mutex features, i.e., they rarely take on values other than 0 at the same time (i.e., they are replaced with a composite feature). In application, although the feature quantity is relatively large, since the feature space is very sparse, can a lossless method be designed to reduce the effective features? Especially on sparse feature space, many features are almost mutually exclusive (for example, many features are not simultaneously non-0 values like (one-hot), and mutually exclusive features can be bound.
2.2.1 Gradient-based One-Side Sampling
GBDT uses a decision tree to learn a function that maps an input space to a gradient space. Assume that the training set has n instances { x1,…,xnAnd f, the characteristic dimension is s. The direction of the negative gradient of the loss function of the model data variable is denoted as { g } at each gradient iteration1,…,gnAnd the decision tree divides the data into all nodes through the optimal dividing points (maximum information gain points). GBDT measures the information gain by the variance after segmentation.
Definition 1: O denotes a training set of a certain fixed node, and a segmentation point d of a segmentation feature j is defined as:
Figure BDA0002652022070000101
wherein the content of the first and second substances,
Figure BDA0002652022070000102
i [ ] represents the variance gain;
d is taken as a characteristic segmentation point;
Figure BDA0002652022070000103
represents the left side of the segmentation point and the variance gain;
Figure BDA0002652022070000104
it indicates to the right of the segmentation point.
xijDenotes the x thiThe jth feature of each sample.
Traversing each split point of each feature to find
Figure BDA0002652022070000105
And calculating the maximum information gain
Figure BDA0002652022070000106
Then, the data is based on the characteristic j*Break-up point of
Figure BDA0002652022070000107
The data is divided into left and right child nodes.
In GOSS:
1) firstly, sequencing training in a descending order according to the gradient of data;
2) reserving the first a data instances as a data subset A;
3) for the remaining data instances, randomly sampling to obtain a data subset B with the size of B;
4) finally, the information gain is estimated by the following equation:
Figure BDA0002652022070000108
Alrepresents the data subset to the left of the division point d;
Arrepresents the data subset to the right of the division point d;
equation (20) differs from equation (19) in that equation (19) is a result of calculating a variance gain in a training set of a fixed node O
Figure BDA0002652022070000111
Figure BDA0002652022070000112
The equation (20) is an estimate of the information gain, which is the total variance gain after all fixed nodes have been traversed, and is therefore
Figure BDA0002652022070000113
Figure BDA0002652022070000114
Here the gos estimates the information gain through a smaller data set
Figure BDA0002652022070000115
The amount of calculation will be greatly reduced. More importantly, the following theory shows that the GOSS does not lose much training accuracy, rather than random sampling.
Define 2 GOSS error
Figure BDA0002652022070000116
The probability is at least 1-, with:
Vj(d) a true information gain representing the data set;
a represents a data subset consisting of reserved first a sample instances;
ACcomplement of A
Figure BDA0002652022070000117
Wherein the content of the first and second substances,
Figure BDA0002652022070000118
from the above theory, the following conclusions are drawn:
1) asymptotic approach ratio of GOSS
Figure BDA0002652022070000119
If the data partitioning is not very unbalanced (e.g. data partitioning)
Figure BDA00026520220700001110
And
Figure BDA00026520220700001111
then the approximation error in equation (21) will be dominated by the second term, when n tends to infinity (when the amount of data is large),
Figure BDA00026520220700001112
will tend to 0, i.e. the larger the data volume, the smaller the error and the higher the precision;
2) random sampling is a case where the GOSS is 0 at a. In most cases, the gos performance is better than random sampling, which is the following case: c0,β>Cα,β-αI.e. by
Figure BDA00026520220700001113
Wherein
Figure BDA00026520220700001114
The generalization of GOSS is analyzed below. Considering GOSS generalization error
Figure BDA00026520220700001115
This is the difference between the variance gain calculated for the instance of GOSS sampling and the actual sample variance gainThe difference between them. Is transformed into
Figure BDA00026520220700001116
Thus, where the GOSS is accurate, the GOSS generalization error approximates the full amount of real data. On the other hand, sampling will increase the diversity of the base learner (since the data obtained may be different for each sampling), which will improve generalization.
2.2.2 Exclusive Feature Bundling
The EFB is a method of reducing feature dimensions (actually, dimension reduction technology) by means of feature bundling, so as to improve the computational efficiency. Typically, the bound features are mutually exclusive (one feature has a value of 0 and one feature has a value of other than 0) so that the two features are bound together without losing information. If two features are not completely mutually exclusive (in some cases, both features are non-0 values), an index can be used to measure the degree of non-mutual exclusion of the features, which is called a collision ratio, and when the value is smaller, two features which are not completely mutually exclusive can be bound without affecting the final precision. The algorithm steps for EFB are as follows:
1) sorting the features according to the number of non-0 values;
2) calculating the conflict ratio between different characteristics;
3) each feature is traversed and attempts are made to merge the features to minimize the collision ratio.
High-order data is usually sparse, and a lossless method can be designed to reduce the dimensionality of the features. In particular, in sparse feature space, many features are mutually exclusive, e.g., they are never simultaneously non-0 values. Mutually exclusive features can be bound as a single feature, and by carefully designing the feature algorithm, a feature histogram identical to that of a single feature is constructed from the feature bundle. The inter-histogram time complexity of the mode is reduced from O (# data # feature) to O (# data # bundle), and the # bundle < # feature can greatly accelerate the training process of the GBDT and lose the precision. There are two problems, however:
1) how to decide which features need to be bundled together;
2) how to construct the bundled features.
Theoretically achieving optimal feature bundling is an NP-Hard (Non-deterministic polymeric-Hard) problem. This means that it is not possible to find an exact solution in polynomial time. The mutual exclusion feature bundling problem is addressed as shown in Algorithm 1.
Figure BDA0002652022070000121
First, a graph is constructed with edges weighted, weight vector feature-to-weight collision rate associations. Second, sort in descending order based on the degree of the features in the graph. Finally, each feature in the ordered list is examined and either assigned to an existing bundle feature with a low rate of conflict with it or a new bundle is created. The Algorithm Algorithm 1 has a time complexity of O (# feature) and is processed only once before training. This complexity is acceptable when the number of features is not that large, but can be significantly affected if one is faced with millions of features. In order to greatly improve the efficiency, a more efficient sorting strategy without drawing is provided: sorting is performed according to the number of non-0 values. This is similar to ranking by degree, as there is a greater chance of making a greater conflict for the number of non-0 values. Because only the ordering policy is changed, details of the new algorithm are omitted to avoid repetition.
For the second problem, to reduce the training complexity, a good method is needed to combine two features that should be bundled. The key is to ensure that the value of the original feature can be identified from the binding feature value. Because histogram-based algorithms store discrete bins rather than continuous feature values, the binding feature can be constructed by having mutually exclusive features individually subordinate to different bins. This can be done by adding an offset to the original feature. For example, assume that two features need to be combined in a bundled feature. The range of the starting A feature is [0,10 ], and the range of the B feature is [0, 20). Then an offset of 10 is added to feature B, and the value of feature B becomes 10, 30. Then, the characteristics A and B can be combined to replace the original A and B, and the combined value is [10,30 ].
The Algorithm details are embodied in Algorithm 2.
Figure BDA0002652022070000131
The EFB algorithm can bundle a plurality of mutually exclusive features into low-dimensional dense features, so that unnecessary calculation aiming at the feature value of 0 can be avoided. Indeed, histogram-based algorithms can be optimized using a table to record the non-0 values of each feature. In practice, the histogram algorithm based on optimization is achieved by recording non-0 values in the data with a table to ignore zero-valued features. By scanning the data in the table, the time complexity of histogram will be reduced from O (# data) to O (# non _ zero _ data). However, extra memory and effort are required to maintain and update the table during the entire tree building process. This optimization is implemented in LightGBM and is considered as a basic function. It is worth noting that this optimization method is not in conflict with the EFB algorithm, as it can still be used when the bundled features are still sparse.
2.3 Stacked Generalization
Ensemble learning is a machine learning paradigm. In ensemble learning, multiple models (often referred to as "weak learners") are trained to solve the same problem and combined to achieve better results. The most important assumptions are: when weak models are combined correctly, a more accurate or robust model can be obtained. In most cases, these basic models themselves do not perform very well, either because they have a high bias (e.g., low-degree-of-freedom models) or because their variance is too large resulting in a less robust (e.g., high-degree-of-freedom models). The idea of the integration method is to create a strong learner (or "integrated model") by combining the biases or variances of these weak learners to achieve better performance. Weak learners are generally combined mainly by the following three methods:
1) bagging, which generally considers homogeneous weak learners, learns these weak learners in parallel independently of each other and combines them according to some deterministic averaging process.
2) Boosting, which is also a common consideration for homogeneous weak learners. It learns the weak learners sequentially in a highly adaptive way and combines them according to some deterministic strategy.
3) The method generally considers heterogeneous weak learners, learns in parallel, combines the heterogeneous weak learners by training a meta-model, and outputs a final prediction result according to the prediction results of different weak models.
There are two main differences between Stacking and Bagging and Boosting. First, Stacking generally considers heterogeneous weak learners (different learning algorithms are combined), while Bagging and Boosting mainly consider homogeneous weak learners. Secondly, Stacking learning combines the basic model by using a meta-model, and Bagging and Boosting combines the weak learners according to a deterministic algorithm. The concept of Stacking is to learn several different weak learners and combine them by training a "meta-model" and then output the final prediction based on multiple predictions returned by these weak models. Therefore, to build the Stacking model, two things need to be defined: l learners that want to fit and a meta model that combines them. For example, for classification problems, KNN classifiers, Logistic Regression, and SVMs may be selected as weak learners, and a decision is made to learn neural networks as meta-models. The neural network will then take the outputs of the three weak learners as inputs and return a final prediction based on the inputs. Therefore, assuming one wants to fit a Stacking integration model consisting of L weak learners, the following steps must be followed:
1) dividing training data into two groups;
2) selecting L weak learners, fitting them to the first set of data;
3) causing each of the L learners to make a prediction of observed data in the second set of data;
4) a meta-model is fitted over the second set of data, using predictions made by the weak learner as input.
In the previous step, the data set was split in two, since the prediction of the data used to train the weak learner was not relevant to the training of the meta-model. Thus, one significant disadvantage of splitting the data set into two parts is that only half of the data is used to train the base model and the other half of the data is used to train the meta-model. To overcome this limitation, training methods like k-fold cross-validation may be used. All such observations can be used to train the meta-model: for any observed data, the prediction of weak learners is done by training examples of these weak learners on k-1 fold data (not including the observed data considered), as shown in FIG. 1.
In other words, it trains on the k-1 fold data to predict the remaining fold data. By iteratively repeating this process, a prediction of any discounted data can be obtained. This may generate a relevant prediction for each observation in the dataset and then train the meta-model using all of these predictions. The Stacking method trains a meta-model that generates the final output based on the output results returned by the weak learners at lower layers.
3 proposed Link prediction method
3.1 proposed Link prediction Algorithm (LLSLP)
Link prediction for social networks is considered a binary classification problem and takes into account 15 similarity indicators per two nodes, namely CN, Sal, Jac, Sor, HPI, HDI, LHN-I, PA, AA, RA, LP, Katz, ACT, Cos, RWR. First, the similarity index is considered a characteristic of any two nodes in the network. Then, Logistic Regression and LightGBM were selected as basic models. And finally, introducing a Stacking idea, and relearning the prediction result of the basic model to obtain a better prediction result.
3.1.1 partitioning node pairs
Considering a social network with n nodes, there are
Figure BDA0002652022070000151
A node pair. Data set of all node pairs in a constructed network
Figure BDA0002652022070000152
Including feature set F and category set C. Firstly, adopting a hierarchical sampling method, according to 8: 2 dividing all node pairs into original training sets
Figure BDA0002652022070000153
With the original test set
Figure BDA0002652022070000154
3.1.2 construction of training set and test set
In the original training set
Figure BDA0002652022070000155
With the original test set
Figure BDA0002652022070000156
In (2), node pairs (n) are calculated separatelyx,ny) And (3) 15 similarity indices (CN, Sal, Jac, Sor, HPI, HDI, LHN-I, PA, AA, RA, LP, Katz, ACT, Cos, RWR), with the 15 similarity indices as node nxAnd ny15 different features in between, the feature set F for all node pairs is obtained. In the original network, a node pair is classified as class 1 if it is connected, and is classified as class 0 otherwise. Thus, a set of categories C for a set of network node pairs is obtained. Finally, combining the feature set and the category set to obtain a training set DtrainAnd test set Dtest
3.1.3 unbalance problem
A data set is often said to be "class-unbalanced" when the number of samples from different classes in the data set differs significantly from one classification task to another. Obviously, the links in the network are sparse, and for nodes in the network, the number of node pairs with connected edges is much smaller than the number of node pairs without connected edges. Meanwhile, node pairs with connected edges, i.e., a few classes, are usually more of a concern in link prediction. Thus, the classes in the training set are not balanced with the classes in the test set. In machine learning, for the learning of unbalanced samples, the overfitting problem is easy to occur, so that the generalization capability of the model is poor, and the prediction is meaningless. For unbalanced data, in order not to change the raw data distribution, a Cost-sensitive Learning (Cost-sensitive Learning) strategy is used herein. Cost sensitive learning assigns higher misclassification cost to a few classes of samples and assigns smaller misclassification cost to a majority of classes of samples. In this way, the importance of the minority class samples is improved in the training process of the learner through the cost-sensitive learning, so that the preference of the classifier on the majority class is relieved. Cost sensitive learning is briefly introduced by taking Logistic Regression as an example.
The maximum likelihood function of the objective function is known from equation (8):
Figure BDA0002652022070000157
then the sample prediction error is minimized, and j (w) is minimized. On the premise of cost sensitivity, positive and negative sample weights [ α, β ] are added before derivation, and equation (22) becomes:
Figure BDA0002652022070000158
the derivation rule includes:
Figure BDA0002652022070000161
suppose y i1 or yiWhen 0, then:
Figure BDA0002652022070000162
iteration wjUntil convergence:
wj:=wj+μ[αyi+(β-α)p(xi)yi-βp(xi)]xj (26)
in summary, the positive and negative sample weights [ α, β ] amplify the cost of misjudging a certain class. Assuming that the model is classified into two classes, the value taking method according to the proportion of positive and negative samples is as follows:
Figure BDA0002652022070000163
the more general form is:
Figure BDA0002652022070000164
wherein n isclassesIs the number of sample classes.
LLSLP algorithm proposed by 3.1.4
In obtaining a training set DtrainAnd test set DtestAnd determining a solution to the imbalance of the data classes, and then training set DtrainAnd test set DtestAnd respectively putting the learning layers into the first learning layer for learning. The learning layer comprises 2 base learners LR and LightGBM, and Cross-validation, Grid searching and Early stopping methods are used for determining the hyper-parameters of the model, so that the fusion characteristics of the 2 base learners to 15 traditional similarity indexes are obtained. Then 2 fusion indexes learned by the base learner are combined and constructed to obtain a new training set
Figure BDA0002652022070000165
And test set
Figure BDA0002652022070000166
And will be
Figure BDA0002652022070000167
And putting the obtained product into a second learning layer for learning. Layer 2 only contains one Meta-Classifier, which is an LR model, and Cross-validation, Grid searching and Early stopping are also used to determine the hyper-parameters of the model in the learning process. Finally, the training is carried out by using Meta-ClassifierModel h (x) ═ h' (h)1(x),h2(x),…,hT(x) For new test set
Figure BDA0002652022070000168
And (3) predicting to obtain a final prediction result FinalPredictionLabel: if the final prediction result FinalPredictionLabel is greater than or equal to the preset result threshold, the two nodes are abnormal links; and if the final prediction result FinalPredictionLabel is smaller than the preset result threshold, the two nodes are normal links. The details of the proposed Algorithm are shown in Algorithm 3.
Figure BDA0002652022070000169
Figure BDA0002652022070000171
3.2 Link prediction model construction
In order to obtain better prediction effect, a model with larger difference is selected as a base model. Logistic Regression is a calculation model, while LightGBM is a tree model, and the algorithm integrating the calculation model and the LightGBM has better accuracy and generalization. The Logistic Regression and LightGBM are used as two base models to train a training set, and 5-fold Cross-validation, Grid searching and Early stopping are adopted to determine the hyper-parameters of the base models. And after the training of the base models is finished, a Stacking method is introduced to integrate the two base models. The probabilities of the presence and absence of links predicted by Logistic Regression and LightGBM are re-characterized. Since the effectiveness of Stacking is mainly from feature extraction, the representation learning is always accompanied by an over-fitting problem, because the features of the second layer are from the learning of the data of the first layer, the original features should not be included in the features in the data of the second layer, so as to reduce the risk of over-fitting. Meanwhile, in order to reduce the over-fitting problem, the second-layer classifier should be a simpler classifier, and generalized linearity such as Logistic Regression is a better choice. In the extraction process of the features (i.e. the learning of the first layer), complex nonlinear transformation is already used, so that it is not suitable to use too complex classifiers in the output layer. This is similar to the activation function or output layer of a neural network, both of which are simpler functions and can control complexity. In addition, effective characteristics can be selected by using the Logistic Regression together with the regularization of L1, unnecessary models are deleted from the base model of the first layer, the operation cost is saved, overfitting is further prevented, and the output result of the Logistic Regression can be understood as probability, so that the method is suitable for partial classification tasks. In summary, Logistic Regression is selected as Meta-Classifier, retraining is carried out on the learned new features, and the final prediction result is determined. The LLSLP model framework is shown in FIG. 2.
4 results and analysis of the experiments
4.1 data set
To fully evaluate the effectiveness of the proposed LLSLP method for link prediction, 10 real networks from various domains were used for the experiment. Where the UPG is a power distribution network. YST is a biological network. KNH, SMG, NSC and GRQ are co-author networks in different research areas. HMT, FBK and ADV are social networks. An EML is a network of individuals who share email. These networks are carefully chosen to cover a wide range of attributes including different sizes, averages, clustering coefficients, heterogeneity indices and imbalance coefficients ir (imbalance ratio) as the ratio of connected to non-connected edges. The following table lists a summary of the structural characteristics of the networks used in the experiments, the detailed information being shown in table 1.
TABLE 1 data set
Figure BDA0002652022070000181
4.2 LLSLP Link prediction model evaluation proposed
Because the network nodes are unbalanced in the proportion of existing links and non-existing links, the final link prediction cannot be measured by only the correct proportion of a single prediction. In order to evaluate the link prediction model established in the first three steps, 7 indexes, such as AUC, Recall and the like, are used to test the performance of the model. AUC, Recall and Precision are common indicators for evaluating classification problems. For data with unbalanced sample classes, the indices were additionally measured using fusion Matrix, Precision-Recall Current, F1-score, MCC (Matthews correlation coefficient). The fusion Matrix can intuitively and specifically observe the prediction result of the model. MCC is a correlation coefficient between-1 and +1, generally considered a balance metric, that can be used even if the category sizes differ widely. Precision-Recall Curve and F1-score combine to embody the relationship between Precision and Recall. Therefore, the LLSLP method is evaluated in consideration of 4 indexes additionally herein.
4.3 evaluation index
7 indicators were used in the experiment to evaluate the performance of the LLSLP link prediction algorithm. They are defined as follows:
1) AUC (area Under the receiver operating characteristic curve) is a metric that takes into account the overall ranking results. Generally, the AUC of a link prediction is defined as:
Figure BDA0002652022070000182
where n is the number of independent comparisons, n1Is the number of times the missing link score is higher than the absent link score, n2The number of times the scores of both are equal.
2) In the field of machine learning, and in particular statistical classification problems, a confusion matrix (also known as an error matrix) is a specific table layout that can visualize the performance of an algorithm, typically an algorithm with supervised learning (in unsupervised learning, commonly referred to as a matching matrix). Each row of the matrix represents an instance in the prediction class and each column represents an instance in the actual class. Taking the second classification as an example:
the predictive classification model corresponds to a confusion matrix, wherein the larger the numerical values of TP (true Positive) and TN (true negative), and the smaller the numerical values of FP (false positive) and FN (true positive), the better the model effect is represented. The confusion matrix can be extended to 5 indexes in fig. 3 on the basis of the statistical results:
3) precision ratio (Precision) — indicates how many of the samples predicted to be positive are true positive samples.
Figure BDA0002652022070000191
4) Recall (Recall) -indicating how many positive cases in a sample were predicted to be correct.
Figure BDA0002652022070000192
5) F1-score-results combining Precision and Recall outputs. The value of F1-score ranges from 0 to 1, with 1 representing the best output of the model and 0 representing the worst output of the model.
Figure BDA0002652022070000193
6) The Mathematic Correlation Coefficient (MCC), which takes into account the number of various samples, is an evaluation index that can be used under both class balance and imbalance.
Figure BDA0002652022070000194
7) Precision-Recall Curve-Precision (Precision) is the y-axis and Recall (Recall) is the x-axis. The accuracy and recall are mutually influenced, ideally both are high, and the more efficient such a model is. But generally, the accuracy is high and the recall rate is low; the recall rate is low and the accuracy is high.
4.4 reference Algorithm
In the part, 15 similarity indexes based on a network topology structure are selected to be compared with LLSLP, and the used similarity indexes are similarity based on local information, similarity based on global information and similarity based on random walk respectively. Tables 2 to 4 show similarity indexes based on local information of nodes, similarity indexes based on global information of nodes, and similarity indexes based on random walks, respectively.
TABLE 2 similarity index based on node local information
Figure BDA0002652022070000195
Figure BDA0002652022070000201
TABLE 3 Global information based similarity index
Figure BDA0002652022070000202
TABLE 4 similarity index based on random walks
Figure BDA0002652022070000203
In tables 2 to 4, (x) represents a neighbor node set of the node x, (y) represents a neighbor node set of the node y, (x) # (y) represents a common neighbor node set of the nodes x and y, and k (x) | represents the degree of the node x.
4.5 Experimental results and analysis
The experimental results herein are all the performance of the model on the test set. Each column of the table shows the sample type within each data set, and each row of the table shows the LLSLP and the remaining 17 comparison algorithms and models proposed herein. AUC, Precision, Recall, F1-score, MCC were chosen as the main assessment indicators due to the limitation of table size, while the rest of the assessment indicators on the experimental data set for LLSLP proposed herein will be shown in the appendix. Wherein the values of AUC, Precision, Recall and F1-score are all between [0,1], and the higher index value represents the better effect of the model on the data sets. The MCC is between the values of [ -1 and 1], the index value of 1 represents that the prediction of the model is completely consistent with the actual prediction on the data set, the value of 0 represents that the prediction result is equivalent to the random prediction result, and otherwise, the value of-1 represents that the model effect shows that the prediction result is completely inconsistent with the actual result on the data set. The results of the experiments on the data sets AUC, Precision, Recall, F1-score, and MCC under different algorithms are shown in tables 5-9.
TABLE 5 AUC values with algorithms
Figure BDA0002652022070000204
Figure BDA0002652022070000211
TABLE 6 Precision values with algorithms
Figure BDA0002652022070000212
TABLE 7 Recall values with algorithms
Figure BDA0002652022070000213
Table 8.F1-score values with algorithms
Figure BDA0002652022070000214
Figure BDA0002652022070000221
TABLE 9 MCC values with algorithms
Figure BDA0002652022070000222
Next, for the LLSLP and comparison algorithm proposed herein, the model evaluates index analysis on each data set. As can be seen from table 5, the AUC values of LLSLP in each data set listed here are top 2, and all perform better than the conventional algorithm. The AUC value on the UGP network is the highest, is 0.9998, and is superior to the traditional CN, Sal, LightGBM and other comparison algorithms and models; the basis model LR is also better and closer to the proposed LLSLP method. The AUC value of Cos which is the best to be expressed in the traditional algorithm is 0.77197, and the AUC value of the proposed LLSLP method is 29.536% higher than that of Cos. The basic model LightGBM is relatively poor in performance, the AUC value is 0.63459, which is much smaller than the proposed LLSLP and another basic model LR, and it can be seen that the prediction of the UGP data set by the basic model LightGBM is not very accurate. However, this does not mean that LightGBM is a bad model, such as in NSC data sets, the predicted performance of the basic model LightGBM is better than that of algorithms, models other than the proposed LLSLP. It can be seen that there is a certain difference in performance between the base model LR and the LightGBM in the face of different data sets, but the proposed LLSLP still has better performance in the face of various data sets, indicating that it has better stability.
Next, the Precision, Recall and F1-score of the respective algorithms, models on the experimental data set from tables 6, 7 and 8 were analyzed. LLSLP is superior to comparison algorithm and model in overall performance, and Precision value in most data sets is higher than that of traditional algorithm and model. But it is also noted that it does not perform well on all data sets of the experiment. Such as Precision values on NSC datasets that are lower than the traditional CN, AA algorithms, of course largely due to the fact that the basis model LR behaves too poorly; precision values on EML and FBK datasets are lower than the base model LR. Precision value on SMG dataset is lower than CN; the performance on KHN data sets is inferior to the traditional algorithm CN and the base model LightGBM. The proposed Recall values of LLSLP on the experimental data set are overall higher but still have certain disadvantages. The following analysis of the Recall values shows that, as can be seen from Table 7, the overall performance of the proposed LLSLP on each data set is better than that of the conventional algorithms CN, AA, LP, RWR, etc., the Recall values of the conventional algorithms are about 50% on average, and the lowest LLSLP value is more than 90%. However, it was also observed that the Recall value of LLSLP was close to that of the base model, and it had to be acknowledged that the high Recall value of LLSLP could benefit to some extent from the 2 base models (Logistic Regression, LightGBM). Because the Precision index and the Recall index are in a negative correlation relationship under the general condition, namely the larger the Precision value is, the smaller the Recall value is; conversely, the smaller the Precision value, the larger the Recall value. For different data sets and different practical application requirements, the quality of an algorithm or a model cannot be evaluated only through Precision values or Recall values, and F1-score combines the former two values to comprehensively measure the performance of the algorithm or the model, so that the F1-score value of LLSLP on the experimental data set is higher than 2, and the overall good performance of the proposed LLSLP is proved as shown in Table 8. It is noted that for two classes, the F1-score value for the possible classifier is higher, while the MCC value is lower. It is shown that a single index is not a complete measure of all the advantages and disadvantages of the classifier.
Finally, analyzing table 9, the problem of data category imbalance of the data set cannot be completely solved in machine learning, because the machine learning model is learned by data, and imbalance of data causes the training model to generate preference to most categories, which makes identification of categories with small data amount difficult. For the performance evaluation of the model on the extremely unbalanced social network data set, the MCC shows good differentiation, the overall performance of the traditional similarity algorithm is relatively poor under the condition that the category is not flat, and the LLSLP proposed herein can obtain relatively good performance under the condition that the category is extremely unbalanced, so that the proposed LLSLP can be proved to be effective.
From fig. 4 to 6, a series of results and conclusions can be observed.
(1) As can be seen in fig. 4: the proposed ROC curve for LLSLP is comparable to most algorithms, and only a few algorithms PA, Katz, ACT, and LightGBM can be distinguished (from table 5, it can be observed that the area under the specific ROC curve on the FBK dataset for each algorithm is 0.84832 for PA, 0.52412 for Katz, 0.83605 for ACT, 0.80372 for LightGBM, and above 0.90000 for the rest of the algorithms and models). The difference degree between the LLSLP and the traditional algorithm and the model is not judged enough only by relying on the ROC curve, and performance evaluation needs to be carried out by means of other indexes.
(2) From fig. 5 it can be derived: the PR graph clearly shows that the proposed LLSLP is better than most of the traditional algorithms CN, Sal, RA, Cos, RWR and the like, but the difference between the proposed LLSLP and the base models LR and LightGBM is not large, and the PR graph has better discrimination, but has certain limitation on the discrimination of the base models. It is worth mentioning that the conventional algorithms HPI, PA, LHN-I, ACT, COS, RWR have higher AUC values (from table 5, it can be observed that HPI is 0.98592, LHN-I is 0.96262, COS is 0.97020, RWR is 0.99126), but the performance on PR curves is poor, of course, because the Recall value is lower when the Precision value is higher; or the Recall value is higher and the Precision value is lower (from Table 6 and Table 7, it can be observed that the Precision values of HPI, LHN-I, Katz, Cos and RWR are 0.11996, 0.03803, 0.00138, 0.00632 and 0.01471 respectively on the FBK data set, and the Recall value is 0.74376, 0.80827, 0.60511, 0.97767 and 0.98744 respectively). Thus leading to low F1-score values (from Table 8, F1-score values for HPI, LHN-I, ACT, Cos, RWR on the FBK data set are 0.20615, 0.07264, 0.00276, 0.01255, 0.02889, respectively). Note that the PR profiles for PA and ACT are also poor due to their low Precision and Recall values (Precision values for PA and ACT are 0.09422, 0.00022, respectively, and Recall values are 0.19520, 001866, respectively), resulting in their low F1-score values, 0.12709, 0.00043, respectively. The F1-score is shown to be indeed effective in combining Precision and Recall, and more objectively evaluating the performance of algorithms and models. Therefore, it is further verified that evaluating an algorithm or model by means of only a single index is not comprehensive and cannot completely measure the performance, so that measuring by using a plurality of evaluation indexes is necessary and comprehensive.
(3) Aiming at the limitation of the PR graph, the confusion matrix chart of FIG. 6 completely shows the specific classification condition of each algorithm and model, and more specifically and comprehensively reflects the performance of the algorithm or the model. Lighter colors in the figure represent more numbers, whereas darker colors represent less numbers. The confusion matrix map is a good complement to the PR graph. It can be seen from fig. 6 that the number of results of the proposed LLSLP classification error is lower and has better performance with respect to the conventional algorithm and the base model. It is noted that, by setting the deviation percentage as the percentage of the absolute value of the difference between the numbers of two error classifications and the larger value between the two, it can be observed that in the error results of the base model LR and LightGBM predictions, a bias towards a class is obviously generated, the former tends to predict the sample as positive class, and the latter tends to predict the sample as negative class (it is known from fig. 6 that LR predicts the class 0 as the number of class 1 as 4455, the class 1 as the number of class 0 as 157, and the deviation percentage is 96.476%, while LightGBM predicts the class 0 as the number of class 1 as 569, the class 1 as the number of class 0 as 3451, and the deviation percentage is 83.512%). Meanwhile, it can be observed from fig. 6 that the bias degree generated on the FBK data set by other conventional algorithms is more serious, for example, the CN predicts the class 0 as the quantity of the class 1 as 1342, predicts the class 1 as the quantity of the class 0 as 14670, and the bias percentage is 90.852%; LHN-I predicts category 0 as 359469 for the number of category 1, 3370 for the number of category 0 for category 1, and a deviation percentage of 99.063%; even more severe deviations from RWR occurred for Cos that predicted category 0 as 2760377 in number of category 1 and category 1 as 35 in number of category 0 with a percent deviation of 99.999%; RWR predicts category 0 as 1174532 for the number of categories 1 and category 1 as 45 for the number of categories 0 with a deviation percentage of 99.996%. The LLSLP proposed is more balanced and does not produce excessive bias to a certain class (the LLSLP predicts the class 0 as 2103 in number of the class 1 and predicts the class 1 as 1500 in number of the class 0, and the deviation percentage is only 28.673%), which also proves that the LLSLP proposed in the method is more effective than other traditional algorithms and models and well combines the advantages of each base model.
From the analysis on the experimental data set, the performance of the LLSLP proposed herein on the experimental data set is improved to some extent relative to the comparison algorithm and model used in the experiment, and as the data set is increased, the overall performance of the model is also improved. It must be acknowledged, of course, that although the LLSLP proposed herein achieves good performance in each evaluation index, there is still some room for improvement, and it is not difficult to see from the overall experimental results that the performance of LLSLP depends to some extent on the performance of 2 basis models (Logistic Regression, LightGBM), so it is particularly important for the choice of basis model. And as the number of base models used is greater, the complexity of the models is higher and the run time is longer. It should be noted that even though the performance of LLSLP is affected by its base model, with a certain number of base models, when one of the base models performs poorly on a certain data set, the performance of LLSLP is not affected much, which is an expression of its stability.
The social network link prediction method (LLSLP) proposed herein is an improvement over other existing algorithms. Traditional similarity-based social network link prediction algorithms focus on a single similarity index for a node. In order to better integrate the existing similarity indexes, thereby further improving the stability and accuracy of link prediction, 15 similarity indexes are combined herein. These 15 similarity indicators represent different characteristics of the network. The LLSLP method provided by the invention not only integrates indexes, but also utilizes the Logistic Regression and LightGBM models. Determining the hyper-parameters of the base model by Cross-validation, Grid searching and Early stopping by taking the Logistic Regression and LightGBM as the base model and 15 indexes as the characteristics to be learned by the model; then, by utilizing the Stacking idea of the integrated model and taking Logistic Regression as the Meta-Classifier, a new link prediction method is provided, and the method considers the complementarity of different similarity indexes and the complementarity of different models, so that the stability is stronger. To prove its effectiveness and feasibility, 10 networks are last taken as an example. Compared with a single index and a model, 7 indexes such as AUC, Precision and the like are used for verifying the reasonability, effectiveness, reliability and stability of the LLSLP method. The main contribution of the method is to integrate the existing similarity indexes and introduce the Stacking idea into model integration for link prediction of the social network.
In the future, other algorithms, more basic models and different types of model integration algorithms are explored and proposed to design a new link prediction method, which has important theoretical and practical significance for link prediction of social networks.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (3)

1. A social network link abnormity prediction method based on stack generalization and cost sensitive learning is characterized by comprising the following steps:
s1, obtaining social network node data, and taking similarity indexes in the obtained social network node data as characteristics of basic model learning;
s2, determining the hyper-parameters of the basic model;
s3, relearning the prediction result of the base model; and obtaining a final prediction result.
2. The method for predicting link abnormality of social network based on stacked generalization and cost-sensitive learning according to claim 1, wherein the base model in step S1 comprises:
given dataset D ═ x1,y1),(x2,y2),(x3,y3),……,(xN,yN) Wherein, in the step (A),
Figure FDA0002652022060000011
yie {0,1 }; when y isiWhen equal to 0, yiRepresents a negative class; when y isiWhen 1, yiRepresents a positive class; 1,2,3, …, N;
Figure FDA0002652022060000012
representing a sample feature space, wherein n represents the feature number of each sample; n represents the number of samples in the data set D;
due to wTThe x + b values are continuous, wherein w represents a column vector and the dimension is (n, 1); t represents transposition; x represents a column vector with dimension (n, 1); b represents a column vector with dimension (1, 1); it cannot fit discrete variables, and it can be considered to fit the conditional probability P (Y ═ 1| x); however, for w ≠ 0, there is no value for solving if w is equal to the zero vector, wTThe value of x + b is a real number R, and the value of unsatisfied probability is 0 to 1, so that a generalized linear model is considered;
since the unit step function is not trivial, the log probability function is a typical alternative function:
Figure FDA0002652022060000013
thus, there are:
Figure FDA0002652022060000014
if y is the probability that x takes the positive example, 1-y is the probability that x takes the negative example; the ratio of the two is called probability odds, which refers to the ratio of the probability of the event occurring to the probability of the event not occurring, and if the probability of the event occurring is P, the log probability:
Figure FDA0002652022060000015
regarding y as the class posterior probability estimation, the rewrite formula is:
Figure FDA0002652022060000021
Figure FDA0002652022060000022
that is, the log-probability of output Y ═ 1 is a model represented by a linear function of input x, which is a logistic regression model; when w isTThe closer the value of + b is to positive infinity, the closer the P (Y ═ 1| x) probability value is to 1; therefore, the idea of logistic regression is to fit a decision boundary first and then establish the probability connection between the boundary and the classification, thereby obtaining the probability under the two-classification condition;
after the mathematical form of the logistic regression model is determined, how to solve the parameters in the model is left; in statistics, a maximum likelihood estimation method is often used for solving, that is, a group of parameters is found, so that the likelihood of data is maximum under the group of parameters; order:
Figure FDA0002652022060000023
Figure FDA0002652022060000024
p(xi) Denotes that the ith sample is x in a known featureiA probability of a positive class (Y ═ 1);
yithat is, the two-class problem is given in the data set D, i.e., yi=y1,y2,y3,...,yn,yi∈{0,1};
For more convenient solution, logarithms are taken from two sides of the peer-to-peer equation and written into log-likelihood functions:
Figure FDA0002652022060000025
in machine learning, the concept of a loss function is lost, which measures the degree of model prediction error; if the average log-likelihood loss over the entire data set is taken, one can obtain:
Figure FDA0002652022060000026
wherein N represents the number of samples in the data set D;
that is, in the logistic regression model, the maximum likelihood function and the minimum loss function are practically equivalent;
there are many methods for solving logistic regression, and here, a gradient descent method is mainly used; the main objective of the optimization is to find a direction towards which the parameter moves so that the value of the loss function can be reduced, which direction is often found by various combinations of first order partial derivatives or second order partial derivatives; the loss function of the logistic regression is:
Figure FDA0002652022060000031
gradient descent finds the descending direction by the first derivative of j (w) to w, and updates the parameters in an iterative manner by:
Figure FDA0002652022060000032
Figure FDA0002652022060000033
Figure FDA0002652022060000034
representing the updated weight parameter of the kth iteration of the ith sample weight parameter;
alpha represents the learning rate and represents the speed of 1-time parameter iterative updating;
Figure FDA0002652022060000035
(k + 1) th iterative update representing ith sample weight parameterThe latter weight parameter;
wirepresenting the weight parameter of the ith sample.
3. The social network link abnormality prediction method based on stacked generalization and cost-sensitive learning of claim 1, wherein in step S2, the method for determining the hyperparameter in the base model comprises one or any combination of cross validation, grid search and early-stop method.
CN202010873960.4A 2020-08-26 2020-08-26 Social network link abnormity prediction method based on stack generalization and cost sensitive learning Active CN112039700B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010873960.4A CN112039700B (en) 2020-08-26 2020-08-26 Social network link abnormity prediction method based on stack generalization and cost sensitive learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010873960.4A CN112039700B (en) 2020-08-26 2020-08-26 Social network link abnormity prediction method based on stack generalization and cost sensitive learning

Publications (2)

Publication Number Publication Date
CN112039700A true CN112039700A (en) 2020-12-04
CN112039700B CN112039700B (en) 2021-11-23

Family

ID=73580093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010873960.4A Active CN112039700B (en) 2020-08-26 2020-08-26 Social network link abnormity prediction method based on stack generalization and cost sensitive learning

Country Status (1)

Country Link
CN (1) CN112039700B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230300032A1 (en) * 2022-03-18 2023-09-21 The Mitre Corporation Systems and methods for behavioral link prediction for network access microsegmentation policy
CN116798233A (en) * 2023-08-25 2023-09-22 湖南天宇汽车制造有限公司 Ambulance rapid passing guiding system
CN117649153A (en) * 2024-01-29 2024-03-05 南京典格通信科技有限公司 Mobile communication network user experience quality prediction method based on information integration

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451703A (en) * 2017-08-31 2017-12-08 杭州师范大学 A kind of social networks multitask Forecasting Methodology based on factor graph model
CN109245952A (en) * 2018-11-16 2019-01-18 大连理工大学 A kind of disappearance link prediction method based on MPA model
US20190132224A1 (en) * 2017-10-26 2019-05-02 Accenture Global Solutions Limited Systems and methods for identifying and mitigating outlier network activity
CN111275113A (en) * 2020-01-20 2020-06-12 西安理工大学 Skew time series abnormity detection method based on cost sensitive hybrid network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451703A (en) * 2017-08-31 2017-12-08 杭州师范大学 A kind of social networks multitask Forecasting Methodology based on factor graph model
US20190132224A1 (en) * 2017-10-26 2019-05-02 Accenture Global Solutions Limited Systems and methods for identifying and mitigating outlier network activity
CN109245952A (en) * 2018-11-16 2019-01-18 大连理工大学 A kind of disappearance link prediction method based on MPA model
CN111275113A (en) * 2020-01-20 2020-06-12 西安理工大学 Skew time series abnormity detection method based on cost sensitive hybrid network

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
伍杰华等: "基于迁移成分分析的多层社交网络链接分类", 《数据分析与知识发现》 *
刘思等: "基于网络表示学习与随机游走的链路预测算法", 《计算机应用》 *
刘维 等: "复杂网络中的链接预测", 《信息与控制》 *
孙炜: "基于代价敏感的改进AdaBoost算法在不平衡数据中的应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
孙诚等: "社会网络中基于神经网络的链路预测方法", 《数学建模及其应用》 *
李宽洋: "基于相似性指标的复杂网络链路学习预测", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
谢奕希 等: "一种基于改进Logistic模型的链路预测指标融合方法", 《信息工程大学学报》 *
黄贤英 等: "社交网络突发事件传播速率模型研究", 《电子科技大学学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230300032A1 (en) * 2022-03-18 2023-09-21 The Mitre Corporation Systems and methods for behavioral link prediction for network access microsegmentation policy
CN116798233A (en) * 2023-08-25 2023-09-22 湖南天宇汽车制造有限公司 Ambulance rapid passing guiding system
CN116798233B (en) * 2023-08-25 2024-01-09 湖南天宇汽车制造有限公司 Ambulance rapid passing guiding system
CN117649153A (en) * 2024-01-29 2024-03-05 南京典格通信科技有限公司 Mobile communication network user experience quality prediction method based on information integration
CN117649153B (en) * 2024-01-29 2024-04-16 南京典格通信科技有限公司 Mobile communication network user experience quality prediction method based on information integration

Also Published As

Publication number Publication date
CN112039700B (en) 2021-11-23

Similar Documents

Publication Publication Date Title
Jean et al. Semi-supervised deep kernel learning: Regression with unlabeled data by minimizing predictive variance
CN112039700B (en) Social network link abnormity prediction method based on stack generalization and cost sensitive learning
CN112073227B (en) Social network link abnormity detection method by utilizing cascading generalization and cost sensitive learning
Armina et al. A review on missing value estimation using imputation algorithm
Nagi et al. Classification of microarray cancer data using ensemble approach
Phyu Survey of classification techniques in data mining
Uncu et al. A novel feature selection approach: combining feature wrappers and filters
Peng et al. User preferences based software defect detection algorithms selection using MCDM
Zhang et al. Incorporating implicit link preference into overlapping community detection
Serafino et al. Ensemble learning for multi-type classification in heterogeneous networks
CN112073298B (en) Social network link abnormity prediction system integrating stacked generalization and cost sensitive learning
Zhao et al. Deep bayesian unsupervised lifelong learning
Dubey et al. Data mining based handling missing data
Za’in et al. Evolving large-scale data stream analytics based on scalable PANFIS
Dutta et al. Clustering by multi objective genetic algorithm
Probst Hyperparameters, tuning and meta-learning for random forest and other machine learning algorithms
Zhou et al. Online recommendation based on incremental-input self-organizing map
Ganji et al. Parallel fuzzy rule learning using an ACO-based algorithm for medical data mining
Sainin et al. A direct ensemble classifier for imbalanced multiclass learning
Berral-García When and how to apply Statistics, Machine Learning and Deep Learning techniques
Carmona et al. An analysis on the use of pre-processing methods in evolutionary fuzzy systems for subgroup discovery
Vukićević et al. Finding best algorithmic components for clustering microarray data
Marco et al. An Improving Long Short Term Memory-Grid Search Based Deep Learning Neural Network for Software Effort Estimation.
Zhang et al. Learning causal fuzzy logic rules by leveraging markov blankets
Fakhraei et al. Adaptive neighborhood graph construction for inference in multi-relational networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant