CN112039700A

CN112039700A - Social network link abnormity prediction method based on stack generalization and cost sensitive learning

Info

Publication number: CN112039700A
Application number: CN202010873960.4A
Authority: CN
Inventors: 刘小洋; 李祥; 叶舒; 马敏
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-04
Anticipated expiration: 2040-08-26
Also published as: CN112039700B

Abstract

The invention provides a social network link abnormity prediction method based on stack generalization and cost sensitive learning, which comprises the following steps of: s1, obtaining social network node data, and taking similarity indexes in the obtained social network node data as characteristics of basic model learning; s2, determining the hyper-parameters of the basic model; s3, relearning the prediction result of the base model; and obtaining a final prediction result. The method and the device can predict the link abnormity of the social network node.

Description

Social network link abnormity prediction method based on stack generalization and cost sensitive learning

Technical Field

The invention relates to the technical field of social networks, in particular to a social network link abnormity prediction method based on stack generalization and cost sensitive learning.

Background

In the real world, social networks are ubiquitous, such as social networks, collaboration networks, protein-protein interaction networks, and communication networks. Analyzing these networks has attracted increasing attention not only in the field of computer science, but also in the fields of sociology, physics, bioinformatics and statistics. Link prediction in social networks is a basic network analysis task, which refers to how to predict the likelihood of generating a link between two nodes in a network that are not connected by known information (e.g., network nodes and network structure). It should be noted that link prediction includes prediction of existing links and prediction of future links.

Link prediction for social networks has been intensively studied. In the past decades, various link prediction methods have been proposed, and most algorithms are based on network architecture. Here, we briefly review two mainstream methods for link prediction, the similarity method (including node similarity and structural similarity) and the likelihood estimation method. To date, a series of achievements have been made in a link prediction method based on similarity, and the method is widely applied to various fields accordingly. Similarity-based link prediction methods can be further divided into three categories, namely neighbor-based, path-based and random walk-based methods. The simplest link prediction method is based on the following assumptions: two nodes are more likely to have a link if they have more neighbors in common. Newman first uses Common Neighbor index (CN) to measure similarity and then presents indices of two nodes and many variants of CN such as Salton index, Resource Allocation index (RA), Adamic-Adar index (AA), Jaccard coefficientHub Promoted index (HPI), Leicht-Holme-Newman index (LHN), preferred Attachment index (PA), etc. According to extensive experiments on real networks, the results show that the RA index performs best, while the overall performance of the PA index is worst. The path-based method computes the similarity of node pairs using a path between two nodes. Examples include Local path index (LP) and Katz index. The LP index only considers local paths of length 2 and 3. The Katz index is based on the whole all paths and can achieve high performance on a practical network. Random walk-based methods use random walks to model interactions between nodes in a network. Some representative methods include Average Commute Time (ACT), SimRank, Random Walk with Restart (RWR) and Local Random Walk (LRW). The ACT index is based on the number of steps required for a mean random walker to start from one node to reach another. SimRank measures the time at which two random walkers, starting from two different nodes respectively, will meet at a certain node. RWR is a direct application of the PageRank algorithm. The LRW is a local index and only focuses on a few steps of random walk. It is well known that the LRW method outperforms ACT indexing with less computational complexity than ACT and RWR. The second category of methods is based on likelihood estimation. Clauset et al proposes a general technique to infer the hierarchy of the network and further use it to predict lost links. The stored block model divides network nodes into several groups, and The connection probability between any two nodes is used for determining which group The node belongs to. Pan et al maximize the likelihood of an observed network based on a predefined structural hamilton and score an observed network for unobserved links by the conditional probability to which the link is added. Liben-Nowell and Kleinberg propose a likelihood estimation method for link prediction. New link prediction methods based on likelihood analysis are then successively obtained-these maximum likelihood methods, although computationally complex, can provide valuable insight.

The similarity method and the likelihood estimation method each have advantages and disadvantages. The similarity-based method has the characteristic of low calculation complexity, but the calculation result is influenced by the network structure. In networks with different structural features, the calculation results are unstable and robustness cannot be obtained. The idea based on likelihood estimation has strong mathematical significance and high prediction accuracy, but needs strict assumptions, has large calculation amount and is not suitable for large-scale networks.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, and particularly provides a social network link abnormity prediction method based on stack generalization and cost sensitive learning.

In order to achieve the above object, the present invention provides a social network link anomaly prediction method based on stack generalization and cost-sensitive learning, including the following steps:

s1, obtaining social network node data, and taking similarity indexes in the obtained social network node data as characteristics of basic model learning;

s2, determining the hyper-parameters of the basic model;

s3, relearning the prediction result of the base model; and obtaining a final prediction result.

In a preferred embodiment of the present invention, the base model in step S1 includes:

given dataset D ═ x₁,y₁),(x₂,y₂),(x₃,y₃),……,(x_N,y_N) Wherein, in the step (A),

y_ie {0,1 }; when y is_iWhen equal to 0, y_iRepresents a negative class; when y is_iWhen 1, y_iRepresents a positive class; 1,2,3, …, N;

representing a sample feature space, wherein n represents the feature number of each sample; n represents the number of samples in the data set D;

due to w^TThe x + b values are continuous, wherein w represents a column vector and the dimension is (n, 1); t represents transposition; x represents a column vector with dimension (n, 1); b represents a column vector with dimension (1, 1); it cannot fit discrete variables, and it can be considered to fit the conditional probability P (Y ═ 1| x); however, for w ≠ 0, there is no value for solving if w is equal to the zero vector, w^TThe value of x + b is a real number R, and the value of unsatisfied probability is 0 to 1, so that a generalized linear model is considered;

since the unit step function is not trivial, the log probability function is a typical alternative function:

thus, there are:

if y is the probability that x takes the positive example, 1-y is the probability that x takes the negative example; the ratio of the two is called probability odds, which refers to the ratio of the probability of the event occurring to the probability of the event not occurring, and if the probability of the event occurring is P, the log probability:

regarding y as the class posterior probability estimation, the rewrite formula is:

that is, the log-probability of output Y ═ 1 is a model represented by a linear function of input x, which is a logistic regression model; when w is^TThe closer the value of + b is to positive infinity, the closer the P (Y ═ 1| x) probability value is to 1; therefore, the idea of logistic regression is to fit a decision boundary first and then establish the probability connection between the boundary and the classification, thereby obtaining the probability under the two-classification condition;

after the mathematical form of the logistic regression model is determined, how to solve the parameters in the model is left; in statistics, a maximum likelihood estimation method is often used for solving, that is, a group of parameters is found, so that the likelihood of data is maximum under the group of parameters; order:

p(x_i) Denotes that the ith sample is x in a known feature_iA probability of a positive class (Y ═ 1);

y_iis the two classification problemIn a given data set D, i.e. y_i＝y₁,y₂,y₃,...,y_n，y_i∈{0,1}；

For more convenient solution, logarithms are taken from two sides of the peer-to-peer equation and written into log-likelihood functions:

in machine learning, the concept of a loss function is lost, which measures the degree of model prediction error; if the average log-likelihood loss over the entire data set is taken, one can obtain:

wherein N represents the number of samples in the data set D;

that is, in the logistic regression model, the maximum likelihood function and the minimum loss function are practically equivalent;

there are many methods for solving logistic regression, and here, a gradient descent method is mainly used; the main objective of the optimization is to find a direction towards which the parameter moves so that the value of the loss function can be reduced, which direction is often found by various combinations of first order partial derivatives or second order partial derivatives; the loss function of the logistic regression is:

gradient descent finds the descending direction by the first derivative of j (w) to w, and updates the parameters in an iterative manner by:

representing the updated weight parameter of the kth iteration of the ith sample weight parameter;

alpha represents the learning rate and represents the speed of 1-time parameter iterative updating;

representing the weight parameter after the (k + 1) th iteration update of the ith sample weight parameter;

w_irepresenting the weight parameter of the ith sample.

In a preferred embodiment of the present invention, in step S2, the method for determining the hyperparameters in the base model includes one or any combination of cross validation, grid search and early stopping method.

In a preferred embodiment of the present invention, in step S3, it is determined according to the final prediction result: if the final prediction result FinalPredictionLabel is greater than or equal to the preset result threshold, the two nodes are abnormal links; and if the final prediction result FinalPredictionLabel is smaller than the preset result threshold, the two nodes are normal links.

In conclusion, due to the adoption of the technical scheme, the method and the device can predict the link abnormity of the social network node.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram of the staged amplification of the present invention.

FIG. 2 is a schematic diagram of LLSLP model framework of the present invention.

FIG. 3 is a schematic diagram of a fusion Matrix according to the present invention.

FIG. 4 is a schematic diagram of ROC for each algorithm on the FBK data set of the present invention.

FIG. 5 is a PR diagram of the algorithms on the FBK data set of the present invention.

FIG. 6 is a PR diagram of the algorithms on the FBK data set of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

1 introduction to the public

1.1 background

1.2 major contributions

1) Aiming at the problem that the traditional link prediction algorithm only considers a single similarity index, is easily influenced by a network structure and has no good generalization, a new social network link prediction method (LLSLP) is provided on the basis of fusing 15 traditional similarity indexes.

2) The LLSLP method provided by the invention not only integrates the traditional similarity indexes, but also introduces the Stacking idea. And carrying out nonlinear calculation on 15 traditional similarity indexes by using a Logistic Regression model and a LightGBM model, and obtaining the characteristics of a fusion index. On the basis, a Logistic Regression model is adopted to learn the fusion characteristics, and Cross-validation, Grid searching and Early stopping methods are used for optimization, so that the proposed LLSLP obtains more complementarity, more stable effect and good generalization.

3) Detailed, systematic assessment analysis was performed on 10 sets of social networking data, SMG, EML, NSC, YST, HMT, KHN, FBK, UGP, ADV, and GRQ, from different domains, with different scales and network structures. And 7 different evaluation indexes are adopted, so that the performance of the algorithm and the model is more comprehensively measured. The LLSLP method presented herein is compared to a single conventional algorithm and model.

4) The experimental results show that the overall performance of the LLSLP on each experimental data set is better than that of the traditional algorithm and model, not only the AUC value reaches over 98.71 percent, but also the AUC value is 10.52 percent higher than that of the traditional 15 link prediction algorithms such as CN, Sal, Jac, Sor, HPI, HDI, LHN-I, PA, AA, RA, LP, Katz, ACT, Cos and RWR on average. And under the condition of extreme imbalance of the data set categories, the F1-score value and the MCC value respectively achieve 3.25% -9.73% and 5.90% -10.21% improvement relative to the traditional 15 link prediction algorithms. Better predictions are made under different data sets, and the effectiveness, stability and generalization of the algorithm are verified through result analysis.

The rest of the text is arranged as follows. In section 2, Logistic Regression, LightGBM, and Stacking are described in detail. In section 3, the proposed LLSLP method is introduced. The experimental setup is discussed in section 4 and the experimental results are analyzed in comparison. Finally, this document is summarized in section 5. In addition, the appendix section supplements the relevant experimental data graphs, including ROC graphs, PR graphs, and confusion matrix graphs.

2 basic model

2.1 Logistic Regression

The nature of Logistic Regression (LR) is: assuming that the data obeys this distribution, then the maximum likelihood estimate is used for parameter estimation. Although referred to as regression, it is actually a classification model and is commonly used for two-classification. The Logistic Regression is described by taking two classification problems as an example:

given a dataset D ═ x considering a binary problem₁,y₁),(x₂,y₂),(x₃,y₃),……,(x_N,y_N) Wherein, in the step (A),

representing a sample feature space, wherein n represents the feature number of each sample; n represents the number of samples in the data set D.

Due to w^TThe x + b values are continuous, wherein w represents a column vector and the dimension is (n, 1); t represents transposition; x represents a column vector with dimension (n, 1); b represents a column vector with dimension (1, 1); it cannot fit discrete variables and can be considered to fit the conditional probability P (Y ═ 1| x). However, for w ≠ 0 (with zero vectors, there is no value for solving), w^TThe value of x + b is a real number R, and the value of unsatisfied probability is 0 to 1, so that a generalized linear model is considered.

thus, there are:

if y is the probability that x is positive, then 1-y is the probability that x is negative. The ratio of the two is called probability (odds), which refers to the ratio of the probability of the event occurring to the probability of not occurring, and if the probability of the event occurring is P, the log probability:

that is, the log-probability of output Y being 1 is a model represented by a linear function of input x, which is a logistic regression model. When w is^TThe more the value of + b getsTo be more infinite, the probability value of P (Y ═ 1| x) is closer to 1. The idea of logistic regression is to fit a decision boundary (not limited to linear but also polynomial) and then establish the probability relationship between the boundary and the classification, so as to obtain the probability under the two classification conditions.

After the mathematical form of the logistic regression model is determined, how to solve the parameters in the model remains. In statistics, a maximum likelihood estimation method is often used to solve, that is, a set of parameters is found, so that the likelihood (probability) of data is maximum under the set of parameters. Order:

p(x_i) The conditional probability in expression (6) indicates that the ith sample has a known characteristic of x_iIn the case of (2), the probability of the positive type (Y ═ 1) is used.

y_iThat is, the two-class problem is given in the data set D, i.e., y_i＝y₁,y₂,y₃,...,y_n，y_i∈{0,1}；

the notion of a loss function is lost in machine learning, which measures how wrong the model predicts. If the average log-likelihood loss over the entire data set is taken, one can obtain:

wherein N represents the number of samples in the data set D;

i.e., in a logistic regression model, the maximum likelihood function and the minimum loss function are practically equivalent.

There are many methods for solving logistic regression, and here, a gradient descent method is mainly used. The main objective of the optimization is to find a direction towards which the parameters are moved to enable the value of the loss function to be reduced, often by first order partial derivatives or various combinations of second order partial derivatives. The loss function of the logistic regression is:

w_irepresenting the weight parameter of the ith sample.

Where k is the number of iterations. After each update of the parameters, the parameters can be updated by comparing | | | J (w)^k+1)-J(w^k) I is less than a threshold value orThe maximum number of iterations is reached and the iteration is stopped.

2.1.1 regularization

Regularization is a general algorithm and idea, so algorithms that produce an overfitting phenomenon can use regularization to avoid overfitting. On the basis of minimizing the empirical risk (namely minimizing the training error), a simple model is adopted as far as possible, and the generalization prediction precision can be effectively improved. If the model is too complex, the variable values will vary slightly, causing prediction accuracy problems. Regularization is effective because it reduces the weight of the features, making the model simpler. Regularization is typically in the form of L1 or L2 fanner, respectively, in the form of Φ (w) | x | | survival₁，Φ(w)＝||x||₂。

1) L1 regularization. LASSO regression, which is equivalent to adding such a priori knowledge to the model: w is a zero mean laplace distribution. Laplace distribution:

μ represents a position parameter in the laplace distribution, and when μ is 0, the symmetry axis of the laplace distribution curve is on the y-axis; σ denotes a scale parameter.

Due to the introduction of a priori knowledge, the likelihood function is:

d represents the number of weight parameters w needing regularization;

taking log and then taking negative to obtain an objective function:

equation (15) is equivalent to the original loss function followed by the L1 regularization, so the nature of the L1 regularization is actually to add a priori knowledge of the model parameters obeying a zero-mean laplacian distribution.

2) L2 regularization. Ridge regression, which is equivalent to adding such a priori knowledge to the model: w follows a zero-mean normal distribution. Normal distribution:

due to the introduction of prior knowledge, the likelihood function:

taking log and then taking negative to obtain an objective function:

equation (18) is equivalent to the original loss function followed by the L2 regularization, so the nature of the L2 regularization is actually to add a priori knowledge that the model parameters follow a zero-mean normal distribution to the model.

The L1 regularization is that the regularization term added after the loss function is L1 norm, and the L1 norm is added to easily obtain sparse solution (more than 0). The L2 regularization is the square of the norm of the regularization term L2 added behind the loss function, and compared with the L1 regularization, the L2 regularization obtains a smoother (not sparse) solution, but can also ensure that the dimension close to 0 (but not equal to 0, so that the dimension is relatively smooth) in the solution is more, and the complexity of the model is reduced.

2.2 LightGBM

The lifting Tree is an optimization process for learning by using an additive model and a forward distribution algorithm, and has some high-efficiency implementation, such as XGBoost, pGBRT, GBDT (Gradient Boosting Decision Tree) and the like. The GBDT uses the negative gradient as an index (information gain) of the division, and the XGboost uses the second derivative. They have the common disadvantage that calculating the information gain requires scanning all samples in order to find the optimal division point. Their efficiency and scalability is difficult to satisfy in the face of large amounts of data or high feature dimensions. The direct method to solve this problem is to reduce the feature amount and data amount without affecting the accuracy, and there is some work to speed up the boosting process according to the sampling of data weight, but GBDT cannot be applied without sample weight.

Microsoft open source LightGBM (GBDT based) solves these problems well, and it mainly contains two algorithms:

1) single-sided Gradient Sampling, Gradient-based One-Side Sampling (gos). Gos (from reduced sample perspective): most of the samples of the small gradient are excluded and only the remaining samples are used to calculate the information gain. The GBDT has no data weight, but each data instance has different gradient, and the instance with large gradient has larger influence on the information gain according to the definition of calculating the information gain, so that when down-sampling, the sample with large gradient (the preset threshold value or the highest percentile interval) should be kept as much as possible, and the sample with small gradient is randomly removed. This measure proves to achieve more accurate results than random sampling at the same sampling rate, especially when the information gain range is large.

2) Mutually Exclusive Feature Binding (EFB). EFB (from a feature reduction perspective): bound mutex features, i.e., they rarely take on values other than 0 at the same time (i.e., they are replaced with a composite feature). In application, although the feature quantity is relatively large, since the feature space is very sparse, can a lossless method be designed to reduce the effective features? Especially on sparse feature space, many features are almost mutually exclusive (for example, many features are not simultaneously non-0 values like (one-hot), and mutually exclusive features can be bound.

2.2.1 Gradient-based One-Side Sampling

GBDT uses a decision tree to learn a function that maps an input space to a gradient space. Assume that the training set has n instances { x₁,…,x_nAnd f, the characteristic dimension is s. The direction of the negative gradient of the loss function of the model data variable is denoted as { g } at each gradient iteration₁,…,g_nAnd the decision tree divides the data into all nodes through the optimal dividing points (maximum information gain points). GBDT measures the information gain by the variance after segmentation.

Definition 1: O denotes a training set of a certain fixed node, and a segmentation point d of a segmentation feature j is defined as:

wherein the content of the first and second substances,

i [ ] represents the variance gain;

d is taken as a characteristic segmentation point;

represents the left side of the segmentation point and the variance gain;

it indicates to the right of the segmentation point.

x_ijDenotes the x th_iThe jth feature of each sample.

Traversing each split point of each feature to find

And calculating the maximum information gain

Then, the data is based on the characteristic j^*Break-up point of

The data is divided into left and right child nodes.

In GOSS:

1) firstly, sequencing training in a descending order according to the gradient of data;

2) reserving the first a data instances as a data subset A;

3) for the remaining data instances, randomly sampling to obtain a data subset B with the size of B;

4) finally, the information gain is estimated by the following equation:

A_lrepresents the data subset to the left of the division point d;

A_rrepresents the data subset to the right of the division point d;

equation (20) differs from equation (19) in that equation (19) is a result of calculating a variance gain in a training set of a fixed node O

The equation (20) is an estimate of the information gain, which is the total variance gain after all fixed nodes have been traversed, and is therefore

Here the gos estimates the information gain through a smaller data set

The amount of calculation will be greatly reduced. More importantly, the following theory shows that the GOSS does not lose much training accuracy, rather than random sampling.

Define 2 GOSS error

The probability is at least 1-, with:

V_j(d) a true information gain representing the data set;

a represents a data subset consisting of reserved first a sample instances;

A^Ccomplement of A

Wherein the content of the first and second substances,

from the above theory, the following conclusions are drawn:

1) asymptotic approach ratio of GOSS

If the data partitioning is not very unbalanced (e.g. data partitioning)

And

then the approximation error in equation (21) will be dominated by the second term, when n tends to infinity (when the amount of data is large),

will tend to 0, i.e. the larger the data volume, the smaller the error and the higher the precision;

2) random sampling is a case where the GOSS is 0 at a. In most cases, the gos performance is better than random sampling, which is the following case: c_0,β＞C_α,β-αI.e. by

Wherein

The generalization of GOSS is analyzed below. Considering GOSS generalization error

This is the difference between the variance gain calculated for the instance of GOSS sampling and the actual sample variance gainThe difference between them. Is transformed into

Thus, where the GOSS is accurate, the GOSS generalization error approximates the full amount of real data. On the other hand, sampling will increase the diversity of the base learner (since the data obtained may be different for each sampling), which will improve generalization.

2.2.2 Exclusive Feature Bundling

The EFB is a method of reducing feature dimensions (actually, dimension reduction technology) by means of feature bundling, so as to improve the computational efficiency. Typically, the bound features are mutually exclusive (one feature has a value of 0 and one feature has a value of other than 0) so that the two features are bound together without losing information. If two features are not completely mutually exclusive (in some cases, both features are non-0 values), an index can be used to measure the degree of non-mutual exclusion of the features, which is called a collision ratio, and when the value is smaller, two features which are not completely mutually exclusive can be bound without affecting the final precision. The algorithm steps for EFB are as follows:

1) sorting the features according to the number of non-0 values;

2) calculating the conflict ratio between different characteristics;

3) each feature is traversed and attempts are made to merge the features to minimize the collision ratio.

High-order data is usually sparse, and a lossless method can be designed to reduce the dimensionality of the features. In particular, in sparse feature space, many features are mutually exclusive, e.g., they are never simultaneously non-0 values. Mutually exclusive features can be bound as a single feature, and by carefully designing the feature algorithm, a feature histogram identical to that of a single feature is constructed from the feature bundle. The inter-histogram time complexity of the mode is reduced from O (# data # feature) to O (# data # bundle), and the # bundle < # feature can greatly accelerate the training process of the GBDT and lose the precision. There are two problems, however:

1) how to decide which features need to be bundled together;

2) how to construct the bundled features.

Theoretically achieving optimal feature bundling is an NP-Hard (Non-deterministic polymeric-Hard) problem. This means that it is not possible to find an exact solution in polynomial time. The mutual exclusion feature bundling problem is addressed as shown in Algorithm 1.

First, a graph is constructed with edges weighted, weight vector feature-to-weight collision rate associations. Second, sort in descending order based on the degree of the features in the graph. Finally, each feature in the ordered list is examined and either assigned to an existing bundle feature with a low rate of conflict with it or a new bundle is created. The Algorithm Algorithm 1 has a time complexity of O (# feature) and is processed only once before training. This complexity is acceptable when the number of features is not that large, but can be significantly affected if one is faced with millions of features. In order to greatly improve the efficiency, a more efficient sorting strategy without drawing is provided: sorting is performed according to the number of non-0 values. This is similar to ranking by degree, as there is a greater chance of making a greater conflict for the number of non-0 values. Because only the ordering policy is changed, details of the new algorithm are omitted to avoid repetition.

For the second problem, to reduce the training complexity, a good method is needed to combine two features that should be bundled. The key is to ensure that the value of the original feature can be identified from the binding feature value. Because histogram-based algorithms store discrete bins rather than continuous feature values, the binding feature can be constructed by having mutually exclusive features individually subordinate to different bins. This can be done by adding an offset to the original feature. For example, assume that two features need to be combined in a bundled feature. The range of the starting A feature is [0,10 ], and the range of the B feature is [0, 20). Then an offset of 10 is added to feature B, and the value of feature B becomes 10, 30. Then, the characteristics A and B can be combined to replace the original A and B, and the combined value is [10,30 ].

The Algorithm details are embodied in Algorithm 2.

The EFB algorithm can bundle a plurality of mutually exclusive features into low-dimensional dense features, so that unnecessary calculation aiming at the feature value of 0 can be avoided. Indeed, histogram-based algorithms can be optimized using a table to record the non-0 values of each feature. In practice, the histogram algorithm based on optimization is achieved by recording non-0 values in the data with a table to ignore zero-valued features. By scanning the data in the table, the time complexity of histogram will be reduced from O (# data) to O (# non _ zero _ data). However, extra memory and effort are required to maintain and update the table during the entire tree building process. This optimization is implemented in LightGBM and is considered as a basic function. It is worth noting that this optimization method is not in conflict with the EFB algorithm, as it can still be used when the bundled features are still sparse.

2.3 Stacked Generalization

Ensemble learning is a machine learning paradigm. In ensemble learning, multiple models (often referred to as "weak learners") are trained to solve the same problem and combined to achieve better results. The most important assumptions are: when weak models are combined correctly, a more accurate or robust model can be obtained. In most cases, these basic models themselves do not perform very well, either because they have a high bias (e.g., low-degree-of-freedom models) or because their variance is too large resulting in a less robust (e.g., high-degree-of-freedom models). The idea of the integration method is to create a strong learner (or "integrated model") by combining the biases or variances of these weak learners to achieve better performance. Weak learners are generally combined mainly by the following three methods:

1) bagging, which generally considers homogeneous weak learners, learns these weak learners in parallel independently of each other and combines them according to some deterministic averaging process.

2) Boosting, which is also a common consideration for homogeneous weak learners. It learns the weak learners sequentially in a highly adaptive way and combines them according to some deterministic strategy.

3) The method generally considers heterogeneous weak learners, learns in parallel, combines the heterogeneous weak learners by training a meta-model, and outputs a final prediction result according to the prediction results of different weak models.

There are two main differences between Stacking and Bagging and Boosting. First, Stacking generally considers heterogeneous weak learners (different learning algorithms are combined), while Bagging and Boosting mainly consider homogeneous weak learners. Secondly, Stacking learning combines the basic model by using a meta-model, and Bagging and Boosting combines the weak learners according to a deterministic algorithm. The concept of Stacking is to learn several different weak learners and combine them by training a "meta-model" and then output the final prediction based on multiple predictions returned by these weak models. Therefore, to build the Stacking model, two things need to be defined: l learners that want to fit and a meta model that combines them. For example, for classification problems, KNN classifiers, Logistic Regression, and SVMs may be selected as weak learners, and a decision is made to learn neural networks as meta-models. The neural network will then take the outputs of the three weak learners as inputs and return a final prediction based on the inputs. Therefore, assuming one wants to fit a Stacking integration model consisting of L weak learners, the following steps must be followed:

1) dividing training data into two groups;

2) selecting L weak learners, fitting them to the first set of data;

3) causing each of the L learners to make a prediction of observed data in the second set of data;

4) a meta-model is fitted over the second set of data, using predictions made by the weak learner as input.

In the previous step, the data set was split in two, since the prediction of the data used to train the weak learner was not relevant to the training of the meta-model. Thus, one significant disadvantage of splitting the data set into two parts is that only half of the data is used to train the base model and the other half of the data is used to train the meta-model. To overcome this limitation, training methods like k-fold cross-validation may be used. All such observations can be used to train the meta-model: for any observed data, the prediction of weak learners is done by training examples of these weak learners on k-1 fold data (not including the observed data considered), as shown in FIG. 1.

In other words, it trains on the k-1 fold data to predict the remaining fold data. By iteratively repeating this process, a prediction of any discounted data can be obtained. This may generate a relevant prediction for each observation in the dataset and then train the meta-model using all of these predictions. The Stacking method trains a meta-model that generates the final output based on the output results returned by the weak learners at lower layers.

3 proposed Link prediction method

3.1 proposed Link prediction Algorithm (LLSLP)

Link prediction for social networks is considered a binary classification problem and takes into account 15 similarity indicators per two nodes, namely CN, Sal, Jac, Sor, HPI, HDI, LHN-I, PA, AA, RA, LP, Katz, ACT, Cos, RWR. First, the similarity index is considered a characteristic of any two nodes in the network. Then, Logistic Regression and LightGBM were selected as basic models. And finally, introducing a Stacking idea, and relearning the prediction result of the basic model to obtain a better prediction result.

3.1.1 partitioning node pairs

Considering a social network with n nodes, there are

A node pair. Data set of all node pairs in a constructed network

Including feature set F and category set C. Firstly, adopting a hierarchical sampling method, according to 8: 2 dividing all node pairs into original training sets

With the original test set

3.1.2 construction of training set and test set

In the original training set

With the original test set

In (2), node pairs (n) are calculated separately_x,n_y) And (3) 15 similarity indices (CN, Sal, Jac, Sor, HPI, HDI, LHN-I, PA, AA, RA, LP, Katz, ACT, Cos, RWR), with the 15 similarity indices as node n_xAnd n_y15 different features in between, the feature set F for all node pairs is obtained. In the original network, a node pair is classified as class 1 if it is connected, and is classified as class 0 otherwise. Thus, a set of categories C for a set of network node pairs is obtained. Finally, combining the feature set and the category set to obtain a training set D_trainAnd test set D_test。

3.1.3 unbalance problem

A data set is often said to be "class-unbalanced" when the number of samples from different classes in the data set differs significantly from one classification task to another. Obviously, the links in the network are sparse, and for nodes in the network, the number of node pairs with connected edges is much smaller than the number of node pairs without connected edges. Meanwhile, node pairs with connected edges, i.e., a few classes, are usually more of a concern in link prediction. Thus, the classes in the training set are not balanced with the classes in the test set. In machine learning, for the learning of unbalanced samples, the overfitting problem is easy to occur, so that the generalization capability of the model is poor, and the prediction is meaningless. For unbalanced data, in order not to change the raw data distribution, a Cost-sensitive Learning (Cost-sensitive Learning) strategy is used herein. Cost sensitive learning assigns higher misclassification cost to a few classes of samples and assigns smaller misclassification cost to a majority of classes of samples. In this way, the importance of the minority class samples is improved in the training process of the learner through the cost-sensitive learning, so that the preference of the classifier on the majority class is relieved. Cost sensitive learning is briefly introduced by taking Logistic Regression as an example.

The maximum likelihood function of the objective function is known from equation (8):

then the sample prediction error is minimized, and j (w) is minimized. On the premise of cost sensitivity, positive and negative sample weights [ α, β ] are added before derivation, and equation (22) becomes:

the derivation rule includes:

suppose y _i1 or y_iWhen 0, then:

iteration w_jUntil convergence:

w_j:＝w_j+μ[αyⁱ+(β-α)p(x_i)y_i-βp(x_i)]x_j (26)

in summary, the positive and negative sample weights [ α, β ] amplify the cost of misjudging a certain class. Assuming that the model is classified into two classes, the value taking method according to the proportion of positive and negative samples is as follows:

the more general form is:

wherein n is_classesIs the number of sample classes.

LLSLP algorithm proposed by 3.1.4

In obtaining a training set D_trainAnd test set D_testAnd determining a solution to the imbalance of the data classes, and then training set D_trainAnd test set D_testAnd respectively putting the learning layers into the first learning layer for learning. The learning layer comprises 2 base learners LR and LightGBM, and Cross-validation, Grid searching and Early stopping methods are used for determining the hyper-parameters of the model, so that the fusion characteristics of the 2 base learners to 15 traditional similarity indexes are obtained. Then 2 fusion indexes learned by the base learner are combined and constructed to obtain a new training set

And test set

And will be

And putting the obtained product into a second learning layer for learning. Layer 2 only contains one Meta-Classifier, which is an LR model, and Cross-validation, Grid searching and Early stopping are also used to determine the hyper-parameters of the model in the learning process. Finally, the training is carried out by using Meta-ClassifierModel h (x) ═ h' (h)₁(x),h₂(x),…,h_T(x) For new test set

And (3) predicting to obtain a final prediction result FinalPredictionLabel: if the final prediction result FinalPredictionLabel is greater than or equal to the preset result threshold, the two nodes are abnormal links; and if the final prediction result FinalPredictionLabel is smaller than the preset result threshold, the two nodes are normal links. The details of the proposed Algorithm are shown in Algorithm 3.

3.2 Link prediction model construction

In order to obtain better prediction effect, a model with larger difference is selected as a base model. Logistic Regression is a calculation model, while LightGBM is a tree model, and the algorithm integrating the calculation model and the LightGBM has better accuracy and generalization. The Logistic Regression and LightGBM are used as two base models to train a training set, and 5-fold Cross-validation, Grid searching and Early stopping are adopted to determine the hyper-parameters of the base models. And after the training of the base models is finished, a Stacking method is introduced to integrate the two base models. The probabilities of the presence and absence of links predicted by Logistic Regression and LightGBM are re-characterized. Since the effectiveness of Stacking is mainly from feature extraction, the representation learning is always accompanied by an over-fitting problem, because the features of the second layer are from the learning of the data of the first layer, the original features should not be included in the features in the data of the second layer, so as to reduce the risk of over-fitting. Meanwhile, in order to reduce the over-fitting problem, the second-layer classifier should be a simpler classifier, and generalized linearity such as Logistic Regression is a better choice. In the extraction process of the features (i.e. the learning of the first layer), complex nonlinear transformation is already used, so that it is not suitable to use too complex classifiers in the output layer. This is similar to the activation function or output layer of a neural network, both of which are simpler functions and can control complexity. In addition, effective characteristics can be selected by using the Logistic Regression together with the regularization of L1, unnecessary models are deleted from the base model of the first layer, the operation cost is saved, overfitting is further prevented, and the output result of the Logistic Regression can be understood as probability, so that the method is suitable for partial classification tasks. In summary, Logistic Regression is selected as Meta-Classifier, retraining is carried out on the learned new features, and the final prediction result is determined. The LLSLP model framework is shown in FIG. 2.

4 results and analysis of the experiments

4.1 data set

To fully evaluate the effectiveness of the proposed LLSLP method for link prediction, 10 real networks from various domains were used for the experiment. Where the UPG is a power distribution network. YST is a biological network. KNH, SMG, NSC and GRQ are co-author networks in different research areas. HMT, FBK and ADV are social networks. An EML is a network of individuals who share email. These networks are carefully chosen to cover a wide range of attributes including different sizes, averages, clustering coefficients, heterogeneity indices and imbalance coefficients ir (imbalance ratio) as the ratio of connected to non-connected edges. The following table lists a summary of the structural characteristics of the networks used in the experiments, the detailed information being shown in table 1.

TABLE 1 data set

4.2 LLSLP Link prediction model evaluation proposed

Because the network nodes are unbalanced in the proportion of existing links and non-existing links, the final link prediction cannot be measured by only the correct proportion of a single prediction. In order to evaluate the link prediction model established in the first three steps, 7 indexes, such as AUC, Recall and the like, are used to test the performance of the model. AUC, Recall and Precision are common indicators for evaluating classification problems. For data with unbalanced sample classes, the indices were additionally measured using fusion Matrix, Precision-Recall Current, F1-score, MCC (Matthews correlation coefficient). The fusion Matrix can intuitively and specifically observe the prediction result of the model. MCC is a correlation coefficient between-1 and +1, generally considered a balance metric, that can be used even if the category sizes differ widely. Precision-Recall Curve and F1-score combine to embody the relationship between Precision and Recall. Therefore, the LLSLP method is evaluated in consideration of 4 indexes additionally herein.

4.3 evaluation index

7 indicators were used in the experiment to evaluate the performance of the LLSLP link prediction algorithm. They are defined as follows:

1) AUC (area Under the receiver operating characteristic curve) is a metric that takes into account the overall ranking results. Generally, the AUC of a link prediction is defined as:

where n is the number of independent comparisons, n₁Is the number of times the missing link score is higher than the absent link score, n₂The number of times the scores of both are equal.

2) In the field of machine learning, and in particular statistical classification problems, a confusion matrix (also known as an error matrix) is a specific table layout that can visualize the performance of an algorithm, typically an algorithm with supervised learning (in unsupervised learning, commonly referred to as a matching matrix). Each row of the matrix represents an instance in the prediction class and each column represents an instance in the actual class. Taking the second classification as an example:

the predictive classification model corresponds to a confusion matrix, wherein the larger the numerical values of TP (true Positive) and TN (true negative), and the smaller the numerical values of FP (false positive) and FN (true positive), the better the model effect is represented. The confusion matrix can be extended to 5 indexes in fig. 3 on the basis of the statistical results:

3) precision ratio (Precision) — indicates how many of the samples predicted to be positive are true positive samples.

4) Recall (Recall) -indicating how many positive cases in a sample were predicted to be correct.

5) F1-score-results combining Precision and Recall outputs. The value of F1-score ranges from 0 to 1, with 1 representing the best output of the model and 0 representing the worst output of the model.

6) The Mathematic Correlation Coefficient (MCC), which takes into account the number of various samples, is an evaluation index that can be used under both class balance and imbalance.

7) Precision-Recall Curve-Precision (Precision) is the y-axis and Recall (Recall) is the x-axis. The accuracy and recall are mutually influenced, ideally both are high, and the more efficient such a model is. But generally, the accuracy is high and the recall rate is low; the recall rate is low and the accuracy is high.

4.4 reference Algorithm

In the part, 15 similarity indexes based on a network topology structure are selected to be compared with LLSLP, and the used similarity indexes are similarity based on local information, similarity based on global information and similarity based on random walk respectively. Tables 2 to 4 show similarity indexes based on local information of nodes, similarity indexes based on global information of nodes, and similarity indexes based on random walks, respectively.

TABLE 2 similarity index based on node local information

TABLE 3 Global information based similarity index

TABLE 4 similarity index based on random walks

In tables 2 to 4, (x) represents a neighbor node set of the node x, (y) represents a neighbor node set of the node y, (x) # (y) represents a common neighbor node set of the nodes x and y, and k (x) | represents the degree of the node x.

4.5 Experimental results and analysis

The experimental results herein are all the performance of the model on the test set. Each column of the table shows the sample type within each data set, and each row of the table shows the LLSLP and the remaining 17 comparison algorithms and models proposed herein. AUC, Precision, Recall, F1-score, MCC were chosen as the main assessment indicators due to the limitation of table size, while the rest of the assessment indicators on the experimental data set for LLSLP proposed herein will be shown in the appendix. Wherein the values of AUC, Precision, Recall and F1-score are all between [0,1], and the higher index value represents the better effect of the model on the data sets. The MCC is between the values of [ -1 and 1], the index value of 1 represents that the prediction of the model is completely consistent with the actual prediction on the data set, the value of 0 represents that the prediction result is equivalent to the random prediction result, and otherwise, the value of-1 represents that the model effect shows that the prediction result is completely inconsistent with the actual result on the data set. The results of the experiments on the data sets AUC, Precision, Recall, F1-score, and MCC under different algorithms are shown in tables 5-9.

TABLE 5 AUC values with algorithms

TABLE 6 Precision values with algorithms

TABLE 7 Recall values with algorithms

Table 8.F1-score values with algorithms

TABLE 9 MCC values with algorithms

Next, for the LLSLP and comparison algorithm proposed herein, the model evaluates index analysis on each data set. As can be seen from table 5, the AUC values of LLSLP in each data set listed here are top 2, and all perform better than the conventional algorithm. The AUC value on the UGP network is the highest, is 0.9998, and is superior to the traditional CN, Sal, LightGBM and other comparison algorithms and models; the basis model LR is also better and closer to the proposed LLSLP method. The AUC value of Cos which is the best to be expressed in the traditional algorithm is 0.77197, and the AUC value of the proposed LLSLP method is 29.536% higher than that of Cos. The basic model LightGBM is relatively poor in performance, the AUC value is 0.63459, which is much smaller than the proposed LLSLP and another basic model LR, and it can be seen that the prediction of the UGP data set by the basic model LightGBM is not very accurate. However, this does not mean that LightGBM is a bad model, such as in NSC data sets, the predicted performance of the basic model LightGBM is better than that of algorithms, models other than the proposed LLSLP. It can be seen that there is a certain difference in performance between the base model LR and the LightGBM in the face of different data sets, but the proposed LLSLP still has better performance in the face of various data sets, indicating that it has better stability.

Next, the Precision, Recall and F1-score of the respective algorithms, models on the experimental data set from tables 6, 7 and 8 were analyzed. LLSLP is superior to comparison algorithm and model in overall performance, and Precision value in most data sets is higher than that of traditional algorithm and model. But it is also noted that it does not perform well on all data sets of the experiment. Such as Precision values on NSC datasets that are lower than the traditional CN, AA algorithms, of course largely due to the fact that the basis model LR behaves too poorly; precision values on EML and FBK datasets are lower than the base model LR. Precision value on SMG dataset is lower than CN; the performance on KHN data sets is inferior to the traditional algorithm CN and the base model LightGBM. The proposed Recall values of LLSLP on the experimental data set are overall higher but still have certain disadvantages. The following analysis of the Recall values shows that, as can be seen from Table 7, the overall performance of the proposed LLSLP on each data set is better than that of the conventional algorithms CN, AA, LP, RWR, etc., the Recall values of the conventional algorithms are about 50% on average, and the lowest LLSLP value is more than 90%. However, it was also observed that the Recall value of LLSLP was close to that of the base model, and it had to be acknowledged that the high Recall value of LLSLP could benefit to some extent from the 2 base models (Logistic Regression, LightGBM). Because the Precision index and the Recall index are in a negative correlation relationship under the general condition, namely the larger the Precision value is, the smaller the Recall value is; conversely, the smaller the Precision value, the larger the Recall value. For different data sets and different practical application requirements, the quality of an algorithm or a model cannot be evaluated only through Precision values or Recall values, and F1-score combines the former two values to comprehensively measure the performance of the algorithm or the model, so that the F1-score value of LLSLP on the experimental data set is higher than 2, and the overall good performance of the proposed LLSLP is proved as shown in Table 8. It is noted that for two classes, the F1-score value for the possible classifier is higher, while the MCC value is lower. It is shown that a single index is not a complete measure of all the advantages and disadvantages of the classifier.

Finally, analyzing table 9, the problem of data category imbalance of the data set cannot be completely solved in machine learning, because the machine learning model is learned by data, and imbalance of data causes the training model to generate preference to most categories, which makes identification of categories with small data amount difficult. For the performance evaluation of the model on the extremely unbalanced social network data set, the MCC shows good differentiation, the overall performance of the traditional similarity algorithm is relatively poor under the condition that the category is not flat, and the LLSLP proposed herein can obtain relatively good performance under the condition that the category is extremely unbalanced, so that the proposed LLSLP can be proved to be effective.

From fig. 4 to 6, a series of results and conclusions can be observed.

(1) As can be seen in fig. 4: the proposed ROC curve for LLSLP is comparable to most algorithms, and only a few algorithms PA, Katz, ACT, and LightGBM can be distinguished (from table 5, it can be observed that the area under the specific ROC curve on the FBK dataset for each algorithm is 0.84832 for PA, 0.52412 for Katz, 0.83605 for ACT, 0.80372 for LightGBM, and above 0.90000 for the rest of the algorithms and models). The difference degree between the LLSLP and the traditional algorithm and the model is not judged enough only by relying on the ROC curve, and performance evaluation needs to be carried out by means of other indexes.

(2) From fig. 5 it can be derived: the PR graph clearly shows that the proposed LLSLP is better than most of the traditional algorithms CN, Sal, RA, Cos, RWR and the like, but the difference between the proposed LLSLP and the base models LR and LightGBM is not large, and the PR graph has better discrimination, but has certain limitation on the discrimination of the base models. It is worth mentioning that the conventional algorithms HPI, PA, LHN-I, ACT, COS, RWR have higher AUC values (from table 5, it can be observed that HPI is 0.98592, LHN-I is 0.96262, COS is 0.97020, RWR is 0.99126), but the performance on PR curves is poor, of course, because the Recall value is lower when the Precision value is higher; or the Recall value is higher and the Precision value is lower (from Table 6 and Table 7, it can be observed that the Precision values of HPI, LHN-I, Katz, Cos and RWR are 0.11996, 0.03803, 0.00138, 0.00632 and 0.01471 respectively on the FBK data set, and the Recall value is 0.74376, 0.80827, 0.60511, 0.97767 and 0.98744 respectively). Thus leading to low F1-score values (from Table 8, F1-score values for HPI, LHN-I, ACT, Cos, RWR on the FBK data set are 0.20615, 0.07264, 0.00276, 0.01255, 0.02889, respectively). Note that the PR profiles for PA and ACT are also poor due to their low Precision and Recall values (Precision values for PA and ACT are 0.09422, 0.00022, respectively, and Recall values are 0.19520, 001866, respectively), resulting in their low F1-score values, 0.12709, 0.00043, respectively. The F1-score is shown to be indeed effective in combining Precision and Recall, and more objectively evaluating the performance of algorithms and models. Therefore, it is further verified that evaluating an algorithm or model by means of only a single index is not comprehensive and cannot completely measure the performance, so that measuring by using a plurality of evaluation indexes is necessary and comprehensive.

(3) Aiming at the limitation of the PR graph, the confusion matrix chart of FIG. 6 completely shows the specific classification condition of each algorithm and model, and more specifically and comprehensively reflects the performance of the algorithm or the model. Lighter colors in the figure represent more numbers, whereas darker colors represent less numbers. The confusion matrix map is a good complement to the PR graph. It can be seen from fig. 6 that the number of results of the proposed LLSLP classification error is lower and has better performance with respect to the conventional algorithm and the base model. It is noted that, by setting the deviation percentage as the percentage of the absolute value of the difference between the numbers of two error classifications and the larger value between the two, it can be observed that in the error results of the base model LR and LightGBM predictions, a bias towards a class is obviously generated, the former tends to predict the sample as positive class, and the latter tends to predict the sample as negative class (it is known from fig. 6 that LR predicts the class 0 as the number of class 1 as 4455, the class 1 as the number of class 0 as 157, and the deviation percentage is 96.476%, while LightGBM predicts the class 0 as the number of class 1 as 569, the class 1 as the number of class 0 as 3451, and the deviation percentage is 83.512%). Meanwhile, it can be observed from fig. 6 that the bias degree generated on the FBK data set by other conventional algorithms is more serious, for example, the CN predicts the class 0 as the quantity of the class 1 as 1342, predicts the class 1 as the quantity of the class 0 as 14670, and the bias percentage is 90.852%; LHN-I predicts category 0 as 359469 for the number of

Claims

1. A social network link abnormity prediction method based on stack generalization and cost sensitive learning is characterized by comprising the following steps:

s2, determining the hyper-parameters of the basic model;

2. The method for predicting link abnormality of social network based on stacked generalization and cost-sensitive learning according to claim 1, wherein the base model in step S1 comprises:

thus, there are:

wherein N represents the number of samples in the data set D;

(k + 1) th iterative update representing ith sample weight parameterThe latter weight parameter;

w_irepresenting the weight parameter of the ith sample.

3. The social network link abnormality prediction method based on stacked generalization and cost-sensitive learning of claim 1, wherein in step S2, the method for determining the hyperparameter in the base model comprises one or any combination of cross validation, grid search and early-stop method.