CN112131569B

CN112131569B - Risk user prediction method based on graph network random walk

Info

Publication number: CN112131569B
Application number: CN202010966200.8A
Authority: CN
Inventors: 易钰奇; 程帆; 张冬梅
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2024-01-05
Anticipated expiration: 2040-09-15
Also published as: CN112131569A

Abstract

The invention relates to a risk user prediction method based on graph network random walk, which comprises the following steps: 1) Acquiring networking data containing a graph as an original data set; 2) Preprocessing an original data set and constructing a graph network; 3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk; 4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result. Compared with the prior art, the invention has the advantages of better expandability, no need of characteristic engineering, good effect and the like.

Description

Risk user prediction method based on graph network random walk

Technical Field

The invention relates to the technical field of data mining, in particular to a risk user prediction method based on graph network random walk.

Background

With the increasing progress of information technology, the scale of data is larger and the data network formed by interaction between data is more complex, and these situations bring great challenges to related data mining work on a graph network, and in predicting the needs of risk users, a large number of complex data screening and mining works are often needed, and part of companies use professionals to perform data analysis, but the situation brings extremely high labor cost.

Although partial algorithm models on the existing single-machine platform have effective achievements, the problem of expansibility exists, and the problem of low massive data processing capacity is solved, so that a risk user prediction method based on graph network random walk is needed to effectively avoid the problem, professional personnel can be not needed to analyze all data piece by piece, and the problem brought by massive data can be solved by the transverse expansion of a system.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a risk user prediction method based on graph network random walk.

The aim of the invention can be achieved by the following technical scheme:

a risk user prediction method based on graph network random walk comprises the following steps:

1) Acquiring networking data containing a graph as an original data set;

2) Preprocessing an original data set and constructing a graph network;

3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk;

4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result.

The data sets in the form of the graph-containing network comprise public competition data sets, university public data sets and enterprise public data sets, wherein the public competition data sets comprise public data sets of Kagle and KDD competition websites, the university public data sets are public data sets on open source data set websites of the university of Stanford, and the enterprise public data sets comprise public data sets of Microsoft and Atlantic enterprises.

The step 2) specifically comprises the following steps:

21 Acquiring characteristic data from the original data, and filtering noise data, namely edge data with low weight;

22 Supplementing the possibly missing data by adopting a relation prediction model;

23 Uniformly coding node numbers in the graph network;

24 Normalized weights of edges in the graph network.

In the step 21), the type of the feature data includes the graph node feature, the weight of the edge, the direction feature of the data, and the risk node selected as the initial node of the subsequent random walk.

The graph node is characterized by being label data of users, representing risk performance scores of the users, wherein the risk performance scores are 0 or 1, corresponding to the risk and no risk, the edges represent relationships among the users, including conversation relationships, attention relationships and social friend relationships, and the weights represent the tightness degree of the relationships among the users.

In the step 22), the supplementing the data that may be missing specifically includes:

for data: performing linear interpolation supplementation by using a linear model;

for category characteristics: and selecting the characteristic value with the largest occurrence number of the category as the missing value to supplement.

In the step 3), the rule of the random walk is specifically as follows:

all directed edges in the graph network are treated as undirected edges, and if a plurality of edges exist among the nodes, the directed edges are combined into one edge, and the weight of the combined edge is the average value of the plurality of edges;

selecting a known risk user as a seed node at the beginning of random walk, selecting the next node of the walk in equal proportion according to the size of the edge weight, stopping the random walk after the probability that all nodes in the graph network appear in a random walk path is stable, and taking the visited probability corresponding to the nodes as a risk score of the user;

for the object of random walk, the probability of b in each step of random walk is randomly moved from the current node to one neighbor node of the current node, and meanwhile, the probability of 1-b is directly returned to the initial seed node from the current node, specifically:

r ^t+1 ＝b*r ^t +(1-b)*r ⁰

wherein r is ^t Representing the probability that node r is accessed in the t-moment graph network, r ⁰ Representing the probability that the first seed node is accessed.

The probability takes a value of 0.9.

In the random walk process, in order to reduce the computational complexity, an access probability threshold delta=0.0001 is set, and when the random walk reaches a certain node, and the access probability of the node is not 0 and less than delta, the random walk returns to the position of the seed node at the beginning.

In the step 4), when the risk score of the user to be predicted exceeds the set risk score threshold, the user to be predicted is judged to be a risk user.

Compared with the prior art, the invention has the following advantages:

1. the invention can avoid the extremely high labor cost brought by the existing manual prediction risk users.

2. The invention can process the graph network data, and can better use the relevance information among samples compared with the common characteristic processing method.

3. The invention has strong expandability and can well support the distributed computing system.

4. The method has wide application range and commercial significance, can process the public data set, and can be popularized to the processing of business data in enterprises.

Drawings

FIG. 1 is a flow chart of the preprocessing and training of the present invention.

Fig. 2 is a flow chart of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples.

Examples

The present invention will be further described in detail for the purpose of more clearly and thoroughly explaining the objects, technical solutions and gist of the present invention. It should be understood that the methods of implementation described herein are merely illustrative of specific methods of the present invention and are not limiting of the invention. Those skilled in the art can implement and popularize the invention according to the principle set forth in the invention, and can popularize the invention into similar application scenes by simply modifying the structured data set to be processed.

As shown in fig. 1 and 2, the present invention firstly preprocesses original data, then uses a random walk algorithm on the preprocessed data, and can obtain a risk score of all nodes in the graph after the random walk, and finally only needs to query the risk score of the corresponding user by means of query when using the model, so that the present invention specifically comprises three stages of data preprocessing stage, random walk stage and using model, and specifically comprises:

1) Data preprocessing: the method comprises the steps of obtaining a graph network data set as an original data set and preprocessing the original data, wherein the following steps are as follows:

firstly, collecting required graph network data, wherein the data are formed by taking users as nodes in a graph network, the relationships among the users, such as social friend relationships, call relationships as edges, the labels of the data are risk performance conditions of the users, such as whether loan default performance exists in a financial scene, and the like, the node data are firstly processed, the characteristic data are mostly provided by a data set, and the characteristic data are generally formed by information of gender, age, historical behavior and the like of the users, the noise data in the characteristic data are filtered in a mode of simply fitting the corresponding characteristic data by utilizing a linear regression or decision tree model, and then the data far away from a fitting curve are removed from the data;

linear regression model formula: f (x) =w ^T X+b

The error function used for training is:

wherein x is _i ，y _i Is the characteristic attribute value corresponding to the sample and the label of the sample, and w and b are the model parameters corresponding to the linear model

Then for edge data, edge data tends to represent relationships between users, while the weight of an edge represents how tight the relationship between users is. In the noise filtering step, edges with too low weights need to be filtered. Then, for the edge with the weight missing, an edge weight prediction model is also required to be applied to supplement the data set, and specifically, a graph network edge weight prediction model proposed by Srijan Kumar 2016 in Edge Weight Prediction in Weighted Signed Networks can be used. Finally, the predicted value of the prediction model is used as the weight value of the corresponding missing weight edge, and the edge with too low predicted weight is filtered out.

2) Random walk phase:

in the random walk stage, a walk rule is defined firstly, and in an experimental system, the customized walk rule is to select the next node of the walk in equal proportion according to the size of the edge weight. All directed edges in the system are treated as undirected edges, and for the case that a plurality of edges exist between nodes, the directed edges are combined into one edge, and the weight of the edge takes the average value of the plurality of edges. Furthermore, for a random walk object, every step of the random walk has a probability of 0.9 that the node where the random walk is currently located moves to a neighbor node of the node randomly, and a probability of 0.1 that the random walk is also returned from the current node to the initial seed node directly, specifically, the iterative formula is as follows:

r ^t+1 ＝0.9*r ^t +0.1*r ⁰

wherein r is ^t Representing the probability of node access in the graph at time t, r ⁰ Representing the probability that the first node is accessed, i.e. the seed node has a value of 1 and the rest of nodes have a value of 0

Also, in order to reduce the computational complexity of the system, a δ=0.0001 needs to be set, and when the random walk reaches a certain node, the access probability of the node is not 0 but less than δ, the random walk will return to the seed node. The above random walk stage walk rule

After defining the rules, the random walk paths are continuously simulated according to the predefined random walk rules only by starting from the pre-selected seed nodes, namely nodes with risk labels in practice. When the probability that the node in the whole graph appears in the random walk path is stabilized, the simulation is stopped, and the probability is taken as the risk score of the corresponding node. The higher the risk score means that the user has a higher likelihood of being a new risk user.

3) Using a model phase:

inputting the data to be queried into a system, matching the system with a corresponding node from a graph network, and outputting the node, namely the risk score of the corresponding sample to be queried, as an output result.

The invention uses a graph network data mining algorithm based on random walk, overcomes the problem that the traditional structured data processing method can not solve the characteristic problem of the graph network, simultaneously optimizes the problem that the neural network has poor effect on category characteristic data, reduces the labor cost and can provide help for better processing the networked data.

Those skilled in the art will readily understand that the above process is only one specific example of the present invention, and in actual industrial production, those skilled in the art may modify and improve some details according to the above description and the actual data set, so that the specific operation is more suitable for the actual application scenario.

Claims

1. The risk user prediction method based on the graph network random walk is characterized by comprising the following steps of:

1) Acquiring networking data containing a graph as an original data set;

2) Preprocessing an original data set and constructing a graph network;

4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result;

in the step 3), the rule of the random walk is specifically as follows:

the edges represent the relationship among users, including conversation relationship, attention relationship and social friend relationship, and the weight of the edges represents the tightness of the relationship among users;

r ^t+1 ＝b*r ^t +(1-b)*r ⁰

2. The graph network random walk-based risk user prediction method according to claim 1, wherein the graph network-form-containing data sets comprise public competition data sets, university public data sets and enterprise public data sets, the public competition data sets comprise kagle and KDD competition website public data sets, the university public data sets are open source data set websites public data sets of the university Stanford, and the enterprise public data sets comprise microsoft and yahoo enterprise public data sets.

3. The risk user prediction method based on graph network random walk according to claim 1, wherein the step 2) specifically comprises the following steps:

23 Uniformly coding node numbers in the graph network;

24 Normalized weights of edges in the graph network.

4. A graph network random walk-based risk user prediction method according to claim 3, wherein in the step 21), the type of feature data includes graph node features of the data, weights of edges, direction features, and risk nodes selected as the initial nodes of the subsequent random walk.

5. The method for predicting risk users based on random walk of graph network as claimed in claim 4, wherein the graph node features the label data of the users, which represents the risk performance score of the users, and the value is 0 or 1, and the corresponding value is zero.

6. A risk user prediction method based on graph network random walk according to claim 3, wherein in step 22), supplementing the data that may be missing specifically includes:

7. A risk user prediction method based on graph network random walk according to claim 1, wherein the probability has a value of 0.9.

8. A risk user prediction method based on graph network random walk according to claim 1, characterized in that in order to reduce the computational complexity during the random walk, an access probability threshold δ=0.0001 is set, when the random walk reaches a certain node, the access probability of which is not 0 and less than δ, then the random walk returns to the first seed node.

9. The method for predicting risk users based on graph network random walk according to claim 8, wherein in the step 4), when the risk score of the user to be predicted exceeds the set risk score threshold, the user to be predicted is determined to be a risk user.