CN112131569B - Risk user prediction method based on graph network random walk - Google Patents

Risk user prediction method based on graph network random walk Download PDF

Info

Publication number
CN112131569B
CN112131569B CN202010966200.8A CN202010966200A CN112131569B CN 112131569 B CN112131569 B CN 112131569B CN 202010966200 A CN202010966200 A CN 202010966200A CN 112131569 B CN112131569 B CN 112131569B
Authority
CN
China
Prior art keywords
random walk
node
data
risk
graph network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010966200.8A
Other languages
Chinese (zh)
Other versions
CN112131569A (en
Inventor
易钰奇
程帆
张冬梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010966200.8A priority Critical patent/CN112131569B/en
Publication of CN112131569A publication Critical patent/CN112131569A/en
Application granted granted Critical
Publication of CN112131569B publication Critical patent/CN112131569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Security & Cryptography (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a risk user prediction method based on graph network random walk, which comprises the following steps: 1) Acquiring networking data containing a graph as an original data set; 2) Preprocessing an original data set and constructing a graph network; 3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk; 4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result. Compared with the prior art, the invention has the advantages of better expandability, no need of characteristic engineering, good effect and the like.

Description

Risk user prediction method based on graph network random walk
Technical Field
The invention relates to the technical field of data mining, in particular to a risk user prediction method based on graph network random walk.
Background
With the increasing progress of information technology, the scale of data is larger and the data network formed by interaction between data is more complex, and these situations bring great challenges to related data mining work on a graph network, and in predicting the needs of risk users, a large number of complex data screening and mining works are often needed, and part of companies use professionals to perform data analysis, but the situation brings extremely high labor cost.
Although partial algorithm models on the existing single-machine platform have effective achievements, the problem of expansibility exists, and the problem of low massive data processing capacity is solved, so that a risk user prediction method based on graph network random walk is needed to effectively avoid the problem, professional personnel can be not needed to analyze all data piece by piece, and the problem brought by massive data can be solved by the transverse expansion of a system.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a risk user prediction method based on graph network random walk.
The aim of the invention can be achieved by the following technical scheme:
a risk user prediction method based on graph network random walk comprises the following steps:
1) Acquiring networking data containing a graph as an original data set;
2) Preprocessing an original data set and constructing a graph network;
3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk;
4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result.
The data sets in the form of the graph-containing network comprise public competition data sets, university public data sets and enterprise public data sets, wherein the public competition data sets comprise public data sets of Kagle and KDD competition websites, the university public data sets are public data sets on open source data set websites of the university of Stanford, and the enterprise public data sets comprise public data sets of Microsoft and Atlantic enterprises.
The step 2) specifically comprises the following steps:
21 Acquiring characteristic data from the original data, and filtering noise data, namely edge data with low weight;
22 Supplementing the possibly missing data by adopting a relation prediction model;
23 Uniformly coding node numbers in the graph network;
24 Normalized weights of edges in the graph network.
In the step 21), the type of the feature data includes the graph node feature, the weight of the edge, the direction feature of the data, and the risk node selected as the initial node of the subsequent random walk.
The graph node is characterized by being label data of users, representing risk performance scores of the users, wherein the risk performance scores are 0 or 1, corresponding to the risk and no risk, the edges represent relationships among the users, including conversation relationships, attention relationships and social friend relationships, and the weights represent the tightness degree of the relationships among the users.
In the step 22), the supplementing the data that may be missing specifically includes:
for data: performing linear interpolation supplementation by using a linear model;
for category characteristics: and selecting the characteristic value with the largest occurrence number of the category as the missing value to supplement.
In the step 3), the rule of the random walk is specifically as follows:
all directed edges in the graph network are treated as undirected edges, and if a plurality of edges exist among the nodes, the directed edges are combined into one edge, and the weight of the combined edge is the average value of the plurality of edges;
selecting a known risk user as a seed node at the beginning of random walk, selecting the next node of the walk in equal proportion according to the size of the edge weight, stopping the random walk after the probability that all nodes in the graph network appear in a random walk path is stable, and taking the visited probability corresponding to the nodes as a risk score of the user;
for the object of random walk, the probability of b in each step of random walk is randomly moved from the current node to one neighbor node of the current node, and meanwhile, the probability of 1-b is directly returned to the initial seed node from the current node, specifically:
r t+1 =b*r t +(1-b)*r 0
wherein r is t Representing the probability that node r is accessed in the t-moment graph network, r 0 Representing the probability that the first seed node is accessed.
The probability takes a value of 0.9.
In the random walk process, in order to reduce the computational complexity, an access probability threshold delta=0.0001 is set, and when the random walk reaches a certain node, and the access probability of the node is not 0 and less than delta, the random walk returns to the position of the seed node at the beginning.
In the step 4), when the risk score of the user to be predicted exceeds the set risk score threshold, the user to be predicted is judged to be a risk user.
Compared with the prior art, the invention has the following advantages:
1. the invention can avoid the extremely high labor cost brought by the existing manual prediction risk users.
2. The invention can process the graph network data, and can better use the relevance information among samples compared with the common characteristic processing method.
3. The invention has strong expandability and can well support the distributed computing system.
4. The method has wide application range and commercial significance, can process the public data set, and can be popularized to the processing of business data in enterprises.
Drawings
FIG. 1 is a flow chart of the preprocessing and training of the present invention.
Fig. 2 is a flow chart of the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples.
Examples
The present invention will be further described in detail for the purpose of more clearly and thoroughly explaining the objects, technical solutions and gist of the present invention. It should be understood that the methods of implementation described herein are merely illustrative of specific methods of the present invention and are not limiting of the invention. Those skilled in the art can implement and popularize the invention according to the principle set forth in the invention, and can popularize the invention into similar application scenes by simply modifying the structured data set to be processed.
As shown in fig. 1 and 2, the present invention firstly preprocesses original data, then uses a random walk algorithm on the preprocessed data, and can obtain a risk score of all nodes in the graph after the random walk, and finally only needs to query the risk score of the corresponding user by means of query when using the model, so that the present invention specifically comprises three stages of data preprocessing stage, random walk stage and using model, and specifically comprises:
1) Data preprocessing: the method comprises the steps of obtaining a graph network data set as an original data set and preprocessing the original data, wherein the following steps are as follows:
firstly, collecting required graph network data, wherein the data are formed by taking users as nodes in a graph network, the relationships among the users, such as social friend relationships, call relationships as edges, the labels of the data are risk performance conditions of the users, such as whether loan default performance exists in a financial scene, and the like, the node data are firstly processed, the characteristic data are mostly provided by a data set, and the characteristic data are generally formed by information of gender, age, historical behavior and the like of the users, the noise data in the characteristic data are filtered in a mode of simply fitting the corresponding characteristic data by utilizing a linear regression or decision tree model, and then the data far away from a fitting curve are removed from the data;
linear regression model formula: f (x) =w T X+b
The error function used for training is:
wherein x is i ,y i Is the characteristic attribute value corresponding to the sample and the label of the sample, and w and b are the model parameters corresponding to the linear model
Then for edge data, edge data tends to represent relationships between users, while the weight of an edge represents how tight the relationship between users is. In the noise filtering step, edges with too low weights need to be filtered. Then, for the edge with the weight missing, an edge weight prediction model is also required to be applied to supplement the data set, and specifically, a graph network edge weight prediction model proposed by Srijan Kumar 2016 in Edge Weight Prediction in Weighted Signed Networks can be used. Finally, the predicted value of the prediction model is used as the weight value of the corresponding missing weight edge, and the edge with too low predicted weight is filtered out.
2) Random walk phase:
in the random walk stage, a walk rule is defined firstly, and in an experimental system, the customized walk rule is to select the next node of the walk in equal proportion according to the size of the edge weight. All directed edges in the system are treated as undirected edges, and for the case that a plurality of edges exist between nodes, the directed edges are combined into one edge, and the weight of the edge takes the average value of the plurality of edges. Furthermore, for a random walk object, every step of the random walk has a probability of 0.9 that the node where the random walk is currently located moves to a neighbor node of the node randomly, and a probability of 0.1 that the random walk is also returned from the current node to the initial seed node directly, specifically, the iterative formula is as follows:
r t+1 =0.9*r t +0.1*r 0
wherein r is t Representing the probability of node access in the graph at time t, r 0 Representing the probability that the first node is accessed, i.e. the seed node has a value of 1 and the rest of nodes have a value of 0
Also, in order to reduce the computational complexity of the system, a δ=0.0001 needs to be set, and when the random walk reaches a certain node, the access probability of the node is not 0 but less than δ, the random walk will return to the seed node. The above random walk stage walk rule
After defining the rules, the random walk paths are continuously simulated according to the predefined random walk rules only by starting from the pre-selected seed nodes, namely nodes with risk labels in practice. When the probability that the node in the whole graph appears in the random walk path is stabilized, the simulation is stopped, and the probability is taken as the risk score of the corresponding node. The higher the risk score means that the user has a higher likelihood of being a new risk user.
3) Using a model phase:
inputting the data to be queried into a system, matching the system with a corresponding node from a graph network, and outputting the node, namely the risk score of the corresponding sample to be queried, as an output result.
The invention uses a graph network data mining algorithm based on random walk, overcomes the problem that the traditional structured data processing method can not solve the characteristic problem of the graph network, simultaneously optimizes the problem that the neural network has poor effect on category characteristic data, reduces the labor cost and can provide help for better processing the networked data.
Those skilled in the art will readily understand that the above process is only one specific example of the present invention, and in actual industrial production, those skilled in the art may modify and improve some details according to the above description and the actual data set, so that the specific operation is more suitable for the actual application scenario.

Claims (9)

1. The risk user prediction method based on the graph network random walk is characterized by comprising the following steps of:
1) Acquiring networking data containing a graph as an original data set;
2) Preprocessing an original data set and constructing a graph network;
3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk;
4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result;
in the step 3), the rule of the random walk is specifically as follows:
all directed edges in the graph network are treated as undirected edges, and if a plurality of edges exist among the nodes, the directed edges are combined into one edge, and the weight of the combined edge is the average value of the plurality of edges;
the edges represent the relationship among users, including conversation relationship, attention relationship and social friend relationship, and the weight of the edges represents the tightness of the relationship among users;
selecting a known risk user as a seed node at the beginning of random walk, selecting the next node of the walk in equal proportion according to the size of the edge weight, stopping the random walk after the probability that all nodes in the graph network appear in a random walk path is stable, and taking the visited probability corresponding to the nodes as a risk score of the user;
for the object of random walk, the probability of b in each step of random walk is randomly moved from the current node to one neighbor node of the current node, and meanwhile, the probability of 1-b is directly returned to the initial seed node from the current node, specifically:
r t+1 =b*r t +(1-b)*r 0
wherein r is t Representing the probability that node r is accessed in the t-moment graph network, r 0 Representing the probability that the first seed node is accessed.
2. The graph network random walk-based risk user prediction method according to claim 1, wherein the graph network-form-containing data sets comprise public competition data sets, university public data sets and enterprise public data sets, the public competition data sets comprise kagle and KDD competition website public data sets, the university public data sets are open source data set websites public data sets of the university Stanford, and the enterprise public data sets comprise microsoft and yahoo enterprise public data sets.
3. The risk user prediction method based on graph network random walk according to claim 1, wherein the step 2) specifically comprises the following steps:
21 Acquiring characteristic data from the original data, and filtering noise data, namely edge data with low weight;
22 Supplementing the possibly missing data by adopting a relation prediction model;
23 Uniformly coding node numbers in the graph network;
24 Normalized weights of edges in the graph network.
4. A graph network random walk-based risk user prediction method according to claim 3, wherein in the step 21), the type of feature data includes graph node features of the data, weights of edges, direction features, and risk nodes selected as the initial nodes of the subsequent random walk.
5. The method for predicting risk users based on random walk of graph network as claimed in claim 4, wherein the graph node features the label data of the users, which represents the risk performance score of the users, and the value is 0 or 1, and the corresponding value is zero.
6. A risk user prediction method based on graph network random walk according to claim 3, wherein in step 22), supplementing the data that may be missing specifically includes:
for data: performing linear interpolation supplementation by using a linear model;
for category characteristics: and selecting the characteristic value with the largest occurrence number of the category as the missing value to supplement.
7. A risk user prediction method based on graph network random walk according to claim 1, wherein the probability has a value of 0.9.
8. A risk user prediction method based on graph network random walk according to claim 1, characterized in that in order to reduce the computational complexity during the random walk, an access probability threshold δ=0.0001 is set, when the random walk reaches a certain node, the access probability of which is not 0 and less than δ, then the random walk returns to the first seed node.
9. The method for predicting risk users based on graph network random walk according to claim 8, wherein in the step 4), when the risk score of the user to be predicted exceeds the set risk score threshold, the user to be predicted is determined to be a risk user.
CN202010966200.8A 2020-09-15 2020-09-15 Risk user prediction method based on graph network random walk Active CN112131569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010966200.8A CN112131569B (en) 2020-09-15 2020-09-15 Risk user prediction method based on graph network random walk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010966200.8A CN112131569B (en) 2020-09-15 2020-09-15 Risk user prediction method based on graph network random walk

Publications (2)

Publication Number Publication Date
CN112131569A CN112131569A (en) 2020-12-25
CN112131569B true CN112131569B (en) 2024-01-05

Family

ID=73846983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010966200.8A Active CN112131569B (en) 2020-09-15 2020-09-15 Risk user prediction method based on graph network random walk

Country Status (1)

Country Link
CN (1) CN112131569B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117689450B (en) * 2024-01-29 2024-04-19 北京一起网科技股份有限公司 Digital marketing system based on big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189936A (en) * 2018-08-13 2019-01-11 天津科技大学 A kind of label semanteme learning method measured based on network structure and semantic dependency
CN109951377A (en) * 2019-03-20 2019-06-28 西安电子科技大学 A kind of good friend's group technology, device, computer equipment and storage medium
CN110175299A (en) * 2019-05-28 2019-08-27 腾讯科技(上海)有限公司 A kind of method and server that recommendation information is determining
CN111008447A (en) * 2019-12-21 2020-04-14 杭州师范大学 Link prediction method based on graph embedding method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189936A (en) * 2018-08-13 2019-01-11 天津科技大学 A kind of label semanteme learning method measured based on network structure and semantic dependency
CN109951377A (en) * 2019-03-20 2019-06-28 西安电子科技大学 A kind of good friend's group technology, device, computer equipment and storage medium
CN110175299A (en) * 2019-05-28 2019-08-27 腾讯科技(上海)有限公司 A kind of method and server that recommendation information is determining
CN111008447A (en) * 2019-12-21 2020-04-14 杭州师范大学 Link prediction method based on graph embedding method

Also Published As

Publication number Publication date
CN112131569A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN104008166A (en) Dialogue short text clustering method based on form and semantic similarity
WO2022166115A1 (en) Recommendation system with adaptive thresholds for neighborhood selection
CN112364242B (en) Graph convolution recommendation system for context awareness
CN111666468A (en) Method for searching personalized influence community in social network based on cluster attributes
CN112465226B (en) User behavior prediction method based on feature interaction and graph neural network
JP2023545940A (en) Graph data processing method, device, computer equipment and computer program
CN110569883A (en) Air quality index prediction method based on Kohonen network clustering and Relieff feature selection
CN111177578B (en) Search method for most influential community around user
CN112131569B (en) Risk user prediction method based on graph network random walk
binti Oseman et al. Data mining in churn analysis model for telecommunication industry
CN117272195A (en) Block chain abnormal node detection method and system based on graph convolution attention network
CN112487110A (en) Overlapped community evolution analysis method and system based on network structure and node content
CN112115359A (en) Recommendation system and method based on multi-order neighbor prediction
CN115905188A (en) Data quality improving method based on knowledge graph
CN116993374A (en) Model optimization method, device, equipment and medium based on deep neural network
CN114969511A (en) Content recommendation method, device and medium based on fragments
Annam et al. Entropy based informative content density approach for efficient web content extraction
CN111737461B (en) Text processing method and device, electronic equipment and computer readable storage medium
CN114519605A (en) Advertisement click fraud detection method, system, server and storage medium
CN113744023A (en) Dual-channel collaborative filtering recommendation method based on graph convolution network
CN114118094B (en) Semantic community discovery method based on nonnegative matrix factorization
CN114338442B (en) Network traffic identification method and system based on feature data and deep learning
CN113313417B (en) Method and device for classifying complaint risk signals based on decision tree model
CN117763462A (en) Interaction method, device, equipment and storage medium based on carbon reduction and consumption reduction
CN115617652A (en) Test case processing method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant