CN112131569B - Risk user prediction method based on graph network random walk - Google Patents
Risk user prediction method based on graph network random walk Download PDFInfo
- Publication number
- CN112131569B CN112131569B CN202010966200.8A CN202010966200A CN112131569B CN 112131569 B CN112131569 B CN 112131569B CN 202010966200 A CN202010966200 A CN 202010966200A CN 112131569 B CN112131569 B CN 112131569B
- Authority
- CN
- China
- Prior art keywords
- random walk
- node
- data
- risk
- graph network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000005295 random walk Methods 0.000 title claims abstract description 55
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 230000006855 networking Effects 0.000 claims abstract description 3
- 230000001502 supplementing effect Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 239000013589 supplement Substances 0.000 claims description 3
- 230000009469 supplementation Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 2
- 238000007418 data mining Methods 0.000 description 3
- 238000012417 linear regression Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009776 industrial production Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Computer Security & Cryptography (AREA)
- Fuzzy Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a risk user prediction method based on graph network random walk, which comprises the following steps: 1) Acquiring networking data containing a graph as an original data set; 2) Preprocessing an original data set and constructing a graph network; 3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk; 4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result. Compared with the prior art, the invention has the advantages of better expandability, no need of characteristic engineering, good effect and the like.
Description
Technical Field
The invention relates to the technical field of data mining, in particular to a risk user prediction method based on graph network random walk.
Background
With the increasing progress of information technology, the scale of data is larger and the data network formed by interaction between data is more complex, and these situations bring great challenges to related data mining work on a graph network, and in predicting the needs of risk users, a large number of complex data screening and mining works are often needed, and part of companies use professionals to perform data analysis, but the situation brings extremely high labor cost.
Although partial algorithm models on the existing single-machine platform have effective achievements, the problem of expansibility exists, and the problem of low massive data processing capacity is solved, so that a risk user prediction method based on graph network random walk is needed to effectively avoid the problem, professional personnel can be not needed to analyze all data piece by piece, and the problem brought by massive data can be solved by the transverse expansion of a system.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a risk user prediction method based on graph network random walk.
The aim of the invention can be achieved by the following technical scheme:
a risk user prediction method based on graph network random walk comprises the following steps:
1) Acquiring networking data containing a graph as an original data set;
2) Preprocessing an original data set and constructing a graph network;
3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk;
4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result.
The data sets in the form of the graph-containing network comprise public competition data sets, university public data sets and enterprise public data sets, wherein the public competition data sets comprise public data sets of Kagle and KDD competition websites, the university public data sets are public data sets on open source data set websites of the university of Stanford, and the enterprise public data sets comprise public data sets of Microsoft and Atlantic enterprises.
The step 2) specifically comprises the following steps:
21 Acquiring characteristic data from the original data, and filtering noise data, namely edge data with low weight;
22 Supplementing the possibly missing data by adopting a relation prediction model;
23 Uniformly coding node numbers in the graph network;
24 Normalized weights of edges in the graph network.
In the step 21), the type of the feature data includes the graph node feature, the weight of the edge, the direction feature of the data, and the risk node selected as the initial node of the subsequent random walk.
The graph node is characterized by being label data of users, representing risk performance scores of the users, wherein the risk performance scores are 0 or 1, corresponding to the risk and no risk, the edges represent relationships among the users, including conversation relationships, attention relationships and social friend relationships, and the weights represent the tightness degree of the relationships among the users.
In the step 22), the supplementing the data that may be missing specifically includes:
for data: performing linear interpolation supplementation by using a linear model;
for category characteristics: and selecting the characteristic value with the largest occurrence number of the category as the missing value to supplement.
In the step 3), the rule of the random walk is specifically as follows:
all directed edges in the graph network are treated as undirected edges, and if a plurality of edges exist among the nodes, the directed edges are combined into one edge, and the weight of the combined edge is the average value of the plurality of edges;
selecting a known risk user as a seed node at the beginning of random walk, selecting the next node of the walk in equal proportion according to the size of the edge weight, stopping the random walk after the probability that all nodes in the graph network appear in a random walk path is stable, and taking the visited probability corresponding to the nodes as a risk score of the user;
for the object of random walk, the probability of b in each step of random walk is randomly moved from the current node to one neighbor node of the current node, and meanwhile, the probability of 1-b is directly returned to the initial seed node from the current node, specifically:
r t+1 =b*r t +(1-b)*r 0
wherein r is t Representing the probability that node r is accessed in the t-moment graph network, r 0 Representing the probability that the first seed node is accessed.
The probability takes a value of 0.9.
In the random walk process, in order to reduce the computational complexity, an access probability threshold delta=0.0001 is set, and when the random walk reaches a certain node, and the access probability of the node is not 0 and less than delta, the random walk returns to the position of the seed node at the beginning.
In the step 4), when the risk score of the user to be predicted exceeds the set risk score threshold, the user to be predicted is judged to be a risk user.
Compared with the prior art, the invention has the following advantages:
1. the invention can avoid the extremely high labor cost brought by the existing manual prediction risk users.
2. The invention can process the graph network data, and can better use the relevance information among samples compared with the common characteristic processing method.
3. The invention has strong expandability and can well support the distributed computing system.
4. The method has wide application range and commercial significance, can process the public data set, and can be popularized to the processing of business data in enterprises.
Drawings
FIG. 1 is a flow chart of the preprocessing and training of the present invention.
Fig. 2 is a flow chart of the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples.
Examples
The present invention will be further described in detail for the purpose of more clearly and thoroughly explaining the objects, technical solutions and gist of the present invention. It should be understood that the methods of implementation described herein are merely illustrative of specific methods of the present invention and are not limiting of the invention. Those skilled in the art can implement and popularize the invention according to the principle set forth in the invention, and can popularize the invention into similar application scenes by simply modifying the structured data set to be processed.
As shown in fig. 1 and 2, the present invention firstly preprocesses original data, then uses a random walk algorithm on the preprocessed data, and can obtain a risk score of all nodes in the graph after the random walk, and finally only needs to query the risk score of the corresponding user by means of query when using the model, so that the present invention specifically comprises three stages of data preprocessing stage, random walk stage and using model, and specifically comprises:
1) Data preprocessing: the method comprises the steps of obtaining a graph network data set as an original data set and preprocessing the original data, wherein the following steps are as follows:
firstly, collecting required graph network data, wherein the data are formed by taking users as nodes in a graph network, the relationships among the users, such as social friend relationships, call relationships as edges, the labels of the data are risk performance conditions of the users, such as whether loan default performance exists in a financial scene, and the like, the node data are firstly processed, the characteristic data are mostly provided by a data set, and the characteristic data are generally formed by information of gender, age, historical behavior and the like of the users, the noise data in the characteristic data are filtered in a mode of simply fitting the corresponding characteristic data by utilizing a linear regression or decision tree model, and then the data far away from a fitting curve are removed from the data;
linear regression model formula: f (x) =w T X+b
The error function used for training is:
wherein x is i ,y i Is the characteristic attribute value corresponding to the sample and the label of the sample, and w and b are the model parameters corresponding to the linear model
Then for edge data, edge data tends to represent relationships between users, while the weight of an edge represents how tight the relationship between users is. In the noise filtering step, edges with too low weights need to be filtered. Then, for the edge with the weight missing, an edge weight prediction model is also required to be applied to supplement the data set, and specifically, a graph network edge weight prediction model proposed by Srijan Kumar 2016 in Edge Weight Prediction in Weighted Signed Networks can be used. Finally, the predicted value of the prediction model is used as the weight value of the corresponding missing weight edge, and the edge with too low predicted weight is filtered out.
2) Random walk phase:
in the random walk stage, a walk rule is defined firstly, and in an experimental system, the customized walk rule is to select the next node of the walk in equal proportion according to the size of the edge weight. All directed edges in the system are treated as undirected edges, and for the case that a plurality of edges exist between nodes, the directed edges are combined into one edge, and the weight of the edge takes the average value of the plurality of edges. Furthermore, for a random walk object, every step of the random walk has a probability of 0.9 that the node where the random walk is currently located moves to a neighbor node of the node randomly, and a probability of 0.1 that the random walk is also returned from the current node to the initial seed node directly, specifically, the iterative formula is as follows:
r t+1 =0.9*r t +0.1*r 0
wherein r is t Representing the probability of node access in the graph at time t, r 0 Representing the probability that the first node is accessed, i.e. the seed node has a value of 1 and the rest of nodes have a value of 0
Also, in order to reduce the computational complexity of the system, a δ=0.0001 needs to be set, and when the random walk reaches a certain node, the access probability of the node is not 0 but less than δ, the random walk will return to the seed node. The above random walk stage walk rule
After defining the rules, the random walk paths are continuously simulated according to the predefined random walk rules only by starting from the pre-selected seed nodes, namely nodes with risk labels in practice. When the probability that the node in the whole graph appears in the random walk path is stabilized, the simulation is stopped, and the probability is taken as the risk score of the corresponding node. The higher the risk score means that the user has a higher likelihood of being a new risk user.
3) Using a model phase:
inputting the data to be queried into a system, matching the system with a corresponding node from a graph network, and outputting the node, namely the risk score of the corresponding sample to be queried, as an output result.
The invention uses a graph network data mining algorithm based on random walk, overcomes the problem that the traditional structured data processing method can not solve the characteristic problem of the graph network, simultaneously optimizes the problem that the neural network has poor effect on category characteristic data, reduces the labor cost and can provide help for better processing the networked data.
Those skilled in the art will readily understand that the above process is only one specific example of the present invention, and in actual industrial production, those skilled in the art may modify and improve some details according to the above description and the actual data set, so that the specific operation is more suitable for the actual application scenario.
Claims (9)
1. The risk user prediction method based on the graph network random walk is characterized by comprising the following steps of:
1) Acquiring networking data containing a graph as an original data set;
2) Preprocessing an original data set and constructing a graph network;
3) Obtaining the probability corresponding to the node, namely the risk score of the user, of the preprocessed data through a clustering algorithm based on random walk;
4) Integrating the user node probability obtained by the clustering algorithm, and outputting a final risk user prediction result;
in the step 3), the rule of the random walk is specifically as follows:
all directed edges in the graph network are treated as undirected edges, and if a plurality of edges exist among the nodes, the directed edges are combined into one edge, and the weight of the combined edge is the average value of the plurality of edges;
the edges represent the relationship among users, including conversation relationship, attention relationship and social friend relationship, and the weight of the edges represents the tightness of the relationship among users;
selecting a known risk user as a seed node at the beginning of random walk, selecting the next node of the walk in equal proportion according to the size of the edge weight, stopping the random walk after the probability that all nodes in the graph network appear in a random walk path is stable, and taking the visited probability corresponding to the nodes as a risk score of the user;
for the object of random walk, the probability of b in each step of random walk is randomly moved from the current node to one neighbor node of the current node, and meanwhile, the probability of 1-b is directly returned to the initial seed node from the current node, specifically:
r t+1 =b*r t +(1-b)*r 0
wherein r is t Representing the probability that node r is accessed in the t-moment graph network, r 0 Representing the probability that the first seed node is accessed.
2. The graph network random walk-based risk user prediction method according to claim 1, wherein the graph network-form-containing data sets comprise public competition data sets, university public data sets and enterprise public data sets, the public competition data sets comprise kagle and KDD competition website public data sets, the university public data sets are open source data set websites public data sets of the university Stanford, and the enterprise public data sets comprise microsoft and yahoo enterprise public data sets.
3. The risk user prediction method based on graph network random walk according to claim 1, wherein the step 2) specifically comprises the following steps:
21 Acquiring characteristic data from the original data, and filtering noise data, namely edge data with low weight;
22 Supplementing the possibly missing data by adopting a relation prediction model;
23 Uniformly coding node numbers in the graph network;
24 Normalized weights of edges in the graph network.
4. A graph network random walk-based risk user prediction method according to claim 3, wherein in the step 21), the type of feature data includes graph node features of the data, weights of edges, direction features, and risk nodes selected as the initial nodes of the subsequent random walk.
5. The method for predicting risk users based on random walk of graph network as claimed in claim 4, wherein the graph node features the label data of the users, which represents the risk performance score of the users, and the value is 0 or 1, and the corresponding value is zero.
6. A risk user prediction method based on graph network random walk according to claim 3, wherein in step 22), supplementing the data that may be missing specifically includes:
for data: performing linear interpolation supplementation by using a linear model;
for category characteristics: and selecting the characteristic value with the largest occurrence number of the category as the missing value to supplement.
7. A risk user prediction method based on graph network random walk according to claim 1, wherein the probability has a value of 0.9.
8. A risk user prediction method based on graph network random walk according to claim 1, characterized in that in order to reduce the computational complexity during the random walk, an access probability threshold δ=0.0001 is set, when the random walk reaches a certain node, the access probability of which is not 0 and less than δ, then the random walk returns to the first seed node.
9. The method for predicting risk users based on graph network random walk according to claim 8, wherein in the step 4), when the risk score of the user to be predicted exceeds the set risk score threshold, the user to be predicted is determined to be a risk user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010966200.8A CN112131569B (en) | 2020-09-15 | 2020-09-15 | Risk user prediction method based on graph network random walk |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010966200.8A CN112131569B (en) | 2020-09-15 | 2020-09-15 | Risk user prediction method based on graph network random walk |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112131569A CN112131569A (en) | 2020-12-25 |
CN112131569B true CN112131569B (en) | 2024-01-05 |
Family
ID=73846983
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010966200.8A Active CN112131569B (en) | 2020-09-15 | 2020-09-15 | Risk user prediction method based on graph network random walk |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112131569B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117689450B (en) * | 2024-01-29 | 2024-04-19 | 北京一起网科技股份有限公司 | Digital marketing system based on big data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109189936A (en) * | 2018-08-13 | 2019-01-11 | 天津科技大学 | A kind of label semanteme learning method measured based on network structure and semantic dependency |
CN109951377A (en) * | 2019-03-20 | 2019-06-28 | 西安电子科技大学 | A kind of good friend's group technology, device, computer equipment and storage medium |
CN110175299A (en) * | 2019-05-28 | 2019-08-27 | 腾讯科技(上海)有限公司 | A kind of method and server that recommendation information is determining |
CN111008447A (en) * | 2019-12-21 | 2020-04-14 | 杭州师范大学 | Link prediction method based on graph embedding method |
-
2020
- 2020-09-15 CN CN202010966200.8A patent/CN112131569B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109189936A (en) * | 2018-08-13 | 2019-01-11 | 天津科技大学 | A kind of label semanteme learning method measured based on network structure and semantic dependency |
CN109951377A (en) * | 2019-03-20 | 2019-06-28 | 西安电子科技大学 | A kind of good friend's group technology, device, computer equipment and storage medium |
CN110175299A (en) * | 2019-05-28 | 2019-08-27 | 腾讯科技(上海)有限公司 | A kind of method and server that recommendation information is determining |
CN111008447A (en) * | 2019-12-21 | 2020-04-14 | 杭州师范大学 | Link prediction method based on graph embedding method |
Also Published As
Publication number | Publication date |
---|---|
CN112131569A (en) | 2020-12-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104008166A (en) | Dialogue short text clustering method based on form and semantic similarity | |
WO2022166115A1 (en) | Recommendation system with adaptive thresholds for neighborhood selection | |
CN112364242B (en) | Graph convolution recommendation system for context awareness | |
CN111666468A (en) | Method for searching personalized influence community in social network based on cluster attributes | |
CN112465226B (en) | User behavior prediction method based on feature interaction and graph neural network | |
JP2023545940A (en) | Graph data processing method, device, computer equipment and computer program | |
CN110569883A (en) | Air quality index prediction method based on Kohonen network clustering and Relieff feature selection | |
CN111177578B (en) | Search method for most influential community around user | |
CN112131569B (en) | Risk user prediction method based on graph network random walk | |
binti Oseman et al. | Data mining in churn analysis model for telecommunication industry | |
CN117272195A (en) | Block chain abnormal node detection method and system based on graph convolution attention network | |
CN112487110A (en) | Overlapped community evolution analysis method and system based on network structure and node content | |
CN112115359A (en) | Recommendation system and method based on multi-order neighbor prediction | |
CN115905188A (en) | Data quality improving method based on knowledge graph | |
CN116993374A (en) | Model optimization method, device, equipment and medium based on deep neural network | |
CN114969511A (en) | Content recommendation method, device and medium based on fragments | |
Annam et al. | Entropy based informative content density approach for efficient web content extraction | |
CN111737461B (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN114519605A (en) | Advertisement click fraud detection method, system, server and storage medium | |
CN113744023A (en) | Dual-channel collaborative filtering recommendation method based on graph convolution network | |
CN114118094B (en) | Semantic community discovery method based on nonnegative matrix factorization | |
CN114338442B (en) | Network traffic identification method and system based on feature data and deep learning | |
CN113313417B (en) | Method and device for classifying complaint risk signals based on decision tree model | |
CN117763462A (en) | Interaction method, device, equipment and storage medium based on carbon reduction and consumption reduction | |
CN115617652A (en) | Test case processing method and device, computing equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |