CN110706095B - Target node key information filling method and system based on associated network - Google Patents

Target node key information filling method and system based on associated network Download PDF

Info

Publication number
CN110706095B
CN110706095B CN201910939414.3A CN201910939414A CN110706095B CN 110706095 B CN110706095 B CN 110706095B CN 201910939414 A CN201910939414 A CN 201910939414A CN 110706095 B CN110706095 B CN 110706095B
Authority
CN
China
Prior art keywords
nodes
node
key information
target
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910939414.3A
Other languages
Chinese (zh)
Other versions
CN110706095A (en
Inventor
郑乐
韩晗
刘嵩
陈锐浩
毛正冉
王张琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN201910939414.3A priority Critical patent/CN110706095B/en
Publication of CN110706095A publication Critical patent/CN110706095A/en
Application granted granted Critical
Publication of CN110706095B publication Critical patent/CN110706095B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Technology Law (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a system for filling key information of a target node based on an associated network, and belongs to the technical field of data mining, machine learning and graph theory. The problem of low accuracy of key information of a target node filled in the prior art is solved. According to the application scene, a relational network of a large number of nodes is established; acquiring an association network of target nodes of related key information based on each relationship network, integrating the association network into a data structure comprising the target nodes, labels, association nodes, association node weights and attribute vectors, performing multiple three-dimensional sampling on the data structure based on an improved random forest method to obtain subsets of a plurality of training decision trees, giving the plurality of decision trees for training, and integrating after training to obtain a final model; and predicting through a final model based on the associated nodes of the target nodes to be filled, and weighting and averaging a plurality of results after prediction to obtain final filling information. The method is based on the associated network and fills the key information of the target node.

Description

Target node key information filling method and system based on associated network
Technical Field
A method and a system for filling key information of a target node based on an association network are used for filling the key information of the target node based on the association network, and belong to the technical field of data mining, machine learning and graph theory.
Background
In many scenarios, there is a need to predict target critical information in the event that there is insufficient target information. Specific scenarios include the field of financial credits, e-commerce recommendations, and health assessments, among others.
Scene one: the field of financial credit, how credit assessment is performed for credit white subscriber admission. The credit white user does not have enough basic credit information for the financial institution to evaluate the repayment willingness and the repayment capacity, and at the moment, the repayment willingness of the target node can be evaluated by utilizing the related information of the close relatives (namely the adjacent network nodes) of the target node. The establishment of the associated network can be established by depending on the data of the stock nodes of the financial institutions and the dimensions of the emergency contact information, the relative information, the frequent contacts of the telephone and the like filled in when the target nodes are imported.
Scene two: the field of e-commerce recommendations. When the activity of the target node is low, namely under the condition that no shopping record exists and no browsing information exists, how to predict the potential shopping tendency of the target node is realized, and further more accurate product recommendation is realized. The social information can be utilized to establish a related network, browse data, purchase data and the like of related nodes are collected from adjacent network nodes, and a model is established to evaluate the shopping tendency of a target node.
Scene three: the field of health assessment. In particular, how to predict the probability of a certain disease in the future for a still healthy person. Given that family medical history has a certain scientific basis, a correlation network can be constructed by using the relatives of people, and the disease probability of certain diseases of a target person is predicted by using dimension information such as physical fitness evaluation, disease age, disease types, eating habits, living habits and the like of correlation nodes, so that the disease prevention work can be done earlier.
The most typical scenario is how the financial field grants credit to a credit account. When the new application user has only historical credit records (lack of characteristic variables), the loan institution cannot utilize the existing scoring model to evaluate the credit risk. For this case, the general methods currently employed are:
1. replacing the characteristic dimensions missing from the user according to the average value, the median value, the quantile, the mode, the random value and the like of the stock user, and then putting the characteristic dimensions into a model and scoring; the filling effect of the missing characteristic dimension is poor, which is equal to artificially increasing noise.
2. Other known characteristics are used as a prediction model to calculate the missing variables. The problem is that if the other variables are themselves few and have no correlation with the missing features, the predicted result is meaningless; if the prediction result is quite accurate, it is also stated that the prediction variable is strongly correlated with the known variable, and is not necessary to be added into the model, so that the resulting scale (i.e. the fitting scale of the unknown characteristic to the known characteristic) is difficult to grasp.
3. These feature dimensions are ignored and other feature dimensions are sought for substitution. A common method is to find out the feature data corresponding to the user if the user has no relevant feature variable, and replace the feature data with the feature data. This method is theoretically effective, but has the following problems: first, the method assumes that there are corresponding features, and may require a significant amount of effort, expense, and sometimes even unavailability to find corresponding features. Secondly, if the user lacks a large number of feature dimensions, even if corresponding features with a small number of dimensions are found, the overall feature missing rate is still high, and accurate prediction on key behaviors of the user cannot be made.
With the development of graph theory, it becomes another possibility to utilize the association network to predict the key information of the target node. Complex social relations exist among people, and the behaviors of people in the network often have relevance by utilizing the network established by the social relations. Taking a credit risk prediction scenario as an example, the main steps of predicting node default by using an association network at present are as follows:
1. a complex network is defined, and first-order adjacent nodes and second-order adjacent nodes are defined according to the distance (affinity) between the network nodes and the target nodes.
2. The credit risk transfer is mainly divided into two methods:
(1) and (4) a weight training method. Different propagation weights are set according to the risk values and node types of adjacent nodes, and a model is established to train the propagation weights. A general risk propagation formula is obtained by weights trained on a large sample. The premise assumption of this approach is that 1) the risk values in the associative network are correlated; 2) there is a general solution to risk value propagation weights between different networks. In real life, the social relationship of people is complex, the structures and the propagation methods of different networks are diversified, and a risk propagation weight general solution is difficult to find. The prediction effect of the method is often poor.
(2) Social information is added. And deriving the social network information of the target by utilizing the established associated network, such as the overdue loan of the people in the adjacent nodes, the normal non-overdue loan of the people and the like. This approach essentially increases the characteristic dimension of the target node, similar to the general approach 3 described above. The method has the problems that derived social network information is not strongly related to key information (whether overdue or not) of the node, cannot be used independently, and still needs to be predicted by combining the existing characteristics of the target node. This brings about another problem: due to the combination of the existing characteristics and the social characteristics, the characteristic dimensionality is suddenly increased, and the problem of characteristic sparsity is also brought due to the variability of social network information, so that the difficulty of model training is increased, and more computing resources are occupied.
Disclosure of Invention
Aiming at the problems of the research, the invention aims to provide a method and a system for filling key information of a target node based on an associated network, which solve the problems that in the prior art, (1) the key information of the target node needs to be filled by depending on the characteristics of the target node, and the key information of the target node cannot be filled at all under the condition that the target node has no relevant characteristics; (2) the accuracy rate of the key information of the filled target node is low; (3) the occupied resources are large.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for filling key information of a target node based on an associated network comprises the following steps:
s1, establishing a relational network based on each node in the large number of nodes according to the application scene to obtain a large number of relational networks;
s2, acquiring nodes of related key information as target nodes based on a large number of relationship networks, integrating the corresponding relationship networks as association networks into a data structure containing the target nodes, labels of key information corresponding to the target nodes, association nodes corresponding to the target nodes, node weights of the association nodes and attribute vectors of the association nodes associated with the key information of the target nodes, and obtaining an integrated training set, wherein the key information refers to behaviors needing to be predicted;
s3, performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain a plurality of subsets of training decision trees;
s4, based on the subsets of the training decision trees, giving a plurality of corresponding decision trees for training, and integrating the trained decision trees to obtain a final model;
and S5, predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information.
Further, the application scenario in step S1 includes a financial credit scenario, an e-commerce recommendation scenario, or a health assessment scenario; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the large number of nodes is more than ten thousand nodes, and the number of the relational networks is the same as that of the nodes.
Further, the specific step of step S2 is:
s2.1, preparing training samples by adopting a supervised machine learning method based on a large number of relational networks, namely selecting nodes with relevant key information in the large number of relational networks as target nodes, namely training samples, and forming a training set by all the selected target nodes with key information, wherein the key information refers to behaviors needing prediction and includes whether a user defaults in a financial credit scene; recommending scenes including whether the user has purchasing intention in the E-commerce; in a health assessment scenario, including the magnitude of the user's risk of having a certain disease;
s2.2, integrating the relationship network corresponding to each training sample as an association network into a data structure containing a target node, a label of key information corresponding to the target node, an association node corresponding to the target node, a node weight of each association node and an attribute vector of each association node associated with the key information of the target node, so as to obtain an integrated training set.
Further, in the step S2.2, in the financial credit scenario, the attribute vector includes the historical borrowing record, income, academic history and age of the associated node; in the E-commerce recommendation scene, the attribute vector comprises browsing data and purchasing data; in the health assessment scenario, the attribute vector includes physical fitness assessment, disease age, disease category, eating habits, and living habits.
Further, the specific step of step S3 is:
s3.1, sample disturbance: selecting m subsets D containing target nodes N' from the integrated training set uniformly and repeatedly by adopting a self-service sampling methodiM is the number of sampling times, N' is the number of target nodes contained in each subset, which is the same as the number of target nodes in the original training set, DiIs the ith subset of the m subsets;
s3.2, attribute category disturbance: subsets D based on the number N' of target nodesiKnowing that the dimension of the attribute vector of the associated node is K, randomly extracting the attribute vector K not exceeding K dimension from K dimensioniAs subset DiThe attribute vector of (2), i.e. the attribute vector of the associated node of each target node is KiWherein, K isiRepresenting the attribute vector of the ith subset after attribute category disturbance;
s3.3, attribute value disturbance: perturbed subsets D for attribute classesiEach target node O inNAttribute value perturbation is performed, i.e. the target node O is knownNThere are M associated nodes RNMM associated weights WNMM attribute vector groups XNMSubset-based attribute vector KiThe attribute value is from M attributesSet of sexual vectors XNMWherein the probability of each attribute vector being extracted is
Figure BDA0002222035500000041
j is 1, 2 … M, wherein PjIs the probability, W, that the j-th associated node attribute value was takenjFor the weight corresponding to the jth associated node,
Figure BDA0002222035500000042
the weight sum of all the associated nodes;
s3.4, for m subsets DiAnd sequentially carrying out attribute category disturbance and attribute value disturbance sampling to obtain subsets of m training decision trees.
Further, in step S4, when the decision tree result is variable 0 or 1, the decision tree is a classification problem, and majrating voting is adopted, that is, the decision tree after the majority voting method is integrated and trained; and when the decision tree result is a continuous variable, the decision tree result is a regression problem, and the trained decision tree is integrated by adopting an averaging method.
Further, in step S5, based on the feature vector and the weight of the associated node of the target node to be padded, the final model is used to perform prediction to obtain a plurality of results, and the plurality of results are weighted and averaged to obtain final padding information, where the formula is as follows:
Pfinal=∑(WM×PM),
wherein, PfinalIs key information of the target node, WM、PMAnd outputting the weight of a certain associated node corresponding to the target node and the prediction result of the associated node output by the final model.
A target node key information filling system based on an association network comprises:
a network operation module: according to an application scene, establishing a relational network based on each node in a large number of nodes to obtain a large number of relational networks;
the data integration module: acquiring nodes of related key information as target nodes based on a large number of relational networks, integrating the corresponding relational networks as associated networks into a data structure comprising the target nodes, labels of key information corresponding to the target nodes, associated nodes corresponding to the target nodes, node weights of the associated nodes and attribute vectors of the associated nodes associated with the key information of the target nodes, and obtaining an integrated training set, wherein the key information refers to behaviors needing to be predicted;
a three-dimensional sampling module: performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain multiple subsets of a training decision tree;
a model training module: based on the subsets of the training decision trees, giving a plurality of corresponding decision trees for training, and integrating the trained decision trees to obtain a final model;
a prediction module: and predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information.
Compared with the prior art, the invention has the beneficial effects that:
firstly, any attribute variable (namely attribute vector) of a target node is not used in a prediction and training sample, and only characteristic variables (namely attribute vector) of a related node are used; from a prediction perspective, an advance prediction is achieved: because the method does not depend on the attribute vector of the target node, the key behavior tendency of the target node can be predicted earlier without waiting for the user to generate related behaviors to form the attribute variable of the user, from the perspective of the user, the user attribute of the user cannot be obtained in some scenes, for example, the user who consumes credit is white, a vicious circle without historical credit records (without attribute vectors), cannot be used for credit evaluation, and then does not have credit records (without attribute vectors) exists, and the technology can break the vicious circle;
compared with the traditional method for increasing the variable dimension of the social network, the method has the advantages that the attribute variables of the social network are not used, the attribute vectors are two-dimensional (row and layer) stacked (the attribute categories form rows, and the attribute values form layers), the feature dimension is reduced, the problem of feature sparsity is solved, and the training complexity of a machine learning model and the computer performance consumption are reduced;
compared with the traditional random forest method, the method disclosed by the invention is more suitable for a data structure of a target node integrated correlation network, the sampled sub-sample set is more variable, the sub-classifiers obtained by training are larger in difference, and the prediction effect after the final model fusion is better;
the method is different from the traditional associated network application, the key information of the target node is predicted through various basic information of the associated nodes, a strong assumption that the key information can be directly transmitted does not need to exist between the nodes, only the basic information needs to be assumed to be partially and approximately transmitted between the nodes, and the key information of the target node is finally predicted accurately through stacking of a large number of basic information dimensions.
The method is generally used in the field of data mining, and has no special requirements on computing hardware resources.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a system framework diagram of the present invention;
FIG. 3 is a data structure diagram of the present invention;
FIG. 4 is a diagram illustrating the relationship between network nodes of samples in an integrated training set according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a network node associated with a sample in an integrated authentication set according to an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a final model evaluating verification samples one by one and outputting probability evaluations of corresponding numbers for the number of associated nodes of the verification samples according to an embodiment of the present invention;
fig. 7 is a schematic diagram of 2 groups as an example, in which the layers are evaluated by a model for each layer according to the number of associated nodes of a verification sample in the embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific embodiments.
A method for filling key information of a target node based on an associated network comprises the following steps:
s1, establishing a relational network based on each node in the large number of nodes according to the application scene to obtain a large number of relational networks; the application scene comprises a financial credit scene, an e-commerce recommendation scene or a health assessment scene; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; other application scenarios are also possible; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the large number of nodes is more than ten thousand nodes, and the number of the relational networks is the same as that of the nodes.
S2, acquiring nodes of related key information as target nodes based on a large number of relationship networks, integrating the corresponding relationship networks as association networks into a data structure containing the target nodes, labels of key information corresponding to the target nodes, association nodes corresponding to the target nodes, node weights of the association nodes and attribute vectors of the association nodes associated with the key information of the target nodes, and obtaining an integrated training set, wherein the key information refers to behaviors needing to be predicted;
the method comprises the following specific steps:
s2.1, preparing training samples by adopting a supervised machine learning method based on a large number of relational networks, namely selecting nodes with relevant key information in the large number of relational networks as target nodes, namely training samples, and forming a training set by all the selected target nodes with key information, wherein the key information refers to behaviors needing prediction and includes whether a user defaults in a financial credit scene; recommending scenes including whether the user has purchasing intention in the E-commerce; in a health assessment scenario, including the size of the user's risk of having a certain disease, in various scenarios, key information for other needs may also be available;
s2.2, integrating the relationship network corresponding to each training sample as an association network into a data structure containing a target node, a label of key information corresponding to the target node, an association node corresponding to the target node, a node weight of each association node and an attribute vector of each association node associated with the key information of the target node, so as to obtain an integrated training set.
In a financial credit scenario, the attribute vector includes historical borrowing records, income, academic history, and age of the associated node; in the E-commerce recommendation scene, the attribute vector comprises browsing data and purchasing data; in the health assessment scenario, the attribute vector includes physical fitness assessment, disease age, disease category, eating habits, and living habits. In various scenarios, attribute vectors for other requirements may also be used.
S3, performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain a plurality of subsets of training decision trees;
the method comprises the following specific steps:
s3.1, sample disturbance: selecting m subsets D containing target nodes N' from the integrated training set uniformly and repeatedly by adopting a self-service sampling methodiM is the number of sampling times, N' is the number of target nodes contained in each subset, which is the same as the number of target nodes in the original training set, DiIs the ith subset of the m subsets; the method specifically comprises the following steps: n 'target nodes are arranged in the integrated training set, and m subsets D with the number of the target nodes being N' are selected uniformly and repeatedly during self-service sampling each timeiWherein, the self-service sampling method utilizes the training samples for multiple times, and expands the number of N 'target nodes in the training set into m N' subsets Di
S3.2, attribute category disturbance: subsets D based on the number N' of target nodesiKnowing that the dimension of the attribute vector of the associated node is K, randomly extracting the attribute vector K not exceeding K dimension from K dimensioniAs subset DiThe attribute vector of (2), i.e. the attribute vector of the associated node of each target node is KiWherein, K isiRepresenting the attribute vector of the ith subset after attribute category disturbance; the method specifically comprises the following steps: subsets D with a number N' of target nodesiThe K-dimensional attribute vector of height, school calendar, age and income is selected from the K-dimensional attribute vectorCalendar and age as subsets DiThe attribute vector of the associated node of each target node;
s3.3, attribute value disturbance: perturbed subsets D for attribute classesiEach target node O inNAttribute value perturbation is performed, i.e. the target node O is knownNThere are M associated nodes RNMM associated weights WNMM attribute vector groups XNMSubset-based attribute vector KiThe attribute value is from M attribute vector groups XNMWherein the probability of each attribute vector being extracted is
Figure BDA0002222035500000071
j is 1, 2 … M, wherein PjIs the probability, W, that the j-th associated node attribute value was takenjFor the weight corresponding to the jth associated node,
Figure BDA0002222035500000072
the weight sum of all the associated nodes; the method specifically comprises the following steps: the target node is Zhang III, which belongs to a certain subset D after attribute class disturbanceiThe target node in (1) is that Zhang III has two related people Wang five and Li four, and the attribute value of the Zhang III is a group of attribute vectors (height, academic calendar and age) randomly drawn from two people of Wang five and Li four.
S3.4, for m subsets DiAnd sequentially carrying out attribute category disturbance and attribute value disturbance sampling to obtain subsets of m training decision trees.
S4, based on the subsets of the training decision trees, giving a plurality of corresponding decision trees (existing) for training, and integrating the trained decision trees to obtain a final model; when the decision tree result is variable 0 or 1, the classification problem is solved, and majrating voting is adopted, namely, the decision tree after the integrated training of the majority voting method is adopted; and when the decision tree result is a continuous variable, the decision tree result is a regression problem, and the trained decision tree is integrated by adopting an averaging method.
And S5, predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information.
Predicting through a final model based on the feature vector and the weight of the associated node of the target node to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, wherein the formula is as follows:
Pfinal=∑(WM×PM),
wherein, PfinalIs key information of the target node, WM、PMAnd outputting the weight of a certain associated node corresponding to the target node and the prediction result of the associated node output by the final model.
A target node key information filling system based on an association network comprises:
a network operation module: according to an application scene, establishing a relational network based on each node in a large number of nodes to obtain a large number of relational networks;
the data integration module: acquiring nodes of related key information as target nodes based on a large number of relational networks, integrating the corresponding relational networks as associated networks into a data structure comprising the target nodes, labels of key information corresponding to the target nodes, associated nodes corresponding to the target nodes, node weights of the associated nodes and attribute vectors of the associated nodes associated with the key information of the target nodes, and obtaining an integrated training set, wherein the key information refers to behaviors needing to be predicted;
a three-dimensional sampling module: performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain multiple subsets of a training decision tree;
a model training module: based on the subsets of the training decision trees, giving a plurality of corresponding decision trees for training, and integrating the trained decision trees to obtain a final model;
a prediction module: and predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information.
Examples
Establishing a relationship network based on 50200 nodes according to a financial credit application scene, selecting 23090 nodes with key information (overdue) as target nodes to establish an association network, wherein 23090 relationship networks corresponding to 23090 target nodes are 23090 association networks;
integrating 23090 associated networks based on the target nodes into a data structure comprising the target nodes, labels of key information corresponding to the target nodes, associated nodes corresponding to the target nodes, node weights of the associated nodes and attribute vectors of the associated nodes associated with the key information of the target nodes, so as to obtain an integrated training set;
finding the associated nodes through the associated network, and after the attribute vectors of the associated nodes are mined, the positive sample (key information positive) of the integrated training set is 927, the negative sample (key information negative) is 22163, and the associated network nodes of the samples in the integrated training set are as shown in fig. 4, wherein there are 20177 target nodes with 1 associated node corresponding to 1 group of attribute vectors, 2690 target nodes with 2 associated nodes, and the target nodes all correspond to 2 groups of attribute vectors, and so on. Defining a sample network complexity index Na as the number of nodes with associated nodes as 1: the number of nodes with associated nodes greater than 1 is 20177: 2913 ═ 6.93. Obviously, the smaller the Na, the higher the network complexity of the sample is illustrated, and the richer the optional associated node attribute vector of the target node is.
2000 subsets are obtained from training samples in a three-dimensional sampling mode, a given model is trained by the 2000 subsets (the model adopts the existing decision tree), and 2000 trained decision trees are obtained. Model fusion (i.e. integrating 2000 trained decision trees) to obtain a final model, and adopting an output probability mean value method for fusion. Five-fold cross validation was performed using the integrated training set, and the final model evaluation effect was AUC of 0.66 and KS of 0.25.
After the final model is obtained, the effect of the model is verified by using a verification set, which specifically comprises the following steps:
according to a financial credit application scene, establishing a relationship network based on 66110 nodes, selecting 30050 nodes with key information (whether overdue) as target nodes to establish an association network, and obtaining 30050 relationship networks corresponding to the 30050 target nodes, namely 30050 association networks;
30050 associated networks based on the target nodes are integrated into a data structure containing the target nodes, labels of key information corresponding to the target nodes, the associated nodes corresponding to the target nodes, node weights of the associated nodes and attribute vectors of the associated nodes associated with the key information of the target nodes, so as to obtain an integrated verification set;
finding the associated nodes through the associated network, and digging out attribute vectors of the associated nodes to form a positive sample (positive key information) of the integrated verification set, wherein the positive sample (negative key information) is 801, and the negative sample (negative key information) is 29249; in the case of the network node related to the sample in the integrated verification set, as shown in fig. 5, the verification sample network complexity index Na is 7.36.
The final model evaluates the verification samples one by one, and outputs probability evaluation of corresponding number aiming at the number of the associated nodes of the verification samples, as shown in fig. 6; and averaging the multiple probability evaluations of each verification sample to obtain the final probability evaluation of the verification sample. On the validation samples, the predictive model effect was as follows, similar to the five-fold cross validation results on the integrated training set, with AUC of 0.67 and KS of 0.26.
The final model can be completed only by training samples, and the 5-fold cross validation of the training samples is only to improve the generalization performance of the model, but the validation samples are only used for truly validating the generalization performance, which represents the actual prediction effect.
According to the verification sample layered prediction scheme, the number of the associated nodes of the verification sample is layered, and model evaluation is performed on each layer, taking 2 groups as an example, and the result is shown in fig. 7, so that when the number of the associated nodes is greater than or equal to 2, the prediction effect of the final model on the verification sample is greatly improved.
Compared with a weight training method, the AUC of a general final model of the weight training method is generally about 0.6-0.65, and is lower than the condition that the number of associated nodes of the hierarchical prediction scheme is more than or equal to 2 (the AUC is 0.749). Compared with the existing model for predicting by using the strong attribute variable of the target node, the AUC is usually about 0.75, and is similar to the situation that the number of the associated nodes of the hierarchical prediction scheme is more than or equal to 2 (the AUC is 0.749). The model established by the invention reaches the degree of production availability, and the invention has low complexity, occupies less resources and can predict in advance.
The above are merely representative examples of the many specific applications of the present invention, and do not limit the scope of the invention in any way. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.

Claims (6)

1. A method for filling key information of a target node based on an associated network is characterized by comprising the following steps:
s1, establishing a relational network based on each node in the large number of nodes according to the application scene to obtain a large number of relational networks;
s2, acquiring nodes of related key information as target nodes based on a large number of relationship networks, integrating the corresponding relationship networks as association networks into a data structure containing the target nodes, labels of key information corresponding to the target nodes, association nodes corresponding to the target nodes, node weights of the association nodes and attribute vectors of the association nodes associated with the key information of the target nodes, and obtaining an integrated training set, wherein the key information refers to behaviors needing to be predicted;
s3, performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain a plurality of subsets of training decision trees;
s4, based on the subsets of the training decision trees, giving a plurality of corresponding decision trees for training, and integrating the trained decision trees to obtain a final model;
s5, predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information;
the application scenario in the step S1 includes a financial credit scenario, an e-commerce recommendation scenario, or a health assessment scenario; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the number of the nodes is more than ten thousand, and the number of the relational networks is the same as that of the nodes;
the specific steps of step S2 are:
s2.1, preparing training samples by adopting a supervised machine learning method based on a large number of relational networks, namely selecting nodes with relevant key information in the large number of relational networks as target nodes, namely training samples, and forming a training set by all the selected target nodes with key information, wherein the key information refers to behaviors needing prediction and includes whether a user defaults in a financial credit scene; recommending scenes including whether the user has purchasing intention in the E-commerce; in a health assessment scenario, including the magnitude of the user's risk of having a certain disease;
s2.2, integrating the relation network of each training sample as an association network into a data structure which comprises a target node, a label of key information corresponding to the target node, an association node corresponding to the target node, node weights of the association nodes and attribute vectors of the association nodes associated with the key information of the target node, and obtaining an integrated training set.
2. The method for filling up key information of target nodes based on the associative network according to claim 1, wherein in the step S2.2, in a financial credit scenario, the attribute vector includes historical borrowing records, income, academic history and age of the associative nodes; in the E-commerce recommendation scene, the attribute vector comprises browsing data and purchasing data; in the health assessment scenario, the attribute vector includes physical fitness assessment, disease age, disease category, eating habits, and living habits.
3. The method for filling up key information of a target node based on an associated network according to claim 1 or 2, wherein the step S3 specifically comprises the steps of:
s3.1, sample disturbance: selecting m subsets D containing target nodes N' from the integrated training set uniformly and repeatedly by adopting a self-service sampling methodiM is the number of sampling times, N' is the number of target nodes contained in each subset, which is the same as the number of target nodes in the original training set, DiIs the ith subset of the m subsets;
s3.2, attribute category disturbance: subsets D based on the number N' of target nodesiKnowing that the dimension of the attribute vector of the associated node is K, randomly extracting the attribute vector K not exceeding K dimension from K dimensioniAs subset DiThe attribute vector of (2), i.e. the attribute vector of the associated node of each target node is KiWherein, K isiRepresenting the attribute vector of the ith subset after attribute category disturbance;
s3.3, attribute value disturbance: perturbed subsets D for attribute classesiEach target node O inNAttribute value perturbation is performed, i.e. the target node O is knownNThere are M associated nodes RNMM associated weights WNMM attribute vector groups XNMSubset-based attribute vector KiThe attribute value is from M attribute vector groups XNMWherein the probability of each attribute vector being extracted is
Figure FDA0003523898790000021
Wherein, PjIs the probability, W, that the j-th associated node attribute value was takenjFor the weight corresponding to the jth associated node,
Figure FDA0003523898790000022
the weight sum of all the associated nodes;
s3.4, for m subsets DiObtaining the children of m training decision trees after attribute category disturbance and attribute value disturbance sampling in sequenceAnd (4) collecting.
4. The method as claimed in claim 1, wherein in step S4, when the decision tree result is variable 0 or 1, for classification, a maj authority voting is used to integrate the trained decision tree; and when the decision tree result is a continuous variable, the decision tree result is a regression problem, and the trained decision tree is integrated by adopting an averaging method.
5. The method for filling key information of a target node based on an association network according to claim 1, wherein in step S5, a final model is used to predict based on the feature vector and weight of the association node of the target node to be filled, so as to obtain a plurality of results, and the plurality of results are weighted and averaged, so as to obtain the final filling information, where the formula is as follows:
Pfinal=∑(WM×PM),
wherein, PfinalIs key information of the target node, WM、PMAnd outputting the weight of a certain associated node corresponding to the target node and the prediction result of the associated node output by the final model.
6. A system for filling key information of a target node based on an associated network is characterized by comprising the following components:
a network operation module: according to an application scene, establishing a relational network based on each node in a large number of nodes to obtain a large number of relational networks; the application scene comprises a financial credit scene, an e-commerce recommendation scene or a health assessment scene; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the number of the nodes is more than ten thousand, and the number of the relational networks is the same as that of the nodes;
the data integration module: based on a large number of relational networks, preparing training samples by adopting a supervised machine learning method, namely selecting nodes related to key information from the large number of relational networks as target nodes, namely training samples, and forming a training set by all the selected target nodes with key information, wherein the key information refers to behaviors needing to be predicted and includes whether a user violates in a financial credit scene; recommending scenes including whether the user has purchasing intention in the E-commerce; in a health assessment scenario, including the magnitude of the user's risk of having a certain disease; integrating the relation network of each training sample as an association network into a data structure containing a target node, a label of key information corresponding to the target node, an association node corresponding to the target node, a node weight of each association node and an attribute vector of each association node associated with the key information of the target node, so as to obtain an integrated training set;
a three-dimensional sampling module: performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain multiple subsets of a training decision tree;
a model training module: based on the subsets of the training decision trees, giving a plurality of corresponding decision trees for training, and integrating the trained decision trees to obtain a final model;
a prediction module: and predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information.
CN201910939414.3A 2019-09-30 2019-09-30 Target node key information filling method and system based on associated network Active CN110706095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910939414.3A CN110706095B (en) 2019-09-30 2019-09-30 Target node key information filling method and system based on associated network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910939414.3A CN110706095B (en) 2019-09-30 2019-09-30 Target node key information filling method and system based on associated network

Publications (2)

Publication Number Publication Date
CN110706095A CN110706095A (en) 2020-01-17
CN110706095B true CN110706095B (en) 2022-04-15

Family

ID=69197491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910939414.3A Active CN110706095B (en) 2019-09-30 2019-09-30 Target node key information filling method and system based on associated network

Country Status (1)

Country Link
CN (1) CN110706095B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325340B (en) * 2020-02-17 2023-06-02 南方科技大学 Information network relation prediction method and system
CN111340147B (en) * 2020-05-22 2021-12-07 四川新网银行股份有限公司 Decision behavior generation method and system based on decision tree
CN111797994B (en) * 2020-06-28 2024-04-05 北京百度网讯科技有限公司 Risk assessment method, apparatus, device and storage medium
CN112885480A (en) * 2021-02-23 2021-06-01 东软集团股份有限公司 User information processing method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109639469A (en) * 2018-11-30 2019-04-16 中国科学技术大学 A kind of sparse net with attributes characterizing method of combination learning and system
CN109672674A (en) * 2018-12-19 2019-04-23 中国科学院信息工程研究所 A kind of Cyberthreat information confidence level recognition methods

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150188941A1 (en) * 2013-12-26 2015-07-02 Telefonica Digital Espana, S.L.U. Method and system for predicting victim users and detecting fake user accounts in online social networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109639469A (en) * 2018-11-30 2019-04-16 中国科学技术大学 A kind of sparse net with attributes characterizing method of combination learning and system
CN109672674A (en) * 2018-12-19 2019-04-23 中国科学院信息工程研究所 A kind of Cyberthreat information confidence level recognition methods

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Exploring Efficiency of Data Mining Techniques for Missing Link in Online Social Network;Chainarong Sirisup 等;《2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)》;20190418;1-6 *
一种面向非平衡数据集分类问题的组合选择方法;职为梅 等;《小型微型计算机系统》;20140430;第35卷(第4期);770-775 *
基于随机森林的公路隧道运营缺失数据插补方法;钱超 等;《交通运输系统工程与信息》;20160630;第16卷(第3期);81-87 *

Also Published As

Publication number Publication date
CN110706095A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
CN110706095B (en) Target node key information filling method and system based on associated network
Yu et al. An intelligent-agent-based fuzzy group decision making model for financial multicriteria decision support: The case of credit scoring
El Morr et al. Descriptive, predictive, and prescriptive analytics
CN108647800B (en) Online social network user missing attribute prediction method based on node embedding
CN110598129B (en) Cross-social network user identity recognition method based on two-stage information entropy
Li et al. A novel collaborative filtering recommendation approach based on soft co-clustering
CN112883070B (en) Generation type countermeasure network recommendation method with differential privacy
CN110222733A (en) The high-precision multistage neural-network classification method of one kind and system
Yigit et al. Extended topology based recommendation system for unidirectional social networks
Hassan Deep learning architecture using rough sets and rough neural networks
CN113761359A (en) Data packet recommendation method and device, electronic equipment and storage medium
Basuchoudhary et al. Predicting hotspots: Using machine learning to understand civil conflict
Zhang et al. A novel hybrid correlation measure for probabilistic linguistic term sets and crisp numbers and its application in customer relationship management
CN114298834A (en) Personal credit evaluation method and system based on self-organizing mapping network
Yi et al. Link prediction based on higher-order structure extraction and autoencoder learning in directed networks
Carrizosa et al. On clustering and interpreting with rules by means of mathematical optimization
Hong et al. DSER: Deep-sequential embedding for single domain recommendation
CN113821827A (en) Joint modeling method and device for protecting multi-party data privacy
CN116662564A (en) Service recommendation method based on depth matrix decomposition and knowledge graph
Du et al. A group recommendation approach based on neural network collaborative filtering
Zeng et al. Model-Stacking-based network user portrait from multi-source campus data
Glonek et al. Semi-supervised graph labelling reveals increasing partisanship in the United States Congress
Guo et al. Explainable recommendation systems by generalized additive models with manifest and latent interactions
CN114610921B (en) Object cluster portrait determination method, device, computer equipment and storage medium
Zou et al. FHC-DQP: Federated Hierarchical Clustering for Distributed QoS Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant