CN110706095B

CN110706095B - Target node key information filling method and system based on associated network

Info

Publication number: CN110706095B
Application number: CN201910939414.3A
Authority: CN
Inventors: 郑乐; 韩晗; 刘嵩; 陈锐浩; 毛正冉; 王张琦
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2022-04-15
Anticipated expiration: 2039-09-30
Also published as: CN110706095A

Abstract

The invention discloses a method and a system for filling key information of a target node based on an associated network, and belongs to the technical field of data mining, machine learning and graph theory. The problem of low accuracy of key information of a target node filled in the prior art is solved. According to the application scene, a relational network of a large number of nodes is established; acquiring an association network of target nodes of related key information based on each relationship network, integrating the association network into a data structure comprising the target nodes, labels, association nodes, association node weights and attribute vectors, performing multiple three-dimensional sampling on the data structure based on an improved random forest method to obtain subsets of a plurality of training decision trees, giving the plurality of decision trees for training, and integrating after training to obtain a final model; and predicting through a final model based on the associated nodes of the target nodes to be filled, and weighting and averaging a plurality of results after prediction to obtain final filling information. The method is based on the associated network and fills the key information of the target node.

Description

Target node key information filling method and system based on associated network

Technical Field

A method and a system for filling key information of a target node based on an association network are used for filling the key information of the target node based on the association network, and belong to the technical field of data mining, machine learning and graph theory.

Background

In many scenarios, there is a need to predict target critical information in the event that there is insufficient target information. Specific scenarios include the field of financial credits, e-commerce recommendations, and health assessments, among others.

Scene one: the field of financial credit, how credit assessment is performed for credit white subscriber admission. The credit white user does not have enough basic credit information for the financial institution to evaluate the repayment willingness and the repayment capacity, and at the moment, the repayment willingness of the target node can be evaluated by utilizing the related information of the close relatives (namely the adjacent network nodes) of the target node. The establishment of the associated network can be established by depending on the data of the stock nodes of the financial institutions and the dimensions of the emergency contact information, the relative information, the frequent contacts of the telephone and the like filled in when the target nodes are imported.

Scene two: the field of e-commerce recommendations. When the activity of the target node is low, namely under the condition that no shopping record exists and no browsing information exists, how to predict the potential shopping tendency of the target node is realized, and further more accurate product recommendation is realized. The social information can be utilized to establish a related network, browse data, purchase data and the like of related nodes are collected from adjacent network nodes, and a model is established to evaluate the shopping tendency of a target node.

Scene three: the field of health assessment. In particular, how to predict the probability of a certain disease in the future for a still healthy person. Given that family medical history has a certain scientific basis, a correlation network can be constructed by using the relatives of people, and the disease probability of certain diseases of a target person is predicted by using dimension information such as physical fitness evaluation, disease age, disease types, eating habits, living habits and the like of correlation nodes, so that the disease prevention work can be done earlier.

The most typical scenario is how the financial field grants credit to a credit account. When the new application user has only historical credit records (lack of characteristic variables), the loan institution cannot utilize the existing scoring model to evaluate the credit risk. For this case, the general methods currently employed are:

1. replacing the characteristic dimensions missing from the user according to the average value, the median value, the quantile, the mode, the random value and the like of the stock user, and then putting the characteristic dimensions into a model and scoring; the filling effect of the missing characteristic dimension is poor, which is equal to artificially increasing noise.

2. Other known characteristics are used as a prediction model to calculate the missing variables. The problem is that if the other variables are themselves few and have no correlation with the missing features, the predicted result is meaningless; if the prediction result is quite accurate, it is also stated that the prediction variable is strongly correlated with the known variable, and is not necessary to be added into the model, so that the resulting scale (i.e. the fitting scale of the unknown characteristic to the known characteristic) is difficult to grasp.

3. These feature dimensions are ignored and other feature dimensions are sought for substitution. A common method is to find out the feature data corresponding to the user if the user has no relevant feature variable, and replace the feature data with the feature data. This method is theoretically effective, but has the following problems: first, the method assumes that there are corresponding features, and may require a significant amount of effort, expense, and sometimes even unavailability to find corresponding features. Secondly, if the user lacks a large number of feature dimensions, even if corresponding features with a small number of dimensions are found, the overall feature missing rate is still high, and accurate prediction on key behaviors of the user cannot be made.

With the development of graph theory, it becomes another possibility to utilize the association network to predict the key information of the target node. Complex social relations exist among people, and the behaviors of people in the network often have relevance by utilizing the network established by the social relations. Taking a credit risk prediction scenario as an example, the main steps of predicting node default by using an association network at present are as follows:

1. a complex network is defined, and first-order adjacent nodes and second-order adjacent nodes are defined according to the distance (affinity) between the network nodes and the target nodes.

2. The credit risk transfer is mainly divided into two methods:

(1) and (4) a weight training method. Different propagation weights are set according to the risk values and node types of adjacent nodes, and a model is established to train the propagation weights. A general risk propagation formula is obtained by weights trained on a large sample. The premise assumption of this approach is that 1) the risk values in the associative network are correlated; 2) there is a general solution to risk value propagation weights between different networks. In real life, the social relationship of people is complex, the structures and the propagation methods of different networks are diversified, and a risk propagation weight general solution is difficult to find. The prediction effect of the method is often poor.

(2) Social information is added. And deriving the social network information of the target by utilizing the established associated network, such as the overdue loan of the people in the adjacent nodes, the normal non-overdue loan of the people and the like. This approach essentially increases the characteristic dimension of the target node, similar to the general approach 3 described above. The method has the problems that derived social network information is not strongly related to key information (whether overdue or not) of the node, cannot be used independently, and still needs to be predicted by combining the existing characteristics of the target node. This brings about another problem: due to the combination of the existing characteristics and the social characteristics, the characteristic dimensionality is suddenly increased, and the problem of characteristic sparsity is also brought due to the variability of social network information, so that the difficulty of model training is increased, and more computing resources are occupied.

Disclosure of Invention

Aiming at the problems of the research, the invention aims to provide a method and a system for filling key information of a target node based on an associated network, which solve the problems that in the prior art, (1) the key information of the target node needs to be filled by depending on the characteristics of the target node, and the key information of the target node cannot be filled at all under the condition that the target node has no relevant characteristics; (2) the accuracy rate of the key information of the filled target node is low; (3) the occupied resources are large.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for filling key information of a target node based on an associated network comprises the following steps:

s1, establishing a relational network based on each node in the large number of nodes according to the application scene to obtain a large number of relational networks;

s2, acquiring nodes of related key information as target nodes based on a large number of relationship networks, integrating the corresponding relationship networks as association networks into a data structure containing the target nodes, labels of key information corresponding to the target nodes, association nodes corresponding to the target nodes, node weights of the association nodes and attribute vectors of the association nodes associated with the key information of the target nodes, and obtaining an integrated training set, wherein the key information refers to behaviors needing to be predicted;

s3, performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain a plurality of subsets of training decision trees;

s4, based on the subsets of the training decision trees, giving a plurality of corresponding decision trees for training, and integrating the trained decision trees to obtain a final model;

and S5, predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information.

Further, the application scenario in step S1 includes a financial credit scenario, an e-commerce recommendation scenario, or a health assessment scenario; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the large number of nodes is more than ten thousand nodes, and the number of the relational networks is the same as that of the nodes.

Further, the specific step of step S2 is:

s2.1, preparing training samples by adopting a supervised machine learning method based on a large number of relational networks, namely selecting nodes with relevant key information in the large number of relational networks as target nodes, namely training samples, and forming a training set by all the selected target nodes with key information, wherein the key information refers to behaviors needing prediction and includes whether a user defaults in a financial credit scene; recommending scenes including whether the user has purchasing intention in the E-commerce; in a health assessment scenario, including the magnitude of the user's risk of having a certain disease;

s2.2, integrating the relationship network corresponding to each training sample as an association network into a data structure containing a target node, a label of key information corresponding to the target node, an association node corresponding to the target node, a node weight of each association node and an attribute vector of each association node associated with the key information of the target node, so as to obtain an integrated training set.

Further, in the step S2.2, in the financial credit scenario, the attribute vector includes the historical borrowing record, income, academic history and age of the associated node; in the E-commerce recommendation scene, the attribute vector comprises browsing data and purchasing data; in the health assessment scenario, the attribute vector includes physical fitness assessment, disease age, disease category, eating habits, and living habits.

Further, the specific step of step S3 is:

s3.1, sample disturbance: selecting m subsets D containing target nodes N' from the integrated training set uniformly and repeatedly by adopting a self-service sampling method_iM is the number of sampling times, N' is the number of target nodes contained in each subset, which is the same as the number of target nodes in the original training set, D_iIs the ith subset of the m subsets;

s3.2, attribute category disturbance: subsets D based on the number N' of target nodes_iKnowing that the dimension of the attribute vector of the associated node is K, randomly extracting the attribute vector K not exceeding K dimension from K dimension_iAs subset D_iThe attribute vector of (2), i.e. the attribute vector of the associated node of each target node is K_iWherein, K is_iRepresenting the attribute vector of the ith subset after attribute category disturbance;

s3.3, attribute value disturbance: perturbed subsets D for attribute classes_iEach target node O in_NAttribute value perturbation is performed, i.e. the target node O is known_NThere are M associated nodes R_NMM associated weights W_NMM attribute vector groups X_NMSubset-based attribute vector K_iThe attribute value is from M attributesSet of sexual vectors X_NMWherein the probability of each attribute vector being extracted is

j is 1, 2 … M, wherein P_jIs the probability, W, that the j-th associated node attribute value was taken_jFor the weight corresponding to the jth associated node,

the weight sum of all the associated nodes;

s3.4, for m subsets D_iAnd sequentially carrying out attribute category disturbance and attribute value disturbance sampling to obtain subsets of m training decision trees.

Further, in step S4, when the decision tree result is variable 0 or 1, the decision tree is a classification problem, and majrating voting is adopted, that is, the decision tree after the majority voting method is integrated and trained; and when the decision tree result is a continuous variable, the decision tree result is a regression problem, and the trained decision tree is integrated by adopting an averaging method.

Further, in step S5, based on the feature vector and the weight of the associated node of the target node to be padded, the final model is used to perform prediction to obtain a plurality of results, and the plurality of results are weighted and averaged to obtain final padding information, where the formula is as follows:

P_final＝∑(W_M×P_M)，

wherein, P_finalIs key information of the target node, W_M、P_MAnd outputting the weight of a certain associated node corresponding to the target node and the prediction result of the associated node output by the final model.

A target node key information filling system based on an association network comprises:

a network operation module: according to an application scene, establishing a relational network based on each node in a large number of nodes to obtain a large number of relational networks;

the data integration module: acquiring nodes of related key information as target nodes based on a large number of relational networks, integrating the corresponding relational networks as associated networks into a data structure comprising the target nodes, labels of key information corresponding to the target nodes, associated nodes corresponding to the target nodes, node weights of the associated nodes and attribute vectors of the associated nodes associated with the key information of the target nodes, and obtaining an integrated training set, wherein the key information refers to behaviors needing to be predicted;

a three-dimensional sampling module: performing multiple three-dimensional sampling on the integrated training set based on an improved random forest method to obtain multiple subsets of a training decision tree;

a model training module: based on the subsets of the training decision trees, giving a plurality of corresponding decision trees for training, and integrating the trained decision trees to obtain a final model;

a prediction module: and predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information.

Compared with the prior art, the invention has the beneficial effects that:

firstly, any attribute variable (namely attribute vector) of a target node is not used in a prediction and training sample, and only characteristic variables (namely attribute vector) of a related node are used; from a prediction perspective, an advance prediction is achieved: because the method does not depend on the attribute vector of the target node, the key behavior tendency of the target node can be predicted earlier without waiting for the user to generate related behaviors to form the attribute variable of the user, from the perspective of the user, the user attribute of the user cannot be obtained in some scenes, for example, the user who consumes credit is white, a vicious circle without historical credit records (without attribute vectors), cannot be used for credit evaluation, and then does not have credit records (without attribute vectors) exists, and the technology can break the vicious circle;

compared with the traditional method for increasing the variable dimension of the social network, the method has the advantages that the attribute variables of the social network are not used, the attribute vectors are two-dimensional (row and layer) stacked (the attribute categories form rows, and the attribute values form layers), the feature dimension is reduced, the problem of feature sparsity is solved, and the training complexity of a machine learning model and the computer performance consumption are reduced;

compared with the traditional random forest method, the method disclosed by the invention is more suitable for a data structure of a target node integrated correlation network, the sampled sub-sample set is more variable, the sub-classifiers obtained by training are larger in difference, and the prediction effect after the final model fusion is better;

the method is different from the traditional associated network application, the key information of the target node is predicted through various basic information of the associated nodes, a strong assumption that the key information can be directly transmitted does not need to exist between the nodes, only the basic information needs to be assumed to be partially and approximately transmitted between the nodes, and the key information of the target node is finally predicted accurately through stacking of a large number of basic information dimensions.

The method is generally used in the field of data mining, and has no special requirements on computing hardware resources.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a system framework diagram of the present invention;

FIG. 3 is a data structure diagram of the present invention;

FIG. 4 is a diagram illustrating the relationship between network nodes of samples in an integrated training set according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a network node associated with a sample in an integrated authentication set according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a final model evaluating verification samples one by one and outputting probability evaluations of corresponding numbers for the number of associated nodes of the verification samples according to an embodiment of the present invention;

fig. 7 is a schematic diagram of 2 groups as an example, in which the layers are evaluated by a model for each layer according to the number of associated nodes of a verification sample in the embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments.

s1, establishing a relational network based on each node in the large number of nodes according to the application scene to obtain a large number of relational networks; the application scene comprises a financial credit scene, an e-commerce recommendation scene or a health assessment scene; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; other application scenarios are also possible; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the large number of nodes is more than ten thousand nodes, and the number of the relational networks is the same as that of the nodes.

the method comprises the following specific steps:

s2.1, preparing training samples by adopting a supervised machine learning method based on a large number of relational networks, namely selecting nodes with relevant key information in the large number of relational networks as target nodes, namely training samples, and forming a training set by all the selected target nodes with key information, wherein the key information refers to behaviors needing prediction and includes whether a user defaults in a financial credit scene; recommending scenes including whether the user has purchasing intention in the E-commerce; in a health assessment scenario, including the size of the user's risk of having a certain disease, in various scenarios, key information for other needs may also be available;

In a financial credit scenario, the attribute vector includes historical borrowing records, income, academic history, and age of the associated node; in the E-commerce recommendation scene, the attribute vector comprises browsing data and purchasing data; in the health assessment scenario, the attribute vector includes physical fitness assessment, disease age, disease category, eating habits, and living habits. In various scenarios, attribute vectors for other requirements may also be used.

the method comprises the following specific steps:

s3.1, sample disturbance: selecting m subsets D containing target nodes N' from the integrated training set uniformly and repeatedly by adopting a self-service sampling method_iM is the number of sampling times, N' is the number of target nodes contained in each subset, which is the same as the number of target nodes in the original training set, D_iIs the ith subset of the m subsets; the method specifically comprises the following steps: n 'target nodes are arranged in the integrated training set, and m subsets D with the number of the target nodes being N' are selected uniformly and repeatedly during self-service sampling each time_iWherein, the self-service sampling method utilizes the training samples for multiple times, and expands the number of N 'target nodes in the training set into m N' subsets D_i；

S3.2, attribute category disturbance: subsets D based on the number N' of target nodes_iKnowing that the dimension of the attribute vector of the associated node is K, randomly extracting the attribute vector K not exceeding K dimension from K dimension_iAs subset D_iThe attribute vector of (2), i.e. the attribute vector of the associated node of each target node is K_iWherein, K is_iRepresenting the attribute vector of the ith subset after attribute category disturbance; the method specifically comprises the following steps: subsets D with a number N' of target nodes_iThe K-dimensional attribute vector of height, school calendar, age and income is selected from the K-dimensional attribute vectorCalendar and age as subsets D_iThe attribute vector of the associated node of each target node;

s3.3, attribute value disturbance: perturbed subsets D for attribute classes_iEach target node O in_NAttribute value perturbation is performed, i.e. the target node O is known_NThere are M associated nodes R_NMM associated weights W_NMM attribute vector groups X_NMSubset-based attribute vector K_iThe attribute value is from M attribute vector groups X_NMWherein the probability of each attribute vector being extracted is

the weight sum of all the associated nodes; the method specifically comprises the following steps: the target node is Zhang III, which belongs to a certain subset D after attribute class disturbance_iThe target node in (1) is that Zhang III has two related people Wang five and Li four, and the attribute value of the Zhang III is a group of attribute vectors (height, academic calendar and age) randomly drawn from two people of Wang five and Li four.

S4, based on the subsets of the training decision trees, giving a plurality of corresponding decision trees (existing) for training, and integrating the trained decision trees to obtain a final model; when the decision tree result is variable 0 or 1, the classification problem is solved, and majrating voting is adopted, namely, the decision tree after the integrated training of the majority voting method is adopted; and when the decision tree result is a continuous variable, the decision tree result is a regression problem, and the trained decision tree is integrated by adopting an averaging method.

Predicting through a final model based on the feature vector and the weight of the associated node of the target node to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, wherein the formula is as follows:

P_final＝∑(W_M×P_M)，

Examples

Establishing a relationship network based on 50200 nodes according to a financial credit application scene, selecting 23090 nodes with key information (overdue) as target nodes to establish an association network, wherein 23090 relationship networks corresponding to 23090 target nodes are 23090 association networks;

integrating 23090 associated networks based on the target nodes into a data structure comprising the target nodes, labels of key information corresponding to the target nodes, associated nodes corresponding to the target nodes, node weights of the associated nodes and attribute vectors of the associated nodes associated with the key information of the target nodes, so as to obtain an integrated training set;

finding the associated nodes through the associated network, and after the attribute vectors of the associated nodes are mined, the positive sample (key information positive) of the integrated training set is 927, the negative sample (key information negative) is 22163, and the associated network nodes of the samples in the integrated training set are as shown in fig. 4, wherein there are 20177 target nodes with 1 associated node corresponding to 1 group of attribute vectors, 2690 target nodes with 2 associated nodes, and the target nodes all correspond to 2 groups of attribute vectors, and so on. Defining a sample network complexity index Na as the number of nodes with associated nodes as 1: the number of nodes with associated nodes greater than 1 is 20177: 2913 ═ 6.93. Obviously, the smaller the Na, the higher the network complexity of the sample is illustrated, and the richer the optional associated node attribute vector of the target node is.

2000 subsets are obtained from training samples in a three-dimensional sampling mode, a given model is trained by the 2000 subsets (the model adopts the existing decision tree), and 2000 trained decision trees are obtained. Model fusion (i.e. integrating 2000 trained decision trees) to obtain a final model, and adopting an output probability mean value method for fusion. Five-fold cross validation was performed using the integrated training set, and the final model evaluation effect was AUC of 0.66 and KS of 0.25.

After the final model is obtained, the effect of the model is verified by using a verification set, which specifically comprises the following steps:

according to a financial credit application scene, establishing a relationship network based on 66110 nodes, selecting 30050 nodes with key information (whether overdue) as target nodes to establish an association network, and obtaining 30050 relationship networks corresponding to the 30050 target nodes, namely 30050 association networks;

30050 associated networks based on the target nodes are integrated into a data structure containing the target nodes, labels of key information corresponding to the target nodes, the associated nodes corresponding to the target nodes, node weights of the associated nodes and attribute vectors of the associated nodes associated with the key information of the target nodes, so as to obtain an integrated verification set;

finding the associated nodes through the associated network, and digging out attribute vectors of the associated nodes to form a positive sample (positive key information) of the integrated verification set, wherein the positive sample (negative key information) is 801, and the negative sample (negative key information) is 29249; in the case of the network node related to the sample in the integrated verification set, as shown in fig. 5, the verification sample network complexity index Na is 7.36.

The final model evaluates the verification samples one by one, and outputs probability evaluation of corresponding number aiming at the number of the associated nodes of the verification samples, as shown in fig. 6; and averaging the multiple probability evaluations of each verification sample to obtain the final probability evaluation of the verification sample. On the validation samples, the predictive model effect was as follows, similar to the five-fold cross validation results on the integrated training set, with AUC of 0.67 and KS of 0.26.

The final model can be completed only by training samples, and the 5-fold cross validation of the training samples is only to improve the generalization performance of the model, but the validation samples are only used for truly validating the generalization performance, which represents the actual prediction effect.

According to the verification sample layered prediction scheme, the number of the associated nodes of the verification sample is layered, and model evaluation is performed on each layer, taking 2 groups as an example, and the result is shown in fig. 7, so that when the number of the associated nodes is greater than or equal to 2, the prediction effect of the final model on the verification sample is greatly improved.

Compared with a weight training method, the AUC of a general final model of the weight training method is generally about 0.6-0.65, and is lower than the condition that the number of associated nodes of the hierarchical prediction scheme is more than or equal to 2 (the AUC is 0.749). Compared with the existing model for predicting by using the strong attribute variable of the target node, the AUC is usually about 0.75, and is similar to the situation that the number of the associated nodes of the hierarchical prediction scheme is more than or equal to 2 (the AUC is 0.749). The model established by the invention reaches the degree of production availability, and the invention has low complexity, occupies less resources and can predict in advance.

The above are merely representative examples of the many specific applications of the present invention, and do not limit the scope of the invention in any way. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.

Claims

1. A method for filling key information of a target node based on an associated network is characterized by comprising the following steps:

s5, predicting through a final model based on the feature vector and the weight of the associated node of the target node of the key information to be filled to obtain a plurality of results, and carrying out weighted average on the plurality of results to obtain final filling information, namely the key information;

the application scenario in the step S1 includes a financial credit scenario, an e-commerce recommendation scenario, or a health assessment scenario; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the number of the nodes is more than ten thousand, and the number of the relational networks is the same as that of the nodes;

the specific steps of step S2 are:

s2.2, integrating the relation network of each training sample as an association network into a data structure which comprises a target node, a label of key information corresponding to the target node, an association node corresponding to the target node, node weights of the association nodes and attribute vectors of the association nodes associated with the key information of the target node, and obtaining an integrated training set.

2. The method for filling up key information of target nodes based on the associative network according to claim 1, wherein in the step S2.2, in a financial credit scenario, the attribute vector includes historical borrowing records, income, academic history and age of the associative nodes; in the E-commerce recommendation scene, the attribute vector comprises browsing data and purchasing data; in the health assessment scenario, the attribute vector includes physical fitness assessment, disease age, disease category, eating habits, and living habits.

3. The method for filling up key information of a target node based on an associated network according to claim 1 or 2, wherein the step S3 specifically comprises the steps of:

Wherein, P_jIs the probability, W, that the j-th associated node attribute value was taken_jFor the weight corresponding to the jth associated node,

the weight sum of all the associated nodes;

s3.4, for m subsets D_iObtaining the children of m training decision trees after attribute category disturbance and attribute value disturbance sampling in sequenceAnd (4) collecting.

4. The method as claimed in claim 1, wherein in step S4, when the decision tree result is variable 0 or 1, for classification, a maj authority voting is used to integrate the trained decision tree; and when the decision tree result is a continuous variable, the decision tree result is a regression problem, and the trained decision tree is integrated by adopting an averaging method.

5. The method for filling key information of a target node based on an association network according to claim 1, wherein in step S5, a final model is used to predict based on the feature vector and weight of the association node of the target node to be filled, so as to obtain a plurality of results, and the plurality of results are weighted and averaged, so as to obtain the final filling information, where the formula is as follows:

P_final＝∑(W_M×P_M)，

6. A system for filling key information of a target node based on an associated network is characterized by comprising the following components:

a network operation module: according to an application scene, establishing a relational network based on each node in a large number of nodes to obtain a large number of relational networks; the application scene comprises a financial credit scene, an e-commerce recommendation scene or a health assessment scene; the dimensionality of the relationship network comprises common contacts of the known nodes, the relatives of the known nodes, friends of the known nodes and colleagues of the known nodes; the associated nodes in the relational network give different weights or average distribution weights according to the degree of the relationship; the number of the nodes is more than ten thousand, and the number of the relational networks is the same as that of the nodes;

the data integration module: based on a large number of relational networks, preparing training samples by adopting a supervised machine learning method, namely selecting nodes related to key information from the large number of relational networks as target nodes, namely training samples, and forming a training set by all the selected target nodes with key information, wherein the key information refers to behaviors needing to be predicted and includes whether a user violates in a financial credit scene; recommending scenes including whether the user has purchasing intention in the E-commerce; in a health assessment scenario, including the magnitude of the user's risk of having a certain disease; integrating the relation network of each training sample as an association network into a data structure containing a target node, a label of key information corresponding to the target node, an association node corresponding to the target node, a node weight of each association node and an attribute vector of each association node associated with the key information of the target node, so as to obtain an integrated training set;