WO2020233259A1

WO2020233259A1 - Multi-center mode random forest algorithm-based feature importance sorting system

Info

Publication number: WO2020233259A1
Application number: PCT/CN2020/083589
Authority: WO
Inventors: 李劲松; 王丰; 胡佩君; 张莹; 杨子玥
Original assignee: 之江实验室
Priority date: 2019-07-12
Filing date: 2020-04-07
Publication date: 2020-11-26
Also published as: JP2022508333A; CN110728291B; JP7064681B2; CN110728291A

Abstract

Disclosed is a multi-center mode random forest algorithm-based feature importance sorting system, comprising a front-end processor deployed in each center participating in collaborative computing, a central server receiving and integrating feature importance sorting results of the various centers, and a result display module feeding back a final feature importance sorting result to a user. A feature importance sorting result is respectively calculated at each center according to a multi-center-based random forest algorithm, and the central server integrates the sorting results of the various centers to form a global feature importance sorting result. The present invention operates under the condition that data in the various centers is not exposed, such that the data in the various centers remains in the various centers, only intermediate parameters are transmitted to the central server, and the original data is not transmitted, so as to effectively ensure data security and the personal privacy included in the data.

Description

A feature importance ranking system based on random forest algorithm under multi-center mode

Technical field

The invention belongs to the technical field of feature selection, and particularly relates to a feature importance ranking system based on a random forest algorithm in a multi-center mode.

Background technique

Feature selection is the process of selecting some of the most effective features from a set of features to reduce the dimensionality of the feature space. Feature selection can reduce the number of features, reduce dimensionality, make the model stronger, reduce over-fitting, and enhance the understanding of features and feature values. This is one of the key issues in the field of data science. In the field of biomedicine, it is often necessary to process high-dimensional data such as omics data sets, where the number of variables is usually much larger than the number of individuals. In this case, the significance of feature selection is particularly important. Random forest is an ensemble learning algorithm that is widely used in the field of biomedicine. It can provide estimates of the importance of variables in the classification process and is considered an effective feature selection algorithm.

Multi-center data collaborative computing is an application scenario that appears in the context of big data. It refers to a group in a geographically dispersed state that uses computer and network technology to cooperate with each other to complete a task. Feature selection based on multi-center data is one of the important issues. In the context of big data, the demand for collaborative computing of data in various centers is increasing.

Existing solutions need to take out the data of each center and collect it on the central server. Then the feature selection is performed on the central server, and the global feature selection result is obtained. However, the process of extracting data from various centers is full of hidden dangers, which may involve security issues such as data leakage, which greatly dampens the enthusiasm of collaborative computing between centers. Especially in the field of biomedicine, the data of each center, that is, each hospital, contains the personal privacy of patients who come to the hospital for medical treatment. The method of extracting the data for centralized processing is not conducive to protecting the privacy of patients and has great risks.

Summary of the invention

The purpose of the present invention is to provide a feature importance ranking system based on the random forest algorithm in the multi-center mode in accordance with actual needs and without exposing the data of each center. The data is always in each center, and only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted. Finally, a safe and effective global feature importance ranking result is obtained.

The purpose of the present invention is achieved through the following technical solutions: a feature importance ranking system based on random forest algorithm in multi-center mode, the system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating The central server for the ranking results of the importance of each central feature; the result display module that feeds back the final feature importance ranking results to the user.

The front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the characteristic importance ranking result of the center. The specific calculation steps are as follows:

A. Read data from the database interface of this center as a sample set;

B. Use bootstrap to randomly select n samples from the sample set as a training set;

C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, d features are randomly selected without repetition, and the training set is divided by these d features;

D. Repeat steps B-C a total of q times, where q is the number of decision trees in the random forest;

E. Use the trained random forest to predict the sample set;

F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:

a) Assuming that the sample set has h features X ₁ , X ₂ , X ₃ ,..., X _h , for each feature X _j , calculate the importance of feature X _j at node m

That is, the change in Gini index before and after the branch of node m, the formula is as follows:

Among them, GI _m represents the Gini index of node m before the branch, and GI _l and GI _r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:

Among them, K indicates that there are K categories, and p _xk indicates the proportion of category k in node x;

b) Assuming that the nodes of feature X _j appearing in decision tree i constitute set E, then the importance of X _j in the ith decision tree

for:

c) Assuming that there are q trees in the random forest, calculate the Gini index score of each feature X _j

That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:

d) Score the Gini index of feature X _j

For normalization, the formula is as follows:

e) Sort the normalized Gini index scores of all features in descending order;

The central server calculating the global feature importance ranking result includes the following sub-steps:

A. Receive the feature importance ranking results from each center;

B. For each feature, find the average of the Gini index scores of the feature in all centers as the global feature importance value;

C. Re-order the features according to the global feature importance value in descending order.

The beneficial effects of the present invention are: the present invention is based on a multi-center random forest algorithm, and the feature importance ranking results are calculated in each center respectively; the central server integrates the ranking results of each center to form a global feature importance ranking result. The present invention does not expose the data of each center, the data of each center in the system is always in the center, only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted, which effectively guarantees data security and the individuals included in the data privacy.

Description of the drawings

Figure 1 is a flowchart of the implementation of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention;

2 is a block diagram of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention;

Figure 3 is a flowchart of the importance of features in each center's front end;

Figure 4 is a flowchart of the global importance ranking in the central server.

Detailed ways

The present invention will be further described in detail below with reference to the drawings and specific embodiments.

As shown in Figures 1 and 2, the present invention provides a feature importance ranking system based on the random forest algorithm in a multi-center mode. The system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating each The central server of the central feature importance ranking result; the result display module that feeds back the final feature importance ranking result to the user.

The front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the feature importance ranking result of the center, as shown in Figure 3, the specific calculation steps are as follows:

A. Read data from the database interface of this center as a sample set;

E. Use the trained random forest to predict the sample set;

for:

d) Score the Gini index of feature X _j

For normalization, the formula is as follows:

e) Sort the normalized Gini index scores of all features in descending order;

The central server calculates the global feature importance ranking result, as shown in FIG. 4, including the following sub-steps:

A. Receive the feature importance ranking results from each center;

The following is a specific example, which shows a feature importance ranking system based on the random forest algorithm in the multi-center mode to predict the risk of diabetes from the physical examination data. The system includes: deployment in each hospital participating in collaborative computing The front-end processor; the central server that receives and integrates the feature importance ranking results of each hospital; the result display module that feeds the final feature importance ranking results back to the user.

The front-end processor is used to read the physical examination data from the database interface of each hospital, and use the random forest algorithm to predict the risk of diabetes, and calculate the ranking result of the importance of the risk of diabetes in the hospital. The specific calculation steps are as follows:

A. Read the physical examination data from the database interface of this hospital as a sample set, assuming that there are 5000 physical examination data;

B. Use bootstrap to randomly select 70 samples from the sample set as a training set;

C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, 7 features are randomly selected without repetition, and the training set is divided by these 7 features;

D. Repeat steps B-C 15 times, 15 is the number of decision trees in the random forest;

E. Use the trained random forest to predict the sample set;

a) Assume that the sample set has 50 characteristics-age, gender, education level, waist circumference, blood type, systolic blood pressure, hemoglobin, etc., denoted as X ₁ , X ₂ , X ₃ ,..., X ₅₀ . For each feature X _j , calculate the importance of feature X _j at node m

for:

c) Knowing that there are 15 trees in the random forest, calculate the Gini index score of each feature X _j

d) Score the Gini index of feature X _j

For normalization, the formula is as follows:

e) Sort the normalized Gini index scores of all features in descending order;

The calculation of the global feature importance ranking result in the physical examination data that affects the risk of diabetes in the central server includes the following sub-steps:

A. Receive the feature importance ranking results from various hospitals;

B. For each feature, find the average of the Gini index scores of the feature in all hospitals as the global feature importance value. For example, for feature glycosylated hemoglobin, its feature importance score in hospital A is 0.182483, its feature importance score in hospital B is 0.150948, and its feature importance score in hospital C is 0.078243, so it is in hospital A, hospital B, and hospital The global feature importance value in the diabetes risk prediction study of multi-center physical examination data jointly carried out by C is: (0.182483+0.150948+0.078243)/3=0.137224.

The present invention calculates the local variable importance ranking based on the Gini index at each site and sends it to the central server. The central server integrates the variable importance ranking of each site and calculates the final ranking result. In this process, the central server only receives the variable importance ranking results of each site, and does not need to exchange patient-level data. This not only obtains an effective global solution, but also effectively guarantees the security of the data, which provides security for the construction of feature screening models. Reliable and efficient solution.

The above are only implementation examples of the present invention and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made without creative work within the spirit and principle of the present invention are all included in the protection scope of the present invention.

Claims

A feature importance ranking system based on the random forest algorithm in the multi-center mode, characterized in that the system includes: a front-end processor deployed in each center participating in collaborative computing; a center that receives and integrates the feature importance ranking results of each center Server; the result display module that feeds back the final feature importance ranking result to the user.

The front-end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the feature importance ranking result of the center. The specific calculation steps are as follows:

A. Read data from the database interface of this center as a sample set;

B. Use bootstrap to randomly select n samples from the sample set as a training set;

C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, d features are randomly selected without repetition, and the training set is divided by these d features;

D. Repeat steps B-C a total of q times, where q is the number of decision trees in the random forest;

E. Use the trained random forest to predict the sample set;

F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:

a) Assuming that the sample set has h features X 1 , X 2 , X 3 ,..., X h , for each feature X j , calculate the importance of feature X j at node m
That is, the change in Gini index before and after the branch of node m, the formula is as follows:

Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:

Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;

b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree
for:

c) Assuming that there are q trees in the random forest, calculate the Gini index score of each feature X j
That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:

d) Score the Gini index of feature X j
For normalization, the formula is as follows:

e) Sort the Gini index scores normalized by all features in descending order.

The central server calculating the global feature importance ranking result includes the following sub-steps:

A. Receive the feature importance ranking results from each center;

B. For each feature, find the average of the Gini index scores of the feature in all centers as the global feature importance value;

C. Re-order the features according to the global feature importance value in descending order.