WO2020233259A1 - Multi-center mode random forest algorithm-based feature importance sorting system - Google Patents

Multi-center mode random forest algorithm-based feature importance sorting system Download PDF

Info

Publication number
WO2020233259A1
WO2020233259A1 PCT/CN2020/083589 CN2020083589W WO2020233259A1 WO 2020233259 A1 WO2020233259 A1 WO 2020233259A1 CN 2020083589 W CN2020083589 W CN 2020083589W WO 2020233259 A1 WO2020233259 A1 WO 2020233259A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
center
random forest
feature importance
gini index
Prior art date
Application number
PCT/CN2020/083589
Other languages
French (fr)
Chinese (zh)
Inventor
李劲松
王丰
胡佩君
张莹
杨子玥
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to JP2021532354A priority Critical patent/JP7064681B2/en
Publication of WO2020233259A1 publication Critical patent/WO2020233259A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation

Definitions

  • the invention belongs to the technical field of feature selection, and particularly relates to a feature importance ranking system based on a random forest algorithm in a multi-center mode.
  • Feature selection is the process of selecting some of the most effective features from a set of features to reduce the dimensionality of the feature space. Feature selection can reduce the number of features, reduce dimensionality, make the model stronger, reduce over-fitting, and enhance the understanding of features and feature values. This is one of the key issues in the field of data science. In the field of biomedicine, it is often necessary to process high-dimensional data such as omics data sets, where the number of variables is usually much larger than the number of individuals. In this case, the significance of feature selection is particularly important. Random forest is an ensemble learning algorithm that is widely used in the field of biomedicine. It can provide estimates of the importance of variables in the classification process and is considered an effective feature selection algorithm.
  • Multi-center data collaborative computing is an application scenario that appears in the context of big data. It refers to a group in a geographically dispersed state that uses computer and network technology to cooperate with each other to complete a task. Feature selection based on multi-center data is one of the important issues. In the context of big data, the demand for collaborative computing of data in various centers is increasing.
  • the purpose of the present invention is to provide a feature importance ranking system based on the random forest algorithm in the multi-center mode in accordance with actual needs and without exposing the data of each center.
  • the data is always in each center, and only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted. Finally, a safe and effective global feature importance ranking result is obtained.
  • a feature importance ranking system based on random forest algorithm in multi-center mode the system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating The central server for the ranking results of the importance of each central feature; the result display module that feeds back the final feature importance ranking results to the user.
  • the front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the characteristic importance ranking result of the center.
  • the specific calculation steps are as follows:
  • steps B-C a total of q times, where q is the number of decision trees in the random forest;
  • Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
  • GI m represents the Gini index of node m before the branch
  • GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively
  • the calculation formula of the Gini index is:
  • K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
  • the central server calculating the global feature importance ranking result includes the following sub-steps:
  • the present invention is based on a multi-center random forest algorithm, and the feature importance ranking results are calculated in each center respectively; the central server integrates the ranking results of each center to form a global feature importance ranking result.
  • the present invention does not expose the data of each center, the data of each center in the system is always in the center, only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted, which effectively guarantees data security and the individuals included in the data privacy.
  • Figure 1 is a flowchart of the implementation of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention
  • FIG. 2 is a block diagram of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention
  • Figure 3 is a flowchart of the importance of features in each center's front end
  • Figure 4 is a flowchart of the global importance ranking in the central server.
  • the present invention provides a feature importance ranking system based on the random forest algorithm in a multi-center mode.
  • the system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating each The central server of the central feature importance ranking result; the result display module that feeds back the final feature importance ranking result to the user.
  • the front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the feature importance ranking result of the center, as shown in Figure 3, the specific calculation steps are as follows:
  • steps B-C a total of q times, where q is the number of decision trees in the random forest;
  • Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
  • GI m represents the Gini index of node m before the branch
  • GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively
  • the calculation formula of the Gini index is:
  • K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
  • the central server calculates the global feature importance ranking result, as shown in FIG. 4, including the following sub-steps:
  • the following is a specific example, which shows a feature importance ranking system based on the random forest algorithm in the multi-center mode to predict the risk of diabetes from the physical examination data.
  • the system includes: deployment in each hospital participating in collaborative computing The front-end processor; the central server that receives and integrates the feature importance ranking results of each hospital; the result display module that feeds the final feature importance ranking results back to the user.
  • the front-end processor is used to read the physical examination data from the database interface of each hospital, and use the random forest algorithm to predict the risk of diabetes, and calculate the ranking result of the importance of the risk of diabetes in the hospital.
  • the specific calculation steps are as follows:
  • Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
  • GI m represents the Gini index of node m before the branch
  • GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively
  • the calculation formula of the Gini index is:
  • K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
  • the calculation of the global feature importance ranking result in the physical examination data that affects the risk of diabetes in the central server includes the following sub-steps:
  • the present invention calculates the local variable importance ranking based on the Gini index at each site and sends it to the central server.
  • the central server integrates the variable importance ranking of each site and calculates the final ranking result.
  • the central server only receives the variable importance ranking results of each site, and does not need to exchange patient-level data. This not only obtains an effective global solution, but also effectively guarantees the security of the data, which provides security for the construction of feature screening models. Reliable and efficient solution.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

Disclosed is a multi-center mode random forest algorithm-based feature importance sorting system, comprising a front-end processor deployed in each center participating in collaborative computing, a central server receiving and integrating feature importance sorting results of the various centers, and a result display module feeding back a final feature importance sorting result to a user. A feature importance sorting result is respectively calculated at each center according to a multi-center-based random forest algorithm, and the central server integrates the sorting results of the various centers to form a global feature importance sorting result. The present invention operates under the condition that data in the various centers is not exposed, such that the data in the various centers remains in the various centers, only intermediate parameters are transmitted to the central server, and the original data is not transmitted, so as to effectively ensure data security and the personal privacy included in the data.

Description

一种基于多中心模式下随机森林算法的特征重要性排序系统A feature importance ranking system based on random forest algorithm under multi-center mode 技术领域Technical field
本发明属于特征选择技术领域,尤其涉及一种基于多中心模式下随机森林算法的特征重要性排序系统。The invention belongs to the technical field of feature selection, and particularly relates to a feature importance ranking system based on a random forest algorithm in a multi-center mode.
背景技术Background technique
特征选择是从一组特征中挑选出一些最有效的特征以降低特征空间维数的过程。特征选择可以减少特征数量、降维,使模型泛化能力更强,减少过拟合,增强对特征和特征值的理解,是数据科学领域的关键问题之一。在生物医学领域,经常需要处理诸如组学数据集之类的高维数据,其中变量的数量通常远大于个体的数量,这种情况下特征选择的意义显得尤为重要。随机森林是一种在生物医学领域应用非常广泛的集成学习算法,它能够在分类过程中提供变量重要性的估计,被认为是一种有效的特征选择算法。Feature selection is the process of selecting some of the most effective features from a set of features to reduce the dimensionality of the feature space. Feature selection can reduce the number of features, reduce dimensionality, make the model stronger, reduce over-fitting, and enhance the understanding of features and feature values. This is one of the key issues in the field of data science. In the field of biomedicine, it is often necessary to process high-dimensional data such as omics data sets, where the number of variables is usually much larger than the number of individuals. In this case, the significance of feature selection is particularly important. Random forest is an ensemble learning algorithm that is widely used in the field of biomedicine. It can provide estimates of the importance of variables in the classification process and is considered an effective feature selection algorithm.
多中心数据协同计算是大数据背景下出现的应用场景,它是指地域上处于分散状态的一个群体借助计算机和网络技术,互相协作共同完成一项任务。基于多中心的数据进行特征选择是其中一项重要问题。在大数据背景下,各个中心数据协同计算的需求越来越大。Multi-center data collaborative computing is an application scenario that appears in the context of big data. It refers to a group in a geographically dispersed state that uses computer and network technology to cooperate with each other to complete a task. Feature selection based on multi-center data is one of the important issues. In the context of big data, the demand for collaborative computing of data in various centers is increasing.
现有的解决方案需要将各个中心的数据取出,汇集在中心服务器上。接着在中心服务器上进行特征选择,得到全局性的特征选择结果。然而,将数据从各个中心取出的过程隐患重重,可能涉及到数据泄露这样的安全性问题,大大打击了中心之间的协同计算的积极性。尤其在生物医学领域,各个中心也就是各家医院的数据中包含了来医院就医患者的个人隐私,将数据取出集中处理的方法不利于保护患者隐私,具有很大风险。Existing solutions need to take out the data of each center and collect it on the central server. Then the feature selection is performed on the central server, and the global feature selection result is obtained. However, the process of extracting data from various centers is full of hidden dangers, which may involve security issues such as data leakage, which greatly dampens the enthusiasm of collaborative computing between centers. Especially in the field of biomedicine, the data of each center, that is, each hospital, contains the personal privacy of patients who come to the hospital for medical treatment. The method of extracting the data for centralized processing is not conducive to protecting the privacy of patients and has great risks.
发明内容Summary of the invention
本发明目的在于针对现有技术的不足,根据现实需求,在不暴露各个中心的数据的条件下,提供一种基于多中心模式下随机森林算法的特征重要性排序系统,本系统中各个中心的数据始终在各中心,只向中心服务器传递模型的中间参数,不传递原始数据,最终得到安全有效的全局性的特征重要性排序结果。The purpose of the present invention is to provide a feature importance ranking system based on the random forest algorithm in the multi-center mode in accordance with actual needs and without exposing the data of each center. The data is always in each center, and only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted. Finally, a safe and effective global feature importance ranking result is obtained.
本发明的目的是通过以下技术方案来实现的:一种基于多中心模式下随机森林算法的特征重要性排序系统,该系统包括:部署在参与协同计算的各中心的前置机;接收并整合各中心特征重要性排序结果的中心服务器;将最终特征重要性排序结果反馈给用户的结果展示模块。The purpose of the present invention is achieved through the following technical solutions: a feature importance ranking system based on random forest algorithm in multi-center mode, the system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating The central server for the ranking results of the importance of each central feature; the result display module that feeds back the final feature importance ranking results to the user.
所述前置机用于从各中心的数据库接口读取数据,并利用随机森林算法计算本中心的特 征重要性排序结果,具体计算步骤如下:The front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the characteristic importance ranking result of the center. The specific calculation steps are as follows:
A.从本中心数据库接口读取数据作为样本集;A. Read data from the database interface of this center as a sample set;
B.用有抽样放回的方法(bootstrap)从样本集中随机选取n个样本作为一个训练集;B. Use bootstrap to randomly select n samples from the sample set as a training set;
C.用抽样得到的训练集生成一颗决策树;在决策树的每一个结点,均随机不重复地选择d个特征,利用这d个特征分别对训练集进行划分;C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, d features are randomly selected without repetition, and the training set is divided by these d features;
D.重复步骤B-C共q次,q即为随机森林中决策树的个数;D. Repeat steps B-C a total of q times, where q is the number of decision trees in the random forest;
E.用训练得到的随机森林对样本集进行预测;E. Use the trained random forest to predict the sample set;
F.利用基尼指数作为评价指标对步骤E的预测结果进行特征重要性排序,包括以下子步骤:F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
a)假设样本集有h个特征X 1,X 2,X 3,...,X h,对于每个特征X j,计算特征X j在节点m的重要性
Figure PCTCN2020083589-appb-000001
即节点m分枝前后的基尼指数变化量,公式如下:
a) Assuming that the sample set has h features X 1 , X 2 , X 3 ,..., X h , for each feature X j , calculate the importance of feature X j at node m
Figure PCTCN2020083589-appb-000001
That is, the change in Gini index before and after the branch of node m, the formula is as follows:
Figure PCTCN2020083589-appb-000002
Figure PCTCN2020083589-appb-000002
其中,GI m表示分枝前节点m的基尼指数,GI l和GI r分别表示分枝后两个新节点l和r的基尼指数;基尼指数的计算公式为: Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:
Figure PCTCN2020083589-appb-000003
Figure PCTCN2020083589-appb-000003
其中,K表示有K个类别,p xk表示节点x中类别k所占的比例; Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
b)假设特征X j在决策树i中出现的节点构成集合E,那么X j在第i棵决策树的重要性
Figure PCTCN2020083589-appb-000004
为:
b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree
Figure PCTCN2020083589-appb-000004
for:
Figure PCTCN2020083589-appb-000005
Figure PCTCN2020083589-appb-000005
c)假设随机森林中有q棵树,计算每个特征X j的基尼指数评分
Figure PCTCN2020083589-appb-000006
亦即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量:
c) Assuming that there are q trees in the random forest, calculate the Gini index score of each feature X j
Figure PCTCN2020083589-appb-000006
That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:
Figure PCTCN2020083589-appb-000007
Figure PCTCN2020083589-appb-000007
d)将特征X j的基尼指数评分
Figure PCTCN2020083589-appb-000008
进行归一化处理,公式如下:
d) Score the Gini index of feature X j
Figure PCTCN2020083589-appb-000008
For normalization, the formula is as follows:
Figure PCTCN2020083589-appb-000009
Figure PCTCN2020083589-appb-000009
e)对所有特征归一化后的基尼指数评分进行降序排序;e) Sort the normalized Gini index scores of all features in descending order;
所述中心服务器计算全局特征重要性排序结果,包括以下子步骤:The central server calculating the global feature importance ranking result includes the following sub-steps:
A.接收各中心传来的特征重要性排序结果;A. Receive the feature importance ranking results from each center;
B.对于每个特征,求得该特征在所有中心的基尼指数评分的平均值作为全局性特征重要性值;B. For each feature, find the average of the Gini index scores of the feature in all centers as the global feature importance value;
C.按照全局性特征重要性值由大到小的顺序,对特征进行重新排序。C. Re-order the features according to the global feature importance value in descending order.
本发明的有益效果是:本发明基于多中心的随机森林算法,在各个中心分别计算特征重要性排序结果;在中心服务器进行整合各个中心的排序结果,形成全局性的特征重要性排序结果。本发明在不暴露各个中心的数据的条件下,本系统中各个中心的数据始终在中心,只向中心服务器传递模型的中间参数,不传递原始数据,有效保障了数据安全和数据中包含的个人隐私。The beneficial effects of the present invention are: the present invention is based on a multi-center random forest algorithm, and the feature importance ranking results are calculated in each center respectively; the central server integrates the ranking results of each center to form a global feature importance ranking result. The present invention does not expose the data of each center, the data of each center in the system is always in the center, only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted, which effectively guarantees data security and the individuals included in the data privacy.
附图说明Description of the drawings
图1为本发明基于多中心模式下随机森林算法的特征重要性排序系统实现流程图;Figure 1 is a flowchart of the implementation of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention;
图2为本发明基于多中心模式下随机森林算法的特征重要性排序系统组成框图;2 is a block diagram of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention;
图3为各中心前置机内特征重要性排序流程图;Figure 3 is a flowchart of the importance of features in each center's front end;
图4为中心服务器内全局重要性排序流程图。Figure 4 is a flowchart of the global importance ranking in the central server.
具体实施方式Detailed ways
下面结合附图和具体实施例对本发明作进一步详细说明。The present invention will be further described in detail below with reference to the drawings and specific embodiments.
如图1、2所示,本发明提供的一种基于多中心模式下随机森林算法的特征重要性排序系统,该系统包括:部署在参与协同计算的各中心的前置机;接收并整合各中心特征重要性排序结果的中心服务器;将最终特征重要性排序结果反馈给用户的结果展示模块。As shown in Figures 1 and 2, the present invention provides a feature importance ranking system based on the random forest algorithm in a multi-center mode. The system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating each The central server of the central feature importance ranking result; the result display module that feeds back the final feature importance ranking result to the user.
所述前置机用于从各中心的数据库接口读取数据,并利用随机森林算法计算本中心的特征重要性排序结果,如图3所示,具体计算步骤如下:The front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the feature importance ranking result of the center, as shown in Figure 3, the specific calculation steps are as follows:
A.从本中心数据库接口读取数据作为样本集;A. Read data from the database interface of this center as a sample set;
B.用有抽样放回的方法(bootstrap)从样本集中随机选取n个样本作为一个训练集;B. Use bootstrap to randomly select n samples from the sample set as a training set;
C.用抽样得到的训练集生成一颗决策树;在决策树的每一个结点,均随机不重复地选择d个特征,利用这d个特征分别对训练集进行划分;C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, d features are randomly selected without repetition, and the training set is divided by these d features;
D.重复步骤B-C共q次,q即为随机森林中决策树的个数;D. Repeat steps B-C a total of q times, where q is the number of decision trees in the random forest;
E.用训练得到的随机森林对样本集进行预测;E. Use the trained random forest to predict the sample set;
F.利用基尼指数作为评价指标对步骤E的预测结果进行特征重要性排序,包括以下子步骤:F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
a)假设样本集有h个特征X 1,X 2,X 3,...,X h,对于每个特征X j,计算特征X j在节点m的重要性
Figure PCTCN2020083589-appb-000010
即节点m分枝前后的基尼指数变化量,公式如下:
a) Assuming that the sample set has h features X 1 , X 2 , X 3 ,..., X h , for each feature X j , calculate the importance of feature X j at node m
Figure PCTCN2020083589-appb-000010
That is, the change in Gini index before and after the branch of node m, the formula is as follows:
Figure PCTCN2020083589-appb-000011
Figure PCTCN2020083589-appb-000011
其中,GI m表示分枝前节点m的基尼指数,GI l和GI r分别表示分枝后两个新节点l和r的基尼指数;基尼指数的计算公式为: Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:
Figure PCTCN2020083589-appb-000012
Figure PCTCN2020083589-appb-000012
其中,K表示有K个类别,p xk表示节点x中类别k所占的比例; Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
b)假设特征X j在决策树i中出现的节点构成集合E,那么X j在第i棵决策树的重要性
Figure PCTCN2020083589-appb-000013
为:
b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree
Figure PCTCN2020083589-appb-000013
for:
Figure PCTCN2020083589-appb-000014
Figure PCTCN2020083589-appb-000014
c)假设随机森林中有q棵树,计算每个特征X j的基尼指数评分
Figure PCTCN2020083589-appb-000015
亦即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量:
c) Assuming that there are q trees in the random forest, calculate the Gini index score of each feature X j
Figure PCTCN2020083589-appb-000015
That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:
Figure PCTCN2020083589-appb-000016
Figure PCTCN2020083589-appb-000016
d)将特征X j的基尼指数评分
Figure PCTCN2020083589-appb-000017
进行归一化处理,公式如下:
d) Score the Gini index of feature X j
Figure PCTCN2020083589-appb-000017
For normalization, the formula is as follows:
Figure PCTCN2020083589-appb-000018
Figure PCTCN2020083589-appb-000018
e)对所有特征归一化后的基尼指数评分进行降序排序;e) Sort the normalized Gini index scores of all features in descending order;
所述中心服务器计算全局特征重要性排序结果,如图4所示,包括以下子步骤:The central server calculates the global feature importance ranking result, as shown in FIG. 4, including the following sub-steps:
A.接收各中心传来的特征重要性排序结果;A. Receive the feature importance ranking results from each center;
B.对于每个特征,求得该特征在所有中心的基尼指数评分的平均值作为全局性特征重要性值;B. For each feature, find the average of the Gini index scores of the feature in all centers as the global feature importance value;
C.按照全局性特征重要性值由大到小的顺序,对特征进行重新排序。C. Re-order the features according to the global feature importance value in descending order.
以下为一个具体的实例,该实例展示了一种基于多中心模式下随机森林算法的由体检数据预测糖尿病患病风险的特征重要性排序系统,该系统包括:部署在参与协同计算的各医院内的前置机;接收并整合各医院特征重要性排序结果的中心服务器;将最终特征重要性排序结果反馈给用户的结果展示模块。The following is a specific example, which shows a feature importance ranking system based on the random forest algorithm in the multi-center mode to predict the risk of diabetes from the physical examination data. The system includes: deployment in each hospital participating in collaborative computing The front-end processor; the central server that receives and integrates the feature importance ranking results of each hospital; the result display module that feeds the final feature importance ranking results back to the user.
所述前置机用于从各医院的数据库接口读取体检数据,并利用随机森林算法预测糖尿病患病风险,计算出本医院内的糖尿病患病风险特征重要性排序结果,具体计算步骤如下:The front-end processor is used to read the physical examination data from the database interface of each hospital, and use the random forest algorithm to predict the risk of diabetes, and calculate the ranking result of the importance of the risk of diabetes in the hospital. The specific calculation steps are as follows:
A.从本医院数据库接口读取体检数据作为样本集,假设共有5000例体检数据;A. Read the physical examination data from the database interface of this hospital as a sample set, assuming that there are 5000 physical examination data;
B.用有抽样放回的方法(bootstrap)从样本集中随机选取70个样本作为一个训练集;B. Use bootstrap to randomly select 70 samples from the sample set as a training set;
C.用抽样得到的训练集生成一颗决策树;在决策树的每一个结点,均随机不重复地选择7个特征,利用这7个特征分别对训练集进行划分;C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, 7 features are randomly selected without repetition, and the training set is divided by these 7 features;
D.重复步骤B-C共15次,15即为随机森林中决策树的个数;D. Repeat steps B-C 15 times, 15 is the number of decision trees in the random forest;
E.用训练得到的随机森林对样本集进行预测;E. Use the trained random forest to predict the sample set;
F.利用基尼指数作为评价指标对步骤E的预测结果进行特征重要性排序,包括以下子步骤:F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
a)假设样本集有50个特征--年龄、性别、文化水平、腰围、血型、收缩压、血红蛋白等等特征,记为X 1,X 2,X 3,...,X 50。对于每个特征X j,计算特征X j在节点m的重要性
Figure PCTCN2020083589-appb-000019
即节点m分枝前后的基尼指数变化量,公式如下:
a) Assume that the sample set has 50 characteristics-age, gender, education level, waist circumference, blood type, systolic blood pressure, hemoglobin, etc., denoted as X 1 , X 2 , X 3 ,..., X 50 . For each feature X j , calculate the importance of feature X j at node m
Figure PCTCN2020083589-appb-000019
That is, the change in Gini index before and after the branch of node m, the formula is as follows:
Figure PCTCN2020083589-appb-000020
Figure PCTCN2020083589-appb-000020
其中,GI m表示分枝前节点m的基尼指数,GI l和GI r分别表示分枝后两个新节点l和r的基尼指数;基尼指数的计算公式为: Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:
Figure PCTCN2020083589-appb-000021
Figure PCTCN2020083589-appb-000021
其中,K表示有K个类别,p xk表示节点x中类别k所占的比例; Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
b)假设特征X j在决策树i中出现的节点构成集合E,那么X j在第i棵决策树的重要性
Figure PCTCN2020083589-appb-000022
为:
b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree
Figure PCTCN2020083589-appb-000022
for:
Figure PCTCN2020083589-appb-000023
Figure PCTCN2020083589-appb-000023
c)已知随机森林中有15棵树,计算每个特征X j的基尼指数评分
Figure PCTCN2020083589-appb-000024
亦即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量:
c) Knowing that there are 15 trees in the random forest, calculate the Gini index score of each feature X j
Figure PCTCN2020083589-appb-000024
That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:
Figure PCTCN2020083589-appb-000025
Figure PCTCN2020083589-appb-000025
d)将特征X j的基尼指数评分
Figure PCTCN2020083589-appb-000026
进行归一化处理,公式如下:
d) Score the Gini index of feature X j
Figure PCTCN2020083589-appb-000026
For normalization, the formula is as follows:
Figure PCTCN2020083589-appb-000027
Figure PCTCN2020083589-appb-000027
e)对所有特征归一化后的基尼指数评分进行降序排序;e) Sort the normalized Gini index scores of all features in descending order;
所述中心服务器内计算体检数据中影响糖尿病患病风险的全局特征重要性排序结果,包括以下子步骤:The calculation of the global feature importance ranking result in the physical examination data that affects the risk of diabetes in the central server includes the following sub-steps:
A.接收各医院传来的特征重要性排序结果;A. Receive the feature importance ranking results from various hospitals;
B.对于每个特征,求得该特征在所有医院的基尼指数评分的平均值作为全局性特征重要性值。例如对于特征糖化血红蛋白,它在医院甲的特征重要性评分为0.182483,在医院乙的特征重要性评分为0.150948,在医院丙的特征重要性评分为0.078243,那么它在医院甲、医院乙、医院丙联合开展的多中心体检数据糖尿病风险预测研究中的全局特征重要性值为:(0.182483+0.150948+0.078243)/3=0.137224。B. For each feature, find the average of the Gini index scores of the feature in all hospitals as the global feature importance value. For example, for feature glycosylated hemoglobin, its feature importance score in hospital A is 0.182483, its feature importance score in hospital B is 0.150948, and its feature importance score in hospital C is 0.078243, so it is in hospital A, hospital B, and hospital The global feature importance value in the diabetes risk prediction study of multi-center physical examination data jointly carried out by C is: (0.182483+0.150948+0.078243)/3=0.137224.
C.按照全局性特征重要性值由大到小的顺序,对特征进行重新排序。C. Re-order the features according to the global feature importance value in descending order.
本发明在每个站点计算基于基尼指数的局部变量重要性排序,并将其发送到中心服务器。中心服务器整合各个站点的变量重要性排序并计算得出最终的排序结果。在此过程中,中心服务器仅接收各站点的变量重要性排序结果,无需交换患者级别的数据,既得到了有效的全局解,又有效地保障了数据的安全性,为构建特征筛选模型提供了安全可靠高效的解决方案。The present invention calculates the local variable importance ranking based on the Gini index at each site and sends it to the central server. The central server integrates the variable importance ranking of each site and calculates the final ranking result. In this process, the central server only receives the variable importance ranking results of each site, and does not need to exchange patient-level data. This not only obtains an effective global solution, but also effectively guarantees the security of the data, which provides security for the construction of feature screening models. Reliable and efficient solution.
以上仅为本发明的实施实例,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,不经过创造性劳动所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。The above are only implementation examples of the present invention and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made without creative work within the spirit and principle of the present invention are all included in the protection scope of the present invention.

Claims (1)

  1. 一种基于多中心模式下随机森林算法的特征重要性排序系统,其特征在于,该系统包括:部署在参与协同计算的各中心的前置机;接收并整合各中心特征重要性排序结果的中心服务器;将最终特征重要性排序结果反馈给用户的结果展示模块。A feature importance ranking system based on the random forest algorithm in the multi-center mode, characterized in that the system includes: a front-end processor deployed in each center participating in collaborative computing; a center that receives and integrates the feature importance ranking results of each center Server; the result display module that feeds back the final feature importance ranking result to the user.
    所述前置机用于从各中心的数据库接口读取数据,并利用随机森林算法计算本中心的特征重要性排序结果,具体计算步骤如下:The front-end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the feature importance ranking result of the center. The specific calculation steps are as follows:
    A.从本中心数据库接口读取数据作为样本集;A. Read data from the database interface of this center as a sample set;
    B.用有抽样放回的方法(bootstrap)从样本集中随机选取n个样本作为一个训练集;B. Use bootstrap to randomly select n samples from the sample set as a training set;
    C.用抽样得到的训练集生成一颗决策树;在决策树的每一个结点,均随机不重复地选择d个特征,利用这d个特征分别对训练集进行划分;C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, d features are randomly selected without repetition, and the training set is divided by these d features;
    D.重复步骤B-C共q次,q即为随机森林中决策树的个数;D. Repeat steps B-C a total of q times, where q is the number of decision trees in the random forest;
    E.用训练得到的随机森林对样本集进行预测;E. Use the trained random forest to predict the sample set;
    F.利用基尼指数作为评价指标对步骤E的预测结果进行特征重要性排序,包括以下子步骤:F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
    a)假设样本集有h个特征X 1,X 2,X 3,...,X h,对于每个特征X j,计算特征X j在节点m的重要性
    Figure PCTCN2020083589-appb-100001
    即节点m分枝前后的基尼指数变化量,公式如下:
    a) Assuming that the sample set has h features X 1 , X 2 , X 3 ,..., X h , for each feature X j , calculate the importance of feature X j at node m
    Figure PCTCN2020083589-appb-100001
    That is, the change in Gini index before and after the branch of node m, the formula is as follows:
    Figure PCTCN2020083589-appb-100002
    Figure PCTCN2020083589-appb-100002
    其中,GI m表示分枝前节点m的基尼指数,GI l和GI r分别表示分枝后两个新节点l和r的基尼指数;基尼指数的计算公式为: Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:
    Figure PCTCN2020083589-appb-100003
    Figure PCTCN2020083589-appb-100003
    其中,K表示有K个类别,p xk表示节点x中类别k所占的比例; Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
    b)假设特征X j在决策树i中出现的节点构成集合E,那么X j在第i棵决策树的重要性
    Figure PCTCN2020083589-appb-100004
    为:
    b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree
    Figure PCTCN2020083589-appb-100004
    for:
    Figure PCTCN2020083589-appb-100005
    Figure PCTCN2020083589-appb-100005
    c)假设随机森林中有q棵树,计算每个特征X j的基尼指数评分
    Figure PCTCN2020083589-appb-100006
    亦即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量:
    c) Assuming that there are q trees in the random forest, calculate the Gini index score of each feature X j
    Figure PCTCN2020083589-appb-100006
    That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:
    Figure PCTCN2020083589-appb-100007
    Figure PCTCN2020083589-appb-100007
    d)将特征X j的基尼指数评分
    Figure PCTCN2020083589-appb-100008
    进行归一化处理,公式如下:
    d) Score the Gini index of feature X j
    Figure PCTCN2020083589-appb-100008
    For normalization, the formula is as follows:
    Figure PCTCN2020083589-appb-100009
    Figure PCTCN2020083589-appb-100009
    e)对所有特征归一化后的基尼指数评分进行降序排序。e) Sort the Gini index scores normalized by all features in descending order.
    所述中心服务器计算全局特征重要性排序结果,包括以下子步骤:The central server calculating the global feature importance ranking result includes the following sub-steps:
    A.接收各中心传来的特征重要性排序结果;A. Receive the feature importance ranking results from each center;
    B.对于每个特征,求得该特征在所有中心的基尼指数评分的平均值作为全局性特征重要性值;B. For each feature, find the average of the Gini index scores of the feature in all centers as the global feature importance value;
    C.按照全局性特征重要性值由大到小的顺序,对特征进行重新排序。C. Re-order the features according to the global feature importance value in descending order.
PCT/CN2020/083589 2019-07-12 2020-04-07 Multi-center mode random forest algorithm-based feature importance sorting system WO2020233259A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2021532354A JP7064681B2 (en) 2019-07-12 2020-04-07 Feature importance sorting system based on random forest algorithm in multi-center mode

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910629792.1A CN110728291B (en) 2019-07-12 2019-07-12 Feature importance ranking system based on random forest algorithm in multi-center mode
CN201910629792.1 2019-07-12

Publications (1)

Publication Number Publication Date
WO2020233259A1 true WO2020233259A1 (en) 2020-11-26

Family

ID=69217693

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083589 WO2020233259A1 (en) 2019-07-12 2020-04-07 Multi-center mode random forest algorithm-based feature importance sorting system

Country Status (3)

Country Link
JP (1) JP7064681B2 (en)
CN (1) CN110728291B (en)
WO (1) WO2020233259A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750029A (en) * 2020-12-30 2021-05-04 北京知因智慧科技有限公司 Credit risk prediction method, device, electronic equipment and storage medium
CN112951324A (en) * 2021-02-05 2021-06-11 广州医科大学 Pathogenic synonymous mutation prediction method based on undersampling
CN113408867A (en) * 2021-05-28 2021-09-17 北京大学 City burglary crime risk assessment method based on mobile phone user and POI data
CN113413163A (en) * 2021-08-24 2021-09-21 山东大学 Heart sound diagnosis system for mixed deep learning and low-difference forest
CN113554519A (en) * 2021-08-25 2021-10-26 山大地纬软件股份有限公司 Medical insurance expenditure risk analysis method and system
CN113642629A (en) * 2021-08-09 2021-11-12 厦门大学 Visualization method and device for improving reliability of spectral analysis based on random forest
CN113762712A (en) * 2021-07-26 2021-12-07 广西大学 Small hydropower cleaning rectification evaluation index screening strategy under big data environment
CN115001739A (en) * 2022-04-19 2022-09-02 中国电子科技网络信息安全有限公司 Random forest based transverse worm attack detection method
CN115083519A (en) * 2022-05-18 2022-09-20 北京大学第三医院(北京大学第三临床医学院) Gene-related characteristic fusion prediction method based on multi-source gene database
CN116226767A (en) * 2023-05-08 2023-06-06 国网浙江省电力有限公司宁波供电公司 Automatic diagnosis method for experimental data of power system
CN117112859A (en) * 2023-06-16 2023-11-24 中国联合网络通信有限公司深圳市分公司 Display method, device and computer readable storage medium for population movement evolution
CN117370899A (en) * 2023-12-08 2024-01-09 中国地质大学(武汉) Ore control factor weight determining method based on principal component-decision tree model
CN117493856A (en) * 2024-01-03 2024-02-02 佛山科学技术学院 Random forest-based method and equipment for analyzing characteristic factors of fruit picking

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728291B (en) * 2019-07-12 2022-02-22 之江实验室 Feature importance ranking system based on random forest algorithm in multi-center mode
CN111401749A (en) * 2020-03-17 2020-07-10 三峡大学 Dynamic safety assessment method based on random forest and extreme learning regression
CN111982299B (en) * 2020-08-14 2021-08-17 国家卫星气象中心(国家空间天气监测预警中心) Dynamic data quality scoring method and system for satellite-borne microwave radiometer

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931224A (en) * 2016-04-14 2016-09-07 浙江大学 Pathology identification method for routine scan CT image of liver based on random forests
CN106856482A (en) * 2015-12-08 2017-06-16 中山爱科数字科技股份有限公司 Health data acquisition method
US20180143199A1 (en) * 2016-11-23 2018-05-24 The Board Of Trustees Of The Leland Stanford Junior University Methods of identifying glioblastoma patients as susceptible to anti-angiogenic therapy using quantitative imaging features and molecular profiling
CN109376750A (en) * 2018-06-15 2019-02-22 武汉大学 A kind of Remote Image Classification merging medium-wave infrared and visible light
CN110728291A (en) * 2019-07-12 2020-01-24 之江实验室 Feature importance ranking system based on random forest algorithm in multi-center mode

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100543749C (en) * 2007-10-18 2009-09-23 中兴通讯股份有限公司 The method that the data of a plurality of data sources are unified to sort
US8818910B1 (en) * 2013-11-26 2014-08-26 Comrise, Inc. Systems and methods for prioritizing job candidates using a decision-tree forest algorithm
CN107908732B (en) * 2017-11-14 2020-02-07 北京恺思睿思信息技术有限公司 Mutually isolated multi-source big data fusion analysis method and system
US20190197578A1 (en) 2017-12-26 2019-06-27 c/o Datorama Technologies, Ltd. Generating significant performance insights on campaigns data
CN109242021A (en) 2018-09-07 2019-01-18 浙江财经大学 A kind of classification prediction technique based on multistage mixed model
CN109194523B (en) * 2018-10-01 2021-07-30 西安电子科技大学 Privacy protection multi-party diagnosis model fusion method and system and cloud server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106856482A (en) * 2015-12-08 2017-06-16 中山爱科数字科技股份有限公司 Health data acquisition method
CN105931224A (en) * 2016-04-14 2016-09-07 浙江大学 Pathology identification method for routine scan CT image of liver based on random forests
US20180143199A1 (en) * 2016-11-23 2018-05-24 The Board Of Trustees Of The Leland Stanford Junior University Methods of identifying glioblastoma patients as susceptible to anti-angiogenic therapy using quantitative imaging features and molecular profiling
CN109376750A (en) * 2018-06-15 2019-02-22 武汉大学 A kind of Remote Image Classification merging medium-wave infrared and visible light
CN110728291A (en) * 2019-07-12 2020-01-24 之江实验室 Feature importance ranking system based on random forest algorithm in multi-center mode

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750029A (en) * 2020-12-30 2021-05-04 北京知因智慧科技有限公司 Credit risk prediction method, device, electronic equipment and storage medium
CN112951324A (en) * 2021-02-05 2021-06-11 广州医科大学 Pathogenic synonymous mutation prediction method based on undersampling
CN113408867A (en) * 2021-05-28 2021-09-17 北京大学 City burglary crime risk assessment method based on mobile phone user and POI data
CN113408867B (en) * 2021-05-28 2023-07-21 北京大学 Urban burglary crime risk assessment method based on mobile phone user and POI data
CN113762712A (en) * 2021-07-26 2021-12-07 广西大学 Small hydropower cleaning rectification evaluation index screening strategy under big data environment
CN113762712B (en) * 2021-07-26 2024-04-09 广西大学 Small hydropower cleaning rectification evaluation index screening strategy in big data environment
CN113642629B (en) * 2021-08-09 2023-12-08 厦门大学 Visualization method and device for improving reliability of spectroscopy analysis based on random forest
CN113642629A (en) * 2021-08-09 2021-11-12 厦门大学 Visualization method and device for improving reliability of spectral analysis based on random forest
CN113413163A (en) * 2021-08-24 2021-09-21 山东大学 Heart sound diagnosis system for mixed deep learning and low-difference forest
CN113554519A (en) * 2021-08-25 2021-10-26 山大地纬软件股份有限公司 Medical insurance expenditure risk analysis method and system
CN115001739A (en) * 2022-04-19 2022-09-02 中国电子科技网络信息安全有限公司 Random forest based transverse worm attack detection method
CN115083519A (en) * 2022-05-18 2022-09-20 北京大学第三医院(北京大学第三临床医学院) Gene-related characteristic fusion prediction method based on multi-source gene database
CN116226767B (en) * 2023-05-08 2023-10-17 国网浙江省电力有限公司宁波供电公司 Automatic diagnosis method for experimental data of power system
CN116226767A (en) * 2023-05-08 2023-06-06 国网浙江省电力有限公司宁波供电公司 Automatic diagnosis method for experimental data of power system
CN117112859A (en) * 2023-06-16 2023-11-24 中国联合网络通信有限公司深圳市分公司 Display method, device and computer readable storage medium for population movement evolution
CN117112859B (en) * 2023-06-16 2024-05-14 中国联合网络通信有限公司深圳市分公司 Display method, device and computer readable storage medium for population movement evolution
CN117370899A (en) * 2023-12-08 2024-01-09 中国地质大学(武汉) Ore control factor weight determining method based on principal component-decision tree model
CN117370899B (en) * 2023-12-08 2024-02-20 中国地质大学(武汉) Ore control factor weight determining method based on principal component-decision tree model
CN117493856A (en) * 2024-01-03 2024-02-02 佛山科学技术学院 Random forest-based method and equipment for analyzing characteristic factors of fruit picking

Also Published As

Publication number Publication date
JP2022508333A (en) 2022-01-19
CN110728291B (en) 2022-02-22
JP7064681B2 (en) 2022-05-11
CN110728291A (en) 2020-01-24

Similar Documents

Publication Publication Date Title
WO2020233259A1 (en) Multi-center mode random forest algorithm-based feature importance sorting system
CN108806792B (en) Deep learning face diagnosis system
Vinitha et al. Disease prediction using machine learning over big data
Wang et al. Comparing the performance of approaches for testing the homogeneity of variance assumption in one-factor ANOVA models
CN113707297B (en) Medical data processing method, device, equipment and storage medium
US7809660B2 (en) System and method to optimize control cohorts using clustering algorithms
CN110085307A (en) A kind of intelligent hospital guide's method and system based on the fusion of multi-source knowledge mapping
US20090287503A1 (en) Analysis of individual and group healthcare data in order to provide real time healthcare recommendations
CN113345577B (en) Diagnosis and treatment auxiliary information generation method, model training method, device, equipment and storage medium
Ramani et al. RETRACTED ARTICLE: MapReduce-based big data framework using modified artificial neural network classifier for diabetic chronic disease prediction
CN112820416A (en) Major infectious disease queue data typing method, typing model and electronic equipment
Afeni et al. Hypertension prediction system using naive bayes classifier
Xiong et al. Prediction of hemodialysis timing based on LVW feature selection and ensemble learning
Kwon et al. Deep learning algorithm to predict need for critical care in pediatric emergency departments
CN108363699A (en) A kind of netizen's school work mood analysis method based on Baidu's mhkc
Oğuz et al. Determination of Covid-19 possible cases by using deep learning techniques
CN111986814A (en) Modeling method of lupus nephritis prediction model of lupus erythematosus patient
CN115732078A (en) Pain disease distinguishing and classifying method and device based on multivariate decision tree model
Nabi et al. Machine learning approach: Detecting polycystic ovary syndrome & it's impact on bangladeshi women
Taherinezhad et al. COVID-19 crisis management: Global appraisal using two-stage DEA and ensemble learning algorithms
Manik A Novel Approach in Determining areas to Lockdown during a Pandemic: COVID-19 as a Case Study
Desai Early Detection and Prevention of Chronic Kidney Disease
CN109522331B (en) Individual-centered regionalized multi-dimensional health data processing method and medium
Theodoraki et al. Innovative data mining approaches for outcome prediction of trauma patients
CN116705310A (en) Data set construction method, device, equipment and medium for perioperative risk assessment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20809900

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021532354

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20809900

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20809900

Country of ref document: EP

Kind code of ref document: A1