一种基于多中心模式下随机森林算法的特征重要性排序系统A feature importance ranking system based on random forest algorithm under multi-center mode
技术领域Technical field
本发明属于特征选择技术领域,尤其涉及一种基于多中心模式下随机森林算法的特征重要性排序系统。The invention belongs to the technical field of feature selection, and particularly relates to a feature importance ranking system based on a random forest algorithm in a multi-center mode.
背景技术Background technique
特征选择是从一组特征中挑选出一些最有效的特征以降低特征空间维数的过程。特征选择可以减少特征数量、降维,使模型泛化能力更强,减少过拟合,增强对特征和特征值的理解,是数据科学领域的关键问题之一。在生物医学领域,经常需要处理诸如组学数据集之类的高维数据,其中变量的数量通常远大于个体的数量,这种情况下特征选择的意义显得尤为重要。随机森林是一种在生物医学领域应用非常广泛的集成学习算法,它能够在分类过程中提供变量重要性的估计,被认为是一种有效的特征选择算法。Feature selection is the process of selecting some of the most effective features from a set of features to reduce the dimensionality of the feature space. Feature selection can reduce the number of features, reduce dimensionality, make the model stronger, reduce over-fitting, and enhance the understanding of features and feature values. This is one of the key issues in the field of data science. In the field of biomedicine, it is often necessary to process high-dimensional data such as omics data sets, where the number of variables is usually much larger than the number of individuals. In this case, the significance of feature selection is particularly important. Random forest is an ensemble learning algorithm that is widely used in the field of biomedicine. It can provide estimates of the importance of variables in the classification process and is considered an effective feature selection algorithm.
多中心数据协同计算是大数据背景下出现的应用场景,它是指地域上处于分散状态的一个群体借助计算机和网络技术,互相协作共同完成一项任务。基于多中心的数据进行特征选择是其中一项重要问题。在大数据背景下,各个中心数据协同计算的需求越来越大。Multi-center data collaborative computing is an application scenario that appears in the context of big data. It refers to a group in a geographically dispersed state that uses computer and network technology to cooperate with each other to complete a task. Feature selection based on multi-center data is one of the important issues. In the context of big data, the demand for collaborative computing of data in various centers is increasing.
现有的解决方案需要将各个中心的数据取出,汇集在中心服务器上。接着在中心服务器上进行特征选择,得到全局性的特征选择结果。然而,将数据从各个中心取出的过程隐患重重,可能涉及到数据泄露这样的安全性问题,大大打击了中心之间的协同计算的积极性。尤其在生物医学领域,各个中心也就是各家医院的数据中包含了来医院就医患者的个人隐私,将数据取出集中处理的方法不利于保护患者隐私,具有很大风险。Existing solutions need to take out the data of each center and collect it on the central server. Then the feature selection is performed on the central server, and the global feature selection result is obtained. However, the process of extracting data from various centers is full of hidden dangers, which may involve security issues such as data leakage, which greatly dampens the enthusiasm of collaborative computing between centers. Especially in the field of biomedicine, the data of each center, that is, each hospital, contains the personal privacy of patients who come to the hospital for medical treatment. The method of extracting the data for centralized processing is not conducive to protecting the privacy of patients and has great risks.
发明内容Summary of the invention
本发明目的在于针对现有技术的不足,根据现实需求,在不暴露各个中心的数据的条件下,提供一种基于多中心模式下随机森林算法的特征重要性排序系统,本系统中各个中心的数据始终在各中心,只向中心服务器传递模型的中间参数,不传递原始数据,最终得到安全有效的全局性的特征重要性排序结果。The purpose of the present invention is to provide a feature importance ranking system based on the random forest algorithm in the multi-center mode in accordance with actual needs and without exposing the data of each center. The data is always in each center, and only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted. Finally, a safe and effective global feature importance ranking result is obtained.
本发明的目的是通过以下技术方案来实现的:一种基于多中心模式下随机森林算法的特征重要性排序系统,该系统包括:部署在参与协同计算的各中心的前置机;接收并整合各中心特征重要性排序结果的中心服务器;将最终特征重要性排序结果反馈给用户的结果展示模块。The purpose of the present invention is achieved through the following technical solutions: a feature importance ranking system based on random forest algorithm in multi-center mode, the system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating The central server for the ranking results of the importance of each central feature; the result display module that feeds back the final feature importance ranking results to the user.
所述前置机用于从各中心的数据库接口读取数据,并利用随机森林算法计算本中心的特 征重要性排序结果,具体计算步骤如下:The front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the characteristic importance ranking result of the center. The specific calculation steps are as follows:
A.从本中心数据库接口读取数据作为样本集;A. Read data from the database interface of this center as a sample set;
B.用有抽样放回的方法(bootstrap)从样本集中随机选取n个样本作为一个训练集;B. Use bootstrap to randomly select n samples from the sample set as a training set;
C.用抽样得到的训练集生成一颗决策树;在决策树的每一个结点,均随机不重复地选择d个特征,利用这d个特征分别对训练集进行划分;C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, d features are randomly selected without repetition, and the training set is divided by these d features;
D.重复步骤B-C共q次,q即为随机森林中决策树的个数;D. Repeat steps B-C a total of q times, where q is the number of decision trees in the random forest;
E.用训练得到的随机森林对样本集进行预测;E. Use the trained random forest to predict the sample set;
F.利用基尼指数作为评价指标对步骤E的预测结果进行特征重要性排序,包括以下子步骤:F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
a)假设样本集有h个特征X
1,X
2,X
3,...,X
h,对于每个特征X
j,计算特征X
j在节点m的重要性
即节点m分枝前后的基尼指数变化量,公式如下:
a) Assuming that the sample set has h features X 1 , X 2 , X 3 ,..., X h , for each feature X j , calculate the importance of feature X j at node m That is, the change in Gini index before and after the branch of node m, the formula is as follows:
其中,GI
m表示分枝前节点m的基尼指数,GI
l和GI
r分别表示分枝后两个新节点l和r的基尼指数;基尼指数的计算公式为:
Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:
其中,K表示有K个类别,p
xk表示节点x中类别k所占的比例;
Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
b)假设特征X
j在决策树i中出现的节点构成集合E,那么X
j在第i棵决策树的重要性
为:
b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree for:
c)假设随机森林中有q棵树,计算每个特征X
j的基尼指数评分
亦即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量:
c) Assuming that there are q trees in the random forest, calculate the Gini index score of each feature X j That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:
d)将特征X
j的基尼指数评分
进行归一化处理,公式如下:
d) Score the Gini index of feature X j For normalization, the formula is as follows:
e)对所有特征归一化后的基尼指数评分进行降序排序;e) Sort the normalized Gini index scores of all features in descending order;
所述中心服务器计算全局特征重要性排序结果,包括以下子步骤:The central server calculating the global feature importance ranking result includes the following sub-steps:
A.接收各中心传来的特征重要性排序结果;A. Receive the feature importance ranking results from each center;
B.对于每个特征,求得该特征在所有中心的基尼指数评分的平均值作为全局性特征重要性值;B. For each feature, find the average of the Gini index scores of the feature in all centers as the global feature importance value;
C.按照全局性特征重要性值由大到小的顺序,对特征进行重新排序。C. Re-order the features according to the global feature importance value in descending order.
本发明的有益效果是:本发明基于多中心的随机森林算法,在各个中心分别计算特征重要性排序结果;在中心服务器进行整合各个中心的排序结果,形成全局性的特征重要性排序结果。本发明在不暴露各个中心的数据的条件下,本系统中各个中心的数据始终在中心,只向中心服务器传递模型的中间参数,不传递原始数据,有效保障了数据安全和数据中包含的个人隐私。The beneficial effects of the present invention are: the present invention is based on a multi-center random forest algorithm, and the feature importance ranking results are calculated in each center respectively; the central server integrates the ranking results of each center to form a global feature importance ranking result. The present invention does not expose the data of each center, the data of each center in the system is always in the center, only the intermediate parameters of the model are transmitted to the central server, and the original data is not transmitted, which effectively guarantees data security and the individuals included in the data privacy.
附图说明Description of the drawings
图1为本发明基于多中心模式下随机森林算法的特征重要性排序系统实现流程图;Figure 1 is a flowchart of the implementation of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention;
图2为本发明基于多中心模式下随机森林算法的特征重要性排序系统组成框图;2 is a block diagram of the feature importance ranking system based on the random forest algorithm in the multi-center mode of the present invention;
图3为各中心前置机内特征重要性排序流程图;Figure 3 is a flowchart of the importance of features in each center's front end;
图4为中心服务器内全局重要性排序流程图。Figure 4 is a flowchart of the global importance ranking in the central server.
具体实施方式Detailed ways
下面结合附图和具体实施例对本发明作进一步详细说明。The present invention will be further described in detail below with reference to the drawings and specific embodiments.
如图1、2所示,本发明提供的一种基于多中心模式下随机森林算法的特征重要性排序系统,该系统包括:部署在参与协同计算的各中心的前置机;接收并整合各中心特征重要性排序结果的中心服务器;将最终特征重要性排序结果反馈给用户的结果展示模块。As shown in Figures 1 and 2, the present invention provides a feature importance ranking system based on the random forest algorithm in a multi-center mode. The system includes: front-end processors deployed in each center participating in collaborative computing; receiving and integrating each The central server of the central feature importance ranking result; the result display module that feeds back the final feature importance ranking result to the user.
所述前置机用于从各中心的数据库接口读取数据,并利用随机森林算法计算本中心的特征重要性排序结果,如图3所示,具体计算步骤如下:The front end processor is used to read data from the database interface of each center, and use the random forest algorithm to calculate the feature importance ranking result of the center, as shown in Figure 3, the specific calculation steps are as follows:
A.从本中心数据库接口读取数据作为样本集;A. Read data from the database interface of this center as a sample set;
B.用有抽样放回的方法(bootstrap)从样本集中随机选取n个样本作为一个训练集;B. Use bootstrap to randomly select n samples from the sample set as a training set;
C.用抽样得到的训练集生成一颗决策树;在决策树的每一个结点,均随机不重复地选择d个特征,利用这d个特征分别对训练集进行划分;C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, d features are randomly selected without repetition, and the training set is divided by these d features;
D.重复步骤B-C共q次,q即为随机森林中决策树的个数;D. Repeat steps B-C a total of q times, where q is the number of decision trees in the random forest;
E.用训练得到的随机森林对样本集进行预测;E. Use the trained random forest to predict the sample set;
F.利用基尼指数作为评价指标对步骤E的预测结果进行特征重要性排序,包括以下子步骤:F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
a)假设样本集有h个特征X
1,X
2,X
3,...,X
h,对于每个特征X
j,计算特征X
j在节点m的重要性
即节点m分枝前后的基尼指数变化量,公式如下:
a) Assuming that the sample set has h features X 1 , X 2 , X 3 ,..., X h , for each feature X j , calculate the importance of feature X j at node m That is, the change in Gini index before and after the branch of node m, the formula is as follows:
其中,GI
m表示分枝前节点m的基尼指数,GI
l和GI
r分别表示分枝后两个新节点l和r的基尼指数;基尼指数的计算公式为:
Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:
其中,K表示有K个类别,p
xk表示节点x中类别k所占的比例;
Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
b)假设特征X
j在决策树i中出现的节点构成集合E,那么X
j在第i棵决策树的重要性
为:
b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree for:
c)假设随机森林中有q棵树,计算每个特征X
j的基尼指数评分
亦即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量:
c) Assuming that there are q trees in the random forest, calculate the Gini index score of each feature X j That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:
d)将特征X
j的基尼指数评分
进行归一化处理,公式如下:
d) Score the Gini index of feature X j For normalization, the formula is as follows:
e)对所有特征归一化后的基尼指数评分进行降序排序;e) Sort the normalized Gini index scores of all features in descending order;
所述中心服务器计算全局特征重要性排序结果,如图4所示,包括以下子步骤:The central server calculates the global feature importance ranking result, as shown in FIG. 4, including the following sub-steps:
A.接收各中心传来的特征重要性排序结果;A. Receive the feature importance ranking results from each center;
B.对于每个特征,求得该特征在所有中心的基尼指数评分的平均值作为全局性特征重要性值;B. For each feature, find the average of the Gini index scores of the feature in all centers as the global feature importance value;
C.按照全局性特征重要性值由大到小的顺序,对特征进行重新排序。C. Re-order the features according to the global feature importance value in descending order.
以下为一个具体的实例,该实例展示了一种基于多中心模式下随机森林算法的由体检数据预测糖尿病患病风险的特征重要性排序系统,该系统包括:部署在参与协同计算的各医院内的前置机;接收并整合各医院特征重要性排序结果的中心服务器;将最终特征重要性排序结果反馈给用户的结果展示模块。The following is a specific example, which shows a feature importance ranking system based on the random forest algorithm in the multi-center mode to predict the risk of diabetes from the physical examination data. The system includes: deployment in each hospital participating in collaborative computing The front-end processor; the central server that receives and integrates the feature importance ranking results of each hospital; the result display module that feeds the final feature importance ranking results back to the user.
所述前置机用于从各医院的数据库接口读取体检数据,并利用随机森林算法预测糖尿病患病风险,计算出本医院内的糖尿病患病风险特征重要性排序结果,具体计算步骤如下:The front-end processor is used to read the physical examination data from the database interface of each hospital, and use the random forest algorithm to predict the risk of diabetes, and calculate the ranking result of the importance of the risk of diabetes in the hospital. The specific calculation steps are as follows:
A.从本医院数据库接口读取体检数据作为样本集,假设共有5000例体检数据;A. Read the physical examination data from the database interface of this hospital as a sample set, assuming that there are 5000 physical examination data;
B.用有抽样放回的方法(bootstrap)从样本集中随机选取70个样本作为一个训练集;B. Use bootstrap to randomly select 70 samples from the sample set as a training set;
C.用抽样得到的训练集生成一颗决策树;在决策树的每一个结点,均随机不重复地选择7个特征,利用这7个特征分别对训练集进行划分;C. Use the training set obtained by sampling to generate a decision tree; at each node of the decision tree, 7 features are randomly selected without repetition, and the training set is divided by these 7 features;
D.重复步骤B-C共15次,15即为随机森林中决策树的个数;D. Repeat steps B-C 15 times, 15 is the number of decision trees in the random forest;
E.用训练得到的随机森林对样本集进行预测;E. Use the trained random forest to predict the sample set;
F.利用基尼指数作为评价指标对步骤E的预测结果进行特征重要性排序,包括以下子步骤:F. Use the Gini index as an evaluation indicator to sort the prediction results of step E by feature importance, including the following sub-steps:
a)假设样本集有50个特征--年龄、性别、文化水平、腰围、血型、收缩压、血红蛋白等等特征,记为X
1,X
2,X
3,...,X
50。对于每个特征X
j,计算特征X
j在节点m的重要性
即节点m分枝前后的基尼指数变化量,公式如下:
a) Assume that the sample set has 50 characteristics-age, gender, education level, waist circumference, blood type, systolic blood pressure, hemoglobin, etc., denoted as X 1 , X 2 , X 3 ,..., X 50 . For each feature X j , calculate the importance of feature X j at node m That is, the change in Gini index before and after the branch of node m, the formula is as follows:
其中,GI
m表示分枝前节点m的基尼指数,GI
l和GI
r分别表示分枝后两个新节点l和r的基尼指数;基尼指数的计算公式为:
Among them, GI m represents the Gini index of node m before the branch, and GI l and GI r represent the Gini indices of the two new nodes l and r after the branch respectively; the calculation formula of the Gini index is:
其中,K表示有K个类别,p
xk表示节点x中类别k所占的比例;
Among them, K indicates that there are K categories, and p xk indicates the proportion of category k in node x;
b)假设特征X
j在决策树i中出现的节点构成集合E,那么X
j在第i棵决策树的重要性
为:
b) Assuming that the nodes of feature X j appearing in decision tree i constitute set E, then the importance of X j in the ith decision tree for:
c)已知随机森林中有15棵树,计算每个特征X
j的基尼指数评分
亦即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量:
c) Knowing that there are 15 trees in the random forest, calculate the Gini index score of each feature X j That is, the average change in the impurity of node splitting in all decision trees of the random forest for the j-th feature:
d)将特征X
j的基尼指数评分
进行归一化处理,公式如下:
d) Score the Gini index of feature X j For normalization, the formula is as follows:
e)对所有特征归一化后的基尼指数评分进行降序排序;e) Sort the normalized Gini index scores of all features in descending order;
所述中心服务器内计算体检数据中影响糖尿病患病风险的全局特征重要性排序结果,包括以下子步骤:The calculation of the global feature importance ranking result in the physical examination data that affects the risk of diabetes in the central server includes the following sub-steps:
A.接收各医院传来的特征重要性排序结果;A. Receive the feature importance ranking results from various hospitals;
B.对于每个特征,求得该特征在所有医院的基尼指数评分的平均值作为全局性特征重要性值。例如对于特征糖化血红蛋白,它在医院甲的特征重要性评分为0.182483,在医院乙的特征重要性评分为0.150948,在医院丙的特征重要性评分为0.078243,那么它在医院甲、医院乙、医院丙联合开展的多中心体检数据糖尿病风险预测研究中的全局特征重要性值为:(0.182483+0.150948+0.078243)/3=0.137224。B. For each feature, find the average of the Gini index scores of the feature in all hospitals as the global feature importance value. For example, for feature glycosylated hemoglobin, its feature importance score in hospital A is 0.182483, its feature importance score in hospital B is 0.150948, and its feature importance score in hospital C is 0.078243, so it is in hospital A, hospital B, and hospital The global feature importance value in the diabetes risk prediction study of multi-center physical examination data jointly carried out by C is: (0.182483+0.150948+0.078243)/3=0.137224.
C.按照全局性特征重要性值由大到小的顺序,对特征进行重新排序。C. Re-order the features according to the global feature importance value in descending order.
本发明在每个站点计算基于基尼指数的局部变量重要性排序,并将其发送到中心服务器。中心服务器整合各个站点的变量重要性排序并计算得出最终的排序结果。在此过程中,中心服务器仅接收各站点的变量重要性排序结果,无需交换患者级别的数据,既得到了有效的全局解,又有效地保障了数据的安全性,为构建特征筛选模型提供了安全可靠高效的解决方案。The present invention calculates the local variable importance ranking based on the Gini index at each site and sends it to the central server. The central server integrates the variable importance ranking of each site and calculates the final ranking result. In this process, the central server only receives the variable importance ranking results of each site, and does not need to exchange patient-level data. This not only obtains an effective global solution, but also effectively guarantees the security of the data, which provides security for the construction of feature screening models. Reliable and efficient solution.
以上仅为本发明的实施实例,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,不经过创造性劳动所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。The above are only implementation examples of the present invention and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made without creative work within the spirit and principle of the present invention are all included in the protection scope of the present invention.