JP7064681B2

JP7064681B2 - Feature importance sorting system based on random forest algorithm in multi-center mode

Info

Publication number: JP7064681B2
Application number: JP2021532354A
Authority: JP
Inventors: ▲勁▼松李; ▲豊▼ 王; 佩君胡; ▲瑩▼ ▲張▼; 子▲ユエ▼ ▲楊▼
Original assignee: 之江実験室
Priority date: 2019-07-12
Filing date: 2020-04-07
Publication date: 2022-05-11
Anticipated expiration: 2040-04-07
Also published as: JP2022508333A; CN110728291B; CN110728291A; WO2020233259A1

Description

本発明は特徴選択技術分野に属し、特にマルチセンターモードにおけるランダムフォレストアルゴリズムに基づく特徴重要度ソートシステムに関する。 The present invention belongs to the field of feature selection technology, and particularly relates to a feature importance sorting system based on a random forest algorithm in a multi-center mode.

特徴選択は特徴空間次元を削減するために、１組の特徴からいくつかの最も効果的な特徴を選択する過程である。特徴選択は特徴数を減少し、次元を削減し、モデルの汎化性能を向上させ、オーバーフィッティングを減少し、特徴及び特徴値への理解を強化することができ、データ科学分野の肝心な問題の１つである。生物医学分野では、常にオーミクスデータセット等の高次元データを処理する必要があり、一般的に、変数の数が個体の数より遥かに大きく、このような場合、特徴選択の意味は特に重要となる。ランダムフォレストは生物医学分野で広く応用されている統合学習アルゴリズムであり、分類過程において変数重要度の推定を行うことができ、効果的な特徴選択アルゴリズムと見なされている。 Feature selection is the process of selecting some of the most effective features from a set of features in order to reduce the feature space dimension. Feature selection can reduce the number of features, reduce dimensions, improve model generalization performance, reduce overfitting, and enhance understanding of features and feature values, a key issue in the field of data science. It is one of. In the field of biomedicine, it is always necessary to process high-dimensional data such as ohmics datasets, and in general, the number of variables is much larger than the number of individuals, and in such cases, the meaning of feature selection is particularly important. Will be. Random forest is an integrated learning algorithm widely applied in the field of biomedicine, which can estimate the importance of variables in the classification process and is regarded as an effective feature selection algorithm.

マルチセンターにおけるデータ協調計算はビッグデータの背景で出現する応用シーンであり、地域で分散状態にある１つのグループがコンピュータ及びネットワーク技術を利用して互いに協力して１つのタスクを遂行することを意味する。マルチセンターにおけるデータに基づいて特徴選択を行うことは、そのうちの１つの重要な問題である。ビッグデータの背景で、各センターデータ協調計算の需要は大きくなっている。 Data co-calculation in a multi-center is an application scene that emerges in the background of big data, meaning that one group in a distributed state in a region cooperates with each other to perform one task using computer and network technology. do. Feature selection based on data in a multi-center is one of the important issues. In the background of big data, the demand for co-calculation of center data is increasing.

従来の解決案は、各センターのデータを取り出してセンターサーバーに集め、次に、センターサーバーにおいて特徴選択を行ってグローバルな特徴選択結果を取得する必要がある。ところが、データを各センターから取り出す過程は潜在的な危険が多く、データ漏洩等の安全上の問題を引き起こす恐れがあり、センター同士の協調計算の積極性を大きく損なってしまう。特に、生物医学分野では、各センター即ち各病院のデータには治療のために病院に来る患者のプライバシーが含まれ、データを取り出して集中処理する方法は患者のプライバシーの保護にとって不利になり、リスクが大きい。 In the conventional solution, it is necessary to take out the data of each center and collect it in the center server, and then perform feature selection on the center server to obtain the global feature selection result. However, the process of extracting data from each center has many potential dangers and may cause safety problems such as data leakage, which greatly impairs the positiveness of cooperative calculation between centers. Especially in the field of biomedicine, the data of each center, that is, each hospital, includes the privacy of patients who come to the hospital for treatment, and the method of extracting and centralizing the data is disadvantageous and risky for protecting the privacy of patients. Is big.

本発明の目的は、従来技術の欠点に対して、実際の需要に応じて、各センターのデータを漏洩しない条件で、マルチセンターモードにおけるランダムフォレストアルゴリズムに基づく特徴重要度ソートシステムを提供することにあり、該システムにおいて、各センターのデータを常に各センターにあり、センターサーバーにモデルの中間パラメータのみを送信し、元のデータを送信せず、最終的に安全で効果的なグローバルな特徴重要度ソート結果を取得する。 An object of the present invention is to provide a feature importance sorting system based on a random forest algorithm in a multi-center mode under the condition that the data of each center is not leaked according to the actual demand against the shortcomings of the prior art. Yes, in the system, the data of each center is always in each center, only the intermediate parameters of the model are sent to the center server, the original data is not sent, and finally the safe and effective global feature importance. Get the sort result.

本発明の目的は以下の技術案により実現される。
マルチセンターモードにおけるランダムフォレストアルゴリズムに基づく特徴重要度ソートシステムであって、
協調計算に参加する各センターに配置されるフロントエンドプロセッサと、各センター特徴重要度ソート結果を受信して統合するセンターサーバーと、最終的な特徴重要度ソート結果をユーザーにフィードバックする結果表示モジュールと、を備え、
前記フロントエンドプロセッサは各センターのデータベースインターフェースからデータを読み取り、ランダムフォレストアルゴリズムにより該センターの特徴重要度ソート結果を計算することに用いられ、その具体的な計算ステップは、
該センターデータベースインターフェースからデータをサンプルセットとして読み取るステップＡと、
ブートストラップ法（ｂｏｏｔｓｔｒａｐ）でサンプルセットからｎ個のサンプルを１つのトレーニングセットとしてランダムに選択するステップＢと、
サンプリングにより取得されたトレーニングセットで１つの決定木を生成し、決定木の各ノードでいずれもｄ個の特徴をランダムに繰り返しなしで選択し、これらのｄ個の特徴を利用してそれぞれトレーニングセットを区画するステップＣと、
ステップＢ～Ｃを合計ｑ回繰り返し、ｑがランダムフォレストにおける決定木の個数であるステップＤと、
トレーニングにより取得されたランダムフォレストでサンプルセットを予測するステップＥと、
ジニ指数を評価指標として利用してステップＥの予測結果に対して特徴重要度ソートを行うステップＦと、
を含み、
前記ステップＦは、
サンプルセットにｈ個の特徴

があると仮定するとき、各特徴Ｘ_ｊについて、ノードｍにおける特徴Ｘ_ｊの重要度

即ちノードｍの分岐前後のジニ指数変化量を計算し、その式が以下のとおりであり、

式中、ＧＩ_ｍが分岐前のノードｍのジニ指数を示し、ＧＩ_ｌとＧＩ_ｒがそれぞれ分岐後の新しい２つのノードｌ及びノードｒのジニ指数を示し、ジニ指数の計算式が以下のとおりであり、

ＫがＫ個の類別があることを示し、ｐ_ｘｋがノードｘにおける類別ｋが占有する比率を示すサブステップａ）と、
特徴Ｘ_ｊが決定木ｉに出現するノードは集合Ｅを構成すると仮定すれば、ｉ番目の決定木におけるＸ_ｊの重要度

が、

であるサブステップｂ）と、
ランダムフォレストにｑ個の決定木があると仮定するとき、各特徴Ｘ_ｊのジニ指数スコア

即ちランダムフォレストのすべての決定木におけるｊ番目の特徴のノード分割不純度の平均変化量を計算し、その式が

であるサブステップｃ）と、
特徴Ｘ_ｊのジニ指数スコア

を正規化処理し、その式が、

であるサブステップｄ）と、
すべての特徴の正規化後のジニ指数スコアを降順ソートするサブステップｅ）と、を含み、
前記センターサーバーがグローバルな特徴重要度ソート結果を計算することは、
各センターから送信された特徴重要度ソート結果を受信するサブステップＡと、
各特徴について、すべてのセンターにおける該特徴のジニ指数スコアの平均値をグローバルな特徴重要度値として求めるサブステップＢと、
グローバルな特徴重要度値の降順で、特徴を改めてソートするサブステップＣと、を含むことを特徴とするマルチセンターモードにおけるランダムフォレストアルゴリズムに基づく特徴重要度ソートシステム。 The object of the present invention is realized by the following technical proposals.
A feature importance sorting system based on the Random Forest algorithm in multi-center mode.
A front-end processor located in each center participating in the co-calculation, a center server that receives and integrates each center feature importance sort result, and a result display module that feeds back the final feature importance sort result to the user. , Equipped with
The front-end processor reads data from the database interface of each center and is used to calculate the feature importance sort result of the center by a random forest algorithm, and the specific calculation step is
Step A, which reads the data from the center database interface as a sample set,
Step B, in which n samples are randomly selected as one training set from the sample set by the bootstrap method, and
One decision tree is generated from the training set obtained by sampling, d features are randomly selected at each node of the decision tree without repetition, and each of these d features is used for the training set. Step C to partition
Steps B to C are repeated q times in total, and step D, in which q is the number of decision trees in the random forest, and
Step E, which predicts the sample set in the random forest acquired by training,
Step F, which sorts the feature importance of the prediction result of step E using the Gini index as an evaluation index,
Including
The step F is
H features in the sample set

Assuming that there is, for each feature X _j , the importance of the feature X _j at the node m

That is, the amount of change in the Gini index before and after the branch of the node m is calculated, and the formula is as follows.

In the formula, GI _m indicates the Gini index of the node m before branching, GI _l and GI _r indicate the Gini index of the two new nodes l and node r after branching, respectively, and the calculation formula of the Gini index is as follows. And

Sub-step a), in which K indicates that there are K classifications, and _pxk indicates the ratio occupied by the classification k in the node x.
Assuming that the nodes in which the feature X _j appears in the decision tree i constitute the set E, the importance of X _j in the i-th decision tree

but,

Substep b), which is
Assuming there are q decision trees in a random forest, the Gini index score for each feature _Xj

That is, the average change amount of the node division impureness of the jth feature in all the decision trees of the random forest is calculated, and the formula is

Substep c), which is
Feature _Xj Gini index score

Is normalized and the expression is

Substep d), which is
Substep e) to sort the normalized Gini index scores of all features in descending order, including
It is not possible for the center server to calculate the global feature importance sort result.
Sub-step A, which receives the feature importance sort result sent from each center, and
For each feature, substep B, which obtains the average value of the Gini index scores of the feature at all centers as the global feature importance value,
A feature importance sorting system based on a random forest algorithm in a multicenter mode, comprising substep C, which sorts features again in descending order of global feature importance values.

本発明の有益な効果は、以下の通りである。
本発明は、マルチセンターのランダムフォレストアルゴリズムに基づいて各センターにおいてそれぞれ特徴重要度ソート結果を計算し、センターサーバーにおいて各センターのソート結果を統合してグローバルな特徴重要度ソート結果を形成するということである。本発明は、各センターのデータを漏洩しない条件で、該システムにおいて各センターのデータを常にセンターにあり、センターサーバーにモデルの中間パラメータのみを送信し、元のデータを送信しないため、データセキュリティ及びデータに含まれる個人のプライバシーを効果的に確保する。 The beneficial effects of the present invention are as follows.
The present invention calculates the feature importance sort result at each center based on the multi-center random forest algorithm, and integrates the sort results of each center at the center server to form a global feature importance sort result. Is. In the present invention, under the condition that the data of each center is not leaked, the data of each center is always in the center in the system, only the intermediate parameters of the model are transmitted to the center server, and the original data is not transmitted. Effectively ensure the privacy of the individuals contained in the data.

図１は本発明に係るマルチセンターモードにおけるランダムフォレストアルゴリズムに基づく特徴重要度ソートシステムの実現フローチャートである。FIG. 1 is a flowchart for realizing a feature importance sorting system based on a random forest algorithm in the multi-center mode according to the present invention. 図２は本発明に係るマルチセンターモードにおけるランダムフォレストアルゴリズムに基づく特徴重要度ソートシステムの構成ブロック図である。FIG. 2 is a block diagram of a feature importance sorting system based on the random forest algorithm in the multi-center mode according to the present invention. 図３は各センターのフロントエンドプロセッサにおける特徴重要度ソートのフローチャートである。FIG. 3 is a flowchart of feature importance sorting in the front-end processor of each center. 図４はセンターサーバーにおけるグローバルな重要度ソートのフローチャートである。FIG. 4 is a flowchart of global importance sorting in the center server.

以下、図面を参照しながら具体的な実施例によって本発明を更に詳しく説明する。 Hereinafter, the present invention will be described in more detail by way of specific examples with reference to the drawings.

図１及び図２に示すように、本発明に係るマルチセンターモードにおけるランダムフォレストアルゴリズムに基づく特徴重要度ソートシステムは、協調計算に参加する各センターに配置されるフロントエンドプロセッサと、各センター特徴重要度ソート結果を受信して統合するセンターサーバーと、最終的な特徴重要度ソート結果をユーザーにフィードバックする結果表示モジュールと、を備える。 As shown in FIGS. 1 and 2, the feature importance sorting system based on the random forest algorithm in the multi-center mode according to the present invention includes a front-end processor placed in each center participating in the cooperative calculation and each center feature importance. It has a center server that receives and integrates the degree sort result, and a result display module that feeds back the final feature importance sort result to the user.

前記フロントエンドプロセッサは各センターのデータベースインターフェースからデータを読み取り、ランダムフォレストアルゴリズムにより該センターの特徴重要度ソート結果を計算することに用いられ、図３に示すように、具体的な計算ステップは、以下のとおりである。
ステップＡ：該センターデータベースインターフェースからデータをサンプルセットとして読み取る；
ステップＢ：ブートストラップ法（ｂｏｏｔｓｔｒａｐ）でサンプルセットからｎ個のサンプルを１つのトレーニングセットとしてランダムに選択する；
ステップＣ：サンプリングにより取得されたトレーニングセットで１つの決定木を生成し、決定木の各ノードでいずれもｄ個の特徴をランダムに繰り返しなしで選択し、これらのｄ個の特徴を利用してそれぞれトレーニングセットを区画する；
ステップＤ：ステップＢ～Ｃを合計ｑ回繰り返し、ｑがランダムフォレストにおける決定木の個数である；
ステップＥ：トレーニングにより取得されたランダムフォレストでサンプルセットを予測する；
ステップＦ：ジニ指数を評価指標として利用してステップＥの予測結果に対して特徴重要度ソートを行う。
該ステップＦは、
サンプルセットにｈ個の特徴

が、

であるサブステップｃ）と、
特徴Ｘ_ｊのジニ指数スコア

を正規化処理し、その式が、

であるサブステップｄ）と、
すべての特徴の正規化後のジニ指数スコアを降順ソートするサブステップｅ）と、を含み、
前記センターサーバーがグローバルな特徴重要度ソート結果を計算することは、図４に示すように、
各センターから送信された特徴重要度ソート結果を受信するサブステップＡと、
各特徴について、すべてのセンターにおける該特徴のジニ指数スコアの平均値をグローバルな特徴重要度値として求めるサブステップＢと、
グローバルな特徴重要度値の降順で、特徴を改めてソートするサブステップＣと、を含む。 The front-end processor reads data from the database interface of each center and is used to calculate the feature importance sort result of the center by a random forest algorithm. As shown in FIG. 3, the specific calculation steps are as follows. It is as follows.
Step A: Read the data as a sample set from the center database interface;
Step B: Randomly select n samples from the sample set as one training set by the bootstrap method;
Step C: Generate one decision tree from the training set obtained by sampling, select d features randomly and without repetition at each node of the decision tree, and use these d features. Separate training sets for each;
Step D: Steps B to C are repeated a total of q times, where q is the number of decision trees in the random forest;
Step E: Predict the sample set in the random forest acquired by training;
Step F: Using the Gini index as an evaluation index, the feature importance sort is performed on the prediction result of step E.
The step F is
H features in the sample set

but,

Substep c), which is
Feature _Xj Gini index score

Is normalized and the expression is

Substep d), which is
Substep e) to sort the normalized Gini index scores of all features in descending order, including
As shown in FIG. 4, the center server calculates the global feature importance sort result.
Sub-step A, which receives the feature importance sort result sent from each center, and
For each feature, substep B, which obtains the average value of the Gini index scores of the feature at all centers as the global feature importance value,
Includes substep C, which sorts the features again in descending order of global feature importance values.

以下はマルチセンターモードにおけるランダムフォレストアルゴリズムに基づく、身体検査データから糖尿病リスクを予測する特徴重要度ソートシステムを示す１つの具体的な例である。該システムは、協調計算に参加する各病院内に配置されるフロントエンドプロセッサと、各病院の特徴重要度ソート結果を受信して統合するセンターサーバーと、最終的な特徴重要度ソート結果をユーザーにフィードバックする結果表示モジュールと、を備える。 The following is one specific example showing a feature importance sorting system that predicts diabetes risk from physical examination data based on a random forest algorithm in multicenter mode. The system provides users with a front-end processor located within each hospital participating in the co-calculation, a center server that receives and integrates the feature importance sort results of each hospital, and the final feature importance sort results. It is equipped with a result display module that provides feedback.

前記フロントエンドプロセッサは各病院のデータベースインターフェースから身体検査データを読み取り、ランダムフォレストアルゴリズムにより糖尿病リスクを予測し、該病院内の糖尿病リスクの特徴重要度ソート結果を計算することに用いられ、その具体的な計算ステップは、以下のとおりである。
ステップＡ：該病院のデータベースインターフェースから身体検査データをサンプルセットとして読み取り、合計５０００例の身体検査データがあると仮定する；
ステップＢ：ブートストラップ法（ｂｏｏｔｓｔｒａｐ）でサンプルセットから７０個のサンプルを１つのトレーニングセットとしてランダムに選択する；
ステップＣ：サンプリングにより取得されたトレーニングセットで１つの決定木を生成し、決定木の各ノードでいずれも７つの特徴をランダムに繰り返しなしで選択し、これらの７つの特徴を利用してそれぞれトレーニングセットを区画する；
ステップＤ：ステップＢ～Ｃを合計１５回繰り返し、１５がランダムフォレストにおける決定木の個数である；
ステップＥ：トレーニングにより取得されたランダムフォレストでサンプルセットを予測する；
ステップＦ：ジニ指数を評価指標として利用してステップＥの予測結果に対して特徴重要度ソートを行う。
該ステップＦは、
サンプルセットには、年齢、性別、教育レベル、胴囲、血液型、収縮期血圧、ヘモグロビン等の特徴５０個があると仮定し、これらの特徴を

とする。各特徴Ｘ_ｊについて、ノードｍにおける特徴Ｘ_ｊの重要度

が、

であるサブステップｂ）と、
ランダムフォレストに１５個の決定木があることが知られ、各特徴Ｘ_ｊのジニ指数スコア

であるサブステップｃ）と、
特徴Ｘ_ｊのジニ指数スコア

を正規化処理し、その式が

であるサブステップｄ）と、
すべての特徴の正規化後のジニ指数スコアを降順ソートするサブステップｅ）と、を含み、
前記センターサーバーは身体検査データにおける糖尿病リスクに影響するグローバルな特徴重要度ソート結果を計算し、該ステップは、
各病院から送信された特徴重要度ソート結果を受信するサブステップＡと、
各特徴について、すべての病院における該特徴のジニ指数スコアの平均値をグローバルな特徴重要度値として求め、例えば、特徴的な糖化ヘモグロビンについて、病院１における特徴重要度スコアは０．１８２４８３であり、病院２における特徴重要度スコアは０．１５０９４８であり、病院３における特徴重要度スコアは０．０７８２４３である場合、病院１、病院２、病院３が共同で開催したマルチセンター身体検査データ糖尿病リスク予測研究におけるグローバルな特徴重要度値は(0.182483+0.150948+0.078243)/3=0.137224であるサブステップＢと、
グローバルな特徴重要度値の降順で、特徴を改めてソートするサブステップＣと、を含む。 The front-end processor reads physical examination data from each hospital's database interface, predicts diabetes risk by a random forest algorithm, and is used to calculate the characteristic importance sort result of diabetes risk in the hospital. The calculation steps are as follows.
Step A: Read the physical examination data as a sample set from the hospital's database interface and assume that there is a total of 5000 physical examination data;
Step B: 70 samples are randomly selected as one training set from the sample set by the bootstrap method;
Step C: Generate one decision tree from the training set obtained by sampling, select 7 features at random and without repetition at each node of the decision tree, and train each of these 7 features. Divide the set;
Step D: Steps B to C are repeated a total of 15 times, where 15 is the number of decision trees in the random forest;
Step E: Predict the sample set in the random forest acquired by training;
Step F: Using the Gini index as an evaluation index, the feature importance sort is performed on the prediction result of step E.
The step F is
Assuming that the sample set has 50 characteristics such as age, gender, education level, waist circumference, blood type, systolic blood pressure, hemoglobin, etc., these characteristics are used.

And. For each feature X _j , the importance of the feature X _j at the node m

but,

Substep b), which is
It is known that there are 15 decision trees in Random Forest, and the Gini index score of each feature _Xj .

Substep c), which is
Feature _Xj Gini index score

Is normalized and the expression is

Substep d), which is
Substep e) to sort the normalized Gini index scores of all features in descending order, including
The center server calculates the global feature importance sort results that affect the risk of diabetes in the physical examination data, and the step is
Sub-step A to receive the feature importance sort result sent from each hospital, and
For each feature, the average value of the gini index scores of the feature in all hospitals is calculated as the global feature importance value. For example, for characteristic glycated hemoglobin, the feature importance score in hospital 1 is 0.1822483. If the feature importance score in hospital 2 is 0.150948 and the feature importance score in hospital 3 is 0.078243, the multi-center physical examination data jointly held by hospital 1, hospital 2, and hospital 3 predicts diabetes risk. Substep B, where the global feature importance value in the study is (0.182483 + 0.150948 + 0.078243) /3=0.123724,
Includes substep C, which sorts the features again in descending order of global feature importance values.

本発明は各サイトでジニ指数に基づく局所変数重要度ソートを計算して、センターサーバーに送信する。センターサーバーは各サイトの変数重要度ソートを統合して計算し、最終的なソート結果を取得する。この過程において、センターサーバーは各サイトの変数重要度ソート結果のみを受信し、患者レベルまでのデータを交換する必要がないため、効果的なグローバルな解を取得するだけでなく、データの安全性を効果的に確保し、特徴選別モデルの構築に安全・確実で効率的な解決案を提供する。 The present invention calculates a local variable importance sort based on the Gini index at each site and sends it to the center server. The center server integrates and calculates the variable importance sorts for each site and gets the final sort result. In this process, the center server receives only the variable importance sort results for each site and does not need to exchange data down to the patient level, so not only does it get an effective global solution, but it also secures the data. Effectively secure and provide a safe, reliable and efficient solution for building a feature selection model.

上記は本発明の実施例に過ぎず、本発明の保護範囲を制限するためのものではない。本発明の趣旨や原則でに創造的労働を経ずに行われたいかなる修正、等価置換、改良等は、いずれも本発明の保護範囲内に含まれるべきである。 The above is only an embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc. made to the intent or principle of the invention without creative labor should be included within the scope of the invention.

Claims

A feature importance sorting system based on the Random Forest algorithm in multi-center mode.
A front-end processor located in each center participating in the co-calculation, a center server that receives and integrates each center feature importance sort result, and a result display module that feeds back the final feature importance sort result to the user. , Equipped with
The front-end processor reads data from the database interface of each center and is used to calculate the feature importance sort result of the center by a random forest algorithm, and the specific calculation step is
Step A, which reads the data from the center database interface as a sample set,
Step B, in which n samples are randomly selected as one training set from the sample set by the bootstrap method, and
One decision tree is generated from the training set obtained by sampling, d features are randomly selected at each node of the decision tree without repetition, and each of these d features is used for the training set. Step C to partition
Steps B to C are repeated q times in total, and step D, in which q is the number of decision trees in the random forest, and
Step E, which predicts the sample set in the random forest acquired by training,
Step F, which sorts the feature importance of the prediction result of step E using the Gini index as an evaluation index,
Including
The step F is
H features in the sample set

If there is, for each feature X _j , the importance of the feature X _j at the node m

Sub-step a), in which K indicates that there are K classifications, and _pxk indicates the ratio occupied by the classification k in the node x.
When the node in which the feature X _j appears in the decision tree i constitutes the set E, the importance of X _j in the i-th decision tree

but,

Substep b), which is
Gini index score for each feature X _j when there are q decision trees in a random forest

Substep c), which is
Feature _Xj Gini index score

Is normalized and the expression is