JP6240804B1

JP6240804B1 - Filtered feature selection algorithm based on improved information measurement and GA

Info

Publication number: JP6240804B1
Application number: JP2017130667A
Authority: JP
Inventors: 魏小▲鵬▼; ▲張強▼; 周昌▲軍▼; ▲韓麗▼君
Original assignee: 大▲連▼大学
Priority date: 2017-04-13
Filing date: 2017-07-03
Publication date: 2017-11-29
Anticipated expiration: 2037-07-03
Also published as: JP2018181290A

Abstract

【課題】本発明は改良した情報測定とＧＡに基づくフィルター式特徴選択アルゴリズム（ＩＭＧＡ）を設計し、情報測定で遺伝的アルゴリズムの適応度関数を構築して遺伝的アルゴリズムに対し遺伝子発現データの特徴選択に適用できるように改良し、最適な遺伝子サブセットを検索し、ＳＶＭを分類器として、選択した特徴サブセットの分類効果を評定する。【解決手段】従来方法に比べて、提供した方法は効果的な特徴選択方法であり、より優れた分類性能を実現できる。【選択図】図１PROBLEM TO BE SOLVED: To design a filter type feature selection algorithm (IMGA) based on improved information measurement and GA, construct a fitness function of a genetic algorithm by information measurement, and characterize gene expression data with respect to the genetic algorithm. It is improved so that it can be applied to selection, the optimal gene subset is searched, and the classification effect of the selected feature subset is evaluated using SVM as a classifier. Compared to the conventional method, the provided method is an effective feature selection method, and can realize better classification performance. [Selection] Figure 1

Description

本発明は特徴選択方法に関し、具体的には、情報測定で遺伝的アルゴリズムの適応度関数を構築して遺伝的アルゴリズムに対し遺伝子発現データの特徴選択に適用できるように改良し、最適な遺伝子サブセットを検索するものである。腫瘍遺伝子発現データの分析分野に属する。 The present invention relates to a feature selection method, and more specifically, an fitness function of a genetic algorithm is constructed by information measurement and improved so that the genetic algorithm can be applied to feature selection of gene expression data. Is to search. It belongs to the field of analysis of oncogene expression data.

腫瘍のタイプを正確に診断することは腫瘍の臨床的治療に対して大切なことであり、マイクロアレイ技術の更なる発展に伴って、大量の腫瘍遺伝子発現データは取得されたが、そのうち、少数の遺伝子だけは本質的にサンプルカテゴリーに関連し、これは腫瘍の分類に利便性をもたらすとともに新しい難問が発生する。如何に腫瘍遺伝子発現プロファイリングデータを効果的に分析して、分類との関連性が強い特徴遺伝子を選択するかは非常に大切なことである。 Accurate diagnosis of tumor type is important for the clinical treatment of tumors, and with the further development of microarray technology, a large amount of tumor gene expression data has been acquired. Only genes are essentially related to the sample category, which brings convenience to tumor classification and creates new challenges. It is very important how to effectively analyze tumor gene expression profiling data and select characteristic genes that are strongly related to classification.

特徴選択は効果的な方法の一種である。特徴選択（遺伝子選択）は腫瘍分類に対し最も有用な重要な遺伝子を識別して、できるだけ多数の無関係な遺伝子を除去することを目的とする。ノイズ遺伝子を除去して、分類モデルの性能や効率を改善して、オーバーフィッティングを減少させる。過去数十年間、多数の学者は遺伝子選択方法の研究に取り組んで、多数の有効な方法を開発し、分類器の使用方式に応じて、Ｆｉｌｔｅｒ法、ｗｒａｐｐｅｒ法及びｅｍｂｅｄｄｅｄ法の三種類に大別される。Ｆｉｌｔｅｒ法はいくつかの判断基準に準じて特徴とカテゴリーの関連性、又は特徴同士の内部関係を評定することで冗長情報を迅速に削除するものであり、Ｍ．ＤａｓｈとＨ．Ｌｉｕは従来の判断基準として、距離測定、情報測定、依存性測定、一貫性測定及び誤分類率測定の五種類に分ける。その長所は、分類器に依存せず且つ計算速度が高いことにある。代表的な方法としては、ＳＮＲ、Ｒｅｌｉｅｆ、ｍＲＭＲが挙げられる。ｗｒａｐｐｅｒ法は分類器の識別率を指標として、分類器の識別率を最高にする１群の特徴サブセットを検索する方法であり、一般的な検索方法としては、シーケンシャルフォワードセレクション法（ＳＦＳ）、ヒューリスティック探索、遺伝的アルゴリズムＧＡ等が挙げられる。ｅｍｂｅｄｄｅｄ法は、分類器中の一部の特性を特徴性能の判断基準として、特定の分類器訓練過程において特徴選択を実施する方法である。代表的な方法としては、例えばＳＶＭＲＦＥ、ランダムフォレスト（ＲａｎｄｏｍＦｏｒｅｓｔ）が挙げられる。その中でも、ｗｒａｐｐｅｒ法とｅｍｂｅｄｄｅｄ法は取得した分類確率がＦｉｌｔｅｒ法より高いが、分類器に依存しなければならず、ｗｒａｐｐｅｒ法は最適な特徴サブセットを検索する時にＮＰ困難が存在し、且つオーバーフィッティングが発生しやすく、ｅｍｂｅｄｄｅｄ法は特徴によるターゲット関数への影響を向上又は低下させるのに分類器のターゲット関数を把握しなければならず、従って、ｅｍｂｅｄｄｅｄ法は特定の分類器に対応した方法で、且つ時間複雑性が高い。 Feature selection is an effective method. Feature selection (gene selection) aims to identify the most important genes that are most useful for tumor classification and to remove as many unrelated genes as possible. Remove noise genes to improve the performance and efficiency of classification models and reduce overfitting. In the past few decades, many scholars have been working on gene selection methods, developing a number of effective methods, and roughly divided into three types, Filter method, Wrapper method and Embedded method, depending on the usage of the classifier Is done. In the Filter method, redundant information is quickly deleted by evaluating the relationship between features and categories or the internal relationship between features according to several criteria. Dash and H.C. Liu is divided into five types as conventional judgment criteria: distance measurement, information measurement, dependency measurement, consistency measurement, and misclassification rate measurement. The advantage is that it is not dependent on the classifier and the calculation speed is high. Typical methods include SNR, Relief, and mRMR. The wrapper method is a method of searching a group of feature subsets that maximizes the classification rate of the classifier by using the classification rate of the classifier as an index. Common search methods include a sequential forward selection method (SFS), a heuristic. Search, genetic algorithm GA, etc. are mentioned. The embedded method is a method of performing feature selection in a specific classifier training process using some characteristics in the classifier as criteria for determining feature performance. Representative methods include, for example, SVMRF and Random Forest. Among them, the wrapper method and the embedded method have a higher classification probability than the filter method, but they must depend on the classifier, and the wrapper method has NP difficulty when searching for the optimal feature subset, and overfitting The embedded method needs to grasp the target function of the classifier in order to improve or decrease the influence of the feature on the target function. Therefore, the embedded method is a method corresponding to a specific classifier, And time complexity is high.

遺伝的アルゴリズムは生物圏での自然淘汰と自然遺伝メカニズムをシミュレートした知的検索アルゴリズムであり、１９７５年にＨｏｌｌａｎｄ教授により始めて提案された以来、シンプルな遺伝的アルゴリズム（ＳＧＡ）と呼ばれ、グローバルパラレルが可能であり、シンプルで汎用性が高く、ロバスト性が高い等の利点を有するため、コンピュータサイエンス、人工知能、オートコントロール等の分野に幅広く適用される。特徴選択は典型的な組合せ最適化問題であり、遺伝的アルゴリズムはグローバル検索最適化アルゴリズムとして、最適な特徴組合せを検索できる。Ｓｋｌａｎｓｋｙは１９８９年に遺伝的アルゴリズムを特徴選択に用いると、高い結果を取得した。しかしながら、従来の遺伝的アルゴリズムは一般的に分類器の確率を特徴サブセット検索用のターゲット関数とすることによって、計算の複雑さが高まる。 The genetic algorithm is an intelligent search algorithm that simulates natural selection and natural genetic mechanisms in the biosphere, and since it was first proposed by Professor Holland in 1975, it was called the simple genetic algorithm (SGA). Since parallelism is possible and it has advantages such as simplicity, high versatility, and high robustness, it is widely applied to fields such as computer science, artificial intelligence, and auto control. Feature selection is a typical combination optimization problem, and a genetic algorithm can search for an optimal feature combination as a global search optimization algorithm. Sklansky obtained high results in 1989 using a genetic algorithm for feature selection. However, conventional genetic algorithms generally increase computational complexity by making the classifier probabilities a target function for feature subset search.

腫瘍分類に用いる特徴選択方法は、元の遺伝子のうちから腫瘍のカテゴリーとの関連性が強い遺伝子を選択し、できる限り少ない情報遺伝子でできるだけ高いサンプル分類確率を取得することを目的とする。本発明では、情報測定と遺伝的アルゴリズムに基づくフィルター式特徴選択アルゴリズムを腫瘍分類に用いることを提案する。特徴選択アルゴリズムの検索速度を高めるために、まずＢｈａｔｔａｃｈａｒｙｙａ距離を用いて大量の無関係な遺伝子を迅速に削除して、１５０個の特徴遺伝子を選択した。次に情報測定で遺伝的アルゴリズムのターゲット関数を構築して最適なサブセットを検索し、且つ遺伝的アルゴリズムに対し遺伝子選択に適用できるように改良した。最後にサポートベクターマシンを用いて選択した最適な遺伝子サブセットを分類した。実験によって、開示された三種類の癌症データセットにおいて方法の性能を検証した結果、該方法は少ない情報遺伝子で高い分類確率を実現できることが明らかになる。 The feature selection method used for tumor classification is to select a gene having a strong association with a tumor category from among the original genes and obtain as high a sample classification probability as possible with as few information genes as possible. In the present invention, it is proposed to use a filter type feature selection algorithm based on information measurement and a genetic algorithm for tumor classification. In order to increase the search speed of the feature selection algorithm, first, 150 feature genes were selected by quickly deleting a large amount of unrelated genes using Bhattacharya distance. Next, the target function of the genetic algorithm was constructed by information measurement to search the optimal subset, and the genetic algorithm was improved so that it could be applied to gene selection. Finally, the optimal gene subset selected using a support vector machine was classified. Experimentation of the performance of the method in the three disclosed cancer data sets reveals that the method can achieve high classification probabilities with fewer information genes.

特表2011-526783号公報Special table 2011-526783

本発明の目的は、改良した情報測定とＧＡに基づくフィルター式特徴選択アルゴリズム（ＩＭＧＡアルゴリズム）を提案することであり、情報測定で遺伝的アルゴリズムの適応度関数を構築して遺伝的アルゴリズムに対し遺伝子発現データの特徴選択に適用できるように改良し、最適な遺伝子サブセットを検索し、最後にサポートベクターマシンを用いてデータを分類することを主旨とする。言い換えれば、改良した情報測定とＧＡに基づくフィルター式特徴選択アルゴリズムを腫瘍分類に用いると、情報遺伝子を選択でき、且つ、小さい特徴サブセット代で元の遺伝子データに代わり、より高い分類確率を取得できる。 An object of the present invention is to propose an improved information measurement and GA-based filter-type feature selection algorithm (IMGA algorithm), which constructs a fitness function of a genetic algorithm by information measurement and generates a gene for the genetic algorithm. The main purpose is to search for optimal gene subsets and finally classify the data using a support vector machine. In other words, using improved information measurement and GA-based filtered feature selection algorithms for tumor classification, it is possible to select information genes and obtain higher classification probabilities instead of the original genetic data with a small feature subset cost .

本発明の技術案は、まずＢｈａｔｔａｃｈａｒｙｙａ距離を用いて１５０個の候補遺伝子サブセットをスクリーニングすることによって、カテゴリーとの関連性が強い情報遺伝子を残すとともに大量の無関係な遺伝子を除去し、次に改良した情報測定とＧＡに基づくフィルター式特徴選択アルゴリズム（ＩＭＧＡアルゴリズム）を用いてこの１５０個の候補遺伝子サブセットのうちから関連性が最も強い特徴遺伝子サブセットを選択し、最後にデータ分析を容易にするために特徴サブセットデータに正規化処理を行い、サポートベクターマシンを用いて分類することである。実験によって、開示された腫瘍データセットにおいて方法の性能を検証する。 The technical solution of the present invention was improved by first screening 150 candidate gene subsets using Bhattacharya distance, leaving information genes with strong association with categories and removing a large number of unrelated genes. To select the most relevant feature gene subset from the 150 candidate gene subsets using the information measurement and GA-based filtered feature selection algorithm (IMGA algorithm), and finally to facilitate data analysis Normalizing the feature subset data and classifying it using a support vector machine. Experiments verify the performance of the method in the disclosed tumor dataset.

従来技術に比べて、本発明は以下の利点を有する。
１．分類器の性能によらずに遺伝子とカテゴリーの間の関係を判断するため、得られた情報遺伝子はより信頼でき且つ効果的である。
２．ＩＭＧＡアルゴリズムで検索した情報遺伝子の個数を自在に制御可能であることは、後続の遺伝子個数による分類正確率への影響を研究することを可能にする。
３．ＩＭＧＡアルゴリズムは検索性能に優れるとともに、検索速度が高い。
４．ＩＭＧＡアルゴリズムで検索した情報遺伝子は分類関連性が高いため、小さい遺伝子サブセットで高分類確率を実現できる。 Compared with the prior art, the present invention has the following advantages.
1. The resulting information genes are more reliable and effective because they determine the relationship between genes and categories regardless of the performance of the classifier.
2. The ability to freely control the number of information genes searched by the IMGA algorithm makes it possible to study the influence of the number of subsequent genes on the classification accuracy rate.
3. The IMGA algorithm has excellent search performance and high search speed.
4). Since information genes searched by the IMGA algorithm have high classification relevance, a high classification probability can be realized with a small gene subset.

要するに、ＩＭＧＡアルゴリズムは分類器の性能によらずに遺伝子とカテゴリーの間の関係を判断するため、得られた情報遺伝子がより信頼でき且つ効果的であり、ＩＭＧＡアルゴリズムで検索した情報遺伝子の個数を自在に制御できることは、後続の遺伝子個数による分類正確率への影響を研究することを可能にする。ＩＭＧＡアルゴリズムは検索性能に優れるとともに、検索速度が高い。ＩＭＧＡアルゴリズムで検索した情報遺伝子は分類関連性が強く、小さい遺伝子サブセットで高分類確率を実現できる。 In short, since the IMGA algorithm determines the relationship between genes and categories regardless of the performance of the classifier, the obtained information genes are more reliable and effective, and the number of information genes searched by the IMGA algorithm is calculated. Being able to control freely makes it possible to study the influence of the number of subsequent genes on the classification accuracy rate. The IMGA algorithm has excellent search performance and high search speed. Information genes searched by the IMGA algorithm have a strong classification relationship, and a high classification probability can be realized with a small gene subset.

遺伝子発現アレイを示す。A gene expression array is shown. 三つのデータセットでのＩＭＧＡの適応度の変化曲線を示す。3 shows IMGA fitness change curves for three data sets. ｌｅｕｋｅｍｉａデータセットでの三種類の方法の比較を示す。A comparison of the three methods on the leukemia dataset is shown. ｌｕｎｇｃａｎｃｅｒデータセットでの三種類の方法の比較を示す。Figure 3 shows a comparison of three methods on a long cancer data set. ｐｒｏｓｔａｔｅｃａｎｃｅｒデータセットでの三種類の方法の比較を示す。A comparison of the three methods on the prostate cancer data set is shown.

以下、本発明の実施の形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

改良した情報測定とＧＡに基づくフィルター式特徴選択アルゴリズムは、遺伝子発現データを対応したフォーマットで対応したドキュメントに記憶して、コンピュータプログラミング言語がこれら情報を識別して処理できるようにし、具体的にステップは以下のとおりである。
ステップ１：まずＢｈａｔｔａｃｈａｒｙｙａ距離を用いて１５０個の候補遺伝子サブセットをスクリーニングすることによって、カテゴリーとの関連性が強い情報遺伝子を残すとともに大量の無関係な遺伝子を除去する。
ステップ２：遺伝子と遺伝子の間の情報エントロピー、及び遺伝子とカテゴリーの間の相互情報量を計算して、適応度関数を構築する。
ステップ３：ＩＭＧＡアルゴリズムを用いてこの１５０個の候補遺伝子サブセットのうちから関連性が最も強い特徴遺伝子サブセットを選択する。
ステップ４：サポートベクターマシンを用いて腫瘍遺伝子発現データを分類する。 An improved information measurement and GA based filter-based feature selection algorithm stores gene expression data in the corresponding document in a corresponding format so that the computer programming language can identify and process this information, specifically steps Is as follows.
Step 1: First, by screening the 150 candidate gene subsets using the Bhattacharya distance, information genes that are strongly related to the category are left and a large amount of unrelated genes are removed.
Step 2: The fitness function is constructed by calculating the information entropy between genes and the mutual information between genes and categories.
Step 3: Use the IMGA algorithm to select the most relevant feature gene subset from the 150 candidate gene subsets.
Step 4: Classify oncogene expression data using a support vector machine.

具体的には、本発明の実施例は本発明の技術案に基づいて実施するものであり、詳細な実施形態や具体的な操作過程を説明したが、本発明の保護範囲は下記実施例に制限されない。 Specifically, the examples of the present invention are implemented based on the technical solution of the present invention, and the detailed embodiments and specific operation processes have been described. However, the protection scope of the present invention is described in the following examples. Not limited.

まず、Ｂｈａｔｔａｃｈａｒｙｙａ距離を用いて１５０個の候補遺伝子サブセットをスクリーニングする。次に、遺伝子発現データをＥ（Ｘ，Ｙ）＝｛ｘ_１，ｘ_２，……ｘ_ｍ；Ｙ｝で示すとし、式中、Ｘ＝｛ｘ_１，ｘ_２，……ｘ_ｍ｝はｍ個の遺伝子、Ｙはカテゴリーを示す。 First, 150 candidate gene subsets are screened using the Bhattacharya distance. Then, the gene expression data _{E (X, Y) = {} x 1, x 2, ...... x m; Y} and shown In the _{_{formula, X = {x 1, x}} 2, ...... x m} is m genes, Y indicates a category.

適応度関数の解を求める過程：遺伝的アルゴリズムができるか限り少なく且つできるだけ高いサンプル分類率を実現できる情報遺伝子サブセットを検索できるように、検索した情報遺伝子サブセットは以下を満たさなければならない。
１、遺伝子自体毎に高情報量を含む。
２、遺伝子とカテゴリーの間の関連性が強い。
３、この情報遺伝子サブセットでは、遺伝子と遺伝子の間の冗長度が小さい。遺伝子自体に含まれる情報量は遺伝子の情報エントロピーＨ（ｘ）で示され、遺伝子とカテゴリーの間の関連性は遺伝子とカテゴリーの間の相互情報量Ｉ（ｘ；ｙ）で示され、遺伝子と遺伝子の間の冗長度の大きさは遺伝子と遺伝子の間の相互情報量Ｉ（ｘ_ｉ；ｘ_ｊ）で示される。情報測定を有効な判断基準としてもよく、本発明は情報測定で適応度関数を構築する方法を提案し、関数式が最小冗長性最大関連性アルゴリズム（ｍＲＭＲ）に基づくものであり、且つ、遺伝子自体に含まれる情報量による、遺伝的アルゴリズムで検索した遺伝子サブセット及びこれから得られた分類性能への影響を考慮に入れ、従って、ここで適応度関数は式（１）としてもよく、次に、改良した遺伝的アルゴリズムで最適な遺伝子サブセットを検索する。
式（１） The process of finding the fitness function solution: The searched information gene subset must satisfy the following so that the genetic gene algorithm can search for the information gene subset that can achieve the highest possible sample classification rate.
1. High information content is included for each gene itself.
2. Strong association between genes and categories.
3. In this information gene subset, the redundancy between genes is small. The amount of information contained in the gene itself is indicated by the information entropy H (x) of the gene, and the relationship between the gene and the category is indicated by the mutual information I (x; y) between the gene and the category. The degree of redundancy between genes is indicated by the mutual information I (x _i ; x _j ) between genes. Information measurement may be an effective criterion, and the present invention proposes a method for constructing a fitness function by information measurement, the function formula is based on a minimum redundancy maximum relevance algorithm (mRMR), and a gene Taking into account the influence of the amount of information contained in the gene subset searched by the genetic algorithm and the classification performance obtained therefrom, the fitness function may be given by equation (1). Search the optimal gene subset with improved genetic algorithm.
Formula (1)

以下は、具体的にＩＭＧＡのステップを説明する。
入力：遺伝子の個数ｋ、遺伝子の情報エントロピーＨ（ｘ）、遺伝子とカテゴリーの間の相互情報量Ｉ（ｘ；ｙ）及び遺伝子と遺伝子の間の相互情報量Ｉ（ｘ_ｉ；ｘ_ｊ）。
出力：ｋ個の遺伝子のインデックス番号。
（１）番号付けは遺伝的アルゴリズムが解決しようとする緊迫な問題であり、選択される遺伝子の個数がｋであるため、コードストリングの長さがｋとなる。ｍ個の遺伝子であれば、直接１−ｍの番号で、番号１−ｍの遺伝子を代表する。結果として、ｋ個の１− ｍの整数を出力して、検索した遺伝子の番号を代表する。
（２）ランダムにＮ_Ｐ個の集団、すなわちＮ_Ｐ個のコードストリングを生成する。集団の数が多いほど、大域解を見つける可能性が高い。
（３）式（１）により適応度値を計算する。
（４）適応度値の昇順に従って、対応した個体を順位付けする。適応度値が最適な個体を選択して直接次世代の遺伝的操作に供する。適応度値が最適な個体を選択する確率をｑとして定義すれば、順位付け後のｉ番目の個体の確率をＰ_ｉとして定義し、
ルーレット戦略に基づき父親を選択し、検索アルゴリズムのランダム性を強化させるため、ランダムに母親を選択し、なお、適応度が高い個体であれば、父親として選択される可能性が高い。
（５）個体の多様性を向上させて、大域解を検索しやすくために、単に両親を交叉（Parents cross）して２つの後代を発生するのではなく、新しい個体が発生するたびに両親を交叉し、交叉際に、父親個体又は母親個体のｉ番目の遺伝子をランダムに選択して新しい個体のｉ番目の遺伝子とする。
（６）極めて小さい変異確率をＰ_ｍとして設定し、条件を満たすと、個体のコード中の対応した遺伝子をほかのものに突然変異する。
（７）数字番号で番号を付けるため、交叉と変異操作をして得た新しい個体に対し、重複番号を付けることが不可避的であり、従って、個体のうち使用されたことのない番号を見付けて、個体の重複番号を置換する。
（８）ステップ（３）、（４）、（５）、（６）、（７）を、最大遺伝世代数に達し又は制約条件を満たすまで繰り返し、アルゴリズムが自動的に終了して、最適な遺伝子サブセットのインデックス番号を出力する。 In the following, the steps of IMGA will be specifically described.
Input: number k of genes, information entropy H (x) of gene, mutual information I (x; y) between gene and category, and mutual information I (x _i ; x _j ) between gene and gene.
Output: Index number of k genes.
(1) Numbering is a pressing problem to be solved by the genetic algorithm. Since the number of genes to be selected is k, the length of the code string is k. In the case of m genes, the number 1-m is directly represented by the number 1-m. As a result, k 1-m integers are output to represent the searched gene numbers.
(2) random N _P-number of groups, i.e., to produce a _{N P} number of code strings. The larger the number of groups, the more likely it is to find a global solution.
(3) The fitness value is calculated by equation (1).
(4) The corresponding individuals are ranked in the ascending order of fitness values. The individual with the best fitness value is selected and directly subjected to the next generation genetic manipulation. If the probability of selecting an individual with the best fitness value is defined as q, the probability of the i-th individual after ranking is defined as P _i ,
In order to select a father based on the roulette strategy and enhance the randomness of the search algorithm, a mother is selected at random, and if the individual has a high fitness, there is a high possibility of being selected as a father.
(5) In order to improve the diversity of individuals and make it easier to search for global solutions, instead of simply crossing parents to generate two progeny, the parents are added each time a new individual is generated. Crossover is performed, and at the time of crossover, the i-th gene of the father individual or the mother individual is randomly selected to be the i-th gene of the new individual.
(6) Set the extremely small mutation probability as P _m, mutated and satisfies the corresponding gene in an individual code to the others.
(7) It is inevitable to give duplicate numbers to new individuals obtained by crossover and mutation operations because they are numbered with numerical numbers. Therefore, find a number that has never been used among individuals. Then, the duplicate number of the individual is replaced.
(8) Repeat steps (3), (4), (5), (6), and (7) until the maximum number of genetic generations is reached or the constraint condition is satisfied, and the algorithm automatically ends to Output the gene subset index number.

最後に、サポートベクターマシンを用いて分類して、分類確率を取得する。 Finally, classification is performed using a support vector machine to obtain a classification probability.

以上に示されるステップによって、三種類の実際な腫瘍遺伝子発現データ（ｌｅｕｋｅｍｉａ、ｌｕｎｇｃａｎｃｅｒ、ｐｒｏｓｔａｔｅｃａｎｃｅｒ）について関連操作を行って、ＩＭＧＡの検索性能とＩＭＧＡアルゴリズムで検索した遺伝子サブセットの分類性能の２つの点についてＩＭＧＡアルゴリズムの有効性を評定し、ＩＭＧＡの検索性能については三個のデータセットでのＩＭＧＡアルゴリズムの適応度の変化曲線（図２参照）を示し、ＩＭＧＡアルゴリズムで検索した遺伝子サブセットの分類性能については、２種の従来のフィルター式選択アルゴリズム（ｍＲＭＲ、Ｒｅｌｉｅｆ）の分類性能の比較図（図３、４、５参照）を示す。 According to the steps shown above, related operations are performed on three types of actual tumor gene expression data (leukemia, lung cancer, and prosthetic cancer), and two types of search performance of IMGA and classification performance of gene subsets searched by the IMGA algorithm are used. The effectiveness of the IMGA algorithm was evaluated in terms of points, and the IMGA search performance was shown by a change curve (see FIG. 2) of the fitness of the IMGA algorithm in three data sets. Is a comparison diagram (see FIGS. 3, 4 and 5) of classification performance of two conventional filter type selection algorithms (mRMR, Relief).

以上に述べたとおり、情報エントロピーと相互情報量を用いて適応度関数を構築し、遺伝的アルゴリズムに対し、腫瘍遺伝子発現データの特徴選択に適用できるように改良する情報測定と遺伝的アルゴリズムに基づくフィルター式特徴選択方法ＩＭＧＡが提供されている。 As mentioned above, the fitness function is constructed using information entropy and mutual information, and the genetic algorithm is based on information measurement and genetic algorithm that can be applied to feature selection of tumor gene expression data A filtered feature selection method IMGA is provided.

ＩＭＧＡの利点は以下のとおりである。
１、分類器の性能によらずに遺伝子とカテゴリーの間の関係を判断するため、得られた情報遺伝子はより信頼でき且つ効果的である。
２、ＩＭＧＡアルゴリズムで検索した情報遺伝子の個数を自在に制御できることは、後続の遺伝子の個数による分類正確率への影響の研究を可能にする。
３、ＩＭＧＡアルゴリズムは検索性能に優れるとともに、検索速度が高い。
４、ＩＭＧＡアルゴリズムで検索した情報遺伝子は分類関連性が強く、小さい遺伝子サブセットで高分類確率を取得できる。 The advantages of IMGA are as follows.
1. The obtained information genes are more reliable and effective because the relationship between genes and categories is judged regardless of the performance of the classifier.
2. The ability to freely control the number of information genes searched by the IMGA algorithm enables the study of the influence of the number of subsequent genes on the classification accuracy rate.
3. IMGA algorithm has excellent search performance and high search speed.
4. Information genes searched by the IMGA algorithm have a strong classification relationship, and a high classification probability can be obtained with a small gene subset.

以上は本発明の好適な実施形態に過ぎず、本発明の保護範囲を制限するものではなく、当業者であれば、本発明の開示した技術範囲を脱逸せずに、本発明の技術案及びその発明発想に基づいて行った均等な置換や変化は全て、本発明の保護範囲に含まれるべきである。

The above is only a preferred embodiment of the present invention, and does not limit the scope of protection of the present invention, and those skilled in the art will understand the technical solution of the present invention without departing from the technical scope disclosed by the present invention. All equivalent substitutions and changes made based on the inventive idea should be included in the protection scope of the present invention.

Claims

Filtered feature selection algorithm based on improved information measurement and GA,
First, by screening 150 candidate gene subsets using Bhattacharya distance, leaving information genes that are strongly related to the category and removing a large amount of unrelated genes,
Building an fitness function by calculating information entropy between genes and mutual information between genes and categories;
Selecting the most relevant feature gene subset from the 150 candidate gene subsets using the IMGA algorithm; and
Classifying the tumor gene expression data using a support vector machine, and
Specifically, in step 2,
Estimating a marginal probability of the gene and a joint probability distribution of the gene and the category 201;
Calculating a gene and gene information entropy, a category information entropy, and a gene and category joint probability 202;
Determining a mutual information amount of the gene and the category 203,
Specifically, in step 3,
Numbering step 301 with a numeric number;
Step 302 of randomly generating a population;
Calculating a fitness function 303;
According to the ascending order of fitness values, the corresponding individuals are ranked, the individuals with the best fitness values are selected and directly subjected to the next generation genetic manipulation, the father is selected based on the roulette strategy, and the mother is randomly selected. Selecting 304;
Using a parents cross each time a new individual is generated 305;
Mutating some genes in the individual's code to others, 306;
Replacing a duplicate number in the individual 307;
Step 303, step 304, step 305, step 306, and step 307 are repeated until the maximum number of genetic generations is reached or the constraint condition is satisfied, and the algorithm is automatically terminated to output the optimal gene subset index number. And a filtered feature selection algorithm.