JP4421971B2

JP4421971B2 - Analysis engine exchange system and data analysis program

Info

Publication number: JP4421971B2
Application number: JP2004229532A
Authority: JP
Inventors: 正貴安東; 彰斎藤; 雄一石橋; 正明松浦; 敏宮田; 大牛嶋; 親民中村; 義男三木; 哲生野田
Original assignee: Japanese Foundation for Cancer Research; NEC Corp; Japan Biological Informatics Consortium
Current assignee: Japanese Foundation for Cancer Research; NEC Corp; Japan Biological Informatics Consortium
Priority date: 2004-08-05
Filing date: 2004-08-05
Publication date: 2010-02-24
Anticipated expiration: 2024-08-05
Also published as: JP2006048429A

Description

本発明は、ｔ−検定、マン・ホイットニのＵ検定、フィッシャーの直接確率検定などの単変量の解析から始まり、回帰分析、ロジスティック回帰分析、分散分析、判別分析、主成分分析などの多変量解析などの統計手法、ニューラルネットワーク、二分木解析法やサポートベクターマシン（SVM）などのデータマイニング手法に関し、特に遺伝子データなどのように大量変数の分析データから効率的に有意な変数を絞り込む解析エンジン交換型システム及び解析エンジン交換型システム用プログラム（データ解析プログラム）に関する。 The present invention starts from univariate analysis such as t-test, Mann-Whitney U test, and Fisher's exact test, and multivariate analysis such as regression analysis, logistic regression analysis, variance analysis, discriminant analysis, principal component analysis, etc. Analytical engine replacement that narrows down significant variables efficiently from large-volume analysis data such as genetic data, especially for data mining methods such as statistical methods such as neural network, binary tree analysis method and support vector machine (SVM) The present invention relates to a type system and an analysis engine exchange type system program (data analysis program).

まず、第１の関連技術を説明する。 First, the first related technique will be described.

一般に、実際の現象を統計的に解析する目的の一つは、種々の特性間の関係を見いだし、予測を行うことである。このような場合、回帰分析やロジスティック回帰分析、判別関数などを含む一般化線型モデルを用いたり、SVMなどのデータマイニングの手法を用いたりして、データから何らかの関係を見いだし、ある変数に対して予測を行うことがよく行われる。例えば、目的変数ｙに対して複数の説明変数ｘ_１，ｘ_２，．．．，ｘ_ｐの関係を解析する場合である。もし、データの持つすべての変数を使ったモデル式を作った場合は、モデルの汎用性が失われ、別のデータに適用した場合に当てはまらなくなるおそれが大きい。特にデータの持つ変数が多い場合は、目的変数ｙをうまく説明できるように、できるだけ少数の最適な変数を選び、モデル式を作らなければならない。一般にモデル式内の説明変数の個数は、数個〜数十個程度になるように変数の選択を行う。一般の統計解析システムでは、このような場合、変数選択法や総あたり法を用意して、様々な変数の組み合わせのモデルから最適と考えられるモデルを選択できるようになっている。 In general, one of the purposes of statistical analysis of actual phenomena is to find and predict the relationship between various characteristics. In such a case, use a generalized linear model including regression analysis, logistic regression analysis, discriminant function, etc., or use data mining techniques such as SVM to find some relationship from the data, and for a certain variable It is common to make predictions. For example, a plurality of explanatory variables x ₁ , x ₂ ,. . . , X _p is analyzed. If you create a model formula that uses all the variables of the data, the generality of the model is lost, and it is highly likely that it will not be applicable when applied to other data. In particular, when there are many variables in the data, it is necessary to select as few optimal variables as possible so that the objective variable y can be explained well and to create a model formula. In general, the variables are selected so that the number of explanatory variables in the model formula is several to several tens. In such a case, a general statistical analysis system prepares a variable selection method or a brute force method, and can select a model that is considered to be optimal from a combination of various variable models.

次に、第２の関連技術を説明する。 Next, a second related technique will be described.

一般の統計解析システムや遺伝子解析システムは、様々な解析手法、例えば回帰分析やロジスティック回帰分析、判別関数などを含む一般化線型モデルなどの統計手法やSVMなどのデータマイニングの手法、を用意している。しかし、変数の組合せを１回だけ指定して解析することを想定しているため、何万個もの変数を持つデータに対して、変数の様々な組合せを繰り返し解析するためには、解析システムの持つプログラミング機能やマクロ機能を用いて、特別な処理を加える必要がある。 General statistical analysis systems and genetic analysis systems have various analysis methods such as regression analysis, logistic regression analysis, statistical methods such as generalized linear models including discriminant functions, and data mining methods such as SVM. Yes. However, since it is assumed that analysis is performed by specifying a combination of variables only once, in order to repeatedly analyze various combinations of variables for data having tens of thousands of variables, It is necessary to add special processing using programming functions and macro functions.

さらに、やみくもに変数の組合せを調べても、現実的な時間内に処理を終了することができないため、なるべく変数の組合せの個数を少なくして、さらに有意な変数の組合せが含まれるように、効率的に組合せ方を調査する必要があるが、上記の解析システムでは、このようなアルゴリズムは用意されていない。例えば、５個程度の変数のモデルを推定しようとした場合、１００００個の変数に対して、５個の変数の組合せは以下のようになり、全ての組み合わせを計算するのは現実的に困難である。 Furthermore, since the process cannot be completed within a realistic time even if the combination of variables is indiscriminately examined, the number of variable combinations is reduced as much as possible so that more significant variable combinations are included. Although it is necessary to investigate the combination method efficiently, the above analysis system does not provide such an algorithm. For example, when trying to estimate a model of about five variables, the combination of five variables is as follows for 10,000 variables, and it is practically difficult to calculate all the combinations. is there.

_{１００００}Ｃ_５＝(10000×9999×9998×9997×9996)／５！≒１０^２０／５！
また、多くの推定された様々な変数の組合せに対するモデルから、どのように最適なモデルの組合せ、あるいは、有意な変数の組合せを選んだらよいのかという基準が用意されていない。このため、上記の一般の解析システム内で最適なモデルあるいは変数の組合せを選ぶためには、プログラミング機能やマクロ機能により、特別な処理を組み込まなくてはならない。 ₁₀₀₀₀ C ₅ = (10000 × 9999 × 9998 × 9997 × 9996) / 5! ≒ ¹⁰ 20/5!
In addition, there is no standard for how to select an optimal model combination or a significant variable combination from many estimated models for various combinations of variables. For this reason, in order to select an optimal model or variable combination in the above general analysis system, a special process must be incorporated by a programming function or a macro function.

更に、第３の関連技術を説明する。 Further, a third related technique will be described.

最適なモデルの組合せ、あるいは、有意な変数の組合せを選ぶためには、１種類の手法だけを用いるだけよりは、一般化線型モデルなどの統計手法やSVMなどのデータマイニングの手法を横断的に用いる必要もでてきている。例えば、図８のように異なる解析手法Ａ及びＢを用いて、変数を絞り込んでいくためには、解析システムの中でプログラミング機能やマクロ機能により、特別な処理を組み込まなくてはならない。 In order to select the optimal model combination or significant variable combination, rather than using only one type of method, a statistical method such as a generalized linear model or a data mining method such as SVM can be used. It is also necessary to use it. For example, in order to narrow down variables using different analysis methods A and B as shown in FIG. 8, special processing must be incorporated in the analysis system by a programming function or a macro function.

特許文献１の請求項１には、「１つの説明変数以外は全て入力値を所定の一定値とし、その際に出力されるニューラルネットワークの出力値と目的変数との影響関係を統計解析で用いられるF値またはｔ値で評価することを全ての説明変数について順次実施し、所定の値以下の不要な説明変数を全て破棄する」ことが記載されている。 Claim 1 of Patent Document 1 states that “the input value is set to a predetermined constant value except for one explanatory variable, and the influence relationship between the output value of the neural network output at that time and the objective variable is used in the statistical analysis. It is described that evaluation with the F value or t value to be performed is sequentially performed for all explanatory variables, and all unnecessary explanatory variables below a predetermined value are discarded.

特許文献２の第８欄の第１６行〜第２１行には、「ステップ５４（図３）で基準値が最大となった変数（ｘ_４）を選び、ステップ５１（図３）でこの変数ｘ_４を含んだ２個の変数の組み合わせ（ｘ_４，ｘ_１）、（ｘ_４，ｘ_２）、（ｘ_４，ｘ_３）を順にループしてくるたびにつくり、ステップ５２で基準値を計算する。」ことが記載されている。しかし、２個の変数の組み合わせ（ｘ_４，ｘ_１）、（ｘ_４，ｘ_２）、（ｘ_４，ｘ_３）を構成しているのは、選択された（絞られた）変数は、ｘ_４のみであり、ｘ_１、ｘ_２、ｘ_３は、選択された（絞られた）変数ではない。すなわち、はじめに、全ての変数の中から有意な変数を複数個選択し、次に、複数個の選択された（絞られた）変数中から、少なくとも２個ずつの全ての組み合わせをつくることは開示がない。 In the 16th to 21st lines of the eighth column of Patent Document 2, “a variable (x ₄ ) whose reference value is maximized in step 54 (FIG. 3) is selected, and this variable is selected in step 51 (FIG. 3). the combination of the two variables that contain _{_{_{_{x 4 (x 4, x 1}}}} ), the _(x _4, x 2), made each time come sequentially loop _{_(x} 4, x _3), the reference value in step 52 "Calculate." However, the combination of two variables (x ₄ , x ₁ ), (x ₄ , x ₂ ), (x ₄ , x ₃ ) constitutes the selected (squeezed) variable is x ₄ is _{_{_{only, x 1, x 2, x}}} 3 is (is squeezed) is selected not a variable. That is, first, it is disclosed to select a plurality of significant variables from all variables, and then to create all combinations of at least two from a plurality of selected (narrowed) variables. There is no.

特許文献３の第２欄の第３９行〜第４４行には、「１回の解析で用いる説明変数の数を一定にして、公知の変数増減法を用いて自動的に異常項目（説明変数）を絞り込む、という解析を複数回行ない、各解析で絞り込まれた項目だけで最終の解析を行なう多段階多変量解析手法」が開示されている。特許文献３にて用いている、公知の変数増減法は、はじめに、全ての変数の中から有意な変数を一つ選択し、次に、一つの選択された変数を固定して残りの変数の中から、１つの変数を選択して、２つずつの組み合わせをつくるものである。従って、この特許文献３にも、はじめに、全ての変数の中から有意な変数を複数個選択し、次に、複数個の選択された（絞られた）変数中から、少なくとも２個ずつの全ての組み合わせをつくることは開示がない。 The 39th to 44th lines in the second column of Patent Document 3 indicate that “the number of explanatory variables used in one analysis is fixed, and an abnormal item (explanatory variable) is automatically generated using a known variable increase / decrease method. ) Is performed a plurality of times, and a final analysis is performed using only the items narrowed down in each analysis. The known variable increase / decrease method used in Patent Document 3 first selects one significant variable from all the variables, then fixes one selected variable and fixes the remaining variables. One variable is selected from among them, and two combinations are created. Therefore, also in this Patent Document 3, first, a plurality of significant variables are selected from all the variables, and then, at least two of all the selected (restricted) variables are selected. There is no disclosure of creating a combination.

特許文献４の要約には、「遺伝子多型サイト情報と表現型の関連を解析する」との記載がある。 In the summary of Patent Document 4, there is a description “analyzing the relationship between genetic polymorphism site information and phenotype”.

特開２０００−３１５１１号公報JP 2000-31511 A 特開平７−９３２８４号公報JP-A-7-93284 特開２００２−１１０４９３号公報JP 2002-110493 A 特開２００３−６７３８９号公報JP 2003-67389 A コックス著「二値データの解析」朝倉書店Cox "Analysis of binary data" Asakura Shoten

上記した関連技術における第１の問題点は、説明変数の数がサンプル数より多くなり、（変数選択法や総当たり法など）統計学的多変量解析で用いられているアルゴリズムが遺伝子発現解析用DNAチップやマイクロアレイなどの大量の変数を持つデータに適用できないことである。従来の変数選択法における変数増加法(forward selection)や変数増減法(stepwise selection)においては、変数がモデルに追加されたり外されたりする場合、追加・削除することによって統計的に有意となる変数が１つずつ追加・削除されるだけであり、大量の変数から候補となる変数を絞り込むスクリーニングなどにおいては効率的に変数を選択することができない。また、変数減少法(backward selection)においては、はじめに全ての変数を取り込んだモデルが必要となるが、１万個の説明変数からなるモデルを考慮することは不可能である。また、総当たり法は、すべての変数の組み合わせを調べるために、変数の個数がｐ個の場合、２^ｐ−１通りの組み合わせのモデルを試すことになる。ｐが１００００と大きな場合、現実的に計算することができない。 The first problem with the related technology described above is that the number of explanatory variables is larger than the number of samples, and the algorithms used in statistical multivariate analysis (such as variable selection and brute force methods) are for gene expression analysis. It cannot be applied to data with a large amount of variables such as DNA chips and microarrays. In the variable selection method (forward selection) and variable increase / decrease method (stepwise selection) in the conventional variable selection method, when a variable is added to or removed from the model, the variable becomes statistically significant by adding or deleting it. Are simply added / deleted one by one, and variables cannot be selected efficiently in screening or the like that narrows down candidate variables from a large number of variables. In addition, in the variable selection method (backward selection), a model in which all variables are taken in first is necessary, but it is impossible to consider a model consisting of 10,000 explanatory variables. In the round robin method, in order to examine combinations of all variables, when the number of variables is p, 2 ^p −1 combinations of models are tried. When p is as large as 10,000, it cannot be calculated realistically.

第２の問題点としては、ハイスループットに調べた遺伝子発現解析のデータに対し、個々の遺伝子の発現解析結果に対して単変量的統計解析を行い、各遺伝子発現結果を評価していくことが重要であるが、数万の遺伝子に対して指定した解析を自動的に繰り返して行うための専用の装置は開発されていない。 As a second problem, univariate statistical analysis is performed on the expression analysis results of individual genes for the gene expression analysis data examined at high throughput, and each gene expression result is evaluated. Importantly, no dedicated device has been developed to automatically repeat the specified analysis for tens of thousands of genes.

第３の問題点としては、DNAチップやマイクロアレイに適用する統計解析手法は、ｔ−検定、マン・ホイットニのＵ検定、フィッシャーの直接確率検定などの単変量の解析から始まり、回帰分析、ロジスティック回帰分析、分散分析、判別分析、主成分分析などの多変量解析だけではなく、データマイニング手法であるニューラルネットワーク、二分木解析法やサポートベクターマシン（SVM）など多岐に渡る。これらの手法を組み合わせたり、交換したりしながらDNAチップやマイクロアレイなどの大量変数からなるデータを効率的に処理するための装置は開発されていない。はじめに、全ての変数の中から有意な変数を複数個選択し、次に、複数個の選択された（絞られた）変数中から、少なくとも２個ずつの全ての組み合わせをつくることは開示がない。 As a third problem, statistical analysis methods applied to DNA chips and microarrays start from univariate analysis such as t-test, Mann-Whitney U test, and Fisher's exact test, regression analysis, logistic regression Not only multivariate analysis such as analysis, variance analysis, discriminant analysis, principal component analysis, but also a variety of data mining methods such as neural network, binary tree analysis method and support vector machine (SVM). An apparatus for efficiently processing data consisting of a large amount of variables such as a DNA chip and a microarray while combining or exchanging these methods has not been developed. First, there is no disclosure to select a plurality of significant variables from all variables, and then to create all combinations of at least two from a plurality of selected (narrowed) variables. .

本発明の課題は、上記問題点を除去できる解析エンジン交換型システム及び解析エンジン交換型システム用のデータ解析プログラムを提供することにある。 An object of the present invention is to provide an analysis engine exchange type system and a data analysis program for an analysis engine exchange type system that can eliminate the above-mentioned problems.

本発明の課題は、はじめに、全ての説明変数の中から有意な説明変数を複数個選択し、次に、複数個の選択された（絞られた）変数中から、少なくとも２個ずつの全ての組み合わせをつくるようにした解析エンジン交換型システム及び解析エンジン交換型システム用のデータ解析プログラムを提供することにある。 The object of the present invention is to first select a plurality of significant explanatory variables from all the explanatory variables, and then select all of at least two of the selected (restricted) variables. It is an object of the present invention to provide an analysis engine exchange type system and a data analysis program for an analysis engine exchange type system that can be combined.

本発明による解析エンジン交換型システム及び本発明によるデータ解析プログラムは、以下のとおりである。 The analysis engine exchange type system according to the present invention and the data analysis program according to the present invention are as follows.

［請求項１］データ解析装置と、解析対象となるデータファイルを前記データ解析装置に入力する入力装置とを有し、
前記データ解析装置は、解析エンジン制御部と、解析エンジン部とを有し、
前記解析エンジン制御部は、１個の目的変数ｙとｐ個の説明変数ｘ_１，ｘ_２，…，ｘ_ｐとからなる、前記解析対象となるデータファイルを受け取ると、前記ｐ個の説明変数から１つの説明変数を取り出す全ての組み合わせを、順次、前記目的変数と共に、前記解析エンジン部に_pＣ_１（＝ｐ）個の組みのデータ（ｙ，ｘ_１），（ｙ，ｘ_２），…，（ｙ，ｘ_ｐ）として渡していき、
前記解析エンジン部は、送られたｐ個の組みのデータに対し、予め定められた解析をそれぞれ実行し、解析結果を前記解析エンジン制御部に送り、
前記解析エンジン制御部は、前記解析結果に基づいて、前記ｐ個の説明変数の中から、結果上位のｐ’個（ｐ’＜ｐ）の説明変数ｘ’_１，…，ｘ’_ｐを選択し、次に、前記ｐ’個の説明変数から２つの説明変数を取り出す全ての組み合わせを、順次、前記目的変数と共に、前記解析エンジン部に_p’Ｃ₂（＝ｐ’×（ｐ’−１）／２）個の組みのデータ（ｙ，ｘ’_１，ｘ’_２），（ｙ，ｘ’_１，ｘ’_３），…，（ｙ，ｘ’_ｐ’−１，ｘ’_ｐ’）として渡していき、
前記解析エンジン部は、送られた（ｐ’×（ｐ’−１）／２）個の組みのデータに対し、別の予め定められた解析をそれぞれ実行し、別の解析結果を前記解析エンジン制御部に送り、
前記解析エンジン制御部は、前記別の解析結果に基づいて、前記ｐ’個の説明変数の中から、結果上位の、ｐ’個よりも少数個の説明変数を選択することを特徴とする解析エンジン交換型システム。 [Claim 1] A data analysis device and an input device for inputting a data file to be analyzed to the data analysis device,
The data analysis apparatus includes an analysis engine control unit and an analysis engine unit,
When the analysis engine control unit receives the data file to be analyzed consisting of one objective variable y and p explanatory variables x ₁ , x ₂ ,..., X _p , the p explanatory variables All the combinations for extracting one explanatory variable from the data are sequentially put together with the objective variable in the analysis engine unit by _p C ₁ (= p) sets of data (y, x ₁ ), (y, x ₂ ), ..., (y, x _p )
The analysis engine unit performs a predetermined analysis on the p sets of data sent, and sends an analysis result to the analysis engine control unit,
The analysis engine control unit selects _p ′ explanatory variables x ′ ₁ ,..., X ′ _p that are higher in the result from the p explanatory variables based on the analysis result. Then, all the combinations for extracting two explanatory variables from the p ′ explanatory variables are sequentially put together with the objective variable into the analysis engine unit _{p ′} C ₂ (= p ′ × (p′−1). ) / 2) sets of data (y, x ′ ₁ , x ′ ₂ ), (y, x ′ ₁ , x ′ ₃ ),..., (Y, x ′ _p′−1 , x ′ _{p ′} ) And pass on as
The analysis engine unit performs another predetermined analysis on the (p ′ × (p′−1) / 2) sets of data sent, and outputs another analysis result to the analysis engine. To the control unit,
The analysis engine control unit selects, based on the other analysis result, from the p ′ explanatory variables, a lower number of explanatory variables than p ′, which is higher in the result. Engine replacement system.

［請求項２］請求項１に記載の解析エンジン交換型システムにおいて、
前記データ解析装置の前記解析エンジン制御部に接続された出力装置を、更に有し、
前記解析エンジン制御部は、前記解析結果及び前記別の表示結果を前記出力装置に表示させる機能を有することを特徴とする解析エンジン交換型システム。 [Claim 2] In the analysis engine exchange type system according to claim 1,
An output device connected to the analysis engine control unit of the data analysis device;
The analysis engine control unit has a function of causing the output device to display the analysis result and the other display result.

［請求項３］請求項１に記載の解析エンジン交換型システムにおいて、
前記解析エンジン部は、送られたｐ個の組みのデータに対し、前記予め定められた解析として、
ｙ＝ｆ（ｘ_ｉ），ｉ＝１，２，…，ｐ
で表される、ｐ個のモデルの推定をそれぞれ実行し、解析結果として、ｐ個のモデルに対する当てはまりの度合い及びｐ個の説明変数に対する有意さを、前記解析エンジン制御部に送り、
前記解析エンジン制御部は、前記ｐ個のモデルに対する当てはまりの度合いの基準値に対する比較結果及び前記ｐ個の説明変数に対する有意さの別の基準値に対する比較結果に基づいて、前記ｐ個の説明変数の中から、結果上位のｐ’個の説明変数ｘ’_１，…，ｘ’_ｐを選択することを特徴とする解析エンジン交換型システム。 [Claim 3] In the analysis engine exchange type system according to claim 1,
The analysis engine unit, for the p sets of data sent, as the predetermined analysis,
y = f (x _i ), i = 1, 2,..., p
P models are each estimated, and the analysis results are sent to the analysis engine control unit the degree of fit for the p models and the significance for the p explanatory variables,
The analysis engine control unit includes the p explanatory variables based on a comparison result with respect to a reference value of the degree of fit with respect to the p models and a comparison result with another reference value of significance with respect to the p explanatory variables. ., An analysis engine exchange type system characterized by selecting _p ′ explanatory variables x ′ ₁ ,..., X ′ _p in the top result.

［請求項４］請求項３に記載の解析エンジン交換型システムにおいて、
前記解析エンジン部は、送られた（ｐ’×（ｐ’−１）／２）個の組みのデータに対し、前記別の予め定められた解析として、
ｙ＝ｆ（ｘ_ｉ，ｘ_ｊ），ｉ，ｊ＝１，２，…，ｐ’，ｉ≠ｊ
で表される、（ｐ’×（ｐ’−１）／２）個のモデルの推定をそれぞれ実行し、前記別の解析結果として、（ｐ’×（ｐ’−１）／２）個のモデルに対する当てはまりの度合い及び（ｐ’×（ｐ’−１）／２）個の説明変数に対する有意さを、前記解析エンジン制御部に送り、
前記解析エンジン制御部は、前記（ｐ’×（ｐ’−１）／２）個のモデルに対する当てはまりの度合いの基準値に対する比較結果及び前記（ｐ’×（ｐ’−１）／２）個の説明変数に対する有意さの別の基準値に対する比較結果に基づいて、前記ｐ’個の説明変数の中から、結果上位の、ｐ’個よりも少数個の説明変数を選択することを特徴とする解析エンジン交換型システム。 [Claim 4] In the analysis engine exchange type system according to claim 3,
The analysis engine unit, for the (p ′ × (p′−1) / 2) sets of data sent, as the other predetermined analysis,
y = f (x _i , x _j ), i, j = 1, 2,..., p ′, i ≠ j
The estimation of (p ′ × (p′−1) / 2) models represented by the following is performed, and as another analysis result, (p ′ × (p′−1) / 2) The degree of fit for the model and the significance for (p ′ × (p′−1) / 2) explanatory variables are sent to the analysis engine controller,
The analysis engine control unit compares the (p ′ × (p′−1) / 2) models with a comparison result with respect to a reference value of the degree of fit for the (p ′ × (p′−1) / 2) models. Based on the comparison result of another significance value for the explanatory variable with respect to another reference value, a lower number of explanatory variables than the p ′ number are selected from among the p ′ explanatory variables. Analysis engine exchange type system.

［請求項５］請求項４に記載の解析エンジン交換型システムにおいて、
前記解析エンジン制御部は、前記ｐ’個よりも少数個の説明変数を選択すると共に、前記解析エンジン部に、選択された少数個の説明変数を用いて、次のモデルの説明変数の個数を１個増やした状態で次のモデルの推定を実行させ、実行結果に基づいて、前記選択された少数個の説明変数の中から、より少数個の説明変数を選択することを特徴とする解析エンジン交換型システム。 [Claim 5] In the analysis engine exchange type system according to claim 4,
The analysis engine control unit selects a smaller number of explanatory variables than the p ′, and uses the selected small number of explanatory variables to determine the number of explanatory variables of the next model. An analysis engine characterized in that estimation of the next model is executed in a state where the number is increased by one, and a smaller number of explanatory variables are selected from the selected small number of explanatory variables based on the execution result. Interchangeable system.

［請求項６］請求項１に記載の解析エンジン交換型システムにおいて、
前記データ解析装置に接続され、データ解析プログラムを記録した記録媒体を、更に有し、
前記データ解析プログラムは、前記記録媒体から前記データ解析装置に読み込まれ、前記データ解析装置の前記解析エンジン制御部及び前記解析エンジン部の前述した動作を制御することを特徴とする解析エンジン交換型システム。 [Claim 6] In the analysis engine exchange type system according to claim 1,
A recording medium connected to the data analysis device and recorded with a data analysis program is further included,
The data analysis program is read from the recording medium into the data analysis device, and controls the above-described operations of the analysis engine control unit and the analysis engine unit of the data analysis device. .

［請求項７］解析エンジン制御部及び解析エンジン部を有するデータ解析装置と、解析対象となるデータファイルを前記データ解析装置に入力する入力装置と、前記データ解析装置に所定の処理を実行させるためのデータ解析プログラムを記録した記録媒体とを有する解析エンジン交換型システムにおける前記データ解析プログラムであって、
前記所定の処理は、
前記解析エンジン制御部が、１個の目的変数ｙとｐ個の説明変数ｘ_１，ｘ_２，…，ｘ_ｐとからなる、前記解析対象となるデータファイルを受け取ると、前記ｐ個の説明変数から１つの説明変数を取り出す全ての組み合わせを、順次、前記目的変数と共に、前記解析エンジン部に_pＣ_１（＝ｐ）個の組みのデータ（ｙ，ｘ_１），（ｙ，ｘ_２），…，（ｙ，ｘ_ｐ）として渡していく第１のステップと、
前記解析エンジン部が、送られたｐ個の組みのデータに対し、予め定められた解析をそれぞれ実行し、解析結果を前記解析エンジン制御部に送る第２のステップと、
前記解析エンジン制御部が、前記解析結果に基づいて、前記ｐ個の説明変数の中から、結果上位のｐ’個（ｐ’＜ｐ）の説明変数ｘ’_１，…，ｘ’_ｐを選択し、次に、前記ｐ’個の説明変数から２つの説明変数を取り出す全ての組み合わせを、順次、前記目的変数と共に、前記解析エンジン部に_p’Ｃ₂（＝ｐ’×（ｐ’−１）／２）個の組みのデータ（ｙ，ｘ’_１，ｘ’_２），（ｙ，ｘ’_１，ｘ’_３），…，（ｙ，ｘ’_ｐ’−１，ｘ’_ｐ’）として渡していく第３のステップと、
前記解析エンジン部が、送られた（ｐ’×（ｐ’−１）／２）個の組みのデータに対し、別の予め定められた解析をそれぞれ実行し、別の解析結果を前記解析エンジン制御部に送る第４のステップと、
前記解析エンジン制御部が、前記別の解析結果に基づいて、前記ｐ’個の説明変数の中から、結果上位の、ｐ’個よりも少数個の説明変数を選択する第５のステップとを有することを特徴とすることを特徴とするデータ解析プログラム。 [Claim 7] A data analysis device having an analysis engine control unit and an analysis engine unit, an input device for inputting a data file to be analyzed to the data analysis device, and for causing the data analysis device to execute predetermined processing The data analysis program in an analysis engine exchange type system having a recording medium recording the data analysis program of
The predetermined process is:
When the analysis engine control unit receives the data file to be analyzed consisting of one objective variable y and p explanatory variables x ₁ , x ₂ ,..., X _p , the p explanatory variables All the combinations for extracting one explanatory variable from the data are sequentially put together with the objective variable in the analysis engine unit by _p C ₁ (= p) sets of data (y, x ₁ ), (y, x ₂ ), ..., the first step passing as (y, x _p ),
A second step in which the analysis engine unit executes a predetermined analysis on each of the p sets of data sent, and sends an analysis result to the analysis engine control unit;
Based on the analysis result, the analysis engine control unit selects _p ′ explanatory variables x ′ ₁ ,..., X ′ _p from the top of the p explanatory variables (p ′ <p). Then, all the combinations for extracting two explanatory variables from the p ′ explanatory variables are sequentially put together with the objective variable into the analysis engine unit _{p ′} C ₂ (= p ′ × (p′−1). ) / 2) sets of data (y, x ′ ₁ , x ′ ₂ ), (y, x ′ ₁ , x ′ ₃ ),..., (Y, x ′ _p′−1 , x ′ _{p ′} ) As a third step,
The analysis engine unit executes another predetermined analysis on the (p ′ × (p′−1) / 2) sets of data sent, and sends another analysis result to the analysis engine. A fourth step to send to the control unit;
A fifth step in which the analysis engine control unit selects, based on the other analysis result, from the p ′ explanatory variables, a lower number of explanatory variables than p ′, which are higher in the result. A data analysis program characterized by comprising.

［請求項８］請求項７に記載のデータ解析プログラムにおいて、
前記解析エンジン交換型システムが前記データ解析装置の前記解析エンジン制御部に接続された出力装置を、更に有している場合に、前記解析エンジン制御部が、前記解析結果及び前記別の表示結果を前記出力装置に表示させるステップを、更に有することを特徴とするデータ解析プログラム。 [Claim 8] In the data analysis program according to claim 7,
When the analysis engine exchange type system further includes an output device connected to the analysis engine control unit of the data analysis device, the analysis engine control unit displays the analysis result and the other display result. A data analysis program further comprising the step of displaying on the output device.

［請求項９］請求項７に記載のデータ解析プログラムにおいて、
前記第２のステップは、前記解析エンジン部が、送られたｐ個の組みのデータに対し、前記予め定められた解析として、
ｙ＝ｆ（ｘ_ｉ），ｉ＝１，２，…，ｐ
で表される、ｐ個のモデルの推定をそれぞれ実行し、解析結果として、ｐ個のモデルに対する当てはまりの度合い及びｐ個の説明変数に対する有意さを、前記解析エンジン制御部に送るステップであり、
前記第３のステップは、前記解析エンジン制御部が、前記ｐ個のモデルに対する当てはまりの度合いの基準値に対する比較結果及び前記ｐ個の説明変数に対する有意さの別の基準値に対する比較結果に基づいて、前記ｐ個の説明変数の中から、結果上位のｐ’個の説明変数ｘ’_１，…，ｘ’_ｐを選択するステップであることを特徴とするデータ解析プログラム。 [Claim 9] In the data analysis program according to claim 7,
In the second step, the analysis engine unit performs the predetermined analysis on the p sets of data sent,
y = f (x _i ), i = 1, 2,..., p
Each of the estimations of the p models represented by: and sending the degree of fit to the p models and the significance of the p explanatory variables as analysis results to the analysis engine control unit,
The third step is based on the comparison result of the analysis engine control unit with respect to the reference value of the degree of fit with respect to the p models and the comparison result with respect to another reference value of the significance with respect to the p explanatory variables. A data analysis program characterized in that it is a step of selecting _p ′ explanatory variables x ′ ₁ ,..., X ′ _p that are higher in the result from the p explanatory variables.

［請求項１０］請求項９に記載のデータ解析プログラムにおいて、
前記第４のステップは、前記解析エンジン部が、送られた（ｐ’×（ｐ’−１）／２）個の組みのデータに対し、前記別の予め定められた解析として、
ｙ＝ｆ（ｘ_ｉ，ｘ_ｊ），ｉ，ｊ＝１，２，…，ｐ’，ｉ≠ｊ
で表される、（ｐ’×（ｐ’−１）／２）個のモデルの推定をそれぞれ実行し、前記別の解析結果として、（ｐ’×（ｐ’−１）／２）個のモデルに対する当てはまりの度合い及び（ｐ’×（ｐ’−１）／２）個の説明変数に対する有意さを、前記解析エンジン制御部に送るステップであり、
前記第５のステップは、前記解析エンジン制御部が、前記（ｐ’×（ｐ’−１）／２）個のモデルに対する当てはまりの度合いの基準値に対する比較結果及び前記（ｐ’×（ｐ’−１）／２）個の説明変数に対する有意さの別の基準値に対する比較結果に基づいて、前記ｐ’個の説明変数の中から、結果上位の、ｐ’個よりも少数個の説明変数を選択するステップであることを特徴とするデータ解析プログラム。 [Claim 10] In the data analysis program according to claim 9,
In the fourth step, the analysis engine unit sends the (p ′ × (p′−1) / 2) sets of data sent as the other predetermined analysis,
y = f (x _i , x _j ), i, j = 1, 2,..., p ′, i ≠ j
The estimation of (p ′ × (p′−1) / 2) models represented by the following is performed, and as another analysis result, (p ′ × (p′−1) / 2) Sending the degree of fit to the model and significance for (p ′ × (p′−1) / 2) explanatory variables to the analysis engine controller,
In the fifth step, the analysis engine control unit compares the comparison result with the reference value of the degree of fit for the (p ′ × (p′−1) / 2) models and the (p ′ × (p ′ -1) / 2) Based on the comparison result with respect to another reference value of the significance for the explanatory variables, the explanatory variables of the top number of the p ′ explanatory variables are smaller than the p ′ explanatory variables. A data analysis program characterized by being a step of selecting.

［請求項１１］請求項１０に記載のデータ解析プログラムにおいて、
前記解析エンジン制御部が、前記ｐ’個よりも少数個の説明変数を選択すると共に、前記解析エンジン部に、選択された少数個の説明変数を用いて、次のモデルの説明変数の個数を１個増やした状態で次のモデルの推定を実行させ、実行結果に基づいて、前記選択された少数個の説明変数の中から、より少数個の説明変数を選択するステップを、更に有することを特徴とするデータ解析プログラム。 [Claim 11] In the data analysis program according to claim 10,
The analysis engine control unit selects a smaller number of explanatory variables than the p ′ number and uses the selected small number of explanatory variables for the analysis engine unit to determine the number of explanatory variables of the next model. The method further includes the step of executing estimation of the next model in a state where the number is increased by one, and selecting a smaller number of explanatory variables from the selected small number of explanatory variables based on the execution result. A featured data analysis program.

本発明によれば、はじめに、全ての説明変数の中から有意な説明変数を複数個選択し、次に、複数個の選択された（絞られた）変数中から、少なくとも２個ずつの全ての組み合わせをつくるようにした解析エンジン交換型システム及び解析エンジン交換型システム用のデータ解析プログラムが得られ、効率的に説明変数の全体の解析を終了することが可能となる。 According to the present invention, first, a plurality of significant explanatory variables are selected from all the explanatory variables, and then, at least two of all the selected (restricted) variables are selected. An analysis engine exchange type system and a data analysis program for the analysis engine exchange type system that can be combined are obtained, and the analysis of the entire explanatory variable can be efficiently completed.

次に、本発明の第１の実施の形態について図面を参照して詳細に説明する。 Next, a first embodiment of the present invention will be described in detail with reference to the drawings.

図１を参照すると、本発明の第１の実施の形態による解析エンジン交換型システムは、プログラム制御により動作するデータ解析装置２と、解析対象となるデータファイルをデータ解析装置２に入力する入力装置１と、ディスプレイ装置や印刷装置等の出力装置３とを含む。この際の解析対象となるデータファイルは、１個の目的変数とｐ個の説明変数からなる。データ解析装置２は、解析エンジン制御部２１と、解析エンジン部２２とを備えている。 Referring to FIG. 1, an analysis engine exchange type system according to a first embodiment of the present invention includes a data analysis device 2 that operates under program control, and an input device that inputs a data file to be analyzed to the data analysis device 2. 1 and an output device 3 such as a display device or a printing device. The data file to be analyzed at this time consists of one objective variable and p explanatory variables. The data analysis device 2 includes an analysis engine control unit 21 and an analysis engine unit 22.

解析エンジン制御部２１は、与えられたデータファイルの目的変数と選択された説明変数を取り出して、解析エンジン部２２へデータを送る。解析エンジン部２２は送られたデータに対し、あらかじめ定められた解析を実行し、解析結果を解析エンジン制御部２１に送る。出力装置３では、解析エンジン制御部２１から送られた解析結果を統計量やパラメータ（例えば、説明変数ごとの統計量にもとづいた有意確率）を用いてソートして表示する。その解析結果に基づいて、解析エンジン制御部２１は、上位ｐ’個（ｐ’＜ｐ）の説明変数を選択し、次のモデルに含める説明変数の個数は増やして、入力装置１からの解析を再度実行する。その解析結果に基づいて、選択する説明変数の個数を減らしていく。処理を繰り返すことにより、モデルに含める説明変数の個数を増やしていくが、解析の対象となる説明変数の個数は減らしていくために、解析実行の回数は総当たり法より少ないので、効率的に変数の全体の解析を終了することが可能となる。 The analysis engine control unit 21 extracts the objective variable and the selected explanatory variable of the given data file, and sends the data to the analysis engine unit 22. The analysis engine unit 22 performs a predetermined analysis on the sent data and sends the analysis result to the analysis engine control unit 21. In the output device 3, the analysis results sent from the analysis engine control unit 21 are sorted and displayed using statistics and parameters (for example, significance based on statistics for each explanatory variable). Based on the analysis result, the analysis engine control unit 21 selects the top p ′ explanatory variables (p ′ <p), increases the number of explanatory variables included in the next model, and analyzes from the input device 1. Run again. Based on the analysis result, the number of explanatory variables to be selected is reduced. By repeating the process, the number of explanatory variables to be included in the model is increased, but since the number of explanatory variables to be analyzed is decreased, the number of analysis executions is less than the brute force method. It becomes possible to finish the analysis of the whole variable.

次に、図２を参照して本実施の形態の動作について詳細に説明する。 Next, the operation of the present embodiment will be described in detail with reference to FIG.

解析対象となるデータファイルにおけるデータは、下記の数式１に示すように、１個の目的変数とｐ個の説明変数から成り立っている。 The data in the data file to be analyzed is composed of one objective variable and p explanatory variables, as shown in Equation 1 below.

解析エンジン制御部２１はデータファイルを入力装置１から受け取り、ｐ個の説明変数から１つの説明変数を取り出す全ての組み合わせを、順次、１個の目的変数と共に、解析エンジン部２２に渡していく。つまり、下記の数式２に示すｐ個の組みのデータを渡す。 The analysis engine control unit 21 receives the data file from the input device 1 and sequentially passes all combinations for extracting one explanatory variable from the p explanatory variables to the analysis engine unit 22 together with one objective variable. That is, p sets of data shown in the following Equation 2 are passed.

解析エンジン部２２は、１組ごとのデータに対して、回帰分析やロジスティック回帰分析などの解析を行う。この場合、下記の数式３のｐ個のモデルを推定する。つまり、ｐ回繰り返して計算を行う。 The analysis engine unit 22 performs analysis such as regression analysis and logistic regression analysis on each set of data. In this case, p models of Equation 3 below are estimated. That is, the calculation is repeated p times.

回帰分析の場合は、回帰モデル式は上記の数式３に示されている通りで、説明変数の回帰係数とその有意さを表す統計量、モデルの当てはまりのよさを表す統計量を計算する。モデルのあてはまりを表す統計量および回帰係数の有意さを表す統計量は任意のものを定義できるが、例として、下記の数式４に示す、各モデルの回帰係数と、モデルのあてはまりを表す統計量として重相関係数と、回帰係数の有意さを表す統計量としてｔ値およびｐ値とを、定義する。 In the case of regression analysis, the regression model equation is as shown in Equation 3 above, and the regression coefficient of the explanatory variable, the statistic indicating its significance, and the statistic indicating the fit of the model are calculated. The statistic that represents the fit of the model and the statistic that represents the significance of the regression coefficient can be defined arbitrarily. For example, the regression coefficient of each model and the statistic that represents the fit of the model are shown in Equation 4 below. Are defined as a multiple correlation coefficient and t-value and p-value as statistics representing the significance of the regression coefficient.

ｐ個の組のデータを解析した結果は、ｐ個のモデルに対する当てはまりの度合い、ｐ個の説明変数に対する有意さである。解析エンジン制御部２１は、これらの結果を、下記の数式５に示すような基準を設けて、モデルおよび変数を選択する。 The result of analyzing the p sets of data is the degree of fit for the p models and the significance for the p explanatory variables. The analysis engine control unit 21 selects a model and a variable based on these results by providing a reference as shown in Equation 5 below.

これにより、ｐ個の説明変数の内のｐ’個の説明変数（ｐ’＜ｐ）に絞りこまれる。 This narrows down to p ′ explanatory variables (p ′ <p) out of the p explanatory variables.

ここで、データファイルは、下記の数式６に示すように、１個の目的変数と上記p’個の説明変数とから成り立っている。 Here, the data file is composed of one objective variable and the above p ′ explanatory variables as shown in Equation 6 below.

解析エンジン制御部２１は、上記データファイルを入力装置１から受け取る。或いは、解析エンジン制御部２１は、上記データファイルを解析エンジン制御部２１内で作る。そして、解析エンジン制御部２１は、ｐ’個の説明変数から２つ取り出す全ての組み合わせを、順次、１個の目的変数と共に、解析エンジン部２２に渡していく。つまり、下記の数式７に示す、_p’Ｃ₂＝［ｐ’×（ｐ’−１）／２！］＝［ｐ’×（ｐ’−１）／２］個の組みのデータを渡す。 The analysis engine control unit 21 receives the data file from the input device 1. Alternatively, the analysis engine control unit 21 creates the data file in the analysis engine control unit 21. Then, the analysis engine control unit 21 sequentially passes all combinations extracted from the p ′ explanatory variables to the analysis engine unit 22 together with one objective variable. That is, _{p ′} C ₂ = [p ′ × (p′−1) / 2! ] = [P ′ × (p′−1) / 2] sets of data are passed.

解析エンジン部２２は、１組ごとのデータに対して、回帰分析やロジスティック回帰分析などの解析を行う。この場合、下記の数式８に示す、ｐ’×（ｐ’−１）／２個のモデルを推定する。 The analysis engine unit 22 performs analysis such as regression analysis and logistic regression analysis on each set of data. In this case, p ′ × (p′−1) / 2 models shown in Equation 8 below are estimated.

回帰分析およびロジスティック回帰の各統計量およびモデルのあてはまりの度合いを示す統計量および各説明変数の有意さを示す統計量は、上記の数式４により同様に求めることができる。ただし、ｐ＝２とする。 A statistical quantity indicating the degree of fit of each statistical quantity and model of regression analysis and logistic regression, and a statistical quantity indicating the significance of each explanatory variable can be obtained in the same manner using Equation 4 above. However, p = 2.

さらに同様に、下記の数式９に示す基準により、説明変数を数十個に絞り込む。 Similarly, the explanatory variables are narrowed down to several tens according to the criterion shown in the following formula 9.

次に絞り込まれた説明変数を用いて、モデルの説明変数の個数を１個増やして推定を行い、処理を繰り返す。このようにして、説明変数を１０個乃至２０個程度に絞り込んでいき、個々の説明変数と目的変数との関係を個別に調査できるようにする。 Next, using the narrowed explanatory variables, the number of explanatory variables in the model is increased by 1, and the process is repeated. In this way, the explanatory variables are narrowed down to about 10 to 20, so that the relationship between each explanatory variable and the objective variable can be individually investigated.

上記の内容を図に示すと、図２のようになる。 The above contents are shown in FIG.

解析エンジン部２２は、回帰分析だけではなく、ロジスティック回帰分析、判別関数、ｔ−検定、マン・ホイットニのＵ検定など様々な統計手法に置き換えることにより、任意の解析手法を用いることができる。これを可能にするのが、解析エンジン部２２と解析エンジン制御部２１とのインタフェースである。 The analysis engine unit 22 can use any analysis method by replacing it with various statistical methods such as logistic regression analysis, discriminant function, t-test, and Mann-Whitney U test as well as regression analysis. The interface between the analysis engine unit 22 and the analysis engine control unit 21 makes this possible.

解析エンジン部２２と解析エンジン制御部２１とのインタフェースを、図３に示す。 An interface between the analysis engine unit 22 and the analysis engine control unit 21 is shown in FIG.

図３において、解析エンジン制御部２１は、繰り返し制御ブロックとして作用し、繰り返しの番号などのパラメータの分析１回分の個数とそのリスト３１を、パラメータおよびデータ入力部２２ａを介して、統計解析計算部として作用する解析エンジン部２２に送る。解析エンジン部２２は、統計量、検定統計量などの結果出力部２２ｂを介して統計量、検定統計量などの結果３５を解析エンジン制御部２１に送る。解析エンジン制御部２１は、ブロック２１ａにおいて、結果編集及び出力を行い、結果の表示３３を出力装置３（図１）に表示させる。解析エンジン制御部２１は、ブロック２１ａにおいて、更に、基準値による変数（説明変数）の抽出を行い、抽出された変数の一覧を、繰り返しごとの変数一覧表示３６として出力装置３に表示させる。 In FIG. 3, the analysis engine control unit 21 functions as a repetitive control block, and the number of parameters for one analysis such as a repetitive number and a list 31 thereof are sent to the statistical analysis calculation unit via the parameter and data input unit 22a. To the analysis engine unit 22 acting as The analysis engine unit 22 sends a result 35 such as a statistic and a test statistic to the analysis engine control unit 21 via a result output unit 22b such as a statistic and a test statistic. In block 21a, the analysis engine control unit 21 performs result editing and output, and displays the result display 33 on the output device 3 (FIG. 1). In block 21a, the analysis engine control unit 21 further extracts a variable (explanatory variable) based on the reference value, and causes the output device 3 to display a list of the extracted variables as a variable list display 36 for each repetition.

図４に、図３の３１のデータ構造と、図３の３５のデータ構造とを示す。 4 shows the data structure 31 in FIG. 3 and the data structure 35 in FIG.

次に、本発明の第２の実施の形態について図面を参照して詳細に説明する。 Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

図５を参照すると、本発明の第２の実施の形態による解析エンジン交換型システムは、典型的にはコンピュータのCPU(Central Processing Unit)であるデータ解析装置５と、図１と同様の入力装置１及び出力装置３とを備えている。更に、解析エンジン交換型システムは、データ解析プログラムを記録した記録媒体４を備える。この記録媒体４は可搬形あるいは固定型のいずれであってもよく、磁気ディスク、半導体メモリ、CD-ROMその他の記録媒体であってもよい。 Referring to FIG. 5, the analysis engine exchange type system according to the second exemplary embodiment of the present invention typically includes a data analysis device 5 that is a CPU (Central Processing Unit) of a computer, and an input device similar to FIG. 1 and an output device 3. Further, the analysis engine exchange type system includes a recording medium 4 on which a data analysis program is recorded. The recording medium 4 may be either a portable type or a fixed type, and may be a magnetic disk, a semiconductor memory, a CD-ROM, or other recording media.

また、本手法を実行できるコンピュータプログラム（上記データ解析プログラム）を、ネットワークに接続されたコンピュータの記録装置に格納しておき、ネットワークを介して他のコンピュータに転送することもできる。本アルゴリズムを実行するコンピュータプログラム（上記データ解析プログラム）を提供する提供媒体としては、様々な形式のコンピュータに読み出し可能な媒体として頒布可能であって、特定のタイプの媒体に限定されるものではない。 In addition, a computer program (the above data analysis program) that can execute this method can be stored in a recording device of a computer connected to a network and transferred to another computer via the network. As a providing medium for providing a computer program (the above data analysis program) for executing the present algorithm, it can be distributed as a computer-readable medium in various formats, and is not limited to a specific type of medium. .

上記データ解析プログラムは記録媒体４からデータ解析装置５に読み込まれ、データ解析装置５の動作を制御し、データ解析装置５に、入力装置１から入力されたデータファイルに対して、図１のデータ解析装置２による処理と同一の処理を実行する。 The data analysis program is read from the recording medium 4 into the data analysis device 5, controls the operation of the data analysis device 5, and the data file shown in FIG. The same processing as that performed by the analysis device 2 is executed.

次に、本発明の実施例を、実データを参照して具体的に説明する。かかる実施例は、図１の第１の実施の形態による解析エンジン交換型システムに対応するものである。図６に示すように、分析データはＳＮＰ（Single Nucleotide Polymorphism：シングル・ヌクレオチド（塩基）ポリモルフィズ（多型））データ及び臨床データで、目的変数（Y）として副作用の有／無（1 or 0）、説明変数（X）としてＳＮＰデータを用いる。ここで、SNPデータにおいては、例えば、(A/A,A/T,T/T)=(10,11,01)のように1SNPに対して2変数を割り当てる。これにより、解析に対して用いた説明変数の個数は約５５００個、ケース数は５４である。 Next, an embodiment of the present invention will be specifically described with reference to actual data. This example corresponds to the analysis engine exchange type system according to the first embodiment of FIG. As shown in FIG. 6, the analysis data are SNP (Single Nucleotide Polymorphism) data and clinical data, with or without side effects (1 or 0) as the objective variable (Y). SNP data is used as the explanatory variable (X). Here, in the SNP data, for example, two variables are assigned to one SNP as (A / A, A / T, T / T) = (10, 11, 01). As a result, the number of explanatory variables used for the analysis is about 5500 and the number of cases is 54.

このデータをロジスティック回帰モデルにあてはめて分析を行う。ｎ個の個体についてｐ個の説明変数を含むロジスティック回帰モデルは、以下の数式１０を仮定する。 This data is applied to a logistic regression model for analysis. A logistic regression model including p explanatory variables for n individuals assumes the following Equation 10.

ここで、θ_iは個体に関する成功確率、λ_iはθ_iのロジスティック変換である。α_ikは個体iに関するk番目の説明変数の値、β_ｋはk番目の説明変数のロジスティック尺度上の回帰係数である。 Here, θ _i is the success probability for the individual, and λ _i is the logistic transformation of θ _i . α _ik is the value of the k-th explanatory variable for the individual i, and β _k is the regression coefficient on the logistic scale of the k-th explanatory variable.

ｎ個の個体について二値反応観測値y₁, y₂, …, y_n,が与えられると、対数尤度は下記の数式１１となる。 Given binary response observation values y ₁ , y ₂ ,..., y _{n for n} individuals, the log likelihood is given by Equation 11 below.

ここで、

とすると、この数式１２で定義された、L(β)を最大にするβの値を、求めるためには、L(β)を目的関数とする非線形最適化問題を解く必要がある。この解法としてここでは、Newton-Raphson法を用いる（非特許文献１を参照）。 here,

Then, in order to obtain the value of β defined by Equation 12 that maximizes L (β), it is necessary to solve a nonlinear optimization problem with L (β) as an objective function. Here, Newton-Raphson method is used as this solution (see Non-Patent Document 1).

説明変数が１つの場合のモデルは、図７に示したような解析結果になる。 The model in the case of one explanatory variable has an analysis result as shown in FIG.

図７において、１行ごとに１回の解析結果を表している。各列においては、「R_Variable No.」は目的変数を表す番号、「E_Variable No.」は説明変数を表す番号である。「Status」は解析処理がエラーになかったかどうかを表している「X2L」はロジスティック回帰モデルの検定統計量である。「B0」および「B」はそれぞれ、定数項と回帰係数である。「ｔ値」および「P値」はそれぞれ説明変数の検定統計量である。 In FIG. 7, one analysis result is shown for each row. In each column, “R_Variable No.” is a number representing an objective variable, and “E_Variable No.” is a number representing an explanatory variable. “Status” indicates whether or not the analysis processing was in error, and “X2L” is a test statistic of the logistic regression model. “B0” and “B” are a constant term and a regression coefficient, respectively. “T value” and “P value” are test statistic of each explanatory variable.

また、図７では、計算されたロジスティック回帰係数の検定統計量ｔ値に対するｐ値の大きさを小さな順にならべかえられており、ロジスティック回帰係数の影響度の大きい順に見ることができる。 In FIG. 7, the magnitudes of the p values with respect to the test statistic t value of the calculated logistic regression coefficient are sorted in ascending order, and can be seen in the order of the influence of the logistic regression coefficient.

ロジスティック回帰モデルの検定統計量であるピアソンのχ^２統計量は以下の数式１３により計算する。 Pearson's χ ² statistic, which is a test statistic of the logistic regression model, is calculated by the following Equation 13.

各説明変数のｔ値は、下記の数式１４のように計算する。 The t value of each explanatory variable is calculated as in Equation 14 below.

ここで、s.e.（）は（）内の要素の標準誤差(standard error)である。 Here, s.e. () is the standard error of the elements in ().

P値については、上記数式１４のｔ値が自由度Ｎ−ｐ−１のt分布に従うので、t分布のｔ値に対応する上側確率を求めることにより計算することができる。 The P value can be calculated by obtaining the upper probability corresponding to the t value of the t distribution because the t value of the above equation 14 follows the t distribution with Np−1 degrees of freedom.

上記の結果により、副作用の有／無がどの遺伝子と関連が強いかということが分かり、関連の強い遺伝子を絞り込むことが可能となる。 From the above results, it can be seen which gene has a strong association with the presence / absence of side effects, and it is possible to narrow down the genes with strong association.

上記第１及び上記第２の実施の形態によれば、各種の統計分析手法が遺伝子発現解析用DNAチップやマイクロアレイなどの大量の変数を持つデータに適用できるようになる。変数の総数を約３００００個とした場合、繰り返しの回数においても、
説明変数が１個：３万回
説明変数が２個：５０万回（１説明変数時に約１０００個の説明変数を抽出）
説明変数が３個：１７万回（２説明変数時に約１００個の説明変数を抽出）
：
：
というようになり、現実的な時間内において処理を終えることができる。 According to the first and second embodiments, various statistical analysis methods can be applied to data having a large amount of variables such as a DNA chip for gene expression analysis and a microarray. If the total number of variables is about 30,000,
One explanatory variable: 30,000 times Two explanatory variables: 500,000 times (about 1000 explanatory variables are extracted for one explanatory variable)
3 explanatory variables: 170,000 times (about 100 explanatory variables are extracted when 2 explanatory variables)
:
:
Thus, the processing can be completed within a realistic time.

更に、上記第１及び上記第２の実施の形態によれば、既存の変数選択法よりも効率的に大量変数の中から候補となる変数を絞り込むことができる。その理由は、モデルに含まれる説明変数の個数ごとに独立して候補となる変数を選択することができるからである。さらに、上記の数式３を、下記の数式１５のように特定の説明変数を固定したモデルに拡張することによって、既存の変数増加法や変数増減法なども適用することができる。 Furthermore, according to the first and second embodiments, candidate variables can be narrowed down from among a large number of variables more efficiently than existing variable selection methods. This is because candidate variables can be selected independently for each number of explanatory variables included in the model. Furthermore, the existing variable increasing method, variable increasing / decreasing method, and the like can be applied by expanding the above mathematical formula 3 to a model in which specific explanatory variables are fixed as in the following mathematical formula 15.

また、上記第１及び上記第２の実施の形態によれば、多くの推定された様々な変数の組合せに対するモデルから、どのように最適なモデルの組合せ、あるいは、有意な変数の組合せを選んだらよいのかという基準を簡単に設定できる。推定されたモデル自体の重相関係数やＦ値などの統計量や、変数毎のｔ値やｐ値などの統計量を任意に選んで、基準値以上（あるいは以下）の変数の組合せを選択することができる。 In addition, according to the first and second embodiments, how to select the most appropriate model combination or significant variable combination from the models for many estimated combinations of variables. It is easy to set the standard of whether it is good. Select a statistical combination such as the estimated correlation coefficient or F value of the model itself, or a statistical value such as t value or p value for each variable, and select a combination of variables above (or below) the reference value. can do.

更に、上記第１及び上記第２の実施の形態によれば、既存の変数選択法よりも幅広い変数の候補からモデルを抽出することが可能である。既存の変数増減法、減少法、増加法などにおいては、抽出されるモデルは１個のみである。しかし、本発明においては計算されたモデルの結果をすべて保存しているので、モデルに対する基準を設けて、上位Ｋ個の解析結果をユーザーに表示して結果の検討を行うことができる。さらに、この上位Ｋ個のモデルに含まれる説明変数を使って次のモデル選択のステップに進むことができ、既存の変数選択法よりも幅広い変数の候補からモデルを抽出することが可能となる。 Furthermore, according to the first and second embodiments, it is possible to extract a model from variable candidates that are wider than those of the existing variable selection method. In the existing variable increase / decrease method, decrease method, increase method, etc., only one model is extracted. However, since all the calculated model results are stored in the present invention, it is possible to set a standard for the model and display the top K analysis results to the user to examine the results. Furthermore, it is possible to proceed to the next model selection step using the explanatory variables included in the top K models, and it is possible to extract models from a wider range of variable candidates than the existing variable selection method.

また、上記第１及び上記第２の実施の形態によれば、複数の解析手法を横断的に組み合わせて使うことができることである。DNAチップやマイクロアレイに適用する手法は、ｔ−検定、マン・ホイットニのＵ検定、フィッシャーの直接確率検定などの単変量の解析から始まり、回帰分析、ロジスティック回帰分析、分散分析、判別分析、主成分分析などの多変量解析などの統計手法、ニューラルネットワーク、二分木解析法やサポートベクターマシン（SVM）などのデータマイニング手法など多岐に渡るが、解析エンジン部をこれらの手法に置き換えることにより、説明変数の絞り込み時に異なる手法を組み合わせて解析することができる。 Further, according to the first and second embodiments, it is possible to use a plurality of analysis methods in a crosswise combination. Techniques applied to DNA chips and microarrays start with univariate analysis such as t-test, Mann-Whitney U test, and Fisher's exact test, and then regression analysis, logistic regression analysis, variance analysis, discriminant analysis, principal component There are a wide variety of statistical methods such as multivariate analysis such as analysis, neural network, binary tree analysis method and data mining method such as support vector machine (SVM), but by replacing the analysis engine part with these methods, explanatory variables It is possible to analyze by combining different methods when narrowing down.

本発明の第１の実施の形態による解析エンジン交換型システムのブロック図である。It is a block diagram of an analysis engine exchange type system by a 1st embodiment of the present invention. 図１の解析エンジン交換型システムの解析エンジン制御部における、変数の絞り込みの過程を示す流れ図である。It is a flowchart which shows the process of narrowing down a variable in the analysis engine control part of the analysis engine exchange type system of FIG. 図１の解析エンジン交換型システムの解析エンジン部と解析エンジン制御部とのインタフェースを示すブロック図である。FIG. 2 is a block diagram illustrating an interface between an analysis engine unit and an analysis engine control unit of the analysis engine exchange type system of FIG. 1. 図３の部分３１のデータ構造と図３の部分３５のデータ構造とを示す図である。It is a figure which shows the data structure of the part 31 of FIG. 3, and the data structure of the part 35 of FIG. 本発明の第２の実施の形態による解析エンジン交換型システムのブロック図である。It is a block diagram of an analysis engine exchange type system by a 2nd embodiment of the present invention. 図１の解析エンジン交換型システムに対応する実施例の動作の説明に使用する分析データの構造を示した図である。It is the figure which showed the structure of the analysis data used for description of operation | movement of the Example corresponding to the analysis engine exchange type system of FIG. 上記実施例における解析結果を示した図である。It is the figure which showed the analysis result in the said Example. 異なる解析手法を用いて変数を絞り込んでいくための処理を示す流れ図である。It is a flowchart which shows the process for narrowing down a variable using a different analysis method.

Explanation of symbols

１入力装置
２データ解析装置
３出力装置
４記録媒体
５データ解析装置
２１解析エンジン制御部
２２解析エンジン部 DESCRIPTION OF SYMBOLS 1 Input device 2 Data analysis device 3 Output device 4 Recording medium 5 Data analysis device 21 Analysis engine control part 22 Analysis engine part

Claims

A data analysis device and an input device for inputting a data file to be analyzed to the data analysis device;
The data analysis apparatus includes an analysis engine control unit and an analysis engine unit,
When the analysis engine control unit receives the data file to be analyzed consisting of one objective variable y and p explanatory variables x ₁ , x ₂ ,..., X _p , the p explanatory variables All the combinations for extracting one explanatory variable from the data are sequentially put together with the objective variable in the analysis engine unit by _p C ₁ (= p) sets of data (y, x ₁ ), (y, x ₂ ), ..., having a first control means passing as (y, x _p ),
The analysis engine unit includes a second control unit that executes predetermined analysis on each of the p sets of data that has been sent, and sends analysis results to the analysis engine control unit.
Based on the analysis result, the analysis engine control unit selects p ′ explanatory variables x ′ ₁ ,..., X ′ _{p ′} that are higher in the result from the p explanatory variables. Next, all combinations for extracting two explanatory variables from the p ′ explanatory variables are sequentially added to the analysis engine unit along with the objective variable in the analysis engine unit _{p ′} C ₂ (= p ′ × (p′− 1) / 2) sets of data (y, x ′ ₁ , x ′ ₂ ), (y, x ′ ₁ , x ′ ₃ ),..., (Y, x ′ _p′−1 , x ′ _{p ′} ) As a third control means
The analysis engine unit performs another predetermined analysis on the (p ′ × (p′−1) / 2) sets of data sent, and outputs another analysis result to the analysis engine. A fourth control means for sending to the control unit ;
The analysis engine control unit includes a fifth control unit that selects, based on the other analysis result, a lower number of explanatory variables than p ′, which is higher in the result, from the p ′ explanatory variables. Have
The predetermined analysis and the other predetermined analysis are analysis by regression analysis, the explanatory variable is genetic data, the target variable is presence / absence of side effect (1/0), Analysis engine exchange type system characterized by narrowing down genetic data strongly related to existence / non-existence of

In the analysis engine exchange type system according to claim 1,
An output device connected to the analysis engine control unit of the data analysis device;
The analysis engine control system further comprises means for outputting the analysis result and the other analysis result to the output device.

In the analysis engine exchange type system according to claim 1,
The second control means in the analysis engine unit, as the predetermined analysis for the p sets of data sent,
y = f (x _i ), i = 1, 2,..., p
P models are each estimated, and the analysis results are sent to the analysis engine control unit, as analysis results, the degree of fit for the p models and the significance for the p explanatory variables.
The third control means in the analysis engine control unit is based on a comparison result with respect to a reference value of the degree of fit for the p models and a comparison result with another reference value of significance for the p explanatory variables. the out of p number of explanatory variables, the results of the upper p 'number of explanatory variables x' _1, ... analysis engine switched system, characterized in that, it is to select the x _'p'.

In the analysis engine exchange type system according to claim 3,
The fourth control means in the analysis engine unit, as the other predetermined analysis, for the (p ′ × (p′−1) / 2) sets of data sent,
y = f (x _i , x _j ), i, j = 1, 2,..., p ′, i ≠ j
The estimation of (p ′ × (p′−1) / 2) models represented by the following is performed, and as another analysis result, (p ′ × (p′−1) / 2) The degree of fit for the model and the significance for (p ′ × (p′−1) / 2) explanatory variables are sent to the analysis engine control unit,
The fifth control means in the analysis engine control unit includes a comparison result with respect to a reference value of the degree of fit for the (p ′ × (p′−1) / 2) models and the (p ′ × (p ′ -1) / 2) Based on the comparison result with respect to another reference value of the significance for the explanatory variables, the explanatory variables of the top number of the p ′ explanatory variables are smaller than the p ′ explanatory variables. analysis engine switched system, characterized in that to select.

In the analysis engine exchange type system according to claim 1,
A recording medium that records a data analysis program to be read by the data analysis device;
The data analysis program is read from the recording medium into the data analysis device, and the data analysis program controls the analysis engine control unit of the data analysis device and first to fifth control means of the analysis engine unit . An analysis engine exchange type system characterized by

As an analysis engine control unit and an analysis engine unit in an analysis engine exchange type system having a data analysis device having an analysis engine control unit and an analysis engine unit, and an input device for inputting a data file to be analyzed to the data analysis device A data analysis program for causing a computer to function ,
Before SL analysis engine control unit, one objective variable y and p number of explanatory variables x _{1 of,} x 2, _..., consisting of x _p, when receiving the data file serving as the analysis target, the p number of Description All combinations for extracting one explanatory variable from the variables are sequentially added to the analysis engine unit along with the objective variable in the _p C ₁ (= p) sets of data (y, x ₁ ), (y, x ₂ ). ,..., (Y, x _p )
A second step in which the analysis engine unit executes a predetermined analysis on each of the p sets of data sent, and sends an analysis result to the analysis engine control unit;
Based on the analysis result, the analysis engine control unit selects p ′ explanatory variables x ′ ₁ ,..., X ′ _{p ′} that are higher in the result from the p explanatory variables. Next, all combinations for extracting two explanatory variables from the p ′ explanatory variables are sequentially added to the analysis engine unit along with the objective variable in the analysis engine unit _{p ′} C ₂ (= p ′ × (p′− 1) / 2) sets of data (y, x ′ ₁ , x ′ ₂ ), (y, x ′ ₁ , x ′ ₃ ),..., (Y, x ′ _p′−1 , x ′ _{p ′} ) As a third step,
The analysis engine unit executes another predetermined analysis on the (p ′ × (p′−1) / 2) sets of data sent, and sends another analysis result to the analysis engine. A fourth step to send to the control unit;
A fifth step in which the analysis engine control unit selects, based on the other analysis result, from the p ′ explanatory variables, a lower number of explanatory variables than p ′, which are higher in the result; Including processing ,
The predetermined analysis and the other predetermined analysis are analysis by regression analysis, the explanatory variable is genetic data, the target variable is presence / absence of side effect (1/0), A data analysis program characterized by narrowing down genetic data strongly related to the presence or absence of.

In the data analysis program according to claim 6,
When the analysis engine exchange type system further includes an output device connected to the analysis engine control unit of the data analysis device, the analysis engine control unit displays the analysis result and the other analysis result. A data analysis program further comprising a step of outputting to the output device.

In the data analysis program according to claim 6,
In the second step, the analysis engine unit performs the predetermined analysis on the p sets of data sent,
y = f (x _i ), i = 1, 2,..., p
Each of the estimations of the p models represented by: and sending the degree of fit to the p models and the significance of the p explanatory variables as analysis results to the analysis engine control unit,
The third step is based on the comparison result of the analysis engine control unit with respect to the reference value of the degree of fit with respect to the p models and the comparison result with respect to another reference value of the significance with respect to the p explanatory variables. A data analysis program characterized by the step of selecting _{p ′} explanatory variables x ′ ₁ ,..., X ′ _{p ′} that are higher in the result from the p explanatory variables.

In the data analysis program according to claim 8,
In the fourth step, the analysis engine unit sends the (p ′ × (p′−1) / 2) sets of data sent as the other predetermined analysis,
y = f (x _i , x _j ), i, j = 1, 2,..., p ′, i ≠ j
The estimation of (p ′ × (p′−1) / 2) models represented by the following is performed, and as another analysis result, (p ′ × (p′−1) / 2) Sending the degree of fit to the model and significance for (p ′ × (p′−1) / 2) explanatory variables to the analysis engine controller,
In the fifth step, the analysis engine control unit compares the comparison result with the reference value of the degree of fit for the (p ′ × (p′−1) / 2) models and the (p ′ × (p ′ -1) / 2) Based on the comparison result with respect to another reference value of the significance for the explanatory variables, the explanatory variables of the top number of the p ′ explanatory variables are smaller than the p ′ explanatory variables. A data analysis program characterized by being a step of selecting.