JP2021009540A

JP2021009540A - Data analysis device and method

Info

Publication number: JP2021009540A
Application number: JP2019122748A
Authority: JP
Inventors: 山本　博之; Hiroyuki Yamamoto; 博之山本
Original assignee: Human Metabolome Technologies Inc
Current assignee: Human Metabolome Technologies Inc
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2021-01-28
Anticipated expiration: 2039-07-01
Also published as: JP7437003B2

Abstract

To enable various data analyses while considering orders among statistic samples.SOLUTION: A data analysis device (5) performs a multivariate analysis of a plurality of statistic samples about a plurality of data items. The data analysis device includes a storage part (52) and a control part (51). The storage part 52 records statistic data (X) for managing a plurality of data items in each sample, and order information (D) showing orders among the plurality of statistic samples. The control part performs arithmetic processing based on the statistic data and the order information. The control part calculates a first vector (wx) corresponding to an explanatory variable and a second vector (wy) corresponding to an auxiliary variable so as to optimize covariance between the explanatory variable (t) in a main component analysis of the statistic data and the auxiliary variable (s) undergoing setting of a constraint condition following the order information (S12), and calculates a score corresponding to the plurality of statistic samples on the basis of at least one of the first and second vectors (S13).SELECTED DRAWING: Figure 9

Description

本発明は、統計的手法でデータ解析を行うデータ解析装置、方法及びプログラムに関する。 The present invention relates to a data analysis device, method and program for performing data analysis by a statistical method.

従来、例えばメタボロミクスでは多数の代謝物等のデータを解析するための多変量解析手法として、主成分分析（ＰＣＡ：Principal Component Analysis）と部分的最小二乗法（ＰＬＳ：Partial Least Squares）が良く用いられている（非特許文献１など参照）。 Conventionally, for example, in metabolomics, principal component analysis (PCA: Principal Component Analysis) and partial least squares regression (PLS: Partial Least Squares) are often used as multivariate analysis methods for analyzing data of a large number of metabolites. (Refer to Non-Patent Document 1 and the like).

特許文献１は、ＰＬＳを応用したＰＬＳ−ＲＯＧ(Rank Order of Groups)に、カーネル法の概念を導入したカーネルＰＬＳ−ＲＯＧという手法を開示している。カーネルＰＬＳ−ＲＯＧによると、統計サンプルが成す群の順序をスコアに反映させながら種々の統計データの統合解析等が行え、群の順序を考慮しながら多様なデータ解析を可能にする。 Patent Document 1 discloses a method called kernel PLS-ROG, which introduces the concept of the kernel method into PLS-ROG (Rank Order of Groups) to which PLS is applied. According to the kernel PLS-ROG, integrated analysis of various statistical data can be performed while reflecting the order of the groups formed by the statistical samples in the score, and various data analysis can be performed while considering the order of the groups.

国際公開第２０１７／０９０５６６号International Publication No. 2017/090566

Hiroyuki Yamamoto, et al., "Dimensionality reduction for metabolome data using PCA, PLS, OPLS, and RFDA with differential penalties to latent variables", Chemom. Intell. Lab. Syst., 98 (2009) 136-142.Hiroyuki Yamamoto, et al., "Dimensionality reduction for metabolome data using PCA, PLS, OPLS, and RFDA with differential penalties to latent variables", Chemom. Intell. Lab. Syst., 98 (2009) 136-142. Yasumune Nakayama, et al., "Novel Strategy for Non-Targeted Isotope-Assisted Metabolomics by Means of Metabolic Turnover and Multivariate Analysis" Metabolites 2014, 4(3), 722-739Yasumune Nakayama, et al., "Novel Strategy for Non-Targeted Isotope-Assisted Metabolomics by Means of Metabolic Turnover and Multivariate Analysis" Metabolites 2014, 4 (3), 722-739 Pongsuwan W, et al., "Prediction of Japanese green tea ranking by gas chromatography/mass spectrometry-based hydrophilic metabolite fingerprinting." J Agric Food Chem. 2007 Jan 24;55(2):231-6.Pongsuwan W, et al., "Prediction of Japanese green tea ranking by gas chromatography / mass spectrometry-based hydrophilic metabolite fingerprinting." J Agric Food Chem. 2007 Jan 24; 55 (2): 231-6.

ＰＬＳは教師あり次元削減法の一種である一方、ＰＣＡは教師なし手法である。本願発明者は、ＰＣＡのような解析手法においてサンプル間の順序をスコアに反映しながら、ローディングの仮説検定といった多様なデータ解析を可能にする方法について、鋭意研究を重ねた。 PLS is a type of supervised dimensionality reduction method, while PCA is an unsupervised method. The inventor of the present application has conducted extensive research on a method that enables various data analysis such as a loading hypothesis test while reflecting the order between samples in a score in an analysis method such as PCA.

本発明の目的は、統計サンプル間の順序を考慮しながら多様なデータ解析を可能にするデータ解析装置および方法を提供することである。 An object of the present invention is to provide a data analysis device and method that enable various data analysis while considering the order between statistical samples.

本発明に係るデータ解析装置は、複数の統計サンプルに対して複数のデータ項目に関する多変量解析を行う装置である。データ解析装置は、記憶部と、制御部とを備える。記憶部は、統計サンプル毎に複数のデータ項目を管理する統計データ、及び複数の統計サンプル間の順序を示す順序情報を記録する。制御部は、統計データ及び順序情報に基づく所定の演算処理を行う。制御部は、統計データの主成分分析における説明変数と、順序情報に従う制約条件が設定される補助変数との間の共分散を最適化するように、説明変数に対応する第１のベクトルと、補助変数に対応する第２のベクトルとを算出し、第１のベクトルと第２のベクトルとの内の少なくとも一方に基づいて、複数の統計サンプルに対するスコアを算出する。 The data analysis device according to the present invention is a device that performs multivariate analysis on a plurality of data items on a plurality of statistical samples. The data analysis device includes a storage unit and a control unit. The storage unit records statistical data for managing a plurality of data items for each statistical sample, and order information indicating the order between the plurality of statistical samples. The control unit performs predetermined arithmetic processing based on statistical data and order information. The control unit uses the first vector corresponding to the explanatory variable to optimize the covariance between the explanatory variable in the principal component analysis of the statistical data and the auxiliary variable for which the constraint condition according to the order information is set. A second vector corresponding to the auxiliary variable is calculated, and a score for a plurality of statistical samples is calculated based on at least one of the first vector and the second vector.

本発明に係るデータ解析方法は、コンピュータが複数の統計サンプルに対して複数のデータ項目に関する多変量解析を行う方法である。コンピュータの記憶部５２には、統計サンプル毎に複数のデータ項目を管理する統計データ、及び複数の統計サンプル間の順序を示す順序情報が記録されている。本方法は、コンピュータが、統計データの主成分分析における説明変数と、順序情報に従う制約条件が設定される補助変数との間の共分散を最適化するように、説明変数に対応する第１のベクトルと、補助変数に対応する第２のベクトルとを算出するステップと、第１のベクトルと第２のベクトルとの内の少なくとも一方に基づいて、複数の統計サンプルに対するスコアを算出するステップとを含む。 The data analysis method according to the present invention is a method in which a computer performs multivariate analysis on a plurality of data items on a plurality of statistical samples. In the storage unit 52 of the computer, statistical data for managing a plurality of data items for each statistical sample and order information indicating the order among the plurality of statistical samples are recorded. In this method, the first method corresponds to the explanatory variables so that the computer optimizes the covariance between the explanatory variables in the principal component analysis of the statistical data and the auxiliary variables for which the constraints according to the order information are set. A step of calculating a vector and a second vector corresponding to an auxiliary variable, and a step of calculating a score for a plurality of statistical samples based on at least one of the first vector and the second vector. Including.

本発明に係るデータ解析装置および方法によると、統計データの主成分分析における説明変数と、順序情報に従う制約条件が設定される補助変数との間の共分散を最適化する理論の適用により、統計サンプル間の順序を考慮しながら多様なデータ解析を可能にすることができる。 According to the data analysis apparatus and method according to the present invention, statistics are obtained by applying a theory that optimizes the covariance between the explanatory variables in the principal component analysis of statistical data and the auxiliary variables for which constraint conditions according to ordinal information are set. It is possible to analyze various data while considering the order between samples.

ＯＳ−ＰＣＡの理論を説明するための図Diagram for explaining the theory of OS-PCA データ解析の事例１におけるＰＣＡの解析結果を示す図The figure which shows the analysis result of PCA in the case 1 of data analysis. データ解析の事例１におけるＯＳ−ＰＣＡの解析結果を示す図The figure which shows the analysis result of OS-PCA in the case 1 of data analysis. データ解析の事例１におけるＯＳ−ＰＣＡのローディングの仮説検定例を示す図表Chart showing an example of hypothesis testing of OS-PCA loading in case 1 of data analysis データ解析の事例２におけるＰＣＡの解析結果を示す図The figure which shows the analysis result of PCA in the case 2 of data analysis. データ解析の事例２におけるＯＳ−ＰＣＡの解析結果を示す図The figure which shows the analysis result of OS-PCA in the case 2 of data analysis. 実施形態１に係るデータ解析装置の構成を示すブロック図Block diagram showing the configuration of the data analysis apparatus according to the first embodiment データ解析装置によるデータ解析処理を示すフローチャートFlowchart showing data analysis processing by data analysis device データ解析処理におけるＯＳ−ＰＣＡ演算処理を示すフローチャートFlowchart showing OS-PCA arithmetic processing in data analysis processing

以下、添付の図面を参照して本発明に係るデータ解析装置、方法及びプログラムの実施の形態を説明する。なお、以下の各実施形態において、同様の構成要素については同一の符号を付している。 Hereinafter, embodiments of the data analysis apparatus, method, and program according to the present invention will be described with reference to the accompanying drawings. In each of the following embodiments, the same reference numerals are given to the same components.

（実施形態１）
１．概要
本発明の実施形態１に係るデータ解析方法による統計解析の概要について説明する。以下では、メタボロミクスに対する本データ解析方法の適用例を説明する。 (Embodiment 1)
1. 1. Outline An outline of statistical analysis by the data analysis method according to the first embodiment of the present invention will be described. The application example of this data analysis method to metabolomics will be described below.

メタボロミクスは、生体内の低分子の代謝物を包括的に解析する研究分野である。メタボロミクスでは、例えば、動物の組織や微生物の細胞、人間の血液や尿などの生体サンプル（試料）を種々の分析装置で測定し、サンプルに含まれる代謝物の濃度等を解析する。測定された種々の代謝物の濃度の値等が記録されるメタボロームデータは、例えば下記のようなｎ行ｑ列のデータ行列Ｘの形式で表される。

Metabolomics is a research field that comprehensively analyzes low-molecular-weight metabolites in vivo. In metabolomics, for example, biological samples (samples) such as animal tissues, microbial cells, human blood and urine are measured by various analyzers, and the concentration of biotransforms contained in the samples is analyzed. The metabolome data in which the measured concentrations of various biotransforms and the like are recorded is represented in the form of the data matrix X of n rows and q columns as shown below.

ここで、ｎはサンプルの個数であり、ｑは測定された代謝物の数（即ちデータ項目数）である。上式（１）は、例えばメタボロームデータとして１行当たりに、行番号に対応するサンプルで測定されたｑ個の代謝物の測定値を記録できる。測定値の代わりに、各種の計算値が記録されてもよい（例えばアイソトポマー比など）。 Here, n is the number of samples, and q is the number of measured metabolites (that is, the number of data items). The above equation (1) can record, for example, the measured values of q metabolites measured in the sample corresponding to the line number per line as metabolome data. Various calculated values may be recorded instead of the measured values (eg, isotopomer ratio, etc.).

メタボロームデータの解析は、主成分分析を用いると、以下のような手順で行われる。即ち、まずメタボロームデータの主成分分析によるスコアにおいてサンプルのデータ分布を可視化して、所望の表現型、群情報や時系列情報等と関連する主成分を見つけ出す。その後、主成分に対応するローディングの仮説検定に基づき有意な代謝物を選出することにより、選出された代謝物群と代謝パスウェイとの関連性を調べること等が行える。 The analysis of metabolome data is performed by the following procedure using principal component analysis. That is, first, the data distribution of the sample is visualized in the score obtained by the principal component analysis of the metabolome data, and the principal component related to the desired phenotype, group information, time series information, etc. is found. After that, by selecting significant metabolites based on the loading hypothesis test corresponding to the main component, the relationship between the selected metabolite group and the metabolic pathway can be investigated.

以上のようなメタボロームデータ解析においては、メタボロームデータに加えて、サンプルまたは群同士の関係に関する付加的な情報が、予め与えられていることがある。従来の典型的な多変量解析は、メタボロームデータを解析するために有用である一方で、このような付加情報は解析の際に考慮されない。このため、典型的なＰＣＡ等においては、サンプルの可視化に用いられるスコアに付加情報が反映されず、解析を進めることが困難な場合がある。このような問題を回避するために付加情報を組み込んだ解析手法として、「平滑化ＰＣＡ」が、以前に本発明者らにより提案された（非特許文献１）。 In the metabolome data analysis as described above, in addition to the metabolome data, additional information regarding the relationship between samples or groups may be given in advance. While conventional typical multivariate analysis is useful for analyzing metabolome data, such additional information is not considered in the analysis. Therefore, in a typical PCA or the like, additional information may not be reflected in the score used for visualizing the sample, and it may be difficult to proceed with the analysis. As an analysis method incorporating additional information in order to avoid such a problem, "smoothing PCA" has been previously proposed by the present inventors (Non-Patent Document 1).

平滑化ＰＣＡは、経時的に採取されたサンプルのメタボロームデータを解析するためには有用である。例えば、微生物の培養や発酵の研究においては、様々な物質の濃度変化を経時的に見るためにメタボロームデータの解析が行われている。本発明者らの研究によると、酵母の発酵過程を可視化する目的でメタボロームデータに平滑化ＰＣＡを適用して、その有効性が確認された（非特許文献１）。ここで、本発明者は、平滑化ＰＣＡはローディングの統計的な意味を理論的に説明することができず、統計的な基準でローディングから代謝物を選出することが困難であるという問題に着目した。 Smoothed PCA is useful for analyzing metabolome data of samples taken over time. For example, in research on culturing and fermentation of microorganisms, metabolome data is analyzed in order to observe changes in the concentrations of various substances over time. According to the research by the present inventors, the effectiveness of smoothing PCA was applied to the metabolome data for the purpose of visualizing the fermentation process of yeast (Non-Patent Document 1). Here, the present inventor focuses on the problem that smoothed PCA cannot theoretically explain the statistical meaning of loading, and it is difficult to select biotransforms from loading on a statistical basis. did.

そこで、本発明者は上記の問題について鋭意検討を重ね、平滑化ＰＣＡと同等の計算結果が得られて且つローディングの仮説検定により統計的に有意な代謝物を選出することを可能にする主成分分析の一手法「直交平滑化ＰＣＡ（ＯＳ−ＰＣＡ：Orthogonal Smoothed PCA）」を考案した。 Therefore, the present inventor repeated diligent studies on the above problems, and obtained a calculation result equivalent to that of smoothed PCA, and made it possible to select a statistically significant metabolite by a loading hypothesis test. We devised one method of analysis, "Orthogonal Smoothed PCA (OS-PCA)".

２．理論
以下、本実施形態に係るＯＳ−ＰＣＡの理論について説明する。 2. 2. Theory The theory of OS-PCA according to this embodiment will be described below.

２−１．平滑化ＰＣＡについて
ＯＳ−ＰＣＡ及び平滑化ＰＣＡといった各種の解析手法において、解析対象とする統計データは、例えば式（１）のデータ行列Ｘとして表される。以下、データ行列Ｘにおけるｐ列目のデータを「ｘ_ｐ」とする。データ行列Ｘは、例えば各列のデータｘ_ｐ（ｐ＝１〜ｑ）を、ｎ個の成分間（即ちサンプル間）において平均「０」且つ分散「１」にスケーリングして用いられる。 2-1. Smoothed PCA In various analysis methods such as OS-PCA and smoothed PCA, the statistical data to be analyzed is represented by, for example, the data matrix X of the equation (1). Hereinafter, the data in the p-th column in the data matrix X will be referred to as “x _p ”. The data matrix X is used, for example, by scaling the data x _p (p = 1 to q) of each column to an average of "0" and a variance of "1" between n components (that is, between samples).

データ行列Ｘに対する主成分分析のスコアには、次式（２）のようなｎ次元ベクトルの説明変数ｔを用いることができる。
ｔ＝Ｘｗ_ｘ（２） For the score of the principal component analysis for the data matrix X, the explanatory variable t of the n-dimensional vector as in the following equation (2) can be used.
t = Xw _x (2)

上式（２）において、重みベクトルｗ_ｘはｑ次元ベクトルであり、ｑ個の成分を有する。上式（２）によると、重みベクトルｗ_ｘの各成分は、データ行列Ｘにおけるデータ項目毎に説明変数ｔの重み付けを示す。説明変数ｔのｎ個の値が、それぞれ対応するサンプルのスコアを示すこととなる。 In the above equation (2), the weight vector w _x is a q-dimensional vector and has q components. According to the above equation (2), each component of the weight vector w _x indicates the weight of the explanatory variable t for each data item in the data matrix X. The n values of the explanatory variables t indicate the scores of the corresponding samples.

ＯＳ−ＰＣＡと平滑化ＰＣＡとは、上記の各種変数に加えて、後述する平滑化パラメータκ及びダミー行列Ｄを共通に用いる。まず、平滑化ＰＣＡは、次式（３１）〜（３２）のように定式化される（非特許文献１）。

In addition to the various variables described above, the OS-PCA and the smoothing PCA commonly use the smoothing parameter κ and the dummy matrix D, which will be described later. First, the smoothed PCA is formulated as the following equations (31) to (32) (Non-Patent Document 1).

上式（３１）〜（３２）において、「’」は行列等の転置を表す（以下同様）。上式（３２）の左辺における第２項は、平滑化パラメータκに基づく平滑化項を構成する。 In the above equations (31) to (32), "'" represents the transpose of a matrix or the like (the same applies hereinafter). The second term on the left side of the above equation (32) constitutes a smoothing term based on the smoothing parameter κ.

上式（３１）〜（３２）によると、平滑化ＰＣＡは、次式（３３）のように一般化固有値問題に帰着する。なお、Ｉは単位行列であり、λは固有値である。

According to the above equations (31) to (32), the smoothed PCA results in a generalized eigenvalue problem as in the following equation (33). Note that I is an identity matrix and λ is an eigenvalue.

２−２．ＯＳ−ＰＣＡについて
本実施形態に係るＯＳ−ＰＣＡは、以上のような平滑化ＰＣＡとは別の定式化で平滑化項を取り入れるべく、次式（３）のような補助変数ｓを導入する。
ｓ＝Ｘｗ_ｙ（３） 2-2. OS-PCA The OS-PCA according to the present embodiment introduces the auxiliary variable s as in the following equation (3) in order to incorporate the smoothing term by a formulation different from the smoothing PCA as described above.
s = _Xw y (3)

補助変数ｓは、後述する制約条件が設定される補助的な変数である（式（６）参照）。補助変数ｓは、説明変数ｔと同様にｎ次元ベクトルであり、サンプル毎のスコアを構成できる。又、上式（３）において、重みベクトルｗ_ｙは、説明変数ｔの重みベクトルｗ_ｘと同様にｑ次元ベクトルである。上式（３）によると、重みベクトルｗ_ｙの各成分は、データ行列Ｘにおけるデータ項目毎に補助変数ｓの重み付けを示す。 The auxiliary variable s is an auxiliary variable in which the constraint conditions described later are set (see Equation (6)). The auxiliary variable s is an n-dimensional vector like the explanatory variable t, and a score for each sample can be constructed. Further, in the above equation (3), the weight vector w _y, as well as the weight vector w _x explanatory variables t and q-dimensional vector. According to the above equation (3), each component of the weight vector w _y indicates the weighting of the auxiliary variables s for each data item in the data matrix X.

平滑化ＰＣＡは、主成分スコアに対応する一変数ｔの分散を最大化した（式（３１）参照）。これに代えて、本実施形態のＯＳ−ＰＣＡは、二変数ｔとｓの共分散を最大化することで主成分を求めるように定式化される。具体的に、本手法は次式（４）〜（６）のように定式化される。

The smoothed PCA maximized the variance of the one variable t corresponding to the principal component score (see equation (31)). Instead, the OS-PCA of the present embodiment is formulated to obtain the principal component by maximizing the covariance of the two variables t and s. Specifically, this method is formulated as the following equations (4) to (6).

上式（４）〜（６）において、平滑化パラメータκは０＜κ＜１の範囲内で設定され、行列Ｐは次式（７）のように表される。
Ｐ＝（１−κ）Ｉ＋κＸ’Ｄ’ＤＸ（７） In the above equations (4) to (6), the smoothing parameter κ is set within the range of 0 <κ <1, and the matrix P is expressed as in the following equation (7).
P = (1-κ) I + κX'D'DX (7)

上式（４）において、共分散ｃｏｖ（ｔ，ｓ）の引数に目的変数は含まれない。このように、本手法は、特にＰＬＳ等のように目的変数の情報を利用してはおらず、教師無し手法である。又、本手法において、上式（４）の最大化は局所的であってもよく、上記の条件式（５），（６）を満たす範囲で共分散ｃｏｖ（ｔ，ｓ）を最適化するように、複数の固有値に対する固有ベクトルを算出可能である。 In the above equation (4), the objective variable is not included in the argument of the covariance cov (t, s). As described above, this method does not utilize the information of the objective variable as in PLS and the like, and is an unsupervised method. Further, in this method, the maximization of the above equation (4) may be local, and the covariance cov (t, s) is optimized within the range satisfying the above conditional equations (5) and (6). As described above, it is possible to calculate the eigenvectors for a plurality of eigenvalues.

上記の条件式（５）は、重みベクトルｗ_ｘの大きさを「１」に設定する条件（即ち正規化条件）を表す。条件式（６）は、平滑化パラメータκの分、重みベクトルｗ_ｙの大きさを「１」からずらす制約条件を表す。同式（６）の左辺第２項は、ダミー行列Ｄによってデータ行列Ｘ中のサンプル間のデータを平滑化する平滑化項である。 The above conditional expression (5) represents a condition (that is, a normalization condition) for setting the magnitude of the weight vector w _x to “1”. Condition (6), the partial smoothing parameter kappa, representing the constraint condition of shifting the magnitude of the weight vector w _y "1". The second term on the left side of the equation (6) is a smoothing term that smoothes the data between the samples in the data matrix X by the dummy matrix D.

ダミー行列Ｄは、サンプル間の順序に応じた平滑化を設定するための行列である。ダミー行列Ｄとしては、例えば図１（Ａ）に示すように一次の差分行列Ｄ^（１）又は二次の差分行列Ｄ^（２）を採用できる。各差分行列Ｄ^（１），Ｄ^（２）の行毎に、差分を取る順序のサンプル間で、データの平滑化を実現できる。 The dummy matrix D is a matrix for setting smoothing according to the order between samples. As the dummy matrix D, for example, as shown in FIG. 1 (A), a first-order difference matrix D ⁽¹⁾ or a second-order difference matrix D ⁽²⁾ can be adopted. Data smoothing can be realized between the samples in the order of taking the difference for each row of each difference matrix D ⁽¹⁾ and D ⁽²⁾ .

図１（Ａ），（Ｂ）では、サンプル間の群の数が１つの場合の各差分行列Ｄ^（１），Ｄ^（２）の行数及び列数を例示している。群の個数が複数Ｇ個の場合、ダミー行列Ｄは、群毎のダミー行列Ｄ_（１）〜Ｄ_（Ｇ）を用いて、図１（Ｃ）に示すように（ブロック）対角的に設定可能である。群毎のダミー行列Ｄ_（１）〜Ｄ_（Ｇ）は、それぞれ同じ群のサンプル間で、図１（Ａ），（Ｂ）と同様の差分行列を採用可能である。 In FIGS. 1A and 1B, the number of rows and the number of columns of each difference matrix D ⁽¹⁾ and D ⁽²⁾ when the number of groups between samples is one are illustrated. When the number of groups is a plurality of G, the dummy matrix D is set diagonally (block) as shown in FIG. 1 (C) by using the dummy matrices D _{(1) to} D _(G) for each group. It is possible. For the dummy matrices D _{(1) to} D _(G) for each group, the same difference matrix as in FIGS. 1 (A) and 1 (B) can be adopted between the samples in the same group.

上式（４）〜（１１）のように定式化されたＯＳ−ＰＣＡは、ラグランジュ乗数法を用いることにより、下記のラグランジュ関数Ｊの最適化問題として記述できる（λ_ｘ，λ_ｙはラグランジュ乗数）。

The OS-PCA formulated as in the above equations (4) to (11) can be described as the following optimization problem of the Lagrange function J by using the Lagrange multiplier method (λ _x and λ _y are the Lagrange multipliers). ).

上記の関数Ｊを各ベクトルｗ_ｘ，ｗ_ｙで偏微分することで、次式（８），（９）がそれぞれ得られる。

By partially differentiating the above function J each vector _w x, in _{w y,} the following equation (8), obtained respectively (9).

上式（８），（９）は、各ベクトルｗ_ｘ，ｗ_ｙについて、次式（１０），（１１）のように整理できる。

Equation (8), (9), each vector _w x, for _{w y,} the following equation (10), can be summarized as (11).

上式（１０），（１１）において、固有値λは、λ＝４λ_ｘλ_ｙを満たす。上式（１０）において、右辺は固有値λと重みベクトルｗ_ｘの積であり、左辺は対称行列と重みベクトルｗ_ｘとの積となっている。 In the above equations (10) and (11), the eigenvalue λ satisfies λ = 4λ _x λ _y . In the above equation (10), the right side is the product of the eigenvalue λ and the weight vector w _x , and the left side is the product of the symmetric matrix and the weight vector w _x .

上式（１０）によると、本手法は、説明変数ｔの重みベクトルｗ_ｘについて固有値問題で記述されている。平滑化ＰＣＡは、一般化固有値問題に帰着したことから、固有ベクトルが互いに直交しなかった。これに対して、本実施形態のＯＳ−ＰＣＡは、上記の固有値問題から説明変数ｔの重みベクトルｗ_ｘに関して、別々の固有値λに対する固有ベクトルが互いに直交することが分かる。 According to the above equation (10), this method describes the weight vector w _x of the explanatory variable t as an eigenvalue problem. The smoothed PCA resulted in a generalized eigenvalue problem, so the eigenvectors were not orthogonal to each other. On the other hand, in the OS-PCA of the present embodiment, it can be seen from the above eigenvalue problem that the eigenvectors for different eigenvalues λ are orthogonal to each other with respect to the weight vector w _x of the explanatory variable t.

以上のＯＳ−ＰＣＡによると、式（１０），（１１）の固有値λ別に、固有ベクトルとしての重みベクトルｗ_ｘ，ｗ_ｙを計算し、式（２），（３）に代入することにより、各変数ｔ，ｓの成分としてサンプルのスコアを算出することができる。以下、最も大きい固有値λによるスコアを第１主成分といい、次に大きい固有値λによるスコアを第２主成分という場合がある。 According to the above OS-PCA, equation (10), apart from the eigenvalues λ of (11), the weight vector _w x as eigenvectors, and _{w y} calculated equation (2), by substituting (3), each The score of the sample can be calculated as a component of the variables t and s. Hereinafter, the score with the largest eigenvalue λ may be referred to as the first principal component, and the score with the next largest eigenvalue λ may be referred to as the second principal component.

２−２−１．ローディングの仮説検定について
以上のようなＯＳ−ＰＣＡによると、平滑化項によりサンプル間の順序情報をスコアに反映できると共に、重みベクトルｗ_ｘが、ローディングの仮説検定を可能とする統計的な性質を満たす（式（１３））。この点について以下、説明する。 2-2-1. About loading hypothesis test According to the above OS-PCA, the order information between samples can be reflected in the score by the smoothing term, and the weight vector w _x has statistical properties that enable loading hypothesis test. Satisfy (Equation (13)). This point will be described below.

まず、データ行列Ｘにおけるｐ番目（ｐ＝１〜ｑ）のデータ項目（代謝物）のデータｘ_ｐと、スコアｓと相関係数ｃｏｒｒ（ｓ，ｘ_ｐ）は、次式（１２）のように表される。

First, the data _{x p} data items p-th in the data matrix X (p = 1 to q) (metabolite), the score s and the correlation coefficient corr (s, _{x p)} is the following formula (12) It is represented by.

データ行列Ｘのスケーリングによる分散Ｖａｒ（ｘ_ｐ）＝１、及び式（３），（８），（１２）によると、相関係数ｃｏｒｒ（ｓ，ｘ_ｐ）は、次式（１３）のように表すことができる。

According to the variance Var (x _p ) = 1 by scaling the data matrix X and the equations (3), (8), and (12), the correlation coefficient corr (s, x _p ) is as in the following equation (13). Can be expressed in.

上式（１３）において、ｗ_ｘ，ｐは重みベクトルｗ_ｘのｐ番目の成分である。上式（１３）右辺の分母は、ｐ番目の変数に影響を与えない。よって、最終的に重みベクトルｗ_ｘは、ｐ番目のデータｘ_ｐとスコアｓとの相関係数ｃｏｒｒ（ｓ，ｘ_ｐ）に比例するという、統計的な性質を有することが分かる。 In the above equation (13), w _{x and p} are the p-th components of the weight vector w _x . The denominator on the right side of the above equation (13) does not affect the p-th variable. Therefore, it can be seen that the weight vector w _x finally has a statistical property that it is proportional to the correlation coefficient corr (s, x _p ) between the p-th data x _p and the score s.

又、Ｒ＝ｃｏｒｒ（ｓ，ｘ_ｐ）とおくと、次式（１４）のｔ統計量（t-statistic）は自由度ｎ−２のｔ分布に従う。

If R = corr (s, x _p ), the t-statistic of the following equation (14) follows the t distribution with n-2 degrees of freedom.

以上より、本手法によると、重みベクトルｗ_ｘの各成分を用いて、代謝物等のデータ項目毎に上記のｔ統計量に基づくｐ値等を得ることができる。すなわち、本実施形態のＯＳ−ＰＣＡによると、ＰＣＡ等と同様に、ローディングの統計的仮説検定を行うことができる。 From the above, according to this method, it is possible to obtain a p-value or the like based on the above t-statistic for each data item such as a metabolite by using each component of the weight vector w _x . That is, according to the OS-PCA of the present embodiment, a statistical hypothesis test of loading can be performed in the same manner as the PCA and the like.

２−２−２．平均化操作について
本実施形態のＯＤ−ＰＣＡは、１つのサンプルについて繰り返し測定されたことで、繰り返しサンプルによる複数のデータがデータ行列Ｘ中にある場合、このようなデータを扱うために、同一サンプル由来のデータに対して平均化の操作を導入することができる。平均化操作を導入したＯＳ−ＰＣＡは、次式（１５）〜（１７）のように表される。

2-2-2. About the averaging operation The OD-PCA of the present embodiment is repeatedly measured for one sample, and when a plurality of data from the repeated samples are in the data matrix X, the same sample is used to handle such data. An averaging operation can be introduced for the derived data. The OS-PCA into which the averaging operation is introduced is expressed by the following equations (15) to (17).

上式（１５）〜（１７）において、平均化のためのダミー行列Ｍは、次式（１８）のようなｎ行ｇ列の行列で表される。なお、ｇは、繰り返しの解消後のサンプル数であり、解消前のｎ個のサンプルにおける（データの繰り返しによる）群の数とも考えることができる。

In the above equations (15) to (17), the dummy matrix M for averaging is represented by an n-by-g matrix as in the following equation (18). In addition, g is the number of samples after the elimination of the repetition, and can be considered as the number of groups (by repeating the data) in the n samples before the elimination.

上式（１８）において、各ベクトルｍ_１〜ｍ_ｇは、それぞれ対応するサンプルについてデータの繰り返しの個数分の次元を有する。例えば、１番目のサンプルの平均化のためのベクトルｍ_１は、繰り返しのデータの個数ｎ１に基づき、次式（１９）のように表される。
ｍ_１’＝［１／ｎ１，１／ｎ１，１／ｎ１，…，１／ｎ１］（１９） In the above equation (18), each vector m _{1 to} _mg has a dimension corresponding to the number of repeated data for the corresponding sample. For example, the vector m ₁ for averaging the first sample is expressed by the following equation (19) based on the number n1 of repeated data.
m ₁ '= [1 / n1, 1 / n1, 1 / n1, ..., 1 / n1] (19)

又、式（１７）における行列Ｑは、平均化操作をしない場合の行列Ｐに対応しており、次式（２０）のように表される。
Ｑ＝（１−κ）Ｉ＋Ｘ’Ｍ’Ｄ’ＤＭＸ（２０） Further, the matrix Q in the equation (17) corresponds to the matrix P when the averaging operation is not performed, and is expressed as in the following equation (20).
Q = (1-κ) I + X'M'D'DMX (20)

上式（１５）〜（１７）によると、平均化行列Ｍによって繰り返しサンプル毎に平均化操作を実現できる。この場合のＯＳ−ＰＣＡも、上述した場合と同様に固有値問題で記述できる。具体的には、次式（２１）〜（２２）のように記述される。

According to the above equations (15) to (17), the averaging operation can be realized for each repeated sample by the averaging matrix M. The OS-PCA in this case can also be described by the eigenvalue problem as in the case described above. Specifically, it is described as the following equations (21) to (22).

３．検証事例について
以上のようなＯＳ−ＰＣＡの理論について、実際のメタボロームデータを用いた検証を行った。２つの検証事例として、ターンオーバー解析と、緑茶のメタボロームデータとにＯＳ−ＰＣＡを適用し、通常の主成分分析の解析結果と比較して、ＯＳ−ＰＣＡの有用性を確認した。各事例について、以下説明する。 3. 3. Verification example The above OS-PCA theory was verified using actual metabolome data. As two verification cases, OS-PCA was applied to the turnover analysis and the metabolome data of green tea, and the usefulness of OS-PCA was confirmed by comparing with the analysis result of the usual principal component analysis. Each case will be described below.

３−１．事例１
事例１においては、非特許文献２と同様のターンオーバー解析について、典型的なＰＣＡとＯＳ−ＰＣＡとを適用した。 3-1. Case 1
In Case 1, a typical PCA and OS-PCA were applied for the same turnover analysis as in Non-Patent Document 2.

本事例では、酵母Saccharomyces cerevisiae BY4742（アミノ酸カクテル）およびX2180株（最小培地とアミノ酸カクテル）について、^１３Ｃグルコースで同位体標識したサンプルを用いた。サンプリングは、０秒、１０秒、２０秒、４０秒、８０秒、１６０秒、３２０秒、６４０秒、１２８０秒、及び２５６０秒という時系列（即ちサンプル間の順序）で行った。各サンプリング結果に対してＧＣ／ＭＳによる代謝物の測定値（メタボロームデータ）から、アイソトポマー比を計算した値を、解析対象の統計データ（即ちデータ行列Ｘ）として用いた。 In this case, samples of yeast Saccharomyces cerevisiae BY4742 (amino acid cocktail) and X2180 strain (minimum medium and amino acid cocktail) areotope-labeled with ¹³ C glucose were used. Sampling was performed in a time series of 0 seconds, 10 seconds, 20 seconds, 40 seconds, 80 seconds, 160 seconds, 320 seconds, 640 seconds, 1280 seconds, and 2560 seconds (that is, the order between samples). The value obtained by calculating the isotopomer ratio from the measured value (metabolomics data) of the metabolome by GC / MS for each sampling result was used as the statistical data to be analyzed (that is, the data matrix X).

上記の統計データに関して、まず通常のＰＣＡ（即ちκ＝０）を行った結果を図２に示す。図２において、横軸は第１主成分のスコアを示し、縦軸は第２主成分のスコア（ＰＣ２）を示す。図２によると、通常のＰＣＡにおいては、第１主成分で時系列の様子が確認できてはいるものの、株間の差は、確認できていない。 FIG. 2 shows the results of performing normal PCA (that is, κ = 0) with respect to the above statistical data. In FIG. 2, the horizontal axis represents the score of the first principal component, and the vertical axis represents the score of the second principal component (PC2). According to FIG. 2, in a normal PCA, although the state of the time series can be confirmed with the first principal component, the difference between the strains cannot be confirmed.

非特許文献２では、上記のアイソトポマー比について全サンプルの平均を引いたデータを利用して主成分分析を行うことで、株間の差が主成分スコアに現れるように工夫が為されている。又、この結果から、注目すべき代謝物としてLysineの4TMSとIsoleucineの2TMSを挙げている。しかしながら、非特許文献２の方法では時系列の情報が失われている。さらに、アイソトポマー比そのものを直接データとして利用していないので、関連する代謝物を選出する際に、目視で確認する必要が生じてしまう。 In Non-Patent Document 2, the principal component analysis is performed using the data obtained by subtracting the average of all the samples for the above isotopomer ratio, so that the difference between the strains appears in the principal component score. From this result, Lysine's 4TMS and Isoleucine's 2TMS are cited as notable metabolites. However, in the method of Non-Patent Document 2, time-series information is lost. Furthermore, since the isotopomer ratio itself is not directly used as data, it becomes necessary to visually confirm it when selecting related metabolites.

次に、本実施形態に係るＯＳ−ＰＣＡの結果を図３（Ａ），（Ｂ）に示す。本例では、平滑化パラメータκ＝０．９９９においてＯＳ−ＰＣＡを上記の統計データに適用した。 Next, the results of OS-PCA according to this embodiment are shown in FIGS. 3 (A) and 3 (B). In this example, OS-PCA was applied to the above statistical data with a smoothing parameter κ = 0.999.

図３（Ａ）では、ＯＳ−ＰＣＡにおける説明変数ｔの第１主成分のスコア（ＰＣ１ｔ）を横軸に示し、同変数ｔの第２成分のスコア（ＰＣ２ｔ）を縦軸に示す。図３（Ｂ）では、ＯＳ−ＰＣＡにおける補助変数ｓの第１主成分のスコア（ＰＣ１ｓ）を横軸に示し、同変数ｓの第２成分のスコア（ＰＣ２ｓ）を縦軸に示す。 In FIG. 3A, the score (PC1t) of the first principal component of the explanatory variable t in OS-PCA is shown on the horizontal axis, and the score (PC2t) of the second component of the variable t is shown on the vertical axis. In FIG. 3B, the score (PC1s) of the first principal component of the auxiliary variable s in OS-PCA is shown on the horizontal axis, and the score (PC2s) of the second component of the variable s is shown on the vertical axis.

図３（Ａ），（Ｂ）に示す結果より、ＯＳ−ＰＣＡでは各変数ｔ，ｓについて、第１主成分で時系列を確認できると共に、第２主成分で株間の差すなわち群間差を確認することができた。第２主成分のスコアＰＣ２ｓについては、特に培地による違いが現れていることから、対応するローディングに着目した。図４に、本事例におけるローディングの仮説検定結果を示す。 From the results shown in FIGS. 3 (A) and 3 (B), in OS-PCA, the time series can be confirmed for each variable t and s with the first principal component, and the difference between strains, that is, the difference between groups can be confirmed with the second principal component. I was able to confirm. Regarding the score PC2s of the second principal component, the corresponding loading was focused on because the difference appeared depending on the medium. FIG. 4 shows the results of the loading hypothesis test in this case.

図４に示すように、ローディングとしてLysine_3TMS_Minor::C00047、Lysine_4TMS_Major::C00047、Histidine::C00135+0、及びPeak-63の４つのピーク（代謝物）について、上記スコアＰＣ２ｓと有意に負の相関が確認された。この結果は、非特許文献２で注目すべき代謝物として挙げているLysineの4TMSを含んでおり、既存の報告とも一致していることが分かる。 As shown in FIG. 4, there is a significant negative correlation with the above score PC2s for the four peaks (biotransformers) of Lysine_3TMS_Minor :: C00047, Lysine_4TMS_Major :: C00047, Histidine :: C00135 + 0, and Peak-63 as loading. confirmed. It can be seen that this result includes Lysine's 4TMS, which is listed as a notable metabolite in Non-Patent Document 2, and is consistent with existing reports.

以上のように、本実施形態に係るＯＳ−ＰＣＡを用いることで、時系列の情報および群間差が確認され、ローディングの統計的仮説検定を用いて選出した代謝物についても妥当および結果が得られた。 As described above, by using the OS-PCA according to the present embodiment, time-series information and differences between groups were confirmed, and valid and results were obtained for the biotransformers selected by using the statistical hypothesis test of loading. Was done.

３−２．事例２
本事例では、緑茶の品評会でランク付けされた緑茶の葉のメタボローデータを解析対象として用いた（非特許文献３）。本データは、１位、６位、１１位、１６位、２１位、３１位、３６位、４１位、４６位、及び５１位といった順序を有する各々の緑茶について、それぞれ３回ずつ測定されたデータである。これにより、３サンプルずつの群が形成され得る。 3-2. Case 2
In this case, the metabolic syndrome data of green tea leaves ranked at the green tea fair was used as the analysis target (Non-Patent Document 3). This data was measured three times for each green tea having the order of 1st, 6th, 11th, 16th, 21st, 31st, 36th, 41st, 46th, and 51st. It is data. As a result, a group of 3 samples can be formed.

上記の統計データに関して、まずＰＣＡの結果を図５に示す。図５では、図２と同様に第１及び第２主成分のスコアを示している。図５によると、ＰＣＡでは幾つかの群の傾向は確認できるが、品評会のランキングとの関連性は確認できない。 Regarding the above statistical data, first, the result of PCA is shown in FIG. FIG. 5 shows the scores of the first and second principal components as in FIG. According to FIG. 5, PCA can confirm the tendency of some groups, but cannot confirm the relevance to the ranking of the competition.

次に、本実施形態に係るＯＳ−ＰＣＡの結果を図６に示す。本例では、平滑化パラメータκ＝０．１においてＯＳ−ＰＣＡを上記の統計データに適用した。図６（Ａ）では、ＯＳ−ＰＣＡにおける補助変数ｓの第１主成分のスコア（ＰＣ１ｏｓ）を横軸に示し、同変数ｓの第２成分のスコア（ＰＣ２ｏｓ）を縦軸に示す。 Next, the result of OS-PCA according to this embodiment is shown in FIG. In this example, OS-PCA was applied to the above statistical data with a smoothing parameter κ = 0.1. In FIG. 6A, the score (PC1os) of the first principal component of the auxiliary variable s in OS-PCA is shown on the horizontal axis, and the score (PC2os) of the second component of the variable s is shown on the vertical axis.

図６に示す結果より、ＯＳ−ＰＣＡにおける第１主成分のスコアＰＣ１ｏｓでは、（２１位のサンプルについては比較的スコアが低いものの）概ねランクの順序に合った関係が確認できる。そこで、第１主成分のスコアＰＣ１ｏｓについてのローディングの統計的仮説検定を行った。 From the results shown in FIG. 6, it can be confirmed that the score PC1os of the first principal component in OS-PCA generally matches the order of rank (although the score is relatively low for the 21st-ranked sample). Therefore, a statistical hypothesis test of loading was performed for the score PC1os of the first principal component.

上記の仮説検定の結果としては、未知のピーク（代謝物）も含めた２２５物質中、ｐ＜０．０５で有意なものは７３個あり、ｑ＜０．０５で有意なものは５７個あった。その中でも特に上記のスコアＰＣ１ｏｓとの相関係数が０．７より高く、名前が既知のものは、下記の５物質であった。
Raffinose(R=-0.8600, p=1.133×10^-9, q=2.550×10^-7)
threo-3-Hydroxy-L-aspartic acid(R=-0.7912, p=1.941×10^-7, q=1.764×10^-5)
Arabinose(R=-0.7880, p=2.352×10^-7, q=1.764×10^-5)
Shikimic acid(R=-0.7334, p=4.023×10^-6, q=2.073×10^-4)
Galactose(R=-0.7228, p=6.450×10^-6, q=2.073×10^-4) As a result of the above hypothesis test, among 225 substances including unknown peaks (metabolites), 73 were significant when p <0.05, and 57 were significant when q <0.05. It was. Among them, the following five substances had a correlation coefficient higher than 0.7 and a known name in particular with the above score PC1os.
Raffinose (R = -0.8600, p = 1.133 × 10 ^-9 , q = 2.550 × 10 ^-7 )
threo-3-Hydroxy-L-aspartic acid (R = -0.7912, p = 1.941 × 10 ^-7 , q = 1.674 × 10 ^-5 )
Arabinose (R = -0.7880, p = 2.352 × 10 ^-7 , q = 1.764 × 10 ^-5 )
Shikimic acid (R = -0.7334, p = 4.023 × 10 ^-6 , q = 2.073 × 10 ^-4 )
Galactose (R = -0.7228, p = 6.450 × 10 ^-6 , q = 2.073 × 10 ^-4 )

既存の報告では、品評会でのランクと関連する物質として、糖類、アミノ酸、及びQuinic acidが挙げられている。上記のＯＳ−ＰＣＡによる解析結果において、糖類については、Raffinose、Arabinose、GalactoseがスコアＰＣ１ｏｓと高い負の相関を有し、ランクの高い緑茶にはこれらの糖類が多く含まれることが確認された。又、アミノ酸については、上記の各糖類に比べると相関は小さいものの、Serine(R=0.5427, p=1.945×10^-3, q=1.287×10^-2、Glycine(R=0.5385, p=2.140×10^-3, q=1.338×10^-2)がスコアＰＣ１ｏｓと有意な正の相関を有し、この２つのアミノ酸は、ランクの高い緑茶には少ない傾向がある。また、その他いくつかのアミノ酸も統計的に有意な相関が確認できた。なお、Quinic acidについてはスコアＰＣ１ｏｓとの統計的な有意な相関は確認されなかった。 Existing reports list sugars, amino acids, and quinic acid as substances associated with their rank at the show. In the above analysis results by OS-PCA, it was confirmed that Raffinose, Arabinose, and Galactose had a high negative correlation with the score PC1os, and that high-ranked green tea contained a large amount of these sugars. Regarding amino acids, although the correlation is smaller than that of the above saccharides, Serine (R = 0.5427, p = 1.945 × 10 ^-3 , q = 1.287 × 10 ^-2 , Glycine (R = 0.5385, p = 2.140 ×) 10 ^-3 , q = 1.338 × 10 ^-2 ) has a significant positive correlation with the score PC1os, and these two amino acids tend to be less in high-ranked green tea, as well as some other amino acids. A statistically significant correlation was confirmed. For Quinic acid, no statistically significant correlation with the score PC1os was confirmed.

以上のように、平滑化ＰＣＡの問題点を改良したＯＳ−ＰＣＡを提案し、ローディングの統計的な性質を理論的に示した。実際のメタボローム解析に適用し、ＯＳ−ＰＣスコアに注目すべきパターンを確認できると共に、統計的仮説検定を用いて代謝物を選出し、従来の知見との一致を確認することができた。 As described above, we proposed an OS-PCA that improved the problems of the smoothed PCA, and theoretically showed the statistical properties of loading. By applying it to the actual metabolome analysis, we were able to confirm a pattern that should be noted for the OS-PC score, and selected biotransformers using the statistical hypothesis test, and confirmed the agreement with the conventional findings.

４．データ解析装置について
以上のようなＯＳ−ＰＣＡを実現するデータ解析装置について、以下説明する。 4. Data analysis device The data analysis device that realizes the above OS-PCA will be described below.

４−１．構成
本実施形態に係るデータ解析装置５の構成について、図７を用いて説明する。図７は、データ解析装置５の構成を示すブロック図である。 4-1. Configuration The configuration of the data analysis device 5 according to the present embodiment will be described with reference to FIG. FIG. 7 is a block diagram showing the configuration of the data analysis device 5.

データ解析装置５は、例えばＰＣ（パーソナルコンピュータ）などの情報処理装置で構成される。データ解析装置５は、図７に示すように、制御部５１と、記憶部５２と、操作部５３と、表示部５４と、機器インタフェース５５と、ネットワークインタフェース５６とを備える。 The data analysis device 5 is composed of an information processing device such as a PC (personal computer). As shown in FIG. 7, the data analysis device 5 includes a control unit 51, a storage unit 52, an operation unit 53, a display unit 54, an equipment interface 55, and a network interface 56.

制御部５１は、例えばソフトウェアと協働して所定の機能を実現するＣＰＵやＭＰＵ等を含み、データ解析装置５の全体動作を制御する。制御部５１は、記憶部５２に格納されたデータやプログラムを読み出して種々の演算処理を行い、各種の機能を実現する。例えば、制御部５１は、本実施形態に係るデータ解析方法をデータ解析装置５に行わせるための命令群を含んだプログラムを実行する。上記のプログラムは、インターネット等の通信ネットワークから提供されてもよいし、可搬性を有する記録媒体に格納されていてもよい。 The control unit 51 includes, for example, a CPU, an MPU, or the like that realizes a predetermined function in cooperation with software, and controls the overall operation of the data analysis device 5. The control unit 51 reads data and programs stored in the storage unit 52 and performs various arithmetic processes to realize various functions. For example, the control unit 51 executes a program including a group of instructions for causing the data analysis device 5 to perform the data analysis method according to the present embodiment. The above program may be provided from a communication network such as the Internet, or may be stored in a portable recording medium.

また、制御部５１は、所定の機能を実現するように設計された専用の電子回路や再構成可能な電子回路などのハードウェア回路であってもよい。制御部５１は、ＣＰＵ、ＭＰＵ、ＧＰＵ、マイコン、ＤＳＰ、ＦＰＧＡ、ＡＳＩＣ等の種々の半導体集積回路で構成されてもよい。 Further, the control unit 51 may be a hardware circuit such as a dedicated electronic circuit or a reconfigurable electronic circuit designed to realize a predetermined function. The control unit 51 may be composed of various semiconductor integrated circuits such as a CPU, MPU, GPU, microcomputer, DSP, FPGA, and ASIC.

記憶部５２は、データ解析装置５の機能を実現するために必要なプログラム及びデータを記憶する記録媒体であり、例えばハードディスク（ＨＤＤ）や半導体記憶装置（ＳＳＤ）を備える。また、記憶部５２は、例えば、ＤＲＡＭやＳＲＡＭ等の半導体デバイスを備えてもよく、データを一時的に記憶するとともに制御部５１の作業エリアとしても機能する。 The storage unit 52 is a recording medium that stores programs and data necessary for realizing the functions of the data analysis device 5, and includes, for example, a hard disk (HDD) or a semiconductor storage device (SSD). Further, the storage unit 52 may be provided with a semiconductor device such as a DRAM or SRAM, and temporarily stores data and also functions as a work area of the control unit 51.

操作部５３は、ユーザが操作を行うユーザインタフェースである。操作部５３は、例えば、キーボード、タッチパッド、タッチパネル、ボタン、スイッチ、及びこれらの組み合わせで構成される。操作部５３は、ユーザによって入力される諸情報を取得する取得部の一例である。 The operation unit 53 is a user interface on which the user operates. The operation unit 53 is composed of, for example, a keyboard, a touch pad, a touch panel, buttons, switches, and a combination thereof. The operation unit 53 is an example of an acquisition unit that acquires various information input by the user.

表示部５４は、例えば、液晶ディスプレイや有機ＥＬディスプレイで構成される。表示部５４は、例えば操作部５３から入力された情報など、種々の情報を表示する。 The display unit 54 is composed of, for example, a liquid crystal display or an organic EL display. The display unit 54 displays various information such as information input from the operation unit 53.

機器インタフェース５５は、データ解析装置５に他の機器を接続するための回路（モジュール）である。機器インタフェース５５は、所定の通信規格にしたがい通信を行う取得部の一例である。所定の規格には、ＵＳＢ、ＨＤＭＩ（登録商標）、ＩＥＥＥ１３９５、ＷｉＦｉ、Ｂｌｕｅｔｏｏｔｈ（登録商標）等が含まれる。 The device interface 55 is a circuit (module) for connecting another device to the data analysis device 5. The device interface 55 is an example of an acquisition unit that performs communication according to a predetermined communication standard. Predetermined standards include USB, HDMI®, IEEE1395, WiFi, Bluetooth® and the like.

ネットワークインタフェース５６は、無線または有線の通信回線を介してデータ解析装置５をネットワークに接続するための回路（モジュール）である。ネットワークインタフェース５６は、所定の通信規格に準拠した通信を行う取得部の一例である。所定の通信規格には、ＩＥＥＥ８０２．３，ＩＥＥＥ８０２．１１ａ／１１ｂ／１１ｇ／１１ａｃ等の通信規格が含まれる。 The network interface 56 is a circuit (module) for connecting the data analysis device 5 to the network via a wireless or wired communication line. The network interface 56 is an example of an acquisition unit that performs communication conforming to a predetermined communication standard. Predetermined communication standards include communication standards such as IEEE802.3 and IEEE802.11a / 11b / 11g / 11ac.

以上の説明では、ＰＣ等で構成されるデータ解析装置５の一例を説明した。データ解析装置５はこれに限定されず、種々の情報処理装置（即ちコンピュータ）であってもよい。例えば、データ解析装置５は、ＡＳＰサーバなどの一つ又は複数のサーバ装置であってもよい。また、コンピュータクラスタ或いはクラウドコンピューティングなどにおいて、本開示に係るデータ解析方法が実現されてもよい。 In the above description, an example of the data analysis device 5 composed of a PC or the like has been described. The data analysis device 5 is not limited to this, and may be various information processing devices (that is, a computer). For example, the data analysis device 5 may be one or more server devices such as an ASP server. Further, the data analysis method according to the present disclosure may be realized in a computer cluster, cloud computing, or the like.

例えば、データ解析装置５は、外部から通信ネットワークを介して入力されたメタボロームデータをネットワークインタフェース５６により取得して、本実施形態のデータ解析方法を実行してもよい。データ解析装置５は、ネットワークインタフェース５６から外部に、データ解析方法の解析結果を送信してもよい。 For example, the data analysis device 5 may acquire the metabolome data input from the outside via the communication network by the network interface 56 and execute the data analysis method of the present embodiment. The data analysis device 5 may transmit the analysis result of the data analysis method to the outside from the network interface 56.

４−２．動作
本実施形態に係るデータ解析装置５の動作について、図８〜図９を用いて説明する。図８は、データ解析装置５によるデータ解析処理を示すフローチャートである。図９は、データ解析処理におけるＯＳ−ＰＣＡ演算処理を示すフローチャートである。 4-2. Operation The operation of the data analysis device 5 according to the present embodiment will be described with reference to FIGS. 8 to 9. FIG. 8 is a flowchart showing a data analysis process by the data analysis device 5. FIG. 9 is a flowchart showing an OS-PCA calculation process in the data analysis process.

図８に示すフローチャートの各処理は、データ解析装置５の制御部５１によって実行される。 Each process of the flowchart shown in FIG. 8 is executed by the control unit 51 of the data analysis device 5.

まず、制御部５１は、解析対象の統計データの一例として、データ行列Ｘを取得する（Ｓ１）。例えばメタボロミクスの解析対象の統計データとして、メタボロームデータを示すデータ行列ＸがステップＳ１において取得される。データ行列Ｘにおけるデータは、代謝物の測定値であってもよいし、測定結果に基づく各種の計算値（例えばアイソトポマー比）であってもよい。 First, the control unit 51 acquires the data matrix X as an example of the statistical data to be analyzed (S1). For example, as statistical data to be analyzed for metabolomics, a data matrix X showing metabolome data is acquired in step S1. The data in the data matrix X may be a measured value of the metabolite or various calculated values (for example, isotopomer ratio) based on the measurement result.

ステップＳ１において、制御部５１は、例えば記憶部５２において予め格納されたデータを作業エリアに読み出して、データ行列Ｘを取得する。制御部５１は、操作部５３におけるユーザの操作によりデータを入力してもよいし、制御部５１は、外部から各種インタフェース５５，５６を用いて、データ行列Ｘを取得してもよい。 In step S1, the control unit 51 reads, for example, the data stored in advance in the storage unit 52 into the work area to acquire the data matrix X. The control unit 51 may input data by the operation of the user in the operation unit 53, or the control unit 51 may acquire the data matrix X from the outside using various interfaces 55 and 56.

また、制御部５１は、データ行列Ｘにおけるサンプル間の順序に関する順序情報の一例であるダミー行列Ｄを取得する（Ｓ２）。例えば、ユーザの操作によってメタボロームデータの入力時等に、サンプル間の順序の情報が設定される。 Further, the control unit 51 acquires a dummy matrix D, which is an example of order information regarding the order between samples in the data matrix X (S2). For example, information on the order between samples is set when metabolome data is input by a user operation.

ステップＳ２において、制御部５１は、例えば記憶部５２に格納された情報を参照して、ダミー行列Ｄを取得する。例えば、制御部５１は、サンプル間に設定された順序において近接する二つ以上のサンプルのデータ間の差分を取るように行列要素の値を決定してダミー行列Ｄを生成し、記憶部５１の作業エリアに保持する。制御部５１は、各種インタフェース５５，５６或いは操作部５３を用いて、ダミー行列Ｄを取得してもよい。 In step S2, the control unit 51 acquires the dummy matrix D by referring to the information stored in the storage unit 52, for example. For example, the control unit 51 determines the value of the matrix element so as to take the difference between the data of two or more samples adjacent to each other in the order set between the samples, generates the dummy matrix D, and generates the dummy matrix D of the storage unit 51. Hold in the work area. The control unit 51 may acquire the dummy matrix D by using various interfaces 55 and 56 or the operation unit 53.

さらに、制御部４１は、取得したデータ行列Ｘにおいて平均化操作の対象となるデータすなわち繰り返しサンプルがあるか否かを判断する（Ｓ３）。制御部５１は、繰り返しサンプルがないと判断した場合（Ｓ３でＮＯ）、特にステップＳ４の処理は行わず、ステップＳ５に進む。ステップＳ３，Ｓ４の処理は、例えばユーザの操作に応じて実行される。 Further, the control unit 41 determines whether or not there is data to be averaged, that is, a repeating sample in the acquired data matrix X (S3). When the control unit 51 determines that there is no repeated sample (NO in S3), the control unit 51 proceeds to step S5 without performing the process of step S4 in particular. The processes of steps S3 and S4 are executed, for example, in response to a user operation.

制御部５１は、データ行列Ｘにおいて繰り返しサンプルがあると判断した場合（Ｓ３でＹＥＳ）、繰り返しサンプル間で平均化操作を行うためのダミー行列Ｍを取得する（Ｓ４）。ステップＳ３，Ｓ４の処理は、例えば制御部５１が取得したデータ行列Ｘにおいて行方向に記録されたデータ項目の情報を参照することによって、実行されてもよい。例えば、制御部５１は、データ行列Ｘ中の繰り返しサンプルの個数に応じて、ダミー行列Ｍを生成する（式（１８）参照）。 When the control unit 51 determines that there are repeated samples in the data matrix X (YES in S3), the control unit 51 acquires a dummy matrix M for performing an averaging operation between the repeated samples (S4). The processes of steps S3 and S4 may be executed, for example, by referring to the information of the data items recorded in the row direction in the data matrix X acquired by the control unit 51. For example, the control unit 51 generates a dummy matrix M according to the number of repeated samples in the data matrix X (see equation (18)).

次に、制御部５１は、取得したデータ行列Ｘ及びダミー行列Ｄ，Ｍに基づいて、上述したＯＳ−ＰＣＡの理論を適用してスコアを算出する処理であるＯＳ−ＰＣＡ演算処理を行う（Ｓ５）。図９のフローチャートを用いて、ＯＳ−ＰＣＡ演算処理（Ｓ５）の一例を説明する。 Next, the control unit 51 performs an OS-PCA calculation process, which is a process of calculating a score by applying the above-mentioned OS-PCA theory, based on the acquired data matrix X and dummy matrices D and M (S5). ). An example of the OS-PCA arithmetic processing (S5) will be described with reference to the flowchart of FIG.

図９の例において、まず、制御部５１は、データ行列Ｘにおいて代謝物などのデータ項目毎にサンプル間の平均が「０」で且つ分散が「１」になるように、データのスケーリング（規格化）を行う（Ｓ１０）。なお、データのスケーリング（Ｓ１０）は、データ行列Ｘの取得時（Ｓ１）に行われてもよい。又、取得されたデータ行列Ｘがスケーリング済みの場合、ステップＳ１０の処理は省略可能である。 In the example of FIG. 9, first, the control unit 51 scales the data (standard) so that the average between the samples is "0" and the variance is "1" for each data item such as a biotransform in the data matrix X. (S10). The data scaling (S10) may be performed at the time of acquisition of the data matrix X (S1). Further, when the acquired data matrix X has been scaled, the process of step S10 can be omitted.

次に、制御部５１は、ＯＳ−ＰＣＡの理論における演算式に、スケーリングされたデータ行列Ｘ及びダミー行列Ｄ，Ｍを代入する（Ｓ１１）。繰り返しサンプルがない場合（Ｓ３でＮＯ）、制御部５１は、各行列Ｘ，Ｄに基づきステップＳ１１の演算式として式（１０）等を用いる。繰り返しサンプルがある場合（Ｓ３でＹＥＳ）、制御部５１は、各行列Ｘ，Ｄ，Ｍに基づき演算式として式（２１）等を用いる。各演算式は、例えば記憶部５２に予め格納されている。 Next, the control unit 51 substitutes the scaled data matrix X and the dummy matrices D and M into the arithmetic expression in the theory of OS-PCA (S11). When there is no repeating sample (NO in S3), the control unit 51 uses the equation (10) or the like as the arithmetic expression in step S11 based on the matrices X and D. When there is a repeating sample (YES in S3), the control unit 51 uses the equation (21) or the like as an arithmetic expression based on the respective matrices X, D, and M. Each calculation formula is stored in advance in, for example, a storage unit 52.

次に、制御部５１は、代入した演算式による固有値問題における１つ又は複数の固有値λおよび固有ベクトルを計算する（Ｓ１２）。これにより、共分散ｃｏｖ（ｔ，ｓ）を最適化するように各重みベクトルｗ_ｘ，ｗ_ｙが算出される。 Next, the control unit 51 calculates one or more eigenvalues λ and eigenvectors in the eigenvalue problem by the substituted arithmetic expression (S12). Accordingly, the covariance cov (t, s) for optimizing such that each weight vector _w x, _{w y} is calculated.

ステップＳ１２において、例えば制御部５１は、式（１０）の各固有値λを算出し、算出した固有値λが大きい順に固有ベクトルとして、１個以上（ｎ−１）個以下の重みベクトルｗ_ｘを算出する。さらに、制御部５１は、算出した重みベクトルｗ_ｘの固有値λを式（１１）に代入して、対応する重みベクトルｗ_ｙを算出する。なお、重みベクトルｗ_ｙの算出には、式（８），（９）が用いられてもよい。 In step S12, for example, the control unit 51 calculates each eigenvalue λ of the equation (10), and calculates one or more (n-1) or less weight vectors w _x as eigenvectors in descending order of the calculated eigenvalues λ. .. Further, the control unit 51, the eigenvalues λ of the calculated weight vector w _x into Equation (11), calculates the corresponding weight vector w _y. Incidentally, the calculation of the weight vector _{w y} the formula (8) may be employed (9).

次に、制御部５１は、固有値λ及び固有ベクトルの計算結果に基づいて、対応するスコアを算出する（Ｓ１３）。制御部５１は、スコアの算出（Ｓ１３）によってＯＳ−ＰＣＡ演算処理（図８のＳ５）を終了し、ステップＳ５に進む。 Next, the control unit 51 calculates the corresponding score based on the calculation results of the eigenvalue λ and the eigenvector (S13). The control unit 51 ends the OS-PCA calculation process (S5 in FIG. 8) by calculating the score (S13), and proceeds to step S5.

ステップＳ１３において、例えば制御部５１は、別々の固有値λによる固有ベクトル毎に、重みベクトルｗ_ｘ及び式（２）に基づき説明変数ｔのｎ個の値を各サンプルのスコアとして算出する。又、補助変数ｓについても同様に、制御部５１は、重みベクトルｗ_ｙ及び式（３）に基づき補助変数ｓの値をスコアとして算出する。なお、ステップＳ１３では、二変数ｔ，ｓのうちの一方のみによるスコアが算出されてもよい。スコアの算出は、例えば固有値λが大きい順に、第１主成分、或いは第１及び第２主成分などと制限して行われてもよい。 In step S13, for example, the control unit 51 calculates n values of the explanatory variables t as scores of each sample based on the weight vector w _x and the equation (2) for each eigenvector with different eigenvalues λ. Also, Similarly, the auxiliary variable s, the control unit 51 calculates the value of the auxiliary variable s based on the weight vector w _y and equation (3) as a score. In step S13, the score based on only one of the two variables t and s may be calculated. The score may be calculated by limiting the first principal component, the first and second principal components, and the like in descending order of the eigenvalues λ, for example.

図８に戻り、ＯＳ−ＰＣＡ演算処理（Ｓ５）の算出結果に基づいて、制御部５１は、算出したスコアを表示するように表示部５４を制御する（Ｓ６）。例えば、制御部５１は、二変数ｔ，ｓのそれぞれについて、例えば図３（Ａ），（Ｂ）のように、第１及び第２主成分の各スコアをそれぞれサンプル毎のプロットとして表示部５４に表示させる。 Returning to FIG. 8, the control unit 51 controls the display unit 54 so as to display the calculated score based on the calculation result of the OS-PCA calculation process (S5) (S6). For example, the control unit 51 displays each of the scores of the first and second principal components as a plot for each sample for each of the two variables t and s, for example, as shown in FIGS. 3A and 3B. To display.

次に、制御部５１は、操作部５３においてユーザの操作を受け付け、ユーザがさらなるデータ解析のため、表示したスコアの種類（第１又は第２主成分等）のいずれかを選択したか否かを判断する（Ｓ７）。例えば、ユーザは、表示部５４に表示されたスコアのプロット画像により、サンプル間の順序が反映されたスコアの種類を選択することができる（図３（Ａ），（Ｂ）参照）。ステップＳ７の選択は、例えば補助変数ｓによるスコアの種類について受け付けられる。 Next, the control unit 51 accepts the user's operation in the operation unit 53, and whether or not the user has selected one of the displayed score types (first or second principal component, etc.) for further data analysis. Is determined (S7). For example, the user can select the type of score that reflects the order between the samples from the plot image of the score displayed on the display unit 54 (see FIGS. 3A and 3B). The selection of step S7 is accepted, for example, for the type of score by the auxiliary variable s.

制御部５１は、ユーザがスコアの種類を選択しなかったと判断した場合（Ｓ６でＮＯ）、本処理を終了する。 When the control unit 51 determines that the user has not selected the score type (NO in S6), the control unit 51 ends this process.

一方、ユーザがスコアの種類のいずれかを選択したと判断した場合（Ｓ６でＹＥＳ）、制御部５１は、選択した主成分に対応する重みベクトルｗ_ｘに基づいて、ローディングの仮説検定を実施するための処理を実行する（Ｓ８〜Ｓ９）。 On the other hand, when it is determined that the user has selected one of the score types (YES in S6), the control unit 51 performs a loading hypothesis test based on the weight vector w _x corresponding to the selected principal component. (S8 to S9).

例えば、制御部５１は、選択したスコアの補助変数ｓと、データ行列Ｘにおける代謝物などのデータ項目毎のデータｘ_ｐとの相関係数ｃｏｒｒ（ｓ，ｘ_ｐ）を計算する（Ｓ８）。また、制御部５１は、例えば式（１４）のｔ統計量に基づき、各データ項目のｐ値を取得する。 For example, the control unit 51 calculates the correlation coefficient corr (s, x _p ) between the auxiliary variable s of the selected score and the data x _p for each data item such as a metabolite in the data matrix X (S8). Further, the control unit 51 acquires the p-value of each data item based on, for example, the t-statistic of the equation (14).

さらに、制御部５１は、各データ項目のｐ値と所定のしきい値（「α」とする）とを比較して、しきい値α未満のｐ値を有するデータ項目を選出する（Ｓ９）。しきい値αは、統計的に有意な水準を示し、例えばα＝０．０５である。ステップＳ９により、例えばデータ項目が代謝物に関する場合、統計的な有意水準を満たす代謝物が、自動的に選出される。制御部５１は、ステップＳ８，Ｓ９の計算結果を示すリスト等（例えば図４）を生成してもよい。 Further, the control unit 51 compares the p-value of each data item with a predetermined threshold value (referred to as “α”), and selects a data item having a p-value less than the threshold value α (S9). .. The threshold value α indicates a statistically significant level, for example, α = 0.05. In step S9, for example, if the data item relates to a metabolite, the metabolite that meets the statistical significance level is automatically selected. The control unit 51 may generate a list or the like (for example, FIG. 4) showing the calculation results of steps S8 and S9.

制御部５１は、以上のようにローディングの仮説検定の処理（Ｓ８，Ｓ９）を実行すると、図８に示すデータ解析処理を終了する。 When the control unit 51 executes the loading hypothesis testing process (S8, S9) as described above, the data analysis process shown in FIG. 8 ends.

以上のデータ解析処理によると、ＯＳ−ＰＣＡの理論に基づくデータ解析方法を実施して、サンプル間の順序を反映したスコアが得られる。さらに、当該スコアに対する相関が統計的に有意な代謝物等を選出するようなローディングの仮説検定を実現できる。 According to the above data analysis process, a data analysis method based on the theory of OS-PCA is carried out, and a score reflecting the order between the samples is obtained. Furthermore, a loading hypothesis test can be realized in which a metabolite or the like whose correlation with the score is statistically significant is selected.

なお、以上の説明では、ステップＳ９において統計的に有意なデータ項目が自動的に選出される例を説明したが、当該選出は自動的に行われなくてもよい。例えば、ユーザが、ステップＳ５の処理結果を用いることにより、仮説検定の計算を適宜行って統計的に有意なデータ項目を選出してもよい。 In the above description, an example in which statistically significant data items are automatically selected in step S9 has been described, but the selection does not have to be performed automatically. For example, the user may appropriately calculate the hypothesis test and select statistically significant data items by using the processing result of step S5.

５．まとめ
以上のように、本実施形態のデータ解析装置５は、複数の統計サンプルに対して複数のデータ項目に関する多変量解析を行う。データ解析装置５は、記憶部５２と、制御部５１とを備える。記憶部５２は、統計サンプル毎に複数のデータ項目を管理する統計データの一例であるデータ行列Ｘ、及び複数の統計サンプル間の順序を示す順序情報の一例であるダミー行列Ｄを記録する。制御部５１は、統計データ及び順序情報に基づく所定の演算処理（Ｓ５）を行う。制御部５１は、統計データの主成分分析における説明変数ｔと、順序情報に従う制約条件（式（６），（１７））が設定される補助変数ｓとの間の共分散を最適化するように、説明変数ｔに対応する重みベクトルｗ_ｘ（第１のベクトル）と、補助変数ｓに対応する重みベクトルｗ_ｙ（第２のベクトル）とを算出する（Ｓ１２）。制御部５１は、第１のベクトルと第２のベクトルとの内の少なくとも一方に基づいて、複数の統計サンプルに対するスコアを算出する（Ｓ１３）。 5. Summary As described above, the data analysis device 5 of the present embodiment performs multivariate analysis on a plurality of data items on a plurality of statistical samples. The data analysis device 5 includes a storage unit 52 and a control unit 51. The storage unit 52 records a data matrix X, which is an example of statistical data that manages a plurality of data items for each statistical sample, and a dummy matrix D, which is an example of order information indicating the order between the plurality of statistical samples. The control unit 51 performs a predetermined arithmetic process (S5) based on the statistical data and the order information. The control unit 51 optimizes the covariance between the explanatory variable t in the principal component analysis of the statistical data and the auxiliary variable s in which the constraint conditions (equations (6) and (17)) according to the order information are set. to be calculated as the weight vector _{w x} corresponding to the explanatory variable t (first vector), the weight vector corresponding to the auxiliary variable s _{w y} (second vector) (S12). The control unit 51 calculates a score for a plurality of statistical samples based on at least one of the first vector and the second vector (S13).

以上のデータ解析装置５によると、ＯＳ−ＰＣＡの理論に従って、ローディングの仮説検定が可能な重みベクトルｗ_ｘに基づき、サンプル間の順序を反映したスコアが得られ、統計サンプル間の順序を考慮しながら多様なデータ解析を可能にすることができる。 According to the above data analysis device 5, a score reflecting the order between samples is obtained based on the weight vector w _x that enables hypothesis testing of loading according to the theory of OS-PCA, and the order between statistical samples is taken into consideration. However, it is possible to analyze various data.

本実施形態において、制約条件は、順序情報が示す順序において統計サンプル毎のデータを平滑化する平滑化項（式（６），（１７）の左辺第２項）によって規定される。こうした補助変数ｓの重みベクトルｗ_ｙに関する平滑化項により、サンプル間の順序を反映したスコアと、ローディングの仮説検定が可能な重みベクトルｗ_ｘとを両立することができる。 In the present embodiment, the constraint condition is defined by a smoothing term (the second term on the left side of equations (6) and (17)) that smoothes the data for each statistical sample in the order indicated by the order information. The smoothing term relates weight vector w _y of such auxiliary variables s, the score reflects the order between samples, it is possible to achieve both the possible weights hypothesis test loading vector w _x.

本実施形態において、スコアは、例えば図３（Ｃ）に示すように、ダミー行列Ｄのような順序情報が示す順序において増大又は減少する。本実施形態のデータ解析装置５によると、このようにスコアにサンプル間の順序を反映できる。 In the present embodiment, the score increases or decreases in the order indicated by the order information such as the dummy matrix D, for example, as shown in FIG. 3C. According to the data analysis device 5 of the present embodiment, the order between the samples can be reflected in the score in this way.

本実施形態における順序情報は、例えば図１（Ｃ）に示すダミー行列Ｄのように、複数の統計サンプルが成す群毎に、統計サンプル間の順序を示してもよい。これにより、サンプル間の群の情報をスコアに反映することも可能である。 The order information in the present embodiment may indicate the order between the statistical samples for each group formed by a plurality of statistical samples, for example, as in the dummy matrix D shown in FIG. 1C. This makes it possible to reflect the group information between the samples in the score.

本実施形態において、重みベクトルｗ_ｘは、統計データにおけるデータ項目毎のデータｘ_ｐと、重みベクトルｗ_ｙに基づくスコアｓとの間の相関係数ｃｏｒｒ（ｓ，ｘ_ｐ）に比例する複数ｑ個の成分を有する。制御部５１は、重みベクトルｗ_ｘの各成分に基づいて、複数のデータ項目の中から、統計的な有意水準を満たすデータ項目を選出してもよい（Ｓ９）。これにより、ローディングの仮説検定を自動化することもできる。 In this embodiment, the weight vector _{w x} includes data _{x p} for each data item in statistical data, the correlation coefficient corr (s, _{x p)} between the score s based on the weight vector _{w y} plurality proportional to q It has individual components. The control unit 51 may select a data item satisfying the statistical significance level from a plurality of data items based on each component of the weight vector w _x (S9). This also makes it possible to automate the loading hypothesis test.

本実施形態において、統計データの一例のデータ行列Ｘは、生体内の複数の代謝物を複数のデータ項目として、データ項目毎に対応する代謝物に関する測定値および計算値の少なくとも一方を含む。代謝物に関するデータ行列ＸにＯＳ−ＰＣＡを適用することにより、メタボロミクスにおいて統計サンプル間の順序を考慮しながら多様なデータ解析を可能にすることができる。 In the present embodiment, the data matrix X of an example of statistical data includes a plurality of biotransformers in a living body as a plurality of data items, and includes at least one of a measured value and a calculated value relating to the corresponding biotransformer for each data item. By applying OS-PCA to the data matrix X for biotransforms, it is possible to analyze various data in metabolomics while considering the order between statistical samples.

本実施形態のデータ解析方法は、データ解析装置５のようなコンピュータが複数の統計サンプルに対して複数のデータ項目に関する多変量解析を行う方法である。コンピュータの記憶部５２には、統計サンプル毎に複数のデータ項目を管理する統計データ、及び複数の統計サンプル間の順序を示す順序情報が記録されている。本方法は、コンピュータが、統計データの主成分分析における説明変数ｔと、順序情報に従う制約条件が設定される補助変数ｓとの間の共分散を最適化するように、説明変数ｔに対応する第１のベクトルと、補助変数ｓに対応する第２のベクトルとを算出するステップ（Ｓ１２）と、第１のベクトルと第２のベクトルとの内の少なくとも一方に基づいて、複数の統計サンプルに対するスコアを算出するステップ（Ｓ１３）とを含む。 The data analysis method of the present embodiment is a method in which a computer such as the data analysis device 5 performs multivariate analysis on a plurality of data items on a plurality of statistical samples. In the storage unit 52 of the computer, statistical data for managing a plurality of data items for each statistical sample and order information indicating the order among the plurality of statistical samples are recorded. The method corresponds to the explanatory variable t so that the computer optimizes the covariance between the explanatory variable t in the principal component analysis of the statistical data and the auxiliary variable s in which the constraints according to the order information are set. For a plurality of statistical samples based on the step (S12) of calculating the first vector and the second vector corresponding to the auxiliary variable s, and at least one of the first vector and the second vector. It includes a step (S13) of calculating a score.

本実施形態では、上記のデータ解析方法をコンピュータに実行させるためのプログラムが提供される。このプログラムは、各種のコンピュータ可読で非一時的な記録媒体に格納して提供可能である。上記のデータ解析方法及びプログラムによると、説明変数ｔと、順序情報に従う制約条件が設定される補助変数ｓとの間の共分散ｃｏｖ（ｔ，ｓ）を最適化する理論ＯＳ−ＰＣＡの適用により、統計サンプル間の順序を考慮しながら多様なデータ解析を可能にすることができる。 In this embodiment, a program for causing a computer to execute the above data analysis method is provided. This program can be provided by storing it on various computer-readable, non-temporary recording media. According to the above data analysis method and program, by applying the theoretical OS-PCA that optimizes the covariance cov (t, s) between the explanatory variable t and the auxiliary variable s in which the constraint condition according to the order information is set. , It is possible to analyze various data while considering the order between statistical samples.

（他の実施形態）
上記の実施形態１では、メタボロミクスに対する本データ解析方法の適用例を説明した。本データ解析方法はメタボロミクスに限らず、種々のオミックス解析や計量化学の多変量解析に適用してもよい。この場合、測定データは、同一生体内におけるオミックス解析又は計量化学によって得られるデータであってもよい。 (Other embodiments)
In the first embodiment described above, an application example of this data analysis method to metabolomics has been described. This data analysis method is not limited to metabolomics, and may be applied to various omics analysis and multivariate analysis of chemometrics. In this case, the measurement data may be data obtained by omics analysis or chemometrics in the same living body.

５データ解析装置
５１制御部
５２記憶部 5 Data analysis device 51 Control unit 52 Storage unit

Claims

A data analysis device that performs multivariate analysis on multiple data items for multiple statistical samples.
A storage unit that records statistical data that manages the plurality of data items for each statistical sample, and order information that indicates the order between the plurality of statistical samples.
It is provided with a control unit that performs predetermined arithmetic processing based on the statistical data and the order information.
The control unit
The first vector corresponding to the explanatory variable and the first vector corresponding to the explanatory variable so as to optimize the covariance between the explanatory variable in the principal component analysis of the statistical data and the auxiliary variable for which the constraint condition according to the order information is set. Calculate with the second vector corresponding to the auxiliary variable,
A data analysis device that calculates scores for the plurality of statistical samples based on at least one of the first vector and the second vector.

The data analysis apparatus according to claim 1, wherein the constraint condition is defined by a smoothing term that smoothes data for each statistical sample in the order indicated by the order information.

The data analysis apparatus according to claim 1 or 2, wherein the score increases or decreases in the order indicated by the order information.

The data analysis apparatus according to any one of claims 1 to 3, wherein the order information indicates the order between the statistical samples for each group formed by a plurality of statistical samples.

The first vector has a plurality of components proportional to the correlation coefficient between the data for each data item in the statistical data and the score based on the second vector.
The control unit according to any one of claims 1 to 4, which selects a data item satisfying a statistical significance level from the plurality of data items based on each component of the first vector. Data analyzer.

The statistical data is any one of claims 1 to 5, which comprises at least one of a measured value and a calculated value relating to the corresponding biotransformer for each of the data items, with the plurality of biotransformers in the living body as the plurality of data items. The data analysis apparatus described in 1.

A data analysis method in which a computer performs multivariate analysis on multiple data items on multiple statistical samples.
In the storage unit of the computer, statistical data for managing the plurality of data items for each statistical sample, and order information indicating the order between the plurality of statistical samples are recorded.
The computer
The first vector corresponding to the explanatory variable and the first vector corresponding to the explanatory variable so as to optimize the covariance between the explanatory variable in the principal component analysis of the statistical data and the auxiliary variable for which the constraint condition according to the order information is set. Steps to calculate the second vector corresponding to the auxiliary variable,
A data analysis method comprising the step of calculating a score for the plurality of statistical samples based on at least one of the first vector and the second vector.

A program for causing a computer to execute the data analysis method according to claim 7.