JP5854346B2

JP5854346B2 - Transcriptome analysis method, disease determination method, computer program, storage medium, and analysis apparatus

Info

Publication number: JP5854346B2
Application number: JP2010214256A
Authority: JP
Inventors: 小西　智一; 智一小西
Original assignee: Akita Prefectural University
Current assignee: Akita Prefectural University
Priority date: 2010-07-21
Filing date: 2010-09-24
Publication date: 2016-02-09
Anticipated expiration: 2030-09-24
Also published as: JP2015043782A; JP2012039994A

Description

本発明は、主成分算出方法、トランスクリプトーム解析方法、遺伝子、老化判定方法、コンピュータプログラム、記憶媒体、及び解析装置に係り、特に実験データを用いて主成分を算出する主成分算出方法、トランスクリプトーム解析方法、遺伝子、老化判定方法、コンピュータプログラム、記憶媒体、及び解析装置に関する。 The present invention relates to a principal component calculation method, a transcriptome analysis method, a gene, an aging determination method, a computer program, a storage medium, and an analysis apparatus, and more particularly to a principal component calculation method for calculating a principal component using experimental data, a transformer The present invention relates to a cryptogram analysis method, a gene, an aging determination method, a computer program, a storage medium, and an analysis apparatus.

主成分分析（ｐｒｉｎｃｉｐａｌｃｏｍｐｏｎｅｎｔａｎａｌｙｓｉｓ）は行列データを、次元を圧縮することで要約する、多変量解析の手法である。
非特許文献１を参照すると、主成分分析は、元々、Ｐｅａｒｓｏｎによる空間と行列要素の距離についての考察を起源としている。この上で、非特許文献２を参照すると、Ｈｏｔｅｌｌｉｎｇが手法としてまとめたとされている。
また、非特許文献３及び４を参照すると、主成分分析は広く使われており、特に大きな次元をもつトランスクリプトーム（ｔｒａｎｓｃｒｉｐｔｏｍｅ）データの解析等に適用することが考えられている。
トランスクリプトームは、所定の条件における細胞内の総合的なｍＲＮＡ（ｍｅｓｓｅｎｇｅｒＲＮＡ、ｔｒａｎｓｃｒｉｐｔｓ）の発現量の状態等を示す。生物は、通常、同一個体内で同一の遺伝情報（ゲノム）を備えているものの、トランスクリプトームは、組織の細胞の差、分化状態、年齢、細胞外からの刺激等に対する応答により異なっている。
トランスクリプトームに係る複数のｍＲＮＡの発現量は、ＤＮＡアレイ（マイクロアレイ）等を用いて測定可能である。 Principal component analysis is a method of multivariate analysis that summarizes matrix data by compressing dimensions.
Referring to Non-Patent Document 1, principal component analysis originally originated from the consideration of the distance between a space and a matrix element by Pearson. On this basis, referring to Non-Patent Document 2, it is said that Hotelling has been summarized as a technique.
Referring to Non-Patent Documents 3 and 4, principal component analysis is widely used, and it is considered to be applied to analysis of transcriptome data having a particularly large dimension.
The transcriptome indicates the state of the expression level of total mRNA (messenger RNA, transcripts) in a cell under a predetermined condition. Organisms usually have the same genetic information (genome) within the same individual, but the transcriptome differs depending on the difference in tissue cells, differentiation status, age, response to extracellular stimuli, etc. .
The expression level of a plurality of mRNAs related to the transcriptome can be measured using a DNA array (microarray) or the like.

まず、図１３を参照して、主成分分析の原理について説明する。
図１３は、従来の主成分分析の原理を示した説明図である。図１３の例では、３群９サンプルで４測定項目の分析対象を、９×４行列の行列Ｘとして計算している。
この計算では、行列Ｘとして、特異値分解を用いて、軸を特異ベクトルＵやＶとして求め、それらのベクトルを用いて主成分ＰＣを求めている。 First, the principle of principal component analysis will be described with reference to FIG.
FIG. 13 is an explanatory diagram showing the principle of conventional principal component analysis. In the example of FIG. 13, the analysis target of 4 measurement items with 3 samples in 9 groups is calculated as a matrix X of 9 × 4 matrix.
In this calculation, the axis is obtained as a singular vector U or V using the singular value decomposition as the matrix X, and the principal component PC is obtained using these vectors.

主成分分析では、測定項目数やサンプル数が多く、且つ線形か、線形への変換が可能である多変量データの中から、項目間とサンプル間に固有ベクトルを発見する。そのベクトルを用いてデータを評価することで、多変量データを効率的に要約する。 In principal component analysis, eigenvectors are found between items and between samples from multivariate data that has a large number of measurement items and samples and is linear or can be converted to linear. Multivariate data is efficiently summarized by evaluating the data using the vectors.

多くの測定においては、データをサンプルｓ×測定項目ｇの行列で表すことができる。
この行列は、「サンプルという次元で表された測定項目分の要素のベクトル」でも、また「測定項目という次元で表されたサンプル分の要素のベクトル」でもある。
いずれの考え方でも次元数は大きくなりがちだが、これらの次元は実際には必ずしも直交しておらず、また要素の違いを効率よく表してもいない。
主成分分析では行列の次元をあらわす軸を新しく設定する。それらの新たな軸はそれぞれ直交している。また第一の軸は要素群の中心に添い、また第二の軸は第一の軸で表されなかった残渣の中心に沿う。
こうすることで、それぞれの新しく設定された軸はオリジナルの行列よりも少ない次元でデータを効率よく近似する。 In many measurements, data can be represented by a matrix of sample s × measurement item g.
This matrix is both “a vector of elements for measurement items expressed in a dimension called samples” and “a vector of elements for samples expressed in a dimension called measurement items”.
Either way of thinking, the number of dimensions tends to be large, but these dimensions are not always orthogonal and do not efficiently represent the differences in elements.
In principal component analysis, a new axis representing the matrix dimension is set. These new axes are orthogonal to each other. The first axis is along the center of the element group, and the second axis is along the center of the residue not represented by the first axis.
In this way, each newly set axis efficiently approximates the data with fewer dimensions than the original matrix.

この作業を特異値分解（ｓｉｎｇｕｌａｒｖａｌｕｅｄｅｃｏｍｐｏｓｉｔｉｏｎ）で説明する。
Ｘを、その項目の平均でセンタリングするなどして標準化したデータ行列、またＸ'をＸの転置行列であるとする。このとき、

Ｘ＝Ｕ・Ｌ^1/2・Ｖ’

ここでＵとＶは特異ベクトルを記すユニタリ行列で、Ｖはサンプルのための軸を、Ｕは項目のための軸を記録している。またＬ^1/2はｄｉａｇｏｎａｌｍａｔｒｉｘで、その対角成分に特異値が大きい順にソートされている。また、また、Ｖ’はＶの転置行列を示す。
サンプルの主成分ＰＣｓ、項目の主成分ＰＣｇは、次の式で定義される。

ＰＣｇ＝Ｘ・Ｖ

また同様に、

ＰＣｓ＝Ｘ’・Ｕ

ＰＣｓは、Ｘ’の主成分である。 This operation will be described by singular value decomposition.
It is assumed that X is a data matrix standardized by centering with the average of the items, and X ′ is a transposed matrix of X. At this time,

X = U ・ L ^1/2・ V '

Here, U and V are unitary matrices describing singular vectors, V records an axis for samples, and U records an axis for items. L ^1/2 is a diagonal matrix, and the diagonal components are sorted in descending order of singular values. Further, V ′ represents a transposed matrix of V.
The sample principal component PCs and the item principal component PCg are defined by the following equations.

PCg = X · V

Similarly,

PCs = X'U

PCs is the main component of X ′.

特異値分解の定義式から明らかなように、それぞれの主成分はユニタリ行列との内積をとることで分解前のＸ又はＸ’を再現できる。
そこで、これはＸやＸ’を回転させたものであることがわかる。あるいは、もとの行列の要素の位置関係はそのままに、直交軸を新たに設定したともいえる。
これらの軸は互いに直交し、かつ要素の違いをもっともよく表す方向から選択されるために、オリジナルの軸よりも少ない次元でデータを表すことができる。これがデータの次元の圧縮の原理となる。
それぞれの主成分は、サンプル数や遺伝子数に依存している。これらの値は、もともとの要素をそれぞれの新しい軸に投影したときの、原点からの距離の総和を表している。つまり、サンプルの主成分ＰＣｓなら項目の距離の総和で、項目の主成分ＰＣｇならサンプルの距離の総和である。当然のごとく、サンプルや項目の数が変われば、この値も変化する。
すなわち、非特許文献４を参照すると、主成分は相対値であり、そのＸの中だけで意味をもつ。 As is clear from the definition of singular value decomposition, each principal component can reproduce X or X ′ before decomposition by taking the inner product with a unitary matrix.
Therefore, it can be seen that this is a rotation of X and X ′. Alternatively, it can be said that the orthogonal axis is newly set while the positional relationship of the elements of the original matrix remains unchanged.
Since these axes are orthogonal to each other and are selected from the direction that best represents the difference between the elements, the data can be represented with fewer dimensions than the original axes. This is the principle of data dimension compression.
Each main component depends on the number of samples and the number of genes. These values represent the total distance from the origin when the original element is projected onto each new axis. That is, if the principal component PCs of the sample, it is the sum of the distances of the items, and if it is the principal component PCg of the items, it is the sum of the distances of the samples. Of course, this value will change if the number of samples or items changes.
That is, referring to Non-Patent Document 4, the principal component is a relative value, and has meaning only in X.

ここで、従来のトランスクリプトーム形成を線形的に解析あるいは予測する情報処理装置として、特許文献１を参照すると、熱力学モデルを用いてトランスクリプトーム形成機構を近似することで、当該モデルを用いてトランスクリプトームの情報処理を行う情報処理装置が記載されている（以下、従来技術１とする。）。
従来技術１の熱力学モデルを用いた情報処理装置は、各ｍＲＮＡの濃度を、各ｍＲＮＡの合成速度を決定するエネルギーパラメータと各ｍＲＮＡの分解速度を決定するエネルギーパラメータとを用いて定義すると共に、前記エネルギーパラメータを塩基配列特異的にＲＮＡないしＤＮＡに結合する因子の細胞内局所的濃度と前記因子の標的となりうる塩基配列が持つ特有の係数とを用いて定義する。
従来技術１では、ｍＲＮＡの濃度、因子の細胞局所内濃度、塩基配列が持つ特有の係数の値の少なくとも一つ以上を前記熱力学モデルへ入力し、残りの値を未知数として算出して出力する。
従来技術１の熱力学モデルを用いた情報処理装置によれば、配列とタンパク性因子との相互作用を客観的に表すことで、ゲノムの量的な情報のトランスクリプトームレベルでの解読やトランスクリプトームの再現が可能となり、様々な実験と測定の結果を比較したり知見の統合をするためのプラットフォームを提供することができる。 Here, as an information processing apparatus that linearly analyzes or predicts the formation of a conventional transcriptome, referring to Patent Document 1, the model is used by approximating a transcriptome formation mechanism using a thermodynamic model. An information processing apparatus that performs transcriptome information processing is described (hereinafter referred to as Conventional Technology 1).
The information processing apparatus using the thermodynamic model of Prior Art 1 defines the concentration of each mRNA using an energy parameter that determines the synthesis rate of each mRNA and an energy parameter that determines the degradation rate of each mRNA, The energy parameter is defined using a local concentration of a factor that binds to RNA or DNA in a base sequence-specific manner and a specific coefficient of a base sequence that can be a target of the factor.
In the prior art 1, at least one of the mRNA concentration, the intracellular concentration of the factor, and the value of the specific coefficient of the base sequence is input to the thermodynamic model, and the remaining value is calculated and output as an unknown. .
According to the information processing apparatus using the thermodynamic model of the prior art 1, by expressing the interaction between the sequence and the protein factor objectively, it is possible to decode and transform the quantitative information of the genome at the transcriptome level. It is possible to reproduce the cryptogram and provide a platform for comparing the results of various experiments and measurements and for integrating knowledge.

特開２００６−２３６０１１JP 2006-236011 A

Ｐｅａｒｓｏｎ，Ｋ（１９０１）， ’ＯｎＬｉｎｅｓａｎｄＰｌａｎｅｓｏｆＣｌｏｓｅｓｔＦｉｔｔｏＳｙｓｔｅｍｓｏｆＰｏｉｎｔｓｉｎＳｐａｃｅ’，ＰｈｉｌｏｓｏｐｈｉｃａｌＭａｇａｚｉｎｅ，２（６），５５９−７２．Pearson, K (1901), 'On Lines and Planes of Closest Fit to Systems of Points in Space', Physiologic Magazine, 2 (6), 559-72. Ｈｏｔｅｌｌｉｎｇ，Ｈ．（１９３３）， ’Ａｎａｌｙｓｉｓｏｆａｃｏｍｐｌｅｘｏｆｓｔａｔｉｓｔｉｃａｌｖａｒｉａｂｌｅｓｉｎｔｏｐｒｉｎｃｉｐａｌｃｏｍｐｏｎｅｎｔｓ’，ＪｏｕｒｎａｌｏｆＥｄｕｃａｔｉｏｎａｌＰｓｙｃｈｏｌｏｇｙ，２４（７），４９８−５２０．Hotel Hotling, H.C. (1933), 'Analysis of a complex of statistical variables into principal components', Journal of Educational Psychology, 24 (7), 498-520. Ｊａｃｋｓｏｎ，Ｊ．Ｅｄｗａｒｄ（１９９１），ＡＵｓｅｒ’ｓＧｕｉｄｅｔｏＰｒｉｎｃｉｐａｌＣｏｍｐｏｎｅｎｔｓ（ＮｅｗＹｏｒｋ：ＪｏｈｎＷｉｌｅｙ＆Ｓｏｎｓ，Ｉｎｃ）．Jackson, J.M. Edward (1991), A User's Guide to Principal Components (New York: John Wiley & Sons, Inc). Ｓｈａｗ，ＰｅｔｅｒＪ．Ａ．（２００３），ＭｕｌｔｉｖａｒｉａｔｅＳｔａｔｉｓｔｉｃｓｆｏｒｔｈｅＥｎｖｉｒｏｎｍｅｎｔａｌＳｃｉｅｎｃｅｓ（Ｌｏｎｄｏｎ：ＨｏｄｄｅｒＡｒｎｏｌｄ）．Shaw, Peter J .; A. (2003), Multivariate Statistics for the Environmental Sciences (London: Holder Arnold).

まず、従来技術１の熱力学モデルを用いた情報処理装置は、モデルにｍＲＮＡの濃度、因子の細胞局所内濃度、塩基配列が持つ特有の係数等をモデルに代入する必要があり、大きな次元をもつデータである汎用的なマイクロアレイのデータに適用することが難しかった。
このため、大きな次元をもつデータを解析するのに適している従来の主成分分析を用いて、一般的なマイクロアレイのｍＲＮＡ量を測定したトランスクリプトームデータを解析することが望まれていた。
ところが、従来の主成分分析は、下記の点でトランスクリプトームデータに適用することについて問題があった。 First, the information processing apparatus using the thermodynamic model of the prior art 1 needs to substitute the model with the mRNA concentration, the intracellular concentration of the factor, the specific coefficient of the base sequence, etc. into the model. It was difficult to apply to general-purpose microarray data, which is the data it has.
For this reason, it has been desired to analyze the transcriptome data obtained by measuring the amount of mRNA in a general microarray using the conventional principal component analysis suitable for analyzing data having a large dimension.
However, the conventional principal component analysis has a problem in applying to transcriptome data in the following points.

まず、トランスクリプトームデータとして記録するための検査項目は、しばしば変更されるという問題があった。これに加えて、トランスクリプトームを調べるためのマイクロアレイには、市販されているだけでも何種類もあり、アップデートされるたびに種類が増えるという問題があった。
さらに、マイクロアレイは、それぞれがカバーする遺伝子の種類は異なることが多く、検査項目がまちまちであった。
ところが、従来の主成分分析では、こういったマイクロアレイの変更や、マイクロアレイデータの検査項目の変更には対応していないという問題があった。 First, there is a problem that inspection items for recording as transcriptome data are often changed. In addition to this, there are many types of microarrays for examining transcriptomes that are available on the market, and there is a problem that the types increase with each update.
Furthermore, microarrays often differ in the types of genes they cover, and test items vary.
However, the conventional principal component analysis has a problem in that it does not support such changes in microarrays or changes in inspection items of microarray data.

この理由として、マイクロアレイにより計測するｍＲＮＡのサンプルや検査項目や遺伝子の種類については、全てが同様の重み、又は重要さをもつわけではないことが挙げられる。
また、マイクロアレイを用いた測定は、多くの場合、複数の生体サンプルを用いて、繰り返して行われる。この際の実験の繰り返し回数は、必ずしも同じ数ではない。このため、行列データ内のサンプルは、全てが同等に独立且つ同じ重さを持つわけではない。
しかしながら、従来の主成分分析はこうした重みの違いに対応しておらず、その補正の手段がないという問題があった。 The reason for this is that not all of the mRNA samples, test items, and gene types measured by the microarray have the same weight or importance.
In many cases, measurement using a microarray is repeatedly performed using a plurality of biological samples. In this case, the number of repetitions of the experiment is not necessarily the same. For this reason, all the samples in the matrix data are not equally independent and have the same weight.
However, the conventional principal component analysis does not cope with such a difference in weight, and there is a problem that there is no means for correcting it.

また、マイクロアレイの実験では、サンプル間で、実験の繰り返しによる共通性とは無関係に、なんらかの変動が共通することがしばしば生じる。
たとえば、実験において、異なる群の複数のサンプルが同一の疾病に罹患したときは、その影響が主成分分析で検出されてしまっていた。
このため、こうした群と無関係の動向により、有効な変化を主成分として発見することが妨げられ、また擬陽性となる過誤の原因にもなっていた。 Also, in microarray experiments, it often happens that some variation is common between samples, regardless of the commonality of repeated experiments.
For example, in experiments, when multiple samples from different groups suffer from the same disease, the effects have been detected by principal component analysis.
For this reason, trends unrelated to these groups hindered the discovery of effective changes as the main component and also caused false positives.

また、従来の主成分分析は、群の偏りに対応していないという問題があった。
たとえば、細胞の薬物応答に対応するトランスクリプトームを解析するトキシコロジーに主成分分析を用いた際、同じような物質（薬物）群がデータ行列に多く含まれている場合、主成分分析により発見される軸の方向性は、それらの物質群を過大に評価するようになるという問題があった。 Further, the conventional principal component analysis has a problem that it does not correspond to the group bias.
For example, when Principal Component Analysis is used for toxicology that analyzes the transcriptome corresponding to the drug response of cells, if many similar substances (drugs) are included in the data matrix, they are discovered by Principal Component Analysis. There is a problem that the direction of the axis to be overestimated those substance groups.

また、従来の主成分分析は、健康診断のような、測定項目が病院間である程度異なる、しかし測定項目が多いようなデータについても対応していなかった。 In addition, the conventional principal component analysis did not deal with data such as medical examinations in which measurement items differ to some extent between hospitals, but there are many measurement items.

本発明のトランスクリプトーム解析方法は、解析装置を用いてデータ行列から主成分を算出し、トランスクリプトームを解析するトランスクリプトーム解析方法であって、前記解析装置は、主成分を、その主成分の算出に用いたサンプル数又は測定項目数の平方根で除することでスケーリングし、前記解析装置は、スケーリングした前記主成分から、所定の閾値でサンプルを選択し、前記トランスクリプトームに係る発現量の変化の前記データ行列から前記主成分を計算し、前記主成分を、前記主成分の算出に用いた前記データ行列の前記サンプル数の平方根、又は該主成分の算出に用いた前記データ行列の前記測定項目数の平方根で除することでスケーリングし、スケーリングした前記主成分から、前記所定の閾値で前記発現量が変化したことを判定して選択し、特異ベクトルで表される前記主成分の軸を求めるために、トレーニングデータを用い、前記トレーニングデータは、群の平均値からなる各群の代表値を用い、項目の基準値を設定して、該基準値で標準化され、前記基準値は、基準にするデータを特定して設定し、設定された基準にするデータは前記主成分の原点とし、前記主成分の軸は複数の測定値で共有しつつ、前記基準にするデータは各測定値で定め、前記トレーニングデータを用いて前記主成分の軸を定義し、前記主成分の軸を個々のサンプルのデータに適用することで、サンプルをクラス分けすることを特徴とする。
本発明のトランスクリプトーム解析方法は、前記トレーニングデータの前記代表値は、予め分散分析で群間の有意差を確認して、測定項目を絞っておくことで、サンプル及び測定項目を選択して定めることを特徴とする。
本発明のトランスクリプトーム解析方法は、前記発現量の変化は、ＲＮＡの量、翻訳されたタンパク質の量、翻訳されたタンパク質の活性、及びタンパク質が代謝して産生された代謝産物の量のいずれかを含むことを特徴とする。
本発明のトランスクリプトーム解析方法は、前記所定の閾値は、スケーリングした主成分を正規分布と比較して、確率０．００１の両側の擬陽性を許容する閾値であることを特徴とする。
本発明のトランスクリプトーム解析方法は、二つ以上のスケーリングした前記主成分を比較することで、前記発現量が変化したことを判定することを特徴とする。
本発明のトランスクリプトーム解析方法は、前記トレーニングデータは、前記データ行列の測定項目を選択して作成し、前記選択されなかった項目のデータをゼロで置き換えて、オリジナルの行列の大きさを保つことを特徴とする。
本発明のトランスクリプトーム解析方法は、前記主成分を算出する際に、欠失したデータをゼロで置き換えることを特徴とする。
本発明のトランスクリプトーム解析方法は、前記トレーニングデータから求めた軸を前記データ行列に適用し、前記主成分を計算することを特徴とする。
本発明のトランスクリプトーム解析方法は、前記トレーニングデータから求めた軸を、データ評価のための重みとして使用することを特徴とする。
本発明のトランスクリプトーム解析方法は、トレーニングデータから軸を求める際に、データ平均以外の選択されたデータを基準として使用することを特徴とする。
本発明のトランスクリプトーム解析方法は、前記主成分を計算する際に、データ平均以外の選択されたデータを基準として使用することを特徴とする。
本発明のトランスクリプトーム解析方法は、前記主成分を計算する際に、下記式によりセンタリングを行って再標準化したデータ行列Ｘ_s、データ行列Ｘ_pを用い、

ここで、ｐ：実験群の番号であることを特徴とする。
本発明のトランスクリプトーム解析方法は、前記データ行列Ｘ_pを特異値分解すると、左特異ベクトルＵ_pと対角行列Ｌ^1/2および右特異ベクトルＶ_pの関係が下記式である

ことを特徴とする。
本発明のトランスクリプトーム解析方法は、前記主成分のうち、サンプル毎の主成分ＰＣ_sは、下記式である

ことを特徴とする。
本発明のトランスクリプトーム解析方法は、前記主成分のうち、遺伝子ごとの主成分ＰＣ_gは、下記式である

ことを特徴とする。
本発明の疾病判定方法は、解析装置を用いてデータ行列から主成分を算出する主成分算出方法であって、前記解析装置は、主成分を、その主成分の算出に用いたサンプル数又は測定項目数の平方根で除することでスケーリングし、前記解析装置は、スケーリングした前記主成分から、所定の閾値でサンプルを選択し、疾病群と対照群とを比較し、健康診断の測定項目についての前記データ行列から前記主成分を計算し、前記主成分を、前記主成分の算出に用いた前記データ行列の前記サンプル数の平方根、又は該主成分の算出に用いた前記データ行列の前記測定項目数の平方根で除することでスケーリングし、スケーリングした前記主成分から、前記所定の閾値で前記健康診断の測定項目が変化したことを判定して選択し、特異ベクトルで表される前記主成分の軸を求めるために、トレーニングデータを用い、前記トレーニングデータは、群の平均値からなる各群の代表値を用い、項目の基準値を設定して、該基準値で標準化され、前記基準値は、基準にするデータを特定して設定し、設定された基準にするデータは前記主成分の原点とし、前記主成分の軸は複数の測定値で共有しつつ、前記基準にするデータは各測定値で定め、前記トレーニングデータを用いて前記主成分の軸を定義し、前記主成分の軸を個々のサンプルのデータに適用することで、サンプルをクラス分けすることを特徴とする。
本発明のコンピュータプログラムは、前記トランスクリプトーム解析方法、又は前記疾病判定方法を実行することを特徴とする。
本発明の記憶媒体は、前記コンピュータプログラムを記憶した記憶媒体であることを特徴とする。
本発明の解析装置は、データ行列から主成分を計算する主成分演算部と、前記主成分を、前記主成分の算出に用いた前記データ行列のサンプル数の平方根、又は該主成分の算出に用いた前記データ行列の測定項目数の平方根で除することでスケーリングする主成分スケーリング部と、サンプルの選択と測定項目の選択をし、トレーニングデータを作成するトレーニングデータ作成部とを備え、スケーリングした前記主成分から、所定の閾値でサンプルを選択し、前記主成分演算部は、トランスクリプトーム又は健康診断の測定項目についての前記データ行列から前記主成分を計算し、前記スケーリング部は、前記主成分を、前記主成分の算出に用いた前記データ行列の前記サンプル数の平方根、又は該主成分の算出に用いた前記データ行列の前記測定項目数の平方根で除することでスケーリングし、スケーリングした前記主成分から、前記所定の閾値で前記トランスクリプトームに係る発現量又は前記健康診断の測定項目が変化したことを判定して選択し、前記トレーニングデータ作成部は、前記トレーニングデータとして、群の平均値からなる各群の代表値を用い、項目の基準値を設定して、該基準値で標準化し、前記基準値は、基準にするデータを特定して設定し、設定された基準にするデータは前記主成分の原点とし、前記主成分の軸は複数の測定値で共有しつつ、前記基準にするデータは各測定値で定め、前記トレーニングデータを用いて前記主成分の軸を定義し、前記主成分の軸を個々のサンプルのデータに適用することで、サンプルをクラス分けすることを特徴とする。 The transcriptome analysis method of the present invention is a transcriptome analysis method for calculating a principal component from a data matrix using an analysis device and analyzing the transcriptome, wherein the analysis device includes the principal component as its main component. Scale by dividing by the number of samples used to calculate the component or the square root of the number of measurement items, the analysis device selects a sample with a predetermined threshold from the scaled principal components, and the expression related to the transcriptome The principal component is calculated from the data matrix of change in quantity, and the principal component is the square root of the number of samples of the data matrix used for calculating the principal component, or the data matrix used for calculating the principal component. Scaled by dividing by the square root of the number of measurement items, and from the scaled principal component, the expression level changed at the predetermined threshold Is used to determine the axis of the principal component represented by a singular vector, training data is used, and the training data uses a representative value of each group consisting of an average value of the group, A reference value is set and standardized by the reference value. The reference value specifies and sets data to be used as a reference. The set reference data is the origin of the principal component and the axis of the principal component. Is shared by multiple measured values, while the reference data is defined by each measured value, the principal component axis is defined using the training data, and the principal component axis is applied to the data of individual samples By doing so, the samples are classified.
In the transcriptome analysis method of the present invention, the representative value of the training data is selected by selecting a sample and a measurement item by confirming a significant difference between groups in advance by analysis of variance and narrowing down the measurement item. It is characterized by defining.
In the transcriptome analysis method of the present invention, the change in the expression level is any of the amount of RNA, the amount of translated protein, the activity of the translated protein, and the amount of metabolite produced by protein metabolism. It is characterized by including these.
The transcriptome analysis method of the present invention is characterized in that the predetermined threshold is a threshold that allows false positives on both sides with a probability of 0.001 by comparing the scaled principal component with a normal distribution.
The transcriptome analysis method of the present invention is characterized by determining that the expression level has changed by comparing two or more scaled principal components.
In the transcriptome analysis method of the present invention, the training data is created by selecting the measurement item of the data matrix, and the data of the unselected item is replaced with zero to maintain the size of the original matrix. It is characterized by that.
The transcriptome analysis method of the present invention is characterized in that the missing data is replaced with zero when calculating the principal component.
The transcriptome analysis method of the present invention is characterized in that the principal component is calculated by applying an axis obtained from the training data to the data matrix.
The transcriptome analysis method of the present invention is characterized in that an axis obtained from the training data is used as a weight for data evaluation.
The transcriptome analysis method of the present invention uses selected data other than the data average as a reference when obtaining an axis from training data.
The transcriptome analysis method of the present invention uses selected data other than the data average as a reference when calculating the principal component.
The transcriptome analysis method of the present invention uses the data matrix X _s and the data matrix X _p that are re-standardized by performing centering according to the following formula when calculating the principal component:

Here, p is the number of the experimental group.
In the transcriptome analysis method of the present invention, when the data matrix X _p is subjected to singular value decomposition, the relationship between the left singular vector U _p , the diagonal matrix L ^1/2 and the right singular vector V _p is expressed by the following equation.

It is characterized by that.
In the transcriptome analysis method of the present invention, the principal component PC _s for each sample among the principal components is represented by the following formula:

It is characterized by that.
Transcriptome analysis method of the present invention, among the main component, the main component PC _g per gene is the following formula

It is characterized by that.
The disease determination method of the present invention is a principal component calculation method for calculating a principal component from a data matrix using an analysis device, wherein the analysis device uses the principal component as a sample number or measurement for calculating the principal component. The analysis apparatus performs scaling by dividing by the square root of the number of items, and the analysis device selects a sample with a predetermined threshold from the scaled principal components, compares the disease group with the control group, The principal component is calculated from the data matrix, and the principal component is the square root of the number of samples of the data matrix used for calculating the principal component, or the measurement item of the data matrix used for calculating the principal component. scaled by dividing by the number of the square root, the said main component scaled, measurement items of the health examination is selected by determining that it has changed by the predetermined threshold value, the table in singular vectors Training data is used to determine the axes of the principal components, and the training data is standardized by setting the reference value of the item using the representative value of each group consisting of the average value of the group and the reference value. The reference value is set by specifying the data to be used as a reference, the set reference data is the origin of the principal component, and the axis of the principal component is shared by a plurality of measured values, The data to be determined is determined by each measurement value, the axis of the principal component is defined using the training data, and the sample is classified by applying the axis of the principal component to the data of each sample. To do.
Computer program of the present invention is characterized by performing said transcriptome analysis method, or the disease determination method.
The storage medium of the present invention is a storage medium storing the computer program.
The analysis apparatus according to the present invention includes a principal component calculation unit that calculates a principal component from a data matrix, and a square root of the number of samples of the data matrix used to calculate the principal component, or the principal component. A principal component scaling unit that scales by dividing by the square root of the number of measurement items in the data matrix used, and a training data creation unit that creates training data by selecting samples and selecting measurement items A sample is selected from the principal components at a predetermined threshold, the principal component calculation unit calculates the principal components from the data matrix for a measurement item of a transcriptome or a health check, and the scaling unit includes the main component The component is the square root of the number of samples of the data matrix used to calculate the principal component, or the data matrix used to calculate the principal component. Selected serial measurements scaled by dividing by the number of items of the square root, the said main component scaled, determination is made that the expression amount or the measurement item of the medical examination according to the transcriptome at the predetermined threshold value is changed The training data creation unit uses a representative value of each group consisting of an average value of the group as the training data, sets a reference value for the item, standardizes the reference value, and the reference value is a reference value The data used as the reference is set as the origin of the principal component, and the axis of the principal component is shared by a plurality of measurement values, while the data used as the reference is determined by each measurement value. Defining the principal component axis using the training data, and applying the principal component axis to the data of each sample to classify the samples.

本発明によれば、直交軸を、分析するデータではなくトレーニングデータから見いだし、スケーリングを行うことで、検査項目が変更されたり、重みの違いがあったり、同じような物質群が多く含まれていたりする行列データを、従来より正確に解析する主成分算出方法を提供することができる。 According to the present invention, the orthogonal axis is found not from the data to be analyzed but from the training data, and by performing the scaling, the inspection item is changed, there is a difference in weight, and many similar substance groups are included. It is possible to provide a principal component calculation method for accurately analyzing conventional matrix data.

本発明の第１の実施の形態に係る解析装置１０の制御構成を示すブロック図である。It is a block diagram which shows the control structure of the analyzer 10 which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る軸の発見と設定を、その適用から切り離す手法についての概念図である。It is a conceptual diagram about the method of isolate | separating the discovery and setting of the axis | shaft which concerns on the 1st Embodiment of this invention from the application. 本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析処理のフローチャートである。It is a flowchart of the principal component analysis process for transcriptome based on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る実施例１において、全データから軸を決定した例と、トレーニングデータから軸を決定した例を示す図である。In Example 1 which concerns on the 1st Embodiment of this invention, it is a figure which shows the example which determined the axis | shaft from all the data, and the example which determined the axis | shaft from training data. 本発明の第１の実施の形態に係る実施例２において、軸を発見するサンプルに偏りを持たせた観察例と、偏りがない例を示す図である。In Example 2 which concerns on the 1st Embodiment of this invention, it is a figure which shows the example of observation which gave bias to the sample which discovers an axis | shaft, and the example without bias. 本発明の第１の実施の形態に係る実施例１のデータについて、ｓＰＣｓとｓＰＣｇとを同軸に表示したバイプロットの例を示す図である。It is a figure which shows the example of the biplot which displayed sPCs and sPCg coaxially about the data of Example 1 which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る遺伝子リスト作成に用いた主成分分析の結果を示すプロット図である。It is a plot figure which shows the result of the principal component analysis used for the gene list preparation which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係るオーソログの検索結果の例を示す図である。It is a figure which shows the example of the search result of the ortholog which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係るリストから１０遺伝子を選んで、各サンプルの老化度を算出した例を示す図である。It is a figure which shows the example which selected 10 genes from the list | wrist which concerns on the 2nd Embodiment of this invention, and computed the aging degree of each sample. 本発明の第２の実施の形態に係る実施例３において、標準化で算出されたパラメータを示す図である。It is a figure which shows the parameter calculated by normalization in Example 3 which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る実施例４において、ｓＰＣ１ｇの度数分布を示すグラフヒストグラムである。In Example 4 which concerns on the 2nd Embodiment of this invention, it is a graph histogram which shows the frequency distribution of sPC1g. 本発明の第２の実施の形態に係る、ｓＰＣ１ｇと正規分布との差を示すＱＱプロットである。It is a QQ plot which shows the difference of sPC1g and normal distribution based on the 2nd Embodiment of this invention. 従来の主成分分析の手法を説明する概念図である。It is a conceptual diagram explaining the method of the conventional principal component analysis.

＜第１の実施の形態＞
〔解析装置１０の制御構成〕
まず、図１を参照して、本発明の第１の実施の形態に係る解析装置１０（トランスクリプトーム解析装置）の制御構成について説明する。
解析装置１０は、例えばＰＣ／ＡＴ互換機や汎用機等である計算装置であって、Ｌｉｎｕｘ（登録商標）、Ｗｉｎｄｏｗｓ（登録商標）等のＯＳがインストールされている。
解析装置１０の主な構成要素としては、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等の制御・演算装置である制御部１００と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）やＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やフラッシュメモリやＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒａｉｖｅ）等の記憶装置である記憶部１１０と、キーボードやマウス等のポインティングデバイスやタッチパネル等やマイクロアレイ解析装置等の外部機器からのＩ／Ｏインターフェイス等を含む入力部１３０と、液晶ディスプレイや有機ＥＬディスプレイや印刷を行うプリンタ等である表示部１４０と、１０００Ｂａｓｅ−Ｔ等の規格のＬＡＮボードや無線ＬＡＮボード等であるネットワーク入出力部１５０とを備えている。
解析装置１０は、主に記憶部１１０に記憶された各種プログラムと、データベース等を含むデータとを用いて制御部１００が実行することで、本発明の第１の実施の形態に係るトランスクリプトーム解析方法をハードウェア資源を用いて実現することができる。 <First Embodiment>
[Control Configuration of Analysis Device 10]
First, with reference to FIG. 1, the control configuration of the analysis apparatus 10 (transcriptome analysis apparatus) according to the first embodiment of the present invention will be described.
The analysis apparatus 10 is a computing apparatus such as a PC / AT compatible machine or a general-purpose machine, and an OS such as Linux (registered trademark) or Windows (registered trademark) is installed therein.
The main components of the analysis apparatus 10 include a control unit 100 that is a control / arithmetic unit such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory). I / O from a storage unit 110 that is a storage device such as a HDD, a hard disk drive (HDD), a flash memory, or a solid state drive (SSD), and a pointing device such as a keyboard or a mouse, a touch panel, or a microarray analyzer. An input unit 130 including an O interface, a display unit 140 such as a liquid crystal display, an organic EL display, a printer that performs printing, and 1000Base-T And a network output unit 150 is a LAN board or a wireless LAN board or the like of case.
The analysis device 10 is executed by the control unit 100 using various programs mainly stored in the storage unit 110 and data including a database, etc., so that the transcriptome according to the first embodiment of the present invention is executed. The analysis method can be realized using hardware resources.

記憶部１１０には、本発明の第１の実施の形態に係るトランスクリプトーム解析方法を実現するためのコンピュータプログラムとデータが記憶されている。この記憶部１１０のプログラムとデータを用いて、本発明の第１の実施の形態に係るトランスクリプトーム解析方法を実行することができる。
このプログラムとデータは、トレーニングデータ作成部２１０と、特異ベクトル演算部２２０と、主成分演算部２３０と、主成分スケーリング部２４０と、データベース２５０とを含んで構成される。 The storage unit 110 stores a computer program and data for realizing the transcriptome analysis method according to the first embodiment of the present invention. The transcriptome analysis method according to the first embodiment of the present invention can be executed using the program and data in the storage unit 110.
The program and data include a training data creation unit 210, a singular vector calculation unit 220, a principal component calculation unit 230, a principal component scaling unit 240, and a database 250.

トレーニングデータ作成部２１０は、サンプルの選択と測定項目の選択をして、さらに基準となる項目値を決定し、トレーニングデータを作成する部位である。 The training data creation unit 210 is a part for creating training data by selecting a sample and selecting a measurement item, further determining a reference item value.

特異ベクトル演算部２２０は、トレーニングデータを特異値分解ないし固有値分解して特異ベクトル又はその部分を求め、保存する部位である。 The singular vector calculation unit 220 is a part that obtains and stores a singular vector or a part thereof by performing singular value decomposition or eigen value decomposition on the training data.

主成分演算部２３０は、上述の特異ベクトル又はその部分を読み込み、基準とサンプルデータとから作成された標準化データを処理して、主成分を求める部位である。 The principal component calculation unit 230 is a part that reads the above-described singular vector or part thereof, processes the standardized data created from the reference and sample data, and obtains the principal component.

主成分スケーリング部２４０は、主成分分析により求められた主成分をスケーリングする部位である。 The principal component scaling unit 240 is a part that scales the principal components obtained by principal component analysis.

データベース２５０は、ＳＱＬ等のデータベースや各種データを記憶する部位である。
データベース２５１には、主にマイクロアレイデータ２５１、トレーニングデータ２５２、軸データ２５３、主成分データ２５４を記憶している。 The database 250 is a part that stores databases such as SQL and various data.
The database 251 mainly stores microarray data 251, training data 252, axis data 253, and principal component data 254.

マイクロアレイデータ２５１は、各実験における群を比較するための、一般的なマイクロアレイのデータを行列データ等で記憶する部位である。
マイクロアレイデータ２５１は、例えば、アフィメトリクス社製のＡｆｆｙｍｅｔｒｉｘＭｕｒｉｎｅＧｅｎｏｍｅＵ７４Ｖｅｒｓｉｏｎ２Ａｒｒａｙの測定データを用いることができる。
また、マイクロアレイデータ２５１は、行列の要素の欠落等である欠失したデータを補った測定データを行列データとして記憶する。この行列データを、トレーニングデータ２５２から求た主成分分析の直交軸に適用（評価）することで、主成分分析による分析結果が得られる。
また、マイクロアレイデータ２５１には、後述する代表値も記憶することができる。 The microarray data 251 is a part for storing general microarray data as matrix data or the like for comparing groups in each experiment.
For example, the measurement data of Affymetrix Murine Genome U74 Version 2 Array manufactured by Affymetrix can be used as the microarray data 251.
Further, the microarray data 251 stores measurement data supplemented with missing data such as missing matrix elements as matrix data. By applying (evaluating) this matrix data to the orthogonal axis of the principal component analysis obtained from the training data 252, the analysis result by the principal component analysis is obtained.
The microarray data 251 can also store representative values described later.

トレーニングデータ２５２は、主成分分析を行う際に、測定値の偏りを排して、軸の発見を行い、主成分を求めるためのトレーニングデータである。
このトレーニングデータは、行列データＸ_tとして記憶する。 The training data 252 is training data for finding the principal component by eliminating the bias of the measured value and finding the axis when performing the principal component analysis.
This training data is stored as matrix data _Xt .

軸データ２５３は、主成分分析において、行列データのなかから見いだす直交する軸の値を記憶するデータである。軸データ２５３は、後述するようにスケーリングされて保持される。
この軸データ２５３としては、行列データＸ_tから求めた特異ベクトル等を記すユニタリ行列であるＵ_t及びＶ_t、行列データＸ_tから求めたｄｉａｇｏｎａｌｍａｔｒｉｘであるＬ_t ^1/2等を記憶する。 The axis data 253 is data for storing values of orthogonal axes found from matrix data in the principal component analysis. The axis data 253 is scaled and held as will be described later.
As the axis data 253, stores matrix data X _t is a unitary matrix mark the singular vectors or the like obtained from U _t and V _t, L _t ^1/2, etc. is a diagonal matrix obtained from the matrix data X _t.

主成分データ２５４は、マイクロアレイデータ２５１の行列データを軸データ２５３に適用して得られる主成分を記憶するデータである。
この主成分データ２５４としては、主成分ＰＣｇ及びこれに直交する主成分ＰＣｓを記憶する。
また、ＰＣｇをスケーリングした主成分であるｓＰＣｇ、ＰＣｓをスケーリングした主成分であるｓＰＣｓを記憶する。 The principal component data 254 is data for storing principal components obtained by applying the matrix data of the microarray data 251 to the axis data 253.
As the principal component data 254, a principal component PCg and a principal component PCs orthogonal thereto are stored.
Further, sPCg which is a principal component obtained by scaling PCg and sPCs which is a principal component obtained by scaling PCs are stored.

〔トランスクリプトーム用主成分分析処理〕
次に、図２〜図３を参照して、本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法を実行するトランスクリプトーム用主成分分析処理について説明する。
なお、本実施形態において用いるトランスクリプトームデータは、ｍＲＮＡの発現量だけでなく、タンパク質の増減やタンパク質の活性等、幅広い分野のトランスクリプトームデータに対応することができる。 [Principal component analysis for transcriptome]
Next, a transcriptome principal component analysis process for executing the transcriptome principal component analysis method according to the first embodiment of the present invention will be described with reference to FIGS.
The transcriptome data used in the present embodiment can correspond not only to the expression level of mRNA but also to transcriptome data in a wide range of fields such as protein increase / decrease and protein activity.

上述したように、主成分分析は、行列データのなかから幾つかの直交する軸を見いだし、その軸に沿ってデータを解析することでデータを要約する方法である。
この主成分分析は、特に大きな次元をもつデータを効率よく客観的に要約することができるが、その結果はデータの中だけで意味をもつ相対値であり、一般性がない。また、行列データを構成するサンプル中に偏りがあると、その偏りは結果に反映される。 As described above, principal component analysis is a method of summarizing data by finding several orthogonal axes from matrix data and analyzing the data along the axes.
This principal component analysis can efficiently and objectively summarize data having a particularly large dimension, but the result is a relative value that has meaning only in the data and has no generality. Further, if there is a bias in the samples constituting the matrix data, the bias is reflected in the result.

このため、本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法においては、軸の発見と適用（評価）とを分離する。
軸の発見においては、主成分分析の直交軸を、分析するデータではなくトレーニングデータから見いだす。この上で、実際のマイクロアレイの実験データを、発見した軸に適用し、主成分を求める。
このように、トレーニングデータから軸を発見することで、サンプルの偏りを排することができる。
また、軸の発見と適用とを分離することによって、軸を広く共有することを可能にするため、分析結果が一般性を持つようになるという効果が得られる。
さらに、主成分をスケーリングすることで、分析値を絶対値で表すことができる。 For this reason, in the principal component analysis method for transcriptome according to the first embodiment of the present invention, axis discovery and application (evaluation) are separated.
In finding the axis, the orthogonal axis of principal component analysis is found from the training data, not the data to be analyzed. On this basis, the actual microarray experimental data is applied to the discovered axis to determine the principal component.
Thus, by finding the axis from the training data, it is possible to eliminate sample bias.
In addition, by separating the discovery and application of the axis, it is possible to share the axis widely, so that the effect that the analysis result becomes general can be obtained.
Furthermore, the analysis value can be expressed as an absolute value by scaling the principal component.

ここで、図２を参照して、本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法の概要について説明する。
図２は、本実施形態において、軸の発見と設定を、その適用から切り離すトランスクリプトーム用主成分分析方法についての概念図である。
本実施形態に係るトランスクリプトーム用主成分分析方法では、軸を求める際にトレーニングデータを用いる。図２の例では、それぞれの群の代表値を用いている。
また、図２の例では、項目２を非選択とし、当該データを０で置き換えている。
さらに、本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法では、特異値分解で軸を特異ベクトルとして求め、それらベクトルを用いて主成分ＰＣを求める。すなわち、主成分分析の軸を発見するために、行列Ｘの全てを使わずに、Ｘの一部、ないしＸから導かれた、より小さい行列Ｘ_t（トレーニングデータ）を用い、その軸を用いて解析する。
スケーリングについては、図２の例では、項目数及びサンプル数が３であるので、３の平方根で除することでスケーリングする。
このように構成することで、主成分分析の拡張と一般化により、従来の主成分分析処理では解析が難しかったマイクロアレイデータについて解析できる。
以下で、図３のフローチャートを参照して、本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析処理の詳細について説明する。
これらの処理は、制御部１００が記憶部１１０のプログラムとデータを実行することで実現する。 Here, an outline of the principal component analysis method for transcriptome according to the first embodiment of the present invention will be described with reference to FIG.
FIG. 2 is a conceptual diagram of a principal component analysis method for transcriptome that separates the discovery and setting of axes from the application in the present embodiment.
In the principal component analysis method for transcriptome according to the present embodiment, training data is used when determining an axis. In the example of FIG. 2, the representative value of each group is used.
In the example of FIG. 2, item 2 is not selected and the data is replaced with 0.
Furthermore, in the principal component analysis method for transcriptome according to the first embodiment of the present invention, the axis is obtained as a singular vector by singular value decomposition, and the principal component PC is obtained using these vectors. That is, in order to find the axis of principal component analysis, instead of using all of the matrix X, a part of X or a smaller matrix X _t (training data) derived from X is used and the axis is used. And analyze.
Regarding the scaling, in the example of FIG. 2, the number of items and the number of samples are 3, so scaling is performed by dividing by the square root of 3.
By configuring in this way, it is possible to analyze microarray data, which is difficult to analyze by conventional principal component analysis processing, by extending and generalizing principal component analysis.
Details of the transcriptome principal component analysis process according to the first embodiment of the present invention will be described below with reference to the flowchart of FIG.
These processes are realized by the control unit 100 executing the program and data in the storage unit 110.

ステップＳ１０１において、制御部１００は、初期化処理を行う。
具体的には、制御部１００は、記憶部１１０のデータベース２５０のマイクロアレイデータ２５１を参照して、欠失データがあった場合は、これを０（ゼロ）等で置き換える処理を行う。
また、制御部１００は、トレーニングデータ２５２、軸データ２５３、主成分データ２５４のような記憶領域を確保し、各種プログラムの初期化にあたる処理をする。 In step S101, the control unit 100 performs an initialization process.
Specifically, the control unit 100 refers to the microarray data 251 in the database 250 of the storage unit 110 and performs processing to replace this with 0 (zero) or the like when there is deletion data.
The control unit 100 secures storage areas such as training data 252, axis data 253, and principal component data 254, and performs processing for initializing various programs.

（欠失データの取り扱い）
具体的に、この初期化処理における欠失データの取り扱いについて説明する。
たとえば、マイクロアレイを用いた具体的な実験においては、マイクロアレイ上のゴミや異物等、工学系のトラブル、信号トラブル等で、完全なマイクロアレイデータが得られないことがある。すなわち、いずれかの項目が測定できないことがあり、この場合、マイクロアレイデータの一部の欠失として記憶される。
ここで、従来の主成分分析のように、軸とデータが別々に測定される場合、このようなデータの欠失が重要になる可能性がある。たとえば、ひとつのデータの欠失によって、ひとつのサンプルの主成分が算出不能になる。
このため、本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析処理においては、欠失したデータをゼロで置き換えて、欠失したデータを補う。
欠失したデータをゼロで置き換えるのは、いわゆるフェイルセーフのような考えに基づく措置である。欠失したデータをゼロで置き換えることにより、主成分は、いささかゼロに近づく。これは、距離総和から置き換えた要素が消えるからである。
しかしながら、主成分が逆に遠ざかることはないので、項目の値となるｓＰＣｇや、サンプルの値となるｓＰＣｓにしても、欠失データによって絶対値が大きくなることがないという効果が得られるため好適である。 (Handling of missing data)
Specifically, the handling of deletion data in this initialization process will be described.
For example, in a specific experiment using a microarray, complete microarray data may not be obtained due to engineering trouble, signal trouble, etc., such as dust and foreign matter on the microarray. That is, any item may not be measurable, and in this case, it is stored as a partial deletion of the microarray data.
Here, when the axis and data are measured separately as in the conventional principal component analysis, such deletion of data may be important. For example, deletion of one data makes it impossible to calculate the principal component of one sample.
For this reason, in the transcriptome principal component analysis processing according to the first embodiment of the present invention, the deleted data is replaced with zeros to compensate for the deleted data.
Replacing the missing data with zero is a measure based on a so-called fail-safe idea. By replacing the missing data with zero, the principal component is somewhat closer to zero. This is because the element replaced from the total distance disappears.
However, since the main component does not move away, the sPCg as the item value or the sPCs as the sample value is preferable because the absolute value does not increase due to the deletion data. It is.

次に、ステップＳ１０２において、制御部１００は、トレーニングデータ作成部２１０を用いて、トレーニングデータ決定処理を行う。
このトレーニングデータ決定処理においては、制御部１００は、測定項目の選択、基準値の設定、代表値の選択、項目の選択、項目の基準値の設定、基準値での標準化等を行う。 Next, in step S 102, the control unit 100 performs a training data determination process using the training data creation unit 210.
In this training data determination process, the control unit 100 performs measurement item selection, reference value setting, representative value selection, item selection, item reference value setting, standardization with reference values, and the like.

まず、制御部１００は、サンプル及びは測定項目を選択してトレーニングデータを作成する。
この際、制御部１００は、平均などでサンプル情報を要約して設定して使用することもできる。 First, the control unit 100 selects a sample and a measurement item and creates training data.
At this time, the control unit 100 can summarize and set the sample information using an average or the like.

（サンプル及び測定項目の選択）
まず、制御部１００は、予め分散分析などで群間の有意差を確認して、測定項目を絞っておくことで、サンプル及び測定項目を選択し、トレーニングデータの行列Ｘ_tに設定する。これにより、代表値を定めることができる。
このようなサンプル及び測定項目の選択を行い、群間で有意な違いがあった測定項目に限定することで、擬陽性の過誤の可能性を小さくすることが可能になる。
また、同様に、制御部１００は、測定限界から外れた項目も対象外にする。この際、対象から除外された項目を削除するのではなく、該当する要素の値を全てゼロに置き換えることで、行列の型を保ちながら解析することが可能になる。
これにより、トランスクリプトームデータにおいて、軸を共有することが可能になる。 (Selection of sample and measurement item)
First, the control unit 100 confirms a significant difference between groups by an analysis of variance in advance, narrows down measurement items, selects samples and measurement items, and sets them in a matrix X _{t of} training data. Thereby, the representative value can be determined.
By selecting such samples and measurement items and limiting them to measurement items that have significant differences between groups, the possibility of false positive errors can be reduced.
Similarly, the control unit 100 excludes items outside the measurement limit. At this time, instead of deleting the items excluded from the target, by replacing all the values of the corresponding elements with zero, analysis can be performed while maintaining the matrix type.
This makes it possible to share axes in transcriptome data.

（トレーニングデータの構造）
以上のような処理におけるトレーニングデータの構造については、データや測定値の偏りを排するためには、軸の発見に用いるトレーニングデータの構造を均一にすることが望ましい。たとえば、一つの分野の薬剤が複数回測定されていて、他の分野の薬剤に比べて多い場合、その頻度を薬剤の分野ごとに調節するべきだ。
また、繰り返し測定がおこなわれている場合、その一つ一つのサンプルは独立したものではなくなる。繰り返し測定された箇所のデータを、サンプル平均値で置き換えれば、個体差の影響は減少される。
このようなトレーニングデータを作成することで、「群をまたいで偶然に一致した何らかの原因による」変動を、誤って検出する可能性を、従来の主成分分析よりずっと小さくすることが可能になる。 (Structure of training data)
Regarding the structure of the training data in the processing as described above, it is desirable to make the structure of the training data used for finding the axes uniform in order to eliminate the bias of the data and measurement values. For example, if a drug in one field is measured multiple times and compared to drugs in other fields, the frequency should be adjusted for each drug field.
In addition, when repeated measurement is performed, each sample is not independent. If the data of the place repeatedly measured is replaced with the sample average value, the influence of individual differences is reduced.
By creating such training data, it is possible to make the possibility of erroneously detecting the fluctuation “because of some cause coincidentally across groups” much smaller than the conventional principal component analysis.

次に、ステップＳ１０３において、制御部１００は、特異ベクトル演算部２２０を用いて、軸設定・発見処理を行う。
具体的には、制御部１００は、異値分解や固有値分解等を行い、特異ベクトルを求める。
たとえば、特異値分解を用いる場合、制御部１００は、選択されたサンプルと測定項目からなるデータの行列Ｘ_tについて特異値分解をし、以下の式により特異ベクトルを求める。

Ｘ_t ＝Ｕ_t・Ｌ_t ^1/2・Ｖ_t’

ここで、Ｖ_tはサンプルのための軸を、Ｕ_tは項目のための軸に係るデータである。 Next, in step S 103, the control unit 100 performs axis setting / discovery processing using the singular vector calculation unit 220.
Specifically, the control unit 100 performs singular value decomposition, eigenvalue decomposition, and the like to obtain a singular vector.
For example, when singular value decomposition is used, the control unit 100 performs singular value decomposition on a matrix X _t of data including selected samples and measurement items, and obtains a singular vector by the following equation.

X _t = U _t · L _t ^1/2 · V _t '

Here, V _t is data relating to the axis for the sample, and U _t is data relating to the axis for the item.

次に、ステップＳ１０４において、制御部１００は、特異ベクトル演算部２２０を用いて、軸保存処理を行う。
具体的には、制御部１００は、異値分解や固有値分解等により求めた特異ベクトル等を、軸データ２５３に記憶する。 Next, in step S 104, the control unit 100 performs an axis storage process using the singular vector calculation unit 220.
Specifically, the control unit 100 stores, in the axis data 253, singular vectors obtained by different value decomposition, eigenvalue decomposition, or the like.

次に、ステップＳ１０５において、制御部１００は、主成分演算部２３０を用いて、軸読み込み処理を行う。
具体的には、制御部１００は、上述のステップＳ１０４にて気押した軸データ２５３を読み出して、主成分の演算をするためにＲＡＭ等に配置する。 Next, in step S 105, the control unit 100 performs an axis reading process using the principal component calculation unit 230.
Specifically, the control unit 100 reads the axis data 253 that is pressed in step S104 described above, and arranges the data in a RAM or the like for calculating the principal component.

次に、ステップＳ１０６において、制御部１００は、主成分演算部２３０を用いて、データ読み込み処理を行う。
具体的には、制御部１００は、上述のトレーニングデータから、項目の基準値を設定して、この基準値で標準化（正規化）を行う。 Next, in step S 106, the control unit 100 performs a data reading process using the principal component calculation unit 230.
Specifically, the control unit 100 sets a reference value of an item from the above-described training data, and standardizes (normalizes) the reference value.

（基準にするデータの設定）
制御部１００は、項目の基準値を基準にするデータを特定してトレーニングデータに設定する。
この際、制御部１００は、基準にするデータとして、全データの平均値を選択することが可能である。
また当然のごとく、制御部１００は、基準にするデータについて、全データの平均値ではないデータを選択をすることもできる。
この設定された基準にするデータは主成分の原点となる。すなわち、ある特定の基準やコントロール実験が考えられる際は、これを用いるべきである。
さらに、基準にするデータは、例えば、それぞれの実験環境下毎で、解析装置１０のユーザやデータの提供者が入力部１３０を用いて設定することができる。
このようにして定められた基準にするデータによって、環境の違いを補正することが期待できる。
つまり、軸は複数の測定値で共有しつつ、基準にするデータは各測定値で定めることが好適である。 (Setting of standard data)
The control unit 100 identifies data based on the reference value of the item and sets it as training data.
At this time, the control unit 100 can select an average value of all data as reference data.
As a matter of course, the control unit 100 can also select data that is not an average value of all data for the reference data.
The set reference data is the origin of the principal component. That is, it should be used when certain criteria and control experiments are considered.
Furthermore, the reference data can be set by the user of the analysis apparatus 10 or the data provider using the input unit 130 for each experimental environment, for example.
It can be expected that the difference in the environment is corrected by using the data set as the standard.
That is, it is preferable that the axis is shared by a plurality of measurement values, and the reference data is determined by each measurement value.

次に、ステップＳ１０７において、制御部１００は、主成分演算部２３０を用いて、主成分計算処理を行う。
具体的には、制御部１００は、上述のトレーニングデータを用いて作成した軸データ２５３を、マイクロアレイデータ２５１の行列データに適用する。より具体的には、制御部１００は、図２により説明したように、主成分ＰＣｓとＰＣｇとを下記の式により求める：

ＰＣｇ＝Ｘ_t’・Ｕ_t
ＰＣｓ＝Ｘ・Ｖ_t

制御部１００は、求めたＰＣｇ及びＰＣｓを主成分データ２５４に記憶する。 Next, in step S 107, the control unit 100 uses the principal component calculation unit 230 to perform principal component calculation processing.
Specifically, the control unit 100 applies the axis data 253 created using the training data described above to the matrix data of the microarray data 251. More specifically, as described with reference to FIG. 2, the control unit 100 obtains the main components PCs and PCg by the following formulas:

PCg = X _t '· U _t
PCs = X · V _t

The control unit 100 stores the obtained PCg and PCs in the principal component data 254.

次に、ステップＳ１０８において、制御部１００は、主成分スケーリング部２４０を用いて、スケーリング処理を行う。
ここで、トレーニングデータ行列Ｘ_tにより求められた軸を用いて主成分分析を行うためには、主成分の一般化、つまり項目やサンプルが変わってもその値を比べられることが必要である。
値を比較することで、トレーニングデータを作成する際の項目やサンプル群の選択の妥当性を確認することができる。
この一般化を実現するために、下記で説明する主成分の値のスケーリングを行う。 Next, in step S 108, the control unit 100 performs a scaling process using the principal component scaling unit 240.
Here, in order to perform a principal component analysis using the axis determined by the training data matrix X _t is a generalization of the main component, ie it is necessary to be changed item or sample is compared with that value.
By comparing the values, it is possible to confirm the validity of selection of items and sample groups when creating training data.
In order to realize this generalization, scaling of the principal component values described below is performed.

制御部１００は、主成分の値を、その計算に用いられた実質の項目ないしサンプルの数の平方根で除することでスケーリングを行う。これは、特異ベクトルの要素の二乗和が１になることと、主成分の要素の数とから演繹される。
たとえば、Ｘの要素の数が４倍になれば、ベクトルの各要素の期待値は１／２倍になる。このため、主成分の期待値は４／２＝２倍になると見込まれる。この場合、ルート（４）＝２で主成分を除することで、最初のＸの主成分と同じスケールをもたせることができる。
このように、項目ないしサンプルの数の平方根で除しておけば、項目ないしサンプルの平均値として主成分を扱うことができる。よって、要素数にかかわらず比較が可能になるという効果が得られる。
具体的なスケーリング方法としては、制御部１００は、主成分ＰＣｇについて、サンプル数ｎ＿ｓａｍｐｌｅであるときに、前述したユニタリ行列Ｕ_tを用いて、以下の式により、ｓＰＣｇを求める：

ｓＰＣｇ＝ＰＣｇ／（ｎ＿ｓａｍｐｌｅ^1/2）
＝Ｘ_t’・Ｕ_t ／（ｎ＿ｓａｍｐｌｅ^1/2）

ｓＰＣｇの値は、項目の主成分に含まれる、ひとつのサンプルの寄与の平均値である。 The controller 100 performs scaling by dividing the value of the principal component by the square root of the number of substantial items or samples used in the calculation. This is deduced from the fact that the sum of squares of the elements of the singular vector is 1 and the number of elements of the principal component.
For example, if the number of elements of X is quadrupled, the expected value of each element of the vector is halved. For this reason, the expected value of the main component is expected to be 4/2 = 2 times. In this case, by dividing the main component by route (4) = 2, it is possible to have the same scale as the first X main component.
In this way, by dividing by the square root of the number of items or samples, the main component can be treated as an average value of the items or samples. Therefore, an effect that comparison is possible regardless of the number of elements is obtained.
As a specific scaling method, for the main component PCg, when the number of samples is n_sample, the control unit 100 obtains sPCg by the following equation using the unitary matrix U _t described above:

sPCg = PCg / (n_sample ^1/2 )
= X _t '· U _t / (n_sample ^1/2 )

The value of sPCg is the average value of the contribution of one sample included in the main component of the item.

また、制御部１００は、同様に選択された項目の数の値がｎ＿ｇｅｎｅであるとき、サンプルの値に含まれるひとつの項目の寄与平均値であるｓＰＣｓを求める：

ｓＰＣｓ＝Ｘ・Ｖ_t／（ｎ＿ｇｅｎｅ^1/2）

ｓＰＣｇやｓＰＣｓの値は、異なる数のサンプルや項目から求めたとしても、それぞれの一つあたりの寄与として表わされるために比較可能である。
制御部１００は、求めたｓＰＣｇ、ｓＰＣｓの値も主成分データ２５４に記憶する。
以上により、本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析処理を終了する。 Similarly, when the value of the number of items selected in the same way is n_gene, the control unit 100 obtains sPCs, which is a contribution average value of one item included in the sample value:

sPCs = X · V _t / (n_gene ^1/2 )

Even if the values of sPCg and sPCs are obtained from different numbers of samples and items, they can be compared because they are represented as contributions per one.
The control unit 100 also stores the obtained sPCg and sPCs values in the principal component data 254.
Thus, the transcriptome principal component analysis process according to the first embodiment of the present invention is completed.

以上の構成により、以下のような効果を得ることができる。
まず、従来技術１の情報処理装置は、大きな次元をもつ汎用的なマイクロアレイのデータに適用することが難しかった。
しかしながら、大きな次元をもつデータを解析するのに適している従来の主成分分析では、検査項目が変更されたり、重みの違いがあったり、同じような物質群が多く含まれていたりする通常の実験で用いられるトランスクリプトームのデータで、正確な解析ができないという問題があった。
これに対して、本発明の第１の実施の形態に係る解析装置１０は、分析するデータではなくトレーニングデータから主成分分析の軸を発見し、スケーリングを行うことで、これらのトランスクリプトームデータを分析することができる。 With the above configuration, the following effects can be obtained.
First, it is difficult to apply the information processing apparatus of the prior art 1 to general-purpose microarray data having a large dimension.
However, in the conventional principal component analysis, which is suitable for analyzing data with a large dimension, the inspection items are changed, there are differences in weights, and many similar substance groups are included. There was a problem that the transcriptome data used in the experiment could not be analyzed accurately.
On the other hand, the analysis apparatus 10 according to the first embodiment of the present invention finds the principal component analysis axis from the training data, not the data to be analyzed, and performs scaling so that the transcriptome data is obtained. Can be analyzed.

また、本発明の第１の実施の形態に係る解析装置１０は、軸の発見と設定を適用から切り離して主成分分析を行う。
これにより、軸を異なる分析者・ラボ（研究所）間・異なる測定項目をもつ測定間で共有することが可能になる。 Further, the analysis apparatus 10 according to the first embodiment of the present invention performs principal component analysis by separating axis discovery and setting from application.
This makes it possible to share the axis between different analysts and laboratories and between measurements with different measurement items.

さらに、本発明の第１の実施の形態に係る解析装置１０は、軸を共有することで、異なる分析者・ラボ間で同一の分析を行うことができる。そのため。分析結果が、あるデータの組み合わせのなかで閉じたものではなくなる。すなわち、ある分析結果を、他の実験データの分析結果と客観性をもって比較することが可能になる。
また、本発明の第１の実施の形態に係る解析装置１０は、スケーリングをすることで値が相対値ではなくなる。
本発明の第１の実施の形態に係る解析装置１０は、これらの処理により、主成分に一般性を持たせることができる。
このため、既存の軸を未知資料に適用することで、その資料を分類することもできる。 Furthermore, the analysis apparatus 10 according to the first embodiment of the present invention can perform the same analysis between different analysts and laboratories by sharing the axis. for that reason. The analysis results are not closed in a certain combination of data. That is, it becomes possible to compare an analysis result with the analysis result of other experimental data objectively.
Moreover, the analysis apparatus 10 according to the first embodiment of the present invention does not have a relative value by scaling.
The analysis apparatus 10 according to the first embodiment of the present invention can give generality to the main component by these processes.
For this reason, it is also possible to classify the material by applying the existing axis to the unknown material.

さらに、本発明の第１の実施の形態に係る解析装置１０は、トレーニングデータを用いることで、サンプルや群の偏りにたいして分析がよりロバストになり、実験の目的に沿った結果を得ることができる。 Furthermore, the analysis apparatus 10 according to the first embodiment of the present invention can use the training data to make the analysis more robust with respect to the bias of the sample and the group, and can obtain a result according to the purpose of the experiment. .

また、従来の主成分分析では、偶然によって定められる主成分の符号を除けば、行列が与えられれば、主成分がほぼ一元的に求まっていた。すなわち、従来の主成分分析で解析者であるユーザーに委ねられていたのは距離の定義だけであった。距離の定義は、行列をいかに標準化するかの選択によって変わり、この標準化のあとには選択肢はなかった。このため、従来の主成分分析の解析結果では、ある意味、客観性が保証されていた。
しかしながら、従来の主成分分析を、トランスクリプトームデータに対応させるため、項目の数やサンプルの数が変わったデータに適用しようとすると、距離の和である主成分のスケールが変わるので、それらの値は比較できないという問題があった。
これに対して、本発明の第１の実施の形態に係る解析装置１０においても、トレーニングデータを使うので、従来の主成分分析方法とは、定性的に異なる点が生じる。すなわち、軸をどのデータ行列から調査するのかに任意性が与えられれば、「どの項目を選択し、どのサンプルを選択するか（代表値をどう導くか）」という選択肢が生じる。
これにより、一見したところ客観性が損われるように思われる。しかしながら、本発明の第１の実施の形態に係る解析装置１０は、主成分の値をスケーリングにより絶対値とすることで、異なる選択による結果の間に比較可能性をもたせることができる。
よって、いずれの選択肢がより適切であるかを検討できるように保つことができる。 Further, in the conventional principal component analysis, if the matrix is given, except for the sign of the principal component determined by chance, the principal component is obtained almost uniformly. That is, only the definition of distance has been left to the user who is an analyst in the conventional principal component analysis. The definition of distance varied with the choice of how to standardize the matrix, and there were no options after this standardization. For this reason, the analysis result of the conventional principal component analysis guarantees a certain sense of objectivity.
However, if the conventional principal component analysis is applied to data in which the number of items or the number of samples has changed in order to correspond to the transcriptome data, the scale of the principal component, which is the sum of the distances, changes. There was a problem that the values could not be compared.
On the other hand, the analysis apparatus 10 according to the first embodiment of the present invention also uses training data, and therefore, qualitatively differs from the conventional principal component analysis method. That is, if arbitraryness is given to which data matrix the axis is examined from, an option of “which item is selected and which sample is selected (how to represent the representative value)” is generated.
This seems to impair objectivity at first glance. However, the analysis apparatus 10 according to the first embodiment of the present invention can provide a comparability between the results of different selections by making the principal component values absolute values by scaling.
Therefore, it can be kept so that it can be examined which option is more appropriate.

以下、本発明の第１の実施の形態に係る解析装置１０を用いて、具体的なマイクロアレイの実験データを使用した解析処理を行い、その結果がどう変化するのかを示す。 Hereinafter, analysis processing using specific microarray experimental data is performed using the analysis apparatus 10 according to the first embodiment of the present invention, and how the results change will be described.

〔実施例１〕
まず、図４を参照して、マウス乳腺の妊娠と出産にかかわるタイムコース実験の解析に用いた例を示す。この実験では、ＮＣＢＩのＧＥＯデータベースにあるＳｅｒｉｅｓＧＳＥ８１９１のデータ（ＵＲＬ＜ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／ｇｅｏ／ｑｕｅｒｙ／ａｃｃ．ｃｇｉ？ａｃｃ＝ＧＳＥ８１９１＞、「Ｋｅｙｓｔａｇｅｓｉｎｍａｍｍａｒｙｇｌａｎｄｄｅｖｅｌｏｐｍｅｎｔ．」）を用いた。具体的には、ＮＣＢＩのＧＥＯデータベースにあるＳｅｒｉｅｓＧＳＭ２０２６６６から（続き番号で）ＧＳＭ２０２７０５までの４０データを用い、使用されたチップはＡｆｆｙｍｅｔｒｉｘＭｕｒｉｎｅＧｅｎｏｍｅＵ７４Ｖｅｒｓｉｏｎ２Ａｒｒａｙである。
より具体的には、図４は、図はサンプルの、スケーリングした主成分であるｓＰＣｓ１とｓＰＣｓ２値を示している。図中に１から６までの数値で表されているのが妊娠の進行に伴う経過、７から９は出産後、１０は断乳後であり、各群４サンプル分のデータを示している。
図４（ａ）は、全データから軸を発見した例を示す。また、図４（ｂ）は、それぞれの群の平均値からなるトレーニングデータから軸を発見した例を示す。
この結果から明らかなように、ｓＰＣｓ１は母乳産生のための乳腺の発達過程を、ｓＰＣｓ２は断乳後の過程を、それぞれ軸として検出していると考えられる。
このように、トレーニングデータを使うことで、郡内のばらつきが減少しており、それはｓＰＣｓ２で特に顕著である。
すなわち、軸を発見・定義するためのトレーニングデータと、分析対象のデータとを分離することで、分析がより目的に叶ったものになる。この効果は、たとえば群間の分離の改善となって現れる。
実際に、図４（ｂ）においては、群間の分離が著しく改善されている。これは、特にｓＰＣｓ２の軸が、サンプルの個体差の影響から免れ、より現象をよく反映するようになったからだと考えられる。 [Example 1]
First, with reference to FIG. 4, the example used for the analysis of the time course experiment regarding the pregnancy and the delivery of a mouse mammary gland is shown. In this experiment, data of Series GSE 8191 in the GBI database of NCBI (URL <http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8191>, “Key stages in magnitude” development. ") was used. Specifically, 40 chips from Series GSM202666 to GSM202705 (in consecutive numbers) in the NCBI GEO database are used, and the chip used is the Affymetrix Murine Genome U74 Version 2 Array.
More specifically, FIG. 4 shows the sPCs1 and sPCs2 values, which are scaled principal components of the sample. In the figure, numerical values from 1 to 6 indicate the progress of pregnancy, 7 to 9 after childbirth, 10 after weaning, and data for 4 samples in each group.
FIG. 4A shows an example in which an axis is found from all data. FIG. 4B shows an example in which an axis is found from training data consisting of an average value of each group.
As is clear from this result, it is considered that sPCs1 detects the development process of the mammary gland for breast milk production, and sPCs2 detects the process after weaning as the axis.
Thus, the use of training data reduces the variation within the county, which is particularly noticeable with sPCs2.
That is, by separating the training data for finding and defining the axis and the data to be analyzed, the analysis becomes more suitable for the purpose. This effect appears as an improvement in separation between groups, for example.
In fact, in FIG. 4 (b), the separation between groups is significantly improved. This is presumably because the axis of sPCs2 is more immune to the effects of individual differences in the sample and reflects the phenomenon better.

〔実施例２〕
次に、図５を参照して、いわゆるトキシコロジーの分野のデータの分析に用いた結果を示す。この実験では、毒性が強いサンプル１，３，５を、そうでない２，４，６および薬物を与えないＣ群と比較したものである。毒性のない２，４，６群はＣ群の近くに位置している。なお、６のサンプルのひとつは、おそらく毒性のあるサンプルと取り違えたものだと考えられる。
より具体的には、図５（ａ）は、軸を発見するサンプルに偏りを持たせた観察例を示す。また、図５（ｂ）は、偏りがない例を示す。
図５においては、いずれの結果も、群の平均値をトレーニングデータとして軸を決定している。しかしながら、図５（ａ）では群５（アスタリスクで強調している）だけ、代表値ではなく全てのデータをトレーニングデータの中に含めてある。この操作によって、データ数の偏りを人為的に起こして、その影響を観察した。
図５（ａ）では、ｓＰＣｓ２は群５の郡内の差を分離することに費やされていることが明白である。これに対して、図５（ｂ）では、それぞれの群が同じような主成分をとっており、ｓＰＣｓ２では１，３，５群が分離している。もちろん、郡内の差はサンプルの個体差を反映したものであり、着目すべき重要なものではない。
つまり、図５（ａ）では、サンプルの偏りが、本来の調査目的を隠してしまっている。これは、サンプルの種類に偏りがある場合、従来の主成分分析法では避けられない現象であった。
これに対して、図５（ｂ）では、そうした場合でも適切なトレーニングデータを用いることで、偏りの影響を避けられることを示している。
すなわち、トレーニングデータを用いることで、サンプルの偏りに起因する軸の重み付けの間違いが解決する。これはサンプルの偏りに対する頑健さとなって現れる。 [Example 2]
Next, referring to FIG. 5, the results used for analyzing data in the so-called toxicology field are shown. In this experiment, the highly toxic samples 1, 3, 5 were compared to the other 2, 4, 6 and group C, which did not receive the drug. Non-toxic groups 2, 4 and 6 are located close to group C. Note that one of the six samples was probably mistaken for a toxic sample.
More specifically, FIG. 5A shows an observation example in which a sample for finding an axis is biased. FIG. 5B shows an example in which there is no bias.
In FIG. 5, the axis of each result is determined using the average value of the group as training data. However, in FIG. 5A, only the group 5 (highlighted with an asterisk) includes all data, not representative values, in the training data. This operation artificially caused a bias in the number of data and observed the effect.
In FIG. 5 (a), it is clear that sPCs2 is spent separating the differences within group 5 counties. On the other hand, in FIG. 5B, each group has the same main component, and in sPCs2, groups 1, 3, and 5 are separated. Of course, the differences within the county reflect the individual differences in the sample and are not important to note.
That is, in FIG. 5A, the bias of the sample hides the original investigation purpose. This is a phenomenon that cannot be avoided by the conventional principal component analysis method when the types of samples are biased.
On the other hand, FIG. 5B shows that the influence of bias can be avoided by using appropriate training data even in such a case.
That is, by using training data, the axis weighting error due to sample bias is solved. This appears as robustness against sample bias.

図６は、図４（ｂ）と同じデータについて、ｓＰＣｓとｓＰＣｇを同時に表示する、いわゆるバイプロットを行ったものである。
図中の一つの円はそれぞれの遺伝子のｓＰＣｇを、番号はそれぞれの群のｓＰＣｓを示している。
約１万の測定項目をもつｓＰＣｓと、たかだか１０のサンプル代表値から計算されるｓＰＣｇが同じ軸の上で表示されていることが、主成分のスケーリングの効果を端的に表している。
ここで、非特許文献４を参照すると、スケーリングをしない場合、軸の目盛りは共有できない。
これに対して、図６では、ｓＰＣが負である遺伝子が、群１０を特徴付けていることが簡単に理解できる。 FIG. 6 shows a so-called biplot in which sPCs and sPCg are simultaneously displayed for the same data as in FIG. 4B.
One circle in the figure indicates the sPCg of each gene, and the number indicates the sPCs of each group.
The fact that sPCs having about 10,000 measurement items and sPCg calculated from at most 10 sample representative values are displayed on the same axis directly represents the effect of scaling of the principal component.
Here, referring to Non-Patent Document 4, the axis scale cannot be shared unless scaling is performed.
On the other hand, in FIG. 6, it can be easily understood that the genes having negative sPC characterize the group 10.

また、図４の実施例１と、図５の実施例２とでは、全く異なる測定であるにもかかわらず、その軸のスケールがだいたい同じであった。
このことから、これらの実験でのトランスクリプトームの変化の規模はだいたい同一であったことがわかる。
すなわち、それぞれの測定で測定するためのｍＲＮＡ等のチップコンテンツが異なり、遺伝子数が異なるにもかかわらず、こうした比較ができることも、主成分のスケーリングの効果のひとつである。 Further, in Example 1 in FIG. 4 and Example 2 in FIG. 5, the scale of the axis was almost the same, although the measurement was completely different.
This indicates that the scale of transcriptome change in these experiments was roughly the same.
That is, one of the effects of scaling of the main components is that such a comparison can be performed even though the chip contents such as mRNA for measurement are different and the number of genes is different.

また、主成分分析はもともと、多数の測定項目をもつデータのなかからトレンドを見いだし、軸を定義する方法である。
上述した本実施形態に係るトランスクリプトーム用主成分分析方法を用いることで、トレーニングデータを用いて、たとえば健康診断で得られるデータのなかから、特定の疾病を示唆する測定項目とそれぞれの重みを発見することができる。この軸を個々の測定データに適用することで、その疾病を発見することが可能になる。 Principal component analysis is a method for defining trends by finding trends from data with a large number of measurement items.
By using the transcriptome principal component analysis method according to the present embodiment described above, using training data, for example, from the data obtained in the health check, the measurement items suggesting a specific disease and the respective weights are obtained. Can be found. By applying this axis to individual measurement data, it is possible to find the disease.

また、本実施形態に係るトランスクリプトーム用主成分分析方法においては、トキシコロジーなどのマイクロアレイを分析する際に、適切なトレーニングデータを用いて軸を定義し、それを個々のサンプルのデータに適用することで、サンプルをクラス分けすることができる。
これによって、新たなサンプルについても、どんな種類の毒性があるのかを調べることができる。 Moreover, in the principal component analysis method for transcriptome according to the present embodiment, when analyzing a microarray such as toxicology, an axis is defined using appropriate training data and applied to data of individual samples. This makes it possible to classify samples.
This makes it possible to determine what kind of toxicity the new sample has.

また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、主成分を、その主成分の算出に用いたサンプル数または測定項目数の平方根で除することでスケーリングすることを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、複数のスケーリングした主成分を比較することを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、特異ベクトルで表されるような主成分の軸を求めるために、トレーニングデータを用いることを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、データの測定項目を選択してトレーニングデータを作成する際に、選択されなかった項目のデータをゼロで置き換えて、オリジナルの行列の大きさを保つことを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、主成分を算出する際に、欠損データをゼロで置き換えることを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、トレーニングデータから求めた軸を用いてデータを評価し、主成分を求めることを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、トレーニングデータから求めた軸を、データ評価のための重みとして使用することを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、トレーニングデータから軸を求める際に、データ平均以外の任意のデータを基準に使用することを特徴とする。
また、本発明の本発明の第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、主成分を求める際に、データ平均以外の任意のデータを基準に使用すること。
また、本発明の本発明の第１の実施の形態に係るコンピュータプログラムは、前記トランスクリプトーム用主成分分析方法を実行することを特徴とする。
また、本発明の本発明の第１の実施の形態に係る計算装置は、前記トランスクリプトーム用主成分分析方法を実行することを特徴とする。 In the transcriptome principal component analysis method according to the first embodiment of the present invention, the principal component is divided by the square root of the number of samples or the number of measurement items used to calculate the principal component. It is characterized by scaling.
The transcriptome principal component analysis method according to the first embodiment of the present invention is characterized by comparing a plurality of scaled principal components.
In addition, the principal component analysis method for transcriptome according to the first embodiment of the present invention is characterized in that training data is used to obtain an axis of a principal component represented by a singular vector. And
In the principal component analysis method for transcriptome according to the first embodiment of the present invention, when selecting training items by selecting data measurement items, data of items not selected is selected. It is characterized by maintaining the size of the original matrix by replacing with zero.
The principal component analysis method for transcriptome according to the first embodiment of the present invention is characterized in that missing data is replaced with zero when calculating a principal component.
In addition, the principal component analysis method for transcriptome according to the first embodiment of the present invention is characterized in that data is evaluated using an axis obtained from training data and a principal component is obtained.
The principal component analysis method for transcriptome according to the first embodiment of the present invention is characterized in that an axis obtained from training data is used as a weight for data evaluation.
In addition, the principal component analysis method for transcriptome according to the first embodiment of the present invention is characterized in that, when determining an axis from training data, any data other than the data average is used as a reference. And
The principal component analysis method for transcriptome according to the first embodiment of the present invention uses any data other than the data average as a reference when obtaining the principal component.
A computer program according to the first embodiment of the present invention is characterized by executing the principal component analysis method for transcriptome.
The computing device according to the first embodiment of the present invention is characterized by executing the principal component analysis method for transcriptome.

＜第２の実施の形態＞
〈遺伝子発現を用いた、皮膚の老化過程の指標の作成方法〉
次に、本発明の第２の実施の形態に係る遺伝子発現を用いた皮膚の老化過程の指標の作成方法について説明する。本発明の第２の実施の形態に係る遺伝子発現を用いた皮膚の老化過程の指標の作成方法では、上述の第１の実施の形態に係るトランスクリプトーム用主成分分析方法を用いて、皮膚の老化に関するトランスクリプトームを解析し、皮膚の老化過程の指標を作成する。 <Second Embodiment>
<Method of creating index of skin aging process using gene expression>
Next, a method for creating an index of the skin aging process using gene expression according to the second embodiment of the present invention will be described. In the method for creating an index of the skin aging process using gene expression according to the second embodiment of the present invention, the principal component analysis method for transcriptome according to the first embodiment described above is used. Analyzes the transcriptome of aging of the skin and creates an index of the skin aging process.

老化は、他の多くの生理現象と同じく、老化は一つの遺伝子によっておきるのではなく、複数の遺伝子がかかわる現象であると考えられる。
老化にともなって皮膚組織の性質は変化する。この変化を検出し、また老化の程度を客観的に測定することは、老化を研究調査し、老化に対抗する措置を開発する上で重要である。 Aging, like many other physiological phenomena, is considered not a single gene but a phenomenon involving multiple genes.
The nature of skin tissue changes with aging. Detecting this change and objectively measuring the extent of aging are important in researching aging and developing measures to combat aging.

本発明の第２の実施の形態に係る遺伝子発現を用いた皮膚の老化過程の指標の作成方法は、皮膚組織において発現を特異的に変化させる遺伝子のリストを提供する。
また、この遺伝子のリストに記載した遺伝子の発現量を計測した値に係数を乗じてから合算することで、皮膚の老化の指標を算出する方法を提供する。 The method for creating an index of the skin aging process using gene expression according to the second embodiment of the present invention provides a list of genes whose expression is specifically changed in skin tissue.
Further, the present invention provides a method for calculating an index of skin aging by multiplying a value obtained by measuring the expression level of a gene described in the gene list and then adding the coefficient.

指標の使用目的の一つは、たとえば物質や療法・施術のためのスクリーニングである。生物個体の皮膚、または培養細胞を用いて、様々な薬剤を投与し、老化の指標を変化させるものを選択することができる。 One of the purposes of use of the indicator is, for example, screening for substances, therapy, and treatment. Various agents can be administered using the skin of living organisms or cultured cells, and those that change the index of aging can be selected.

また指標は、生物個体の皮膚の老化の度合いを計測する際にも重要である。これはたとえば、スクリーニングされた物質が実際に効果を持ったかどうかを確認するときに使用される。
本発明の実施の形態に係る皮膚老化指標作成方法によれば、皮膚の老化の度合いを客観的に評価することができる。 The index is also important when measuring the degree of skin aging of living organisms. This is used, for example, to determine whether the screened substance actually has an effect.
According to the skin aging index creation method according to the embodiment of the present invention, the degree of skin aging can be objectively evaluated.

（指標の作成）
マイクロアレイを用いて遺伝子発現を網羅的に調べることで、どの遺伝子がどの程度に老化にかかわるのかを明らかにした。データ解析の際に、複数の生理条件にあるサンプルのデータを主成分分析することで、老化に特異的にはたらく遺伝子を同定し、表１と２に掲げる遺伝子のリストを作成した。 (Create metrics)
By exhaustively examining gene expression using a microarray, we clarified which genes are involved in aging and how much. At the time of data analysis, by analyzing principal component data of samples under a plurality of physiological conditions, genes that specifically act on aging were identified, and a list of genes listed in Tables 1 and 2 was created.

データの標準化には、客観的なパラメトリック法である３パラメータ対数正規分布を利用する方法を用いた。これはデータの統計学的な分布を手がかりにして、その分布の母数を求め、正規分布へとデータを標準化する方法である。
具体的な標準化の実行方法は、国際公開第０２／００１４７７号公報、国際公開第２００８／０５６６９３号公報、特表２０１０−５１０５５７号公報、特開２００４−０１３５７３号公報、特開２００６−２３６０１１号公報、Ｋｏｎｉｓｈｉ，Ｔｏｍｏｋａｚｕ（２００４）， 'Ｔｈｒｅｅ−ｐａｒａｍｅｔｅｒｌｏｇｎｏｒｍａｌｄｉｓｔｒｉｂｕｔｉｏｎｕｂｉｑｕｉｔｏｕｓｌｙｆｏｕｎｄｉｎｃＤＮＡｍｉｃｒｏａｒｒａｙｄａｔａａｎｄｉｔｓａｐｐｌｉｃａｔｉｏｎｔｏｐａｒａｍｅｔｒｉｃｄａｔａｔｒｅａｔｍｅｎｔ'，ＢＭＣＢｉｏｉｎｆｏｒｍａｔｉｃｓ，５，５．、Ｋｏｎｉｓｈｉ，Ｔｏｍｏｋａｚｕ（２００８）， 'ＤａｔａＤｉｓｔｒｉｂｕｔｉｏｎｏｆＳｈｏｒｔＯｌｉｇｏｎｕｃｌｅｏｔｉｄｅＥｘｐｒｅｓｓｉｏｎＡｒｒａｙｓａｎｄＩｔｓＡｐｐｌｉｃａｔｉｏｎｔｏｔｈｅＣｏｎｓｔｒｕｃｔｉｏｎｏｆａＧｅｎｅｒａｌｉｚｅｄＩｎｔｅｌｌｅｃｔｕａｌＦｒａｍｅｗｏｒｋ'，ＳｔａｔＡｐｐｌＧｅｎｅｔＭｏｌＢｉｏｌ．，７（１），Ａｒｔｉｃｌｅ２５．等を参照して実現することができる。 For standardization of data, a method using a three-parameter lognormal distribution, which is an objective parametric method, was used. This is a method of standardizing data to a normal distribution by obtaining a statistical parameter distribution and obtaining a parameter of the distribution.
Specific standardization execution methods include International Publication No. WO 02/001477, International Publication No. 2008/056693, Japanese Translation of PCT International Publication No. 2010-510557, Japanese Unexamined Patent Publication No. 2004-013573, and Japanese Unexamined Patent Publication No. 2006-236011. , Konishi, Tomokazu (2004), 'Three-parameter logistic distribution, bifoundatively found in cDNA microarray data and its applications. , Konishi, Tomokazu (2008), 'Data Distribution of Short ligation and Amplification of Affects of the Federation and the Construction of the Federation. , 7 (1), Article 25. It can be realized with reference to the above.

またデータの解析には、第１の実施の形態に係るトランスクリプトーム用主成分分析方法を用いた。この際の解析装置の構成は、第１の実施の形態に係る解析装置１０と同様である。
主成分分析は、分析者であるユーザーが設定する自由パラメータがないので、元々、客観性が高い。また、第１の実施の形態に係るトランスクリプトーム用主成分分析方法は、マイクロアレイデータのように、独立性が高くないこともあるデータにおいても、客観性の高い分析データを得ることができる。
また、主成分分析は、老化や紫外線（ＵｌｔｒａＶｉｏｌｅｔ、ＵＶ）刺激といった異なる方向性のシグナルの影響を分離して見分けるために好適である。
以下で、皮膚の老化に関するマイクロアレイの実験データを用いて、トランスクリプトーム用主成分分析方法を実行した例についての詳細を説明する。 In the data analysis, the transcriptome principal component analysis method according to the first embodiment was used. The configuration of the analysis device at this time is the same as that of the analysis device 10 according to the first embodiment.
Principal component analysis is originally highly objective because there are no free parameters set by the analyst user. In addition, the transcriptome principal component analysis method according to the first embodiment can obtain highly objective analysis data even in data that may not be highly independent, such as microarray data.
The principal component analysis is suitable for separating and distinguishing the influence of signals having different directions such as aging and ultraviolet (UV) stimulation.
Hereinafter, details of an example in which the principal component analysis method for transcriptome is executed using microarray experimental data on skin aging will be described.

（トランスクリプトーム用主成分分析方法の軸発見とサンプルへの適用の説明）
まず、ｚ標準化されたマイクロアレイデータをサンプルｓと遺伝子ｇの行列で表す。この行列から、遺伝子毎に当該遺伝子の平均を減じる、いわゆるセンタリングを行い、再標準化する。これは、全データの遺伝子毎の平均により、それぞれの遺伝子の値を減ずることで、主成分分析の結果のゼロを原点に重ねる処理である。この再標準化したデータ行列Ｘ_sを、軸設定・発見処理の計算の対象に用いる。
また、各実験群ｐのサンプルの代表値を同様なデータの行列で表す。この代表値としては、例えば、その群内での遺伝子の平均値を用いることができる。このサンプルの代表値のデータの行列についても、センタリングを行って再標準化する。この再標準化したデータ行列Ｘ_pについても、軸設定・発見処理の計算の対象に用いる。
このＸ_sとＸ_pをベクトルとして表現すると、以下の数式の通りである： (Explanation of axis discovery of transcriptome principal component analysis method and application to sample)
First, z-standardized microarray data is represented by a matrix of sample s and gene g. From this matrix, for each gene, the average of the gene is reduced, so-called centering is performed, and re-standardization is performed. This is a process of superimposing zero of the result of the principal component analysis on the origin by subtracting the value of each gene by the average of all the data for each gene. This restandardized data matrix X _s is used as an object for calculation of axis setting / discovery processing.
In addition, representative values of samples of each experimental group p are represented by a matrix of similar data. As this representative value, for example, an average value of genes within the group can be used. The matrix of representative value data of this sample is also re-standardized by centering. This re-standardized data matrix X _p is also used for calculation of axis setting / discovery processing.
Expressing these X _s and X _p as vectors, they are as follows:

次に、標準化したデータ行列Ｘ_pを特異値分解すると、左特異ベクトルＵ_pと対角行列Ｌ^1/2および右特異ベクトルＶ_pが得られる。なお、Ｕ_p及びＶ_pは、それぞれ第１の実施の形態に係るＵ_t及びＶ_tとそれぞれ同様のベクトルを示す。
この際のＸ_p、Ｕ_p、Ｌ^1/2、Ｖ_pの関係は、以下の数式の通りである： Next, when the standardized data matrix X _p is subjected to singular value decomposition, a left singular vector U _p , a diagonal matrix L ^1/2 and a right singular vector V _p are obtained. U _p and V _p are the same vectors as U _t and V _{t according} to the first embodiment, respectively.
The relationship among X _p , U _p , L ^1/2 , and V _p at this time is as follows:

ここで、Ｖ_p’はＶ_pの転置行列である。

Here, V _p ′ is a transposed matrix of V _p .

サンプル毎の主成分ＰＣ_sは、以下の数式により算出する。 The principal component PC _{s for} each sample is calculated by the following mathematical formula.

また、遺伝子ごとの主成分ＰＣ_gは、以下の数式により算出する。 Further, the main component PC _g per gene, is calculated by the following equation.

（遺伝子リストの作成）
老化に関連する遺伝子を、例えば、以下のように同定し、表１と表２を作成した。 (Generation of gene list)
Genes associated with aging were identified as follows, for example, and Tables 1 and 2 were prepared.

マウス（ＢＬ６）の生後１週間の個体、その母親（生後２か月）、およびリタイアした老齢マウス（生後２年）の３群から皮膚組織を得、それぞれよりトータルＲＮＡを抽出し、アフィメトリクス社製ＧｅｎｅＣｈｉｐによるマイクロアレイ測定を行った。
測定値をパラメトリック法で標準化し、各遺伝子について群間で有意な発現の差があることを、ＡＮＯＶＡ法を用いてｐ−ｖａｌｕｅの閾値０．０１で確認した。
さらに、有意差が確認された遺伝子について、それぞれの群の平均値をつかってＸ_pを求め、上述のトランスクリプトーム用主成分分析方法により、主成分分析を行った。その結果、ＰＣ１とＰＣ２を得た。
ＰＣ１は、ＰＣｓにあたるサンプルに着目した、マイクロアレイデータの主成分である。
ＰＣ２は、ＰＣｇにあたる項目に着目した、マイクロアレイデータの主成分である。ここでは、ＰＣｇは、遺伝子を示す。 Skin tissue was obtained from 3 groups of mice (BL6), 1 week old, their mother (2 months old), and retired old mice (2 years old), and total RNA was extracted from each group, manufactured by Affymetrix Microarray measurement was performed using GeneChip.
The measured values were standardized by a parametric method, and it was confirmed that there was a significant difference in expression between groups for each gene with a p-value threshold of 0.01 using the ANOVA method.
Furthermore, for genes for which significant differences were confirmed, X _p was determined using the average value of each group, and principal component analysis was performed by the above-described transcriptome principal component analysis method. As a result, PC1 and PC2 were obtained.
PC1 is the main component of microarray data focusing on the sample corresponding to PCs.
PC2 is the main component of microarray data focusing on the item corresponding to PCg. Here, PCg represents a gene.

図７を参照して、このＰＣ１とＰＣ２について説明する。図７においては、ＰＣ１の主成分スコアを横軸、ＰＣ２の主成分スコアを縦軸に示す。丸はそれぞれ一つの遺伝子に対応する。文字Ｃは生後１週間、Ｍは２か月の母親、Ｏは２年の老齢マウスのサンプルを現す。
ＰＣ１について、大きな絶対値をとる遺伝子には皮膚に特異的に発現するものが多く見られた。またＰＣ２では、乳腺および抗体産生に大きく関与するものがみられた。そこでＰＣ２は授乳期の母親の特性が、ＰＣ１は皮膚の老化が顕れていると判断した。実際、それぞれのマウス個体の年齢とＰＣ１上の位置は対応していた。
それぞれのサンプルのＰＣ値は、それぞれ６倍の値を用いてプロットしてある。 The PC1 and PC2 will be described with reference to FIG. In FIG. 7, the principal component score of PC1 is shown on the horizontal axis, and the principal component score of PC2 is shown on the vertical axis. Each circle corresponds to one gene. The letter C represents a week from birth, M represents a 2 month old mother, and O represents a 2 year old mouse sample.
Regarding PC1, many genes having large absolute values were specifically expressed in the skin. Also, PC2 was found to be greatly involved in mammary gland and antibody production. Therefore, it was judged that PC2 had the characteristics of a breastfeeding mother and PC1 had developed skin aging. In fact, the age of each mouse individual and the position on PC1 corresponded.
The PC value of each sample is plotted using 6 times the value.

老化の指標となる遺伝子群のリストは、有意な発現の違いをもち（Ｐ＜０．０１）、ＰＣ１の主成分スコアの絶対値が大きく（０．３以上）、且つ主成分スコアのＰＣ２の絶対値が小さい（０．３以下）を基準にして選ばれた。 The list of gene groups serving as indicators of aging has a significant difference in expression (P <0.01), the absolute value of the PC1 principal component score is large (0.3 or more), and the PC2 component score of PC2 It was selected on the basis of a small absolute value (0.3 or less).

以下の表１に、ＰＣ１及びＰＣ２を用いて選択された、老化によって遺伝子発現を増大させる遺伝子群を示す。表１では、選択に用いた主成分ＰＣ_g１およびＰＣ_g２を併せて示す。
この遺伝子を特定する手段として、アフィメトリクス社のＩＤ番号、通常使われている遺伝子の略称、および公的なデータベースの登録番号としてＵｎｉＧｅｎｅＩＤ番号を示す。これらの遺伝子の配列は公知であり、それぞれの番号から容易に検索することが可能である。 Table 1 below shows a group of genes selected using PC1 and PC2 to increase gene expression by aging. Table 1 also shows the principal components PC _g 1 and PC _g 2 used for selection.
As a means for specifying this gene, an Affymetrix ID number, an abbreviation of a commonly used gene, and a UniGeneID number as a public database registration number are shown. The sequences of these genes are known and can be easily searched from their respective numbers.

また、老化によって遺伝子発現を減少させる遺伝子群を下記の表２に示す。 In addition, Table 2 below shows gene groups that decrease gene expression due to aging.

すなわち、アフィメトリクス社の遺伝子ＩＤ番号が以下のチップコンテンツで測定される遺伝子とそのオーソログのｍＲＮＡ等の発現量を皮膚の老化の指標に用いることができる：
１４３９２００＿ｘ＿ａｔ、１４３９６２５＿ａｔ、１４５３５１１＿ａｔ、１４２９８３５＿ａｔ、１４５７９６７＿ａｔ、１４５０４５５＿ｓ＿ａｔ、１４１６２３９＿ａｔ、１４４９４７５＿ａｔ、１４４１９９１＿ａｔ、１４２１００１＿ａ＿ａｔ、１４２２８２５＿ａｔ、１４５１３８２＿ａｔ、１４５３００９＿ａｔ、１４１６７７６＿ａｔ、１４３５７９２＿ａｔ、１４１８９８９＿ａｔ、１４３７４３１＿ａｔ、１４３１１７１＿ａｔ、１４５０４７５＿ａｔ、１４４８４７０＿ａｔ、１４５１４２４＿ａｔ、１４２３２７１＿ａｔ、１４４８３９７＿ａｔ、１４４２０８９＿ａｔ、１４４８３０３＿ａｔ、１４２０５３８＿ａｔ、１４４８９３２＿ａｔ、１４３０１３２＿ａｔ、１４２１５８９＿ａｔ、１４２７１７９＿ａｔ、１４２０４０９＿ａｔ、１４３６５５７＿ａｔ、１４２７３７８＿ａｔ、１４６０１８５＿ａｔ、１４３１１６５＿ａｔ、１４５０５３６＿ｓ＿ａｔ、１４２６２０３＿ａｔ、１４２１６９１＿ａｔ、１４２９９５７＿ａｔ、１４２７３６６＿ａｔ、１４３１６５０＿ａｔ、１４５０５４０＿ｘ＿ａｔ、１４２２２０９＿ｓ＿ａｔ、１４３６０５５＿ａｔ、１４５０７７４＿ａｔ、１４３８２３９＿ａｔ、１４３０６３５＿ａｔ、１４４９５５９＿ａｔ、１４３５１８４＿ａｔ、１４１９３２３＿ａｔ、１４１９７６７＿ａｔ、１４２２７６０＿ａｔ、１４４９１７０＿ａｔ、１４２０４６７＿ａｔ、１４２２２４０＿ｓ＿ａｔ、１４４８０２１＿ａｔ、１４２７８６６＿ｘ＿ａｔ、１４３３９２４＿ａｔ、１４６００４９＿ｓ＿ａｔ、１４１５９２７＿ａｔ、１４１５８３２＿ａｔ、１４３６１１９＿ａｔ、１４３４４４９＿ａｔ、１４１９０２８＿ａｔ、１４４８４２１＿ｓ＿ａｔ、１４２４２６６＿ｓ＿ａｔ、１４５０８７１＿ａ＿ａｔ、１４３１８５６＿ａ＿ａｔ、１４２４５２８＿ａｔ、１４１８７９６＿ａｔ、１４２７１６８＿ａ＿ａｔ、１４２７８８４＿ａｔ、１４２２４３７＿ａｔ、１４２６２５１＿ａｔ、１４５２９６８＿ａｔ、１４５０８３９＿ａｔ、１４４１９２８＿ｘ＿ａｔ、１４２０８５４＿ａｔ、１４３４２０２＿ａ＿ａｔ、１４１６８０３＿ａｔ、１４３８９６６＿ｘ＿ａｔ、１４２９４０３＿ｘ＿ａｔ、１４３６１１５＿ａｔ、１４１７８３６＿ａｔ、１４４８１９４＿ａ＿ａｔ、１４１７７１４＿ｘ＿ａｔ、１４２２６１０＿ｓ＿ａｔ、１４３７６６５＿ａｔ、１４５１０４７＿ａｔ、１４１６６４０＿ａｔ、１４１８５３８＿ａｔ、１４１８０６３＿ａｔ、１４３５８５１＿ａｔ、１４４８２２８＿ａｔ、１４１７２７５＿ａｔ、１４５４６５１＿ｘ＿ａｔ、１４２６７５８＿ｓ＿ａｔ、１４１７３５９＿ａｔ、１４２４０１０＿ａｔ、１４２３２５３＿ａｔ、１４１９４８７＿ａｔ、１４３５３８２＿ａｔ、１４５００７９＿ａｔ、１４１７１４９＿ａｔ、１４２８８９６＿ａｔ、１４１７３５５＿ａｔ、１４５６３１５＿ａ＿ａｔ、１４２４５５６＿ａｔ、１４２７５８０＿ａ＿ａｔ、１４４８２０１＿ａｔ、１４２０８８４＿ａｔ、１４３６８５３＿ａ＿ａｔ、１４４９２０６＿ａｔ、１４３５５８５＿ａｔ、１４２２９７３＿ａ＿ａｔ、１４１６７１３＿ａｔ、１４５１８０１＿ａｔ、１４５４６０８＿ｘ＿ａｔ、１４１９０６３＿ａｔとそのオーソログ。 That is, the expression level of genes such as Affymetrix's gene ID number measured in the following chip content and its ortholog mRNA can be used as an indicator of skin aging:
1439200_x_at, 1439625_at, 1453511_at, 1429835_at, 1457967_at, 1450455_s_at, 1416239_at, 1449475_at, 1441991_at, 1421001_a_at, 1422825_at, 1451382_at, 1453009_at, 1416776_at, 1435792_at, 1418989_at, 1437431_at, 1431171_at, 1450475_at, 1448470_at, 1451424_at, 1423271_at, 1448397_at, 1442089_at, 1448303_at, 1420538_at, 1448932_at, 1430132_at, 1421589_at , 1427179_at, 1420409_at, 1436557_at, 1427378_at, 1460185_at, 1431165_at, 1450536_s_at, 1426203_at, 1421691_at, 1429957_at, 1427366_at, 1431650_at, 1450540_x_at, 1422209_s_at, 1436055_at, 1450774_at, 1438239_at, 1430635_at, 1449559_at, 1435184_at, 1419323_at, 1419767_at, 1422760_at, 1449170_at, 1420467_at 1422240_s_at, 1448021_at, 1427866_x_at, 1433 24_at, 1460049_s_at, 1415927_at, 1415832_at, 1436119_at, 1434449_at, 1419028_at, 1448421_s_at, 1424266_s_at, 1450871_a_at, 1431856_a_at, 1424528_at, 1418796_at, 1427168_a_at, 1427884_at, 1422437_at, 1426251_at, 1452968_at, 1450839_at, 1441928_x_at, 1420854_at, 1434202_a_at, 1416803_at, 1438966_x_at, 1429403_x_at, 1436115_at, 1417836_at, 1448194_a_at , 1417714_x_at, 1422610_s_at, 1437665_at, 1451047_at, 1416640_at, 1418538_at, 1418063_at, 1435851_at, 1448228_at, 1417275_at, 1454651_x_at, 1426758_s_at, 1417359_at, 1424010_at, 1423253_at, 1419487_at, 1435382_at, 1450079_at, 1417149_at, 1428896_at, 1417355_at, 1456315_a_at, 1424556_at, 1427580_a_at, 1448201_at , 1420884_at, 1436853_a_at, 1449206_at, 435585_at, 1422973_a_at, 1416713_at, 1451801_at, 1454608_x_at, 1419063_at and its orthologs.

また、ＵｎｉＧｅｎｅＩＤ番号が以下の遺伝子とそのオーソログのｍＲＮＡ等の発現量についても、皮膚の老化の指標に用いることができる：
Ｍｍ．４６４８８６、Ｍｍ．４５４５２６、Ｍｍ．１５８７６６、Ｍｍ．３３３６６１、Ｍｍ．８６３３１、Ｍｍ．２７４４７、Ｍｍ．３２１７、Ｍｍ．２７３２７１、Ｍｍ．４２５４９１、Ｍｍ．２３２５２３、Ｍｍ．７５４９８、Ｍｍ．３５０８３、Ｍｍ．３３９３３２、Ｍｍ．９１１４、Ｍｍ．３６２６４４、Ｍｍ．２３０２４９、Ｍｍ．３２０３１７、Ｍｍ．１７１３５７、Ｍｍ．５１９４、Ｍｍ．４２３０７８、Ｍｍ．９９９８９、Ｍｍ．３９０６８３、Ｍｍ．２５６５２、Ｍｍ．３４０７９１、Ｍｍ．３０２６０２、Ｍｍ．４９９０２、Ｍｍ．４２２７９９、Ｍｍ．１８０２５６、Ｍｍ．４３９６７３、Ｍｍ．４３９７３８、Ｍｍ．３７９５２、Ｍｍ．２９１４９８、Ｍｍ．１０６８６８、、Ｍｍ．４４１６７２、Ｍｍ．３４３７２、Ｍｍ．１９６６８９、Ｍｍ．４６１０９、Ｍｍ．３０９６７、Ｍｍ．１５８２８１、Ｍｍ．４１６８４４、Ｍｍ．３８９９９３、Ｍｍ．４２２８００、Ｍｍ．２９０６７７、Ｍｍ．２４６６９７、Ｍｍ．３４４４１、Ｍｍ．１３８４３７、Ｍｍ．１７６３、Ｍｍ．２５２５９、Ｍｍ．２０８５４、Ｍｍ．２０８５１、Ｍｍ．２５０３５８、Ｍｍ．８５２５３、Ｍｍ．３４２０１、Ｍｍ．１０６９３、Ｍｍ．４４０１６７、Ｍｍ．４６７４９５、Ｍｍ．３９２１７６、Ｍｍ．５０１０９、Ｍｍ．６８６、Ｍｍ．２６７９、Ｍｍ．２６３１３８、Ｍｍ．２５０７８６、Ｍｍ．２９７４４４、Ｍｍ．３８３２１６、Ｍｍ．２９１１０、Ｍｍ．４６０６、Ｍｍ．３４７７６、Ｍｍ．４５１２７、Ｍｍ．２０４２８、Ｍｍ．２９７８５９、Ｍｍ．２４９５５５、Ｍｍ．１０２９９、Ｍｍ．１０８５５７、Ｍｍ．４１５５６、Ｍｍ．４０７４１５、Ｍｍ．２７１９７３、Ｍｍ．２７５３２０、Ｍｍ．２５６０５８、Ｍｍ．２４７２０、Ｍｍ．２８７１４６、Ｍｍ．１９１２８１、Ｍｍ．８１９１６、Ｍｍ．２０１６４、Ｍｍ．１４８０２、Ｍｍ．１９６１１０、Ｍｍ．２８１０１８、Ｍｍ．３３１９７９、Ｍｍ．１９３、Ｍｍ．５８５０７、Ｍｍ．２９８１９９、Ｍｍ．６２２８、Ｍｍ．２９８２５１、Ｍｍ．１７２、Ｍｍ．３９０４０、Ｍｍ．２５２０６３、Ｍｍ．２８９６４５、Ｍｍ．７３８６、Ｍｍ．２７２２７８、Ｍｍ．９９８６、Ｍｍ．３７９０６７、Ｍｍ．４００２５３、Ｍｍ．２２３６７、Ｍｍ．３７０５、Ｍｍ．２８４２４６、Ｍｍ．３８９８００、Ｍｍ．２４１２０５、Ｍｍ．１２７７３１、Ｍｍ．２９３２６３、Ｍｍ．１９１５５、Ｍｍ．２９１３２、Ｍｍ．１７４８４、Ｍｍ．３１６８８５、Ｍｍ．１８１２５、Ｍｍ．２８５８５、Ｍｍ．２９３５８、Ｍｍ．３３８５０８、Ｍｍ．２１０８、Ｍｍ．３０６０２１とそのオーソログ。 Moreover, the expression level of the gene having the UniGene ID number and the ortholog mRNA thereof can also be used as an index of skin aging:
Mm. 464886, Mm. 454526, Mm. 158766, Mm. 333661, Mm. 86331, Mm. 27447, Mm. 3217, Mm. 273271, Mm. 425491, Mm. 232523, Mm. 75498, Mm. 35083, Mm. 339332, Mm. 9114, Mm. 362644, Mm. 230249, Mm. 320317, Mm. 171357, Mm. 5194, Mm. 423078, Mm. 99989, Mm. 390683, Mm. 2562, Mm. 340791, Mm. 302602, Mm. 49902, Mm. 422799, Mm. 180256, Mm. 439673, Mm. 439738, Mm. 37952, Mm. 291498, Mm. 106868, Mm. 441672, Mm. 34372, Mm. 196689, Mm. 46109, Mm. 30967, Mm. 158281, Mm. 416844, Mm. 389993, Mm. 422800, Mm. 290677, Mm. 246697, Mm. 34441, Mm. 138437, Mm. 1763, Mm. 25259, Mm. 20854, Mm. 20851, Mm. 250358, Mm. 85253, Mm. 34201, Mm. 10663, Mm. 440167, Mm. 467495, Mm. 392176, Mm. 50109, Mm. 686, Mm. 2679, Mm. 263138, Mm. 250786, Mm. 297444, Mm. 383216, Mm. 29110, Mm. 4606, Mm. 34776, Mm. 45127, Mm. 20428, Mm. 297859, Mm. 249555, Mm. 10299, Mm. 108557, Mm. 41556, Mm. 407415, Mm. 271973, Mm. 275320, Mm. 256060, Mm. 24720, Mm. 287146, Mm. 191281, Mm. 81916, Mm. 2014, Mm. 14802, Mm. 196110, Mm. 281018, Mm. 331979, Mm. 193, Mm. 58507, Mm. 298199, Mm. 6228, Mm. 298251, Mm. 172, Mm. 39040, Mm. 252063, Mm. 289645, Mm. 7386, Mm. 272278, Mm. 9986, Mm. 379067, Mm. 400253, Mm. 22367, Mm. 3705, Mm. 284246, Mm. 389800, Mm. 241205, Mm. 127773, Mm. 293263, Mm. 19155, Mm. 29132, Mm. 17484, Mm. 316885, Mm. 18125, Mm. 28585, Mm. 29358, Mm. 338508, Mm. 2108, Mm. 306021 and its ortholog.

（指標の算出）
次に、リストにある遺伝子のひとつ、望ましくは複数の遺伝子について、被測定サンプルでの発現量を測定して、それらの遺伝子の発現量を、あらかじめ定めた基準値と比較し、発現量の変化を調べる。 (Calculation of indicators)
Next, measure the expression level of one of the genes in the list, preferably multiple genes, in the sample to be measured, compare the expression level of those genes with a predetermined reference value, and change the expression level. Check out.

この発現量の変化に、あらかじめ定めておいた係数を乗ずる。係数は、たとえば老化で発現が減少することがわかっている遺伝子では負値、発現が増大する遺伝子では正値になるように定める。 The change in the expression level is multiplied by a predetermined coefficient. The coefficient is determined so that, for example, a gene whose expression is known to decrease with aging becomes a negative value, and a gene whose expression increases increases to a positive value.

上記で遺伝子ごとに得た値を合算して老化の指標とする。
ここで、サンプルｓの指標ＡＩ_sは、ｎ個の遺伝子ｇの測定値ｘ_s,gより、下記の数式を用いて求める。 The values obtained for each gene above are added together as an index of aging.
Here, the index AI _s of the sample s is obtained from the measured values x _{s, g} of the n genes g using the following formula.

ここで、ｂ_gはその遺伝子の基準値、ｋ_gは遺伝子ごとの係数である。 Here, b _g is a reference value of the gene, and k _g is a coefficient for each gene.

指標に用いられる遺伝子は、表１および表２にある遺伝子のうちのいずれか、又はいくつかの組み合わせを用いて、得ることができる。 The gene used for the indicator can be obtained by using any one or some combination of the genes shown in Table 1 and Table 2.

測定値ｘ_s,gは、ｚ標準化したマイクロアレイデータの場合は、そのｚスコアであるか、又はセンタリングして再標準化したｚスコアを用いることができる。 The measured value x _{s, g} is the z-score in the case of z-standardized microarray data, or a z-score that is centered and re-standardized can be used.

いわゆる「発現量」として、例えば、ｍＲＮＡやタンパク質の細胞内濃度や活性のように、対数変換していない数値が測定値として得られる場合には、ｘ_s,gはそれらの値の対数値を用いる。
この際、対数の底は統一する必要があるが、どの値でもかまわない。 As the so-called “expression level”, for example, when numerical values not logarithmically converted are obtained as measured values, such as intracellular concentrations and activities of mRNA and protein, x _{s, g} is the logarithmic value of those values. Use.
At this time, the base of the logarithm must be unified, but any value is acceptable.

基準値ｂ_gは、それぞれの遺伝子について、たとえば１週令のマウスにおける平均値として定義することができる。 The reference value b _g can be defined for each gene as, for example, an average value in a 1-week-old mouse.

係数ｋ_gは、遺伝子に関しての主成分、または特異値分解によって得る２種類のユニタリ行列のうち左特異ベクトルＵ_pを用いればよい。
もちろん、ベクトルの方向を分析者が指定できないため、主成分分析の結果は符号が逆になりうる。その際は、符号を逆転させて、老化が進行する方向を正にすればよい。また、指標として値を扱いやすくするために、共通の任意な定数を乗じてもよい。 Coefficient k _g can be used left singular vectors U _p of the two types of unitary matrix obtained by the principal component, or singular value decomposition with respect to the gene.
Of course, since the analyst cannot specify the direction of the vector, the sign of the result of the principal component analysis can be reversed. In that case, it is only necessary to reverse the sign to make the direction in which aging progresses positive. Moreover, in order to make it easy to handle a value as an index, a common arbitrary constant may be multiplied.

係数ｋ_gは、最も簡単には、たとえば主成分が正ならプラス１，負ならマイナス１とすることができる。 The coefficient k _g can be most simply set to, for example, plus 1 if the main component is positive and minus 1 if the main component is negative.

さらに、遺伝子には発現を変動させやすいものとそうでないものがある。この遺伝子発現の変動を標準化するためには、主成分で１を除した値を係数ｋ_gにすることで対応可能である。 In addition, some genes tend to fluctuate expression and others do not. This in order to standardize the variation in gene expression can be dealt with by the value obtained by dividing the 1 in the main component to the coefficient k _g.

以上のように構成することで、以下のような効果を得ることができる。
従来、老化に関する実験の遺伝子データから、老化の指標となる遺伝子候補のリストを得るのは難しかった。これは、網羅的な遺伝子発現データは測定誤差を含み、また遺伝子発現は老化以外の条件でも変化するためである。すなわち、網羅的な遺伝子発現データから、どの遺伝子に着目すればいいかを見出すのは難しい課題であった。 With the configuration described above, the following effects can be obtained.
Conventionally, it has been difficult to obtain a list of gene candidates that serve as indicators of aging from genetic data of experiments concerning aging. This is because exhaustive gene expression data includes measurement errors, and gene expression changes even under conditions other than aging. That is, it has been a difficult task to find out which genes should be focused on from comprehensive gene expression data.

この従来の問題点の具体例として、どの遺伝子も、発現量にはある程度の揺らぎがある。また測定値には誤差が含まれる。さらに、どの遺伝子も、老化とは無関係な刺激でその発現を変化させることがあり得る。
そこで、単一の遺伝子の発現測定の結果は、かならずしも老化を正しく反映しない。たとえば、特開平１０−１２３１３０号公報ではエラスターゼの活性だけを測定しているが、この活性のゆらぎはそのままデータに反映される。 As a specific example of this conventional problem, there is some fluctuation in the expression level of any gene. The measurement value includes an error. In addition, any gene can change its expression with a stimulus independent of aging.
Thus, the results of single gene expression measurements do not necessarily reflect aging correctly. For example, in JP-A-10-123130, only the activity of elastase is measured, but the fluctuation of this activity is reflected in the data as it is.

また、着目している遺伝子が、皮膚の老化を調べるという目的のために最も適切かどうかは、網羅的に遺伝子を調べないことには判明しない。 Moreover, it is not clear that the gene of interest is the most appropriate for the purpose of examining skin aging by not examining genes exhaustively.

この網羅性という観点からは、たとえば特表２００２−５３５９９７号公報にあるような、ディファレンシャル・スクリーニング法で遺伝子群を決定する方策は不完全である。
これは、使用するプライマーによって、あるいはスクリーニングの条件によって、遺伝子群の一部の結果しか観測できないからである。
また一般的にこの種の方法は定量性を持たないため、老化以外の多くの要因で変化する遺伝子発現のなかから適切な遺伝子を選択するのは困難である。 From the viewpoint of completeness, for example, a method for determining a gene group by a differential screening method as disclosed in JP-T-2002-535997 is incomplete.
This is because only partial results of the gene group can be observed depending on the primers used or the screening conditions.
In general, since this type of method does not have quantitativeness, it is difficult to select an appropriate gene from gene expression that changes due to many factors other than aging.

ただし、網羅的な遺伝子測定は、しばしばデータの数理的な処理に困難を伴う。具体的には、データを客観的に処理することができずに、測定ノイズを信号と解釈する過誤をおかしがちである。 However, exhaustive genetic measurements often involve difficulties in mathematical processing of data. Specifically, the data cannot be processed objectively, and the error of interpreting the measurement noise as a signal tends to be wrong.

特に、マイクロアレイなどの網羅的な分析手段は、データを不完全な相対値で算出するため、データの標準化は分析結果に大きな影響を与える。 In particular, exhaustive analysis means such as microarrays calculate data with incomplete relative values, so data standardization has a great influence on the analysis results.

また遺伝子にはそれぞれ、老化以外の条件でどの程度に発現変動が変わり得るか、また老化によってどの程度発現変動があるかにおいて、その性質が異なる。
たとえば、特開２００７−２５９８５１号公報に見られるように、ただその発現変動が大きいことだけで遺伝子を選択すると、そうした特性を反映することができない。これは過誤の原因である。 In addition, each gene has different properties depending on how much the expression fluctuation can be changed under conditions other than aging, and how much the expression fluctuation is caused by aging.
For example, as can be seen in Japanese Patent Application Laid-Open No. 2007-259851, such a characteristic cannot be reflected if a gene is selected merely because of its large expression fluctuation. This is a cause of error.

また発現変化を理解する際に、分析者を視覚的に補助する方法として、たとえば各種のクラスタリングがある。しかしこれらの方法は、変化の類似性を定義した上で行うものであるが、その定義には客観性がない。
これらの方法は、ところが、特開２００８−１７８３９０号公報や、特表２００５−５２４３８２号公報等に見られるように、遺伝子の選択に用いられることがある。しかしながら、その原理上の限界のゆえに、クラスタリングを用いた遺伝子群の選択はしばしば大きな過誤の原因となる。 As a method for visually assisting an analyst in understanding the expression change, for example, there are various types of clustering. However, these methods are performed after defining the similarity of change, but the definition is not objective.
However, these methods are sometimes used for gene selection as seen in JP-A-2008-178390, JP-T-2005-524382, and the like. However, due to its theoretical limitations, gene group selection using clustering is often a source of major errors.

また、複数の遺伝子発現をいかに客観的に統合して、ひとつ、ないし限定された少数の指標として現すかが重要である。すなわち、従来、それぞれの遺伝子の発現変化をまとめて、１〜数個の指標にしないと、その遺伝子変化を評価することができなかった。さらに、指標には客観性が求められた。
多数の遺伝子発現の情報は、それだけでは理解しがたいからである。 In addition, it is important how to objectively integrate multiple gene expressions and present them as one or a limited number of indicators. That is, conventionally, the gene change could not be evaluated unless the expression changes of each gene were collected and used as one to several indices. In addition, the index required objectivity.
This is because information on a large number of gene expressions is difficult to understand by itself.

これに対して、本発明の第２の実施の形態に係るリストは、老化を客観的に評価するために、遺伝子発現を用いた指標を提供する。
このため、本発明の第２の実施の形態においては、複数の実験群にマイクロアレイ測定を行って遺伝子発現を調べ、それを主成分分析で精査し、主成分を得た。この主成分に関わり、他の因子に関わらないことを指標にして、老化に関与する遺伝子のリストを得た。このリストにある遺伝子の遺伝子発現を被験者で調べ、その値を合算処理することで、老化の指標にする。合算処理には、主成分分析から求められた係数を用いる。
これにより、発現量の揺らぎにロバストで、データを客観的に処理することができ、従来のクラスタリングよりも精度が高く、遺伝子発現を少数の指標として得ることができる。よって、本発明の第２の実施の形態に係るリストは、老化に関する遺伝子発現を用いた指標を提供することができる。 In contrast, the list according to the second embodiment of the present invention provides an index using gene expression in order to objectively evaluate aging.
For this reason, in the second embodiment of the present invention, microarray measurements were performed on a plurality of experimental groups to examine gene expression, which was examined by principal component analysis to obtain principal components. A list of genes involved in aging was obtained using as an index the involvement of this principal component and no involvement of other factors. By examining the gene expression of the genes in this list in the subject and adding the values together, it becomes an index of aging. For the summing process, coefficients obtained from the principal component analysis are used.
As a result, the data can be processed objectively, robust to fluctuations in the expression level, and more accurate than conventional clustering, and gene expression can be obtained as a small number of indices. Therefore, the list according to the second embodiment of the present invention can provide an index using gene expression related to aging.

（マウス以外の生物への応用）
また、皮膚の老化は他の生物、特に、他のほ乳類を含む高等動物において、マウスと同じように起きると考えられる。
老化はゲノムに支配された現象であり、弾力の喪失、光沢の低下、脱毛など、よく似通ったプロセスでおきるからである。 (Application to organisms other than mice)
Skin aging is also thought to occur in other organisms, particularly higher animals including other mammals, in the same manner as mice.
Aging is a genome-dominated phenomenon that occurs in a similar process, such as loss of elasticity, loss of gloss, and hair loss.

遺伝子の多くは、これら高等動物の間で共通である。つまり、同じ起源である遺伝子が働いている。
また、多くの遺伝子は、多くの生物種に共通して存在していて、それぞれ共通の働きを担っている。
こうした、別種の生物にある相同な遺伝子は、当該遺伝子のオーソログと呼ばれている。
当然、マウスで発見された遺伝子のオーソログは、たとえばヒトでも同じ働きをしている。マウスの老化段階で発現する遺伝子は、ヒトでもやはり老化段階で発現することが予想される。 Many of the genes are common among these higher animals. In other words, genes of the same origin are working.
In addition, many genes are common to many species, and each plays a common role.
Such a homologous gene in another organism is called an ortholog of the gene.
Naturally, gene orthologs found in mice have the same function in humans, for example. Genes that are expressed at the aging stage of mice are also expected to be expressed at the aging stage in humans.

すなわち、上述のリストにあるマウスの遺伝子のオーソログが、マウスの場合と同様に、ヒトを含む他の生物種において、共通して働くことは明白である。 That is, it is clear that the orthologs of the mouse genes listed above work in common in other species including humans as in the case of mice.

こうした、マウス以外の生物のオーソログは、以下に述べるような方法で容易に特定することができる。
オーソログは、第一に、アフィメトリクス社の提供する情報から探すことができる。たとえばＭｏｕｓｅ４３０＿２．ｎａ３０．ｏｒｔｈｏｌｏｇ．ｃｓｖというファイルがインターネットを通じて公開・提供されている。
これは、この実験で使用したＭｏｕｓｅ４３０＿２チップにある遺伝子のオーソログを、同社の別のチップのなかから探して作成されているファイルである。ＰｒｏｂｅＳｅｔＩＤを指定することで、どのチップのどの遺伝子がオーソログであるかを、そのチップのＰｒｏｂｅＳｅｔＩＤで示している。
チップとＰｒｏｂｅＳｅｔＩＤが指定されることで、同社が用意しているアノテーションファイルで、その遺伝子のＵｎｉＧｅｎｅＩＤを探すことができる。たとえば、Ｍｏｕｓｅ４３０＿２チップならばＭｏｕｓｅ４３０＿２．ｎａ３０．ａｎｎｏｔ．ｃｓｖというファイルが公開されている。
このＩＤを指定することで、ＮＣＢＩなどの公的なデータベースを通じて、その遺伝子の塩基配列を知ることができる。 Such orthologs of organisms other than mice can be easily identified by the method described below.
The ortholog can be searched first from the information provided by Affymetrix. For example, Mouse 430_2. na30. ortholog. A file called csv is published and provided through the Internet.
This is a file created by searching for an ortholog of a gene in the Mouse430_2 chip used in this experiment from another chip of the same company. By designating the Probe Set ID, which gene of which chip is the ortholog is indicated by the Probe Set ID of the chip.
By specifying the chip and Probe Set ID, the UniGene ID of the gene can be searched for in the annotation file prepared by the company. For example, if the Mouse 430_2 chip, the Mouse 430_2. na30. annot. A file called csv is published.
By specifying this ID, the base sequence of the gene can be known through a public database such as NCBI.

図８を参照して、いくつかの遺伝子について、オーソログを検索した例を示す。図８では、Ｍｏｕｓｅ４３０＿２チップのＰｒｏｂｅＳｅｔＩＤからヒトのチップでのＰｒｏｂｅＳｅｔＩＤを探し、またそれぞれからＵｎｉＧｅｎｅＩＤを求めた例を示している。
このような一連の作業は、当業者であれば容易に行うことができる。また探し出す生物種の対象はヒトには限定されず、アフィメトリクス社が提供する全ての生物種がその対象となり得る。 With reference to FIG. 8, an example in which orthologs are searched for several genes is shown. FIG. 8 shows an example in which the Probe Set ID of the human chip is searched from the Probe Set ID of the Mouse 430_2 chip, and the UniGene ID is obtained from each.
Such a series of operations can be easily performed by those skilled in the art. The target species to be searched for is not limited to humans, and all species provided by Affymetrix can be targeted.

また、オーソログを、当該遺伝子の配列の相似性を利用してデータベースから探すことができる。
上述の例で説明したＭｏｕｓｅ４３０＿２チップのコンテンツの遺伝子は、その塩基配列が公開されている。その塩基配列や、さらに翻訳したアミノ酸配列を用いて、公共のデータベースを、ＢＬＡＳＴのようなローカルアラインメント・アルゴリズム等を用いて検索し、オーソログを見つけることができる。この際、着目する生物種のなかでスコアがもっとも高い、またはＥ値がもっとも低いこといった条件をもって、オーソログを発見することもできる。
これによって、アフィメトリクス社が提供しない生物種でも、オーソログを発見することもできる。一連の作業は当業者であれば容易に行うことができる。 In addition, orthologues can be searched from a database using sequence similarity of the gene.
The base sequence of the gene of the content of the Mouse 430_2 chip described in the above example is disclosed. An ortholog can be found by searching a public database using a local alignment algorithm such as BLAST using the base sequence and further translated amino acid sequence. At this time, an ortholog can be found under the condition that the score is the highest among the species of interest or the E value is the lowest.
This makes it possible to discover orthologs even for species not provided by Affymetrix. A series of operations can be easily performed by those skilled in the art.

また、配列の相似性を利用してクローニングを行い、オーソログを同定することもできる。
加えて、着目する生物種のＤＮＡライブラリーから、マウス遺伝子のプローブを用いて、遺伝子をクローニングすることもできる。
同様に、マウス遺伝子の配列を基に、プライマーを設計し、ＲＴ−ＰＣＲ法等を用いて遺伝子を増幅してクローニングすることもできる。
また抗体を利用して、発現ライブラリーを用いてクローニングすることもできる。
一連の作業は、当業者であれば容易に行うことができる。 In addition, it is possible to identify orthologs by cloning using sequence similarity.
In addition, a gene can be cloned from a DNA library of a biological species of interest using a mouse gene probe.
Similarly, primers can be designed based on the sequence of the mouse gene, and the gene can be amplified and cloned using the RT-PCR method or the like.
Moreover, it can also clone using an expression library using an antibody.
A series of operations can be easily performed by those skilled in the art.

本発明の第２の実施の形態に係るリストの作成には、網羅的な測定が可能であるマイクロアレイを用いた。もちろん、スクリーニングや老化度の計測のためにも、マイクロアレイを用いることができる。
しかしながら、本実施形態のトランスクリプトーム用主成分分析方法は、マイクロアレイデータ以外の行列データを用いて、主成分分析を行うことが可能である。
たとえば、マイクロアレイ以外の、もっと簡便な方法で発現量を測定しても、リストを作成することが可能である。網羅性が不要であるためである。 For the creation of the list according to the second embodiment of the present invention, a microarray capable of comprehensive measurement was used. Of course, the microarray can also be used for screening and measuring the aging degree.
However, the principal component analysis method for transcriptome of the present embodiment can perform principal component analysis using matrix data other than microarray data.
For example, it is possible to create a list even if the expression level is measured by a simpler method other than microarray. This is because completeness is unnecessary.

マイクロアレイ以外の発現量を測定する方法としては、ＲＴ−ＰＣＲ法やリアルタイムＰＣＲ法等の手法で、転写物であるｍＲＮＡの量を測定することがまず考えられる。
この際に、コントロールとなるハウスキーピング遺伝子等の転写物を用いて標準化し、その転写物が基準値からどれほど違っているかを測定できる。 As a method for measuring the expression level other than the microarray, it is first considered to measure the amount of mRNA as a transcript by a technique such as RT-PCR method or real-time PCR method.
At this time, it can be standardized using a transcript such as a housekeeping gene as a control, and how much the transcript differs from the reference value can be measured.

（発現量の定義）
なお、本発明の第１又は第２の実施の形態において、遺伝子の「発現量」とは、その遺伝子からの転写物の量や、翻訳産物の量、翻訳産物の活性、その活性により産出された物質の量等を示す。 (Definition of expression level)
In the first or second embodiment of the present invention, the “expression level” of a gene is produced by the amount of transcript from the gene, the amount of translation product, the activity of translation product, and its activity. Indicates the amount of substance.

すなわち、本発明の第１及び第２の実施の形態において、「発現量」とは、ｍＲＮＡの量の増減を示すだけではなく、より幅広い概念として定義される。
たとえば、ｍＲＮＡの量の増減は、そのコードするタンパク質の量の増減と対応すると考えられる。すなわち、特異抗体を用いてタンパク質を検出すれば、さらに簡便に測定を行うことができる。これを、「発現量」の行列データとして得ることができる。このタンパク質の検出としては、それぞれのタンパク質の増減割合の対数値に係数を乗じて合算することで指標を得ることができる。
また、ｍＲＮＡだけではなく、ｓｎＲＮＡ等の細胞内調整に関わるＲＮＡの「発現量」を測定し、行列データとして用いることができる。 That is, in the first and second embodiments of the present invention, the “expression level” is defined not only as an increase / decrease in the amount of mRNA but also as a broader concept.
For example, an increase or decrease in the amount of mRNA is considered to correspond to an increase or decrease in the amount of the encoded protein. That is, if a protein is detected using a specific antibody, the measurement can be performed more simply. This can be obtained as matrix data of “expression level”. For the detection of this protein, an index can be obtained by multiplying the logarithmic value of the increase / decrease ratio of each protein by a coefficient and adding them together.
Further, not only mRNA but also “expression level” of RNA involved in intracellular regulation such as snRNA can be measured and used as matrix data.

また、タンパク質の活性をタンパク量の代わりに、「発現量」として、行列データに使用することも可能である。 It is also possible to use the protein activity in the matrix data as the “expression amount” instead of the protein amount.

また、培養細胞を使用したスクリーニング系のデータを用いて、「発現量」として、行列データに使用することも可能である。
このスクリーニング系の構築については、着目した遺伝子の調節領域、つまりプロモーター配列やシス配列等にレポーター遺伝子を接続した遺伝子を作成し、活性測定が容易な指示遺伝子（コンストラクト）を作成することができる。このレポーター遺伝子は、ＣＡＴ（ｃｈｌｏｒａｍｐｈｅｎｉｃｏｌａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ）等の酵素活性を持つレポーター遺伝子や、ルシフェラーゼ等の発光などを呈する遺伝子を用いることができる。
選択された遺伝子を培養細胞に導入することで、レポーター遺伝子の活性を測定しながら容易にスクリーングが可能になる。 Moreover, it is also possible to use it for matrix data as “expression level” using data of a screening system using cultured cells.
Regarding the construction of this screening system, a gene in which a reporter gene is connected to a regulatory region of the gene of interest, that is, a promoter sequence or a cis sequence, can be prepared, and an indicator gene (construct) for which activity measurement can be easily performed. As the reporter gene, a reporter gene having an enzyme activity such as CAT (chlorphenicol acetate transferase) or a gene exhibiting luminescence such as luciferase can be used.
By introducing the selected gene into the cultured cell, it becomes possible to screen easily while measuring the activity of the reporter gene.

（本発明の実施の形態に係る主成分算出方法の他分野への適用）
なお、本発明の第１又は第２の実施形態に係る主成分算出方法は、拡張された主成分分析方法として、トランスクリプトームの解析だけではなく、健康診断のような、測定項目が例えば病院間である程度異なるものの、測定項目が多いようなデータにも適用することができる。
たとえば、なんらかの疾病が健康診断のいずれかの項目で発見される可能性を調べたい時には、疾病群と対照群を設定し、それぞれの群の代表値を、平均を取る等により求める。この際、測定値はなるべくリニアになるような数値で表し、定性的なデータ等になるようにする。そして、データを項目ごとにセンタリングして、各項目の平均がゼロになるようにする。さらに、ある項目について、いくつかの病院で測定されていない場合には、その欠損値をゼロで置き換える。このようにして得た２群・多項目の行列から、軸を表す各ユニタリ行列、ＰＣｇ１（項目の主成分）、及びＰＣｓ１を得ることができる。ＰＣｇ１で大きな絶対値をもつ測定項目群は、その疾病をよく表す項目である。また、得たユニタリ行列Ｖ_pから、各個人のＰＣｓ１ないしｓＰＣｓ１を得ることができる。
この結果、ある程度大きな集団からのランダムサンプルを用いて、それぞれの個人のＰＣｓ１乃至ｓＰＣｓ１の分布を調べれば、下記の実施例４に記載した計算方法を用いて閾値を計算することができる。
このとき主成分ＰＣｓの分布が実質的に正規分布であったり、あるいは閾値よりも絶対値の大きなＰＣｓをもつ個人の割合が、その疾病の罹患率よりも明らかに小さかった等の場合には、その疾病は使用した健康診断の項目では評価できないことになる。また、逆であるなら、その疾病はその項目で評価できることになる。さらに、ある疾病にだけ着目する場合には、ＰＣ１ｇが大きな絶対値をもつ項目を、その測定の容易さやコストなども勘案しながら、実施する測定項目を取捨選択することができる。また、勿論ＰＣ１ｇは、その疾病の原因や治療法を研究する上でも重要な知見となる。
ＰＣｓ１が閾値を超えた個人には、その疾病が疑われることになる。もし複数の疾病に注目するときは疾病群の数が増え、注目するべき主成分の数も増加することは言うまでもない。ただし、必ずしもそれは疾病と同じだけの数になるわけではなく、おそらく、似た症状をもつ疾病群は同一の主成分に影響するので、当該の主成分によって判断されることになる。 (Application to other fields of principal component calculation method according to embodiments of the present invention)
Note that the principal component calculation method according to the first or second embodiment of the present invention is an extended principal component analysis method, in which not only transcriptome analysis but also measurement items such as medical examinations are hospitals, for example. Although it differs to some extent, it can also be applied to data with many measurement items.
For example, when it is desired to examine the possibility that some kind of illness is found in any item of the medical examination, the disease group and the control group are set, and the representative value of each group is obtained by taking an average or the like. At this time, the measured value is represented by a numerical value that is as linear as possible, and is qualitative data. Then, the data is centered for each item so that the average of each item becomes zero. Further, if a certain item is not measured in some hospitals, the missing value is replaced with zero. From the two-group / multi-item matrix thus obtained, each unitary matrix representing the axis, PCg1 (principal component of the item), and PCs1 can be obtained. The measurement item group having a large absolute value in PCg1 is an item that well represents the disease. Moreover, the unitary matrix V _p obtained, to PCs1 not each person can be obtained SPCs1.
As a result, the threshold value can be calculated using the calculation method described in Example 4 below by examining the distribution of PCs1 to sPCs1 of each individual using a random sample from a somewhat large population.
In this case, when the distribution of the main component PCs is substantially normal distribution, or the proportion of individuals having PCs whose absolute value is larger than the threshold is clearly smaller than the prevalence of the disease, The disease cannot be evaluated by the health check items used. If the opposite is true, the disease can be assessed on that item. Furthermore, when paying attention only to a certain disease, it is possible to select items to be performed while taking into consideration the ease of measurement and cost of items for which PC1g has a large absolute value. Of course, PC1g is also an important finding in studying the cause and treatment of the disease.
An individual whose PCs1 exceeds the threshold is suspected of the disease. Needless to say, when attention is paid to a plurality of diseases, the number of disease groups increases and the number of main components to be noted also increases. However, it is not necessarily the same number as the disease. Probably, a group of diseases having similar symptoms affects the same main component, and is thus determined by the main component.

〔実施例３〕
図９を参照して、本発明の第２の実施の形態に係るリストにある遺伝子から、１０遺伝子を選んで、各サンプルの老化度を測定したケースについて説明する。
図９は、センタリングした標準化データから指標を得る方法を示す。
基準値は、これらデータの中から、遺伝子ごとに、幼齢のマウスの平均として求めた。また係数は主成分ＰＣ_g１に、指標を見やすくするための定数１７を乗じたものを用いた。
得られた値を合算して指標を得た。各サンプルの値を棒グラフで示す。
以下、この実施例３の具体的な計算方法について、より詳しく説明する。 Example 3
With reference to FIG. 9, a case where 10 genes are selected from the genes in the list according to the second embodiment of the present invention and the aging degree of each sample is measured will be described.
FIG. 9 shows a method of obtaining an index from the centered standardized data.
The reference value was determined from these data as the average of young mice for each gene. The coefficient used was a value obtained by multiplying the main component PC _g 1 by a constant 17 for making the index easy to see.
The obtained values were added together to obtain an index. The value of each sample is shown as a bar graph.
Hereinafter, the specific calculation method of the third embodiment will be described in more detail.

（標準化）
まず、実施例１と同様のＮＣＢＩのＧＥＯデータベースにあるＳｅｒｉｅｓＧＳＭ２０２６６６〜ＧＳＭ２０２７０５までの４０データをマイクロアレイの行列データとして取得した。
このデータを、株式会社スカイライト・バイオテック社のＳｕｐｅｒＮＯＲＭデータ標準化サービスを利用して、３パラメータ対数正規分布を用いるパラメトリック法でＰＭデータを標準化し、ｚスコアを求めた。
また標準化したＰＭデータのトリム平均から各遺伝子の発現レベルを求めた。この遺伝子の発現レベルは、「ＫｏｎｉｓｈｉＴ（２００８）ＤａｔａＤｉｓｔｒｉｂｕｔｉｏｎｏｆＳｈｏｒｔＯｌｉｇｏｎｕｃｌｅｏｔｉｄｅＥｘｐｒｅｓｓｉｏｎＡｒｒａｙｓａｎｄＩｔｓＡｐｐｌｉｃａｔｉｏｎｔｏｔｈｅＣｏｎｓｔｒｕｃｔｉｏｎｏｆａＧｅｎｅｒａｌｉｚｅｄＩｎｔｅｌｌｅｃｔｕａｌＦｒａｍｅｗｏｒｋ．ＳｔａｔＡｐｐｌＧｅｎｅｔＭｏｌＢｉｏｌ７：Ａｒｔｉｃｌｅ２５．」「ＫｏｎｉｓｈｉＴ．（２００４）Ｔｈｒｅｅ−ｐａｒａｍｅｔｅｒｌｏｇｎｏｒｍａｌｄｉｓｔｒｉｂｕｔｉｏｎｕｂｉｑｕｉｔｏｕｓｌｙｆｏｕｎｄｉｎｃＤＮＡｍｉｃｒｏａｒｒａｙｄａｔａａｎｄｉｔｓａｐｐｌｉｃａｔｉｏｎｔｏｐａｒａｍｅｔｒｉｃｄａｔａｔｒｅａｔｍｅｎｔ．ＢＭＣＢｉｏｉｎｆｏｒｍａｔｉｃｓ，５，５．」に従って求めた。 (Standardization)
First, 40 data from Series GSM202666 to GSM202705 in the NCBI GEO database similar to Example 1 were acquired as matrix data of a microarray.
Using this data, a Super NORM data standardization service of Skylight Biotech Co., Ltd. was used to standardize PM data by a parametric method using a three-parameter lognormal distribution, and a z score was obtained.
Moreover, the expression level of each gene was calculated | required from the trim average of standardized PM data. The level of expression of this gene, "Konishi T (2008) Data Distribution of Short Oligonucleotide Expression Arrays and Its Application to the Construction of a Generalized Intellectual Framework Stat Appl Genet Mol Biol 7:. Article 25.""Konishi T. (2004) Three-parameter logistic distribution ubiquitously found in cDNA microarray data and its application to parametric data treatment. MC Bioinformatics, 5, was determined according to 5. ".

（標準化で算出されたパラメータ）
図１０を参照して、上述の標準化で算出されたパラメータについて説明する。各パラメータは、以下の通りである：

ｌｏｗｅｒ信頼区間下限
ｕｐｐｅｒ信頼区間上限
ｓａｔｕｒａｔｉｏｎ測定限界
ｇａｍｍａ γ （バックグラウンド）
ｓｉｇｍａ σ （分布の幅）
ｍｕ μ （分布の中心）

なお、使用した対数の底は１０である。 (Parameters calculated by standardization)
With reference to FIG. 10, the parameter calculated by the above-mentioned standardization will be described. Each parameter is as follows:

Lower lower confidence interval upper upper confidence interval saturation Measurement limit gamma γ (background)
sigma σ (width of distribution)
mu μ (center of distribution)

The base of the logarithm used is 10.

（繰り返し測定の十分な遺伝子の選定）
主成分分析は、データの中の全体の傾向を知るための方法でもあるので、データに含まれる個別のサンプルの個体差によるばらつきはノイズとして働く。
遺伝子のなかには不安定なものもあり、ある程度の数の繰り返し測定を行わない限り、遺伝子の発現量等の変化は明らかにならない。これは、発現量等が大きく変化していても同様である。
もちろん、同一サンプルから測定できるマイクロアレイの繰り返し回数には限りがあるので、例えば、チップコンテンツの半分程度の遺伝子で、十分な数の繰り返し測定がなされていないという可能性がある。
これらの遺伝子からの情報はノイズが大きいと考えられ、主成分分析の精度を低下させる可能性が考えられる。そこで、これらの遺伝子からの情報を除くことにした。
このため、十分な観測数があるかどうかを遺伝子ごとに判断するために、分散分析（２ｗａｙＡＮＯＶＡ）を遺伝子ごとに行った。これは各遺伝子に対応するＰＭデータのｚスコアを対応させながら、群間で有意に発現に違いがあるかどうかを検定する方法である。帰無仮説は「各群で発現量は一致する」とする。仮定する式は：

発現量の差＝ＰＭセルの感度の差＋群間差

で、閾値０．００２の両側検定を行った。すなわち、群間差について計算されたＰ値が０．００１以下の遺伝子を、十分な観測数があるとして選択した。この閾値の設定はマイクロアレイデータの検定としては普通に用いられるものである。
また多数の検定が行われることになるが、検定の多重性は考慮していない。遺伝子の安定性は個々に異なるので、各遺伝子の検定結果は個別に判断されるべきだからである。
この検定は、「ＫｏｎｉｓｈｉＴ，ＫｏｎｉｓｈｉＦ，ＴａｋａｓａｋｉＳ，ＩｎｏｕｅＫ，ＮａｋａｙａｍａＫ，ＫｏｎａｇａｙａＡ（２００８）ＣｏｉｎｃｉｄｅｎｃｅｂｅｔｗｅｅｎＴｒａｎｓｃｒｉｐｔｏｍｅＡｎａｌｙｓｅｓｏｎＤｉｆｆｅｒｅｎｔＭｉｃｒｏａｒｒａｙＰｌａｔｆｏｒｍｓＵｓｉｎｇａＰａｒａｍｅｔｒｉｃＦｒａｍｅｗｏｒｋ．ＰＬｏＳＯＮＥ３：ｅ３５５５．」の方法に従って行った。 (Select sufficient genes for repeated measurements)
Principal component analysis is also a method for knowing overall trends in data, and therefore variations due to individual differences in individual samples included in the data act as noise.
Some genes are unstable, and unless a certain number of repeated measurements are performed, changes in the gene expression level, etc. will not be apparent. This is the same even if the expression level or the like changes greatly.
Of course, since the number of microarray repetitions that can be measured from the same sample is limited, for example, there is a possibility that a sufficient number of repeated measurements have not been made with about half of the gene of the chip content.
The information from these genes is considered to be noisy and may reduce the accuracy of principal component analysis. Therefore, we decided to remove information from these genes.
For this reason, in order to determine whether there is a sufficient number of observations for each gene, analysis of variance (2-way ANOVA) was performed for each gene. This is a method for testing whether there is a significant difference in expression between groups while associating the z-score of PM data corresponding to each gene. The null hypothesis is that the expression levels are the same in each group. The assumed formula is:

Difference in expression level = Difference in PM cell sensitivity + Difference between groups

Then, a two-sided test with a threshold value of 0.002 was performed. That is, a gene having a P value calculated for the difference between groups of 0.001 or less was selected as having a sufficient number of observations. This threshold setting is normally used as a test for microarray data.
A large number of tests will be performed, but the multiplicity of the tests is not taken into consideration. This is because the stability of genes differs from one individual to another, and the test results for each gene should be judged individually.
This test is, "Konishi T, Konishi F, Takasaki S, Inoue K, Nakayama K, Konagaya A (2008) Coincidence between Transcriptome Analyses on Different Microarray Platforms Using a Parametric Framework PLoS ONE 3: e3555.." Was carried out according to the method of .

（マイクロアレイ用主成分分析に供するデータ）
この分析にはＰＭデータを遺伝子ごとに（トリム平均によって）まとめたデータを用いた。具体的には、スカイライト・バイオテック社製のＳｕｐｅｒＮＯＲＭデータ標準化サービスで提供されるＰＩＶＯＴ出力ファイルにあるｚスコアを用いた。
この際、上述の分散分析で帰無仮説が棄却されなかった遺伝子の情報を取り除くため、これらの遺伝子の値は全てゼロに置き換えた。
これにより、ノイズが主成分分析の結果に影響を与えないようにすることができる。また、特定の遺伝子を削除することで、行列の形が変わることを防ぐことができる。
また、全ての欠失したデータはゼロに置き換えた。この理由としては、上述したように、欠失したデータがあると主成分分析の計算ができないため、これを置き換える必要があるためである。この際、欠失したデータをゼロで置き換えるのは、いわゆるフェイル・セーフによる措置である。これは、上述のように、欠落したデータをゼロで置き換える限り、擬陽性（ｆａｌｓｅｐｏｓｉｔｉｖｅ）の原因にならないためである。 (Data for principal component analysis for microarray)
For this analysis, PM data was collected for each gene (by trim average). Specifically, the z score in the PIVOT output file provided by the SuperNORM data standardization service manufactured by Skylight Biotech was used.
At this time, in order to remove information on genes for which the null hypothesis was not rejected in the above analysis of variance, the values of these genes were all replaced with zero.
Thereby, noise can be prevented from affecting the result of the principal component analysis. In addition, by deleting a specific gene, it is possible to prevent the matrix form from changing.
All deleted data was replaced with zero. The reason for this is that, as described above, if there is missing data, the calculation of the principal component analysis cannot be performed, and this needs to be replaced. At this time, replacing the deleted data with zero is a so-called fail-safe measure. This is because, as described above, as long as missing data is replaced with zero, it does not cause false positives.

次に、欠失したデータを置き換えた行列データの全ての測定要素（遺伝子）毎に、データをセンタリングした。このセンタリングは、全データの遺伝子毎の平均をもってそれぞれの遺伝子の値を減ずる処理である。これにより、主成分分析の結果のゼロを原点に重ねることができる。
いずれかのコントロール実験区の値で減ずるならば、原点はその実験区に重なる。また遺伝子の発現レベルの差は生体の機能と相関するため、分散の統一は行っていない。 Next, the data was centered for every measurement element (gene) of the matrix data in which the deleted data was replaced. This centering is a process of reducing the value of each gene with an average for each gene of all data. Thereby, zero of the result of the principal component analysis can be superimposed on the origin.
If it is reduced by the value of any control experiment, the origin overlaps that experiment. In addition, since the difference in gene expression level correlates with the function of the living body, the distribution is not unified.

群ごとに各遺伝子の平均値を求め、各群の代表値とした。この代表値をＸｔとして用いて特異値分解し、３つの行列Ｕ_t，Ｌ^1/2 _tとＶ_t'を求めた。
上述の第１又は第２の実施の形態で説明したように、全データから軸を決定した場合や、群に偏りがあるばあいのシミュレーションでは、Ｘの内容がそれぞれに異なっている。
遺伝子ごとの主成分であるＰＣ_gはＸ_tとＵ_tから求めた。
また、サンプルの主成分であるＰＣ_sは全てのデータＸとＶ_tから求めた。このため、群の代表値ではなく、各サンプルの値が算出されている。これはサンプル間にどの程度の個体差があるのかを観察できるようにするための措置である。 The average value of each gene was obtained for each group and used as a representative value for each group. Singular value decomposition was performed using this representative value as Xt, and three matrices U _t , L ^1/2 _t and V _t ′ were obtained.
As described in the first or second embodiment, the contents of X are different from each other in the case where the axis is determined from all the data or in the simulation when the group is biased.
PC _g, which is the main component for each gene, was determined from X _t and U _t .
In addition, PC _s, which is a main component of the sample was determined from all of the data X and V _t. For this reason, the value of each sample is calculated instead of the representative value of the group. This is a measure for making it possible to observe how much individual differences exist between samples.

以上により、行列データＸ_sを作成し、マイクロアレイデータ２５１に記憶した。その後、第１の実施の形態のトランスクリプトーム用主成分分析方法及び第２の実施の形態の各処理を行い、遺伝子リストを得た。 Thus, matrix data X _s was created and stored in the microarray data 251. Thereafter, the transcriptome principal component analysis method of the first embodiment and each process of the second embodiment were performed to obtain a gene list.

〔実施例４〕
次に、図１１〜１２を参照して、マイクロアレイデータ主成分分析の結果から遺伝子を選択する方法について説明する。
まず、上述の実施例１と同様のＮＣＢＩのＧＥＯデータベースにあるＳｅｒｉｅｓＧＳＭ２０２６６６〜ＧＳＭ２０２７０５までの４０データをマイクロアレイの行列データとして取得し、上述の第１の実施の形態に係るマイクロアレイデータ主成分分析方法を用いて分析を行った。その後、遺伝子を選択した。 Example 4
Next, a method for selecting genes from the results of microarray data principal component analysis will be described with reference to FIGS.
First, 40 data from Series GSM202666 to GSM202705 in the NCBI GEO database similar to the above-described first embodiment are acquired as matrix data of the microarray, and the microarray data principal component analysis method according to the first embodiment described above is performed. Was used for analysis. Subsequently, genes were selected.

上述したように、主成分分析における各主成分は多くの数を合算して得られる。たとえば、遺伝子の主成分は、サンプルごとに算出された要素を合算したものである。
もし、これらの要素に生物学的な意味が乏しければ、要素間には明確な相関がなく、独立していることになる。
そして、各要素はサンプル間の差異から得られるものなので、その分布様式は同一となると過程できる。
さらに、要素がむしろランダム数でシミュレートできるような性質のものであるのなら、中心極限定理から、その合算の結果は正規分布することが予測される。 As described above, each principal component in the principal component analysis is obtained by adding a large number. For example, the main component of the gene is the sum of the elements calculated for each sample.
If these elements have little biological significance, there is no clear correlation between the elements and they are independent.
And since each element is obtained from the difference between samples, it can be processed when the distribution pattern is the same.
Furthermore, if the elements are of a nature that can be simulated with a random number, the central limit theorem predicts that the result of the summation will be normally distributed.

図１１は、取得したｓＰＣ１ｇの度数分布を示すグラフヒストグラムである。縦軸は、度数（Ｆｒｅｑｕｅｎｃｙ）、横軸は各要素の数を示す。図１１の確認されたｓＰＣ１ｇの分布は、概要として正規分布していた。特に、その分布中心は、正規分布に沿った分布をしていた。
図１２は、取得したｓＰＣ１ｇの分布と、理論的な正規分布とを比較したＱＱプロットの例である。ＱＱプロットは、ある確率ｐを与えたときに、２つの確率点（ｑｕａｎｔｉｌｅ）となるｑ１とｑ２とを、それぞれ縦軸、横軸にとってプロットした確率プロットである（「Ｇｎａｎａｄｅｓｉｋａｎ，Ｒ．；Ｗｉｌｋ，Ｍ．Ｂ．（１９６８）， "Ｐｒｏｂａｂｉｌｉｔｙｐｌｏｔｔｉｎｇｍｅｔｈｏｄｓｆｏｒｔｈｅａｎａｌｙｓｉｓｏｆｄａｔａ"，Ｂｉｏｍｅｔｒｉｋａ５５（１）：１〜１７」を参照）。このＱＱプロットでは、ソートしたｓＰＣ１ｇの実データと正規分布の理論値を一次近似した。
ノイズの影響を避けて直線部分だけからパラメータを求めるために、ロバストなチューキーの方法を用いた（「Ｔｕｋｅｙ，Ｊ．Ｗ．（１９７７）．ＥｘｐｌｏｒａｔｏｒｙＤａｔａＡｎａｌｙｓｉｓ，ＲｅａｄｉｎｇＭａｓｓａｃｈｕｓｅｔｔｓ：Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ．」を参照）。図１２の実線は、近似直線式を示す（ｙ＝０．０９ｘ）。
図１２のＱＱプロットによると、分布中心は、正規分布に沿った分布をしていることは、明らかである。ただし、分布の両端はより絶対値の大きな値を示す傾向が顕著で、グラフの上下方向にプロットが曲がった。これは、ランダムでない要素間の相関があることを示唆している。
具体的には、近似直線と実データは、実データの値として±０．１７くらいから乖離しはじめる。この程度の値から、強い意味をもつ遺伝子群が混じってくると考えることができる。逆に、全てがランダムだったと仮定すると、実データはこの近似直線上にのっていたはずである。 FIG. 11 is a graph histogram showing the frequency distribution of the acquired sPC1g. The vertical axis represents frequency, and the horizontal axis represents the number of each element. The confirmed distribution of sPC1g in FIG. 11 was normally distributed. In particular, the distribution center has a distribution along a normal distribution.
FIG. 12 is an example of a QQ plot comparing the acquired distribution of sPC1g with a theoretical normal distribution. The QQ plot is a probability plot in which q1 and q2 which are two probability points (quantiles) are plotted on the vertical axis and the horizontal axis when given a certain probability p (“Gnanadesikan, R .; Wilk, MB (1968), “Probability Plotting Methods for the Analysis of Data”, Biometrica 55 (1): 1-17 ”. In this QQ plot, the sorted actual data of sPC1g and the theoretical value of the normal distribution were first-order approximated.
A robust Tukey method was used to obtain the parameter only from the straight line portion while avoiding the influence of noise (see “Tukey, JW (1977). Exploratory Data Analysis, Reading Massachets: Addison-Wesley.”). ). The solid line in FIG. 12 shows an approximate linear equation (y = 0.09x).
According to the QQ plot of FIG. 12, it is clear that the distribution center is distributed along the normal distribution. However, the tendency that the both ends of the distribution show larger absolute values is remarkable, and the plot is bent in the vertical direction of the graph. This suggests that there is a correlation between non-random elements.
Specifically, the approximate line and the actual data start to deviate from about ± 0.17 as the actual data value. From this value, it can be considered that a gene group with a strong meaning is mixed. Conversely, assuming that everything was random, the actual data would have been on this approximate line.

ランダムな要素の組み合わせとして、確率０．００１の両側の擬陽性を受け入れるとすると、理論値としてｚスコア±３．３が、所定の閾値となる。これを、図１２の、縦の波線として示す。
あるいは、０．００１／２の確率で、分布中心で観測されたようなランダムな効果は、±３．３というｚスコアを記録しうることになる。これは近似直線から、ｓＰＣ１ｇの値として±０．３に相当する。これを、図１２の横の波線として示す。
そこで、ｓＰＣ１ｇの値がこれらを超える遺伝子を選択した。この選択した中の遺伝子に期待される擬陽性の確率は、０．００１よりも小さくなる。
同様の計算を実施例２トキシコロジーの分野のデータからの分析結果であるｓＰＣｇ２にも行い、０．３という域値を得た。所定の閾値により、トキシコロジーに関連する遺伝子を得ることも可能であった。 Assuming that false positives on both sides with a probability of 0.001 are accepted as a combination of random elements, z-score ± 3.3 as a theoretical value is a predetermined threshold value. This is shown as a vertical wavy line in FIG.
Alternatively, a random effect such as that observed at the center of the distribution with a probability of 0.001 / 2 can record a z-score of ± 3.3. This corresponds to ± 0.3 as the value of sPC1g from the approximate straight line. This is shown as a horizontal wavy line in FIG.
Therefore, genes with sPC1g values exceeding these were selected. The expected false positive probability for this selected gene is less than 0.001.
A similar calculation was performed on sPCg2, which is an analysis result from data in the field of Example 2 toxicology, and a threshold value of 0.3 was obtained. It was also possible to obtain genes related to toxicology with a predetermined threshold.

なお、上記実施の形態の構成及び動作は例であって、本発明の趣旨を逸脱しない範囲で適宜変更して実行することができることは言うまでもない。 Note that the configuration and operation of the above-described embodiment are examples, and it is needless to say that the configuration and operation can be appropriately changed and executed without departing from the gist of the present invention.

１０解析装置
１００制御部
１１０記憶部
１３０入力部
１４０表示部
１５０ネットワーク入出力部
２１０トレーニングデータ作成部
２２０特異ベクトル演算部
２３０主成分演算部
２４０主成分スケーリング部
２５０データベース
２５１マイクロアレイデータ
２５２トレーニングデータ
２５３軸データ
２５４主成分データ DESCRIPTION OF SYMBOLS 10 Analysis apparatus 100 Control part 110 Storage part 130 Input part 140 Display part 150 Network input / output part 210 Training data creation part 220 Singular vector calculation part 230 Principal component calculation part 240 Principal component scaling part 250 Database 251 Microarray data 252 Training data 253 Axis Data 254 Principal component data

Claims

A transcriptome analysis method for analyzing a transcriptome by calculating a principal component from a data matrix using an analysis device,
The analyzer scales the main component by dividing it by the square root of the number of samples or measurement items used to calculate the main component,
The analysis apparatus selects a sample with a predetermined threshold from the scaled principal components,
Calculating the principal component from the data matrix of the expression level change related to the transcriptome;
The principal component is scaled by dividing by the square root of the number of samples of the data matrix used for the calculation of the principal component, or the square root of the number of measurement items of the data matrix used for the calculation of the principal component,
From the scaled principal component, it is determined and selected that the expression level has changed at the predetermined threshold,
In order to find the axis of the principal component represented by a singular vector, training data is used,
The training data is standardized by using the representative value of each group consisting of the average value of the group, setting the reference value of the item,
The reference value is set by specifying data to be used as a reference, the set reference data is the origin of the principal component, and the axis of the principal component is used as the reference while being shared by a plurality of measurement values. Data is determined by each measured value,
A transcriptome analysis method comprising classifying samples by defining the principal component axis using the training data and applying the principal component axis to individual sample data.

The representative value of the training data is
The transcriptome analysis method according to claim 1, wherein a sample and a measurement item are selected and determined by confirming a significant difference between groups in advance by analysis of variance and narrowing down measurement items.

The change in the expression level includes any of the amount of RNA, the amount of translated protein, the activity of the translated protein, and the amount of metabolite produced by protein metabolism. 3. The transcriptome analysis method according to 1 or 2.

4. The threshold value according to claim 1, wherein the predetermined threshold value is a threshold value that allows false positives on both sides with a probability of 0.001 by comparing the scaled principal component with a normal distribution. 5. Transcriptome analysis method.

The transcriptome analysis method according to any one of claims 1 to 4, wherein it is determined that the expression level has changed by comparing two or more scaled principal components.

The training data is created by selecting measurement items of the data matrix,
The transcriptome analysis method according to any one of claims 1 to 5, wherein the size of the original matrix is maintained by replacing data of items not selected with zero.

The transcriptome analysis method according to any one of claims 1 to 6, wherein when the principal component is calculated, the deleted data is replaced with zero.

The transcriptome analysis method according to claim 1, wherein the principal component is calculated by applying an axis obtained from the training data to the data matrix.

The transcriptome analysis method according to any one of claims 1 to 8, wherein an axis obtained from the training data is used as a weight for data evaluation.

The transcriptome analysis method according to any one of claims 1 to 9, wherein, when obtaining an axis from training data, selected data other than the data average is used as a reference. .

The transcriptome analysis method according to any one of claims 1 to 10, wherein when the principal component is calculated, selected data other than the data average is used as a reference.

When calculating the principal component, using the data matrix X _s and the data matrix X _p that are re-standardized by performing centering according to the following formula,

Here, p is an experimental group number. The transcriptome analysis method according to any one of claims 1 to 11, wherein:

When the singular value decomposition is performed on the data matrix X _p , the relationship between the left singular vector U _p , the diagonal matrix L ^1/2 and the right singular vector V _p is expressed by the following equation.

The transcriptome analysis method according to claim 12 .

Among the main components, the main component PC _s for each sample is represented by the following formula.

The transcriptome analysis method according to claim 12 or 13 , characterized in that:

Among the main components, the main component PC _g for each gene is represented by the following formula.

The transcriptome analysis method according to any one of claims 12 to 14, wherein

A principal component calculation method for calculating a principal component from a data matrix using an analysis device,
The analyzer scales the main component by dividing it by the square root of the number of samples or measurement items used to calculate the main component,
The analysis apparatus selects a sample with a predetermined threshold from the scaled principal components,
Compare the disease group with the control group,
Calculating the principal component from the data matrix for the measurement items of the health check,
The principal component is scaled by dividing by the square root of the number of samples of the data matrix used for the calculation of the principal component, or the square root of the number of measurement items of the data matrix used for the calculation of the principal component,
From the scaled main component, it is determined and selected that the measurement item of the health check has changed at the predetermined threshold,
In order to find the axis of the principal component represented by a singular vector, training data is used,
The training data is standardized by using the representative value of each group consisting of the average value of the group, setting the reference value of the item,
The reference value is set by specifying data to be used as a reference, the set reference data is the origin of the principal component, and the axis of the principal component is used as the reference while being shared by a plurality of measurement values. Data is determined by each measured value,
A disease determination method comprising: classifying a sample by defining an axis of the principal component using the training data and applying the axis of the principal component to data of each sample.

A computer program for executing the transcriptome analysis method according to claim 1 or the disease determination method according to claim 16.

A storage medium storing the computer program according to claim 17.

A principal component calculation unit for calculating a principal component from a data matrix;
Principal component scaling that scales by dividing the principal component by the square root of the number of samples of the data matrix used to calculate the principal component or the square root of the number of measurement items of the data matrix used to calculate the principal component And
With a training data creation unit that selects the sample and the measurement item and creates training data,
From the scaled principal component, a sample is selected with a predetermined threshold,
The principal component calculation unit calculates the principal component from the data matrix for a measurement item of a transcriptome or a medical examination,
The scaling unit divides the principal component by a square root of the number of samples of the data matrix used for calculating the principal component or a square root of the number of measurement items of the data matrix used for calculating the principal component. From the scaled main component, it is determined and selected that the expression level or the measurement item of the health checkup according to the transcriptome has changed at the predetermined threshold,
The training data creation unit uses a representative value of each group consisting of an average value of the group as the training data, sets a reference value of the item, standardizes the reference value, and the reference value is used as a reference Specify and set the data, the data to be set as the reference is the origin of the principal component, the axis of the principal component is shared by a plurality of measurement values, the data to be the reference is determined by each measurement value,
An analysis apparatus comprising: classifying samples by defining an axis of the principal component using the training data and applying the axis of the principal component to individual sample data.