JP2014035653A

JP2014035653A - Population size estimation method, population estimation method, gene expression analysis method, data analysis method, sensitivity measurement method, program, recording medium and population size estimation system

Info

Publication number: JP2014035653A
Application number: JP2012176595A
Authority: JP
Inventors: Tomokazu Konishi; 智一小西
Original assignee: Akita Prefectural University
Current assignee: Akita Prefectural University
Priority date: 2012-08-09
Filing date: 2012-08-09
Publication date: 2014-02-24

Abstract

PROBLEM TO BE SOLVED: To provide a population size estimation method for precisely estimating the size of population using measurement data obtained by using a measurement method having detection limit.SOLUTION: The total number of data is estimated on measurement data obtained using a measurement method having detection limit is made on a computer. The measured data is sorted in an order from the largest/smallest data. Subsequently, a range of candidate values of the total number of data is set based on the measured data. Then, scatter diagrams of probability distribution of a normalized measured data and an estimated distribution of the measured data are drawn with respect to all candidate values within the range. These scatter diagrams are evaluated on the linearity using a correlation like Pearson's R-value or a function on a linear relationship to estimate an optimum candidate value as the population size.

Description

本発明は、母集団大きさ推定方法、母数推定方法、遺伝子発現量解析方法、データ分析方法、感度測定方法、プログラム、記録媒体、及び母集団大きさ推定装置に係り、特に、検出感度に限界があるデータを用いた際の母集団大きさ推定方法、母数推定方法、遺伝子発現量解析方法、データ分析方法、感度測定方法、プログラム、記録媒体、及び母集団大きさ推定装置に関する。 The present invention relates to a population size estimation method, a parameter estimation method, a gene expression level analysis method, a data analysis method, a sensitivity measurement method, a program, a recording medium, and a population size estimation apparatus, and more particularly to detection sensitivity. The present invention relates to a population size estimation method, a population estimation method, a gene expression level analysis method, a data analysis method, a sensitivity measurement method, a program, a recording medium, and a population size estimation apparatus when using data with limitations.

遺伝する生物情報はゲノムにＤＮＡの塩基配列の形で記録されている。ＤＮＡは細胞内でＲＮＡへと転写され、これらＲＮＡは細胞内に蓄積する。
それらＲＮＡは特定のタンパク質の情報を保持しているが、頻繁に翻訳されるＲＮＡは細胞内の濃度（発現量）が高く保たれる。つまり、これらＲＮＡの濃度は、遺伝情報がはじめて物理的な形に変わる段階である。細胞に含まれるＲＮＡの種類と発現量に係るコンテンツ（内容）が、トランスクリプトームと称されている。各ＲＮＡの発現量は、その細胞の機能や状態を知る上で重要な手掛かりとなる。 The inherited biological information is recorded in the genome in the form of a DNA base sequence. DNA is transcribed into RNA in the cell, and the RNA accumulates in the cell.
These RNAs retain information on specific proteins, but RNAs that are frequently translated maintain high intracellular concentrations (expression levels). That is, the concentration of these RNAs is a stage where genetic information changes to a physical form for the first time. Content related to the type and expression level of RNA contained in a cell is referred to as a transcriptome. The expression level of each RNA is an important clue for knowing the function and state of the cell.

ＲＮＡの種類は数万に登るため、それらの種類を見分けた上で定量するのは容易ではない。しかし、１９９０年代にＤＮＡチップを使ったＤＮＡマイクロアレイ法が登場し、その網羅的な解析が可能になった。これはハイブリダイゼーションの画像等から、遺伝子ごとに感度が異なる相対値を測定値として得る方法である。
また、ＤＮＡチップを用いないトランスクリプトームの測定方法として、定量ＰＣＲを使う方法と、ＤＮＡシーケンサーを使う方法がある。
さらに、細胞内の代謝産物である、有機化合物やタンパク質や代謝産物等を、クロマトグラフィ等の方法で分離・定量しながら質量分析し、物質の同定を行う方法も実用化されている。これらの物質の同定方法の例として、ＬＣ−ＭＳ（ＬｉｑｕｉｄＣｈｒｏｍａｔｏｇｒａｐｈＭａｓｓＳｐｅｃｔｏｍｅｔｅｒ）等が存在する。 Since there are tens of thousands of types of RNA, it is not easy to quantify after identifying these types. However, the DNA microarray method using a DNA chip appeared in the 1990s, and its comprehensive analysis became possible. This is a method of obtaining relative values having different sensitivities for each gene as measured values from hybridization images and the like.
There are two methods for measuring transcriptome without using a DNA chip: a method using quantitative PCR and a method using a DNA sequencer.
Furthermore, a method for identifying a substance by mass-analyzing an organic compound, protein, metabolite or the like, which is an intracellular metabolite, while separating and quantifying it by a method such as chromatography has been put into practical use. Examples of methods for identifying these substances include LC-MS (Liquid Chromatography Mass Spectrometer).

ここで、複数のサンプルからなる相対値の測定データを解析する際に、最初に行うことは、当該測定データの標準化である。測定に用いられるサンプルは、先に何らかの方法でその量を揃えることが多いが、揃えられた状態にて取得される数値は、絶対値にはならない。これは、測定の原理上、サンプルごとに感度が異なるからである。この感度差のために、測定間においてデータを直接に比較することはできない。そのため、測定データを別途、標準化することが必要である。この標準化により、各サンプルが比較可能になる。
測定データの標準化を行うためには、データの母集団の分布をあらわす「母数」を推定する。ここでは、あるサンプルに内包されている被測定物質のすべてを、母集団であると考える。たとえば、ある特定の強度域から失われた「強め」に測定されたサンプルのデータからでも、その母集団の性質を表す母数が分かれば、そのサンプルがどの程度に強めに測定されたのかを推定することができる。そのため、母数を使うことで、測定間の比較が可能になる。
たとえば、正規分布モデルであるなら、平均で推定される分布の中心の母数μと、標準偏差で推定される分布の巾の母数σがある。これらの母数の推定は、その後のあらゆる解析の基盤となるために重要である。 Here, when analyzing measurement data of a relative value composed of a plurality of samples, the first thing to do is standardization of the measurement data. Samples used for measurement are often arranged in some way in advance, but numerical values obtained in the aligned state are not absolute values. This is because the sensitivity varies from sample to sample on the principle of measurement. Because of this sensitivity difference, data cannot be directly compared between measurements. Therefore, it is necessary to standardize the measurement data separately. This standardization allows each sample to be compared.
In order to standardize measurement data, a “parameter” representing the distribution of the population of data is estimated. Here, all the substances to be measured contained in a certain sample are considered to be a population. For example, even if the data of a “strongly” sample that was lost from a specific intensity range is known, it is possible to determine how strongly the sample was measured if the parameter representing the nature of the population is known. Can be estimated. Therefore, comparison between measurements is possible by using the parameter.
For example, in the case of a normal distribution model, there is a parameter μ at the center of the distribution estimated by the average and a parameter σ of the width of the distribution estimated by the standard deviation. These parameter estimates are important to provide the basis for any subsequent analysis.

図９の例を参照して、測定データの標準化と母集団の関係について、より詳しく説明する。
たとえば、図９（ａ）のように、本来の母集団が平均μ＝０，標準偏差σ＝１の正規分布に従うデータである場合、そのデータのヒストグラムは、図９（ａ）のようなものになる。
さて、従来から、感度限界等により、検出できたものだけがデータとして得られるような検出限界のある測定方法は多い。たとえば電気泳動、クロマトグラフィ、質量分析のように、データを経時的に記録するものは、必ず検出限界がある。また、次世代型シーケンサーを用いたＲＮＡ−ｓｅｑのように、データが基本的に「読まれる・読まれない」というベルヌーイ試行を用いた測定の場合も、読まれなかったデータは得られないために、検出限界が存在する。さらに、蛍光や電気的な検知方法を用いたハイブリダイゼーション等も、機器の感度限界に係る検出限界がある。
図９（ｂ）は検出限界があるデータの測定値のヒストグラムの例である。このように、検出限界より小さい分布の一部が特異的に失われることが多い。逆に、検出限界を超えて大きいシグナルが無視されたり、測定機の不全を招くことで、当該データが失われることもある。 With reference to the example of FIG. 9, the relationship between the standardization of measurement data and the population will be described in more detail.
For example, as shown in FIG. 9A, when the original population is data according to a normal distribution with mean μ = 0 and standard deviation σ = 1, the histogram of the data is as shown in FIG. 9A. become.
Conventionally, there are many measurement methods with a detection limit such that only data that can be detected is obtained as data due to a sensitivity limit or the like. For example, those that record data over time, such as electrophoresis, chromatography, and mass spectrometry, always have detection limits. In addition, as with RNA-seq using a next-generation sequencer, data that has not been read cannot be obtained even in the measurement using the Bernoulli trial that data is basically “read / not read”. There is a detection limit. Furthermore, hybridization using fluorescence or an electrical detection method has a detection limit related to the sensitivity limit of the instrument.
FIG. 9B is an example of a histogram of measured values of data having a detection limit. In this way, a part of the distribution smaller than the detection limit is often specifically lost. Conversely, a large signal that exceeds the detection limit may be ignored, or the data may be lost due to failure of the measuring instrument.

従来から、このように検出限界がある測定方法を用いる場合、測定データの母集団の大きさが不明となり、このため分布特性や母数が求められないことが多かった。
ここで、「母集団の大きさ」は、被測定物の種類の総数を示す値である。従来は、データが母集団からのランダムサンプルだと仮定し、たとえば正規分布を仮定して母数を推定して、母手段の大きさを標準化してきた。
しかしながら、この図９（ｂ）のデータから、例えば、平均を算出すると０．３で、標準偏差は０．８程度となる。これは本来の母集団の平均０、標準偏差１から大きく外れている。こうした違いは、データが、ある特定の強度域から失われ、ランダムなサンプリングであるという前提が崩れるために引き起こされる。 Conventionally, when a measurement method having a detection limit is used, the size of the population of measurement data is unknown, and thus distribution characteristics and parameters are often not obtained.
Here, the “population size” is a value indicating the total number of types of objects to be measured. Conventionally, assuming that the data is a random sample from the population, for example, assuming a normal distribution, the number of parameters is estimated, and the size of the generating means has been standardized.
However, for example, when the average is calculated from the data of FIG. 9B, the standard deviation is 0.3 and the standard deviation is about 0.8. This is far from the average 0 and standard deviation 1 of the original population. These differences are caused by the assumption that data is lost from a certain intensity range and is a random sampling.

また、このように分布の一部を特異的に失う場合、いわゆるロバストな計算方法を用いても、正しく母数を算出できない。これは、撹乱の原因が少数の外れ値ではないためである。たとえば、図９（ｂ）のデータで、よく使われるロバストなμの推定方法である中央値を計算すると０．２２となり、同じくロバストなσの推定方法であるＭＡＤ（ｍｅｄｉａｎａｂｓｏｌｕｔｅｄｅｖｉａｔｉｏｎ）は０．８４となる。
加えて、実際には、検出限界はサンプルや測定によって異なる。これは、たとえ測定機器が同じであっても、測定されている個々のサンプル毎に、そのシグナル総量が一定しないためである。
つまり、図９（ｃ）、（ｄ）、（ｅ）の例で灰色（塗りつぶし）で示すような、感度限界以下で検出できない消失領域の割合は、一定にはならない。このため、図９（ｃ）では、平均が０．１５で標準偏差が０．８８となり、図９（ｄ）では、平均が０．３０で標準偏差が０．８０となり、図９（ｅ）では平均が０．５１で標準偏差が０．７０となる。 In addition, when a part of the distribution is specifically lost in this way, the parameter cannot be calculated correctly even by using a so-called robust calculation method. This is because the cause of the disturbance is not a small number of outliers. For example, if the median, which is a commonly used robust μ estimation method, is calculated with the data in FIG. 9B, it is 0.22, and MAD (median absolute deviation), which is also a robust σ estimation method, is 0. 84.
In addition, in practice, the detection limit depends on the sample and measurement. This is because the total amount of signal is not constant for each individual sample being measured, even if the measurement equipment is the same.
That is, the proportion of disappearing areas that cannot be detected below the sensitivity limit, as shown in gray (filled) in the examples of FIGS. 9C, 9D, and 9E, is not constant. Therefore, in FIG. 9C, the average is 0.15 and the standard deviation is 0.88, and in FIG. 9D, the average is 0.30 and the standard deviation is 0.80. Then the average is 0.51 and the standard deviation is 0.70.

この消失領域の割合のばらつきは、とくに母集団が対数正規分布である場合には顕著になる。たとえば細胞の熱力学的な特性のために、ｍＲＮＡは対数正規分布する。この性質から、多くの量的なオミックスデータは同様な分布をするものと予想される。
対数正規分布では、もっとも下位のデータがもっとも頻度よく現れる。そのため、僅かな検出感度の違いが、得られるデータの数を大きく変えることになる。当然、分布の母数の推定値は失われるデータ数によって大きく影響される。 This variation in the ratio of the disappearing region becomes remarkable particularly when the population has a lognormal distribution. For example, due to the thermodynamic properties of cells, mRNA is log-normally distributed. Because of this property, many quantitative omics data are expected to have a similar distribution.
In the lognormal distribution, the lowest data appears most frequently. Therefore, a slight difference in detection sensitivity greatly changes the number of data obtained. Naturally, the estimate of the distribution parameter is greatly influenced by the number of data lost.

ここで、従来のＤＮＡマイクロアレイデータの標準化方法として、特許文献１を参照すると、一つのＤＮＡマイクロアレイから得られたデータセットにおいて、各データを大きさ順に並べ替えてデータの順位を決定するステップと、あるデータの順位を、標準正規累積分布関数の逆関数に入力して、当該データの標準化後の予測値を出力するステップと、当該データを前記予測値で置き換えるステップと、を備えたＤＮＡマイクロアレイデータの処理方法が開示されている（以下、従来技術１とする。）。
従来技術１の処理方法によれば、従来のパラメトリック手法の標準化における直線性からの歪みと、パラメータ推定に必要な繰り返し演算の必要性とを回避する標準化方法を提供することができる。 Here, as a standardization method of conventional DNA microarray data, with reference to Patent Document 1, in a data set obtained from one DNA microarray, a step of rearranging each data in order of size and determining the rank of the data; DNA microarray data comprising: inputting a rank of certain data to an inverse function of a standard normal cumulative distribution function, outputting a predicted value after standardization of the data; and replacing the data with the predicted value (Hereinafter referred to as prior art 1).
According to the processing method of the prior art 1, it is possible to provide a standardization method that avoids the distortion from linearity in the standardization of the conventional parametric method and the necessity of the iterative calculation necessary for parameter estimation.

国際公開第２００８／０５６６９３号International Publication No. 2008/056693

従来の検出限界がある測定方法では、いずれも感度に限界があり、その限界が測定ごとに異なる。またトランスクリプトーム等に係る発現量データは、基本的に対数正規分布をする。このため、僅かな測定限界の違いに影響される測定データのデータ数（母集団の大きさ）の違いが大きくなる傾向がある。
しかしながら、従来技術１は、検出限界のためにデータを失う場合における、測定データの母集団の大きさの正確な推定は想定していなかった。 Any conventional measurement method with a detection limit has a limit in sensitivity, and the limit differs for each measurement. In addition, the expression level data related to transcriptome and the like basically has a lognormal distribution. For this reason, the difference in the number of measurement data (population size) that is influenced by a slight difference in measurement limit tends to increase.
However, the prior art 1 did not assume an accurate estimation of the size of the population of measurement data when data is lost due to a detection limit.

本発明は、このような状況に鑑みてなされたものであり、上述の課題を解消することを課題とする。 This invention is made | formed in view of such a condition, and makes it a subject to eliminate the above-mentioned subject.

本発明の母集団大きさ推定方法は、コンピュータに、検出限界がある測定方法で得られた測定データの総データ数の推定を実行させる母集団大きさ推定方法において、前記測定データをソートし、前記測定データにより算出された範囲で、前記総データ数の大きさの候補値を複数用意し、前記候補値毎に、前記測定データの母集団の分布に係る理論値のセットを用意し、ソートされた前記測定データと前記理論値のセットとの一致を相関関係又は線形関係に係る関数により評価し、前記関数による評価が最適な前記候補値を母集団の大きさとして推定することを特徴とする。
本発明の母集団大きさ推定方法は、前記関数は、上位からの同順の前記測定データの値と前記理論値のセットの値との直線性又は距離を評価することを特徴とする。
本発明の母集団大きさ推定方法は、前記関数として、ピアソンの積率相関係数を用い、最適な前記候補値が前記範囲の端であった場合、又は用意した候補値の精度が低い場合には、前記候補値の前記範囲を調整し、十分な精度が得られたら評価が最適な前記候補値を確定することを特徴とする。
本発明の母数推定方法は、前記評価が最適な前記候補値の前記理論値のセットと、ソートされた前記測定データとを用いて、上位からの同順の前記測定データの値と前記理論値のセットの値とを対比させ、一次近似した切片と傾きを算出し、前記母集団の分布の特性を示す母数を算出することを特徴とする。
本発明の遺伝子発現量解析方法は、測定データとして、遺伝子発現量の定量化データを用い、前記母集団大きさ推定方法により推定された前記総データ数により前記遺伝子発現量を標準化することを特徴とする。
本発明の遺伝子発現量解析方法は、前記遺伝子発現量の定量化データは、トランスクリプトの発現量、ｃＤＮＡの読み取り回数、及び派生データであることを特徴とする。
本発明のプロテオーム又はメタボロームのデータ分析方法は、測定データとして、タンパク質量又は物質量の定量化データを用い、前記母集団大きさ推定方法により推定された前記総データ数により前記タンパク質量又は物質量を標準化することを特徴とする。
本発明のＤＮＡマイクロアレイの感度測定方法は、測定データとして、ＲＮＡｓｅｑ及びＤＮＡマイクロアレイの遺伝子発現量の定量化データを用い、前記母集団大きさ推定方法により推定された前記総データ数によりそれぞれ標準化し、前記ＤＮＡマイクロアレイの各スポットの感度を、標準化された前記ＲＮＡｓｅｑの遺伝子発現量で補正して求めることを特徴とする。
本発明のプログラムは、コンピュータに、検出限界がある測定方法で得られた測定データの総データ数を推定する母集団大きさ推定方法を実行させるプログラムにおいて、前記コンピュータに、前記測定データをソートさせ、前記測定データにより算出された範囲で、前記総データ数の大きさの候補値を複数用意させ、前記候補値毎に、前記測定データの母集団の分布に係る理論値のセットを用意させ、ソートされた前記測定データと前記理論値のセットとの一致を相関関係又は線形関係に係る関数により評価し、前記関数による評価が最適な前記候補値を母集団の大きさとして推定させる母集団大きさ推定方法を実行させることを特徴とする。
本発明の記録媒体は、コンピュータに、検出限界がある測定方法で得られた測定データの総データ数を推定する母集団大きさ推定方法を実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体において、前記コンピュータに、前記測定データをソートさせ、前記測定データにより算出された範囲で、前記総データ数の大きさの候補値を複数用意させ、前記候補値毎に、前記測定データの母集団の分布に係る理論値のセットを用意させ、ソートされた前記測定データと前記理論値のセットとの一致を相関関係又は線形関係に係る関数により評価し、前記関数による評価が最適な前記候補値を母集団の大きさとして推定させる母集団大きさ推定方法を実行させるためのプログラムを記録したことを特徴とする。
本発明の母集団大きさ推定装置は、検出限界がある測定方法で得られた測定データの総データ数の推定を実行させる母集団大きさ推定装置において、前記測定データをソートする、測定データソート手段と、前記測定データにより算出された範囲で、前記総データ数の大きさの候補値を複数用意する候補値用意手段と、前記候補値毎に、前記測定データの母集団の分布に係る理論値のセットを用意する分布用意手段と、ソートされた前記測定データと前記理論値のセットとの一致を相関関係又は線形関係に係る関数により評価し、前記関数による評価が最適な前記候補値を母集団の大きさとして推定する候補値評価手段とを備えることを特徴とする。 The population size estimation method of the present invention sorts the measurement data in a population size estimation method that causes a computer to perform estimation of the total number of measurement data obtained by a measurement method with a detection limit. Prepare a plurality of candidate values of the total number of data within the range calculated from the measurement data, prepare a set of theoretical values related to the population distribution of the measurement data for each candidate value, and sort The measured data and the set of theoretical values are evaluated by a function related to a correlation or linear relationship, and the candidate value that is optimally evaluated by the function is estimated as a population size, To do.
The population size estimation method according to the present invention is characterized in that the function evaluates a linearity or a distance between the value of the measurement data in the same order from the top and the value of the set of theoretical values.
The population size estimation method of the present invention uses the Pearson product moment correlation coefficient as the function, and the optimal candidate value is at the end of the range, or the prepared candidate value has low accuracy Is characterized in that the range of the candidate values is adjusted, and when the sufficient accuracy is obtained, the candidate values that are optimally evaluated are determined.
The parameter estimation method of the present invention uses the set of the theoretical values of the candidate values that are optimally evaluated and the sorted measurement data, and the values of the measurement data in the same order from the top and the theory It is characterized in that a set of values is compared, a first-order approximation intercept and slope are calculated, and a parameter indicating the characteristics of the population distribution is calculated.
The gene expression level analysis method of the present invention uses the gene expression level quantification data as measurement data, and standardizes the gene expression level based on the total number of data estimated by the population size estimation method. And
The gene expression level analysis method of the present invention is characterized in that the quantification data of the gene expression level is a transcript expression level, a cDNA reading count, and derivative data.
The proteome or metabolome data analysis method of the present invention uses the protein amount or substance amount quantification data as measurement data, and the protein amount or substance amount based on the total number of data estimated by the population size estimation method. Is standardized.
The sensitivity measurement method of the DNA microarray of the present invention uses RNAseq and quantification data of the gene expression level of the DNA microarray as measurement data, standardized by the total number of data estimated by the population size estimation method, The sensitivity of each spot of the DNA microarray is obtained by correcting the standardized gene expression level of the RNAseq.
The program of the present invention is a program for causing a computer to execute a population size estimation method for estimating the total number of measurement data obtained by a measurement method having a detection limit, and causing the computer to sort the measurement data. , In a range calculated by the measurement data, to prepare a plurality of candidate values of the total number of data, for each candidate value, to prepare a set of theoretical values related to the population distribution of the measurement data, A population size that evaluates a match between the sorted measurement data and the set of theoretical values by a function related to a correlation or linear relationship, and estimates the candidate value that is optimally evaluated by the function as the size of the population It is characterized in that the estimation method is executed.
The recording medium of the present invention is a computer-readable recording medium that records a program that causes a computer to execute a population size estimation method for estimating the total number of measurement data obtained by a measurement method with a detection limit. Causing the computer to sort the measurement data and prepare a plurality of candidate values of the total number of data within a range calculated by the measurement data; for each candidate value, a distribution of the population of the measurement data A set of theoretical values according to the above is prepared, and the coincidence between the sorted measurement data and the set of theoretical values is evaluated by a function related to a correlation or linear relationship, and the candidate value that is optimally evaluated by the function is determined as a mother A program for executing a population size estimation method for estimating the size of a population is recorded.
A population size estimation device according to the present invention sorts the measurement data in a population size estimation device that performs estimation of the total number of measurement data obtained by a measurement method having a detection limit. Means for preparing a plurality of candidate values for the total number of data in the range calculated from the measurement data, and a theory relating to the distribution of the population of the measurement data for each candidate value Distribution preparing means for preparing a set of values, and evaluating the match between the sorted measurement data and the set of theoretical values by a function related to a correlation or linear relationship, and the candidate value that is optimally evaluated by the function It further comprises candidate value evaluation means for estimating the size of the population.

本発明によれば、候補値の範囲を設定して想定される分布と散布図で比較することで、検出限界のある測定データの母集団の大きさを正確に推定して標準化に用いることができる母集団大きさ推定方法を提供することができる。 According to the present invention, it is possible to accurately estimate the size of a population of measurement data having a detection limit and use it for standardization by setting a range of candidate values and comparing the assumed distribution with a scatter diagram. A possible population size estimation method can be provided.

本発明の実施の形態に係る解析装置１０の制御構成を示すブロック図である。It is a block diagram which shows the control structure of the analyzer 10 which concerns on embodiment of this invention. 本発明の実施の形態に係る母集団大きさ推定処理のフローチャートである。It is a flowchart of the population size estimation process which concerns on embodiment of this invention. 本発明の実施の形態に係る測定データ２５１と散布図の描画の例を示すグラフである。It is a graph which shows the example of drawing of the measurement data 251 and scatter diagram concerning an embodiment of the invention. 本発明の実施の形態に係る実施例１にて、母集団大きさ推定方法をＲＮＡｓｅｑに適用した結果を示すグラフである。It is a graph which shows the result of having applied the population size estimation method to RNAseq in Example 1 which concerns on embodiment of this invention. 本発明の実施の形態に係る実施例２にて、母集団大きさ推定方法によるＲＮＡｓｅｑデータの標準化と、従来の標準化との比較を行った結果を示すグラフである。It is a graph which shows the result of having compared standardization of RNAseq data by the population size estimation method with the conventional standardization in Example 2 which concerns on embodiment of this invention. 本発明の実施の形態に係る実施例３にて、母集団大きさ推定方法を用いた際の、ＲＮＡｓｅｑデータとＤＮＡマイクロアレイデータとの一致を調べた結果を示すグラフである。It is a graph which shows the result of having investigated the coincidence of RNAseq data and DNA microarray data at the time of using the population size estimation method in Example 3 which concerns on embodiment of this invention. 本発明の実施の形態に係る実施例４にて、母集団大きさ推定方法を用いて、ＲＮＡｓｅｑデータを用いてＤＮＡマイクロアレイの感度測定を行った結果を示すグラフである。It is a graph which shows the result of having performed the sensitivity measurement of a DNA microarray using RNAseq data using the population size estimation method in Example 4 which concerns on embodiment of this invention. 本発明の実施の形態に係る実施例５にて、実際にＲ言語のコードを使用して、作成した計算機実験用のヒストグラムである。In Example 5 which concerns on embodiment of this invention, it is the histogram for computer experiments produced actually using the code of R language. 従来の検出限界のある測定データの標準化の手法を説明する概念図である。It is a conceptual diagram explaining the conventional method of standardization of the measurement data with a detection limit.

＜実施の形態＞
〔解析装置１０の制御構成〕
まず、図１を参照して、本発明の実施の形態に係る解析装置１０（コンピュータ、母集団大きさ推定装置、解析手段）の制御構成について説明する。
解析装置１０は、例えばＰＣ／ＡＴ互換機や汎用機等である計算装置であって、Ｌｉｎｕｘ（登録商標）、Ｗｉｎｄｏｗｓ（登録商標）等のＯＳがインストールされている。
解析装置１０の主な構成要素としては、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等の制御・演算装置である制御部１００と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）やＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やフラッシュメモリやＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒａｉｖｅ）等の記憶装置／記録媒体である記憶部１１０と、キーボードやマウス等のポインティングデバイスやタッチパネル等やマイクロアレイ解析装置等の外部機器からのＩ／Ｏインターフェイス等を含む入力部１３０と、液晶ディスプレイや有機ＥＬディスプレイや印刷を行うプリンタ等である表示部１４０と、１０００Ｂａｓｅ−Ｔ等の規格のＬＡＮボードや無線ＬＡＮボードやシリアル／パラレル／ＵＳＢインターフェイス等である入出力部１５０とを備えている。
解析装置１０は、主に記憶部１１０に記憶された各種プログラムと、データベース等を含むデータとを用いて制御部１００が実行することで、本発明の実施の形態に係る母集団大きさ推定方法をハードウェア資源を用いて実現することができる。 <Embodiment>
[Control Configuration of Analysis Device 10]
First, with reference to FIG. 1, the control configuration of the analysis apparatus 10 (computer, population size estimation apparatus, analysis means) according to the embodiment of the present invention will be described.
The analysis apparatus 10 is a computing apparatus such as a PC / AT compatible machine or a general-purpose machine, and an OS such as Linux (registered trademark) or Windows (registered trademark) is installed therein.
The main components of the analysis apparatus 10 include a control unit 100 that is a control / arithmetic unit such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory). Storage device 110 which is a storage device / recording medium such as HDD, Hard Disk Drive (HDD), flash memory, SSD (Solid State Drive), and other external devices such as a pointing device such as a keyboard and a mouse, a touch panel, and a microarray analyzer An input unit 130 including an I / O interface, a liquid crystal display, an organic EL display, a display unit 140 such as a printer for printing, and 1000 Bass A LAN board or wireless LAN board, serial / parallel / USB interface such as standards such as -T and an input-output unit 150.
The analysis apparatus 10 is executed by the control unit 100 mainly using various programs stored in the storage unit 110 and data including a database and the like, so that the population size estimation method according to the embodiment of the present invention is performed. Can be realized using hardware resources.

記憶部１１０は、本発明の実施の形態に係る母集団大きさ推定方法を実現するためのプログラム及びデータが記憶されている。この記憶部１１０のプログラム及びデータは、制御部１００により、ハードウェア資源を用いて実行／処理することができる。
このプログラム及びデータは、データベース２００、測定データソート部２１０（測定データソート手段）、候補値範囲設定部２２０（候補値用意手段）、分布作成部２２５（分布用意手段）、候補値描画部２３０（候補値描画手段）、及び候補値評価部２４０（候補値評価手段、母数算出手段）とを含んで構成される。
また、このプログラム及びデータは、別途記録媒体に記録され、解析装置１０にインストール可能に構成されていてもよい。また、インターネット等からダウンロードしてインストールすることも可能である。 Storage unit 110 stores a program and data for realizing the population size estimation method according to the embodiment of the present invention. The program and data in the storage unit 110 can be executed / processed by the control unit 100 using hardware resources.
The program and data include the database 200, the measurement data sorting unit 210 (measurement data sorting unit), the candidate value range setting unit 220 (candidate value preparation unit), the distribution creation unit 225 (distribution preparation unit), and the candidate value drawing unit 230 ( Candidate value drawing means) and a candidate value evaluation unit 240 (candidate value evaluation means, parameter calculation means).
The program and data may be separately recorded on a recording medium and configured to be installable in the analysis apparatus 10. It can also be downloaded from the Internet and installed.

データベース２００は、ＳＱＬ等のデータベースや各種データを記憶する部位である。データベース２００には、主に、測定データ２５１、分布データ２５２（理論値のセット）、設定データ２５３、及び元データ２５４を記憶している。 The database 200 is a part for storing a database such as SQL and various data. The database 200 mainly stores measurement data 251, distribution data 252 (theoretical value set), setting data 253, and original data 254.

測定データ２５１は、元データ２５４の値を、測定データソート部２１０により、大きい順にソート等して作成されたデータである。
分布データ２５２は、分布作成部２２５で作成された、元データ２５４や測定データ２５１と同じと想定される分布の理論値のセットに係るデータである。なお、分布データ２５２には、想定される分布の乱数値を、候補値分記憶しておいてもよい。
設定データ２５３は、元データ２５４から測定データ２５１を作成するための設定、測定データ２５１の想定される分布の種類、候補値の範囲を設定するためのパラメータ、候補値の範囲のパラメータ、候補値の範囲、候補値のセット等を記憶するデータである。
元データ２５４は、検出限界がある測定方法を用いる実験機器から取得した、元のデータである。この元データ２５４は、例えば、トランスクリプトームに係る遺伝子発現量の定量化データや、タンパク質や低分子有機化合物等の質量計測のデータ等を用いることができる。遺伝子発現量の定量化データとしては、例えば、解析する遺伝子のＲＮＡや翻訳されたタンパク質の量等に対応する、ＲＮＡｓｅｑのリード数、ＤＮＡマイクロアレイのシグナル強度、ＲＴ−ＰＣＲの発現量の値、リアルタイムＲＴ−ＰＣＲの蛍光値（蛍光強度比）、ＳＡＧＥ法（ＳｅｒｉａｌＡｎａｌｙｓｉｓｏｆＧｅｎｅＥｘｐｒｅｓｓｉｏｎ）のタグの計測数、及びこれらの派生データ等を用いることができる。また、検出限界がある質量測定のデータとしては、ＬＣ−ＭＳのデータのような各種質量計測のデータも用いることができる。さらに、計測値が大きさの定まる数値のセットとして得られる、各種の生物学的、非生物学的な値のデータを元データ２５４として用いることが可能である。 The measurement data 251 is data created by sorting the values of the original data 254 in ascending order by the measurement data sorting unit 210.
The distribution data 252 is data related to a set of theoretical values of distributions that are assumed to be the same as the original data 254 and the measurement data 251 created by the distribution creating unit 225. The distribution data 252 may store a random value of an assumed distribution for the candidate value.
The setting data 253 includes a setting for creating the measurement data 251 from the original data 254, an assumed distribution type of the measurement data 251, a parameter for setting a candidate value range, a candidate value range parameter, and a candidate value. Data, a set of candidate values, and the like.
The original data 254 is original data acquired from experimental equipment using a measurement method with a detection limit. As the original data 254, for example, quantification data of the gene expression level related to the transcriptome, mass measurement data such as proteins and low molecular organic compounds, and the like can be used. Examples of the quantification data of gene expression level include the number of RNAseq reads, DNA microarray signal intensity, RT-PCR expression level value, real time corresponding to the RNA of the gene to be analyzed and the amount of translated protein. The RT-PCR fluorescence value (fluorescence intensity ratio), the number of tags measured by the SAGE method (Serial Analysis of Gene Expression), derived data thereof, and the like can be used. As mass measurement data having a detection limit, various mass measurement data such as LC-MS data can also be used. Furthermore, data of various biological and non-biological values obtained as a set of numerical values whose measurement values are determined can be used as the original data 254.

測定データソート部２１０は、実験機器から取得した元データ２５４を、大きさによりソートする等の初期化を行って測定データ２５１として記憶部１１０に記憶する。
また、測定データソート部２１０は、測定データ２５１から、候補値の範囲を決定するため、最大値、最小値、平均、メディアン、分散等の統計的な値を算出する。 The measurement data sorting unit 210 performs initialization such as sorting the original data 254 acquired from the experimental equipment by size, and stores the data as measurement data 251 in the storage unit 110.
In addition, the measurement data sorting unit 210 calculates statistical values such as a maximum value, a minimum value, an average, a median, and a variance from the measurement data 251 in order to determine a range of candidate values.

候補値範囲設定部２２０は、測定データ２５１の本来の完全なデータ数として想定される理論値（以下、「候補値」という。）の範囲を設定する部位である。
また、候補値範囲設定部２２０は、候補値評価部２４０の評価に従って新たな候補値の範囲を設定する。この候補値の範囲の設定については後述する。 The candidate value range setting unit 220 is a part that sets a range of theoretical values (hereinafter referred to as “candidate values”) that are assumed as the original complete data number of the measurement data 251.
The candidate value range setting unit 220 sets a new candidate value range in accordance with the evaluation of the candidate value evaluation unit 240. The setting of the range of candidate values will be described later.

分布作成部２２５は、設定データ２５３の想定される分布の種類の設定に従って、設定された候補値に対応する分布の値（理論値）のセットを、分布データ２５２として作成する。ここでは、分布作成部２２５は、所定のｚスコアにおける分布の理論値のセットを算出し、分布データ２５２に記憶することができる。
たとえば、分布作成部２２５は、測定データ２５１が正規分布と想定される場合、標準正規分布を用意して理論値のセットを算出し、分布データ２５２に記憶する。つまり、分布作成部２２５は、例えば、正規分布ならμ＝０，σ＝１の分布を用意する。
また、例えば、測定データ２５１が対数正規分布と想定される場合、分布作成部２２５は、作成した正規分布を対数正規分布に変換して、対数正規分布の値のセットを理論値のセットとして分布データ２５２に記憶する。
なお、分布作成部２２５は、乱数等を用いて、候補値に対応した正規分布の値のセットを直接、作成することもできる。この場合、分布作成部２２５は、作成した分布の値のセットをソートして、分布データ２５２として記憶することもできる。
また、分布作成部２２５は、想定される分布の種類に係る分布関数や確率密度関数自体を分布データ２５２に記憶し、各候補値における所定のｚスコアにおける分布の理論値をその都度算出してもよい。 The distribution creation unit 225 creates a set of distribution values (theoretical values) corresponding to the set candidate values as the distribution data 252 according to the assumed distribution type setting of the setting data 253. Here, the distribution creation unit 225 can calculate a set of theoretical values of the distribution at a predetermined z score and store it in the distribution data 252.
For example, when the measurement data 251 is assumed to be a normal distribution, the distribution creation unit 225 prepares a standard normal distribution, calculates a set of theoretical values, and stores the set in the distribution data 252. That is, for example, the distribution creating unit 225 prepares a distribution of μ = 0 and σ = 1 for a normal distribution.
Further, for example, when the measurement data 251 is assumed to be a log normal distribution, the distribution creation unit 225 converts the created normal distribution to a log normal distribution, and distributes a set of log normal distribution values as a set of theoretical values. Store in data 252.
The distribution creation unit 225 can also directly create a set of normal distribution values corresponding to the candidate values using random numbers or the like. In this case, the distribution creating unit 225 can sort the created set of distribution values and store them as distribution data 252.
In addition, the distribution creation unit 225 stores the distribution function related to the assumed distribution type and the probability density function itself in the distribution data 252, and calculates the theoretical value of the distribution at a predetermined z score for each candidate value each time. Also good.

候補値描画部２３０は、分布データ２５２と測定データ２５１との散布図を作成する部位である。
候補値描画部２３０は、例えば、用意した分布データ２５２の分布の値のセットと、測定データ２５１を各候補値で標準化した値のセットとを、それぞれ大きい順に対応させ、散布図を作成、描画する。 The candidate value drawing unit 230 is a part that creates a scatter diagram of the distribution data 252 and the measurement data 251.
For example, the candidate value drawing unit 230 creates and draws a scatter diagram by associating a set of distribution values of the prepared distribution data 252 with a set of values obtained by standardizing the measurement data 251 with each candidate value in descending order. To do.

候補値描画部２３０は、本実施形態においては、データが正規分布である場合に、いわゆるｚスコアを算出し、推定された母数を用いた標準化を行う。すなわち、下記の式（１）により、ｚスコアを算出し標準化する：

ｚ＝（測定値−ｍ）／ｓ …… 式（１）

ただし、上述の式（１）において、ｍは母平均μの推定値、ｓは母標準偏差の推定値である。 In this embodiment, the candidate value drawing unit 230 calculates a so-called z-score and performs standardization using the estimated parameter when the data has a normal distribution. That is, the z-score is calculated and standardized by the following equation (1):

z = (measured value−m) / s (1)

In the above equation (1), m is an estimated value of the population mean μ, and s is an estimated value of the population standard deviation.

候補値評価部２４０は、分布データ２５２と測定データ２５１との関係性を評価する部位である。
候補値評価部２４０は、ソートされた測定データ２５１と、設定された候補値に係る分布データ２５２との一致を、「直線性」または「順位の理論値−ソートデータ間の距離」等で評価する。
候補値評価部２４０は、例えば、直線性の評価関数として、ピアソンの積率相関係数ｒ値を算出して評価することができる。
また、候補値評価部２４０は、ソートデータ間の距離として、分布データ２５２と測定データ２５１の各順位の差の絶対値の総和や平均、二乗和の総和や平均等のマハラノビス汎距離を用いることができる。
候補値評価部２４０は、もっとも評価の高い候補値を、母集団の大きさの推定値として確定する。
また、候補値評価部２４０は、母集団の大きさの推定値を使って、母集団の分布の特性を示す母数を算出することもできる。 The candidate value evaluation unit 240 is a part that evaluates the relationship between the distribution data 252 and the measurement data 251.
The candidate value evaluation unit 240 evaluates the match between the sorted measurement data 251 and the distribution data 252 related to the set candidate value by “linearity”, “theoretical ranking value—the distance between the sorted data” or the like. To do.
The candidate value evaluation unit 240 can calculate and evaluate the Pearson product-moment correlation coefficient r value, for example, as a linearity evaluation function.
Further, the candidate value evaluation unit 240 uses the Mahalanobis general distance such as the sum and average of absolute values of the differences between the ranks of the distribution data 252 and the measurement data 251 and the sum and average of the sum of squares as the distance between the sort data. Can do.
The candidate value evaluation unit 240 determines the candidate value with the highest evaluation as an estimated value of the size of the population.
In addition, the candidate value evaluation unit 240 can also calculate a parameter indicating the distribution characteristics of the population using the estimated size of the population.

〔母集団大きさ推定処理〕
次に、図２〜図３を参照して、本発明の実施の形態に係る母集団大きさ推定方法を実行する母集団大きさ推定処理について説明する。
ここで、本実施形態の母集団大きさ推定処理では、被測定物の種類の数を「母集団の大きさ」として推定する。この「母集団の大きさ」として、例えば、電気泳動の一種であるＳＤＳ−ＰＡＧＥの実験データの場合、疎水性が大きいタンパクや、電荷の強いタンパクはそもそも電気泳動されず、検出できないため、母集団に入れないことが好適である。すなわち、この例では、ＳＤＳ−ＰＡＧＥで検出することが原理的に可能なタンパクの全てが母集団となる。
つまり、被測定物として多くの種類があり、そのうちいくらかを所定の実験手法方法で測定可能である場合、その測定可能な種類の数が母集団の大きさとなる。 [Population size estimation process]
Next, a population size estimation process for executing the population size estimation method according to the embodiment of the present invention will be described with reference to FIGS.
Here, in the population size estimation processing according to the present embodiment, the number of types of objects to be measured is estimated as “the size of the population”. As the “population size”, for example, in the case of SDS-PAGE experimental data, which is a kind of electrophoresis, proteins with high hydrophobicity or proteins with high charge are not electrophoresed in the first place and cannot be detected. It is preferred not to enter the group. That is, in this example, all the proteins that can be detected by SDS-PAGE in principle are the population.
That is, when there are many types of objects to be measured and some of them can be measured by a predetermined experimental method, the number of types that can be measured becomes the size of the population.

本実施形態の母集団大きさ推定処理では、全体の総データ数ｎを未知数とおいて、この総データ数ｎに様々な候補値を仮に代入しながら、最もふさわしい候補値を探していく。この候補値のふさわしさの確認のためには、理論値とデータの一致を相関関係又は線形関係に係る関数により評価する。この際、相関関係又は線形関係としては、「直線性」又は「順位の理論値−ソートデータ間の距離」等で評価する。このソートデータ間の距離としては、マハラノビス距離や「ノルム」等を用いることができる。
ここで、理論値と評価された候補値とをＱＱプロット（Ｑｕａｎｔｉｌｅ−Ｑｕａｎｔｉｌｅｐｌｏｔ）の変形である散布図等で描画した場合に、直線に近づくほど評価が高く、総データ数ｎに対して最もふさわしい値となる。すなわち、望ましい「母集団の大きさ」の推定結果となる。 In the population size estimation process of the present embodiment, the total number of data n is an unknown, and the most suitable candidate value is searched for while substituting various candidate values for the total number of data n. In order to confirm the suitability of the candidate value, the coincidence between the theoretical value and the data is evaluated by a function related to the correlation or linear relationship. In this case, the correlation or linear relationship is evaluated by “linearity”, “theoretical value of rank—distance between sorted data”, or the like. As the distance between the sorted data, Mahalanobis distance, “norm”, or the like can be used.
Here, when the theoretical value and the evaluated candidate value are drawn in a scatter diagram or the like which is a modification of the QQ plot (Quantile-Quantile plot), the evaluation becomes higher as the line approaches, and the evaluation is highest for the total number of data n The value is appropriate. That is, it is an estimation result of a desirable “population size”.

つまり、本実施形態では、想定される分布の種類が分かっており、検出限界がある測定方法で得られた測定データ２５１の母集団の大きさを推定する。これにより、測定データ２５１の標準化を行う。このために、測定データ２５１の本来の完全なデータ数として想定される総データ数ｎを未知とし、その値を解析的に解く処理を行う。
より具体的に説明すると、まず解析装置１０は、測定データソート部２１０を用いて、元データ２５４をソートし、測定データ２５１を作成する。
次に、解析装置１０は、候補値範囲設定部２２０を用いて、総データ数ｎの候補値の範囲を決定する。この上で、解析装置１０は、候補値範囲設定部２２０を用いて、範囲内から複数の候補値を用意する。
そして、解析装置１０は、候補値描画部２３０により、想定される分布である分布データ２５２と、範囲内の各候補値で標準化した測定データ２５１とを、それぞれ大きい順に対応させ、ＱＱプロットの変形である散布図を描画する。
次に、解析装置１０は、候補値評価部２４０の評価関数により、この総データ数ｎに係る理論値と測定データ２５１データとの一致を評価する。なお、散布図を描画せずに、直接データ間で一致を評価することもできる。
解析装置１０は、候補値範囲設定部２２０により、範囲内の端が最適な候補値と算出されたり、用意した候補値の精度が低かったりした場合、候補値の範囲を調整しつつ、計算を繰り返す。そして、解析装置１０は、十分な精度が得られたら候補値を確定する。この確定した候補値が、測定データ２５１の本来の完全な総データ数ｎの期待値となる。
以下で、図２のフローチャートを参照して、各母集団大きさ推定処理について、ステップ毎に詳しく説明する。 That is, in this embodiment, the type of distribution assumed is known, and the size of the population of measurement data 251 obtained by a measurement method with a detection limit is estimated. Thereby, the measurement data 251 is standardized. For this purpose, the total data number n assumed as the original complete data number of the measurement data 251 is unknown, and the value is analytically solved.
More specifically, the analysis apparatus 10 first sorts the original data 254 using the measurement data sorting unit 210 to create measurement data 251.
Next, the analysis apparatus 10 uses the candidate value range setting unit 220 to determine a range of candidate values for the total number of data n. Then, the analysis apparatus 10 uses the candidate value range setting unit 220 to prepare a plurality of candidate values from within the range.
Then, the analysis apparatus 10 causes the candidate value drawing unit 230 to correspond the distribution data 252 that is an assumed distribution and the measurement data 251 standardized with each candidate value in the range in order of increasing, and transform the QQ plot. Draw a scatter diagram.
Next, the analysis apparatus 10 evaluates the coincidence between the theoretical value related to the total number of data n and the measurement data 251 data using the evaluation function of the candidate value evaluation unit 240. It is also possible to directly evaluate coincidence between data without drawing a scatter diagram.
When the candidate value range setting unit 220 calculates the end of the range as the optimal candidate value or the prepared candidate value has low accuracy, the analysis apparatus 10 performs the calculation while adjusting the range of the candidate value. repeat. Then, the analysis device 10 determines the candidate value when sufficient accuracy is obtained. This determined candidate value becomes the expected value of the original number n of total complete data of the measurement data 251.
Hereinafter, with reference to the flowchart of FIG. 2, each population size estimation process will be described in detail for each step.

（ステップＳ１０１）
制御部１００は、測定データソート部２１０を用いて、データ読み込みソート処理を行う。
この処理においては、まず、制御部１００は、検出限界がある測定方法で得られた元データ２５４を取得する。この元データ２５４は、他の記録媒体から、記憶部１１０に記憶されていてもよい。また、元データ２５４は、入出力部１５０を介して、ネットワーク上の他のサーバや実験機器等（図示せず）から取得してもよい。
制御部１００は、取得した元データ２５４の発現量等、標準化したい値のセットを、大きさにより、小さい順又は大きい順になるよう並べ替え（ソート）する。
そして、制御部１００は、ソートされた値のセットを、測定データ２５１として記憶部１１０に記憶する。
また、制御部１００は、測定データ２５１の最大値、最小値、平均、分散（標準偏差）等の統計的な値を求め、記憶部１１０に記憶する。
また、図３（ａ）を参照すると、制御部１００は、測定データ２５１を標準正規分布とみて標準化した場合の、ｚスコアを横軸、値の頻度を縦軸としたヒストグラムを描画し、又は度数分布表を作成して記憶部１１０に記憶することもできる。図３（ａ）は、図９（ｂ）と同様の正規分布のデータを描画した例を示している。 (Step S101)
The control unit 100 performs a data reading sort process using the measurement data sort unit 210.
In this process, first, the control unit 100 acquires original data 254 obtained by a measurement method having a detection limit. The original data 254 may be stored in the storage unit 110 from another recording medium. Further, the original data 254 may be acquired from another server on the network, experimental equipment, or the like (not shown) via the input / output unit 150.
The control unit 100 rearranges (sorts) a set of values to be standardized, such as the expression level of the acquired original data 254, in order of decreasing or increasing order according to size.
Then, the control unit 100 stores the sorted set of values in the storage unit 110 as measurement data 251.
Further, the control unit 100 obtains statistical values such as the maximum value, minimum value, average, and variance (standard deviation) of the measurement data 251 and stores them in the storage unit 110.
3A, the control unit 100 draws a histogram with the z score as the horizontal axis and the frequency of values as the vertical axis when the measurement data 251 is standardized as a standard normal distribution, or A frequency distribution table may be created and stored in the storage unit 110. FIG. 3A shows an example in which normal distribution data similar to FIG. 9B is drawn.

データの並べ替えに関しては、例えば、統計用プログラミング言語である「Ｒ言語」を用いた場合、下記のようなコードを用いることができる：

sorted_data <- sort(data, na.last=T, decreasing=T)

ここでは、配列ｄａｔａの並び替えを行い、配列ｓｏｒｔｅｄ＿ｄａｔａに代入している。 Regarding the rearrangement of data, for example, when the “R language” that is a statistical programming language is used, the following codes can be used:

sorted_data <-sort (data, na.last = T, decreasing = T)

Here, the array data is rearranged and assigned to the array sorted_data.

（ステップＳ１０２）
次に、制御部１００は、候補値範囲設定部２２０を用いて、候補値範囲初期設定処理を行う。
この処理では、制御部１００は、測定データ２５１を用いて、測定データ２５１の総データ数ｎの推定値となる「候補値」の範囲の初期値を決定する。
制御部１００は、この候補値の範囲として、まず、設定データ２５３から、想定される分布の種類を取得する。この上で、測定データ２５１の統計的な値を用いて、想定される分布からあり得る範囲を初期値として設定し、設定データ２５３に記憶する。
この候補値の範囲としては、例えば、測定データ２５１を標準化した場合には、ｚスコアで−４から−１までの分布になるような値の積分値を算出することができる。また、データの種類毎に所定値を用意したり、以前に計測した同種のデータから、候補値の範囲の初期値を用意してもよい。 (Step S102)
Next, the control unit 100 uses the candidate value range setting unit 220 to perform candidate value range initial setting processing.
In this process, the control unit 100 uses the measurement data 251 to determine an initial value of a “candidate value” range that is an estimated value of the total number n of measurement data 251.
As the range of candidate values, the control unit 100 first acquires an assumed distribution type from the setting data 253. Then, using a statistical value of the measurement data 251, a possible range from the assumed distribution is set as an initial value and stored in the setting data 253.
As the range of candidate values, for example, when the measurement data 251 is standardized, an integral value of values that have a distribution of −4 to −1 in z score can be calculated. In addition, a predetermined value may be prepared for each type of data, or an initial value in a range of candidate values may be prepared from the same type of data measured previously.

（ステップＳ１０３）
制御部１００は、候補値範囲設定部２２０を用いて、候補値用意処理を行う。
ここでは、制御部１００は、設定された候補値の範囲から、候補値を複数個用意して記憶部１１０の設定データ２５３に記憶する。
制御部１００は、例えば、設定された候補値の範囲を所定値で等分して、この所定値＋１の値分の候補値を用意できる。つまり、上述の例の場合、この所定値が１００の場合、制御部１００は、０〜６５０の範囲を１００等分して、１０１個の候補値を用意することができる。
なお、測定データ２５１がリード数のように整数の場合は、総データ数ｎを整数と仮定し、整数の候補値を用意することもできる。 (Step S103)
The control unit 100 uses the candidate value range setting unit 220 to perform candidate value preparation processing.
Here, the control unit 100 prepares a plurality of candidate values from the set range of candidate values and stores them in the setting data 253 of the storage unit 110.
For example, the control unit 100 can equally divide the set range of candidate values by a predetermined value and prepare candidate values for the predetermined value + 1. That is, in the case of the above example, when the predetermined value is 100, the control unit 100 can prepare 101 candidate values by dividing the range of 0 to 650 into 100 equal parts.
If the measurement data 251 is an integer such as the number of leads, the total number of data n is assumed to be an integer, and integer candidate values can be prepared.

この候補値用意処理について、Ｒ言語のコードとして、例えば、９０００〜１１０００の範囲を、２０刻みで１０１個の候補値を用意する場合、下記のようなコードを用いることができる：

n_candidates <- seq(from=9000, to=11000, by=20)

ここで、n_candidatesは、候補値の配列である。
また、後述する直線性評価処理において、散布図の直線性を、例えばピアソンのｒ値を評価関数の値として算出し、判定する。このため、各候補値に係る評価関数の値を記憶するためのベクターｒｓも、下記のＲ言語のコードのように用意する：

rs <- 1:101 + NA For this candidate value preparation process, for example, in the case of preparing 101 candidate values in increments of 20 in the range of 9000 to 11000 as R language codes, the following codes can be used:

n_candidates <-seq (from = 9000, to = 11000, by = 20)

Here, n_candidates is an array of candidate values.
Further, in the linearity evaluation process described later, the linearity of the scatter diagram is determined by calculating, for example, the Pearson r value as the value of the evaluation function. For this reason, a vector rs for storing the value of the evaluation function related to each candidate value is also prepared like the following R language code:

rs <-1: 101 + NA

（ステップＳ１０４）
制御部１００は、分布作成部２２５を用いて、分布用意処理を行う。
この処理において、まず、制御部１００は、用意された候補値から、ひとつの候補値を選択する。
次に、制御部１００は、設定データ２５３を参照し、測定データ２５１として想定される分布の理論値のセットを用意する。
制御部１００は、正規分布なら、例えば、Ｎ（０，１）、すなわち平均０で分散σ²＝１の標準正規分布を用意し、分布データ２５２として、記憶部１１０に記憶する。
また、制御部１００は、異なる分布モデルを使う場合には、それぞれの標準状態のものを使う。制御部１００は、例えば、対数正規分布なら、γ＝０、μ＝０、σ＝１の対数正規分布を用意し、分布データ２５２として、記憶部１１０に記憶する。 (Step S104)
The control unit 100 uses the distribution creation unit 225 to perform distribution preparation processing.
In this process, first, the control unit 100 selects one candidate value from the prepared candidate values.
Next, the control unit 100 refers to the setting data 253 and prepares a set of theoretical values of distribution assumed as the measurement data 251.
For the normal distribution, the control unit 100 prepares, for example, a standard normal distribution with N (0, 1), that is, an average of 0 and a variance σ ² = 1, and stores the data as distribution data 252 in the storage unit 110.
Moreover, the control part 100 uses the thing of each standard state, when using a different distribution model. For example, in the case of a lognormal distribution, the control unit 100 prepares a lognormal distribution with γ = 0, μ = 0, and σ = 1, and stores the logarithmic normal distribution in the storage unit 110 as distribution data 252.

（ステップＳ１０５）
制御部１００は、候補値描画部２３０を用いて、散布図描画処理を行う。
具体的には、まず、制御部１００は、測定データ２５１と、用意した分布である分布データ２５２とを、それぞれ大きい順に対応させて散布図を描く。
ここで、従来のＱＱプロットでは、それぞれ同じ分位点にあるデータを対応させた図である。これに対して、制御部１００は、同じ順位にある数を対応させた図を、本実施形態の散布図として描画する。
このため、本実施形態の散布図は、最も低い領域にあっては、理論値又はソートしたデータが、対応させる相手を持たないという状況が起こる。この場合、制御部１００は、対応がつくペアだけを用いて描画する。 (Step S105)
The control unit 100 uses the candidate value drawing unit 230 to perform a scatter diagram drawing process.
Specifically, first, the control unit 100 draws a scatter diagram by associating the measurement data 251 with the prepared distribution data 252 in descending order.
Here, in the conventional QQ plot, the data at the same quantile are associated with each other. On the other hand, the control part 100 draws the figure which matched the number in the same order | rank as a scatter diagram of this embodiment.
For this reason, in the scatter diagram of the present embodiment, in the lowest region, a situation occurs in which the theoretical value or the sorted data does not have a counterpart to be associated. In this case, the control unit 100 draws using only a pair that can be matched.

図３（ｂ）〜（ｄ）を参照して詳細に説明すると、制御部１００は、ある確率ｐを与えたときに、２つの確率点（ｑｕａｎｔｉｌｅ）となるｑ１とｑ２とを、例えば、それぞれＸ軸（横軸）、Ｙ軸（縦軸）にとってプロットした散布図（確率プロット）を作成する。ここでは、分布データ２５２をｑ１として描画する。また、制御部１００は、測定データ２５１を、選択された候補値により標準化し、ｑ２として描画する。これは、ＱＱプロット（「Ｇｎａｎａｄｅｓｉｋａｎ，Ｒ．；Ｗｉｌｋ，Ｍ．Ｂ．（１９６８）， "Ｐｒｏｂａｂｉｌｉｔｙｐｌｏｔｔｉｎｇｍｅｔｈｏｄｓｆｏｒｔｈｅａｎａｌｙｓｉｓｏｆｄａｔａ"，Ｂｉｏｍｅｔｒｉｋａ５５（１）：１〜１７」を参照）の変形となる。つまり、この散布図では、Ｘ軸の座標に用意した分布の理論値である分布データ２５２を、Ｙ軸の座標にソートしたデータである測定データ２５１をとって比較している。この際、分布データ２５２を大きい順に、また測定データ２５１を大きい順に対応させて、その関係を調べている。 If it demonstrates in detail with reference to FIG.3 (b)-(d), when the control part 100 gives a certain probability p, q1 and q2 used as two probability points (quantile) will be each, for example, respectively. A scatter diagram (probability plot) plotted for the X axis (horizontal axis) and the Y axis (vertical axis) is created. Here, the distribution data 252 is drawn as q1. In addition, the control unit 100 standardizes the measurement data 251 with the selected candidate value and renders it as q2. This is a QQ plot (see Gnanadesikan, R .; Wilk, MB (1968), "Probability plotting methods for the analysis of data", Biometrica 55 (1): variants 1 to 17). . That is, in this scatter diagram, the distribution data 252 that is the theoretical value of the distribution prepared for the X-axis coordinates is compared with the measurement data 251 that is the data sorted on the Y-axis coordinates. At this time, the distribution data 252 is correlated in ascending order and the measurement data 251 is correlated in ascending order to examine the relationship.

たとえば、この散布図を表示するＲ言語のコードとしては、下記のようなコードを用いることができる：

ideal <- -qnorm(ppoints(n_candidates[i]))
s_limit <- min(length(data[!is.na(data)]), n_candidates[i])
plot(sorted_data[1:s_limit], ideal[1:s_limit]) For example, the following code can be used as an R language code for displaying the scatter diagram:

ideal <--qnorm (ppoints (n_candidates [i]))
s_limit <-min (length (data [! is.na (data)]), n_candidates [i])
plot (sorted_data [1: s_limit], ideal [1: s_limit])

図３を再び参照すると、図３（ｂ）は、図３（ａ）のデータにおいて、候補値が７０００の際の散布図の例を示している。
図３（ｃ）は、図３（ｂ）と同様に、候補値が１００００の際の散布図の例を示している。
図３（ｄ）は、図３（ｂ）（ｃ）と同様に、候補値が１４０００の際の散布図の例を示している。 Referring back to FIG. 3, FIG. 3B shows an example of a scatter diagram when the candidate value is 7000 in the data of FIG.
FIG. 3C shows an example of a scatter diagram when the candidate value is 10,000, as in FIG.
FIG. 3D shows an example of a scatter diagram when the candidate value is 14000, similarly to FIGS. 3B and 3C.

なお、制御部１００は、分布作成部２２５を用いて、上述の分布用意処理において、選択された候補値に対応する分布の値のセットを（疑似）乱数列から直接作成し、これを分布データ２５２として記憶することもできる。この際、制御部１００は、正規分布を、ボックス＝ミューラー法（Ｂｏｘ−Ｍｕｌｌｅｒｔｒａｎｓｆｏｒｍ）や、中心極限定理を用いた方法等を用いて作成可能である。また、制御部１００は、対数正規分布も、正規分布から作成することができる。
また、制御部１００は、候補値描画部２３０を用いて、このような分布をもつ乱数列である分布データ２５２と、測定データ２５１とを対応させた散布図を、直接描画するような処理も可能である。
また、制御部１００は、この散布図を描画せずに、直接下記の候補値評価処理を行うこともできる。 Note that the control unit 100 uses the distribution creation unit 225 to directly create a set of distribution values corresponding to the selected candidate value from the (pseudo) random number sequence in the above-described distribution preparation process, and generate this distribution data. It can also be stored as 252. At this time, the control unit 100 can create a normal distribution by using a Box = Muller transform, a method using a central limit theorem, or the like. The control unit 100 can also create a lognormal distribution from the normal distribution.
The control unit 100 also uses the candidate value drawing unit 230 to perform a process of directly drawing a scatter diagram in which the distribution data 252 that is a random number sequence having such a distribution and the measurement data 251 are associated with each other. Is possible.
Moreover, the control part 100 can also perform the following candidate value evaluation process directly, without drawing this scatter diagram.

（ステップＳ１０６）
制御部１００は、候補値評価部２４０を用いて、候補値評価処理を行う。
この処理において、制御部１００は、分布データ２５２と測定データ２５１との一致について、相関関係又は線形関係に係る関数で評価する。この相関関係又は線形関係に係る関数としては、「直線性」又は「順位の理論値−ソートデータ間の距離」等に係る所定の評価関数を用いることができる。
直線性により評価する場合、この評価関数として、制御部１００は、例えば、ピアソンのｒ値を算出して用いることができる。
また、制御部１００は、このピアソンのｒ値の算出の際、正規分布の場合には、μとσを、それぞれ直線のｙ切片と傾きとして推定することもできる。 (Step S106)
The control unit 100 uses the candidate value evaluation unit 240 to perform candidate value evaluation processing.
In this process, the control unit 100 evaluates the coincidence between the distribution data 252 and the measurement data 251 with a function related to a correlation or linear relationship. As a function related to this correlation or linear relationship, a predetermined evaluation function related to “linearity” or “theoretical value of rank—the distance between sorted data” or the like can be used.
When evaluating by linearity, as this evaluation function, the control part 100 can calculate and use Pearson's r value, for example.
Further, when calculating the Pearson r value, the control unit 100 can also estimate μ and σ as the y-intercept and the slope of a straight line, respectively, in the case of a normal distribution.

図３を再び参照すると、図３（ｂ）の散布図では、ピアソンのｒ値は、０．９８４４３６３であった。また、標準化した測定データ２５１のμとσの推定値は（０．４６，０．７１）であった。
図３（ｃ）では、ピアソンのｒ値は０．９９９８９９６であり、標準化に用いた測定データ２５１のμとσの推定値は（−０．０１，０．９９）であった。
図３（ｄ）では、ピアソンのｒ値は０．９９８１４１６であり、標準化に用いた測定データ２５１のμとσの推定値は（−０．４３，１．１４）であった。
このように、図３（ａ）のデータは、図３（ｃ）のように候補値が１００００の際に、ピアソンのｒ値が大きく、直線性が高くなり、測定データ２５１の総データ数ｎの推定値として良好となることが分かる。 Referring again to FIG. 3, in the scatter diagram of FIG. 3 (b), Pearson's r value was 0.9844363. The estimated values of μ and σ of the standardized measurement data 251 were (0.46, 0.71).
In FIG. 3C, Pearson's r value was 0.9998996, and the estimated values of μ and σ of the measurement data 251 used for standardization were (−0.01, 0.99).
In FIG. 3D, Pearson's r value was 0.9981416, and the estimated values of μ and σ of the measurement data 251 used for standardization were (−0.43, 1.14).
As described above, the data of FIG. 3A has a large Pearson r value and high linearity when the candidate value is 10,000 as shown in FIG. It can be seen that the estimated value is good.

なお、制御部１００は、分布データ２５２と測定データ２５１とを「順位の理論値−ソートデータ間の距離」により評価する場合、この「距離」として、各順位の値の差の絶対値の総和や平均、二乗和の総和や平均等のマハラノビス汎距離等を算出して用いることができる。 When the control unit 100 evaluates the distribution data 252 and the measurement data 251 based on “theoretical value of the rank—the distance between the sort data”, the total of the absolute values of the differences between the values of the ranks is used as the “distance”. And Mahalanobis general distance such as average, sum of squares, average, etc. can be calculated and used.

（ステップＳ１０７）
制御部１００は、候補値評価部２４０を用いて、範囲内で用意されたすべての候補値を評価したか否かを判定する。
Ｙｅｓ、すなわち全ての候補値を評価し終わった場合、制御部１００は、処理をステップＳ１０８に進める。
Ｎｏ、すなわち範囲内で候補値がある場合、制御部１００は、処理をステップＳ１０４に戻して、分布の用意、散布図の描画、直線性の評価を続ける。この際、例えば、候補値の範囲内で、大きい値、小さい値について交互に評価を行い、徐々にもっとも確からしい候補値を求めるように範囲を狭め、計算を高速に行っても良い。また、ニュートン法を用いてもよい。 (Step S107)
The control unit 100 uses the candidate value evaluation unit 240 to determine whether or not all candidate values prepared within the range have been evaluated.
If Yes, that is, if all candidate values have been evaluated, the control unit 100 advances the process to step S108.
No, that is, if there is a candidate value within the range, the control unit 100 returns the process to step S104, and continues preparation of distribution, drawing of a scatter diagram, and evaluation of linearity. At this time, for example, evaluation may be performed alternately for a large value and a small value within a range of candidate values, and the range may be narrowed to gradually obtain the most probable candidate value, and the calculation may be performed at high speed. Further, Newton's method may be used.

このような直線性の評価のＲ言語のコードとして、例えば、下記のようなコードを用いることができる：

for (i in 1:101){
n <- n_candidates[i]
ideal <- -qnorm(ppoints(n)) # 理想値
s_limit <- min(length(data[!is.na(data)]), n)
rs[i]<- cor(sorted_data[1:s_limit], ideal[1:s_limit]) } # Peasronのr値

このコードでは、上述のコードで用意したベクターｒｓについて、単純に全ての候補値を評価して代入する例を示している。
この上で、下記のようなコードを用いて、候補値の中から最適なものを選抜する。ここでは、制御部１００は、最も大きい評価値を与えた候補を選択する。

selected <- n_candidates[which(rs==max(rs))]

変数ｓｅｌｅｃｔｅｄは、ベクターｒｓ中の評価が最尤な候補値となる。 As an R language code for such linearity evaluation, for example, the following codes can be used:

for (i in 1: 101) {
n <-n_candidates [i]
ideal <--qnorm (ppoints (n)) # ideal value
s_limit <-min (length (data [! is.na (data)]), n)
rs [i] <-cor (sorted_data [1: s_limit], ideal [1: s_limit])} # Peasron r value

This code shows an example in which all candidate values are simply evaluated and substituted for the vector rs prepared in the above code.
On this basis, the optimum code is selected from the candidate values using the following code. Here, the control unit 100 selects a candidate that gives the largest evaluation value.

selected <-n_candidates [which (rs == max (rs))]

The variable selected is a candidate value that is most likely to be evaluated in the vector rs.

図３（ｅ）を参照すると、図３（ａ）とは別の測定データ２５１について、０〜６５０を候補値の範囲として、散布図を描いた際のピアソンのｒ値の変化の例を示している。この図３（ｅ）のＸ軸は各候補値、Ｙ軸はピアソンのｒ値である。
このように、制御部１００は、候補値の値を変化させながら、関係の直線性を、上述したようにピアソンのｒ値のような評価関数でモニターして、もっとも確からしい候補値を探すことができる。この関係は、単純に上に凸か下に凸かであると考えられる。
このため、もっとも確からしい候補値を求める作業に、ニュートン法を適用することができ、計算資源を節約することが可能である。 Referring to FIG. 3 (e), an example of a change in the Pearson r value when a scatter diagram is drawn with 0 to 650 as a range of candidate values for measurement data 251 different from FIG. 3 (a) is shown. ing. In FIG. 3E, the X-axis is each candidate value, and the Y-axis is Pearson's r value.
In this way, the control unit 100 monitors the relationship linearity with an evaluation function such as the Pearson r-value as described above while changing the value of the candidate value, and searches for the most likely candidate value. Can do. This relationship is considered to be simply convex upward or downward.
For this reason, Newton's method can be applied to the task of obtaining the most probable candidate value, and it is possible to save computational resources.

なお、もしこの図が上に凸でなければ、候補値が不適であると考えられるため、制御部１００は、後述する候補値範囲ずらし処理を行う。または、制御部１００は、候補をもっと狭い範囲から選択する候補値範囲狭め処理を行う。 If this figure is not convex upward, it is considered that the candidate value is inappropriate, and therefore the control unit 100 performs a candidate value range shifting process described later. Alternatively, the control unit 100 performs a candidate value range narrowing process for selecting candidates from a narrower range.

また、ピアソンのｒ値ではない、別の評価関数を用いて、上述の散布図を評価することもできる。
図３（ｆ）（ｇ）を参照すると、例えば、上述のＲ言語のコードのうち、ベクターｒｓに代入する箇所を下記のように変更することもできる：

rs[i]<- -mean((sorted_data[1:s_limit]-ideal[1:s_limit])^2) # 別の評価関数

図３（ｆ）のＸ軸は各候補値、Ｙ軸はこの別の評価関数の値である。このように、評価関数が異なると値が変わるため、データの特性に合わせて評価関数を選択することができる。
図３（ｇ）は、図３（ｃ）と同様に、この別の評価関数の値が最大値となった候補値を代入した変数ｓｅｌｅｃｔｅｄについて、下記コードで散布図を描画した例である：

ideal <- -qnorm(ppoints(selected))
s_limit <- min(length(data[!is.na(data)]), selected)
plot(sorted_data[1:s_limit], ideal[1:s_limit]) It is also possible to evaluate the above scatter plot using another evaluation function that is not the Pearson r value.
Referring to FIGS. 3 (f) and 3 (g), for example, in the R language code described above, the place to be assigned to the vector rs can be changed as follows:

rs [i] <--mean ((sorted_data [1: s_limit] -ideal [1: s_limit]) ^ 2) # Another evaluation function

In FIG. 3F, the X-axis represents each candidate value, and the Y-axis represents the value of this other evaluation function. As described above, since the value changes depending on the evaluation function, the evaluation function can be selected in accordance with the data characteristics.
FIG. 3G is an example in which a scatter diagram is drawn with the following code for the variable selected to which the candidate value having the maximum value of the other evaluation function is substituted, as in FIG.

ideal <--qnorm (ppoints (selected))
s_limit <-min (length (data [! is.na (data)]), selected)
plot (sorted_data [1: s_limit], ideal [1: s_limit])

（ステップＳ１０８）
制御部１００は、候補値評価部２４０を用いて、用意した候補値の外側（端）により最適な候補値があるか否かを判定する。
ここでは、制御部１００は、上述した評価関数のグラフを、極値がほぼなく、単調増加／減少の場合、もっと最適な候補値がある、すなわちＹｅｓと判定する。つまり、制御部１００は、用意した候補値の端で、評価関数の値が最高値の場合、Ｙｅｓと判定する。それ以外の場合は、Ｎｏと判定する。
Ｙｅｓの場合、制御部１００は、処理をステップＳ１０９に進め、候補値の範囲を変更する。
Ｎｏの場合、制御部１００は、処理をステップＳ１１０に進める。 (Step S108)
The control unit 100 uses the candidate value evaluation unit 240 to determine whether or not there is an optimal candidate value on the outside (end) of the prepared candidate value.
Here, the control unit 100 determines that there is a more optimal candidate value when the graph of the evaluation function described above has almost no extreme value and monotonously increases / decreases, that is, Yes. That is, the control unit 100 determines Yes when the value of the evaluation function is the highest value at the end of the prepared candidate value. Otherwise, it is determined as No.
In the case of Yes, the control unit 100 proceeds with the process to step S109 and changes the range of candidate values.
In No, the control part 100 advances a process to step S110.

（ステップＳ１０９）
制御部１００は、候補値の範囲を変更する必要がある場合、候補値範囲設定部２２０を用いて、候補値範囲ずらし処理を行う。
ここでは、用意した候補値のいずれかの端が、評価関数の値が最高値であった場合、候補値の範囲をそちら側に調整してずらす。つまり、例えば、用意した候補値の範囲が、５００〜１０００であり、候補値１０００の評価関数の値が最大値であった場合、候補値の範囲を１００１〜１５０１に再設定するような処理を行う。
制御部１００は、その後処理をステップＳ１０３に戻して、ずらされた候補値の範囲で再度計算を進める。
なお、候補値の再設定の際、候補値の範囲を同一の幅でずらすのではなく、評価関数のグラフの増加率／減少率を基に、範囲の幅を設定することもできる。 (Step S109)
When the candidate value range needs to be changed, the control unit 100 uses the candidate value range setting unit 220 to perform candidate value range shifting processing.
Here, if the value of the evaluation function is the highest value at any end of the prepared candidate values, the range of candidate values is adjusted and shifted to that side. That is, for example, when the prepared candidate value range is 500 to 1000, and the value of the evaluation function of the candidate value 1000 is the maximum value, processing for resetting the candidate value range to 1001 to 1501 is performed. Do.
The control unit 100 then returns the process to step S103 and proceeds with the calculation again within the range of the shifted candidate values.
When resetting candidate values, the range of the candidate values can be set based on the rate of increase / decrease of the graph of the evaluation function, instead of shifting the range of the candidate values by the same width.

（ステップＳ１１０）
制御部１００は、候補値の範囲内にもっとも確からしい候補値があった場合、候補値評価部２４０を用いて、より精度が必要か否かを判定する。
制御部１００は、この精度については、例えば、用意された候補値の範囲の幅と、設定データ２５３の候補値の範囲のパラメータを参照し、より狭い幅で検索する必要がある場合にはＹｅｓと判定する。それ以外の場合、たとえば候補値が整数で用意され、評価した連続する整数内に最適な候補値があった場合には、Ｎｏと判定する。
Ｙｅｓの場合、制御部１００は、処理をステップＳ１１１に進める。
Ｎｏの場合、制御部１００は、処理をステップＳ１１２に進める。 (Step S110)
When there is a most probable candidate value within the range of candidate values, the control unit 100 uses the candidate value evaluation unit 240 to determine whether more accuracy is necessary.
For example, the control unit 100 refers to the prepared candidate value range width and the candidate value range parameter of the setting data 253, and if the search needs to be performed with a narrower width, Yes. Is determined. In other cases, for example, candidate values are prepared as integers, and if there is an optimum candidate value in the evaluated consecutive integers, it is determined as No.
In Yes, control part 100 advances a process to Step S111.
In No, the control part 100 advances a process to step S112.

（ステップＳ１１１）
制御部１００は、候補値範囲設定部２２０を用いて、候補値範囲狭め処理を行う。
制御部１００は、候補値の範囲を狭めて再度設定する。ここでは、制御部１００は、上述の直線性の評価関数の値が最高値であった候補値の前後の値を、新たな候補値の範囲と設定することができる。
制御部１００は、その後処理をステップＳ１０３に戻して、狭められた候補値の範囲で再度計算を進める。 (Step S111)
The control unit 100 uses the candidate value range setting unit 220 to perform candidate value range narrowing processing.
The control unit 100 narrows and sets the range of candidate values again. Here, the control unit 100 can set values before and after the candidate value having the highest value of the linearity evaluation function as a range of new candidate values.
The control unit 100 then returns the process to step S103 and proceeds with the calculation again within the narrowed range of candidate values.

（ステップＳ１１２）
制御部１００は、充分な精度で最適な候補値が得られた場合、候補値評価部２４０を用いて、候補値確定処理を行う。
ここでは、制御部１００は、十分な精度が得られたら候補値を確定する。
これが、測定データ２５１の本来の完全な総データ数ｎの期待値、すなわち母集団の大きさと推定される。 (Step S112)
When the optimal candidate value is obtained with sufficient accuracy, the control unit 100 uses the candidate value evaluation unit 240 to perform candidate value determination processing.
Here, the control unit 100 determines the candidate value when sufficient accuracy is obtained.
This is estimated as the expected value of the original complete data number n of the measurement data 251, that is, the size of the population.

（ステップＳ１１３）
次に、制御部１００は、候補値評価部２４０を用いて、確定した候補値を基に母数の推定を行い、この母数により測定データ２５１を標準化して記憶する。
この際に、制御部１００は、失われたデータ数を勘案しながら、例えば、直線近似を用いて算出することが可能である。
この際、制御部１００は、ソートデータと理論値を上位からの同順の値でもって対比させ、一次近似して切片と傾きを算出する。制御部１００は、算出された切片と傾きを用いて、母数を推定する。制御部１００は、この一次近似において、最小二乗法や、ロバストな別の方法等を用いることができる。 (Step S113)
Next, the control unit 100 uses the candidate value evaluation unit 240 to estimate a parameter based on the determined candidate value, and standardizes and stores the measurement data 251 using this parameter.
At this time, the control unit 100 can perform calculation using, for example, linear approximation while considering the number of lost data.
At this time, the control unit 100 compares the sort data and the theoretical value with values in the same order from the top, and performs a first-order approximation to calculate the intercept and the slope. The control unit 100 estimates the parameter using the calculated intercept and slope. The controller 100 can use a least square method, another robust method, or the like in this linear approximation.

このような母数の推定のためのパラメータの設定に関するＲ言語のコードとして、例えば、下記のようなコードを用いることができる：

u_limit <- selected-length(data[!is.na(data)])
params <- coef ( lm(sorted_data[u_limit: s_limit] ~ ideal[u_limit: s_limit]))

この上で、μとσの推定値は、それぞれ：

(mu <- params[1]) # μ
(sigma <- params[2]) # σ

のようなコードで算出できる。すなわち、正規分布の場合、この例のようにμが切片、σが傾きとなる。
制御部１００は、これらの算出された母数を用いて、測定データ２５１を標準化して記憶する。 For example, the following code can be used as the R language code for setting parameters for parameter estimation:

u_limit <-selected-length (data [! is.na (data)])
params <-coef (lm (sorted_data [u_limit: s_limit] ~ ideal [u_limit: s_limit]))

On top of this, the estimated values for μ and σ are:

(mu <-params [1]) # μ
(sigma <-params [2]) # σ

It can be calculated with a code such as That is, in the case of a normal distribution, μ is an intercept and σ is a slope as in this example.
The control unit 100 standardizes and stores the measurement data 251 using these calculated parameters.

（ステップＳ１１４）
制御部１００は、候補値評価部２４０を用いて、分析処理を行う。
次に、必要な場合、制御部１００は、実際の測定データと、他に標準化された測定データ２５１との比較を行う。たとえば、制御部１００は、同じ発現量データであるＲＮＡｓｅｑとＤＮＡマイクロアレイのデータを比較することができる。これにより、ＤＮＡマイクロアレイの各スポットの感度を、標準化されたＲＮＡｓｅｑの遺伝子発現量で補正して求めることが可能となる。
また、制御部１００は、主成分分析等のデータ解析を行うこともできる。
以上により母集団大きさ推定処理を終了する。 (Step S114)
The control unit 100 uses the candidate value evaluation unit 240 to perform analysis processing.
Next, if necessary, the control unit 100 compares the actual measurement data with other standardized measurement data 251. For example, the control unit 100 can compare RNAseq, which is the same expression level data, with data of the DNA microarray. As a result, the sensitivity of each spot of the DNA microarray can be obtained by correcting with the normalized gene expression level of RNAseq.
The control unit 100 can also perform data analysis such as principal component analysis.
The population size estimation process is thus completed.

以上のように構成することで、以下のような効果を得ることができる。
上述したように、検出限界のためにデータを失う測定方法のデータでは、母数の正確な推定は困難であった。
これに対して、本発明の実施の形態に係る母集団大きさ推定方法は、検出限界がある測定方法の測定データ２５１の分布モデルと、想定される分布との関係から母集団の大きさである総データ数ｎを推定し、μやσのような分布の母数も求めることもできる。このため、標準化した値の精度と確度が高まる。特に、対数正規分布をする測定データ２５１においては、総データ数ｎは、母数を正確に求める際に重要である。本発明はこれにより、検出限界がある測定方法のデータについて、標準化を正確に行うことができる。
すなわち、本実施形態の母集団大きさ推定方法により母集団の大きさを推定して標準化することで、ＲＮＡｓｅｑ、メタボローム解析、ｑＰＣＲを用いるマイクロアレイ等のトランスクリプトーム解析、次世代シーケンサーのデータ分析等を、従来より正確に行うことができる。
また、測定では検出限界が変化することがあるが、本実施形態に係る母集団大きさ推定方法は、こうした場合でも、標準化した値を比較することができる。よって、例えば、有意差が得られる測定項目を増やすことができる。
また、本発明の実施の形態に係る母集団大きさ推定方法で、疾病等に係るトランスクリプトームのデータを標準化して、主成分分析等に用いることで、より正確な分析を行うことができる。 With the configuration described above, the following effects can be obtained.
As described above, it is difficult to accurately estimate the parameter with the data of the measurement method that loses data due to the detection limit.
On the other hand, the population size estimation method according to the embodiment of the present invention uses the size of the population based on the relationship between the distribution model of the measurement data 251 of the measurement method having a detection limit and the assumed distribution. A total number n of data can be estimated, and a population parameter such as μ or σ can be obtained. This increases the accuracy and accuracy of standardized values. In particular, in the measurement data 251 having a lognormal distribution, the total number of data n is important in accurately obtaining the parameter. Accordingly, the present invention can accurately standardize the data of the measuring method having a detection limit.
That is, by estimating and standardizing the population size by the population size estimation method of the present embodiment, RNAseq, metabolome analysis, transcriptome analysis such as microarray using qPCR, data analysis of next generation sequencer, etc. Can be performed more accurately than in the past.
Although the detection limit may change in measurement, the population size estimation method according to the present embodiment can compare standardized values even in such a case. Therefore, for example, it is possible to increase the measurement items from which a significant difference is obtained.
In addition, with the population size estimation method according to the embodiment of the present invention, more accurate analysis can be performed by standardizing transcriptome data relating to diseases and the like and using it for principal component analysis or the like. .

また、従来、ＤＮＡマイクロアレイのそれぞれのスポットの感度は、測定方法がなかった。そのため、異なる設計のＤＮＡマイクロアレイのデータは、たとえ同じ遺伝子を測っていたとしても、直接に比較することが困難であった。これは、遺伝子のどの配列を測定するかによって、プローブに用いられる塩基配列が違うため、そのプローブの感度が異なるからであった。
本実施形態の母集団大きさ推定方法は、こういったＤＮＡマイクロアレイのデータに関しても、別の測定方法と対応させて標準化することで、各スポットの感度を推定することができる。このように、各スポットの感度を求めることで、他の臓器や患者等のＤＮＡマイクロアレイのデータを補正することが可能となる。これによって、マイクロアレイのデータを一種の絶対値として用いることができる。このため、異なる設計のチップのデータを、直接に比較できるようになる。 Conventionally, there was no method for measuring the sensitivity of each spot of a DNA microarray. Therefore, it was difficult to directly compare the data of DNA microarrays with different designs, even if they measured the same gene. This is because the base sequence used for the probe differs depending on which sequence of the gene is measured, and the sensitivity of the probe differs.
The population size estimation method of the present embodiment can estimate the sensitivity of each spot by standardizing such DNA microarray data in correspondence with another measurement method. Thus, by obtaining the sensitivity of each spot, it becomes possible to correct data of DNA microarrays of other organs and patients. Thereby, the microarray data can be used as a kind of absolute value. For this reason, it becomes possible to directly compare data of chips having different designs.

またＤＮＡマイクロアレイは、チップの設計にないＲＮＡは検出できない。このため、特に近年その重要性がクローズアップされてきた、タンパクをコードしないＲＮＡのような、新規／未知の対象を測定することはできないという問題があった。
これに対して、本実施形態の母集団大きさ推定方法は、ＲＮＡｓｅｑのような新規／未知の対象を測定できるトランスクリプトーム等の解析にも適用できる。また、よく知られているＤＮＡマイクロアレイのデータと比較することで、当該新規／未知の対象の発現量等を確実に測定できる。また、ＤＮＡマイクロアレイの測定結果の感度差に関わらず、ＲＮＡｓｅｑと結果を直接比較することができる。 Also, DNA microarrays cannot detect RNA that is not on the chip design. For this reason, there has been a problem that it is not possible to measure new / unknown objects such as RNA that does not encode proteins, which has been particularly important in recent years.
On the other hand, the population size estimation method of the present embodiment can be applied to analysis of a transcriptome or the like that can measure a new / unknown target such as RNAseq. Further, by comparing with well-known DNA microarray data, the expression level of the new / unknown target can be reliably measured. In addition, RNAseq and the results can be directly compared regardless of the sensitivity difference in the measurement results of the DNA microarray.

なお、本実施形態の母集団大きさ推定方法は、生物学的データ以外にも適用することができる。
たとえば、食品、土壌、水、空気中に含まれる物質を、クロマトグラフィ等の物質の分離方法と質量分析等の検出方法と組み合わせて同定・定量する場合等に応用することが可能である。その場合、これらの物質の分布形式に関する情報が必要である。たとえば、食品なら、生物由来なので対数正規分布を用いることができる。 Note that the population size estimation method of the present embodiment can be applied to other than biological data.
For example, the present invention can be applied to a case where substances contained in food, soil, water, and air are identified and quantified in combination with a separation method of substances such as chromatography and a detection method such as mass spectrometry. In that case, information on the distribution form of these substances is required. For example, since a food is derived from living organisms, a lognormal distribution can be used.

次に図面に基づき本発明を実施例によりさらに説明するが、以下の具体例は本発明を限定するものではない。 EXAMPLES Next, although an Example demonstrates this invention further based on drawing, the following specific examples do not limit this invention.

〔ＲＮＡｓｅｑのデータの標準化への適用〕
ＲＮＡｓｅｑは、ＲＮＡを逆転写してＤＮＡにして、そのＤＮＡ断片の配列を決定するトランスクリプトームの測定方法である。ＲＮＡｓｅｑでは、より多量に存在したＲＮＡの相補配列がより高い頻度で観測されるため、この頻度から各ＲＮＡの濃度を推定する。ＲＮＡｓｅｑは、スプライシング等が異なる未知のＲＮＡでも測定可能である。またデータが観測される／されないというデジタルな測定方法であり、画像のような仲介物もないため、より正確な測定結果が期待されている。
しかしながら、ＲＮＡｓｅｑは、測定項目が大きい多変量データであり、解析は容易ではない。また観測数を数えるベルヌーイ試行に由来する、二項分布のノイズを持つ。当初は標準化が不要ではないかとさえ言われたが、総リード数の違いをあわせるためにも、データの比較には当然、標準化が必要である。
このため、本実施例において、ＲＮＡｓｅｑデータについて、本実施形態の母集団大きさ推定方法により母集団の大きさを推定して、標準化を行った。 [Application to standardization of RNAseq data]
RNAseq is a transcriptome measurement method that reverse-transcribes RNA into DNA and determines the sequence of the DNA fragment. In RNAseq, since a complementary sequence of RNA present in a larger amount is observed at a higher frequency, the concentration of each RNA is estimated from this frequency. RNAseq can be measured even with unknown RNAs with different splicing and the like. In addition, it is a digital measurement method in which data is observed / not observed, and since there is no intermediary such as an image, a more accurate measurement result is expected.
However, RNAseq is multivariate data with large measurement items, and analysis is not easy. It also has binomial noise, derived from Bernoulli trials that count the number of observations. Initially, it was even said that standardization was unnecessary, but of course, standardization is necessary for data comparison in order to match the total number of leads.
Therefore, in this example, the RNAseq data was standardized by estimating the population size by the population size estimation method of this embodiment.

（使用データ）
文献「ＪｏｈｎＣ．Ｍａｒｉｏｎｉｅｔ．ａｌ．，ＲＮＡ−ｓｅｑ：Ａｎａｓｓｅｓｓｍｅｎｔｏｆｔｅｃｈｎｉｃａｌｒｅｐｒｏｄｕｃｉｂｉｌｉｔｙａｎｄｃｏｍｐａｒｉｓｏｎｗｉｔｈｇｅｎｅｅｘｐｒｅｓｓｉｏｎａｒｒａｙｓ，ＧｅｎｏｍｅＲｅｓ．２００８，８（９）：１５０９−１５１７．」に報告されたデータを元データ２５４として使用した。
これは単一のヒト検体からの腎臓および肝臓のサンプルをイルミナ社製の次世代シーケンサーによるＲＮＡｓｅｑ、及びアフィメトリクス社製のＤＮＡマイクロアレイで測定したデータである。 (Use data)
The data “John C. Marioni et. Al., RNA-seq: An assessment of technical reproductivity and comparison with gene expression arrays, Gen. Res. Used as 254.
This is data obtained by measuring kidney and liver samples from a single human specimen with RNAseq using a next-generation sequencer manufactured by Illumina and a DNA microarray manufactured by Affymetrix.

（結果）
肝臓（Ｒ１Ｌ４Ｌｉｖｅｒ）と腎臓（Ｒ１Ｌ１Ｋｉｄｎｅｙ）のサンプルについて、初期範囲として、１２０００〜２３０００を指定し、元データ２５４の分布として対数正規分布を用い、本実施形態の母集団大きさ推定方法で各候補値の散布図を描画し、ピアソンのｒ値を測定した。対数の底には１０を用いた。
結果として、腎臓のサンプルでは、最適な候補値は、１２６２０であった。つまり、総データ数ｎは、１２６２０と推定される。この腎臓のサンプルの母数は、平均μ＝１．７７、標準偏差σ＝０．５１２であった。
また、肝臓のサンプルでは、最適な候補値は１３０６０であった。つまり、総データ数ｎは、１３０６０と推定される。この肝臓のサンプルの母数は、平均μ＝１．５１５、標準偏差σ＝０．６１０であった。
図４を参照して、この測定データ２５１と分布データ２５２の分布モデルの関係とを、下記で説明する。 (result)
For the liver (R1L4Liver) and kidney (R1L1Kidney) samples, specify 12000 to 23000 as the initial range, use the lognormal distribution as the distribution of the original data 254, and use the population size estimation method of this embodiment to set each candidate value. A scatter diagram was drawn and Pearson's r value was measured. 10 was used for the base of the logarithm.
As a result, the optimal candidate value for the kidney sample was 12620. That is, the total number of data n is estimated to be 12620. The population of this kidney sample had an average μ = 1.77 and a standard deviation σ = 0.512.
In the liver sample, the optimal candidate value was 13060. That is, the total number of data n is estimated to be 13060. The parameters of this liver sample were mean μ = 1.515 and standard deviation σ = 0.610.
The relationship between the measurement data 251 and the distribution model of the distribution data 252 will be described below with reference to FIG.

図４（ａ）は、肝臓のサンプルについて描画した、上述のＱＱプロットの変形である散布図である。図４（ａ）のＸ軸は用意した正規分布のｚスコア、Ｙ軸は肝臓のサンプルから作成した測定データ２５１を標準化したもののｚスコアである。また、丸印で示される各点は、算出した確率点を示す。
図４（ｂ）は、この肝臓のサンプルを標準化した際のヒストグラムである。Ｘ軸は用意した正規分布のｚスコア、Ｙ軸のバーは各サンプルの頻度を示す。また、点線は、最適な候補値の分布から期待される頻度を示す。
この図４（ａ）（ｂ）の結果のように、ｚスコアにして、下が０．５付近までのデータが、算出した最適な候補値により求められた母数のモデルと一致することがわかる。これは全体の約４０％に相当する。ｚスコアで０．５以下のデータは、測定限界以下であり、サンプル由来のノイズ、及び二項分布に従う測定機器／実験方法由来のノイズに影響されると考えられる。 FIG. 4A is a scatter diagram, which is a modification of the above-described QQ plot, drawn for a liver sample. In FIG. 4A, the X-axis is a z-score of a prepared normal distribution, and the Y-axis is a z-score of standardized measurement data 251 created from a liver sample. Each point indicated by a circle indicates a calculated probability point.
FIG. 4B is a histogram when the liver sample is standardized. The X axis shows the z-score of the prepared normal distribution, and the Y axis bar shows the frequency of each sample. The dotted line indicates the frequency expected from the optimal candidate value distribution.
As shown in the results of FIGS. 4A and 4B, the data up to about 0.5 below the z score may coincide with the model of the parameter obtained from the calculated optimal candidate value. Recognize. This corresponds to about 40% of the total. Data with a z-score of 0.5 or less is below the measurement limit and is considered to be affected by noise from the sample and noise from the measuring instrument / experimental method according to the binomial distribution.

図４（ｃ）は、図４（ａ）（ｂ）と同じ肝臓のサンプルを散布図ではなく、大きさ順に直接プロットした図である。この図でも、Ｘ軸は用意した正規分布のｚスコア、Ｙ軸は測定データ２５１を標準化したもののｚスコアである。この図では、分布データ２５２に合わせて算出した最適な候補値分のデータを作成して大きさ順に並べ、測定データ２５１とペアで描画して比較する。
この結果は上述の図４（ａ）（ｂ）と同様に、０．５以下のデータはむしろノイズによる影響が大きく、データに再現性が期待できない。
このため、このサンプルにおいて、大きさがｚスコアで０．５以下のデータを用いないで発現量を比較することも考えられる。 FIG. 4C is a diagram in which the same liver samples as in FIGS. 4A and 4B are directly plotted in order of size, not a scatter diagram. Also in this figure, the X-axis is a prepared normal distribution z-score, and the Y-axis is a standardized measurement data 251 z-score. In this figure, data for the optimum candidate values calculated in accordance with the distribution data 252 are created, arranged in order of size, and drawn and compared with the measurement data 251 in pairs.
Similar to FIGS. 4 (a) and 4 (b), the result is that the data of 0.5 or less is rather affected by noise, and reproducibility of the data cannot be expected.
For this reason, in this sample, it is also possible to compare the expression levels without using data with a z score of 0.5 or less.

図４（ｄ）（ｅ）は、上述の図４（ａ）（ｂ）と同様に、腎臓のサンプルについて、本実施の形態の母集団大きさ推定方法で標準化した測定データ２５１と正規分布との散布図による比較、及びヒストグラムを示す。
この結果として、ｚスコアにして−１程度のところで、算出した最適な候補値により求められた母数のモデルからの乖離が観測される。これ以下のデータは、サンプル由来のノイズ、又は二項分布のノイズの影響を受けていることが示唆される。 4 (d) and 4 (e) are similar to FIGS. 4 (a) and 4 (b) described above, with respect to the kidney sample, the measurement data 251 and the normal distribution standardized by the population size estimation method of the present embodiment. The comparison by a scatter diagram and a histogram are shown.
As a result, when the z-score is about −1, a deviation from the model of the parameter obtained from the calculated optimal candidate value is observed. Data below this suggests that it is affected by noise from the sample or by binomial noise.

〔ＲＮＡｓｅｑデータの従来の標準化法との比較〕
次に、図５を参照して、上述の実施例１のＭａｒｉｏｎｉらの文献と同じＲＮＡｓｅｑのデータを用いて、よく用いられている従来の標準化法（以下、「総和一定」と称する。）と、本実施形態の母集団大きさ推定方法とで、発現量を比較した。
従来の「総和一定」の標準化法は、サンプル（リード）数の総数で全体を除して、測定した値の強度を合わせる標準化法である。 [Comparison with conventional standardization method of RNAseq data]
Next, referring to FIG. 5, a conventional standardization method (hereinafter referred to as “constant summation”) that is often used, using the same RNAseq data as the above-mentioned Marioni et al. The expression level was compared with the population size estimation method of this embodiment.
The conventional “total sum constant” standardization method is a standardization method in which the whole is divided by the total number of samples (leads) and the intensities of the measured values are combined.

（遺伝子の選択）
７回繰り返し測定をした肝臓及び腎臓のＲＮＡｓｅｑのデータを、それぞれの遺伝子ごとにｔ検定し、肝臓又は腎臓のいずれかで両側１％の有意水準で発現量（リード数）の差がある、すなわちポジティヴ（陽性）となる遺伝子を選んだ。
これら遺伝子に着目し、肝臓及び腎臓で対になる発現量の組み合わせを得た。つまり、統計検定で１％有意となった遺伝子において、肝臓と腎臓の発現量の対数比（ログレシオ）を求めて比較した。 (Gene selection)
The liver and kidney RNAseq data measured 7 times were t-tested for each gene, and there was a difference in the expression level (number of reads) at a significance level of 1% on both sides in either the liver or kidney. A gene that was positive was selected.
Focusing on these genes, a combination of expression levels paired in the liver and kidney was obtained. That is, the logarithmic ratio (log ratio) of the expression levels of the liver and kidney was compared for genes that became 1% significant by statistical test.

（結果）
図５（ａ）は、従来の、総数で全体を除した総和一定の標準化法により標準化した各遺伝子の発現量を用い、２組のデータから腎臓と肝臓の差を求めて、それぞれを比較したプロットである。Ｘ軸とＹ軸はそれぞれ、肝臓と腎臓の遺伝子発現の差を、ログレシオを用いて示している。各点は、各遺伝子の腎臓及び肝臓の発現量の差を、Ｘ座標に一方のデータ組み合わせ，Ｙ座標に他方の組み合わせとしてプロットしたものである。図中の「ｒ」は、ピアソンの相関係数ｒ値を示す。
図５（ａ）のように、総和一定の標準化だと、発現変化の再現性がさほど高くないことがわかる。特に、原点（Ｘ座標，Ｙ座標）＝（０，０）付近でｙ＝−ｘ方向にばらつく、即ち逆相関を示す遺伝子が散見される。これは、特に発現量の差が少ない遺伝子群において、検定が偽陽性を生じやすいという傾向を示している。この現象は、測定にノイズ成分が多いときに起きる。 (result)
FIG. 5 (a) shows the difference between the kidney and the liver from two sets of data using the expression levels of each gene standardized by the conventional standardization method with the sum total divided by the total number, and compared each. It is a plot. The X-axis and Y-axis indicate the difference in gene expression between the liver and kidney using the log ratio, respectively. Each point is a plot of the difference in the expression level of the kidney and liver of each gene as one data combination on the X coordinate and the other combination on the Y coordinate. “R” in the figure indicates the Pearson correlation coefficient r value.
As shown in FIG. 5 (a), it is understood that the reproducibility of the expression change is not so high when the summation is constant. In particular, there are some genes that vary in the y = −x direction near the origin (X coordinate, Y coordinate) = (0, 0), that is, exhibit an inverse correlation. This indicates a tendency that the test tends to generate false positives, particularly in a gene group with a small difference in expression level. This phenomenon occurs when there are many noise components in the measurement.

図５（ｂ）は、本実施形態の母集団大きさ推定方法で発現量を標準化した測定データ２５１の腎臓と肝臓での発現量の違いを、（ａ）と同じ２組の測定の組み合わせにおいて比較したプロットである。Ｘ軸とＹ軸はそれぞれ、肝臓と腎臓の遺伝子発現の差をログレシオを用いて示している。各点は、各遺伝子の腎臓及び肝臓の発現量を図５（ａ）と同様にプロットしたもので、図中の「ｒ」はピアソンの相関係数ｒ値である。
図５（ｂ）では、原点を境に、各プロットが直線ｙ＝ｘ上で２つに分離されている。つまり、逆相関をする遺伝子はほとんど検出されず、検定が偽陽性を生じにくいことを示している。これは、測定結果のノイズ成分が、シグナルに比べて小さいときにおきる現象である。測定の再現性は高くなっており、ｒ値も従来の０．８９７から０．９４９程度に増えている。
また、またｙ＝ｘで大きな値をとる遺伝子の数が、本実施形態の母集団大きさ推定方法のプロットのほうが明らかに多いことがわかる。これは、ノイズ成分が減少したために検定で有意差が検出される遺伝子が増加したからだと考えられる。すなわち、結果的に相関係数を高めているだけでなく、データのもつ価値を高めることができる。 FIG. 5B shows the difference in the expression level between the kidney and the liver in the measurement data 251 in which the expression level is standardized by the population size estimation method of the present embodiment in the same two sets of measurements as in FIG. It is the compared plot. The X-axis and Y-axis indicate the difference in gene expression between the liver and kidney using the log ratio, respectively. Each point is a plot of the expression level of each gene in the kidney and liver in the same manner as in FIG. 5A, and “r” in the figure is the Pearson correlation coefficient r value.
In FIG. 5B, each plot is separated into two on the straight line y = x with the origin as a boundary. That is, almost no genes that are inversely correlated are detected, indicating that the test is unlikely to produce false positives. This is a phenomenon that occurs when the noise component of the measurement result is smaller than the signal. The reproducibility of measurement is high, and the r value is increased from the conventional 0.897 to about 0.949.
It can also be seen that the number of genes having a large value at y = x is clearly larger in the plot of the population size estimation method of the present embodiment. This is thought to be because the number of genes for which a significant difference was detected by the test increased because the noise component decreased. That is, as a result, not only the correlation coefficient is increased, but also the value of the data can be increased.

〔ＲＮＡｓｅｑデータとＤＮＡマイクロアレイデータとの一致〕
次に、上述の実施例１、２と同じＭａｒｉｏｎｉらのＲＮＡｓｅｑとＤＮＡマイクロアレイのデータについて、同じ遺伝子の発現量を測定したＲＮＡｓｅｑと、ＤＮＡマイクロアレイのデータとが、一致するか比較した。具体的には、本実施形態の母集団大きさ推定方法が影響するか否かかを調べた。
実施例３では、上述の実施例２と同じ、肝臓又は腎臓のいずれかのＲＮＡｓｅｑの発現量のデータにおいて、両側１％の有意水準でｔ検定をかけ、発現量（リード数）の差があった遺伝子の発現量を元データ２５４として用いた。また、これに加えて、ＤＮＡマイクロアレイでも、１％有意水準で発現量の差があったものを選択して元データ２５４として用いた。 [The agreement between RNAseq data and DNA microarray data]
Next, regarding the same RNAionq and DNA microarray data of Marioni et al. As in Examples 1 and 2, the RNAseq measured for the expression level of the same gene and the DNA microarray data were compared. Specifically, it was examined whether or not the population size estimation method of this embodiment affects.
In Example 3, the same data as in Example 2 above, either in the expression level of RNAseq in either liver or kidney, a t-test was performed at a significance level of 1% on both sides, and there was a difference in expression level (number of reads). The gene expression level was used as the original data 254. In addition, a DNA microarray having a difference in expression level at a 1% significance level was selected and used as original data 254.

図６（ａ）は、元々のＭａｒｉｏｎｉらのオリジナルのデータをプロットしたコントロールである。Ｙ軸がＲＮＡｓｅｑでの発現量の対数比であり、Ｘ軸がアフィメトリクス社製のＤＮＡマイクロアレイの発現量データである。これは、ＤＮＡマイクロアレイの発現データを、従来のＲＭＡ（ＲｏｂｕｓｔＭｕｌｔｉ−ＡｒｒａｙＡｖｅｒａｇｅ）法で標準化処理して算出した値である。ＲＮＡｓｅｑで２５０リード以上であった遺伝子を濃い色（濃い塗りつぶし）の点で示す。それより少ないデータをものを薄い色（薄い塗りつぶし）の点で示す。２５０リード以上の遺伝子は、上位６．４％のデータに限られている。ピアソンのｒ値は、０．８７程度であった。 FIG. 6A is a control in which original data of the original Marioni et al. Is plotted. The Y-axis is the logarithmic ratio of the expression level in RNAseq, and the X-axis is the expression level data of DNA microarray manufactured by Affymetrix. This is a value calculated by standardizing DNA microarray expression data using a conventional RMA (Robust Multi-Array Average) method. Genes that have 250 or more reads in RNAseq are indicated by dark colored dots. Less data is shown in light-colored (lightly filled) dots. Genes with 250 reads or more are limited to the top 6.4% of data. Pearson's r value was about 0.87.

図６（ｂ）は、本実施の形態の母集団大きさ推定方法で標準化を行った後、同じ遺伝子について図６（ａ）と同様にプロットを行ったものである。特に、薄い色（薄い塗りつぶし）の点で示す、２５０リード未満の比較的低いリード数において、上下へのバラツキが少なくなっている。具体的には、ピアソンのｒ値が０．９１４程度に向上している。
すなわち、少ない発現量の遺伝子においても、ＲＮＡｓｅｑ及びＤＮＡマイクロアレイの発現量を再現性高く、誤差を少なくして分析できる。また、ＲＮＡｓｅｑ及びＤＮＡマイクロアレイの発現量を、比較することも可能になる。 FIG. 6B shows the same gene plotted as in FIG. 6A after standardization by the population size estimation method of the present embodiment. In particular, there is less vertical variation at a relatively low number of leads of less than 250 leads, indicated by the light color (light fill). Specifically, Pearson's r value is improved to about 0.914.
That is, even in a gene with a small expression level, the expression level of RNAseq and DNA microarray can be analyzed with high reproducibility and with less error. It is also possible to compare the expression levels of RNAseq and DNA microarray.

〔ＤＮＡマイクロアレイの感度測定〕
次に、上述の実施例３と同じデータを用いて、ＲＮＡｓｅｑとＤＮＡマイクロアレイとの比較から、ＤＮＡマイクロアレイの各スポットの感度の測定を行った。
従来のＤＮＡマイクロアレイの発現量データの解析では、通常、コントロールの組織の測定量と、調べたい組織の発現量を測定した際の測定量との「倍率」を算出して、遺伝子発現の変化量を求めている。このため、ＤＮＡマイクロアレイの各スポットに形成されたＤＮＡ配列のＴｍ値（ｍｅｌｔｉｎｇｔｅｍｐｅｒａｔｕｒｅ）や収率等により生じる感度差を相殺することができている。しかしながら、より一般性の高い定量値を得るために、各スポットの感度を測定したいという要望があった。
これに対して、本実施例では、同一遺伝子のＲＮＡｓｅｑのデータを用いて、ＤＮＡマイクロアレイの各スポット自体の感度を測定した。 [Measurement of DNA microarray sensitivity]
Next, using the same data as in Example 3 above, the sensitivity of each spot of the DNA microarray was measured by comparing RNAseq with the DNA microarray.
In the analysis of conventional DNA microarray expression level data, the amount of change in gene expression is usually calculated by calculating the “magnification” between the measurement level of the control tissue and the measurement level when measuring the expression level of the tissue to be examined. Seeking. For this reason, the difference in sensitivity caused by the Tm value (melting temperature) or yield of the DNA sequence formed in each spot of the DNA microarray can be offset. However, there has been a desire to measure the sensitivity of each spot in order to obtain a more general quantitative value.
In contrast, in this example, the sensitivity of each spot of the DNA microarray was measured using RNAseq data of the same gene.

（感度差の推定）
上述の実施例３の肝臓のデータを、ＲＮＡｓｅｑとＤＮＡマイクロアレイを本実施形態の母集団大きさ推定方法で標準化し、各ＤＮＡマイクロアレイの各遺伝子に対応するスポットの感度を推定する。
この各スポットの感度を用いて、各ＤＮＡマイクロアレイの各スポットの感度差を補正する。具体的には、各スポットについて、肝臓のＲＮＡｓｅｑとＤＮＡマイクロアレイでで得られた感度差を推定し、この感度差の値を腎臓のデータに掛けて標準化する。 (Estimation of sensitivity difference)
The liver data of Example 3 described above is standardized for RNAseq and DNA microarrays using the population size estimation method of this embodiment, and the sensitivity of spots corresponding to each gene in each DNA microarray is estimated.
The sensitivity difference of each spot of each DNA microarray is corrected using the sensitivity of each spot. Specifically, for each spot, the difference in sensitivity obtained by liver RNAseq and DNA microarray is estimated, and the value of this sensitivity difference is multiplied by the kidney data and standardized.

図７（ａ）（ｂ）は、それぞれＹ軸にマイクロアレイの測定値を標準化した値、Ｘ軸にＲＮＡ−ｓｅｑを本実施形態の母集団大きさ推定方法で標準化した値をとって比較したものである。Ｘ軸、Ｙ軸ともに、感度の倍率ではなく、対数値のｚスコアを示している。また、図７（ａ）は肝臓のデータ、図７（ｂ）は、腎臓のデータをそれぞれ示している。
結果として、正の相関が見られるものの、大きくばらついている。すなわち、実施例３では、倍率を観察して感度が相殺されるため、結果が一致した。しかしながら、上述したように、マイクロアレイにはスポットごとに異なる感度があるため、絶対的な感度で比較すると、必ずしも結果が一致しないことが分かる。 FIGS. 7A and 7B are values obtained by comparing the values obtained by standardizing the measurement values of the microarray on the Y axis and the values obtained by standardizing RNA-seq on the X axis by the population size estimation method of the present embodiment. It is. Both the X-axis and the Y-axis indicate the logarithmic value z-score, not the sensitivity magnification. FIG. 7A shows liver data, and FIG. 7B shows kidney data.
As a result, although there is a positive correlation, it varies greatly. That is, in Example 3, since the sensitivity was canceled by observing the magnification, the results matched. However, as described above, since microarrays have different sensitivities for each spot, it can be seen that the results do not always match when compared with absolute sensitivity.

図７（ｃ）は、図７（ａ）の結果を基にして、「ＲＮＡ−ｓｅｑの標準化データ−マイクロアレイの標準化データ」として感度を推定し、この感度を図７（ｂ）のマイクロアレイの結果に和することで補正したものである。すなわち、図７（ｃ）は、図７（ａ）の結果を基にして、肝臓のデータから推定された各スポット（遺伝子）の感度差を使って補正したプロットである。図７（ａ）（ｂ）と同様に、Ｘ軸、Ｙ軸とも絶対値（ログスケール）を示している。
結果として、ＲＮＡ−ｓｅｑとの相関が、明らかに図７（ｂ）よりも向上している。このように、ＤＮＡマイクロアレイの感度測定により、結果の散らばりを抑えて、各遺伝子毎に、発現量の測定を正確に行うことができる。また、感度推定のためのデータ数を増やせば、さらに精度を高めることができる。 FIG. 7 (c) estimates the sensitivity as “standardized data of RNA-seq-standardized data of microarray” based on the result of FIG. 7 (a), and this sensitivity is the result of the microarray of FIG. 7 (b). Is corrected by adding to That is, FIG. 7C is a plot corrected using the sensitivity difference of each spot (gene) estimated from the liver data based on the result of FIG. 7A. Similar to FIGS. 7A and 7B, both the X-axis and the Y-axis indicate absolute values (log scale).
As a result, the correlation with RNA-seq is clearly improved as compared with FIG. Thus, by measuring the sensitivity of the DNA microarray, it is possible to accurately measure the expression level for each gene while suppressing the dispersion of results. Further, if the number of data for sensitivity estimation is increased, the accuracy can be further increased.

〔Ｒ言語での実装例〕
次に、上述の実施の形態に係る母数の推定を実際にＲ言語で実装して、計算機実験を行った実施例を以下で示す。
まず、以下のコードで、正規分布する乱数を一万個を用意して、配列ｒａｎｄに記憶した。
また、配列ｒａｎｄから、ｚスコアで−０．５より小さいデータを消去して作成した配列ｄａｔａを用意して、測定限界をシミュレートした：

rand <- rnorm(10000) #
data <- rand ; data[ data< -0.5] <-NA [Example of implementation in R language]
Next, an example in which the estimation of the parameter according to the above-described embodiment is actually implemented in the R language and a computer experiment is performed will be described below.
First, 10,000 random numbers with normal distribution were prepared and stored in the array rand with the following code.
In addition, array data prepared by deleting data with a z score smaller than −0.5 from the array rand was prepared to simulate the measurement limit:

rand <-rnorm (10000) #
data <-rand; data [data <-0.5] <-NA

この用意した乱数ｒａｎｄと配列ｄａｔａとについてヒストグラムの作成を行った：

hr <- hist(rand, breaks=-50:50/10) #ヒストグラムの作成
hd <- hist(data, breaks=-50:50/10) A histogram was created for the prepared random number rand and array data:

hr <-hist (rand, breaks = -50: 50/10) #Create histogram
hd <-hist (data, breaks = -50: 50/10)

この上で、以下のコードで、ヒストグラムを重ね書きして描画した：

plot(hr, col="gray50", xlab="", ylab="", main="")
par(new=T)
plot(hd, col="white", xlab="", ylab="", main="") On top of this, the histogram was overwritten with the following code:

plot (hr, col = "gray50", xlab = "", ylab = "", main = "")
par (new = T)
plot (hd, col = "white", xlab = "", ylab = "", main = "")

図８は、描画したヒストグラムの例である。図８の影がついている部分が、図９と同様に、検出されなかったデータに相当する。
次に、得られた配列ｒａｎｄの母数を、以下のコードで推定した：

# 平均
> mean(rand, na.rm=T)
[1] -0.0009032517

# 中央値（ロバストなμの推定）
> median(rand, na.rm=T)
[1] 0.0079797

# 標準偏差（ロバストなμの推定）
> sd(rand, na.rm=T)
[1] 0.9975465

# 標準偏差（ロバストなＭＡＤの推定）
> mad(rand, na.rm=T)
[1] 0.9996847 FIG. 8 is an example of a drawn histogram. The shaded portion in FIG. 8 corresponds to data that has not been detected, as in FIG.
Next, the parameter of the resulting array rand was estimated with the following code:

# Average
> mean (rand, na.rm = T)
[1] -0.0009032517

# Median (Robust μ estimate)
> median (rand, na.rm = T)
[1] 0.0079797

# Standard deviation (Robust μ estimate)
> sd (rand, na.rm = T)
[1] 0.9975465

# Standard deviation (Robust MAD estimation)
> mad (rand, na.rm = T)
[1] 0.9996847

これらの配列ｒａｎｄから求めた母数の値は、どれもよく作成した正規分布の設定を反映していた。
次に、従来方法である下記のコードで、得られた配列ｄａｔａの母数を推定した：

# 平均
> mean(data, na.rm=T)
[1] 0.5033331

# 中央値（ロバストなμの推定）
> median(data, na.rm=T)
[1] 0.3849361

# 標準偏差（ロバストなμの推定）
> sd(data, na.rm=T)
[1] 0.6977276

# 標準偏差（ロバストなＭＡＤの推定）
> mad(data, na.rm=T)
[1] 0.7157046 The values of the parameters obtained from these arrays rand reflected the well-created normal distribution settings.
Next, the parameter of the obtained array data was estimated with the following code, which is a conventional method:

# Average
> mean (data, na.rm = T)
[1] 0.5033331

# Median (Robust μ estimate)
> median (data, na.rm = T)
[1] 0.3849361

# Standard deviation (Robust μ estimate)
> sd (data, na.rm = T)
[1] 0.6977276

# Standard deviation (Robust MAD estimation)
> mad (data, na.rm = T)
[1] 0.7157046

このように、配列ｄａｔａは、従来方法で計算すると、ロバストな計算方法でも、本来の正規分布の設定から離れている。
これに対して、上述の実施形態のＲ言語のコードにより、失われたデータ数を勘案しつつ、直線近似を用いて母数を推定した：

(mu <- params[1]) # μ
-0.0003264361

(sigma <- params[2]) # σ
0.9854621 As described above, when the array data is calculated by the conventional method, it is far from the original normal distribution setting even by the robust calculation method.
On the other hand, the R language code of the above-described embodiment estimated the parameter using linear approximation while taking into account the number of lost data:

(mu <-params [1]) # μ
-0.0003264361

(sigma <-params [2]) # σ
0.9854621

結果として、それぞれ、設定値に近い母数の推定に成功した。 As a result, we succeeded in estimating the parameters close to the set values.

なお、上記実施の形態の構成及び動作は例であって、本発明の趣旨を逸脱しない範囲で適宜変更して実行することができることは言うまでもない。 Note that the configuration and operation of the above-described embodiment are examples, and it is needless to say that the configuration and operation can be appropriately changed and executed without departing from the gist of the present invention.

本発明の母集団大きさ推定方法は、次世代シーケンサーやＤＮＡマイクロアレイ等を用いたトランスクリプトームの解析に幅広く利用でき、産業上に利用することができる。 The population size estimation method of the present invention can be widely used for transcriptome analysis using next-generation sequencers, DNA microarrays, and the like, and can be used industrially.

１０解析装置
１００制御部
１１０記憶部
１３０入力部
１４０表示部
１５０入出力部
２００データベース
２１０測定データソート部
２２０候補値範囲設定部
２２５分布作成部
２３０候補値描画部
２４０候補値評価部
２５１測定データ
２５２分布データ
２５３設定データ
２５４元データ DESCRIPTION OF SYMBOLS 10 Analysis apparatus 100 Control part 110 Storage part 130 Input part 140 Display part 150 Input / output part 200 Database 210 Measurement data sort part 220 Candidate value range setting part 225 Distribution creation part 230 Candidate value drawing part 240 Candidate value evaluation part 251 Measurement data 252 Distribution data 253 Setting data 254 Original data

Claims

In a population size estimation method that causes a computer to perform estimation of the total number of measurement data obtained by a measurement method with a detection limit,
Sort the measurement data,
In the range calculated from the measurement data, preparing a plurality of candidate values for the total number of data,
For each candidate value, prepare a set of theoretical values related to the population distribution of the measurement data,
The matching between the sorted measurement data and the set of theoretical values is evaluated by a function related to a correlation or linear relationship, and the candidate value that is optimally evaluated by the function is estimated as a population size. And population size estimation method.

The population size estimation method according to claim 1, wherein the function evaluates linearity or distance between the value of the measurement data in the same order from the top and the value of the set of theoretical values.

Using the Pearson product moment correlation coefficient as the function,
If the optimal candidate value is at the end of the range, or if the prepared candidate value is less accurate, adjust the range of the candidate value,
The population size estimation method according to claim 1, wherein the candidate value with the best evaluation is determined when sufficient accuracy is obtained.

Using the set of theoretical values of the candidate values with the best evaluation according to any one of claims 1 to 3, and the sorted measurement data,
Comparing the value of the measurement data in the same order from the top and the value of the set of theoretical values, calculating a first-order approximation intercept and slope, and calculating a parameter indicating the characteristics of the population distribution A characteristic parameter estimation method.

As measurement data, quantification data of gene expression level is used,
The gene expression level analysis method, wherein the gene expression level is standardized by the total number of data estimated by the population size estimation method according to any one of claims 1 to 3.

The gene expression level analysis method according to claim 5, wherein the quantification data of the gene expression level is a transcript expression level, a cDNA read count, and derivative data.

As measurement data, use quantification data of protein amount or substance amount,
The method for analyzing data of a proteome or metabolome, wherein the protein amount or the substance amount is standardized based on the total number of data estimated by the population size estimation method according to any one of claims 1 to 3.

As measurement data, using quantification data of RNAseq and DNA microarray gene expression level,
Standardized by the total number of data estimated by the population size estimation method according to any one of claims 1 to 3,
A sensitivity measurement method for a DNA microarray, wherein the sensitivity of each spot of the DNA microarray is determined by correcting the standardized gene expression level of the RNAseq.

In a program for causing a computer to execute a population size estimation method for estimating the total number of measurement data obtained by a measurement method with a detection limit,
In the computer,
Sort the measurement data;
In the range calculated from the measurement data, a plurality of candidate values for the size of the total data number are prepared,
For each candidate value, a set of theoretical values related to the population distribution of the measurement data is prepared,
A population size that evaluates a match between the sorted measurement data and the set of theoretical values by a function related to a correlation or linear relationship, and estimates the candidate value that is optimally evaluated by the function as the size of the population A program characterized by causing the estimation method to be executed.

In a computer-readable recording medium recording a program for causing a computer to execute a population size estimation method for estimating the total number of measurement data obtained by a measurement method having a detection limit,
In the computer,
Sort the measurement data;
In the range calculated from the measurement data, a plurality of candidate values for the size of the total data number are prepared,
For each candidate value, a set of theoretical values related to the population distribution of the measurement data is prepared,
A population size that evaluates a match between the sorted measurement data and the set of theoretical values by a function related to a correlation or linear relationship, and estimates the candidate value that is optimally evaluated by the function as the size of the population A recording medium on which is recorded a program for executing the estimation method.

In a population size estimation device that performs estimation of the total number of measurement data obtained by a measurement method with a detection limit,
A measurement data sorting means for sorting the measurement data;
Candidate value preparation means for preparing a plurality of candidate values of the size of the total number of data within a range calculated from the measurement data;
Distribution preparation means for preparing a set of theoretical values related to the population distribution of the measurement data for each candidate value;
Candidate value evaluation for evaluating coincidence between the sorted measurement data and the set of theoretical values by a function related to a correlation or linear relationship, and estimating the candidate value that is optimally evaluated by the function as the size of a population A population size estimation device.