JP5242568B2

JP5242568B2 - Clustering method, program and apparatus

Info

Publication number: JP5242568B2
Application number: JP2009525454A
Authority: JP
Inventors: 哲也田邊
Original assignee: Olympus Corp
Current assignee: Olympus Corp
Priority date: 2007-08-01
Filing date: 2008-07-31
Publication date: 2013-07-24
Anticipated expiration: 2028-07-31
Also published as: JPWO2009017204A1; WO2009017204A1

Description

本発明は、複数のデータをクラスタリングするクラスタリング方法、プログラムおよび装置に関する。 The present invention relates to a clustering method, program, and apparatus for clustering a plurality of data.

同一または異なる生体由来のサンプルに特有の多様な現象や特性等を解析することは、生物学的ないし医学的に重要な評価を可能にする。例えば、遺伝子多型解析では、サンプルの濃度や阻害物質の有無などに応じて多型識別反応の進行の速さがサンプルごとに異なる。このため、遺伝子多型解析では、広い分布を持った数値データ（多型データ）が得られる。 Analyzing various phenomena and characteristics peculiar to samples derived from the same or different living bodies enables biologically and medically important evaluations. For example, in gene polymorphism analysis, the speed of the polymorphism discrimination reaction varies from sample to sample depending on the concentration of the sample and the presence or absence of an inhibitor. For this reason, in the gene polymorphism analysis, numerical data (polymorphism data) having a wide distribution is obtained.

得られた数値データを識別する際には、オペレータが数値データの散布図を目視することによってクラスタリングを行うことがある。しかしながら、オペレータが数値データを識別する場合、オペレータによって識別結果が異なってしまうことがあった。 When the obtained numerical data is identified, the operator may perform clustering by visually observing a scatter diagram of the numerical data. However, when an operator identifies numerical data, the identification result may differ depending on the operator.

このような状況の下、従来より、数値データの識別を自動的に行う様々な試みがなされている。例えば、下記特許文献１では、遺伝子多型解析において、サンプルからのシグナルに対して統計学的な手法を用いる技術が開示されている。ところが、この技術では、数百サンプル中に数サンプルしか存在しないような頻度の少ない遺伝子多型に対応する数値データは統計的に意味をなさないため、そのような数値データの取り扱いが困難であるという問題があった。 Under such circumstances, various attempts have been made to automatically identify numerical data. For example, Patent Document 1 below discloses a technique that uses a statistical technique for a signal from a sample in gene polymorphism analysis. However, with this technique, numerical data corresponding to a low-frequency genetic polymorphism such that there are only a few samples in hundreds of samples is not statistically meaningful, and it is difficult to handle such numerical data. There was a problem.

そこで、遺伝子多型解析において、統計学的な手法に遺伝統計学的な手法を組み入れる技術も開示されている（例えば、特許文献２を参照）。この技術では、遺伝子多型解析で得られた数値データの信頼性を、ハーディー・ワインバーグ平衡を利用して遺伝統計学的に評価している。 Therefore, a technique of incorporating a genetic statistical technique into a statistical technique in gene polymorphism analysis has also been disclosed (see, for example, Patent Document 2). In this technique, the reliability of numerical data obtained by genetic polymorphism analysis is genetically evaluated using the Hardy-Weinberg equilibrium.

特開２００４−２７２３５０号公報JP 2004-272350 A 特開２００６−１０７３９６号公報JP 2006-107396 A

しかしながら、遺伝統計学的な手法を取り入れた遺伝子多型解析を行う場合には、ランダムにサンプリングを行う必要がある。このため、家系サンプルや患者サンプルなど偏ったサンプリングによって得られたデータは、遺伝統計学的な解析には不適である。また、統計学的な手法を用いる場合には、多型頻度が少ない場合に信頼できる統計量が得られず、判定を誤ってしまうことがあった。 However, when performing genetic polymorphism analysis incorporating a genetic statistical method, it is necessary to perform random sampling. For this reason, data obtained by biased sampling such as family samples and patient samples are not suitable for genetic statistical analysis. In addition, when a statistical method is used, a reliable statistic is not obtained when the polymorphism frequency is low, and the determination may be wrong.

本発明は、上記に鑑みてなされたものであって、サンプルの選び方によらず、そのサンプルに関連した数値データのクラスタリングを適確に行うことができるクラスタリング方法、プログラムおよび装置を提供することを目的とする。 The present invention has been made in view of the above, and provides a clustering method, a program, and an apparatus capable of accurately performing numerical data clustering related to a sample regardless of how the sample is selected. Objective.

上述した課題を解決し、目的を達成するために、本発明に係るクラスタリング方法は、複数の多次元数値データを記憶する記憶手段を備えたコンピュータが、前記複数の多次元数値データを一または複数のクラスタに分割するクラスタリング方法であって、前記複数の多次元数値データを前記記憶手段から読み出し、この読み出した前記複数の多次元数値データの各々をより低次元の数値データに変換するデータ変換ステップと、前記データ変換ステップで変換した数値データの各々に対応したデータ存在確率を与える複数の確率密度関数を生成する確率密度関数生成ステップと、前記確率密度関数生成ステップで生成した複数の確率密度関数の線形和をとることによって前記複数の多次元数値データの信頼性を数値的に定める信頼性分布関数を生成する信頼性分布関数生成ステップと、前記信頼性分布関数生成ステップで生成した信頼性分布関数に基づいて前記複数の多次元数値データのクラスタ分割を行うクラスタ分割ステップと、を有することを特徴とする。 In order to solve the above-described problems and achieve the object, the clustering method according to the present invention is such that a computer provided with storage means for storing a plurality of multidimensional numerical data stores one or more of the plurality of multidimensional numerical data. A data conversion step of reading the plurality of multidimensional numerical data from the storage means and converting each of the read multidimensional numerical data into lower dimensional numerical data A probability density function generating step for generating a plurality of probability density functions that give data existence probabilities corresponding to each of the numerical data converted in the data conversion step, and a plurality of probability density functions generated in the probability density function generating step A reliability distribution function that numerically determines the reliability of the plurality of multidimensional numerical data by taking a linear sum of A reliability distribution function generation step for generating a multi-dimensional numerical data cluster based on the reliability distribution function generated in the reliability distribution function generation step And

また、本発明に係るクラスタリング方法は、上記発明において、前記データ変換ステップは、前記多次元数値データの異なる成分の比の値を用いることにより、前記複数の多次元数値データの各々を、次元が１次元低い数値データに変換することを特徴とする。 Further, in the clustering method according to the present invention, in the above invention, the data conversion step uses a ratio value of different components of the multidimensional numerical data, so that each of the plurality of multidimensional numerical data has a dimension. It is characterized by being converted into numerical data one dimension lower.

また、本発明に係るクラスタリング方法は、上記発明において、前記多次元数値データの次元数は２であり、前記データ変換ステップで変換した後の１次元数値データの和は１であることを特徴とする。 The clustering method according to the present invention is characterized in that, in the above invention, the number of dimensions of the multi-dimensional numerical data is 2, and the sum of the one-dimensional numerical data after the conversion in the data conversion step is 1. To do.

また、本発明に係るクラスタリング方法は、上記発明において、前記確率密度関数生成ステップで生成する確率密度関数はガウス関数であり、前記ガウス関数の平均は、着目している２次元数値データの各次元の比によって定められ、前記ガウス関数の分散は、複数の２次元数値データの分布を与える２次元平面上において、着目している２次元数値データと当該２次元数値データから所定の範囲にある２次元数値データとの距離を用いて定められることを特徴とする。 In the clustering method according to the present invention, in the above invention, the probability density function generated in the probability density function generation step is a Gaussian function, and the average of the Gaussian function is calculated for each dimension of the two-dimensional numerical data of interest. The variance of the Gaussian function is 2 in a predetermined range from the two-dimensional numerical data of interest and the two-dimensional numerical data on a two-dimensional plane that gives a distribution of a plurality of two-dimensional numerical data. It is characterized by being determined using a distance from the dimension numerical data.

また、本発明に係るクラスタリング方法は、上記発明において、前記２次元数値データは一塩基遺伝子多型のアリルの検出データであり、前記データ変換ステップで変換したデータはアリルの濃度であることを特徴とする。 The clustering method according to the present invention is characterized in that, in the above invention, the two-dimensional numerical data is detection data of an allele of a single nucleotide polymorphism, and the data converted in the data conversion step is an allyl concentration. And

また、本発明に係るクラスタリング方法は、上記発明において、前記クラスタ分割ステップは、前記信頼性分布関数を、前記データ変換ステップで変換した後の数値データに関して微分する信頼性分布関数微分ステップと、前記信頼性分布関数微分ステップで微分した値から前記信頼性分布関数の極小値を算出する極小値算出ステップと、前記極小値算出ステップで算出した極小値を特徴付ける極小値特徴量を算出する極小値特徴量算出ステップと、前記極小値特徴量算出ステップで算出した極小値特徴量を用いて前記多次元数値データが分布する空間におけるクラスタ分割位置を設定するクラスタ分割位置設定ステップと、を含むことを特徴とする。 Further, in the clustering method according to the present invention, in the above invention, the cluster dividing step includes a reliability distribution function differentiation step for differentiating the reliability distribution function with respect to numerical data after being converted in the data conversion step, A minimum value calculation step for calculating a minimum value of the reliability distribution function from a value differentiated in the reliability distribution function differentiation step, and a minimum value feature for calculating a minimum value feature characterizing the minimum value calculated in the minimum value calculation step A quantity calculation step, and a cluster division position setting step for setting a cluster division position in a space in which the multidimensional numerical data is distributed using the minimum value feature quantity calculated in the minimum value feature quantity calculation step. And

また、本発明に係るクラスタリング方法は、上記発明において、前記クラスタ分割ステップにおけるクラスタ分割結果を出力するクラスタ分割結果出力ステップをさらに有することを特徴とする。 Further, the clustering method according to the present invention is characterized in that, in the above invention, the method further includes a cluster division result output step for outputting a cluster division result in the cluster division step.

本発明に係るクラスタリングプログラムは、上記いずれかの発明に係るクラスタリング方法を前記コンピュータに実行させることを特徴とする。 A clustering program according to the present invention causes the computer to execute a clustering method according to any one of the above inventions.

本発明に係るクラスタリング装置は、複数の多次元数値データを一または複数のクラスタに分割するクラスタリング装置であって、前記複数の多次元数値データを記憶する記憶手段と、前記複数の多次元数値データを前記記憶手段から読み出し、この読み出した前記複数の多次元数値データの各々をより低次元の数値データに変換するデータ変換手段と、前記データ変換手段で変換した数値データの各々に対応したデータ存在確率を与える複数の確率密度関数を生成し、この生成した複数の確率密度関数の線形和をとることによって前記複数の多次元数値データの信頼性を数値的に定める信頼性分布関数を生成する関数生成手段と、前記関数生成手段で生成した信頼性分布関数に基づいて前記複数の多次元数値データのクラスタ分割を行うクラスタ分割手段と、を備えたことを特徴とする。 A clustering apparatus according to the present invention is a clustering apparatus that divides a plurality of multidimensional numerical data into one or a plurality of clusters, and a storage unit that stores the plurality of multidimensional numerical data; and the plurality of multidimensional numerical data Data conversion means for converting each of the read multi-dimensional numerical data into lower-dimensional numerical data, and data corresponding to each of the numerical data converted by the data conversion means A function that generates a plurality of probability density functions that give probabilities and generates a reliability distribution function that numerically defines the reliability of the plurality of multidimensional numerical data by taking a linear sum of the generated plurality of probability density functions And a clustering unit that performs cluster division of the plurality of multidimensional numerical data based on a reliability distribution function generated by the generation unit and the function generation unit. Characterized by comprising a static splitting means.

本発明によれば、複数の多次元数値データの各々をより低次元の数値データに変換し、変換後の数値データの各々に対応したデータ存在確率を与える複数の確率密度関数を生成し、それら複数の確率密度関数の線形和をとることによって複数の多次元数値データの信頼性を数値的に定める信頼性分布関数を生成し、この信頼性分布関数に基づいて複数の多次元数値データのクラスタ分割を行うことにより、数値データの特性に左右されない処理を実現することができる。したがって、サンプルの選び方によらず、そのサンプルに関連した数値データのクラスタリングを適確に行うことが可能となる。 According to the present invention, each of a plurality of multidimensional numerical data is converted into lower dimensional numerical data, and a plurality of probability density functions that give data existence probabilities corresponding to each of the converted numerical data are generated, Generate a reliability distribution function that numerically defines the reliability of multiple multidimensional numerical data by taking the linear sum of multiple probability density functions, and then cluster multiple multidimensional numerical data based on this reliability distribution function By performing the division, it is possible to realize processing independent of the characteristics of the numerical data. Therefore, it is possible to accurately perform clustering of numerical data related to the sample regardless of how the sample is selected.

図１は、本発明の一実施の形態に係るクラスタリング装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a clustering apparatus according to an embodiment of the present invention. 図２は、測定装置の概略構成を示す模式図である。FIG. 2 is a schematic diagram illustrating a schematic configuration of the measurement apparatus. 図３は、本発明の一実施の形態に係るクラスタリング方法の処理の概要を示すフローチャートである。FIG. 3 is a flowchart showing an outline of processing of the clustering method according to the embodiment of the present invention. 図４は、データ変換処理の詳細を示すフローチャートである。FIG. 4 is a flowchart showing details of the data conversion process. 図５は、信頼性分布関数の生成例（第１例）を示す図である。FIG. 5 is a diagram illustrating a generation example (first example) of the reliability distribution function. 図６は、信頼性分布関数の生成例（第２例）を示す図である。FIG. 6 is a diagram illustrating a generation example (second example) of the reliability distribution function. 図７は、クラスタ分割処理の詳細を示すフローチャートである。FIG. 7 is a flowchart showing details of the cluster division processing. 図８は、信頼性分布関数の谷の幅と深さを模式的に示す図である。FIG. 8 is a diagram schematically showing the valley width and depth of the reliability distribution function. 図９は、クラスタ分割結果の表示出力例（第１例）を示す図である。FIG. 9 is a diagram illustrating a display output example (first example) of the cluster division result. 図１０は、クラスタ分割結果の表示出力例（第２例）を示す図である。FIG. 10 is a diagram illustrating a display output example (second example) of the cluster division result.

Explanation of symbols

１クラスタリング装置
２、１４１送受信部
３入力部
４、１０４制御部
５記憶部
６出力部
４１データ変換部
４２関数生成部
４３クラスタ分割部
５１測定データ記憶部
５２変換データ記憶部
５３関数記憶部
５４クラスタ分割結果記憶部
１０１測定装置
１０２マイクロアレイ
１０３蛍光検出器
Ｍ１、Ｍ２山
Ｖ谷DESCRIPTION OF SYMBOLS 1 Clustering apparatus 2,141 Transmission / reception part 3 Input part 4,104 Control part 5 Storage part 6 Output part 41 Data conversion part 42 Function generation part 43 Cluster division part 51 Measurement data storage part 52 Conversion data storage part 53 Function storage part 54 Cluster Division result storage unit 101 Measuring device 102 Microarray 103 Fluorescence detector M1, M2 Mountain V Valley

以下、添付図面を参照して本発明を実施するための最良の形態（以後、「実施の形態」と称する）を説明する。図１は、本発明の一実施の形態に係るクラスタリング装置の構成を示す図である。同図に示すクラスタリング装置１は、測定装置１０１から送信されてくる複数の測定データ（数値データ）をクラスタリングする装置であり、コンピュータを用いて実現される。 The best mode for carrying out the present invention (hereinafter referred to as “embodiment”) will be described below with reference to the accompanying drawings. FIG. 1 is a diagram showing a configuration of a clustering apparatus according to an embodiment of the present invention. The clustering apparatus 1 shown in FIG. 1 is an apparatus that clusters a plurality of measurement data (numerical data) transmitted from the measurement apparatus 101, and is realized using a computer.

クラスタリング装置１は、測定装置１０１との間でデータの送受信を行う送受信部２と、キーボードやマウスなどによって実現され、外部から情報が入力される入力部３と、測定データのクラスタリングに関する各種演算を行うとともに、クラスタリング装置１の動作制御を行う制御部４と、測定データや制御部４における演算結果を含む情報を記憶する記憶部５と、制御部４における演算によって得られる測定データのクラスタリング結果を含む情報を出力する出力部６と、を備える。 The clustering device 1 is realized by a transmission / reception unit 2 that transmits / receives data to / from the measurement device 101, a keyboard, a mouse, and the like. A control unit 4 that controls the operation of the clustering apparatus 1, a storage unit 5 that stores information including measurement data and calculation results in the control unit 4, and a clustering result of measurement data obtained by calculation in the control unit 4. And an output unit 6 that outputs information including the output information.

制御部４は、測定装置１０１から入力された測定データに所定の変換を施すデータ変換部４１と、データ変換部４１が変換したデータを用いて所定の関数を生成する関数生成部４２と、関数生成部４２が生成した関数を用いて測定データをクラスタ分割するクラスタ分割部４３と、を有する。制御部４は、演算機能および制御機能を有するＣＰＵ（Central Processing Unit）などを用いて実現される。なお、データ変換部４１は、データ変換手段の少なくとも一部を構成する。また、関数生成部４２は関数生成手段の少なくとも一部を構成し、クラスタ分割部４３はクラスタ分割手段の少なくとも一部を構成する。 The control unit 4 includes a data conversion unit 41 that performs predetermined conversion on measurement data input from the measurement apparatus 101, a function generation unit 42 that generates a predetermined function using data converted by the data conversion unit 41, a function And a cluster division unit 43 that divides the measurement data into clusters using the function generated by the generation unit 42. The control unit 4 is realized using a CPU (Central Processing Unit) having an arithmetic function and a control function. The data conversion unit 41 constitutes at least part of the data conversion means. The function generation unit 42 constitutes at least a part of the function generation unit, and the cluster division unit 43 constitutes at least a part of the cluster division unit.

記憶部５は、測定データを記憶する測定データ記憶部５１と、測定データを変換したデータを記憶する変換データ記憶部５２と、変換したデータを用いて生成した関数を記憶する関数記憶部５３と、測定データに対するクラスタ分割結果を記憶するクラスタ分割結果記憶部５４と、を有する。このような記憶部５は、本実施の形態に係るクラスタリングプログラムや所定のＯＳを起動するプログラムなどを予め記憶するＲＯＭ（Read Only Memory）、制御部４が演算を行う際に使用する情報を一時的に記憶するＲＡＭ（Random Access Memory）などを用いて実現され、記憶手段の少なくとも一部を構成する。また、記憶部５として、ハードディスクなどの外部記憶装置を具備してもよい。 The storage unit 5 includes a measurement data storage unit 51 that stores measurement data, a conversion data storage unit 52 that stores data obtained by converting measurement data, and a function storage unit 53 that stores functions generated using the converted data. A cluster division result storage unit 54 for storing the cluster division result for the measurement data. Such a storage unit 5 is a ROM (Read Only Memory) that stores in advance a clustering program according to the present embodiment, a program for starting a predetermined OS, and the like, and temporarily stores information used when the control unit 4 performs calculations. This is realized by using a RAM (Random Access Memory) or the like that stores the data, and constitutes at least a part of the storage means. The storage unit 5 may include an external storage device such as a hard disk.

出力部６は、制御部４からの制御信号に基づいて画像を生成し、この生成した画像を表示する機能を有しており、液晶、プラズマ、有機ＥＬ等のディスプレイを用いて実現される。 The output unit 6 has a function of generating an image based on a control signal from the control unit 4 and displaying the generated image, and is realized using a display such as liquid crystal, plasma, and organic EL.

図２は、測定装置１０１の概略構成を示す模式図である。測定装置１０１は、遺伝子の一塩基遺伝子多型（ＳＮＰ：Single Nucleotide Polymorphism）を検出するＳＮＰタイピングを行う装置であり、基盤上に複数のスポットＳＰが形成されたマイクロアレイ１０２と、マイクロアレイ１０２の各スポットＳＰに対し、励起光としてのレーザ光を照射するレーザ光源を有するとともに、この照射したレーザ光によって発生する蛍光の強度を検出する光電子増倍管を有する蛍光検出器１０３と、蛍光検出器１０３の動作を制御する制御部１０４と、を備える。 FIG. 2 is a schematic diagram illustrating a schematic configuration of the measurement apparatus 101. The measuring device 101 is a device that performs SNP typing to detect a single nucleotide polymorphism (SNP) of a gene. A microarray 102 in which a plurality of spots SP are formed on a substrate, and each spot of the microarray 102 A fluorescence detector 103 having a laser light source for irradiating the SP with laser light as excitation light and having a photomultiplier tube for detecting the intensity of fluorescence generated by the irradiated laser light, And a control unit 104 that controls the operation.

制御部１０４は、クラスタリング装置１との間で測定データを含む情報の送受信を行う送受信部１４１を有する。 The control unit 104 includes a transmission / reception unit 141 that transmits / receives information including measurement data to / from the clustering apparatus 1.

マイクロアレイ１０２のスポットＳＰには、あるサンプルの特定のＳＮＰに対応する遺伝子と相補的な配列を有する遺伝子（プローブ）が点着されている。このようなプローブの中には、測定データの基準となる内部コントロール用のプローブが含まれており、所定のスポットＳＰに点着されている。以後、この内部コントロール用のプローブを「ハイブリコントロール」という。また、ハイブリコントロール以外のプローブが配置されているスポットＳＰについては、ＳＮＰをｍでラベルするとともにサンプルをｎでラベルすることによってＳＰmnと記載する（ここで、ｍ，ｎは自然数）。 A gene (probe) having a sequence complementary to a gene corresponding to a specific SNP of a sample is spotted on the spot SP of the microarray 102. Such a probe includes a probe for internal control serving as a reference for measurement data, and is spotted on a predetermined spot SP. Hereinafter, the probe for internal control is referred to as “hybrid control”. In addition, a spot SP in which probes other than the hybrid control are arranged is described as SPmn by labeling the SNP with m and labeling the sample with n (where m and n are natural numbers).

測定装置１０１でＳＮＰタイピングを行う際には、あるサンプル（ｎ）から生成したアリル（対立遺伝子）のｃＤＮＡを蛍光色素Ｃｙ３（図２で白丸（○）表示）、Ｃｙ５（図２で黒丸（●）表示）でそれぞれ標識した標識ＤＮＡ（タグ）を、マイクロアレイ１０２の各スポットＳＰmnに点着したプローブとハイブリダイゼーションさせる。その後、蛍光検出器１０３は、ハイブリダイゼーションによって発生した蛍光の強度（シグナル輝度）を検出する。蛍光検出器１０３が一方の蛍光色素に対応する蛍光シグナルを検出した場合、そのＳＮＰのアリルはホモ接合性を有する（図２の○○や●●）。これに対し、蛍光検出器１０３が２つの蛍光色素にそれぞれ対応する蛍光シグナルを検出した場合、そのＳＮＰのアリルはヘテロ接合性を有する（図２で●○）。 When SNP typing is performed by the measuring apparatus 101, the allyl (allelic) cDNA generated from a sample (n) is expressed by fluorescent dyes Cy3 (indicated by white circles (◯) in FIG. 2) and Cy5 (indicated by black circles (●) in FIG. The labeled DNAs (tags) labeled in () are hybridized with the probes spotted on the spots SPmn of the microarray 102. Thereafter, the fluorescence detector 103 detects the intensity (signal luminance) of the fluorescence generated by the hybridization. When the fluorescence detector 103 detects a fluorescence signal corresponding to one of the fluorescent dyes, the allele of the SNP has homozygosity (○○ and ●● in FIG. 2). On the other hand, when the fluorescence detector 103 detects fluorescence signals corresponding to the two fluorescent dyes, the allele of the SNP has heterozygosity (（in FIG. 2).

制御部１０４は、蛍光検出器１０３が検出したスポットＳＰから発生する蛍光を用いることにより、測定データとしての２次元数値データである蛍光色素Ｃｙ３、Ｃｙ５にそれぞれ対応したシグナル輝度を算出し、この算出したシグナル輝度を、送受信部１４１を介してクラスタリング装置１へ送信する。 The control unit 104 uses the fluorescence generated from the spot SP detected by the fluorescence detector 103 to calculate the signal luminances corresponding to the fluorescent dyes Cy3 and Cy5, which are two-dimensional numerical data as measurement data, and this calculation The signal luminance thus transmitted is transmitted to the clustering apparatus 1 via the transmission / reception unit 141.

なお、測定装置１０１が行うＳＮＰタイピング法の詳細は、以下の文献に記載された方法と本質的に同じである。N. Nishida, T. Tanabe, K. Hashido, K. Hirayasu, M. Takasu, A. Suyama, K. Tokunaga, "DigiTag assay for multipulex single nucleotide polymorphism typing with high success rate", Anal Biochem. 346 (2005) 281-288; N. Nishida, T. Tanabe, M. Takasu, A. Suyama, K. Tokunaga, "Further development of multipulex single nucleotide polymorphism typing method, the DigiTag2 assay", Anal Biochem. 364 (2007) 78-85. Note that the details of the SNP typing method performed by the measuring apparatus 101 are essentially the same as the methods described in the following documents. N. Nishida, T. Tanabe, K. Hashido, K. Hirayasu, M. Takasu, A. Suyama, K. Tokunaga, "DigiTag assay for multipulex single nucleotide polymorphism typing with high success rate", Anal Biochem. 346 (2005) 281-288; N. Nishida, T. Tanabe, M. Takasu, A. Suyama, K. Tokunaga, "Further development of multipulex single nucleotide polymorphism typing method, the DigiTag2 assay", Anal Biochem. 364 (2007) 78-85 .

図３は、本実施の形態に係るクラスタリング方法の処理の概要を示すフローチャートである。本実施の形態では、クラスタリングを行う際の測定データの信頼性に関して、以下の２点（１−１）、（１−２）を仮定する。
（１−１）シグナル輝度が高い測定データは信頼性が高い。
（１−２）シグナル輝度の測定データの分布を示す図において、別のサンプルの測定データが近傍に分布している測定データは信頼性が高い。FIG. 3 is a flowchart showing an outline of processing of the clustering method according to the present embodiment. In the present embodiment, the following two points (1-1) and (1-2) are assumed regarding the reliability of measurement data when clustering is performed.
(1-1) Measurement data with high signal luminance is highly reliable.
(1-2) In the figure showing the distribution of measurement data of signal luminance, measurement data in which measurement data of another sample is distributed in the vicinity has high reliability.

上述した前提のもと、クラスタリング装置１のデータ変換部４１は、測定データ記憶部５１で記憶する測定データとしての蛍光のシグナル輝度を読み出し、この読み出したシグナル輝度を所定の規則にしたがって変換する（ステップＳ１）。 Under the above-mentioned assumption, the data conversion unit 41 of the clustering apparatus 1 reads the fluorescence signal luminance as measurement data stored in the measurement data storage unit 51, and converts the read signal luminance according to a predetermined rule ( Step S1).

一般に、シグナル輝度は、系に投入したサンプルの濃度、マイクロアレイ１０２に点着しているプローブの濃度、蛍光検出器１０３が照射するレーザ光の強度、蛍光検出器１０３が有する光電子増倍管の感度などの影響によってバラツキを有する。そこで、データ変換部４１は、前述した測定系の影響を排除するために、ハイブリコントロールのシグナル輝度を基準として、各スポットＳＰmnのシグナル輝度を、プローブ点着量や標識ＤＮＡの濃度に依存しない量に変換する。このデータ変換処理を行うにあたって、以下の３点（２−１）〜（２−３）を仮定する。
（２−１）シグナル輝度は、マイクロアレイ１０２への標識ＤＮＡの点着量に比例する。
（２−２）蛍光色素の発光効率は蛍光色素にのみ依存し、ＤＮＡ配列には依存しない。
（２−３）蛍光色素Ｃｙ３、Ｃｙ５でそれぞれ標識されたハイブリコントロールの標識ＤＮＡのモル比は１:１である。In general, the signal luminance is the concentration of the sample introduced into the system, the concentration of the probe spotted on the microarray 102, the intensity of the laser light emitted by the fluorescence detector 103, and the sensitivity of the photomultiplier tube that the fluorescence detector 103 has. Due to the influence of the Therefore, in order to eliminate the influence of the measurement system described above, the data conversion unit 41 uses the signal luminance of the hybrid control as a reference, and the signal luminance of each spot SPmn is an amount that does not depend on the probe spotting amount or the labeled DNA concentration. Convert to In performing this data conversion processing, the following three points (2-1) to (2-3) are assumed.
(2-1) The signal luminance is proportional to the amount of labeled DNA spotted on the microarray 102.
(2-2) The luminous efficiency of the fluorescent dye depends only on the fluorescent dye and does not depend on the DNA sequence.
(2-3) The molar ratio of the hybrid control labeled DNAs labeled with the fluorescent dyes Cy3 and Cy5, respectively, is 1: 1.

以上の仮定（２−１）〜（２−３）に基づいて、マイクロアレイ１０２から発生する蛍光のシグナル輝度Ｉは、

と定義される。ここで、ｄは対応する蛍光色素の発光効率、ＳはスポットＳＰのプローブ点着量に比例する係数、Ｃは標識ＤＮＡの濃度である。以下、発光効率ｄについては、蛍光色素Ｃｙ３の発光効率をｄ_Cy3とし、蛍光色素Ｃｙ５の発光効率をｄ_Cy5とする。Based on the above assumptions (2-1) to (2-3), the signal intensity I of the fluorescence generated from the microarray 102 is

Is defined. Here, d is the luminous efficiency of the corresponding fluorescent dye, S is a coefficient proportional to the amount of spot spot spot SP, and C is the concentration of the labeled DNA. Hereinafter, regarding the luminous efficiency d, the luminous efficiency of the fluorescent dye Cy3 is d _Cy3 and the luminous efficiency of the fluorescent dye Cy5 is d _Cy5 .

ステップＳ１においてデータ変換部４１が行う具体的な演算について、図４に示すフローチャートを参照して説明する。データ変換部４１は、蛍光色素Ｃｙ３の発光効率ｄ_Cy3と蛍光色素Ｃｙ５の発光効率ｄ_Cy5との比（発光効率比）ｄ_Cy3／ｄ_Cy5を、ハイブリコントロールの蛍光色素ごとのシグナル輝度Ｉ_HybriContCy3、Ｉ_HybriContCy5から求める（ステップＳ１１）。蛍光色素ごとのシグナル輝度Ｉ_HybriContCy3、Ｉ_HybriContCy5は、式（１）により、

と表される。ここで、標識ＤＮＡ濃度Ｃの添字中のＥＤ−１、ＥＤ−２は、蛍光色素Ｃｙ３、Ｃｙ５によってそれぞれ標識される標識ＤＮＡを識別するためのものである。A specific calculation performed by the data conversion unit 41 in step S1 will be described with reference to a flowchart shown in FIG. The data conversion unit 41 calculates the ratio of the light emission efficiency d _Cy3 of the fluorescent dye _Cy3 and the light emission efficiency d _Cy5 of the fluorescent dye Cy5 (light emission efficiency ratio) d _Cy3 / d _Cy5 to the signal luminance I _HybriContCy3 for each fluorescent dye of the hybrid control. _Obtained from I _HybriContCy5 (step S11). The signal intensities I _HybriContCy3 and I _HybriContCy5 for each fluorescent dye are _expressed by the following equation (1).

It is expressed. Here, ED-1 and ED-2 in the subscript of the labeled DNA concentration C are for identifying the labeled DNAs respectively labeled with the fluorescent dyes Cy3 and Cy5.

上述した仮定（２−３）より、標識ＤＮＡ濃度Ｃ_{HybriContED-1}、Ｃ_{HybriContED-2}は等しい（Ｃ_{HybriContED-1}＝Ｃ_{HybriContED-2}）。したがって、

が得られ、蛍光色素Ｃｙ３と蛍光色素Ｃｙ５との発光効率比ｄ_Cy3／ｄ_Cy5が、測定データであるハイブリコントロールのシグナル輝度の比Ｉ_HybriContCy3／Ｉ_HybriContCy5を用いて表される。From the above assumption (2-3), the labeled DNA concentrations C _{HybriContED-1} and C _{HybriContED-2} are equal (C _{HybriContED-1} = C _{HybriContED-2} ). Therefore,

Is obtained, luminous efficiency ratio d _Cy3 / d _Cy5 with a fluorescent dye Cy3 and the fluorescent dye Cy5 are represented using the ratio I _HybriContCy3 / I _HybriContCy5 signal intensity of hybridization controls the measurement data.

続いて、データ変換部４１は、スポットＳＰmnのシグナル輝度を、対応する蛍光色素の発光効率比ｄ_Cy3／ｄ_Cy5を用いて補正する（ステップＳ１２）。スポットＳＰmnのシグナル輝度Ｉ_{SNPmSAMPLEnCy3}、Ｉ_{SNPmSAMPLEnCy5}は、

と定義される。ここで、Ｓ_SNPmSAMPLEnはスポットＳＰmnのプローブ点着量、Ｃ_{SNPmSAMPLEnED-1}、Ｃ_{SNPmSAMPLEnED-2}は、スポットＳＰmnの蛍光色素Ｃｙ３、Ｃｙ５でそれぞれ標識した標識ＤＮＡの濃度である。Subsequently, the data conversion unit 41 corrects the signal luminance of the spot SPmn using the luminous efficiency ratio d _Cy3 / d _Cy5 of the corresponding fluorescent dye (step S12). The signal brightness I _{SNPmSAMPLEnCy3} and I _{SNPmSAMPLEnCy5} of the spot SPmn is

Is defined. Here, S _SNPmSAMPLEn is the spot spot amount of spot SPmn, and C _{SNPmSAMPLEnED-1} and C _{SNPmSAMPLEnED-2} are the concentrations of labeled DNAs labeled with fluorescent dyes Cy3 and Cy5 of spot SPmn, respectively.

データ変換部４１は、蛍光色素Ｃｙ３、Ｃｙ５の発光効率ｄ_Cy3、ｄ_Cy5を用いることにより、シグナル輝度Ｉ_{SNPmSAMPLEnCy3}、Ｉ_{SNPmSAMPLEnCy5}を

と補正する。The data conversion unit 41 uses the luminous efficiencies d _Cy3 and d _Cy5 of the fluorescent dyes Cy3 and _Cy5 to obtain the signal luminances I _{SNPmSAMPLEnCy3} and I _{SNPmSAMPLEnCy5} .

And correct.

続いて、データ変換部４１は、補正後のシグナル輝度Ｉ'_{SNPmSAMPLEnCy3}、Ｉ'_{SNPmSAMPLEnCy5}の和を、スポットＳＰmnのプローブ点着量に比例する係数Ｓ'_SNPmSAMPLEnとして再定義する（ステップＳ１３）。すなわち、データ変換部４１は、スポットＳＰmnのプローブ点着量に比例する係数を

と再定義する。Subsequently, the data conversion unit 41 _redefines the sum of the corrected signal luminances I ′ _{SNPmSAMPLEnCy3} and I ′ _{SNPmSAMPLEnCy5} as a coefficient S ′ _SNPmSAMPLEn that is proportional to the probe spot amount of the spot SPmn (step S13). That is, the data conversion unit 41 calculates a coefficient proportional to the amount of probe spot landing on the spot SPmn.

And redefine.

この後、データ変換部４１は、補正後のシグナル輝度Ｉ'_{SNPmSAMPLEnCy3}、Ｉ'_{SNPmSAMPLEnCy5}と再定義後のプローブ点着量に比例する係数Ｓ'_SNPmSAMPLEnとを用いて定義される標識ＤＮＡ濃度Ｃ_{HybriContED-1}、Ｃ_{HybriContED-2}の補正値を算出し、変換データ記憶部５２に書き込んで記憶する（ステップＳ１４）。このステップＳ１４で算出する標識ＤＮＡ濃度の補正値Ｃ'_{SNPmSAMPLEnCy3}、Ｃ'_{SNPmSAMPLEnCy5}は、

と定義される。ここで、式（１０）、（１１）は、式（５）〜（９）を用いて導出される。例えば、式（１０）は、次のように変形することによって導出される。

この導出において、２番目の等号では式（９）を代入し、３番目の等号では式（７）、（８）を代入し、最後の等号では式（５）、（６）を代入した。Thereafter, the data conversion unit 41 uses the corrected signal luminance I ′ _{SNPmSAMPLEnCy3} and I ′ _{SNPmSAMPLEnCy5} and the labeled DNA concentration C _{HybriContED−} defined by using the coefficient S ′ _SNPmSAMPLEn proportional to the re-defined probe _spotting amount. ₁ , the correction value of C _{HybriContED-2} is calculated, and is written and stored in the conversion data storage unit 52 (step S14). Correction values C ′ _{SNPmSAMPLEnCy3} and C ′ _{SNPmSAMPLEnCy5} of the labeled DNA concentration calculated in step S14 are:

Is defined. Here, Expressions (10) and (11) are derived using Expressions (5) to (9). For example, equation (10) is derived by transforming as follows.

In this derivation, equation (9) is substituted for the second equal sign, equations (7) and (8) are substituted for the third equal sign, and equations (5) and (6) are substituted for the last equal sign. Substituted.

補正後の標識ＤＮＡ濃度Ｃ'_{SNPmSAMPLEnCy3}、Ｃ'_{SNPmSAMPLEnCy5}は、補正前の標識ＤＮＡ濃度の和によって規格化した値であり、蛍光色素Ｃｙ３、Ｃｙ５によってそれぞれ標識されたアリルの濃度に対応している。このようにして、データ変換部４１は、スポットＳＰmnにおける二つの測定データＩ_{SNPmSAMPLEnCy3}、Ｉ_{SNPmSAMPLEnCy5}を、スポットＳＰmnのプローブ点着量やサンプルの濃度に依存しない１次元の量に変換する。The corrected labeled DNA concentrations C ′ _{SNPmSAMPLEnCy3} and C ′ _{SNPmSAMPLEnCy5} are values normalized by the sum of the labeled DNA concentrations before correction, and correspond to the concentrations of allyl labeled with the fluorescent dyes Cy3 and Cy5, respectively. In this way, the data conversion unit 41 converts the two measurement data I _{SNPmSAMPLEnCy3} and I _{SNPmSAMPLEnCy5} at the spot SPmn into a one-dimensional amount that does not depend on the probe spot deposition amount or sample concentration of the spot SPmn.

次に、関数生成部４２は、上述したステップＳ１で１次元上に分布した測定データが真の値として存在するデータ存在確率を与える確率密度関数を生成する（ステップＳ２）。具体的には、関数生成部４２は、１次元に変換後の測定データＣ'_{SNPmSAMPLEnCy3}、Ｃ'_{SNPmSAMPLEnCy5}を変換データ記憶部５２から読み出し、各データの測定点を中心とした正規分布を与えるガウス関数

を確率密度関数として生成し、関数記憶部５３に書き込んで記憶する。Next, the function generation unit 42 generates a probability density function that gives a data existence probability that the measurement data distributed in one dimension in step S1 described above exists as a true value (step S2). Specifically, the function generation unit 42 _reads the measurement data C ′ _{SNPmSAMPLEnCy3} and C ′ _{SNPmSAMPLEnCy5} converted into one dimension from the conversion data storage unit 52, and gives a normal distribution centered on the measurement points of each data.

Is generated as a probability density function, and is written and stored in the function storage unit 53.

式（１２）において、ガウス関数の面積に対応する係数Ｉ_SNPmSAMPLEnは、

と定義される量である。In equation (12), the coefficient I _SNPmSAMPLEn corresponding to the area of the Gaussian function is

It is an amount defined as

また、式（１２）でガウス関数の分散に対応する定数ｄ_SNPmSAMPLEnは、シグナル輝度の分布を示す２次元平面（Ｉ_{SNPmSAMPLECy3}，Ｉ_{SNPmSAMPLECy5}）において、着目しているサンプルから所定の範囲にあるサンプルまでの距離を用いて定められる量（代表距離）であり、

と定義される量である。ここで、Δθ^(k) _SNPmSAMPLEnは、注目しているサンプルとそのサンプルの近傍に位置するサンプルとの２次元平面（Ｉ_{SNPmSAMPLECy3}，Ｉ_{SNPmSAMPLECy5}）における角度差である。また、定数ａ^(k) _SNPxSAMPLEyは距離の平滑化に関わる数であり、１より大きい値として適宜定められる。なお、式（１４）では、近傍の３つのサンプルまでの角度差Δθ^(k) _SNPmSAMPLEnを用いて代表距離ｄ_SNPmSAMPLEnを算出している。In addition, the constant d _SNPmSAMPLEn corresponding to the variance of the Gaussian function in Expression (12) is from a sample of interest to a sample within a predetermined range on the two-dimensional plane (I _{SNPmSAMPLECy3} , I _{SNPmSAMPLECy5} ) indicating the signal luminance distribution. Is a quantity (representative distance) determined using the distance of

It is an amount defined as Here, Δθ ^(k) _SNPmSAMPLEn is an angle difference in a two-dimensional plane (I _{SNPmSAMPLECy3} , I _{SNPmSAMPLECy5} ) between the sample of interest and a sample located in the vicinity of the sample. The constant a ^(k) _SNPxSAMPLEy is a number related to distance smoothing and is appropriately determined as a value larger than 1. In Expression (14), the representative distance d _SNPmSAMPLEn is calculated using the angle difference Δθ ^(k) _SNPmSAMPLEn to three neighboring samples.

さらに、式（１２）のｒ_SNPmSAMPLEnは、蛍光色素ごとの補正後の標識ＤＮＡの濃度比Ｃ'_{SNPmSAMPLEnCy3}／Ｃ'_{SNPmSAMPLEnCy5}である。Further, r _SNPmSAMPLEn in the formula (12) is a concentration ratio C ′ _{SNPmSAMPLEnCy3} / C ′ _{SNPmSAMPLEnCy5} of the labeled DNA after correction for each fluorescent dye.

この後、関数生成部４２は、測定データの信頼性を数値的に定めた信頼性分布関数として、同じＳＮＰに対する全てのサンプルの確率密度関数の和として定義される関数

を算出し、関数記憶部５３に書き込んで記憶する（ステップＳ３）。図５および図６は、異なるサンプル、ＳＮＰの組み合わせに対する信頼性分布関数の生成例を示す図である。これらの図に示す信頼性分布関数Ｇ_SNPm（ｘ）においては、各サンプルに対するガウス関数ｇ_SNPmSAMPLEn（ｘ）が足し合わされ、複数の山と谷のピークが現れている。Thereafter, the function generator 42 is a function defined as the sum of the probability density functions of all samples for the same SNP as a reliability distribution function that numerically defines the reliability of the measurement data.

Is written and stored in the function storage unit 53 (step S3). FIG. 5 and FIG. 6 are diagrams showing examples of generation of reliability distribution functions for combinations of different samples and SNPs. In the reliability distribution function G _SNPm (x) shown in these figures, the Gaussian function g _SNPmSAMPLEn (x) for each sample is added, and a plurality of peaks and valleys appear.

次に、クラスタ分割部４３は、ＳＮＰごとに関数記憶部５３から読み出した信頼性分布関数Ｇ_SNPm（ｘ）に基づいて２次元平面上でのクラスタの分割を行う（ステップＳ４）。以下、図７のフローチャートを参照して、クラスタ分割処理の詳細を説明する。まず、クラスタ分割部４３は、信頼性分布関数Ｇ_SNPm（ｘ）のｘに関する数値微分を求める（ステップＳ４１）。Next, the cluster dividing unit 43 divides the cluster on the two-dimensional plane based on the reliability distribution function G _SNPm (x) read from the function storage unit 53 for each SNP (step S4). Details of the cluster division process will be described below with reference to the flowchart of FIG. First, the cluster dividing unit 43 obtains a numerical differentiation with respect to x of the reliability distribution function G _SNPm (x) (step S41).

この後、クラスタ分割部４３は、ステップＳ４１の結果を用いて信頼性分布関数Ｇ_SNPm（ｘ）の極小値を算出する（ステップＳ４２）。Thereafter, the cluster dividing unit 43 calculates the minimum value of the reliability distribution function G _SNPm (x) using the result of Step S41 (Step S42).

続いて、クラスタ分割部４３は、ステップＳ４２で算出した極小値を特徴付ける極小値特徴量を算出する（ステップＳ４３）。ここでいう極小値特徴量とは、信頼性分布関数Ｇ_SNPm（ｘ）の極小値を谷底としたときの谷の幅と深さである。図８は、信頼性分布関数Ｇ_SNPm（ｘ）の谷の幅と深さを模式的に示す図である。同図に示す谷Ｖの幅ｗは、谷Ｖを挟んで隣接する山Ｍ１、Ｍ２の頂上間の水平距離である。また、谷Ｖの深さｐは、谷Ｖの谷底（極小値の位置）から見た山Ｍ１の高さｐ１と谷Ｖの谷底から見た山Ｍ２の高さｐ２との平均値（ｐ１＋ｐ２）／２である。Subsequently, the cluster dividing unit 43 calculates a minimum value feature amount that characterizes the minimum value calculated in step S42 (step S43). The minimum value feature here is the width and depth of the valley when the minimum value of the reliability distribution function G _SNPm (x) is the valley bottom. FIG. 8 is a diagram schematically showing the valley width and depth of the reliability distribution function G _SNPm (x). The width w of the valley V shown in the figure is the horizontal distance between the peaks of the mountains M1 and M2 adjacent to each other across the valley V. The depth p of the valley V is the average value (p1 + p2) of the height p1 of the mountain M1 seen from the valley bottom (minimum position) of the valley V and the height p2 of the mountain M2 seen from the valley bottom of the valley V. / 2.

次に、クラスタ分割部４３は、ステップＳ４３で算出した極小値特徴量を用いてクラスタ分割位置を設定する（ステップＳ４４）。具体的には、クラスタ分割部４３は、谷Ｖの幅ｗおよび深さｐ、定数ｂを用いて定義される評価関数

を谷Ｖごとに求める。その後、クラスタ分割部４３は、式（１６）にしたがって求めた全ての谷Ｖの評価関数Ｑ_Vの中で所定の閾値Ｑ_thを超えているものの中から上位２つまでをクラスタ分割点として抽出する。なお、式（１６）で定義される谷Ｖの評価関数Ｑ_Vはあくまでも一例に過ぎず、谷Ｖの幅ｗや深さｐを用いて定義される関数であれば、式（１６）以外の関数でもかまわない。Next, the cluster division unit 43 sets a cluster division position using the minimum value feature amount calculated in step S43 (step S44). Specifically, the cluster dividing unit 43 determines the evaluation function defined using the width w and depth p of the valley V and the constant b.

For each valley V. After that, the cluster dividing unit 43 extracts, from the evaluation functions Q _V of all the valleys V obtained according to the equation (16), the upper two of the evaluation functions Q _V exceeding the predetermined threshold value Q _th as cluster dividing points. To do. Note that the evaluation function Q _V of the valley V defined by Expression (16) is merely an example, and any function other than Expression (16) can be used as long as the function is defined using the width w and depth p of the valley V. It can be a function.

この後、出力部６は、ステップＳ４におけるクラスタ分割結果を出力する（ステップＳ５）。図９および図１０は、異なるサンプルとＳＮＰとの組み合わせに対するクラスタ分割結果（３つのクラスタＣｒ１〜Ｃｒ３に分割）の表示出力例を示す図である。このうち、図９は図５に示す信頼性分布関数Ｇ_SNPm（ｘ）を用いてクラスタリングした結果を示している。また、図１０は図６に示す信頼性分布関数Ｇ_SNPm（ｘ）を用いてクラスタリングした結果を示している。図９および図１０において、分割された３つのクラスタＣｒ１〜Ｃｒ３は、互いに異なる多型データに対応している。Thereafter, the output unit 6 outputs the cluster division result in step S4 (step S5). FIGS. 9 and 10 are diagrams showing display output examples of cluster division results (divided into three clusters Cr1 to Cr3) for combinations of different samples and SNPs. 9 shows the result of clustering using the reliability distribution function G _SNPm (x) shown in FIG. FIG. 10 shows the result of clustering using the reliability distribution function G _SNPm (x) shown in FIG. 9 and 10, the three divided clusters Cr1 to Cr3 correspond to different polymorphic data.

なお、クラスタ分割部４３は、評価関数Ｑ_Vが閾値Ｑ_thを超えるような谷Ｖが１つしかない場合、クラスタを２つに分割する。この場合、クラスタ分割部４３は、谷Ｖの位置が２次元平面上で縦軸と横軸の値が等しい直線を境界として、この直線と縦軸とによって挟まれた領域に属するか、その直線と横軸とによって挟まれた領域に属するかを判断し、クラスタが属する多型データの種別を判定する。Note that the cluster dividing unit 43 divides the cluster into two when there is only one valley V whose evaluation function Q _V exceeds the threshold value Q _th . In this case, the cluster dividing unit 43 belongs to a region between the straight line and the vertical axis or a straight line whose valley V is located on a two-dimensional plane with a straight line having the same value on the vertical axis and the horizontal axis as a boundary. And the type of polymorphic data to which the cluster belongs is determined.

以上説明したクラスタリング方法によれば、クラスタの分割を、統計的な値や遺伝統計学的な指標を用いないで行っているため、ＳＮＰタイピング結果に関する統計量、遺伝統計学的な指標の信頼性が高めることができる。 According to the clustering method described above, the cluster is divided without using a statistical value or a genetic statistical index. Therefore, the statistic regarding the SNP typing result and the reliability of the genetic statistical index are determined. Can be increased.

以上説明した本発明の一実施の形態によれば、複数の多次元数値データの各々をより低次元の数値データに変換し、変換後の数値データの各々に対応したデータ存在確率を与える複数の確率密度関数を生成し、それら複数の確率密度関数の線形和をとることによって複数の多次元数値データの信頼性を数値的に定める信頼性分布関数を生成し、この信頼性分布関数に基づいて複数の多次元数値データのクラスタ分割を行うため、数値データの特性に左右されない処理を実現することができる。したがって、サンプルの選び方によらず、そのサンプルに関連した数値データのクラスタリングを適確に行うことが可能となる。 According to the embodiment of the present invention described above, each of a plurality of multidimensional numerical data is converted into lower-dimensional numerical data, and a plurality of data existence probabilities corresponding to each of the converted numerical data are provided. Generate a probability density function, and generate a reliability distribution function that numerically defines the reliability of multiple multi-dimensional numerical data by taking the linear sum of the plurality of probability density functions. Based on this reliability distribution function Since a plurality of multidimensional numerical data are divided into clusters, processing independent of the characteristics of the numerical data can be realized. Therefore, it is possible to accurately perform clustering of numerical data related to the sample regardless of how the sample is selected.

また、本実施の形態によれば、信頼性分布関数に基づいたクラスタ分割を行う際、信頼性分布関数の極小値およびこの極小値を特徴付ける極小値特徴量（谷の幅や深さ）を算出し、この算出した極小値特徴量を用いて定義される評価関数を用いたクラスタ分割位置の設定を行うため、所定の条件を満足しない位置でクラスタ分割を行ってしまうことがない。したがって、分割すべきクラスタの数が通常のデータ集合より少ないデータ集合に対して、余分なクラスリングを行わないで済む。 In addition, according to the present embodiment, when performing cluster division based on the reliability distribution function, the minimum value of the reliability distribution function and the minimum value feature amount (valley width and depth) that characterizes the minimum value are calculated. Since the cluster division position is set using the evaluation function defined using the calculated minimum value feature amount, the cluster division is not performed at a position that does not satisfy the predetermined condition. Therefore, it is not necessary to perform extra class ring for a data set in which the number of clusters to be divided is smaller than a normal data set.

なお、上述した一実施の形態では、多次元数値データとしてＳＮＰのアリルの検出データを用いたが、本発明は、それ以外にも、多次元の数値データを分画（または分類）する方法において、多数の数値データのばらつきが多いような生物学的な測定に対しても有効に適用することができる。 In the above-described embodiment, the SNP allele detection data is used as the multidimensional numerical data. However, the present invention is also applicable to a method of fractionating (or classifying) multidimensional numerical data. In addition, the present invention can be effectively applied to biological measurements in which a large number of numerical data varies.

また、上述した一実施の形態では、マイクロアレイ上に固相化ないし不動化された各種サンプルからの蛍光シグナルに基づき解析を行っているが、本発明は、マイクロアレイ以外のビーズやアフィニティカラム等の固相検定（Solid Phase Assay）に対して広く適用可能である。 In the above-described embodiment, analysis is performed based on fluorescence signals from various samples immobilized or immobilized on the microarray. However, the present invention is not limited to beads other than the microarray or affinity columns. Widely applicable to Solid Phase Assays.

また、本発明において、固相検定に拠らない方法として、蛍光等の光学的標識を識別用タグとして用いずに、分子量を異ならせただけの質量分析用タグを用いるＭＡＳ（Magic Angle Spinning）等の分類方法を適用してもよい。 Further, in the present invention, as a method not based on the solid phase assay, MAS (Magic Angle Spinning) using a mass spectrometry tag having a different molecular weight without using an optical label such as fluorescence as an identification tag. Such a classification method may be applied.

また、本発明では、光学的標識として、蛍光以外にも、発光（化学発光や生物発光）、吸光（比色や濁度）、散乱光、偏光に関連する標識を適用してもよい。さらに、対象によっては、放射線、磁気、原子間力、電子線、電磁超音波（ＥＭＡＴ）等の電磁エネルギーを標識としてもよい。 In the present invention, in addition to fluorescence, a label related to light emission (chemiluminescence or bioluminescence), light absorption (colorimetric or turbidity), scattered light, or polarization may be applied as an optical label. Furthermore, depending on the object, electromagnetic energy such as radiation, magnetism, atomic force, electron beam, electromagnetic ultrasonic wave (EMAT), or the like may be used as a label.

また、本発明は、各種血球や体細胞のような形状パラメータを光学的ないし電磁学的にイメージングして、画像解析による数値化を行うようなセルベースアッセイにも適している。 The present invention is also suitable for cell-based assays in which shape parameters such as various blood cells and somatic cells are optically or electromagnetically imaged and digitized by image analysis.

また、本発明に係るクラスタリング方法では、データ変換ステップにおいて、一般に多次元数値データをそれよりも低い次元の数値データ変換することができる。 In the clustering method according to the present invention, multidimensional numerical data can generally be converted into numerical data of a lower dimension in the data conversion step.

なお、本発明に係るクラスタリング装置は、インターネット、イントラネット、固定電話網、携帯電話網、専用回線網等の適当な組み合わせによって構成される通信ネットワークを介して測定装置と通信接続した構成としてもよい。 Note that the clustering apparatus according to the present invention may be configured to be connected to the measurement apparatus through a communication network configured by an appropriate combination of the Internet, an intranet, a fixed telephone network, a mobile phone network, a leased line network, and the like.

また、本発明に係るクラスタリングプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、フラッシュメモリ、ＭＯディスク等のコンピュータ読み取り可能な記録媒体に記録して広く流通させることも可能である。 Further, the clustering program according to the present invention can be recorded on a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, a DVD-ROM, a flash memory, or an MO disk and widely distributed.

このように、本発明は、ここでは記載していない様々な実施の形態等を含みうるものであり、特許請求の範囲により特定される技術的思想を逸脱しない範囲内において種々の設計変更等を施すことが可能である。 Thus, the present invention can include various embodiments and the like not described herein, and various design changes and the like can be made without departing from the technical idea specified by the claims. It is possible to apply.

本発明に係るクラスタリング方法、プログラムおよび装置は、同一または異なる生体由来のサンプルに特有の多様な現象や特性等を解析するのに適しており、特に遺伝子多型解析に適している。 The clustering method, program and apparatus according to the present invention are suitable for analyzing various phenomena and characteristics peculiar to samples derived from the same or different organisms, and particularly suitable for gene polymorphism analysis.

Claims

A computer comprising storage means for storing a plurality of multidimensional numerical data is a clustering method for dividing the plurality of multidimensional numerical data into one or a plurality of clusters,
A data conversion step of reading the plurality of multidimensional numerical data from the storage means, and converting each of the read multidimensional numerical data into lower dimensional numerical data;
Probability density function generation step for generating a plurality of probability density functions giving data existence probabilities corresponding to each of the numerical data converted in the data conversion step;
A reliability distribution function generation step for generating a reliability distribution function that numerically defines the reliability of the plurality of multidimensional numerical data by taking a linear sum of the plurality of probability density functions generated in the probability density function generation step; ,
A cluster division step of performing cluster division of the plurality of multidimensional numerical data based on the reliability distribution function generated in the reliability distribution function generation step;
A clustering method characterized by comprising:

The data conversion step includes
The clustering method according to claim 1, wherein each of the plurality of multidimensional numerical data is converted into numerical data whose dimension is one dimension lower by using a ratio value of different components of the multidimensional numerical data. .

The number of dimensions of the multidimensional numerical data is 2,
The clustering method according to claim 1 or 2, wherein the sum of the one-dimensional numerical data after conversion in the data conversion step is 1.

The probability density function generated in the probability density function generating step is a Gaussian function,
The average of the Gaussian function is determined by the ratio of each dimension of the two-dimensional numerical data of interest,
The variance of the Gaussian function is the distance between the focused two-dimensional numerical data and the two-dimensional numerical data within a predetermined range from the two-dimensional numerical data on a two-dimensional plane that gives a distribution of a plurality of two-dimensional numerical data. The clustering method according to claim 3, wherein the clustering method is defined using

The two-dimensional numerical data is detection data for an allele of a single nucleotide polymorphism,
The clustering method according to claim 3 or 4, wherein the data converted in the data conversion step is an allyl concentration.

The cluster dividing step includes:
A reliability distribution function differentiation step for differentiating the reliability distribution function with respect to the numerical data after being converted in the data conversion step;
A minimum value calculating step for calculating a minimum value of the reliability distribution function from the value differentiated in the reliability distribution function differentiation step;
A minimum value feature quantity calculating step for calculating a minimum value feature quantity characterizing the minimum value calculated in the minimum value calculation step;
A cluster division position setting step for setting a cluster division position in a space in which the multidimensional numerical data is distributed using the minimum value feature quantity calculated in the minimum value feature quantity calculation step;
The clustering method according to claim 1, wherein the clustering method includes:

The clustering method according to claim 1, further comprising a cluster division result output step for outputting a cluster division result in the cluster division step.

A clustering program that causes the computer to execute the clustering method according to claim 1.

A clustering device that divides a plurality of multidimensional numerical data into one or a plurality of clusters,
Storage means for storing the plurality of multidimensional numerical data;
Data conversion means for reading the plurality of multidimensional numerical data from the storage means, and converting each of the read multidimensional numerical data into lower-dimensional numerical data;
A plurality of probability density functions that give data existence probabilities corresponding to each of the numerical data converted by the data converting means are generated, and the plurality of multidimensional numerical data are obtained by taking a linear sum of the generated probability density functions. Function generating means for generating a reliability distribution function that numerically defines the reliability of
Cluster dividing means for performing cluster division of the plurality of multidimensional numerical data based on the reliability distribution function generated by the function generating means;
A clustering apparatus characterized by comprising: